top_secret 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +29 -2
- data/CODEOWNERS +15 -0
- data/README.md +152 -29
- data/lib/top_secret/filtered_text/result.rb +29 -0
- data/lib/top_secret/filtered_text.rb +73 -0
- data/lib/top_secret/filters/ner.rb +2 -2
- data/lib/top_secret/text/batch_result.rb +45 -0
- data/lib/top_secret/text/result.rb +26 -0
- data/lib/top_secret/text.rb +121 -7
- data/lib/top_secret/version.rb +1 -1
- data/lib/top_secret.rb +29 -14
- metadata +7 -3
- data/lib/top_secret/result.rb +0 -24
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7b57ce12774e37584e4d7a25a2606799b642ddff1eaaeb6f8fa53c2c1e4cae58
|
4
|
+
data.tar.gz: 60b70028bdf30ea474c1987ea15a27df87c7d5614743b69e9f8a20e00a358401
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: de3cb592711f29551bdcb81c8e2c4effb03d7d0e47a1c995adb7a4c828652a98050f133918b63c9a5cbdd3e03aba1e15c0abd06e969f9c7f711b5ce9f22247de
|
7
|
+
data.tar.gz: cb08a6e173106fa653f16f9abbf8ee7b1dd6018fe4097b6841701004ada46ab37b4a9be2079da7398233e35a6c334ee1a445c8d5f0faf7c82eb4c35b0343b79c
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,32 @@
|
|
1
1
|
## [Unreleased]
|
2
2
|
|
3
|
-
## [0.
|
3
|
+
## [0.2.0] - 2025-08-18
|
4
4
|
|
5
|
-
|
5
|
+
### Added
|
6
|
+
|
7
|
+
- Added `TopSecret::Text.filter_all` for batch processing multiple messages with globally consistent redaction labels
|
8
|
+
- Added `TopSecret::Text::BatchResult` class to hold results from batch operations
|
9
|
+
- Added `TopSecret::FilteredText` class for restoring filtered text by substituting placeholders with original values
|
10
|
+
- Added `TopSecret::FilteredText::Result` class to track restoration success and failures
|
11
|
+
|
12
|
+
### Changed
|
13
|
+
|
14
|
+
- **BREAKING:** Moved `TopSecret::Result` to `TopSecret::Text::Result` and `TopSecret::BatchResult` to `TopSecret::Text::BatchResult` for better namespace organization
|
15
|
+
- **BREAKING:** Refactored configuration system to use individual filter accessors instead of nested `default_filters`
|
16
|
+
- Updated `TopSecret::Text.filter` to accept keyword arguments for filter overrides and `custom_filters` array
|
17
|
+
- Each default filter now has its own configuration accessor (e.g., `TopSecret.email_filter`, `TopSecret.people_filter`)
|
18
|
+
|
19
|
+
### Migration Guide
|
20
|
+
|
21
|
+
- Replace `TopSecret::Result` with `TopSecret::Text::Result` and `TopSecret::BatchResult` with `TopSecret::Text::BatchResult`
|
22
|
+
- Replace `TopSecret.configure { |c| c.default_filters.email_filter = filter }` with `TopSecret.configure { |c| c.email_filter = filter }`
|
23
|
+
- Replace `TopSecret::Text.filter(text, filters: { email_filter: filter })` with `TopSecret::Text.filter(text, email_filter: filter)`
|
24
|
+
- For new filters, use `TopSecret::Text.filter(text, custom_filters: [filter])` instead of adding to `default_filters`
|
25
|
+
|
26
|
+
## [0.1.1] - 2025-08-08
|
27
|
+
|
28
|
+
- Ensure `TopSecret.min_confidence_score` is respected
|
29
|
+
|
30
|
+
## [0.1.0] - 2025-08-08
|
31
|
+
|
32
|
+
- Initial release
|
data/CODEOWNERS
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
# Lines starting with '#' are comments.
|
2
|
+
# Each line is a file pattern followed by one or more owners.
|
3
|
+
|
4
|
+
# More details are here: https://help.github.com/articles/about-codeowners/
|
5
|
+
|
6
|
+
# The '*' pattern is global owners.
|
7
|
+
|
8
|
+
# Order is important. The last matching pattern has the most precedence.
|
9
|
+
# The folders are ordered as follows:
|
10
|
+
|
11
|
+
# In each subsection folders are ordered first by depth, then alphabetically.
|
12
|
+
# This should make it easy to add new rules without breaking existing ones.
|
13
|
+
|
14
|
+
# Global rule:
|
15
|
+
* @stevepolitodesign
|
data/README.md
CHANGED
@@ -1,6 +1,8 @@
|
|
1
1
|
# Top Secret
|
2
2
|
|
3
|
-
|
3
|
+
[](https://github.com/thoughtbot/top_secret/actions/workflows/main.yml)
|
4
|
+
|
5
|
+
Filter sensitive information from free text before sending it to external services or APIs, such as chatbots and LLMs.
|
4
6
|
|
5
7
|
By default it filters the following:
|
6
8
|
|
@@ -123,7 +125,7 @@ TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com")
|
|
123
125
|
This will return
|
124
126
|
|
125
127
|
```ruby
|
126
|
-
<TopSecret::Result
|
128
|
+
<TopSecret::Text::Result
|
127
129
|
@input="Ralph can be reached at ralph@thoughtbot.com",
|
128
130
|
@mapping={:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"},
|
129
131
|
@output="[PERSON_1] can be reached at [EMAIL_1]"
|
@@ -154,26 +156,123 @@ result.mapping
|
|
154
156
|
# => {:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"}
|
155
157
|
```
|
156
158
|
|
159
|
+
### Batch Processing
|
160
|
+
|
161
|
+
When processing multiple messages, use `filter_all` to ensure consistent redaction labels across all messages:
|
162
|
+
|
163
|
+
```ruby
|
164
|
+
messages = [
|
165
|
+
"Contact ralph@thoughtbot.com for details",
|
166
|
+
"Email ralph@thoughtbot.com again if needed",
|
167
|
+
"Also CC ruby@thoughtbot.com on the thread"
|
168
|
+
]
|
169
|
+
|
170
|
+
result = TopSecret::Text.filter_all(messages)
|
171
|
+
```
|
172
|
+
|
173
|
+
This will return
|
174
|
+
|
175
|
+
```ruby
|
176
|
+
<TopSecret::Text::BatchResult
|
177
|
+
@mapping={:EMAIL_1=>"ralph@thoughtbot.com", :EMAIL_2=>"ruby@thoughtbot.com"},
|
178
|
+
@items=[
|
179
|
+
<TopSecret::Text::BatchResult::Item @input="Contact ralph@thoughtbot.com for details", @output="Contact [EMAIL_1] for details">,
|
180
|
+
<TopSecret::Text::BatchResult::Item @input="Email ralph@thoughtbot.com again if needed", @output="Email [EMAIL_1] again if needed">,
|
181
|
+
<TopSecret::Text::BatchResult::Item @input="Also CC ruby@thoughtbot.com on the thread", @output="Also CC [EMAIL_2] on the thread">
|
182
|
+
]
|
183
|
+
>
|
184
|
+
```
|
185
|
+
|
186
|
+
Access the global mapping
|
187
|
+
|
188
|
+
```ruby
|
189
|
+
result.mapping
|
190
|
+
|
191
|
+
# => {:EMAIL_1=>"ralph@thoughtbot.com", :EMAIL_2=>"ruby@thoughtbot.com"}
|
192
|
+
```
|
193
|
+
|
194
|
+
Access individual items
|
195
|
+
|
196
|
+
```ruby
|
197
|
+
result.items[0].input
|
198
|
+
# => "Contact ralph@thoughtbot.com for details"
|
199
|
+
|
200
|
+
result.items[0].output
|
201
|
+
# => "Contact [EMAIL_1] for details"
|
202
|
+
```
|
203
|
+
|
204
|
+
The key benefit is that identical values receive the same labels across all messages - notice how `ralph@thoughtbot.com` becomes `[EMAIL_1]` in both the first and second messages.
|
205
|
+
|
206
|
+
### Restoring Filtered Text
|
207
|
+
|
208
|
+
When external services (like LLMs) return responses containing filter placeholders, use `TopSecret::FilteredText.restore` to substitute them back with original values:
|
209
|
+
|
210
|
+
```ruby
|
211
|
+
# Filter messages before sending to LLM
|
212
|
+
messages = ["Contact ralph@thoughtbot.com for details"]
|
213
|
+
batch_result = TopSecret::Text.filter_all(messages)
|
214
|
+
|
215
|
+
# Send filtered text to LLM: "Contact [EMAIL_1] for details"
|
216
|
+
# LLM responds with: "I'll email [EMAIL_1] about this request"
|
217
|
+
llm_response = "I'll email [EMAIL_1] about this request"
|
218
|
+
|
219
|
+
# Restore the original values
|
220
|
+
restore_result = TopSecret::FilteredText.restore(llm_response, mapping: batch_result.mapping)
|
221
|
+
```
|
222
|
+
|
223
|
+
This will return
|
224
|
+
|
225
|
+
```ruby
|
226
|
+
<TopSecret::FilteredText::Result
|
227
|
+
@output="I'll email ralph@thoughtbot.com about this request",
|
228
|
+
@restored=["[EMAIL_1]"],
|
229
|
+
@unrestored=[]
|
230
|
+
>
|
231
|
+
```
|
232
|
+
|
233
|
+
Access the restored text
|
234
|
+
|
235
|
+
```ruby
|
236
|
+
restore_result.output
|
237
|
+
# => "I'll email ralph@thoughtbot.com about this request"
|
238
|
+
```
|
239
|
+
|
240
|
+
Track which placeholders were restored
|
241
|
+
|
242
|
+
```ruby
|
243
|
+
restore_result.restored
|
244
|
+
# => ["[EMAIL_1]"]
|
245
|
+
|
246
|
+
restore_result.unrestored
|
247
|
+
# => []
|
248
|
+
```
|
249
|
+
|
250
|
+
The restoration process tracks both successful and failed placeholder substitutions, allowing you to handle cases where the LLM response contains placeholders not found in your mapping.
|
251
|
+
|
157
252
|
### Advanced Examples
|
158
253
|
|
159
254
|
#### Overriding the default filters
|
160
255
|
|
161
256
|
When overriding or [disabling](#disabling-a-default-filter-1) a [default filter](#default-filters), you must map to the correct key.
|
162
257
|
|
258
|
+
> [!IMPORTANT]
|
259
|
+
> Invalid filter keys will raise an `ArgumentError`. Only the following keys are valid:
|
260
|
+
> `credit_card_filter`, `email_filter`, `phone_number_filter`, `ssn_filter`, `people_filter`, `location_filter`
|
261
|
+
|
163
262
|
```ruby
|
164
263
|
regex_filter = TopSecret::Filters::Regex.new(label: "EMAIL_ADDRESS", regex: /\b\w+\[at\]\w+\.\w+\b/)
|
165
264
|
ner_filter = TopSecret::Filters::NER.new(label: "NAME", tag: :person, min_confidence_score: 0.25)
|
166
265
|
|
167
|
-
TopSecret::Text.filter("Ralph can be reached at ralph[at]thoughtbot.com",
|
266
|
+
TopSecret::Text.filter("Ralph can be reached at ralph[at]thoughtbot.com",
|
168
267
|
email_filter: regex_filter,
|
169
268
|
people_filter: ner_filter
|
170
|
-
|
269
|
+
)
|
171
270
|
```
|
172
271
|
|
173
272
|
This will return
|
174
273
|
|
175
274
|
```ruby
|
176
|
-
<TopSecret::Result
|
275
|
+
<TopSecret::Text::Result
|
177
276
|
@input="Ralph can be reached at ralph[at]thoughtbot.com",
|
178
277
|
@mapping={:EMAIL_ADDRESS_1=>"ralph[at]thoughtbot.com", :NAME_1=>"Ralph", :NAME_2=>"ralph["},
|
179
278
|
@output="[NAME_1] can be reached at [EMAIL_ADDRESS_1]"
|
@@ -183,22 +282,29 @@ This will return
|
|
183
282
|
#### Disabling a default filter
|
184
283
|
|
185
284
|
```ruby
|
186
|
-
TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com",
|
285
|
+
TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com",
|
187
286
|
email_filter: nil,
|
188
287
|
people_filter: nil
|
189
|
-
|
288
|
+
)
|
190
289
|
```
|
191
290
|
|
192
291
|
This will return
|
193
292
|
|
194
293
|
```ruby
|
195
|
-
<TopSecret::Result
|
294
|
+
<TopSecret::Text::Result
|
196
295
|
@input="Ralph can be reached at ralph@thoughtbot.com",
|
197
296
|
@mapping={},
|
198
297
|
@output="Ralph can be reached at ralph@thoughtbot.com"
|
199
298
|
>
|
200
299
|
```
|
201
300
|
|
301
|
+
#### Error handling for invalid filter keys
|
302
|
+
|
303
|
+
```ruby
|
304
|
+
# This will raise ArgumentError: Unknown key: :invalid_filter. Valid keys are: ...
|
305
|
+
TopSecret::Text.filter("some text", invalid_filter: some_filter)
|
306
|
+
```
|
307
|
+
|
202
308
|
### Custom Filters
|
203
309
|
|
204
310
|
#### Adding new [Regex filters][]
|
@@ -209,15 +315,15 @@ ip_address_filter = TopSecret::Filters::Regex.new(
|
|
209
315
|
regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
|
210
316
|
)
|
211
317
|
|
212
|
-
TopSecret::Text.filter("Ralph's IP address is 192.168.1.1",
|
213
|
-
|
214
|
-
|
318
|
+
TopSecret::Text.filter("Ralph's IP address is 192.168.1.1",
|
319
|
+
custom_filters: [ip_address_filter]
|
320
|
+
)
|
215
321
|
```
|
216
322
|
|
217
323
|
This will return
|
218
324
|
|
219
325
|
```ruby
|
220
|
-
<TopSecret::Result
|
326
|
+
<TopSecret::Text::Result
|
221
327
|
@input="Ralph's IP address is 192.168.1.1",
|
222
328
|
@mapping={:PERSON_1=>"Ralph", :IP_ADDRESS_1=>"192.168.1.1"},
|
223
329
|
@output="[PERSON_1]'s IP address is [IP_ADDRESS_1]"
|
@@ -235,15 +341,15 @@ language_filter = TopSecret::Filters::NER.new(
|
|
235
341
|
min_confidence_score: 0.75
|
236
342
|
)
|
237
343
|
|
238
|
-
TopSecret::Text.filter("Ralph's favorite programming language is Ruby.",
|
239
|
-
|
240
|
-
|
344
|
+
TopSecret::Text.filter("Ralph's favorite programming language is Ruby.",
|
345
|
+
custom_filters: [language_filter]
|
346
|
+
)
|
241
347
|
```
|
242
348
|
|
243
349
|
This will return
|
244
350
|
|
245
351
|
```ruby
|
246
|
-
<TopSecret::Result
|
352
|
+
<TopSecret::Text::Result
|
247
353
|
@input="Ralph's favorite programming language is Ruby.",
|
248
354
|
@mapping={:PERSON_1=>"Ralph", :LANGUAGE_1=>"Ruby"},
|
249
355
|
@output="[PERSON_1]'s favorite programming language is [LANGUAGE_1]"
|
@@ -265,9 +371,9 @@ regex_filter = TopSecret::Filters::Regex.new(
|
|
265
371
|
regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
|
266
372
|
)
|
267
373
|
|
268
|
-
result = TopSecret::Text.filter("Server IP: 192.168.1.1",
|
269
|
-
|
270
|
-
|
374
|
+
result = TopSecret::Text.filter("Server IP: 192.168.1.1",
|
375
|
+
custom_filters: [regex_filter]
|
376
|
+
)
|
271
377
|
|
272
378
|
result.output
|
273
379
|
# => "Server IP: [IP_ADDRESS_1]"
|
@@ -285,9 +391,9 @@ ner_filter = TopSecret::Filters::NER.new(
|
|
285
391
|
min_confidence_score: 0.25
|
286
392
|
)
|
287
393
|
|
288
|
-
result = TopSecret::Text.filter("Ralph and Ruby work at thoughtbot.",
|
394
|
+
result = TopSecret::Text.filter("Ralph and Ruby work at thoughtbot.",
|
289
395
|
people_filter: ner_filter
|
290
|
-
|
396
|
+
)
|
291
397
|
|
292
398
|
result.output
|
293
399
|
# => "[PERSON_1] and [PERSON_2] work at thoughtbot."
|
@@ -326,7 +432,7 @@ end
|
|
326
432
|
|
327
433
|
```ruby
|
328
434
|
TopSecret.configure do |config|
|
329
|
-
config.
|
435
|
+
config.email_filter = TopSecret::Filters::Regex.new(
|
330
436
|
label: "EMAIL_ADDRESS",
|
331
437
|
regex: /\b\w+\[at\]\w+\.\w+\b/
|
332
438
|
)
|
@@ -337,18 +443,20 @@ end
|
|
337
443
|
|
338
444
|
```ruby
|
339
445
|
TopSecret.configure do |config|
|
340
|
-
config.
|
446
|
+
config.email_filter = nil
|
341
447
|
end
|
342
448
|
```
|
343
449
|
|
344
|
-
### Adding
|
450
|
+
### Adding custom filters globally
|
345
451
|
|
346
452
|
```ruby
|
453
|
+
ip_address_filter = TopSecret::Filters::Regex.new(
|
454
|
+
label: "IP_ADDRESS",
|
455
|
+
regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
|
456
|
+
)
|
457
|
+
|
347
458
|
TopSecret.configure do |config|
|
348
|
-
config.
|
349
|
-
label: "IP_ADDRESS",
|
350
|
-
regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
|
351
|
-
)
|
459
|
+
config.custom_filters << ip_address_filter
|
352
460
|
end
|
353
461
|
```
|
354
462
|
|
@@ -361,11 +469,26 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
|
|
361
469
|
>
|
362
470
|
> You'll need to download and extract [ner_model.dat][] first, and place it in the root of this project.
|
363
471
|
|
472
|
+
### Performance Benchmarks
|
473
|
+
|
474
|
+
Run `bin/benchmark` to test performance and catch regressions:
|
475
|
+
|
476
|
+
```bash
|
477
|
+
bin/benchmark # CI-optimized benchmark with pass/fail thresholds
|
478
|
+
```
|
479
|
+
|
480
|
+
> [!NOTE]
|
481
|
+
> When adding new public methods to the API, ensure they are included in the benchmark script to catch performance regressions.
|
482
|
+
|
364
483
|
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
365
484
|
|
366
485
|
## Contributing
|
367
486
|
|
368
|
-
Bug reports
|
487
|
+
[Bug reports](https://github.com/thoughtbot/top_secret/issues/new?template=bug_report.md) and [pull requests](https://github.com/thoughtbot/top_secret/pulls) are welcome on GitHub at [https://github.com/thoughtbot/top_secret](https://github.com/thoughtbot/top_secret).
|
488
|
+
|
489
|
+
Please create a [new discussion](https://github.com/thoughtbot/top_secret/discussions/new?category=ideas) if you want to share ideas for new features.
|
490
|
+
|
491
|
+
This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/thoughtbot/top_secret/blob/main/CODE_OF_CONDUCT.md).
|
369
492
|
|
370
493
|
## License
|
371
494
|
|
@@ -0,0 +1,29 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
class FilteredText
|
5
|
+
# Result object returned by FilteredText restoration operations.
|
6
|
+
#
|
7
|
+
# Contains the restored text along with tracking information about which
|
8
|
+
# placeholders were successfully restored and which remain unrestored.
|
9
|
+
class Result
|
10
|
+
# @return [String] The text with placeholders restored to original values
|
11
|
+
attr_reader :output
|
12
|
+
|
13
|
+
# @return [Array<String>] Array of placeholder strings that could not be restored
|
14
|
+
attr_reader :unrestored
|
15
|
+
|
16
|
+
# @return [Array<String>] Array of placeholder strings that were successfully restored
|
17
|
+
attr_reader :restored
|
18
|
+
|
19
|
+
# @param output [String] The restored text
|
20
|
+
# @param unrestored [Array<String>] Placeholders that could not be restored
|
21
|
+
# @param restored [Array<String>] Placeholders that were successfully restored
|
22
|
+
def initialize(output, unrestored, restored)
|
23
|
+
@output = output
|
24
|
+
@unrestored = unrestored
|
25
|
+
@restored = restored
|
26
|
+
end
|
27
|
+
end
|
28
|
+
end
|
29
|
+
end
|
@@ -0,0 +1,73 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require_relative "filtered_text/result"
|
4
|
+
|
5
|
+
module TopSecret
|
6
|
+
# Restores filtered text by substituting placeholders with original values.
|
7
|
+
#
|
8
|
+
# This class is used to reverse the filtering process, typically when processing
|
9
|
+
# responses from external services like LLMs that may contain filtered placeholders.
|
10
|
+
class FilteredText
|
11
|
+
# @return [String] The text being processed for restoration
|
12
|
+
attr_reader :output
|
13
|
+
|
14
|
+
# @param filtered_text [String] Text containing filter placeholders like [EMAIL_1]
|
15
|
+
# @param mapping [Hash] Hash mapping filter symbols to original values
|
16
|
+
def initialize(filtered_text, mapping:)
|
17
|
+
@mapping = mapping
|
18
|
+
@output = filtered_text.dup
|
19
|
+
end
|
20
|
+
|
21
|
+
# Convenience method to restore filtered text in one call
|
22
|
+
#
|
23
|
+
# @param filtered_text [String] Text containing filter placeholders
|
24
|
+
# @param mapping [Hash] Hash mapping filter symbols to original values
|
25
|
+
# @return [Result] Contains restored text and tracking information
|
26
|
+
#
|
27
|
+
# @example Basic restoration
|
28
|
+
# mapping = {EMAIL_1: "john@example.com"}
|
29
|
+
# result = TopSecret::FilteredText.restore("Contact [EMAIL_1]", mapping: mapping)
|
30
|
+
# result.output # => "Contact john@example.com"
|
31
|
+
# result.restored # => ["[EMAIL_1]"]
|
32
|
+
# result.unrestored # => []
|
33
|
+
def self.restore(filtered_text, mapping:)
|
34
|
+
new(filtered_text, mapping:).restore
|
35
|
+
end
|
36
|
+
|
37
|
+
# Performs the restoration process
|
38
|
+
#
|
39
|
+
# Substitutes all found placeholders with their mapped values and tracks
|
40
|
+
# which placeholders were successfully restored vs those that remain unrestored.
|
41
|
+
#
|
42
|
+
# @return [Result] Contains the restored text and tracking arrays
|
43
|
+
def restore
|
44
|
+
restored = []
|
45
|
+
|
46
|
+
mapping.each do |filter, value|
|
47
|
+
placeholder = build_placeholder(filter)
|
48
|
+
|
49
|
+
if output.include? placeholder
|
50
|
+
restored << placeholder
|
51
|
+
output.gsub! placeholder, value
|
52
|
+
end
|
53
|
+
end
|
54
|
+
|
55
|
+
unrestored = output.scan(/\[\w*_\d\]/)
|
56
|
+
|
57
|
+
Result.new(output, unrestored, restored)
|
58
|
+
end
|
59
|
+
|
60
|
+
private
|
61
|
+
|
62
|
+
# @return [Hash] Mapping from filter symbols to original values
|
63
|
+
attr_reader :mapping
|
64
|
+
|
65
|
+
# Builds a placeholder string from a filter symbol
|
66
|
+
#
|
67
|
+
# @param filter [Symbol] The filter symbol (e.g., :EMAIL_1)
|
68
|
+
# @return [String] The placeholder string (e.g., "[EMAIL_1]")
|
69
|
+
def build_placeholder(filter)
|
70
|
+
"[#{filter}]"
|
71
|
+
end
|
72
|
+
end
|
73
|
+
end
|
@@ -13,7 +13,7 @@ module TopSecret
|
|
13
13
|
def initialize(label:, tag:, min_confidence_score: nil)
|
14
14
|
@label = label
|
15
15
|
@tag = tag.upcase.to_s
|
16
|
-
@min_confidence_score = min_confidence_score
|
16
|
+
@min_confidence_score = min_confidence_score
|
17
17
|
end
|
18
18
|
|
19
19
|
# Filters and extracts entity texts matching the tag and score threshold.
|
@@ -21,7 +21,7 @@ module TopSecret
|
|
21
21
|
# @param entities [Array<Hash>] List of entity hashes with keys :tag, :score, and :text
|
22
22
|
# @return [Array<String>] Matched entity texts
|
23
23
|
def call(entities)
|
24
|
-
tags = entities.filter { _1.fetch(:tag) == tag && _1.fetch(:score) >= min_confidence_score }
|
24
|
+
tags = entities.filter { _1.fetch(:tag) == tag && _1.fetch(:score) >= (min_confidence_score || TopSecret.min_confidence_score) }
|
25
25
|
tags.map { _1.fetch(:text) }
|
26
26
|
end
|
27
27
|
|
@@ -0,0 +1,45 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
class Text
|
5
|
+
# Holds the result of a batch redaction operation on multiple messages.
|
6
|
+
# Contains a global mapping that ensures consistent labeling across all messages
|
7
|
+
# and a collection of individual input/output pairs.
|
8
|
+
class BatchResult
|
9
|
+
# @return [Hash] Global mapping of redaction labels to original values across all messages
|
10
|
+
attr_reader :mapping
|
11
|
+
|
12
|
+
# @return [Array<Item>] Array of input/output pairs for each processed message
|
13
|
+
attr_reader :items
|
14
|
+
|
15
|
+
# Creates a new BatchResult instance
|
16
|
+
#
|
17
|
+
# @param mapping [Hash] Global mapping of redaction labels to original values
|
18
|
+
# @param items [Array<Item>] Array of input/output pairs
|
19
|
+
def initialize(mapping: {}, items: [])
|
20
|
+
@mapping = mapping
|
21
|
+
@items = items
|
22
|
+
end
|
23
|
+
|
24
|
+
# Represents a single message within a batch redaction operation.
|
25
|
+
# Contains only the input and output text, without individual mappings.
|
26
|
+
# The mapping is maintained at the BatchResult level for global consistency.
|
27
|
+
class Item
|
28
|
+
# @return [String] The original unredacted input
|
29
|
+
attr_reader :input
|
30
|
+
|
31
|
+
# @return [String] The redacted output
|
32
|
+
attr_reader :output
|
33
|
+
|
34
|
+
# Creates a new Item instance
|
35
|
+
#
|
36
|
+
# @param input [String] The original text
|
37
|
+
# @param output [String] The redacted text
|
38
|
+
def initialize(input, output)
|
39
|
+
@input = input
|
40
|
+
@output = output
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
45
|
+
end
|
@@ -0,0 +1,26 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
class Text
|
5
|
+
# Holds the result of a redaction operation.
|
6
|
+
class Result
|
7
|
+
# @return [String] The original unredacted input
|
8
|
+
attr_reader :input
|
9
|
+
|
10
|
+
# @return [String] The redacted output
|
11
|
+
attr_reader :output
|
12
|
+
|
13
|
+
# @return [Hash] Mapping of redacted labels to matched values
|
14
|
+
attr_reader :mapping
|
15
|
+
|
16
|
+
# @param input [String] The original text
|
17
|
+
# @param output [String] The redacted text
|
18
|
+
# @param mapping [Hash] Map of labels to matched values
|
19
|
+
def initialize(input, output, mapping)
|
20
|
+
@input = input
|
21
|
+
@output = output
|
22
|
+
@mapping = mapping
|
23
|
+
end
|
24
|
+
end
|
25
|
+
end
|
26
|
+
end
|
data/lib/top_secret/text.rb
CHANGED
@@ -1,37 +1,110 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
+
require "active_support/core_ext/hash/keys"
|
4
|
+
require_relative "text/result"
|
5
|
+
require_relative "text/batch_result"
|
6
|
+
|
3
7
|
module TopSecret
|
4
8
|
# Processes text to identify and redact sensitive information using configured filters.
|
5
9
|
class Text
|
6
10
|
# @param input [String] The original text to be filtered
|
7
11
|
# @param filters [Hash, nil] Optional set of filters to override the defaults
|
8
|
-
|
12
|
+
# @param custom_filters [Array] Additional custom filters to apply
|
13
|
+
# @param model [Mitie::NER, nil] Optional pre-loaded MITIE model for performance
|
14
|
+
def initialize(input, custom_filters: [], filters: {}, model: nil)
|
9
15
|
@input = input
|
10
16
|
@output = input.dup
|
11
17
|
@mapping = {}
|
12
18
|
|
13
|
-
@model = Mitie::NER.new(TopSecret.model_path)
|
19
|
+
@model = model || Mitie::NER.new(TopSecret.model_path)
|
14
20
|
@doc = @model.doc(@output)
|
15
21
|
@entities = @doc.entities
|
16
22
|
|
17
23
|
@filters = filters
|
24
|
+
@custom_filters = custom_filters
|
18
25
|
end
|
19
26
|
|
20
27
|
# Convenience method to create an instance and filter input
|
21
28
|
#
|
22
29
|
# @param input [String] The text to filter
|
23
|
-
# @param filters [Hash] Optional filters to override defaults
|
30
|
+
# @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
|
31
|
+
# @param custom_filters [Array] Additional custom filters to apply
|
24
32
|
# @return [Result] The filtered result
|
25
|
-
|
26
|
-
|
33
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
34
|
+
def self.filter(input, custom_filters: [], **filters)
|
35
|
+
new(input, filters:, custom_filters:).filter
|
36
|
+
end
|
37
|
+
|
38
|
+
# Filters multiple messages with globally consistent redaction labels
|
39
|
+
#
|
40
|
+
# Processes a collection of messages and ensures that identical sensitive values
|
41
|
+
# receive the same redaction labels across all messages. This is useful when
|
42
|
+
# processing conversation threads or document collections where consistency matters.
|
43
|
+
#
|
44
|
+
# @param messages [Array<String>] Array of text messages to filter
|
45
|
+
# @param custom_filters [Array] Additional custom filters to apply
|
46
|
+
# @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
|
47
|
+
# @return [BatchResult] Contains global mapping and array of input/output pairs
|
48
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
49
|
+
#
|
50
|
+
# @example Basic usage
|
51
|
+
# messages = ["Contact john@test.com", "Email john@test.com again"]
|
52
|
+
# result = TopSecret::Text.filter_all(messages)
|
53
|
+
# result.items[0].output # => "Contact [EMAIL_1]"
|
54
|
+
# result.items[1].output # => "Email [EMAIL_1] again"
|
55
|
+
# result.mapping # => { EMAIL_1: "john@test.com" }
|
56
|
+
#
|
57
|
+
# @example With custom filters
|
58
|
+
# ip_filter = TopSecret::Filters::Regex.new(label: "IP", regex: /\d+\.\d+\.\d+\.\d+/)
|
59
|
+
# result = TopSecret::Text.filter_all(messages, custom_filters: [ip_filter])
|
60
|
+
def self.filter_all(messages, custom_filters: [], **filters)
|
61
|
+
shared_model = Mitie::NER.new(TopSecret.model_path)
|
62
|
+
|
63
|
+
individual_results = messages.map do |message|
|
64
|
+
new(message, filters:, custom_filters:, model: shared_model).filter
|
65
|
+
end
|
66
|
+
|
67
|
+
global_mapping = {}
|
68
|
+
label_counters = {}
|
69
|
+
|
70
|
+
individual_results.each do |result|
|
71
|
+
result.mapping.each do |individual_key, value|
|
72
|
+
next if global_mapping.key?(value)
|
73
|
+
|
74
|
+
# TODO: This assumes labels are formatted consistently.
|
75
|
+
# We need to account for the following for the case where a label could begin with an "_"
|
76
|
+
label_type = individual_key.to_s.rpartition("_").first
|
77
|
+
|
78
|
+
label_counters[label_type] ||= 0
|
79
|
+
label_counters[label_type] += 1
|
80
|
+
global_key = :"#{label_type}_#{label_counters[label_type]}"
|
81
|
+
|
82
|
+
global_mapping[value] = global_key
|
83
|
+
end
|
84
|
+
end
|
85
|
+
|
86
|
+
inverted_global_mapping = global_mapping.invert
|
87
|
+
|
88
|
+
items = individual_results.map do |result|
|
89
|
+
output = result.input.dup
|
90
|
+
inverted_global_mapping.each { |filter, value| output.gsub!(value, "[#{filter}]") }
|
91
|
+
Text::BatchResult::Item.new(result.input, output)
|
92
|
+
end
|
93
|
+
|
94
|
+
Text::BatchResult.new(mapping: global_mapping.invert, items:)
|
27
95
|
end
|
28
96
|
|
29
97
|
# Applies configured filters to the input, redacting matches and building a mapping.
|
30
98
|
#
|
31
99
|
# @return [Result] Contains original input, redacted output, and mapping of labels to values
|
32
100
|
# @raise [Error] If an unsupported filter is encountered
|
101
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
33
102
|
def filter
|
34
|
-
|
103
|
+
validate_filters!
|
104
|
+
|
105
|
+
all_filters.each do |filter|
|
106
|
+
next if filter.nil?
|
107
|
+
|
35
108
|
values = case filter
|
36
109
|
when TopSecret::Filters::Regex
|
37
110
|
filter.call(input)
|
@@ -45,7 +118,7 @@ module TopSecret
|
|
45
118
|
|
46
119
|
substitute_text
|
47
120
|
|
48
|
-
Result.new(input, output, mapping)
|
121
|
+
Text::Result.new(input, output, mapping)
|
49
122
|
end
|
50
123
|
|
51
124
|
private
|
@@ -65,6 +138,9 @@ module TopSecret
|
|
65
138
|
# @return [Hash] Active filters used for redaction
|
66
139
|
attr_reader :filters
|
67
140
|
|
141
|
+
# @return [Array] Custom filters to apply
|
142
|
+
attr_reader :custom_filters
|
143
|
+
|
68
144
|
# Builds the mapping of label keys to matched values, indexed uniquely.
|
69
145
|
#
|
70
146
|
# @param values [Array<String>] Values matched by a filter
|
@@ -85,5 +161,43 @@ module TopSecret
|
|
85
161
|
output.gsub! value, "[#{filter}]"
|
86
162
|
end
|
87
163
|
end
|
164
|
+
|
165
|
+
# Collects all filters to apply: default filters with overrides plus custom filters
|
166
|
+
#
|
167
|
+
# @return [Array] Array of filter objects to apply
|
168
|
+
def all_filters
|
169
|
+
merged_filters.values.compact + TopSecret.custom_filters + custom_filters
|
170
|
+
end
|
171
|
+
|
172
|
+
# Merges default filters with user-provided filter overrides
|
173
|
+
#
|
174
|
+
# @return [Hash] Hash containing default filters with any user overrides applied
|
175
|
+
# @private
|
176
|
+
def merged_filters
|
177
|
+
default_filters.merge(filters)
|
178
|
+
end
|
179
|
+
|
180
|
+
# Validates that all provided filter keys are recognized
|
181
|
+
#
|
182
|
+
# @return [void]
|
183
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
184
|
+
def validate_filters!
|
185
|
+
merged_filters.assert_valid_keys(*default_filters.keys)
|
186
|
+
end
|
187
|
+
|
188
|
+
# Returns the default filters configuration hash
|
189
|
+
#
|
190
|
+
# @return [Hash] Hash containing all configured default filters, keyed by filter name
|
191
|
+
# @private
|
192
|
+
def default_filters
|
193
|
+
{
|
194
|
+
credit_card_filter: TopSecret.credit_card_filter,
|
195
|
+
email_filter: TopSecret.email_filter,
|
196
|
+
phone_number_filter: TopSecret.phone_number_filter,
|
197
|
+
ssn_filter: TopSecret.ssn_filter,
|
198
|
+
people_filter: TopSecret.people_filter,
|
199
|
+
location_filter: TopSecret.location_filter
|
200
|
+
}
|
201
|
+
end
|
88
202
|
end
|
89
203
|
end
|
data/lib/top_secret/version.rb
CHANGED
data/lib/top_secret.rb
CHANGED
@@ -11,10 +11,10 @@ require_relative "top_secret/constants"
|
|
11
11
|
require_relative "top_secret/filters/ner"
|
12
12
|
require_relative "top_secret/filters/regex"
|
13
13
|
require_relative "top_secret/error"
|
14
|
-
require_relative "top_secret/result"
|
15
14
|
require_relative "top_secret/text"
|
15
|
+
require_relative "top_secret/filtered_text"
|
16
16
|
|
17
|
-
# TopSecret filters sensitive information from free text before it's sent to external services or APIs, such as
|
17
|
+
# TopSecret filters sensitive information from free text before it's sent to external services or APIs, such as chatbots and LLMs.
|
18
18
|
#
|
19
19
|
# @!attribute [rw] model_path
|
20
20
|
# @return [String] the path to the MITIE NER model
|
@@ -22,23 +22,38 @@ require_relative "top_secret/text"
|
|
22
22
|
# @!attribute [rw] min_confidence_score
|
23
23
|
# @return [Float] the minimum confidence score required for NER matches
|
24
24
|
#
|
25
|
-
# @!attribute [rw]
|
26
|
-
# @return [
|
25
|
+
# @!attribute [rw] custom_filters
|
26
|
+
# @return [Array] array of custom filters that can be configured
|
27
|
+
#
|
28
|
+
# @!attribute [rw] credit_card_filter
|
29
|
+
# @return [TopSecret::Filters::Regex] filter for credit card numbers
|
30
|
+
#
|
31
|
+
# @!attribute [rw] email_filter
|
32
|
+
# @return [TopSecret::Filters::Regex] filter for email addresses
|
33
|
+
#
|
34
|
+
# @!attribute [rw] phone_number_filter
|
35
|
+
# @return [TopSecret::Filters::Regex] filter for phone numbers
|
36
|
+
#
|
37
|
+
# @!attribute [rw] ssn_filter
|
38
|
+
# @return [TopSecret::Filters::Regex] filter for social security numbers
|
39
|
+
#
|
40
|
+
# @!attribute [rw] people_filter
|
41
|
+
# @return [TopSecret::Filters::NER] filter for person names
|
42
|
+
#
|
43
|
+
# @!attribute [rw] location_filter
|
44
|
+
# @return [TopSecret::Filters::NER] filter for location names
|
27
45
|
module TopSecret
|
28
46
|
include ActiveSupport::Configurable
|
29
47
|
|
30
48
|
config_accessor :model_path, default: "ner_model.dat"
|
31
49
|
config_accessor :min_confidence_score, default: MIN_CONFIDENCE_SCORE
|
32
50
|
|
33
|
-
config_accessor :
|
34
|
-
options = ActiveSupport::OrderedOptions.new
|
35
|
-
options.credit_card_filter = TopSecret::Filters::Regex.new(label: "CREDIT_CARD", regex: CREDIT_CARD_REGEX)
|
36
|
-
options.email_filter = TopSecret::Filters::Regex.new(label: "EMAIL", regex: EMAIL_REGEX)
|
37
|
-
options.phone_number_filter = TopSecret::Filters::Regex.new(label: "PHONE_NUMBER", regex: PHONE_REGEX)
|
38
|
-
options.ssn_filter = TopSecret::Filters::Regex.new(label: "SSN", regex: SSN_REGEX)
|
39
|
-
options.people_filter = TopSecret::Filters::NER.new(label: "PERSON", tag: :person)
|
40
|
-
options.location_filter = TopSecret::Filters::NER.new(label: "LOCATION", tag: :location)
|
51
|
+
config_accessor :custom_filters, default: []
|
41
52
|
|
42
|
-
|
43
|
-
|
53
|
+
config_accessor :credit_card_filter, default: TopSecret::Filters::Regex.new(label: "CREDIT_CARD", regex: CREDIT_CARD_REGEX)
|
54
|
+
config_accessor :email_filter, default: TopSecret::Filters::Regex.new(label: "EMAIL", regex: EMAIL_REGEX)
|
55
|
+
config_accessor :phone_number_filter, default: TopSecret::Filters::Regex.new(label: "PHONE_NUMBER", regex: PHONE_REGEX)
|
56
|
+
config_accessor :ssn_filter, default: TopSecret::Filters::Regex.new(label: "SSN", regex: SSN_REGEX)
|
57
|
+
config_accessor :people_filter, default: TopSecret::Filters::NER.new(label: "PERSON", tag: :person)
|
58
|
+
config_accessor :location_filter, default: TopSecret::Filters::NER.new(label: "LOCATION", tag: :location)
|
44
59
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: top_secret
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Steve Polito
|
@@ -44,7 +44,7 @@ dependencies:
|
|
44
44
|
- !ruby/object:Gem::Version
|
45
45
|
version: 0.3.2
|
46
46
|
description: Filter sensitive information from free text before sending it to external
|
47
|
-
services or APIs, such as
|
47
|
+
services or APIs, such as chatbots and LLMs.
|
48
48
|
email:
|
49
49
|
- stevepolito@hey.com
|
50
50
|
executables: []
|
@@ -52,6 +52,7 @@ extensions: []
|
|
52
52
|
extra_rdoc_files: []
|
53
53
|
files:
|
54
54
|
- CHANGELOG.md
|
55
|
+
- CODEOWNERS
|
55
56
|
- CODE_OF_CONDUCT.md
|
56
57
|
- LICENSE.txt
|
57
58
|
- README.md
|
@@ -59,10 +60,13 @@ files:
|
|
59
60
|
- lib/top_secret.rb
|
60
61
|
- lib/top_secret/constants.rb
|
61
62
|
- lib/top_secret/error.rb
|
63
|
+
- lib/top_secret/filtered_text.rb
|
64
|
+
- lib/top_secret/filtered_text/result.rb
|
62
65
|
- lib/top_secret/filters/ner.rb
|
63
66
|
- lib/top_secret/filters/regex.rb
|
64
|
-
- lib/top_secret/result.rb
|
65
67
|
- lib/top_secret/text.rb
|
68
|
+
- lib/top_secret/text/batch_result.rb
|
69
|
+
- lib/top_secret/text/result.rb
|
66
70
|
- lib/top_secret/version.rb
|
67
71
|
- sig/top_secret.rbs
|
68
72
|
homepage: https://github.com/thoughtbot/top_secret
|
data/lib/top_secret/result.rb
DELETED
@@ -1,24 +0,0 @@
|
|
1
|
-
# frozen_string_literal: true
|
2
|
-
|
3
|
-
module TopSecret
|
4
|
-
# Holds the result of a redaction operation.
|
5
|
-
class Result
|
6
|
-
# @return [String] The original unredacted input
|
7
|
-
attr_reader :input
|
8
|
-
|
9
|
-
# @return [String] The redacted output
|
10
|
-
attr_reader :output
|
11
|
-
|
12
|
-
# @return [Hash] Mapping of redacted labels to matched values
|
13
|
-
attr_reader :mapping
|
14
|
-
|
15
|
-
# @param input [String] The original text
|
16
|
-
# @param output [String] The redacted text
|
17
|
-
# @param mapping [Hash] Map of labels to matched values
|
18
|
-
def initialize(input, output, mapping)
|
19
|
-
@input = input
|
20
|
-
@output = output
|
21
|
-
@mapping = mapping
|
22
|
-
end
|
23
|
-
end
|
24
|
-
end
|