top_secret 0.1.1 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +44 -0
- data/CODEOWNERS +15 -0
- data/README.md +317 -29
- data/lib/top_secret/constants.rb +3 -0
- data/lib/top_secret/filtered_text/result.rb +29 -0
- data/lib/top_secret/filtered_text.rb +73 -0
- data/lib/top_secret/mapping.rb +15 -0
- data/lib/top_secret/null_model.rb +32 -0
- data/lib/top_secret/text/batch_result.rb +42 -0
- data/lib/top_secret/text/global_mapping.rb +63 -0
- data/lib/top_secret/text/result.rb +59 -0
- data/lib/top_secret/text/scan_result.rb +18 -0
- data/lib/top_secret/text.rb +162 -13
- data/lib/top_secret/version.rb +3 -1
- data/lib/top_secret.rb +31 -15
- metadata +19 -11
- data/lib/top_secret/result.rb +0 -24
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: cef5c9c267dd71870fd244408a3f9a020d19978381810f99e7ed5defc67f12a7
|
4
|
+
data.tar.gz: ec20793792721a47371cc6a7cf3c687ad4284b9c9f1f9b14c57e002118c5532a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0eea9a1087e5082e245b73cdbb434543dbfcff277b6d58b7a89e463b1162215e0058f5190df3863a73fd32223fdf42fa65a253781193ea7e44ecc6b68c359e45
|
7
|
+
data.tar.gz: 6626056578ec3feecf27d843a994f782c144b363a6b7473e6e0be0fa80044c2283c329ae068c4f0f66af13f021743039864e13f220174489bd822e0c162872a7
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,49 @@
|
|
1
1
|
## [Unreleased]
|
2
2
|
|
3
|
+
## [0.3.0] - 2025-09-19
|
4
|
+
|
5
|
+
### Added
|
6
|
+
|
7
|
+
- Added `TopSecret::Text.scan` method for detecting sensitive information without redacting text
|
8
|
+
- Added `TopSecret::Text::ScanResult` class to hold scan operation results with `mapping` and `sensitive?` methods
|
9
|
+
- Added `TopSecret::Text::GlobalMapping` class to manage consistent labeling across multiple filtering operations
|
10
|
+
- Added factory methods to domain objects: `BatchResult.from_messages`, `Result.from_messages`, and `Result.with_global_labels`
|
11
|
+
- Added support for disabling NER filtering by setting `model_path` to `nil` for improved performance and deployment flexibility
|
12
|
+
- Added support for Rails 7.0 and newer
|
13
|
+
- Added `#safe?` predicate method as the logical opposite of `#sensitive?` for `BatchResult`, `Result` and `ScanResult` classes
|
14
|
+
|
15
|
+
### Changed
|
16
|
+
|
17
|
+
- **BREAKING:** `TopSecret::Text.filter_all` now returns `TopSecret::Text::Result` objects instead of `TopSecret::Text::BatchResult::Item` objects for individual items
|
18
|
+
- Each item in `BatchResult#items` now includes an individual `mapping` attribute containing only the sensitive information found in that specific message
|
19
|
+
- `TopSecret::Text.filter_all` now only processes sensitive results when building global mappings, improving efficiency
|
20
|
+
- Refactored `TopSecret::Text.filter_all` to use domain objects with better separation of concerns and testability
|
21
|
+
- Improved performance by implementing lazy loading of MITIE model and document processing
|
22
|
+
- NER filtering now gracefully falls back when MITIE model is unavailable, continuing with regex-based filters only
|
23
|
+
|
24
|
+
## [0.2.0] - 2025-08-18
|
25
|
+
|
26
|
+
### Added
|
27
|
+
|
28
|
+
- Added `TopSecret::Text.filter_all` for batch processing multiple messages with globally consistent redaction labels
|
29
|
+
- Added `TopSecret::Text::BatchResult` class to hold results from batch operations
|
30
|
+
- Added `TopSecret::FilteredText` class for restoring filtered text by substituting placeholders with original values
|
31
|
+
- Added `TopSecret::FilteredText::Result` class to track restoration success and failures
|
32
|
+
|
33
|
+
### Changed
|
34
|
+
|
35
|
+
- **BREAKING:** Moved `TopSecret::Result` to `TopSecret::Text::Result` and `TopSecret::BatchResult` to `TopSecret::Text::BatchResult` for better namespace organization
|
36
|
+
- **BREAKING:** Refactored configuration system to use individual filter accessors instead of nested `default_filters`
|
37
|
+
- Updated `TopSecret::Text.filter` to accept keyword arguments for filter overrides and `custom_filters` array
|
38
|
+
- Each default filter now has its own configuration accessor (e.g., `TopSecret.email_filter`, `TopSecret.people_filter`)
|
39
|
+
|
40
|
+
### Migration Guide
|
41
|
+
|
42
|
+
- Replace `TopSecret::Result` with `TopSecret::Text::Result` and `TopSecret::BatchResult` with `TopSecret::Text::BatchResult`
|
43
|
+
- Replace `TopSecret.configure { |c| c.default_filters.email_filter = filter }` with `TopSecret.configure { |c| c.email_filter = filter }`
|
44
|
+
- Replace `TopSecret::Text.filter(text, filters: { email_filter: filter })` with `TopSecret::Text.filter(text, email_filter: filter)`
|
45
|
+
- For new filters, use `TopSecret::Text.filter(text, custom_filters: [filter])` instead of adding to `default_filters`
|
46
|
+
|
3
47
|
## [0.1.1] - 2025-08-08
|
4
48
|
|
5
49
|
- Ensure `TopSecret.min_confidence_score` is respected
|
data/CODEOWNERS
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
# Lines starting with '#' are comments.
|
2
|
+
# Each line is a file pattern followed by one or more owners.
|
3
|
+
|
4
|
+
# More details are here: https://help.github.com/articles/about-codeowners/
|
5
|
+
|
6
|
+
# The '*' pattern is global owners.
|
7
|
+
|
8
|
+
# Order is important. The last matching pattern has the most precedence.
|
9
|
+
# The folders are ordered as follows:
|
10
|
+
|
11
|
+
# In each subsection folders are ordered first by depth, then alphabetically.
|
12
|
+
# This should make it easy to add new rules without breaking existing ones.
|
13
|
+
|
14
|
+
# Global rule:
|
15
|
+
* @stevepolitodesign
|
data/README.md
CHANGED
@@ -1,6 +1,8 @@
|
|
1
1
|
# Top Secret
|
2
2
|
|
3
|
-
|
3
|
+
[](https://github.com/thoughtbot/top_secret/actions/workflows/main.yml)
|
4
|
+
|
5
|
+
Filter sensitive information from free text before sending it to external services or APIs, such as chatbots and LLMs.
|
4
6
|
|
5
7
|
By default it filters the following:
|
6
8
|
|
@@ -32,6 +34,13 @@ gem install top_secret
|
|
32
34
|
>
|
33
35
|
> You'll need to download and extract [ner_model.dat][] first.
|
34
36
|
|
37
|
+
> [!TIP]
|
38
|
+
> Due to its large size, you'll likely want to avoid committing [ner_model.dat][] into version control.
|
39
|
+
>
|
40
|
+
> You'll need to ensure the file exists in deployed environments. See relevant [discussion][discussions_60] for details.
|
41
|
+
>
|
42
|
+
> Alternatively, you can disable NER filtering entirely by setting `model_path` to `nil` if you only need regex-based filters (credit cards, emails, phone numbers, SSNs). This improves performance and eliminates the model file dependency.
|
43
|
+
|
35
44
|
By default, Top Secret assumes the file will live at the root of your project, but this can be configured.
|
36
45
|
|
37
46
|
```ruby
|
@@ -123,7 +132,7 @@ TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com")
|
|
123
132
|
This will return
|
124
133
|
|
125
134
|
```ruby
|
126
|
-
<TopSecret::Result
|
135
|
+
<TopSecret::Text::Result
|
127
136
|
@input="Ralph can be reached at ralph@thoughtbot.com",
|
128
137
|
@mapping={:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"},
|
129
138
|
@output="[PERSON_1] can be reached at [EMAIL_1]"
|
@@ -154,26 +163,264 @@ result.mapping
|
|
154
163
|
# => {:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"}
|
155
164
|
```
|
156
165
|
|
166
|
+
Check if sensitive information was found
|
167
|
+
|
168
|
+
```ruby
|
169
|
+
result.sensitive?
|
170
|
+
|
171
|
+
# => true
|
172
|
+
|
173
|
+
result.safe?
|
174
|
+
|
175
|
+
# => false
|
176
|
+
```
|
177
|
+
|
178
|
+
### Scanning for Sensitive Information
|
179
|
+
|
180
|
+
Use `TopSecret::Text.scan` to detect sensitive information without redacting the text. This is useful when you only need to check if sensitive data exists or get a mapping of what was found:
|
181
|
+
|
182
|
+
```ruby
|
183
|
+
TopSecret::Text.scan("Ralph can be reached at ralph@thoughtbot.com")
|
184
|
+
```
|
185
|
+
|
186
|
+
This will return
|
187
|
+
|
188
|
+
```ruby
|
189
|
+
<TopSecret::Text::ScanResult
|
190
|
+
@mapping={:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"}
|
191
|
+
>
|
192
|
+
```
|
193
|
+
|
194
|
+
Check if sensitive information was found
|
195
|
+
|
196
|
+
```ruby
|
197
|
+
result.sensitive?
|
198
|
+
|
199
|
+
# => true
|
200
|
+
|
201
|
+
result.safe?
|
202
|
+
|
203
|
+
# => false
|
204
|
+
```
|
205
|
+
|
206
|
+
View the mapping of found sensitive information
|
207
|
+
|
208
|
+
```ruby
|
209
|
+
result.mapping
|
210
|
+
|
211
|
+
# => {:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"}
|
212
|
+
```
|
213
|
+
|
214
|
+
The `scan` method accepts the same filter options as `filter`:
|
215
|
+
|
216
|
+
```ruby
|
217
|
+
# Override default filters
|
218
|
+
email_filter = TopSecret::Filters::Regex.new(
|
219
|
+
label: "EMAIL_ADDRESS",
|
220
|
+
regex: /\w+\[at\]\w+\.\w+/
|
221
|
+
)
|
222
|
+
result = TopSecret::Text.scan("Contact user[at]example.com", email_filter:)
|
223
|
+
result.mapping
|
224
|
+
# => {:EMAIL_ADDRESS_1=>"user[at]example.com"}
|
225
|
+
|
226
|
+
# Disable specific filters
|
227
|
+
result = TopSecret::Text.scan("Ralph works in Boston", people_filter: nil)
|
228
|
+
result.mapping
|
229
|
+
# => {:LOCATION_1=>"Boston"}
|
230
|
+
|
231
|
+
# Add custom filters
|
232
|
+
ip_filter = TopSecret::Filters::Regex.new(
|
233
|
+
label: "IP_ADDRESS",
|
234
|
+
regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
|
235
|
+
)
|
236
|
+
result = TopSecret::Text.scan("Server IP is 192.168.1.1", custom_filters: [ip_filter])
|
237
|
+
result.mapping
|
238
|
+
# => {:IP_ADDRESS_1=>"192.168.1.1"}
|
239
|
+
```
|
240
|
+
|
241
|
+
### Batch Processing
|
242
|
+
|
243
|
+
When processing multiple messages, use `filter_all` to ensure consistent redaction labels across all messages:
|
244
|
+
|
245
|
+
```ruby
|
246
|
+
messages = [
|
247
|
+
"Contact ralph@thoughtbot.com for details",
|
248
|
+
"Email ralph@thoughtbot.com again if needed",
|
249
|
+
"Also CC ruby@thoughtbot.com on the thread"
|
250
|
+
]
|
251
|
+
|
252
|
+
result = TopSecret::Text.filter_all(messages)
|
253
|
+
```
|
254
|
+
|
255
|
+
This will return
|
256
|
+
|
257
|
+
```ruby
|
258
|
+
<TopSecret::Text::BatchResult
|
259
|
+
@mapping={:EMAIL_1=>"ralph@thoughtbot.com", :EMAIL_2=>"ruby@thoughtbot.com"},
|
260
|
+
@items=[
|
261
|
+
<TopSecret::Text::Result @input="Contact ralph@thoughtbot.com for details", @output="Contact [EMAIL_1] for details", @mapping={:EMAIL_1=>"ralph@thoughtbot.com"}>,
|
262
|
+
<TopSecret::Text::Result @input="Email ralph@thoughtbot.com again if needed", @output="Email [EMAIL_1] again if needed", @mapping={:EMAIL_1=>"ralph@thoughtbot.com"}>,
|
263
|
+
<TopSecret::Text::Result @input="Also CC ruby@thoughtbot.com on the thread", @output="Also CC [EMAIL_2] on the thread", @mapping={:EMAIL_2=>"ruby@thoughtbot.com"}>
|
264
|
+
]
|
265
|
+
>
|
266
|
+
```
|
267
|
+
|
268
|
+
Access the global mapping
|
269
|
+
|
270
|
+
```ruby
|
271
|
+
result.mapping
|
272
|
+
|
273
|
+
# => {:EMAIL_1=>"ralph@thoughtbot.com", :EMAIL_2=>"ruby@thoughtbot.com"}
|
274
|
+
```
|
275
|
+
|
276
|
+
Access individual items
|
277
|
+
|
278
|
+
```ruby
|
279
|
+
result.items[0].input
|
280
|
+
# => "Contact ralph@thoughtbot.com for details"
|
281
|
+
|
282
|
+
result.items[0].output
|
283
|
+
# => "Contact [EMAIL_1] for details"
|
284
|
+
|
285
|
+
result.items[0].mapping
|
286
|
+
# => {:EMAIL_1=>"ralph@thoughtbot.com"}
|
287
|
+
|
288
|
+
result.items[0].sensitive?
|
289
|
+
# => true
|
290
|
+
|
291
|
+
result.items[0].safe?
|
292
|
+
# => false
|
293
|
+
```
|
294
|
+
|
295
|
+
The key benefit is that identical values receive the same labels across all messages - notice how `ralph@thoughtbot.com` becomes `[EMAIL_1]` in both the first and second messages.
|
296
|
+
|
297
|
+
Each item also maintains its own mapping containing only the sensitive information found in that specific message, while the batch result provides a global mapping of all sensitive information across all messages.
|
298
|
+
|
299
|
+
### Restoring Filtered Text
|
300
|
+
|
301
|
+
When external services (like LLMs) return responses containing filter placeholders, use `TopSecret::FilteredText.restore` to substitute them back with original values:
|
302
|
+
|
303
|
+
```ruby
|
304
|
+
# Filter messages before sending to LLM
|
305
|
+
messages = ["Contact ralph@thoughtbot.com for details"]
|
306
|
+
batch_result = TopSecret::Text.filter_all(messages)
|
307
|
+
|
308
|
+
# Send filtered text to LLM: "Contact [EMAIL_1] for details"
|
309
|
+
# LLM responds with: "I'll email [EMAIL_1] about this request"
|
310
|
+
llm_response = "I'll email [EMAIL_1] about this request"
|
311
|
+
|
312
|
+
# Restore the original values
|
313
|
+
restore_result = TopSecret::FilteredText.restore(llm_response, mapping: batch_result.mapping)
|
314
|
+
```
|
315
|
+
|
316
|
+
This will return
|
317
|
+
|
318
|
+
```ruby
|
319
|
+
<TopSecret::FilteredText::Result
|
320
|
+
@output="I'll email ralph@thoughtbot.com about this request",
|
321
|
+
@restored=["[EMAIL_1]"],
|
322
|
+
@unrestored=[]
|
323
|
+
>
|
324
|
+
```
|
325
|
+
|
326
|
+
Access the restored text
|
327
|
+
|
328
|
+
```ruby
|
329
|
+
restore_result.output
|
330
|
+
# => "I'll email ralph@thoughtbot.com about this request"
|
331
|
+
```
|
332
|
+
|
333
|
+
Track which placeholders were restored
|
334
|
+
|
335
|
+
```ruby
|
336
|
+
restore_result.restored
|
337
|
+
# => ["[EMAIL_1]"]
|
338
|
+
|
339
|
+
restore_result.unrestored
|
340
|
+
# => []
|
341
|
+
```
|
342
|
+
|
343
|
+
The restoration process tracks both successful and failed placeholder substitutions, allowing you to handle cases where the LLM response contains placeholders not found in your mapping.
|
344
|
+
|
345
|
+
### Working with LLMs
|
346
|
+
|
347
|
+
When sending filtered information to LLMs, they'll likely need to be instructed on how to handle those filters. Otherwise, we risk them not being returned in the response, which would break the restoration process.
|
348
|
+
|
349
|
+
Here's a recommended approach:
|
350
|
+
|
351
|
+
```ruby
|
352
|
+
instructions = <<~TEXT
|
353
|
+
I'm going to send filtered information to you in the form of free text.
|
354
|
+
If you need to refer to the filtered information in a response, just reference it by the filter.
|
355
|
+
TEXT
|
356
|
+
```
|
357
|
+
|
358
|
+
Complete example:
|
359
|
+
|
360
|
+
```ruby
|
361
|
+
require "openai"
|
362
|
+
require "top_secret"
|
363
|
+
|
364
|
+
openai = OpenAI::Client.new(
|
365
|
+
api_key: Rails.application.credentials.openai.api_key!
|
366
|
+
)
|
367
|
+
|
368
|
+
original_messages = [
|
369
|
+
"Ralph lives in Boston.",
|
370
|
+
"You can reach them at ralph@thoughtbot.com or 877-976-2687"
|
371
|
+
]
|
372
|
+
|
373
|
+
# Filter all messages
|
374
|
+
result = TopSecret::Text.filter_all(original_messages)
|
375
|
+
filtered_messages = result.items.map(&:output)
|
376
|
+
|
377
|
+
user_messages = filtered_messages.map { {role: "user", content: it} }
|
378
|
+
|
379
|
+
# Instruct LLM how to handle filtered messages
|
380
|
+
instructions = <<~TEXT
|
381
|
+
I'm going to send filtered information to you in the form of free text.
|
382
|
+
If you need to refer to the filtered information in a response, just reference it by the filter.
|
383
|
+
TEXT
|
384
|
+
|
385
|
+
messages = [
|
386
|
+
{role: "system", content: instructions},
|
387
|
+
*user_messages
|
388
|
+
]
|
389
|
+
|
390
|
+
chat_completion = openai.chat.completions.create(messages:, model: :"gpt-5")
|
391
|
+
response = chat_completion.choices.last.message.content
|
392
|
+
|
393
|
+
# Restore the response from the mapping
|
394
|
+
mapping = result.mapping
|
395
|
+
restored_response = TopSecret::FilteredText.restore(response, mapping:).output
|
396
|
+
|
397
|
+
puts(restored_response)
|
398
|
+
```
|
399
|
+
|
157
400
|
### Advanced Examples
|
158
401
|
|
159
402
|
#### Overriding the default filters
|
160
403
|
|
161
404
|
When overriding or [disabling](#disabling-a-default-filter-1) a [default filter](#default-filters), you must map to the correct key.
|
162
405
|
|
406
|
+
> [!IMPORTANT]
|
407
|
+
> Invalid filter keys will raise an `ArgumentError`. Only the following keys are valid:
|
408
|
+
> `credit_card_filter`, `email_filter`, `phone_number_filter`, `ssn_filter`, `people_filter`, `location_filter`
|
409
|
+
|
163
410
|
```ruby
|
164
411
|
regex_filter = TopSecret::Filters::Regex.new(label: "EMAIL_ADDRESS", regex: /\b\w+\[at\]\w+\.\w+\b/)
|
165
412
|
ner_filter = TopSecret::Filters::NER.new(label: "NAME", tag: :person, min_confidence_score: 0.25)
|
166
413
|
|
167
|
-
TopSecret::Text.filter("Ralph can be reached at ralph[at]thoughtbot.com",
|
414
|
+
TopSecret::Text.filter("Ralph can be reached at ralph[at]thoughtbot.com",
|
168
415
|
email_filter: regex_filter,
|
169
416
|
people_filter: ner_filter
|
170
|
-
|
417
|
+
)
|
171
418
|
```
|
172
419
|
|
173
420
|
This will return
|
174
421
|
|
175
422
|
```ruby
|
176
|
-
<TopSecret::Result
|
423
|
+
<TopSecret::Text::Result
|
177
424
|
@input="Ralph can be reached at ralph[at]thoughtbot.com",
|
178
425
|
@mapping={:EMAIL_ADDRESS_1=>"ralph[at]thoughtbot.com", :NAME_1=>"Ralph", :NAME_2=>"ralph["},
|
179
426
|
@output="[NAME_1] can be reached at [EMAIL_ADDRESS_1]"
|
@@ -183,22 +430,29 @@ This will return
|
|
183
430
|
#### Disabling a default filter
|
184
431
|
|
185
432
|
```ruby
|
186
|
-
TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com",
|
433
|
+
TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com",
|
187
434
|
email_filter: nil,
|
188
435
|
people_filter: nil
|
189
|
-
|
436
|
+
)
|
190
437
|
```
|
191
438
|
|
192
439
|
This will return
|
193
440
|
|
194
441
|
```ruby
|
195
|
-
<TopSecret::Result
|
442
|
+
<TopSecret::Text::Result
|
196
443
|
@input="Ralph can be reached at ralph@thoughtbot.com",
|
197
444
|
@mapping={},
|
198
445
|
@output="Ralph can be reached at ralph@thoughtbot.com"
|
199
446
|
>
|
200
447
|
```
|
201
448
|
|
449
|
+
#### Error handling for invalid filter keys
|
450
|
+
|
451
|
+
```ruby
|
452
|
+
# This will raise ArgumentError: Unknown key: :invalid_filter. Valid keys are: ...
|
453
|
+
TopSecret::Text.filter("some text", invalid_filter: some_filter)
|
454
|
+
```
|
455
|
+
|
202
456
|
### Custom Filters
|
203
457
|
|
204
458
|
#### Adding new [Regex filters][]
|
@@ -209,15 +463,15 @@ ip_address_filter = TopSecret::Filters::Regex.new(
|
|
209
463
|
regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
|
210
464
|
)
|
211
465
|
|
212
|
-
TopSecret::Text.filter("Ralph's IP address is 192.168.1.1",
|
213
|
-
|
214
|
-
|
466
|
+
TopSecret::Text.filter("Ralph's IP address is 192.168.1.1",
|
467
|
+
custom_filters: [ip_address_filter]
|
468
|
+
)
|
215
469
|
```
|
216
470
|
|
217
471
|
This will return
|
218
472
|
|
219
473
|
```ruby
|
220
|
-
<TopSecret::Result
|
474
|
+
<TopSecret::Text::Result
|
221
475
|
@input="Ralph's IP address is 192.168.1.1",
|
222
476
|
@mapping={:PERSON_1=>"Ralph", :IP_ADDRESS_1=>"192.168.1.1"},
|
223
477
|
@output="[PERSON_1]'s IP address is [IP_ADDRESS_1]"
|
@@ -235,15 +489,15 @@ language_filter = TopSecret::Filters::NER.new(
|
|
235
489
|
min_confidence_score: 0.75
|
236
490
|
)
|
237
491
|
|
238
|
-
TopSecret::Text.filter("Ralph's favorite programming language is Ruby.",
|
239
|
-
|
240
|
-
|
492
|
+
TopSecret::Text.filter("Ralph's favorite programming language is Ruby.",
|
493
|
+
custom_filters: [language_filter]
|
494
|
+
)
|
241
495
|
```
|
242
496
|
|
243
497
|
This will return
|
244
498
|
|
245
499
|
```ruby
|
246
|
-
<TopSecret::Result
|
500
|
+
<TopSecret::Text::Result
|
247
501
|
@input="Ralph's favorite programming language is Ruby.",
|
248
502
|
@mapping={:PERSON_1=>"Ralph", :LANGUAGE_1=>"Ruby"},
|
249
503
|
@output="[PERSON_1]'s favorite programming language is [LANGUAGE_1]"
|
@@ -265,9 +519,9 @@ regex_filter = TopSecret::Filters::Regex.new(
|
|
265
519
|
regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
|
266
520
|
)
|
267
521
|
|
268
|
-
result = TopSecret::Text.filter("Server IP: 192.168.1.1",
|
269
|
-
|
270
|
-
|
522
|
+
result = TopSecret::Text.filter("Server IP: 192.168.1.1",
|
523
|
+
custom_filters: [regex_filter]
|
524
|
+
)
|
271
525
|
|
272
526
|
result.output
|
273
527
|
# => "Server IP: [IP_ADDRESS_1]"
|
@@ -285,9 +539,9 @@ ner_filter = TopSecret::Filters::NER.new(
|
|
285
539
|
min_confidence_score: 0.25
|
286
540
|
)
|
287
541
|
|
288
|
-
result = TopSecret::Text.filter("Ralph and Ruby work at thoughtbot.",
|
542
|
+
result = TopSecret::Text.filter("Ralph and Ruby work at thoughtbot.",
|
289
543
|
people_filter: ner_filter
|
290
|
-
|
544
|
+
)
|
291
545
|
|
292
546
|
result.output
|
293
547
|
# => "[PERSON_1] and [PERSON_2] work at thoughtbot."
|
@@ -314,6 +568,22 @@ TopSecret.configure do |config|
|
|
314
568
|
end
|
315
569
|
```
|
316
570
|
|
571
|
+
### Disabling NER filtering
|
572
|
+
|
573
|
+
For improved performance or when the MITIE model file cannot be deployed, you can disable NER-based filtering entirely. This will disable people and location detection but retain all regex-based filters (credit cards, emails, phone numbers, SSNs):
|
574
|
+
|
575
|
+
```ruby
|
576
|
+
TopSecret.configure do |config|
|
577
|
+
config.model_path = nil
|
578
|
+
end
|
579
|
+
```
|
580
|
+
|
581
|
+
This is useful in environments where:
|
582
|
+
|
583
|
+
- The model file cannot be deployed due to size constraints
|
584
|
+
- You only need regex-based filtering
|
585
|
+
- You want to optimize for performance over NER capabilities
|
586
|
+
|
317
587
|
### Overriding the confidence score
|
318
588
|
|
319
589
|
```ruby
|
@@ -326,7 +596,7 @@ end
|
|
326
596
|
|
327
597
|
```ruby
|
328
598
|
TopSecret.configure do |config|
|
329
|
-
config.
|
599
|
+
config.email_filter = TopSecret::Filters::Regex.new(
|
330
600
|
label: "EMAIL_ADDRESS",
|
331
601
|
regex: /\b\w+\[at\]\w+\.\w+\b/
|
332
602
|
)
|
@@ -337,18 +607,20 @@ end
|
|
337
607
|
|
338
608
|
```ruby
|
339
609
|
TopSecret.configure do |config|
|
340
|
-
config.
|
610
|
+
config.email_filter = nil
|
341
611
|
end
|
342
612
|
```
|
343
613
|
|
344
|
-
### Adding
|
614
|
+
### Adding custom filters globally
|
345
615
|
|
346
616
|
```ruby
|
617
|
+
ip_address_filter = TopSecret::Filters::Regex.new(
|
618
|
+
label: "IP_ADDRESS",
|
619
|
+
regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
|
620
|
+
)
|
621
|
+
|
347
622
|
TopSecret.configure do |config|
|
348
|
-
config.
|
349
|
-
label: "IP_ADDRESS",
|
350
|
-
regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
|
351
|
-
)
|
623
|
+
config.custom_filters << ip_address_filter
|
352
624
|
end
|
353
625
|
```
|
354
626
|
|
@@ -361,11 +633,26 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
|
|
361
633
|
>
|
362
634
|
> You'll need to download and extract [ner_model.dat][] first, and place it in the root of this project.
|
363
635
|
|
636
|
+
### Performance Benchmarks
|
637
|
+
|
638
|
+
Run `bin/benchmark` to test performance and catch regressions:
|
639
|
+
|
640
|
+
```bash
|
641
|
+
bin/benchmark # CI-optimized benchmark with pass/fail thresholds
|
642
|
+
```
|
643
|
+
|
644
|
+
> [!NOTE]
|
645
|
+
> When adding new public methods to the API, ensure they are included in the benchmark script to catch performance regressions.
|
646
|
+
|
364
647
|
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
365
648
|
|
366
649
|
## Contributing
|
367
650
|
|
368
|
-
Bug reports
|
651
|
+
[Bug reports](https://github.com/thoughtbot/top_secret/issues/new?template=bug_report.md) and [pull requests](https://github.com/thoughtbot/top_secret/pulls) are welcome on GitHub at [https://github.com/thoughtbot/top_secret](https://github.com/thoughtbot/top_secret).
|
652
|
+
|
653
|
+
Please create a [new discussion](https://github.com/thoughtbot/top_secret/discussions/new?category=ideas) if you want to share ideas for new features.
|
654
|
+
|
655
|
+
This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/thoughtbot/top_secret/blob/main/CODE_OF_CONDUCT.md).
|
369
656
|
|
370
657
|
## License
|
371
658
|
|
@@ -400,3 +687,4 @@ We are [available for hire][hire].
|
|
400
687
|
[train]: https://github.com/ankane/mitie-ruby?tab=readme-ov-file#training
|
401
688
|
[Regex filters]: https://github.com/thoughtbot/top_secret/blob/main/lib/top_secret/filters/regex.rb
|
402
689
|
[NER filters]: https://github.com/thoughtbot/top_secret/blob/main/lib/top_secret/filters/ner.rb
|
690
|
+
[discussions_60]: https://github.com/thoughtbot/top_secret/discussions/60
|
data/lib/top_secret/constants.rb
CHANGED
@@ -0,0 +1,29 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
class FilteredText
|
5
|
+
# Result object returned by FilteredText restoration operations.
|
6
|
+
#
|
7
|
+
# Contains the restored text along with tracking information about which
|
8
|
+
# placeholders were successfully restored and which remain unrestored.
|
9
|
+
class Result
|
10
|
+
# @return [String] The text with placeholders restored to original values
|
11
|
+
attr_reader :output
|
12
|
+
|
13
|
+
# @return [Array<String>] Array of placeholder strings that could not be restored
|
14
|
+
attr_reader :unrestored
|
15
|
+
|
16
|
+
# @return [Array<String>] Array of placeholder strings that were successfully restored
|
17
|
+
attr_reader :restored
|
18
|
+
|
19
|
+
# @param output [String] The restored text
|
20
|
+
# @param unrestored [Array<String>] Placeholders that could not be restored
|
21
|
+
# @param restored [Array<String>] Placeholders that were successfully restored
|
22
|
+
def initialize(output, unrestored, restored)
|
23
|
+
@output = output
|
24
|
+
@unrestored = unrestored
|
25
|
+
@restored = restored
|
26
|
+
end
|
27
|
+
end
|
28
|
+
end
|
29
|
+
end
|
@@ -0,0 +1,73 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require_relative "filtered_text/result"
|
4
|
+
|
5
|
+
module TopSecret
|
6
|
+
# Restores filtered text by substituting placeholders with original values.
|
7
|
+
#
|
8
|
+
# This class is used to reverse the filtering process, typically when processing
|
9
|
+
# responses from external services like LLMs that may contain filtered placeholders.
|
10
|
+
class FilteredText
|
11
|
+
# @return [String] The text being processed for restoration
|
12
|
+
attr_reader :output
|
13
|
+
|
14
|
+
# @param filtered_text [String] Text containing filter placeholders like [EMAIL_1]
|
15
|
+
# @param mapping [Hash] Hash mapping filter symbols to original values
|
16
|
+
def initialize(filtered_text, mapping:)
|
17
|
+
@mapping = mapping
|
18
|
+
@output = filtered_text.dup
|
19
|
+
end
|
20
|
+
|
21
|
+
# Convenience method to restore filtered text in one call
|
22
|
+
#
|
23
|
+
# @param filtered_text [String] Text containing filter placeholders
|
24
|
+
# @param mapping [Hash] Hash mapping filter symbols to original values
|
25
|
+
# @return [Result] Contains restored text and tracking information
|
26
|
+
#
|
27
|
+
# @example Basic restoration
|
28
|
+
# mapping = {EMAIL_1: "john@example.com"}
|
29
|
+
# result = TopSecret::FilteredText.restore("Contact [EMAIL_1]", mapping: mapping)
|
30
|
+
# result.output # => "Contact john@example.com"
|
31
|
+
# result.restored # => ["[EMAIL_1]"]
|
32
|
+
# result.unrestored # => []
|
33
|
+
def self.restore(filtered_text, mapping:)
|
34
|
+
new(filtered_text, mapping:).restore
|
35
|
+
end
|
36
|
+
|
37
|
+
# Performs the restoration process
|
38
|
+
#
|
39
|
+
# Substitutes all found placeholders with their mapped values and tracks
|
40
|
+
# which placeholders were successfully restored vs those that remain unrestored.
|
41
|
+
#
|
42
|
+
# @return [Result] Contains the restored text and tracking arrays
|
43
|
+
def restore
|
44
|
+
restored = []
|
45
|
+
|
46
|
+
mapping.each do |filter, value|
|
47
|
+
placeholder = build_placeholder(filter)
|
48
|
+
|
49
|
+
if output.include? placeholder
|
50
|
+
restored << placeholder
|
51
|
+
output.gsub! placeholder, value
|
52
|
+
end
|
53
|
+
end
|
54
|
+
|
55
|
+
unrestored = output.scan(/\[\w*_\d\]/)
|
56
|
+
|
57
|
+
Result.new(output, unrestored, restored)
|
58
|
+
end
|
59
|
+
|
60
|
+
private
|
61
|
+
|
62
|
+
# @return [Hash] Mapping from filter symbols to original values
|
63
|
+
attr_reader :mapping
|
64
|
+
|
65
|
+
# Builds a placeholder string from a filter symbol
|
66
|
+
#
|
67
|
+
# @param filter [Symbol] The filter symbol (e.g., :EMAIL_1)
|
68
|
+
# @return [String] The placeholder string (e.g., "[EMAIL_1]")
|
69
|
+
def build_placeholder(filter)
|
70
|
+
"[#{filter}]"
|
71
|
+
end
|
72
|
+
end
|
73
|
+
end
|
@@ -0,0 +1,15 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
module Mapping
|
5
|
+
# @return [Boolean] Whether sensitive information was found
|
6
|
+
def sensitive?
|
7
|
+
mapping.any?
|
8
|
+
end
|
9
|
+
|
10
|
+
# @return [Boolean] Whether sensitive information was not found
|
11
|
+
def safe?
|
12
|
+
!sensitive?
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
@@ -0,0 +1,32 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
# A null object implementation that provides a no-op interface compatible with Mitie::NER.
|
5
|
+
# Used when NER filtering is disabled (model_path is nil) to eliminate conditional checks
|
6
|
+
# throughout the codebase.
|
7
|
+
#
|
8
|
+
# @example
|
9
|
+
# model = TopSecret::NullModel.new
|
10
|
+
# doc = model.doc("some text")
|
11
|
+
# doc.entities # => []
|
12
|
+
class NullModel
|
13
|
+
# A null document implementation that provides an empty entities array.
|
14
|
+
# Used as the return value from NullModel#doc to maintain interface compatibility.
|
15
|
+
class NullDoc
|
16
|
+
# Returns an empty array of entities.
|
17
|
+
#
|
18
|
+
# @return [Array] Always returns an empty array
|
19
|
+
def entities
|
20
|
+
[]
|
21
|
+
end
|
22
|
+
end
|
23
|
+
|
24
|
+
# Creates a null document that returns empty entities.
|
25
|
+
#
|
26
|
+
# @param input [String] The input text (ignored)
|
27
|
+
# @return [NullDoc] A document-like object with empty entities
|
28
|
+
def doc(input)
|
29
|
+
NullDoc.new
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
@@ -0,0 +1,42 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
class Text
|
5
|
+
# Holds the result of a batch redaction operation on multiple messages.
|
6
|
+
# Contains a global mapping that ensures consistent labeling across all messages
|
7
|
+
# and a collection of individual input/output pairs.
|
8
|
+
class BatchResult # TODO Rename to FilterBatchResult
|
9
|
+
include Mapping
|
10
|
+
|
11
|
+
# @return [Hash] Global mapping of redaction labels to original values across all messages
|
12
|
+
attr_reader :mapping
|
13
|
+
|
14
|
+
# @return [Array<Item>] Array of input/output pairs for each processed message
|
15
|
+
attr_reader :items
|
16
|
+
|
17
|
+
# Creates a new BatchResult instance
|
18
|
+
#
|
19
|
+
# @param mapping [Hash] Global mapping of redaction labels to original values
|
20
|
+
# @param items [Array<Item>] Array of input/output pairs
|
21
|
+
def initialize(mapping: {}, items: [])
|
22
|
+
@mapping = mapping
|
23
|
+
@items = items
|
24
|
+
end
|
25
|
+
|
26
|
+
# Creates a BatchResult from multiple messages with consistent global labeling
|
27
|
+
#
|
28
|
+
# @param messages [Array<String>] Array of text messages to filter
|
29
|
+
# @param custom_filters [Array] Additional custom filters to apply
|
30
|
+
# @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
|
31
|
+
# @return [BatchResult] Contains global mapping and array of Result objects with individual mappings
|
32
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
33
|
+
def self.from_messages(messages, custom_filters: [], **filters)
|
34
|
+
individual_results = TopSecret::Text::Result.from_messages(messages, custom_filters:, **filters)
|
35
|
+
mapping = TopSecret::Text::GlobalMapping.from_results(individual_results)
|
36
|
+
items = TopSecret::Text::Result.with_global_labels(individual_results, mapping)
|
37
|
+
|
38
|
+
Text::BatchResult.new(mapping:, items:)
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
42
|
+
end
|
@@ -0,0 +1,63 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
class Text
|
5
|
+
# Manages consistent labeling across multiple filtering operations by ensuring
|
6
|
+
# identical sensitive values receive the same redaction labels globally.
|
7
|
+
class GlobalMapping
|
8
|
+
# Creates a global mapping from individual filter results
|
9
|
+
#
|
10
|
+
# @param individual_results [Array<Result>] Array of individual filter results
|
11
|
+
# @return [Hash] Inverted mapping from filter labels to original values
|
12
|
+
def self.from_results(individual_results)
|
13
|
+
new.build_from_results(individual_results)
|
14
|
+
end
|
15
|
+
|
16
|
+
# Creates a new GlobalMapping instance
|
17
|
+
def initialize
|
18
|
+
@mapping = {}
|
19
|
+
@label_counters = {}
|
20
|
+
end
|
21
|
+
|
22
|
+
# Builds the global mapping by processing all individual results
|
23
|
+
#
|
24
|
+
# @param individual_results [Array<Result>] Array of individual filter results
|
25
|
+
# @return [Hash] Inverted mapping from filter labels to original values
|
26
|
+
def build_from_results(individual_results)
|
27
|
+
individual_results.each { |result| process_result(result) if result.sensitive? }
|
28
|
+
|
29
|
+
mapping.invert
|
30
|
+
end
|
31
|
+
|
32
|
+
private
|
33
|
+
|
34
|
+
attr_reader :mapping
|
35
|
+
attr_reader :label_counters
|
36
|
+
|
37
|
+
# Processes a single result, adding new values to the global mapping
|
38
|
+
#
|
39
|
+
# @param result [Result] Individual filter result to process
|
40
|
+
def process_result(result)
|
41
|
+
result.mapping.each do |individual_key, value|
|
42
|
+
next if mapping.key?(value)
|
43
|
+
|
44
|
+
mapping[value] = generate_global_key(individual_key)
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
# Generates a consistent global key for a given individual key
|
49
|
+
#
|
50
|
+
# @param individual_key [Symbol] The individual key from a filter result
|
51
|
+
# @return [Symbol] The global key with consistent numbering
|
52
|
+
def generate_global_key(individual_key)
|
53
|
+
# TODO: This assumes labels are formatted consistently.
|
54
|
+
# We need to account for the following for the case where a label could begin with an "_"
|
55
|
+
label_type = individual_key.to_s.rpartition("_").first
|
56
|
+
|
57
|
+
label_counters[label_type] ||= 0
|
58
|
+
label_counters[label_type] += 1
|
59
|
+
:"#{label_type}_#{label_counters[label_type]}"
|
60
|
+
end
|
61
|
+
end
|
62
|
+
end
|
63
|
+
end
|
@@ -0,0 +1,59 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
class Text
|
5
|
+
# Holds the result of a redaction operation.
|
6
|
+
class Result # TODO: Rename to FilterResult
|
7
|
+
include Mapping
|
8
|
+
|
9
|
+
# @return [String] The original unredacted input
|
10
|
+
attr_reader :input
|
11
|
+
|
12
|
+
# @return [String] The redacted output
|
13
|
+
attr_reader :output
|
14
|
+
|
15
|
+
# @return [Hash] Mapping of redacted labels to matched values
|
16
|
+
attr_reader :mapping
|
17
|
+
|
18
|
+
# @param input [String] The original text
|
19
|
+
# @param output [String] The redacted text
|
20
|
+
# @param mapping [Hash] Map of labels to matched values
|
21
|
+
def initialize(input, output, mapping)
|
22
|
+
@input = input
|
23
|
+
@output = output
|
24
|
+
@mapping = mapping
|
25
|
+
end
|
26
|
+
|
27
|
+
# Filters multiple messages individually using a shared model for performance
|
28
|
+
#
|
29
|
+
# @param messages [Array<String>] Array of text messages to filter
|
30
|
+
# @param custom_filters [Array] Additional custom filters to apply
|
31
|
+
# @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
|
32
|
+
# @return [Array<Result>] Array of individual Result objects for each message
|
33
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
34
|
+
def self.from_messages(messages, custom_filters: [], **filters)
|
35
|
+
shared_model = TopSecret.model_path ? Mitie::NER.new(TopSecret.model_path) : nil
|
36
|
+
|
37
|
+
messages.map do |message|
|
38
|
+
TopSecret::Text.new(message, filters:, custom_filters:, model: shared_model).filter
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
# Creates Result objects with globally consistent labels applied to text
|
43
|
+
#
|
44
|
+
# @param individual_results [Array<Result>] Array of individual filter results
|
45
|
+
# @param global_mapping [Hash] Global mapping from filter labels to original values
|
46
|
+
# @return [Array<Result>] Array of Result objects with globally consistent redaction and individual mappings
|
47
|
+
def self.with_global_labels(individual_results, global_mapping)
|
48
|
+
individual_results.map do |result|
|
49
|
+
output = global_mapping.reduce(result.input.dup) do |text, (filter, value)|
|
50
|
+
text.gsub(value, "[#{filter}]")
|
51
|
+
end
|
52
|
+
filter_keys = output.scan(/\[([^\]]+)\]/).flatten.map(&:to_sym)
|
53
|
+
mapping = global_mapping.slice(*filter_keys)
|
54
|
+
new(result.input, output, mapping)
|
55
|
+
end
|
56
|
+
end
|
57
|
+
end
|
58
|
+
end
|
59
|
+
end
|
@@ -0,0 +1,18 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TopSecret
|
4
|
+
class Text
|
5
|
+
# Holds the result of a scan operation.
|
6
|
+
class ScanResult
|
7
|
+
include Mapping
|
8
|
+
|
9
|
+
# @return [Hash] Mapping of redacted labels to matched values
|
10
|
+
attr_reader :mapping
|
11
|
+
|
12
|
+
# @param mapping [Hash] Map of labels to matched values
|
13
|
+
def initialize(mapping)
|
14
|
+
@mapping = mapping
|
15
|
+
end
|
16
|
+
end
|
17
|
+
end
|
18
|
+
end
|
data/lib/top_secret/text.rb
CHANGED
@@ -1,37 +1,116 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
+
require "active_support/core_ext/hash/keys"
|
4
|
+
require_relative "null_model"
|
5
|
+
require_relative "text/result"
|
6
|
+
require_relative "text/batch_result"
|
7
|
+
require_relative "text/scan_result"
|
8
|
+
require_relative "text/global_mapping"
|
9
|
+
|
3
10
|
module TopSecret
|
4
11
|
# Processes text to identify and redact sensitive information using configured filters.
|
5
12
|
class Text
|
6
13
|
# @param input [String] The original text to be filtered
|
7
14
|
# @param filters [Hash, nil] Optional set of filters to override the defaults
|
8
|
-
|
15
|
+
# @param custom_filters [Array] Additional custom filters to apply
|
16
|
+
# @param model [Mitie::NER, nil] Optional pre-loaded MITIE model for performance
|
17
|
+
def initialize(input, custom_filters: [], filters: {}, model: nil)
|
9
18
|
@input = input
|
10
19
|
@output = input.dup
|
11
20
|
@mapping = {}
|
12
21
|
|
13
|
-
@model =
|
14
|
-
@doc = @model.doc(@output)
|
15
|
-
@entities = @doc.entities
|
22
|
+
@model = model || default_model
|
16
23
|
|
17
24
|
@filters = filters
|
25
|
+
@custom_filters = custom_filters
|
18
26
|
end
|
19
27
|
|
20
28
|
# Convenience method to create an instance and filter input
|
21
29
|
#
|
22
30
|
# @param input [String] The text to filter
|
23
|
-
# @param filters [Hash] Optional filters to override defaults
|
31
|
+
# @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
|
32
|
+
# @param custom_filters [Array] Additional custom filters to apply
|
24
33
|
# @return [Result] The filtered result
|
25
|
-
|
26
|
-
|
34
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
35
|
+
def self.filter(input, custom_filters: [], **filters)
|
36
|
+
new(input, filters:, custom_filters:).filter
|
27
37
|
end
|
28
38
|
|
29
|
-
#
|
39
|
+
# Filters multiple messages with globally consistent redaction labels
|
30
40
|
#
|
31
|
-
#
|
41
|
+
# Processes a collection of messages and ensures that identical sensitive values
|
42
|
+
# receive the same redaction labels across all messages. This is useful when
|
43
|
+
# processing conversation threads or document collections where consistency matters.
|
44
|
+
#
|
45
|
+
# @param messages [Array<String>] Array of text messages to filter
|
46
|
+
# @param custom_filters [Array] Additional custom filters to apply
|
47
|
+
# @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
|
48
|
+
# @return [BatchResult] Contains global mapping and array of Result objects with individual mappings
|
49
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
50
|
+
#
|
51
|
+
# @example Basic usage
|
52
|
+
# messages = ["Contact john@test.com", "Email john@test.com again"]
|
53
|
+
# result = TopSecret::Text.filter_all(messages)
|
54
|
+
# result.items[0].output # => "Contact [EMAIL_1]"
|
55
|
+
# result.items[1].output # => "Email [EMAIL_1] again"
|
56
|
+
# result.items[0].mapping # => { EMAIL_1: "john@test.com" }
|
57
|
+
# result.mapping # => { EMAIL_1: "john@test.com" }
|
58
|
+
#
|
59
|
+
# @example With custom filters
|
60
|
+
# ip_filter = TopSecret::Filters::Regex.new(label: "IP", regex: /\d+\.\d+\.\d+\.\d+/)
|
61
|
+
# result = TopSecret::Text.filter_all(messages, custom_filters: [ip_filter])
|
62
|
+
def self.filter_all(messages, custom_filters: [], **filters)
|
63
|
+
Text::BatchResult.from_messages(messages, custom_filters:, **filters)
|
64
|
+
end
|
65
|
+
|
66
|
+
# Convenience method to scan input text for sensitive information without redacting it
|
67
|
+
#
|
68
|
+
# This method detects sensitive information using configured filters but does not modify
|
69
|
+
# the original text. Use this when you only need to check if sensitive data exists or
|
70
|
+
# get a mapping of what was found.
|
71
|
+
#
|
72
|
+
# @param input [String] The text to scan for sensitive information
|
73
|
+
# @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
|
74
|
+
# @param custom_filters [Array] Additional custom filters to apply
|
75
|
+
# @return [ScanResult] Contains mapping of found sensitive information and sensitive? flag
|
76
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
77
|
+
#
|
78
|
+
# @example Basic scanning
|
79
|
+
# result = TopSecret::Text.scan("Contact john@example.com")
|
80
|
+
# result.sensitive? # => true
|
81
|
+
# result.mapping # => {:EMAIL_1=>"john@example.com"}
|
82
|
+
#
|
83
|
+
# @example With custom filters
|
84
|
+
# ip_filter = TopSecret::Filters::Regex.new(label: "IP", regex: /\d+\.\d+\.\d+\.\d+/)
|
85
|
+
# result = TopSecret::Text.scan("Server IP: 192.168.1.1", custom_filters: [ip_filter])
|
86
|
+
# result.mapping # => {:IP_1=>"192.168.1.1"}
|
87
|
+
#
|
88
|
+
# @example Overriding default filters
|
89
|
+
# custom_email = TopSecret::Filters::Regex.new(label: "EMAIL_ADDR", regex: /\w+@\w+/)
|
90
|
+
# result = TopSecret::Text.scan("user@test.com", email_filter: custom_email)
|
91
|
+
# result.mapping # => {:EMAIL_ADDR_1=>"user@test.com"}
|
92
|
+
def self.scan(input, custom_filters: [], **filters)
|
93
|
+
new(input, filters:, custom_filters:).scan
|
94
|
+
end
|
95
|
+
|
96
|
+
# Scans the input text for sensitive information using configured filters
|
97
|
+
#
|
98
|
+
# This method applies all active filters to detect sensitive information but does not
|
99
|
+
# redact the original text. It builds a mapping of found values and returns whether
|
100
|
+
# any sensitive information was detected.
|
101
|
+
#
|
102
|
+
# @return [ScanResult] Contains mapping of found sensitive information and sensitive? flag
|
32
103
|
# @raise [Error] If an unsupported filter is encountered
|
33
|
-
|
34
|
-
|
104
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
105
|
+
def scan
|
106
|
+
@doc ||= model.doc(@output) if model
|
107
|
+
@entities ||= doc.entities if model
|
108
|
+
|
109
|
+
validate_filters!
|
110
|
+
|
111
|
+
all_filters.each do |filter|
|
112
|
+
next if filter.nil?
|
113
|
+
|
35
114
|
values = case filter
|
36
115
|
when TopSecret::Filters::Regex
|
37
116
|
filter.call(input)
|
@@ -43,9 +122,20 @@ module TopSecret
|
|
43
122
|
build_mapping(values, label: filter.label)
|
44
123
|
end
|
45
124
|
|
46
|
-
|
125
|
+
ScanResult.new(mapping)
|
126
|
+
end
|
47
127
|
|
48
|
-
|
128
|
+
# Applies configured filters to the input, redacting matches and building a mapping.
|
129
|
+
#
|
130
|
+
# @return [Result] Contains original input, redacted output, and mapping of labels to values
|
131
|
+
# @raise [Error] If an unsupported filter is encountered
|
132
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
133
|
+
def filter
|
134
|
+
scan_result = scan
|
135
|
+
|
136
|
+
substitute_text if scan_result.sensitive?
|
137
|
+
|
138
|
+
Text::Result.new(input, output, scan_result.mapping)
|
49
139
|
end
|
50
140
|
|
51
141
|
private
|
@@ -59,12 +149,21 @@ module TopSecret
|
|
59
149
|
# @return [Hash] Mapping from redaction labels to original values
|
60
150
|
attr_reader :mapping
|
61
151
|
|
152
|
+
# @return [Object] The NER model (typically Mitie::NER or a test double)
|
153
|
+
attr_reader :model
|
154
|
+
|
155
|
+
# @return [Object] The document created from the output text (typically Mitie::Document or a test double)
|
156
|
+
attr_reader :doc
|
157
|
+
|
62
158
|
# @return [Array<Hash>] Named entities extracted by MITIE
|
63
159
|
attr_reader :entities
|
64
160
|
|
65
161
|
# @return [Hash] Active filters used for redaction
|
66
162
|
attr_reader :filters
|
67
163
|
|
164
|
+
# @return [Array] Custom filters to apply
|
165
|
+
attr_reader :custom_filters
|
166
|
+
|
68
167
|
# Builds the mapping of label keys to matched values, indexed uniquely.
|
69
168
|
#
|
70
169
|
# @param values [Array<String>] Values matched by a filter
|
@@ -85,5 +184,55 @@ module TopSecret
|
|
85
184
|
output.gsub! value, "[#{filter}]"
|
86
185
|
end
|
87
186
|
end
|
187
|
+
|
188
|
+
# Collects all filters to apply: default filters with overrides plus custom filters
|
189
|
+
#
|
190
|
+
# @return [Array] Array of filter objects to apply
|
191
|
+
def all_filters
|
192
|
+
merged_filters.values.compact + TopSecret.custom_filters + custom_filters
|
193
|
+
end
|
194
|
+
|
195
|
+
# Merges default filters with user-provided filter overrides
|
196
|
+
#
|
197
|
+
# @return [Hash] Hash containing default filters with any user overrides applied
|
198
|
+
# @private
|
199
|
+
def merged_filters
|
200
|
+
default_filters.merge(filters)
|
201
|
+
end
|
202
|
+
|
203
|
+
# Validates that all provided filter keys are recognized
|
204
|
+
#
|
205
|
+
# @return [void]
|
206
|
+
# @raise [ArgumentError] If invalid filter keys are provided
|
207
|
+
def validate_filters!
|
208
|
+
merged_filters.assert_valid_keys(*default_filters.keys)
|
209
|
+
end
|
210
|
+
|
211
|
+
# Returns the default filters configuration hash
|
212
|
+
#
|
213
|
+
# @return [Hash] Hash containing all configured default filters, keyed by filter name
|
214
|
+
# @private
|
215
|
+
def default_filters
|
216
|
+
{
|
217
|
+
credit_card_filter: TopSecret.credit_card_filter,
|
218
|
+
email_filter: TopSecret.email_filter,
|
219
|
+
phone_number_filter: TopSecret.phone_number_filter,
|
220
|
+
ssn_filter: TopSecret.ssn_filter,
|
221
|
+
people_filter: TopSecret.people_filter,
|
222
|
+
location_filter: TopSecret.location_filter
|
223
|
+
}
|
224
|
+
end
|
225
|
+
|
226
|
+
# Creates the default model based on configuration.
|
227
|
+
# Returns a MITIE NER model if a model path is configured, otherwise returns a null model.
|
228
|
+
#
|
229
|
+
# @return [Mitie::NER, NullModel] The model instance to use for NER processing
|
230
|
+
def default_model
|
231
|
+
if TopSecret.model_path
|
232
|
+
Mitie::NER.new(TopSecret.model_path)
|
233
|
+
else
|
234
|
+
NullModel.new
|
235
|
+
end
|
236
|
+
end
|
88
237
|
end
|
89
238
|
end
|
data/lib/top_secret/version.rb
CHANGED
data/lib/top_secret.rb
CHANGED
@@ -8,13 +8,14 @@ require "mitie"
|
|
8
8
|
# modules
|
9
9
|
require_relative "top_secret/version"
|
10
10
|
require_relative "top_secret/constants"
|
11
|
+
require_relative "top_secret/mapping"
|
11
12
|
require_relative "top_secret/filters/ner"
|
12
13
|
require_relative "top_secret/filters/regex"
|
13
14
|
require_relative "top_secret/error"
|
14
|
-
require_relative "top_secret/result"
|
15
15
|
require_relative "top_secret/text"
|
16
|
+
require_relative "top_secret/filtered_text"
|
16
17
|
|
17
|
-
# TopSecret filters sensitive information from free text before it's sent to external services or APIs, such as
|
18
|
+
# TopSecret filters sensitive information from free text before it's sent to external services or APIs, such as chatbots and LLMs.
|
18
19
|
#
|
19
20
|
# @!attribute [rw] model_path
|
20
21
|
# @return [String] the path to the MITIE NER model
|
@@ -22,23 +23,38 @@ require_relative "top_secret/text"
|
|
22
23
|
# @!attribute [rw] min_confidence_score
|
23
24
|
# @return [Float] the minimum confidence score required for NER matches
|
24
25
|
#
|
25
|
-
# @!attribute [rw]
|
26
|
-
# @return [
|
26
|
+
# @!attribute [rw] custom_filters
|
27
|
+
# @return [Array] array of custom filters that can be configured
|
28
|
+
#
|
29
|
+
# @!attribute [rw] credit_card_filter
|
30
|
+
# @return [TopSecret::Filters::Regex] filter for credit card numbers
|
31
|
+
#
|
32
|
+
# @!attribute [rw] email_filter
|
33
|
+
# @return [TopSecret::Filters::Regex] filter for email addresses
|
34
|
+
#
|
35
|
+
# @!attribute [rw] phone_number_filter
|
36
|
+
# @return [TopSecret::Filters::Regex] filter for phone numbers
|
37
|
+
#
|
38
|
+
# @!attribute [rw] ssn_filter
|
39
|
+
# @return [TopSecret::Filters::Regex] filter for social security numbers
|
40
|
+
#
|
41
|
+
# @!attribute [rw] people_filter
|
42
|
+
# @return [TopSecret::Filters::NER] filter for person names
|
43
|
+
#
|
44
|
+
# @!attribute [rw] location_filter
|
45
|
+
# @return [TopSecret::Filters::NER] filter for location names
|
27
46
|
module TopSecret
|
28
47
|
include ActiveSupport::Configurable
|
29
48
|
|
30
|
-
config_accessor :model_path, default:
|
49
|
+
config_accessor :model_path, default: MODEL_PATH
|
31
50
|
config_accessor :min_confidence_score, default: MIN_CONFIDENCE_SCORE
|
32
51
|
|
33
|
-
config_accessor :
|
34
|
-
options = ActiveSupport::OrderedOptions.new
|
35
|
-
options.credit_card_filter = TopSecret::Filters::Regex.new(label: "CREDIT_CARD", regex: CREDIT_CARD_REGEX)
|
36
|
-
options.email_filter = TopSecret::Filters::Regex.new(label: "EMAIL", regex: EMAIL_REGEX)
|
37
|
-
options.phone_number_filter = TopSecret::Filters::Regex.new(label: "PHONE_NUMBER", regex: PHONE_REGEX)
|
38
|
-
options.ssn_filter = TopSecret::Filters::Regex.new(label: "SSN", regex: SSN_REGEX)
|
39
|
-
options.people_filter = TopSecret::Filters::NER.new(label: "PERSON", tag: :person)
|
40
|
-
options.location_filter = TopSecret::Filters::NER.new(label: "LOCATION", tag: :location)
|
52
|
+
config_accessor :custom_filters, default: []
|
41
53
|
|
42
|
-
|
43
|
-
|
54
|
+
config_accessor :credit_card_filter, default: TopSecret::Filters::Regex.new(label: "CREDIT_CARD", regex: CREDIT_CARD_REGEX)
|
55
|
+
config_accessor :email_filter, default: TopSecret::Filters::Regex.new(label: "EMAIL", regex: EMAIL_REGEX)
|
56
|
+
config_accessor :phone_number_filter, default: TopSecret::Filters::Regex.new(label: "PHONE_NUMBER", regex: PHONE_REGEX)
|
57
|
+
config_accessor :ssn_filter, default: TopSecret::Filters::Regex.new(label: "SSN", regex: SSN_REGEX)
|
58
|
+
config_accessor :people_filter, default: TopSecret::Filters::NER.new(label: "PERSON", tag: :person)
|
59
|
+
config_accessor :location_filter, default: TopSecret::Filters::NER.new(label: "LOCATION", tag: :location)
|
44
60
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: top_secret
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Steve Polito
|
@@ -13,22 +13,22 @@ dependencies:
|
|
13
13
|
name: activesupport
|
14
14
|
requirement: !ruby/object:Gem::Requirement
|
15
15
|
requirements:
|
16
|
-
- - "~>"
|
17
|
-
- !ruby/object:Gem::Version
|
18
|
-
version: '8.0'
|
19
16
|
- - ">="
|
20
17
|
- !ruby/object:Gem::Version
|
21
|
-
version:
|
18
|
+
version: 7.0.8
|
19
|
+
- - "<"
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '9'
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
24
|
version_requirements: !ruby/object:Gem::Requirement
|
25
25
|
requirements:
|
26
|
-
- - "~>"
|
27
|
-
- !ruby/object:Gem::Version
|
28
|
-
version: '8.0'
|
29
26
|
- - ">="
|
30
27
|
- !ruby/object:Gem::Version
|
31
|
-
version:
|
28
|
+
version: 7.0.8
|
29
|
+
- - "<"
|
30
|
+
- !ruby/object:Gem::Version
|
31
|
+
version: '9'
|
32
32
|
- !ruby/object:Gem::Dependency
|
33
33
|
name: mitie
|
34
34
|
requirement: !ruby/object:Gem::Requirement
|
@@ -44,7 +44,7 @@ dependencies:
|
|
44
44
|
- !ruby/object:Gem::Version
|
45
45
|
version: 0.3.2
|
46
46
|
description: Filter sensitive information from free text before sending it to external
|
47
|
-
services or APIs, such as
|
47
|
+
services or APIs, such as chatbots and LLMs.
|
48
48
|
email:
|
49
49
|
- stevepolito@hey.com
|
50
50
|
executables: []
|
@@ -52,6 +52,7 @@ extensions: []
|
|
52
52
|
extra_rdoc_files: []
|
53
53
|
files:
|
54
54
|
- CHANGELOG.md
|
55
|
+
- CODEOWNERS
|
55
56
|
- CODE_OF_CONDUCT.md
|
56
57
|
- LICENSE.txt
|
57
58
|
- README.md
|
@@ -59,10 +60,17 @@ files:
|
|
59
60
|
- lib/top_secret.rb
|
60
61
|
- lib/top_secret/constants.rb
|
61
62
|
- lib/top_secret/error.rb
|
63
|
+
- lib/top_secret/filtered_text.rb
|
64
|
+
- lib/top_secret/filtered_text/result.rb
|
62
65
|
- lib/top_secret/filters/ner.rb
|
63
66
|
- lib/top_secret/filters/regex.rb
|
64
|
-
- lib/top_secret/
|
67
|
+
- lib/top_secret/mapping.rb
|
68
|
+
- lib/top_secret/null_model.rb
|
65
69
|
- lib/top_secret/text.rb
|
70
|
+
- lib/top_secret/text/batch_result.rb
|
71
|
+
- lib/top_secret/text/global_mapping.rb
|
72
|
+
- lib/top_secret/text/result.rb
|
73
|
+
- lib/top_secret/text/scan_result.rb
|
66
74
|
- lib/top_secret/version.rb
|
67
75
|
- sig/top_secret.rbs
|
68
76
|
homepage: https://github.com/thoughtbot/top_secret
|
data/lib/top_secret/result.rb
DELETED
@@ -1,24 +0,0 @@
|
|
1
|
-
# frozen_string_literal: true
|
2
|
-
|
3
|
-
module TopSecret
|
4
|
-
# Holds the result of a redaction operation.
|
5
|
-
class Result
|
6
|
-
# @return [String] The original unredacted input
|
7
|
-
attr_reader :input
|
8
|
-
|
9
|
-
# @return [String] The redacted output
|
10
|
-
attr_reader :output
|
11
|
-
|
12
|
-
# @return [Hash] Mapping of redacted labels to matched values
|
13
|
-
attr_reader :mapping
|
14
|
-
|
15
|
-
# @param input [String] The original text
|
16
|
-
# @param output [String] The redacted text
|
17
|
-
# @param mapping [Hash] Map of labels to matched values
|
18
|
-
def initialize(input, output, mapping)
|
19
|
-
@input = input
|
20
|
-
@output = output
|
21
|
-
@mapping = mapping
|
22
|
-
end
|
23
|
-
end
|
24
|
-
end
|