top_secret 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d570b8ecd9cb5ab35f59dc688a6a13a749cebf1abfcbce36906f12d1d8452189
4
- data.tar.gz: 8b5d639506a7ecd6bbb548e245db54384a2d16c0b47bbd564230c4140639c05d
3
+ metadata.gz: 7b57ce12774e37584e4d7a25a2606799b642ddff1eaaeb6f8fa53c2c1e4cae58
4
+ data.tar.gz: 60b70028bdf30ea474c1987ea15a27df87c7d5614743b69e9f8a20e00a358401
5
5
  SHA512:
6
- metadata.gz: fef5120cc93ac1772270816788ee9bd3a7141779f300a839ecb7d8d35d228c7539e33bb55cac1ff97d975b35251d46913020879d985dad75655de385a47b31ca
7
- data.tar.gz: 5acc9bb4f77210d0ae804ee50472ec75952f12812892c6d304feb4209ee152a5d3a3740fe938d17dd47cb664945d90dc73ddfa1578b2c4f1400a1ef23d8e93bb
6
+ metadata.gz: de3cb592711f29551bdcb81c8e2c4effb03d7d0e47a1c995adb7a4c828652a98050f133918b63c9a5cbdd3e03aba1e15c0abd06e969f9c7f711b5ce9f22247de
7
+ data.tar.gz: cb08a6e173106fa653f16f9abbf8ee7b1dd6018fe4097b6841701004ada46ab37b4a9be2079da7398233e35a6c334ee1a445c8d5f0faf7c82eb4c35b0343b79c
data/CHANGELOG.md CHANGED
@@ -1,5 +1,32 @@
1
1
  ## [Unreleased]
2
2
 
3
- ## [0.1.0] - 2025-07-23
3
+ ## [0.2.0] - 2025-08-18
4
4
 
5
- - Initial release
5
+ ### Added
6
+
7
+ - Added `TopSecret::Text.filter_all` for batch processing multiple messages with globally consistent redaction labels
8
+ - Added `TopSecret::Text::BatchResult` class to hold results from batch operations
9
+ - Added `TopSecret::FilteredText` class for restoring filtered text by substituting placeholders with original values
10
+ - Added `TopSecret::FilteredText::Result` class to track restoration success and failures
11
+
12
+ ### Changed
13
+
14
+ - **BREAKING:** Moved `TopSecret::Result` to `TopSecret::Text::Result` and `TopSecret::BatchResult` to `TopSecret::Text::BatchResult` for better namespace organization
15
+ - **BREAKING:** Refactored configuration system to use individual filter accessors instead of nested `default_filters`
16
+ - Updated `TopSecret::Text.filter` to accept keyword arguments for filter overrides and `custom_filters` array
17
+ - Each default filter now has its own configuration accessor (e.g., `TopSecret.email_filter`, `TopSecret.people_filter`)
18
+
19
+ ### Migration Guide
20
+
21
+ - Replace `TopSecret::Result` with `TopSecret::Text::Result` and `TopSecret::BatchResult` with `TopSecret::Text::BatchResult`
22
+ - Replace `TopSecret.configure { |c| c.default_filters.email_filter = filter }` with `TopSecret.configure { |c| c.email_filter = filter }`
23
+ - Replace `TopSecret::Text.filter(text, filters: { email_filter: filter })` with `TopSecret::Text.filter(text, email_filter: filter)`
24
+ - For new filters, use `TopSecret::Text.filter(text, custom_filters: [filter])` instead of adding to `default_filters`
25
+
26
+ ## [0.1.1] - 2025-08-08
27
+
28
+ - Ensure `TopSecret.min_confidence_score` is respected
29
+
30
+ ## [0.1.0] - 2025-08-08
31
+
32
+ - Initial release
data/CODEOWNERS ADDED
@@ -0,0 +1,15 @@
1
+ # Lines starting with '#' are comments.
2
+ # Each line is a file pattern followed by one or more owners.
3
+
4
+ # More details are here: https://help.github.com/articles/about-codeowners/
5
+
6
+ # The '*' pattern is global owners.
7
+
8
+ # Order is important. The last matching pattern has the most precedence.
9
+ # The folders are ordered as follows:
10
+
11
+ # In each subsection folders are ordered first by depth, then alphabetically.
12
+ # This should make it easy to add new rules without breaking existing ones.
13
+
14
+ # Global rule:
15
+ * @stevepolitodesign
data/README.md CHANGED
@@ -1,6 +1,8 @@
1
1
  # Top Secret
2
2
 
3
- Filter sensitive information from free text before sending it to external services or APIs, such as Chatbots.
3
+ [![Ruby](https://github.com/thoughtbot/top_secret/actions/workflows/main.yml/badge.svg?branch=main)](https://github.com/thoughtbot/top_secret/actions/workflows/main.yml)
4
+
5
+ Filter sensitive information from free text before sending it to external services or APIs, such as chatbots and LLMs.
4
6
 
5
7
  By default it filters the following:
6
8
 
@@ -123,7 +125,7 @@ TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com")
123
125
  This will return
124
126
 
125
127
  ```ruby
126
- <TopSecret::Result
128
+ <TopSecret::Text::Result
127
129
  @input="Ralph can be reached at ralph@thoughtbot.com",
128
130
  @mapping={:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"},
129
131
  @output="[PERSON_1] can be reached at [EMAIL_1]"
@@ -154,26 +156,123 @@ result.mapping
154
156
  # => {:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"}
155
157
  ```
156
158
 
159
+ ### Batch Processing
160
+
161
+ When processing multiple messages, use `filter_all` to ensure consistent redaction labels across all messages:
162
+
163
+ ```ruby
164
+ messages = [
165
+ "Contact ralph@thoughtbot.com for details",
166
+ "Email ralph@thoughtbot.com again if needed",
167
+ "Also CC ruby@thoughtbot.com on the thread"
168
+ ]
169
+
170
+ result = TopSecret::Text.filter_all(messages)
171
+ ```
172
+
173
+ This will return
174
+
175
+ ```ruby
176
+ <TopSecret::Text::BatchResult
177
+ @mapping={:EMAIL_1=>"ralph@thoughtbot.com", :EMAIL_2=>"ruby@thoughtbot.com"},
178
+ @items=[
179
+ <TopSecret::Text::BatchResult::Item @input="Contact ralph@thoughtbot.com for details", @output="Contact [EMAIL_1] for details">,
180
+ <TopSecret::Text::BatchResult::Item @input="Email ralph@thoughtbot.com again if needed", @output="Email [EMAIL_1] again if needed">,
181
+ <TopSecret::Text::BatchResult::Item @input="Also CC ruby@thoughtbot.com on the thread", @output="Also CC [EMAIL_2] on the thread">
182
+ ]
183
+ >
184
+ ```
185
+
186
+ Access the global mapping
187
+
188
+ ```ruby
189
+ result.mapping
190
+
191
+ # => {:EMAIL_1=>"ralph@thoughtbot.com", :EMAIL_2=>"ruby@thoughtbot.com"}
192
+ ```
193
+
194
+ Access individual items
195
+
196
+ ```ruby
197
+ result.items[0].input
198
+ # => "Contact ralph@thoughtbot.com for details"
199
+
200
+ result.items[0].output
201
+ # => "Contact [EMAIL_1] for details"
202
+ ```
203
+
204
+ The key benefit is that identical values receive the same labels across all messages - notice how `ralph@thoughtbot.com` becomes `[EMAIL_1]` in both the first and second messages.
205
+
206
+ ### Restoring Filtered Text
207
+
208
+ When external services (like LLMs) return responses containing filter placeholders, use `TopSecret::FilteredText.restore` to substitute them back with original values:
209
+
210
+ ```ruby
211
+ # Filter messages before sending to LLM
212
+ messages = ["Contact ralph@thoughtbot.com for details"]
213
+ batch_result = TopSecret::Text.filter_all(messages)
214
+
215
+ # Send filtered text to LLM: "Contact [EMAIL_1] for details"
216
+ # LLM responds with: "I'll email [EMAIL_1] about this request"
217
+ llm_response = "I'll email [EMAIL_1] about this request"
218
+
219
+ # Restore the original values
220
+ restore_result = TopSecret::FilteredText.restore(llm_response, mapping: batch_result.mapping)
221
+ ```
222
+
223
+ This will return
224
+
225
+ ```ruby
226
+ <TopSecret::FilteredText::Result
227
+ @output="I'll email ralph@thoughtbot.com about this request",
228
+ @restored=["[EMAIL_1]"],
229
+ @unrestored=[]
230
+ >
231
+ ```
232
+
233
+ Access the restored text
234
+
235
+ ```ruby
236
+ restore_result.output
237
+ # => "I'll email ralph@thoughtbot.com about this request"
238
+ ```
239
+
240
+ Track which placeholders were restored
241
+
242
+ ```ruby
243
+ restore_result.restored
244
+ # => ["[EMAIL_1]"]
245
+
246
+ restore_result.unrestored
247
+ # => []
248
+ ```
249
+
250
+ The restoration process tracks both successful and failed placeholder substitutions, allowing you to handle cases where the LLM response contains placeholders not found in your mapping.
251
+
157
252
  ### Advanced Examples
158
253
 
159
254
  #### Overriding the default filters
160
255
 
161
256
  When overriding or [disabling](#disabling-a-default-filter-1) a [default filter](#default-filters), you must map to the correct key.
162
257
 
258
+ > [!IMPORTANT]
259
+ > Invalid filter keys will raise an `ArgumentError`. Only the following keys are valid:
260
+ > `credit_card_filter`, `email_filter`, `phone_number_filter`, `ssn_filter`, `people_filter`, `location_filter`
261
+
163
262
  ```ruby
164
263
  regex_filter = TopSecret::Filters::Regex.new(label: "EMAIL_ADDRESS", regex: /\b\w+\[at\]\w+\.\w+\b/)
165
264
  ner_filter = TopSecret::Filters::NER.new(label: "NAME", tag: :person, min_confidence_score: 0.25)
166
265
 
167
- TopSecret::Text.filter("Ralph can be reached at ralph[at]thoughtbot.com", filters: {
266
+ TopSecret::Text.filter("Ralph can be reached at ralph[at]thoughtbot.com",
168
267
  email_filter: regex_filter,
169
268
  people_filter: ner_filter
170
- })
269
+ )
171
270
  ```
172
271
 
173
272
  This will return
174
273
 
175
274
  ```ruby
176
- <TopSecret::Result
275
+ <TopSecret::Text::Result
177
276
  @input="Ralph can be reached at ralph[at]thoughtbot.com",
178
277
  @mapping={:EMAIL_ADDRESS_1=>"ralph[at]thoughtbot.com", :NAME_1=>"Ralph", :NAME_2=>"ralph["},
179
278
  @output="[NAME_1] can be reached at [EMAIL_ADDRESS_1]"
@@ -183,22 +282,29 @@ This will return
183
282
  #### Disabling a default filter
184
283
 
185
284
  ```ruby
186
- TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com", filters: {
285
+ TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com",
187
286
  email_filter: nil,
188
287
  people_filter: nil
189
- })
288
+ )
190
289
  ```
191
290
 
192
291
  This will return
193
292
 
194
293
  ```ruby
195
- <TopSecret::Result
294
+ <TopSecret::Text::Result
196
295
  @input="Ralph can be reached at ralph@thoughtbot.com",
197
296
  @mapping={},
198
297
  @output="Ralph can be reached at ralph@thoughtbot.com"
199
298
  >
200
299
  ```
201
300
 
301
+ #### Error handling for invalid filter keys
302
+
303
+ ```ruby
304
+ # This will raise ArgumentError: Unknown key: :invalid_filter. Valid keys are: ...
305
+ TopSecret::Text.filter("some text", invalid_filter: some_filter)
306
+ ```
307
+
202
308
  ### Custom Filters
203
309
 
204
310
  #### Adding new [Regex filters][]
@@ -209,15 +315,15 @@ ip_address_filter = TopSecret::Filters::Regex.new(
209
315
  regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
210
316
  )
211
317
 
212
- TopSecret::Text.filter("Ralph's IP address is 192.168.1.1", filters: {
213
- ip_address_filter: ip_address_filter
214
- })
318
+ TopSecret::Text.filter("Ralph's IP address is 192.168.1.1",
319
+ custom_filters: [ip_address_filter]
320
+ )
215
321
  ```
216
322
 
217
323
  This will return
218
324
 
219
325
  ```ruby
220
- <TopSecret::Result
326
+ <TopSecret::Text::Result
221
327
  @input="Ralph's IP address is 192.168.1.1",
222
328
  @mapping={:PERSON_1=>"Ralph", :IP_ADDRESS_1=>"192.168.1.1"},
223
329
  @output="[PERSON_1]'s IP address is [IP_ADDRESS_1]"
@@ -235,15 +341,15 @@ language_filter = TopSecret::Filters::NER.new(
235
341
  min_confidence_score: 0.75
236
342
  )
237
343
 
238
- TopSecret::Text.filter("Ralph's favorite programming language is Ruby.", filters: {
239
- language_filter: language_filter
240
- })
344
+ TopSecret::Text.filter("Ralph's favorite programming language is Ruby.",
345
+ custom_filters: [language_filter]
346
+ )
241
347
  ```
242
348
 
243
349
  This will return
244
350
 
245
351
  ```ruby
246
- <TopSecret::Result
352
+ <TopSecret::Text::Result
247
353
  @input="Ralph's favorite programming language is Ruby.",
248
354
  @mapping={:PERSON_1=>"Ralph", :LANGUAGE_1=>"Ruby"},
249
355
  @output="[PERSON_1]'s favorite programming language is [LANGUAGE_1]"
@@ -265,9 +371,9 @@ regex_filter = TopSecret::Filters::Regex.new(
265
371
  regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
266
372
  )
267
373
 
268
- result = TopSecret::Text.filter("Server IP: 192.168.1.1", filters: {
269
- ip_address_filter: regex_filter
270
- })
374
+ result = TopSecret::Text.filter("Server IP: 192.168.1.1",
375
+ custom_filters: [regex_filter]
376
+ )
271
377
 
272
378
  result.output
273
379
  # => "Server IP: [IP_ADDRESS_1]"
@@ -285,9 +391,9 @@ ner_filter = TopSecret::Filters::NER.new(
285
391
  min_confidence_score: 0.25
286
392
  )
287
393
 
288
- result = TopSecret::Text.filter("Ralph and Ruby work at thoughtbot.", filters: {
394
+ result = TopSecret::Text.filter("Ralph and Ruby work at thoughtbot.",
289
395
  people_filter: ner_filter
290
- })
396
+ )
291
397
 
292
398
  result.output
293
399
  # => "[PERSON_1] and [PERSON_2] work at thoughtbot."
@@ -326,7 +432,7 @@ end
326
432
 
327
433
  ```ruby
328
434
  TopSecret.configure do |config|
329
- config.default_filters.email_filter = TopSecret::Filters::Regex.new(
435
+ config.email_filter = TopSecret::Filters::Regex.new(
330
436
  label: "EMAIL_ADDRESS",
331
437
  regex: /\b\w+\[at\]\w+\.\w+\b/
332
438
  )
@@ -337,18 +443,20 @@ end
337
443
 
338
444
  ```ruby
339
445
  TopSecret.configure do |config|
340
- config.default_filters.email_filter = nil
446
+ config.email_filter = nil
341
447
  end
342
448
  ```
343
449
 
344
- ### Adding new default filters
450
+ ### Adding custom filters globally
345
451
 
346
452
  ```ruby
453
+ ip_address_filter = TopSecret::Filters::Regex.new(
454
+ label: "IP_ADDRESS",
455
+ regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
456
+ )
457
+
347
458
  TopSecret.configure do |config|
348
- config.default_filters.ip_address_filter = TopSecret::Filters::Regex.new(
349
- label: "IP_ADDRESS",
350
- regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
351
- )
459
+ config.custom_filters << ip_address_filter
352
460
  end
353
461
  ```
354
462
 
@@ -361,11 +469,26 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
361
469
  >
362
470
  > You'll need to download and extract [ner_model.dat][] first, and place it in the root of this project.
363
471
 
472
+ ### Performance Benchmarks
473
+
474
+ Run `bin/benchmark` to test performance and catch regressions:
475
+
476
+ ```bash
477
+ bin/benchmark # CI-optimized benchmark with pass/fail thresholds
478
+ ```
479
+
480
+ > [!NOTE]
481
+ > When adding new public methods to the API, ensure they are included in the benchmark script to catch performance regressions.
482
+
364
483
  To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
365
484
 
366
485
  ## Contributing
367
486
 
368
- Bug reports and pull requests are welcome on GitHub at [https://github.com/thoughtbot/top_secret](https://github.com/thoughtbot/top_secret). This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/thoughtbot/top_secret/blob/main/CODE_OF_CONDUCT.md).
487
+ [Bug reports](https://github.com/thoughtbot/top_secret/issues/new?template=bug_report.md) and [pull requests](https://github.com/thoughtbot/top_secret/pulls) are welcome on GitHub at [https://github.com/thoughtbot/top_secret](https://github.com/thoughtbot/top_secret).
488
+
489
+ Please create a [new discussion](https://github.com/thoughtbot/top_secret/discussions/new?category=ideas) if you want to share ideas for new features.
490
+
491
+ This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/thoughtbot/top_secret/blob/main/CODE_OF_CONDUCT.md).
369
492
 
370
493
  ## License
371
494
 
@@ -0,0 +1,29 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ class FilteredText
5
+ # Result object returned by FilteredText restoration operations.
6
+ #
7
+ # Contains the restored text along with tracking information about which
8
+ # placeholders were successfully restored and which remain unrestored.
9
+ class Result
10
+ # @return [String] The text with placeholders restored to original values
11
+ attr_reader :output
12
+
13
+ # @return [Array<String>] Array of placeholder strings that could not be restored
14
+ attr_reader :unrestored
15
+
16
+ # @return [Array<String>] Array of placeholder strings that were successfully restored
17
+ attr_reader :restored
18
+
19
+ # @param output [String] The restored text
20
+ # @param unrestored [Array<String>] Placeholders that could not be restored
21
+ # @param restored [Array<String>] Placeholders that were successfully restored
22
+ def initialize(output, unrestored, restored)
23
+ @output = output
24
+ @unrestored = unrestored
25
+ @restored = restored
26
+ end
27
+ end
28
+ end
29
+ end
@@ -0,0 +1,73 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "filtered_text/result"
4
+
5
+ module TopSecret
6
+ # Restores filtered text by substituting placeholders with original values.
7
+ #
8
+ # This class is used to reverse the filtering process, typically when processing
9
+ # responses from external services like LLMs that may contain filtered placeholders.
10
+ class FilteredText
11
+ # @return [String] The text being processed for restoration
12
+ attr_reader :output
13
+
14
+ # @param filtered_text [String] Text containing filter placeholders like [EMAIL_1]
15
+ # @param mapping [Hash] Hash mapping filter symbols to original values
16
+ def initialize(filtered_text, mapping:)
17
+ @mapping = mapping
18
+ @output = filtered_text.dup
19
+ end
20
+
21
+ # Convenience method to restore filtered text in one call
22
+ #
23
+ # @param filtered_text [String] Text containing filter placeholders
24
+ # @param mapping [Hash] Hash mapping filter symbols to original values
25
+ # @return [Result] Contains restored text and tracking information
26
+ #
27
+ # @example Basic restoration
28
+ # mapping = {EMAIL_1: "john@example.com"}
29
+ # result = TopSecret::FilteredText.restore("Contact [EMAIL_1]", mapping: mapping)
30
+ # result.output # => "Contact john@example.com"
31
+ # result.restored # => ["[EMAIL_1]"]
32
+ # result.unrestored # => []
33
+ def self.restore(filtered_text, mapping:)
34
+ new(filtered_text, mapping:).restore
35
+ end
36
+
37
+ # Performs the restoration process
38
+ #
39
+ # Substitutes all found placeholders with their mapped values and tracks
40
+ # which placeholders were successfully restored vs those that remain unrestored.
41
+ #
42
+ # @return [Result] Contains the restored text and tracking arrays
43
+ def restore
44
+ restored = []
45
+
46
+ mapping.each do |filter, value|
47
+ placeholder = build_placeholder(filter)
48
+
49
+ if output.include? placeholder
50
+ restored << placeholder
51
+ output.gsub! placeholder, value
52
+ end
53
+ end
54
+
55
+ unrestored = output.scan(/\[\w*_\d\]/)
56
+
57
+ Result.new(output, unrestored, restored)
58
+ end
59
+
60
+ private
61
+
62
+ # @return [Hash] Mapping from filter symbols to original values
63
+ attr_reader :mapping
64
+
65
+ # Builds a placeholder string from a filter symbol
66
+ #
67
+ # @param filter [Symbol] The filter symbol (e.g., :EMAIL_1)
68
+ # @return [String] The placeholder string (e.g., "[EMAIL_1]")
69
+ def build_placeholder(filter)
70
+ "[#{filter}]"
71
+ end
72
+ end
73
+ end
@@ -13,7 +13,7 @@ module TopSecret
13
13
  def initialize(label:, tag:, min_confidence_score: nil)
14
14
  @label = label
15
15
  @tag = tag.upcase.to_s
16
- @min_confidence_score = min_confidence_score || TopSecret.min_confidence_score
16
+ @min_confidence_score = min_confidence_score
17
17
  end
18
18
 
19
19
  # Filters and extracts entity texts matching the tag and score threshold.
@@ -21,7 +21,7 @@ module TopSecret
21
21
  # @param entities [Array<Hash>] List of entity hashes with keys :tag, :score, and :text
22
22
  # @return [Array<String>] Matched entity texts
23
23
  def call(entities)
24
- tags = entities.filter { _1.fetch(:tag) == tag && _1.fetch(:score) >= min_confidence_score }
24
+ tags = entities.filter { _1.fetch(:tag) == tag && _1.fetch(:score) >= (min_confidence_score || TopSecret.min_confidence_score) }
25
25
  tags.map { _1.fetch(:text) }
26
26
  end
27
27
 
@@ -0,0 +1,45 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ class Text
5
+ # Holds the result of a batch redaction operation on multiple messages.
6
+ # Contains a global mapping that ensures consistent labeling across all messages
7
+ # and a collection of individual input/output pairs.
8
+ class BatchResult
9
+ # @return [Hash] Global mapping of redaction labels to original values across all messages
10
+ attr_reader :mapping
11
+
12
+ # @return [Array<Item>] Array of input/output pairs for each processed message
13
+ attr_reader :items
14
+
15
+ # Creates a new BatchResult instance
16
+ #
17
+ # @param mapping [Hash] Global mapping of redaction labels to original values
18
+ # @param items [Array<Item>] Array of input/output pairs
19
+ def initialize(mapping: {}, items: [])
20
+ @mapping = mapping
21
+ @items = items
22
+ end
23
+
24
+ # Represents a single message within a batch redaction operation.
25
+ # Contains only the input and output text, without individual mappings.
26
+ # The mapping is maintained at the BatchResult level for global consistency.
27
+ class Item
28
+ # @return [String] The original unredacted input
29
+ attr_reader :input
30
+
31
+ # @return [String] The redacted output
32
+ attr_reader :output
33
+
34
+ # Creates a new Item instance
35
+ #
36
+ # @param input [String] The original text
37
+ # @param output [String] The redacted text
38
+ def initialize(input, output)
39
+ @input = input
40
+ @output = output
41
+ end
42
+ end
43
+ end
44
+ end
45
+ end
@@ -0,0 +1,26 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ class Text
5
+ # Holds the result of a redaction operation.
6
+ class Result
7
+ # @return [String] The original unredacted input
8
+ attr_reader :input
9
+
10
+ # @return [String] The redacted output
11
+ attr_reader :output
12
+
13
+ # @return [Hash] Mapping of redacted labels to matched values
14
+ attr_reader :mapping
15
+
16
+ # @param input [String] The original text
17
+ # @param output [String] The redacted text
18
+ # @param mapping [Hash] Map of labels to matched values
19
+ def initialize(input, output, mapping)
20
+ @input = input
21
+ @output = output
22
+ @mapping = mapping
23
+ end
24
+ end
25
+ end
26
+ end
@@ -1,37 +1,110 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require "active_support/core_ext/hash/keys"
4
+ require_relative "text/result"
5
+ require_relative "text/batch_result"
6
+
3
7
  module TopSecret
4
8
  # Processes text to identify and redact sensitive information using configured filters.
5
9
  class Text
6
10
  # @param input [String] The original text to be filtered
7
11
  # @param filters [Hash, nil] Optional set of filters to override the defaults
8
- def initialize(input, filters: TopSecret.default_filters)
12
+ # @param custom_filters [Array] Additional custom filters to apply
13
+ # @param model [Mitie::NER, nil] Optional pre-loaded MITIE model for performance
14
+ def initialize(input, custom_filters: [], filters: {}, model: nil)
9
15
  @input = input
10
16
  @output = input.dup
11
17
  @mapping = {}
12
18
 
13
- @model = Mitie::NER.new(TopSecret.model_path)
19
+ @model = model || Mitie::NER.new(TopSecret.model_path)
14
20
  @doc = @model.doc(@output)
15
21
  @entities = @doc.entities
16
22
 
17
23
  @filters = filters
24
+ @custom_filters = custom_filters
18
25
  end
19
26
 
20
27
  # Convenience method to create an instance and filter input
21
28
  #
22
29
  # @param input [String] The text to filter
23
- # @param filters [Hash] Optional filters to override defaults
30
+ # @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
31
+ # @param custom_filters [Array] Additional custom filters to apply
24
32
  # @return [Result] The filtered result
25
- def self.filter(input, filters: {})
26
- new(input, filters:).filter
33
+ # @raise [ArgumentError] If invalid filter keys are provided
34
+ def self.filter(input, custom_filters: [], **filters)
35
+ new(input, filters:, custom_filters:).filter
36
+ end
37
+
38
+ # Filters multiple messages with globally consistent redaction labels
39
+ #
40
+ # Processes a collection of messages and ensures that identical sensitive values
41
+ # receive the same redaction labels across all messages. This is useful when
42
+ # processing conversation threads or document collections where consistency matters.
43
+ #
44
+ # @param messages [Array<String>] Array of text messages to filter
45
+ # @param custom_filters [Array] Additional custom filters to apply
46
+ # @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
47
+ # @return [BatchResult] Contains global mapping and array of input/output pairs
48
+ # @raise [ArgumentError] If invalid filter keys are provided
49
+ #
50
+ # @example Basic usage
51
+ # messages = ["Contact john@test.com", "Email john@test.com again"]
52
+ # result = TopSecret::Text.filter_all(messages)
53
+ # result.items[0].output # => "Contact [EMAIL_1]"
54
+ # result.items[1].output # => "Email [EMAIL_1] again"
55
+ # result.mapping # => { EMAIL_1: "john@test.com" }
56
+ #
57
+ # @example With custom filters
58
+ # ip_filter = TopSecret::Filters::Regex.new(label: "IP", regex: /\d+\.\d+\.\d+\.\d+/)
59
+ # result = TopSecret::Text.filter_all(messages, custom_filters: [ip_filter])
60
+ def self.filter_all(messages, custom_filters: [], **filters)
61
+ shared_model = Mitie::NER.new(TopSecret.model_path)
62
+
63
+ individual_results = messages.map do |message|
64
+ new(message, filters:, custom_filters:, model: shared_model).filter
65
+ end
66
+
67
+ global_mapping = {}
68
+ label_counters = {}
69
+
70
+ individual_results.each do |result|
71
+ result.mapping.each do |individual_key, value|
72
+ next if global_mapping.key?(value)
73
+
74
+ # TODO: This assumes labels are formatted consistently.
75
+ # We need to account for the following for the case where a label could begin with an "_"
76
+ label_type = individual_key.to_s.rpartition("_").first
77
+
78
+ label_counters[label_type] ||= 0
79
+ label_counters[label_type] += 1
80
+ global_key = :"#{label_type}_#{label_counters[label_type]}"
81
+
82
+ global_mapping[value] = global_key
83
+ end
84
+ end
85
+
86
+ inverted_global_mapping = global_mapping.invert
87
+
88
+ items = individual_results.map do |result|
89
+ output = result.input.dup
90
+ inverted_global_mapping.each { |filter, value| output.gsub!(value, "[#{filter}]") }
91
+ Text::BatchResult::Item.new(result.input, output)
92
+ end
93
+
94
+ Text::BatchResult.new(mapping: global_mapping.invert, items:)
27
95
  end
28
96
 
29
97
  # Applies configured filters to the input, redacting matches and building a mapping.
30
98
  #
31
99
  # @return [Result] Contains original input, redacted output, and mapping of labels to values
32
100
  # @raise [Error] If an unsupported filter is encountered
101
+ # @raise [ArgumentError] If invalid filter keys are provided
33
102
  def filter
34
- TopSecret.default_filters.merge(filters).compact.each_value do |filter|
103
+ validate_filters!
104
+
105
+ all_filters.each do |filter|
106
+ next if filter.nil?
107
+
35
108
  values = case filter
36
109
  when TopSecret::Filters::Regex
37
110
  filter.call(input)
@@ -45,7 +118,7 @@ module TopSecret
45
118
 
46
119
  substitute_text
47
120
 
48
- Result.new(input, output, mapping)
121
+ Text::Result.new(input, output, mapping)
49
122
  end
50
123
 
51
124
  private
@@ -65,6 +138,9 @@ module TopSecret
65
138
  # @return [Hash] Active filters used for redaction
66
139
  attr_reader :filters
67
140
 
141
+ # @return [Array] Custom filters to apply
142
+ attr_reader :custom_filters
143
+
68
144
  # Builds the mapping of label keys to matched values, indexed uniquely.
69
145
  #
70
146
  # @param values [Array<String>] Values matched by a filter
@@ -85,5 +161,43 @@ module TopSecret
85
161
  output.gsub! value, "[#{filter}]"
86
162
  end
87
163
  end
164
+
165
+ # Collects all filters to apply: default filters with overrides plus custom filters
166
+ #
167
+ # @return [Array] Array of filter objects to apply
168
+ def all_filters
169
+ merged_filters.values.compact + TopSecret.custom_filters + custom_filters
170
+ end
171
+
172
+ # Merges default filters with user-provided filter overrides
173
+ #
174
+ # @return [Hash] Hash containing default filters with any user overrides applied
175
+ # @private
176
+ def merged_filters
177
+ default_filters.merge(filters)
178
+ end
179
+
180
+ # Validates that all provided filter keys are recognized
181
+ #
182
+ # @return [void]
183
+ # @raise [ArgumentError] If invalid filter keys are provided
184
+ def validate_filters!
185
+ merged_filters.assert_valid_keys(*default_filters.keys)
186
+ end
187
+
188
+ # Returns the default filters configuration hash
189
+ #
190
+ # @return [Hash] Hash containing all configured default filters, keyed by filter name
191
+ # @private
192
+ def default_filters
193
+ {
194
+ credit_card_filter: TopSecret.credit_card_filter,
195
+ email_filter: TopSecret.email_filter,
196
+ phone_number_filter: TopSecret.phone_number_filter,
197
+ ssn_filter: TopSecret.ssn_filter,
198
+ people_filter: TopSecret.people_filter,
199
+ location_filter: TopSecret.location_filter
200
+ }
201
+ end
88
202
  end
89
203
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module TopSecret
4
- VERSION = "0.1.0"
4
+ VERSION = "0.2.0"
5
5
  end
data/lib/top_secret.rb CHANGED
@@ -11,10 +11,10 @@ require_relative "top_secret/constants"
11
11
  require_relative "top_secret/filters/ner"
12
12
  require_relative "top_secret/filters/regex"
13
13
  require_relative "top_secret/error"
14
- require_relative "top_secret/result"
15
14
  require_relative "top_secret/text"
15
+ require_relative "top_secret/filtered_text"
16
16
 
17
- # TopSecret filters sensitive information from free text before it's sent to external services or APIs, such as Chatbots.
17
+ # TopSecret filters sensitive information from free text before it's sent to external services or APIs, such as chatbots and LLMs.
18
18
  #
19
19
  # @!attribute [rw] model_path
20
20
  # @return [String] the path to the MITIE NER model
@@ -22,23 +22,38 @@ require_relative "top_secret/text"
22
22
  # @!attribute [rw] min_confidence_score
23
23
  # @return [Float] the minimum confidence score required for NER matches
24
24
  #
25
- # @!attribute [rw] default_filters
26
- # @return [ActiveSupport::OrderedOptions] a set of default filters used to identify sensitive data
25
+ # @!attribute [rw] custom_filters
26
+ # @return [Array] array of custom filters that can be configured
27
+ #
28
+ # @!attribute [rw] credit_card_filter
29
+ # @return [TopSecret::Filters::Regex] filter for credit card numbers
30
+ #
31
+ # @!attribute [rw] email_filter
32
+ # @return [TopSecret::Filters::Regex] filter for email addresses
33
+ #
34
+ # @!attribute [rw] phone_number_filter
35
+ # @return [TopSecret::Filters::Regex] filter for phone numbers
36
+ #
37
+ # @!attribute [rw] ssn_filter
38
+ # @return [TopSecret::Filters::Regex] filter for social security numbers
39
+ #
40
+ # @!attribute [rw] people_filter
41
+ # @return [TopSecret::Filters::NER] filter for person names
42
+ #
43
+ # @!attribute [rw] location_filter
44
+ # @return [TopSecret::Filters::NER] filter for location names
27
45
  module TopSecret
28
46
  include ActiveSupport::Configurable
29
47
 
30
48
  config_accessor :model_path, default: "ner_model.dat"
31
49
  config_accessor :min_confidence_score, default: MIN_CONFIDENCE_SCORE
32
50
 
33
- config_accessor :default_filters do
34
- options = ActiveSupport::OrderedOptions.new
35
- options.credit_card_filter = TopSecret::Filters::Regex.new(label: "CREDIT_CARD", regex: CREDIT_CARD_REGEX)
36
- options.email_filter = TopSecret::Filters::Regex.new(label: "EMAIL", regex: EMAIL_REGEX)
37
- options.phone_number_filter = TopSecret::Filters::Regex.new(label: "PHONE_NUMBER", regex: PHONE_REGEX)
38
- options.ssn_filter = TopSecret::Filters::Regex.new(label: "SSN", regex: SSN_REGEX)
39
- options.people_filter = TopSecret::Filters::NER.new(label: "PERSON", tag: :person)
40
- options.location_filter = TopSecret::Filters::NER.new(label: "LOCATION", tag: :location)
51
+ config_accessor :custom_filters, default: []
41
52
 
42
- options
43
- end
53
+ config_accessor :credit_card_filter, default: TopSecret::Filters::Regex.new(label: "CREDIT_CARD", regex: CREDIT_CARD_REGEX)
54
+ config_accessor :email_filter, default: TopSecret::Filters::Regex.new(label: "EMAIL", regex: EMAIL_REGEX)
55
+ config_accessor :phone_number_filter, default: TopSecret::Filters::Regex.new(label: "PHONE_NUMBER", regex: PHONE_REGEX)
56
+ config_accessor :ssn_filter, default: TopSecret::Filters::Regex.new(label: "SSN", regex: SSN_REGEX)
57
+ config_accessor :people_filter, default: TopSecret::Filters::NER.new(label: "PERSON", tag: :person)
58
+ config_accessor :location_filter, default: TopSecret::Filters::NER.new(label: "LOCATION", tag: :location)
44
59
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: top_secret
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Steve Polito
@@ -44,7 +44,7 @@ dependencies:
44
44
  - !ruby/object:Gem::Version
45
45
  version: 0.3.2
46
46
  description: Filter sensitive information from free text before sending it to external
47
- services or APIs, such as Chatbots.
47
+ services or APIs, such as chatbots and LLMs.
48
48
  email:
49
49
  - stevepolito@hey.com
50
50
  executables: []
@@ -52,6 +52,7 @@ extensions: []
52
52
  extra_rdoc_files: []
53
53
  files:
54
54
  - CHANGELOG.md
55
+ - CODEOWNERS
55
56
  - CODE_OF_CONDUCT.md
56
57
  - LICENSE.txt
57
58
  - README.md
@@ -59,10 +60,13 @@ files:
59
60
  - lib/top_secret.rb
60
61
  - lib/top_secret/constants.rb
61
62
  - lib/top_secret/error.rb
63
+ - lib/top_secret/filtered_text.rb
64
+ - lib/top_secret/filtered_text/result.rb
62
65
  - lib/top_secret/filters/ner.rb
63
66
  - lib/top_secret/filters/regex.rb
64
- - lib/top_secret/result.rb
65
67
  - lib/top_secret/text.rb
68
+ - lib/top_secret/text/batch_result.rb
69
+ - lib/top_secret/text/result.rb
66
70
  - lib/top_secret/version.rb
67
71
  - sig/top_secret.rbs
68
72
  homepage: https://github.com/thoughtbot/top_secret
@@ -1,24 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module TopSecret
4
- # Holds the result of a redaction operation.
5
- class Result
6
- # @return [String] The original unredacted input
7
- attr_reader :input
8
-
9
- # @return [String] The redacted output
10
- attr_reader :output
11
-
12
- # @return [Hash] Mapping of redacted labels to matched values
13
- attr_reader :mapping
14
-
15
- # @param input [String] The original text
16
- # @param output [String] The redacted text
17
- # @param mapping [Hash] Map of labels to matched values
18
- def initialize(input, output, mapping)
19
- @input = input
20
- @output = output
21
- @mapping = mapping
22
- end
23
- end
24
- end