confidential_info_redactor_lite 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: e843b44fe21d8ebfba53cc804d37e47b68573bdb
4
+ data.tar.gz: 0f1d5c38bb7c203d0df2137cddde0513fe404e3b
5
+ SHA512:
6
+ metadata.gz: 745830d5d5c2d6796167d6af93d6035d08d9071ba2297147aae9931f791bf94036818f0c7af74005721a2ba6f7c0be35c4992cb1468dd2dbca4c0ca9f1739829
7
+ data.tar.gz: 354a0d50bbc246490c39653aad4939f65a3329539ea6da99b8d8d7c608b93f2294e5dcc111bd3a3620219ce31364d22b3178e79b415e90a0707becbb9fe8cacb
data/.gitignore ADDED
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ *.bundle
11
+ *.so
12
+ *.o
13
+ *.a
14
+ mkmf.log
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
data/.travis.yml ADDED
@@ -0,0 +1,5 @@
1
+ language: ruby
2
+ rvm:
3
+ - "2.1.0"
4
+ - "2.1.5"
5
+ - "2.2.0"
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in confidential_info_redactor_lite.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2015 Kevin S. Dias
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,120 @@
1
+ # Confidential Info Redactor Lite
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/confidential_info_redactor_lite.svg)](http://badge.fury.io/rb/confidential_info_redactor_lite) [![Build Status](https://travis-ci.org/diasks2/confidential_info_redactor_lite.png)](https://travis-ci.org/diasks2/confidential_info_redactor_lite) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/confidential_info_redactor_lite/blob/master/LICENSE.txt)
4
+
5
+
6
+ Confidential Info Redactor Lite is a Ruby gem to semi-automatically redact confidential information from a text. It is the "lite" version of [Confidential Info Redactor](https://github.com/diasks2/confidential_info_redactor) - the difference being that in this gem no languages are included, you need to supply the vocabulary lists for each language you want to use.
7
+
8
+ This gem is a poor man's named-entity recognition (NER) library built to extract (and later redact) information in a text (such as proper nouns) that may be confidential.
9
+
10
+ It differs from typical NER as it makes no attempt to identify whether a token is a person, company, location, etc. It only attempts to extract tokens that might fit into one of those categories.
11
+
12
+ Your use case may vary, but the gem was written to first extract potential sensitive tokens from a text and then show the user the extracted tokens and let the user decide which ones should be redacted (or add missing tokens to the list).
13
+
14
+ The way the gem works is rather simple. It uses regular expressions to search for capitalized tokens (1-grams, 2-grams, 3-grams etc.) and then checks whether those tokens match a list of the common vocabulary for that language (e.g. the x most frequent words - the size of x depending on what is available for that language). If the token is not in the list of words for that language it is added to an array of tokens that should be checked by the user as potential "confidential information".
15
+
16
+ In the sentence "Pepsi and Coca-Cola battled for position in the market." the gem would extract "Pepsi" and "Coca-Cola" as potential tokens to redact.
17
+
18
+ In addition to searching for proper nouns, the gem also has the functionality to redact numbers, dates, emails and hyperlinks.
19
+
20
+ ## Install
21
+
22
+ **Ruby**
23
+ *Supports Ruby 2.1.0 and above*
24
+ ```
25
+ gem install confidential_info_redactor_lite
26
+ ```
27
+
28
+ **Ruby on Rails**
29
+ Add this line to your application’s Gemfile:
30
+ ```ruby
31
+ gem 'confidential_info_redactor_lite'
32
+ ```
33
+
34
+ ## Usage
35
+
36
+ * If no language is specified, the library will default to English.
37
+ * To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
38
+
39
+ ```ruby
40
+ text = 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
41
+ corpus = ['array', 'of', 'common', 'english', 'words']
42
+ tokens = ConfidentialInfoRedactorLite::Extractor.new(text: text, corpus: corpus).extract
43
+ # => ["Coca-Cola", "Pepsi", "John Smith"]
44
+
45
+ en_dow = %w(monday tuesday wednesday thursday friday saturday sunday)
46
+ en_dow_abbr = %w(mon tu tue tues wed th thu thur thurs fri sat sun)
47
+ en_months = %w(january february march april may june july august september october november december)
48
+ en_month_abbr = %w(jan feb mar apr jun jul aug sep sept oct nov dec)
49
+ ConfidentialInfoRedactorLite::Redactor.new(text: text, tokens: tokens, dow: en_dow, dow_abbr: en_dow_abbr, months: en_months, months_abbr: en_month_abbr).redact
50
+ # => '<redacted> announced a merger with <redacted> that will happen on <redacted date> for <redacted number>. Please contact <redacted> at <redacted> or visit <redacted>.'
51
+
52
+ # You can also just use a specific redactor
53
+ ConfidentialInfoRedactorLite::Redactor.new(text: text, dow: en_dow, dow_abbr: en_dow_abbr, months: en_months, months_abbr: en_month_abbr).dates
54
+ # => 'Coca-Cola announced a merger with Pepsi that will happen on <redacted date> for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
55
+
56
+ ConfidentialInfoRedactorLite::Redactor.new(text: text, dow: en_dow, dow_abbr: en_dow_abbr, months: en_months, months_abbr: en_month_abbr).numbers
57
+ # => 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for <redacted number>. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
58
+
59
+ ConfidentialInfoRedactorLite::Redactor.new(text: text, dow: en_dow, dow_abbr: en_dow_abbr, months: en_months, months_abbr: en_month_abbr).emails
60
+ # => 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at <redacted> or visit http://www.super-fake-merger.com.'
61
+
62
+ ConfidentialInfoRedactorLite::Redactor.new(text: text, dow: en_dow, dow_abbr: en_dow_abbr, months: en_months, months_abbr: en_month_abbr).hyperlinks
63
+ # => 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit <redacted>.'
64
+
65
+ ConfidentialInfoRedactorLite::Redactor.new(text: text, tokens: tokens, dow: en_dow, dow_abbr: en_dow_abbr, months: en_months, months_abbr: en_month_abbr).proper_nouns
66
+ # => '<redacted> announced a merger with <redacted> that will happen on December 15th, 2020 for $200,000,000,000. Please contact <redacted> at j.smith@example.com or visit http://www.super-fake-merger.com.'
67
+
68
+ # It is possible to 'turn off' any of the specific redactors
69
+ ConfidentialInfoRedactorLite::Redactor.new(text: text, tokens: tokens, ignore_numbers: true, dow: en_dow, dow_abbr: en_dow_abbr, months: en_months, months_abbr: en_month_abbr).redact
70
+ # => '<redacted> announced a merger with <redacted> that will happen on <redacted date> for $200,000,000,000. Please contact <redacted> at <redacted> or visit <redacted>.'
71
+
72
+ # It is also possible to change the redaction text
73
+ text = 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
74
+ tokens = ['Coca-Cola', 'Pepsi', 'John Smith']
75
+ ConfidentialInfoRedactorLite::Redactor.new(text: text, tokens: tokens, number_text: '**redacted number**', date_text: '^^redacted date^^', token_text: '*****', dow: en_dow, dow_abbr: en_dow_abbr, months: en_months, months_abbr: en_month_abbr).redact
76
+ # => '***** announced a merger with ***** that will happen on ^^redacted date^^ for **redacted number**. Please contact ***** at ***** or visit *****.'
77
+ ```
78
+
79
+ #### Redactor class options
80
+ * `language` *(optional - defaults to 'en' if not specified)*
81
+ * `tokens` *(optional - any tokens to redact from the text)*
82
+ * `number_text` *(optional - change the text for redacted numbers; the standard is `<redacted number>`)*
83
+ * `date_text` *(optional - change the text for redacted dates; the standard is `<redacted date>`)*
84
+ * `token_text` *(optional - change the text for redacted tokens, emails and hyperlinks; the standard is `<redacted>`)*
85
+ * `ignore_emails` *(optional - set to true if you do not want to redact emails)*
86
+ * `ignore_dates` *(optional - set to true if you do not want to redact dates)*
87
+ * `ignore_numbers` *(optional - set to true if you do not want to redact numbers)*
88
+ * `ignore_hyperlinks` *(optional - set to true if you do not want to redact hyperlinks)*
89
+
90
+ ## Contributing
91
+
92
+ 1. Fork it ( https://github.com/diasks2/confidential_info_redactor_lite/fork )
93
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
94
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
95
+ 4. Push to the branch (`git push origin my-new-feature`)
96
+ 5. Create a new Pull Request
97
+
98
+ ## License
99
+
100
+ The MIT License (MIT)
101
+
102
+ Copyright (c) 2015 Kevin S. Dias
103
+
104
+ Permission is hereby granted, free of charge, to any person obtaining a copy
105
+ of this software and associated documentation files (the "Software"), to deal
106
+ in the Software without restriction, including without limitation the rights
107
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
108
+ copies of the Software, and to permit persons to whom the Software is
109
+ furnished to do so, subject to the following conditions:
110
+
111
+ The above copyright notice and this permission notice shall be included in
112
+ all copies or substantial portions of the Software.
113
+
114
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
115
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
116
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
117
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
118
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
119
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
120
+ THE SOFTWARE.
data/Rakefile ADDED
@@ -0,0 +1,5 @@
1
+ require 'bundler/gem_tasks'
2
+ require 'rspec/core/rake_task'
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+ task :default => :spec
@@ -0,0 +1,26 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'confidential_info_redactor_lite/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "confidential_info_redactor_lite"
8
+ spec.version = ConfidentialInfoRedactorLite::VERSION
9
+ spec.authors = ["Kevin S. Dias"]
10
+ spec.email = ["diasks2@gmail.com"]
11
+ spec.summary = %q{Semi-automatically redact confidential information from a text (supply your own language packs)}
12
+ spec.description = %q{The lite version of https://github.com/diasks2/confidential_info_redactor - include your own language packs.}
13
+ spec.homepage = ""
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0")
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+ spec.required_ruby_version = '>= 2.1.0'
21
+
22
+ spec.add_development_dependency "bundler", "~> 1.7"
23
+ spec.add_development_dependency "rake", "~> 10.0"
24
+ spec.add_development_dependency "rspec"
25
+ spec.add_runtime_dependency "pragmatic_segmenter"
26
+ end
@@ -0,0 +1,4 @@
1
+ require 'confidential_info_redactor_lite/version'
2
+ require 'confidential_info_redactor_lite/extractor'
3
+ require 'confidential_info_redactor_lite/redactor'
4
+ require 'pragmatic_segmenter'
@@ -0,0 +1,149 @@
1
+ module ConfidentialInfoRedactorLite
2
+ class Date
3
+ # Rubular: http://rubular.com/r/73CZ2HU0q6
4
+ DMY_MDY_REGEX = /(\d{1,2}(\/|\.|-)){2}\d{4}/
5
+
6
+ # Rubular: http://rubular.com/r/GWbuWXw4t0
7
+ YMD_YDM_REGEX = /\d{4}(\/|\.|-)(\d{1,2}(\/|\.|-)){2}/
8
+
9
+ # Rubular: http://rubular.com/r/SRZ27XNlvR
10
+ DIGIT_ONLY_YEAR_FIRST_REGEX = /[12]\d{7}\D/
11
+
12
+ # Rubular: http://rubular.com/r/mpVSeaKwdY
13
+ DIGIT_ONLY_YEAR_LAST_REGEX = /\d{4}[12]\d{3}\D/
14
+
15
+ attr_reader :string, :dow, :dow_abbr, :months, :months_abbr
16
+ def initialize(string:, dow:, dow_abbr:, months:, months_abbr:)
17
+ @string = string
18
+ @dow = dow
19
+ @dow_abbr = dow_abbr
20
+ @months = months
21
+ @months_abbr = months_abbr
22
+ end
23
+
24
+ def includes_date?
25
+ long_date || number_only_date
26
+ end
27
+
28
+ def replace
29
+ new_string = string.dup
30
+ counter = 0
31
+ dow_abbr.each do |day|
32
+ counter +=1 if string.include?('day')
33
+ end
34
+ if counter > 0
35
+ dow_abbr.each do |day|
36
+ months.each do |month|
37
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
38
+ end
39
+ months_abbr.each do |month|
40
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
41
+ end
42
+ end
43
+ dow.each do |day|
44
+ months.each do |month|
45
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
46
+ .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
47
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
48
+ .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
49
+ .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
50
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
51
+ end
52
+ months_abbr.each do |month|
53
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
54
+ .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
55
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
56
+ .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
57
+ .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
58
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
59
+ end
60
+ end
61
+ else
62
+ dow.each do |day|
63
+ months.each do |month|
64
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
65
+ .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
66
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
67
+ .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
68
+ .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
69
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
70
+ end
71
+ months_abbr.each do |month|
72
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
73
+ .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
74
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
75
+ .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
76
+ .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
77
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
78
+ end
79
+ end
80
+ dow_abbr.each do |day|
81
+ months.each do |month|
82
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
83
+ end
84
+ months_abbr.each do |month|
85
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
86
+ end
87
+ end
88
+ end
89
+ new_string = new_string.gsub(DMY_MDY_REGEX, ' <redacted date> ')
90
+ .gsub(YMD_YDM_REGEX, ' <redacted date> ')
91
+ .gsub(DIGIT_ONLY_YEAR_FIRST_REGEX, ' <redacted date> ')
92
+ .gsub(DIGIT_ONLY_YEAR_LAST_REGEX, ' <redacted date> ')
93
+ end
94
+
95
+ def occurences
96
+ replace.scan(/<redacted date>/).size
97
+ end
98
+
99
+ def replace_number_only_date
100
+ string.gsub(DMY_MDY_REGEX, ' <redacted date> ')
101
+ .gsub(YMD_YDM_REGEX, ' <redacted date> ')
102
+ .gsub(DIGIT_ONLY_YEAR_FIRST_REGEX, ' <redacted date> ')
103
+ .gsub(DIGIT_ONLY_YEAR_LAST_REGEX, ' <redacted date> ')
104
+ end
105
+
106
+ private
107
+
108
+ def long_date
109
+ match_found = false
110
+ dow.each do |day|
111
+ months.each do |month|
112
+ break if match_found
113
+ match_found = check_for_matches(day, month)
114
+ end
115
+ months_abbr.each do |month|
116
+ break if match_found
117
+ match_found = check_for_matches(day, month)
118
+ end
119
+ end
120
+ dow_abbr.each do |day|
121
+ months.each do |month|
122
+ break if match_found
123
+ match_found = !(string !~ /#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*(,)*\s\d{4}/i)
124
+ end
125
+ months_abbr.each do |month|
126
+ break if match_found
127
+ match_found = !(string !~ /#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i)
128
+ end
129
+ end
130
+ match_found
131
+ end
132
+
133
+ def number_only_date
134
+ !(string !~ DMY_MDY_REGEX) ||
135
+ !(string !~ YMD_YDM_REGEX) ||
136
+ !(string !~ DIGIT_ONLY_YEAR_FIRST_REGEX) ||
137
+ !(string !~ DIGIT_ONLY_YEAR_LAST_REGEX)
138
+ end
139
+
140
+ def check_for_matches(day, month)
141
+ !(string !~ /#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i) ||
142
+ !(string !~ /#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i) ||
143
+ !(string !~ /\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i) ||
144
+ !(string !~ /\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i) ||
145
+ !(string !~ /#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i) ||
146
+ !(string !~ /\d{2}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*(\d{4}|\d{2})/i)
147
+ end
148
+ end
149
+ end
@@ -0,0 +1,34 @@
1
+ module ConfidentialInfoRedactorLite
2
+ # This class extracts proper nouns from a text
3
+ class Extractor
4
+ # Rubular: http://rubular.com/r/qE0g4r9zR7
5
+ EXTRACT_REGEX = /(?<=\s|^|\s\")([A-Z]\S*\s)*[A-Z]\S*(?=(\s|\.|\z))|(?<=\s|^|\s\")[i][A-Z][a-z]+/
6
+ attr_reader :text, :language, :corpus
7
+ def initialize(text:, corpus:, **args)
8
+ @text = text.gsub(/[’‘]/, "'")
9
+ @corpus = corpus
10
+ @language = args[:language] || 'en'
11
+ end
12
+
13
+ def extract
14
+ extracted_terms = []
15
+ PragmaticSegmenter::Segmenter.new(text: text, language: language).segment.each do |segment|
16
+ initial_extracted_terms = segment.gsub(EXTRACT_REGEX).map { |match| match unless corpus.include?(match.downcase.gsub(/[\?\.\)\(\!\\\/\"\:\;]/, '').gsub(/\'$/, '')) }.compact
17
+ initial_extracted_terms.each do |ngram|
18
+ ngram.split(/[\?\)\(\!\\\/\"\:\;\,]/).each do |t|
19
+ extracted_terms << t.gsub(/[\?\)\(\!\\\/\"\:\;\,]/, '').gsub(/\'$/, '').gsub(/\.\z/, '').strip unless corpus.include?(t.downcase.gsub(/[\?\.\)\(\!\\\/\"\:\;]/, '').gsub(/\'$/, '').strip)
20
+ end
21
+ end
22
+ end
23
+
24
+ if language.eql?('de')
25
+ extracted_terms.delete_if do |token|
26
+ corpus.include?(token.split(' ')[0].downcase.strip) &&
27
+ token.split(' ')[0].downcase.strip != 'deutsche'
28
+ end.uniq.reject(&:empty?)
29
+ else
30
+ extracted_terms.uniq.reject(&:empty?)
31
+ end
32
+ end
33
+ end
34
+ end
@@ -0,0 +1,31 @@
1
+ require 'uri'
2
+
3
+ module ConfidentialInfoRedactorLite
4
+ class Hyperlink
5
+ NON_HYPERLINK_REGEX = /\A\w+:$/
6
+
7
+ # Rubular: http://rubular.com/r/fXa4lp0gfS
8
+ HYPERLINK_REGEX = /(http|https|www)(\.|:)/
9
+
10
+ attr_reader :string
11
+ def initialize(string:)
12
+ @string = string
13
+ end
14
+
15
+ def hyperlink?
16
+ !(string !~ URI.regexp) && string !~ NON_HYPERLINK_REGEX && !(string !~ HYPERLINK_REGEX)
17
+ end
18
+
19
+ def replace
20
+ new_string = string.dup
21
+ string.split(/\s+/).each do |token|
22
+ if !(token !~ URI.regexp) && token !~ NON_HYPERLINK_REGEX && !(token !~ HYPERLINK_REGEX) && token.include?('">')
23
+ new_string = new_string.gsub(/#{Regexp.escape(token.split('">')[0].gsub(/\.\z/, ''))}/, ' <redacted> ')
24
+ elsif !(token !~ URI.regexp) && token !~ NON_HYPERLINK_REGEX && !(token !~ HYPERLINK_REGEX)
25
+ new_string = new_string.gsub(/#{Regexp.escape(token.gsub(/\.\z/, ''))}/, ' <redacted> ')
26
+ end
27
+ end
28
+ new_string
29
+ end
30
+ end
31
+ end
@@ -0,0 +1,86 @@
1
+ require 'confidential_info_redactor_lite/date'
2
+ require 'confidential_info_redactor_lite/hyperlink'
3
+
4
+ module ConfidentialInfoRedactorLite
5
+ # This class redacts various tokens from a text
6
+ class Redactor
7
+ # Rubular: http://rubular.com/r/LRRPtDgJOe
8
+ NUMBER_REGEX = /(?<=\A|\A\()[^(]?\d+((,|\.)*\d)*(\D?\s|\s|\.?\s|\.$)|(?<=\s|\s\()[^(]?\d+((,|\.)*\d)*(?=(\D?\s|\s|\.?\s|\.$))|(?<=\s)\d+(nd|th|st)|(?<=\s)\d+\/\d+\"*(?=\s)/
9
+ EMAIL_REGEX = /(?<=\A|\s)[\w+\-.]+@[a-z\d\-]+(\.[a-z]+)*\.[a-z]+(?=\z|\s|\.)/i
10
+
11
+ attr_reader :text, :language, :number_text, :date_text, :token_text, :tokens, :ignore_emails, :ignore_dates, :ignore_numbers, :ignore_hyperlinks, :dow, :dow_abbr, :months, :months_abbr
12
+ def initialize(text:, dow:, dow_abbr:, months:, months_abbr:, **args)
13
+ @text = text
14
+ @language = args[:language] || 'en'
15
+ @tokens = args[:tokens]
16
+ @number_text = args[:number_text] || '<redacted number>'
17
+ @date_text = args[:date_text] || '<redacted date>'
18
+ @token_text = args[:token_text] || '<redacted>'
19
+ @ignore_emails = args[:ignore_emails]
20
+ @ignore_dates = args[:ignore_dates]
21
+ @ignore_numbers = args[:ignore_numbers]
22
+ @ignore_hyperlinks = args[:ignore_hyperlinks]
23
+ @dow = dow
24
+ @dow_abbr = dow_abbr
25
+ @months = months
26
+ @months_abbr = months_abbr
27
+ end
28
+
29
+ def dates
30
+ redact_dates(text)
31
+ end
32
+
33
+ def numbers
34
+ redact_numbers(text)
35
+ end
36
+
37
+ def emails
38
+ redact_emails(text)
39
+ end
40
+
41
+ def hyperlinks
42
+ redact_hyperlinks(text)
43
+ end
44
+
45
+ def proper_nouns
46
+ redact_tokens(text)
47
+ end
48
+
49
+ def redact
50
+ if ignore_emails
51
+ redacted_text = text
52
+ else
53
+ redacted_text = redact_emails(text)
54
+ end
55
+ redacted_text = redact_hyperlinks(redacted_text) unless ignore_hyperlinks
56
+ redacted_text = redact_dates(redacted_text) unless ignore_dates
57
+ redacted_text = redact_numbers(redacted_text) unless ignore_numbers
58
+ redact_tokens(redacted_text)
59
+ end
60
+
61
+ private
62
+
63
+ def redact_hyperlinks(txt)
64
+ ConfidentialInfoRedactorLite::Hyperlink.new(string: txt).replace.gsub(/<redacted>/, "#{token_text}").gsub(/\s*#{Regexp.escape(token_text)}\s*/, " #{token_text} ").gsub(/#{Regexp.escape(token_text)}\s{1}\.{1}/, "#{token_text}.").gsub(/#{Regexp.escape(token_text)}\s{1}\,{1}/, "#{token_text},")
65
+ end
66
+
67
+ def redact_dates(txt)
68
+ ConfidentialInfoRedactorLite::Date.new(string: txt, dow: dow, dow_abbr: dow_abbr, months: months, months_abbr: months_abbr).replace.gsub(/<redacted date>/, "#{date_text}").gsub(/\s*#{Regexp.escape(date_text)}\s*/, " #{date_text} ").gsub(/\A\s*#{Regexp.escape(date_text)}\s*/, "#{date_text} ").gsub(/#{Regexp.escape(date_text)}\s{1}\.{1}/, "#{date_text}.")
69
+ end
70
+
71
+ def redact_numbers(txt)
72
+ txt.gsub(NUMBER_REGEX, " #{number_text} ").gsub(/\s*#{Regexp.escape(number_text)}\s*/, " #{number_text} ").gsub(/\A\s*#{Regexp.escape(number_text)}\s*/, "#{number_text} ").gsub(/#{Regexp.escape(number_text)}\s{1}\.{1}/, "#{number_text}.").gsub(/#{Regexp.escape(number_text)}\s{1}\,{1}/, "#{number_text},").gsub(/#{Regexp.escape(number_text)}\s{1}\){1}/, "#{number_text})").gsub(/\(\s{1}#{Regexp.escape(number_text)}/, "(#{number_text}")
73
+ end
74
+
75
+ def redact_emails(txt)
76
+ txt.gsub(EMAIL_REGEX, "#{token_text}")
77
+ end
78
+
79
+ def redact_tokens(txt)
80
+ tokens.sort_by{ |x| x.split.count }.reverse.each do |token|
81
+ txt.gsub!(/#{Regexp.escape(token)}/, "#{token_text}")
82
+ end
83
+ txt
84
+ end
85
+ end
86
+ end