confidential_info_redactor 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 54ab8c079b1020a3b43ac9fd7f0d566b91a3ed60
4
+ data.tar.gz: e565672ba5e464adf70b909ceb6dc047f39df58c
5
+ SHA512:
6
+ metadata.gz: d942f1a36c6a09978aea347a0a61edd89feb2202202eee57a4dc9bffc8e2c503654fb5afd3b799bc1ab53e6678c788484740b979b03594e7605b1f40992fc82f
7
+ data.tar.gz: e3206d083d5eeb24989bf596316d063ec9c09027fcad7ed8f320380b43f421b1ebea9d459d34e829b2d3bcd43272c50112bc6d24a3d7cd71c855821293e9a4f6
data/.gitignore ADDED
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ *.bundle
11
+ *.so
12
+ *.o
13
+ *.a
14
+ mkmf.log
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
data/.travis.yml ADDED
@@ -0,0 +1,5 @@
1
+ language: ruby
2
+ rvm:
3
+ - "2.1.0"
4
+ - "2.1.5"
5
+ - "2.2.0"
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in confidential_info_redactor.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2015 Kevin S. Dias
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,127 @@
1
+ # Confidential Info Redactor
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/confidential_info_redactor.svg)](http://badge.fury.io/rb/confidential_info_redactor) [![Build Status](https://travis-ci.org/diasks2/confidential_info_redactor.png)](https://travis-ci.org/diasks2/confidential_info_redactor) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/confidential_info_redactor/blob/master/LICENSE.txt)
4
+
5
+ Confidential Info Redactor is a Ruby gem to semi-automatically redact confidential information from a text.
6
+
7
+ This gem is a poor man's named-entity recognition (NER) library built to extract (and later redact) information in a text (such as proper nouns) that may be confidential.
8
+
9
+ It differs from typical NER as it makes no attempt to identify whether a token is a person, company, location, etc. It only attempts to extract tokens that might fit into one of those categories.
10
+
11
+ Your use case may vary, but the gem was written to first extract potential sensitive tokens from a text and then show the user the extracted tokens and let the user decide which ones should be redacted (or add missing tokens to the list).
12
+
13
+ The way the gem works is rather simple. It uses regular expressions to search for capitalized tokens (1-grams, 2-grams, 3-grams etc.) and then checks whether those tokens match a list of the common vocabulary for that language (e.g. the x most frequent words - the size of x depending on what is available for that language). If the token is not in the list of words for that language it is added to an array of tokens that should be checked by the user as potential "confidential information".
14
+
15
+ In the sentence "Pepsi and Coca-Cola battled for position in the market." the gem would extract "Pepsi" and "Coca-Cola" as potential tokens to redact.
16
+
17
+ In addition to searching for proper nouns, the gem also has the functionality to redact numbers, dates, emails and hyperlinks.
18
+
19
+ ## Install
20
+
21
+ **Ruby**
22
+ *Supports Ruby 2.1.0 and above*
23
+ ```
24
+ gem install confidential_info_redactor
25
+ ```
26
+
27
+ **Ruby on Rails**
28
+ Add this line to your application’s Gemfile:
29
+ ```ruby
30
+ gem 'confidential_info_redactor'
31
+ ```
32
+
33
+ ## Usage
34
+
35
+ * If no language is specified, the library will default to English.
36
+ * To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
37
+
38
+ ```ruby
39
+ text = 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
40
+
41
+ tokens = ConfidentialInfoRedactor::Extractor.new(text: text).extract
42
+ # => ["Coca-Cola", "Pepsi", "John Smith"]
43
+
44
+ ConfidentialInfoRedactor::Redactor.new(text: text, tokens: tokens).redact
45
+ # => '<redacted> announced a merger with <redacted> that will happen on <redacted date> for <redacted number>. Please contact <redacted> at <redacted> or visit <redacted>.'
46
+
47
+ # You can also just use a specific redactor
48
+ ConfidentialInfoRedactor::Redactor.new(text: text).dates
49
+ # => 'Coca-Cola announced a merger with Pepsi that will happen on <redacted date> for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
50
+
51
+ ConfidentialInfoRedactor::Redactor.new(text: text).numbers
52
+ # => 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for <redacted number>. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
53
+
54
+ ConfidentialInfoRedactor::Redactor.new(text: text).emails
55
+ # => 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at <redacted> or visit http://www.super-fake-merger.com.'
56
+
57
+ ConfidentialInfoRedactor::Redactor.new(text: text).hyperlinks
58
+ # => 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit <redacted>.'
59
+
60
+ ConfidentialInfoRedactor::Redactor.new(text: text, tokens: tokens).proper_nouns
61
+ # => '<redacted> announced a merger with <redacted> that will happen on December 15th, 2020 for $200,000,000,000. Please contact <redacted> at j.smith@example.com or visit http://www.super-fake-merger.com.'
62
+
63
+ # It is possible to 'turn off' any of the specific redactors
64
+ ConfidentialInfoRedactor::Redactor.new(text: text, tokens: tokens, ignore_numbers: true).redact
65
+ # => '<redacted> announced a merger with <redacted> that will happen on <redacted date> for $200,000,000,000. Please contact <redacted> at <redacted> or visit <redacted>.'
66
+
67
+ # German Example
68
+ text = 'Viele Mitarbeiter der Deutschen Bank suchen eine andere Arbeitsstelle.'
69
+
70
+ tokens = ConfidentialInfoRedactor::Extractor.new(text: text, language: 'de').extract
71
+ # => ['Deutschen Bank']
72
+
73
+ ConfidentialInfoRedactor::Redactor.new(text: text, language: 'de', tokens: tokens).redact
74
+ # => 'Viele Mitarbeiter der <redacted> suchen eine andere Arbeitsstelle.'
75
+
76
+ # It is also possible to change the redaction text
77
+ tokens = ['Coca-Cola', 'Pepsi']
78
+ ConfidentialInfoRedactor::Redactor.new(text: text, tokens: tokens, number_text: '**redacted number**', date_text: '^^redacted date^^', token_text: '*****').redact
79
+ # => '***** announced a merger with ***** that will happen on ^^redacted date^^ for **redacted number**. Please contact ***** at ***** or visit *****.'
80
+ ```
81
+
82
+ #### Redactor class options
83
+ * `language` *(optional - defaults to 'en' if not specified)*
84
+ * `tokens` *(optional - any tokens to redact from the text)*
85
+ * `number_text` *(optional - change the text for redacted numbers; the standard is `<redacted number>`)*
86
+ * `date_text` *(optional - change the text for redacted dates; the standard is `<redacted date>`)*
87
+ * `token_text` *(optional - change the text for redacted tokens, emails and hyperlinks; the standard is `<redacted>`)*
88
+ * `ignore_emails` *(optional - set to true if you do not want to redact emails)*
89
+ * `ignore_dates` *(optional - set to true if you do not want to redact dates)*
90
+ * `ignore_numbers` *(optional - set to true if you do not want to redact numbers)*
91
+ * `ignore_hyperlinks` *(optional - set to true if you do not want to redact hyperlinks)*
92
+
93
+ #### Languages Supported
94
+ * English ('en')
95
+ * German ('de')
96
+
97
+ ## Contributing
98
+
99
+ 1. Fork it ( https://github.com/diasks2/confidential_info_redactor/fork )
100
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
101
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
102
+ 4. Push to the branch (`git push origin my-new-feature`)
103
+ 5. Create a new Pull Request
104
+
105
+ ## License
106
+
107
+ The MIT License (MIT)
108
+
109
+ Copyright (c) 2015 Kevin S. Dias
110
+
111
+ Permission is hereby granted, free of charge, to any person obtaining a copy
112
+ of this software and associated documentation files (the "Software"), to deal
113
+ in the Software without restriction, including without limitation the rights
114
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
115
+ copies of the Software, and to permit persons to whom the Software is
116
+ furnished to do so, subject to the following conditions:
117
+
118
+ The above copyright notice and this permission notice shall be included in
119
+ all copies or substantial portions of the Software.
120
+
121
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
122
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
123
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
124
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
125
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
126
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
127
+ THE SOFTWARE.
data/Rakefile ADDED
@@ -0,0 +1,5 @@
1
+ require 'bundler/gem_tasks'
2
+ require 'rspec/core/rake_task'
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+ task :default => :spec
@@ -0,0 +1,25 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'confidential_info_redactor/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "confidential_info_redactor"
8
+ spec.version = ConfidentialInfoRedactor::VERSION
9
+ spec.authors = ["Kevin S. Dias"]
10
+ spec.email = ["diasks2@gmail.com"]
11
+ spec.summary = %q{Semi-automatically redact confidential information from a text}
12
+ spec.description = %q{A Ruby gem to semi-automatically redact confidential information from a text}
13
+ spec.homepage = "https://github.com/diasks2/confidential_info_redactor"
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0")
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.7"
22
+ spec.add_development_dependency "rake", "~> 10.0"
23
+ spec.add_development_dependency "rspec"
24
+ spec.add_runtime_dependency "pragmatic_segmenter"
25
+ end
@@ -0,0 +1,4 @@
1
+ require 'confidential_info_redactor/version'
2
+ require 'confidential_info_redactor/extractor'
3
+ require 'confidential_info_redactor/redactor'
4
+ require 'pragmatic_segmenter'
@@ -0,0 +1,172 @@
1
+ module ConfidentialInfoRedactor
2
+ class Date
3
+ EN_DOW = %w(monday tuesday wednesday thursday friday saturday sunday)
4
+ EN_DOW_ABBR = %w(mon tu tue tues wed th thu thur thurs fri sat sun)
5
+ EN_MONTHS = %w(january february march april may june july august september october november december)
6
+ EN_MONTH_ABBR = %w(jan feb mar apr jun jul aug sep sept oct nov dec)
7
+
8
+ DE_DOW = %w(montag dienstag mittwoch donnerstag freitag samstag sonntag sonnabend)
9
+ DE_DOW_ABBR = %w(mo di mi do fr sa so)
10
+ DE_MONTHS = %w(januar februar märz april mai juni juli august september oktober november dezember)
11
+ DE_MONTH_ABBR = %w(jan jän feb märz apr mai juni juli aug sep sept okt nov dez)
12
+ # Rubular: http://rubular.com/r/73CZ2HU0q6
13
+ DMY_MDY_REGEX = /(\d{1,2}(\/|\.|-)){2}\d{4}/
14
+
15
+ # Rubular: http://rubular.com/r/GWbuWXw4t0
16
+ YMD_YDM_REGEX = /\d{4}(\/|\.|-)(\d{1,2}(\/|\.|-)){2}/
17
+
18
+ # Rubular: http://rubular.com/r/SRZ27XNlvR
19
+ DIGIT_ONLY_YEAR_FIRST_REGEX = /[12]\d{7}\D/
20
+
21
+ # Rubular: http://rubular.com/r/mpVSeaKwdY
22
+ DIGIT_ONLY_YEAR_LAST_REGEX = /\d{4}[12]\d{3}\D/
23
+
24
+ attr_reader :string, :language, :dow, :dow_abbr, :months, :months_abbr
25
+ def initialize(string:, language:)
26
+ @string = string
27
+ @language = language
28
+ case language
29
+ when 'en'
30
+ @dow = EN_DOW
31
+ @dow_abbr = EN_DOW_ABBR
32
+ @months = EN_MONTHS
33
+ @months_abbr = EN_MONTH_ABBR
34
+ when 'de'
35
+ @dow = DE_DOW
36
+ @dow_abbr = DE_DOW_ABBR
37
+ @months = DE_MONTHS
38
+ @months_abbr = DE_MONTH_ABBR
39
+ else
40
+ @dow = EN_DOW
41
+ @dow_abbr = EN_DOW_ABBR
42
+ @months = EN_MONTHS
43
+ @months_abbr = EN_MONTH_ABBR
44
+ end
45
+ end
46
+
47
+ def includes_date?
48
+ long_date || number_only_date
49
+ end
50
+
51
+ def replace
52
+ new_string = string.dup
53
+ counter = 0
54
+ dow_abbr.each do |day|
55
+ counter +=1 if string.include?('day')
56
+ end
57
+ if counter > 0
58
+ dow_abbr.each do |day|
59
+ months.each do |month|
60
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
61
+ end
62
+ months_abbr.each do |month|
63
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
64
+ end
65
+ end
66
+ dow.each do |day|
67
+ months.each do |month|
68
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
69
+ .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
70
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
71
+ .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
72
+ .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
73
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
74
+ end
75
+ months_abbr.each do |month|
76
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
77
+ .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
78
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
79
+ .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
80
+ .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
81
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
82
+ end
83
+ end
84
+ else
85
+ dow.each do |day|
86
+ months.each do |month|
87
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
88
+ .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
89
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
90
+ .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
91
+ .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
92
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
93
+ end
94
+ months_abbr.each do |month|
95
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
96
+ .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
97
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
98
+ .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
99
+ .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
100
+ .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
101
+ end
102
+ end
103
+ dow_abbr.each do |day|
104
+ months.each do |month|
105
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
106
+ end
107
+ months_abbr.each do |month|
108
+ new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
109
+ end
110
+ end
111
+ end
112
+ new_string = new_string.gsub(DMY_MDY_REGEX, ' <redacted date> ')
113
+ .gsub(YMD_YDM_REGEX, ' <redacted date> ')
114
+ .gsub(DIGIT_ONLY_YEAR_FIRST_REGEX, ' <redacted date> ')
115
+ .gsub(DIGIT_ONLY_YEAR_LAST_REGEX, ' <redacted date> ')
116
+ end
117
+
118
+ def occurences
119
+ replace.scan(/<redacted date>/).size
120
+ end
121
+
122
+ def replace_number_only_date
123
+ string.gsub(DMY_MDY_REGEX, ' <redacted date> ')
124
+ .gsub(YMD_YDM_REGEX, ' <redacted date> ')
125
+ .gsub(DIGIT_ONLY_YEAR_FIRST_REGEX, ' <redacted date> ')
126
+ .gsub(DIGIT_ONLY_YEAR_LAST_REGEX, ' <redacted date> ')
127
+ end
128
+
129
+ private
130
+
131
+ def long_date
132
+ match_found = false
133
+ dow.each do |day|
134
+ months.each do |month|
135
+ break if match_found
136
+ match_found = check_for_matches(day, month)
137
+ end
138
+ months_abbr.each do |month|
139
+ break if match_found
140
+ match_found = check_for_matches(day, month)
141
+ end
142
+ end
143
+ dow_abbr.each do |day|
144
+ months.each do |month|
145
+ break if match_found
146
+ match_found = !(string !~ /#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*(,)*\s\d{4}/i)
147
+ end
148
+ months_abbr.each do |month|
149
+ break if match_found
150
+ match_found = !(string !~ /#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i)
151
+ end
152
+ end
153
+ match_found
154
+ end
155
+
156
+ def number_only_date
157
+ !(string !~ DMY_MDY_REGEX) ||
158
+ !(string !~ YMD_YDM_REGEX) ||
159
+ !(string !~ DIGIT_ONLY_YEAR_FIRST_REGEX) ||
160
+ !(string !~ DIGIT_ONLY_YEAR_LAST_REGEX)
161
+ end
162
+
163
+ def check_for_matches(day, month)
164
+ !(string !~ /#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i) ||
165
+ !(string !~ /#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i) ||
166
+ !(string !~ /\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i) ||
167
+ !(string !~ /\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i) ||
168
+ !(string !~ /#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i) ||
169
+ !(string !~ /\d{2}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*(\d{4}|\d{2})/i)
170
+ end
171
+ end
172
+ end
@@ -0,0 +1,43 @@
1
+ require 'confidential_info_redactor/word_lists'
2
+
3
+ module ConfidentialInfoRedactor
4
+ # This class extracts proper nouns from a text
5
+ class Extractor
6
+ # Rubular: http://rubular.com/r/qE0g4r9zR7
7
+ EXTRACT_REGEX = /(?<=\s|^|\s\")([A-Z]\S*\s)*[A-Z]\S*(?=(\s|\.|\z))|(?<=\s|^|\s\")[i][A-Z][a-z]+/
8
+ attr_reader :text, :language, :corpus
9
+ def initialize(text:, **args)
10
+ @text = text.gsub(/[’‘]/, "'")
11
+ @language = args[:language] || 'en'
12
+ case @language
13
+ when 'en'
14
+ @corpus = ConfidentialInfoRedactor::WordLists::EN_WORDS
15
+ when 'de'
16
+ @corpus = ConfidentialInfoRedactor::WordLists::DE_WORDS
17
+ else
18
+ @corpus = ConfidentialInfoRedactor::WordLists::EN_WORDS
19
+ end
20
+ end
21
+
22
+ def extract
23
+ extracted_terms = []
24
+ PragmaticSegmenter::Segmenter.new(text: text, language: language).segment.each do |segment|
25
+ initial_extracted_terms = segment.gsub(EXTRACT_REGEX).map { |match| match unless corpus.include?(match.downcase.gsub(/[\?\.\)\(\!\\\/\"\:\;]/, '').gsub(/\'$/, '')) }.compact
26
+ initial_extracted_terms.each do |ngram|
27
+ ngram.split(/[\?\)\(\!\\\/\"\:\;\,]/).each do |t|
28
+ extracted_terms << t.gsub(/[\?\)\(\!\\\/\"\:\;\,]/, '').gsub(/\'$/, '').gsub(/\.\z/, '').strip unless corpus.include?(t.downcase.gsub(/[\?\.\)\(\!\\\/\"\:\;]/, '').gsub(/\'$/, '').strip)
29
+ end
30
+ end
31
+ end
32
+
33
+ if language.eql?('de')
34
+ extracted_terms.delete_if do |token|
35
+ corpus.include?(token.split(' ')[0].downcase.strip) &&
36
+ token.split(' ')[0].downcase.strip != 'deutsche'
37
+ end.uniq.reject(&:empty?)
38
+ else
39
+ extracted_terms.uniq.reject(&:empty?)
40
+ end
41
+ end
42
+ end
43
+ end