RubyGems - confidential_info_redactor - Versions diffs - 0.0.1 - Mend

confidential_info_redactor 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

checksums.yaml +7 -0
data/.gitignore +14 -0
data/.rspec +1 -0
data/.travis.yml +5 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +127 -0
data/Rakefile +5 -0
data/confidential_info_redactor.gemspec +25 -0
data/lib/confidential_info_redactor.rb +4 -0
data/lib/confidential_info_redactor/date.rb +172 -0
data/lib/confidential_info_redactor/extractor.rb +43 -0
data/lib/confidential_info_redactor/hyperlink.rb +31 -0
data/lib/confidential_info_redactor/redactor.rb +83 -0
data/lib/confidential_info_redactor/version.rb +3 -0
data/lib/confidential_info_redactor/word_lists.rb +6 -0
data/spec/confidential_info_redactor/date_spec.rb +273 -0
data/spec/confidential_info_redactor/extractor_spec.rb +115 -0
data/spec/confidential_info_redactor/hyperlink_spec.rb +61 -0
data/spec/confidential_info_redactor/redactor_spec.rb +152 -0
data/spec/spec_helper.rb +1 -0
metadata +126 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 54ab8c079b1020a3b43ac9fd7f0d566b91a3ed60
+  data.tar.gz: e565672ba5e464adf70b909ceb6dc047f39df58c
+SHA512:
+  metadata.gz: d942f1a36c6a09978aea347a0a61edd89feb2202202eee57a4dc9bffc8e2c503654fb5afd3b799bc1ab53e6678c788484740b979b03594e7605b1f40992fc82f
+  data.tar.gz: e3206d083d5eeb24989bf596316d063ec9c09027fcad7ed8f320380b43f421b1ebea9d459d34e829b2d3bcd43272c50112bc6d24a3d7cd71c855821293e9a4f6

data/.gitignore ADDED Viewed

@@ -0,0 +1,14 @@
+/.bundle/
+/.yardoc
+/Gemfile.lock
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/.rspec ADDED Viewed

	@@ -0,0 +1 @@
1	+ --color

data/.travis.yml ADDED Viewed

@@ -0,0 +1,5 @@
+language: ruby
+rvm:
+  - "2.1.0"
+  - "2.1.5"
+  - "2.2.0"

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in confidential_info_redactor.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2015 Kevin S. Dias
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,127 @@
+# Confidential Info Redactor
+[![Gem Version](https://badge.fury.io/rb/confidential_info_redactor.svg)](http://badge.fury.io/rb/confidential_info_redactor) [![Build Status](https://travis-ci.org/diasks2/confidential_info_redactor.png)](https://travis-ci.org/diasks2/confidential_info_redactor) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/confidential_info_redactor/blob/master/LICENSE.txt)
+Confidential Info Redactor is a Ruby gem to semi-automatically redact confidential information from a text.
+This gem is a poor man's named-entity recognition (NER) library built to extract (and later redact) information in a text (such as proper nouns) that may be confidential.
+It differs from typical NER as it makes no attempt to identify whether a token is a person, company, location, etc. It only attempts to extract tokens that might fit into one of those categories.
+Your use case may vary, but the gem was written to first extract potential sensitive tokens from a text and then show the user the extracted tokens and let the user decide which ones should be redacted (or add missing tokens to the list).
+The way the gem works is rather simple. It uses regular expressions to search for capitalized tokens (1-grams, 2-grams, 3-grams etc.) and then checks whether those tokens match a list of the common vocabulary for that language (e.g. the x most frequent words - the size of x depending on what is available for that language). If the token is not in the list of words for that language it is added to an array of tokens that should be checked by the user as potential "confidential information".
+In the sentence "Pepsi and Coca-Cola battled for position in the market." the gem would extract "Pepsi" and "Coca-Cola" as potential tokens to redact.
+In addition to searching for proper nouns, the gem also has the functionality to redact numbers, dates, emails and hyperlinks.
+## Install
+**Ruby**
+*Supports Ruby 2.1.0 and above*
+```
+gem install confidential_info_redactor
+```
+**Ruby on Rails**
+Add this line to your application’s Gemfile:
+```ruby
+gem 'confidential_info_redactor'
+```
+## Usage
+* If no language is specified, the library will default to English.
+* To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
+```ruby
+text = 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
+tokens = ConfidentialInfoRedactor::Extractor.new(text: text).extract
+# => ["Coca-Cola", "Pepsi", "John Smith"]
+ConfidentialInfoRedactor::Redactor.new(text: text, tokens: tokens).redact
+# => '<redacted> announced a merger with <redacted> that will happen on <redacted date> for <redacted number>. Please contact <redacted> at <redacted> or visit <redacted>.'
+# You can also just use a specific redactor
+ConfidentialInfoRedactor::Redactor.new(text: text).dates
+# => 'Coca-Cola announced a merger with Pepsi that will happen on <redacted date> for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
+ConfidentialInfoRedactor::Redactor.new(text: text).numbers
+# => 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for <redacted number>. Please contact John Smith at j.smith@example.com or visit http://www.super-fake-merger.com.'
+ConfidentialInfoRedactor::Redactor.new(text: text).emails
+# => 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at <redacted> or visit http://www.super-fake-merger.com.'
+ConfidentialInfoRedactor::Redactor.new(text: text).hyperlinks
+# => 'Coca-Cola announced a merger with Pepsi that will happen on December 15th, 2020 for $200,000,000,000. Please contact John Smith at j.smith@example.com or visit <redacted>.'
+ConfidentialInfoRedactor::Redactor.new(text: text, tokens: tokens).proper_nouns
+# => '<redacted> announced a merger with <redacted> that will happen on December 15th, 2020 for $200,000,000,000. Please contact <redacted> at j.smith@example.com or visit http://www.super-fake-merger.com.'
+# It is possible to 'turn off' any of the specific redactors
+ConfidentialInfoRedactor::Redactor.new(text: text, tokens: tokens, ignore_numbers: true).redact
+# => '<redacted> announced a merger with <redacted> that will happen on <redacted date> for $200,000,000,000. Please contact <redacted> at <redacted> or visit <redacted>.'
+# German Example
+text = 'Viele Mitarbeiter der Deutschen Bank suchen eine andere Arbeitsstelle.'
+tokens = ConfidentialInfoRedactor::Extractor.new(text: text, language: 'de').extract
+# => ['Deutschen Bank']
+ConfidentialInfoRedactor::Redactor.new(text: text, language: 'de', tokens: tokens).redact
+# => 'Viele Mitarbeiter der <redacted> suchen eine andere Arbeitsstelle.'
+# It is also possible to change the redaction text
+tokens = ['Coca-Cola', 'Pepsi']
+ConfidentialInfoRedactor::Redactor.new(text: text, tokens: tokens, number_text: '**redacted number**', date_text: '^^redacted date^^', token_text: '*****').redact
+# => '***** announced a merger with ***** that will happen on ^^redacted date^^ for **redacted number**. Please contact ***** at ***** or visit *****.'
+```
+#### Redactor class options
+* `language` *(optional - defaults to 'en' if not specified)*
+* `tokens` *(optional - any tokens to redact from the text)*
+* `number_text` *(optional - change the text for redacted numbers; the standard is `<redacted number>`)*
+* `date_text` *(optional - change the text for redacted dates; the standard is `<redacted date>`)*
+* `token_text` *(optional - change the text for redacted tokens, emails and hyperlinks; the standard is `<redacted>`)*
+* `ignore_emails` *(optional - set to true if you do not want to redact emails)*
+* `ignore_dates` *(optional - set to true if you do not want to redact dates)*
+* `ignore_numbers` *(optional - set to true if you do not want to redact numbers)*
+* `ignore_hyperlinks` *(optional - set to true if you do not want to redact hyperlinks)*
+#### Languages Supported
+* English ('en')
+* German ('de')
+## Contributing
+1. Fork it ( https://github.com/diasks2/confidential_info_redactor/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request
+## License
+The MIT License (MIT)
+Copyright (c) 2015 Kevin S. Dias
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/Rakefile ADDED Viewed

@@ -0,0 +1,5 @@
+require 'bundler/gem_tasks'
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new(:spec)
+task :default => :spec

data/confidential_info_redactor.gemspec ADDED Viewed

@@ -0,0 +1,25 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'confidential_info_redactor/version'
+Gem::Specification.new do |spec|
+  spec.name          = "confidential_info_redactor"
+  spec.version       = ConfidentialInfoRedactor::VERSION
+  spec.authors       = ["Kevin S. Dias"]
+  spec.email         = ["diasks2@gmail.com"]
+  spec.summary       = %q{Semi-automatically redact confidential information from a text}
+  spec.description   = %q{A Ruby gem to semi-automatically redact confidential information from a text}
+  spec.homepage      = "https://github.com/diasks2/confidential_info_redactor"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.7"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "rspec"
+  spec.add_runtime_dependency "pragmatic_segmenter"
+end

data/lib/confidential_info_redactor.rb ADDED Viewed

@@ -0,0 +1,4 @@
+require 'confidential_info_redactor/version'
+require 'confidential_info_redactor/extractor'
+require 'confidential_info_redactor/redactor'
+require 'pragmatic_segmenter'

data/lib/confidential_info_redactor/date.rb ADDED Viewed

@@ -0,0 +1,172 @@
+module ConfidentialInfoRedactor
+  class Date
+    EN_DOW = %w(monday tuesday wednesday thursday friday saturday sunday)
+    EN_DOW_ABBR = %w(mon tu tue tues wed th thu thur thurs fri sat sun)
+    EN_MONTHS = %w(january february march april may june july august september october november december)
+    EN_MONTH_ABBR = %w(jan feb mar apr jun jul aug sep sept oct nov dec)
+    DE_DOW = %w(montag dienstag mittwoch donnerstag freitag samstag sonntag sonnabend)
+    DE_DOW_ABBR = %w(mo di mi do fr sa so)
+    DE_MONTHS = %w(januar februar märz april mai juni juli august september oktober november dezember)
+    DE_MONTH_ABBR = %w(jan jän feb märz apr mai juni juli aug sep sept okt nov dez)
+    # Rubular: http://rubular.com/r/73CZ2HU0q6
+    DMY_MDY_REGEX = /(\d{1,2}(\/|\.|-)){2}\d{4}/
+    # Rubular: http://rubular.com/r/GWbuWXw4t0
+    YMD_YDM_REGEX = /\d{4}(\/|\.|-)(\d{1,2}(\/|\.|-)){2}/
+    # Rubular: http://rubular.com/r/SRZ27XNlvR
+    DIGIT_ONLY_YEAR_FIRST_REGEX = /[12]\d{7}\D/
+    # Rubular: http://rubular.com/r/mpVSeaKwdY
+    DIGIT_ONLY_YEAR_LAST_REGEX = /\d{4}[12]\d{3}\D/
+    attr_reader :string, :language, :dow, :dow_abbr, :months, :months_abbr
+    def initialize(string:, language:)
+      @string = string
+      @language = language
+      case language
+      when 'en'
+        @dow = EN_DOW
+        @dow_abbr = EN_DOW_ABBR
+        @months = EN_MONTHS
+        @months_abbr = EN_MONTH_ABBR
+      when 'de'
+        @dow = DE_DOW
+        @dow_abbr = DE_DOW_ABBR
+        @months = DE_MONTHS
+        @months_abbr = DE_MONTH_ABBR
+      else
+        @dow = EN_DOW
+        @dow_abbr = EN_DOW_ABBR
+        @months = EN_MONTHS
+        @months_abbr = EN_MONTH_ABBR
+      end
+    end
+    def includes_date?
+      long_date || number_only_date
+    end
+    def replace
+      new_string = string.dup
+      counter = 0
+      dow_abbr.each do |day|
+        counter +=1 if string.include?('day')
+      end
+      if counter > 0
+        dow_abbr.each do |day|
+          months.each do |month|
+            new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+          end
+          months_abbr.each do |month|
+            new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+          end
+        end
+        dow.each do |day|
+          months.each do |month|
+            new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+                                   .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
+                                   .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+                                   .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
+                                   .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
+                                   .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
+          end
+          months_abbr.each do |month|
+            new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+                                   .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
+                                   .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+                                   .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
+                                   .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
+                                   .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
+          end
+        end
+      else
+        dow.each do |day|
+          months.each do |month|
+            new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+                                   .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
+                                   .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+                                   .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
+                                   .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
+                                   .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
+          end
+          months_abbr.each do |month|
+            new_string = new_string.gsub(/#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+                                   .gsub(/\d{2}(\.|-|\/)*\s?#{Regexp.escape(month)}(\.|-|\/)*\s?(\d{4}|\d{2})/i, ' <redacted date> ')
+                                   .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+                                   .gsub(/\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i, ' <redacted date> ')
+                                   .gsub(/\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i, ' <redacted date> ')
+                                   .gsub(/#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i, ' <redacted date> ')
+          end
+        end
+        dow_abbr.each do |day|
+          months.each do |month|
+            new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+          end
+          months_abbr.each do |month|
+            new_string = new_string.gsub(/#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i, ' <redacted date> ')
+          end
+        end
+      end
+      new_string = new_string.gsub(DMY_MDY_REGEX, ' <redacted date> ')
+                     .gsub(YMD_YDM_REGEX, ' <redacted date> ')
+                     .gsub(DIGIT_ONLY_YEAR_FIRST_REGEX, ' <redacted date> ')
+                     .gsub(DIGIT_ONLY_YEAR_LAST_REGEX, ' <redacted date> ')
+    end
+    def occurences
+      replace.scan(/<redacted date>/).size
+    end
+    def replace_number_only_date
+      string.gsub(DMY_MDY_REGEX, ' <redacted date> ')
+            .gsub(YMD_YDM_REGEX, ' <redacted date> ')
+            .gsub(DIGIT_ONLY_YEAR_FIRST_REGEX, ' <redacted date> ')
+            .gsub(DIGIT_ONLY_YEAR_LAST_REGEX, ' <redacted date> ')
+    end
+    private
+    def long_date
+      match_found = false
+      dow.each do |day|
+        months.each do |month|
+          break if match_found
+          match_found = check_for_matches(day, month)
+        end
+        months_abbr.each do |month|
+          break if match_found
+          match_found = check_for_matches(day, month)
+        end
+      end
+      dow_abbr.each do |day|
+        months.each do |month|
+          break if match_found
+          match_found = !(string !~ /#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*(,)*\s\d{4}/i)
+        end
+        months_abbr.each do |month|
+          break if match_found
+          match_found = !(string !~ /#{Regexp.escape(day)}(\.)*(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i)
+        end
+      end
+      match_found
+    end
+    def number_only_date
+      !(string !~ DMY_MDY_REGEX) ||
+      !(string !~ YMD_YDM_REGEX) ||
+      !(string !~ DIGIT_ONLY_YEAR_FIRST_REGEX) ||
+      !(string !~ DIGIT_ONLY_YEAR_LAST_REGEX)
+    end
+    def check_for_matches(day, month)
+      !(string !~ /#{Regexp.escape(day)}(,)*\s#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i) ||
+      !(string !~ /#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*(,)*\s\d{4}/i) ||
+      !(string !~ /\d{4}\.*\s#{Regexp.escape(month)}\s\d+(rd|th|st)*/i) ||
+      !(string !~ /\d{4}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*\d+/i) ||
+      !(string !~ /#{Regexp.escape(month)}(\.)*\s\d+(rd|th|st)*/i) ||
+      !(string !~ /\d{2}(\.|-|\/)*#{Regexp.escape(month)}(\.|-|\/)*(\d{4}|\d{2})/i)
+    end
+  end
+end

data/lib/confidential_info_redactor/extractor.rb ADDED Viewed

@@ -0,0 +1,43 @@
+require 'confidential_info_redactor/word_lists'
+module ConfidentialInfoRedactor
+  # This class extracts proper nouns from a text
+  class Extractor
+    # Rubular: http://rubular.com/r/qE0g4r9zR7
+    EXTRACT_REGEX = /(?<=\s|^|\s\")([A-Z]\S*\s)*[A-Z]\S*(?=(\s|\.|\z))|(?<=\s|^|\s\")[i][A-Z][a-z]+/
+    attr_reader :text, :language, :corpus
+    def initialize(text:, **args)
+      @text = text.gsub(/[’‘]/, "'")
+      @language = args[:language] || 'en'
+      case @language
+      when 'en'
+        @corpus = ConfidentialInfoRedactor::WordLists::EN_WORDS
+      when 'de'
+        @corpus = ConfidentialInfoRedactor::WordLists::DE_WORDS
+      else
+        @corpus = ConfidentialInfoRedactor::WordLists::EN_WORDS
+      end
+    end
+    def extract
+      extracted_terms = []
+      PragmaticSegmenter::Segmenter.new(text: text, language: language).segment.each do |segment|
+        initial_extracted_terms = segment.gsub(EXTRACT_REGEX).map { |match| match unless corpus.include?(match.downcase.gsub(/[\?\.\)\(\!\\\/\"\:\;]/, '').gsub(/\'$/, '')) }.compact
+        initial_extracted_terms.each do |ngram|
+          ngram.split(/[\?\)\(\!\\\/\"\:\;\,]/).each do |t|
+            extracted_terms << t.gsub(/[\?\)\(\!\\\/\"\:\;\,]/, '').gsub(/\'$/, '').gsub(/\.\z/, '').strip unless corpus.include?(t.downcase.gsub(/[\?\.\)\(\!\\\/\"\:\;]/, '').gsub(/\'$/, '').strip)
+          end
+        end
+      end
+      if language.eql?('de')
+        extracted_terms.delete_if do |token|
+          corpus.include?(token.split(' ')[0].downcase.strip) &&
+            token.split(' ')[0].downcase.strip != 'deutsche'
+        end.uniq.reject(&:empty?)
+      else
+        extracted_terms.uniq.reject(&:empty?)
+      end
+    end
+  end
+end