stockade 0.1.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 84a7317f812734960f8ed6f56c0194d783aad1c7
4
- data.tar.gz: 6d361b573054fb0b0b2c5d19a02cebe7cb705173
3
+ metadata.gz: 2e40900217ae6ce9c9548707578efcfe90d27211
4
+ data.tar.gz: 8130b7c181f4d8ff0b2e322d4a408b4311505aa8
5
5
  SHA512:
6
- metadata.gz: 431ecb92de2cd3e67596af6347d23c09e86bc4efbd8436617bdd599d25140fa67872385b848bd989635b24f8477da44d880047c2ac112fa8614f9b291b844727
7
- data.tar.gz: 0b9040960eb148c06f4a664d6f074c075200afd36f10046c8c50eaa914d5550ebb1c91ea8ca5f9d53d0add1f0ddfb464c7b0636beb5ec6e361eaeefce129f417
6
+ metadata.gz: 39b25eff37a4405e38312d5c6cce3836468f1db7791421f90176060140496c27dfec84cd5e78289e2a316e00787884a09c6d18dbe9681ca7a851693ddd76f1d5
7
+ data.tar.gz: abf9b314b14437d08dc1fc5116a51d2d129881097ac6491d0ac66c6c73cc16af9f0749af0b8b1886a45126fecdbf790ce27751063e2fcf9eeed0fb2051d6cb9b
@@ -2,19 +2,35 @@ PATH
2
2
  remote: .
3
3
  specs:
4
4
  stockade (0.1.1)
5
+ credit_card_validations (~> 3.5)
5
6
  memoist (~> 0.1)
6
- rambling-trie
7
+ rambling-trie (~> 2.0)
8
+ rubyzip (~> 1.0)
7
9
 
8
10
  GEM
9
11
  remote: https://rubygems.org/
10
12
  specs:
13
+ activemodel (5.2.1)
14
+ activesupport (= 5.2.1)
15
+ activesupport (5.2.1)
16
+ concurrent-ruby (~> 1.0, >= 1.0.2)
17
+ i18n (>= 0.7, < 2)
18
+ minitest (~> 5.1)
19
+ tzinfo (~> 1.1)
11
20
  ast (2.4.0)
12
21
  byebug (10.0.2)
13
22
  coderay (1.1.2)
23
+ concurrent-ruby (1.0.5)
24
+ credit_card_validations (3.5.0)
25
+ activemodel (>= 3, <= 6)
26
+ activesupport (>= 3, <= 6)
14
27
  diff-lcs (1.3)
28
+ i18n (1.1.0)
29
+ concurrent-ruby (~> 1.0)
15
30
  jaro_winkler (1.5.1)
16
31
  memoist (0.16.0)
17
32
  method_source (0.9.0)
33
+ minitest (5.11.3)
18
34
  parallel (1.12.1)
19
35
  parser (2.5.1.2)
20
36
  ast (~> 2.4.0)
@@ -50,6 +66,10 @@ GEM
50
66
  ruby-progressbar (~> 1.7)
51
67
  unicode-display_width (~> 1.0, >= 1.0.1)
52
68
  ruby-progressbar (1.9.0)
69
+ rubyzip (1.2.1)
70
+ thread_safe (0.3.6)
71
+ tzinfo (1.2.5)
72
+ thread_safe (~> 0.1)
53
73
  unicode-display_width (1.4.0)
54
74
 
55
75
  PLATFORMS
@@ -57,10 +77,10 @@ PLATFORMS
57
77
 
58
78
  DEPENDENCIES
59
79
  bundler (~> 1.16)
60
- pry-byebug
80
+ pry-byebug (~> 3.0)
61
81
  rake (~> 10.0)
62
82
  rspec (~> 3.0)
63
- rubocop
83
+ rubocop (~> 0.49)
64
84
  stockade!
65
85
 
66
86
  BUNDLED WITH
data/README.md CHANGED
@@ -1,11 +1,10 @@
1
- # PII Lexer
1
+ # PII Detector
2
2
 
3
3
  _This is a proof-of-concept level software._
4
4
 
5
- Stockade is a lexer for Personally Identifiable Information (PII). It scans
6
- unstructured text (from files, logs, databases, web etc.) and tokenized
7
- recognized pieces of PII. This information can be used to raise errors,
8
- discard, mask data.
5
+ Stockade is a Personally Identifiable Information (PII) detector. It scans
6
+ unstructured text (from files, logs, databases, web etc.) and masks all
7
+ identified pieces of PII.
9
8
 
10
9
  ## Installation
11
10
 
@@ -17,21 +16,40 @@ gem install stockade
17
16
 
18
17
  ```ruby
19
18
  require 'stockade'
20
- #=> true
21
- Stockade.mask('Mr. John Smith email is jsmith@example.com')
22
- #=> "Mr. **** ***** email is ******************"
23
- Stockade.mask('and his phone is 555-123-4567.')
24
- #=> *** his phone is ************.
19
+
20
+ puts Stockade.mask(<<-EOS
21
+
22
+ Dossier on Mr. John Smith born 09/02/1995
23
+ His email is jsmith@example.com and his phone is 555-123-4567.
24
+ He is using Visa card 4111 1111 1111 1111
25
+
26
+ EOS
27
+
28
+ #=>
29
+ Dossier on Mr. **** ***** born **********
30
+ His email is ****************** and his phone is ************.
31
+ ** is using Visa card *******************
32
+
25
33
  ```
26
- Yes, 'and' looks like PII because there is also a lastname.
34
+ Notice, how word 'He' was incorrectly identified as a name.
27
35
 
28
36
  ## Implementation
29
37
 
30
- It uses
38
+ This is done in three stages.
39
+
40
+ ### Scanning
41
+
42
+ Using a manually curated list of regexes and
31
43
  [StringScanner](https://ruby-doc.org/stdlib-2.5.1/libdoc/strscan/rdoc/StringScanner.html)
32
- and a manually curated list of regular expressions to match strings that _look_
33
- like PII. This works for things like emails, phone numbers, dates, national
34
- ids, credit card numbers and ip addresses. But it does not work for names.
35
- Names are verified against the list of known first and last names that are
36
- stored as a trie.
44
+ it extracts and labels lexeme candidates.
45
+
46
+ ### Evaluation
47
+
48
+ Lexeme candidates further evaluated (in some cases this is a no-op) to filter
49
+ out false positives. For example, first and lastnames are checked against a
50
+ database of known names. Dates are checked to be in the past.
51
+
52
+ ### Parsing
37
53
 
54
+ Some rudimentary parsing done. Lexemes that are fully covered by other lexemes
55
+ are eliminated. Ambiguous lexemes are disambiguated using rules of precedence.
data/bin/load CHANGED
@@ -8,5 +8,5 @@ require 'rambling-trie'
8
8
 
9
9
  %w[lastnames firstnames words].each do |type|
10
10
  trie = Rambling::Trie.create("data/#{type}.txt")
11
- Rambling::Trie.dump(trie, "data/#{type}.dump")
11
+ Rambling::Trie.dump(trie, "data/#{type}.zip")
12
12
  end
Binary file
Binary file
@@ -0,0 +1,13 @@
1
+ #-###-###-####
2
+ +#-###-###-####
3
+ +##-###-###-####
4
+ +###-###-###-####
5
+ ###-###-####
6
+ ### ### ####
7
+ (## ##) #### ####
8
+ ##########
9
+ (##) #### ####
10
+ (##) ## #### ####
11
+ ###-###-###-####
12
+ ###-####
13
+ (###) ###-####
Binary file
@@ -13,17 +13,41 @@ require 'stockade/lexemes/dict'
13
13
  require 'stockade/lexemes/word'
14
14
  require 'stockade/lexemes/lastname'
15
15
  require 'stockade/lexemes/firstname'
16
+ require 'stockade/lexemes/payment_card'
16
17
 
17
18
  # Stockade module
18
19
  module Stockade
19
- # Mask all PII in `text` with `*`
20
- #
21
- def self.mask(text)
22
- lexemes = Parser.call(Lexer.call(text))
23
- lexemes.inject(text) do |mask, lexeme|
24
- prefix = lexeme.start_pos.zero? ? '' : mask[0..lexeme.start_pos - 1]
25
- postfix = mask[lexeme.end_pos..-1]
26
- "#{prefix}#{lexeme.mask}#{postfix}"
20
+ class << self
21
+ # Mask all PII in `text` with `*`
22
+ #
23
+ def mask(text)
24
+ process(text, :mask)
25
+ end
26
+
27
+ def tokenize(text)
28
+ process(text, :token)
29
+ end
30
+
31
+ def process(text, action)
32
+ lexemes(text).inject(text) do |mask, lexeme|
33
+ prefix = lexeme.start_pos.zero? ? '' : mask[0..lexeme.start_pos - 1]
34
+ postfix = mask[lexeme.end_pos..-1]
35
+ "#{prefix}#{lexeme.send(action)}#{postfix}"
36
+ end
37
+ end
38
+
39
+ def extract(text)
40
+ lexemes(text).map do |lexeme|
41
+ {
42
+ lexeme.class.name.to_s.split('::').last.downcase => lexeme.value
43
+ }
44
+ end
45
+ end
46
+
47
+ private
48
+
49
+ def lexemes(text)
50
+ Parser.call(Lexer.call(text))
27
51
  end
28
52
  end
29
53
  end
@@ -42,6 +42,24 @@ module Stockade
42
42
  def mask
43
43
  '*' * raw_value.size
44
44
  end
45
+
46
+ def token
47
+ SecureRandom.base64(raw_value.size)[0..raw_value.size - 1]
48
+ end
49
+
50
+ def type
51
+ self.class.name.split('::').last.downcase.to_sym
52
+ end
53
+
54
+ def self.types
55
+ %i[date word email firstname lastname phone]
56
+ end
57
+
58
+ types.each do |type_name|
59
+ define_method :"#{type_name}?" do
60
+ type == type_name
61
+ end
62
+ end
45
63
  end
46
64
  end
47
65
  end
@@ -27,13 +27,17 @@ module Stockade
27
27
  Word.new(value: value).valid?
28
28
  end
29
29
 
30
+ def capitalized?
31
+ raw_value[0] == raw_value[0].upcase
32
+ end
33
+
30
34
  class << self
31
35
  extend Memoist
32
36
 
33
37
  def dict_name; end
34
38
 
35
39
  def dict
36
- Rambling::Trie.load("data/#{dict_name}.dump")
40
+ Rambling::Trie.load("data/#{dict_name}.zip")
37
41
  end
38
42
  memoize :dict
39
43
  end
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Stockade
4
+ module Lexemes
5
+ class Filler < Base
6
+ end
7
+ end
8
+ end
@@ -0,0 +1,28 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'credit_card_validations'
4
+
5
+ module Stockade
6
+ module Lexemes
7
+ # Lexeme for anything that resembles payment card numbers
8
+ # https://en.wikipedia.org/wiki/Payment_card_number
9
+ #
10
+ # Any 10-19 character long sequences of digits optionally grouped using
11
+ # ' ' or '-' delimiters are suspects
12
+ class PaymentCard < Base
13
+ def self.regex
14
+ /
15
+ (?<!\d) # NaN
16
+ \d
17
+ ([\s\-]*\d[\s\-]*){10,17}
18
+ \d
19
+ (?!\d) # NaN
20
+ /x
21
+ end
22
+
23
+ def valid?
24
+ CreditCardValidations::Detector.new(value).valid?
25
+ end
26
+ end
27
+ end
28
+ end
@@ -4,22 +4,7 @@ module Stockade
4
4
  module Lexemes
5
5
  # Phone lexeme
6
6
  class Phone < Base
7
- # Less noisy phone mask syntax compared to regexes
8
- MASKS = [
9
- '#-###-###-####',
10
- '+#-###-###-####',
11
- '+##-###-###-####',
12
- '+###-###-###-####',
13
- '###-###-####',
14
- '### ### ####',
15
- '(## ##) #### ####',
16
- '##########',
17
- '(##) #### ####',
18
- '(##) ## #### ####',
19
- '###-###-###-####',
20
- '###-####',
21
- '(###) ###-####'
22
- ].freeze
7
+ MASKS = File.readlines('data/phones.txt').freeze
23
8
 
24
9
  class << self
25
10
  def regex
@@ -33,7 +18,7 @@ module Stockade
33
18
 
34
19
  private
35
20
 
36
- # Convert phone number mask to its regex
21
+ # Convert less noisy phone mask syntax to regexes
37
22
  # ### ### #### => (?:\d{3}\s\d{3}\s\d{4})
38
23
  def to_re(mask)
39
24
  '(?:' +
@@ -12,6 +12,10 @@ module Stockade
12
12
  def mask
13
13
  raw_value
14
14
  end
15
+
16
+ def token
17
+ raw_value
18
+ end
15
19
  end
16
20
  end
17
21
  end
@@ -27,6 +27,7 @@ module Stockade
27
27
 
28
28
  def lexeme_classes
29
29
  [
30
+ Stockade::Lexemes::PaymentCard,
30
31
  Stockade::Lexemes::Date,
31
32
  Stockade::Lexemes::Email,
32
33
  Stockade::Lexemes::Phone,
@@ -75,18 +75,13 @@ module Stockade
75
75
  end
76
76
 
77
77
  def priority(lexeme)
78
- lexeme_priorities.index(lexeme.class)
79
- end
80
-
81
- def lexeme_priorities
82
- [
83
- Lexemes::Word,
84
- Lexemes::Firstname,
85
- Lexemes::Lastname,
86
- Lexemes::Phone,
87
- Lexemes::Date,
88
- Lexemes::Email
89
- ]
78
+ if [Lexemes::Firstname, Lexemes::Lastname].include?(lexeme.class)
79
+ lexeme.capitalized? ? 3 : 1
80
+ elsif lexeme.is_a?(Lexemes::Word)
81
+ 2
82
+ else
83
+ 3
84
+ end
90
85
  end
91
86
  end
92
87
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Stockade
4
- VERSION = '0.1.1'
4
+ VERSION = '0.2.0'
5
5
  end
@@ -4,6 +4,7 @@ lib = File.expand_path('lib', __dir__)
4
4
  $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
5
5
  require 'stockade/version'
6
6
 
7
+ # rubocop:disable Metrics/BlockLength
7
8
  Gem::Specification.new do |spec|
8
9
  spec.name = 'stockade'
9
10
  spec.version = Stockade::VERSION
@@ -29,8 +30,11 @@ Gem::Specification.new do |spec|
29
30
  spec.add_development_dependency 'pry-byebug', '~> 3.0'
30
31
  spec.add_development_dependency 'rake', '~> 10.0'
31
32
  spec.add_development_dependency 'rspec', '~> 3.0'
32
- spec.add_development_dependency 'rubocop', '~> 0.5'
33
+ spec.add_development_dependency 'rubocop', '~> 0.49'
33
34
 
35
+ spec.add_runtime_dependency 'credit_card_validations', '~> 3.5'
34
36
  spec.add_runtime_dependency 'memoist', '~> 0.1'
35
37
  spec.add_runtime_dependency 'rambling-trie', '~> 2.0'
38
+ spec.add_runtime_dependency 'rubyzip', '~> 1.0'
36
39
  end
40
+ # rubocop:enable Metrics/BlockLength
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: stockade
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Stan Mazhara
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-08-05 00:00:00.000000000 Z
11
+ date: 2018-08-12 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -72,14 +72,28 @@ dependencies:
72
72
  requirements:
73
73
  - - "~>"
74
74
  - !ruby/object:Gem::Version
75
- version: '0.5'
75
+ version: '0.49'
76
76
  type: :development
77
77
  prerelease: false
78
78
  version_requirements: !ruby/object:Gem::Requirement
79
79
  requirements:
80
80
  - - "~>"
81
81
  - !ruby/object:Gem::Version
82
- version: '0.5'
82
+ version: '0.49'
83
+ - !ruby/object:Gem::Dependency
84
+ name: credit_card_validations
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '3.5'
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '3.5'
83
97
  - !ruby/object:Gem::Dependency
84
98
  name: memoist
85
99
  requirement: !ruby/object:Gem::Requirement
@@ -108,6 +122,20 @@ dependencies:
108
122
  - - "~>"
109
123
  - !ruby/object:Gem::Version
110
124
  version: '2.0'
125
+ - !ruby/object:Gem::Dependency
126
+ name: rubyzip
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - "~>"
130
+ - !ruby/object:Gem::Version
131
+ version: '1.0'
132
+ type: :runtime
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - "~>"
137
+ - !ruby/object:Gem::Version
138
+ version: '1.0'
111
139
  description: "\n Stockade is a lexer that reads unstructured text information (from
112
140
  files,\n logs, databases etc.) and tokenizes pieces that look like personally\n
113
141
  \ identifiable information (PII).\n "
@@ -124,19 +152,19 @@ files:
124
152
  - README.md
125
153
  - Rakefile
126
154
  - bin/load
127
- - data/firstnames.dump
128
- - data/firstnames.txt
129
- - data/lastnames.dump
130
- - data/lastnames.txt
131
- - data/words.dump
132
- - data/words.txt
155
+ - data/firstnames.zip
156
+ - data/lastnames.zip
157
+ - data/phones.txt
158
+ - data/words.zip
133
159
  - lib/stockade.rb
134
160
  - lib/stockade/lexemes/base.rb
135
161
  - lib/stockade/lexemes/date.rb
136
162
  - lib/stockade/lexemes/dict.rb
137
163
  - lib/stockade/lexemes/email.rb
164
+ - lib/stockade/lexemes/filler.rb
138
165
  - lib/stockade/lexemes/firstname.rb
139
166
  - lib/stockade/lexemes/lastname.rb
167
+ - lib/stockade/lexemes/payment_card.rb
140
168
  - lib/stockade/lexemes/phone.rb
141
169
  - lib/stockade/lexemes/word.rb
142
170
  - lib/stockade/lexer.rb