wilderpeople 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 4591ba4d3f4654cae88164e3106ef022b0dfbc32
4
+ data.tar.gz: d364ee06d64951745386aba6b7fadbc3783cd4e5
5
+ SHA512:
6
+ metadata.gz: 756050f37ba7b4ed278e80276b977d006ed74fd7ef8cfc1c29819ffed3c5c9623d1b03b085d9b53711872508365c2d547aad1379b86532485556df6eb18a2fc9
7
+ data.tar.gz: 4068c8206a34f4f0bac26d7eb7f7c34663ed090a8cd2f18510aefc15965ae54425c252c9053a84644b5acd228cee282187f5a3e59d072776cd25e1a0fc8549d9
data/.gitignore ADDED
@@ -0,0 +1,9 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
data/.travis.yml ADDED
@@ -0,0 +1,5 @@
1
+ sudo: false
2
+ language: ruby
3
+ rvm:
4
+ - 2.0.0
5
+ before_install: gem install bundler -v 1.13.6
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in wilderpeople.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2017 Rob Nichols and Warwickshire County Council
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,173 @@
1
+ # Wilderpeople
2
+
3
+ This tool was specifically design to assist in the identification of people
4
+ based on sets of matching data. However, it could also be used to retrieve
5
+ other types of data from an array.
6
+
7
+ The tool assumes you have an array of hashes, where each hash describes an
8
+ unique item or person. It then allows you to pass in a "matching dataset" and
9
+ in return it will return the hash that best matches that data, **if there
10
+ is only one best match.**
11
+
12
+ ## Installation
13
+
14
+ Add this line to your application's Gemfile:
15
+
16
+ ```ruby
17
+ gem 'wilderpeople'
18
+ ```
19
+
20
+ And then execute:
21
+
22
+ $ bundle
23
+
24
+ Or install it yourself as:
25
+
26
+ $ gem install wilderpeople
27
+
28
+ ## Usage
29
+
30
+ Given the following dataset:
31
+
32
+ ```ruby
33
+ data = [
34
+ {surname: 'Bloggs', forename: 'Fred', gender: 'Male'},
35
+ {surname: 'Bloggs', forename: 'Winifred', gender: 'Female'},
36
+ {surname: 'Bloggs', forename: 'Jane', gender: 'Female'}
37
+ ]
38
+ ```
39
+
40
+ We have someone called Fred Bloggs, and we wish to find the hash that matches
41
+ them.
42
+
43
+ To do that we first need to identify the criteria on which to base the match.
44
+ As we just have a forename and surname, let's start with:
45
+
46
+ ```ruby
47
+ config = {
48
+ must: {
49
+ surname: :exact,
50
+ forename: :exact
51
+ }
52
+ }
53
+ ```
54
+
55
+ That is, to get a match the hash in the dataset must exactly match both the
56
+ surname and the forename.
57
+
58
+ With that information, we can now perform a search:
59
+
60
+ ```ruby
61
+ search = Wilderpeople::Search.new(data: data, config: config)
62
+ person = search.find surname: 'Bloggs', forename: 'Fred'
63
+ person == {surname: 'Bloggs', forename: 'Fred', gender: 'Male'}
64
+ ```
65
+
66
+ Success .... however, it turns out that the Winifred Bloggs we're after is
67
+ Frederick Blogg's sister, who also goes by the name 'Fred'.
68
+
69
+ So we could try changing the criteria in the config to look for a female Bloggs:
70
+
71
+ ```ruby
72
+ config = {
73
+ must: {
74
+ surname: :exact,
75
+ gender: :exact
76
+ }
77
+ }
78
+ search = Wilderpeople::Search.new(data: data, config: config)
79
+ person = search.find surname: 'Bloggs', gender: 'Female'
80
+ person == nil
81
+ ```
82
+
83
+ Unfortunately both Winifred Bloggs, and Jane Bloggs match that criteria, so the
84
+ search is unable to find an unique match, so returns `nil`.
85
+
86
+ So how do we identify the correct hash. The solution is to use fuzzy logic.
87
+
88
+ ```ruby
89
+ config = {
90
+ must: {
91
+ surname: :exact,
92
+ forename: :hypocorism,
93
+ gender: :exact
94
+ }
95
+ }
96
+ search = Wilderpeople::Search.new(data: data, config: config)
97
+ person = search.find surname: 'Bloggs', forename: 'Fred', gender: 'Female'
98
+ person == {surname: 'Bloggs', forename: 'Winifred', gender: 'Female'}
99
+ ```
100
+
101
+ ### Criteria configuration
102
+
103
+ The `config` is a hash of one or two parts. The parts being:
104
+
105
+ * **must** The rules that must be matched
106
+ * **can** The rules that may be matched
107
+
108
+ Each part is itself a hash where the keys match the data keys, and the values
109
+ are the matchers to be used to compare the data in those keys.
110
+
111
+ ```ruby
112
+ config = { must: { surname: :exact } }
113
+ ```
114
+
115
+ With this configuration, the `exact` matcher will be used to compare the
116
+ `surname` of each hash in the data.
117
+
118
+ ### Matchers
119
+
120
+ The matchers are defined by the protected methods in the
121
+ [Wilderpeople::Matcher class](https://github.com/reggieb/wilderpeople/blob/master/lib/wilderpeople/matcher.rb).
122
+ Please read the comments in this class to find out how each matcher operates.
123
+
124
+ ### Can matching
125
+
126
+ If possible the match will be attempted using only the `must` criteria, but
127
+ if more that one match is returned, the system can then use the `can` criteria
128
+ to try and find a unique match.
129
+
130
+ So in the example above, Winifred could have been found using:
131
+
132
+ ```ruby
133
+ config = {
134
+ must: {
135
+ surname: :exact,
136
+ gender: :exact
137
+ },
138
+ can: {
139
+ forename: :hypocorism
140
+ }
141
+ }
142
+ search = Wilderpeople::Search.new(data: data, config: config)
143
+ person = search.find surname: 'Bloggs', forename: 'Fred', gender: 'Female'
144
+ person == {surname: 'Bloggs', forename: 'Winifred', gender: 'Female'}
145
+ ```
146
+
147
+ With this configuration the `hypocorism` matcher would only be called if the
148
+ search was unable to find a unique record by `surname` and `gender`. That is,
149
+ if we then went on to get Frederick's hash using the same configuration:
150
+
151
+ ```ruby
152
+ person = search.find surname: 'Bloggs', forename: 'Fred', gender: 'Male'
153
+ ```
154
+
155
+ A unique record would be identified without having to do the `hypocorism` match.
156
+
157
+ ## Development
158
+
159
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run
160
+ `rake` to run the tests. You can also run `bin/console` for an interactive
161
+ prompt that will allow you to experiment.
162
+
163
+ ## Contributing
164
+
165
+ Bug reports and pull requests are welcome on GitHub at
166
+ https://github.com/reggieb/wilderpeople.
167
+
168
+
169
+ ## License
170
+
171
+ The gem is available as open source under the terms of the
172
+ [MIT License](http://opensource.org/licenses/MIT).
173
+
data/Rakefile ADDED
@@ -0,0 +1,10 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new(:test) do |t|
5
+ t.libs << "test"
6
+ t.libs << "lib"
7
+ t.test_files = FileList['test/**/*_test.rb']
8
+ end
9
+
10
+ task :default => :test
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "wilderpeople"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,153 @@
1
+ require 'hypocorism'
2
+ require 'date'
3
+ require 'levenshtein'
4
+ module Wilderpeople
5
+ class Matcher
6
+
7
+ class << self
8
+ attr_writer :levenshtein_threshold
9
+
10
+ def levenshtein_threshold
11
+ @levenshtein_threshold ||= 0.3
12
+ end
13
+
14
+ # Using `method_missing` so that instead of having to do:
15
+ # Matcher.new(a, b).exact
16
+ # We can do
17
+ # Matcher.exact(a, b)
18
+ def method_missing(method, *args, &block)
19
+ if protected_instance_methods.include?(method)
20
+ by(method, args[0], args[1])
21
+ else
22
+ super
23
+ end
24
+ end
25
+
26
+ # Another way to use matcher is:
27
+ # Matcher.by :exact, a, b
28
+ def by(method, a, b)
29
+ raise "Method must be defined" unless method
30
+ new(a, b).send(method)
31
+ end
32
+ end
33
+
34
+ attr_reader :a, :b, :dates
35
+
36
+ # Passing arguments into initialize so that prep can be done just once
37
+ # and before main test.
38
+ # Also means that I don't have to worry about the result of one match
39
+ # poluting the next match
40
+ def initialize(a, b)
41
+ @a, @b = a.clone, b.clone
42
+ prep
43
+ end
44
+
45
+ protected # note that method_missing looks for protected_instance_methods
46
+
47
+ # `exact` used in other methods so needs to work with either stored `a` and `b`
48
+ # or items passed into it
49
+ def exact(x = a, y = b)
50
+ x == y
51
+ end
52
+
53
+ # All the right letters but not necessarily in the right order
54
+ def transposed
55
+ return true if exact
56
+ exact *[a,b].collect{|x| x.chars.sort}
57
+ end
58
+
59
+ # Designed for matching street names, so don't have to worry about
60
+ # Road/Rb and Street/St.
61
+ # Note that if matching one word, just that word is matched
62
+ def exact_except_last_word
63
+ return true if exact
64
+ words = [a,b].collect(&:split)
65
+ exact *words.collect{|w| w.size == 1 ? w : w[0..-2]}
66
+ end
67
+
68
+ # Match the first letter only
69
+ def first_letter
70
+ return true if exact
71
+ exact *[a,b].collect{|x| x[0]}
72
+ end
73
+
74
+ # Match 'Foo bar' with 'Foobar'
75
+ def exact_no_space
76
+ return true if exact
77
+ exact *[a,b].collect{|x| x.gsub(/\s/, '')}
78
+ end
79
+
80
+ # Match English first names with alternative forms
81
+ # So 'Robert' matches 'Rob'
82
+ def hypocorism
83
+ Hypocorism.match(a,b)
84
+ end
85
+
86
+ # Match dates
87
+ def date
88
+ return true if exact
89
+ @dates = [a, b].collect{|x| date_parse(x)}
90
+ exact(*dates)
91
+ rescue ArgumentError # Error raised when entry won't parse
92
+ false
93
+ end
94
+
95
+ # Matches dates, but also handles day and month being swapped.
96
+ # So 3/5/2001 matches 5/3/2001
97
+ def fuzzy_date
98
+ return true if date
99
+ return false unless dates
100
+ return false if dates[1].day > 12
101
+ exact(dates[0], swap_day_month(@dates[1]))
102
+ end
103
+
104
+ # User Levenshtein distance to compare similar strings.
105
+ # The Levenshtein distance is a string metric for measuring the difference
106
+ # between two sequences. Informally, the Levenshtein distance between two
107
+ # words is the minimum number of single-character edits required to change
108
+ # one word into the other.
109
+ # See https://github.com/tliff/levenshtein
110
+ def fuzzy(threshold = self.class.levenshtein_threshold)
111
+ exact
112
+ return false if a.empty? || b.empty?
113
+ !!Levenshtein.normalized_distance(a,b, threshold)
114
+ end
115
+
116
+ private
117
+
118
+ def prep
119
+ [a,b].each do |attr|
120
+ case attr
121
+ when String
122
+ attr.downcase!
123
+ attr.strip! if attr
124
+ when Array
125
+ attr.collect{|x| prep(x)}
126
+ else
127
+ attr
128
+ end
129
+ end
130
+ end
131
+
132
+ def swap_day_month(date)
133
+ Date.new(date.year, date.day, date.month)
134
+ end
135
+
136
+ def date_parse(string)
137
+ Date.parse(string)
138
+ rescue ArgumentError
139
+ try_american_format(string)
140
+ end
141
+
142
+ def try_american_format(string)
143
+ Date.strptime string, "%m/%d/%Y"
144
+ rescue ArgumentError
145
+ # Need to catch instance where system is using American format
146
+ try_proper_format(string)
147
+ end
148
+
149
+ def try_proper_format(string)
150
+ Date.strptime string, "%d/%m/%Y"
151
+ end
152
+ end
153
+ end
@@ -0,0 +1,64 @@
1
+ require 'active_support/core_ext/hash/indifferent_access'
2
+ module Wilderpeople
3
+ class Search
4
+ attr_reader :data, :config, :result, :args
5
+ def initialize(data: [], config: {})
6
+ @data = data.collect(&:with_indifferent_access)
7
+ @config = config
8
+ end
9
+
10
+ def find(args)
11
+ @result = data
12
+ @args = args
13
+ select_must || select_can
14
+ end
15
+
16
+ private
17
+
18
+ def select_must
19
+ return if result.empty?
20
+ return unless must_rules
21
+ @result = result.select do |datum|
22
+ must_rules.all? do |key, matcher_method|
23
+ return false unless datum[key] && args[key]
24
+ Matcher.by matcher_method, datum[key], args[key]
25
+ end
26
+ end
27
+ result.first if result.size == 1
28
+ end
29
+
30
+ def must_rules
31
+ config[:must]
32
+ end
33
+
34
+ def select_can
35
+ return if result.empty?
36
+ return unless can_rules
37
+ # Get all of the matches for each of the can rules.
38
+ matches = can_rules.collect do |key, matcher_method|
39
+ result.select do |datum|
40
+ [matcher_method, datum[key], args[key]]
41
+ Matcher.by matcher_method, datum[key], args[key]
42
+ end
43
+ end
44
+ # Then determine if one datum appears in more matches
45
+ # than any other, and if so return that one.
46
+ occurrences = find_occurrences(matches)
47
+ count_of_commonest = occurrences.values.max
48
+ if occurrences.values.count(count_of_commonest) == 1
49
+ occurrences.rassoc(count_of_commonest)[0]
50
+ end
51
+ end
52
+
53
+ def can_rules
54
+ config[:can]
55
+ end
56
+
57
+ # Returns a hash with each item as key, and the count of occurrences as value
58
+ # See: http://jerodsanto.net/2013/10/ruby-quick-tip-easily-count-occurrences-of-array-elements/
59
+ def find_occurrences(array)
60
+ array.flatten.each_with_object(Hash.new(0)){ |item,count| count[item] += 1 }
61
+ end
62
+
63
+ end
64
+ end
@@ -0,0 +1,3 @@
1
+ module Wilderpeople
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,7 @@
1
+ require 'wilderpeople/version'
2
+ require 'wilderpeople/matcher'
3
+ require 'wilderpeople/search'
4
+
5
+ module Wilderpeople
6
+
7
+ end
@@ -0,0 +1,32 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'wilderpeople/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "wilderpeople"
8
+ spec.version = Wilderpeople::VERSION
9
+ spec.authors = ["Rob Nichols"]
10
+ spec.email = ["robnichols@warwickshire.gov.uk"]
11
+
12
+ spec.summary = %q{A tool for fuzzy matching people data}
13
+ spec.description = %q{Allow people data from one source to be compared with anothe source}
14
+ spec.homepage = "https://github.com/reggieb/wilderpeople"
15
+ spec.license = "MIT"
16
+
17
+ spec.files = `git ls-files -z`.split("\x0").reject do |f|
18
+ f.match(%r{^(test|spec|features)/})
19
+ end
20
+ spec.bindir = "exe"
21
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
22
+ spec.require_paths = ["lib"]
23
+
24
+ spec.add_dependency 'hypocorism', "~> 0.0.2"
25
+ spec.add_dependency 'levenshtein'
26
+ spec.add_dependency 'activesupport'
27
+
28
+ spec.add_development_dependency "bundler", "~> 1.13"
29
+ spec.add_development_dependency "rake", "~> 10.0"
30
+ spec.add_development_dependency "minitest", "~> 5.0"
31
+ spec.add_development_dependency "pry"
32
+ end
metadata ADDED
@@ -0,0 +1,155 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: wilderpeople
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Rob Nichols
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2017-03-09 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: hypocorism
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ~>
18
+ - !ruby/object:Gem::Version
19
+ version: 0.0.2
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ~>
25
+ - !ruby/object:Gem::Version
26
+ version: 0.0.2
27
+ - !ruby/object:Gem::Dependency
28
+ name: levenshtein
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - '>='
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - '>='
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: activesupport
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - '>='
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - '>='
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: bundler
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ~>
60
+ - !ruby/object:Gem::Version
61
+ version: '1.13'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ~>
67
+ - !ruby/object:Gem::Version
68
+ version: '1.13'
69
+ - !ruby/object:Gem::Dependency
70
+ name: rake
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ~>
74
+ - !ruby/object:Gem::Version
75
+ version: '10.0'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ~>
81
+ - !ruby/object:Gem::Version
82
+ version: '10.0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: minitest
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ~>
88
+ - !ruby/object:Gem::Version
89
+ version: '5.0'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ~>
95
+ - !ruby/object:Gem::Version
96
+ version: '5.0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: pry
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - '>='
102
+ - !ruby/object:Gem::Version
103
+ version: '0'
104
+ type: :development
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - '>='
109
+ - !ruby/object:Gem::Version
110
+ version: '0'
111
+ description: Allow people data from one source to be compared with anothe source
112
+ email:
113
+ - robnichols@warwickshire.gov.uk
114
+ executables: []
115
+ extensions: []
116
+ extra_rdoc_files: []
117
+ files:
118
+ - .gitignore
119
+ - .travis.yml
120
+ - Gemfile
121
+ - LICENSE.txt
122
+ - README.md
123
+ - Rakefile
124
+ - bin/console
125
+ - bin/setup
126
+ - lib/wilderpeople.rb
127
+ - lib/wilderpeople/matcher.rb
128
+ - lib/wilderpeople/search.rb
129
+ - lib/wilderpeople/version.rb
130
+ - wilderpeople.gemspec
131
+ homepage: https://github.com/reggieb/wilderpeople
132
+ licenses:
133
+ - MIT
134
+ metadata: {}
135
+ post_install_message:
136
+ rdoc_options: []
137
+ require_paths:
138
+ - lib
139
+ required_ruby_version: !ruby/object:Gem::Requirement
140
+ requirements:
141
+ - - '>='
142
+ - !ruby/object:Gem::Version
143
+ version: '0'
144
+ required_rubygems_version: !ruby/object:Gem::Requirement
145
+ requirements:
146
+ - - '>='
147
+ - !ruby/object:Gem::Version
148
+ version: '0'
149
+ requirements: []
150
+ rubyforge_project:
151
+ rubygems_version: 2.4.8
152
+ signing_key:
153
+ specification_version: 4
154
+ summary: A tool for fuzzy matching people data
155
+ test_files: []