wilderpeople 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 4591ba4d3f4654cae88164e3106ef022b0dfbc32
4
+ data.tar.gz: d364ee06d64951745386aba6b7fadbc3783cd4e5
5
+ SHA512:
6
+ metadata.gz: 756050f37ba7b4ed278e80276b977d006ed74fd7ef8cfc1c29819ffed3c5c9623d1b03b085d9b53711872508365c2d547aad1379b86532485556df6eb18a2fc9
7
+ data.tar.gz: 4068c8206a34f4f0bac26d7eb7f7c34663ed090a8cd2f18510aefc15965ae54425c252c9053a84644b5acd228cee282187f5a3e59d072776cd25e1a0fc8549d9
data/.gitignore ADDED
@@ -0,0 +1,9 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
data/.travis.yml ADDED
@@ -0,0 +1,5 @@
1
+ sudo: false
2
+ language: ruby
3
+ rvm:
4
+ - 2.0.0
5
+ before_install: gem install bundler -v 1.13.6
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in wilderpeople.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2017 Rob Nichols and Warwickshire County Council
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,173 @@
1
+ # Wilderpeople
2
+
3
+ This tool was specifically design to assist in the identification of people
4
+ based on sets of matching data. However, it could also be used to retrieve
5
+ other types of data from an array.
6
+
7
+ The tool assumes you have an array of hashes, where each hash describes an
8
+ unique item or person. It then allows you to pass in a "matching dataset" and
9
+ in return it will return the hash that best matches that data, **if there
10
+ is only one best match.**
11
+
12
+ ## Installation
13
+
14
+ Add this line to your application's Gemfile:
15
+
16
+ ```ruby
17
+ gem 'wilderpeople'
18
+ ```
19
+
20
+ And then execute:
21
+
22
+ $ bundle
23
+
24
+ Or install it yourself as:
25
+
26
+ $ gem install wilderpeople
27
+
28
+ ## Usage
29
+
30
+ Given the following dataset:
31
+
32
+ ```ruby
33
+ data = [
34
+ {surname: 'Bloggs', forename: 'Fred', gender: 'Male'},
35
+ {surname: 'Bloggs', forename: 'Winifred', gender: 'Female'},
36
+ {surname: 'Bloggs', forename: 'Jane', gender: 'Female'}
37
+ ]
38
+ ```
39
+
40
+ We have someone called Fred Bloggs, and we wish to find the hash that matches
41
+ them.
42
+
43
+ To do that we first need to identify the criteria on which to base the match.
44
+ As we just have a forename and surname, let's start with:
45
+
46
+ ```ruby
47
+ config = {
48
+ must: {
49
+ surname: :exact,
50
+ forename: :exact
51
+ }
52
+ }
53
+ ```
54
+
55
+ That is, to get a match the hash in the dataset must exactly match both the
56
+ surname and the forename.
57
+
58
+ With that information, we can now perform a search:
59
+
60
+ ```ruby
61
+ search = Wilderpeople::Search.new(data: data, config: config)
62
+ person = search.find surname: 'Bloggs', forename: 'Fred'
63
+ person == {surname: 'Bloggs', forename: 'Fred', gender: 'Male'}
64
+ ```
65
+
66
+ Success .... however, it turns out that the Winifred Bloggs we're after is
67
+ Frederick Blogg's sister, who also goes by the name 'Fred'.
68
+
69
+ So we could try changing the criteria in the config to look for a female Bloggs:
70
+
71
+ ```ruby
72
+ config = {
73
+ must: {
74
+ surname: :exact,
75
+ gender: :exact
76
+ }
77
+ }
78
+ search = Wilderpeople::Search.new(data: data, config: config)
79
+ person = search.find surname: 'Bloggs', gender: 'Female'
80
+ person == nil
81
+ ```
82
+
83
+ Unfortunately both Winifred Bloggs, and Jane Bloggs match that criteria, so the
84
+ search is unable to find an unique match, so returns `nil`.
85
+
86
+ So how do we identify the correct hash. The solution is to use fuzzy logic.
87
+
88
+ ```ruby
89
+ config = {
90
+ must: {
91
+ surname: :exact,
92
+ forename: :hypocorism,
93
+ gender: :exact
94
+ }
95
+ }
96
+ search = Wilderpeople::Search.new(data: data, config: config)
97
+ person = search.find surname: 'Bloggs', forename: 'Fred', gender: 'Female'
98
+ person == {surname: 'Bloggs', forename: 'Winifred', gender: 'Female'}
99
+ ```
100
+
101
+ ### Criteria configuration
102
+
103
+ The `config` is a hash of one or two parts. The parts being:
104
+
105
+ * **must** The rules that must be matched
106
+ * **can** The rules that may be matched
107
+
108
+ Each part is itself a hash where the keys match the data keys, and the values
109
+ are the matchers to be used to compare the data in those keys.
110
+
111
+ ```ruby
112
+ config = { must: { surname: :exact } }
113
+ ```
114
+
115
+ With this configuration, the `exact` matcher will be used to compare the
116
+ `surname` of each hash in the data.
117
+
118
+ ### Matchers
119
+
120
+ The matchers are defined by the protected methods in the
121
+ [Wilderpeople::Matcher class](https://github.com/reggieb/wilderpeople/blob/master/lib/wilderpeople/matcher.rb).
122
+ Please read the comments in this class to find out how each matcher operates.
123
+
124
+ ### Can matching
125
+
126
+ If possible the match will be attempted using only the `must` criteria, but
127
+ if more that one match is returned, the system can then use the `can` criteria
128
+ to try and find a unique match.
129
+
130
+ So in the example above, Winifred could have been found using:
131
+
132
+ ```ruby
133
+ config = {
134
+ must: {
135
+ surname: :exact,
136
+ gender: :exact
137
+ },
138
+ can: {
139
+ forename: :hypocorism
140
+ }
141
+ }
142
+ search = Wilderpeople::Search.new(data: data, config: config)
143
+ person = search.find surname: 'Bloggs', forename: 'Fred', gender: 'Female'
144
+ person == {surname: 'Bloggs', forename: 'Winifred', gender: 'Female'}
145
+ ```
146
+
147
+ With this configuration the `hypocorism` matcher would only be called if the
148
+ search was unable to find a unique record by `surname` and `gender`. That is,
149
+ if we then went on to get Frederick's hash using the same configuration:
150
+
151
+ ```ruby
152
+ person = search.find surname: 'Bloggs', forename: 'Fred', gender: 'Male'
153
+ ```
154
+
155
+ A unique record would be identified without having to do the `hypocorism` match.
156
+
157
+ ## Development
158
+
159
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run
160
+ `rake` to run the tests. You can also run `bin/console` for an interactive
161
+ prompt that will allow you to experiment.
162
+
163
+ ## Contributing
164
+
165
+ Bug reports and pull requests are welcome on GitHub at
166
+ https://github.com/reggieb/wilderpeople.
167
+
168
+
169
+ ## License
170
+
171
+ The gem is available as open source under the terms of the
172
+ [MIT License](http://opensource.org/licenses/MIT).
173
+
data/Rakefile ADDED
@@ -0,0 +1,10 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new(:test) do |t|
5
+ t.libs << "test"
6
+ t.libs << "lib"
7
+ t.test_files = FileList['test/**/*_test.rb']
8
+ end
9
+
10
+ task :default => :test
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "wilderpeople"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,153 @@
1
+ require 'hypocorism'
2
+ require 'date'
3
+ require 'levenshtein'
4
+ module Wilderpeople
5
+ class Matcher
6
+
7
+ class << self
8
+ attr_writer :levenshtein_threshold
9
+
10
+ def levenshtein_threshold
11
+ @levenshtein_threshold ||= 0.3
12
+ end
13
+
14
+ # Using `method_missing` so that instead of having to do:
15
+ # Matcher.new(a, b).exact
16
+ # We can do
17
+ # Matcher.exact(a, b)
18
+ def method_missing(method, *args, &block)
19
+ if protected_instance_methods.include?(method)
20
+ by(method, args[0], args[1])
21
+ else
22
+ super
23
+ end
24
+ end
25
+
26
+ # Another way to use matcher is:
27
+ # Matcher.by :exact, a, b
28
+ def by(method, a, b)
29
+ raise "Method must be defined" unless method
30
+ new(a, b).send(method)
31
+ end
32
+ end
33
+
34
+ attr_reader :a, :b, :dates
35
+
36
+ # Passing arguments into initialize so that prep can be done just once
37
+ # and before main test.
38
+ # Also means that I don't have to worry about the result of one match
39
+ # poluting the next match
40
+ def initialize(a, b)
41
+ @a, @b = a.clone, b.clone
42
+ prep
43
+ end
44
+
45
+ protected # note that method_missing looks for protected_instance_methods
46
+
47
+ # `exact` used in other methods so needs to work with either stored `a` and `b`
48
+ # or items passed into it
49
+ def exact(x = a, y = b)
50
+ x == y
51
+ end
52
+
53
+ # All the right letters but not necessarily in the right order
54
+ def transposed
55
+ return true if exact
56
+ exact *[a,b].collect{|x| x.chars.sort}
57
+ end
58
+
59
+ # Designed for matching street names, so don't have to worry about
60
+ # Road/Rb and Street/St.
61
+ # Note that if matching one word, just that word is matched
62
+ def exact_except_last_word
63
+ return true if exact
64
+ words = [a,b].collect(&:split)
65
+ exact *words.collect{|w| w.size == 1 ? w : w[0..-2]}
66
+ end
67
+
68
+ # Match the first letter only
69
+ def first_letter
70
+ return true if exact
71
+ exact *[a,b].collect{|x| x[0]}
72
+ end
73
+
74
+ # Match 'Foo bar' with 'Foobar'
75
+ def exact_no_space
76
+ return true if exact
77
+ exact *[a,b].collect{|x| x.gsub(/\s/, '')}
78
+ end
79
+
80
+ # Match English first names with alternative forms
81
+ # So 'Robert' matches 'Rob'
82
+ def hypocorism
83
+ Hypocorism.match(a,b)
84
+ end
85
+
86
+ # Match dates
87
+ def date
88
+ return true if exact
89
+ @dates = [a, b].collect{|x| date_parse(x)}
90
+ exact(*dates)
91
+ rescue ArgumentError # Error raised when entry won't parse
92
+ false
93
+ end
94
+
95
+ # Matches dates, but also handles day and month being swapped.
96
+ # So 3/5/2001 matches 5/3/2001
97
+ def fuzzy_date
98
+ return true if date
99
+ return false unless dates
100
+ return false if dates[1].day > 12
101
+ exact(dates[0], swap_day_month(@dates[1]))
102
+ end
103
+
104
+ # User Levenshtein distance to compare similar strings.
105
+ # The Levenshtein distance is a string metric for measuring the difference
106
+ # between two sequences. Informally, the Levenshtein distance between two
107
+ # words is the minimum number of single-character edits required to change
108
+ # one word into the other.
109
+ # See https://github.com/tliff/levenshtein
110
+ def fuzzy(threshold = self.class.levenshtein_threshold)
111
+ exact
112
+ return false if a.empty? || b.empty?
113
+ !!Levenshtein.normalized_distance(a,b, threshold)
114
+ end
115
+
116
+ private
117
+
118
+ def prep
119
+ [a,b].each do |attr|
120
+ case attr
121
+ when String
122
+ attr.downcase!
123
+ attr.strip! if attr
124
+ when Array
125
+ attr.collect{|x| prep(x)}
126
+ else
127
+ attr
128
+ end
129
+ end
130
+ end
131
+
132
+ def swap_day_month(date)
133
+ Date.new(date.year, date.day, date.month)
134
+ end
135
+
136
+ def date_parse(string)
137
+ Date.parse(string)
138
+ rescue ArgumentError
139
+ try_american_format(string)
140
+ end
141
+
142
+ def try_american_format(string)
143
+ Date.strptime string, "%m/%d/%Y"
144
+ rescue ArgumentError
145
+ # Need to catch instance where system is using American format
146
+ try_proper_format(string)
147
+ end
148
+
149
+ def try_proper_format(string)
150
+ Date.strptime string, "%d/%m/%Y"
151
+ end
152
+ end
153
+ end
@@ -0,0 +1,64 @@
1
+ require 'active_support/core_ext/hash/indifferent_access'
2
+ module Wilderpeople
3
+ class Search
4
+ attr_reader :data, :config, :result, :args
5
+ def initialize(data: [], config: {})
6
+ @data = data.collect(&:with_indifferent_access)
7
+ @config = config
8
+ end
9
+
10
+ def find(args)
11
+ @result = data
12
+ @args = args
13
+ select_must || select_can
14
+ end
15
+
16
+ private
17
+
18
+ def select_must
19
+ return if result.empty?
20
+ return unless must_rules
21
+ @result = result.select do |datum|
22
+ must_rules.all? do |key, matcher_method|
23
+ return false unless datum[key] && args[key]
24
+ Matcher.by matcher_method, datum[key], args[key]
25
+ end
26
+ end
27
+ result.first if result.size == 1
28
+ end
29
+
30
+ def must_rules
31
+ config[:must]
32
+ end
33
+
34
+ def select_can
35
+ return if result.empty?
36
+ return unless can_rules
37
+ # Get all of the matches for each of the can rules.
38
+ matches = can_rules.collect do |key, matcher_method|
39
+ result.select do |datum|
40
+ [matcher_method, datum[key], args[key]]
41
+ Matcher.by matcher_method, datum[key], args[key]
42
+ end
43
+ end
44
+ # Then determine if one datum appears in more matches
45
+ # than any other, and if so return that one.
46
+ occurrences = find_occurrences(matches)
47
+ count_of_commonest = occurrences.values.max
48
+ if occurrences.values.count(count_of_commonest) == 1
49
+ occurrences.rassoc(count_of_commonest)[0]
50
+ end
51
+ end
52
+
53
+ def can_rules
54
+ config[:can]
55
+ end
56
+
57
+ # Returns a hash with each item as key, and the count of occurrences as value
58
+ # See: http://jerodsanto.net/2013/10/ruby-quick-tip-easily-count-occurrences-of-array-elements/
59
+ def find_occurrences(array)
60
+ array.flatten.each_with_object(Hash.new(0)){ |item,count| count[item] += 1 }
61
+ end
62
+
63
+ end
64
+ end
@@ -0,0 +1,3 @@
1
+ module Wilderpeople
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,7 @@
1
+ require 'wilderpeople/version'
2
+ require 'wilderpeople/matcher'
3
+ require 'wilderpeople/search'
4
+
5
+ module Wilderpeople
6
+
7
+ end
@@ -0,0 +1,32 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'wilderpeople/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "wilderpeople"
8
+ spec.version = Wilderpeople::VERSION
9
+ spec.authors = ["Rob Nichols"]
10
+ spec.email = ["robnichols@warwickshire.gov.uk"]
11
+
12
+ spec.summary = %q{A tool for fuzzy matching people data}
13
+ spec.description = %q{Allow people data from one source to be compared with anothe source}
14
+ spec.homepage = "https://github.com/reggieb/wilderpeople"
15
+ spec.license = "MIT"
16
+
17
+ spec.files = `git ls-files -z`.split("\x0").reject do |f|
18
+ f.match(%r{^(test|spec|features)/})
19
+ end
20
+ spec.bindir = "exe"
21
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
22
+ spec.require_paths = ["lib"]
23
+
24
+ spec.add_dependency 'hypocorism', "~> 0.0.2"
25
+ spec.add_dependency 'levenshtein'
26
+ spec.add_dependency 'activesupport'
27
+
28
+ spec.add_development_dependency "bundler", "~> 1.13"
29
+ spec.add_development_dependency "rake", "~> 10.0"
30
+ spec.add_development_dependency "minitest", "~> 5.0"
31
+ spec.add_development_dependency "pry"
32
+ end
metadata ADDED
@@ -0,0 +1,155 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: wilderpeople
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Rob Nichols
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2017-03-09 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: hypocorism
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ~>
18
+ - !ruby/object:Gem::Version
19
+ version: 0.0.2
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ~>
25
+ - !ruby/object:Gem::Version
26
+ version: 0.0.2
27
+ - !ruby/object:Gem::Dependency
28
+ name: levenshtein
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - '>='
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - '>='
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: activesupport
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - '>='
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - '>='
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: bundler
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ~>
60
+ - !ruby/object:Gem::Version
61
+ version: '1.13'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ~>
67
+ - !ruby/object:Gem::Version
68
+ version: '1.13'
69
+ - !ruby/object:Gem::Dependency
70
+ name: rake
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ~>
74
+ - !ruby/object:Gem::Version
75
+ version: '10.0'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ~>
81
+ - !ruby/object:Gem::Version
82
+ version: '10.0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: minitest
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ~>
88
+ - !ruby/object:Gem::Version
89
+ version: '5.0'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ~>
95
+ - !ruby/object:Gem::Version
96
+ version: '5.0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: pry
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - '>='
102
+ - !ruby/object:Gem::Version
103
+ version: '0'
104
+ type: :development
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - '>='
109
+ - !ruby/object:Gem::Version
110
+ version: '0'
111
+ description: Allow people data from one source to be compared with anothe source
112
+ email:
113
+ - robnichols@warwickshire.gov.uk
114
+ executables: []
115
+ extensions: []
116
+ extra_rdoc_files: []
117
+ files:
118
+ - .gitignore
119
+ - .travis.yml
120
+ - Gemfile
121
+ - LICENSE.txt
122
+ - README.md
123
+ - Rakefile
124
+ - bin/console
125
+ - bin/setup
126
+ - lib/wilderpeople.rb
127
+ - lib/wilderpeople/matcher.rb
128
+ - lib/wilderpeople/search.rb
129
+ - lib/wilderpeople/version.rb
130
+ - wilderpeople.gemspec
131
+ homepage: https://github.com/reggieb/wilderpeople
132
+ licenses:
133
+ - MIT
134
+ metadata: {}
135
+ post_install_message:
136
+ rdoc_options: []
137
+ require_paths:
138
+ - lib
139
+ required_ruby_version: !ruby/object:Gem::Requirement
140
+ requirements:
141
+ - - '>='
142
+ - !ruby/object:Gem::Version
143
+ version: '0'
144
+ required_rubygems_version: !ruby/object:Gem::Requirement
145
+ requirements:
146
+ - - '>='
147
+ - !ruby/object:Gem::Version
148
+ version: '0'
149
+ requirements: []
150
+ rubyforge_project:
151
+ rubygems_version: 2.4.8
152
+ signing_key:
153
+ specification_version: 4
154
+ summary: A tool for fuzzy matching people data
155
+ test_files: []