matching 0.14.1

Sign up to get free protection for your applications and to get access to all the features.
data/.document ADDED
@@ -0,0 +1,5 @@
1
+ lib/**/*.rb
2
+ bin/*
3
+ -
4
+ features/**/*.feature
5
+ LICENSE.txt
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format Fuubar
2
+ --color
data/Gemfile ADDED
@@ -0,0 +1,22 @@
1
+ source "http://rubygems.org"
2
+
3
+ gem 'text'
4
+
5
+ # Add dependencies required to use your gem here.
6
+ # Example:
7
+ # gem "activesupport", ">= 2.3.5"
8
+
9
+ # Add dependencies to develop your gem here.
10
+ # Include everything needed to run rake, tests, features, etc.
11
+ group :development do
12
+ gem "bundler"
13
+ gem "jeweler"
14
+ end
15
+
16
+ group :test do
17
+ gem "rspec", ">= 0"
18
+ gem "fuubar"
19
+ gem "activerecord"
20
+ gem "sqlite3"
21
+ gem "redis"
22
+ end
data/Gemfile.lock ADDED
@@ -0,0 +1,60 @@
1
+ GEM
2
+ remote: http://rubygems.org/
3
+ specs:
4
+ activemodel (3.2.1)
5
+ activesupport (= 3.2.1)
6
+ builder (~> 3.0.0)
7
+ activerecord (3.2.1)
8
+ activemodel (= 3.2.1)
9
+ activesupport (= 3.2.1)
10
+ arel (~> 3.0.0)
11
+ tzinfo (~> 0.3.29)
12
+ activesupport (3.2.1)
13
+ i18n (~> 0.6)
14
+ multi_json (~> 1.0)
15
+ arel (3.0.0)
16
+ builder (3.0.0)
17
+ diff-lcs (1.1.3)
18
+ fuubar (1.0.0)
19
+ rspec (~> 2.0)
20
+ rspec-instafail (~> 0.2.0)
21
+ ruby-progressbar (~> 0.0.10)
22
+ git (1.2.5)
23
+ i18n (0.6.0)
24
+ jeweler (1.8.3)
25
+ bundler (~> 1.0)
26
+ git (>= 1.2.5)
27
+ rake
28
+ rdoc
29
+ json (1.6.5)
30
+ multi_json (1.0.4)
31
+ rake (0.9.2.2)
32
+ rdoc (3.12)
33
+ json (~> 1.4)
34
+ redis (2.2.2)
35
+ rspec (2.8.0)
36
+ rspec-core (~> 2.8.0)
37
+ rspec-expectations (~> 2.8.0)
38
+ rspec-mocks (~> 2.8.0)
39
+ rspec-core (2.8.0)
40
+ rspec-expectations (2.8.0)
41
+ diff-lcs (~> 1.1.2)
42
+ rspec-instafail (0.2.2)
43
+ rspec-mocks (2.8.0)
44
+ ruby-progressbar (0.0.10)
45
+ sqlite3 (1.3.5)
46
+ text (1.0.3)
47
+ tzinfo (0.3.31)
48
+
49
+ PLATFORMS
50
+ ruby
51
+
52
+ DEPENDENCIES
53
+ activerecord
54
+ bundler
55
+ fuubar
56
+ jeweler
57
+ redis
58
+ rspec
59
+ sqlite3
60
+ text
data/README.md ADDED
@@ -0,0 +1,319 @@
1
+ # matching
2
+
3
+ Matching is a library for performing rules-based matches between records in two
4
+ datasets. These datasets are typically from two different sources that pertain
5
+ to the same or similar set of transactions. Matching allows you to compare
6
+ the datasets and produces an array of matched records as well as an array of
7
+ exceptions (nonmatches) for each input dataset.
8
+
9
+ Matching is designed primarily for reconciliations. Example use cases:
10
+
11
+ * Bank reconciliations, where input datasets come from an accounting system and an
12
+ online bank statement.
13
+
14
+ * Cellular commission reconciliation, where input datasets come from an
15
+ independent retailer's Point Of Sale system and a carrier's commission
16
+ statement.
17
+
18
+ This library is not a replacement for database joins on a
19
+ properly-designed RDBMS. It's designed for real-world situations where
20
+ the programmer must handle data from different sources and find commonality between them.
21
+
22
+ ## Example
23
+
24
+ To illustrate how Matching is useful in situations where a database join can
25
+ lead to errors, take the example of reconciling a bank statement against an
26
+ accounting system's transactions. In this example, the bookeeper incorrectly
27
+ recorded the Basecamp transaction twice and the two Github transactions have different dates.
28
+
29
+ #### Accounting System
30
+
31
+ <table>
32
+ <tr><th>Date</th><th>Description</th><th>Amount</th><tr>
33
+ <tr>
34
+ <td>2012-01-01</td>
35
+ <td>Basecamp</td>
36
+ <td>25.00</td>
37
+ </tr>
38
+ <tr>
39
+ <td>2012-01-01</td>
40
+ <td>Basecamp</td>
41
+ <td>25.00</td>
42
+ </tr>
43
+ <tr>
44
+ <td>2012-01-02</td>
45
+ <td>Github</td>
46
+ <td>25.00</td>
47
+ </tr>
48
+ </table>
49
+
50
+ #### Bank Statement
51
+
52
+ <table>
53
+ <tr><th>Date</th><th>Description</th><th>Amount</th><tr>
54
+ <tr>
55
+ <td>2012-01-01</td>
56
+ <td>Basecamp (37 signals)</td>
57
+ <td>25.00</td>
58
+ </tr>
59
+ <tr>
60
+ <td>2012-01-03</td>
61
+ <td>Github</td>
62
+ <td>25.00</td>
63
+ </tr>
64
+ </table>
65
+
66
+ Using a database approach, you might load the datasets into two tables,
67
+ "ledger" and "bank" then join on amount:
68
+
69
+ ``` sql
70
+ select * from ledger a join bank b on a.amount = b.amount;
71
+
72
+ 2012-01-01|Basecamp|25.0|2012-01-01|Basecamp (37 signals)|25.0
73
+ 2012-01-01|Basecamp|25.0|2012-01-03|Github|25.0
74
+ 2012-01-01|Basecamp|25.0|2012-01-01|Basecamp (37 signals)|25.0
75
+ 2012-01-01|Basecamp|25.0|2012-01-03|Github|25.0
76
+ 2012-01-02|Github|25.0|2012-01-01|Basecamp (37 signals)|25.0
77
+ 2012-01-02|Github|25.0|2012-01-03|Github|25.0
78
+ ```
79
+
80
+ That's clearly not the right answer. Because amount was the only criterion
81
+ used for joining, the query joins each record with a $25 value (3*2 pairs).
82
+
83
+ OK, how about adding in the date:
84
+
85
+ ``` sql
86
+ select * from ledger a join bank b on a.amount = b.amount and a.date = b.date;
87
+
88
+ 2012-01-01|Basecamp|25.0|2012-01-01|Basecamp (37 signals)|25.0
89
+ 2012-01-01|Basecamp|25.0|2012-01-01|Basecamp (37 signals)|25.0
90
+ ```
91
+
92
+ Still incorrect because the bookeeper recorded the Github transaction on Jan. 2
93
+ and the bank shows the debit on Jan. 3. How about using description and amount?
94
+
95
+ ``` sql
96
+ select * from ledger a join bank b on a.amount = b.amount and a.description = b.description;
97
+
98
+ 2012-01-02|Github|25.0|2012-01-03|Github|25.0
99
+ ```
100
+
101
+ Even worse. Because two different people or systems entered these records, they
102
+ have slightly different descriptions. Now you might try some more complidated SQL:
103
+
104
+ ``` sql
105
+ select * from ledger a join bank b on a.amount = b.amount and (a.description = b.description or a.date = b.date);
106
+
107
+ 2012-01-01|Basecamp|25.0|2012-01-01|Basecamp (37 signals)|25.0
108
+ 2012-01-01|Basecamp|25.0|2012-01-01|Basecamp (37 signals)|25.0
109
+ 2012-01-02|Github|25.0|2012-01-03|Github|25.0
110
+ ```
111
+
112
+ At first blush that might look right, but because there are two bank statement
113
+ lines, a correctly matched result *must not* contain more than two
114
+ records. What we want is this:
115
+
116
+ ``` sql
117
+ 2012-01-01|Basecamp|25.0|2012-01-01|Basecamp (37 signals)|25.0
118
+ 2012-01-02|Github|25.0|2012-01-03|Github|25.0
119
+ ```
120
+
121
+ ### Solution using Matching
122
+
123
+ ``` ruby
124
+ require 'matching'
125
+ include Matching
126
+
127
+ class Transaction
128
+ attr_accessor :date, :desc, :amount
129
+ def initialize(date, desc, amount)
130
+ @date, @desc, @amount = date, desc, amount
131
+ end
132
+ def to_s
133
+ [@date, @desc, @amount].join(',')
134
+ end
135
+ end
136
+
137
+ ledger_txns = [
138
+ Transaction.new(Date.new(2012,1,1),'Basecamp','25.0'),
139
+ Transaction.new(Date.new(2012,1,1),'Basecamp','25.0'),
140
+ Transaction.new(Date.new(2012,1,2),'Github','25.0')
141
+ ]
142
+
143
+ bank_txns = [
144
+ Transaction.new(Date.new(2012,1,1),'Basecamp (37 signals)','25.0'),
145
+ Transaction.new(Date.new(2012,1,3),'Github','25.0')
146
+ ]
147
+
148
+ matcher = Matcher.new(
149
+ :left_store => ArrayStore.new(ledger_txns),
150
+ :right_store => ArrayStore.new(bank_txns),
151
+ :min_score => 1.0
152
+ )
153
+
154
+ matcher.define do
155
+ join :amount, :amount, 1.0
156
+ compare :date, :date, 0.5, :fuzzy => true
157
+ end
158
+
159
+ matcher.match
160
+
161
+ puts "Matches:\n"
162
+ matcher.matches.each do |match|
163
+ puts [match.left_obj, "%.2f" % match.score, match.right_obj].join(',')
164
+ end
165
+
166
+ puts "Left exceptions:\n"
167
+ matcher.left_exceptions.each { |l_exc| puts l_exc }
168
+
169
+ puts "Right exceptions:\n"
170
+ matcher.right_exceptions.each { |r_exc| puts r_exc }
171
+
172
+ ```
173
+
174
+ This is the correct result according to the rules we supplied to the matcher.
175
+
176
+ ``` bash
177
+ Matches:
178
+ 2012-01-01,Basecamp,25.0,1.50,2012-01-01,Basecamp (37 signals),25.0
179
+ 2012-01-02,Github,25.0,1.48,2012-01-03,Github,25.0
180
+ Left exceptions:
181
+ 2012-01-01,Basecamp,25.0
182
+ Right exceptions:
183
+ ```
184
+
185
+ ## How It Works
186
+
187
+ Data is loaded into the matcher using either an ArrayStore or an ActiveRelationStore. These classes use duck typing and it
188
+ would be simple to create your own for different data sources.
189
+
190
+ You describe the matching rules during initialization and a "define" block. Initialize expects a "left" and "right" data store
191
+ and optionally a minimum score for considering two objects to be a match (default is 1.0). The matcher assigns a float score to each matched object pair
192
+ according to the rules you supply.
193
+
194
+ The define block describes which attribute pairs from the left and right data stores will be used for comparison, how they are
195
+ to be compared, and the score assigned for a successful pairing. In the example above, all objects are from the same class (Transaction)
196
+ but this isn't required.
197
+
198
+ Attribute pairs are either joined or compared. Joined attributes are indexed in either a hash (default) or Redis and the matcher
199
+ does a lookup for each left object and first gets an array of potential right matches via a union of searches against the indexes
200
+ by join attributes. It then applies comparison rules to create a total score of the match between the left object and all
201
+ candidate matches on the right.
202
+
203
+ In cases where a match is "contested" because the highest-scored right candidate is already matched, the left object with the highest
204
+ score is awarded the match and the "loser" attempts to match its next-highest ranked right object, if any exists. In situations
205
+ where there is no right object with a high enough score to pair, that left object is added to the array of left exceptions. Right exceptions
206
+ are the array of right objects that fail to pair with any left object.
207
+
208
+ ## Describing match pairs
209
+
210
+ At least one join (exact match) pair must be defined. My company uses this system for analyzing data with serialized values. In our experience, record pairs with no exact matches are typically low-quality matches and are
211
+ best left for a manual review process. Also, without the benefit of indexing, comparing every left object against every right
212
+ object would kill performance for large datasets.
213
+
214
+ ``` ruby
215
+ # Join "amount" from both the left and right data stores and award a 1.0 to each pairing
216
+ matcher.define do
217
+ join :amount, :amount, 1.0
218
+ end
219
+ ```
220
+
221
+ If multiple joins are defined, that means one index for each join pair will be created. It *does not* mean that both joins
222
+ must be satisfied in order for a pair to be awarded a score. Scores are additive and the highest-scored pair "wins".
223
+
224
+ ``` ruby
225
+ # Join on first and last names, giving higher weight to the last name
226
+ # This is analogous to a database OR join (not AND). Later scoring will link only
227
+ # the highest-scoring pair.
228
+ matcher.define do
229
+ join :first, :first_name, 0.5
230
+ join :last, :last_name, 1.0
231
+ end
232
+ ```
233
+
234
+ Comparisons are performed after joins have created a filtered array of right objects for each left object. The result of
235
+ each comparison is added to the score awarded by joins.
236
+
237
+ ``` ruby
238
+ # Award an additional point for each pair where the age attribute is the same. Attributes with frequent value
239
+ # commonality are poor candiates for joins because many comparisons will be made between left and right object pairs.
240
+ # It's best to use attributes with frequent unique values for joins (e.g. name, phone number, SSN, etc.)
241
+ # and use comparisons for more common attributes (e.g., date, age, sex).
242
+ matcher.define do
243
+ join :last, :last_name, 1.0
244
+ compare :age, :age, 1.0
245
+ end
246
+
247
+ # Do a fuzzy comparison on first name using Levenshtein edit distance. Currently there are a limited number of
248
+ # built-in fuzzy comparison functions but these can easily be extended. The attribute being compared must
249
+ # respond to 'similarity_to(l,r)' and return a float value from 0 to 1.
250
+ # See custom rules below for more flexible options.
251
+ matcher.define do
252
+ join :last, :last_name, 1.0
253
+ compare :first, :first_name, 1.0, :fuzzy => true
254
+ end
255
+
256
+ # Use a lambda to perform the comparisons. The lambda must accept two arguments (left and right objects) and
257
+ # return a score for the pair as a float. In this case, award 1.0 to each pair whose dates are within two days
258
+ # of each other.
259
+
260
+ within_two_days = lambda { |l,r| ((l.date - r.date).abs <= 2 ? 1.0 : 0.0) }
261
+
262
+ matcher.define do
263
+ join :amount, :amount, 1.0
264
+ custom within_two_days
265
+ end
266
+ ```
267
+
268
+ ## Using It
269
+
270
+ Add to your Gemfile:
271
+
272
+ ``` bash
273
+ gem 'matching', :git => 'git://github.com/btedev/matching.git'
274
+ ```
275
+
276
+ ``` bash
277
+ $ bundle install
278
+ ```
279
+
280
+ In your project:
281
+
282
+ ``` ruby
283
+ require 'matching'
284
+ include Matching
285
+ ```
286
+
287
+ ## Comments and Caveats
288
+
289
+ * This is designed for 1:1 matching. You will need to fork and modify it for any other use.
290
+ Check out fuzzy_match for a different approach to rich, rules-based searching: https://github.com/seamusabshere/fuzzy_match.
291
+ FEBRL is another free data linking library written in Python: http://sourceforge.net/projects/febrl/.
292
+ * Every object will be allocated to one of three resulting arrays: matches, left exceptions, and right exceptions.
293
+ * Fuzzy != magic. Every object from the left store will be matched with the highest-possible
294
+ scoring match from the right store according to the rules you supply the matcher.
295
+ * You can use negative scores to decrease the liklihood of pairing.
296
+ * In cases where two or more left objects match the same right object with the same score, the object chosen for final match
297
+ assignment is arbitrary. The other left object(s) will be added to the left exceptions array.
298
+ * Rspec is your friend. Test your rules in the controlled environment of the test suite before deploying on production data.
299
+ * If you use it, I'd love to know what problem you're applying it to. Besides using it in my company, I also use it for reconciling my bank statement.
300
+
301
+ ## Contributing
302
+
303
+ * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
304
+ * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
305
+ * Fork the project
306
+ * Start a feature/bugfix branch
307
+ * Commit and push until you are happy with your contribution
308
+ * Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
309
+ * Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
310
+
311
+ ## Copyright
312
+
313
+ Copyright (c) 2012 Barry Ezell. MIT License:
314
+
315
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
316
+
317
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
318
+
319
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/Rakefile ADDED
@@ -0,0 +1,47 @@
1
+ # encoding: utf-8
2
+
3
+ require 'rubygems'
4
+ require 'bundler'
5
+ require 'rspec/core/rake_task'
6
+
7
+ begin
8
+ Bundler.setup(:default, :development)
9
+ rescue Bundler::BundlerError => e
10
+ $stderr.puts e.message
11
+ $stderr.puts "Run `bundle install` to install missing gems"
12
+ exit e.status_code
13
+ end
14
+ require 'rake'
15
+
16
+ require 'jeweler'
17
+ Jeweler::Tasks.new do |gem|
18
+ # gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
19
+ gem.name = "matching"
20
+ gem.homepage = "http://github.com/btedev/matching"
21
+ gem.license = "MIT license"
22
+ gem.summary = "Dataset matching engine"
23
+ gem.description = ""
24
+ gem.email = "barrye@gmail.com"
25
+ gem.authors = ["Barry Ezell"]
26
+ # dependencies defined in Gemfile
27
+
28
+ gem.files.exclude 'db/**/*'
29
+ end
30
+
31
+ #Jeweler::RubygemsDotOrgTasks.new
32
+
33
+ desc "Run all specs"
34
+ RSpec::Core::RakeTask.new(:spec) do |spec|
35
+ spec.rspec_opts = ["--color"]
36
+ spec.pattern = 'spec/**/*_spec.rb'
37
+ end
38
+
39
+ require 'rdoc/task'
40
+ Rake::RDocTask.new do |rdoc|
41
+ version = File.exist?('VERSION') ? File.read('VERSION') : ""
42
+
43
+ rdoc.rdoc_dir = 'rdoc'
44
+ rdoc.title = "matching #{version}"
45
+ rdoc.rdoc_files.include('README*')
46
+ rdoc.rdoc_files.include('lib/**/*.rb')
47
+ end