RubyGems - bayesic_matching - Versions diffs - 0.1.0 - Mend

bayesic_matching 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +7 -0
data/.gitignore +11 -0
data/.rspec +3 -0
data/.travis.yml +5 -0
data/Gemfile +6 -0
data/Gemfile.lock +37 -0
data/LICENSE.txt +21 -0
data/README.md +39 -0
data/Rakefile +6 -0
data/bayesic_matching.gemspec +28 -0
data/bin/console +14 -0
data/bin/setup +8 -0
data/examples/benchmark.rb +82 -0
data/examples/favorite_recent_movies.csv +11 -0
data/examples/popular_recent_movies.csv +2601 -0
data/lib/bayesic_matching/version.rb +3 -0
data/lib/bayesic_matching.rb +34 -0
metadata +116 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 5cea0e170af73887df8e76272b3cd884b83ddf06
+  data.tar.gz: a3448c71f0ea4614fea06983678f988bcde33c27
+SHA512:
+  metadata.gz: 159b74bf8b9224c7cba78e483b88f2f69c6ac943d5bcabee6bebc2a65077755d2402512139e94adde5b3b0e3be4c074d37d8fcd32262c3d6ee8b822dc6925294
+  data.tar.gz: 3e451a929ab45a35cd7462f72fd23ad962b844d7a3b854cf77479cb4347e34d647fab4d4f801c94d29e2f67015461bb5e87d2e44b6500a1fd723db0815dcdd1c

data/.gitignore ADDED Viewed

@@ -0,0 +1,11 @@
+/.bundle/
+/.yardoc
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+# rspec failure tracking
+.rspec_status

data/.rspec ADDED Viewed

@@ -0,0 +1,3 @@
+--format documentation
+--color
+--require spec_helper

data/.travis.yml ADDED Viewed

@@ -0,0 +1,5 @@
+sudo: false
+language: ruby
+rvm:
+  - 2.3.1
+before_install: gem install bundler -v 1.16.0

data/Gemfile ADDED Viewed

@@ -0,0 +1,6 @@
+source "https://rubygems.org"
+git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
+# Specify your gem's dependencies in bayesic_matching.gemspec
+gemspec

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,37 @@
+PATH
+  remote: .
+  specs:
+    bayesic_matching (0.1.0)
+GEM
+  remote: https://rubygems.org/
+  specs:
+    benchmark-ips (2.7.2)
+    diff-lcs (1.3)
+    rake (10.5.0)
+    rspec (3.7.0)
+      rspec-core (~> 3.7.0)
+      rspec-expectations (~> 3.7.0)
+      rspec-mocks (~> 3.7.0)
+    rspec-core (3.7.0)
+      rspec-support (~> 3.7.0)
+    rspec-expectations (3.7.0)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.7.0)
+    rspec-mocks (3.7.0)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.7.0)
+    rspec-support (3.7.0)
+PLATFORMS
+  ruby
+DEPENDENCIES
+  bayesic_matching!
+  benchmark-ips (~> 2.7)
+  bundler (~> 1.16)
+  rake (~> 10.0)
+  rspec (~> 3.0)
+BUNDLED WITH
+   1.16.0

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2017 Michael Ries
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,39 @@
+# BayesicMatching
+Like NaiveBayes, except useful for the case of many possible classes with small training sets per class.
+This is useful if you have two lists of names or titles and you want to match between them with a given confidence level.
+## Usage
+```ruby
+matcher = BayesicMatching.new
+matcher.train(["it","was","the","best","of","times"], "novel")
+matcher.train(["tonight","on","the","seven","o'clock"], "news")
+matcher.classify(["the","best","of"])
+# => {"novel"=>1.0, "news"=>0.667}
+matcher.classify(["the","time"])
+#  => {"novel"=>0.667, "news"=>0.667}
+```
+## How It Works
+This library uses the basic idea of [Bayes Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem).
+It records which tokens it has seen for each possible classification. Later when you pass a set of tokens and ask for the most likely classification it looks for all potential matches and then ranks them by considering the probabily of any given match according to the tokens that it sees.
+Tokens which exist in many records (ie not very unique) have a smaller impact on the probability of a match and more unique tokens have a larger impact.
+## Will It Work For My Dataset?
+I'm using this in a project that has to match several hundred records against a list of ~10k possible matches.
+At these sizes this project will train a matcher in ~10ms and each record that I check for a match takes ~1.2ms.
+You can try it out with your own dataset by producing two simple CSV files and running the `examples/benchmark.rb` script in this repo.
+For example you can run `bundle exec ruby benchmark.rb popular_recent_movies.csv favorite_recent_movies.csv` (those two files are provided in the examples directory as well).
+If you can create a similar pair of CSV files you can test on whatever dataset you want and see the accuracy and performance of the library.
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/Rakefile ADDED Viewed

@@ -0,0 +1,6 @@
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+task :default => :spec

data/bayesic_matching.gemspec ADDED Viewed

@@ -0,0 +1,28 @@
+lib = File.expand_path("../lib", __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require "bayesic_matching/version"
+Gem::Specification.new do |spec|
+  spec.name          = "bayesic_matching"
+  spec.version       = BayesicMatching::VERSION
+  spec.authors       = ["Michael Ries"]
+  spec.email         = ["michael@riesd.com"]
+  spec.summary       = "bayesian approach to matching one list of strings with another"
+  spec.description   = spec.summary
+  spec.homepage      = "https://github.com/mmmries/bayesic_matching"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0").reject do |f|
+    f.match(%r{^(test|spec|features)/})
+  end
+  spec.bindir        = "exe"
+  spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "benchmark-ips", "~> 2.7"
+  spec.add_development_dependency "bundler", "~> 1.16"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "rspec", "~> 3.0"
+end

data/bin/console ADDED Viewed

@@ -0,0 +1,14 @@
+#!/usr/bin/env ruby
+require "bundler/setup"
+require "bayesic_matching"
+# You can add fixtures and/or initialization code here to make experimenting
+# with your gem easier. You can also use a different console, if you like.
+# (If you use this, don't forget to add pry to your Gemfile!)
+# require "pry"
+# Pry.start
+require "irb"
+IRB.start(__FILE__)

data/bin/setup ADDED Viewed

@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
+set -vx
+bundle install
+# Do any other automated setup that you need to do here

data/examples/benchmark.rb ADDED Viewed

@@ -0,0 +1,82 @@
+require "bayesic_matching"
+require "benchmark/ips"
+require "csv"
+if ARGV.size < 2
+  puts "please provide a pair of CSV files. (i.e. ruby benchmark.rb training.csv matching.csv)"
+  puts "\ttraining.csv should have source_id and source_string columns"
+  puts "\tmatching.csv should have match_string and source_id columns"
+  exit(1)
+end
+training_csv_path = ARGV[0]
+matching_csv_path = ARGV[1]
+# You can tokenize your strings using many different schemes.
+# The method below just downcases and splits on word boundaries,
+# then removes punctuation and filters single-letter words.
+# Feel free to change this to a tokenization scheme of your preference
+def tokenize_string(str)
+  str.downcase.split(/\b+/).map do |word|
+    word.gsub(/[^\w ]/,"")
+  end.reject{|word| word.size < 2 }
+end
+training_rows = []
+::CSV.foreach(training_csv_path, :headers => true, :header_converters => :symbol) do |row|
+  training_rows << {:string => row[:source_string], :id => row[:source_id], :tokens => tokenize_string(row[:source_string])}
+end
+matching_rows = []
+::CSV.foreach(matching_csv_path, :headers => true, :header_converters => :symbol) do |row|
+  matching_rows << {:string => row[:match_string], :source_id => row[:source_id], :tokens => tokenize_string(row[:match_string])}
+end
+def train_matcher(training_rows)
+  matcher = BayesicMatching.new
+  training_rows.each do |row|
+    matcher.train(row[:tokens], row[:id])
+  end
+  matcher
+end
+def attempt_matches(matcher, matching_rows, print_mismatch_data = false)
+  results = {:correct => 0, :incorrect => 0, :unmatched => 0, :total => 0}
+  matching_rows.each do |row|
+    probabilities = matcher.classify(row[:tokens])
+    next if row[:source_id].nil? or row[:source_id].size == 0 # if no source_id was present don't bother counting the statistics
+    results[:total] += 1
+    if probabilities.empty?
+      results[:unmatched] += 1
+    else
+      best_match, confidence = probabilities.max_by{|_klass, probability| probability }
+      if best_match == row[:source_id]
+        results[:correct] += 1
+      else
+        results[:incorrect] += 1
+        if print_mismatch_data
+          puts "MISMATCH of #{row[:string]} (#{row[:tokens]}) to #{best_match} (should have been #{row[:source_id]})"
+          puts "\tconfidence: #{probabilities[best_match]}"
+        end
+      end
+    end
+  end
+  results
+end
+matcher = train_matcher(training_rows)
+Benchmark.ips do |x|
+  x.config(:time => 5, :warmup => 2)
+  x.report("training") { train_matcher(training_rows) }
+  x.report("matching") { attempt_matches(matcher, matching_rows) }
+end
+puts "= Checking Accuracy"
+results = attempt_matches(matcher, matching_rows, true)
+puts "= Accuracy Results"
+puts "\t#{results[:total]} attempted matches"
+puts "\t#{results[:correct]} correct (#{results[:correct].to_f / results[:total]}%)"
+puts "\t#{results[:incorrect]} incorrect (#{results[:incorrect].to_f / results[:total]}%)"
+puts "\t#{results[:unmatched]} unmatched (#{results[:unmatched].to_f / results[:total]}%)"

data/examples/favorite_recent_movies.csv ADDED Viewed

@@ -0,0 +1,11 @@
+match_string,source_id
+Forest Gump,2101
+Benjamin Button,720
+12 Years a Slave,262
+Green Mile,1612
+Pulp Fiction,2110
+Titanic,1801
+Inglourious Basterds,625
+The Lord of the Rings: The Fellowship of the Ring,1402
+The Lord of the Rings: The Two Towers,1302
+Fight Club,1654