RubyGems - mendel - Versions diffs - 1.0.0 - Mend

mendel 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

checksums.yaml +7 -0
data/.gitignore +18 -0
data/.rspec +2 -0
data/.ruby-version +1 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +119 -0
data/Rakefile +52 -0
data/TODO.md +15 -0
data/benchmark/addition_combiner.rb +6 -0
data/benchmark/benchmarker.rb +58 -0
data/benchmark/graph.rb +33 -0
data/benchmark/run_six_lists.rb +17 -0
data/benchmark/run_two_lists.rb +14 -0
data/benchmark/simple.rb +34 -0
data/lib/mendel.rb +5 -0
data/lib/mendel/combiner.rb +174 -0
data/lib/mendel/min_priority_queue.rb +48 -0
data/lib/mendel/observable_combiner.rb +24 -0
data/lib/mendel/version.rb +3 -0
data/lib/mendel/visualizers/ascii.rb +54 -0
data/lib/mendel/visualizers/base.rb +41 -0
data/mendel.gemspec +28 -0
data/spec/fixtures/example_input.rb +13 -0
data/spec/fixtures/example_output/different_lengths.rb +303 -0
data/spec/fixtures/example_output/inc_integers_w_inc_decimals.rb +10003 -0
data/spec/fixtures/example_output/inc_integers_w_repeats.rb +10003 -0
data/spec/fixtures/example_output/inc_integers_w_repeats_and_skips.rb +10003 -0
data/spec/fixtures/example_output/inc_integers_w_skips.rb +10003 -0
data/spec/mendel/combiner_spec.rb +256 -0
data/spec/mendel/min_priority_queue_spec.rb +70 -0
data/spec/mendel/observable_combiner_spec.rb +42 -0
data/spec/spec_helper.rb +42 -0
data/spec/support/foosball_team.rb +24 -0
data/visualizer_spec/ascii_spec.rb +119 -0
data/visualizer_spec/base_spec.rb +74 -0
metadata +175 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 773c767d9054373c84771c9ddc4d8d0a0b83b823
+  data.tar.gz: a56609df4cc57d0f978206562d6962dd3a673ee2
+SHA512:
+  metadata.gz: 6278978743f8c42ac2a3a35bff2dc2e6de520ccd9786e484d7b168b2b0243a30bfbedc75083b490181361f922b7b2705d5108dc2dbef11445f1d8919574256d1
+  data.tar.gz: 8c2f2e5514b45ce1f646d3785b0d8aae0dc62d17d4386b35216d32b568a63227b6c7212f86710bc4c435dba3b7abb2ab9e004fd41ccd5507546a9cbdead7e722

data/.gitignore ADDED

@@ -0,0 +1,18 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+benchmark/data

data/.rspec ADDED

	@@ -0,0 +1,2 @@
1	+ --color
2	+ --format progress

data/.ruby-version ADDED

	@@ -0,0 +1 @@
1	+ ruby-2.3.0

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in combiner.gemspec
+gemspec

data/LICENSE.txt ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2014 Nathan Long
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,119 @@
+# Mendel
+Mendel breeds the best combinations of sorted lists that you provide.
+For example, suppose you have 100 shirts, 200 pairs of pants, and 50 hats, ordered by price. How could you find the 50 cheapest outfits?
+A brute force approach would build all 1 million possibilities (100 * 200 * 50), sort by price, and take the best 50. An ideal solution would build the best 50 and stop.
+Mendel gets much closer to the ideal by incrementally building candidates for the "next best" combination and using a priority queue to pull the best one at any given moment.
+## How it Works
+Mendel is easiest to explain for two lists. In that case, we can think of the combinations as a grid, where the X value is from the first list and the Y value is from the second. Inside the grid, we can represent combinations as the sum of the coordinate values.
+**The lists must be sorted by score**. This means that the sums will increase (or remain constant) along one or both axes.
+For example, imagine that these grids are landscapes, and the scores in the middle are elevations. **Mendel chooses combinations like a tide, rising from the bottom left.**
+     +---+    +---+    +---+    +---+
+    1|555|   1|567|   3|777|   3|789|
+    1|555|   1|567|   2|666|   2|678|
+    1|555|   1|567|   1|555|   1|567|
+     +---+    +---+    +---+    +---+
+      444      456      444      456
+In every case, we are guaranteed that the bottom left corner - the best item from list Y combined with the best item from list X - has the lowest elevation. Beyond that, the next best combination could be at `0,1` or `1,0`; we don't know. All we can do is check them both and choose the best one. "Check them both" means producing a score, and to "choose the best one", Mendel uses a [priority queue](https://en.wikipedia.org/wiki/Priority_queue).
+If we find that we've chosen `0,1`, before we return it, we add `1,1` and `0,2` to the queue. We don't know yet whether either of them is better than `1,0`, but next time we need a value, the priority queue will decide. So the water line continues to move up and to the right. Any coordinate "under water" has been returned, any coordinate above the water line has not yet been scored, and any coordinate the water line is just touching is a combination that's currently in the priority queue,
+Run `rake visualize` to see this process in action.
+Mendel does the same process for combinations of 3 or more lists, too. Imagining a 6-dimensional graph is beyond the author's cognitive abilities, but in principle, it's the same.
+## Usage
+Create a combiner class that knows how to score combinations of your items. Then provide lists of items, sorted in ascending value.
+For example:
+```ruby
+# Simple lists of numbers. Any combination of these
+# can be scored by adding them together
+list1 = (1..100).to_a
+list2 = (1.0..100.0).to_a
+class NumericCombiner
+  include Mendel::Combiner
+  # Scores a combination from the two lists by adding them
+  def score_combination(numbers)
+    numbers.reduce(0) { |sum, number| sum += number }
+  end
+end
+nc = NumericCombiner.new(list1, list2)
+nc.take(50) # The 50 best combinations
+```
+Mendel will return two-item arrays of `[combination, score]`.
+A combination of items is, by default, an array with one item from each list. However, if you like, you may specify how to build combinations of your items.
+```ruby
+defense_players = [{name: 'Jimmy', age: 10}, {name: 'Susan', age: 12}]
+offense_players = [{name: 'Roger', age: 8},  {name: 'Carla',  age: 14}]
+class FoosballTeam
+  attr_accessor :players
+  def initialize(*players)
+    self.players = players
+  end
+  def average_age
+    players.reduce(0){ |total, player|
+      total += player.fetch(:age)
+    } / 2.0
+  end
+end
+class TeamBuilder
+  include Mendel::Combiner
+  def build_combination(players)
+    FoosballTeam.new(*players)
+  end
+  def score_combination(team)
+    team.average_age
+  end
+end
+pc = TeamBuilder.new(defense_players, offense_players)
+pc.take(2) # The youngest teams
+```
+If you need to apply other criteria besides the score, use lazy enumeration and chain other calls:
+```ruby
+  pc.each.lazy.reject { |team, score| team.contains_siblings? }.take(50).to_a
+```
+## Serialization and deserialization
+`Mendel::Combiner` provides the instance methods `#dump` and `#dump_json` and the class methods `.load` and `.load_json`. This allows you to pause enumeration, save the data, and resume enumerating some time later.
+## Caveats
+1. **Single Enumeration**. For memory's sake, Mendel **does not keep** combinations it has returned to you. Combinations are built and flushed as you enumerate, so if you enumerate twice, there will be no data the second time; you will have to build a new combiner. If you need to keep the combinations, it is up to you to do so.
+2. **Memory**. Producing ALL combinations of your lists in inherently expensive. Mendel shines at producing the N best. It will allow you to enumerate all of the combinations, but the more there are, the more memory it will need to queue them up. If you want the top 10,000 combinations, you'll probably be fine. If you want the top 10 billion, I hope you have lots of RAM.
+## Installation
+In Bundler:
+    gem 'mendel', git: (this repo address)
+## Naming
+Mendel is named for [Gregor Mendel](https://en.wikipedia.org/wiki/Gregor_Mendel), "the father of modern genetics", a scientist and monk who discovered patterns of inheritance while breeding pea plants. The Mendel gem helps you breed the best possible hybrids of your data.

data/Rakefile ADDED

@@ -0,0 +1,52 @@
+require "bundler/gem_tasks"
+require 'mendel'
+require_relative "benchmark/addition_combiner"
+require 'rspec/core/rake_task'
+Bundler.setup
+RSpec::Core::RakeTask.new(:spec)
+task "default" => "spec"
+desc "Open IRB to experiment with Mendel::Combiner"
+task :console do
+  require 'irb'
+  require 'irb/completion'
+  require_relative 'lib/mendel'
+  ARGV.clear
+  IRB.start
+end
+desc "See a visualization of the Mendel::Combiner algorithm"
+task :visualize do
+  require 'irb'
+  require "mendel/observable_combiner"
+  require "mendel/visualizers/ascii"
+  class ConsoleCombiner < Mendel::ObservableCombiner
+    def score_combination(items)
+      items.reduce(0) { |sum, item| sum += item }
+    end
+  end
+  def clear_screen
+    system('clear') or system('cls')
+  end
+  def show(limit = nil)
+    list1 = 10.times.map { rand(100) }.sort
+    list2 = 10.times.map { rand(1.0...100.0) }.sort
+    combiner = ConsoleCombiner.new(list1, list2)
+    visualizer = Mendel::Visualizers::ASCII.new(combiner)
+    combiner.each_with_index do |combo, i|
+      break if limit.kind_of?(Numeric) && i > limit
+      clear_screen
+      puts visualizer.output
+      sleep(0.5)
+    end; nil
+  end
+  ARGV.clear
+  clear_screen
+  puts "Mendel works like rising water, finding the lowest points"
+  puts "Type 'show()'. Optionally, pass a max number of frames"
+  IRB.start
+end

data/TODO.md ADDED

@@ -0,0 +1,15 @@
+# TODO
+- Pretty documentation
+# Probably Not TODO
+## Memory Optimization
+We currently track a set of all seen coordinates. As we build combinations, the size of the set approaches the total number of possible combinations. Although each item in the set is very small (an array of fixnums), the number can grow large.
+The purpose of the set is to keep from queuing the same coordinates repeatedly (eg, [1,1] could be queued as a child of [1,0] and again as a child of [0,1]). We could save memory by not remembering every coordinate we've seen and rejecting subsequent attempts to score that spot; rather, for this purpose it's probably enough to have the priority queue reject duplicates of what it currently has in it. This would make the set an order of magnitude smaller: instead of having every coordinate in a grid (2D), it would have only the advancing edge (1D), or instead of every coordinate in a cube, only the advancing surface.
+This only works if we can assume that a child will never be returned before its parent; eg, [1,1] won't be returned before [0,1]; if it were, when [0,1] got returned, [1,1] would be queued again. That assumption, in turn, is only true if we don't have duplicate scores, which we very well might. In that case, if [0,1] and [1,1] are both scored 10, we can make no guarantees about which will be returned first. A workaround is to give to the priority queue a score consisting of the "normal" score PLUS the coordinates; eg, [10, [1,1]]. This guarantees that children have higher scores than parents.
+A test indicated that this does save memory, but it seems not to be worth the trouble. It makes the code more complicated and, in an upper bound use case for me, consisting of 10 lists of 200 items each and pulling (I think) 40k results, it saved (I think) something like 100MB. So I scrapped it. I record this here only because someone may have a use case where it matters, and an order of magnitude in memory use may be important for them. So: free idea.

data/benchmark/addition_combiner.rb ADDED

@@ -0,0 +1,6 @@
+class AdditionCombiner
+  include Mendel::Combiner
+  def score_combination(items)
+    items.reduce(0) { |sum, item| sum += item }
+  end
+end

data/benchmark/benchmarker.rb ADDED

@@ -0,0 +1,58 @@
+"#{File.expand_path('..',File.dirname(__FILE__))}/lib".tap {|lib_dir|
+  $LOAD_PATH << lib_dir unless $LOAD_PATH.include?(lib_dir)
+}
+require 'mendel'
+require 'time'
+require 'benchmark'
+require 'csv'
+class Mendel::Benchmarker
+  attr_accessor :combiner, :chunk_size
+  def initialize(combiner, chunk_size)
+    self.combiner   = combiner
+    self.chunk_size = chunk_size
+  end
+  # Look! A big ol' procedural script shoved into a method!
+  def go!
+    column_names = %i[cstime cutime real stime total utime]
+    puts "Benchmarking..."
+    stats = []
+    $stdout = File.open(File::NULL, 'w')
+    Benchmark.bm do |benchmark|
+      done = false
+      until done do
+        # Ensure GC doesn't run during benchmarking
+        GC.disable
+        bm = benchmark.report do
+          chunk = combiner.take(chunk_size)
+          done = true if chunk.empty?
+        end
+        data_point = {queue_length: combiner.queue_length}
+        column_names.map {|colname| data_point[colname] = bm.send(colname) }
+        stats << data_point
+        GC.enable
+      end
+    end
+    $stdout = STDOUT
+    puts "Writing performance data into 'benchmark/data'"
+    Dir.chdir('benchmark') do
+      Dir.mkdir('data') unless Dir.exist?('data')
+      Dir.chdir('data') do
+        lengths  = combiner.lists.map {|l| l.length.to_s}.join('x')
+        filename = "#{lengths}-#{chunk_size}_each"
+        CSV.open("#{filename}.csv", "wb") do |csv|
+          csv << stats.first.keys
+          stats.each do |entry|
+            csv << entry.values
+          end
+        end
+      end
+    end
+    puts "Done!"
+  end
+end

data/benchmark/graph.rb ADDED

@@ -0,0 +1,33 @@
+data_dir = File.expand_path('data', File.dirname(__FILE__))
+unless Dir.exist?(data_dir)
+  puts "No data directory found: #{data_dir}"
+  exit
+end
+require 'gruff'
+require 'csv'
+Dir.glob("#{data_dir}/*.csv") do |data_file|
+  queue_lengths = []
+  utimes        = []
+  CSV.foreach(data_file, headers: true) do |row|
+    utimes        << row.fetch('utime').to_f
+    queue_lengths << row.fetch('queue_length').to_i
+  end
+  base_image_name = data_file.sub('.csv', '')
+  g = Gruff::Line.new
+  g.data(:utimes, utimes)
+  g.title = "Time per .take()"
+  g.write("#{base_image_name}_utimes.png")
+  g = Gruff::Line.new
+  g.data(:queue_length, queue_lengths)
+  g.title = "Queue Length per .take()"
+  g.write("#{base_image_name}_queue_lengths.png")
+end

data/benchmark/run_six_lists.rb ADDED

@@ -0,0 +1,17 @@
+require_relative 'benchmarker'
+require_relative 'addition_combiner'
+chunk_size   = ENV.fetch('CS', 25)
+puts "Using chunk size #{chunk_size} - set ENV var CS to change"
+lists = 6.times.map {
+  # More than this eats a ton of memory
+  8.times.map { rand(1.0...1_000.0) }.sort
+}
+# require 'pry'
+# binding.pry
+# exit
+benchmarker = Mendel::Benchmarker.new(AdditionCombiner.new(*lists), chunk_size)
+benchmarker.go!

data/benchmark/run_two_lists.rb ADDED

@@ -0,0 +1,14 @@
+require_relative 'benchmarker'
+require_relative 'addition_combiner'
+list1_length = ENV.fetch('L1', 100)
+list2_length = ENV.fetch('L2', 200)
+chunk_size   = ENV.fetch('CS', 10)
+puts "Using list lengths #{list1_length} and #{list2_length} and chunk size #{chunk_size}"
+puts "Set ENV vars L1, L2, and CS to change"
+list1    = list1_length.times.map  { rand(1_000_000)       }.sort
+list2    = list2_length.times.map  { rand(1.0...1_000_000) }.sort
+benchmarker = Mendel::Benchmarker.new(AdditionCombiner.new(list1, list2), chunk_size)
+benchmarker.go!

data/benchmark/simple.rb ADDED

@@ -0,0 +1,34 @@
+"#{File.expand_path('..',File.dirname(__FILE__))}/lib".tap {|lib_dir|
+  $LOAD_PATH << lib_dir unless $LOAD_PATH.include?(lib_dir)
+}
+require 'mendel'
+require 'time'
+require_relative 'addition_combiner'
+list_count   = ENV.fetch('LIST_COUNT', 10).to_i
+list_length  = ENV.fetch('LIST_LENGTH', 200).to_i
+result_count = ENV.fetch('RESULT_COUNT', 10_000).to_i
+puts "Pulling #{result_count} results from #{list_count} lists of #{list_length} each"
+puts "You may override with ENV vars LIST_COUNT, LIST_LENGTH, RESULT_COUNT"
+if result_count >= list_length**list_count
+  puts "***(You asked for #{result_count} results, but only #{list_length**list_count} are possible...)"
+end
+lists = list_count.times.map {
+  list_length.times.map { rand(1.0...1_000.0) }.sort
+}
+puts "Look at the starting memory usage - you have 10 seconds"
+sleep(10)
+puts "about to do the work"
+start = Time.now
+GC.disable
+bc = AdditionCombiner.new(*lists)
+bc.take(result_count)
+fin = Time.now
+puts "Took #{fin - start} seconds to pull #{result_count} combos"
+puts "Look at the final memory usage - you have 10 seconds till exit"
+sleep(10)

data/lib/mendel.rb ADDED

@@ -0,0 +1,5 @@
+module Mendel
+end
+require_relative 'mendel/version'
+require_relative 'mendel/combiner'

data/lib/mendel/combiner.rb ADDED

@@ -0,0 +1,174 @@
+require "mendel/version"
+require "mendel/min_priority_queue"
+require "observer"
+require "set"
+module Mendel
+  module Combiner
+    include Enumerable
+    attr_accessor :lists, :priority_queue
+    def self.included(target)
+      target.extend(ClassMethods)
+    end
+    def initialize(*lists)
+      raise EmptyList if lists.any?(&:empty?)
+      self.lists          = lists
+      self.priority_queue = MinPriorityQueue.new
+      queue_combo_at(lists.map {0} )
+    end
+    def each
+      return self.to_enum unless block_given?
+      loop do
+        combo = next_combination
+        break if combo == :none
+        yield combo
+      end
+    end
+    def dump
+      {INPUT => lists, SEEN => seen_set.to_a, QUEUED => priority_queue.dump }
+    end
+    def dump_json
+      JSON.dump(dump)
+    end
+    def queue_length
+      priority_queue.length
+    end
+    def score_combination(items)
+      raise NotImplementedError,
+        <<-MESSAGE
+        Including class must define. Must take a combination and produce a score.
+          - If you have not defined `build_combination`, `score combination` will receive
+            an array of N items (one from each list)
+          - If you have defined `build_combination`, `score_combination` will receive
+            whatever `build_combination` returns
+        MESSAGE
+    end
+    private
+    def seen_set
+      @seen ||= Set.new
+    end
+    def seen_set=(set)
+      @seen = set
+    end
+    def next_combination
+      pair = pop_queue
+      return :none if pair.nil?
+      data, score = pair
+      coordinates = data.fetch(COORDINATES)
+      combo       = data.fetch(COMBO)
+      queue_children_of(coordinates)
+      [combo, score]
+    end
+    def pop_queue
+      priority_queue.pop
+    end
+    def queue_children_of(coordinates)
+      children_coordinates = next_steps_from(coordinates)
+      children_coordinates.each {|cc| queue_combo_at(cc) }
+    end
+    def queue_combo_at(coordinates)
+      return if seen_set.include?(coordinates)
+      seen_set << coordinates
+      queue_item = queueable_item_for(coordinates)
+      score = queue_item.delete(SCORE)
+      priority_queue.push(queue_item, score)
+    end
+    def queueable_item_for(coordinates)
+      raise InvalidCoordinates, coordinates unless valid_for_lists?(coordinates, lists)
+      combo = combo_at(coordinates)
+      score = score_combination(combo)
+      {COMBO => combo, COORDINATES => coordinates, SCORE => score}
+    end
+    def combo_at(coordinates)
+      items = lists.each_with_index.map {|list, i| list[coordinates[i]] }
+      build_combination(items)
+    end
+    def build_combination(items)
+      items
+    end
+    # Increments which are valid for instance's lists
+    def next_steps_from(coordinates)
+      increments_from(coordinates).select { |coords| valid_for_lists?(coords, lists) }
+    end
+    # All possible coordinates which are one greater than the given
+    # coords in a single direction.
+    # Eg:
+    # increments_from([0,0])
+    #   #=> [[0,1], [1, 0]]
+    # increments_from([10,5,7])
+    #   => [[11, 5, 7], [10, 6, 7], [10, 5, 8]]
+    def increments_from(coordinates)
+      coordinates.length.times.map { |i| coordinates.dup.tap { |c| c[i] += 1} }
+    end
+    # Do the coordinates represent a valid location given these lists?
+    # Eg:
+    #   valid_for_lists?([0,1], [['thundercats', 'voltron'], ['hi', 'ho']])
+    #     #=> true - represents ['thundercats', 'ho']
+    #   valid_for_lists?([0,2], [['thundercats', 'voltron'], ['hi', 'ho']])
+    #     #=> false - first list has an index 0, but second list has no index 2
+    #   valid_for_lists?([0,2,0], [['thundercats', 'voltron'], ['hi', 'ho']])
+    #     #=> false - there are only two lists
+    def valid_for_lists?(coords, lists)
+      # Must give exactly one index per list
+      return false unless coords.length == lists.length
+      coords.each_with_index.all? { |value, index| valid_index_in?(lists[index], value) }
+    end
+    # Eg:
+    #   valid_index_in?(['hi', 'ho'],  1) #=> true
+    #   valid_index_in?(['hi', 'ho'],  2) #=> false
+    #   valid_index_in?(['hi', 'ho'], -2) #=> true
+    #   valid_index_in?(['hi', 'ho'], -3) #=> true
+    def valid_index_in?(array, index)
+      index <= (array.length - 1) && index >= (0 - array.length)
+    end
+    # To keep from allocating so many strings
+    COMBO       = 'combo'.freeze
+    COORDINATES = 'coordinates'.freeze
+    INPUT       = 'input'.freeze
+    QUEUED      = 'queued'.freeze
+    SCORE       = 'score'.freeze
+    SEEN        = 'seen'.freeze
+    module ClassMethods
+      def load(data)
+        instance = new(*data.fetch(INPUT))
+        instance.instance_eval {
+          self.seen_set       = Set.new(data.fetch(SEEN))
+          self.priority_queue = MinPriorityQueue.new.tap {|q| q.load(data.fetch(QUEUED))}
+        }
+        instance
+      end
+      def load_json(json)
+        self.load(JSON.parse(json))
+      end
+    end
+    InvalidCoordinates = Class.new(StandardError)
+    EmptyList          = Class.new(StandardError)
+  end
+end