RubyGems - data_hut - Versions diffs - 0.0.7 → 0.0.8 - Mend

data_hut 0.0.7 → 0.0.8

Files changed (21) hide show

data/.gitignore +3 -1
data/CHANGELOG.md +16 -0
data/README.md +51 -112
data/Rakefile +6 -0
data/lib/data_hut/data_warehouse.rb +56 -17
data/lib/data_hut/version.rb +1 -1
data/samples/basic.rb +1 -1
data/samples/common/D3JS_LICENSE +26 -0
data/samples/common/d3.v3.min.js +4 -0
data/samples/common/report.html.haml +16 -0
data/samples/{sample_helper.rb → common/sample_helper.rb} +1 -1
data/samples/common/samples.gemfile +8 -0
data/samples/league_of_legends.rb +21 -7
data/samples/reddit_science.rb +77 -0
data/samples/weather_files/screenshot.png +0 -0
data/samples/weather_files/weather.css +24 -0
data/samples/weather_files/weather.js +89 -0
data/samples/weather_station.rb +62 -0
data/test/spec/basic_test.rb +42 -0
data/test/unit/data_warehouse_test.rb +18 -1
metadata +12 -3

data/.gitignore CHANGED

@@ -15,4 +15,6 @@ spec/reports
 test/tmp
 test/version_tmp
 tmp
-*.db
+*.db
+samples/common/samples.gemfile.lock
+samples/weather_report.html

data/CHANGELOG.md CHANGED

@@ -1,5 +1,21 @@
 # Changelog
+## 0.0.8
+* handle unsanitized nil values properly - If your input data has occasional nil values during extract or transform, you may have seen:
+    DataHut: Ruby type 'NilClass' not supported by Sequel...
+  DataHut now handles nil values instead of raising this exception so that it is easier to work with unsanitized datasets.
+* added `DataHut::DataWarehouse#non_unique` which allows you to specify any test of uniqueness for early skipping during transform or extract phases.  DataHut has duplicate detection built-in, i.e. it doesn't allow identical records to be inserted.  However in the past, you had to wait for all the fields to be added or transformed before this detection was done.  `non-unique` allows you to define more specific uniqueness paramters for early skipping without going through all that.  i.e. you have a feed where you know a dup is some kind of GUID... simply test if the GUID is unique *before* going any further...
+        dh.extract(data) do |r, d|
+          next if dh.not_unique(guid: d[:guid])
+          r.guid = d[:guid]
+          r.name = d[:name]
+          r.age = d[:age]
+          ...
+        end
 ## 0.0.7
 * added capability to store and fetch arbitrary metadata from the DataHut.

data/README.md CHANGED

@@ -6,6 +6,9 @@ DataHut has basic features for small one-off analytics like parsing error logs a
 *Extract* your data from anywhere, *transform* it however you like and *analyze* it for insights!
+<img src="https://raw.github.com/coldnebo/data_hut/master/samples/weather_files/screenshot.png" width="70%"/>
+*from [samples/weather_station.rb](https://github.com/coldnebo/data_hut/blob/master/samples/weather_station.rb)*
 ## Installation
@@ -50,9 +53,14 @@ Setting up a datahut is easy...
     binding.pry
-The datahut *dataset* is a Sequel::Model backed by the data warehouse you just created.
+DataHut provides access to the underlying [Sequel::Dataset](http://sequel.rubyforge.org/rdoc/classes/Sequel/Dataset.html) using
+a Sequel::Model binding.  This allows you to query individual fields and stats from the dataset, but also returns rows as objects that are accessed with the same uniform object syntax you used for extracting and transforming... i.e.:
+    [1] pry(main)> person = ds.first
+    [2] pry(main)> [person.name, person.age]
+    => ["barney", 27]
-And here's the kinds of powerful things you can do:
+And here's some of the other powerful things you can do with a Sequel::Dataset:
     [2] pry(main)> ds.where(eligible: false).count
     => 2
@@ -63,7 +71,7 @@ And here's the kinds of powerful things you can do:
     [5] pry(main)> ds.min(:age)
     => 27
-But wait, you can name these collections:
+But you can also name subsets of data and work with those instead:
     [6] pry(main)> ineligible = ds.where(eligible: false)
     => #<Sequel::SQLite::Dataset: "SELECT * FROM `data_warehouse` WHERE (`eligible` = 'f')">
@@ -74,7 +82,7 @@ But wait, you can name these collections:
     => [#< @values={:dw_id=>3, :name=>"fred", :age=>44, :eligible=>false}>,
      #< @values={:dw_id=>2, :name=>"phil", :age=>31, :eligible=>false}>]
-The results are Sequel::Model objects, so you can treat them as such:
+And results remain Sequel::Model objects, so you can access fields with object notation:
     [32] pry(main)> record = ineligible.order(Sequel.desc(:age)).first
     => #< @values={:dw_id=>3, :name=>"fred", :age=>44, :eligible=>false}>
@@ -84,113 +92,18 @@ The results are Sequel::Model objects, so you can treat them as such:
     => 44
-Read more about the [Sequel gem](http://sequel.rubyforge.org/rdoc/files/README_rdoc.html) to determine what operations you can perform on a datahut dataset.
+Read more about the [Sequel gem](http://sequel.rubyforge.org/) to determine what operations you can perform on a DataHut dataset.
 ## A More Ambitious Example...
-Taking a popular game like League of Legends and hand-rolling some simple analysis of the champions...
-    require 'data_hut'
-    require 'nokogiri'
-    require 'open-uri'
-    require 'pry'
-    root = 'http://na.leagueoflegends.com'
-    # load the data once... (manually delete it to refresh)
-    unless File.exists?("lolstats.db")
-      dh = DataHut.connect("lolstats")
-      champions_page = Nokogiri::HTML(open("#{root}/champions"))
-      urls = champions_page.css('table.champion_item td.description span a').collect{|e| e.attribute('href').value}
-      # keep the powers for later since they are on different pages.
-      powers = {}
-      champions_page.css('table.champion_item').each do |c|
-        name        = c.css('td.description span.highlight a').text
-        attack      = c.css('td.graphing td.filled_attack').count
-        health      = c.css('td.graphing td.filled_health').count
-        spells      = c.css('td.graphing td.filled_spells').count
-        difficulty  = c.css('td.graphing td.filled_difficulty').count
-        powers.store(name, {attack_power: attack, defense_power: health, ability_power: spells, difficulty: difficulty})
-      end
-      puts "loading champion data"
-      dh.extract(urls) do |r, url|
-        champion_page = Nokogiri::HTML(open("#{root}#{url}"))
-        r.name = champion_page.css('div.page_header_text').text
-        st = champion_page.css('table.stats_table')
-        names = st.css('td.stats_name').collect{|e| e.text.strip.downcase.gsub(/ /,'_')}
-        values = st.css('td.stats_value').collect{|e| e.text.strip}
-        modifiers = st.css('td.stats_modifier').collect{|e| e.text.strip}
-        dh.store_meta(:stats, names)
-        (0..names.count-1).collect do |i|
-          stat = (names[i] + "=").to_sym
-          r.send(stat, values[i].to_f)
-          stat_per_level = (names[i].downcase.gsub(/ /,'_') << "_per_level=").to_sym
-          per_level_value = modifiers[i].match(/\+([\d\.]+)/)[1].to_f rescue 0
-          r.send(stat_per_level, per_level_value)
-        end
-        # add the powers for this champion...
-        power = powers[r.name]
-        r.attack_power = power[:attack_power]
-        r.defense_power = power[:defense_power]
-        r.ability_power = power[:ability_power]
-        r.difficulty = power[:difficulty]
-        print "."
-      end
-      puts "done."
-    end
-    # connect again in case extract was skipped because the core data already exists:
-    dh = DataHut.connect("lolstats")
-    # instead of writing out each stat line manually, we can use some metaprogramming along with some metadata to automate this.
-    def total_stat(r,stat)
-      total_stat = ("total_" + stat + "=").to_sym
-      stat_per_level = r.send((stat + "_per_level").to_sym)
-      base = r.send(stat.to_sym)
-      total = base + (stat_per_level * 18.0)
-      r.send(total_stat, total)
-    end
-    # we need to fetch metadata that was written during extract (potentially in a previous process run)
-    stats = dh.fetch_meta(:stats)
-    puts "first transform"
-    dh.transform do |r|
-      stats.each do |stat|
-        total_stat(r,stat)
-      end
-      print '.'
-    end
-    puts "second transform"
-    # there's no need to do transforms all in one batch either... you can layer them...
-    dh.transform(true) do |r|
-      # this index combines the tank dimensions above for best combination (simple Euclidean metric)
-      r.nuke_index = r.total_damage * r.total_move_speed * r.total_mana * (r.ability_power)
-      r.easy_nuke_index = r.total_damage * r.total_move_speed * r.total_mana * (r.ability_power) * (1.0/r.difficulty)
-      r.tenacious_index = r.total_armor * r.total_health * r.total_spell_block * r.total_health_regen * (r.defense_power)
-      r.support_index = r.total_mana * r.total_armor * r.total_spell_block * r.total_health * r.total_health_regen * r.total_mana_regen * (r.ability_power * r.defense_power)
-      print '.'
-    end
-    # use once at the end to mark records processed.
-    dh.transform_complete
-    puts "transforms complete"
-    ds = dh.dataset
-    binding.pry
+Taking a popular game like League of Legends and hand-rolling some simple analysis of the champions.  Look at the following sample
+code:
+* [samples/league_of_legends.rb](https://github.com/coldnebo/data_hut/blob/master/samples/league_of_legends.rb)
-Now that we have some data, lets play...
+Running this sample scrapes some game statistics from an official website and then transforms this base data with
+extra fields containing different totals and indices that we can construct however we like.
+Now that we have some data extracted and some initial transforms defined, lets play with the results...
 * who has the most base damage?
@@ -202,7 +115,7 @@ Now that we have some data, lets play...
          {"Poppy"=>56.3}]
-* but wait a minute... what about at level 18?  Fortunately, we've transformed our data to add some extra fields for this...
+* but wait a minute... what about at level 18?  Fortunately, we've transformed our data to add some extra "total" fields for each stat...
         [2] pry(main)> ds.order(Sequel.desc(:total_damage)).limit(5).collect{|c| {c.name => c.total_damage}}
         => [{"Skarner"=>129.70000000000002},
@@ -211,8 +124,7 @@ Now that we have some data, lets play...
          {"Taric"=>121.0},
          {"Alistar"=>120.19}]
-* how about using some of the indexes we defined above... like the 'nuke_index' (notice that the assumptions on what make a good
-nuke are subjective, but that's the fun of it; we can model our assumptions and see how the data changes in response.)
+* how about using some of the indices we defined?... for instance, if we want to know which champions produce the greatest damage we could try sorting by our 'nuke_index', (notice that the assumptions on what make a good 'nuke' are subjective, but that's the fun of it; we can model our assumptions and see how the data changes in response.)
         [3] pry(main)> ds.order(Sequel.desc(:nuke_index)).limit(5).collect{|c| {c.name => [c.total_damage, c.total_move_speed, c.total_mana, c.ability_power]}}
         => [{"Karthus"=>[100.7, 335.0, 1368.0, 10]},
@@ -221,14 +133,14 @@ nuke are subjective, but that's the fun of it; we can model our assumptions and
          {"Karma"=>[109.4, 335.0, 1320.0, 9]},
          {"Lux"=>[109.4, 340.0, 1150.0, 10]}]
-I must have hit close to the mark, because personally I hate each of these champions when I go up against them!  ;)
+From my experience in the game, these champions are certainly heavy hitters.  What do you think?
-* and (now I risk becoming addicted to datahut myself), here's some further guesses with an easy_nuke index:
+* and (now I risk becoming addicted to DataHut myself), here's some further guesses with an 'easy_nuke' index (champions that have a lot of damage, but are also less difficult to play):
         [4] pry(main)> ds.order(Sequel.desc(:easy_nuke_index)).limit(5).collect{|c| c.name}
         => ["Sona", "Ryze", "Nasus", "Soraka", "Heimerdinger"]
-* makes sense, but is still fascinating... what about my crack at a support_index?
+* makes sense, but is still fascinating... what about my crack at a support_index (champions that have a lot of regen, staying power, etc.)?
         [5] pry(main)> ds.order(Sequel.desc(:support_index)).limit(5).collect{|c| c.name}
         => ["Sion", "Diana", "Nunu", "Nautilus", "Amumu"]
@@ -240,6 +152,33 @@ You get the idea now!  *Extract* your data from anywhere, *transform* it however
 Have fun!
+## Metadata Object Store
+DataHut also supports a basic Ruby object store for storing persistent metadata that might be useful during extract and transform passes.
+*Examples:*
+* [samples/league_of_legends.rb](https://github.com/coldnebo/data_hut/blob/master/samples/league_of_legends.rb):
+        dh.extract(urls) do |r, url|
+          ...
+          # names => ["damage", "health", "mana", "move_speed", "armor", "spell_block", "health_regen", "mana_regen"]
+          # DataHut also allows you to store metadata for the data warehouse during any processing phase for later retrieval.
+          # Since we extract the data only once, but may need stats names for subsequent transforms, we can store the
+          # stats names in the metadata for later use:
+          dh.store_meta(:stats, names)
+          ...
+        end
+        ...
+        # later... we can fetch the metadata that was written during the extract phase and use it...
+        stats = dh.fetch_meta(:stats)
+        # stats => ["damage", "health", "mana", "move_speed", "armor", "spell_block", "health_regen", "mana_regen"]
+**Caveats:** Because the datastore can support any Ruby object (including custom ones) it is up to the caller to make sure that custom classes are in context before storage and fetch.  i.e. if you store a custom object and then fetch it in a context that doesn't have that class loaded, you'll get an error.  For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
+See {DataHut::DataWarehouse#store_meta} and {DataHut::DataWarehouse#fetch_meta} for details.
 ## TODOS
 * further optimizations

data/Rakefile CHANGED

@@ -12,4 +12,10 @@ task :default => :test
 desc "clean up"
 task :clean do
   FileUtils.rm(FileList["samples/**/*.db"], force: true, verbose: true)
+  FileUtils.rm(FileList["samples/*.html"], force: true, verbose: true)
+end
+desc "install gems for running samples"
+task :samples do
+  system('bundle install --gemfile=samples/common/samples.gemfile')
 end

data/lib/data_hut/data_warehouse.rb CHANGED

@@ -82,6 +82,12 @@ module DataHut
     #   more information about supported ruby data types you can use.
     # @yieldparam element an element from your data.
     # @raise [ArgumentError] if you don't provide a block
+    # @return [void]
+    # @note Duplicate records (all fields and values must match) are automatically not inserted at the end of an extract iteration. You may
+    #   also skip duplicate extracts early in the iteration by using {#not_unique}.
+    # @note Fields with nil values in records are skipped because the underlying database defaults these to
+    #   nil already. However you must have at least one non-nil value in order for the field to be automatically created,
+    #   otherwise subsequent transform layers may report errors on trying to access the field.
     def extract(data)
       raise(ArgumentError, "a block is required for extract.", caller) unless block_given?
@@ -102,7 +108,7 @@ module DataHut
     #
     # @param forced if set to 'true', this transform will iterate over records already marked processed.  This can be useful for
     #   layers of transforms that deal with analytics where the analytical model may need to rapidly change as you explore the data.
-    #   See the second transform in {file/README.md#A_More_Ambitious_Example___}.
+    #   See the second transform in {https://github.com/coldnebo/data_hut/blob/master/samples/league_of_legends.rb#L102 samples/league_of_legends.rb:102}.
     # @yield [record] lets you modify the DataHut record
     # @yieldparam record an OpenStruct that fronts the DataHut record.  You may access existing fields on this record or create new
     #   fields to store synthetic data from a transform pass.
@@ -110,6 +116,7 @@ module DataHut
     #   See {http://sequel.rubyforge.org/rdoc/files/doc/schema_modification_rdoc.html Sequel Schema Modification Methods} for
     #   more information about supported ruby data types you can use.
     # @raise [ArgumentError] if you don't provide a block
+    # @return [void]
     def transform(forced=false)
       raise(ArgumentError, "a block is required for transform.", caller) unless block_given?
@@ -126,10 +133,10 @@ module DataHut
         r = OpenStruct.new(h)
         # and let the transformer modify it...
         yield r
-        # now add any new transformation fields to the schema...
-        adapt_schema(r)
         # get the update hash from the openstruct
-        h = r.marshal_dump
+        h = ostruct_to_hash(r)
+        # now add any new transformation fields to the schema...
+        adapt_schema(h)
         # and use it to update the record
         @db[:data_warehouse].where(dw_id: dw_id).update(h)
       end
@@ -144,6 +151,8 @@ module DataHut
     #      transform_complete (marks the update complete)
     #      dh.dataset is used to visualize graphs with d3.js
     #   end
+    #
+    # @return [void]
     def transform_complete
       @db[:data_warehouse].update(:dw_processed => true)
     end
@@ -156,17 +165,21 @@ module DataHut
     #
     # @param logger [Logger] a logger for the underlying Sequel actions.
     # @raise [ArgumentError] if passed a logger that is not a kind of {http://www.ruby-doc.org/stdlib-1.9.3//libdoc/logger/rdoc/Logger.html Logger}.
+    # @return [void]
     def logger=(logger)
       raise(ArgumentError, "logger must be a type of Logger.") unless logger.kind_of?(Logger)
       @db.logger = logger
     end
-    # stores metadata
+    # stores any Ruby object as metadata in the datahut.
     #
-    # @param key [Symbol] to lookup the metadata by
-    # @param value [Object] ruby object to store
+    # @param key [Symbol] to reference the metadata by
+    # @param value [Object] ruby object to store in metadata
+    # @return [void]
+    # @note Because the datastore can support any Ruby object (including custom ones) it is up to
+    #   the caller to make sure that custom classes are in context before storage and fetch.  i.e. if you
+    #   store a custom object and then fetch it in a context that doesn't have that class loaded, you'll get an error.
+    #   For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
     def store_meta(key, value)
       key = key.to_s if key.instance_of?(Symbol)
       begin
@@ -177,14 +190,18 @@ module DataHut
           @db[:data_warehouse_meta].insert(key: key, value: value)
         end
       rescue Exception => e
-        raise(ArgumentError, "DataHut: unable to store metadata value #{value.inspect}.", caller)
+        raise(ArgumentError, "DataHut: unable to store metadata value #{value.inspect}: #{e.message}", caller)
       end
     end
-    # retrieves previously stored metadata by key
+    # retrieves any Ruby object stored as metadata.
     #
     # @param key [Symbol] to lookup the metadata by
-    # @return [Object] ruby object that was fetched
+    # @return [Object] ruby object that was fetched from metadata
+    # @note Because the datastore can support any Ruby object (including custom ones) it is up to
+    #   the caller to make sure that custom classes are in context before storage and fetch.  i.e. if you
+    #   store a custom object and then fetch it in a context that doesn't have that class loaded, you'll get an error.
+    #   For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
     def fetch_meta(key)
       key = key.to_s if key.instance_of?(Symbol)
       begin
@@ -192,11 +209,29 @@ module DataHut
         value = r[:value] unless r.nil?
         value = Marshal.load(value) unless value.nil?
       rescue Exception => e
-        raise(ArgumentError, "DataHut: unable to fetch metadata key #{key}.", caller)
+        raise(RuntimeError, "DataHut: unable to fetch metadata key #{key}: #{e.message}", caller)
       end
       value
     end
+    # used to determine if the specified fields and values are unique in the datahut.
+    #
+    # @example
+    #   dh.extract(data) do |r, d|
+    #     next if dh.not_unique(name: d[:name])
+    #     r.name = d[:name]
+    #     r.age = d[:age]
+    #     ...
+    #   end
+    #
+    # @note exactly duplicate records are automatically skipped at the end of an extract iteration (see {#extract}). This
+    #   method is useful if an extract iteration takes a long time and you want to skip duplicates early in the iteration.
+    # @param hash [Hash] of the key, value pairs specifying a partial record by which to consider records unique.
+    # @return [Boolean] true if the {field: value} already exists, false otherwise (including if the column doesn't yet exist.)
+    def not_unique(hash)
+      @db[:data_warehouse].where(hash).count > 0 rescue false
+    end
     private
     def initialize(name)
@@ -221,16 +256,20 @@ module DataHut
     end
     def store(r)
-      adapt_schema(r)
-      h = r.marshal_dump
+      h = ostruct_to_hash(r)
+      adapt_schema(h)
       # don't insert dups
-      unless @db[:data_warehouse].where(h).count > 0
+      unless not_unique(h)
         @db[:data_warehouse].insert(h)
       end
     end
-    def adapt_schema(r)
+    def ostruct_to_hash(r)
       h = r.marshal_dump
+      h.reject{|k,v| v.nil?}  # you can't define a column type "NilClass", so strip these before adapting the schema
+    end
+    def adapt_schema(h)
       h.keys.each do |key|
         type = h[key].class
         unless Sequel::Schema::CreateTableGenerator::GENERIC_TYPES.include?(type)

data/lib/data_hut/version.rb CHANGED

@@ -1,3 +1,3 @@
 module DataHut
-  VERSION = "0.0.7"
+  VERSION = "0.0.8"
 end

data/samples/basic.rb CHANGED

@@ -2,7 +2,7 @@
 # run from the samples dir with:
 # $ ruby basic.rb
-require_relative 'sample_helper.rb'
+require_relative 'common/sample_helper.rb'
 require 'data_hut'
 require 'pry'

data/samples/common/D3JS_LICENSE ADDED

@@ -0,0 +1,26 @@
+Copyright (c) 2013, Michael Bostock
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+* The name Michael Bostock may not be used to endorse or promote products
+  derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL MICHAEL BOSTOCK BE LIABLE FOR ANY DIRECT,
+INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
+EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.