RubyGems - hivemeta - Versions diffs - 0.0.5 → 0.0.6 - Mend

hivemeta 0.0.5 → 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/CHANGELOG CHANGED Viewed

@@ -3,7 +3,10 @@
 - perf: 4x+ faster ... now basically on par with manual split into array
 - perf: create extra hash for column index by name
 - perf: remove unnecessary string indexed assignment
-- clean: table.each does each inside rather than each_with_index
+- clean: Table#each does each inside rather than each_with_index
+- new: Table#process works on file input, by default STDIN
+- new: can now use environmental variables in order to minimize code
+       all prefixed by hivemeta_ : db_user, db_pass, db_host, db_name
 * 2011-05-17 - fsf
 - bugfix: default unspecified delimiter is ^A rather than TAB

data/README CHANGED Viewed

@@ -43,11 +43,23 @@ gem install hivemeta
 API Usage
-streaming map/reduce code snippet:
+streaming map/reduce code snippet (abstracted processing loop):
 require 'hivemeta'
-h = HiveMeta::Connection.new(...) # see sample-mapper.rb for detail
+h = HiveMeta::Connection.new # see below for detail
+h.table('sample_inventory').process do |row|
+  item_id = row.item_id # can access by method or [:sym] or ['str']
+  count   = row.inv_cnt.to_i
+  puts "#{item_id}\t#{count}" if count >= 1000
+end
+streaming map/reduce code snippet (normal STDIN processing loop):
+require 'hivemeta'
+h = HiveMeta::Connection.new # see below for detail
 inv_table = h.table 'sample_inventory'
 STDIN.each_line do |line|
@@ -62,6 +74,35 @@ STDIN.each_line do |line|
   puts "#{item_id}\t#{count}" if count >= 1000
 end
+establishing a connection (in ruby code):
+db_user    = 'hive'
+db_pass    = 'hivepasshere'
+db_host    = 'localhost'
+db_name    = 'hivemeta'
+dbi_string = "DBI:Mysql:#{db_name}:#{db_host}"
+h = HiveMeta::Connection.new(dbi_string, db_user, db_pass)
+establishing a connection (environment variables):
+# when no arguments are passed, the following env variables will be used:
+#
+#  hivemeta_db_host
+#  hivemeta_db_name
+#  hivemeta_db_user
+#  hivemeta_db_pass
+#
+# to set these in a streaming map/reduce job, use -D arguments like so:
+#
+#  -D hivemeta.db_host=mydbhost \
+#  -D hivemeta.db_name=hivemeta \
+#  -D hivemeta.db_user=hive \
+#  -D hivemeta.db_pass=mydbpass \
+# the connection will made with those env variables without any other code
+h = HiveMeta::Connection.new
 ---
 hivemeta_query.rb Usage

data/lib/hivemeta/connection.rb CHANGED Viewed

@@ -6,6 +6,12 @@ module HiveMeta
   class Connection
     def initialize(dbi_string = nil, db_user = nil, db_pass = nil)
+      db_name = ENV['hivemeta_db_name']
+      db_host = ENV['hivemeta_db_host']
+      dbi_string ||= "DBI:Mysql:#{db_name}:#{db_host}"
+      db_user    ||= ENV['hivemeta_db_user']
+      db_pass    ||= ENV['hivemeta_db_pass']
       @dbi_string = dbi_string
       @db_user    = db_user
       @db_pass    = db_pass

data/lib/hivemeta/table.rb CHANGED Viewed

@@ -49,6 +49,24 @@ module HiveMeta
         return Record.new(line, self)
       end
     end
+    # process all input (default to STDIN for Hadoop Streaming)
+    # via a provided block
+    def process(f = STDIN, warning = nil)
+      if not block_given?
+        return process_row f.readline
+      end
+      f.each_line do |line|
+        begin
+          process_row(line) {|row| yield row}
+        rescue HiveMeta::FieldCountError
+          warning ||= "reporter:counter:bad_data,row_size,1"
+          STDERR.puts warning
+          next
+        end
+      end
+    end
   end
 end

metadata CHANGED Viewed

@@ -5,8 +5,8 @@ version: !ruby/object:Gem::Version
   segments:
   - 0
   - 0
-  - 5
-  version: 0.0.5
+  - 6
+  version: 0.0.6
 platform: ruby
 authors:
 - Frank Fejes