RubyGems - wukong-load - Versions diffs - 0.0.2 → 0.1.0 - Mend

wukong-load 0.0.2 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

data/.yardopts +5 -0
data/Gemfile +16 -0
data/LICENSE.md +1 -1
data/README.md +100 -34
data/bin/wu-load +1 -47
data/bin/wu-source +4 -0
data/lib/wukong-load.rb +36 -3
data/lib/wukong-load/load_runner.rb +64 -0
data/lib/wukong-load/loader.rb +7 -0
data/lib/wukong-load/loaders/elasticsearch.rb +151 -0
data/lib/wukong-load/loaders/kafka.rb +98 -0
data/lib/wukong-load/loaders/mongodb.rb +123 -0
data/lib/wukong-load/loaders/sql.rb +169 -0
data/lib/wukong-load/models/http_request.rb +60 -0
data/lib/wukong-load/source_driver.rb +46 -0
data/lib/wukong-load/source_runner.rb +36 -0
data/lib/wukong-load/version.rb +1 -1
data/spec/spec_helper.rb +13 -0
data/spec/wukong-load/loaders/elasticsearch_spec.rb +142 -0
data/spec/wukong-load/loaders/kafka_spec.rb +72 -0
data/spec/wukong-load/loaders/mongodb_spec.rb +100 -0
data/spec/wukong-load/loaders/sql_spec.rb +112 -0
data/spec/wukong-load/models/http_request_spec.rb +21 -0
data/wukong-load.gemspec +3 -2
metadata +26 -10
data/lib/wukong-load/configuration.rb +0 -8
data/lib/wukong-load/elasticsearch.rb +0 -99
data/lib/wukong-load/runner.rb +0 -48
data/spec/wukong-load/elasticsearch_spec.rb +0 -140

data/.yardopts ADDED

@@ -0,0 +1,5 @@
+--readme   README.md
+--markup   markdown
+-
+LICENSE.md
+README.md

data/Gemfile CHANGED

@@ -5,4 +5,20 @@ gemspec
 group :development do
   gem 'rake',     '~> 0.9'
   gem 'rspec',    '~> 2'
+  gem 'yard'
+  gem 'redcarpet'
 end
+group :mongo do
+  gem 'mongo'
+  gem 'bson_ext'
+end
+group :sql do
+  gem 'mysql2'
+end
+group :kafka do
+  gem 'kafka-rb'
+end

data/LICENSE.md CHANGED

@@ -1,4 +1,4 @@
-# License for Wukong
+# License for Wukong-Load
 The wukong code is __Copyright (c) 2011, 2012 Infochimps, Inc__

data/README.md CHANGED

@@ -1,7 +1,7 @@
 # Wukong-Load
 This Wukong plugin makes it easy to load data from the command-line
-into various.
+into various data stores.
 It is assumed that you will independently deploy and configure each
 data store yourself (but see
@@ -19,7 +19,7 @@ useful when developing flows in concert with wu-local.
 Wukong-Load can be installed as a RubyGem:
 ```
-$ sudo gem install wukong-hadoop
+$ sudo gem install wukong-load
 ```
 ## Usage
@@ -39,7 +39,14 @@ $ wu-load store_name --help
 Further details will depend on the data store you're writing to.
-### Elasticsearch Usage
+### Expected Input
+All input to `wu-load` should be newline-separated, JSON-formatted,
+hash-like records.  For some data stores, keys in the record may be
+interpreted as metadata about the record or about how to route the
+record within the data store.
+## Elasticsearch Usage
 Lets you load JSON-formatted records into an
 [Elasticsearch](http://www.elasticsearch.org) database.  See full
@@ -49,36 +56,10 @@ options with
 $ wu-load elasticsearch --help
 ```
-#### Expected Input
-All input to `wu-load` should be newline-separated, JSON-formatted,
-hash-like record.  Some keys in the record will be interpreted as
-metadata about the record or about how to route the record within the
-database but the entire record will be written to the database
-unmodified.
+### Connecting
-A (pretty-printed for clarity -- the real record shouldn't contain
-newlines) record like
-```json
-{
-  "_index":      "publications"
-  "_type":       "book",
-  "ISBN":        "0553573403",
-  "title":       "A Game of Thrones",
-  "author":      "George R. R. Martin",
-  "description": "The first of half a hundred novels to come out since...",
-  ...
-}
-```
-might use the `_index` and `_type` fields as metadata but the
-**whole** record will be written.
-#### Connecting
-`wu-load` has a default host (localhost) and port (9200) it tries to
-connect to but you can change these:
+`wu-load` tries to connect to an Elasticsearch server at a default
+host (localhost) and port (9200).  You can change these:
 ```
 $ cat data.json | wu-load elasticsearch --host=10.122.123.124 --port=80
@@ -86,7 +67,7 @@ $ cat data.json | wu-load elasticsearch --host=10.122.123.124 --port=80
 All queries will be sent to this address.
-#### Routing
+### Routing
 Elasticsearch stores data in several *indices* which each contain
 *documents* of various *types*.
@@ -98,7 +79,10 @@ Elasticsearch stores data in several *indices* which each contain
 $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=publication --es_type=book
 ```
-##### Creates vs. Updates
+A record with an `_index` or `_es_type` field will override these
+default settings.  You can change the names of the fields used.
+### Creates vs. Updates
 If an input document contains a value for the field `_id` then that
 value will be as the ID of the record when written, possibly
@@ -109,3 +93,85 @@ You can change the field you use for the Elasticsearch ID property:
 ```
 $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=media --es_type=books --id_field="ISBN"
 ```
+## Kafka Usage
+Lets you load JSON-formatted records into a
+[Kafka](http://kafka.apache.org/) queue.  See full options with
+```
+$ wu-load kafka --help
+```
+### Connecting
+`wu-load` tries to connect to a Kafka broker at a default host
+(localhost) and a port (9092).  You can change these:
+```
+$ cat data.json | wu-load kafka --host=10.122.123.124 --port=1234
+```
+All records will be sent to this address.
+### Routing
+Kafka stores data in several named *queues*.  Each queue can have
+several numbered *partitions*.
+`wu-load` loads each record into the default queue (`test`) and
+partition (0), but you can change these:
+```
+$ cat data.json | wu-load kafka --host=10.123.123.123 --topic=messages --partition=6
+```
+A record with a `_topic` or `_partition` field will override these
+default settings.  You can change the names of the fields used.
+## MongoDB Usage
+Lets you load JSON-formatted records into an
+[MongoDB](http://www.mongodb.org) database.  See full options with
+```
+$ wu-load mongodb --help
+```
+### Connecting
+`wu-load` tries to connect to an MongoDB server at a default host
+(localhost) and port (27017).  You can change these:
+```
+$ cat data.json | wu-load mongodb --host=10.122.123.124 --port=1234
+```
+All queries will be sent to this address.
+### Routing
+MongoDB stores *documents* in several *databases* which each contain
+*collections*.
+`wu-load` loads each document into default database (`wukong`) and
+collection (`streaming_record`), but you can change these:
+```
+$ cat data.json | wu-load mongodb --host=10.123.123.123 --database=publication --collection=book
+```
+A record with a `_database` or `_collection` field will override these
+default settings.  You can change the names of the fields used.
+### Creates vs. Updates
+If an input document contains a value for the field `_id` then that
+value will be as the ID of the record when written, possibly
+overwriting a record that already exists -- an update.
+You can change the field you use for the MongoDB ID property:
+```
+$ cat data.json | wu-load mongodb --host=10.123.123.123 --database=media --collection=books --id_field="ISBN"
+```

data/bin/wu-load CHANGED

@@ -1,50 +1,4 @@
 #!/usr/bin/env ruby
 require 'wukong-load'
-settings = Wukong::Load::Configuration
-settings.use(:commandline)
-settings.usage = "usage: #{File.basename($0)} DATA_STORE [ --param=value | -p value | --param | -p]"
-settings.description = <<-EOF
-wu-load is a tool for loading data from Wukong into data stores.  It
-supports multiple, pluggable data stores, including:
-Supported data stores:
-   elasticsearch
-   hbase (planned)
-   mongob (planned)
-   mysql (planned)
-Get specific help for a data store with
-  $ wu-load store_name --help
-Elasticsearch Usage:
-Pass newline-separated, JSON-formatted records over STDIN:
-$ cat data.json | wu-load elasticsearch
-By default, wu-load attempts to write each input record to a local
-Elasticsearch database.  Records will be routed to a default
-Elasticsearch index and type.  Records with an '_id' field will be
-considered updates.  The rest will be creates.  You can override these
-options:
-$ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=my_app --es_type=my_obj --id_field="doc_id"
-Params:
-   --host=String            Elasticsearch host, without HTTP prefix [Default: localhost]
-   --port=Integer           Port on Elasticsearch host [Default: 9200]
-   --index=String           Default Elasticsearch index for records [Default: wukong]
-   --es_type=String         Default Elasticsearch type  for records [Default: streaming_record]
-   --index_field=String     Field in each record naming desired Elasticsearch index
-   --es_type_field=String   Field in each record naming desired Elasticsearch type
-   --id_field=String        Field in each record naming providing ID of existing Elasticsearch record to update
-EOF
-require 'wukong/boot' ; Wukong.boot!(settings)
-require 'wukong-load/runner'
-Wukong::Load::Runner.run(settings)
+Wukong::Load::LoadRunner.run

data/bin/wu-source ADDED

@@ -0,0 +1,4 @@
+#!/usr/bin/env ruby
+require 'wukong-load'
+Wukong::Load::SourceRunner.run

data/lib/wukong-load.rb CHANGED

@@ -3,8 +3,41 @@ require 'wukong'
 module Wukong
   # Loads data from the command-line into data stores.
   module Load
+    include Plugin
+    # Configure `settings` for Wukong-Load.
+    #
+    # Will ensure that `wu-load` has the same settings as `wu-local`.
+    #
+    # @param [Configliere::Param] settings the settings to configure
+    # @param [String] program the currently executing program name
+    def self.configure settings, program
+      case program
+      when 'wu-load'
+        settings.define :tcp_port, description: "Consume TCP requests on the given port instead of lines over STDIN", type: Integer, flag: 't'
+      when 'wu-source'
+        settings.define :per_sec,    description: "Number of events produced per second", type: Float
+        settings.define :period,     description: "Number of seconds between events (overrides --per_sec)", type: Float
+        settings.define :batch_size, description: "Trigger a finalize across the dataflow each time this many records are processed", type: Integer
+      end
+    end
+    # Boot Wukong-Load from the resolved `settings` in the given
+    # `dir`.
+    #
+    # @param [Configliere::Param] settings the resolved settings
+    # @param [String] dir the directory to boot in
+    def self.boot settings, dir
+    end
   end
 end
-require_relative 'wukong-load/version'
-require_relative 'wukong-load/configuration'
-require_relative 'wukong-load/elasticsearch'
+require_relative 'wukong-load/load_runner'
+require_relative 'wukong-load/source_runner'
+require_relative 'wukong-load/models/http_request'
+require_relative 'wukong-load/loaders/elasticsearch'
+require_relative 'wukong-load/loaders/kafka'
+require_relative 'wukong-load/loaders/mongodb'
+require_relative 'wukong-load/loaders/sql'

data/lib/wukong-load/load_runner.rb ADDED

@@ -0,0 +1,64 @@
+module Wukong
+  module Load
+    # Runs the wu-load command.
+    class LoadRunner < Wukong::Local::LocalRunner
+      usage "DATA_STORE"
+      description <<-EOF.gsub(/^ {8}/,'')
+        wu-load is a tool for loading data from Wukong into data stores.  It
+        supports multiple, pluggable data stores, including:
+        Supported data stores:
+           elasticsearch
+           kafka
+           mongodb
+           mysql
+           hbase (planned)
+        Get specific help for a data store with
+          $ wu-load store_name --help
+      EOF
+      include Logging
+      # Ensure that we were passed a data store name that we know
+      # about.
+      #
+      # @raise [Wukong::Error] if the data store is missing or unknown
+      # @return [true]
+      def validate
+        case
+        when data_store_name.nil?
+          raise Error.new("Must provide the name of a data store as the first argument")
+        when processor.nil?
+          raise Error.new("No loader defined for data store <#{data_store_name}>")
+        end
+        true
+      end
+      # The name of the data store
+      #
+      # @return [String]
+      def data_store_name
+        args.first
+      end
+      # The name of the processor that should handle the data store
+      #
+      # @return [String]
+      def processor
+        case data_store_name
+        when 'elasticsearch'   then :elasticsearch_loader
+        when 'kafka'           then :kafka_loader
+        when 'mongo','mongodb' then :mongodb_loader
+        when 'sql', 'mysql'    then :sql_loader
+        end
+      end
+    end
+  end
+end

data/lib/wukong-load/loader.rb CHANGED

@@ -4,10 +4,17 @@ module Wukong
     # Base class from which to build Loaders.
     class Loader < Wukong::Processor::FromJson
+      # Calls super() to leverage its deserialization and then calls
+      # #load on the yielded record.
+      #
+      # @param [String] line JSON to parse.
       def process line
         super(line) { |record| load(record) }
       end
+      # Override this method to load a record into the data store.
+      #
+      # @param [Hash] record
       def load record
       end

data/lib/wukong-load/loaders/elasticsearch.rb ADDED

@@ -0,0 +1,151 @@
+require_relative('../loader')
+module Wukong
+  module Load
+    # Loads data into Elasticsearch.
+    #
+    # Uses Elasticsearch's HTTP API to communicate.
+    #
+    # Allows loading records into a given index and type.  Records can
+    # have fields `_index` and `_es_type` which override the given
+    # index and type on a per-record basis.
+    #
+    # Records can have an `_id` field which indicates an update, not a
+    # create.
+    #
+    # The names of these fields within each record (`_index`,
+    # `_es_type`, and `_id`) can be customized.
+    class ElasticsearchLoader < Loader
+      field :host,          String, :default => 'localhost', :doc => "Elasticsearch host"
+      field :port,          Integer,:default => 9200, :doc => "Port on Elasticsearch host"
+      field :index,         String, :default => 'wukong', :doc => "Default Elasticsearch index for records"
+      field :es_type,       String, :default => 'streaming_record', :doc => "Default Elasticsearch type for records"
+      field :index_field,   String, :default => '_index', :doc => "Name of field in each record overriding default Elasticsearch index"
+      field :es_type_field, String, :default => '_es_type', :doc => "Name of field in each record overriding default Elasticsearch type"
+      field :id_field,      String, :default => '_id', :doc => "Name of field in each record providing ID of existing Elasticsearch record to update"
+      description <<-EOF.gsub(/^ {8}/,'')
+        Loads newline-separated, JSON-formatted records over STDIN
+        into Elasticsearch using its HTTP API.
+          $ cat data.json | wu-load elasticsearch
+        By default, wu-load attempts to write each input record to a
+        local Elasticsearch database.
+        Input records will be written to a default Elasticsearch index
+        and type.  Each record can have _index and _es_type fields to
+        override this on a per-record basis.
+        Records with an _id field will be trigger updates, the rest
+        creates.
+        The fields used (_index, _es_type, and _id) can be changed:
+          $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=web_events --es_type=impressions --id_field="impression_id"
+      EOF
+      # The Net::HTTP connection we'll use for talking to
+      # Elasticsearch.
+      attr_accessor :connection
+      # Creates a connection
+      def setup
+        h = host.gsub(%r{^http://},'')
+        log.debug("Connecting to Elasticsearch cluster at #{h}:#{port}...")
+        begin
+          self.connection = Net::HTTP.new(h, port)
+          self.connection.use_ssl = true if host =~ /^https/
+        rescue => e
+          raise Error.new(e.message)
+        end
+      end
+      # Load a single record into Elasticsearch.
+      #
+      # If the record has an ID, we'll issue an update, otherwise a create
+      #
+      # @param [Hash] record
+      def load record
+        id_for(record) ? request(Net::HTTP::Put, update_path(record), record) : request(Net::HTTP::Post, create_path(record), record)
+      end
+      # :nodoc:
+      def create_path record
+        File.join('/', index_for(record).to_s, es_type_for(record).to_s)
+      end
+      # :nodoc:
+      def update_path record
+        File.join('/', index_for(record).to_s, es_type_for(record).to_s, id_for(record).to_s)
+      end
+      # :nodoc:
+      def index_for record
+        record[index_field] || self.index
+      end
+      # :nodoc:
+      def es_type_for record
+        record[es_type_field] || self.es_type
+      end
+      # :nodoc:
+      def id_for record
+        record[id_field]
+      end
+      # Make a request via the existing #connection.  Record will be
+      # turned to JSON automatically.
+      #
+      # @param [Net::HTTPRequest] request_type
+      # @param [String] path
+      # @param [Hash] record
+      def request request_type, path, record
+        perform_request(create_request(request_type, path, record))
+      end
+      private
+      # :nodoc:
+      def create_request request_type, path, record
+        request_type.new(path).tap do |req|
+          req.body = MultiJson.dump(record)
+        end
+      end
+      # :nodoc:
+      def perform_request req
+        begin
+          response = connection.request(req)
+          status   = response.code.to_i
+          if (200..201).include?(status)
+            log.info("#{req.class} #{req.path} #{status}")
+          else
+            handle_elasticsearch_error(status, response)
+          end
+        rescue => e
+          log.error("#{e.class} - #{e.message}")
+        end
+      end
+      # :nodoc:
+      def handle_elasticsearch_error status, response
+        begin
+          error = MultiJson.load(response.body)
+          log.error("#{response.code}: #{error['error']}")
+        rescue => e
+          log.error("Received a response code of #{status}: #{response.body}")
+        end
+      end
+      register :elasticsearch_loader
+    end
+  end
+end