wukong-load 0.0.2 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,5 @@
1
+ --readme README.md
2
+ --markup markdown
3
+ -
4
+ LICENSE.md
5
+ README.md
data/Gemfile CHANGED
@@ -5,4 +5,20 @@ gemspec
5
5
  group :development do
6
6
  gem 'rake', '~> 0.9'
7
7
  gem 'rspec', '~> 2'
8
+ gem 'yard'
9
+ gem 'redcarpet'
8
10
  end
11
+
12
+ group :mongo do
13
+ gem 'mongo'
14
+ gem 'bson_ext'
15
+ end
16
+
17
+ group :sql do
18
+ gem 'mysql2'
19
+ end
20
+
21
+ group :kafka do
22
+ gem 'kafka-rb'
23
+ end
24
+
data/LICENSE.md CHANGED
@@ -1,4 +1,4 @@
1
- # License for Wukong
1
+ # License for Wukong-Load
2
2
 
3
3
  The wukong code is __Copyright (c) 2011, 2012 Infochimps, Inc__
4
4
 
data/README.md CHANGED
@@ -1,7 +1,7 @@
1
1
  # Wukong-Load
2
2
 
3
3
  This Wukong plugin makes it easy to load data from the command-line
4
- into various.
4
+ into various data stores.
5
5
 
6
6
  It is assumed that you will independently deploy and configure each
7
7
  data store yourself (but see
@@ -19,7 +19,7 @@ useful when developing flows in concert with wu-local.
19
19
  Wukong-Load can be installed as a RubyGem:
20
20
 
21
21
  ```
22
- $ sudo gem install wukong-hadoop
22
+ $ sudo gem install wukong-load
23
23
  ```
24
24
 
25
25
  ## Usage
@@ -39,7 +39,14 @@ $ wu-load store_name --help
39
39
 
40
40
  Further details will depend on the data store you're writing to.
41
41
 
42
- ### Elasticsearch Usage
42
+ ### Expected Input
43
+
44
+ All input to `wu-load` should be newline-separated, JSON-formatted,
45
+ hash-like records. For some data stores, keys in the record may be
46
+ interpreted as metadata about the record or about how to route the
47
+ record within the data store.
48
+
49
+ ## Elasticsearch Usage
43
50
 
44
51
  Lets you load JSON-formatted records into an
45
52
  [Elasticsearch](http://www.elasticsearch.org) database. See full
@@ -49,36 +56,10 @@ options with
49
56
  $ wu-load elasticsearch --help
50
57
  ```
51
58
 
52
- #### Expected Input
53
-
54
- All input to `wu-load` should be newline-separated, JSON-formatted,
55
- hash-like record. Some keys in the record will be interpreted as
56
- metadata about the record or about how to route the record within the
57
- database but the entire record will be written to the database
58
- unmodified.
59
+ ### Connecting
59
60
 
60
- A (pretty-printed for clarity -- the real record shouldn't contain
61
- newlines) record like
62
-
63
- ```json
64
- {
65
- "_index": "publications"
66
- "_type": "book",
67
- "ISBN": "0553573403",
68
- "title": "A Game of Thrones",
69
- "author": "George R. R. Martin",
70
- "description": "The first of half a hundred novels to come out since...",
71
- ...
72
- }
73
- ```
74
-
75
- might use the `_index` and `_type` fields as metadata but the
76
- **whole** record will be written.
77
-
78
- #### Connecting
79
-
80
- `wu-load` has a default host (localhost) and port (9200) it tries to
81
- connect to but you can change these:
61
+ `wu-load` tries to connect to an Elasticsearch server at a default
62
+ host (localhost) and port (9200). You can change these:
82
63
 
83
64
  ```
84
65
  $ cat data.json | wu-load elasticsearch --host=10.122.123.124 --port=80
@@ -86,7 +67,7 @@ $ cat data.json | wu-load elasticsearch --host=10.122.123.124 --port=80
86
67
 
87
68
  All queries will be sent to this address.
88
69
 
89
- #### Routing
70
+ ### Routing
90
71
 
91
72
  Elasticsearch stores data in several *indices* which each contain
92
73
  *documents* of various *types*.
@@ -98,7 +79,10 @@ Elasticsearch stores data in several *indices* which each contain
98
79
  $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=publication --es_type=book
99
80
  ```
100
81
 
101
- ##### Creates vs. Updates
82
+ A record with an `_index` or `_es_type` field will override these
83
+ default settings. You can change the names of the fields used.
84
+
85
+ ### Creates vs. Updates
102
86
 
103
87
  If an input document contains a value for the field `_id` then that
104
88
  value will be as the ID of the record when written, possibly
@@ -109,3 +93,85 @@ You can change the field you use for the Elasticsearch ID property:
109
93
  ```
110
94
  $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=media --es_type=books --id_field="ISBN"
111
95
  ```
96
+
97
+ ## Kafka Usage
98
+
99
+ Lets you load JSON-formatted records into a
100
+ [Kafka](http://kafka.apache.org/) queue. See full options with
101
+
102
+ ```
103
+ $ wu-load kafka --help
104
+ ```
105
+
106
+ ### Connecting
107
+
108
+ `wu-load` tries to connect to a Kafka broker at a default host
109
+ (localhost) and a port (9092). You can change these:
110
+
111
+ ```
112
+ $ cat data.json | wu-load kafka --host=10.122.123.124 --port=1234
113
+ ```
114
+
115
+ All records will be sent to this address.
116
+
117
+ ### Routing
118
+
119
+ Kafka stores data in several named *queues*. Each queue can have
120
+ several numbered *partitions*.
121
+
122
+ `wu-load` loads each record into the default queue (`test`) and
123
+ partition (0), but you can change these:
124
+
125
+ ```
126
+ $ cat data.json | wu-load kafka --host=10.123.123.123 --topic=messages --partition=6
127
+ ```
128
+
129
+ A record with a `_topic` or `_partition` field will override these
130
+ default settings. You can change the names of the fields used.
131
+
132
+ ## MongoDB Usage
133
+
134
+ Lets you load JSON-formatted records into an
135
+ [MongoDB](http://www.mongodb.org) database. See full options with
136
+
137
+ ```
138
+ $ wu-load mongodb --help
139
+ ```
140
+
141
+ ### Connecting
142
+
143
+ `wu-load` tries to connect to an MongoDB server at a default host
144
+ (localhost) and port (27017). You can change these:
145
+
146
+ ```
147
+ $ cat data.json | wu-load mongodb --host=10.122.123.124 --port=1234
148
+ ```
149
+
150
+ All queries will be sent to this address.
151
+
152
+ ### Routing
153
+
154
+ MongoDB stores *documents* in several *databases* which each contain
155
+ *collections*.
156
+
157
+ `wu-load` loads each document into default database (`wukong`) and
158
+ collection (`streaming_record`), but you can change these:
159
+
160
+ ```
161
+ $ cat data.json | wu-load mongodb --host=10.123.123.123 --database=publication --collection=book
162
+ ```
163
+
164
+ A record with a `_database` or `_collection` field will override these
165
+ default settings. You can change the names of the fields used.
166
+
167
+ ### Creates vs. Updates
168
+
169
+ If an input document contains a value for the field `_id` then that
170
+ value will be as the ID of the record when written, possibly
171
+ overwriting a record that already exists -- an update.
172
+
173
+ You can change the field you use for the MongoDB ID property:
174
+
175
+ ```
176
+ $ cat data.json | wu-load mongodb --host=10.123.123.123 --database=media --collection=books --id_field="ISBN"
177
+ ```
@@ -1,50 +1,4 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
3
  require 'wukong-load'
4
- settings = Wukong::Load::Configuration
5
- settings.use(:commandline)
6
-
7
- settings.usage = "usage: #{File.basename($0)} DATA_STORE [ --param=value | -p value | --param | -p]"
8
- settings.description = <<-EOF
9
- wu-load is a tool for loading data from Wukong into data stores. It
10
- supports multiple, pluggable data stores, including:
11
-
12
- Supported data stores:
13
-
14
- elasticsearch
15
- hbase (planned)
16
- mongob (planned)
17
- mysql (planned)
18
-
19
- Get specific help for a data store with
20
-
21
- $ wu-load store_name --help
22
-
23
- Elasticsearch Usage:
24
-
25
- Pass newline-separated, JSON-formatted records over STDIN:
26
-
27
- $ cat data.json | wu-load elasticsearch
28
-
29
- By default, wu-load attempts to write each input record to a local
30
- Elasticsearch database. Records will be routed to a default
31
- Elasticsearch index and type. Records with an '_id' field will be
32
- considered updates. The rest will be creates. You can override these
33
- options:
34
-
35
- $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=my_app --es_type=my_obj --id_field="doc_id"
36
-
37
- Params:
38
- --host=String Elasticsearch host, without HTTP prefix [Default: localhost]
39
- --port=Integer Port on Elasticsearch host [Default: 9200]
40
- --index=String Default Elasticsearch index for records [Default: wukong]
41
- --es_type=String Default Elasticsearch type for records [Default: streaming_record]
42
- --index_field=String Field in each record naming desired Elasticsearch index
43
- --es_type_field=String Field in each record naming desired Elasticsearch type
44
- --id_field=String Field in each record naming providing ID of existing Elasticsearch record to update
45
- EOF
46
-
47
- require 'wukong/boot' ; Wukong.boot!(settings)
48
-
49
- require 'wukong-load/runner'
50
- Wukong::Load::Runner.run(settings)
4
+ Wukong::Load::LoadRunner.run
@@ -0,0 +1,4 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'wukong-load'
4
+ Wukong::Load::SourceRunner.run
@@ -3,8 +3,41 @@ require 'wukong'
3
3
  module Wukong
4
4
  # Loads data from the command-line into data stores.
5
5
  module Load
6
+ include Plugin
7
+
8
+ # Configure `settings` for Wukong-Load.
9
+ #
10
+ # Will ensure that `wu-load` has the same settings as `wu-local`.
11
+ #
12
+ # @param [Configliere::Param] settings the settings to configure
13
+ # @param [String] program the currently executing program name
14
+ def self.configure settings, program
15
+ case program
16
+ when 'wu-load'
17
+ settings.define :tcp_port, description: "Consume TCP requests on the given port instead of lines over STDIN", type: Integer, flag: 't'
18
+ when 'wu-source'
19
+ settings.define :per_sec, description: "Number of events produced per second", type: Float
20
+ settings.define :period, description: "Number of seconds between events (overrides --per_sec)", type: Float
21
+ settings.define :batch_size, description: "Trigger a finalize across the dataflow each time this many records are processed", type: Integer
22
+ end
23
+ end
24
+
25
+ # Boot Wukong-Load from the resolved `settings` in the given
26
+ # `dir`.
27
+ #
28
+ # @param [Configliere::Param] settings the resolved settings
29
+ # @param [String] dir the directory to boot in
30
+ def self.boot settings, dir
31
+ end
32
+
6
33
  end
7
34
  end
8
- require_relative 'wukong-load/version'
9
- require_relative 'wukong-load/configuration'
10
- require_relative 'wukong-load/elasticsearch'
35
+ require_relative 'wukong-load/load_runner'
36
+ require_relative 'wukong-load/source_runner'
37
+
38
+ require_relative 'wukong-load/models/http_request'
39
+
40
+ require_relative 'wukong-load/loaders/elasticsearch'
41
+ require_relative 'wukong-load/loaders/kafka'
42
+ require_relative 'wukong-load/loaders/mongodb'
43
+ require_relative 'wukong-load/loaders/sql'
@@ -0,0 +1,64 @@
1
+ module Wukong
2
+ module Load
3
+
4
+ # Runs the wu-load command.
5
+ class LoadRunner < Wukong::Local::LocalRunner
6
+
7
+ usage "DATA_STORE"
8
+
9
+ description <<-EOF.gsub(/^ {8}/,'')
10
+ wu-load is a tool for loading data from Wukong into data stores. It
11
+ supports multiple, pluggable data stores, including:
12
+
13
+ Supported data stores:
14
+
15
+ elasticsearch
16
+ kafka
17
+ mongodb
18
+ mysql
19
+ hbase (planned)
20
+
21
+ Get specific help for a data store with
22
+
23
+ $ wu-load store_name --help
24
+ EOF
25
+
26
+ include Logging
27
+
28
+ # Ensure that we were passed a data store name that we know
29
+ # about.
30
+ #
31
+ # @raise [Wukong::Error] if the data store is missing or unknown
32
+ # @return [true]
33
+ def validate
34
+ case
35
+ when data_store_name.nil?
36
+ raise Error.new("Must provide the name of a data store as the first argument")
37
+ when processor.nil?
38
+ raise Error.new("No loader defined for data store <#{data_store_name}>")
39
+ end
40
+ true
41
+ end
42
+
43
+ # The name of the data store
44
+ #
45
+ # @return [String]
46
+ def data_store_name
47
+ args.first
48
+ end
49
+
50
+ # The name of the processor that should handle the data store
51
+ #
52
+ # @return [String]
53
+ def processor
54
+ case data_store_name
55
+ when 'elasticsearch' then :elasticsearch_loader
56
+ when 'kafka' then :kafka_loader
57
+ when 'mongo','mongodb' then :mongodb_loader
58
+ when 'sql', 'mysql' then :sql_loader
59
+ end
60
+ end
61
+
62
+ end
63
+ end
64
+ end
@@ -4,10 +4,17 @@ module Wukong
4
4
  # Base class from which to build Loaders.
5
5
  class Loader < Wukong::Processor::FromJson
6
6
 
7
+ # Calls super() to leverage its deserialization and then calls
8
+ # #load on the yielded record.
9
+ #
10
+ # @param [String] line JSON to parse.
7
11
  def process line
8
12
  super(line) { |record| load(record) }
9
13
  end
10
14
 
15
+ # Override this method to load a record into the data store.
16
+ #
17
+ # @param [Hash] record
11
18
  def load record
12
19
  end
13
20
 
@@ -0,0 +1,151 @@
1
+ require_relative('../loader')
2
+
3
+ module Wukong
4
+ module Load
5
+
6
+ # Loads data into Elasticsearch.
7
+ #
8
+ # Uses Elasticsearch's HTTP API to communicate.
9
+ #
10
+ # Allows loading records into a given index and type. Records can
11
+ # have fields `_index` and `_es_type` which override the given
12
+ # index and type on a per-record basis.
13
+ #
14
+ # Records can have an `_id` field which indicates an update, not a
15
+ # create.
16
+ #
17
+ # The names of these fields within each record (`_index`,
18
+ # `_es_type`, and `_id`) can be customized.
19
+ class ElasticsearchLoader < Loader
20
+
21
+ field :host, String, :default => 'localhost', :doc => "Elasticsearch host"
22
+ field :port, Integer,:default => 9200, :doc => "Port on Elasticsearch host"
23
+ field :index, String, :default => 'wukong', :doc => "Default Elasticsearch index for records"
24
+ field :es_type, String, :default => 'streaming_record', :doc => "Default Elasticsearch type for records"
25
+ field :index_field, String, :default => '_index', :doc => "Name of field in each record overriding default Elasticsearch index"
26
+ field :es_type_field, String, :default => '_es_type', :doc => "Name of field in each record overriding default Elasticsearch type"
27
+ field :id_field, String, :default => '_id', :doc => "Name of field in each record providing ID of existing Elasticsearch record to update"
28
+
29
+ description <<-EOF.gsub(/^ {8}/,'')
30
+ Loads newline-separated, JSON-formatted records over STDIN
31
+ into Elasticsearch using its HTTP API.
32
+
33
+ $ cat data.json | wu-load elasticsearch
34
+
35
+ By default, wu-load attempts to write each input record to a
36
+ local Elasticsearch database.
37
+
38
+ Input records will be written to a default Elasticsearch index
39
+ and type. Each record can have _index and _es_type fields to
40
+ override this on a per-record basis.
41
+
42
+ Records with an _id field will be trigger updates, the rest
43
+ creates.
44
+
45
+ The fields used (_index, _es_type, and _id) can be changed:
46
+
47
+ $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=web_events --es_type=impressions --id_field="impression_id"
48
+ EOF
49
+
50
+ # The Net::HTTP connection we'll use for talking to
51
+ # Elasticsearch.
52
+ attr_accessor :connection
53
+
54
+ # Creates a connection
55
+ def setup
56
+ h = host.gsub(%r{^http://},'')
57
+ log.debug("Connecting to Elasticsearch cluster at #{h}:#{port}...")
58
+ begin
59
+ self.connection = Net::HTTP.new(h, port)
60
+ self.connection.use_ssl = true if host =~ /^https/
61
+ rescue => e
62
+ raise Error.new(e.message)
63
+ end
64
+ end
65
+
66
+ # Load a single record into Elasticsearch.
67
+ #
68
+ # If the record has an ID, we'll issue an update, otherwise a create
69
+ #
70
+ # @param [Hash] record
71
+ def load record
72
+ id_for(record) ? request(Net::HTTP::Put, update_path(record), record) : request(Net::HTTP::Post, create_path(record), record)
73
+ end
74
+
75
+ # :nodoc:
76
+ def create_path record
77
+ File.join('/', index_for(record).to_s, es_type_for(record).to_s)
78
+ end
79
+
80
+ # :nodoc:
81
+ def update_path record
82
+ File.join('/', index_for(record).to_s, es_type_for(record).to_s, id_for(record).to_s)
83
+ end
84
+
85
+ # :nodoc:
86
+ def index_for record
87
+ record[index_field] || self.index
88
+ end
89
+
90
+ # :nodoc:
91
+ def es_type_for record
92
+ record[es_type_field] || self.es_type
93
+ end
94
+
95
+ # :nodoc:
96
+ def id_for record
97
+ record[id_field]
98
+ end
99
+
100
+ # Make a request via the existing #connection. Record will be
101
+ # turned to JSON automatically.
102
+ #
103
+ # @param [Net::HTTPRequest] request_type
104
+ # @param [String] path
105
+ # @param [Hash] record
106
+ def request request_type, path, record
107
+ perform_request(create_request(request_type, path, record))
108
+ end
109
+
110
+ private
111
+
112
+ # :nodoc:
113
+ def create_request request_type, path, record
114
+ request_type.new(path).tap do |req|
115
+ req.body = MultiJson.dump(record)
116
+ end
117
+ end
118
+
119
+ # :nodoc:
120
+ def perform_request req
121
+ begin
122
+ response = connection.request(req)
123
+ status = response.code.to_i
124
+ if (200..201).include?(status)
125
+ log.info("#{req.class} #{req.path} #{status}")
126
+ else
127
+ handle_elasticsearch_error(status, response)
128
+ end
129
+ rescue => e
130
+ log.error("#{e.class} - #{e.message}")
131
+ end
132
+ end
133
+
134
+ # :nodoc:
135
+ def handle_elasticsearch_error status, response
136
+ begin
137
+ error = MultiJson.load(response.body)
138
+ log.error("#{response.code}: #{error['error']}")
139
+ rescue => e
140
+ log.error("Received a response code of #{status}: #{response.body}")
141
+ end
142
+ end
143
+
144
+ register :elasticsearch_loader
145
+
146
+ end
147
+ end
148
+ end
149
+
150
+
151
+