wukong-load 0.0.2 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,5 @@
1
+ --readme README.md
2
+ --markup markdown
3
+ -
4
+ LICENSE.md
5
+ README.md
data/Gemfile CHANGED
@@ -5,4 +5,20 @@ gemspec
5
5
  group :development do
6
6
  gem 'rake', '~> 0.9'
7
7
  gem 'rspec', '~> 2'
8
+ gem 'yard'
9
+ gem 'redcarpet'
8
10
  end
11
+
12
+ group :mongo do
13
+ gem 'mongo'
14
+ gem 'bson_ext'
15
+ end
16
+
17
+ group :sql do
18
+ gem 'mysql2'
19
+ end
20
+
21
+ group :kafka do
22
+ gem 'kafka-rb'
23
+ end
24
+
data/LICENSE.md CHANGED
@@ -1,4 +1,4 @@
1
- # License for Wukong
1
+ # License for Wukong-Load
2
2
 
3
3
  The wukong code is __Copyright (c) 2011, 2012 Infochimps, Inc__
4
4
 
data/README.md CHANGED
@@ -1,7 +1,7 @@
1
1
  # Wukong-Load
2
2
 
3
3
  This Wukong plugin makes it easy to load data from the command-line
4
- into various.
4
+ into various data stores.
5
5
 
6
6
  It is assumed that you will independently deploy and configure each
7
7
  data store yourself (but see
@@ -19,7 +19,7 @@ useful when developing flows in concert with wu-local.
19
19
  Wukong-Load can be installed as a RubyGem:
20
20
 
21
21
  ```
22
- $ sudo gem install wukong-hadoop
22
+ $ sudo gem install wukong-load
23
23
  ```
24
24
 
25
25
  ## Usage
@@ -39,7 +39,14 @@ $ wu-load store_name --help
39
39
 
40
40
  Further details will depend on the data store you're writing to.
41
41
 
42
- ### Elasticsearch Usage
42
+ ### Expected Input
43
+
44
+ All input to `wu-load` should be newline-separated, JSON-formatted,
45
+ hash-like records. For some data stores, keys in the record may be
46
+ interpreted as metadata about the record or about how to route the
47
+ record within the data store.
48
+
49
+ ## Elasticsearch Usage
43
50
 
44
51
  Lets you load JSON-formatted records into an
45
52
  [Elasticsearch](http://www.elasticsearch.org) database. See full
@@ -49,36 +56,10 @@ options with
49
56
  $ wu-load elasticsearch --help
50
57
  ```
51
58
 
52
- #### Expected Input
53
-
54
- All input to `wu-load` should be newline-separated, JSON-formatted,
55
- hash-like record. Some keys in the record will be interpreted as
56
- metadata about the record or about how to route the record within the
57
- database but the entire record will be written to the database
58
- unmodified.
59
+ ### Connecting
59
60
 
60
- A (pretty-printed for clarity -- the real record shouldn't contain
61
- newlines) record like
62
-
63
- ```json
64
- {
65
- "_index": "publications"
66
- "_type": "book",
67
- "ISBN": "0553573403",
68
- "title": "A Game of Thrones",
69
- "author": "George R. R. Martin",
70
- "description": "The first of half a hundred novels to come out since...",
71
- ...
72
- }
73
- ```
74
-
75
- might use the `_index` and `_type` fields as metadata but the
76
- **whole** record will be written.
77
-
78
- #### Connecting
79
-
80
- `wu-load` has a default host (localhost) and port (9200) it tries to
81
- connect to but you can change these:
61
+ `wu-load` tries to connect to an Elasticsearch server at a default
62
+ host (localhost) and port (9200). You can change these:
82
63
 
83
64
  ```
84
65
  $ cat data.json | wu-load elasticsearch --host=10.122.123.124 --port=80
@@ -86,7 +67,7 @@ $ cat data.json | wu-load elasticsearch --host=10.122.123.124 --port=80
86
67
 
87
68
  All queries will be sent to this address.
88
69
 
89
- #### Routing
70
+ ### Routing
90
71
 
91
72
  Elasticsearch stores data in several *indices* which each contain
92
73
  *documents* of various *types*.
@@ -98,7 +79,10 @@ Elasticsearch stores data in several *indices* which each contain
98
79
  $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=publication --es_type=book
99
80
  ```
100
81
 
101
- ##### Creates vs. Updates
82
+ A record with an `_index` or `_es_type` field will override these
83
+ default settings. You can change the names of the fields used.
84
+
85
+ ### Creates vs. Updates
102
86
 
103
87
  If an input document contains a value for the field `_id` then that
104
88
  value will be as the ID of the record when written, possibly
@@ -109,3 +93,85 @@ You can change the field you use for the Elasticsearch ID property:
109
93
  ```
110
94
  $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=media --es_type=books --id_field="ISBN"
111
95
  ```
96
+
97
+ ## Kafka Usage
98
+
99
+ Lets you load JSON-formatted records into a
100
+ [Kafka](http://kafka.apache.org/) queue. See full options with
101
+
102
+ ```
103
+ $ wu-load kafka --help
104
+ ```
105
+
106
+ ### Connecting
107
+
108
+ `wu-load` tries to connect to a Kafka broker at a default host
109
+ (localhost) and a port (9092). You can change these:
110
+
111
+ ```
112
+ $ cat data.json | wu-load kafka --host=10.122.123.124 --port=1234
113
+ ```
114
+
115
+ All records will be sent to this address.
116
+
117
+ ### Routing
118
+
119
+ Kafka stores data in several named *queues*. Each queue can have
120
+ several numbered *partitions*.
121
+
122
+ `wu-load` loads each record into the default queue (`test`) and
123
+ partition (0), but you can change these:
124
+
125
+ ```
126
+ $ cat data.json | wu-load kafka --host=10.123.123.123 --topic=messages --partition=6
127
+ ```
128
+
129
+ A record with a `_topic` or `_partition` field will override these
130
+ default settings. You can change the names of the fields used.
131
+
132
+ ## MongoDB Usage
133
+
134
+ Lets you load JSON-formatted records into an
135
+ [MongoDB](http://www.mongodb.org) database. See full options with
136
+
137
+ ```
138
+ $ wu-load mongodb --help
139
+ ```
140
+
141
+ ### Connecting
142
+
143
+ `wu-load` tries to connect to an MongoDB server at a default host
144
+ (localhost) and port (27017). You can change these:
145
+
146
+ ```
147
+ $ cat data.json | wu-load mongodb --host=10.122.123.124 --port=1234
148
+ ```
149
+
150
+ All queries will be sent to this address.
151
+
152
+ ### Routing
153
+
154
+ MongoDB stores *documents* in several *databases* which each contain
155
+ *collections*.
156
+
157
+ `wu-load` loads each document into default database (`wukong`) and
158
+ collection (`streaming_record`), but you can change these:
159
+
160
+ ```
161
+ $ cat data.json | wu-load mongodb --host=10.123.123.123 --database=publication --collection=book
162
+ ```
163
+
164
+ A record with a `_database` or `_collection` field will override these
165
+ default settings. You can change the names of the fields used.
166
+
167
+ ### Creates vs. Updates
168
+
169
+ If an input document contains a value for the field `_id` then that
170
+ value will be as the ID of the record when written, possibly
171
+ overwriting a record that already exists -- an update.
172
+
173
+ You can change the field you use for the MongoDB ID property:
174
+
175
+ ```
176
+ $ cat data.json | wu-load mongodb --host=10.123.123.123 --database=media --collection=books --id_field="ISBN"
177
+ ```
@@ -1,50 +1,4 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
3
  require 'wukong-load'
4
- settings = Wukong::Load::Configuration
5
- settings.use(:commandline)
6
-
7
- settings.usage = "usage: #{File.basename($0)} DATA_STORE [ --param=value | -p value | --param | -p]"
8
- settings.description = <<-EOF
9
- wu-load is a tool for loading data from Wukong into data stores. It
10
- supports multiple, pluggable data stores, including:
11
-
12
- Supported data stores:
13
-
14
- elasticsearch
15
- hbase (planned)
16
- mongob (planned)
17
- mysql (planned)
18
-
19
- Get specific help for a data store with
20
-
21
- $ wu-load store_name --help
22
-
23
- Elasticsearch Usage:
24
-
25
- Pass newline-separated, JSON-formatted records over STDIN:
26
-
27
- $ cat data.json | wu-load elasticsearch
28
-
29
- By default, wu-load attempts to write each input record to a local
30
- Elasticsearch database. Records will be routed to a default
31
- Elasticsearch index and type. Records with an '_id' field will be
32
- considered updates. The rest will be creates. You can override these
33
- options:
34
-
35
- $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=my_app --es_type=my_obj --id_field="doc_id"
36
-
37
- Params:
38
- --host=String Elasticsearch host, without HTTP prefix [Default: localhost]
39
- --port=Integer Port on Elasticsearch host [Default: 9200]
40
- --index=String Default Elasticsearch index for records [Default: wukong]
41
- --es_type=String Default Elasticsearch type for records [Default: streaming_record]
42
- --index_field=String Field in each record naming desired Elasticsearch index
43
- --es_type_field=String Field in each record naming desired Elasticsearch type
44
- --id_field=String Field in each record naming providing ID of existing Elasticsearch record to update
45
- EOF
46
-
47
- require 'wukong/boot' ; Wukong.boot!(settings)
48
-
49
- require 'wukong-load/runner'
50
- Wukong::Load::Runner.run(settings)
4
+ Wukong::Load::LoadRunner.run
@@ -0,0 +1,4 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'wukong-load'
4
+ Wukong::Load::SourceRunner.run
@@ -3,8 +3,41 @@ require 'wukong'
3
3
  module Wukong
4
4
  # Loads data from the command-line into data stores.
5
5
  module Load
6
+ include Plugin
7
+
8
+ # Configure `settings` for Wukong-Load.
9
+ #
10
+ # Will ensure that `wu-load` has the same settings as `wu-local`.
11
+ #
12
+ # @param [Configliere::Param] settings the settings to configure
13
+ # @param [String] program the currently executing program name
14
+ def self.configure settings, program
15
+ case program
16
+ when 'wu-load'
17
+ settings.define :tcp_port, description: "Consume TCP requests on the given port instead of lines over STDIN", type: Integer, flag: 't'
18
+ when 'wu-source'
19
+ settings.define :per_sec, description: "Number of events produced per second", type: Float
20
+ settings.define :period, description: "Number of seconds between events (overrides --per_sec)", type: Float
21
+ settings.define :batch_size, description: "Trigger a finalize across the dataflow each time this many records are processed", type: Integer
22
+ end
23
+ end
24
+
25
+ # Boot Wukong-Load from the resolved `settings` in the given
26
+ # `dir`.
27
+ #
28
+ # @param [Configliere::Param] settings the resolved settings
29
+ # @param [String] dir the directory to boot in
30
+ def self.boot settings, dir
31
+ end
32
+
6
33
  end
7
34
  end
8
- require_relative 'wukong-load/version'
9
- require_relative 'wukong-load/configuration'
10
- require_relative 'wukong-load/elasticsearch'
35
+ require_relative 'wukong-load/load_runner'
36
+ require_relative 'wukong-load/source_runner'
37
+
38
+ require_relative 'wukong-load/models/http_request'
39
+
40
+ require_relative 'wukong-load/loaders/elasticsearch'
41
+ require_relative 'wukong-load/loaders/kafka'
42
+ require_relative 'wukong-load/loaders/mongodb'
43
+ require_relative 'wukong-load/loaders/sql'
@@ -0,0 +1,64 @@
1
+ module Wukong
2
+ module Load
3
+
4
+ # Runs the wu-load command.
5
+ class LoadRunner < Wukong::Local::LocalRunner
6
+
7
+ usage "DATA_STORE"
8
+
9
+ description <<-EOF.gsub(/^ {8}/,'')
10
+ wu-load is a tool for loading data from Wukong into data stores. It
11
+ supports multiple, pluggable data stores, including:
12
+
13
+ Supported data stores:
14
+
15
+ elasticsearch
16
+ kafka
17
+ mongodb
18
+ mysql
19
+ hbase (planned)
20
+
21
+ Get specific help for a data store with
22
+
23
+ $ wu-load store_name --help
24
+ EOF
25
+
26
+ include Logging
27
+
28
+ # Ensure that we were passed a data store name that we know
29
+ # about.
30
+ #
31
+ # @raise [Wukong::Error] if the data store is missing or unknown
32
+ # @return [true]
33
+ def validate
34
+ case
35
+ when data_store_name.nil?
36
+ raise Error.new("Must provide the name of a data store as the first argument")
37
+ when processor.nil?
38
+ raise Error.new("No loader defined for data store <#{data_store_name}>")
39
+ end
40
+ true
41
+ end
42
+
43
+ # The name of the data store
44
+ #
45
+ # @return [String]
46
+ def data_store_name
47
+ args.first
48
+ end
49
+
50
+ # The name of the processor that should handle the data store
51
+ #
52
+ # @return [String]
53
+ def processor
54
+ case data_store_name
55
+ when 'elasticsearch' then :elasticsearch_loader
56
+ when 'kafka' then :kafka_loader
57
+ when 'mongo','mongodb' then :mongodb_loader
58
+ when 'sql', 'mysql' then :sql_loader
59
+ end
60
+ end
61
+
62
+ end
63
+ end
64
+ end
@@ -4,10 +4,17 @@ module Wukong
4
4
  # Base class from which to build Loaders.
5
5
  class Loader < Wukong::Processor::FromJson
6
6
 
7
+ # Calls super() to leverage its deserialization and then calls
8
+ # #load on the yielded record.
9
+ #
10
+ # @param [String] line JSON to parse.
7
11
  def process line
8
12
  super(line) { |record| load(record) }
9
13
  end
10
14
 
15
+ # Override this method to load a record into the data store.
16
+ #
17
+ # @param [Hash] record
11
18
  def load record
12
19
  end
13
20
 
@@ -0,0 +1,151 @@
1
+ require_relative('../loader')
2
+
3
+ module Wukong
4
+ module Load
5
+
6
+ # Loads data into Elasticsearch.
7
+ #
8
+ # Uses Elasticsearch's HTTP API to communicate.
9
+ #
10
+ # Allows loading records into a given index and type. Records can
11
+ # have fields `_index` and `_es_type` which override the given
12
+ # index and type on a per-record basis.
13
+ #
14
+ # Records can have an `_id` field which indicates an update, not a
15
+ # create.
16
+ #
17
+ # The names of these fields within each record (`_index`,
18
+ # `_es_type`, and `_id`) can be customized.
19
+ class ElasticsearchLoader < Loader
20
+
21
+ field :host, String, :default => 'localhost', :doc => "Elasticsearch host"
22
+ field :port, Integer,:default => 9200, :doc => "Port on Elasticsearch host"
23
+ field :index, String, :default => 'wukong', :doc => "Default Elasticsearch index for records"
24
+ field :es_type, String, :default => 'streaming_record', :doc => "Default Elasticsearch type for records"
25
+ field :index_field, String, :default => '_index', :doc => "Name of field in each record overriding default Elasticsearch index"
26
+ field :es_type_field, String, :default => '_es_type', :doc => "Name of field in each record overriding default Elasticsearch type"
27
+ field :id_field, String, :default => '_id', :doc => "Name of field in each record providing ID of existing Elasticsearch record to update"
28
+
29
+ description <<-EOF.gsub(/^ {8}/,'')
30
+ Loads newline-separated, JSON-formatted records over STDIN
31
+ into Elasticsearch using its HTTP API.
32
+
33
+ $ cat data.json | wu-load elasticsearch
34
+
35
+ By default, wu-load attempts to write each input record to a
36
+ local Elasticsearch database.
37
+
38
+ Input records will be written to a default Elasticsearch index
39
+ and type. Each record can have _index and _es_type fields to
40
+ override this on a per-record basis.
41
+
42
+ Records with an _id field will be trigger updates, the rest
43
+ creates.
44
+
45
+ The fields used (_index, _es_type, and _id) can be changed:
46
+
47
+ $ cat data.json | wu-load elasticsearch --host=10.123.123.123 --index=web_events --es_type=impressions --id_field="impression_id"
48
+ EOF
49
+
50
+ # The Net::HTTP connection we'll use for talking to
51
+ # Elasticsearch.
52
+ attr_accessor :connection
53
+
54
+ # Creates a connection
55
+ def setup
56
+ h = host.gsub(%r{^http://},'')
57
+ log.debug("Connecting to Elasticsearch cluster at #{h}:#{port}...")
58
+ begin
59
+ self.connection = Net::HTTP.new(h, port)
60
+ self.connection.use_ssl = true if host =~ /^https/
61
+ rescue => e
62
+ raise Error.new(e.message)
63
+ end
64
+ end
65
+
66
+ # Load a single record into Elasticsearch.
67
+ #
68
+ # If the record has an ID, we'll issue an update, otherwise a create
69
+ #
70
+ # @param [Hash] record
71
+ def load record
72
+ id_for(record) ? request(Net::HTTP::Put, update_path(record), record) : request(Net::HTTP::Post, create_path(record), record)
73
+ end
74
+
75
+ # :nodoc:
76
+ def create_path record
77
+ File.join('/', index_for(record).to_s, es_type_for(record).to_s)
78
+ end
79
+
80
+ # :nodoc:
81
+ def update_path record
82
+ File.join('/', index_for(record).to_s, es_type_for(record).to_s, id_for(record).to_s)
83
+ end
84
+
85
+ # :nodoc:
86
+ def index_for record
87
+ record[index_field] || self.index
88
+ end
89
+
90
+ # :nodoc:
91
+ def es_type_for record
92
+ record[es_type_field] || self.es_type
93
+ end
94
+
95
+ # :nodoc:
96
+ def id_for record
97
+ record[id_field]
98
+ end
99
+
100
+ # Make a request via the existing #connection. Record will be
101
+ # turned to JSON automatically.
102
+ #
103
+ # @param [Net::HTTPRequest] request_type
104
+ # @param [String] path
105
+ # @param [Hash] record
106
+ def request request_type, path, record
107
+ perform_request(create_request(request_type, path, record))
108
+ end
109
+
110
+ private
111
+
112
+ # :nodoc:
113
+ def create_request request_type, path, record
114
+ request_type.new(path).tap do |req|
115
+ req.body = MultiJson.dump(record)
116
+ end
117
+ end
118
+
119
+ # :nodoc:
120
+ def perform_request req
121
+ begin
122
+ response = connection.request(req)
123
+ status = response.code.to_i
124
+ if (200..201).include?(status)
125
+ log.info("#{req.class} #{req.path} #{status}")
126
+ else
127
+ handle_elasticsearch_error(status, response)
128
+ end
129
+ rescue => e
130
+ log.error("#{e.class} - #{e.message}")
131
+ end
132
+ end
133
+
134
+ # :nodoc:
135
+ def handle_elasticsearch_error status, response
136
+ begin
137
+ error = MultiJson.load(response.body)
138
+ log.error("#{response.code}: #{error['error']}")
139
+ rescue => e
140
+ log.error("Received a response code of #{status}: #{response.body}")
141
+ end
142
+ end
143
+
144
+ register :elasticsearch_loader
145
+
146
+ end
147
+ end
148
+ end
149
+
150
+
151
+