wukong-storm 0.1.1 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +1 -0
- data/.rspec +1 -2
- data/Gemfile +1 -1
- data/README.md +174 -18
- data/bin/wu-bolt +4 -0
- data/lib/wukong-storm.rb +50 -10
- data/lib/wukong-storm/bolt_driver.rb +81 -0
- data/lib/wukong-storm/bolt_runner.rb +44 -0
- data/lib/wukong-storm/storm_invocation.rb +386 -0
- data/lib/wukong-storm/storm_runner.rb +123 -0
- data/lib/wukong-storm/version.rb +1 -1
- data/lib/wukong-storm/wukong-storm.jar +0 -0
- data/pom.xml +111 -0
- data/spec/spec_helper.rb +13 -1
- data/spec/wukong-storm/bolt_driver_spec.rb +46 -0
- data/spec/wukong-storm/storm_invocation_spec.rb +204 -0
- data/spec/wukong-storm/storm_runner_spec.rb +76 -0
- data/spec/{wu_storm_spec.rb → wukong-storm/wu-bolt_spec.rb} +14 -14
- data/spec/wukong-storm/wu-storm_spec.rb +17 -0
- data/spec/wukong-storm_spec.rb +5 -0
- data/src/main/java/com/infochimps/wukong/storm/Builder.java +53 -0
- data/src/main/java/com/infochimps/wukong/storm/DataflowBuilder.java +74 -0
- data/src/main/java/com/infochimps/wukong/storm/SpoutBuilder.java +237 -0
- data/src/main/java/com/infochimps/wukong/storm/StateBuilder.java +46 -0
- data/src/main/java/com/infochimps/wukong/storm/TopologyBuilder.java +130 -0
- data/src/main/java/com/infochimps/wukong/storm/TopologySubmitter.java +181 -0
- data/wukong-storm.gemspec +3 -2
- metadata +49 -11
- data/lib/wukong-storm/driver.rb +0 -58
- data/lib/wukong-storm/runner.rb +0 -40
data/.gitignore
CHANGED
data/.rspec
CHANGED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -1,31 +1,187 @@
|
|
1
|
-
# Wukong
|
1
|
+
# Wukong-Storm
|
2
2
|
|
3
|
-
|
3
|
+
The Hadoop plugin for Wukong lets you run <a
|
4
|
+
href="http://github.com/infochimps-labs/wukong/tree/3.0.0">Wukong</a>
|
5
|
+
processors and dataflows as <a
|
6
|
+
href="https://github.com/nathanmarz/storm">Storm</a> topologies reading data in and out from <a href="http://kafka.apache.org/">Kafka</a>.
|
4
7
|
|
5
|
-
|
8
|
+
Before you use Wukong-Storm to develop, test, and write your Hadoop
|
9
|
+
jobs, you might want to read about <a
|
10
|
+
href="http://github.com/infochimps-labs/wukong/tree/3.0.0">Wukong</a>,
|
11
|
+
write some <a
|
12
|
+
href="http://github.com/infochimps-labs/wukong/tree/3.0.0#writing-simple-processors">simple
|
13
|
+
processors</a>, and read about some of Storm's <a
|
14
|
+
href="https://github.com/nathanmarz/storm/wiki/Concepts">core
|
15
|
+
concepts</a>.
|
6
16
|
|
17
|
+
You might also want to check out some other projects which enrich the
|
18
|
+
Wukong and Hadoop experience:
|
19
|
+
|
20
|
+
* <a href="http://github.com/infochimps-labs/wukong-hadoop">wukong-hadoop</a>: Run Wukong processors and dataflows as mappers and/or reducers within the Hadoop framework. Model jobs locally before you run them.
|
21
|
+
* <a href="http://github.com/infochimps-labs/wukong-load">wukong-load</a>: Load the output data from your local Wukong jobs and flows into a variety of different data stores.
|
22
|
+
* <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
|
23
|
+
|
24
|
+
<a name="installation"></a>
|
25
|
+
## Installation & Setup
|
26
|
+
|
27
|
+
Wukong-Storm can be installed as a RubyGem:
|
28
|
+
|
29
|
+
```
|
30
|
+
$ sudo gem install wukong-storm
|
7
31
|
```
|
8
|
-
usage: wu-storm PROCESSOR|FLOW [...--param=value...]
|
9
32
|
|
10
|
-
|
11
|
-
|
33
|
+
If you actually want to run your dataflows as functioning Storm
|
34
|
+
topologies reading/writing to/from Kafka, you'll of course need access
|
35
|
+
to Storm and Kafka installations. <a
|
36
|
+
href="http://github.com/infochimps-labs/ironfan">Ironfan</a> is a
|
37
|
+
great tool for building and managing Storm clusters and other
|
38
|
+
distributed infrastructure quickly and easily.
|
39
|
+
|
40
|
+
To run Storm jobs through Wukong-Storm, you'll need to move your your
|
41
|
+
Wukong code to each worker of the Storm cluster, install Wukong-Storm
|
42
|
+
on each, and log in and launch your job fron one of them. Ironfan
|
43
|
+
again helps with configuring this.
|
12
44
|
|
13
|
-
|
14
|
-
|
15
|
-
'foo\n|\n'. The '|' character is the specified End-Of-File delimiter.
|
45
|
+
<a name="anatomy"></a>
|
46
|
+
## Anatomy of a running topology
|
16
47
|
|
17
|
-
|
18
|
-
|
48
|
+
Storm defines the concept of a **topology**. A topology contains
|
49
|
+
spouts and bolts. A **spout** is a source of data. A **bolt**
|
50
|
+
processes data. Bolts can be connected to each other and to spouts in
|
51
|
+
arbitrary ways.
|
19
52
|
|
20
|
-
|
21
|
-
|
53
|
+
Tooplogies submitted to Storm's Nimbus but run within a Storm
|
54
|
+
supervisor. Each supervisor can dedicate a certain number of
|
55
|
+
**workers** to a topology. Within each worker, **parallelism**
|
56
|
+
controls the number of threads the worker assigns to the topology.
|
22
57
|
|
58
|
+
Wukong-Storm runs each Wukong dataflow as a single bolt within a
|
59
|
+
single topology. Data is passed to this bolt over STDIN and collected
|
60
|
+
over STDOUT, similar to the way <a
|
61
|
+
href="http://hadoop.apache.org/docs/r0.15.2/streaming.html">Hadoop
|
62
|
+
streaming </a> operates.
|
23
63
|
|
24
|
-
|
25
|
-
|
26
|
-
|
64
|
+
This topology is hooked up to a
|
65
|
+
`storm.kafka.trident.OpaqueTridentKafkaSpout` (part of
|
66
|
+
[storm-contrib](https://github.com/nathanmarz/storm-contrib)) which
|
67
|
+
reads from a single input topic within Kafka.
|
68
|
+
|
69
|
+
Output records are written to a default Kafka topic but this can be
|
70
|
+
overridden on a per-record basis.
|
71
|
+
|
72
|
+
<a name="protocol"></a>
|
73
|
+
### Communication protocol
|
74
|
+
|
75
|
+
A Wukong dataflow launched within Storm runs as a single bolt (see
|
76
|
+
[`com.infochimps.wukong.storm.SubprocessFunction`](https://github.com/infochimps-labs/wukong-storm/blob/master/src/main/java/com/infochimps/wukong/storm/SubprocessFunction.java)).
|
77
|
+
This bolt works by launching an arbitrary command-line and sending it
|
78
|
+
records over STDIN and reading its output over STDOUT. The
|
79
|
+
`SubprocessFunction` class expects whatever command it launched to
|
80
|
+
obey a protocol under which the output after **each** input consists
|
81
|
+
of each output record followed by a newline, with the full batch of
|
82
|
+
output records followed by a batch terminator (default: `---`) then
|
83
|
+
another newline.
|
84
|
+
|
85
|
+
Wukong-Storm comes with a command `wu-bolt` which works very similarly
|
86
|
+
to `wu-local` but implements this protocol. Here's an example of
|
87
|
+
using `wu-bolt` directly with a processor:
|
88
|
+
|
89
|
+
```
|
90
|
+
$ echo 2 | wu-bolt prime_factorizer.rb
|
91
|
+
2
|
92
|
+
---
|
93
|
+
$ echo 12 | wu-bolt prime_factorizer.rb
|
94
|
+
2
|
95
|
+
2
|
96
|
+
3
|
97
|
+
---
|
98
|
+
$ echo 19 | wu-bolt prime_factorizer.rb
|
99
|
+
---
|
27
100
|
```
|
28
101
|
|
29
|
-
|
102
|
+
Notice that in the last example, the presence of the batch delimiter
|
103
|
+
after each input record make it easy to tell the difference between
|
104
|
+
"no output records" and "no output records yet" which, over
|
105
|
+
STDIN/STDOUT, is rather hard to tell otherwise.
|
106
|
+
|
107
|
+
## Running a dataflow
|
108
|
+
|
109
|
+
### A simple processor
|
110
|
+
|
111
|
+
Assuming you have correctly installed Wukong-Storm, Storm, Kafka,
|
112
|
+
Zookeeper, &c., and you have defined a simple dataflow (or in this
|
113
|
+
case, just a single processor) like this:
|
114
|
+
|
115
|
+
```ruby
|
116
|
+
# in upcaser.rb
|
117
|
+
Wukong.processor(:upcaser) do
|
118
|
+
def process line
|
119
|
+
yield line.upcase
|
120
|
+
end
|
121
|
+
end
|
122
|
+
```
|
123
|
+
|
124
|
+
Then you can launch it directly into Storm:
|
125
|
+
|
126
|
+
```
|
127
|
+
$ wu-storm upcaser.rb --input=some_input_topic --output=some_output_topic
|
128
|
+
```
|
129
|
+
|
130
|
+
If a topology named `upcaser` already exists, you'll get an error.
|
131
|
+
Add the `--rm` flag to first kill the running topology before
|
132
|
+
launching the new one:
|
133
|
+
|
134
|
+
```
|
135
|
+
$ wu-storm upcaser.rb --input=some_input_topic --output=some_output_topic --rm
|
136
|
+
```
|
137
|
+
|
138
|
+
The default amount of time to wait for the topology to die is 300
|
139
|
+
seconds (5 minutes), just like the `storm kill` command (which is used
|
140
|
+
under the hood). When debugging a topology in development, it's
|
141
|
+
helpful to add `--wait=1` to immediately kill the topology.
|
142
|
+
|
143
|
+
See exactly what happened behind the scenes by adding the `--dry_run`
|
144
|
+
flag which will print commands and not execute them:
|
145
|
+
|
146
|
+
```
|
147
|
+
$ wu-storm upcaser.rb --input=some_input_topic --output=some_output_topic --rm --dry_run
|
148
|
+
```
|
149
|
+
|
150
|
+
### A more complicated example
|
151
|
+
|
152
|
+
Say you have a dataflow:
|
153
|
+
|
154
|
+
```ruby
|
155
|
+
# in my_flow.rb
|
156
|
+
Wukong.dataflow(:my_flow) do
|
157
|
+
my_parser | does_something | then_something_else | to_json
|
158
|
+
end
|
159
|
+
```
|
160
|
+
|
161
|
+
You can launch it using a different topology name as well as target
|
162
|
+
arbitrary locations for your Zookeeper, Kafka, and Storm servers:
|
163
|
+
|
164
|
+
```
|
165
|
+
$ wu-storm my_flow.rb --name=my_flow_attempt_3 --zookeeper_hosts=10.121.121.121,10.122.122.122 --kafka_hosts=10.123.123.123 --nimbus_host=10.124.124.124 --input=some_input_topic --output=some_output_topic
|
166
|
+
```
|
167
|
+
|
168
|
+
### Running non-Wukong or non-Ruby code
|
169
|
+
|
170
|
+
You can also use Wukong-Storm as a harness to run non-Wukong or
|
171
|
+
non-Ruby code. As long as you can specificy a command-line to run
|
172
|
+
which supports the [communication protocol](#protocol), then you can
|
173
|
+
run it with `wu-storm`:
|
174
|
+
|
175
|
+
```
|
176
|
+
$ wu-storm --bolt_command='my_cmd --some-option=value -af -q 3' --input=some_input_topic --output=some_output_topic
|
177
|
+
```
|
178
|
+
|
179
|
+
### Scaling options
|
180
|
+
|
181
|
+
Storm provides several options for scaling up or down a topology.
|
182
|
+
Wukong-Storm makes them accessible at launch time via the following
|
183
|
+
options:
|
30
184
|
|
31
|
-
|
185
|
+
* `--workers` specify the number of workers (a.k.a. "executors" or "slots") for the topology. Defaults to 1.
|
186
|
+
* `--input_parallelism` specify the number of threads within the spout reading from Kafka within each worker. Defaults to 1.
|
187
|
+
* `--parallelism` specify the number of threads within the bolt running Wukong code within each worker. Defaults to 1.
|
data/bin/wu-bolt
ADDED
data/lib/wukong-storm.rb
CHANGED
@@ -12,15 +12,54 @@ module Wukong
|
|
12
12
|
# @param [Configliere::Param] settings the settings to configure
|
13
13
|
# @param [String] program the name of the currently executing program
|
14
14
|
def self.configure settings, program
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
15
|
+
case program
|
16
|
+
when 'wu-bolt'
|
17
|
+
settings.define :run, description: 'Name of the processor or dataflow to use. Defaults to basename of the given path', flag: 'r'
|
18
|
+
settings.define :delimiter, description: 'Emitted as a single record to mark the end of the batch ', default: 'X', flag: 't'
|
19
|
+
when 'wu-storm'
|
20
|
+
settings.define :name, wukong_storm: true, description: "Name for the launched topology"
|
21
|
+
settings.define :command_prefix, wukong_storm: true, description: "Prefix to insert before all Wukong commands"
|
22
|
+
settings.define :bolt_command, wukong_storm: true, description: "Command-line to run within the spawned Storm bolt"
|
23
|
+
settings.define :dry_run, wukong_storm: true, description: "Echo commands that will be run, but don't run them", type: :boolean, default: false
|
24
|
+
settings.define :wait, wukong_storm: true, description: "How many seconds to wait when killing a topology", type: Integer, default: 300
|
25
|
+
settings.define :rm, wukong_storm: true, description: "Will kill any running topology of the same name before launching", type: :boolean, default: false
|
26
|
+
settings.define :delimiter, wukong_storm: true, description: "Batch delimiter to use with wu-bolt"
|
27
|
+
settings.define :parallelism, wukong_storm: true, description: "Parallelism hint for wu-bolt", default: 1
|
28
|
+
|
29
|
+
settings.define :input, wukong_storm: true, description: "Input URI for the topology. The scheme of the URI determines the type of spout."
|
30
|
+
settings.define :input_parallelism, wukong_storm: true, description: "Parallelism (number of simultaneous threads) reading input. Only used by some spouts.", default: 1
|
31
|
+
settings.define :offset, wukong_storm: true, description: "Offset to use when starting to read from input. Interpreted in a spout-dependent way."
|
32
|
+
|
33
|
+
settings.define :from_beginning, wukong_storm: true, description: "Start reading from the beginning of the input.", type: :boolean, default: false
|
34
|
+
settings.define :from_end, wukong_storm: true, description: "Start reading from the end of the input.", type: :boolean, default: false
|
35
|
+
settings.define :resume, wukong_storm: true, description: "Start reading from where the topology left off. This is the default behavior.", type: :boolean, default: true
|
36
|
+
|
37
|
+
settings.define :kafka_partitions, wukong_storm: true, description: "Number of Kafka partitions on the input topic", default: 1
|
38
|
+
settings.define :kafka_batch, wukong_storm: true, description: "Batch size when reading from input topic (bytes)", default: 1_048_576
|
39
|
+
|
40
|
+
settings.define :aws_key, wukong_storm: true, description: "AWS access key. (Required for S3 input)"
|
41
|
+
settings.define :aws_secret, wukong_storm: true, description: "AWS secret key. (Required for S3 input)"
|
42
|
+
settings.define :aws_region, wukong_storm: true, description: "AWS region, one of: us-east-1, us-west-[1,2], eu-west-1, ap-southeast-[1,2], ap-northeast-1, sa-east-1. (Required for S3 input)", default: 'us-east-1'
|
43
|
+
|
44
|
+
settings.define :output, wukong_storm: true, description: "Output URI for the topology. The schee of the URI determines the type of state used."
|
45
|
+
|
46
|
+
settings.define :debug, wukong_storm: true, storm: true, description: 'topology.debug'
|
47
|
+
settings.define :optimize, wukong_storm: true, storm: true, description: 'topology.optimize'
|
48
|
+
settings.define :timeout, wukong_storm: true, storm: true, description: 'topology.message.timeout.secs'
|
49
|
+
settings.define :workers, wukong_storm: true, storm: true, description: 'topology.workers'
|
50
|
+
settings.define :worker_opts, wukong_storm: true, storm: true, description: 'topology.worker.childopts'
|
51
|
+
settings.define :ackers, wukong_storm: true, storm: true, description: 'topology.acker.executors'
|
52
|
+
settings.define :sample_rate, wukong_storm: true, storm: true, description: 'topology.stats.sample.rate'
|
53
|
+
|
54
|
+
settings.define :nimbus_host, wukong_storm: true, storm: true, description: 'nimbus.host', default: 'localhost'
|
55
|
+
settings.define :nimbus_port, wukong_storm: true, storm: true, description: 'nimbus.thrift.port', default: 6627
|
56
|
+
settings.define :kafka_hosts, wukong_storm: true, description: "Comma-separated list of Kafka hosts", default: 'localhost'
|
57
|
+
settings.define :zookeeper_hosts, wukong_storm: true, description: "Comma-separated list of Zookeeper hosts", default: 'localhost'
|
58
|
+
|
59
|
+
settings.define :storm_home, wukong_storm: true, description: "Path to Storm installation", env_var: "STORM_HOME", default: "/usr/lib/storm"
|
60
|
+
settings.define :storm_runner, wukong_storm: true, description: "Path to Storm executable. Use this for non-standard Storm installations"
|
61
|
+
|
62
|
+
end
|
24
63
|
end
|
25
64
|
|
26
65
|
# Boots the Wukong::Storm plugin.
|
@@ -33,4 +72,5 @@ module Wukong
|
|
33
72
|
end
|
34
73
|
end
|
35
74
|
|
36
|
-
require 'wukong-storm/
|
75
|
+
require 'wukong-storm/storm_runner'
|
76
|
+
require 'wukong-storm/bolt_runner'
|
@@ -0,0 +1,81 @@
|
|
1
|
+
module Wukong
|
2
|
+
module Storm
|
3
|
+
|
4
|
+
# Modifies the behavior of Wukong::Local::StdioDriver by appending
|
5
|
+
# a batch delimiter after each set of output records, including
|
6
|
+
# when there are 0 output records or if an error occurs.
|
7
|
+
class BoltDriver < Local::StdioDriver
|
8
|
+
|
9
|
+
include Logging
|
10
|
+
|
11
|
+
#
|
12
|
+
# == Startup ==
|
13
|
+
#
|
14
|
+
|
15
|
+
# Override the behavior of StdioDriver by initializing an empty
|
16
|
+
# array of output records.
|
17
|
+
def initialize(label, settings)
|
18
|
+
super(label, settings)
|
19
|
+
@output = []
|
20
|
+
end
|
21
|
+
|
22
|
+
# Do *not* sync $stdout as in the StdioDriver.
|
23
|
+
def setup()
|
24
|
+
end
|
25
|
+
|
26
|
+
#
|
27
|
+
# == Reading Input ==
|
28
|
+
#
|
29
|
+
|
30
|
+
# Called by EventMachine framework after successfully reading a
|
31
|
+
# line from $stdin.
|
32
|
+
#
|
33
|
+
# Relies on StdioDriver, but calls #write_output afterwards to
|
34
|
+
# ensure that a delimiter is also sent.
|
35
|
+
#
|
36
|
+
# @param [String] line
|
37
|
+
def receive_line line
|
38
|
+
super(line)
|
39
|
+
write_output
|
40
|
+
end
|
41
|
+
|
42
|
+
#
|
43
|
+
# == Handling Output ==
|
44
|
+
#
|
45
|
+
|
46
|
+
# Don't write the record to $stdout, but store it in an array of
|
47
|
+
# output records instead.
|
48
|
+
#
|
49
|
+
# @param [Object] record
|
50
|
+
#
|
51
|
+
# @see #write_output
|
52
|
+
def process(record)
|
53
|
+
@output << record
|
54
|
+
end
|
55
|
+
|
56
|
+
# Writes all output records out in a single batch write with a
|
57
|
+
# batch delimiter appended to the end.
|
58
|
+
#
|
59
|
+
# All output records are newline delimited within the batch.
|
60
|
+
#
|
61
|
+
# The batch itself includes a newline character after the final
|
62
|
+
# batch delimiter.
|
63
|
+
#
|
64
|
+
# $stdout is flushed after the write and accumulated outputs are
|
65
|
+
# cleared.
|
66
|
+
#
|
67
|
+
# @see #process
|
68
|
+
def write_output
|
69
|
+
@output.each do |record|
|
70
|
+
$stdout.write(record)
|
71
|
+
$stdout.write("\n")
|
72
|
+
end
|
73
|
+
$stdout.write(settings.delimiter)
|
74
|
+
$stdout.write("\n")
|
75
|
+
$stdout.flush
|
76
|
+
@output.clear
|
77
|
+
end
|
78
|
+
|
79
|
+
end
|
80
|
+
end
|
81
|
+
end
|
@@ -0,0 +1,44 @@
|
|
1
|
+
require_relative('bolt_driver')
|
2
|
+
|
3
|
+
module Wukong
|
4
|
+
module Storm
|
5
|
+
|
6
|
+
# Implements the runner for wu-bolt.
|
7
|
+
class StormBoltRunner < Wukong::Local::LocalRunner
|
8
|
+
|
9
|
+
include Logging
|
10
|
+
|
11
|
+
usage "PROCESSOR|FLOW"
|
12
|
+
|
13
|
+
description <<-EOF.gsub(/^ {8}/,'')
|
14
|
+
wu-bolt is a commandline tool for running Wukong dataflows as
|
15
|
+
bolts within a Storm topology.
|
16
|
+
|
17
|
+
wu-bolt behaves like wu-local except it adds a batch
|
18
|
+
terminator after the output generated from each input record.
|
19
|
+
This allows Storm to differentiate "no output" from "no output
|
20
|
+
yet", important for back-propagating acks.
|
21
|
+
|
22
|
+
For example
|
23
|
+
|
24
|
+
$ echo "adds a terminator" | wu-bolt tokenizer.rb
|
25
|
+
adds
|
26
|
+
a
|
27
|
+
terminator
|
28
|
+
---
|
29
|
+
$ echo "" | wu-bolt tokenizer.rb
|
30
|
+
---
|
31
|
+
|
32
|
+
If there is ever a suppressed error in pricessing, or a
|
33
|
+
skipped record for any reason, wu-bolt will still output the
|
34
|
+
batch terminator.
|
35
|
+
EOF
|
36
|
+
|
37
|
+
# :nodoc:
|
38
|
+
def driver
|
39
|
+
BoltDriver
|
40
|
+
end
|
41
|
+
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
@@ -0,0 +1,386 @@
|
|
1
|
+
require 'shellwords'
|
2
|
+
|
3
|
+
module Wukong
|
4
|
+
module Storm
|
5
|
+
|
6
|
+
# This module defines several methods that generate command lines
|
7
|
+
# that interact with Storm using the `storm` program.
|
8
|
+
module StormInvocation
|
9
|
+
|
10
|
+
#
|
11
|
+
# == Topology Structure & Properties
|
12
|
+
#
|
13
|
+
|
14
|
+
# Return the name of the Storm topology from the given settings
|
15
|
+
# and/or commandline args.
|
16
|
+
#
|
17
|
+
# @return [String] the name of the Storm topology
|
18
|
+
def topology_name
|
19
|
+
settings[:name] || dataflow
|
20
|
+
end
|
21
|
+
|
22
|
+
# Name of the Wukong dataflow to be launched.
|
23
|
+
#
|
24
|
+
# Obtained from either the first non-option argument passed to
|
25
|
+
# `wu-storm` or the `--run` option.
|
26
|
+
#
|
27
|
+
# @return [String]
|
28
|
+
def dataflow_name
|
29
|
+
args.first || settings[:run]
|
30
|
+
end
|
31
|
+
|
32
|
+
# The input URI for the topology. Will determine the Trident
|
33
|
+
# spout that will be used.
|
34
|
+
#
|
35
|
+
# @return [URI]
|
36
|
+
def input_uri
|
37
|
+
@input_uri ||= URI.parse(settings[:input])
|
38
|
+
end
|
39
|
+
|
40
|
+
# Does this topology read from Kafka?
|
41
|
+
#
|
42
|
+
# @return [true, false]
|
43
|
+
def kafka_input?
|
44
|
+
! blob_input?
|
45
|
+
end
|
46
|
+
|
47
|
+
# Does this topology read from a filesystem?
|
48
|
+
#
|
49
|
+
# @return [true, false]
|
50
|
+
def blob_input?
|
51
|
+
s3_input? || file_input?
|
52
|
+
end
|
53
|
+
|
54
|
+
# Does this topology read from Amazon's S3?
|
55
|
+
#
|
56
|
+
# @return [true, false]
|
57
|
+
def s3_input?
|
58
|
+
input_uri.scheme == 's3'
|
59
|
+
end
|
60
|
+
|
61
|
+
# Does this topology read from a local filesystem?
|
62
|
+
#
|
63
|
+
# @return [true, false]
|
64
|
+
def file_input?
|
65
|
+
input_uri.scheme == 'file'
|
66
|
+
end
|
67
|
+
|
68
|
+
# The input URI for the topology. Will determine the Trident
|
69
|
+
# state that will be used.
|
70
|
+
#
|
71
|
+
# @return [URI]
|
72
|
+
def output_uri
|
73
|
+
@output_uri ||= URI.parse(settings[:output])
|
74
|
+
end
|
75
|
+
|
76
|
+
# Does this topology write to Kafka?
|
77
|
+
#
|
78
|
+
# @return [true, false]
|
79
|
+
def kafka_output?
|
80
|
+
true # only option right now
|
81
|
+
end
|
82
|
+
|
83
|
+
#
|
84
|
+
# == Interaction w/Storm ==
|
85
|
+
#
|
86
|
+
|
87
|
+
# Generates a commandline that can be used to launch a new Storm
|
88
|
+
# topology based on the given dataflow, input and output topics,
|
89
|
+
# and settings.
|
90
|
+
#
|
91
|
+
# @return [String]
|
92
|
+
def storm_launch_commandline
|
93
|
+
[
|
94
|
+
storm_runner,
|
95
|
+
"jar #{wukong_topology_submitter_jar}",
|
96
|
+
fully_qualified_class_name,
|
97
|
+
native_storm_options,
|
98
|
+
storm_topology_options,
|
99
|
+
].flatten.compact.join("\ \t\\\n ")
|
100
|
+
end
|
101
|
+
|
102
|
+
# Generates a commandline that can be used to kill a running
|
103
|
+
# Storm topology based on the given topology name.
|
104
|
+
#
|
105
|
+
# @return [String]
|
106
|
+
def storm_kill_commandline
|
107
|
+
"#{storm_runner} kill #{topology_name} #{storm_kill_options} > /dev/null 2>&1"
|
108
|
+
end
|
109
|
+
|
110
|
+
# Generates the commandline that will be used to launch wu-bolt
|
111
|
+
# within each bolt of the Storm topology.
|
112
|
+
#
|
113
|
+
# @return [String]
|
114
|
+
def wu_bolt_commandline
|
115
|
+
return settings[:bolt_command] if settings[:bolt_command]
|
116
|
+
[settings[:command_prefix], 'wu-bolt', dataflow_name, non_wukong_storm_params_string].compact.map(&:to_s).reject(&:empty?).join(' ')
|
117
|
+
end
|
118
|
+
|
119
|
+
# Return the path to the `storm` program.
|
120
|
+
#
|
121
|
+
# Will pay attention to `--storm_runner` and `--storm_home`
|
122
|
+
# options.
|
123
|
+
#
|
124
|
+
# @return [String]
|
125
|
+
def storm_runner
|
126
|
+
explicit_runner = settings[:storm_runner]
|
127
|
+
home_runner = File.join(settings[:storm_home], 'bin/storm')
|
128
|
+
default_runner = 'storm'
|
129
|
+
case
|
130
|
+
when explicit_runner then explicit_runner
|
131
|
+
when File.exist?(home_runner) then home_runner
|
132
|
+
else default_runner
|
133
|
+
end
|
134
|
+
end
|
135
|
+
|
136
|
+
# Path to the Java jar file containing the submitter class.
|
137
|
+
#
|
138
|
+
# @return [String]
|
139
|
+
#
|
140
|
+
# @see #fully_qualified_class_name
|
141
|
+
def wukong_topology_submitter_jar
|
142
|
+
File.expand_path("wukong-storm.jar", File.dirname(__FILE__))
|
143
|
+
end
|
144
|
+
|
145
|
+
# The default Java Submitter class.
|
146
|
+
#
|
147
|
+
# @see #fully_qualified_class_name
|
148
|
+
TOPOLOGY_SUBMITTER_CLASS = "com.infochimps.wukong.storm.TopologySubmitter"
|
149
|
+
|
150
|
+
# Returns the fully qualified name of the Java submitter class.
|
151
|
+
#
|
152
|
+
# @see TOPOLOGY_SUBMITTER_CLASS
|
153
|
+
def fully_qualified_class_name
|
154
|
+
TOPOLOGY_SUBMITTER_CLASS
|
155
|
+
end
|
156
|
+
|
157
|
+
# Return Java `-D` options constructed from mapping the passed
|
158
|
+
# in "friendly" options (`--timeout`) to native, Storm options
|
159
|
+
# (`topology.message.timeout.secs`).
|
160
|
+
#
|
161
|
+
# @return [Array<String>] an array of each `-D` option
|
162
|
+
def native_storm_options
|
163
|
+
settings.params_with(:storm).map do |option, value|
|
164
|
+
defn = settings.definition_of(option, :description)
|
165
|
+
[defn, settings[option.to_sym]]
|
166
|
+
end.map { |option, value| java_option(option, value) }
|
167
|
+
end
|
168
|
+
|
169
|
+
# Return Java `-D` options for Wukong-specific options.
|
170
|
+
#
|
171
|
+
# @return [Array<String>]
|
172
|
+
def storm_topology_options
|
173
|
+
(services_options + topology_options + spout_options + dataflow_options + state_options).reject do |pair|
|
174
|
+
key, value = pair
|
175
|
+
value.nil? || value.to_s.strip.empty?
|
176
|
+
end.map { |pair| java_option(*pair) }.sort
|
177
|
+
end
|
178
|
+
|
179
|
+
# Return Java `-D` option key-value pairs related to services
|
180
|
+
# used by the topology.
|
181
|
+
#
|
182
|
+
# @return [Array<Array>] an Array of key-value pairs
|
183
|
+
def services_options
|
184
|
+
[
|
185
|
+
["wukong.kafka.hosts", settings[:kafka_hosts]],
|
186
|
+
["wukong.zookeeper.hosts", settings[:zookeeper_hosts]],
|
187
|
+
]
|
188
|
+
end
|
189
|
+
|
190
|
+
# Return Java `-D` option key-value pairs related to the overall
|
191
|
+
# topology.
|
192
|
+
#
|
193
|
+
# @return [Array<Array>] an Array of key-value pairs
|
194
|
+
def topology_options
|
195
|
+
[
|
196
|
+
["wukong.topology", topology_name],
|
197
|
+
]
|
198
|
+
end
|
199
|
+
|
200
|
+
# Return Java `-D` option key-value pairs related to the
|
201
|
+
# topology's spout.
|
202
|
+
#
|
203
|
+
# @return [Array<Array>] an Array of key-value pairs
|
204
|
+
def spout_options
|
205
|
+
case
|
206
|
+
when blob_input?
|
207
|
+
blob_spout_options + (s3_input? ? s3_spout_options : file_spout_options)
|
208
|
+
else
|
209
|
+
kafka_spout_options
|
210
|
+
end
|
211
|
+
end
|
212
|
+
|
213
|
+
# Return Java `-D` option key-value pairs related to the
|
214
|
+
# topology's spout if it is reading from a generic filesystem.
|
215
|
+
#
|
216
|
+
# @return [Array<Array>] an Array of key-value pairs
|
217
|
+
def blob_spout_options
|
218
|
+
[
|
219
|
+
["wukong.input.type", "blob"],
|
220
|
+
].tap do |so|
|
221
|
+
so << ["wukong.input.blob.marker", settings[:offset]] if settings[:offset]
|
222
|
+
so << case
|
223
|
+
when settings[:from_beginning]
|
224
|
+
["wukong.input.blob.start", "EARLIEST"]
|
225
|
+
when settings[:from_end]
|
226
|
+
["wukong.input.blob.start", "LATEST"]
|
227
|
+
when settings[:offset]
|
228
|
+
["wukong.input.blob.start", "EXPLICIT"]
|
229
|
+
else
|
230
|
+
["wukong.input.blob.start", "RESUME"]
|
231
|
+
end
|
232
|
+
end
|
233
|
+
end
|
234
|
+
|
235
|
+
# Return Java `-D` option key-value pairs related to the
|
236
|
+
# topology's spout if it is reading from S3.
|
237
|
+
#
|
238
|
+
# @return [Array<Array>] an Array of key-value pairs
|
239
|
+
def s3_spout_options
|
240
|
+
[
|
241
|
+
["wukong.input.blob.type", "s3"],
|
242
|
+
["wukong.input.blob.path", input_uri.path.gsub(%r{^/},'')],
|
243
|
+
["wukong.input.blob.s3_bucket", input_uri.host],
|
244
|
+
["wukong.input.blob.aws_key", settings[:aws_key]],
|
245
|
+
["wukong.input.blob.aws_secret", settings[:aws_secret]],
|
246
|
+
["wukong.input.blob.s3_endpoint", s3_endpoint]
|
247
|
+
]
|
248
|
+
end
|
249
|
+
|
250
|
+
# The AWS endpoint used to communicate with AWS for S3 access.
|
251
|
+
#
|
252
|
+
# Determined by the AWS region the S3 bucket was declared to be
|
253
|
+
# in.
|
254
|
+
#
|
255
|
+
# @see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
|
256
|
+
def s3_endpoint
|
257
|
+
case settings[:aws_region]
|
258
|
+
when 'us-east-1' then 's3.amazonaws.com'
|
259
|
+
when 'us-west-1' then 's3-us-west-1.amazonaws.com'
|
260
|
+
when 'us-west-2' then 's3-us-west-2.amazonaws.com'
|
261
|
+
when /EU/, 'eu-west-1' then 's3-eu-west-1.amazonaws.com'
|
262
|
+
when 'ap-southeast-1' then 's3-ap-southeast-1.amazonaws.com'
|
263
|
+
when 'ap-southeast-2' then 's3-ap-southeast-2.amazonaws.com'
|
264
|
+
when 'ap-northeast-1' then 's3-ap-northeast-1.amazonaws.com'
|
265
|
+
when 'sa-east-1' then 's3-sa-east-1.amazonaws.com'
|
266
|
+
end
|
267
|
+
end
|
268
|
+
|
269
|
+
# Return Java `-D` option key-value pairs related to the
|
270
|
+
# topology's spout if it is reading from a local file.
|
271
|
+
#
|
272
|
+
# @return [Array<Array>] an Array of key-value pairs
|
273
|
+
def file_spout_options
|
274
|
+
[
|
275
|
+
["wukong.input.blob.type", "file"],
|
276
|
+
["wukong.input.blob.path", input_uri.path],
|
277
|
+
]
|
278
|
+
end
|
279
|
+
|
280
|
+
# Return Java `-D` option key-value pairs related to the
|
281
|
+
# topology's spout if it is reading from Kafka.
|
282
|
+
#
|
283
|
+
# @return [Array<Array>] an Array of key-value pairs
|
284
|
+
def kafka_spout_options
|
285
|
+
[
|
286
|
+
["wukong.input.type", 'kafka'],
|
287
|
+
["wukong.input.kafka.topic", settings[:input]],
|
288
|
+
["wukong.input.kafka.partitions", settings[:kafka_partitions]],
|
289
|
+
["wukong.input.kafka.batch", settings[:kafka_batch]],
|
290
|
+
|
291
|
+
["wukong.input.parallelism", settings[:input_parallelism]],
|
292
|
+
case
|
293
|
+
when settings[:from_beginning]
|
294
|
+
["wukong.input.kafka.offset", "-2"]
|
295
|
+
when settings[:from_end]
|
296
|
+
["wukong.input.kafka.offset", "-1"]
|
297
|
+
when settings[:offset]
|
298
|
+
["wukong.input.kafka.offset", settings[:offset]]
|
299
|
+
else
|
300
|
+
# Do *not* set anything and the spout will attempt to
|
301
|
+
# resume and, finding no prior offset, will start from the
|
302
|
+
# end, as though we'd passed "-1"
|
303
|
+
end
|
304
|
+
]
|
305
|
+
end
|
306
|
+
|
307
|
+
# Return Java `-D` option key-value pairs related to the Wukong
|
308
|
+
# dataflow run by the topology.
|
309
|
+
#
|
310
|
+
# @return [Array<Array>] an Array of key-value pairs
|
311
|
+
def dataflow_options
|
312
|
+
[
|
313
|
+
["wukong.directory", Dir.pwd],
|
314
|
+
["wukong.dataflow", dataflow_name],
|
315
|
+
["wukong.command", wu_bolt_commandline],
|
316
|
+
["wukong.parallelism", settings[:parallelism]],
|
317
|
+
].tap do |opts|
|
318
|
+
opts << ["wukong.environment", settings[:environment]] if settings[:environment]
|
319
|
+
end
|
320
|
+
end
|
321
|
+
|
322
|
+
# Return Java `-D` option key-value pairs related to the final
|
323
|
+
# state used by the topology.
|
324
|
+
#
|
325
|
+
# @return [Array<Array>] an Array of key-value pairs
|
326
|
+
def state_options
|
327
|
+
case
|
328
|
+
when kafka_output?
|
329
|
+
kafka_state_options
|
330
|
+
end
|
331
|
+
end
|
332
|
+
|
333
|
+
# Return Java `-D` option key-value pairs related to the final
|
334
|
+
# state used by the topology when it is writing to Kafka.
|
335
|
+
#
|
336
|
+
# @return [Array<Array>] an Array of key-value pairs
|
337
|
+
def kafka_state_options
|
338
|
+
[
|
339
|
+
["wukong.output.kafka.topic", settings[:output]],
|
340
|
+
]
|
341
|
+
end
|
342
|
+
|
343
|
+
protected
|
344
|
+
|
345
|
+
# Return a String of options used when attempting to kill a
|
346
|
+
# running Storm topology.
|
347
|
+
#
|
348
|
+
# @return [String]
|
349
|
+
def storm_kill_options
|
350
|
+
"-w #{settings[:wait]}"
|
351
|
+
end
|
352
|
+
|
353
|
+
# Format the given `option` and `value` into a Java option
|
354
|
+
# (`-D`).
|
355
|
+
#
|
356
|
+
# @param [Object] option
|
357
|
+
# @param [Object] value
|
358
|
+
# @return [String]
|
359
|
+
def java_option option, value
|
360
|
+
return unless value
|
361
|
+
return if value.to_s.strip.empty?
|
362
|
+
"-D#{option}=#{Shellwords.escape(value.to_s)}"
|
363
|
+
end
|
364
|
+
|
365
|
+
# Parameters that should be passed onto subprocesses.
|
366
|
+
#
|
367
|
+
# @return [Configliere::Param]
|
368
|
+
def params_to_pass
|
369
|
+
settings
|
370
|
+
end
|
371
|
+
|
372
|
+
# Return a String stripped of any `wu-storm`-specific params but
|
373
|
+
# still including any other params.
|
374
|
+
#
|
375
|
+
# @return [String]
|
376
|
+
def non_wukong_storm_params_string
|
377
|
+
params_to_pass.reject do |param, val|
|
378
|
+
(params_to_pass.definition_of(param, :wukong_storm) || params_to_pass.definition_of(param, :wukong))
|
379
|
+
end.map do |param, val|
|
380
|
+
"--#{param}=#{Shellwords.escape(val.to_s)}"
|
381
|
+
end.join(" ")
|
382
|
+
end
|
383
|
+
|
384
|
+
end
|
385
|
+
end
|
386
|
+
end
|