wukong-storm 0.1.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore CHANGED
@@ -57,3 +57,4 @@ away
57
57
  .rbx
58
58
  Gemfile.lock
59
59
  Backup*of*.numbers
60
+ target/*
data/.rspec CHANGED
@@ -1,3 +1,2 @@
1
- --format documentation
1
+ --format progress
2
2
  --color
3
- --drb
data/Gemfile CHANGED
@@ -1,4 +1,4 @@
1
- source :rubygems
1
+ source 'https://rubygems.org'
2
2
 
3
3
  gemspec
4
4
 
data/README.md CHANGED
@@ -1,31 +1,187 @@
1
- # Wukong Storm
1
+ # Wukong-Storm
2
2
 
3
- ## Usage
3
+ The Hadoop plugin for Wukong lets you run <a
4
+ href="http://github.com/infochimps-labs/wukong/tree/3.0.0">Wukong</a>
5
+ processors and dataflows as <a
6
+ href="https://github.com/nathanmarz/storm">Storm</a> topologies reading data in and out from <a href="http://kafka.apache.org/">Kafka</a>.
4
7
 
5
- The Wukong Storm plugin is very basic at the moment. It functions entirely over STDIN and STDOUT. Taken from the `wu-storm` executable:
8
+ Before you use Wukong-Storm to develop, test, and write your Hadoop
9
+ jobs, you might want to read about <a
10
+ href="http://github.com/infochimps-labs/wukong/tree/3.0.0">Wukong</a>,
11
+ write some <a
12
+ href="http://github.com/infochimps-labs/wukong/tree/3.0.0#writing-simple-processors">simple
13
+ processors</a>, and read about some of Storm's <a
14
+ href="https://github.com/nathanmarz/storm/wiki/Concepts">core
15
+ concepts</a>.
6
16
 
17
+ You might also want to check out some other projects which enrich the
18
+ Wukong and Hadoop experience:
19
+
20
+ * <a href="http://github.com/infochimps-labs/wukong-hadoop">wukong-hadoop</a>: Run Wukong processors and dataflows as mappers and/or reducers within the Hadoop framework. Model jobs locally before you run them.
21
+ * <a href="http://github.com/infochimps-labs/wukong-load">wukong-load</a>: Load the output data from your local Wukong jobs and flows into a variety of different data stores.
22
+ * <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
23
+
24
+ <a name="installation"></a>
25
+ ## Installation & Setup
26
+
27
+ Wukong-Storm can be installed as a RubyGem:
28
+
29
+ ```
30
+ $ sudo gem install wukong-storm
7
31
  ```
8
- usage: wu-storm PROCESSOR|FLOW [...--param=value...]
9
32
 
10
- wu-storm is a commandline tool for running Wukong processors and flows in
11
- a storm or trident topology.
33
+ If you actually want to run your dataflows as functioning Storm
34
+ topologies reading/writing to/from Kafka, you'll of course need access
35
+ to Storm and Kafka installations. <a
36
+ href="http://github.com/infochimps-labs/ironfan">Ironfan</a> is a
37
+ great tool for building and managing Storm clusters and other
38
+ distributed infrastructure quickly and easily.
39
+
40
+ To run Storm jobs through Wukong-Storm, you'll need to move your your
41
+ Wukong code to each worker of the Storm cluster, install Wukong-Storm
42
+ on each, and log in and launch your job fron one of them. Ironfan
43
+ again helps with configuring this.
12
44
 
13
- wu-storm operates over STDIN and STDOUT and has a one-to-one message guarantee.
14
- For example, when using an identity processor, wu-storm, given an event 'foo', will return
15
- 'foo\n|\n'. The '|' character is the specified End-Of-File delimiter.
45
+ <a name="anatomy"></a>
46
+ ## Anatomy of a running topology
16
47
 
17
- If there is ever a suppressed error in processing, or a skipped record for any reason,
18
- wu-storm will still respond with a '|\n', signifying an empty return event.
48
+ Storm defines the concept of a **topology**. A topology contains
49
+ spouts and bolts. A **spout** is a source of data. A **bolt**
50
+ processes data. Bolts can be connected to each other and to spouts in
51
+ arbitrary ways.
19
52
 
20
- If there are multiple messages that have resulted from a single event, wu-storm will return
21
- them newline separated, followed by the delimite, e.g. 'foo\nbar\nbaz\n|\n'.
53
+ Tooplogies submitted to Storm's Nimbus but run within a Storm
54
+ supervisor. Each supervisor can dedicate a certain number of
55
+ **workers** to a topology. Within each worker, **parallelism**
56
+ controls the number of threads the worker assigns to the topology.
22
57
 
58
+ Wukong-Storm runs each Wukong dataflow as a single bolt within a
59
+ single topology. Data is passed to this bolt over STDIN and collected
60
+ over STDOUT, similar to the way <a
61
+ href="http://hadoop.apache.org/docs/r0.15.2/streaming.html">Hadoop
62
+ streaming </a> operates.
23
63
 
24
- Params:
25
- -t, --delimiter=String The EOF specifier when returning events [Default: |]
26
- -r, --run=String Name of the processor or dataflow to use. Defaults to basename of the given path
64
+ This topology is hooked up to a
65
+ `storm.kafka.trident.OpaqueTridentKafkaSpout` (part of
66
+ [storm-contrib](https://github.com/nathanmarz/storm-contrib)) which
67
+ reads from a single input topic within Kafka.
68
+
69
+ Output records are written to a default Kafka topic but this can be
70
+ overridden on a per-record basis.
71
+
72
+ <a name="protocol"></a>
73
+ ### Communication protocol
74
+
75
+ A Wukong dataflow launched within Storm runs as a single bolt (see
76
+ [`com.infochimps.wukong.storm.SubprocessFunction`](https://github.com/infochimps-labs/wukong-storm/blob/master/src/main/java/com/infochimps/wukong/storm/SubprocessFunction.java)).
77
+ This bolt works by launching an arbitrary command-line and sending it
78
+ records over STDIN and reading its output over STDOUT. The
79
+ `SubprocessFunction` class expects whatever command it launched to
80
+ obey a protocol under which the output after **each** input consists
81
+ of each output record followed by a newline, with the full batch of
82
+ output records followed by a batch terminator (default: `---`) then
83
+ another newline.
84
+
85
+ Wukong-Storm comes with a command `wu-bolt` which works very similarly
86
+ to `wu-local` but implements this protocol. Here's an example of
87
+ using `wu-bolt` directly with a processor:
88
+
89
+ ```
90
+ $ echo 2 | wu-bolt prime_factorizer.rb
91
+ 2
92
+ ---
93
+ $ echo 12 | wu-bolt prime_factorizer.rb
94
+ 2
95
+ 2
96
+ 3
97
+ ---
98
+ $ echo 19 | wu-bolt prime_factorizer.rb
99
+ ---
27
100
  ```
28
101
 
29
- ## TODO
102
+ Notice that in the last example, the presence of the batch delimiter
103
+ after each input record make it easy to tell the difference between
104
+ "no output records" and "no output records yet" which, over
105
+ STDIN/STDOUT, is rather hard to tell otherwise.
106
+
107
+ ## Running a dataflow
108
+
109
+ ### A simple processor
110
+
111
+ Assuming you have correctly installed Wukong-Storm, Storm, Kafka,
112
+ Zookeeper, &c., and you have defined a simple dataflow (or in this
113
+ case, just a single processor) like this:
114
+
115
+ ```ruby
116
+ # in upcaser.rb
117
+ Wukong.processor(:upcaser) do
118
+ def process line
119
+ yield line.upcase
120
+ end
121
+ end
122
+ ```
123
+
124
+ Then you can launch it directly into Storm:
125
+
126
+ ```
127
+ $ wu-storm upcaser.rb --input=some_input_topic --output=some_output_topic
128
+ ```
129
+
130
+ If a topology named `upcaser` already exists, you'll get an error.
131
+ Add the `--rm` flag to first kill the running topology before
132
+ launching the new one:
133
+
134
+ ```
135
+ $ wu-storm upcaser.rb --input=some_input_topic --output=some_output_topic --rm
136
+ ```
137
+
138
+ The default amount of time to wait for the topology to die is 300
139
+ seconds (5 minutes), just like the `storm kill` command (which is used
140
+ under the hood). When debugging a topology in development, it's
141
+ helpful to add `--wait=1` to immediately kill the topology.
142
+
143
+ See exactly what happened behind the scenes by adding the `--dry_run`
144
+ flag which will print commands and not execute them:
145
+
146
+ ```
147
+ $ wu-storm upcaser.rb --input=some_input_topic --output=some_output_topic --rm --dry_run
148
+ ```
149
+
150
+ ### A more complicated example
151
+
152
+ Say you have a dataflow:
153
+
154
+ ```ruby
155
+ # in my_flow.rb
156
+ Wukong.dataflow(:my_flow) do
157
+ my_parser | does_something | then_something_else | to_json
158
+ end
159
+ ```
160
+
161
+ You can launch it using a different topology name as well as target
162
+ arbitrary locations for your Zookeeper, Kafka, and Storm servers:
163
+
164
+ ```
165
+ $ wu-storm my_flow.rb --name=my_flow_attempt_3 --zookeeper_hosts=10.121.121.121,10.122.122.122 --kafka_hosts=10.123.123.123 --nimbus_host=10.124.124.124 --input=some_input_topic --output=some_output_topic
166
+ ```
167
+
168
+ ### Running non-Wukong or non-Ruby code
169
+
170
+ You can also use Wukong-Storm as a harness to run non-Wukong or
171
+ non-Ruby code. As long as you can specificy a command-line to run
172
+ which supports the [communication protocol](#protocol), then you can
173
+ run it with `wu-storm`:
174
+
175
+ ```
176
+ $ wu-storm --bolt_command='my_cmd --some-option=value -af -q 3' --input=some_input_topic --output=some_output_topic
177
+ ```
178
+
179
+ ### Scaling options
180
+
181
+ Storm provides several options for scaling up or down a topology.
182
+ Wukong-Storm makes them accessible at launch time via the following
183
+ options:
30
184
 
31
- The configuration file has __all__ of the options for storm listed. Slowly translating into real Configliere options.
185
+ * `--workers` specify the number of workers (a.k.a. "executors" or "slots") for the topology. Defaults to 1.
186
+ * `--input_parallelism` specify the number of threads within the spout reading from Kafka within each worker. Defaults to 1.
187
+ * `--parallelism` specify the number of threads within the bolt running Wukong code within each worker. Defaults to 1.
data/bin/wu-bolt ADDED
@@ -0,0 +1,4 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'wukong-storm'
4
+ Wukong::Storm::StormBoltRunner.run
data/lib/wukong-storm.rb CHANGED
@@ -12,15 +12,54 @@ module Wukong
12
12
  # @param [Configliere::Param] settings the settings to configure
13
13
  # @param [String] program the name of the currently executing program
14
14
  def self.configure settings, program
15
- return unless program == 'wu-storm'
16
- settings.define :zookeepers_servers, description: 'storm.zookeeper.servers'
17
- settings.define :zookeepers_port, description: 'storm.zookeeper.port'
18
- settings.define :local_dir, description: 'storm.local.dir'
19
- settings.define :scheduler, description: 'storm.scheduler'
20
- settings.define :cluster_mode, description: 'storm.cluster.mode'
21
- settings.define :local_hostname, description: 'storm.local.hostname'
22
- settings.define :run, description: 'Name of the processor or dataflow to use. Defaults to basename of the given path', flag: 'r'
23
- settings.define :delimiter, description: 'Emitted as a single record to mark the end of the batch ', default: '---', flag: 't'
15
+ case program
16
+ when 'wu-bolt'
17
+ settings.define :run, description: 'Name of the processor or dataflow to use. Defaults to basename of the given path', flag: 'r'
18
+ settings.define :delimiter, description: 'Emitted as a single record to mark the end of the batch ', default: 'X', flag: 't'
19
+ when 'wu-storm'
20
+ settings.define :name, wukong_storm: true, description: "Name for the launched topology"
21
+ settings.define :command_prefix, wukong_storm: true, description: "Prefix to insert before all Wukong commands"
22
+ settings.define :bolt_command, wukong_storm: true, description: "Command-line to run within the spawned Storm bolt"
23
+ settings.define :dry_run, wukong_storm: true, description: "Echo commands that will be run, but don't run them", type: :boolean, default: false
24
+ settings.define :wait, wukong_storm: true, description: "How many seconds to wait when killing a topology", type: Integer, default: 300
25
+ settings.define :rm, wukong_storm: true, description: "Will kill any running topology of the same name before launching", type: :boolean, default: false
26
+ settings.define :delimiter, wukong_storm: true, description: "Batch delimiter to use with wu-bolt"
27
+ settings.define :parallelism, wukong_storm: true, description: "Parallelism hint for wu-bolt", default: 1
28
+
29
+ settings.define :input, wukong_storm: true, description: "Input URI for the topology. The scheme of the URI determines the type of spout."
30
+ settings.define :input_parallelism, wukong_storm: true, description: "Parallelism (number of simultaneous threads) reading input. Only used by some spouts.", default: 1
31
+ settings.define :offset, wukong_storm: true, description: "Offset to use when starting to read from input. Interpreted in a spout-dependent way."
32
+
33
+ settings.define :from_beginning, wukong_storm: true, description: "Start reading from the beginning of the input.", type: :boolean, default: false
34
+ settings.define :from_end, wukong_storm: true, description: "Start reading from the end of the input.", type: :boolean, default: false
35
+ settings.define :resume, wukong_storm: true, description: "Start reading from where the topology left off. This is the default behavior.", type: :boolean, default: true
36
+
37
+ settings.define :kafka_partitions, wukong_storm: true, description: "Number of Kafka partitions on the input topic", default: 1
38
+ settings.define :kafka_batch, wukong_storm: true, description: "Batch size when reading from input topic (bytes)", default: 1_048_576
39
+
40
+ settings.define :aws_key, wukong_storm: true, description: "AWS access key. (Required for S3 input)"
41
+ settings.define :aws_secret, wukong_storm: true, description: "AWS secret key. (Required for S3 input)"
42
+ settings.define :aws_region, wukong_storm: true, description: "AWS region, one of: us-east-1, us-west-[1,2], eu-west-1, ap-southeast-[1,2], ap-northeast-1, sa-east-1. (Required for S3 input)", default: 'us-east-1'
43
+
44
+ settings.define :output, wukong_storm: true, description: "Output URI for the topology. The schee of the URI determines the type of state used."
45
+
46
+ settings.define :debug, wukong_storm: true, storm: true, description: 'topology.debug'
47
+ settings.define :optimize, wukong_storm: true, storm: true, description: 'topology.optimize'
48
+ settings.define :timeout, wukong_storm: true, storm: true, description: 'topology.message.timeout.secs'
49
+ settings.define :workers, wukong_storm: true, storm: true, description: 'topology.workers'
50
+ settings.define :worker_opts, wukong_storm: true, storm: true, description: 'topology.worker.childopts'
51
+ settings.define :ackers, wukong_storm: true, storm: true, description: 'topology.acker.executors'
52
+ settings.define :sample_rate, wukong_storm: true, storm: true, description: 'topology.stats.sample.rate'
53
+
54
+ settings.define :nimbus_host, wukong_storm: true, storm: true, description: 'nimbus.host', default: 'localhost'
55
+ settings.define :nimbus_port, wukong_storm: true, storm: true, description: 'nimbus.thrift.port', default: 6627
56
+ settings.define :kafka_hosts, wukong_storm: true, description: "Comma-separated list of Kafka hosts", default: 'localhost'
57
+ settings.define :zookeeper_hosts, wukong_storm: true, description: "Comma-separated list of Zookeeper hosts", default: 'localhost'
58
+
59
+ settings.define :storm_home, wukong_storm: true, description: "Path to Storm installation", env_var: "STORM_HOME", default: "/usr/lib/storm"
60
+ settings.define :storm_runner, wukong_storm: true, description: "Path to Storm executable. Use this for non-standard Storm installations"
61
+
62
+ end
24
63
  end
25
64
 
26
65
  # Boots the Wukong::Storm plugin.
@@ -33,4 +72,5 @@ module Wukong
33
72
  end
34
73
  end
35
74
 
36
- require 'wukong-storm/runner'
75
+ require 'wukong-storm/storm_runner'
76
+ require 'wukong-storm/bolt_runner'
@@ -0,0 +1,81 @@
1
+ module Wukong
2
+ module Storm
3
+
4
+ # Modifies the behavior of Wukong::Local::StdioDriver by appending
5
+ # a batch delimiter after each set of output records, including
6
+ # when there are 0 output records or if an error occurs.
7
+ class BoltDriver < Local::StdioDriver
8
+
9
+ include Logging
10
+
11
+ #
12
+ # == Startup ==
13
+ #
14
+
15
+ # Override the behavior of StdioDriver by initializing an empty
16
+ # array of output records.
17
+ def initialize(label, settings)
18
+ super(label, settings)
19
+ @output = []
20
+ end
21
+
22
+ # Do *not* sync $stdout as in the StdioDriver.
23
+ def setup()
24
+ end
25
+
26
+ #
27
+ # == Reading Input ==
28
+ #
29
+
30
+ # Called by EventMachine framework after successfully reading a
31
+ # line from $stdin.
32
+ #
33
+ # Relies on StdioDriver, but calls #write_output afterwards to
34
+ # ensure that a delimiter is also sent.
35
+ #
36
+ # @param [String] line
37
+ def receive_line line
38
+ super(line)
39
+ write_output
40
+ end
41
+
42
+ #
43
+ # == Handling Output ==
44
+ #
45
+
46
+ # Don't write the record to $stdout, but store it in an array of
47
+ # output records instead.
48
+ #
49
+ # @param [Object] record
50
+ #
51
+ # @see #write_output
52
+ def process(record)
53
+ @output << record
54
+ end
55
+
56
+ # Writes all output records out in a single batch write with a
57
+ # batch delimiter appended to the end.
58
+ #
59
+ # All output records are newline delimited within the batch.
60
+ #
61
+ # The batch itself includes a newline character after the final
62
+ # batch delimiter.
63
+ #
64
+ # $stdout is flushed after the write and accumulated outputs are
65
+ # cleared.
66
+ #
67
+ # @see #process
68
+ def write_output
69
+ @output.each do |record|
70
+ $stdout.write(record)
71
+ $stdout.write("\n")
72
+ end
73
+ $stdout.write(settings.delimiter)
74
+ $stdout.write("\n")
75
+ $stdout.flush
76
+ @output.clear
77
+ end
78
+
79
+ end
80
+ end
81
+ end
@@ -0,0 +1,44 @@
1
+ require_relative('bolt_driver')
2
+
3
+ module Wukong
4
+ module Storm
5
+
6
+ # Implements the runner for wu-bolt.
7
+ class StormBoltRunner < Wukong::Local::LocalRunner
8
+
9
+ include Logging
10
+
11
+ usage "PROCESSOR|FLOW"
12
+
13
+ description <<-EOF.gsub(/^ {8}/,'')
14
+ wu-bolt is a commandline tool for running Wukong dataflows as
15
+ bolts within a Storm topology.
16
+
17
+ wu-bolt behaves like wu-local except it adds a batch
18
+ terminator after the output generated from each input record.
19
+ This allows Storm to differentiate "no output" from "no output
20
+ yet", important for back-propagating acks.
21
+
22
+ For example
23
+
24
+ $ echo "adds a terminator" | wu-bolt tokenizer.rb
25
+ adds
26
+ a
27
+ terminator
28
+ ---
29
+ $ echo "" | wu-bolt tokenizer.rb
30
+ ---
31
+
32
+ If there is ever a suppressed error in pricessing, or a
33
+ skipped record for any reason, wu-bolt will still output the
34
+ batch terminator.
35
+ EOF
36
+
37
+ # :nodoc:
38
+ def driver
39
+ BoltDriver
40
+ end
41
+
42
+ end
43
+ end
44
+ end
@@ -0,0 +1,386 @@
1
+ require 'shellwords'
2
+
3
+ module Wukong
4
+ module Storm
5
+
6
+ # This module defines several methods that generate command lines
7
+ # that interact with Storm using the `storm` program.
8
+ module StormInvocation
9
+
10
+ #
11
+ # == Topology Structure & Properties
12
+ #
13
+
14
+ # Return the name of the Storm topology from the given settings
15
+ # and/or commandline args.
16
+ #
17
+ # @return [String] the name of the Storm topology
18
+ def topology_name
19
+ settings[:name] || dataflow
20
+ end
21
+
22
+ # Name of the Wukong dataflow to be launched.
23
+ #
24
+ # Obtained from either the first non-option argument passed to
25
+ # `wu-storm` or the `--run` option.
26
+ #
27
+ # @return [String]
28
+ def dataflow_name
29
+ args.first || settings[:run]
30
+ end
31
+
32
+ # The input URI for the topology. Will determine the Trident
33
+ # spout that will be used.
34
+ #
35
+ # @return [URI]
36
+ def input_uri
37
+ @input_uri ||= URI.parse(settings[:input])
38
+ end
39
+
40
+ # Does this topology read from Kafka?
41
+ #
42
+ # @return [true, false]
43
+ def kafka_input?
44
+ ! blob_input?
45
+ end
46
+
47
+ # Does this topology read from a filesystem?
48
+ #
49
+ # @return [true, false]
50
+ def blob_input?
51
+ s3_input? || file_input?
52
+ end
53
+
54
+ # Does this topology read from Amazon's S3?
55
+ #
56
+ # @return [true, false]
57
+ def s3_input?
58
+ input_uri.scheme == 's3'
59
+ end
60
+
61
+ # Does this topology read from a local filesystem?
62
+ #
63
+ # @return [true, false]
64
+ def file_input?
65
+ input_uri.scheme == 'file'
66
+ end
67
+
68
+ # The input URI for the topology. Will determine the Trident
69
+ # state that will be used.
70
+ #
71
+ # @return [URI]
72
+ def output_uri
73
+ @output_uri ||= URI.parse(settings[:output])
74
+ end
75
+
76
+ # Does this topology write to Kafka?
77
+ #
78
+ # @return [true, false]
79
+ def kafka_output?
80
+ true # only option right now
81
+ end
82
+
83
+ #
84
+ # == Interaction w/Storm ==
85
+ #
86
+
87
+ # Generates a commandline that can be used to launch a new Storm
88
+ # topology based on the given dataflow, input and output topics,
89
+ # and settings.
90
+ #
91
+ # @return [String]
92
+ def storm_launch_commandline
93
+ [
94
+ storm_runner,
95
+ "jar #{wukong_topology_submitter_jar}",
96
+ fully_qualified_class_name,
97
+ native_storm_options,
98
+ storm_topology_options,
99
+ ].flatten.compact.join("\ \t\\\n ")
100
+ end
101
+
102
+ # Generates a commandline that can be used to kill a running
103
+ # Storm topology based on the given topology name.
104
+ #
105
+ # @return [String]
106
+ def storm_kill_commandline
107
+ "#{storm_runner} kill #{topology_name} #{storm_kill_options} > /dev/null 2>&1"
108
+ end
109
+
110
+ # Generates the commandline that will be used to launch wu-bolt
111
+ # within each bolt of the Storm topology.
112
+ #
113
+ # @return [String]
114
+ def wu_bolt_commandline
115
+ return settings[:bolt_command] if settings[:bolt_command]
116
+ [settings[:command_prefix], 'wu-bolt', dataflow_name, non_wukong_storm_params_string].compact.map(&:to_s).reject(&:empty?).join(' ')
117
+ end
118
+
119
+ # Return the path to the `storm` program.
120
+ #
121
+ # Will pay attention to `--storm_runner` and `--storm_home`
122
+ # options.
123
+ #
124
+ # @return [String]
125
+ def storm_runner
126
+ explicit_runner = settings[:storm_runner]
127
+ home_runner = File.join(settings[:storm_home], 'bin/storm')
128
+ default_runner = 'storm'
129
+ case
130
+ when explicit_runner then explicit_runner
131
+ when File.exist?(home_runner) then home_runner
132
+ else default_runner
133
+ end
134
+ end
135
+
136
+ # Path to the Java jar file containing the submitter class.
137
+ #
138
+ # @return [String]
139
+ #
140
+ # @see #fully_qualified_class_name
141
+ def wukong_topology_submitter_jar
142
+ File.expand_path("wukong-storm.jar", File.dirname(__FILE__))
143
+ end
144
+
145
+ # The default Java Submitter class.
146
+ #
147
+ # @see #fully_qualified_class_name
148
+ TOPOLOGY_SUBMITTER_CLASS = "com.infochimps.wukong.storm.TopologySubmitter"
149
+
150
+ # Returns the fully qualified name of the Java submitter class.
151
+ #
152
+ # @see TOPOLOGY_SUBMITTER_CLASS
153
+ def fully_qualified_class_name
154
+ TOPOLOGY_SUBMITTER_CLASS
155
+ end
156
+
157
+ # Return Java `-D` options constructed from mapping the passed
158
+ # in "friendly" options (`--timeout`) to native, Storm options
159
+ # (`topology.message.timeout.secs`).
160
+ #
161
+ # @return [Array<String>] an array of each `-D` option
162
+ def native_storm_options
163
+ settings.params_with(:storm).map do |option, value|
164
+ defn = settings.definition_of(option, :description)
165
+ [defn, settings[option.to_sym]]
166
+ end.map { |option, value| java_option(option, value) }
167
+ end
168
+
169
+ # Return Java `-D` options for Wukong-specific options.
170
+ #
171
+ # @return [Array<String>]
172
+ def storm_topology_options
173
+ (services_options + topology_options + spout_options + dataflow_options + state_options).reject do |pair|
174
+ key, value = pair
175
+ value.nil? || value.to_s.strip.empty?
176
+ end.map { |pair| java_option(*pair) }.sort
177
+ end
178
+
179
+ # Return Java `-D` option key-value pairs related to services
180
+ # used by the topology.
181
+ #
182
+ # @return [Array<Array>] an Array of key-value pairs
183
+ def services_options
184
+ [
185
+ ["wukong.kafka.hosts", settings[:kafka_hosts]],
186
+ ["wukong.zookeeper.hosts", settings[:zookeeper_hosts]],
187
+ ]
188
+ end
189
+
190
+ # Return Java `-D` option key-value pairs related to the overall
191
+ # topology.
192
+ #
193
+ # @return [Array<Array>] an Array of key-value pairs
194
+ def topology_options
195
+ [
196
+ ["wukong.topology", topology_name],
197
+ ]
198
+ end
199
+
200
+ # Return Java `-D` option key-value pairs related to the
201
+ # topology's spout.
202
+ #
203
+ # @return [Array<Array>] an Array of key-value pairs
204
+ def spout_options
205
+ case
206
+ when blob_input?
207
+ blob_spout_options + (s3_input? ? s3_spout_options : file_spout_options)
208
+ else
209
+ kafka_spout_options
210
+ end
211
+ end
212
+
213
+ # Return Java `-D` option key-value pairs related to the
214
+ # topology's spout if it is reading from a generic filesystem.
215
+ #
216
+ # @return [Array<Array>] an Array of key-value pairs
217
+ def blob_spout_options
218
+ [
219
+ ["wukong.input.type", "blob"],
220
+ ].tap do |so|
221
+ so << ["wukong.input.blob.marker", settings[:offset]] if settings[:offset]
222
+ so << case
223
+ when settings[:from_beginning]
224
+ ["wukong.input.blob.start", "EARLIEST"]
225
+ when settings[:from_end]
226
+ ["wukong.input.blob.start", "LATEST"]
227
+ when settings[:offset]
228
+ ["wukong.input.blob.start", "EXPLICIT"]
229
+ else
230
+ ["wukong.input.blob.start", "RESUME"]
231
+ end
232
+ end
233
+ end
234
+
235
+ # Return Java `-D` option key-value pairs related to the
236
+ # topology's spout if it is reading from S3.
237
+ #
238
+ # @return [Array<Array>] an Array of key-value pairs
239
+ def s3_spout_options
240
+ [
241
+ ["wukong.input.blob.type", "s3"],
242
+ ["wukong.input.blob.path", input_uri.path.gsub(%r{^/},'')],
243
+ ["wukong.input.blob.s3_bucket", input_uri.host],
244
+ ["wukong.input.blob.aws_key", settings[:aws_key]],
245
+ ["wukong.input.blob.aws_secret", settings[:aws_secret]],
246
+ ["wukong.input.blob.s3_endpoint", s3_endpoint]
247
+ ]
248
+ end
249
+
250
+ # The AWS endpoint used to communicate with AWS for S3 access.
251
+ #
252
+ # Determined by the AWS region the S3 bucket was declared to be
253
+ # in.
254
+ #
255
+ # @see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
256
+ def s3_endpoint
257
+ case settings[:aws_region]
258
+ when 'us-east-1' then 's3.amazonaws.com'
259
+ when 'us-west-1' then 's3-us-west-1.amazonaws.com'
260
+ when 'us-west-2' then 's3-us-west-2.amazonaws.com'
261
+ when /EU/, 'eu-west-1' then 's3-eu-west-1.amazonaws.com'
262
+ when 'ap-southeast-1' then 's3-ap-southeast-1.amazonaws.com'
263
+ when 'ap-southeast-2' then 's3-ap-southeast-2.amazonaws.com'
264
+ when 'ap-northeast-1' then 's3-ap-northeast-1.amazonaws.com'
265
+ when 'sa-east-1' then 's3-sa-east-1.amazonaws.com'
266
+ end
267
+ end
268
+
269
+ # Return Java `-D` option key-value pairs related to the
270
+ # topology's spout if it is reading from a local file.
271
+ #
272
+ # @return [Array<Array>] an Array of key-value pairs
273
+ def file_spout_options
274
+ [
275
+ ["wukong.input.blob.type", "file"],
276
+ ["wukong.input.blob.path", input_uri.path],
277
+ ]
278
+ end
279
+
280
+ # Return Java `-D` option key-value pairs related to the
281
+ # topology's spout if it is reading from Kafka.
282
+ #
283
+ # @return [Array<Array>] an Array of key-value pairs
284
+ def kafka_spout_options
285
+ [
286
+ ["wukong.input.type", 'kafka'],
287
+ ["wukong.input.kafka.topic", settings[:input]],
288
+ ["wukong.input.kafka.partitions", settings[:kafka_partitions]],
289
+ ["wukong.input.kafka.batch", settings[:kafka_batch]],
290
+
291
+ ["wukong.input.parallelism", settings[:input_parallelism]],
292
+ case
293
+ when settings[:from_beginning]
294
+ ["wukong.input.kafka.offset", "-2"]
295
+ when settings[:from_end]
296
+ ["wukong.input.kafka.offset", "-1"]
297
+ when settings[:offset]
298
+ ["wukong.input.kafka.offset", settings[:offset]]
299
+ else
300
+ # Do *not* set anything and the spout will attempt to
301
+ # resume and, finding no prior offset, will start from the
302
+ # end, as though we'd passed "-1"
303
+ end
304
+ ]
305
+ end
306
+
307
+ # Return Java `-D` option key-value pairs related to the Wukong
308
+ # dataflow run by the topology.
309
+ #
310
+ # @return [Array<Array>] an Array of key-value pairs
311
+ def dataflow_options
312
+ [
313
+ ["wukong.directory", Dir.pwd],
314
+ ["wukong.dataflow", dataflow_name],
315
+ ["wukong.command", wu_bolt_commandline],
316
+ ["wukong.parallelism", settings[:parallelism]],
317
+ ].tap do |opts|
318
+ opts << ["wukong.environment", settings[:environment]] if settings[:environment]
319
+ end
320
+ end
321
+
322
+ # Return Java `-D` option key-value pairs related to the final
323
+ # state used by the topology.
324
+ #
325
+ # @return [Array<Array>] an Array of key-value pairs
326
+ def state_options
327
+ case
328
+ when kafka_output?
329
+ kafka_state_options
330
+ end
331
+ end
332
+
333
+ # Return Java `-D` option key-value pairs related to the final
334
+ # state used by the topology when it is writing to Kafka.
335
+ #
336
+ # @return [Array<Array>] an Array of key-value pairs
337
+ def kafka_state_options
338
+ [
339
+ ["wukong.output.kafka.topic", settings[:output]],
340
+ ]
341
+ end
342
+
343
+ protected
344
+
345
+ # Return a String of options used when attempting to kill a
346
+ # running Storm topology.
347
+ #
348
+ # @return [String]
349
+ def storm_kill_options
350
+ "-w #{settings[:wait]}"
351
+ end
352
+
353
+ # Format the given `option` and `value` into a Java option
354
+ # (`-D`).
355
+ #
356
+ # @param [Object] option
357
+ # @param [Object] value
358
+ # @return [String]
359
+ def java_option option, value
360
+ return unless value
361
+ return if value.to_s.strip.empty?
362
+ "-D#{option}=#{Shellwords.escape(value.to_s)}"
363
+ end
364
+
365
+ # Parameters that should be passed onto subprocesses.
366
+ #
367
+ # @return [Configliere::Param]
368
+ def params_to_pass
369
+ settings
370
+ end
371
+
372
+ # Return a String stripped of any `wu-storm`-specific params but
373
+ # still including any other params.
374
+ #
375
+ # @return [String]
376
+ def non_wukong_storm_params_string
377
+ params_to_pass.reject do |param, val|
378
+ (params_to_pass.definition_of(param, :wukong_storm) || params_to_pass.definition_of(param, :wukong))
379
+ end.map do |param, val|
380
+ "--#{param}=#{Shellwords.escape(val.to_s)}"
381
+ end.join(" ")
382
+ end
383
+
384
+ end
385
+ end
386
+ end