elasticity 1.5 → 2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.rspec +2 -1
- data/.rvmrc +1 -1
- data/HISTORY.md +47 -24
- data/LICENSE +1 -1
- data/README.md +165 -317
- data/Rakefile +4 -3
- data/elasticity.gemspec +3 -5
- data/lib/elasticity.rb +10 -5
- data/lib/elasticity/aws_request.rb +81 -20
- data/lib/elasticity/custom_jar_step.rb +33 -0
- data/lib/elasticity/emr.rb +45 -117
- data/lib/elasticity/hadoop_bootstrap_action.rb +27 -0
- data/lib/elasticity/hive_step.rb +57 -0
- data/lib/elasticity/job_flow.rb +109 -39
- data/lib/elasticity/job_flow_status.rb +53 -0
- data/lib/elasticity/job_flow_status_step.rb +35 -0
- data/lib/elasticity/job_flow_step.rb +17 -25
- data/lib/elasticity/pig_step.rb +82 -0
- data/lib/elasticity/support/conditional_raise.rb +23 -0
- data/lib/elasticity/version.rb +1 -1
- data/spec/lib/elasticity/aws_request_spec.rb +159 -51
- data/spec/lib/elasticity/custom_jar_step_spec.rb +59 -0
- data/spec/lib/elasticity/emr_spec.rb +231 -762
- data/spec/lib/elasticity/hadoop_bootstrap_action_spec.rb +26 -0
- data/spec/lib/elasticity/hive_step_spec.rb +74 -0
- data/spec/lib/elasticity/job_flow_integration_spec.rb +197 -0
- data/spec/lib/elasticity/job_flow_spec.rb +369 -138
- data/spec/lib/elasticity/job_flow_status_spec.rb +147 -0
- data/spec/lib/elasticity/job_flow_status_step_spec.rb +73 -0
- data/spec/lib/elasticity/job_flow_step_spec.rb +26 -64
- data/spec/lib/elasticity/pig_step_spec.rb +104 -0
- data/spec/lib/elasticity/support/conditional_raise_spec.rb +35 -0
- data/spec/spec_helper.rb +1 -50
- data/spec/support/be_a_hash_including_matcher.rb +35 -0
- metadata +101 -119
- data/.autotest +0 -2
- data/lib/elasticity/custom_jar_job.rb +0 -38
- data/lib/elasticity/hive_job.rb +0 -69
- data/lib/elasticity/pig_job.rb +0 -109
- data/lib/elasticity/simple_job.rb +0 -51
- data/spec/fixtures/vcr_cassettes/add_instance_groups/one_group_successful.yml +0 -44
- data/spec/fixtures/vcr_cassettes/add_instance_groups/one_group_unsuccessful.yml +0 -41
- data/spec/fixtures/vcr_cassettes/add_jobflow_steps/add_multiple_steps.yml +0 -266
- data/spec/fixtures/vcr_cassettes/custom_jar_job/cloudburst.yml +0 -41
- data/spec/fixtures/vcr_cassettes/describe_jobflows/all_jobflows.yml +0 -75
- data/spec/fixtures/vcr_cassettes/direct/terminate_jobflow.yml +0 -38
- data/spec/fixtures/vcr_cassettes/hive_job/hive_ads.yml +0 -41
- data/spec/fixtures/vcr_cassettes/modify_instance_groups/set_instances_to_3.yml +0 -38
- data/spec/fixtures/vcr_cassettes/pig_job/apache_log_reports.yml +0 -41
- data/spec/fixtures/vcr_cassettes/pig_job/apache_log_reports_with_bootstrap.yml +0 -41
- data/spec/fixtures/vcr_cassettes/run_jobflow/word_count.yml +0 -41
- data/spec/fixtures/vcr_cassettes/set_termination_protection/nonexistent_job_flows.yml +0 -41
- data/spec/fixtures/vcr_cassettes/set_termination_protection/protect_multiple_job_flows.yml +0 -38
- data/spec/fixtures/vcr_cassettes/terminate_jobflows/one_jobflow.yml +0 -38
- data/spec/lib/elasticity/custom_jar_job_spec.rb +0 -118
- data/spec/lib/elasticity/hive_job_spec.rb +0 -90
- data/spec/lib/elasticity/pig_job_spec.rb +0 -226
data/.rspec
CHANGED
data/.rvmrc
CHANGED
@@ -1 +1 @@
|
|
1
|
-
rvm use ruby-1.9.
|
1
|
+
rvm use ruby-1.9.3-p194@elasticity --create
|
data/HISTORY.md
CHANGED
@@ -1,56 +1,79 @@
|
|
1
|
-
|
1
|
+
## 2.0 - June 26, 2012
|
2
|
+
|
3
|
+
2.0 is a rewrite of the simplified API after a year's worth of daily use at [Sharethrough](http://www.sharethrough.com/). We're investing heavily in our data processing infrastucture and many Elasticity feature ideas have come from those efforts.
|
4
|
+
|
5
|
+
In order to move more quickly and support interesting features like a command-line interface, configuration-file-based launching, keep-alive clusters and more - a remodeling of the simplified API was done. This is going to result in breaking changes to the API, hence the bump to 2.0. I hope that most of you were using ```gem 'elasticity', '~> 1.5'``` in your Gemfile :)
|
6
|
+
|
7
|
+
#### API Changes
|
8
|
+
|
9
|
+
+ The ```SimpleJob```-based API has been removed in favour of a more modular step-based approach using the "job flow" and "step" vernacular, in line with Amazon's own language. If you're familiar with the AWS UI, using Elasticity will be a bit more straightforward.
|
10
|
+
+ The functionality provided by ```JobFlow``` and ```JobFlowStep``` has been transitioned to ```JobFlowStatus``` and ```JobFlowStatusStep``` respectively, clearing the path for use of ```JobFlow``` and ```JobFlowStep``` in job submission.
|
11
|
+
|
12
|
+
#### New Features!
|
13
|
+
|
14
|
+
+ When submitting jobs via ```JobFlow``` API, it is now possible to specify the version of the AMI, whether or not the cluster is keep-alive, and the subnet ID (for launching into a VPC). Keep in mind that AWS will error out if you specify an unsupported combination of AMI and Hadoop version.
|
15
|
+
+ The default version of Hadoop in ```JobFlow``` is now 0.20.205. The previous default was 0.20 in case you'd like to set it yourself.
|
16
|
+
+ It is now possible to name Hadoop bootstrap actions, making it easier to understand the actions when looking in the AWS UI after a job is submitted.
|
17
|
+
|
18
|
+
#### Under The Hood
|
19
|
+
|
20
|
+
+ AWS requests are now POSTs (thanks to [Menno van der Sman](https://github.com/menno)) in order to avoid server-imposed GET request size limits. Rather than maintain two separate code paths for GET and POST, we decided to only support POST as there is no reason to support both.
|
21
|
+
+ Drastic simplification of the testing around EMR submission, reducing LoC (however important that metric is you :) and complexity by ~50%.
|
22
|
+
+ Development dependency updates: updated to ruby-1.9.3-p194 and rspec-2.10. Removed dependency on VCR and WebMock (no longer using either of these).
|
23
|
+
|
24
|
+
## 1.5
|
2
25
|
|
3
26
|
+ Added support for Hadoop bootstrap actions to all job types (Pig, Hive and Custom Jar).
|
4
27
|
+ Added support for REE 1.8.7-2011.12, Ruby 1.9.2 and 1.9.3.
|
5
28
|
+ Updated to the latest versions of all development dependencies (notably VCR 2).
|
6
29
|
|
7
|
-
|
30
|
+
## 1.4.1
|
8
31
|
|
9
|
-
+ Added Elasticity::EMR#describe_jobflow("jobflow_id") for describing a specific job. If you happen to run hundreds of EMR jobs, this makes retrieving jobflow status much faster than using Elasticity::EMR#describe_jobflowS which pulls down and parses XML status for hundreds of jobs.
|
32
|
+
+ Added ```Elasticity::EMR#describe_jobflow("jobflow_id")``` for describing a specific job. If you happen to run hundreds of EMR jobs, this makes retrieving jobflow status much faster than using ```Elasticity::EMR#describe_jobflowS``` which pulls down and parses XML status for hundreds of jobs.
|
10
33
|
|
11
|
-
|
34
|
+
## 1.4
|
12
35
|
|
13
|
-
+ Added Elasticity::CustomJarJob for launching "Custom Jar" jobs.
|
36
|
+
+ Added ```Elasticity::CustomJarJob``` for launching "Custom Jar" jobs.
|
14
37
|
|
15
|
-
|
38
|
+
## 1.3.1
|
16
39
|
|
17
40
|
+ Explicitly requiring 'time' (only a problem if you aren't running from within a Rails environment).
|
18
|
-
+ Elasticity::JobFlow now exposes last_state_change_reason
|
41
|
+
+ ```Elasticity::JobFlow``` now exposes ```last_state_change_reason```.
|
19
42
|
|
20
|
-
|
43
|
+
## 1.3 (Contributions from Wouter Broekhof)
|
21
44
|
|
22
45
|
+ The default mode of communication is now via HTTPS.
|
23
|
-
+ Elasticity::AwsRequest new option
|
24
|
-
+ Elasticity::AwsRequest new option
|
25
|
-
+ Elasticity::EMR#describe_jobflows now accepts additional params for filtering the jobflow query (see docs).
|
46
|
+
+ ```Elasticity::AwsRequest``` new option ```:secure => true|false``` (whether to use HTTPS).
|
47
|
+
+ ```Elasticity::AwsRequest``` new option ```:region => eu-west-1|...``` (which region to run the EMR job).
|
48
|
+
+ ```Elasticity::EMR#describe_jobflows``` now accepts additional params for filtering the jobflow query (see docs).
|
26
49
|
|
27
|
-
|
50
|
+
## 1.2.2
|
28
51
|
|
29
|
-
+ HiveJob and PigJob now support configuring Hadoop options via
|
52
|
+
+ ```HiveJob``` and ```PigJob``` now support configuring Hadoop options via ```#add_hadoop_bootstrap_action()```.
|
30
53
|
|
31
|
-
|
54
|
+
## 1.2.1
|
32
55
|
|
33
56
|
+ Shipping up E_PARALLELS Pig variable with each invocation; reasonable default value for PARALLEL based on the number and type of instances configured.
|
34
57
|
|
35
|
-
|
58
|
+
## 1.2
|
36
59
|
|
37
|
-
+ Added PigJob
|
60
|
+
+ Added ```PigJob```!
|
38
61
|
|
39
|
-
|
62
|
+
## 1.1.1
|
40
63
|
|
41
|
-
+ HiveJob critical bug fixed, now it works :)
|
42
|
-
+ Added log_uri and action_on_failure as options to HiveJob
|
43
|
-
+ Added integration tests to HiveJob
|
64
|
+
+ ```HiveJob``` critical bug fixed, now it works :)
|
65
|
+
+ Added ```log_uri``` and ```action_on_failure``` as options to ```HiveJob```.
|
66
|
+
+ Added integration tests to ```HiveJob```.
|
44
67
|
|
45
|
-
|
68
|
+
## 1.1
|
46
69
|
|
47
|
-
+ Added HiveJob
|
70
|
+
+ Added ```HiveJob```, a simplified way to launch basic Hive job flows.
|
48
71
|
+ Added HISTORY.
|
49
72
|
|
50
|
-
|
73
|
+
## 1.0.1
|
51
74
|
|
52
75
|
+ Added LICENSE.
|
53
76
|
|
54
|
-
|
77
|
+
## 1.0
|
55
78
|
|
56
79
|
+ Released!
|
data/LICENSE
CHANGED
@@ -186,7 +186,7 @@
|
|
186
186
|
same "printed page" as the copyright notice for easier
|
187
187
|
identification within third-party archives.
|
188
188
|
|
189
|
-
Copyright 2011 Robert Slifka
|
189
|
+
Copyright 2011-2012 Robert Slifka
|
190
190
|
|
191
191
|
Licensed under the Apache License, Version 2.0 (the "License");
|
192
192
|
you may not use this file except in compliance with the License.
|
data/README.md
CHANGED
@@ -1,78 +1,131 @@
|
|
1
|
-
Elasticity provides programmatic access to Amazon's Elastic Map Reduce service. The aim is to conveniently
|
1
|
+
Elasticity provides programmatic access to Amazon's Elastic Map Reduce service. The aim is to conveniently map the EMR REST API calls to higher level operations that make working with job flows more productive and more enjoyable.
|
2
2
|
|
3
|
-
[![Build Status](https://secure.travis-ci.org/rslifka/elasticity.png)](http://travis-ci.org/rslifka/elasticity)
|
3
|
+
[![Build Status](https://secure.travis-ci.org/rslifka/elasticity.png)](http://travis-ci.org/rslifka/elasticity) REE, 1.8.7, 1.9.2, 1.9.3
|
4
4
|
|
5
|
-
|
5
|
+
Elasticity provides two ways to access EMR:
|
6
6
|
|
7
|
-
|
7
|
+
* **Indirectly through a JobFlow-based API**. This README discusses the Elasticity API.
|
8
|
+
* **Directly through access to the EMR REST API**. The less-discussed hidden darkside... I use this to enable the Elasticity API though it is not documented save for RubyDoc available at the the RubyGems [auto-generated documentation site](http://rubydoc.info/gems/elasticity/frames). Be forewarned: Making the calls directly requires that you understand how to structure EMR requests at the Amazon API level and from experience I can tell you there are more fun things you could be doing :) Scroll to the end for more information on the Amazon API.
|
9
|
+
|
10
|
+
# Installation
|
11
|
+
|
12
|
+
```
|
8
13
|
gem install elasticity
|
9
|
-
|
14
|
+
```
|
10
15
|
|
11
|
-
|
16
|
+
or in your Gemfile
|
12
17
|
|
13
|
-
|
18
|
+
```
|
19
|
+
gem 'elasticity', '~> 2.0'
|
20
|
+
```
|
14
21
|
|
15
|
-
|
22
|
+
This will ensure that you protect yourself from API changes, which will only be made in major revisions.
|
16
23
|
|
17
|
-
|
18
|
-
@action_on_failure = "TERMINATE_JOB_FLOW"
|
19
|
-
@ec2_key_name = "default"
|
20
|
-
@hadoop_version = "0.20"
|
21
|
-
@instance_count = 2
|
22
|
-
@master_instance_type = "m1.small"
|
23
|
-
@name = "Elasticity Job"
|
24
|
-
@slave_instance_type = "m1.small"
|
25
|
-
</pre>
|
24
|
+
# Kicking Off a Job
|
26
25
|
|
27
|
-
|
26
|
+
When using the EMR UI, there are several sample jobs that Amazon supplies. The assets for these sample jobs are hosted on S3 and publicly available meaning you can run this code as-is (supplying your AWS credentials appropriately) and ```JobFlow#run``` will return the ID of the job flow.
|
28
27
|
|
29
|
-
|
28
|
+
```
|
29
|
+
require 'elasticity'
|
30
30
|
|
31
|
-
|
31
|
+
# Create a job flow with your AWS credentials
|
32
|
+
jobflow = Elasticity::JobFlow.new('AWS access key', 'AWS secret key')
|
32
33
|
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
34
|
+
# This is the first step in the jobflow - running a custom jar
|
35
|
+
step = Elasticity::CustomJarStep.new('s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar')
|
36
|
+
|
37
|
+
# Here are the arguments to pass to the jar
|
38
|
+
c.arguments = %w(s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br s3n://elasticmapreduce/samples/cloudburst/input/100k.br s3n://slif-output/cloudburst/output/2012-06-22 36 3 0 1 240 48 24 24 128 16)
|
39
|
+
|
40
|
+
# Add the step to the jobflow
|
41
|
+
jobflow.add_step(step)
|
42
|
+
|
43
|
+
# Let's go!
|
44
|
+
jobflow.run
|
45
|
+
```
|
46
|
+
|
47
|
+
Note that this example is only for ```CustomJarStep```. ```PigStep``` and ```HiveStep``` will have different means of passing parameters.
|
48
|
+
|
49
|
+
# Working with Job Flows
|
50
|
+
|
51
|
+
Job flows are the center of the EMR universe. The general order of operations is:
|
52
|
+
|
53
|
+
1. Create a job flow.
|
54
|
+
1. Specify options.
|
55
|
+
1. Add bootstrap actions.
|
56
|
+
1. Create steps.
|
57
|
+
1. Run the job flow.
|
58
|
+
1. (optional) Add additional steps.
|
59
|
+
1. (optional) Shutdown the job flow.
|
60
|
+
|
61
|
+
## 1 - Creating Job Flows
|
62
|
+
|
63
|
+
Only your AWS credentials are needed.
|
64
|
+
|
65
|
+
```
|
66
|
+
jobflow = Elasticity::JobFlow.new('AWS access key', 'AWS secret key')
|
67
|
+
```
|
68
|
+
|
69
|
+
## 2 - Specifying Job Flow Options
|
70
|
+
|
71
|
+
Configuration job flow options, shown below with default values. Note that these defaults are subject to change - they are reasonable defaults at the time(s) I work on them (e.g. the latest version of Hadoop).
|
72
|
+
|
73
|
+
These options are sent up as part of job flow submission (i.e. ```JobFlow#run```), so be sure to configure these before running the job.
|
74
|
+
|
75
|
+
```
|
76
|
+
jobflow.action_on_failure = 'TERMINATE_JOB_FLOW'
|
77
|
+
jobflow.ami_version = 'latest'
|
78
|
+
jobflow.ec2_key_name = 'default'
|
79
|
+
jobflow.ec2_subnet_id = nil
|
80
|
+
jobflow.hadoop_version = '0.20.205'
|
81
|
+
jobflow.instance_count = 2
|
82
|
+
jobflow.keep_job_flow_alive_when_no_steps = true
|
83
|
+
jobflow.log_uri = nil
|
84
|
+
jobflow.master_instance_type = 'm1.small'
|
85
|
+
jobflow.name = 'Elasticity Job Flow'
|
86
|
+
jobflow.slave_instance_type = 'm1.small'
|
87
|
+
```
|
88
|
+
|
89
|
+
## 3 - Adding Bootstrap Actions
|
38
90
|
|
39
|
-
|
91
|
+
Bootstrap actions are run as part of setting up the job flow, so be sure to configure these before running the job.
|
40
92
|
|
41
|
-
|
93
|
+
```
|
94
|
+
[
|
95
|
+
Elasticity::HadoopBootstrapAction.new('-m', 'mapred.map.tasks=101'),
|
96
|
+
Elasticity::HadoopBootstrapAction.new('-m', 'mapred.reduce.child.java.opts=-Xmx200m')
|
97
|
+
Elasticity::HadoopBootstrapAction.new('-m', 'mapred.tasktracker.map.tasks.maximum=14')
|
98
|
+
].each do |action|
|
99
|
+
jobflow.add_bootstrap_action(action)
|
100
|
+
end
|
101
|
+
```
|
42
102
|
|
43
|
-
|
44
|
-
hive = Elasticity::HiveJob.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
45
|
-
hive.run("s3n://slif-hive/test.q", {
|
46
|
-
"LIB" => "s3n://slif-test/lib",
|
47
|
-
"OUTPUT" => "s3n://slif-test/output"
|
48
|
-
})
|
49
|
-
|
50
|
-
> "j-129V5AQFMKO1C"
|
51
|
-
</pre>
|
103
|
+
## 4 - Adding Steps
|
52
104
|
|
53
|
-
|
105
|
+
Each type of step has a default name that can be overridden (the :name field). Apart from that, steps are configured differently - exhaustively described below.
|
54
106
|
|
55
|
-
|
107
|
+
### Adding a Pig Step
|
56
108
|
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
pig.ec2_key_name = "slif_dev"
|
61
|
-
pig.run("s3n://elasticmapreduce/samples/pig-apache/do-reports.pig", {
|
62
|
-
"INPUT" => "s3n://elasticmapreduce/samples/pig-apache/input",
|
63
|
-
"OUTPUT" => "s3n://slif-elasticity/pig-apache/output/2011-05-04"
|
64
|
-
})
|
65
|
-
|
66
|
-
> "j-16PZ24OED71C6"
|
67
|
-
</pre>
|
109
|
+
```
|
110
|
+
# Path to the Pig script
|
111
|
+
pig_step = Elasticity::PigStep.new('s3n://mybucket/script.pig')
|
68
112
|
|
69
|
-
|
113
|
+
# (optional) These variables are available during the execution of your script
|
114
|
+
pig_step.variables = {
|
115
|
+
'VAR1' => 'VALUE1',
|
116
|
+
'VAR2' => 'VALUE2'
|
117
|
+
}
|
118
|
+
|
119
|
+
jobflow.add_step(pig_step)
|
120
|
+
```
|
121
|
+
|
122
|
+
#### PARALLEL
|
70
123
|
|
71
124
|
Given the importance of specifying a reasonable value for [the number of parallel reducers](http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features PARALLEL), Elasticity calculates and passes through a reasonable default up with every invocation in the form of a script variable called E_PARALLELS. This default value is based off of the formula in the Pig Cookbook and the number of reducers AWS configures per instance.
|
72
125
|
|
73
126
|
For example, if you had 8 instances in total and your slaves were m1.xlarge, the value is 26 (as shown below).
|
74
127
|
|
75
|
-
|
128
|
+
```
|
76
129
|
s3://elasticmapreduce/libs/pig/pig-script
|
77
130
|
--run-pig-script
|
78
131
|
--args
|
@@ -80,309 +133,104 @@ For example, if you had 8 instances in total and your slaves were m1.xlarge, the
|
|
80
133
|
-p OUTPUT=s3n://slif-elasticity/pig-apache/output/2011-05-04
|
81
134
|
-p E_PARALLELS=26
|
82
135
|
s3n://elasticmapreduce/samples/pig-apache/do-reports.pig
|
83
|
-
|
136
|
+
```
|
84
137
|
|
85
138
|
Use this as you would any other Pig variable.
|
86
139
|
|
87
|
-
|
140
|
+
```
|
88
141
|
A = LOAD 'myfile' AS (t, u, v);
|
89
142
|
B = GROUP A BY t PARALLEL $E_PARALLELS;
|
90
143
|
...
|
91
|
-
|
144
|
+
```
|
145
|
+
|
146
|
+
### Adding a Hive Step
|
147
|
+
|
148
|
+
```
|
149
|
+
# Path to the Hive Script
|
150
|
+
hive_step = Elasticity::HiveStep.new('s3n://mybucket/script.hql')
|
151
|
+
|
152
|
+
# (optional) These variables are available during the execution of your script
|
153
|
+
hive_step.variables = {
|
154
|
+
'VAR1' => 'VALUE1',
|
155
|
+
'VAR2' => 'VALUE2'
|
156
|
+
}
|
157
|
+
|
158
|
+
jobflow.add_step(hive_step)
|
159
|
+
```
|
160
|
+
|
161
|
+
### Adding a Custom Jar Step
|
162
|
+
|
163
|
+
```
|
164
|
+
# Path to your jar
|
165
|
+
jar_step = Elasticity::CustomJarStep.new('s3n://mybucket/my.jar')
|
166
|
+
|
167
|
+
# (optional) Arguments passed to the jar
|
168
|
+
jar_step.arguments = ['arg1', 'arg2']
|
92
169
|
|
93
|
-
|
170
|
+
jobflow.add_step(jar_step)
|
171
|
+
```
|
94
172
|
|
95
|
-
|
173
|
+
## 5 - Running the Job Flow
|
96
174
|
|
97
|
-
|
98
|
-
custom_jar = Elasticity::CustomJarJob.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
99
|
-
custom_jar.log_uri = "s3n://slif-test/output/logs"
|
100
|
-
custom_jar.action_on_failure = "TERMINATE_JOB_FLOW"
|
101
|
-
jobflow_id = custom_jar.run('s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar', [
|
102
|
-
"s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br",
|
103
|
-
"s3n://elasticmapreduce/samples/cloudburst/input/100k.br",
|
104
|
-
"s3n://slif_hadoop_test/cloudburst/output/2011-12-09",
|
105
|
-
])
|
106
|
-
|
107
|
-
> "j-1IU6NM8OUPS9I"
|
108
|
-
</pre>
|
175
|
+
Submit the job flow to Amazon, storing the ID of the running job flow.
|
109
176
|
|
110
|
-
|
177
|
+
```
|
178
|
+
jobflow_id = jobflow.run
|
179
|
+
```
|
111
180
|
|
112
|
-
|
113
|
-
Elasticity::CustomJarJob.new(key, secret).run(s3_jar_path, ['MyCustomClass', 'arg1', 'arg2'])
|
114
|
-
</pre>
|
181
|
+
## 6 - Adding Additional Steps (optional)
|
115
182
|
|
116
|
-
|
183
|
+
Steps can be added to a running jobflow just by calling ```#add_step``` on the job flow exactly how you add them prior to submitting the job.
|
184
|
+
|
185
|
+
## 7 - Shutting Down the Job Flow (optional)
|
186
|
+
|
187
|
+
By default, job flows are set to terminate when there are no more running steps. You can tell the job flow to stay alive when it has nothing left to do:
|
188
|
+
|
189
|
+
```
|
190
|
+
jobflow.keep_job_flow_alive_when_no_steps = true
|
191
|
+
```
|
192
|
+
|
193
|
+
If that's the case, or if you'd just like to terminate a running jobflow before waiting for it to finish:
|
194
|
+
|
195
|
+
```
|
196
|
+
jobflow.shutdown
|
197
|
+
```
|
198
|
+
|
199
|
+
# Amazon EMR Documentation
|
117
200
|
|
118
201
|
Elasticity wraps all of the EMR API calls. Please see the Amazon guide for details on these operations because the default values aren't obvious (e.g. the meaning of <code>DescribeJobFlows</code> without parameters).
|
119
202
|
|
120
|
-
You may opt for "direct" access to the API where you specify the params and Elasticity takes care of the signing for you, responding with the XML from Amazon.
|
203
|
+
You may opt for "direct" access to the API where you specify the params and Elasticity takes care of the signing for you, responding with the XML from Amazon.
|
121
204
|
|
122
|
-
In addition to the [AWS EMR
|
205
|
+
In addition to the [AWS EMR site](http://aws.amazon.com/elasticmapreduce/), there are three primary resources of reference information for EMR:
|
123
206
|
|
124
207
|
* [Amazon EMR Getting Started Guide](http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/)
|
125
208
|
* [Amazon EMR Developer Guide](http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/)
|
126
209
|
* [Amazon EMR API Reference](http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/)
|
127
210
|
|
128
|
-
Unfortunately, the documentation is sometimes incorrect and sometimes missing. E.g. the allowable values for AddInstanceGroups are present in the [PDF](http://awsdocs.s3.amazonaws.com/ElasticMapReduce/20090331/emr-api-20090331.pdf) version of the API reference but not in the [HTML](http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/) version. Elasticity implements the API as specified in the PDF reference as that is the most complete description I could find.
|
129
|
-
|
130
|
-
## AddInstanceGroups
|
131
|
-
|
132
|
-
AddInstanceGroups adds a group of instances to an existing job flow. The available instance configuration options are listed in the EMR API reference. They've been converted to be more Ruby-like in the wrappers, as shown in the example below.
|
133
|
-
|
134
|
-
<pre>
|
135
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
136
|
-
instance_group_config = {
|
137
|
-
:instance_count => 1,
|
138
|
-
:instance_role => "TASK",
|
139
|
-
:instance_type => "m1.small",
|
140
|
-
:market => "ON_DEMAND",
|
141
|
-
:name => "Go Canucks Go!"
|
142
|
-
}
|
143
|
-
emr.add_instance_groups("j-26LIXPUNSC0M3", [instance_group_config])
|
144
|
-
|
145
|
-
> ["ig-E7C8MGA2ULQ1"]
|
146
|
-
</pre>
|
147
|
-
|
148
|
-
Some combinations of the options will be rejected by Amazon and some once-valid options will sometimes be rejected if they not relevant to the current state of the job flow (e.g. duplicate addition of TASK groups to the same job flow).
|
149
|
-
|
150
|
-
<pre>
|
151
|
-
emr.add_instance_groups("j-26LIXPUNSC0M3", [instance_group_config])
|
152
|
-
|
153
|
-
> Task instance group already exists in the job flow, cannot add more task groups
|
154
|
-
</pre>
|
155
|
-
|
156
|
-
## AddJobFlowSteps
|
157
|
-
|
158
|
-
AddJobFlowSteps adds the specified steps to the specified job flow.
|
159
|
-
|
160
|
-
<pre>
|
161
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
162
|
-
jobflow_id = emr.run_job_flow(...)
|
163
|
-
emr.add_jobflow_steps(jobflow_id, {
|
164
|
-
:steps => [
|
165
|
-
{
|
166
|
-
:action_on_failure => "TERMINATE_JOB_FLOW",
|
167
|
-
:hadoop_jar_step => {
|
168
|
-
:args => [
|
169
|
-
"s3://elasticmapreduce/libs/pig/pig-script",
|
170
|
-
"--base-path",
|
171
|
-
"s3://elasticmapreduce/libs/pig/",
|
172
|
-
"--install-pig"
|
173
|
-
],
|
174
|
-
:jar => "s3://elasticmapreduce/libs/script-runner/script-runner.jar"
|
175
|
-
},
|
176
|
-
:name => "Setup Pig"
|
177
|
-
}
|
178
|
-
]
|
179
|
-
})
|
180
|
-
</pre>
|
181
|
-
|
182
|
-
## describe_jobflow (Elasticity Convenience Method)
|
183
|
-
|
184
|
-
This is a convenience methods that wraps DescribeJobFlow to return the status of a single job.
|
185
|
-
|
186
|
-
<pre>
|
187
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
188
|
-
jobflow = emr.describe_jobflow("j-129V5AQFMKO1C")
|
189
|
-
p jobflow.jobflow_id
|
190
|
-
> "j-129V5AQFMKO1C"
|
191
|
-
p jobflow.name
|
192
|
-
> "Elasticity Test Job"
|
193
|
-
</pre>
|
194
|
-
|
195
|
-
## DescribeJobFlows
|
196
|
-
|
197
|
-
DescribeJobFlows returns detailed information as to the state of all jobs. Currently this is wrapped in an <code>Elasticity::JobFlow</code> that contains the <code>name</code>, <code>jobflow_id</code> and <code>state</code>.
|
198
|
-
|
199
|
-
<pre>
|
200
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
201
|
-
jobflows = emr.describe_jobflows
|
202
|
-
p jobflows.map(&:name)
|
203
|
-
|
204
|
-
> ["Hive Test", "Pig Test", "Interactive Hadoop", "Interactive Hive"]
|
205
|
-
</pre>
|
206
|
-
|
207
|
-
## ModifyInstanceGroups
|
208
|
-
|
209
|
-
A job flow contains several "instance groups" of various types. These instances are where the work for your EMR task occurs. After a job flow has been created, you can find these instance groups in the AWS web UI by clicking on a job flow and then clicking on the "Instance Groups" tab.
|
210
|
-
|
211
|
-
<pre>
|
212
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
213
|
-
emr.modify_instance_groups({"ig-2T1HNUO61BG3O" => 3})
|
214
|
-
</pre>
|
215
|
-
|
216
|
-
If there's an error, you'll receive an ArgumentError containing the message from Amazon. For example if you attempt to modify an instance group that's part of a terminated job flow:
|
217
|
-
|
218
|
-
<pre>
|
219
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
220
|
-
emr.modify_instance_groups({"ig-some_terminated_group" => 3})
|
221
|
-
|
222
|
-
> ArgumentError: An instance group may only be modified when the job flow is running or waiting
|
223
|
-
</pre>
|
224
|
-
|
225
|
-
Or if you attempt to increase the instance count of the MASTER instance group:
|
226
|
-
|
227
|
-
<pre>
|
228
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
229
|
-
emr.modify_instance_groups({"ig-some_terminated_group" => 3})
|
230
|
-
|
231
|
-
> ArgumentError: A master instance group may not be modified
|
232
|
-
</pre>
|
233
|
-
|
234
|
-
## RunJobFlow
|
235
|
-
|
236
|
-
RunJobFlow creates and starts a new job flow. Specifying the arguments to RunJobFlow is a bit of a hot mess at the moment, requiring you to understand the EMR syntax as well as the data structure for specifying jobs. Here's a beefy example:
|
237
|
-
|
238
|
-
<pre>
|
239
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
240
|
-
jobflow_id = emr.run_job_flow({
|
241
|
-
:name => "Elasticity Test Flow (EMR Pig Script)",
|
242
|
-
:instances => {
|
243
|
-
:ec2_key_name => "sharethrough-dev",
|
244
|
-
:hadoop_version => "0.20",
|
245
|
-
:instance_count => 2,
|
246
|
-
:master_instance_type => "m1.small",
|
247
|
-
:placement => {
|
248
|
-
:availability_zone => "us-east-1a"
|
249
|
-
},
|
250
|
-
:slave_instance_type => "m1.small",
|
251
|
-
},
|
252
|
-
:steps => [
|
253
|
-
{
|
254
|
-
:action_on_failure => "TERMINATE_JOB_FLOW",
|
255
|
-
:hadoop_jar_step => {
|
256
|
-
:args => [
|
257
|
-
"s3://elasticmapreduce/libs/pig/pig-script",
|
258
|
-
"--base-path",
|
259
|
-
"s3://elasticmapreduce/libs/pig/",
|
260
|
-
"--install-pig"
|
261
|
-
],
|
262
|
-
:jar => "s3://elasticmapreduce/libs/script-runner/script-runner.jar"
|
263
|
-
},
|
264
|
-
:name => "Setup Pig"
|
265
|
-
},
|
266
|
-
{
|
267
|
-
:action_on_failure => "TERMINATE_JOB_FLOW",
|
268
|
-
:hadoop_jar_step => {
|
269
|
-
:args => [
|
270
|
-
"s3://elasticmapreduce/libs/pig/pig-script",
|
271
|
-
"--run-pig-script",
|
272
|
-
"--args",
|
273
|
-
"-p",
|
274
|
-
"INPUT=s3n://elasticmapreduce/samples/pig-apache/input",
|
275
|
-
"-p",
|
276
|
-
"OUTPUT=s3n://slif-elasticity/pig-apache/output/2011-04-19",
|
277
|
-
"s3n://elasticmapreduce/samples/pig-apache/do-reports.pig"
|
278
|
-
],
|
279
|
-
:jar => "s3://elasticmapreduce/libs/script-runner/script-runner.jar"
|
280
|
-
},
|
281
|
-
:name => "Run Pig Script"
|
282
|
-
}
|
283
|
-
]
|
284
|
-
})
|
285
|
-
|
286
|
-
> "j-129V5AQFMKO1C"
|
287
|
-
</pre>
|
288
|
-
|
289
|
-
Currently Elasticity doesn't do much to ease this pain although this is what I would like to focus on in coming releases. Feel free to ship ideas my way. In the meantime, have a look at the EMR API [PDF](http://awsdocs.s3.amazonaws.com/ElasticMapReduce/20090331/emr-api-20090331.pdf) under the RunJobFlow action and riff off of the example here.
|
290
|
-
|
291
|
-
## SetTerminationProtection
|
292
|
-
|
293
|
-
Enable or disable "termination protection" on the specified job flows. Termination protection prevents a job flow from from being terminated by any user-initiated action.
|
294
|
-
|
295
|
-
<pre>
|
296
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
297
|
-
emr.set_termination_protection(["j-1B4D1XP0C0A35", "j-1YG2MYL0HVYS5"])
|
298
|
-
</pre>
|
299
|
-
|
300
|
-
To disable termination protection, specify false as the second parameter.
|
301
|
-
|
302
|
-
<pre>
|
303
|
-
emr.set_termination_protection(["j-1B4D1XP0C0A35", "j-1YG2MYL0HVYS5"], false)
|
304
|
-
</pre>
|
305
|
-
|
306
|
-
## TerminateJobFlows
|
307
|
-
|
308
|
-
Terminate the specified job flow. When the job flow '''exists''', you will receive no output. This is because Amazon does not return anything other than a 200 when you terminate a job flow :) You'll want to continuously poll with DescribeJobFlows to see when the job was actually terminated.
|
309
|
-
|
310
|
-
<pre>
|
311
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
312
|
-
emr.terminate_jobflows("j-BOWBV7884XD0")
|
313
|
-
</pre>
|
314
|
-
|
315
|
-
When the job flow '''doesn't exist''':
|
316
|
-
|
317
|
-
<pre>
|
318
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
319
|
-
emr.terminate_jobflows("no-flow")
|
320
|
-
|
321
|
-
> ArgumentError: Job flow 'no-flow' does not exist.
|
322
|
-
</pre>
|
323
|
-
|
324
|
-
# Direct Response Access
|
325
|
-
|
326
|
-
If you're fine with Elasticity's invocation wrapping and would prefer to get at the resulting XML rather than the wrapped response, throw a block our way and we'll yield the result. This still saves you the trouble of having to create the params and sign the request yet gives you direct access to the response XML for your parsing pleasure.
|
327
|
-
|
328
|
-
<pre>
|
329
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
330
|
-
emr.describe_jobflows{|xml| puts xml[0..77]}
|
331
|
-
|
332
|
-
> <DescribeJobFlowsResponse xmlns="http://elasticmapreduce.amazonaws.com/doc/200...
|
333
|
-
</pre>
|
334
|
-
|
335
|
-
# Direct Request/Response Access
|
336
|
-
|
337
|
-
If you're chomping at the bit to initiate some EMR functionality that isn't wrapped (or isn't wrapped in a way you prefer :) feel free to access the AWS EMR API directly by using <code>EMR.direct()</code>. You can find the allowed values in Amazon's EMR API [developer documentation](http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html).
|
338
|
-
|
339
|
-
<pre>
|
340
|
-
emr = Elasticity::EMR.new(ENV["AWS_ACCESS_KEY_ID"], ENV["AWS_SECRET_KEY"])
|
341
|
-
params = {"Operation" => "DescribeJobFlows"}
|
342
|
-
result_xml = emr.direct(params)
|
343
|
-
result_xml[0..78]
|
344
|
-
|
345
|
-
> <DescribeJobFlowsResponse xmlns="http://elasticmapreduce.amazonaws.com/doc/2009...
|
346
|
-
</pre>
|
347
|
-
|
348
|
-
# Something Borrowed...
|
349
|
-
|
350
|
-
AWS signing was used from [RightScale's](http://www.rightscale.com/) amazing [right_aws gem](https://github.com/rightscale/right_aws) which works extraordinarily well! If you need access to any AWS service (EC2, S3, etc.), have a look.
|
351
|
-
|
352
|
-
Used camelize from ActiveSupport as well, thank you \Rails :)
|
211
|
+
Unfortunately, the documentation is sometimes incorrect and sometimes missing. E.g. the allowable values for ```AddInstanceGroups``` are present in the [PDF](http://awsdocs.s3.amazonaws.com/ElasticMapReduce/20090331/emr-api-20090331.pdf) version of the API reference but not in the [HTML](http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/) version. Elasticity implements the API as specified in the PDF reference as that is the most complete description I could find.
|
353
212
|
|
354
213
|
# Thanks!
|
355
214
|
|
356
|
-
|
215
|
+
* AWS signing was used from [RightScale's](http://www.rightscale.com/) amazing [right_aws gem](https://github.com/rightscale/right_aws) which works extraordinarily well! If you need access to any AWS service (EC2, S3, etc.), have a look.
|
216
|
+
* <code>camelize</code> was used from ActiveSupport to assist in converting parmeters to AWS request format.
|
217
|
+
* Thanks to the following people who have contributed patches or helpful suggestions: [Ryan Weald](https://github.com/rweald), [Aram Price](https://github.com/aramprice/), [Wouter Broekhof](https://github.com/wouter/) and [Menno van der Sman](https://github.com/menno)
|
357
218
|
|
358
|
-
+ [Aram Price](https://github.com/aramprice/)
|
359
|
-
+ [Wouter Broekhof](https://github.com/wouter/)
|
360
219
|
|
361
220
|
# License
|
362
221
|
|
363
|
-
|
222
|
+
```
|
364
223
|
Copyright 2011-2012 Robert Slifka
|
365
224
|
|
366
225
|
Licensed under the Apache License, Version 2.0 (the "License");
|
367
226
|
you may not use this file except in compliance with the License.
|
368
227
|
You may obtain a copy of the License at
|
369
228
|
|
370
|
-
|
229
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
371
230
|
|
372
231
|
Unless required by applicable law or agreed to in writing, software
|
373
232
|
distributed under the License is distributed on an "AS IS" BASIS,
|
374
233
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
375
234
|
See the License for the specific language governing permissions and
|
376
235
|
limitations under the License.
|
377
|
-
|
378
|
-
|
379
|
-
### Development Notes for Slif
|
380
|
-
|
381
|
-
[Versioning Guide](http://docs.rubygems.org/read/chapter/7#page27), c/o [@brokenladder](https://twitter.com/#!/brokenladder)
|
382
|
-
|
383
|
-
<pre>
|
384
|
-
rake build # Build lorem-0.0.2.gem into the pkg directory
|
385
|
-
rake install # Build and install lorem-0.0.2.gem into system gems
|
386
|
-
rake release # Create tag v0.0.2 and build
|
387
|
-
# and push lorem-0.0.2.gem to Rubygems
|
388
|
-
</pre>
|
236
|
+
```
|