mobilize-hdfs 1.351 → 1.361

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,234 +1,4 @@
1
- Mobilize-Hdfs
2
- ===============
1
+ Mobilize
2
+ ========
3
3
 
4
- Mobilize-Hdfs adds the power of hdfs to [mobilize-ssh][mobilize-ssh].
5
- * read, write, and copy hdfs files through Google
6
- Spreadsheets.
7
-
8
- Table Of Contents
9
- -----------------
10
- * [Overview](#section_Overview)
11
- * [Install](#section_Install)
12
- * [Mobilize-Hdfs](#section_Install_Mobilize-Hdfs)
13
- * [Install Dirs and Files](#section_Install_Dirs_and_Files)
14
- * [Configure](#section_Configure)
15
- * [Hadoop](#section_Configure_Hadoop)
16
- * [Start](#section_Start)
17
- * [Create Job](#section_Start_Create_Job)
18
- * [Run Test](#section_Start_Run_Test)
19
- * [Meta](#section_Meta)
20
- * [Author](#section_Author)
21
-
22
- <a name='section_Overview'></a>
23
- Overview
24
- -----------
25
-
26
- * Mobilize-hdfs adds Hdfs methods to mobilize-ssh.
27
-
28
- <a name='section_Install'></a>
29
- Install
30
- ------------
31
-
32
- Make sure you go through all the steps in the
33
- [mobilize-base][mobilize-base] and [mobilize-ssh][mobilize-ssh]
34
- install sections first.
35
-
36
- <a name='section_Install_Mobilize-Hdfs'></a>
37
- ### Mobilize-Hdfs
38
-
39
- add this to your Gemfile:
40
-
41
- ``` ruby
42
- gem "mobilize-hdfs"
43
- ```
44
-
45
- or do
46
-
47
- $ gem install mobilize-hdfs
48
-
49
- for a ruby-wide install.
50
-
51
- <a name='section_Install_Dirs_and_Files'></a>
52
- ### Dirs and Files
53
-
54
- ### Rakefile
55
-
56
- Inside the Rakefile in your project's root dir, make sure you have:
57
-
58
- ``` ruby
59
- require 'mobilize-base/tasks'
60
- require 'mobilize-ssh/tasks'
61
- require 'mobilize-hdfs/tasks'
62
- ```
63
-
64
- This defines rake tasks essential to run the environment.
65
-
66
- ### Config Dir
67
-
68
- run
69
-
70
- $ rake mobilize_hdfs:setup
71
-
72
- This will copy over a sample hadoop.yml to your config dir.
73
-
74
- <a name='section_Configure'></a>
75
- Configure
76
- ------------
77
-
78
- <a name='section_Configure_Hadoop'></a>
79
- ### Configure Hadoop
80
-
81
- * Hadoop is big data. That means we need to be careful when reading from
82
- the cluster as it could easily fill up our mongodb instance, RAM, local disk
83
- space, etc.
84
- * To achieve this, all hadoop operations, stage outputs, etc. are
85
- executed and stored on the cluster only.
86
- * The exceptions are:
87
- * writing to the cluster from an external source, such as a google
88
- sheet. Here there
89
- is no risk as the external source has much more strict size limits than
90
- hdfs.
91
- * reading from the cluster, such as for posting to google sheet. In
92
- this case, the read_limit parameter dictates the maximum amount that can
93
- be read. If the data is bigger than the read limit, an exception will be
94
- raised.
95
-
96
- The Hadoop configuration consists of:
97
- * output_dir, which is the absolute path to the directory in HDFS that will store stage
98
- outputs. Directory names should end with a slash (/). It will choose the
99
- first cluster as the default cluster to write to.
100
- * read_limit, which is the maximum size data that can be read from the
101
- cluster. Default is 1GB.
102
- * clusters - this defines aliases for clusters, which are used as
103
- parameters for Hdfs stages. Cluster aliases contain 5 parameters:
104
- * namenode - defines the name and port for accessing the namenode
105
- * name - namenode full name, as in namenode1.host.com
106
- * port - namenode port, by default 50070
107
- * gateway_node - defines the node that executes the cluster commands.
108
- * exec_path - defines the path to the hadoop
109
- This node must be defined in ssh.yml according to the specs in
110
- [mobilize-ssh][mobilize-ssh]. The gateway node can be the same for
111
- multiple clusters, depending on your cluster setup.
112
-
113
- Sample hadoop.yml:
114
-
115
- ``` yml
116
- ---
117
- development:
118
- output_dir: /user/mobilize/development/
119
- read_limit: 1000000000
120
- clusters:
121
- dev_cluster:
122
- namenode:
123
- name: dev_namenode.host.com
124
- port: 50070
125
- gateway_node: dev_hadoop_host
126
- exec_path: /path/to/hadoop
127
- dev_cluster_2:
128
- namenode:
129
- name: dev_namenode_2.host.com
130
- port: 50070
131
- gateway_node: dev_hadoop_host
132
- exec_path: /path/to/hadoop
133
- test:
134
- output_dir: /user/mobilize/test/
135
- read_limit: 1000000000
136
- clusters:
137
- test_cluster:
138
- namenode:
139
- name: test_namenode.host.com
140
- port: 50070
141
- gateway_node: test_hadoop_host
142
- exec_path: /path/to/hadoop
143
- test_cluster_2:
144
- namenode:
145
- name: test_namenode_2.host.com
146
- port: 50070
147
- gateway_node: test_hadoop_host
148
- exec_path: /path/to/hadoop
149
- production:
150
- output_dir: /user/mobilize/production/
151
- read_limit: 1000000000
152
- clusters:
153
- prod_cluster:
154
- namenode:
155
- name: prod_namenode.host.com
156
- port: 50070
157
- gateway_node: prod_hadoop_host
158
- exec_path: /path/to/hadoop
159
- prod_cluster_2:
160
- namenode:
161
- name: prod_namenode_2.host.com
162
- port: 50070
163
- gateway_node: prod_hadoop_host
164
- exec_path: /path/to/hadoop
165
- ```
166
-
167
- <a name='section_Start'></a>
168
- Start
169
- -----
170
-
171
- <a name='section_Start_Create_Job'></a>
172
- ### Create Job
173
-
174
- * For mobilize-hdfs, the following stages are available.
175
- * cluster and user are optional for all of the below.
176
- * cluster defaults to output_cluster;
177
- * user is treated the same way as in [mobilize-ssh][mobilize-ssh].
178
- * hdfs.write `source:<full_path>, target:<hdfs_full_path>, user:<user>`
179
- * The full_path can use `<gsheet_path>` or `<hdfs_path>`. The test uses "test_hdfs_1.in".
180
- * `<hdfs_path>` is the cluster alias followed by absolute path on the cluster.
181
- * if a full path is supplied without a preceding cluster alias (e.g. "/user/mobilize/test/test_hdfs_1.in"),
182
- the first listed cluster will be used as the default.
183
- * The test uses "/user/mobilize/test/test_hdfs_1.in" for the initial
184
- write, then "test_cluster_2/user/mobilize/test/test_hdfs_copy.out" for
185
- the cross-cluster write.
186
- * both cluster arguments and user are optional. If writing from
187
- one cluster to another, your source_cluster gateway_node must be able to
188
- access both clusters.
189
-
190
- <a name='section_Start_Run_Test'></a>
191
- ### Run Test
192
-
193
- To run tests, you will need to
194
-
195
- 1) go through the [mobilize-base][mobilize-base] and [mobilize-ssh][mobilize-ssh] tests first
196
-
197
- 2) clone the mobilize-hdfs repository
198
-
199
- From the project folder, run
200
-
201
- 3) $ rake mobilize_hdfs:setup
202
-
203
- Copy over the config files from the mobilize-base and mobilize-ssh
204
- projects into the config dir, and populate the values in the hadoop.yml file.
205
-
206
- If you don't have two clusters, you can populate test_cluster_2 with the
207
- same cluster as your first.
208
-
209
- 3) $ rake test
210
-
211
- * The test runs a 3 stage job:
212
- * test_hdfs_1:
213
- * `hdfs.write target:"/user/mobilize/test/test_hdfs_1.out", source:"test_hdfs_1.in"`
214
- * `hdfs.write source:"/user/mobilize/test/test_hdfs_1.out",target:"test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out"`
215
- * `gsheet.write source:"hdfs://test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out", target:"test_hdfs_1_copy.out"`
216
- * at the end of the test, there should be a sheet named "test_hdfs_1_copy.out" with the same data as test_hdfs_1.in
217
-
218
- <a name='section_Meta'></a>
219
- Meta
220
- ----
221
-
222
- * Code: `git clone git://github.com/dena/mobilize-hdfs.git`
223
- * Home: <https://github.com/dena/mobilize-hdfs>
224
- * Bugs: <https://github.com/dena/mobilize-hdfs/issues>
225
- * Gems: <http://rubygems.org/gems/mobilize-hdfs>
226
-
227
- <a name='section_Author'></a>
228
- Author
229
- ------
230
-
231
- Cassio Paes-Leme :: cpaesleme@dena.com :: @cpaesleme
232
-
233
- [mobilize-base]: https://github.com/dena/mobilize-base
234
- [mobilize-ssh]: https://github.com/dena/mobilize-ssh
4
+ Please refer to the mobilize-server wiki: https://github.com/DeNA/mobilize-server/wiki
data/lib/mobilize-hdfs.rb CHANGED
@@ -3,6 +3,9 @@ require "mobilize-ssh"
3
3
 
4
4
  module Mobilize
5
5
  module Hdfs
6
+ def Hdfs.home_dir
7
+ File.expand_path('..',File.dirname(__FILE__))
8
+ end
6
9
  end
7
10
  end
8
11
  require "mobilize-hdfs/handlers/hadoop"
@@ -127,17 +127,7 @@ module Mobilize
127
127
  path = path.starts_with?("/") ? path : "/#{path}"
128
128
  end
129
129
  url = "hdfs://#{cluster}#{path}"
130
- hdfs_url = Hdfs.hdfs_url(url)
131
- begin
132
- response = Hadoop.run(cluster, "fs -tail '#{hdfs_url}'", user_name)
133
- if response['exit_code']==0 or is_target
134
- return "hdfs://#{cluster}#{path}"
135
- else
136
- raise "Unable to find #{url} with error: #{response['stderr']}"
137
- end
138
- rescue => exc
139
- raise Exception, "Unable to find #{url} with error: #{exc.to_s}", exc.backtrace
140
- end
130
+ return url
141
131
  end
142
132
 
143
133
  def Hdfs.user_name_by_stage_path(stage_path,cluster=nil)
@@ -1,6 +1,7 @@
1
- namespace :mobilize_hdfs do
1
+ require 'yaml'
2
+ namespace :mobilize do
2
3
  desc "Set up config and log folders and files"
3
- task :setup do
4
+ task :setup_hdfs do
4
5
  sample_dir = File.dirname(__FILE__) + '/../samples/'
5
6
  sample_files = Dir.entries(sample_dir)
6
7
  config_dir = (ENV['MOBILIZE_CONFIG_DIR'] ||= "config/mobilize/")
@@ -1,5 +1,5 @@
1
1
  module Mobilize
2
2
  module Hdfs
3
- VERSION = "1.351"
3
+ VERSION = "1.361"
4
4
  end
5
5
  end
@@ -16,5 +16,5 @@ Gem::Specification.new do |gem|
16
16
  gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
17
17
  gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
18
18
  gem.require_paths = ["lib"]
19
- gem.add_runtime_dependency "mobilize-ssh","1.351"
19
+ gem.add_runtime_dependency "mobilize-ssh","1.361"
20
20
  end
@@ -0,0 +1,91 @@
1
+ ---
2
+ - test0: test0
3
+ test1: test1
4
+ test2: test2
5
+ test3: test3
6
+ test4: test4
7
+ test5: test5
8
+ test6: test6
9
+ test7: test7
10
+ test8: test8
11
+ test9: test9
12
+ - test0: test0
13
+ test1: test1
14
+ test2: test2
15
+ test3: test3
16
+ test4: test4
17
+ test5: test5
18
+ test6: test6
19
+ test7: test7
20
+ test8: test8
21
+ test9: test9
22
+ - test0: test0
23
+ test1: test1
24
+ test2: test2
25
+ test3: test3
26
+ test4: test4
27
+ test5: test5
28
+ test6: test6
29
+ test7: test7
30
+ test8: test8
31
+ test9: test9
32
+ - test0: test0
33
+ test1: test1
34
+ test2: test2
35
+ test3: test3
36
+ test4: test4
37
+ test5: test5
38
+ test6: test6
39
+ test7: test7
40
+ test8: test8
41
+ test9: test9
42
+ - test0: test0
43
+ test1: test1
44
+ test2: test2
45
+ test3: test3
46
+ test4: test4
47
+ test5: test5
48
+ test6: test6
49
+ test7: test7
50
+ test8: test8
51
+ test9: test9
52
+ - test0: test0
53
+ test1: test1
54
+ test2: test2
55
+ test3: test3
56
+ test4: test4
57
+ test5: test5
58
+ test6: test6
59
+ test7: test7
60
+ test8: test8
61
+ test9: test9
62
+ - test0: test0
63
+ test1: test1
64
+ test2: test2
65
+ test3: test3
66
+ test4: test4
67
+ test5: test5
68
+ test6: test6
69
+ test7: test7
70
+ test8: test8
71
+ test9: test9
72
+ - test0: test0
73
+ test1: test1
74
+ test2: test2
75
+ test3: test3
76
+ test4: test4
77
+ test5: test5
78
+ test6: test6
79
+ test7: test7
80
+ test8: test8
81
+ test9: test9
82
+ - test0: test0
83
+ test1: test1
84
+ test2: test2
85
+ test3: test3
86
+ test4: test4
87
+ test5: test5
88
+ test6: test6
89
+ test7: test7
90
+ test8: test8
91
+ test9: test9
@@ -0,0 +1,17 @@
1
+ ---
2
+ - path: "Runner_mobilize(test)/jobs"
3
+ state: working
4
+ count: 1
5
+ confirmed_ats: []
6
+ - path: "Runner_mobilize(test)/jobs/hdfs1/stage1"
7
+ state: working
8
+ count: 1
9
+ confirmed_ats: []
10
+ - path: "Runner_mobilize(test)/jobs/hdfs1/stage2"
11
+ state: working
12
+ count: 1
13
+ confirmed_ats: []
14
+ - path: "Runner_mobilize(test)/jobs/hdfs1/stage3"
15
+ state: working
16
+ count: 1
17
+ confirmed_ats: []
@@ -0,0 +1,10 @@
1
+ - name: hdfs1
2
+ active: true
3
+ trigger: once
4
+ status: ""
5
+ stage1: hdfs.write target:"/user/mobilize/test/hdfs1.out",
6
+ source:"hdfs1.in"
7
+ stage2: hdfs.write source:"/user/mobilize/test/hdfs1.out",
8
+ target:"test_cluster_2/user/mobilize/test/hdfs1_copy.out",
9
+ stage3: gsheet.write source:"hdfs://test_cluster_2/user/mobilize/test/hdfs1_copy.out",
10
+ target:"Runner_mobilize(test)/hdfs1_copy.out"
@@ -0,0 +1,42 @@
1
+ require 'test_helper'
2
+ describe "Mobilize" do
3
+ # enqueues 4 workers on Resque
4
+ it "runs integration test" do
5
+
6
+ puts "restart workers"
7
+ Mobilize::Jobtracker.restart_workers!
8
+
9
+ u = TestHelper.owner_user
10
+ r = u.runner
11
+ user_name = u.name
12
+ gdrive_slot = u.email
13
+
14
+ puts "add test data"
15
+ ["hdfs1.in"].each do |fixture_name|
16
+ target_url = "gsheet://#{r.title}/#{fixture_name}"
17
+ TestHelper.write_fixture(fixture_name, target_url, 'replace')
18
+ end
19
+
20
+ puts "add/update jobs"
21
+ u.jobs.each{|j| j.delete}
22
+ jobs_fixture_name = "integration_jobs"
23
+ jobs_target_url = "gsheet://#{r.title}/jobs"
24
+ TestHelper.write_fixture(jobs_fixture_name, jobs_target_url, 'update')
25
+
26
+ puts "job rows added, force enqueue runner, wait for stages"
27
+ #wait for stages to complete
28
+ expected_fixture_name = "integration_expected"
29
+ Mobilize::Jobtracker.stop!
30
+ r.enqueue!
31
+ TestHelper.confirm_expected_jobs(expected_fixture_name)
32
+
33
+ puts "update job status and activity"
34
+ r.update_gsheet(gdrive_slot)
35
+
36
+ puts "check posted data"
37
+ ['hdfs1_copy.out'].each do |out_name|
38
+ url = "gsheet://#{r.title}/#{out_name}"
39
+ assert TestHelper.check_output(url, 'min_length' => 599) == true
40
+ end
41
+ end
42
+ end
data/test/test_helper.rb CHANGED
@@ -8,3 +8,4 @@ $dir = File.dirname(File.expand_path(__FILE__))
8
8
  ENV['MOBILIZE_ENV'] = 'test'
9
9
  require 'mobilize-hdfs'
10
10
  $TESTING = true
11
+ require "#{Mobilize::Ssh.home_dir}/test/test_helper"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: mobilize-hdfs
3
3
  version: !ruby/object:Gem::Version
4
- version: '1.351'
4
+ version: '1.361'
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-04-25 00:00:00.000000000 Z
12
+ date: 2013-05-31 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: mobilize-ssh
@@ -18,7 +18,7 @@ dependencies:
18
18
  requirements:
19
19
  - - '='
20
20
  - !ruby/object:Gem::Version
21
- version: '1.351'
21
+ version: '1.361'
22
22
  type: :runtime
23
23
  prerelease: false
24
24
  version_requirements: !ruby/object:Gem::Requirement
@@ -26,7 +26,7 @@ dependencies:
26
26
  requirements:
27
27
  - - '='
28
28
  - !ruby/object:Gem::Version
29
- version: '1.351'
29
+ version: '1.361'
30
30
  description: Adds hdfs read, write, and copy support to mobilize-ssh
31
31
  email:
32
32
  - cpaesleme@dena.com
@@ -46,8 +46,10 @@ files:
46
46
  - lib/mobilize-hdfs/version.rb
47
47
  - lib/samples/hadoop.yml
48
48
  - mobilize-hdfs.gemspec
49
- - test/hdfs_job_rows.yml
50
- - test/mobilize-hdfs_test.rb
49
+ - test/fixtures/hdfs1.in.yml
50
+ - test/fixtures/integration_expected.yml
51
+ - test/fixtures/integration_jobs.yml
52
+ - test/integration/mobilize-hdfs_test.rb
51
53
  - test/redis-test.conf
52
54
  - test/test_helper.rb
53
55
  homepage: http://github.com/dena/mobilize-hdfs
@@ -64,7 +66,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
64
66
  version: '0'
65
67
  segments:
66
68
  - 0
67
- hash: -2398358469652320205
69
+ hash: 2190980204946063989
68
70
  required_rubygems_version: !ruby/object:Gem::Requirement
69
71
  none: false
70
72
  requirements:
@@ -73,7 +75,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
73
75
  version: '0'
74
76
  segments:
75
77
  - 0
76
- hash: -2398358469652320205
78
+ hash: 2190980204946063989
77
79
  requirements: []
78
80
  rubyforge_project:
79
81
  rubygems_version: 1.8.25
@@ -81,7 +83,9 @@ signing_key:
81
83
  specification_version: 3
82
84
  summary: Adds hdfs read, write, and copy support to mobilize-ssh
83
85
  test_files:
84
- - test/hdfs_job_rows.yml
85
- - test/mobilize-hdfs_test.rb
86
+ - test/fixtures/hdfs1.in.yml
87
+ - test/fixtures/integration_expected.yml
88
+ - test/fixtures/integration_jobs.yml
89
+ - test/integration/mobilize-hdfs_test.rb
86
90
  - test/redis-test.conf
87
91
  - test/test_helper.rb
@@ -1,10 +0,0 @@
1
- - name: test_hdfs_1
2
- active: true
3
- trigger: once
4
- status: ""
5
- stage1: hdfs.write target:"/user/mobilize/test/test_hdfs_1.out",
6
- source:"test_hdfs_1.in"
7
- stage2: hdfs.write source:"/user/mobilize/test/test_hdfs_1.out",
8
- target:"test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out",
9
- stage3: gsheet.write source:"hdfs://test_cluster_2/user/mobilize/test/test_hdfs_1_copy.out",
10
- target:"Runner_mobilize(test)/test_hdfs_1_copy.out"
@@ -1,70 +0,0 @@
1
- require 'test_helper'
2
-
3
- describe "Mobilize" do
4
-
5
- def before
6
- puts 'nothing before'
7
- end
8
-
9
- # enqueues 4 workers on Resque
10
- it "runs integration test" do
11
-
12
- puts "restart workers"
13
- Mobilize::Jobtracker.restart_workers!
14
-
15
- gdrive_slot = Mobilize::Gdrive.owner_email
16
- puts "create user 'mobilize'"
17
- user_name = gdrive_slot.split("@").first
18
- u = Mobilize::User.where(:name=>user_name).first
19
- r = u.runner
20
- hdfs_1_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1.in",gdrive_slot)
21
- [hdfs_1_sheet].each {|s| s.delete if s}
22
-
23
- puts "add test_source data"
24
- hdfs_1_sheet = Mobilize::Gsheet.find_or_create_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1.in",gdrive_slot)
25
- hdfs_1_tsv = ([%w{test0 test1 test2 test3 test4 test5 test6 test7 test8 test9}.join("\t")]*10).join("\n")
26
- hdfs_1_sheet.write(hdfs_1_tsv,u.name)
27
-
28
- jobs_sheet = r.gsheet(gdrive_slot)
29
-
30
- test_job_rows = ::YAML.load_file("#{Mobilize::Base.root}/test/hdfs_job_rows.yml")
31
- test_job_rows.map{|j| r.jobs(j['name'])}.each{|j| j.delete if j}
32
- jobs_sheet.add_or_update_rows(test_job_rows)
33
-
34
- hdfs_1_target_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1_copy.out",gdrive_slot)
35
- [hdfs_1_target_sheet].each {|s| s.delete if s}
36
-
37
- puts "job row added, force enqueued requestor, wait for stages"
38
- r.enqueue!
39
- wait_for_stages
40
-
41
- puts "jobtracker posted data to test sheet"
42
- test_destination_sheet = Mobilize::Gsheet.find_by_path("#{r.path.split("/")[0..-2].join("/")}/test_hdfs_1_copy.out",gdrive_slot)
43
-
44
- assert test_destination_sheet.read(u.name).length == 599
45
- end
46
-
47
- def wait_for_stages(time_limit=600,stage_limit=120,wait_length=10)
48
- time = 0
49
- time_since_stage = 0
50
- #check for 10 min
51
- while time < time_limit and time_since_stage < stage_limit
52
- sleep wait_length
53
- job_classes = Mobilize::Resque.jobs.map{|j| j['class']}
54
- if job_classes.include?("Mobilize::Stage")
55
- time_since_stage = 0
56
- puts "saw stage at #{time.to_s} seconds"
57
- else
58
- time_since_stage += wait_length
59
- puts "#{time_since_stage.to_s} seconds since stage seen"
60
- end
61
- time += wait_length
62
- puts "total wait time #{time.to_s} seconds"
63
- end
64
-
65
- if time >= time_limit
66
- raise "Timed out before stage completion"
67
- end
68
- end
69
-
70
- end