mobilize-base 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore ADDED
@@ -0,0 +1,9 @@
1
+ *.gem
2
+ *.swp
3
+ .bundle
4
+ Gemfile.lock
5
+ pkg/*
6
+ lib/mobilize-base/tmp/*
7
+ config/*
8
+ log/*
9
+ .idea/
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in mobilize-base.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) Cassio Paes-Leme
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,509 @@
1
+ Mobilize
2
+ ========
3
+
4
+ Mobilize is an end-to-end data transfer workflow manager with:
5
+ * a Google Spreadsheets UI through [google-drive-ruby][google_drive_ruby];
6
+ * a queue manager through [Resque][resque];
7
+ * a persistent caching / database layer through [Mongoid][mongoid];
8
+ * gems for data transfers to/from Hive, mySQL, and HTTP endpoints
9
+ (coming soon).
10
+
11
+ Mobilize-Base includes all the core scheduling and processing
12
+ functionality, allowing you to:
13
+ * put workers on the Mobilize Resque queue.
14
+ * create [Requestors](#section_Start_Requestors_Requestor) and their associated Google Spreadsheet [Jobspecs](#section_Start_Requestors_Jobspec);
15
+ * poll for [Jobs](#section_Job) on Jobspecs (currently gsheet to gsheet only) and add them to Resque;
16
+ * monitor the status of Jobs on a rolling log.
17
+
18
+ Table Of Contents
19
+ -----------------
20
+ * [Overview](#section_Overview)
21
+ * [Install](#section_Install)
22
+ * [Redis](#section_Install_Redis)
23
+ * [MongoDB](#section_Install_MongoDB)
24
+ * [Mobilize-Base](#section_Install_Mobilize-Base)
25
+ * [Default Folders and Files](#section_Install_Folders_and_Files)
26
+ * [Configure](#section_Configure)
27
+ * [Google Drive](#section_Configure_Google_Drive)
28
+ * [Jobtracker](#section_Configure_Jobtracker)
29
+ * [Mongoid](#section_Configure_Mongoid)
30
+ * [Resque](#section_Configure_Resque)
31
+ * [Start](#section_Start)
32
+ * [Start resque-web](#section_Start_Start_resque-web)
33
+ * [Set Environment](#section_Start_Set_Environment)
34
+ * [Create Requestor](#section_Start_Create_Requestor)
35
+ * [Start Workers](#section_Start_Start_Workers)
36
+ * [View Logs](#section_Start_View_Logs)
37
+ * [Start Jobtracker](#section_Start_Start_Jobtracker)
38
+ * [Create Job](#section_Start_Create_Job)
39
+ * [Run Test](#section_Start_Run_Test)
40
+ * [Meta](#section_Meta)
41
+ * [Author](#section_Author)
42
+
43
+ <a name='section_Overview'></a>
44
+ Overview
45
+ -----------
46
+
47
+ * Mobilize is a fun centralized way to access your data lying inside multiple different technoligies under one roof understood by everyone - that is Excel sheets!!
48
+ * Mobilize can enable transfer of data across diverse databases/technologies like to & from hive, hdfs, hbase, various apis, different databases so that people who are already well versed with dealing with excel sheets can still interact with these diverse technologies and be productive.
49
+ * The spreadsheets are currently hosted in the cloud on Google Spreadsheets, so that you can access them anywhere - even on your tablets.
50
+ * Mobilize in pluggable and extensible, so tomorrow if you want to access data from a cool new database techonology, you can just add a module for that.
51
+
52
+
53
+ <a name='section_Install'></a>
54
+ Install
55
+ ------------
56
+
57
+ Mobilize requires Ruby 1.9.3, and has been tested on OSX and Ubuntu.
58
+
59
+ [RVM][rvm] is great for managing your rubies.
60
+
61
+ <a name='section_Install_Redis'></a>
62
+ ### Redis
63
+
64
+ Redis is a pre-requisite for running Resque.
65
+
66
+ Please refer to the [Resque Redis Section][redis] for complete
67
+ instructions.
68
+
69
+ <a name='section_Install_MongoDB'></a>
70
+ ### MongoDB
71
+
72
+ MongoDB is used to persist caches between reads and writes, keep track
73
+ of Requestors and Jobs, and store Datasets that map to endpoints.
74
+
75
+ Please refer to the [MongoDB Quickstart Page][mongodb_quickstart] to get started.
76
+
77
+ The settings for database and port are set in config/mongoid.yml
78
+ and are best left as default. Please refer to [Configure
79
+ Mongoid](#section_Configure_Mongoid) for details.
80
+
81
+ <a name='section_Install_Mobilize-Base'></a>
82
+ ### Mobilize-Base
83
+
84
+ Mobilize-Base contains all of the gems it needs to run.
85
+
86
+ add this to your Gemfile:
87
+
88
+ ``` ruby
89
+ gem "mobilize-base", "~>1.0"
90
+ ```
91
+
92
+ or do
93
+
94
+ $ gem install mobilize-base
95
+
96
+ for a ruby-wide install.
97
+
98
+ <a name='section_Install_Folders_and_Files'></a>
99
+ ### Folders and Files
100
+
101
+ Mobilize requires a config folder and a log folder.
102
+
103
+ If you're on Rails, it will use the built-in config and log folders.
104
+
105
+ Otherwise, it will use log and config folders in the project folder (the
106
+ same one that contains your Rakefile)
107
+
108
+ ### Rakefile
109
+
110
+ Inside the Rakefile in your project's root folder, make sure you have:
111
+
112
+ ``` ruby
113
+ require 'mobilize-base/tasks'
114
+ ```
115
+
116
+ This defines tasks essential to run the environment.
117
+
118
+ ### Config and Log Folders
119
+
120
+ run
121
+
122
+ $ rake mobilize:setup
123
+
124
+ Mobilize will create config and log folders at the project root
125
+ level. (same as the Rakefile)
126
+
127
+ It will also create all required config files, which are detailed below.
128
+
129
+ Resque will create a mobilize-resque-`<environment>`.log in the log folder,
130
+ and loop over 10 files, 10MB each.
131
+
132
+ <a name='section_Configure'></a>
133
+ Configure
134
+ ------------
135
+
136
+ All Mobilize configurations live in files in `config/*.yml`. Samples can
137
+ be found below or on github in the [lib/samples][git_samples] folder.
138
+
139
+ <a name='section_Configure_Google_Drive'></a>
140
+ ### Configure Google Drive
141
+
142
+ Google drive needs:
143
+ * an owner email address and password. You can set up separate owners
144
+ for different environments as in the below file, which will keep your
145
+ mission critical workers from getting rate-limit errors.
146
+ * one or more admins with email attributes -- these will be for people
147
+ who should be given write permissions to ALL Mobilize sheets, for
148
+ maintenance purposes.
149
+ * one or more workers with email and pw attributes -- they will be used
150
+ to queue up google reads and writes. This can be the same as the owner
151
+ account for testing purposes or low-volume environments.
152
+
153
+ __Mobilize only allows one Resque
154
+ worker at a time to use a Google drive worker account for
155
+ reading/writing.__
156
+
157
+ Sample gdrive.yml:
158
+
159
+ ``` yml
160
+
161
+ development:
162
+ owner:
163
+ email: 'owner_development@host.com'
164
+ pw: "google_drive_password"
165
+ admins:
166
+ - {email: 'admin@host.com'}
167
+ workers:
168
+ - {email: 'worker_development001@host.com', pw: "worker001_google_drive_password"}
169
+ - {email: 'worker_development002@host.com', pw: "worker002_google_drive_password"}
170
+ test:
171
+ owner:
172
+ email: 'owner_test@host.com'
173
+ pw: "google_drive_password"
174
+ admins:
175
+ - {email: 'admin@host.com'}
176
+ workers:
177
+ - {email: 'worker_test001@host.com', pw: "worker001_google_drive_password"}
178
+ - {email: 'worker_test002@host.com', pw: "worker002_google_drive_password"}
179
+ production:
180
+ owner:
181
+ email: 'owner_production@host.com'
182
+ pw: "google_drive_password"
183
+ admins:
184
+ - {email: 'admin@host.com'}
185
+ workers:
186
+ - {email: 'worker_production001@host.com', pw: "worker001_google_drive_password"}
187
+ - {email: 'worker_production002@host.com', pw: "worker002_google_drive_password"}
188
+
189
+ ```
190
+
191
+ <a name='section_Configure_Jobtracker'></a>
192
+ ### Configure Jobtracker
193
+
194
+ The Jobtracker sits on your Resque and does 2 things:
195
+ * check for Requestors that are due for polling;
196
+ * send out notifications when:
197
+ * there are failed jobs on Resque;
198
+ * there are jobs on Resque that have run beyond the max run time.
199
+
200
+ Emails are sent using ActionMailer, through the owner Google Drive
201
+ account.
202
+
203
+ To this end, it needs these parameters, for which there is a sample
204
+ below and in the [lib/samples][git_samples] folder:
205
+
206
+ ``` yml
207
+ development:
208
+ cycle_freq: 10 #10 secs between Jobtracker sweeps
209
+ notification_freq: 3600 #1 hour between failure/timeout notifications
210
+ requestor_refresh_freq: 300 #5 min between requestor checks
211
+ max_run_time: 14400 # if a job runs for 4h+, notification will be sent
212
+ admins: #emails to send notifications to
213
+ - {email: 'admin@host.com'}
214
+ test:
215
+ cycle_freq: 10 #10 secs between Jobtracker sweeps
216
+ notification_freq: 3600 #1 hour between failure/timeout notifications
217
+ requestor_refresh_freq: 300 #5 min between requestor checks
218
+ max_run_time: 14400 # if a job runs for 4h+, notification will be sent
219
+ admins: #emails to send notifications to
220
+ - {email: 'admin@host.com'}
221
+
222
+ production:
223
+ cycle_freq: 10 #10 secs between Jobtracker sweeps
224
+ notification_freq: 3600 #1 hour between failure/timeout notifications
225
+ requestor_refresh_freq: 300 #5 min between requestor checks
226
+ max_run_time: 14400 # if a job runs for 4h+, notification will be sent
227
+ admins: #emails to send notifications to
228
+ - {email: 'admin@host.com'}
229
+ ```
230
+
231
+ <a name='section_Configure_Mongoid'></a>
232
+ ### Configure Mongoid
233
+
234
+ Mongoid is the abstraction layer on top of MongoDB so we can interact
235
+ with it in an ActiveRecord-like fashion.
236
+
237
+ It needs the below parameters, which can be found in the [lib/samples][git_samples] folder.
238
+
239
+ You shouldn't need to change anything in this file.
240
+
241
+ ``` yml
242
+ development:
243
+ sessions:
244
+ default:
245
+ database: mobilize-development
246
+ persist_in_safe_mode: true
247
+ hosts:
248
+ - 127.0.0.1:27017
249
+ test:
250
+ sessions:
251
+ default:
252
+ database: mobilize-test
253
+ persist_in_safe_mode: true
254
+ hosts:
255
+ - 127.0.0.1:27017
256
+ production:
257
+ sessions:
258
+ default:
259
+ database: mobilize-production
260
+ persist_in_safe_mode: true
261
+ hosts:
262
+ - 127.0.0.1:27017
263
+ ```
264
+
265
+ <a name='section_Configure_Resque'></a>
266
+ ### Configure Resque
267
+
268
+ Resque keeps track of Jobs, Workers and logging.
269
+
270
+ It needs the below parameters, which can be found in the [lib/samples][git_samples] folder.
271
+
272
+ * queue_name - the name of the Resque queue where you would like the Jobtracker and Resque Workers to
273
+ run. Default is mobilize.
274
+ * max_workers - the total number of simultaneous workers you would like
275
+ on your queue. Default is 4 for development and test, 36 in
276
+ production, but feel free to adjust depending on your hardware.
277
+ * redis_port - you should probably leave this alone, it specifies the
278
+ default port for dev and prod and a separate one for testing.
279
+
280
+ ``` yml
281
+ development:
282
+ queue_name: 'mobilize'
283
+ max_workers: 4
284
+ redis_port: 6379
285
+ test:
286
+ queue_name: 'mobilize'
287
+ max_workers: 4
288
+ redis_port: 9736
289
+ production:
290
+ queue_name: 'mobilize'
291
+ max_workers: 36
292
+ redis_port: 6379
293
+ ```
294
+
295
+ <a name='section_Start'></a>
296
+ Start
297
+ -----
298
+
299
+ A Mobilize instance can be considered "started" or "running" when you have:
300
+
301
+ 1. Resque workers running on the Mobilize queue;
302
+ 2. A Jobtracker running on one of the Resque workers;
303
+ 3. One or more Requestors created in your MongoDB;
304
+ 4. One or more Jobs created in a Requestor's Jobspec;
305
+
306
+ <a name='section_Start_Start_resque-web'></a>
307
+ ### Start resque-web
308
+
309
+ To start resque-web, which is a kickass UI layer built in Sinatra,
310
+ you'll need to install the resque gem explicitly, as in
311
+
312
+ ``` ruby
313
+ gem install resque
314
+ ```
315
+
316
+ then, you can do
317
+
318
+ $ resque-web
319
+
320
+ and it'll start an instance on 127.0.0.1:5678
321
+
322
+ You'll want to keep an eye on this as it tracks your workers in real
323
+ time and allows you to keep track of failed jobs. More detail on the
324
+ [Resque Standalone section][resque-web].
325
+
326
+ <a name='section_Start_Set_Environment'></a>
327
+ ### Set Environment
328
+
329
+ Mobilize takes the environment from your Rails.env if you're running
330
+ Rails, or assumes "development." You can specify "development", "test",
331
+ or "production," as per the yml files.
332
+
333
+ Otherwise, it takes it from MOBILIZE_ENV parameter, set from irb, as in:
334
+
335
+ ``` ruby
336
+ > ENV['MOBILIZE_ENV'] = 'production'
337
+ > require 'mobilize-base'
338
+ ```
339
+ This affects all parameters as set in the yml files, including the
340
+ database.
341
+
342
+ <a name='section_Start_Create_Requestor'></a>
343
+ ### Create Requestor
344
+
345
+ Requestors are people who use the Mobilize service to move data from one
346
+ endpoint to another. They each have a Jobspec, which is a google sheet
347
+ that contains one or more Jobs.
348
+
349
+ To create a requestor, use the Requestor.find_or_create_by_email
350
+ command in irb (replace the user with your own email, or any email
351
+ google recognizes).
352
+
353
+ ``` ruby
354
+ > Requestor.find_or_create_by_email("user@host.com")
355
+ ```
356
+
357
+ <a name='section_Start_Start_Workers'></a>
358
+ ### Start Workers
359
+
360
+ Workers are rake tasks that load the Mobilize environment and allow the
361
+ processing of the Jobtracker, Requestors and Jobs.
362
+
363
+ These will start as many workers as are defined in your resque.yml.
364
+
365
+ To start workers, do:
366
+
367
+ ``` ruby
368
+ > Jobtracker.prep_workers
369
+ ```
370
+
371
+ if you have workers already running and would like to kill and refresh
372
+ them, do:
373
+
374
+ ``` ruby
375
+ > Jobtracker.restart_workers!
376
+ ```
377
+
378
+ Note that this will kill any workers on the Mobilize queue.
379
+
380
+ <a name='section_Start_View_Logs'></a>
381
+ ### View Logs
382
+
383
+ at this point, you'll want to start viewing the logs for the Resque
384
+ workers -- they will be stored under your log folder. You can do:
385
+
386
+ $ tail -f log/mobilize-`<environment>`.log
387
+
388
+ to view them.
389
+
390
+ <a name='section_Start_Start_Jobtracker'></a>
391
+ ### Start Jobtracker
392
+
393
+ Once the Resque workers are running, and you have at least one Requestor
394
+ set up, it's time to start the Jobtracker:
395
+
396
+ ``` ruby
397
+ > Jobtracker.start
398
+ ```
399
+
400
+ The Jobtracker will automatically enqueue any Requestors that have not
401
+ been processed in the requestor_refresh period defined in the
402
+ jobtracker.yml, and create their Jobspecs if they do not exist. You can
403
+ see this process on your Resque UI and in the log file.
404
+
405
+ <a name='section_Start_Create_Job'></a>
406
+ ### Create Job
407
+
408
+ Now it's time to go onto the Jobspec and add a Job to be processed.
409
+
410
+ To do this, you should log into your Google Drive with either the
411
+ owner's account, an admin account, or the Jobspec Requestor's account. These
412
+ will be the accounts with edit permissions to a given Jobspec.
413
+
414
+ Navigate to the Jobs tab on the Jobspec `(denoted by Jobspec_<requestor
415
+ name>)` and enter values under each header:
416
+
417
+ * name This is the name of the job you would like to add. Names must be unique across all your jobs, otherwise you will get an error
418
+
419
+ * active set this to blank or FALSE if you want to turn off a job
420
+
421
+ * schedule This uses human readable syntax to schedule jobs. It accepts the following:
422
+ * every `<integer>` hour -- fire the job at increments of `<integer>` hours, minimum of 1 hour
423
+ * every `<integer>` day -- fire the job at increments of `<integer>` days, minimum of 1
424
+ * every `<integer>` day after <HH:MM> -- fire the job at increments of <integer> days, after HH:MM UTC time
425
+ * every `<integer>` day_of_week after <HH:MM> -- fire the job on specified day of week, after HH:MM UTC time; 1=Sunday
426
+ * every `<integer>` day_of_month after <HH:MM> -- fire the job on specified day of month, after HH:MM UTC time
427
+ * once -- fire the job once if active is set to TRUE, set active to FALSE right after
428
+ * after `<jobname>` -- fire the job after the job named `<jobname>`
429
+
430
+ * status Mobilize writes this field with the last status returned by the job
431
+
432
+ * last_error Mobilize writes any errors to this field, and wipes it if
433
+ the job completes successfully.
434
+
435
+ * destination_url Mobilize writes this field with a link to the last dataset returned by the job, blank if none
436
+
437
+ * read_handler This is where the job reads its data from. For
438
+ mobilize-base, you should enter "gsheet"
439
+
440
+ * write_handler This is where the job writes its data to. For
441
+ mobilize-base, you should enter "gsheet"
442
+
443
+ * param_source This is the path to an array of data, as read from a google sheet,
444
+ that is relayed to the job.
445
+ The format is `<google docs book>/<google docs sheet>`, so if you
446
+ wanted to read from the "output" sheet on the "monthly results" book you
447
+ would write in `<monthly results>/<output>`. For a sheet in the Jobspec
448
+ itself you could write simply `<output>`.
449
+
450
+ * params This is a hash of data, expressed in a JSON. Not relevant to
451
+ mobilize-base
452
+
453
+ * destination This is the destination for the data, relayed to the job.
454
+ For a gsheet write_handler, this would be the name of the sheet to be
455
+ written to, similar to param_source.
456
+
457
+ <a name='section_Start_Run_Test'></a>
458
+ ### Run Test
459
+
460
+ To run tests, you will need to
461
+
462
+ 1) clone the repository
463
+
464
+ From the project folder, run
465
+
466
+ 2) rake mobilize:setup
467
+
468
+ and populate the "test" environment in the config files with the
469
+ necessary details.
470
+
471
+ 3) $ rake test
472
+
473
+ This will create a test Jobspec with a sample job. These will run off a
474
+ test redis instance which will be killed once the tests finish.
475
+
476
+ <a name='section_Meta'></a>
477
+ Meta
478
+ ----
479
+
480
+ * Code: `git clone git://github.com/ngmoco/mobilize-base.git`
481
+ * Home: <https://github.com/ngmoco/mobilize-base>
482
+ * Bugs: <https://github.com//mobilize-base/issues>
483
+ * Gems: <http://rubygems.org/gems/mobilize-base>
484
+
485
+ <a name='section_Author'></a>
486
+ Author
487
+ ------
488
+
489
+ Cassio Paes-Leme :: cpaesleme@ngmoco.com :: @cpaesleme
490
+
491
+ <a name='section_Special_Thanks'></a>
492
+ Special Thanks
493
+ --------------
494
+
495
+ * Al Thompson and Sagar Mehta for awesome design advice and discussions
496
+ * Elliott Clark for enlightening me to the wonders of Resque
497
+ * Bob Colner for pointing me to google-drive-ruby when I tried to
498
+ reinvent the wheel
499
+ * ngmoco:) and DeNA Global for supporting and adopting the Mobilize
500
+ platform
501
+
502
+ [google_drive_ruby]: https://github.com/gimite/google-drive-ruby
503
+ [resque]: https://github.com/defunkt/resque
504
+ [mongoid]: http://mongoid.org/en/mongoid/index.html
505
+ [resque_redis]: https://github.com/defunkt/resque#section_Installing_Redis
506
+ [mongodb_quickstart]: http://www.mongodb.org/display/DOCS/Quickstart
507
+ [git_samples]: https://github.ngmoco.com/Ngpipes/mobilize-base/tree/master/lib/samples
508
+ [rvm]: https://rvm.io/
509
+ [resque-web]: https://github.com/defunkt/resque#standalone
data/Rakefile ADDED
@@ -0,0 +1,34 @@
1
+ require 'rubygems'
2
+
3
+ begin
4
+ require 'bundler/setup'
5
+ rescue LoadError => e
6
+ warn e.message
7
+ warn "Run `gem install bundler` to install Bundler"
8
+ exit -1
9
+ end
10
+
11
+ #
12
+ # Bundler
13
+ #
14
+ require "bundler/gem_tasks"
15
+
16
+ #
17
+ # Setup
18
+ #
19
+ $LOAD_PATH.unshift 'lib'
20
+ require 'mobilize-base/tasks'
21
+
22
+
23
+ #
24
+ # Tests
25
+ #
26
+ require 'rake/testtask'
27
+
28
+ Rake::TestTask.new do |test|
29
+ test.verbose = true
30
+ test.libs << "test"
31
+ test.libs << "lib"
32
+ test.test_files = FileList['test/**/*_test.rb']
33
+ end
34
+ task :default => :test
@@ -0,0 +1,22 @@
1
+
2
+ class Array
3
+ def sel(&blk)
4
+ return self.select(&blk)
5
+ end
6
+ def group_count
7
+ counts = Hash.new(0)
8
+ self.each { |m| counts[m] += 1 }
9
+ return counts
10
+ end
11
+ def sum
12
+ return self.inject{|sum,x| sum + x }
13
+ end
14
+ def hash_array_to_tsv
15
+ if self.first.nil? or self.first.class!=Hash
16
+ return ""
17
+ end
18
+ header = self.first.keys.join("\t")
19
+ rows = self.map{|r| r.values.join("\t")}
20
+ ([header] + rows).join("\n")
21
+ end
22
+ end