massive 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (55) hide show
  1. checksums.yaml +15 -0
  2. data/.gitignore +22 -0
  3. data/.rspec +3 -0
  4. data/.rvmrc +1 -0
  5. data/.travis.yml +7 -0
  6. data/Gemfile +19 -0
  7. data/Gemfile.lock +141 -0
  8. data/Guardfile +9 -0
  9. data/LICENSE.txt +22 -0
  10. data/README.md +196 -0
  11. data/Rakefile +8 -0
  12. data/lib/massive.rb +63 -0
  13. data/lib/massive/cancelling.rb +20 -0
  14. data/lib/massive/file.rb +80 -0
  15. data/lib/massive/file_job.rb +9 -0
  16. data/lib/massive/file_process.rb +7 -0
  17. data/lib/massive/file_step.rb +7 -0
  18. data/lib/massive/job.rb +115 -0
  19. data/lib/massive/locking.rb +27 -0
  20. data/lib/massive/memory_consumption.rb +15 -0
  21. data/lib/massive/notifications.rb +40 -0
  22. data/lib/massive/notifiers.rb +6 -0
  23. data/lib/massive/notifiers/base.rb +32 -0
  24. data/lib/massive/notifiers/pusher.rb +17 -0
  25. data/lib/massive/process.rb +69 -0
  26. data/lib/massive/process_serializer.rb +12 -0
  27. data/lib/massive/retry.rb +49 -0
  28. data/lib/massive/status.rb +59 -0
  29. data/lib/massive/step.rb +143 -0
  30. data/lib/massive/step_serializer.rb +12 -0
  31. data/lib/massive/timing_support.rb +10 -0
  32. data/lib/massive/version.rb +3 -0
  33. data/massive.gemspec +23 -0
  34. data/spec/fixtures/custom_job.rb +4 -0
  35. data/spec/fixtures/custom_step.rb +19 -0
  36. data/spec/models/massive/cancelling_spec.rb +83 -0
  37. data/spec/models/massive/file_job_spec.rb +24 -0
  38. data/spec/models/massive/file_spec.rb +209 -0
  39. data/spec/models/massive/file_step_spec.rb +22 -0
  40. data/spec/models/massive/job_spec.rb +319 -0
  41. data/spec/models/massive/locking_spec.rb +52 -0
  42. data/spec/models/massive/memory_consumption_spec.rb +24 -0
  43. data/spec/models/massive/notifications_spec.rb +107 -0
  44. data/spec/models/massive/notifiers/base_spec.rb +48 -0
  45. data/spec/models/massive/notifiers/pusher_spec.rb +49 -0
  46. data/spec/models/massive/process_serializer_spec.rb +38 -0
  47. data/spec/models/massive/process_spec.rb +235 -0
  48. data/spec/models/massive/status_spec.rb +104 -0
  49. data/spec/models/massive/step_serializer_spec.rb +40 -0
  50. data/spec/models/massive/step_spec.rb +490 -0
  51. data/spec/models/massive/timing_support_spec.rb +55 -0
  52. data/spec/shared/step_context.rb +25 -0
  53. data/spec/spec_helper.rb +42 -0
  54. data/spec/support/mongoid.yml +78 -0
  55. metadata +175 -0
@@ -0,0 +1,15 @@
1
+ ---
2
+ !binary "U0hBMQ==":
3
+ metadata.gz: !binary |-
4
+ NDgzMGYwNDUzYmUyZTIzYTYwNmJiMzU1MzdkYTY1MGY5OWUzYmYwOQ==
5
+ data.tar.gz: !binary |-
6
+ ODUxZDVlMjA3YWVmZjE0MjEzNTdhODU5ODRiNWU2ODg0YTBhZTE3YQ==
7
+ SHA512:
8
+ metadata.gz: !binary |-
9
+ OTE3ODEyZjc5NDVjYWI4MWVmMDk1YzkwYjFkYTJjZGM1YWU1ZmY0YjJiYzY4
10
+ MTZhZjU0ZGU0MGUwYTUzZWNjOWE4YmMyMDE5MzEzNTE4MGM4YjM1MzFhODEy
11
+ YzIyNTA1M2RmOWI4MDk1NTAyZjRkOTMyNjQ4ZjM0Mzk5NmFlNGU=
12
+ data.tar.gz: !binary |-
13
+ NTM0ZGQ5MDc0NGYwYjdmYTYwYjdlNWZiOWJhODFjZThlNjY3ZWJjZWEzMmI3
14
+ YWYzN2RkMDlkODllNzczNzMzZDdjMjk0ZGUwOGE5NTJhMmI4MTRhY2E0ZTNk
15
+ YWU0YWQ5YjQ3NjE2MDQxNDk3MTcyMTkwOWM0NjU0OTUwMThkNWI=
@@ -0,0 +1,22 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ .DS_Store
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ spec/dummy/db/*.sqlite3
19
+ spec/dummy/log/*.log
20
+ spec/dummy/tmp/
21
+ spec/dummy/.sass-cache
22
+ vendor/bundle
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --color
2
+ --format documentation
3
+ --drb
data/.rvmrc ADDED
@@ -0,0 +1 @@
1
+ rvm use 1.9.3@massive --create
@@ -0,0 +1,7 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
4
+ services:
5
+ - mongodb
6
+ - redis-server
7
+ bundler_args: --without development
data/Gemfile ADDED
@@ -0,0 +1,19 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
4
+
5
+ gem 'rake'
6
+
7
+ gem 'fog'
8
+
9
+ group(:development) do
10
+ gem 'debugger'
11
+ gem 'guard-rspec', require: false
12
+ gem 'terminal-notifier-guard'
13
+ end
14
+
15
+ group(:test) do
16
+ gem 'simplecov', require: false
17
+ gem 'rspec'
18
+ gem 'database_cleaner'
19
+ end
@@ -0,0 +1,141 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ massive (0.1.0)
5
+ active_model_serializers
6
+ file_processor (= 0.2.0)
7
+ mongoid (~> 3.1.x)
8
+ resque
9
+
10
+ GEM
11
+ remote: https://rubygems.org/
12
+ specs:
13
+ active_model_serializers (0.8.1)
14
+ activemodel (>= 3.0)
15
+ activemodel (3.2.17)
16
+ activesupport (= 3.2.17)
17
+ builder (~> 3.0.0)
18
+ activesupport (3.2.17)
19
+ i18n (~> 0.6, >= 0.6.4)
20
+ multi_json (~> 1.0)
21
+ builder (3.0.4)
22
+ celluloid (0.15.2)
23
+ timers (~> 1.1.0)
24
+ celluloid-io (0.15.0)
25
+ celluloid (>= 0.15.0)
26
+ nio4r (>= 0.5.0)
27
+ coderay (1.1.0)
28
+ columnize (0.3.6)
29
+ database_cleaner (1.2.0)
30
+ debugger (1.6.2)
31
+ columnize (>= 0.3.1)
32
+ debugger-linecache (~> 1.2.0)
33
+ debugger-ruby_core_source (~> 1.2.3)
34
+ debugger-linecache (1.2.0)
35
+ debugger-ruby_core_source (1.2.3)
36
+ diff-lcs (1.2.4)
37
+ excon (0.31.0)
38
+ ffi (1.9.3)
39
+ file_processor (0.2.0)
40
+ fog (1.20.0)
41
+ builder
42
+ excon (~> 0.31.0)
43
+ formatador (~> 0.2.0)
44
+ mime-types
45
+ multi_json (~> 1.0)
46
+ net-scp (~> 1.1)
47
+ net-ssh (>= 2.1.3)
48
+ nokogiri (>= 1.5.11)
49
+ formatador (0.2.4)
50
+ guard (2.5.1)
51
+ formatador (>= 0.2.4)
52
+ listen (~> 2.6)
53
+ lumberjack (~> 1.0)
54
+ pry (>= 0.9.12)
55
+ thor (>= 0.18.1)
56
+ guard-rspec (4.2.8)
57
+ guard (~> 2.1)
58
+ rspec (>= 2.14, < 4.0)
59
+ i18n (0.6.9)
60
+ listen (2.7.0)
61
+ celluloid (>= 0.15.2)
62
+ celluloid-io (>= 0.15.0)
63
+ rb-fsevent (>= 0.9.3)
64
+ rb-inotify (>= 0.9)
65
+ lumberjack (1.0.4)
66
+ method_source (0.8.2)
67
+ mime-types (2.1)
68
+ mini_portile (0.5.2)
69
+ mongoid (3.1.6)
70
+ activemodel (~> 3.2)
71
+ moped (~> 1.4)
72
+ origin (~> 1.0)
73
+ tzinfo (~> 0.3.29)
74
+ mono_logger (1.1.0)
75
+ moped (1.5.2)
76
+ multi_json (1.8.2)
77
+ net-scp (1.1.2)
78
+ net-ssh (>= 2.6.5)
79
+ net-ssh (2.8.0)
80
+ nio4r (1.0.0)
81
+ nokogiri (1.6.1)
82
+ mini_portile (~> 0.5.0)
83
+ origin (1.1.0)
84
+ pry (0.9.12.6)
85
+ coderay (~> 1.0)
86
+ method_source (~> 0.8)
87
+ slop (~> 3.4)
88
+ rack (1.5.2)
89
+ rack-protection (1.5.2)
90
+ rack
91
+ rake (10.1.0)
92
+ rb-fsevent (0.9.4)
93
+ rb-inotify (0.9.3)
94
+ ffi (>= 0.5.0)
95
+ redis (3.0.7)
96
+ redis-namespace (1.4.1)
97
+ redis (~> 3.0.4)
98
+ resque (1.25.2)
99
+ mono_logger (~> 1.0)
100
+ multi_json (~> 1.0)
101
+ redis-namespace (~> 1.3)
102
+ sinatra (>= 0.9.2)
103
+ vegas (~> 0.1.2)
104
+ rspec (2.14.1)
105
+ rspec-core (~> 2.14.0)
106
+ rspec-expectations (~> 2.14.0)
107
+ rspec-mocks (~> 2.14.0)
108
+ rspec-core (2.14.7)
109
+ rspec-expectations (2.14.3)
110
+ diff-lcs (>= 1.1.3, < 2.0)
111
+ rspec-mocks (2.14.4)
112
+ simplecov (0.7.1)
113
+ multi_json (~> 1.0)
114
+ simplecov-html (~> 0.7.1)
115
+ simplecov-html (0.7.1)
116
+ sinatra (1.4.4)
117
+ rack (~> 1.4)
118
+ rack-protection (~> 1.4)
119
+ tilt (~> 1.3, >= 1.3.4)
120
+ slop (3.4.7)
121
+ terminal-notifier-guard (1.5.3)
122
+ thor (0.18.1)
123
+ tilt (1.4.1)
124
+ timers (1.1.0)
125
+ tzinfo (0.3.39)
126
+ vegas (0.1.11)
127
+ rack (>= 1.0.0)
128
+
129
+ PLATFORMS
130
+ ruby
131
+
132
+ DEPENDENCIES
133
+ database_cleaner
134
+ debugger
135
+ fog
136
+ guard-rspec
137
+ massive!
138
+ rake
139
+ rspec
140
+ simplecov
141
+ terminal-notifier-guard
@@ -0,0 +1,9 @@
1
+ # A sample Guardfile
2
+ # More info at https://github.com/guard/guard#readme
3
+
4
+ guard :rspec do
5
+ watch(%r{^spec/.+_spec\.rb$})
6
+ watch(%r{^lib/(.+)\.rb$}) { |m| "spec/models/#{m[1]}_spec.rb" }
7
+ watch('spec/spec_helper.rb') { "spec" }
8
+ end
9
+
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2013 Vicente Mundim
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,196 @@
1
+ # Massive
2
+
3
+ [![build status][1]][2]
4
+ [![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/dtmtec/massive/trend.png)](https://bitdeli.com/free "Bitdeli Badge")
5
+
6
+ [1]: https://travis-ci.org/dtmtec/massive.png
7
+ [2]: http://travis-ci.org/dtmtec/massive
8
+
9
+ Massive gives you a basic infrastructure to parallelize processing of large files and/or data using Resque, Redis and MongoDB.
10
+
11
+ ## Installation
12
+
13
+ Add this line to your application's Gemfile:
14
+
15
+ gem 'massive'
16
+
17
+ And then execute:
18
+
19
+ $ bundle
20
+
21
+ Or install it yourself as:
22
+
23
+ $ gem install massive
24
+
25
+ ## Requirements
26
+
27
+ If you want notifications using [Pusher][http://pusher.com], you'll need to add `pusher-gem` to your Gemfile. Also, if you'd like these notifications to be sent in less than one second intervals you'll need to use 2.6 version of Redis.
28
+
29
+ ## Usage
30
+
31
+ Massive gives you a basic model structure to process data, either coming from a file or from some other source, like a database. It has three basic concepts:
32
+
33
+ * __Process__: defines the set of steps and control their execution.
34
+ * __Step__: a step of the process, for example, when processing a CSV file you may want to gather some info of the file, them read the data from the file and import it to the database, and later perform some processing on that data. In this scenario you would create 3 steps, each step would split the processing into smaller jobs.
35
+ * __Job__: here lies the basic processing logic, iterating through each item from the data set reserved for it, and them process the item. It also updates the number of processed items, so you can poll the jobs about their progress.
36
+
37
+ The main usage would consist in subclassing `Massive::Step` and `Massive::Job` to add the required logic for your processing.
38
+
39
+ For example, suppose you want to perform an operation on a model, for example, cache the number of friends a `User` has on a social network. Let's suppose that we have 100 thousands users in the database, so this would probably take some time, so we want to do it in background.
40
+
41
+ We just need one step for it, and we will call it `CacheFriendsStep`:
42
+
43
+ ```ruby
44
+ class CacheFriendsStep < Massive::Step
45
+ # here we tell it how to calculate the total number of items we want it to process
46
+ calculates_total_count_with { User.count }
47
+
48
+ # we define the job class, otherwise it would use the default, which is Massive::Job
49
+ job_class 'CacheFriendsJob'
50
+ end
51
+ ```
52
+
53
+ Then we define the job class `CacheFriendsJob`, redefining two methods `each_item` and `process_each`. The first one is used to iterate through our data set, yielding the given block on each pass. Note that it uses the job offset and limit, so that the job can be parallelized. The last one is used to actually process an item, receiving its index within the job data set.
54
+
55
+ ```ruby
56
+ class CacheFriendsJob < Massive::Job
57
+ def each_item(&block)
58
+ User.offset(offset).limit(limit).each(&block)
59
+ end
60
+
61
+ def process_each(user, index)
62
+ user.friends_count = user.friends.count
63
+ end
64
+ end
65
+ ```
66
+
67
+ Now we just create a process, and add the `CacheFriendsStep` to it, then enqueue the step:
68
+
69
+ ```ruby
70
+ process = Massive::Process.new
71
+ process.steps << CacheFriendsStep.new
72
+ process.save
73
+
74
+ process.enqueue_next
75
+ ```
76
+
77
+ Now the `CacheFriendsStep` is enqueued in the Resque queue. When it is run by a Resque worker it will split the processing into a number of jobs based on the step `limit_ratio`. This `limit_ratio` could be defined like this:
78
+
79
+ ```ruby
80
+ class CacheFriendsStep < Massive::Step
81
+ # here we tell it how to calculate the total number of items we want it to process
82
+ calculates_total_count_with { User.count }
83
+
84
+ # we define the job class, otherwise it would use the default, which is Massive::Job
85
+ job_class 'CacheFriendsJob'
86
+
87
+ # defining a different limit ratio
88
+ limit_ratio 2000 => 1000, 1000 => 200, 0 => 100
89
+ end
90
+ ```
91
+
92
+ What this means is that when the number of items to process is greater or equal than 2000, it will split jobs making each one process 1000 items. If the number of items is less than 2000 but greater than 1000, it will process 200 items each. If the number of items is less than 1000, it will process 100 items each.
93
+
94
+ The default limit ratio is defined like this: `3000 => 1000, 0 => 100`. When its greater than or equal to 3000, process 1000 items each, otherwise, process 100.
95
+
96
+ For the above example, it would create `100000 / 1000 == 100` jobs, where the first one would have an offset of 0, and a limit of 1000, the next one an offset of 1000 and a limit of 1000, and so on.
97
+
98
+ With 100 jobs in a Resque queue you may want to start more than one worker so that it can process this queue more quickly.
99
+
100
+ Now you just need to wait until all jobs have been completed, by polling the step once in a while:
101
+
102
+ ```ruby
103
+ process.reload
104
+ step = process.steps.first
105
+
106
+ step.processed # gives you the sum of the number of items processed by all jobs
107
+ step.processed_percentage # the percentage of processed items based on the total count
108
+ step.elapsed_time # the elapsed time from when the step started processing until now, or its duration once it is finished
109
+ step.processing_time # the sum of the elapsed time for each job, which basically gives you the total time spent processing your data set.
110
+ ```
111
+
112
+ You can check whether the step is completed, or started:
113
+
114
+ ```ruby
115
+ step.started? # true when it has been started
116
+ step.completed? # false when it has been completed, i.e., there is at least one job that has not been completed
117
+ ```
118
+
119
+ ### Retry
120
+
121
+ When an error occurs while processing an item, it will automatically retry it for a number of times, giving an interval. By default it will retry 10 times with a 2 second interval. This is very useful when you'd expect some external service to fail for a small period of time, but you want to make sure that you recover from it, without the need to retry the entire job processing.
122
+
123
+ If the processing of a single item fails for the maximum number of retries the exception will be raised again, making the job fail. The error message will be stored and can be accessed through `Massive::Job#last_error`. It will also record the time when the error occurred.
124
+
125
+ You can change retry interval and the maximum number of retries if you want:
126
+
127
+ ```ruby
128
+ class CacheFriendsJob < Massive::Job
129
+ retry_interval 5
130
+ maximum_retries 3
131
+
132
+ def each_item(&block)
133
+ User.offset(offset).limit(limit).each(&block)
134
+ end
135
+
136
+ def process_each(user, index)
137
+ user.friends_count = user.friends.count
138
+ end
139
+ end
140
+ ```
141
+
142
+ ### File
143
+
144
+ One common use for __Massive__ is to process large CSV files, so it comes with `Massive::FileProcess`, `Massive::FileStep` and `Massive::FileJob`. A `Massive::FileProcess` embeds one `Massive::File`, which has a URL to a local or external file, and a [file processor](https://github.com/dtmtec/file_processor).
145
+
146
+ With this structure you can easily import users from a CSV file:
147
+
148
+ ```ruby
149
+ class ImportUsersStep < Massive::FileStep
150
+ job_class 'ImportUsersJob'
151
+ end
152
+
153
+ class ImportUsersJob < Massive::FileJob
154
+ def process_each(row, index)
155
+ User.create(row)
156
+ end
157
+ end
158
+
159
+ process = Massive::FileProcess.new(file_attributes: { url: 'http://myserver.com/path/to/my/file.csv' })
160
+ process.steps << ImportUsersStep.new
161
+ ```
162
+
163
+ Notice that we didn't had to specify how the step would calculate the total count for the `ImportUsersStep`. It is already set to the number of lines in the CSV file of the `Massive::FileProcess`. For this we want to make sure that we have gathered information about the file:
164
+
165
+ ```ruby
166
+ process.file.gather_info!
167
+ ```
168
+
169
+ We also didn't have to specify how the job would iterate through each row, it is already defined. We just get a CSV::Row, which will be a Hash-like structure where the header of the CSV is the key, so we can just pass it to `User.create`. Of course this is a simple example, you should protect the attributes, or even pass only the ones you want from the CSV.
170
+
171
+ The `Massive::File` has support for [Fog::Storage][http://fog.io/storage/]. To use it yoy must define the `fog_credentials` and optionally the `fog_directory` and `fog_authenticated_url_expiration`:
172
+
173
+ ```ruby
174
+ Massive.fog_credentials = {
175
+ provider: 'AWS',
176
+ aws_access_key_id: 'INSERT-YOUR-AWS-KEY-HERE',
177
+ aws_secret_access_key: 'INSERT-YOUR-AWS-SECRET-HERE'
178
+ }
179
+
180
+ Massive.fog_directory = 'your-bucket-here' # defaults to 'massive'
181
+ Massive.fog_authenticated_url_expiration = 30.minutes # defaults to 1.hour
182
+ ```
183
+
184
+ Then set the `filename` field when creating the `Massive::File` instead of setting its `url`. Notice that the filename should point to the full path within the bucket, not just the actual filename.
185
+
186
+ ```ruby
187
+ process = Massive::FileProcess.new(file_attributes: { filename: '/path/to/my/file.csv' })
188
+ ```
189
+
190
+ ## Contributing
191
+
192
+ 1. Fork it
193
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
194
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
195
+ 4. Push to the branch (`git push origin my-new-feature`)
196
+ 5. Create new Pull Request