kraps 0.5.0 → 0.6.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +6 -0
- data/Gemfile.lock +1 -1
- data/README.md +112 -18
- data/lib/kraps/actions.rb +2 -0
- data/lib/kraps/drivers.rb +24 -0
- data/lib/kraps/job.rb +74 -0
- data/lib/kraps/job_resolver.rb +13 -0
- data/lib/kraps/runner.rb +26 -1
- data/lib/kraps/step.rb +1 -1
- data/lib/kraps/temp_path.rb +1 -27
- data/lib/kraps/temp_paths.rb +2 -2
- data/lib/kraps/version.rb +1 -1
- data/lib/kraps/worker.rb +137 -34
- data/lib/kraps.rb +8 -5
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 02ba582478178273300e77d5ddd18a8568fd682b0c53a444f1c5e7b756f9fd9a
|
4
|
+
data.tar.gz: 9e1698a67c252512a2f277ca1a6e079e063f8f1e6c13500245aabd969cca122d
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2d1b3bd10d1048c64804ddf86069c0757247b581edea0f38330189a10d35ed096970d008533a167ddaf97c16479425e3baa2bb56f5a8888255ecfd12b911a168
|
7
|
+
data.tar.gz: 045263d6aa920cef97a162fcbfd41f239087c459a8c86bd2112d3ce4524c48c6e65ceaac8993f8ea7fb9089a8877db2863e39644888c8bd884df0fa95241277d
|
data/CHANGELOG.md
CHANGED
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -3,11 +3,12 @@
|
|
3
3
|
**Easily process big data in ruby**
|
4
4
|
|
5
5
|
Kraps allows to process and perform calculations on very large datasets in
|
6
|
-
parallel using a map/reduce framework
|
7
|
-
you already have. You just need some
|
8
|
-
layer with temporary lifecycle policy
|
9
|
-
job framework (like sidekiq,
|
10
|
-
progress. Most things you most
|
6
|
+
parallel using a map/reduce framework similar to [spark](https://spark.apache.org/),
|
7
|
+
but runs on a background job framework you already have. You just need some
|
8
|
+
space on your filesystem, S3 as a storage layer with temporary lifecycle policy
|
9
|
+
enabled, the already mentioned background job framework (like sidekiq,
|
10
|
+
shoryuken, etc) and redis to keep track of the progress. Most things you most
|
11
|
+
likely already have in place anyways.
|
11
12
|
|
12
13
|
## Installation
|
13
14
|
|
@@ -115,13 +116,13 @@ be able to give 300-400 megabytes to Kraps then, but now divide this by 10 and
|
|
115
116
|
specify a `memory_limit` of around `30.megabytes`, better less. The
|
116
117
|
`memory_limit` affects how much chunks will be written to disk depending on the
|
117
118
|
data size you are processing and how big these chunks are. The smaller the
|
118
|
-
value, the more chunks
|
119
|
-
the chunks.
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
119
|
+
value, the more chunks. The more chunks, the more runs Kraps need to merge
|
120
|
+
the chunks. The `chunk_limit` ensures that only the specified amount of chunks
|
121
|
+
are processed in a single run. A run basically means: it takes up to
|
122
|
+
`chunk_limit` chunks, reduces them and pushes the result as a new chunk to the
|
123
|
+
list of chunks to process. Thus, if your number of file descriptors is
|
124
|
+
unlimited, you want to set it to a higher number to avoid the overhead of
|
125
|
+
multiple runs. `concurrency` tells Kraps how much threads to use to
|
125
126
|
concurrently upload/download files from the storage layer. Finally, `retries`
|
126
127
|
specifies how often Kraps should retry the job step in case of errors. Kraps
|
127
128
|
will sleep for 5 seconds between those retries. Please note that it's not yet
|
@@ -130,7 +131,6 @@ Kraps. Please note, however, that `parallelize` is not covered by `retries`
|
|
130
131
|
yet, as the block passed to `parallelize` is executed by the runner, not the
|
131
132
|
workers.
|
132
133
|
|
133
|
-
|
134
134
|
Now, executing your job is super easy:
|
135
135
|
|
136
136
|
```ruby
|
@@ -182,11 +182,11 @@ https://github.com/mrkamel/map-reduce-ruby/#limitations-for-keys
|
|
182
182
|
## Storage
|
183
183
|
|
184
184
|
Kraps stores temporary results of steps in a storage layer. Currently, only S3
|
185
|
-
is supported besides a in
|
185
|
+
is supported besides a in-memory driver used for testing purposes. Please be
|
186
186
|
aware that Kraps does not clean up any files from the storage layer, as it
|
187
|
-
would be a safe thing to do in case of errors anyways. Instead, Kraps
|
188
|
-
lifecycle features of modern object storage systems. Therefore, it is
|
189
|
-
to
|
187
|
+
would not be a safe thing to do in case of errors anyways. Instead, Kraps
|
188
|
+
relies on lifecycle features of modern object storage systems. Therefore, it is
|
189
|
+
required to configure a lifecycle policy to delete any files after e.g. 7 days
|
190
190
|
either for a whole bucket or for a certain prefix like e.g. `temp/` and tell
|
191
191
|
Kraps about the prefix to use (e.g. `temp/kraps/`).
|
192
192
|
|
@@ -229,6 +229,19 @@ The block gets each key-value pair passed and the `collector` block can be
|
|
229
229
|
called as often as neccessary. This is also the reason why `map` can not simply
|
230
230
|
return the new key-value pair, but the `collector` must be used instead.
|
231
231
|
|
232
|
+
* `map_partitions`: Maps the key value pairs to other key value pairs, but the
|
233
|
+
block receives all data of each partition as an enumerable and sorted by key.
|
234
|
+
Please be aware that you should not call `to_a` or similar on the enumerable.
|
235
|
+
Prefer `map` over `map_partitions` when possible.
|
236
|
+
|
237
|
+
```ruby
|
238
|
+
job.map_partitions(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker) do |pairs, collector|
|
239
|
+
pairs.each do |key, value|
|
240
|
+
collector.call("changed #{key}", "changed #{value}")
|
241
|
+
end
|
242
|
+
end
|
243
|
+
```
|
244
|
+
|
232
245
|
* `reduce`: Reduces the values of pairs having the same key
|
233
246
|
|
234
247
|
```ruby
|
@@ -245,6 +258,24 @@ The `key` itself is also passed to the block for the case that you need to
|
|
245
258
|
customize the reduce calculation according to the value of the key. However,
|
246
259
|
most of the time, this is not neccessary and the key can simply be ignored.
|
247
260
|
|
261
|
+
* `combine`: Combines the results of 2 jobs by combining every key available
|
262
|
+
in the current job result with the corresponding key from the passed job
|
263
|
+
result. When the passed job result does not have the corresponding key,
|
264
|
+
`nil` will be passed to the block. Keys which are only available in the
|
265
|
+
passed job result are completely omitted.
|
266
|
+
|
267
|
+
```ruby
|
268
|
+
job.combine(other_job, worker: MyKrapsWorker) do |key, value1, value2|
|
269
|
+
(value1 || {}).merge(value2 || {})
|
270
|
+
end
|
271
|
+
```
|
272
|
+
|
273
|
+
Please note that the keys, partitioners and the number of partitions must match
|
274
|
+
for the jobs to be combined. Further note that the results of `other_job` must
|
275
|
+
be reduced, meaning that every key must be unique. Finally, `other_job` must
|
276
|
+
not neccessarily be listed in the array of jobs returned by the `call` method,
|
277
|
+
since Kraps detects the dependency on its own.
|
278
|
+
|
248
279
|
* `repartition`: Used to change the partitioning
|
249
280
|
|
250
281
|
```ruby
|
@@ -255,7 +286,8 @@ Repartitions all data into the specified number of partitions and using the
|
|
255
286
|
specified partitioner.
|
256
287
|
|
257
288
|
* `each_partition`: Passes the partition number and all data of each partition
|
258
|
-
as
|
289
|
+
as an enumerable and sorted by key. Please be aware that you should not call
|
290
|
+
`to_a` or similar on the enumerable.
|
259
291
|
|
260
292
|
```ruby
|
261
293
|
job.each_partition do |partition, pairs|
|
@@ -265,6 +297,22 @@ job.each_partition do |partition, pairs|
|
|
265
297
|
end
|
266
298
|
```
|
267
299
|
|
300
|
+
* `dump`: Store all current data per partition under the specified prefix
|
301
|
+
|
302
|
+
```ruby
|
303
|
+
job.dump(prefix: "path/to/dump", worker: MyKrapsWorker)
|
304
|
+
```
|
305
|
+
|
306
|
+
It creates a folder for every partition and stores one or more chunks in there.
|
307
|
+
|
308
|
+
* `load`: Loads the previously dumped data
|
309
|
+
|
310
|
+
```ruby
|
311
|
+
job.load(prefix: "path/to/dump", partitions: 32, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
|
312
|
+
```
|
313
|
+
|
314
|
+
The number of partitions and the partitioner must be specified.
|
315
|
+
|
268
316
|
Please note that every API method accepts a `before` callable:
|
269
317
|
|
270
318
|
```ruby
|
@@ -326,6 +374,52 @@ When you execute the job, Kraps will execute the jobs one after another and as
|
|
326
374
|
the jobs build up on each other, Kraps will execute the steps shared by both
|
327
375
|
jobs only once.
|
328
376
|
|
377
|
+
## Testing
|
378
|
+
|
379
|
+
Kraps ships with an in-memory fake driver for storage, which you can use for
|
380
|
+
testing purposes instead of the s3 driver:
|
381
|
+
|
382
|
+
```ruby Kraps.configure(
|
383
|
+
driver: Kraps::Drivers::FakeDriver.new(bucket: "kraps"),
|
384
|
+
# ...
|
385
|
+
) ```
|
386
|
+
|
387
|
+
This is of course much faster than using s3 or some s3 compatible service.
|
388
|
+
Moreover, when testing large Kraps jobs you maybe want to test intermediate
|
389
|
+
steps. You can use `#dump` for this purpose and test that the data dumped is
|
390
|
+
correct.
|
391
|
+
|
392
|
+
```ruby
|
393
|
+
job = job.dump("path/to/dump")
|
394
|
+
```
|
395
|
+
|
396
|
+
and in your tests do
|
397
|
+
|
398
|
+
```ruby
|
399
|
+
Kraps.driver.value("path/to/dump/0/chunk.json") # => data of partition 0
|
400
|
+
Kraps.driver.value("path/to/dump/1/chunk.json") # => data of partition 1
|
401
|
+
# ...
|
402
|
+
```
|
403
|
+
|
404
|
+
The data is stored in lines, each line is a json encoded array of key and
|
405
|
+
value.
|
406
|
+
|
407
|
+
```ruby
|
408
|
+
data = Kraps.driver.value("path/to/dump/0/chunk.json).lines.map do |line|
|
409
|
+
JSON.parse(line) # => [key, value]
|
410
|
+
end
|
411
|
+
```
|
412
|
+
|
413
|
+
The API of the driver is:
|
414
|
+
|
415
|
+
* `store(name, data_or_ui, options = {})`: Stores `data_or_io` as `name`
|
416
|
+
* `list(prefix: nil)`: Lists all objects or all objects matching the `prefix`
|
417
|
+
* `value(name)`: Returns the object content of `name`
|
418
|
+
* `download(name, path)`: Downloads the object `name` to `path` in your
|
419
|
+
filesystem
|
420
|
+
* `exists?(name)`: Returns `true`/`false`
|
421
|
+
* `flush`: Removes all objects from the fake storage
|
422
|
+
|
329
423
|
## Dependencies
|
330
424
|
|
331
425
|
Kraps is built on top of
|
data/lib/kraps/actions.rb
CHANGED
data/lib/kraps/drivers.rb
CHANGED
@@ -8,6 +8,26 @@ module Kraps
|
|
8
8
|
def with_prefix(path)
|
9
9
|
File.join(*[@prefix, path].compact)
|
10
10
|
end
|
11
|
+
|
12
|
+
def list(prefix: nil)
|
13
|
+
driver.list(bucket, prefix: prefix)
|
14
|
+
end
|
15
|
+
|
16
|
+
def value(name)
|
17
|
+
driver.value(name, bucket)
|
18
|
+
end
|
19
|
+
|
20
|
+
def download(name, path)
|
21
|
+
driver.download(name, bucket, path)
|
22
|
+
end
|
23
|
+
|
24
|
+
def exists?(name)
|
25
|
+
driver.exists?(name, bucket)
|
26
|
+
end
|
27
|
+
|
28
|
+
def store(name, data_or_io, options = {})
|
29
|
+
driver.store(name, data_or_io, bucket, options)
|
30
|
+
end
|
11
31
|
end
|
12
32
|
|
13
33
|
class S3Driver
|
@@ -32,6 +52,10 @@ module Kraps
|
|
32
52
|
@bucket = bucket
|
33
53
|
@prefix = prefix
|
34
54
|
end
|
55
|
+
|
56
|
+
def flush
|
57
|
+
driver.flush
|
58
|
+
end
|
35
59
|
end
|
36
60
|
end
|
37
61
|
end
|
data/lib/kraps/job.rb
CHANGED
@@ -45,6 +45,24 @@ module Kraps
|
|
45
45
|
end
|
46
46
|
end
|
47
47
|
|
48
|
+
def map_partitions(partitions: nil, partitioner: nil, worker: @worker, before: nil, &block)
|
49
|
+
fresh.tap do |job|
|
50
|
+
job.instance_eval do
|
51
|
+
@partitions = partitions if partitions
|
52
|
+
@partitioner = partitioner if partitioner
|
53
|
+
|
54
|
+
@steps << Step.new(
|
55
|
+
action: Actions::MAP_PARTITIONS,
|
56
|
+
partitions: @partitions,
|
57
|
+
partitioner: @partitioner,
|
58
|
+
worker: worker,
|
59
|
+
before: before,
|
60
|
+
block: block
|
61
|
+
)
|
62
|
+
end
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
48
66
|
def reduce(worker: @worker, before: nil, &block)
|
49
67
|
fresh.tap do |job|
|
50
68
|
job.instance_eval do
|
@@ -60,6 +78,23 @@ module Kraps
|
|
60
78
|
end
|
61
79
|
end
|
62
80
|
|
81
|
+
def combine(other_job, worker: @worker, before: nil, &block)
|
82
|
+
fresh.tap do |job|
|
83
|
+
job.instance_eval do
|
84
|
+
@steps << Step.new(
|
85
|
+
action: Actions::COMBINE,
|
86
|
+
partitions: @partitions,
|
87
|
+
partitioner: @partitioner,
|
88
|
+
worker: worker,
|
89
|
+
before: before,
|
90
|
+
block: block,
|
91
|
+
dependency: other_job,
|
92
|
+
options: { combine_step_index: other_job.steps.size - 1 }
|
93
|
+
)
|
94
|
+
end
|
95
|
+
end
|
96
|
+
end
|
97
|
+
|
63
98
|
def each_partition(worker: @worker, before: nil, &block)
|
64
99
|
fresh.tap do |job|
|
65
100
|
job.instance_eval do
|
@@ -81,6 +116,45 @@ module Kraps
|
|
81
116
|
end
|
82
117
|
end
|
83
118
|
|
119
|
+
def dump(prefix:, worker: @worker)
|
120
|
+
each_partition(worker: worker) do |partition, pairs|
|
121
|
+
tempfile = Tempfile.new
|
122
|
+
|
123
|
+
pairs.each do |pair|
|
124
|
+
tempfile.puts(JSON.generate(pair))
|
125
|
+
end
|
126
|
+
|
127
|
+
Kraps.driver.store(File.join(prefix, partition.to_s, "chunk.json"), tempfile.tap(&:rewind))
|
128
|
+
ensure
|
129
|
+
tempfile&.close(true)
|
130
|
+
end
|
131
|
+
end
|
132
|
+
|
133
|
+
def load(prefix:, partitions:, partitioner:, worker: @worker)
|
134
|
+
job = parallelize(partitions: partitions, partitioner: proc { |key, _| key }, worker: worker) do |collector|
|
135
|
+
(0...partitions).each do |partition|
|
136
|
+
collector.call(partition)
|
137
|
+
end
|
138
|
+
end
|
139
|
+
|
140
|
+
job.map_partitions(partitioner: partitioner, worker: worker) do |partition, _, collector|
|
141
|
+
tempfile = Tempfile.new
|
142
|
+
|
143
|
+
path = File.join(prefix, partition.to_s, "chunk.json")
|
144
|
+
next unless Kraps.driver.exists?(path)
|
145
|
+
|
146
|
+
Kraps.driver.download(path, tempfile.path)
|
147
|
+
|
148
|
+
tempfile.each_line do |line|
|
149
|
+
key, value = JSON.parse(line)
|
150
|
+
|
151
|
+
collector.call(key, value)
|
152
|
+
end
|
153
|
+
ensure
|
154
|
+
tempfile&.close(true)
|
155
|
+
end
|
156
|
+
end
|
157
|
+
|
84
158
|
def fresh
|
85
159
|
dup.tap do |job|
|
86
160
|
job.instance_variable_set(:@steps, @steps.dup)
|
@@ -0,0 +1,13 @@
|
|
1
|
+
module Kraps
|
2
|
+
class JobResolver
|
3
|
+
def call(jobs)
|
4
|
+
resolve_dependencies(Array(jobs)).uniq
|
5
|
+
end
|
6
|
+
|
7
|
+
private
|
8
|
+
|
9
|
+
def resolve_dependencies(jobs)
|
10
|
+
jobs.map { |job| [resolve_dependencies(job.steps.map(&:dependency).compact), job] }.flatten
|
11
|
+
end
|
12
|
+
end
|
13
|
+
end
|
data/lib/kraps/runner.rb
CHANGED
@@ -5,7 +5,7 @@ module Kraps
|
|
5
5
|
end
|
6
6
|
|
7
7
|
def call(*args, **kwargs)
|
8
|
-
|
8
|
+
JobResolver.new.call(@klass.new.call(*args, **kwargs)).tap do |jobs|
|
9
9
|
jobs.each_with_index do |job, job_index|
|
10
10
|
job.steps.each_with_index.inject(nil) do |frame, (_, step_index)|
|
11
11
|
StepRunner.new(
|
@@ -69,6 +69,16 @@ module Kraps
|
|
69
69
|
end
|
70
70
|
end
|
71
71
|
|
72
|
+
def perform_map_partitions
|
73
|
+
with_distributed_job do |distributed_job|
|
74
|
+
push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
|
75
|
+
enqueue(token: distributed_job.token, part: part, partition: partition)
|
76
|
+
end
|
77
|
+
|
78
|
+
Frame.new(token: distributed_job.token, partitions: @step.partitions)
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
72
82
|
def perform_reduce
|
73
83
|
with_distributed_job do |distributed_job|
|
74
84
|
push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
|
@@ -79,6 +89,21 @@ module Kraps
|
|
79
89
|
end
|
80
90
|
end
|
81
91
|
|
92
|
+
def perform_combine
|
93
|
+
combine_job = @step.dependency
|
94
|
+
combine_step = combine_job.steps[@step.options[:combine_step_index]]
|
95
|
+
|
96
|
+
raise(IncompatibleFrame, "Incompatible number of partitions") if combine_step.partitions != @step.partitions
|
97
|
+
|
98
|
+
with_distributed_job do |distributed_job|
|
99
|
+
push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
|
100
|
+
enqueue(token: distributed_job.token, part: part, partition: partition, combine_frame: combine_step.frame.to_h)
|
101
|
+
end
|
102
|
+
|
103
|
+
Frame.new(token: distributed_job.token, partitions: @step.partitions)
|
104
|
+
end
|
105
|
+
end
|
106
|
+
|
82
107
|
def perform_each_partition
|
83
108
|
with_distributed_job do |distributed_job|
|
84
109
|
push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
|
data/lib/kraps/step.rb
CHANGED
data/lib/kraps/temp_path.rb
CHANGED
@@ -1,29 +1,3 @@
|
|
1
1
|
module Kraps
|
2
|
-
|
3
|
-
attr_reader :path
|
4
|
-
|
5
|
-
def initialize(prefix: nil, suffix: nil)
|
6
|
-
@path = File.join(Dir.tmpdir, [prefix, SecureRandom.hex[0, 16], Process.pid, suffix].compact.join("."))
|
7
|
-
|
8
|
-
File.open(@path, File::CREAT | File::EXCL) {}
|
9
|
-
|
10
|
-
ObjectSpace.define_finalizer(self, self.class.finalize(@path))
|
11
|
-
|
12
|
-
return unless block_given?
|
13
|
-
|
14
|
-
begin
|
15
|
-
yield
|
16
|
-
ensure
|
17
|
-
unlink
|
18
|
-
end
|
19
|
-
end
|
20
|
-
|
21
|
-
def unlink
|
22
|
-
FileUtils.rm_f(@path)
|
23
|
-
end
|
24
|
-
|
25
|
-
def self.finalize(path)
|
26
|
-
proc { FileUtils.rm_f(path) }
|
27
|
-
end
|
28
|
-
end
|
2
|
+
TempPath = MapReduce::TempPath
|
29
3
|
end
|
data/lib/kraps/temp_paths.rb
CHANGED
data/lib/kraps/version.rb
CHANGED
data/lib/kraps/worker.rb
CHANGED
@@ -1,10 +1,13 @@
|
|
1
1
|
module Kraps
|
2
2
|
class Worker
|
3
|
-
|
3
|
+
include MapReduce::Mergeable
|
4
|
+
|
5
|
+
def initialize(json, memory_limit:, chunk_limit:, concurrency:, logger: Logger.new("/dev/null"))
|
4
6
|
@args = JSON.parse(json)
|
5
7
|
@memory_limit = memory_limit
|
6
8
|
@chunk_limit = chunk_limit
|
7
9
|
@concurrency = concurrency
|
10
|
+
@logger = logger
|
8
11
|
end
|
9
12
|
|
10
13
|
def call(retries: 3)
|
@@ -36,24 +39,14 @@ module Kraps
|
|
36
39
|
mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
|
37
40
|
Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
|
38
41
|
File.open(path) do |stream|
|
39
|
-
Kraps.driver.
|
42
|
+
Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
|
40
43
|
end
|
41
44
|
end
|
42
45
|
end
|
43
46
|
end
|
44
47
|
|
45
48
|
def perform_map
|
46
|
-
temp_paths =
|
47
|
-
|
48
|
-
files = Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
|
49
|
-
|
50
|
-
temp_paths_index = files.each_with_object({}) do |file, hash|
|
51
|
-
hash[file] = temp_paths.add
|
52
|
-
end
|
53
|
-
|
54
|
-
Parallelizer.each(files, @concurrency) do |file|
|
55
|
-
Kraps.driver.driver.download(file, Kraps.driver.bucket, temp_paths_index[file].path)
|
56
|
-
end
|
49
|
+
temp_paths = download_all(token: @args["frame"]["token"], partition: @args["partition"])
|
57
50
|
|
58
51
|
current_step = step
|
59
52
|
|
@@ -85,14 +78,45 @@ module Kraps
|
|
85
78
|
mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
|
86
79
|
Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
|
87
80
|
File.open(path) do |stream|
|
88
|
-
Kraps.driver.driver.
|
89
|
-
|
90
|
-
|
81
|
+
Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
|
82
|
+
end
|
83
|
+
end
|
84
|
+
end
|
85
|
+
ensure
|
86
|
+
temp_paths&.delete
|
87
|
+
end
|
88
|
+
|
89
|
+
def perform_map_partitions
|
90
|
+
temp_paths = download_all(token: @args["frame"]["token"], partition: @args["partition"])
|
91
|
+
|
92
|
+
current_step = step
|
93
|
+
current_partition = @args["partition"]
|
94
|
+
|
95
|
+
implementation = Object.new
|
96
|
+
implementation.define_singleton_method(:map) do |enum, &block|
|
97
|
+
current_step.block.call(current_partition, enum, block)
|
98
|
+
end
|
99
|
+
|
100
|
+
subsequent_step = next_step
|
101
|
+
|
102
|
+
if subsequent_step&.action == Actions::REDUCE
|
103
|
+
implementation.define_singleton_method(:reduce) do |key, value1, value2|
|
104
|
+
subsequent_step.block.call(key, value1, value2)
|
105
|
+
end
|
106
|
+
end
|
107
|
+
|
108
|
+
mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
|
109
|
+
mapper.map(k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
|
110
|
+
|
111
|
+
mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
|
112
|
+
Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
|
113
|
+
File.open(path) do |stream|
|
114
|
+
Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
|
91
115
|
end
|
92
116
|
end
|
93
117
|
end
|
94
118
|
ensure
|
95
|
-
temp_paths&.
|
119
|
+
temp_paths&.delete
|
96
120
|
end
|
97
121
|
|
98
122
|
def perform_reduce
|
@@ -105,8 +129,8 @@ module Kraps
|
|
105
129
|
|
106
130
|
reducer = MapReduce::Reducer.new(implementation)
|
107
131
|
|
108
|
-
Parallelizer.each(Kraps.driver.
|
109
|
-
Kraps.driver.
|
132
|
+
Parallelizer.each(Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")), @concurrency) do |file|
|
133
|
+
Kraps.driver.download(file, reducer.add_chunk)
|
110
134
|
end
|
111
135
|
|
112
136
|
tempfile = Tempfile.new
|
@@ -115,35 +139,96 @@ module Kraps
|
|
115
139
|
tempfile.puts(JSON.generate([key, value]))
|
116
140
|
end
|
117
141
|
|
118
|
-
Kraps.driver.
|
142
|
+
Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{@args["partition"]}/chunk.#{@args["part"]}.json"), tempfile.tap(&:rewind))
|
119
143
|
ensure
|
120
144
|
tempfile&.close(true)
|
121
145
|
end
|
122
146
|
|
147
|
+
def perform_combine
|
148
|
+
temp_paths1 = download_all(token: @args["frame"]["token"], partition: @args["partition"])
|
149
|
+
temp_paths2 = download_all(token: @args["combine_frame"]["token"], partition: @args["partition"])
|
150
|
+
|
151
|
+
enum1 = k_way_merge(temp_paths1.each.to_a, chunk_limit: @chunk_limit)
|
152
|
+
enum2 = k_way_merge(temp_paths2.each.to_a, chunk_limit: @chunk_limit)
|
153
|
+
|
154
|
+
combine_method = method(:combine)
|
155
|
+
current_step = step
|
156
|
+
|
157
|
+
implementation = Object.new
|
158
|
+
implementation.define_singleton_method(:map) do |&block|
|
159
|
+
combine_method.call(enum1, enum2) do |key, value1, value2|
|
160
|
+
block.call(key, current_step.block.call(key, value1, value2))
|
161
|
+
end
|
162
|
+
end
|
163
|
+
|
164
|
+
mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
|
165
|
+
mapper.map
|
166
|
+
|
167
|
+
mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
|
168
|
+
Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
|
169
|
+
File.open(path) do |stream|
|
170
|
+
Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
|
171
|
+
end
|
172
|
+
end
|
173
|
+
end
|
174
|
+
ensure
|
175
|
+
temp_paths1&.delete
|
176
|
+
temp_paths2&.delete
|
177
|
+
end
|
178
|
+
|
179
|
+
def combine(enum1, enum2)
|
180
|
+
current1 = begin; enum1.next; rescue StopIteration; nil; end
|
181
|
+
current2 = begin; enum2.next; rescue StopIteration; nil; end
|
182
|
+
|
183
|
+
loop do
|
184
|
+
return if current1.nil? && current2.nil?
|
185
|
+
return if current1.nil?
|
186
|
+
|
187
|
+
if current2.nil?
|
188
|
+
yield(current1[0], current1[1], nil)
|
189
|
+
|
190
|
+
current1 = begin; enum1.next; rescue StopIteration; nil; end
|
191
|
+
elsif current1[0] == current2[0]
|
192
|
+
loop do
|
193
|
+
yield(current1[0], current1[1], current2[1])
|
194
|
+
|
195
|
+
current1 = begin; enum1.next; rescue StopIteration; nil; end
|
196
|
+
|
197
|
+
break if current1.nil?
|
198
|
+
break if current1[0] != current2[0]
|
199
|
+
end
|
200
|
+
|
201
|
+
current2 = begin; enum2.next; rescue StopIteration; nil; end
|
202
|
+
else
|
203
|
+
res = current1[0] <=> current2[0]
|
204
|
+
|
205
|
+
if res < 0
|
206
|
+
yield(current1[0], current1[1], nil)
|
207
|
+
|
208
|
+
current1 = begin; enum1.next; rescue StopIteration; nil; end
|
209
|
+
else
|
210
|
+
current2 = begin; enum2.next; rescue StopIteration; nil; end
|
211
|
+
end
|
212
|
+
end
|
213
|
+
end
|
214
|
+
end
|
215
|
+
|
123
216
|
def perform_each_partition
|
124
217
|
temp_paths = TempPaths.new
|
125
218
|
|
126
|
-
files = Kraps.driver.
|
219
|
+
files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
|
127
220
|
|
128
221
|
temp_paths_index = files.each_with_object({}) do |file, hash|
|
129
222
|
hash[file] = temp_paths.add
|
130
223
|
end
|
131
224
|
|
132
225
|
Parallelizer.each(files, @concurrency) do |file|
|
133
|
-
Kraps.driver.
|
226
|
+
Kraps.driver.download(file, temp_paths_index[file].path)
|
134
227
|
end
|
135
228
|
|
136
|
-
|
137
|
-
File.open(temp_path.path) do |stream|
|
138
|
-
stream.each_line do |line|
|
139
|
-
yielder << JSON.parse(line)
|
140
|
-
end
|
141
|
-
end
|
142
|
-
end
|
143
|
-
|
144
|
-
step.block.call(@args["partition"], enum)
|
229
|
+
step.block.call(@args["partition"], k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
|
145
230
|
ensure
|
146
|
-
temp_paths&.
|
231
|
+
temp_paths&.delete
|
147
232
|
end
|
148
233
|
|
149
234
|
def with_retries(num_retries)
|
@@ -154,12 +239,14 @@ module Kraps
|
|
154
239
|
rescue Kraps::Error
|
155
240
|
distributed_job.stop
|
156
241
|
raise
|
157
|
-
rescue StandardError
|
242
|
+
rescue StandardError => e
|
158
243
|
if retries >= num_retries
|
159
244
|
distributed_job.stop
|
160
245
|
raise
|
161
246
|
end
|
162
247
|
|
248
|
+
@logger.error(e)
|
249
|
+
|
163
250
|
sleep(5)
|
164
251
|
retries += 1
|
165
252
|
|
@@ -167,8 +254,24 @@ module Kraps
|
|
167
254
|
end
|
168
255
|
end
|
169
256
|
|
257
|
+
def download_all(token:, partition:)
|
258
|
+
temp_paths = TempPaths.new
|
259
|
+
|
260
|
+
files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/")).sort
|
261
|
+
|
262
|
+
temp_paths_index = files.each_with_object({}) do |file, hash|
|
263
|
+
hash[file] = temp_paths.add
|
264
|
+
end
|
265
|
+
|
266
|
+
Parallelizer.each(files, @concurrency) do |file|
|
267
|
+
Kraps.driver.download(file, temp_paths_index[file].path)
|
268
|
+
end
|
269
|
+
|
270
|
+
temp_paths
|
271
|
+
end
|
272
|
+
|
170
273
|
def jobs
|
171
|
-
@jobs ||=
|
274
|
+
@jobs ||= JobResolver.new.call(@args["klass"].constantize.new.call(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)))
|
172
275
|
end
|
173
276
|
|
174
277
|
def job
|
data/lib/kraps.rb
CHANGED
@@ -1,3 +1,9 @@
|
|
1
|
+
require "distributed_job"
|
2
|
+
require "ruby-progressbar"
|
3
|
+
require "ruby-progressbar/outputs/null"
|
4
|
+
require "map_reduce"
|
5
|
+
require "redis"
|
6
|
+
|
1
7
|
require_relative "kraps/version"
|
2
8
|
require_relative "kraps/drivers"
|
3
9
|
require_relative "kraps/actions"
|
@@ -8,21 +14,18 @@ require_relative "kraps/temp_paths"
|
|
8
14
|
require_relative "kraps/timeout_queue"
|
9
15
|
require_relative "kraps/interval"
|
10
16
|
require_relative "kraps/job"
|
17
|
+
require_relative "kraps/job_resolver"
|
11
18
|
require_relative "kraps/runner"
|
12
19
|
require_relative "kraps/step"
|
13
20
|
require_relative "kraps/frame"
|
14
21
|
require_relative "kraps/worker"
|
15
|
-
require "distributed_job"
|
16
|
-
require "ruby-progressbar"
|
17
|
-
require "ruby-progressbar/outputs/null"
|
18
|
-
require "map_reduce"
|
19
|
-
require "redis"
|
20
22
|
|
21
23
|
module Kraps
|
22
24
|
class Error < StandardError; end
|
23
25
|
class InvalidAction < Error; end
|
24
26
|
class InvalidStep < Error; end
|
25
27
|
class JobStopped < Error; end
|
28
|
+
class IncompatibleFrame < Error; end
|
26
29
|
|
27
30
|
def self.configure(driver:, redis: Redis.new, namespace: nil, job_ttl: 24 * 60 * 60, show_progress: true, enqueuer: ->(worker, json) { worker.perform_async(json) })
|
28
31
|
@driver = driver
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: kraps
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.6.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Benjamin Vetter
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-11-
|
11
|
+
date: 2022-11-16 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: attachie
|
@@ -147,6 +147,7 @@ files:
|
|
147
147
|
- lib/kraps/hash_partitioner.rb
|
148
148
|
- lib/kraps/interval.rb
|
149
149
|
- lib/kraps/job.rb
|
150
|
+
- lib/kraps/job_resolver.rb
|
150
151
|
- lib/kraps/parallelizer.rb
|
151
152
|
- lib/kraps/runner.rb
|
152
153
|
- lib/kraps/step.rb
|