kraps 0.5.0 → 0.6.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 921ae08326c96216136418861b88af7f11bce519c924cd1813216165f7f02690
4
- data.tar.gz: '0913d31d3caeea0be664bc714e9d0da58227f515c047be31359e96040bc0c141'
3
+ metadata.gz: 02ba582478178273300e77d5ddd18a8568fd682b0c53a444f1c5e7b756f9fd9a
4
+ data.tar.gz: 9e1698a67c252512a2f277ca1a6e079e063f8f1e6c13500245aabd969cca122d
5
5
  SHA512:
6
- metadata.gz: d8e43e5229fc310019801e62a2e278470a1eb37b50e4aca27b9c64edb6666115f0f25c7a7375790516e2726fcf10980cdac1523c54dde8d3527a39fd919a2a5a
7
- data.tar.gz: 30b1a9edcdd4f7ff476bfa4c070aef31debd727500e27a08b59f1df2663362c60e3cc3a3c860455d568abd994bb56a216f7eedf8baea6cc06ca73b1d0bdf9a07
6
+ metadata.gz: 2d1b3bd10d1048c64804ddf86069c0757247b581edea0f38330189a10d35ed096970d008533a167ddaf97c16479425e3baa2bb56f5a8888255ecfd12b911a168
7
+ data.tar.gz: 045263d6aa920cef97a162fcbfd41f239087c459a8c86bd2112d3ce4524c48c6e65ceaac8993f8ea7fb9089a8877db2863e39644888c8bd884df0fa95241277d
data/CHANGELOG.md CHANGED
@@ -1,5 +1,11 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## v0.6.0
4
+
5
+ * Added `map_partitions`
6
+ * Added `combine`
7
+ * Added `dump` and `load`
8
+
3
9
  ## v0.5.0
4
10
 
5
11
  * Added a `before` option to specify a callable to run before
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- kraps (0.5.0)
4
+ kraps (0.6.0)
5
5
  attachie
6
6
  distributed_job
7
7
  map-reduce-ruby (>= 3.0.0)
data/README.md CHANGED
@@ -3,11 +3,12 @@
3
3
  **Easily process big data in ruby**
4
4
 
5
5
  Kraps allows to process and perform calculations on very large datasets in
6
- parallel using a map/reduce framework and runs on a background job framework
7
- you already have. You just need some space on your filesystem, S3 as a storage
8
- layer with temporary lifecycle policy enabled, the already mentioned background
9
- job framework (like sidekiq, shoryuken, etc) and redis to keep track of the
10
- progress. Most things you most likely already have in place anyways.
6
+ parallel using a map/reduce framework similar to [spark](https://spark.apache.org/),
7
+ but runs on a background job framework you already have. You just need some
8
+ space on your filesystem, S3 as a storage layer with temporary lifecycle policy
9
+ enabled, the already mentioned background job framework (like sidekiq,
10
+ shoryuken, etc) and redis to keep track of the progress. Most things you most
11
+ likely already have in place anyways.
11
12
 
12
13
  ## Installation
13
14
 
@@ -115,13 +116,13 @@ be able to give 300-400 megabytes to Kraps then, but now divide this by 10 and
115
116
  specify a `memory_limit` of around `30.megabytes`, better less. The
116
117
  `memory_limit` affects how much chunks will be written to disk depending on the
117
118
  data size you are processing and how big these chunks are. The smaller the
118
- value, the more chunks and the more chunks, the more runs Kraps need to merge
119
- the chunks. It can affect the performance The `chunk_limit` ensures that only
120
- the specified amount of chunks are processed in a single run. A run basically
121
- means: it takes up to `chunk_limit` chunks, reduces them and pushes the result
122
- as a new chunk to the list of chunks to process. Thus, if your number of file
123
- descriptors is unlimited, you want to set it to a higher number to avoid the
124
- overhead of multiple runs. `concurrency` tells Kraps how much threads to use to
119
+ value, the more chunks. The more chunks, the more runs Kraps need to merge
120
+ the chunks. The `chunk_limit` ensures that only the specified amount of chunks
121
+ are processed in a single run. A run basically means: it takes up to
122
+ `chunk_limit` chunks, reduces them and pushes the result as a new chunk to the
123
+ list of chunks to process. Thus, if your number of file descriptors is
124
+ unlimited, you want to set it to a higher number to avoid the overhead of
125
+ multiple runs. `concurrency` tells Kraps how much threads to use to
125
126
  concurrently upload/download files from the storage layer. Finally, `retries`
126
127
  specifies how often Kraps should retry the job step in case of errors. Kraps
127
128
  will sleep for 5 seconds between those retries. Please note that it's not yet
@@ -130,7 +131,6 @@ Kraps. Please note, however, that `parallelize` is not covered by `retries`
130
131
  yet, as the block passed to `parallelize` is executed by the runner, not the
131
132
  workers.
132
133
 
133
-
134
134
  Now, executing your job is super easy:
135
135
 
136
136
  ```ruby
@@ -182,11 +182,11 @@ https://github.com/mrkamel/map-reduce-ruby/#limitations-for-keys
182
182
  ## Storage
183
183
 
184
184
  Kraps stores temporary results of steps in a storage layer. Currently, only S3
185
- is supported besides a in memory driver used for testing purposes. Please be
185
+ is supported besides a in-memory driver used for testing purposes. Please be
186
186
  aware that Kraps does not clean up any files from the storage layer, as it
187
- would be a safe thing to do in case of errors anyways. Instead, Kraps relies on
188
- lifecycle features of modern object storage systems. Therefore, it is recommend
189
- to e.g. configure a lifecycle policy to delete any files after e.g. 7 days
187
+ would not be a safe thing to do in case of errors anyways. Instead, Kraps
188
+ relies on lifecycle features of modern object storage systems. Therefore, it is
189
+ required to configure a lifecycle policy to delete any files after e.g. 7 days
190
190
  either for a whole bucket or for a certain prefix like e.g. `temp/` and tell
191
191
  Kraps about the prefix to use (e.g. `temp/kraps/`).
192
192
 
@@ -229,6 +229,19 @@ The block gets each key-value pair passed and the `collector` block can be
229
229
  called as often as neccessary. This is also the reason why `map` can not simply
230
230
  return the new key-value pair, but the `collector` must be used instead.
231
231
 
232
+ * `map_partitions`: Maps the key value pairs to other key value pairs, but the
233
+ block receives all data of each partition as an enumerable and sorted by key.
234
+ Please be aware that you should not call `to_a` or similar on the enumerable.
235
+ Prefer `map` over `map_partitions` when possible.
236
+
237
+ ```ruby
238
+ job.map_partitions(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker) do |pairs, collector|
239
+ pairs.each do |key, value|
240
+ collector.call("changed #{key}", "changed #{value}")
241
+ end
242
+ end
243
+ ```
244
+
232
245
  * `reduce`: Reduces the values of pairs having the same key
233
246
 
234
247
  ```ruby
@@ -245,6 +258,24 @@ The `key` itself is also passed to the block for the case that you need to
245
258
  customize the reduce calculation according to the value of the key. However,
246
259
  most of the time, this is not neccessary and the key can simply be ignored.
247
260
 
261
+ * `combine`: Combines the results of 2 jobs by combining every key available
262
+ in the current job result with the corresponding key from the passed job
263
+ result. When the passed job result does not have the corresponding key,
264
+ `nil` will be passed to the block. Keys which are only available in the
265
+ passed job result are completely omitted.
266
+
267
+ ```ruby
268
+ job.combine(other_job, worker: MyKrapsWorker) do |key, value1, value2|
269
+ (value1 || {}).merge(value2 || {})
270
+ end
271
+ ```
272
+
273
+ Please note that the keys, partitioners and the number of partitions must match
274
+ for the jobs to be combined. Further note that the results of `other_job` must
275
+ be reduced, meaning that every key must be unique. Finally, `other_job` must
276
+ not neccessarily be listed in the array of jobs returned by the `call` method,
277
+ since Kraps detects the dependency on its own.
278
+
248
279
  * `repartition`: Used to change the partitioning
249
280
 
250
281
  ```ruby
@@ -255,7 +286,8 @@ Repartitions all data into the specified number of partitions and using the
255
286
  specified partitioner.
256
287
 
257
288
  * `each_partition`: Passes the partition number and all data of each partition
258
- as a lazy enumerable
289
+ as an enumerable and sorted by key. Please be aware that you should not call
290
+ `to_a` or similar on the enumerable.
259
291
 
260
292
  ```ruby
261
293
  job.each_partition do |partition, pairs|
@@ -265,6 +297,22 @@ job.each_partition do |partition, pairs|
265
297
  end
266
298
  ```
267
299
 
300
+ * `dump`: Store all current data per partition under the specified prefix
301
+
302
+ ```ruby
303
+ job.dump(prefix: "path/to/dump", worker: MyKrapsWorker)
304
+ ```
305
+
306
+ It creates a folder for every partition and stores one or more chunks in there.
307
+
308
+ * `load`: Loads the previously dumped data
309
+
310
+ ```ruby
311
+ job.load(prefix: "path/to/dump", partitions: 32, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
312
+ ```
313
+
314
+ The number of partitions and the partitioner must be specified.
315
+
268
316
  Please note that every API method accepts a `before` callable:
269
317
 
270
318
  ```ruby
@@ -326,6 +374,52 @@ When you execute the job, Kraps will execute the jobs one after another and as
326
374
  the jobs build up on each other, Kraps will execute the steps shared by both
327
375
  jobs only once.
328
376
 
377
+ ## Testing
378
+
379
+ Kraps ships with an in-memory fake driver for storage, which you can use for
380
+ testing purposes instead of the s3 driver:
381
+
382
+ ```ruby Kraps.configure(
383
+ driver: Kraps::Drivers::FakeDriver.new(bucket: "kraps"),
384
+ # ...
385
+ ) ```
386
+
387
+ This is of course much faster than using s3 or some s3 compatible service.
388
+ Moreover, when testing large Kraps jobs you maybe want to test intermediate
389
+ steps. You can use `#dump` for this purpose and test that the data dumped is
390
+ correct.
391
+
392
+ ```ruby
393
+ job = job.dump("path/to/dump")
394
+ ```
395
+
396
+ and in your tests do
397
+
398
+ ```ruby
399
+ Kraps.driver.value("path/to/dump/0/chunk.json") # => data of partition 0
400
+ Kraps.driver.value("path/to/dump/1/chunk.json") # => data of partition 1
401
+ # ...
402
+ ```
403
+
404
+ The data is stored in lines, each line is a json encoded array of key and
405
+ value.
406
+
407
+ ```ruby
408
+ data = Kraps.driver.value("path/to/dump/0/chunk.json).lines.map do |line|
409
+ JSON.parse(line) # => [key, value]
410
+ end
411
+ ```
412
+
413
+ The API of the driver is:
414
+
415
+ * `store(name, data_or_ui, options = {})`: Stores `data_or_io` as `name`
416
+ * `list(prefix: nil)`: Lists all objects or all objects matching the `prefix`
417
+ * `value(name)`: Returns the object content of `name`
418
+ * `download(name, path)`: Downloads the object `name` to `path` in your
419
+ filesystem
420
+ * `exists?(name)`: Returns `true`/`false`
421
+ * `flush`: Removes all objects from the fake storage
422
+
329
423
  ## Dependencies
330
424
 
331
425
  Kraps is built on top of
data/lib/kraps/actions.rb CHANGED
@@ -3,7 +3,9 @@ module Kraps
3
3
  ALL = [
4
4
  PARALLELIZE = "parallelize",
5
5
  MAP = "map",
6
+ MAP_PARTITIONS = "map_partitions",
6
7
  REDUCE = "reduce",
8
+ COMBINE = "combine",
7
9
  EACH_PARTITION = "each_partition"
8
10
  ]
9
11
  end
data/lib/kraps/drivers.rb CHANGED
@@ -8,6 +8,26 @@ module Kraps
8
8
  def with_prefix(path)
9
9
  File.join(*[@prefix, path].compact)
10
10
  end
11
+
12
+ def list(prefix: nil)
13
+ driver.list(bucket, prefix: prefix)
14
+ end
15
+
16
+ def value(name)
17
+ driver.value(name, bucket)
18
+ end
19
+
20
+ def download(name, path)
21
+ driver.download(name, bucket, path)
22
+ end
23
+
24
+ def exists?(name)
25
+ driver.exists?(name, bucket)
26
+ end
27
+
28
+ def store(name, data_or_io, options = {})
29
+ driver.store(name, data_or_io, bucket, options)
30
+ end
11
31
  end
12
32
 
13
33
  class S3Driver
@@ -32,6 +52,10 @@ module Kraps
32
52
  @bucket = bucket
33
53
  @prefix = prefix
34
54
  end
55
+
56
+ def flush
57
+ driver.flush
58
+ end
35
59
  end
36
60
  end
37
61
  end
data/lib/kraps/job.rb CHANGED
@@ -45,6 +45,24 @@ module Kraps
45
45
  end
46
46
  end
47
47
 
48
+ def map_partitions(partitions: nil, partitioner: nil, worker: @worker, before: nil, &block)
49
+ fresh.tap do |job|
50
+ job.instance_eval do
51
+ @partitions = partitions if partitions
52
+ @partitioner = partitioner if partitioner
53
+
54
+ @steps << Step.new(
55
+ action: Actions::MAP_PARTITIONS,
56
+ partitions: @partitions,
57
+ partitioner: @partitioner,
58
+ worker: worker,
59
+ before: before,
60
+ block: block
61
+ )
62
+ end
63
+ end
64
+ end
65
+
48
66
  def reduce(worker: @worker, before: nil, &block)
49
67
  fresh.tap do |job|
50
68
  job.instance_eval do
@@ -60,6 +78,23 @@ module Kraps
60
78
  end
61
79
  end
62
80
 
81
+ def combine(other_job, worker: @worker, before: nil, &block)
82
+ fresh.tap do |job|
83
+ job.instance_eval do
84
+ @steps << Step.new(
85
+ action: Actions::COMBINE,
86
+ partitions: @partitions,
87
+ partitioner: @partitioner,
88
+ worker: worker,
89
+ before: before,
90
+ block: block,
91
+ dependency: other_job,
92
+ options: { combine_step_index: other_job.steps.size - 1 }
93
+ )
94
+ end
95
+ end
96
+ end
97
+
63
98
  def each_partition(worker: @worker, before: nil, &block)
64
99
  fresh.tap do |job|
65
100
  job.instance_eval do
@@ -81,6 +116,45 @@ module Kraps
81
116
  end
82
117
  end
83
118
 
119
+ def dump(prefix:, worker: @worker)
120
+ each_partition(worker: worker) do |partition, pairs|
121
+ tempfile = Tempfile.new
122
+
123
+ pairs.each do |pair|
124
+ tempfile.puts(JSON.generate(pair))
125
+ end
126
+
127
+ Kraps.driver.store(File.join(prefix, partition.to_s, "chunk.json"), tempfile.tap(&:rewind))
128
+ ensure
129
+ tempfile&.close(true)
130
+ end
131
+ end
132
+
133
+ def load(prefix:, partitions:, partitioner:, worker: @worker)
134
+ job = parallelize(partitions: partitions, partitioner: proc { |key, _| key }, worker: worker) do |collector|
135
+ (0...partitions).each do |partition|
136
+ collector.call(partition)
137
+ end
138
+ end
139
+
140
+ job.map_partitions(partitioner: partitioner, worker: worker) do |partition, _, collector|
141
+ tempfile = Tempfile.new
142
+
143
+ path = File.join(prefix, partition.to_s, "chunk.json")
144
+ next unless Kraps.driver.exists?(path)
145
+
146
+ Kraps.driver.download(path, tempfile.path)
147
+
148
+ tempfile.each_line do |line|
149
+ key, value = JSON.parse(line)
150
+
151
+ collector.call(key, value)
152
+ end
153
+ ensure
154
+ tempfile&.close(true)
155
+ end
156
+ end
157
+
84
158
  def fresh
85
159
  dup.tap do |job|
86
160
  job.instance_variable_set(:@steps, @steps.dup)
@@ -0,0 +1,13 @@
1
+ module Kraps
2
+ class JobResolver
3
+ def call(jobs)
4
+ resolve_dependencies(Array(jobs)).uniq
5
+ end
6
+
7
+ private
8
+
9
+ def resolve_dependencies(jobs)
10
+ jobs.map { |job| [resolve_dependencies(job.steps.map(&:dependency).compact), job] }.flatten
11
+ end
12
+ end
13
+ end
data/lib/kraps/runner.rb CHANGED
@@ -5,7 +5,7 @@ module Kraps
5
5
  end
6
6
 
7
7
  def call(*args, **kwargs)
8
- Array(@klass.new.call(*args, **kwargs)).tap do |jobs|
8
+ JobResolver.new.call(@klass.new.call(*args, **kwargs)).tap do |jobs|
9
9
  jobs.each_with_index do |job, job_index|
10
10
  job.steps.each_with_index.inject(nil) do |frame, (_, step_index)|
11
11
  StepRunner.new(
@@ -69,6 +69,16 @@ module Kraps
69
69
  end
70
70
  end
71
71
 
72
+ def perform_map_partitions
73
+ with_distributed_job do |distributed_job|
74
+ push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
75
+ enqueue(token: distributed_job.token, part: part, partition: partition)
76
+ end
77
+
78
+ Frame.new(token: distributed_job.token, partitions: @step.partitions)
79
+ end
80
+ end
81
+
72
82
  def perform_reduce
73
83
  with_distributed_job do |distributed_job|
74
84
  push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
@@ -79,6 +89,21 @@ module Kraps
79
89
  end
80
90
  end
81
91
 
92
+ def perform_combine
93
+ combine_job = @step.dependency
94
+ combine_step = combine_job.steps[@step.options[:combine_step_index]]
95
+
96
+ raise(IncompatibleFrame, "Incompatible number of partitions") if combine_step.partitions != @step.partitions
97
+
98
+ with_distributed_job do |distributed_job|
99
+ push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
100
+ enqueue(token: distributed_job.token, part: part, partition: partition, combine_frame: combine_step.frame.to_h)
101
+ end
102
+
103
+ Frame.new(token: distributed_job.token, partitions: @step.partitions)
104
+ end
105
+ end
106
+
82
107
  def perform_each_partition
83
108
  with_distributed_job do |distributed_job|
84
109
  push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
data/lib/kraps/step.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Kraps
2
- Step = Struct.new(:action, :partitioner, :partitions, :block, :worker, :before, :frame, keyword_init: true)
2
+ Step = Struct.new(:action, :partitioner, :partitions, :block, :worker, :before, :frame, :dependency, :options, keyword_init: true)
3
3
  end
@@ -1,29 +1,3 @@
1
1
  module Kraps
2
- class TempPath
3
- attr_reader :path
4
-
5
- def initialize(prefix: nil, suffix: nil)
6
- @path = File.join(Dir.tmpdir, [prefix, SecureRandom.hex[0, 16], Process.pid, suffix].compact.join("."))
7
-
8
- File.open(@path, File::CREAT | File::EXCL) {}
9
-
10
- ObjectSpace.define_finalizer(self, self.class.finalize(@path))
11
-
12
- return unless block_given?
13
-
14
- begin
15
- yield
16
- ensure
17
- unlink
18
- end
19
- end
20
-
21
- def unlink
22
- FileUtils.rm_f(@path)
23
- end
24
-
25
- def self.finalize(path)
26
- proc { FileUtils.rm_f(path) }
27
- end
28
- end
2
+ TempPath = MapReduce::TempPath
29
3
  end
@@ -17,9 +17,9 @@ module Kraps
17
17
  end
18
18
  end
19
19
 
20
- def unlink
20
+ def delete
21
21
  synchronize do
22
- @temp_paths.each(&:unlink)
22
+ @temp_paths.each(&:delete)
23
23
  end
24
24
  end
25
25
 
data/lib/kraps/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Kraps
2
- VERSION = "0.5.0"
2
+ VERSION = "0.6.0"
3
3
  end
data/lib/kraps/worker.rb CHANGED
@@ -1,10 +1,13 @@
1
1
  module Kraps
2
2
  class Worker
3
- def initialize(json, memory_limit:, chunk_limit:, concurrency:)
3
+ include MapReduce::Mergeable
4
+
5
+ def initialize(json, memory_limit:, chunk_limit:, concurrency:, logger: Logger.new("/dev/null"))
4
6
  @args = JSON.parse(json)
5
7
  @memory_limit = memory_limit
6
8
  @chunk_limit = chunk_limit
7
9
  @concurrency = concurrency
10
+ @logger = logger
8
11
  end
9
12
 
10
13
  def call(retries: 3)
@@ -36,24 +39,14 @@ module Kraps
36
39
  mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
37
40
  Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
38
41
  File.open(path) do |stream|
39
- Kraps.driver.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream, Kraps.driver.bucket)
42
+ Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
40
43
  end
41
44
  end
42
45
  end
43
46
  end
44
47
 
45
48
  def perform_map
46
- temp_paths = TempPaths.new
47
-
48
- files = Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
49
-
50
- temp_paths_index = files.each_with_object({}) do |file, hash|
51
- hash[file] = temp_paths.add
52
- end
53
-
54
- Parallelizer.each(files, @concurrency) do |file|
55
- Kraps.driver.driver.download(file, Kraps.driver.bucket, temp_paths_index[file].path)
56
- end
49
+ temp_paths = download_all(token: @args["frame"]["token"], partition: @args["partition"])
57
50
 
58
51
  current_step = step
59
52
 
@@ -85,14 +78,45 @@ module Kraps
85
78
  mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
86
79
  Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
87
80
  File.open(path) do |stream|
88
- Kraps.driver.driver.store(
89
- Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream, Kraps.driver.bucket
90
- )
81
+ Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
82
+ end
83
+ end
84
+ end
85
+ ensure
86
+ temp_paths&.delete
87
+ end
88
+
89
+ def perform_map_partitions
90
+ temp_paths = download_all(token: @args["frame"]["token"], partition: @args["partition"])
91
+
92
+ current_step = step
93
+ current_partition = @args["partition"]
94
+
95
+ implementation = Object.new
96
+ implementation.define_singleton_method(:map) do |enum, &block|
97
+ current_step.block.call(current_partition, enum, block)
98
+ end
99
+
100
+ subsequent_step = next_step
101
+
102
+ if subsequent_step&.action == Actions::REDUCE
103
+ implementation.define_singleton_method(:reduce) do |key, value1, value2|
104
+ subsequent_step.block.call(key, value1, value2)
105
+ end
106
+ end
107
+
108
+ mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
109
+ mapper.map(k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
110
+
111
+ mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
112
+ Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
113
+ File.open(path) do |stream|
114
+ Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
91
115
  end
92
116
  end
93
117
  end
94
118
  ensure
95
- temp_paths&.unlink
119
+ temp_paths&.delete
96
120
  end
97
121
 
98
122
  def perform_reduce
@@ -105,8 +129,8 @@ module Kraps
105
129
 
106
130
  reducer = MapReduce::Reducer.new(implementation)
107
131
 
108
- Parallelizer.each(Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")), @concurrency) do |file|
109
- Kraps.driver.driver.download(file, Kraps.driver.bucket, reducer.add_chunk)
132
+ Parallelizer.each(Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")), @concurrency) do |file|
133
+ Kraps.driver.download(file, reducer.add_chunk)
110
134
  end
111
135
 
112
136
  tempfile = Tempfile.new
@@ -115,35 +139,96 @@ module Kraps
115
139
  tempfile.puts(JSON.generate([key, value]))
116
140
  end
117
141
 
118
- Kraps.driver.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{@args["partition"]}/chunk.#{@args["part"]}.json"), tempfile.tap(&:rewind), Kraps.driver.bucket)
142
+ Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{@args["partition"]}/chunk.#{@args["part"]}.json"), tempfile.tap(&:rewind))
119
143
  ensure
120
144
  tempfile&.close(true)
121
145
  end
122
146
 
147
+ def perform_combine
148
+ temp_paths1 = download_all(token: @args["frame"]["token"], partition: @args["partition"])
149
+ temp_paths2 = download_all(token: @args["combine_frame"]["token"], partition: @args["partition"])
150
+
151
+ enum1 = k_way_merge(temp_paths1.each.to_a, chunk_limit: @chunk_limit)
152
+ enum2 = k_way_merge(temp_paths2.each.to_a, chunk_limit: @chunk_limit)
153
+
154
+ combine_method = method(:combine)
155
+ current_step = step
156
+
157
+ implementation = Object.new
158
+ implementation.define_singleton_method(:map) do |&block|
159
+ combine_method.call(enum1, enum2) do |key, value1, value2|
160
+ block.call(key, current_step.block.call(key, value1, value2))
161
+ end
162
+ end
163
+
164
+ mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
165
+ mapper.map
166
+
167
+ mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
168
+ Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
169
+ File.open(path) do |stream|
170
+ Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
171
+ end
172
+ end
173
+ end
174
+ ensure
175
+ temp_paths1&.delete
176
+ temp_paths2&.delete
177
+ end
178
+
179
+ def combine(enum1, enum2)
180
+ current1 = begin; enum1.next; rescue StopIteration; nil; end
181
+ current2 = begin; enum2.next; rescue StopIteration; nil; end
182
+
183
+ loop do
184
+ return if current1.nil? && current2.nil?
185
+ return if current1.nil?
186
+
187
+ if current2.nil?
188
+ yield(current1[0], current1[1], nil)
189
+
190
+ current1 = begin; enum1.next; rescue StopIteration; nil; end
191
+ elsif current1[0] == current2[0]
192
+ loop do
193
+ yield(current1[0], current1[1], current2[1])
194
+
195
+ current1 = begin; enum1.next; rescue StopIteration; nil; end
196
+
197
+ break if current1.nil?
198
+ break if current1[0] != current2[0]
199
+ end
200
+
201
+ current2 = begin; enum2.next; rescue StopIteration; nil; end
202
+ else
203
+ res = current1[0] <=> current2[0]
204
+
205
+ if res < 0
206
+ yield(current1[0], current1[1], nil)
207
+
208
+ current1 = begin; enum1.next; rescue StopIteration; nil; end
209
+ else
210
+ current2 = begin; enum2.next; rescue StopIteration; nil; end
211
+ end
212
+ end
213
+ end
214
+ end
215
+
123
216
  def perform_each_partition
124
217
  temp_paths = TempPaths.new
125
218
 
126
- files = Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
219
+ files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
127
220
 
128
221
  temp_paths_index = files.each_with_object({}) do |file, hash|
129
222
  hash[file] = temp_paths.add
130
223
  end
131
224
 
132
225
  Parallelizer.each(files, @concurrency) do |file|
133
- Kraps.driver.driver.download(file, Kraps.driver.bucket, temp_paths_index[file].path)
226
+ Kraps.driver.download(file, temp_paths_index[file].path)
134
227
  end
135
228
 
136
- enum = Enumerator::Lazy.new(temp_paths) do |yielder, temp_path|
137
- File.open(temp_path.path) do |stream|
138
- stream.each_line do |line|
139
- yielder << JSON.parse(line)
140
- end
141
- end
142
- end
143
-
144
- step.block.call(@args["partition"], enum)
229
+ step.block.call(@args["partition"], k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
145
230
  ensure
146
- temp_paths&.unlink
231
+ temp_paths&.delete
147
232
  end
148
233
 
149
234
  def with_retries(num_retries)
@@ -154,12 +239,14 @@ module Kraps
154
239
  rescue Kraps::Error
155
240
  distributed_job.stop
156
241
  raise
157
- rescue StandardError
242
+ rescue StandardError => e
158
243
  if retries >= num_retries
159
244
  distributed_job.stop
160
245
  raise
161
246
  end
162
247
 
248
+ @logger.error(e)
249
+
163
250
  sleep(5)
164
251
  retries += 1
165
252
 
@@ -167,8 +254,24 @@ module Kraps
167
254
  end
168
255
  end
169
256
 
257
+ def download_all(token:, partition:)
258
+ temp_paths = TempPaths.new
259
+
260
+ files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/")).sort
261
+
262
+ temp_paths_index = files.each_with_object({}) do |file, hash|
263
+ hash[file] = temp_paths.add
264
+ end
265
+
266
+ Parallelizer.each(files, @concurrency) do |file|
267
+ Kraps.driver.download(file, temp_paths_index[file].path)
268
+ end
269
+
270
+ temp_paths
271
+ end
272
+
170
273
  def jobs
171
- @jobs ||= Array(@args["klass"].constantize.new.call(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)))
274
+ @jobs ||= JobResolver.new.call(@args["klass"].constantize.new.call(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)))
172
275
  end
173
276
 
174
277
  def job
data/lib/kraps.rb CHANGED
@@ -1,3 +1,9 @@
1
+ require "distributed_job"
2
+ require "ruby-progressbar"
3
+ require "ruby-progressbar/outputs/null"
4
+ require "map_reduce"
5
+ require "redis"
6
+
1
7
  require_relative "kraps/version"
2
8
  require_relative "kraps/drivers"
3
9
  require_relative "kraps/actions"
@@ -8,21 +14,18 @@ require_relative "kraps/temp_paths"
8
14
  require_relative "kraps/timeout_queue"
9
15
  require_relative "kraps/interval"
10
16
  require_relative "kraps/job"
17
+ require_relative "kraps/job_resolver"
11
18
  require_relative "kraps/runner"
12
19
  require_relative "kraps/step"
13
20
  require_relative "kraps/frame"
14
21
  require_relative "kraps/worker"
15
- require "distributed_job"
16
- require "ruby-progressbar"
17
- require "ruby-progressbar/outputs/null"
18
- require "map_reduce"
19
- require "redis"
20
22
 
21
23
  module Kraps
22
24
  class Error < StandardError; end
23
25
  class InvalidAction < Error; end
24
26
  class InvalidStep < Error; end
25
27
  class JobStopped < Error; end
28
+ class IncompatibleFrame < Error; end
26
29
 
27
30
  def self.configure(driver:, redis: Redis.new, namespace: nil, job_ttl: 24 * 60 * 60, show_progress: true, enqueuer: ->(worker, json) { worker.perform_async(json) })
28
31
  @driver = driver
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kraps
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.0
4
+ version: 0.6.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Vetter
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-11-10 00:00:00.000000000 Z
11
+ date: 2022-11-16 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: attachie
@@ -147,6 +147,7 @@ files:
147
147
  - lib/kraps/hash_partitioner.rb
148
148
  - lib/kraps/interval.rb
149
149
  - lib/kraps/job.rb
150
+ - lib/kraps/job_resolver.rb
150
151
  - lib/kraps/parallelizer.rb
151
152
  - lib/kraps/runner.rb
152
153
  - lib/kraps/step.rb