RubyGems - kraps - Versions diffs - 0.3.0 → 0.5.0 - Mend

kraps 0.3.0 → 0.5.0

Files changed (10) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: f5bb601e7ee415b95b4b258a0241c25e6fe19eb3e772c06d4149afbfcfbe6c3d
-  data.tar.gz: cb948c05947e48d2d8e970eebbc6e2c4a5b0a88cb162ad87bf0743196f6bcaef
+  metadata.gz: 921ae08326c96216136418861b88af7f11bce519c924cd1813216165f7f02690
+  data.tar.gz: '0913d31d3caeea0be664bc714e9d0da58227f515c047be31359e96040bc0c141'
 SHA512:
-  metadata.gz: 1d1c5a16205c5584626fed5bca9b6c7dd6fae3b4f3c725b158e7740f6fa05a17abdcb483b43cbdad813576e2fc2c7621b89b94d61b32776d85ae774f5a4332d1
-  data.tar.gz: 2670dbc002633e801d8cf98fc8454c8881295f72b505bd4baf6cf0c8685a8c97a8a2dbf26e8a617c74b452ef627e820807e7af6e05b20a627fb99ce2eb216a1a
+  metadata.gz: d8e43e5229fc310019801e62a2e278470a1eb37b50e4aca27b9c64edb6666115f0f25c7a7375790516e2726fcf10980cdac1523c54dde8d3527a39fd919a2a5a
+  data.tar.gz: 30b1a9edcdd4f7ff476bfa4c070aef31debd727500e27a08b59f1df2663362c60e3cc3a3c860455d568abd994bb56a216f7eedf8baea6cc06ca73b1d0bdf9a07

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,15 @@
 # CHANGELOG
+## v0.5.0
+* Added a `before` option to specify a callable to run before
+  a step to e.g. populate caches upfront, etc.
+## v0.4.0
+* Pre-reduce in a map step when the subsequent step is a
+  reduce step
 ## v0.3.0
 * Changed partitioners to receive the number of partitions

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    kraps (0.2.0)
+    kraps (0.5.0)
       attachie
       distributed_job
       map-reduce-ruby (>= 3.0.0)
@@ -23,7 +23,7 @@ GEM
       connection_pool
       mime-types
     aws-eventstream (1.2.0)
-    aws-partitions (1.654.0)
+    aws-partitions (1.657.0)
     aws-sdk-core (3.166.0)
       aws-eventstream (~> 1, >= 1.0.2)
       aws-partitions (~> 1, >= 1.651.0)
@@ -62,7 +62,7 @@ GEM
     rake (13.0.6)
     redis (5.0.5)
       redis-client (>= 0.9.0)
-    redis-client (0.11.0)
+    redis-client (0.11.1)
       connection_pool
     regexp_parser (2.6.0)
     rexml (3.2.5)

data/README.md CHANGED Viewed

@@ -95,28 +95,41 @@ class MyKrapsWorker
   include Sidekiq::Worker
   def perform(json)
-    Kraps::Worker.new(json, memory_limit: 128.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
+    Kraps::Worker.new(json, memory_limit: 16.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
   end
 end
 ```
 The `json` argument is automatically enqueued by Kraps and contains everything
 it needs to know about the job and step to execute. The `memory_limit` tells
-Kraps how much memory it is allowed to allocate for temporary chunks, etc. This
-value depends on the memory size of your container/server and how much worker
-threads your background queue spawns. Let's say your container/server has 2
-gigabytes of memory and your background framework spawns 5 threads.
-Theoretically, you might be able to give 300-400 megabytes to Kraps then. The
-`chunk_limit` ensures that only the specified amount of chunks are processed in
-a single run. A run basically means: it takes up to `chunk_limit` chunks,
-reduces them and pushes the result as a new chunk to the list of chunks to
-process. Thus, if your number of file descriptors is unlimited, you want to set
-it to a higher number to avoid the overhead of multiple runs. `concurrency`
-tells Kraps how much threads to use to concurrently upload/download files from
-the storage layer. Finally, `retries` specifies how often Kraps should retry
-the job step in case of errors. Kraps will sleep for 5 seconds between those
-retries. Please note that it's not yet possible to use the retry mechanism of
-your background job framework with Kraps.
+Kraps how much memory it is allowed to allocate for temporary chunks. More
+concretely, it tells Kraps how big the file size of a temporary chunk can grow
+in memory up until Kraps must write it to disk. However, ruby of course
+allocates much more memory for a chunk than the raw file size of the chunk. As
+a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
+`memory_size` depends on the memory size of your container/server, how much
+worker threads your background queue spawns and how much memory your workers
+need besides of Kraps. Let's say your container/server has 2 gigabytes of
+memory and your background framework spawns 5 threads. Theoretically, you might
+be able to give 300-400 megabytes to Kraps then, but now divide this by 10 and
+specify a `memory_limit` of around `30.megabytes`, better less. The
+`memory_limit` affects how much chunks will be written to disk depending on the
+data size you are processing and how big these chunks are. The smaller the
+value, the more chunks and the more chunks, the more runs Kraps need to merge
+the chunks. It can affect the performance The `chunk_limit` ensures that only
+the specified amount of chunks are processed in a single run. A run basically
+means: it takes up to `chunk_limit` chunks, reduces them and pushes the result
+as a new chunk to the list of chunks to process. Thus, if your number of file
+descriptors is unlimited, you want to set it to a higher number to avoid the
+overhead of multiple runs. `concurrency` tells Kraps how much threads to use to
+concurrently upload/download files from the storage layer. Finally, `retries`
+specifies how often Kraps should retry the job step in case of errors. Kraps
+will sleep for 5 seconds between those retries. Please note that it's not yet
+possible to use the retry mechanism of your background job framework with
+Kraps. Please note, however, that `parallelize` is not covered by `retries`
+yet, as the block passed to `parallelize` is executed by the runner, not the
+workers.
 Now, executing your job is super easy:
@@ -252,6 +265,19 @@ job.each_partition do |partition, pairs|
 end
 ```
+Please note that every API method accepts a `before` callable:
+```ruby
+before_block = proc do
+  # runs once before the map action in every worker, which can be useful to
+  # e.g. populate caches etc.
+end
+job.map(before: before_block) do |key, value, collector|
+  # ...
+end
+```
 ## More Complex Jobs
 Please note that a job class can return multiple jobs and jobs can build up on

data/lib/kraps/job.rb CHANGED Viewed

@@ -9,46 +9,74 @@ module Kraps
       @partitioner = HashPartitioner.new
     end
-    def parallelize(partitions:, partitioner: HashPartitioner.new, worker: @worker, &block)
+    def parallelize(partitions:, partitioner: HashPartitioner.new, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @partitions = partitions
           @partitioner = partitioner
-          @steps << Step.new(action: Actions::PARALLELIZE, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
+          @steps << Step.new(
+            action: Actions::PARALLELIZE,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
         end
       end
     end
-    def map(partitions: nil, partitioner: nil, worker: @worker, &block)
+    def map(partitions: nil, partitioner: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @partitions = partitions if partitions
           @partitioner = partitioner if partitioner
-          @steps << Step.new(action: Actions::MAP, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
+          @steps << Step.new(
+            action: Actions::MAP,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
         end
       end
     end
-    def reduce(worker: @worker, &block)
+    def reduce(worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
-          @steps << Step.new(action: Actions::REDUCE, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
+          @steps << Step.new(
+            action: Actions::REDUCE,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
         end
       end
     end
-    def each_partition(worker: @worker, &block)
+    def each_partition(worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
-          @steps << Step.new(action: Actions::EACH_PARTITION, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
+          @steps << Step.new(
+            action: Actions::EACH_PARTITION,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
         end
       end
     end
-    def repartition(partitions:, partitioner: nil, worker: @worker)
-      map(partitions: partitions, partitioner: partitioner, worker: worker) do |key, value, collector|
+    def repartition(partitions:, partitioner: nil, worker: @worker, before: nil)
+      map(partitions: partitions, partitioner: partitioner, worker: worker, before: before) do |key, value, collector|
         collector.call(key, value)
       end
     end

data/lib/kraps/runner.rb CHANGED Viewed

@@ -55,7 +55,7 @@ module Kraps
             enqueue(token: distributed_job.token, part: part, item: item)
           end
-          Frame.new(token: distributed_job.token, partitions: @step.args[:partitions])
+          Frame.new(token: distributed_job.token, partitions: @step.partitions)
         end
       end
@@ -65,7 +65,7 @@ module Kraps
             enqueue(token: distributed_job.token, part: part, partition: partition)
           end
-          Frame.new(token: distributed_job.token, partitions: @step.args[:partitions])
+          Frame.new(token: distributed_job.token, partitions: @step.partitions)
         end
       end
@@ -75,7 +75,7 @@ module Kraps
             enqueue(token: distributed_job.token, part: part, partition: partition)
           end
-          Frame.new(token: distributed_job.token, partitions: @step.args[:partitions])
+          Frame.new(token: distributed_job.token, partitions: @step.partitions)
         end
       end
@@ -91,7 +91,7 @@ module Kraps
       def enqueue(token:, part:, **rest)
         Kraps.enqueuer.call(
-          @step.args[:worker],
+          @step.worker,
           JSON.generate(
             job_index: @job_index,
             step_index: @step_index,

data/lib/kraps/step.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  Step = Struct.new(:action, :args, :block, :frame, keyword_init: true)
+  Step = Struct.new(:action, :partitioner, :partitions, :block, :worker, :before, :frame, keyword_init: true)
 end

data/lib/kraps/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  VERSION = "0.3.0"
+  VERSION = "0.5.0"
 end

data/lib/kraps/worker.rb CHANGED Viewed

@@ -13,6 +13,8 @@ module Kraps
       raise(InvalidAction, "Invalid action #{step.action}") unless Actions::ALL.include?(step.action)
       with_retries(retries) do # TODO: allow to use queue based retries
+        step.before&.call
         send(:"perform_#{step.action}")
         distributed_job.done(@args["part"])
@@ -60,6 +62,14 @@ module Kraps
         current_step.block.call(key, value, block)
       end
+      subsequent_step = next_step
+      if subsequent_step&.action == Actions::REDUCE
+        implementation.define_singleton_method(:reduce) do |key, value1, value2|
+          subsequent_step.block.call(key, value1, value2)
+        end
+      end
       mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
       temp_paths.each do |temp_path|
@@ -143,15 +153,16 @@ module Kraps
         yield
       rescue Kraps::Error
         distributed_job.stop
+        raise
       rescue StandardError
-        sleep(5)
-        retries += 1
         if retries >= num_retries
           distributed_job.stop
           raise
         end
+        sleep(5)
+        retries += 1
         retry
       end
     end
@@ -180,8 +191,12 @@ module Kraps
       end
     end
+    def next_step
+      @next_step ||= steps[@args["step_index"] + 1]
+    end
     def partitioner
-      @partitioner ||= proc { |key| step.args[:partitioner].call(key, step.args[:partitions]) }
+      @partitioner ||= proc { |key| step.partitioner.call(key, step.partitions) }
     end
     def distributed_job

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: kraps
 version: !ruby/object:Gem::Version
-  version: 0.3.0
+  version: 0.5.0
 platform: ruby
 authors:
 - Benjamin Vetter
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2022-11-07 00:00:00.000000000 Z
+date: 2022-11-10 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: attachie