RubyGems - kraps - Versions diffs - 0.3.0 → 0.5.0 - Mend

kraps 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: f5bb601e7ee415b95b4b258a0241c25e6fe19eb3e772c06d4149afbfcfbe6c3d
-  data.tar.gz: cb948c05947e48d2d8e970eebbc6e2c4a5b0a88cb162ad87bf0743196f6bcaef
+  metadata.gz: 921ae08326c96216136418861b88af7f11bce519c924cd1813216165f7f02690
+  data.tar.gz: '0913d31d3caeea0be664bc714e9d0da58227f515c047be31359e96040bc0c141'
 SHA512:
-  metadata.gz: 1d1c5a16205c5584626fed5bca9b6c7dd6fae3b4f3c725b158e7740f6fa05a17abdcb483b43cbdad813576e2fc2c7621b89b94d61b32776d85ae774f5a4332d1
-  data.tar.gz: 2670dbc002633e801d8cf98fc8454c8881295f72b505bd4baf6cf0c8685a8c97a8a2dbf26e8a617c74b452ef627e820807e7af6e05b20a627fb99ce2eb216a1a
+  metadata.gz: d8e43e5229fc310019801e62a2e278470a1eb37b50e4aca27b9c64edb6666115f0f25c7a7375790516e2726fcf10980cdac1523c54dde8d3527a39fd919a2a5a
+  data.tar.gz: 30b1a9edcdd4f7ff476bfa4c070aef31debd727500e27a08b59f1df2663362c60e3cc3a3c860455d568abd994bb56a216f7eedf8baea6cc06ca73b1d0bdf9a07

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,15 @@
 # CHANGELOG
+## v0.5.0
+* Added a `before` option to specify a callable to run before
+  a step to e.g. populate caches upfront, etc.
+## v0.4.0
+* Pre-reduce in a map step when the subsequent step is a
+  reduce step
 ## v0.3.0
 * Changed partitioners to receive the number of partitions

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    kraps (0.2.0)
+    kraps (0.5.0)
       attachie
       distributed_job
       map-reduce-ruby (>= 3.0.0)
@@ -23,7 +23,7 @@ GEM
       connection_pool
       mime-types
     aws-eventstream (1.2.0)
-    aws-partitions (1.654.0)
+    aws-partitions (1.657.0)
     aws-sdk-core (3.166.0)
       aws-eventstream (~> 1, >= 1.0.2)
       aws-partitions (~> 1, >= 1.651.0)
@@ -62,7 +62,7 @@ GEM
     rake (13.0.6)
     redis (5.0.5)
       redis-client (>= 0.9.0)
-    redis-client (0.11.0)
+    redis-client (0.11.1)
       connection_pool
     regexp_parser (2.6.0)
     rexml (3.2.5)

data/README.md CHANGED Viewed

@@ -95,28 +95,41 @@ class MyKrapsWorker
   include Sidekiq::Worker
   def perform(json)
-    Kraps::Worker.new(json, memory_limit: 128.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
+    Kraps::Worker.new(json, memory_limit: 16.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
   end
 end
 ```
 The `json` argument is automatically enqueued by Kraps and contains everything
 it needs to know about the job and step to execute. The `memory_limit` tells
-Kraps how much memory it is allowed to allocate for temporary chunks, etc. This
-value depends on the memory size of your container/server and how much worker
-threads your background queue spawns. Let's say your container/server has 2
-gigabytes of memory and your background framework spawns 5 threads.
-Theoretically, you might be able to give 300-400 megabytes to Kraps then. The
-`chunk_limit` ensures that only the specified amount of chunks are processed in
-a single run. A run basically means: it takes up to `chunk_limit` chunks,
-reduces them and pushes the result as a new chunk to the list of chunks to
-process. Thus, if your number of file descriptors is unlimited, you want to set
-it to a higher number to avoid the overhead of multiple runs. `concurrency`
-tells Kraps how much threads to use to concurrently upload/download files from
-the storage layer. Finally, `retries` specifies how often Kraps should retry
-the job step in case of errors. Kraps will sleep for 5 seconds between those
-retries. Please note that it's not yet possible to use the retry mechanism of
-your background job framework with Kraps.
+Kraps how much memory it is allowed to allocate for temporary chunks. More
+concretely, it tells Kraps how big the file size of a temporary chunk can grow
+in memory up until Kraps must write it to disk. However, ruby of course
+allocates much more memory for a chunk than the raw file size of the chunk. As
+a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
+`memory_size` depends on the memory size of your container/server, how much
+worker threads your background queue spawns and how much memory your workers
+need besides of Kraps. Let's say your container/server has 2 gigabytes of
+memory and your background framework spawns 5 threads. Theoretically, you might
+be able to give 300-400 megabytes to Kraps then, but now divide this by 10 and
+specify a `memory_limit` of around `30.megabytes`, better less. The
+`memory_limit` affects how much chunks will be written to disk depending on the
+data size you are processing and how big these chunks are. The smaller the
+value, the more chunks and the more chunks, the more runs Kraps need to merge
+the chunks. It can affect the performance The `chunk_limit` ensures that only
+the specified amount of chunks are processed in a single run. A run basically
+means: it takes up to `chunk_limit` chunks, reduces them and pushes the result
+as a new chunk to the list of chunks to process. Thus, if your number of file
+descriptors is unlimited, you want to set it to a higher number to avoid the
+overhead of multiple runs. `concurrency` tells Kraps how much threads to use to
+concurrently upload/download files from the storage layer. Finally, `retries`
+specifies how often Kraps should retry the job step in case of errors. Kraps
+will sleep for 5 seconds between those retries. Please note that it's not yet
+possible to use the retry mechanism of your background job framework with
+Kraps. Please note, however, that `parallelize` is not covered by `retries`
+yet, as the block passed to `parallelize` is executed by the runner, not the
+workers.
 Now, executing your job is super easy:
@@ -252,6 +265,19 @@ job.each_partition do |partition, pairs|
 end
 ```
+Please note that every API method accepts a `before` callable:
+```ruby
+before_block = proc do
+  # runs once before the map action in every worker, which can be useful to
+  # e.g. populate caches etc.
+end
+job.map(before: before_block) do |key, value, collector|
+  # ...
+end
+```
 ## More Complex Jobs
 Please note that a job class can return multiple jobs and jobs can build up on

data/lib/kraps/job.rb CHANGED Viewed

@@ -9,46 +9,74 @@ module Kraps
       @partitioner = HashPartitioner.new
     end
-    def parallelize(partitions:, partitioner: HashPartitioner.new, worker: @worker, &block)
+    def parallelize(partitions:, partitioner: HashPartitioner.new, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @partitions = partitions
           @partitioner = partitioner
-          @steps << Step.new(action: Actions::PARALLELIZE, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
+          @steps << Step.new(
+            action: Actions::PARALLELIZE,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
         end
       end
     end
-    def map(partitions: nil, partitioner: nil, worker: @worker, &block)
+    def map(partitions: nil, partitioner: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @partitions = partitions if partitions
           @partitioner = partitioner if partitioner
-          @steps << Step.new(action: Actions::MAP, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
+          @steps << Step.new(
+            action: Actions::MAP,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
         end
       end
     end
-    def reduce(worker: @worker, &block)
+    def reduce(worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
-          @steps << Step.new(action: Actions::REDUCE, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
+          @steps << Step.new(
+            action: Actions::REDUCE,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
         end
       end
     end
-    def each_partition(worker: @worker, &block)
+    def each_partition(worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
-          @steps << Step.new(action: Actions::EACH_PARTITION, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
+          @steps << Step.new(
+            action: Actions::EACH_PARTITION,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
         end
       end
     end
-    def repartition(partitions:, partitioner: nil, worker: @worker)
-      map(partitions: partitions, partitioner: partitioner, worker: worker) do |key, value, collector|
+    def repartition(partitions:, partitioner: nil, worker: @worker, before: nil)
+      map(partitions: partitions, partitioner: partitioner, worker: worker, before: before) do |key, value, collector|
         collector.call(key, value)
       end
     end

data/lib/kraps/runner.rb CHANGED Viewed

@@ -55,7 +55,7 @@ module Kraps
             enqueue(token: distributed_job.token, part: part, item: item)
           end
-          Frame.new(token: distributed_job.token, partitions: @step.args[:partitions])
+          Frame.new(token: distributed_job.token, partitions: @step.partitions)
         end
       end
@@ -65,7 +65,7 @@ module Kraps
             enqueue(token: distributed_job.token, part: part, partition: partition)
           end
-          Frame.new(token: distributed_job.token, partitions: @step.args[:partitions])
+          Frame.new(token: distributed_job.token, partitions: @step.partitions)
         end
       end
@@ -75,7 +75,7 @@ module Kraps
             enqueue(token: distributed_job.token, part: part, partition: partition)
           end
-          Frame.new(token: distributed_job.token, partitions: @step.args[:partitions])
+          Frame.new(token: distributed_job.token, partitions: @step.partitions)
         end
       end
@@ -91,7 +91,7 @@ module Kraps
       def enqueue(token:, part:, **rest)
         Kraps.enqueuer.call(
-          @step.args[:worker],
+          @step.worker,
           JSON.generate(
             job_index: @job_index,
             step_index: @step_index,

data/lib/kraps/step.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  Step = Struct.new(:action, :args, :block, :frame, keyword_init: true)
+  Step = Struct.new(:action, :partitioner, :partitions, :block, :worker, :before, :frame, keyword_init: true)
 end

data/lib/kraps/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  VERSION = "0.3.0"
+  VERSION = "0.5.0"
 end

data/lib/kraps/worker.rb CHANGED Viewed

@@ -13,6 +13,8 @@ module Kraps
       raise(InvalidAction, "Invalid action #{step.action}") unless Actions::ALL.include?(step.action)
       with_retries(retries) do # TODO: allow to use queue based retries
+        step.before&.call
         send(:"perform_#{step.action}")
         distributed_job.done(@args["part"])
@@ -60,6 +62,14 @@ module Kraps
         current_step.block.call(key, value, block)
       end
+      subsequent_step = next_step
+      if subsequent_step&.action == Actions::REDUCE
+        implementation.define_singleton_method(:reduce) do |key, value1, value2|
+          subsequent_step.block.call(key, value1, value2)
+        end
+      end
       mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
       temp_paths.each do |temp_path|
@@ -143,15 +153,16 @@ module Kraps
         yield
       rescue Kraps::Error
         distributed_job.stop
+        raise
       rescue StandardError
-        sleep(5)
-        retries += 1
         if retries >= num_retries
           distributed_job.stop
           raise
         end
+        sleep(5)
+        retries += 1
         retry
       end
     end
@@ -180,8 +191,12 @@ module Kraps
       end
     end
+    def next_step
+      @next_step ||= steps[@args["step_index"] + 1]
+    end
     def partitioner
-      @partitioner ||= proc { |key| step.args[:partitioner].call(key, step.args[:partitions]) }
+      @partitioner ||= proc { |key| step.partitioner.call(key, step.partitions) }
     end
     def distributed_job

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: kraps
 version: !ruby/object:Gem::Version
-  version: 0.3.0
+  version: 0.5.0
 platform: ruby
 authors:
 - Benjamin Vetter
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2022-11-07 00:00:00.000000000 Z
+date: 2022-11-10 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: attachie