RubyGems - kraps - Versions diffs - 0.2.0 → 0.4.0 - Mend

kraps 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +10 -0
data/Gemfile.lock +3 -3
data/README.md +35 -21
data/lib/kraps/hash_partitioner.rb +7 -0
data/lib/kraps/job.rb +3 -3
data/lib/kraps/version.rb +1 -1
data/lib/kraps/worker.rb +17 -4
data/lib/kraps.rb +1 -0
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 57d06d20f406e72e26424dcba5c2af296a1551085113371c2cae9894f18e72ff
-  data.tar.gz: a25c8c3440cdd26eeb9b32655e77556e4be4966d47bd3e290d0a702c3b4cde9f
+  metadata.gz: 15d08cf8952d4e5a083a6a4f9791fd16e9e2dbf67c1c71326f3af840c0c72eb8
+  data.tar.gz: c6542584846c54e5897b7b59ef40e8ec282ee27521b2d5ff39551eb02755882d
 SHA512:
-  metadata.gz: 655e2d0f525e136b72e87231a8d359eda952b7e3b30a8cd38b25f7b4fba27baebedacced0a00071915166325b621437869026c6785ab7353eea928ca736bc2a7
-  data.tar.gz: 0d450b3032a6809d2a5300e18e3cfc9a948ca5f589f1161823d9923084939d249282d28bf0619c3b9ace10dc742118dfcf5895c898e2d9fe7c55f98d00e683b4
+  metadata.gz: 21d1ef7a132edacf54e0b2df12b8d085af84ec1ed1cd019d258e43aba4cffbecdeada9b2b7f4baeefec4b59d115eb3e38400da94a3d7961ab19bbbb7dd2cf58c
+  data.tar.gz: fde066e9fdc5f9df7e95be43142cb04a7a1c5279decb277f1d815db508c87d2c04be46ea9559069c8a2c9539ee2eaa949a2fe2fdc3bf862937f9211cdfd8fbd5

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,15 @@
 # CHANGELOG
+## v0.4.0
+* Pre-reduce in a map step when the subsequent step is a
+  reduce step
+## v0.3.0
+* Changed partitioners to receive the number of partitions
+  as second parameter
 ## v0.2.0
 * Updated map-reduce-ruby to allow concurrent uploads

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    kraps (0.2.0)
+    kraps (0.4.0)
       attachie
       distributed_job
       map-reduce-ruby (>= 3.0.0)
@@ -23,7 +23,7 @@ GEM
       connection_pool
       mime-types
     aws-eventstream (1.2.0)
-    aws-partitions (1.654.0)
+    aws-partitions (1.657.0)
     aws-sdk-core (3.166.0)
       aws-eventstream (~> 1, >= 1.0.2)
       aws-partitions (~> 1, >= 1.651.0)
@@ -62,7 +62,7 @@ GEM
     rake (13.0.6)
     redis (5.0.5)
       redis-client (>= 0.9.0)
-    redis-client (0.11.0)
+    redis-client (0.11.1)
       connection_pool
     regexp_parser (2.6.0)
     rexml (3.2.5)

data/README.md CHANGED Viewed

@@ -95,28 +95,41 @@ class MyKrapsWorker
   include Sidekiq::Worker
   def perform(json)
-    Kraps::Worker.new(json, memory_limit: 128.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
+    Kraps::Worker.new(json, memory_limit: 16.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
   end
 end
 ```
 The `json` argument is automatically enqueued by Kraps and contains everything
 it needs to know about the job and step to execute. The `memory_limit` tells
-Kraps how much memory it is allowed to allocate for temporary chunks, etc. This
-value depends on the memory size of your container/server and how much worker
-threads your background queue spawns. Let's say your container/server has 2
-gigabytes of memory and your background framework spawns 5 threads.
-Theoretically, you might be able to give 300-400 megabytes to Kraps then. The
-`chunk_limit` ensures that only the specified amount of chunks are processed in
-a single run. A run basically means: it takes up to `chunk_limit` chunks,
-reduces them and pushes the result as a new chunk to the list of chunks to
-process. Thus, if your number of file descriptors is unlimited, you want to set
-it to a higher number to avoid the overhead of multiple runs. `concurrency`
-tells Kraps how much threads to use to concurrently upload/download files from
-the storage layer. Finally, `retries` specifies how often Kraps should retry
-the job step in case of errors. Kraps will sleep for 5 seconds between those
-retries. Please note that it's not yet possible to use the retry mechanism of
-your background job framework with Kraps.
+Kraps how much memory it is allowed to allocate for temporary chunks. More
+concretely, it tells Kraps how big the file size of a temporary chunk can grow
+in memory up until Kraps must write it to disk. However, ruby of course
+allocates much more memory for a chunk than the raw file size of the chunk. As
+a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
+`memory_size` depends on the memory size of your container/server, how much
+worker threads your background queue spawns and how much memory your workers
+need besides of Kraps. Let's say your container/server has 2 gigabytes of
+memory and your background framework spawns 5 threads. Theoretically, you might
+be able to give 300-400 megabytes to Kraps then, but now divide this by 10 and
+specify a `memory_limit` of around `30.megabytes`, better less. The
+`memory_limit` affects how much chunks will be written to disk depending on the
+data size you are processing and how big these chunks are. The smaller the
+value, the more chunks and the more chunks, the more runs Kraps need to merge
+the chunks. It can affect the performance The `chunk_limit` ensures that only
+the specified amount of chunks are processed in a single run. A run basically
+means: it takes up to `chunk_limit` chunks, reduces them and pushes the result
+as a new chunk to the list of chunks to process. Thus, if your number of file
+descriptors is unlimited, you want to set it to a higher number to avoid the
+overhead of multiple runs. `concurrency` tells Kraps how much threads to use to
+concurrently upload/download files from the storage layer. Finally, `retries`
+specifies how often Kraps should retry the job step in case of errors. Kraps
+will sleep for 5 seconds between those retries. Please note that it's not yet
+possible to use the retry mechanism of your background job framework with
+Kraps. Please note, however, that `parallelize` is not covered by `retries`
+yet, as the block passed to `parallelize` is executed by the runner, not the
+workers.
 Now, executing your job is super easy:
@@ -143,17 +156,18 @@ split. Kraps assigns every `key` to a partition, either using a custom
 `partitioner` or the default built in hash partitioner. The hash partitioner
 simply calculates a hash of your key modulo the number of partitions and the
 resulting partition number is the partition where the respective key is
-assigned to. A partitioner is a callable which gets the key as argument and
-returns a partition number. The built in hash partitioner looks similar to this
-one:
+assigned to. A partitioner is a callable which gets the key and the number of
+partitions as argument and returns a partition number. The built in hash
+partitioner looks similar to this one:
 ```ruby
-partitioner = proc { |key| Digest::SHA1.hexdigest(key.inspect)[0..4].to_i(16) % 128 } # 128 partitions
+partitioner = proc { |key, num_partitions| Digest::SHA1.hexdigest(key.inspect)[0..4].to_i(16) % num_partitions }
 ```
 Please note, it's important that the partitioner and the specified number of
 partitions stays in sync. When you use a custom partitioner, please make sure
-that the partitioner operates on the same number of partitions you specify.
+that the partitioner correctly returns a partition number in the range of
+`0...num_partitions`.
 ## Datatypes

data/lib/kraps/hash_partitioner.rb ADDED Viewed

@@ -0,0 +1,7 @@
+module Kraps
+  class HashPartitioner
+    def call(key, num_partitions)
+      Digest::SHA1.hexdigest(JSON.generate(key))[0..4].to_i(16) % num_partitions
+    end
+  end
+end

data/lib/kraps/job.rb CHANGED Viewed

@@ -6,10 +6,10 @@ module Kraps
       @worker = worker
       @steps = []
       @partitions = 0
-      @partitioner = MapReduce::HashPartitioner.new(@partitions)
+      @partitioner = HashPartitioner.new
     end
-    def parallelize(partitions:, partitioner: MapReduce::HashPartitioner.new(partitions), worker: @worker, &block)
+    def parallelize(partitions:, partitioner: HashPartitioner.new, worker: @worker, &block)
       fresh.tap do |job|
         job.instance_eval do
           @partitions = partitions
@@ -24,7 +24,7 @@ module Kraps
       fresh.tap do |job|
         job.instance_eval do
           @partitions = partitions if partitions
-          @partitioner = partitioner || MapReduce::HashPartitioner.new(partitions) if partitioner || partitions
+          @partitioner = partitioner if partitioner
           @steps << Step.new(action: Actions::MAP, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
         end

data/lib/kraps/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  VERSION = "0.2.0"
+  VERSION = "0.4.0"
 end

data/lib/kraps/worker.rb CHANGED Viewed

@@ -60,6 +60,14 @@ module Kraps
         current_step.block.call(key, value, block)
       end
+      subsequent_step = next_step
+      if subsequent_step&.action == Actions::REDUCE
+        implementation.define_singleton_method(:reduce) do |key, value1, value2|
+          subsequent_step.block.call(key, value1, value2)
+        end
+      end
       mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
       temp_paths.each do |temp_path|
@@ -143,15 +151,16 @@ module Kraps
         yield
       rescue Kraps::Error
         distributed_job.stop
+        raise
       rescue StandardError
-        sleep(5)
-        retries += 1
         if retries >= num_retries
           distributed_job.stop
           raise
         end
+        sleep(5)
+        retries += 1
         retry
       end
     end
@@ -180,8 +189,12 @@ module Kraps
       end
     end
+    def next_step
+      @next_step ||= steps[@args["step_index"] + 1]
+    end
     def partitioner
-      @partitioner ||= step.args[:partitioner]
+      @partitioner ||= proc { |key| step.args[:partitioner].call(key, step.args[:partitions]) }
     end
     def distributed_job

data/lib/kraps.rb CHANGED Viewed

@@ -2,6 +2,7 @@ require_relative "kraps/version"
 require_relative "kraps/drivers"
 require_relative "kraps/actions"
 require_relative "kraps/parallelizer"
+require_relative "kraps/hash_partitioner"
 require_relative "kraps/temp_path"
 require_relative "kraps/temp_paths"
 require_relative "kraps/timeout_queue"

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: kraps
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.4.0
 platform: ruby
 authors:
 - Benjamin Vetter
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2022-11-01 00:00:00.000000000 Z
+date: 2022-11-09 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: attachie
@@ -144,6 +144,7 @@ files:
 - lib/kraps/actions.rb
 - lib/kraps/drivers.rb
 - lib/kraps/frame.rb
+- lib/kraps/hash_partitioner.rb
 - lib/kraps/interval.rb
 - lib/kraps/job.rb
 - lib/kraps/parallelizer.rb