RubyGems - kraps - Versions diffs - 0.6.0 → 0.8.0 - Mend

kraps 0.6.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 02ba582478178273300e77d5ddd18a8568fd682b0c53a444f1c5e7b756f9fd9a
-  data.tar.gz: 9e1698a67c252512a2f277ca1a6e079e063f8f1e6c13500245aabd969cca122d
+  metadata.gz: d261c779e82209152e26decbc6c5a6c5c5ddb0fb40803884383617635727d3b2
+  data.tar.gz: 1b9c6fa8db7a7811cbac5a7a5db518e1f3ee75df583521b64417341e830425f4
 SHA512:
-  metadata.gz: 2d1b3bd10d1048c64804ddf86069c0757247b581edea0f38330189a10d35ed096970d008533a167ddaf97c16479425e3baa2bb56f5a8888255ecfd12b911a168
-  data.tar.gz: 045263d6aa920cef97a162fcbfd41f239087c459a8c86bd2112d3ce4524c48c6e65ceaac8993f8ea7fb9089a8877db2863e39644888c8bd884df0fa95241277d
+  metadata.gz: dcb05139042149be087b1a2c7f14a31cd5e28dedb1517aca83299f63b90046e4d05e0ab19dfaeede329e784880623abda19675252cdeaad04f8ccd87249afde5
+  data.tar.gz: 10fd07c322c659ae21a682832eba30416c830f9d2146af685d69168ad5137045ef4268c0a43cee4e879bb875edf900ca740bbe4cbfe8b91b34ad3df40763bce0

data/.rubocop.yml CHANGED Viewed

@@ -80,3 +80,6 @@ Style/WordArray:
 Style/RedundantEach:
   Enabled: false
+Lint/NonLocalExitFromIterator:
+  Enabled: false

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,19 @@
 # CHANGELOG
+## v0.8.0
+* Use number of partitions of previous step for `jobs` option by default
+* Changed `combine` to receive a `collector`
+* Added mandatory `concurrency` argument to `load`
+## v0.7.0
+* Added a `jobs` option to the actions to limit the concurrency
+  when e.g. accessing external data stores and to avoid overloading
+  them
+* Added a queue using redis for the jobs to avoid starving workers
+* Removed `distributed_job` dependency
 ## v0.6.0
 * Added `map_partitions`

data/Gemfile.lock CHANGED Viewed

@@ -1,9 +1,8 @@
 PATH
   remote: .
   specs:
-    kraps (0.6.0)
+    kraps (0.7.0)
       attachie
-      distributed_job
       map-reduce-ruby (>= 3.0.0)
       redis
       ruby-progressbar
@@ -41,8 +40,6 @@ GEM
     concurrent-ruby (1.1.10)
     connection_pool (2.3.0)
     diff-lcs (1.5.0)
-    distributed_job (3.1.0)
-      redis (>= 4.1.0)
     i18n (1.12.0)
       concurrent-ruby (~> 1.0)
     jmespath (1.6.1)
@@ -62,7 +59,7 @@ GEM
     rake (13.0.6)
     redis (5.0.5)
       redis-client (>= 0.9.0)
-    redis-client (0.11.1)
+    redis-client (0.11.2)
       connection_pool
     regexp_parser (2.6.0)
     rexml (3.2.5)

data/README.md CHANGED Viewed

@@ -30,7 +30,7 @@ Kraps.configure(
   driver: Kraps::Drivers::S3Driver.new(s3_client: Aws::S3::Client.new("..."), bucket: "some-bucket", prefix: "temp/kraps/"),
   redis: Redis.new,
   namespace: "my-application", # An optional namespace to be used for redis keys, default: nil
-  job_ttl: 24.hours, # Job information in redis will automatically be removed after this amount of time, default: 24 hours
+  job_ttl: 7.days, # Job information in redis will automatically be removed after this amount of time, default: 4 days
   show_progress: true # Whether or not to show the progress in the terminal when executing jobs, default: true
   enqueuer: ->(worker, json) { worker.perform_async(json) } # Allows to customize the enqueueing of worker jobs
 )
@@ -220,7 +220,7 @@ items are used as keys and the values are set to `nil`.
 * `map`: Maps the key value pairs to other key value pairs
 ```ruby
-job.map(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker) do |key, value, collector|
+job.map(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker, jobs: 8) do |key, value, collector|
   collector.call("changed #{key}", "changed #{value}")
 end
 ```
@@ -229,13 +229,22 @@ The block gets each key-value pair passed and the `collector` block can be
 called as often as neccessary. This is also the reason why `map` can not simply
 return the new key-value pair, but the `collector` must be used instead.
+The `jobs` argument can be useful when you need to access an external data
+source, like a relational database and you want to limit the number of workers
+accessing the store concurrently to avoid overloading it. If you don't specify
+it, it will be identical to the number of partitions of the previous step. It
+is recommended to only use it for steps where you need to throttle the
+concurrency, because it will of course slow down the processing. The `jobs`
+argument only applies to the current step. The following steps don't inherit
+the argument, but reset it.
 * `map_partitions`: Maps the key value pairs to other key value pairs, but the
   block receives all data of each partition as an enumerable and sorted by key.
   Please be aware that you should not call `to_a` or similar on the enumerable.
   Prefer `map` over `map_partitions` when possible.
 ```ruby
-job.map_partitions(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker) do |pairs, collector|
+job.map_partitions(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker, jobs: 8) do |pairs, collector|
   pairs.each do |key, value|
     collector.call("changed #{key}", "changed #{value}")
   end
@@ -245,7 +254,7 @@ end
 * `reduce`: Reduces the values of pairs having the same key
 ```ruby
-job.reduce(worker: MyKrapsWorker) do |key, value1, value2|
+job.reduce(worker: MyKrapsWorker, jobs: 8) do |key, value1, value2|
   value1 + value2
 end
 ```
@@ -265,8 +274,8 @@ most of the time, this is not neccessary and the key can simply be ignored.
   passed job result are completely omitted.
 ```ruby
-  job.combine(other_job, worker: MyKrapsWorker) do |key, value1, value2|
-    (value1 || {}).merge(value2 || {})
+  job.combine(other_job, worker: MyKrapsWorker, jobs: 8) do |key, value1, value2, collector|
+    collector.call(key, (value1 || {}).merge(value2 || {}))
   end
 ```
@@ -279,7 +288,7 @@ since Kraps detects the dependency on its own.
 * `repartition`: Used to change the partitioning
 ```ruby
-job.repartition(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker)
+job.repartition(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker, jobs: 8)
 ```
 Repartitions all data into the specified number of partitions and using the
@@ -290,7 +299,7 @@ specified partitioner.
   `to_a` or similar on the enumerable.
 ```ruby
-job.each_partition do |partition, pairs|
+job.each_partition(jobs: 8) do |partition, pairs|
   pairs.each do |key, value|
     # ...
   end
@@ -308,10 +317,12 @@ It creates a folder for every partition and stores one or more chunks in there.
 * `load`: Loads the previously dumped data
 ```ruby
-job.load(prefix: "path/to/dump", partitions: 32, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
+job.load(prefix: "path/to/dump", partitions: 32, concurrency: 8, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
 ```
-The number of partitions and the partitioner must be specified.
+The number of partitions, the partitioner and concurrency must be specified.
+The concurrency specifies the number of threads used for downloading chunks in
+parallel.
 Please note that every API method accepts a `before` callable:
@@ -379,7 +390,8 @@ jobs only once.
 Kraps ships with an in-memory fake driver for storage, which you can use for
 testing purposes instead of the s3 driver:
-```ruby Kraps.configure(
+```ruby
+Kraps.configure(
   driver: Kraps::Drivers::FakeDriver.new(bucket: "kraps"),
   # ...
 ) ```
@@ -425,8 +437,6 @@ The API of the driver is:
 Kraps is built on top of
 [map-reduce-ruby](https://github.com/mrkamel/map-reduce-ruby) for the
 map/reduce framework,
-[distributed_job](https://github.com/mrkamel/distributed_job)
-to keep track of the job/step status,
 [attachie](https://github.com/mrkamel/attachie) to interact with the storage
 layer (s3),
 [ruby-progressbar](https://github.com/jfelchner/ruby-progressbar) to

data/docker-compose.yml CHANGED Viewed

@@ -1,6 +1,6 @@
 version: '2'
 services:
-  elasticsearch:
+  redis:
     image: redis
     ports:
       - 6379:6379

data/lib/kraps/downloader.rb ADDED Viewed

@@ -0,0 +1,19 @@
+module Kraps
+  class Downloader
+    def self.download_all(prefix:, concurrency:)
+      temp_paths = TempPaths.new
+      files = Kraps.driver.list(prefix: prefix).sort
+      temp_paths_index = files.each_with_object({}) do |file, hash|
+        hash[file] = temp_paths.add
+      end
+      Parallelizer.each(files, concurrency) do |file|
+        Kraps.driver.download(file, temp_paths_index[file].path)
+      end
+      temp_paths
+    end
+  end
+end

data/lib/kraps/job.rb CHANGED Viewed

@@ -27,14 +27,17 @@ module Kraps
       end
     end
-    def map(partitions: nil, partitioner: nil, worker: @worker, before: nil, &block)
+    def map(partitions: nil, partitioner: nil, jobs: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
+          jobs = [jobs, @partitions].compact.min
           @partitions = partitions if partitions
           @partitioner = partitioner if partitioner
           @steps << Step.new(
             action: Actions::MAP,
+            jobs: jobs,
             partitions: @partitions,
             partitioner: @partitioner,
             worker: worker,
@@ -45,14 +48,17 @@ module Kraps
       end
     end
-    def map_partitions(partitions: nil, partitioner: nil, worker: @worker, before: nil, &block)
+    def map_partitions(partitions: nil, partitioner: nil, jobs: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
+          jobs = [jobs, @partitions].compact.min
           @partitions = partitions if partitions
           @partitioner = partitioner if partitioner
           @steps << Step.new(
             action: Actions::MAP_PARTITIONS,
+            jobs: jobs,
             partitions: @partitions,
             partitioner: @partitioner,
             worker: worker,
@@ -63,11 +69,12 @@ module Kraps
       end
     end
-    def reduce(worker: @worker, before: nil, &block)
+    def reduce(jobs: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @steps << Step.new(
             action: Actions::REDUCE,
+            jobs: [jobs, @partitions].compact.min,
             partitions: @partitions,
             partitioner: @partitioner,
             worker: worker,
@@ -78,11 +85,12 @@ module Kraps
       end
     end
-    def combine(other_job, worker: @worker, before: nil, &block)
+    def combine(other_job, jobs: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @steps << Step.new(
             action: Actions::COMBINE,
+            jobs: [jobs, @partitions].compact.min,
             partitions: @partitions,
             partitioner: @partitioner,
             worker: worker,
@@ -95,11 +103,12 @@ module Kraps
       end
     end
-    def each_partition(worker: @worker, before: nil, &block)
+    def each_partition(jobs: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @steps << Step.new(
             action: Actions::EACH_PARTITION,
+            jobs: [jobs, @partitions].compact.min,
             partitions: @partitions,
             partitioner: @partitioner,
             worker: worker,
@@ -110,8 +119,8 @@ module Kraps
       end
     end
-    def repartition(partitions:, partitioner: nil, worker: @worker, before: nil)
-      map(partitions: partitions, partitioner: partitioner, worker: worker, before: before) do |key, value, collector|
+    def repartition(partitions:, jobs: nil, partitioner: nil, worker: @worker, before: nil)
+      map(jobs: jobs, partitions: partitions, partitioner: partitioner, worker: worker, before: before) do |key, value, collector|
         collector.call(key, value)
       end
     end
@@ -130,7 +139,7 @@ module Kraps
       end
     end
-    def load(prefix:, partitions:, partitioner:, worker: @worker)
+    def load(prefix:, partitions:, partitioner:, concurrency:, worker: @worker)
       job = parallelize(partitions: partitions, partitioner: proc { |key, _| key }, worker: worker) do |collector|
         (0...partitions).each do |partition|
           collector.call(partition)
@@ -138,20 +147,19 @@ module Kraps
       end
       job.map_partitions(partitioner: partitioner, worker: worker) do |partition, _, collector|
-        tempfile = Tempfile.new
+        temp_paths = Downloader.download_all(prefix: File.join(prefix, partition.to_s, "/"), concurrency: concurrency)
-        path = File.join(prefix, partition.to_s, "chunk.json")
-        next unless Kraps.driver.exists?(path)
+        temp_paths.each do |temp_path|
+          File.open(temp_path.path) do |stream|
+            stream.each_line do |line|
+              key, value = JSON.parse(line)
-        Kraps.driver.download(path, tempfile.path)
-        tempfile.each_line do |line|
-          key, value = JSON.parse(line)
-          collector.call(key, value)
+              collector.call(key, value)
+            end
+          end
         end
       ensure
-        tempfile&.close(true)
+        temp_paths&.delete
       end
     end

data/lib/kraps/redis_queue.rb ADDED Viewed

@@ -0,0 +1,151 @@
+module Kraps
+  class RedisQueue
+    VISIBILITY_TIMEOUT = 60
+    attr_reader :token
+    def initialize(redis:, token:, namespace:, ttl:)
+      @redis = redis
+      @token = token
+      @namespace = namespace
+      @ttl = ttl
+    end
+    def size
+      @size_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, job = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), ARGV[5]
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+        return redis.call('llen', queue_key) + redis.call('zcard', pending_key)
+      SCRIPT
+      @redis.eval(@size_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl])
+    end
+    def enqueue(payload)
+      @enqueue_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, job = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), ARGV[5]
+        redis.call('rpush', queue_key, job)
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+      SCRIPT
+      @redis.eval(@enqueue_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl, JSON.generate(payload)])
+    end
+    def dequeue
+      @dequeue_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, visibility_timeout = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), tonumber(ARGV[5])
+        local zitem = redis.call('zrange', pending_key, 0, 0, 'WITHSCORES')
+        local job = zitem[1]
+        if not zitem[2] or tonumber(zitem[2]) > tonumber(redis.call('time')[1]) then
+          job = redis.call('lpop', queue_key)
+        end
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+        if not job then return nil end
+        redis.call('zadd', pending_key, tonumber(redis.call('time')[1]) + visibility_timeout, job)
+        redis.call('expire', pending_key, ttl)
+        return job
+      SCRIPT
+      job = @redis.eval(@dequeue_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl, VISIBILITY_TIMEOUT])
+      unless job
+        yield(nil)
+        return
+      end
+      keep_alive(job) do
+        yield(JSON.parse(job)) if job
+      end
+      @remove_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, job = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), ARGV[5]
+        redis.call('zrem', pending_key, job)
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+      SCRIPT
+      @redis.eval(@remove_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl, job])
+    end
+    def stop
+      @stop_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4])
+        redis.call('hset', status_key, 'stopped', 1)
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+      SCRIPT
+      @redis.eval(@stop_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl])
+    end
+    def stopped?
+      @stopped_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4])
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+        return redis.call('hget', status_key, 'stopped')
+      SCRIPT
+      @redis.eval(@stopped_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl]).to_i == 1
+    end
+    private
+    def keep_alive(job)
+      @keep_alive_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, job, visibility_timeout = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), ARGV[5], tonumber(ARGV[6])
+        redis.call('zadd', pending_key, tonumber(redis.call('time')[1]) + visibility_timeout, job)
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+      SCRIPT
+      interval = Interval.new(5) do
+        @redis.eval(@keep_alive_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl, job, VISIBILITY_TIMEOUT])
+      end
+      yield
+    ensure
+      interval&.stop
+    end
+    def redis_queue_key
+      [@namespace, "kraps", "queue", @token].compact.join(":")
+    end
+    def redis_pending_key
+      [@namespace, "kraps", "pending", @token].compact.join(":")
+    end
+    def redis_status_key
+      [@namespace, "kraps", "status", @token].compact.join(":")
+    end
+  end
+end

data/lib/kraps/runner.rb CHANGED Viewed

@@ -45,48 +45,35 @@ module Kraps
       def perform_parallelize
         enum = Enumerator.new do |yielder|
-          collector = proc { |item| yielder << item }
+          collector = proc { |item| yielder << { item: item } }
           @step.block.call(collector)
         end
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, enum) do |item, part|
-            enqueue(token: distributed_job.token, part: part, item: item)
-          end
+        token = push_and_wait(enum: enum)
-          Frame.new(token: distributed_job.token, partitions: @step.partitions)
-        end
+        Frame.new(token: token, partitions: @step.partitions)
       end
       def perform_map
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
-            enqueue(token: distributed_job.token, part: part, partition: partition)
-          end
+        enum = (0...@frame.partitions).map { |partition| { partition: partition } }
+        token = push_and_wait(job_count: @step.jobs, enum: enum)
-          Frame.new(token: distributed_job.token, partitions: @step.partitions)
-        end
+        Frame.new(token: token, partitions: @step.partitions)
       end
       def perform_map_partitions
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
-            enqueue(token: distributed_job.token, part: part, partition: partition)
-          end
+        enum = (0...@frame.partitions).map { |partition| { partition: partition } }
+        token = push_and_wait(job_count: @step.jobs, enum: enum)
-          Frame.new(token: distributed_job.token, partitions: @step.partitions)
-        end
+        Frame.new(token: token, partitions: @step.partitions)
       end
       def perform_reduce
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
-            enqueue(token: distributed_job.token, part: part, partition: partition)
-          end
+        enum = (0...@frame.partitions).map { |partition| { partition: partition } }
+        token = push_and_wait(job_count: @step.jobs, enum: enum)
-          Frame.new(token: distributed_job.token, partitions: @step.partitions)
-        end
+        Frame.new(token: token, partitions: @step.partitions)
       end
       def perform_combine
@@ -95,82 +82,64 @@ module Kraps
         raise(IncompatibleFrame, "Incompatible number of partitions") if combine_step.partitions != @step.partitions
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
-            enqueue(token: distributed_job.token, part: part, partition: partition, combine_frame: combine_step.frame.to_h)
-          end
-          Frame.new(token: distributed_job.token, partitions: @step.partitions)
+        enum = (0...@frame.partitions).map do |partition|
+          { partition: partition, combine_frame: combine_step.frame.to_h }
         end
+        token = push_and_wait(job_count: @step.jobs, enum: enum)
+        Frame.new(token: token, partitions: @step.partitions)
       end
       def perform_each_partition
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
-            enqueue(token: distributed_job.token, part: part, partition: partition)
-          end
+        enum = (0...@frame.partitions).map { |partition| { partition: partition } }
+        push_and_wait(job_count: @step.jobs, enum: enum)
-          @frame
-        end
+        @frame
       end
-      def enqueue(token:, part:, **rest)
-        Kraps.enqueuer.call(
-          @step.worker,
-          JSON.generate(
-            job_index: @job_index,
-            step_index: @step_index,
-            frame: @frame.to_h,
-            token: token,
-            part: part,
-            klass: @klass,
-            args: @args,
-            kwargs: @kwargs,
-            **rest
-          )
-        )
-      end
+      def push_and_wait(enum:, job_count: nil)
+        redis_queue = RedisQueue.new(redis: Kraps.redis, token: SecureRandom.hex, namespace: Kraps.namespace, ttl: Kraps.job_ttl)
+        progress_bar = build_progress_bar("#{@klass}: job #{@job_index + 1}/#{@jobs.size}, step #{@step_index + 1}/#{@job.steps.size}, #{@step.jobs || "?"} jobs, token #{redis_queue.token}, %a, %c/%C (%p%) => #{@step.action}")
-      def with_distributed_job
-        distributed_job = Kraps.distributed_job_client.build(token: SecureRandom.hex)
+        total = 0
-        yield(distributed_job)
-      rescue Interrupt
-        distributed_job&.stop
-        raise
-      end
+        interval = Interval.new(1) do
+          # The interval is used to continously update the progress bar even
+          # when push_all is used and to avoid sessions being terminated due
+          # to inactivity etc
-      def push_and_wait(distributed_job, enum)
-        progress_bar = build_progress_bar("#{@klass}: job #{@job_index + 1}/#{@jobs.size}, step #{@step_index + 1}/#{@job.steps.size}, token #{distributed_job.token}, %a, %c/%C (%p%) => #{@step.action}")
+          progress_bar.total = total
+          progress_bar.progress = [progress_bar.total - redis_queue.size, 0].max
+        end
-        begin
-          total = 0
+        enum.each_with_index do |item, part|
+          total += 1
-          interval = Interval.new(1) do
-            progress_bar.total = total
-          end
+          redis_queue.enqueue(item.merge(part: part))
+        end
-          distributed_job.push_each(enum) do |item, part|
-            total += 1
-            interval.fire(timeout: 1)
+        (job_count || total).times do
+          break if redis_queue.stopped?
-            yield(item, part)
-          end
-        ensure
-          interval&.stop
+          Kraps.enqueuer.call(@step.worker, JSON.generate(job_index: @job_index, step_index: @step_index, frame: @frame.to_h, token: redis_queue.token, klass: @klass, args: @args, kwargs: @kwargs))
         end
         loop do
-          progress_bar.total = distributed_job.total
-          progress_bar.progress = progress_bar.total - distributed_job.count
-          break if distributed_job.finished? || distributed_job.stopped?
+          break if redis_queue.size.zero?
+          break if redis_queue.stopped?
           sleep(1)
         end
-        raise(JobStopped, "The job was stopped") if distributed_job.stopped?
+        raise(JobStopped, "The job was stopped") if redis_queue.stopped?
+        interval.fire(timeout: 1)
+        redis_queue.token
       ensure
+        redis_queue&.stop
+        interval&.stop
         progress_bar&.stop
       end

data/lib/kraps/step.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  Step = Struct.new(:action, :partitioner, :partitions, :block, :worker, :before, :frame, :dependency, :options, keyword_init: true)
+  Step = Struct.new(:action, :partitioner, :partitions, :jobs, :block, :worker, :before, :frame, :dependency, :options, keyword_init: true)
 end

data/lib/kraps/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  VERSION = "0.6.0"
+  VERSION = "0.8.0"
 end

data/lib/kraps/worker.rb CHANGED Viewed

@@ -11,22 +11,22 @@ module Kraps
     end
     def call(retries: 3)
-      return if distributed_job.stopped?
+      return if redis_queue.stopped?
       raise(InvalidAction, "Invalid action #{step.action}") unless Actions::ALL.include?(step.action)
-      with_retries(retries) do # TODO: allow to use queue based retries
-        step.before&.call
+      dequeue do |payload|
+        with_retries(retries) do # TODO: allow to use queue based retries
+          step.before&.call
-        send(:"perform_#{step.action}")
-        distributed_job.done(@args["part"])
+          send(:"perform_#{step.action}", payload)
+        end
       end
     end
     private
-    def perform_parallelize
+    def perform_parallelize(payload)
       implementation = Class.new do
         def map(key)
           yield(key, nil)
@@ -34,19 +34,19 @@ module Kraps
       end
       mapper = MapReduce::Mapper.new(implementation.new, partitioner: partitioner, memory_limit: @memory_limit)
-      mapper.map(@args["item"])
+      mapper.map(payload["item"])
       mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
         Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
           File.open(path) do |stream|
-            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{payload["part"]}.json"), stream)
           end
         end
       end
     end
-    def perform_map
-      temp_paths = download_all(token: @args["frame"]["token"], partition: @args["partition"])
+    def perform_map(payload)
+      temp_paths = download_all(token: @args["frame"]["token"], partition: payload["partition"])
       current_step = step
@@ -78,7 +78,7 @@ module Kraps
       mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
         Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
           File.open(path) do |stream|
-            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{payload["partition"]}.json"), stream)
           end
         end
       end
@@ -86,11 +86,11 @@ module Kraps
       temp_paths&.delete
     end
-    def perform_map_partitions
-      temp_paths = download_all(token: @args["frame"]["token"], partition: @args["partition"])
+    def perform_map_partitions(payload)
+      temp_paths = download_all(token: @args["frame"]["token"], partition: payload["partition"])
       current_step = step
-      current_partition = @args["partition"]
+      current_partition = payload["partition"]
       implementation = Object.new
       implementation.define_singleton_method(:map) do |enum, &block|
@@ -111,7 +111,7 @@ module Kraps
       mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
         Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
           File.open(path) do |stream|
-            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{payload["partition"]}.json"), stream)
           end
         end
       end
@@ -119,7 +119,7 @@ module Kraps
       temp_paths&.delete
     end
-    def perform_reduce
+    def perform_reduce(payload)
       current_step = step
       implementation = Object.new
@@ -129,7 +129,7 @@ module Kraps
       reducer = MapReduce::Reducer.new(implementation)
-      Parallelizer.each(Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")), @concurrency) do |file|
+      Parallelizer.each(Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{payload["partition"]}/")), @concurrency) do |file|
         Kraps.driver.download(file, reducer.add_chunk)
       end
@@ -139,14 +139,14 @@ module Kraps
         tempfile.puts(JSON.generate([key, value]))
       end
-      Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{@args["partition"]}/chunk.#{@args["part"]}.json"), tempfile.tap(&:rewind))
+      Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{payload["partition"]}/chunk.#{payload["partition"]}.json"), tempfile.tap(&:rewind))
     ensure
       tempfile&.close(true)
     end
-    def perform_combine
-      temp_paths1 = download_all(token: @args["frame"]["token"], partition: @args["partition"])
-      temp_paths2 = download_all(token: @args["combine_frame"]["token"], partition: @args["partition"])
+    def perform_combine(payload)
+      temp_paths1 = download_all(token: @args["frame"]["token"], partition: payload["partition"])
+      temp_paths2 = download_all(token: payload["combine_frame"]["token"], partition: payload["partition"])
       enum1 = k_way_merge(temp_paths1.each.to_a, chunk_limit: @chunk_limit)
       enum2 = k_way_merge(temp_paths2.each.to_a, chunk_limit: @chunk_limit)
@@ -157,7 +157,7 @@ module Kraps
       implementation = Object.new
       implementation.define_singleton_method(:map) do |&block|
         combine_method.call(enum1, enum2) do |key, value1, value2|
-          block.call(key, current_step.block.call(key, value1, value2))
+          current_step.block.call(key, value1, value2, block)
         end
       end
@@ -167,7 +167,7 @@ module Kraps
       mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
         Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
           File.open(path) do |stream|
-            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{payload["partition"]}.json"), stream)
           end
         end
       end
@@ -213,10 +213,10 @@ module Kraps
       end
     end
-    def perform_each_partition
+    def perform_each_partition(payload)
       temp_paths = TempPaths.new
-      files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
+      files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{payload["partition"]}/")).sort
       temp_paths_index = files.each_with_object({}) do |file, hash|
         hash[file] = temp_paths.add
@@ -226,7 +226,7 @@ module Kraps
         Kraps.driver.download(file, temp_paths_index[file].path)
       end
-      step.block.call(@args["partition"], k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
+      step.block.call(payload["partition"], k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
     ensure
       temp_paths&.delete
     end
@@ -237,11 +237,11 @@ module Kraps
       begin
         yield
       rescue Kraps::Error
-        distributed_job.stop
+        redis_queue.stop
         raise
       rescue StandardError => e
         if retries >= num_retries
-          distributed_job.stop
+          redis_queue.stop
           raise
         end
@@ -254,20 +254,23 @@ module Kraps
       end
     end
-    def download_all(token:, partition:)
-      temp_paths = TempPaths.new
-      files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/")).sort
+    def dequeue
+      loop do
+        break if redis_queue.stopped?
+        break if redis_queue.size.zero?
-      temp_paths_index = files.each_with_object({}) do |file, hash|
-        hash[file] = temp_paths.add
+        redis_queue.dequeue do |payload|
+          payload ? yield(payload) : sleep(1)
+        end
       end
+    end
-      Parallelizer.each(files, @concurrency) do |file|
-        Kraps.driver.download(file, temp_paths_index[file].path)
-      end
+    def redis_queue
+      @redis_queue ||= RedisQueue.new(redis: Kraps.redis, token: @args["token"], namespace: Kraps.namespace, ttl: Kraps.job_ttl)
+    end
-      temp_paths
+    def download_all(token:, partition:)
+      Downloader.download_all(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/"), concurrency: @concurrency)
     end
     def jobs
@@ -301,9 +304,5 @@ module Kraps
     def partitioner
       @partitioner ||= proc { |key| step.partitioner.call(key, step.partitions) }
     end
-    def distributed_job
-      @distributed_job ||= Kraps.distributed_job_client.build(token: @args["token"])
-    end
   end
 end

data/lib/kraps.rb CHANGED Viewed

@@ -1,4 +1,3 @@
-require "distributed_job"
 require "ruby-progressbar"
 require "ruby-progressbar/outputs/null"
 require "map_reduce"
@@ -9,6 +8,7 @@ require_relative "kraps/drivers"
 require_relative "kraps/actions"
 require_relative "kraps/parallelizer"
 require_relative "kraps/hash_partitioner"
+require_relative "kraps/redis_queue"
 require_relative "kraps/temp_path"
 require_relative "kraps/temp_paths"
 require_relative "kraps/timeout_queue"
@@ -19,6 +19,7 @@ require_relative "kraps/runner"
 require_relative "kraps/step"
 require_relative "kraps/frame"
 require_relative "kraps/worker"
+require_relative "kraps/downloader"
 module Kraps
   class Error < StandardError; end
@@ -27,9 +28,11 @@ module Kraps
   class JobStopped < Error; end
   class IncompatibleFrame < Error; end
-  def self.configure(driver:, redis: Redis.new, namespace: nil, job_ttl: 24 * 60 * 60, show_progress: true, enqueuer: ->(worker, json) { worker.perform_async(json) })
+  def self.configure(driver:, redis: Redis.new, namespace: nil, job_ttl: 4 * 24 * 60 * 60, show_progress: true, enqueuer: ->(worker, json) { worker.perform_async(json) })
     @driver = driver
-    @distributed_job_client = DistributedJob::Client.new(redis: redis, namespace: namespace, default_ttl: job_ttl)
+    @redis = redis
+    @namespace = namespace
+    @job_ttl = job_ttl.to_i
     @show_progress = show_progress
     @enqueuer = enqueuer
   end
@@ -38,8 +41,16 @@ module Kraps
     @driver
   end
-  def self.distributed_job_client
-    @distributed_job_client
+  def self.redis
+    @redis
+  end
+  def self.namespace
+    @namespace
+  end
+  def self.job_ttl
+    @job_ttl
   end
   def self.show_progress?

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: kraps
 version: !ruby/object:Gem::Version
-  version: 0.6.0
+  version: 0.8.0
 platform: ruby
 authors:
 - Benjamin Vetter
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2022-11-16 00:00:00.000000000 Z
+date: 2023-02-13 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: attachie
@@ -24,20 +24,6 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-- !ruby/object:Gem::Dependency
-  name: distributed_job
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '0'
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '0'
 - !ruby/object:Gem::Dependency
   name: map-reduce-ruby
   requirement: !ruby/object:Gem::Requirement
@@ -142,6 +128,7 @@ files:
 - docker-compose.yml
 - lib/kraps.rb
 - lib/kraps/actions.rb
+- lib/kraps/downloader.rb
 - lib/kraps/drivers.rb
 - lib/kraps/frame.rb
 - lib/kraps/hash_partitioner.rb
@@ -149,6 +136,7 @@ files:
 - lib/kraps/job.rb
 - lib/kraps/job_resolver.rb
 - lib/kraps/parallelizer.rb
+- lib/kraps/redis_queue.rb
 - lib/kraps/runner.rb
 - lib/kraps/step.rb
 - lib/kraps/temp_path.rb