RubyGems - kraps - Versions diffs - 0.5.0 → 0.7.0 - Mend

kraps 0.5.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 921ae08326c96216136418861b88af7f11bce519c924cd1813216165f7f02690
-  data.tar.gz: '0913d31d3caeea0be664bc714e9d0da58227f515c047be31359e96040bc0c141'
+  metadata.gz: 19635ced3e745d44313ed3bc416ef73eb555134c591725f4dab7b38208e21393
+  data.tar.gz: bb7f679c7e2cd053744d1c7d857629de1912a305880d0538bdab418de3861ba1
 SHA512:
-  metadata.gz: d8e43e5229fc310019801e62a2e278470a1eb37b50e4aca27b9c64edb6666115f0f25c7a7375790516e2726fcf10980cdac1523c54dde8d3527a39fd919a2a5a
-  data.tar.gz: 30b1a9edcdd4f7ff476bfa4c070aef31debd727500e27a08b59f1df2663362c60e3cc3a3c860455d568abd994bb56a216f7eedf8baea6cc06ca73b1d0bdf9a07
+  metadata.gz: 91273ba54ea33c6d5cb1b4f335ad8039c35601953fdf1e6b9b2ac3117ceb25d81e9be569e5c8b70deda22b53a4c72f04d60ffb4b251badf5c5a64d13d399f36c
+  data.tar.gz: 4e563257fcba0c9f457b363da4f43d000ee239a5179e2f965071bb3df27e362cdf7bc9950a1624520cb862acfb0fde89e96b0a4987d79a93524230c4b84619cd

data/.rubocop.yml CHANGED Viewed

@@ -80,3 +80,6 @@ Style/WordArray:
 Style/RedundantEach:
   Enabled: false
+Lint/NonLocalExitFromIterator:
+  Enabled: false

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,19 @@
 # CHANGELOG
+## v0.7.0
+* Added a `jobs` option to the actions to limit the concurrency
+  when e.g. accessing external data stores and to avoid overloading
+  them
+* Added a queue using redis for the jobs to avoid starving workers
+* Removed `distributed_job` dependency
+## v0.6.0
+* Added `map_partitions`
+* Added `combine`
+* Added `dump` and `load`
 ## v0.5.0
 * Added a `before` option to specify a callable to run before

data/Gemfile.lock CHANGED Viewed

@@ -1,9 +1,8 @@
 PATH
   remote: .
   specs:
-    kraps (0.5.0)
+    kraps (0.7.0)
       attachie
-      distributed_job
       map-reduce-ruby (>= 3.0.0)
       redis
       ruby-progressbar
@@ -41,8 +40,6 @@ GEM
     concurrent-ruby (1.1.10)
     connection_pool (2.3.0)
     diff-lcs (1.5.0)
-    distributed_job (3.1.0)
-      redis (>= 4.1.0)
     i18n (1.12.0)
       concurrent-ruby (~> 1.0)
     jmespath (1.6.1)
@@ -62,7 +59,7 @@ GEM
     rake (13.0.6)
     redis (5.0.5)
       redis-client (>= 0.9.0)
-    redis-client (0.11.1)
+    redis-client (0.11.2)
       connection_pool
     regexp_parser (2.6.0)
     rexml (3.2.5)

data/README.md CHANGED Viewed

@@ -3,11 +3,12 @@
 **Easily process big data in ruby**
 Kraps allows to process and perform calculations on very large datasets in
-parallel using a map/reduce framework and runs on a background job framework
-you already have. You just need some space on your filesystem, S3 as a storage
-layer with temporary lifecycle policy enabled, the already mentioned background
-job framework (like sidekiq, shoryuken, etc) and redis to keep track of the
-progress. Most things you most likely already have in place anyways.
+parallel using a map/reduce framework similar to [spark](https://spark.apache.org/),
+but runs on a background job framework you already have. You just need some
+space on your filesystem, S3 as a storage layer with temporary lifecycle policy
+enabled, the already mentioned background job framework (like sidekiq,
+shoryuken, etc) and redis to keep track of the progress. Most things you most
+likely already have in place anyways.
 ## Installation
@@ -29,7 +30,7 @@ Kraps.configure(
   driver: Kraps::Drivers::S3Driver.new(s3_client: Aws::S3::Client.new("..."), bucket: "some-bucket", prefix: "temp/kraps/"),
   redis: Redis.new,
   namespace: "my-application", # An optional namespace to be used for redis keys, default: nil
-  job_ttl: 24.hours, # Job information in redis will automatically be removed after this amount of time, default: 24 hours
+  job_ttl: 7.days, # Job information in redis will automatically be removed after this amount of time, default: 4 days
   show_progress: true # Whether or not to show the progress in the terminal when executing jobs, default: true
   enqueuer: ->(worker, json) { worker.perform_async(json) } # Allows to customize the enqueueing of worker jobs
 )
@@ -115,13 +116,13 @@ be able to give 300-400 megabytes to Kraps then, but now divide this by 10 and
 specify a `memory_limit` of around `30.megabytes`, better less. The
 `memory_limit` affects how much chunks will be written to disk depending on the
 data size you are processing and how big these chunks are. The smaller the
-value, the more chunks and the more chunks, the more runs Kraps need to merge
-the chunks. It can affect the performance The `chunk_limit` ensures that only
-the specified amount of chunks are processed in a single run. A run basically
-means: it takes up to `chunk_limit` chunks, reduces them and pushes the result
-as a new chunk to the list of chunks to process. Thus, if your number of file
-descriptors is unlimited, you want to set it to a higher number to avoid the
-overhead of multiple runs. `concurrency` tells Kraps how much threads to use to
+value, the more chunks. The more chunks, the more runs Kraps need to merge
+the chunks. The `chunk_limit` ensures that only the specified amount of chunks
+are processed in a single run. A run basically means: it takes up to
+`chunk_limit` chunks, reduces them and pushes the result as a new chunk to the
+list of chunks to process. Thus, if your number of file descriptors is
+unlimited, you want to set it to a higher number to avoid the overhead of
+multiple runs. `concurrency` tells Kraps how much threads to use to
 concurrently upload/download files from the storage layer. Finally, `retries`
 specifies how often Kraps should retry the job step in case of errors. Kraps
 will sleep for 5 seconds between those retries. Please note that it's not yet
@@ -130,7 +131,6 @@ Kraps. Please note, however, that `parallelize` is not covered by `retries`
 yet, as the block passed to `parallelize` is executed by the runner, not the
 workers.
 Now, executing your job is super easy:
 ```ruby
@@ -182,11 +182,11 @@ https://github.com/mrkamel/map-reduce-ruby/#limitations-for-keys
 ## Storage
 Kraps stores temporary results of steps in a storage layer. Currently, only S3
-is supported besides a in memory driver used for testing purposes. Please be
+is supported besides a in-memory driver used for testing purposes. Please be
 aware that Kraps does not clean up any files from the storage layer, as it
-would be a safe thing to do in case of errors anyways. Instead, Kraps relies on
-lifecycle features of modern object storage systems. Therefore, it is recommend
-to e.g. configure a lifecycle policy to delete any files after e.g. 7 days
+would not be a safe thing to do in case of errors anyways. Instead, Kraps
+relies on lifecycle features of modern object storage systems. Therefore, it is
+required to configure a lifecycle policy to delete any files after e.g. 7 days
 either for a whole bucket or for a certain prefix like e.g. `temp/` and tell
 Kraps about the prefix to use (e.g. `temp/kraps/`).
@@ -220,7 +220,7 @@ items are used as keys and the values are set to `nil`.
 * `map`: Maps the key value pairs to other key value pairs
 ```ruby
-job.map(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker) do |key, value, collector|
+job.map(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker, jobs: 8) do |key, value, collector|
   collector.call("changed #{key}", "changed #{value}")
 end
 ```
@@ -229,10 +229,31 @@ The block gets each key-value pair passed and the `collector` block can be
 called as often as neccessary. This is also the reason why `map` can not simply
 return the new key-value pair, but the `collector` must be used instead.
+The `jobs` argument can be useful when you need to access an external data
+source, like a relational database and you want to limit the number of workers
+accessing the store concurrently to avoid overloading it. If you don't specify
+it, it will be identical to the number of partitions. It is recommended to only
+use it for steps where you need to throttle the concurrency, because it will of
+course slow down the processing. The `jobs` argument only applies to the
+current step. The following steps don't inherit the argument, but reset it.
+* `map_partitions`: Maps the key value pairs to other key value pairs, but the
+  block receives all data of each partition as an enumerable and sorted by key.
+  Please be aware that you should not call `to_a` or similar on the enumerable.
+  Prefer `map` over `map_partitions` when possible.
+```ruby
+job.map_partitions(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker, jobs: 8) do |pairs, collector|
+  pairs.each do |key, value|
+    collector.call("changed #{key}", "changed #{value}")
+  end
+end
+```
 * `reduce`: Reduces the values of pairs having the same key
 ```ruby
-job.reduce(worker: MyKrapsWorker) do |key, value1, value2|
+job.reduce(worker: MyKrapsWorker, jobs: 8) do |key, value1, value2|
   value1 + value2
 end
 ```
@@ -245,26 +266,61 @@ The `key` itself is also passed to the block for the case that you need to
 customize the reduce calculation according to the value of the key. However,
 most of the time, this is not neccessary and the key can simply be ignored.
+* `combine`: Combines the results of 2 jobs by combining every key available
+  in the current job result with the corresponding key from the passed job
+  result. When the passed job result does not have the corresponding key,
+  `nil` will be passed to the block. Keys which are only available in the
+  passed job result are completely omitted.
+```ruby
+  job.combine(other_job, worker: MyKrapsWorker, jobs: 8) do |key, value1, value2|
+    (value1 || {}).merge(value2 || {})
+  end
+```
+Please note that the keys, partitioners and the number of partitions must match
+for the jobs to be combined. Further note that the results of `other_job` must
+be reduced, meaning that every key must be unique. Finally, `other_job` must
+not neccessarily be listed in the array of jobs returned by the `call` method,
+since Kraps detects the dependency on its own.
 * `repartition`: Used to change the partitioning
 ```ruby
-job.repartition(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker)
+job.repartition(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker, jobs: 8)
 ```
 Repartitions all data into the specified number of partitions and using the
 specified partitioner.
 * `each_partition`: Passes the partition number and all data of each partition
-  as a lazy enumerable
+  as an enumerable and sorted by key. Please be aware that you should not call
+  `to_a` or similar on the enumerable.
 ```ruby
-job.each_partition do |partition, pairs|
+job.each_partition(jobs: 8) do |partition, pairs|
   pairs.each do |key, value|
     # ...
   end
 end
 ```
+* `dump`: Store all current data per partition under the specified prefix
+```ruby
+job.dump(prefix: "path/to/dump", worker: MyKrapsWorker)
+```
+It creates a folder for every partition and stores one or more chunks in there.
+* `load`: Loads the previously dumped data
+```ruby
+job.load(prefix: "path/to/dump", partitions: 32, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
+```
+The number of partitions and the partitioner must be specified.
 Please note that every API method accepts a `before` callable:
 ```ruby
@@ -326,13 +382,58 @@ When you execute the job, Kraps will execute the jobs one after another and as
 the jobs build up on each other, Kraps will execute the steps shared by both
 jobs only once.
+## Testing
+Kraps ships with an in-memory fake driver for storage, which you can use for
+testing purposes instead of the s3 driver:
+```ruby
+Kraps.configure(
+  driver: Kraps::Drivers::FakeDriver.new(bucket: "kraps"),
+  # ...
+) ```
+This is of course much faster than using s3 or some s3 compatible service.
+Moreover, when testing large Kraps jobs you maybe want to test intermediate
+steps. You can use `#dump` for this purpose and test that the data dumped is
+correct.
+```ruby
+job = job.dump("path/to/dump")
+```
+and in your tests do
+```ruby
+Kraps.driver.value("path/to/dump/0/chunk.json") # => data of partition 0
+Kraps.driver.value("path/to/dump/1/chunk.json") # => data of partition 1
+# ...
+```
+The data is stored in lines, each line is a json encoded array of key and
+value.
+```ruby
+data = Kraps.driver.value("path/to/dump/0/chunk.json).lines.map do |line|
+  JSON.parse(line) # => [key, value]
+end
+```
+The API of the driver is:
+* `store(name, data_or_ui, options = {})`: Stores `data_or_io` as `name`
+* `list(prefix: nil)`: Lists all objects or all objects matching the `prefix`
+* `value(name)`: Returns the object content of `name`
+* `download(name, path)`: Downloads the object `name` to `path` in your
+  filesystem
+* `exists?(name)`: Returns `true`/`false`
+* `flush`: Removes all objects from the fake storage
 ## Dependencies
 Kraps is built on top of
 [map-reduce-ruby](https://github.com/mrkamel/map-reduce-ruby) for the
 map/reduce framework,
-[distributed_job](https://github.com/mrkamel/distributed_job)
-to keep track of the job/step status,
 [attachie](https://github.com/mrkamel/attachie) to interact with the storage
 layer (s3),
 [ruby-progressbar](https://github.com/jfelchner/ruby-progressbar) to

data/docker-compose.yml CHANGED Viewed

@@ -1,6 +1,6 @@
 version: '2'
 services:
-  elasticsearch:
+  redis:
     image: redis
     ports:
       - 6379:6379

data/lib/kraps/actions.rb CHANGED Viewed

@@ -3,7 +3,9 @@ module Kraps
     ALL = [
       PARALLELIZE = "parallelize",
       MAP = "map",
+      MAP_PARTITIONS = "map_partitions",
       REDUCE = "reduce",
+      COMBINE = "combine",
       EACH_PARTITION = "each_partition"
     ]
   end

data/lib/kraps/drivers.rb CHANGED Viewed

@@ -8,6 +8,26 @@ module Kraps
       def with_prefix(path)
         File.join(*[@prefix, path].compact)
       end
+      def list(prefix: nil)
+        driver.list(bucket, prefix: prefix)
+      end
+      def value(name)
+        driver.value(name, bucket)
+      end
+      def download(name, path)
+        driver.download(name, bucket, path)
+      end
+      def exists?(name)
+        driver.exists?(name, bucket)
+      end
+      def store(name, data_or_io, options = {})
+        driver.store(name, data_or_io, bucket, options)
+      end
     end
     class S3Driver
@@ -32,6 +52,10 @@ module Kraps
         @bucket = bucket
         @prefix = prefix
       end
+      def flush
+        driver.flush
+      end
     end
   end
 end

data/lib/kraps/job.rb CHANGED Viewed

@@ -27,7 +27,7 @@ module Kraps
       end
     end
-    def map(partitions: nil, partitioner: nil, worker: @worker, before: nil, &block)
+    def map(partitions: nil, partitioner: nil, jobs: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @partitions = partitions if partitions
@@ -35,6 +35,7 @@ module Kraps
           @steps << Step.new(
             action: Actions::MAP,
+            jobs: [jobs, @partitions].compact.min,
             partitions: @partitions,
             partitioner: @partitioner,
             worker: worker,
@@ -45,11 +46,31 @@ module Kraps
       end
     end
-    def reduce(worker: @worker, before: nil, &block)
+    def map_partitions(partitions: nil, partitioner: nil, jobs: nil, worker: @worker, before: nil, &block)
+      fresh.tap do |job|
+        job.instance_eval do
+          @partitions = partitions if partitions
+          @partitioner = partitioner if partitioner
+          @steps << Step.new(
+            action: Actions::MAP_PARTITIONS,
+            jobs: [jobs, @partitions].compact.min,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
+        end
+      end
+    end
+    def reduce(jobs: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @steps << Step.new(
             action: Actions::REDUCE,
+            jobs: [jobs, @partitions].compact.min,
             partitions: @partitions,
             partitioner: @partitioner,
             worker: worker,
@@ -60,11 +81,30 @@ module Kraps
       end
     end
-    def each_partition(worker: @worker, before: nil, &block)
+    def combine(other_job, jobs: nil, worker: @worker, before: nil, &block)
+      fresh.tap do |job|
+        job.instance_eval do
+          @steps << Step.new(
+            action: Actions::COMBINE,
+            jobs: [jobs, @partitions].compact.min,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block,
+            dependency: other_job,
+            options: { combine_step_index: other_job.steps.size - 1 }
+          )
+        end
+      end
+    end
+    def each_partition(jobs: nil, worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
           @steps << Step.new(
             action: Actions::EACH_PARTITION,
+            jobs: [jobs, @partitions].compact.min,
             partitions: @partitions,
             partitioner: @partitioner,
             worker: worker,
@@ -75,12 +115,51 @@ module Kraps
       end
     end
-    def repartition(partitions:, partitioner: nil, worker: @worker, before: nil)
-      map(partitions: partitions, partitioner: partitioner, worker: worker, before: before) do |key, value, collector|
+    def repartition(partitions:, jobs: nil, partitioner: nil, worker: @worker, before: nil)
+      map(jobs: jobs, partitions: partitions, partitioner: partitioner, worker: worker, before: before) do |key, value, collector|
         collector.call(key, value)
       end
     end
+    def dump(prefix:, worker: @worker)
+      each_partition(worker: worker) do |partition, pairs|
+        tempfile = Tempfile.new
+        pairs.each do |pair|
+          tempfile.puts(JSON.generate(pair))
+        end
+        Kraps.driver.store(File.join(prefix, partition.to_s, "chunk.json"), tempfile.tap(&:rewind))
+      ensure
+        tempfile&.close(true)
+      end
+    end
+    def load(prefix:, partitions:, partitioner:, worker: @worker)
+      job = parallelize(partitions: partitions, partitioner: proc { |key, _| key }, worker: worker) do |collector|
+        (0...partitions).each do |partition|
+          collector.call(partition)
+        end
+      end
+      job.map_partitions(partitioner: partitioner, worker: worker) do |partition, _, collector|
+        tempfile = Tempfile.new
+        path = File.join(prefix, partition.to_s, "chunk.json")
+        next unless Kraps.driver.exists?(path)
+        Kraps.driver.download(path, tempfile.path)
+        tempfile.each_line do |line|
+          key, value = JSON.parse(line)
+          collector.call(key, value)
+        end
+      ensure
+        tempfile&.close(true)
+      end
+    end
     def fresh
       dup.tap do |job|
         job.instance_variable_set(:@steps, @steps.dup)

data/lib/kraps/job_resolver.rb ADDED Viewed

@@ -0,0 +1,13 @@
+module Kraps
+  class JobResolver
+    def call(jobs)
+      resolve_dependencies(Array(jobs)).uniq
+    end
+    private
+    def resolve_dependencies(jobs)
+      jobs.map { |job| [resolve_dependencies(job.steps.map(&:dependency).compact), job] }.flatten
+    end
+  end
+end

data/lib/kraps/redis_queue.rb ADDED Viewed

@@ -0,0 +1,151 @@
+module Kraps
+  class RedisQueue
+    VISIBILITY_TIMEOUT = 60
+    attr_reader :token
+    def initialize(redis:, token:, namespace:, ttl:)
+      @redis = redis
+      @token = token
+      @namespace = namespace
+      @ttl = ttl
+    end
+    def size
+      @size_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, job = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), ARGV[5]
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+        return redis.call('llen', queue_key) + redis.call('zcard', pending_key)
+      SCRIPT
+      @redis.eval(@size_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl])
+    end
+    def enqueue(payload)
+      @enqueue_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, job = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), ARGV[5]
+        redis.call('rpush', queue_key, job)
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+      SCRIPT
+      @redis.eval(@enqueue_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl, JSON.generate(payload)])
+    end
+    def dequeue
+      @dequeue_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, visibility_timeout = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), tonumber(ARGV[5])
+        local zitem = redis.call('zrange', pending_key, 0, 0, 'WITHSCORES')
+        local job = zitem[1]
+        if not zitem[2] or tonumber(zitem[2]) > tonumber(redis.call('time')[1]) then
+          job = redis.call('lpop', queue_key)
+        end
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+        if not job then return nil end
+        redis.call('zadd', pending_key, tonumber(redis.call('time')[1]) + visibility_timeout, job)
+        redis.call('expire', pending_key, ttl)
+        return job
+      SCRIPT
+      job = @redis.eval(@dequeue_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl, VISIBILITY_TIMEOUT])
+      unless job
+        yield(nil)
+        return
+      end
+      keep_alive(job) do
+        yield(JSON.parse(job)) if job
+      end
+      @remove_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, job = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), ARGV[5]
+        redis.call('zrem', pending_key, job)
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+      SCRIPT
+      @redis.eval(@remove_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl, job])
+    end
+    def stop
+      @stop_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4])
+        redis.call('hset', status_key, 'stopped', 1)
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+      SCRIPT
+      @redis.eval(@stop_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl])
+    end
+    def stopped?
+      @stopped_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4])
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+        return redis.call('hget', status_key, 'stopped')
+      SCRIPT
+      @redis.eval(@stopped_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl]).to_i == 1
+    end
+    private
+    def keep_alive(job)
+      @keep_alive_script ||= <<~SCRIPT
+        local queue_key, pending_key, status_key, ttl, job, visibility_timeout = ARGV[1], ARGV[2], ARGV[3], tonumber(ARGV[4]), ARGV[5], tonumber(ARGV[6])
+        redis.call('zadd', pending_key, tonumber(redis.call('time')[1]) + visibility_timeout, job)
+        redis.call('expire', queue_key, ttl)
+        redis.call('expire', pending_key, ttl)
+        redis.call('expire', status_key, ttl)
+      SCRIPT
+      interval = Interval.new(5) do
+        @redis.eval(@keep_alive_script, argv: [redis_queue_key, redis_pending_key, redis_status_key, @ttl, job, VISIBILITY_TIMEOUT])
+      end
+      yield
+    ensure
+      interval&.stop
+    end
+    def redis_queue_key
+      [@namespace, "kraps", "queue", @token].compact.join(":")
+    end
+    def redis_pending_key
+      [@namespace, "kraps", "pending", @token].compact.join(":")
+    end
+    def redis_status_key
+      [@namespace, "kraps", "status", @token].compact.join(":")
+    end
+  end
+end

data/lib/kraps/runner.rb CHANGED Viewed

@@ -5,7 +5,7 @@ module Kraps
     end
     def call(*args, **kwargs)
-      Array(@klass.new.call(*args, **kwargs)).tap do |jobs|
+      JobResolver.new.call(@klass.new.call(*args, **kwargs)).tap do |jobs|
         jobs.each_with_index do |job, job_index|
           job.steps.each_with_index.inject(nil) do |frame, (_, step_index)|
             StepRunner.new(
@@ -45,107 +45,101 @@ module Kraps
       def perform_parallelize
         enum = Enumerator.new do |yielder|
-          collector = proc { |item| yielder << item }
+          collector = proc { |item| yielder << { item: item } }
           @step.block.call(collector)
         end
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, enum) do |item, part|
-            enqueue(token: distributed_job.token, part: part, item: item)
-          end
+        token = push_and_wait(enum: enum)
-          Frame.new(token: distributed_job.token, partitions: @step.partitions)
-        end
+        Frame.new(token: token, partitions: @step.partitions)
       end
       def perform_map
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
-            enqueue(token: distributed_job.token, part: part, partition: partition)
-          end
+        enum = (0...@frame.partitions).map { |partition| { partition: partition } }
+        token = push_and_wait(job_count: @step.jobs, enum: enum)
-          Frame.new(token: distributed_job.token, partitions: @step.partitions)
-        end
+        Frame.new(token: token, partitions: @step.partitions)
+      end
+      def perform_map_partitions
+        enum = (0...@frame.partitions).map { |partition| { partition: partition } }
+        token = push_and_wait(job_count: @step.jobs, enum: enum)
+        Frame.new(token: token, partitions: @step.partitions)
       end
       def perform_reduce
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
-            enqueue(token: distributed_job.token, part: part, partition: partition)
-          end
+        enum = (0...@frame.partitions).map { |partition| { partition: partition } }
+        token = push_and_wait(job_count: @step.jobs, enum: enum)
-          Frame.new(token: distributed_job.token, partitions: @step.partitions)
-        end
+        Frame.new(token: token, partitions: @step.partitions)
       end
-      def perform_each_partition
-        with_distributed_job do |distributed_job|
-          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
-            enqueue(token: distributed_job.token, part: part, partition: partition)
-          end
+      def perform_combine
+        combine_job = @step.dependency
+        combine_step = combine_job.steps[@step.options[:combine_step_index]]
+        raise(IncompatibleFrame, "Incompatible number of partitions") if combine_step.partitions != @step.partitions
-          @frame
+        enum = (0...@frame.partitions).map do |partition|
+          { partition: partition, combine_frame: combine_step.frame.to_h }
         end
-      end
-      def enqueue(token:, part:, **rest)
-        Kraps.enqueuer.call(
-          @step.worker,
-          JSON.generate(
-            job_index: @job_index,
-            step_index: @step_index,
-            frame: @frame.to_h,
-            token: token,
-            part: part,
-            klass: @klass,
-            args: @args,
-            kwargs: @kwargs,
-            **rest
-          )
-        )
+        token = push_and_wait(job_count: @step.jobs, enum: enum)
+        Frame.new(token: token, partitions: @step.partitions)
       end
-      def with_distributed_job
-        distributed_job = Kraps.distributed_job_client.build(token: SecureRandom.hex)
+      def perform_each_partition
+        enum = (0...@frame.partitions).map { |partition| { partition: partition } }
+        push_and_wait(job_count: @step.jobs, enum: enum)
-        yield(distributed_job)
-      rescue Interrupt
-        distributed_job&.stop
-        raise
+        @frame
       end
-      def push_and_wait(distributed_job, enum)
-        progress_bar = build_progress_bar("#{@klass}: job #{@job_index + 1}/#{@jobs.size}, step #{@step_index + 1}/#{@job.steps.size}, token #{distributed_job.token}, %a, %c/%C (%p%) => #{@step.action}")
+      def push_and_wait(enum:, job_count: nil)
+        redis_queue = RedisQueue.new(redis: Kraps.redis, token: SecureRandom.hex, namespace: Kraps.namespace, ttl: Kraps.job_ttl)
+        progress_bar = build_progress_bar("#{@klass}: job #{@job_index + 1}/#{@jobs.size}, step #{@step_index + 1}/#{@job.steps.size}, token #{redis_queue.token}, %a, %c/%C (%p%) => #{@step.action}")
-        begin
-          total = 0
+        total = 0
-          interval = Interval.new(1) do
-            progress_bar.total = total
-          end
+        interval = Interval.new(1) do
+          # The interval is used to continously update the progress bar even
+          # when push_all is used and to avoid sessions being terminated due
+          # to inactivity etc
-          distributed_job.push_each(enum) do |item, part|
-            total += 1
-            interval.fire(timeout: 1)
+          progress_bar.total = total
+          progress_bar.progress = [progress_bar.total - redis_queue.size, 0].max
+        end
-            yield(item, part)
-          end
-        ensure
-          interval&.stop
+        enum.each_with_index do |item, part|
+          total += 1
+          redis_queue.enqueue(item.merge(part: part))
         end
-        loop do
-          progress_bar.total = distributed_job.total
-          progress_bar.progress = progress_bar.total - distributed_job.count
+        (job_count || total).times do
+          break if redis_queue.stopped?
+          Kraps.enqueuer.call(@step.worker, JSON.generate(job_index: @job_index, step_index: @step_index, frame: @frame.to_h, token: redis_queue.token, klass: @klass, args: @args, kwargs: @kwargs))
+        end
-          break if distributed_job.finished? || distributed_job.stopped?
+        loop do
+          break if redis_queue.size.zero?
+          break if redis_queue.stopped?
           sleep(1)
         end
-        raise(JobStopped, "The job was stopped") if distributed_job.stopped?
+        raise(JobStopped, "The job was stopped") if redis_queue.stopped?
+        interval.fire(timeout: 1)
+        redis_queue.token
       ensure
+        redis_queue&.stop
+        interval&.stop
         progress_bar&.stop
       end

data/lib/kraps/step.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  Step = Struct.new(:action, :partitioner, :partitions, :block, :worker, :before, :frame, keyword_init: true)
+  Step = Struct.new(:action, :partitioner, :partitions, :jobs, :block, :worker, :before, :frame, :dependency, :options, keyword_init: true)
 end

data/lib/kraps/temp_path.rb CHANGED Viewed

@@ -1,29 +1,3 @@
 module Kraps
-  class TempPath
-    attr_reader :path
-    def initialize(prefix: nil, suffix: nil)
-      @path = File.join(Dir.tmpdir, [prefix, SecureRandom.hex[0, 16], Process.pid, suffix].compact.join("."))
-      File.open(@path, File::CREAT | File::EXCL) {}
-      ObjectSpace.define_finalizer(self, self.class.finalize(@path))
-      return unless block_given?
-      begin
-        yield
-      ensure
-        unlink
-      end
-    end
-    def unlink
-      FileUtils.rm_f(@path)
-    end
-    def self.finalize(path)
-      proc { FileUtils.rm_f(path) }
-    end
-  end
+  TempPath = MapReduce::TempPath
 end

data/lib/kraps/temp_paths.rb CHANGED Viewed

@@ -17,9 +17,9 @@ module Kraps
       end
     end
-    def unlink
+    def delete
       synchronize do
-        @temp_paths.each(&:unlink)
+        @temp_paths.each(&:delete)
       end
     end

data/lib/kraps/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  VERSION = "0.5.0"
+  VERSION = "0.7.0"
 end

data/lib/kraps/worker.rb CHANGED Viewed

@@ -1,29 +1,32 @@
 module Kraps
   class Worker
-    def initialize(json, memory_limit:, chunk_limit:, concurrency:)
+    include MapReduce::Mergeable
+    def initialize(json, memory_limit:, chunk_limit:, concurrency:, logger: Logger.new("/dev/null"))
       @args = JSON.parse(json)
       @memory_limit = memory_limit
       @chunk_limit = chunk_limit
       @concurrency = concurrency
+      @logger = logger
     end
     def call(retries: 3)
-      return if distributed_job.stopped?
+      return if redis_queue.stopped?
       raise(InvalidAction, "Invalid action #{step.action}") unless Actions::ALL.include?(step.action)
-      with_retries(retries) do # TODO: allow to use queue based retries
-        step.before&.call
-        send(:"perform_#{step.action}")
+      dequeue do |payload|
+        with_retries(retries) do # TODO: allow to use queue based retries
+          step.before&.call
-        distributed_job.done(@args["part"])
+          send(:"perform_#{step.action}", payload)
+        end
       end
     end
     private
-    def perform_parallelize
+    def perform_parallelize(payload)
       implementation = Class.new do
         def map(key)
           yield(key, nil)
@@ -31,29 +34,19 @@ module Kraps
       end
       mapper = MapReduce::Mapper.new(implementation.new, partitioner: partitioner, memory_limit: @memory_limit)
-      mapper.map(@args["item"])
+      mapper.map(payload["item"])
       mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
         Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
           File.open(path) do |stream|
-            Kraps.driver.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream, Kraps.driver.bucket)
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{payload["part"]}.json"), stream)
           end
         end
       end
     end
-    def perform_map
-      temp_paths = TempPaths.new
-      files = Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
-      temp_paths_index = files.each_with_object({}) do |file, hash|
-        hash[file] = temp_paths.add
-      end
-      Parallelizer.each(files, @concurrency) do |file|
-        Kraps.driver.driver.download(file, Kraps.driver.bucket, temp_paths_index[file].path)
-      end
+    def perform_map(payload)
+      temp_paths = download_all(token: @args["frame"]["token"], partition: payload["partition"])
       current_step = step
@@ -85,17 +78,48 @@ module Kraps
       mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
         Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
           File.open(path) do |stream|
-            Kraps.driver.driver.store(
-              Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream, Kraps.driver.bucket
-            )
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{payload["partition"]}.json"), stream)
           end
         end
       end
     ensure
-      temp_paths&.unlink
+      temp_paths&.delete
     end
-    def perform_reduce
+    def perform_map_partitions(payload)
+      temp_paths = download_all(token: @args["frame"]["token"], partition: payload["partition"])
+      current_step = step
+      current_partition = payload["partition"]
+      implementation = Object.new
+      implementation.define_singleton_method(:map) do |enum, &block|
+        current_step.block.call(current_partition, enum, block)
+      end
+      subsequent_step = next_step
+      if subsequent_step&.action == Actions::REDUCE
+        implementation.define_singleton_method(:reduce) do |key, value1, value2|
+          subsequent_step.block.call(key, value1, value2)
+        end
+      end
+      mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
+      mapper.map(k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
+      mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
+        Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
+          File.open(path) do |stream|
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{payload["partition"]}.json"), stream)
+          end
+        end
+      end
+    ensure
+      temp_paths&.delete
+    end
+    def perform_reduce(payload)
       current_step = step
       implementation = Object.new
@@ -105,8 +129,8 @@ module Kraps
       reducer = MapReduce::Reducer.new(implementation)
-      Parallelizer.each(Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")), @concurrency) do |file|
-        Kraps.driver.driver.download(file, Kraps.driver.bucket, reducer.add_chunk)
+      Parallelizer.each(Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{payload["partition"]}/")), @concurrency) do |file|
+        Kraps.driver.download(file, reducer.add_chunk)
       end
       tempfile = Tempfile.new
@@ -115,35 +139,96 @@ module Kraps
         tempfile.puts(JSON.generate([key, value]))
       end
-      Kraps.driver.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{@args["partition"]}/chunk.#{@args["part"]}.json"), tempfile.tap(&:rewind), Kraps.driver.bucket)
+      Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{payload["partition"]}/chunk.#{payload["partition"]}.json"), tempfile.tap(&:rewind))
     ensure
       tempfile&.close(true)
     end
-    def perform_each_partition
-      temp_paths = TempPaths.new
+    def perform_combine(payload)
+      temp_paths1 = download_all(token: @args["frame"]["token"], partition: payload["partition"])
+      temp_paths2 = download_all(token: payload["combine_frame"]["token"], partition: payload["partition"])
-      files = Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
+      enum1 = k_way_merge(temp_paths1.each.to_a, chunk_limit: @chunk_limit)
+      enum2 = k_way_merge(temp_paths2.each.to_a, chunk_limit: @chunk_limit)
-      temp_paths_index = files.each_with_object({}) do |file, hash|
-        hash[file] = temp_paths.add
+      combine_method = method(:combine)
+      current_step = step
+      implementation = Object.new
+      implementation.define_singleton_method(:map) do |&block|
+        combine_method.call(enum1, enum2) do |key, value1, value2|
+          block.call(key, current_step.block.call(key, value1, value2))
+        end
       end
-      Parallelizer.each(files, @concurrency) do |file|
-        Kraps.driver.driver.download(file, Kraps.driver.bucket, temp_paths_index[file].path)
+      mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
+      mapper.map
+      mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
+        Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
+          File.open(path) do |stream|
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{payload["partition"]}.json"), stream)
+          end
+        end
       end
+    ensure
+      temp_paths1&.delete
+      temp_paths2&.delete
+    end
-      enum = Enumerator::Lazy.new(temp_paths) do |yielder, temp_path|
-        File.open(temp_path.path) do |stream|
-          stream.each_line do |line|
-            yielder << JSON.parse(line)
+    def combine(enum1, enum2)
+      current1 = begin; enum1.next; rescue StopIteration; nil; end
+      current2 = begin; enum2.next; rescue StopIteration; nil; end
+      loop do
+        return if current1.nil? && current2.nil?
+        return if current1.nil?
+        if current2.nil?
+          yield(current1[0], current1[1], nil)
+          current1 = begin; enum1.next; rescue StopIteration; nil; end
+        elsif current1[0] == current2[0]
+          loop do
+            yield(current1[0], current1[1], current2[1])
+            current1 = begin; enum1.next; rescue StopIteration; nil; end
+            break if current1.nil?
+            break if current1[0] != current2[0]
+          end
+          current2 = begin; enum2.next; rescue StopIteration; nil; end
+        else
+          res = current1[0] <=> current2[0]
+          if res < 0
+            yield(current1[0], current1[1], nil)
+            current1 = begin; enum1.next; rescue StopIteration; nil; end
+          else
+            current2 = begin; enum2.next; rescue StopIteration; nil; end
           end
         end
       end
+    end
+    def perform_each_partition(payload)
+      temp_paths = TempPaths.new
+      files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{payload["partition"]}/")).sort
+      temp_paths_index = files.each_with_object({}) do |file, hash|
+        hash[file] = temp_paths.add
+      end
-      step.block.call(@args["partition"], enum)
+      Parallelizer.each(files, @concurrency) do |file|
+        Kraps.driver.download(file, temp_paths_index[file].path)
+      end
+      step.block.call(payload["partition"], k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
     ensure
-      temp_paths&.unlink
+      temp_paths&.delete
     end
     def with_retries(num_retries)
@@ -152,14 +237,16 @@ module Kraps
       begin
         yield
       rescue Kraps::Error
-        distributed_job.stop
+        redis_queue.stop
         raise
-      rescue StandardError
+      rescue StandardError => e
         if retries >= num_retries
-          distributed_job.stop
+          redis_queue.stop
           raise
         end
+        @logger.error(e)
         sleep(5)
         retries += 1
@@ -167,8 +254,39 @@ module Kraps
       end
     end
+    def dequeue
+      loop do
+        break if redis_queue.stopped?
+        break if redis_queue.size.zero?
+        redis_queue.dequeue do |payload|
+          payload ? yield(payload) : sleep(1)
+        end
+      end
+    end
+    def redis_queue
+      @redis_queue ||= RedisQueue.new(redis: Kraps.redis, token: @args["token"], namespace: Kraps.namespace, ttl: Kraps.job_ttl)
+    end
+    def download_all(token:, partition:)
+      temp_paths = TempPaths.new
+      files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/")).sort
+      temp_paths_index = files.each_with_object({}) do |file, hash|
+        hash[file] = temp_paths.add
+      end
+      Parallelizer.each(files, @concurrency) do |file|
+        Kraps.driver.download(file, temp_paths_index[file].path)
+      end
+      temp_paths
+    end
     def jobs
-      @jobs ||= Array(@args["klass"].constantize.new.call(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)))
+      @jobs ||= JobResolver.new.call(@args["klass"].constantize.new.call(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)))
     end
     def job
@@ -198,9 +316,5 @@ module Kraps
     def partitioner
       @partitioner ||= proc { |key| step.partitioner.call(key, step.partitions) }
     end
-    def distributed_job
-      @distributed_job ||= Kraps.distributed_job_client.build(token: @args["token"])
-    end
   end
 end

data/lib/kraps.rb CHANGED Viewed

@@ -1,32 +1,37 @@
+require "ruby-progressbar"
+require "ruby-progressbar/outputs/null"
+require "map_reduce"
+require "redis"
 require_relative "kraps/version"
 require_relative "kraps/drivers"
 require_relative "kraps/actions"
 require_relative "kraps/parallelizer"
 require_relative "kraps/hash_partitioner"
+require_relative "kraps/redis_queue"
 require_relative "kraps/temp_path"
 require_relative "kraps/temp_paths"
 require_relative "kraps/timeout_queue"
 require_relative "kraps/interval"
 require_relative "kraps/job"
+require_relative "kraps/job_resolver"
 require_relative "kraps/runner"
 require_relative "kraps/step"
 require_relative "kraps/frame"
 require_relative "kraps/worker"
-require "distributed_job"
-require "ruby-progressbar"
-require "ruby-progressbar/outputs/null"
-require "map_reduce"
-require "redis"
 module Kraps
   class Error < StandardError; end
   class InvalidAction < Error; end
   class InvalidStep < Error; end
   class JobStopped < Error; end
+  class IncompatibleFrame < Error; end
-  def self.configure(driver:, redis: Redis.new, namespace: nil, job_ttl: 24 * 60 * 60, show_progress: true, enqueuer: ->(worker, json) { worker.perform_async(json) })
+  def self.configure(driver:, redis: Redis.new, namespace: nil, job_ttl: 4 * 24 * 60 * 60, show_progress: true, enqueuer: ->(worker, json) { worker.perform_async(json) })
     @driver = driver
-    @distributed_job_client = DistributedJob::Client.new(redis: redis, namespace: namespace, default_ttl: job_ttl)
+    @redis = redis
+    @namespace = namespace
+    @job_ttl = job_ttl.to_i
     @show_progress = show_progress
     @enqueuer = enqueuer
   end
@@ -35,8 +40,16 @@ module Kraps
     @driver
   end
-  def self.distributed_job_client
-    @distributed_job_client
+  def self.redis
+    @redis
+  end
+  def self.namespace
+    @namespace
+  end
+  def self.job_ttl
+    @job_ttl
   end
   def self.show_progress?

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: kraps
 version: !ruby/object:Gem::Version
-  version: 0.5.0
+  version: 0.7.0
 platform: ruby
 authors:
 - Benjamin Vetter
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2022-11-10 00:00:00.000000000 Z
+date: 2022-12-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: attachie
@@ -24,20 +24,6 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-- !ruby/object:Gem::Dependency
-  name: distributed_job
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '0'
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '0'
 - !ruby/object:Gem::Dependency
   name: map-reduce-ruby
   requirement: !ruby/object:Gem::Requirement
@@ -147,7 +133,9 @@ files:
 - lib/kraps/hash_partitioner.rb
 - lib/kraps/interval.rb
 - lib/kraps/job.rb
+- lib/kraps/job_resolver.rb
 - lib/kraps/parallelizer.rb
+- lib/kraps/redis_queue.rb
 - lib/kraps/runner.rb
 - lib/kraps/step.rb
 - lib/kraps/temp_path.rb