RubyGems - kraps - Versions diffs - 0.5.0 → 0.6.0 - Mend

kraps 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 921ae08326c96216136418861b88af7f11bce519c924cd1813216165f7f02690
-  data.tar.gz: '0913d31d3caeea0be664bc714e9d0da58227f515c047be31359e96040bc0c141'
+  metadata.gz: 02ba582478178273300e77d5ddd18a8568fd682b0c53a444f1c5e7b756f9fd9a
+  data.tar.gz: 9e1698a67c252512a2f277ca1a6e079e063f8f1e6c13500245aabd969cca122d
 SHA512:
-  metadata.gz: d8e43e5229fc310019801e62a2e278470a1eb37b50e4aca27b9c64edb6666115f0f25c7a7375790516e2726fcf10980cdac1523c54dde8d3527a39fd919a2a5a
-  data.tar.gz: 30b1a9edcdd4f7ff476bfa4c070aef31debd727500e27a08b59f1df2663362c60e3cc3a3c860455d568abd994bb56a216f7eedf8baea6cc06ca73b1d0bdf9a07
+  metadata.gz: 2d1b3bd10d1048c64804ddf86069c0757247b581edea0f38330189a10d35ed096970d008533a167ddaf97c16479425e3baa2bb56f5a8888255ecfd12b911a168
+  data.tar.gz: 045263d6aa920cef97a162fcbfd41f239087c459a8c86bd2112d3ce4524c48c6e65ceaac8993f8ea7fb9089a8877db2863e39644888c8bd884df0fa95241277d

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,11 @@
 # CHANGELOG
+## v0.6.0
+* Added `map_partitions`
+* Added `combine`
+* Added `dump` and `load`
 ## v0.5.0
 * Added a `before` option to specify a callable to run before

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    kraps (0.5.0)
+    kraps (0.6.0)
       attachie
       distributed_job
       map-reduce-ruby (>= 3.0.0)

data/README.md CHANGED Viewed

@@ -3,11 +3,12 @@
 **Easily process big data in ruby**
 Kraps allows to process and perform calculations on very large datasets in
-parallel using a map/reduce framework and runs on a background job framework
-you already have. You just need some space on your filesystem, S3 as a storage
-layer with temporary lifecycle policy enabled, the already mentioned background
-job framework (like sidekiq, shoryuken, etc) and redis to keep track of the
-progress. Most things you most likely already have in place anyways.
+parallel using a map/reduce framework similar to [spark](https://spark.apache.org/),
+but runs on a background job framework you already have. You just need some
+space on your filesystem, S3 as a storage layer with temporary lifecycle policy
+enabled, the already mentioned background job framework (like sidekiq,
+shoryuken, etc) and redis to keep track of the progress. Most things you most
+likely already have in place anyways.
 ## Installation
@@ -115,13 +116,13 @@ be able to give 300-400 megabytes to Kraps then, but now divide this by 10 and
 specify a `memory_limit` of around `30.megabytes`, better less. The
 `memory_limit` affects how much chunks will be written to disk depending on the
 data size you are processing and how big these chunks are. The smaller the
-value, the more chunks and the more chunks, the more runs Kraps need to merge
-the chunks. It can affect the performance The `chunk_limit` ensures that only
-the specified amount of chunks are processed in a single run. A run basically
-means: it takes up to `chunk_limit` chunks, reduces them and pushes the result
-as a new chunk to the list of chunks to process. Thus, if your number of file
-descriptors is unlimited, you want to set it to a higher number to avoid the
-overhead of multiple runs. `concurrency` tells Kraps how much threads to use to
+value, the more chunks. The more chunks, the more runs Kraps need to merge
+the chunks. The `chunk_limit` ensures that only the specified amount of chunks
+are processed in a single run. A run basically means: it takes up to
+`chunk_limit` chunks, reduces them and pushes the result as a new chunk to the
+list of chunks to process. Thus, if your number of file descriptors is
+unlimited, you want to set it to a higher number to avoid the overhead of
+multiple runs. `concurrency` tells Kraps how much threads to use to
 concurrently upload/download files from the storage layer. Finally, `retries`
 specifies how often Kraps should retry the job step in case of errors. Kraps
 will sleep for 5 seconds between those retries. Please note that it's not yet
@@ -130,7 +131,6 @@ Kraps. Please note, however, that `parallelize` is not covered by `retries`
 yet, as the block passed to `parallelize` is executed by the runner, not the
 workers.
 Now, executing your job is super easy:
 ```ruby
@@ -182,11 +182,11 @@ https://github.com/mrkamel/map-reduce-ruby/#limitations-for-keys
 ## Storage
 Kraps stores temporary results of steps in a storage layer. Currently, only S3
-is supported besides a in memory driver used for testing purposes. Please be
+is supported besides a in-memory driver used for testing purposes. Please be
 aware that Kraps does not clean up any files from the storage layer, as it
-would be a safe thing to do in case of errors anyways. Instead, Kraps relies on
-lifecycle features of modern object storage systems. Therefore, it is recommend
-to e.g. configure a lifecycle policy to delete any files after e.g. 7 days
+would not be a safe thing to do in case of errors anyways. Instead, Kraps
+relies on lifecycle features of modern object storage systems. Therefore, it is
+required to configure a lifecycle policy to delete any files after e.g. 7 days
 either for a whole bucket or for a certain prefix like e.g. `temp/` and tell
 Kraps about the prefix to use (e.g. `temp/kraps/`).
@@ -229,6 +229,19 @@ The block gets each key-value pair passed and the `collector` block can be
 called as often as neccessary. This is also the reason why `map` can not simply
 return the new key-value pair, but the `collector` must be used instead.
+* `map_partitions`: Maps the key value pairs to other key value pairs, but the
+  block receives all data of each partition as an enumerable and sorted by key.
+  Please be aware that you should not call `to_a` or similar on the enumerable.
+  Prefer `map` over `map_partitions` when possible.
+```ruby
+job.map_partitions(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker) do |pairs, collector|
+  pairs.each do |key, value|
+    collector.call("changed #{key}", "changed #{value}")
+  end
+end
+```
 * `reduce`: Reduces the values of pairs having the same key
 ```ruby
@@ -245,6 +258,24 @@ The `key` itself is also passed to the block for the case that you need to
 customize the reduce calculation according to the value of the key. However,
 most of the time, this is not neccessary and the key can simply be ignored.
+* `combine`: Combines the results of 2 jobs by combining every key available
+  in the current job result with the corresponding key from the passed job
+  result. When the passed job result does not have the corresponding key,
+  `nil` will be passed to the block. Keys which are only available in the
+  passed job result are completely omitted.
+```ruby
+  job.combine(other_job, worker: MyKrapsWorker) do |key, value1, value2|
+    (value1 || {}).merge(value2 || {})
+  end
+```
+Please note that the keys, partitioners and the number of partitions must match
+for the jobs to be combined. Further note that the results of `other_job` must
+be reduced, meaning that every key must be unique. Finally, `other_job` must
+not neccessarily be listed in the array of jobs returned by the `call` method,
+since Kraps detects the dependency on its own.
 * `repartition`: Used to change the partitioning
 ```ruby
@@ -255,7 +286,8 @@ Repartitions all data into the specified number of partitions and using the
 specified partitioner.
 * `each_partition`: Passes the partition number and all data of each partition
-  as a lazy enumerable
+  as an enumerable and sorted by key. Please be aware that you should not call
+  `to_a` or similar on the enumerable.
 ```ruby
 job.each_partition do |partition, pairs|
@@ -265,6 +297,22 @@ job.each_partition do |partition, pairs|
 end
 ```
+* `dump`: Store all current data per partition under the specified prefix
+```ruby
+job.dump(prefix: "path/to/dump", worker: MyKrapsWorker)
+```
+It creates a folder for every partition and stores one or more chunks in there.
+* `load`: Loads the previously dumped data
+```ruby
+job.load(prefix: "path/to/dump", partitions: 32, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
+```
+The number of partitions and the partitioner must be specified.
 Please note that every API method accepts a `before` callable:
 ```ruby
@@ -326,6 +374,52 @@ When you execute the job, Kraps will execute the jobs one after another and as
 the jobs build up on each other, Kraps will execute the steps shared by both
 jobs only once.
+## Testing
+Kraps ships with an in-memory fake driver for storage, which you can use for
+testing purposes instead of the s3 driver:
+```ruby Kraps.configure(
+  driver: Kraps::Drivers::FakeDriver.new(bucket: "kraps"),
+  # ...
+) ```
+This is of course much faster than using s3 or some s3 compatible service.
+Moreover, when testing large Kraps jobs you maybe want to test intermediate
+steps. You can use `#dump` for this purpose and test that the data dumped is
+correct.
+```ruby
+job = job.dump("path/to/dump")
+```
+and in your tests do
+```ruby
+Kraps.driver.value("path/to/dump/0/chunk.json") # => data of partition 0
+Kraps.driver.value("path/to/dump/1/chunk.json") # => data of partition 1
+# ...
+```
+The data is stored in lines, each line is a json encoded array of key and
+value.
+```ruby
+data = Kraps.driver.value("path/to/dump/0/chunk.json).lines.map do |line|
+  JSON.parse(line) # => [key, value]
+end
+```
+The API of the driver is:
+* `store(name, data_or_ui, options = {})`: Stores `data_or_io` as `name`
+* `list(prefix: nil)`: Lists all objects or all objects matching the `prefix`
+* `value(name)`: Returns the object content of `name`
+* `download(name, path)`: Downloads the object `name` to `path` in your
+  filesystem
+* `exists?(name)`: Returns `true`/`false`
+* `flush`: Removes all objects from the fake storage
 ## Dependencies
 Kraps is built on top of

data/lib/kraps/actions.rb CHANGED Viewed

@@ -3,7 +3,9 @@ module Kraps
     ALL = [
       PARALLELIZE = "parallelize",
       MAP = "map",
+      MAP_PARTITIONS = "map_partitions",
       REDUCE = "reduce",
+      COMBINE = "combine",
       EACH_PARTITION = "each_partition"
     ]
   end

data/lib/kraps/drivers.rb CHANGED Viewed

@@ -8,6 +8,26 @@ module Kraps
       def with_prefix(path)
         File.join(*[@prefix, path].compact)
       end
+      def list(prefix: nil)
+        driver.list(bucket, prefix: prefix)
+      end
+      def value(name)
+        driver.value(name, bucket)
+      end
+      def download(name, path)
+        driver.download(name, bucket, path)
+      end
+      def exists?(name)
+        driver.exists?(name, bucket)
+      end
+      def store(name, data_or_io, options = {})
+        driver.store(name, data_or_io, bucket, options)
+      end
     end
     class S3Driver
@@ -32,6 +52,10 @@ module Kraps
         @bucket = bucket
         @prefix = prefix
       end
+      def flush
+        driver.flush
+      end
     end
   end
 end

data/lib/kraps/job.rb CHANGED Viewed

@@ -45,6 +45,24 @@ module Kraps
       end
     end
+    def map_partitions(partitions: nil, partitioner: nil, worker: @worker, before: nil, &block)
+      fresh.tap do |job|
+        job.instance_eval do
+          @partitions = partitions if partitions
+          @partitioner = partitioner if partitioner
+          @steps << Step.new(
+            action: Actions::MAP_PARTITIONS,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block
+          )
+        end
+      end
+    end
     def reduce(worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
@@ -60,6 +78,23 @@ module Kraps
       end
     end
+    def combine(other_job, worker: @worker, before: nil, &block)
+      fresh.tap do |job|
+        job.instance_eval do
+          @steps << Step.new(
+            action: Actions::COMBINE,
+            partitions: @partitions,
+            partitioner: @partitioner,
+            worker: worker,
+            before: before,
+            block: block,
+            dependency: other_job,
+            options: { combine_step_index: other_job.steps.size - 1 }
+          )
+        end
+      end
+    end
     def each_partition(worker: @worker, before: nil, &block)
       fresh.tap do |job|
         job.instance_eval do
@@ -81,6 +116,45 @@ module Kraps
       end
     end
+    def dump(prefix:, worker: @worker)
+      each_partition(worker: worker) do |partition, pairs|
+        tempfile = Tempfile.new
+        pairs.each do |pair|
+          tempfile.puts(JSON.generate(pair))
+        end
+        Kraps.driver.store(File.join(prefix, partition.to_s, "chunk.json"), tempfile.tap(&:rewind))
+      ensure
+        tempfile&.close(true)
+      end
+    end
+    def load(prefix:, partitions:, partitioner:, worker: @worker)
+      job = parallelize(partitions: partitions, partitioner: proc { |key, _| key }, worker: worker) do |collector|
+        (0...partitions).each do |partition|
+          collector.call(partition)
+        end
+      end
+      job.map_partitions(partitioner: partitioner, worker: worker) do |partition, _, collector|
+        tempfile = Tempfile.new
+        path = File.join(prefix, partition.to_s, "chunk.json")
+        next unless Kraps.driver.exists?(path)
+        Kraps.driver.download(path, tempfile.path)
+        tempfile.each_line do |line|
+          key, value = JSON.parse(line)
+          collector.call(key, value)
+        end
+      ensure
+        tempfile&.close(true)
+      end
+    end
     def fresh
       dup.tap do |job|
         job.instance_variable_set(:@steps, @steps.dup)

data/lib/kraps/job_resolver.rb ADDED Viewed

@@ -0,0 +1,13 @@
+module Kraps
+  class JobResolver
+    def call(jobs)
+      resolve_dependencies(Array(jobs)).uniq
+    end
+    private
+    def resolve_dependencies(jobs)
+      jobs.map { |job| [resolve_dependencies(job.steps.map(&:dependency).compact), job] }.flatten
+    end
+  end
+end

data/lib/kraps/runner.rb CHANGED Viewed

@@ -5,7 +5,7 @@ module Kraps
     end
     def call(*args, **kwargs)
-      Array(@klass.new.call(*args, **kwargs)).tap do |jobs|
+      JobResolver.new.call(@klass.new.call(*args, **kwargs)).tap do |jobs|
         jobs.each_with_index do |job, job_index|
           job.steps.each_with_index.inject(nil) do |frame, (_, step_index)|
             StepRunner.new(
@@ -69,6 +69,16 @@ module Kraps
         end
       end
+      def perform_map_partitions
+        with_distributed_job do |distributed_job|
+          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
+            enqueue(token: distributed_job.token, part: part, partition: partition)
+          end
+          Frame.new(token: distributed_job.token, partitions: @step.partitions)
+        end
+      end
       def perform_reduce
         with_distributed_job do |distributed_job|
           push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
@@ -79,6 +89,21 @@ module Kraps
         end
       end
+      def perform_combine
+        combine_job = @step.dependency
+        combine_step = combine_job.steps[@step.options[:combine_step_index]]
+        raise(IncompatibleFrame, "Incompatible number of partitions") if combine_step.partitions != @step.partitions
+        with_distributed_job do |distributed_job|
+          push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|
+            enqueue(token: distributed_job.token, part: part, partition: partition, combine_frame: combine_step.frame.to_h)
+          end
+          Frame.new(token: distributed_job.token, partitions: @step.partitions)
+        end
+      end
       def perform_each_partition
         with_distributed_job do |distributed_job|
           push_and_wait(distributed_job, 0...@frame.partitions) do |partition, part|

data/lib/kraps/step.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  Step = Struct.new(:action, :partitioner, :partitions, :block, :worker, :before, :frame, keyword_init: true)
+  Step = Struct.new(:action, :partitioner, :partitions, :block, :worker, :before, :frame, :dependency, :options, keyword_init: true)
 end

data/lib/kraps/temp_path.rb CHANGED Viewed

@@ -1,29 +1,3 @@
 module Kraps
-  class TempPath
-    attr_reader :path
-    def initialize(prefix: nil, suffix: nil)
-      @path = File.join(Dir.tmpdir, [prefix, SecureRandom.hex[0, 16], Process.pid, suffix].compact.join("."))
-      File.open(@path, File::CREAT | File::EXCL) {}
-      ObjectSpace.define_finalizer(self, self.class.finalize(@path))
-      return unless block_given?
-      begin
-        yield
-      ensure
-        unlink
-      end
-    end
-    def unlink
-      FileUtils.rm_f(@path)
-    end
-    def self.finalize(path)
-      proc { FileUtils.rm_f(path) }
-    end
-  end
+  TempPath = MapReduce::TempPath
 end

data/lib/kraps/temp_paths.rb CHANGED Viewed

@@ -17,9 +17,9 @@ module Kraps
       end
     end
-    def unlink
+    def delete
       synchronize do
-        @temp_paths.each(&:unlink)
+        @temp_paths.each(&:delete)
       end
     end

data/lib/kraps/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kraps
-  VERSION = "0.5.0"
+  VERSION = "0.6.0"
 end

data/lib/kraps/worker.rb CHANGED Viewed

@@ -1,10 +1,13 @@
 module Kraps
   class Worker
-    def initialize(json, memory_limit:, chunk_limit:, concurrency:)
+    include MapReduce::Mergeable
+    def initialize(json, memory_limit:, chunk_limit:, concurrency:, logger: Logger.new("/dev/null"))
       @args = JSON.parse(json)
       @memory_limit = memory_limit
       @chunk_limit = chunk_limit
       @concurrency = concurrency
+      @logger = logger
     end
     def call(retries: 3)
@@ -36,24 +39,14 @@ module Kraps
       mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
         Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
           File.open(path) do |stream|
-            Kraps.driver.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream, Kraps.driver.bucket)
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
           end
         end
       end
     end
     def perform_map
-      temp_paths = TempPaths.new
-      files = Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
-      temp_paths_index = files.each_with_object({}) do |file, hash|
-        hash[file] = temp_paths.add
-      end
-      Parallelizer.each(files, @concurrency) do |file|
-        Kraps.driver.driver.download(file, Kraps.driver.bucket, temp_paths_index[file].path)
-      end
+      temp_paths = download_all(token: @args["frame"]["token"], partition: @args["partition"])
       current_step = step
@@ -85,14 +78,45 @@ module Kraps
       mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
         Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
           File.open(path) do |stream|
-            Kraps.driver.driver.store(
-              Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream, Kraps.driver.bucket
-            )
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
+          end
+        end
+      end
+    ensure
+      temp_paths&.delete
+    end
+    def perform_map_partitions
+      temp_paths = download_all(token: @args["frame"]["token"], partition: @args["partition"])
+      current_step = step
+      current_partition = @args["partition"]
+      implementation = Object.new
+      implementation.define_singleton_method(:map) do |enum, &block|
+        current_step.block.call(current_partition, enum, block)
+      end
+      subsequent_step = next_step
+      if subsequent_step&.action == Actions::REDUCE
+        implementation.define_singleton_method(:reduce) do |key, value1, value2|
+          subsequent_step.block.call(key, value1, value2)
+        end
+      end
+      mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
+      mapper.map(k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
+      mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
+        Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
+          File.open(path) do |stream|
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
           end
         end
       end
     ensure
-      temp_paths&.unlink
+      temp_paths&.delete
     end
     def perform_reduce
@@ -105,8 +129,8 @@ module Kraps
       reducer = MapReduce::Reducer.new(implementation)
-      Parallelizer.each(Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")), @concurrency) do |file|
-        Kraps.driver.driver.download(file, Kraps.driver.bucket, reducer.add_chunk)
+      Parallelizer.each(Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")), @concurrency) do |file|
+        Kraps.driver.download(file, reducer.add_chunk)
       end
       tempfile = Tempfile.new
@@ -115,35 +139,96 @@ module Kraps
         tempfile.puts(JSON.generate([key, value]))
       end
-      Kraps.driver.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{@args["partition"]}/chunk.#{@args["part"]}.json"), tempfile.tap(&:rewind), Kraps.driver.bucket)
+      Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{@args["partition"]}/chunk.#{@args["part"]}.json"), tempfile.tap(&:rewind))
     ensure
       tempfile&.close(true)
     end
+    def perform_combine
+      temp_paths1 = download_all(token: @args["frame"]["token"], partition: @args["partition"])
+      temp_paths2 = download_all(token: @args["combine_frame"]["token"], partition: @args["partition"])
+      enum1 = k_way_merge(temp_paths1.each.to_a, chunk_limit: @chunk_limit)
+      enum2 = k_way_merge(temp_paths2.each.to_a, chunk_limit: @chunk_limit)
+      combine_method = method(:combine)
+      current_step = step
+      implementation = Object.new
+      implementation.define_singleton_method(:map) do |&block|
+        combine_method.call(enum1, enum2) do |key, value1, value2|
+          block.call(key, current_step.block.call(key, value1, value2))
+        end
+      end
+      mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
+      mapper.map
+      mapper.shuffle(chunk_limit: @chunk_limit) do |partitions|
+        Parallelizer.each(partitions.to_a, @concurrency) do |partition, path|
+          File.open(path) do |stream|
+            Kraps.driver.store(Kraps.driver.with_prefix("#{@args["token"]}/#{partition}/chunk.#{@args["part"]}.json"), stream)
+          end
+        end
+      end
+    ensure
+      temp_paths1&.delete
+      temp_paths2&.delete
+    end
+    def combine(enum1, enum2)
+      current1 = begin; enum1.next; rescue StopIteration; nil; end
+      current2 = begin; enum2.next; rescue StopIteration; nil; end
+      loop do
+        return if current1.nil? && current2.nil?
+        return if current1.nil?
+        if current2.nil?
+          yield(current1[0], current1[1], nil)
+          current1 = begin; enum1.next; rescue StopIteration; nil; end
+        elsif current1[0] == current2[0]
+          loop do
+            yield(current1[0], current1[1], current2[1])
+            current1 = begin; enum1.next; rescue StopIteration; nil; end
+            break if current1.nil?
+            break if current1[0] != current2[0]
+          end
+          current2 = begin; enum2.next; rescue StopIteration; nil; end
+        else
+          res = current1[0] <=> current2[0]
+          if res < 0
+            yield(current1[0], current1[1], nil)
+            current1 = begin; enum1.next; rescue StopIteration; nil; end
+          else
+            current2 = begin; enum2.next; rescue StopIteration; nil; end
+          end
+        end
+      end
+    end
     def perform_each_partition
       temp_paths = TempPaths.new
-      files = Kraps.driver.driver.list(Kraps.driver.bucket, prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
+      files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{@args["frame"]["token"]}/#{@args["partition"]}/")).sort
       temp_paths_index = files.each_with_object({}) do |file, hash|
         hash[file] = temp_paths.add
       end
       Parallelizer.each(files, @concurrency) do |file|
-        Kraps.driver.driver.download(file, Kraps.driver.bucket, temp_paths_index[file].path)
+        Kraps.driver.download(file, temp_paths_index[file].path)
       end
-      enum = Enumerator::Lazy.new(temp_paths) do |yielder, temp_path|
-        File.open(temp_path.path) do |stream|
-          stream.each_line do |line|
-            yielder << JSON.parse(line)
-          end
-        end
-      end
-      step.block.call(@args["partition"], enum)
+      step.block.call(@args["partition"], k_way_merge(temp_paths.each.to_a, chunk_limit: @chunk_limit))
     ensure
-      temp_paths&.unlink
+      temp_paths&.delete
     end
     def with_retries(num_retries)
@@ -154,12 +239,14 @@ module Kraps
       rescue Kraps::Error
         distributed_job.stop
         raise
-      rescue StandardError
+      rescue StandardError => e
         if retries >= num_retries
           distributed_job.stop
           raise
         end
+        @logger.error(e)
         sleep(5)
         retries += 1
@@ -167,8 +254,24 @@ module Kraps
       end
     end
+    def download_all(token:, partition:)
+      temp_paths = TempPaths.new
+      files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/")).sort
+      temp_paths_index = files.each_with_object({}) do |file, hash|
+        hash[file] = temp_paths.add
+      end
+      Parallelizer.each(files, @concurrency) do |file|
+        Kraps.driver.download(file, temp_paths_index[file].path)
+      end
+      temp_paths
+    end
     def jobs
-      @jobs ||= Array(@args["klass"].constantize.new.call(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)))
+      @jobs ||= JobResolver.new.call(@args["klass"].constantize.new.call(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)))
     end
     def job

data/lib/kraps.rb CHANGED Viewed

@@ -1,3 +1,9 @@
+require "distributed_job"
+require "ruby-progressbar"
+require "ruby-progressbar/outputs/null"
+require "map_reduce"
+require "redis"
 require_relative "kraps/version"
 require_relative "kraps/drivers"
 require_relative "kraps/actions"
@@ -8,21 +14,18 @@ require_relative "kraps/temp_paths"
 require_relative "kraps/timeout_queue"
 require_relative "kraps/interval"
 require_relative "kraps/job"
+require_relative "kraps/job_resolver"
 require_relative "kraps/runner"
 require_relative "kraps/step"
 require_relative "kraps/frame"
 require_relative "kraps/worker"
-require "distributed_job"
-require "ruby-progressbar"
-require "ruby-progressbar/outputs/null"
-require "map_reduce"
-require "redis"
 module Kraps
   class Error < StandardError; end
   class InvalidAction < Error; end
   class InvalidStep < Error; end
   class JobStopped < Error; end
+  class IncompatibleFrame < Error; end
   def self.configure(driver:, redis: Redis.new, namespace: nil, job_ttl: 24 * 60 * 60, show_progress: true, enqueuer: ->(worker, json) { worker.perform_async(json) })
     @driver = driver

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: kraps
 version: !ruby/object:Gem::Version
-  version: 0.5.0
+  version: 0.6.0
 platform: ruby
 authors:
 - Benjamin Vetter
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2022-11-10 00:00:00.000000000 Z
+date: 2022-11-16 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: attachie
@@ -147,6 +147,7 @@ files:
 - lib/kraps/hash_partitioner.rb
 - lib/kraps/interval.rb
 - lib/kraps/job.rb
+- lib/kraps/job_resolver.rb
 - lib/kraps/parallelizer.rb
 - lib/kraps/runner.rb
 - lib/kraps/step.rb