kraps 0.7.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 19635ced3e745d44313ed3bc416ef73eb555134c591725f4dab7b38208e21393
4
- data.tar.gz: bb7f679c7e2cd053744d1c7d857629de1912a305880d0538bdab418de3861ba1
3
+ metadata.gz: 8b24f67ff2122dc82d372eb421c379ac42f415674999958f42f71a2cdbee1a33
4
+ data.tar.gz: d8297482eb38a30cb8ff6a7761a544833e3795a8f3a045343f6a10c370d886ca
5
5
  SHA512:
6
- metadata.gz: 91273ba54ea33c6d5cb1b4f335ad8039c35601953fdf1e6b9b2ac3117ceb25d81e9be569e5c8b70deda22b53a4c72f04d60ffb4b251badf5c5a64d13d399f36c
7
- data.tar.gz: 4e563257fcba0c9f457b363da4f43d000ee239a5179e2f965071bb3df27e362cdf7bc9950a1624520cb862acfb0fde89e96b0a4987d79a93524230c4b84619cd
6
+ metadata.gz: 5543d1a8af8fa12007d38f00d9aa515eb1edf254d8bbc2aa8c133c0101dbe7ccaef5f0330929b6b84456a81eec7e5744091deace2d16ed2826f14cd56432db8f
7
+ data.tar.gz: 14992d608157562da3af98207681a66cf0a9a9566861fe79c949ccdec9db6cb344bfb070147fc820eac70b076c15db373896f80b5bd8d5f8714f9ce75d7eb7c8
data/.rubocop.yml CHANGED
@@ -16,6 +16,12 @@ Lint/UnreachableLoop:
16
16
  Metrics/BlockLength:
17
17
  Enabled: false
18
18
 
19
+ Style/HashEachMethods:
20
+ Enabled: false
21
+
22
+ Style/ZeroLengthPredicate:
23
+ Enabled: false
24
+
19
25
  Gemspec/RequiredRubyVersion:
20
26
  Enabled: false
21
27
 
data/CHANGELOG.md CHANGED
@@ -1,5 +1,41 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## v0.9.0
4
+
5
+ * Argments are no longer passed to the `call` method, but to the
6
+ initializer instead
7
+
8
+ Before:
9
+
10
+ ```ruby
11
+ class MyJob
12
+ def call(arg1, arg2)
13
+ # ...
14
+ end
15
+ end
16
+ ```
17
+
18
+ After:
19
+
20
+ ```ruby
21
+ class MyJob
22
+ def initializer(arg1, arg2)
23
+ @arg1 = arg1
24
+ @arg2 = arg2
25
+ end
26
+
27
+ def call
28
+ # ...
29
+ end
30
+ end
31
+ ```
32
+
33
+ ## v0.8.0
34
+
35
+ * Use number of partitions of previous step for `jobs` option by default
36
+ * Changed `combine` to receive a `collector`
37
+ * Added mandatory `concurrency` argument to `load`
38
+
3
39
  ## v0.7.0
4
40
 
5
41
  * Added a `jobs` option to the actions to limit the concurrency
data/Gemfile CHANGED
@@ -3,6 +3,7 @@ source "https://rubygems.org"
3
3
  # Specify your gem's dependencies in kraps.gemspec
4
4
  gemspec
5
5
 
6
- gem "rake", "~> 13.0"
7
-
8
- gem "rspec", "~> 3.0"
6
+ gem "bundler"
7
+ gem "rake"
8
+ gem "rspec"
9
+ gem "rubocop"
data/README.md CHANGED
@@ -38,16 +38,21 @@ Kraps.configure(
38
38
 
39
39
  Afterwards, create a job class, which tells Kraps what your job should do.
40
40
  Therefore, you create some class with a `call` method, and optionally some
41
- arguments. Let's create a simple job, which reads search log files to analyze
42
- how often search queries have been searched:
41
+ arguments passed to its initializer. Let's create a simple job, which reads
42
+ search log files to analyze how often search queries have been searched:
43
43
 
44
44
  ```ruby
45
45
  class SearchLogCounter
46
- def call(start_date:, end_date:)
46
+ def initialize(start_date:, end_date:)
47
+ @start_date = start_date
48
+ @end_date = end_date
49
+ end
50
+
51
+ def call
47
52
  job = Kraps::Job.new(worker: MyKrapsWorker)
48
53
 
49
54
  job = job.parallelize(partitions: 128) do |collector|
50
- (Date.parse(start_date)..Date.parse(end_date)).each do |date|
55
+ (Date.parse(@start_date)..Date.parse(@end_date)).each do |date|
51
56
  collector.call(date.to_s)
52
57
  end
53
58
  end
@@ -214,6 +219,10 @@ job.parallelize(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker
214
219
  end
215
220
  ```
216
221
 
222
+ Please note, that `parallelize` itself is not parallelized but rather
223
+ parallelizes the data you feed into Kraps within `parallelize` by splitting it
224
+ into the number of `partitions` specified.
225
+
217
226
  The block must use the collector to feed Kraps with individual items. The
218
227
  items are used as keys and the values are set to `nil`.
219
228
 
@@ -232,10 +241,11 @@ return the new key-value pair, but the `collector` must be used instead.
232
241
  The `jobs` argument can be useful when you need to access an external data
233
242
  source, like a relational database and you want to limit the number of workers
234
243
  accessing the store concurrently to avoid overloading it. If you don't specify
235
- it, it will be identical to the number of partitions. It is recommended to only
236
- use it for steps where you need to throttle the concurrency, because it will of
237
- course slow down the processing. The `jobs` argument only applies to the
238
- current step. The following steps don't inherit the argument, but reset it.
244
+ it, it will be identical to the number of partitions of the previous step. It
245
+ is recommended to only use it for steps where you need to throttle the
246
+ concurrency, because it will of course slow down the processing. The `jobs`
247
+ argument only applies to the current step. The following steps don't inherit
248
+ the argument, but reset it.
239
249
 
240
250
  * `map_partitions`: Maps the key value pairs to other key value pairs, but the
241
251
  block receives all data of each partition as an enumerable and sorted by key.
@@ -273,8 +283,8 @@ most of the time, this is not neccessary and the key can simply be ignored.
273
283
  passed job result are completely omitted.
274
284
 
275
285
  ```ruby
276
- job.combine(other_job, worker: MyKrapsWorker, jobs: 8) do |key, value1, value2|
277
- (value1 || {}).merge(value2 || {})
286
+ job.combine(other_job, worker: MyKrapsWorker, jobs: 8) do |key, value1, value2, collector|
287
+ collector.call(key, (value1 || {}).merge(value2 || {}))
278
288
  end
279
289
  ```
280
290
 
@@ -316,10 +326,12 @@ It creates a folder for every partition and stores one or more chunks in there.
316
326
  * `load`: Loads the previously dumped data
317
327
 
318
328
  ```ruby
319
- job.load(prefix: "path/to/dump", partitions: 32, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
329
+ job.load(prefix: "path/to/dump", partitions: 32, concurrency: 8, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
320
330
  ```
321
331
 
322
- The number of partitions and the partitioner must be specified.
332
+ The number of partitions, the partitioner and concurrency must be specified.
333
+ The concurrency specifies the number of threads used for downloading chunks in
334
+ parallel.
323
335
 
324
336
  Please note that every API method accepts a `before` callable:
325
337
 
@@ -342,11 +354,16 @@ of searches made:
342
354
 
343
355
  ```ruby
344
356
  class SearchLogCounter
345
- def call(start_date:, end_date:)
357
+ def initialize(start_date:, end_date:)
358
+ @start_date = start_date
359
+ @end_date = end_date
360
+ end
361
+
362
+ def call
346
363
  count_job = Kraps::Job.new(worker: SomeBackgroundWorker)
347
364
 
348
365
  count_job = count_job.parallelize(partitions: 128) do |collector|
349
- (Date.parse(start_date)..Date.parse(end_date)).each do |date|
366
+ (Date.parse(@start_date)..Date.parse(@end_date)).each do |date|
350
367
  collector.call(date.to_s)
351
368
  end
352
369
  end
@@ -0,0 +1,19 @@
1
+ module Kraps
2
+ class Downloader
3
+ def self.download_all(prefix:, concurrency:)
4
+ temp_paths = TempPaths.new
5
+
6
+ files = Kraps.driver.list(prefix: prefix).sort
7
+
8
+ temp_paths_index = files.each_with_object({}) do |file, hash|
9
+ hash[file] = temp_paths.add
10
+ end
11
+
12
+ Parallelizer.each(files, concurrency) do |file|
13
+ Kraps.driver.download(file, temp_paths_index[file].path)
14
+ end
15
+
16
+ temp_paths
17
+ end
18
+ end
19
+ end
data/lib/kraps/job.rb CHANGED
@@ -30,12 +30,14 @@ module Kraps
30
30
  def map(partitions: nil, partitioner: nil, jobs: nil, worker: @worker, before: nil, &block)
31
31
  fresh.tap do |job|
32
32
  job.instance_eval do
33
+ jobs = [jobs, @partitions].compact.min
34
+
33
35
  @partitions = partitions if partitions
34
36
  @partitioner = partitioner if partitioner
35
37
 
36
38
  @steps << Step.new(
37
39
  action: Actions::MAP,
38
- jobs: [jobs, @partitions].compact.min,
40
+ jobs: jobs,
39
41
  partitions: @partitions,
40
42
  partitioner: @partitioner,
41
43
  worker: worker,
@@ -49,12 +51,14 @@ module Kraps
49
51
  def map_partitions(partitions: nil, partitioner: nil, jobs: nil, worker: @worker, before: nil, &block)
50
52
  fresh.tap do |job|
51
53
  job.instance_eval do
54
+ jobs = [jobs, @partitions].compact.min
55
+
52
56
  @partitions = partitions if partitions
53
57
  @partitioner = partitioner if partitioner
54
58
 
55
59
  @steps << Step.new(
56
60
  action: Actions::MAP_PARTITIONS,
57
- jobs: [jobs, @partitions].compact.min,
61
+ jobs: jobs,
58
62
  partitions: @partitions,
59
63
  partitioner: @partitioner,
60
64
  worker: worker,
@@ -135,7 +139,7 @@ module Kraps
135
139
  end
136
140
  end
137
141
 
138
- def load(prefix:, partitions:, partitioner:, worker: @worker)
142
+ def load(prefix:, partitions:, partitioner:, concurrency:, worker: @worker)
139
143
  job = parallelize(partitions: partitions, partitioner: proc { |key, _| key }, worker: worker) do |collector|
140
144
  (0...partitions).each do |partition|
141
145
  collector.call(partition)
@@ -143,20 +147,19 @@ module Kraps
143
147
  end
144
148
 
145
149
  job.map_partitions(partitioner: partitioner, worker: worker) do |partition, _, collector|
146
- tempfile = Tempfile.new
147
-
148
- path = File.join(prefix, partition.to_s, "chunk.json")
149
- next unless Kraps.driver.exists?(path)
150
+ temp_paths = Downloader.download_all(prefix: File.join(prefix, partition.to_s, "/"), concurrency: concurrency)
150
151
 
151
- Kraps.driver.download(path, tempfile.path)
152
+ temp_paths.each do |temp_path|
153
+ File.open(temp_path.path) do |stream|
154
+ stream.each_line do |line|
155
+ key, value = JSON.parse(line)
152
156
 
153
- tempfile.each_line do |line|
154
- key, value = JSON.parse(line)
155
-
156
- collector.call(key, value)
157
+ collector.call(key, value)
158
+ end
159
+ end
157
160
  end
158
161
  ensure
159
- tempfile&.close(true)
162
+ temp_paths&.delete
160
163
  end
161
164
  end
162
165
 
data/lib/kraps/runner.rb CHANGED
@@ -5,7 +5,7 @@ module Kraps
5
5
  end
6
6
 
7
7
  def call(*args, **kwargs)
8
- JobResolver.new.call(@klass.new.call(*args, **kwargs)).tap do |jobs|
8
+ JobResolver.new.call(@klass.new(*args, **kwargs).call).tap do |jobs|
9
9
  jobs.each_with_index do |job, job_index|
10
10
  job.steps.each_with_index.inject(nil) do |frame, (_, step_index)|
11
11
  StepRunner.new(
@@ -100,7 +100,7 @@ module Kraps
100
100
 
101
101
  def push_and_wait(enum:, job_count: nil)
102
102
  redis_queue = RedisQueue.new(redis: Kraps.redis, token: SecureRandom.hex, namespace: Kraps.namespace, ttl: Kraps.job_ttl)
103
- progress_bar = build_progress_bar("#{@klass}: job #{@job_index + 1}/#{@jobs.size}, step #{@step_index + 1}/#{@job.steps.size}, token #{redis_queue.token}, %a, %c/%C (%p%) => #{@step.action}")
103
+ progress_bar = build_progress_bar("#{@klass}: job #{@job_index + 1}/#{@jobs.size}, step #{@step_index + 1}/#{@job.steps.size}, #{@step.jobs || "?"} jobs, token #{redis_queue.token}, %a, %c/%C (%p%) => #{@step.action}")
104
104
 
105
105
  total = 0
106
106
 
data/lib/kraps/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Kraps
2
- VERSION = "0.7.0"
2
+ VERSION = "0.9.0"
3
3
  end
data/lib/kraps/worker.rb CHANGED
@@ -157,7 +157,7 @@ module Kraps
157
157
  implementation = Object.new
158
158
  implementation.define_singleton_method(:map) do |&block|
159
159
  combine_method.call(enum1, enum2) do |key, value1, value2|
160
- block.call(key, current_step.block.call(key, value1, value2))
160
+ current_step.block.call(key, value1, value2, block)
161
161
  end
162
162
  end
163
163
 
@@ -270,23 +270,11 @@ module Kraps
270
270
  end
271
271
 
272
272
  def download_all(token:, partition:)
273
- temp_paths = TempPaths.new
274
-
275
- files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/")).sort
276
-
277
- temp_paths_index = files.each_with_object({}) do |file, hash|
278
- hash[file] = temp_paths.add
279
- end
280
-
281
- Parallelizer.each(files, @concurrency) do |file|
282
- Kraps.driver.download(file, temp_paths_index[file].path)
283
- end
284
-
285
- temp_paths
273
+ Downloader.download_all(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/"), concurrency: @concurrency)
286
274
  end
287
275
 
288
276
  def jobs
289
- @jobs ||= JobResolver.new.call(@args["klass"].constantize.new.call(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)))
277
+ @jobs ||= JobResolver.new.call(@args["klass"].constantize.new(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)).call)
290
278
  end
291
279
 
292
280
  def job
data/lib/kraps.rb CHANGED
@@ -19,6 +19,7 @@ require_relative "kraps/runner"
19
19
  require_relative "kraps/step"
20
20
  require_relative "kraps/frame"
21
21
  require_relative "kraps/worker"
22
+ require_relative "kraps/downloader"
22
23
 
23
24
  module Kraps
24
25
  class Error < StandardError; end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kraps
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ version: 0.9.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Vetter
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-12-02 00:00:00.000000000 Z
11
+ date: 2024-03-13 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: attachie
@@ -66,48 +66,6 @@ dependencies:
66
66
  - - ">="
67
67
  - !ruby/object:Gem::Version
68
68
  version: '0'
69
- - !ruby/object:Gem::Dependency
70
- name: bundler
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - ">="
74
- - !ruby/object:Gem::Version
75
- version: '0'
76
- type: :development
77
- prerelease: false
78
- version_requirements: !ruby/object:Gem::Requirement
79
- requirements:
80
- - - ">="
81
- - !ruby/object:Gem::Version
82
- version: '0'
83
- - !ruby/object:Gem::Dependency
84
- name: rspec
85
- requirement: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - ">="
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
- type: :development
91
- prerelease: false
92
- version_requirements: !ruby/object:Gem::Requirement
93
- requirements:
94
- - - ">="
95
- - !ruby/object:Gem::Version
96
- version: '0'
97
- - !ruby/object:Gem::Dependency
98
- name: rubocop
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - ">="
102
- - !ruby/object:Gem::Version
103
- version: '0'
104
- type: :development
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - ">="
109
- - !ruby/object:Gem::Version
110
- version: '0'
111
69
  description: Kraps allows to process and perform calculations on very large datasets
112
70
  in parallel
113
71
  email:
@@ -121,13 +79,13 @@ files:
121
79
  - CHANGELOG.md
122
80
  - CODE_OF_CONDUCT.md
123
81
  - Gemfile
124
- - Gemfile.lock
125
82
  - LICENSE.txt
126
83
  - README.md
127
84
  - Rakefile
128
85
  - docker-compose.yml
129
86
  - lib/kraps.rb
130
87
  - lib/kraps/actions.rb
88
+ - lib/kraps/downloader.rb
131
89
  - lib/kraps/drivers.rb
132
90
  - lib/kraps/frame.rb
133
91
  - lib/kraps/hash_partitioner.rb
data/Gemfile.lock DELETED
@@ -1,108 +0,0 @@
1
- PATH
2
- remote: .
3
- specs:
4
- kraps (0.7.0)
5
- attachie
6
- map-reduce-ruby (>= 3.0.0)
7
- redis
8
- ruby-progressbar
9
-
10
- GEM
11
- remote: https://rubygems.org/
12
- specs:
13
- activesupport (7.0.4)
14
- concurrent-ruby (~> 1.0, >= 1.0.2)
15
- i18n (>= 1.6, < 2)
16
- minitest (>= 5.1)
17
- tzinfo (~> 2.0)
18
- ast (2.4.2)
19
- attachie (1.2.0)
20
- activesupport
21
- aws-sdk-s3
22
- connection_pool
23
- mime-types
24
- aws-eventstream (1.2.0)
25
- aws-partitions (1.657.0)
26
- aws-sdk-core (3.166.0)
27
- aws-eventstream (~> 1, >= 1.0.2)
28
- aws-partitions (~> 1, >= 1.651.0)
29
- aws-sigv4 (~> 1.5)
30
- jmespath (~> 1, >= 1.6.1)
31
- aws-sdk-kms (1.59.0)
32
- aws-sdk-core (~> 3, >= 3.165.0)
33
- aws-sigv4 (~> 1.1)
34
- aws-sdk-s3 (1.117.1)
35
- aws-sdk-core (~> 3, >= 3.165.0)
36
- aws-sdk-kms (~> 1)
37
- aws-sigv4 (~> 1.4)
38
- aws-sigv4 (1.5.2)
39
- aws-eventstream (~> 1, >= 1.0.2)
40
- concurrent-ruby (1.1.10)
41
- connection_pool (2.3.0)
42
- diff-lcs (1.5.0)
43
- i18n (1.12.0)
44
- concurrent-ruby (~> 1.0)
45
- jmespath (1.6.1)
46
- json (2.6.2)
47
- lazy_priority_queue (0.1.1)
48
- map-reduce-ruby (3.0.0)
49
- json
50
- lazy_priority_queue
51
- mime-types (3.4.1)
52
- mime-types-data (~> 3.2015)
53
- mime-types-data (3.2022.0105)
54
- minitest (5.16.3)
55
- parallel (1.22.1)
56
- parser (3.1.2.1)
57
- ast (~> 2.4.1)
58
- rainbow (3.1.1)
59
- rake (13.0.6)
60
- redis (5.0.5)
61
- redis-client (>= 0.9.0)
62
- redis-client (0.11.2)
63
- connection_pool
64
- regexp_parser (2.6.0)
65
- rexml (3.2.5)
66
- rspec (3.12.0)
67
- rspec-core (~> 3.12.0)
68
- rspec-expectations (~> 3.12.0)
69
- rspec-mocks (~> 3.12.0)
70
- rspec-core (3.12.0)
71
- rspec-support (~> 3.12.0)
72
- rspec-expectations (3.12.0)
73
- diff-lcs (>= 1.2.0, < 2.0)
74
- rspec-support (~> 3.12.0)
75
- rspec-mocks (3.12.0)
76
- diff-lcs (>= 1.2.0, < 2.0)
77
- rspec-support (~> 3.12.0)
78
- rspec-support (3.12.0)
79
- rubocop (1.38.0)
80
- json (~> 2.3)
81
- parallel (~> 1.10)
82
- parser (>= 3.1.2.1)
83
- rainbow (>= 2.2.2, < 4.0)
84
- regexp_parser (>= 1.8, < 3.0)
85
- rexml (>= 3.2.5, < 4.0)
86
- rubocop-ast (>= 1.23.0, < 2.0)
87
- ruby-progressbar (~> 1.7)
88
- unicode-display_width (>= 1.4.0, < 3.0)
89
- rubocop-ast (1.23.0)
90
- parser (>= 3.1.1.0)
91
- ruby-progressbar (1.11.0)
92
- tzinfo (2.0.5)
93
- concurrent-ruby (~> 1.0)
94
- unicode-display_width (2.3.0)
95
-
96
- PLATFORMS
97
- ruby
98
- x86_64-linux
99
-
100
- DEPENDENCIES
101
- bundler
102
- kraps!
103
- rake (~> 13.0)
104
- rspec (~> 3.0)
105
- rubocop
106
-
107
- BUNDLED WITH
108
- 2.3.24