kraps 0.7.0 → 0.9.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 19635ced3e745d44313ed3bc416ef73eb555134c591725f4dab7b38208e21393
4
- data.tar.gz: bb7f679c7e2cd053744d1c7d857629de1912a305880d0538bdab418de3861ba1
3
+ metadata.gz: 8b24f67ff2122dc82d372eb421c379ac42f415674999958f42f71a2cdbee1a33
4
+ data.tar.gz: d8297482eb38a30cb8ff6a7761a544833e3795a8f3a045343f6a10c370d886ca
5
5
  SHA512:
6
- metadata.gz: 91273ba54ea33c6d5cb1b4f335ad8039c35601953fdf1e6b9b2ac3117ceb25d81e9be569e5c8b70deda22b53a4c72f04d60ffb4b251badf5c5a64d13d399f36c
7
- data.tar.gz: 4e563257fcba0c9f457b363da4f43d000ee239a5179e2f965071bb3df27e362cdf7bc9950a1624520cb862acfb0fde89e96b0a4987d79a93524230c4b84619cd
6
+ metadata.gz: 5543d1a8af8fa12007d38f00d9aa515eb1edf254d8bbc2aa8c133c0101dbe7ccaef5f0330929b6b84456a81eec7e5744091deace2d16ed2826f14cd56432db8f
7
+ data.tar.gz: 14992d608157562da3af98207681a66cf0a9a9566861fe79c949ccdec9db6cb344bfb070147fc820eac70b076c15db373896f80b5bd8d5f8714f9ce75d7eb7c8
data/.rubocop.yml CHANGED
@@ -16,6 +16,12 @@ Lint/UnreachableLoop:
16
16
  Metrics/BlockLength:
17
17
  Enabled: false
18
18
 
19
+ Style/HashEachMethods:
20
+ Enabled: false
21
+
22
+ Style/ZeroLengthPredicate:
23
+ Enabled: false
24
+
19
25
  Gemspec/RequiredRubyVersion:
20
26
  Enabled: false
21
27
 
data/CHANGELOG.md CHANGED
@@ -1,5 +1,41 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## v0.9.0
4
+
5
+ * Argments are no longer passed to the `call` method, but to the
6
+ initializer instead
7
+
8
+ Before:
9
+
10
+ ```ruby
11
+ class MyJob
12
+ def call(arg1, arg2)
13
+ # ...
14
+ end
15
+ end
16
+ ```
17
+
18
+ After:
19
+
20
+ ```ruby
21
+ class MyJob
22
+ def initializer(arg1, arg2)
23
+ @arg1 = arg1
24
+ @arg2 = arg2
25
+ end
26
+
27
+ def call
28
+ # ...
29
+ end
30
+ end
31
+ ```
32
+
33
+ ## v0.8.0
34
+
35
+ * Use number of partitions of previous step for `jobs` option by default
36
+ * Changed `combine` to receive a `collector`
37
+ * Added mandatory `concurrency` argument to `load`
38
+
3
39
  ## v0.7.0
4
40
 
5
41
  * Added a `jobs` option to the actions to limit the concurrency
data/Gemfile CHANGED
@@ -3,6 +3,7 @@ source "https://rubygems.org"
3
3
  # Specify your gem's dependencies in kraps.gemspec
4
4
  gemspec
5
5
 
6
- gem "rake", "~> 13.0"
7
-
8
- gem "rspec", "~> 3.0"
6
+ gem "bundler"
7
+ gem "rake"
8
+ gem "rspec"
9
+ gem "rubocop"
data/README.md CHANGED
@@ -38,16 +38,21 @@ Kraps.configure(
38
38
 
39
39
  Afterwards, create a job class, which tells Kraps what your job should do.
40
40
  Therefore, you create some class with a `call` method, and optionally some
41
- arguments. Let's create a simple job, which reads search log files to analyze
42
- how often search queries have been searched:
41
+ arguments passed to its initializer. Let's create a simple job, which reads
42
+ search log files to analyze how often search queries have been searched:
43
43
 
44
44
  ```ruby
45
45
  class SearchLogCounter
46
- def call(start_date:, end_date:)
46
+ def initialize(start_date:, end_date:)
47
+ @start_date = start_date
48
+ @end_date = end_date
49
+ end
50
+
51
+ def call
47
52
  job = Kraps::Job.new(worker: MyKrapsWorker)
48
53
 
49
54
  job = job.parallelize(partitions: 128) do |collector|
50
- (Date.parse(start_date)..Date.parse(end_date)).each do |date|
55
+ (Date.parse(@start_date)..Date.parse(@end_date)).each do |date|
51
56
  collector.call(date.to_s)
52
57
  end
53
58
  end
@@ -214,6 +219,10 @@ job.parallelize(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker
214
219
  end
215
220
  ```
216
221
 
222
+ Please note, that `parallelize` itself is not parallelized but rather
223
+ parallelizes the data you feed into Kraps within `parallelize` by splitting it
224
+ into the number of `partitions` specified.
225
+
217
226
  The block must use the collector to feed Kraps with individual items. The
218
227
  items are used as keys and the values are set to `nil`.
219
228
 
@@ -232,10 +241,11 @@ return the new key-value pair, but the `collector` must be used instead.
232
241
  The `jobs` argument can be useful when you need to access an external data
233
242
  source, like a relational database and you want to limit the number of workers
234
243
  accessing the store concurrently to avoid overloading it. If you don't specify
235
- it, it will be identical to the number of partitions. It is recommended to only
236
- use it for steps where you need to throttle the concurrency, because it will of
237
- course slow down the processing. The `jobs` argument only applies to the
238
- current step. The following steps don't inherit the argument, but reset it.
244
+ it, it will be identical to the number of partitions of the previous step. It
245
+ is recommended to only use it for steps where you need to throttle the
246
+ concurrency, because it will of course slow down the processing. The `jobs`
247
+ argument only applies to the current step. The following steps don't inherit
248
+ the argument, but reset it.
239
249
 
240
250
  * `map_partitions`: Maps the key value pairs to other key value pairs, but the
241
251
  block receives all data of each partition as an enumerable and sorted by key.
@@ -273,8 +283,8 @@ most of the time, this is not neccessary and the key can simply be ignored.
273
283
  passed job result are completely omitted.
274
284
 
275
285
  ```ruby
276
- job.combine(other_job, worker: MyKrapsWorker, jobs: 8) do |key, value1, value2|
277
- (value1 || {}).merge(value2 || {})
286
+ job.combine(other_job, worker: MyKrapsWorker, jobs: 8) do |key, value1, value2, collector|
287
+ collector.call(key, (value1 || {}).merge(value2 || {}))
278
288
  end
279
289
  ```
280
290
 
@@ -316,10 +326,12 @@ It creates a folder for every partition and stores one or more chunks in there.
316
326
  * `load`: Loads the previously dumped data
317
327
 
318
328
  ```ruby
319
- job.load(prefix: "path/to/dump", partitions: 32, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
329
+ job.load(prefix: "path/to/dump", partitions: 32, concurrency: 8, partitioner: Kraps::HashPartitioner.new, worker: MyKrapsWorker)
320
330
  ```
321
331
 
322
- The number of partitions and the partitioner must be specified.
332
+ The number of partitions, the partitioner and concurrency must be specified.
333
+ The concurrency specifies the number of threads used for downloading chunks in
334
+ parallel.
323
335
 
324
336
  Please note that every API method accepts a `before` callable:
325
337
 
@@ -342,11 +354,16 @@ of searches made:
342
354
 
343
355
  ```ruby
344
356
  class SearchLogCounter
345
- def call(start_date:, end_date:)
357
+ def initialize(start_date:, end_date:)
358
+ @start_date = start_date
359
+ @end_date = end_date
360
+ end
361
+
362
+ def call
346
363
  count_job = Kraps::Job.new(worker: SomeBackgroundWorker)
347
364
 
348
365
  count_job = count_job.parallelize(partitions: 128) do |collector|
349
- (Date.parse(start_date)..Date.parse(end_date)).each do |date|
366
+ (Date.parse(@start_date)..Date.parse(@end_date)).each do |date|
350
367
  collector.call(date.to_s)
351
368
  end
352
369
  end
@@ -0,0 +1,19 @@
1
+ module Kraps
2
+ class Downloader
3
+ def self.download_all(prefix:, concurrency:)
4
+ temp_paths = TempPaths.new
5
+
6
+ files = Kraps.driver.list(prefix: prefix).sort
7
+
8
+ temp_paths_index = files.each_with_object({}) do |file, hash|
9
+ hash[file] = temp_paths.add
10
+ end
11
+
12
+ Parallelizer.each(files, concurrency) do |file|
13
+ Kraps.driver.download(file, temp_paths_index[file].path)
14
+ end
15
+
16
+ temp_paths
17
+ end
18
+ end
19
+ end
data/lib/kraps/job.rb CHANGED
@@ -30,12 +30,14 @@ module Kraps
30
30
  def map(partitions: nil, partitioner: nil, jobs: nil, worker: @worker, before: nil, &block)
31
31
  fresh.tap do |job|
32
32
  job.instance_eval do
33
+ jobs = [jobs, @partitions].compact.min
34
+
33
35
  @partitions = partitions if partitions
34
36
  @partitioner = partitioner if partitioner
35
37
 
36
38
  @steps << Step.new(
37
39
  action: Actions::MAP,
38
- jobs: [jobs, @partitions].compact.min,
40
+ jobs: jobs,
39
41
  partitions: @partitions,
40
42
  partitioner: @partitioner,
41
43
  worker: worker,
@@ -49,12 +51,14 @@ module Kraps
49
51
  def map_partitions(partitions: nil, partitioner: nil, jobs: nil, worker: @worker, before: nil, &block)
50
52
  fresh.tap do |job|
51
53
  job.instance_eval do
54
+ jobs = [jobs, @partitions].compact.min
55
+
52
56
  @partitions = partitions if partitions
53
57
  @partitioner = partitioner if partitioner
54
58
 
55
59
  @steps << Step.new(
56
60
  action: Actions::MAP_PARTITIONS,
57
- jobs: [jobs, @partitions].compact.min,
61
+ jobs: jobs,
58
62
  partitions: @partitions,
59
63
  partitioner: @partitioner,
60
64
  worker: worker,
@@ -135,7 +139,7 @@ module Kraps
135
139
  end
136
140
  end
137
141
 
138
- def load(prefix:, partitions:, partitioner:, worker: @worker)
142
+ def load(prefix:, partitions:, partitioner:, concurrency:, worker: @worker)
139
143
  job = parallelize(partitions: partitions, partitioner: proc { |key, _| key }, worker: worker) do |collector|
140
144
  (0...partitions).each do |partition|
141
145
  collector.call(partition)
@@ -143,20 +147,19 @@ module Kraps
143
147
  end
144
148
 
145
149
  job.map_partitions(partitioner: partitioner, worker: worker) do |partition, _, collector|
146
- tempfile = Tempfile.new
147
-
148
- path = File.join(prefix, partition.to_s, "chunk.json")
149
- next unless Kraps.driver.exists?(path)
150
+ temp_paths = Downloader.download_all(prefix: File.join(prefix, partition.to_s, "/"), concurrency: concurrency)
150
151
 
151
- Kraps.driver.download(path, tempfile.path)
152
+ temp_paths.each do |temp_path|
153
+ File.open(temp_path.path) do |stream|
154
+ stream.each_line do |line|
155
+ key, value = JSON.parse(line)
152
156
 
153
- tempfile.each_line do |line|
154
- key, value = JSON.parse(line)
155
-
156
- collector.call(key, value)
157
+ collector.call(key, value)
158
+ end
159
+ end
157
160
  end
158
161
  ensure
159
- tempfile&.close(true)
162
+ temp_paths&.delete
160
163
  end
161
164
  end
162
165
 
data/lib/kraps/runner.rb CHANGED
@@ -5,7 +5,7 @@ module Kraps
5
5
  end
6
6
 
7
7
  def call(*args, **kwargs)
8
- JobResolver.new.call(@klass.new.call(*args, **kwargs)).tap do |jobs|
8
+ JobResolver.new.call(@klass.new(*args, **kwargs).call).tap do |jobs|
9
9
  jobs.each_with_index do |job, job_index|
10
10
  job.steps.each_with_index.inject(nil) do |frame, (_, step_index)|
11
11
  StepRunner.new(
@@ -100,7 +100,7 @@ module Kraps
100
100
 
101
101
  def push_and_wait(enum:, job_count: nil)
102
102
  redis_queue = RedisQueue.new(redis: Kraps.redis, token: SecureRandom.hex, namespace: Kraps.namespace, ttl: Kraps.job_ttl)
103
- progress_bar = build_progress_bar("#{@klass}: job #{@job_index + 1}/#{@jobs.size}, step #{@step_index + 1}/#{@job.steps.size}, token #{redis_queue.token}, %a, %c/%C (%p%) => #{@step.action}")
103
+ progress_bar = build_progress_bar("#{@klass}: job #{@job_index + 1}/#{@jobs.size}, step #{@step_index + 1}/#{@job.steps.size}, #{@step.jobs || "?"} jobs, token #{redis_queue.token}, %a, %c/%C (%p%) => #{@step.action}")
104
104
 
105
105
  total = 0
106
106
 
data/lib/kraps/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Kraps
2
- VERSION = "0.7.0"
2
+ VERSION = "0.9.0"
3
3
  end
data/lib/kraps/worker.rb CHANGED
@@ -157,7 +157,7 @@ module Kraps
157
157
  implementation = Object.new
158
158
  implementation.define_singleton_method(:map) do |&block|
159
159
  combine_method.call(enum1, enum2) do |key, value1, value2|
160
- block.call(key, current_step.block.call(key, value1, value2))
160
+ current_step.block.call(key, value1, value2, block)
161
161
  end
162
162
  end
163
163
 
@@ -270,23 +270,11 @@ module Kraps
270
270
  end
271
271
 
272
272
  def download_all(token:, partition:)
273
- temp_paths = TempPaths.new
274
-
275
- files = Kraps.driver.list(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/")).sort
276
-
277
- temp_paths_index = files.each_with_object({}) do |file, hash|
278
- hash[file] = temp_paths.add
279
- end
280
-
281
- Parallelizer.each(files, @concurrency) do |file|
282
- Kraps.driver.download(file, temp_paths_index[file].path)
283
- end
284
-
285
- temp_paths
273
+ Downloader.download_all(prefix: Kraps.driver.with_prefix("#{token}/#{partition}/"), concurrency: @concurrency)
286
274
  end
287
275
 
288
276
  def jobs
289
- @jobs ||= JobResolver.new.call(@args["klass"].constantize.new.call(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)))
277
+ @jobs ||= JobResolver.new.call(@args["klass"].constantize.new(*@args["args"], **@args["kwargs"].transform_keys(&:to_sym)).call)
290
278
  end
291
279
 
292
280
  def job
data/lib/kraps.rb CHANGED
@@ -19,6 +19,7 @@ require_relative "kraps/runner"
19
19
  require_relative "kraps/step"
20
20
  require_relative "kraps/frame"
21
21
  require_relative "kraps/worker"
22
+ require_relative "kraps/downloader"
22
23
 
23
24
  module Kraps
24
25
  class Error < StandardError; end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kraps
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ version: 0.9.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Vetter
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-12-02 00:00:00.000000000 Z
11
+ date: 2024-03-13 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: attachie
@@ -66,48 +66,6 @@ dependencies:
66
66
  - - ">="
67
67
  - !ruby/object:Gem::Version
68
68
  version: '0'
69
- - !ruby/object:Gem::Dependency
70
- name: bundler
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - ">="
74
- - !ruby/object:Gem::Version
75
- version: '0'
76
- type: :development
77
- prerelease: false
78
- version_requirements: !ruby/object:Gem::Requirement
79
- requirements:
80
- - - ">="
81
- - !ruby/object:Gem::Version
82
- version: '0'
83
- - !ruby/object:Gem::Dependency
84
- name: rspec
85
- requirement: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - ">="
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
- type: :development
91
- prerelease: false
92
- version_requirements: !ruby/object:Gem::Requirement
93
- requirements:
94
- - - ">="
95
- - !ruby/object:Gem::Version
96
- version: '0'
97
- - !ruby/object:Gem::Dependency
98
- name: rubocop
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - ">="
102
- - !ruby/object:Gem::Version
103
- version: '0'
104
- type: :development
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - ">="
109
- - !ruby/object:Gem::Version
110
- version: '0'
111
69
  description: Kraps allows to process and perform calculations on very large datasets
112
70
  in parallel
113
71
  email:
@@ -121,13 +79,13 @@ files:
121
79
  - CHANGELOG.md
122
80
  - CODE_OF_CONDUCT.md
123
81
  - Gemfile
124
- - Gemfile.lock
125
82
  - LICENSE.txt
126
83
  - README.md
127
84
  - Rakefile
128
85
  - docker-compose.yml
129
86
  - lib/kraps.rb
130
87
  - lib/kraps/actions.rb
88
+ - lib/kraps/downloader.rb
131
89
  - lib/kraps/drivers.rb
132
90
  - lib/kraps/frame.rb
133
91
  - lib/kraps/hash_partitioner.rb
data/Gemfile.lock DELETED
@@ -1,108 +0,0 @@
1
- PATH
2
- remote: .
3
- specs:
4
- kraps (0.7.0)
5
- attachie
6
- map-reduce-ruby (>= 3.0.0)
7
- redis
8
- ruby-progressbar
9
-
10
- GEM
11
- remote: https://rubygems.org/
12
- specs:
13
- activesupport (7.0.4)
14
- concurrent-ruby (~> 1.0, >= 1.0.2)
15
- i18n (>= 1.6, < 2)
16
- minitest (>= 5.1)
17
- tzinfo (~> 2.0)
18
- ast (2.4.2)
19
- attachie (1.2.0)
20
- activesupport
21
- aws-sdk-s3
22
- connection_pool
23
- mime-types
24
- aws-eventstream (1.2.0)
25
- aws-partitions (1.657.0)
26
- aws-sdk-core (3.166.0)
27
- aws-eventstream (~> 1, >= 1.0.2)
28
- aws-partitions (~> 1, >= 1.651.0)
29
- aws-sigv4 (~> 1.5)
30
- jmespath (~> 1, >= 1.6.1)
31
- aws-sdk-kms (1.59.0)
32
- aws-sdk-core (~> 3, >= 3.165.0)
33
- aws-sigv4 (~> 1.1)
34
- aws-sdk-s3 (1.117.1)
35
- aws-sdk-core (~> 3, >= 3.165.0)
36
- aws-sdk-kms (~> 1)
37
- aws-sigv4 (~> 1.4)
38
- aws-sigv4 (1.5.2)
39
- aws-eventstream (~> 1, >= 1.0.2)
40
- concurrent-ruby (1.1.10)
41
- connection_pool (2.3.0)
42
- diff-lcs (1.5.0)
43
- i18n (1.12.0)
44
- concurrent-ruby (~> 1.0)
45
- jmespath (1.6.1)
46
- json (2.6.2)
47
- lazy_priority_queue (0.1.1)
48
- map-reduce-ruby (3.0.0)
49
- json
50
- lazy_priority_queue
51
- mime-types (3.4.1)
52
- mime-types-data (~> 3.2015)
53
- mime-types-data (3.2022.0105)
54
- minitest (5.16.3)
55
- parallel (1.22.1)
56
- parser (3.1.2.1)
57
- ast (~> 2.4.1)
58
- rainbow (3.1.1)
59
- rake (13.0.6)
60
- redis (5.0.5)
61
- redis-client (>= 0.9.0)
62
- redis-client (0.11.2)
63
- connection_pool
64
- regexp_parser (2.6.0)
65
- rexml (3.2.5)
66
- rspec (3.12.0)
67
- rspec-core (~> 3.12.0)
68
- rspec-expectations (~> 3.12.0)
69
- rspec-mocks (~> 3.12.0)
70
- rspec-core (3.12.0)
71
- rspec-support (~> 3.12.0)
72
- rspec-expectations (3.12.0)
73
- diff-lcs (>= 1.2.0, < 2.0)
74
- rspec-support (~> 3.12.0)
75
- rspec-mocks (3.12.0)
76
- diff-lcs (>= 1.2.0, < 2.0)
77
- rspec-support (~> 3.12.0)
78
- rspec-support (3.12.0)
79
- rubocop (1.38.0)
80
- json (~> 2.3)
81
- parallel (~> 1.10)
82
- parser (>= 3.1.2.1)
83
- rainbow (>= 2.2.2, < 4.0)
84
- regexp_parser (>= 1.8, < 3.0)
85
- rexml (>= 3.2.5, < 4.0)
86
- rubocop-ast (>= 1.23.0, < 2.0)
87
- ruby-progressbar (~> 1.7)
88
- unicode-display_width (>= 1.4.0, < 3.0)
89
- rubocop-ast (1.23.0)
90
- parser (>= 3.1.1.0)
91
- ruby-progressbar (1.11.0)
92
- tzinfo (2.0.5)
93
- concurrent-ruby (~> 1.0)
94
- unicode-display_width (2.3.0)
95
-
96
- PLATFORMS
97
- ruby
98
- x86_64-linux
99
-
100
- DEPENDENCIES
101
- bundler
102
- kraps!
103
- rake (~> 13.0)
104
- rspec (~> 3.0)
105
- rubocop
106
-
107
- BUNDLED WITH
108
- 2.3.24