kraps 0.2.0 → 0.4.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 57d06d20f406e72e26424dcba5c2af296a1551085113371c2cae9894f18e72ff
4
- data.tar.gz: a25c8c3440cdd26eeb9b32655e77556e4be4966d47bd3e290d0a702c3b4cde9f
3
+ metadata.gz: 15d08cf8952d4e5a083a6a4f9791fd16e9e2dbf67c1c71326f3af840c0c72eb8
4
+ data.tar.gz: c6542584846c54e5897b7b59ef40e8ec282ee27521b2d5ff39551eb02755882d
5
5
  SHA512:
6
- metadata.gz: 655e2d0f525e136b72e87231a8d359eda952b7e3b30a8cd38b25f7b4fba27baebedacced0a00071915166325b621437869026c6785ab7353eea928ca736bc2a7
7
- data.tar.gz: 0d450b3032a6809d2a5300e18e3cfc9a948ca5f589f1161823d9923084939d249282d28bf0619c3b9ace10dc742118dfcf5895c898e2d9fe7c55f98d00e683b4
6
+ metadata.gz: 21d1ef7a132edacf54e0b2df12b8d085af84ec1ed1cd019d258e43aba4cffbecdeada9b2b7f4baeefec4b59d115eb3e38400da94a3d7961ab19bbbb7dd2cf58c
7
+ data.tar.gz: fde066e9fdc5f9df7e95be43142cb04a7a1c5279decb277f1d815db508c87d2c04be46ea9559069c8a2c9539ee2eaa949a2fe2fdc3bf862937f9211cdfd8fbd5
data/CHANGELOG.md CHANGED
@@ -1,5 +1,15 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## v0.4.0
4
+
5
+ * Pre-reduce in a map step when the subsequent step is a
6
+ reduce step
7
+
8
+ ## v0.3.0
9
+
10
+ * Changed partitioners to receive the number of partitions
11
+ as second parameter
12
+
3
13
  ## v0.2.0
4
14
 
5
15
  * Updated map-reduce-ruby to allow concurrent uploads
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- kraps (0.2.0)
4
+ kraps (0.4.0)
5
5
  attachie
6
6
  distributed_job
7
7
  map-reduce-ruby (>= 3.0.0)
@@ -23,7 +23,7 @@ GEM
23
23
  connection_pool
24
24
  mime-types
25
25
  aws-eventstream (1.2.0)
26
- aws-partitions (1.654.0)
26
+ aws-partitions (1.657.0)
27
27
  aws-sdk-core (3.166.0)
28
28
  aws-eventstream (~> 1, >= 1.0.2)
29
29
  aws-partitions (~> 1, >= 1.651.0)
@@ -62,7 +62,7 @@ GEM
62
62
  rake (13.0.6)
63
63
  redis (5.0.5)
64
64
  redis-client (>= 0.9.0)
65
- redis-client (0.11.0)
65
+ redis-client (0.11.1)
66
66
  connection_pool
67
67
  regexp_parser (2.6.0)
68
68
  rexml (3.2.5)
data/README.md CHANGED
@@ -95,28 +95,41 @@ class MyKrapsWorker
95
95
  include Sidekiq::Worker
96
96
 
97
97
  def perform(json)
98
- Kraps::Worker.new(json, memory_limit: 128.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
98
+ Kraps::Worker.new(json, memory_limit: 16.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
99
99
  end
100
100
  end
101
101
  ```
102
102
 
103
103
  The `json` argument is automatically enqueued by Kraps and contains everything
104
104
  it needs to know about the job and step to execute. The `memory_limit` tells
105
- Kraps how much memory it is allowed to allocate for temporary chunks, etc. This
106
- value depends on the memory size of your container/server and how much worker
107
- threads your background queue spawns. Let's say your container/server has 2
108
- gigabytes of memory and your background framework spawns 5 threads.
109
- Theoretically, you might be able to give 300-400 megabytes to Kraps then. The
110
- `chunk_limit` ensures that only the specified amount of chunks are processed in
111
- a single run. A run basically means: it takes up to `chunk_limit` chunks,
112
- reduces them and pushes the result as a new chunk to the list of chunks to
113
- process. Thus, if your number of file descriptors is unlimited, you want to set
114
- it to a higher number to avoid the overhead of multiple runs. `concurrency`
115
- tells Kraps how much threads to use to concurrently upload/download files from
116
- the storage layer. Finally, `retries` specifies how often Kraps should retry
117
- the job step in case of errors. Kraps will sleep for 5 seconds between those
118
- retries. Please note that it's not yet possible to use the retry mechanism of
119
- your background job framework with Kraps.
105
+ Kraps how much memory it is allowed to allocate for temporary chunks. More
106
+ concretely, it tells Kraps how big the file size of a temporary chunk can grow
107
+ in memory up until Kraps must write it to disk. However, ruby of course
108
+ allocates much more memory for a chunk than the raw file size of the chunk. As
109
+ a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
110
+ `memory_size` depends on the memory size of your container/server, how much
111
+ worker threads your background queue spawns and how much memory your workers
112
+ need besides of Kraps. Let's say your container/server has 2 gigabytes of
113
+ memory and your background framework spawns 5 threads. Theoretically, you might
114
+ be able to give 300-400 megabytes to Kraps then, but now divide this by 10 and
115
+ specify a `memory_limit` of around `30.megabytes`, better less. The
116
+ `memory_limit` affects how much chunks will be written to disk depending on the
117
+ data size you are processing and how big these chunks are. The smaller the
118
+ value, the more chunks and the more chunks, the more runs Kraps need to merge
119
+ the chunks. It can affect the performance The `chunk_limit` ensures that only
120
+ the specified amount of chunks are processed in a single run. A run basically
121
+ means: it takes up to `chunk_limit` chunks, reduces them and pushes the result
122
+ as a new chunk to the list of chunks to process. Thus, if your number of file
123
+ descriptors is unlimited, you want to set it to a higher number to avoid the
124
+ overhead of multiple runs. `concurrency` tells Kraps how much threads to use to
125
+ concurrently upload/download files from the storage layer. Finally, `retries`
126
+ specifies how often Kraps should retry the job step in case of errors. Kraps
127
+ will sleep for 5 seconds between those retries. Please note that it's not yet
128
+ possible to use the retry mechanism of your background job framework with
129
+ Kraps. Please note, however, that `parallelize` is not covered by `retries`
130
+ yet, as the block passed to `parallelize` is executed by the runner, not the
131
+ workers.
132
+
120
133
 
121
134
  Now, executing your job is super easy:
122
135
 
@@ -143,17 +156,18 @@ split. Kraps assigns every `key` to a partition, either using a custom
143
156
  `partitioner` or the default built in hash partitioner. The hash partitioner
144
157
  simply calculates a hash of your key modulo the number of partitions and the
145
158
  resulting partition number is the partition where the respective key is
146
- assigned to. A partitioner is a callable which gets the key as argument and
147
- returns a partition number. The built in hash partitioner looks similar to this
148
- one:
159
+ assigned to. A partitioner is a callable which gets the key and the number of
160
+ partitions as argument and returns a partition number. The built in hash
161
+ partitioner looks similar to this one:
149
162
 
150
163
  ```ruby
151
- partitioner = proc { |key| Digest::SHA1.hexdigest(key.inspect)[0..4].to_i(16) % 128 } # 128 partitions
164
+ partitioner = proc { |key, num_partitions| Digest::SHA1.hexdigest(key.inspect)[0..4].to_i(16) % num_partitions }
152
165
  ```
153
166
 
154
167
  Please note, it's important that the partitioner and the specified number of
155
168
  partitions stays in sync. When you use a custom partitioner, please make sure
156
- that the partitioner operates on the same number of partitions you specify.
169
+ that the partitioner correctly returns a partition number in the range of
170
+ `0...num_partitions`.
157
171
 
158
172
  ## Datatypes
159
173
 
@@ -0,0 +1,7 @@
1
+ module Kraps
2
+ class HashPartitioner
3
+ def call(key, num_partitions)
4
+ Digest::SHA1.hexdigest(JSON.generate(key))[0..4].to_i(16) % num_partitions
5
+ end
6
+ end
7
+ end
data/lib/kraps/job.rb CHANGED
@@ -6,10 +6,10 @@ module Kraps
6
6
  @worker = worker
7
7
  @steps = []
8
8
  @partitions = 0
9
- @partitioner = MapReduce::HashPartitioner.new(@partitions)
9
+ @partitioner = HashPartitioner.new
10
10
  end
11
11
 
12
- def parallelize(partitions:, partitioner: MapReduce::HashPartitioner.new(partitions), worker: @worker, &block)
12
+ def parallelize(partitions:, partitioner: HashPartitioner.new, worker: @worker, &block)
13
13
  fresh.tap do |job|
14
14
  job.instance_eval do
15
15
  @partitions = partitions
@@ -24,7 +24,7 @@ module Kraps
24
24
  fresh.tap do |job|
25
25
  job.instance_eval do
26
26
  @partitions = partitions if partitions
27
- @partitioner = partitioner || MapReduce::HashPartitioner.new(partitions) if partitioner || partitions
27
+ @partitioner = partitioner if partitioner
28
28
 
29
29
  @steps << Step.new(action: Actions::MAP, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
30
30
  end
data/lib/kraps/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Kraps
2
- VERSION = "0.2.0"
2
+ VERSION = "0.4.0"
3
3
  end
data/lib/kraps/worker.rb CHANGED
@@ -60,6 +60,14 @@ module Kraps
60
60
  current_step.block.call(key, value, block)
61
61
  end
62
62
 
63
+ subsequent_step = next_step
64
+
65
+ if subsequent_step&.action == Actions::REDUCE
66
+ implementation.define_singleton_method(:reduce) do |key, value1, value2|
67
+ subsequent_step.block.call(key, value1, value2)
68
+ end
69
+ end
70
+
63
71
  mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
64
72
 
65
73
  temp_paths.each do |temp_path|
@@ -143,15 +151,16 @@ module Kraps
143
151
  yield
144
152
  rescue Kraps::Error
145
153
  distributed_job.stop
154
+ raise
146
155
  rescue StandardError
147
- sleep(5)
148
- retries += 1
149
-
150
156
  if retries >= num_retries
151
157
  distributed_job.stop
152
158
  raise
153
159
  end
154
160
 
161
+ sleep(5)
162
+ retries += 1
163
+
155
164
  retry
156
165
  end
157
166
  end
@@ -180,8 +189,12 @@ module Kraps
180
189
  end
181
190
  end
182
191
 
192
+ def next_step
193
+ @next_step ||= steps[@args["step_index"] + 1]
194
+ end
195
+
183
196
  def partitioner
184
- @partitioner ||= step.args[:partitioner]
197
+ @partitioner ||= proc { |key| step.args[:partitioner].call(key, step.args[:partitions]) }
185
198
  end
186
199
 
187
200
  def distributed_job
data/lib/kraps.rb CHANGED
@@ -2,6 +2,7 @@ require_relative "kraps/version"
2
2
  require_relative "kraps/drivers"
3
3
  require_relative "kraps/actions"
4
4
  require_relative "kraps/parallelizer"
5
+ require_relative "kraps/hash_partitioner"
5
6
  require_relative "kraps/temp_path"
6
7
  require_relative "kraps/temp_paths"
7
8
  require_relative "kraps/timeout_queue"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kraps
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Vetter
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-11-01 00:00:00.000000000 Z
11
+ date: 2022-11-09 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: attachie
@@ -144,6 +144,7 @@ files:
144
144
  - lib/kraps/actions.rb
145
145
  - lib/kraps/drivers.rb
146
146
  - lib/kraps/frame.rb
147
+ - lib/kraps/hash_partitioner.rb
147
148
  - lib/kraps/interval.rb
148
149
  - lib/kraps/job.rb
149
150
  - lib/kraps/parallelizer.rb