kraps 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 57d06d20f406e72e26424dcba5c2af296a1551085113371c2cae9894f18e72ff
4
- data.tar.gz: a25c8c3440cdd26eeb9b32655e77556e4be4966d47bd3e290d0a702c3b4cde9f
3
+ metadata.gz: 15d08cf8952d4e5a083a6a4f9791fd16e9e2dbf67c1c71326f3af840c0c72eb8
4
+ data.tar.gz: c6542584846c54e5897b7b59ef40e8ec282ee27521b2d5ff39551eb02755882d
5
5
  SHA512:
6
- metadata.gz: 655e2d0f525e136b72e87231a8d359eda952b7e3b30a8cd38b25f7b4fba27baebedacced0a00071915166325b621437869026c6785ab7353eea928ca736bc2a7
7
- data.tar.gz: 0d450b3032a6809d2a5300e18e3cfc9a948ca5f589f1161823d9923084939d249282d28bf0619c3b9ace10dc742118dfcf5895c898e2d9fe7c55f98d00e683b4
6
+ metadata.gz: 21d1ef7a132edacf54e0b2df12b8d085af84ec1ed1cd019d258e43aba4cffbecdeada9b2b7f4baeefec4b59d115eb3e38400da94a3d7961ab19bbbb7dd2cf58c
7
+ data.tar.gz: fde066e9fdc5f9df7e95be43142cb04a7a1c5279decb277f1d815db508c87d2c04be46ea9559069c8a2c9539ee2eaa949a2fe2fdc3bf862937f9211cdfd8fbd5
data/CHANGELOG.md CHANGED
@@ -1,5 +1,15 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## v0.4.0
4
+
5
+ * Pre-reduce in a map step when the subsequent step is a
6
+ reduce step
7
+
8
+ ## v0.3.0
9
+
10
+ * Changed partitioners to receive the number of partitions
11
+ as second parameter
12
+
3
13
  ## v0.2.0
4
14
 
5
15
  * Updated map-reduce-ruby to allow concurrent uploads
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- kraps (0.2.0)
4
+ kraps (0.4.0)
5
5
  attachie
6
6
  distributed_job
7
7
  map-reduce-ruby (>= 3.0.0)
@@ -23,7 +23,7 @@ GEM
23
23
  connection_pool
24
24
  mime-types
25
25
  aws-eventstream (1.2.0)
26
- aws-partitions (1.654.0)
26
+ aws-partitions (1.657.0)
27
27
  aws-sdk-core (3.166.0)
28
28
  aws-eventstream (~> 1, >= 1.0.2)
29
29
  aws-partitions (~> 1, >= 1.651.0)
@@ -62,7 +62,7 @@ GEM
62
62
  rake (13.0.6)
63
63
  redis (5.0.5)
64
64
  redis-client (>= 0.9.0)
65
- redis-client (0.11.0)
65
+ redis-client (0.11.1)
66
66
  connection_pool
67
67
  regexp_parser (2.6.0)
68
68
  rexml (3.2.5)
data/README.md CHANGED
@@ -95,28 +95,41 @@ class MyKrapsWorker
95
95
  include Sidekiq::Worker
96
96
 
97
97
  def perform(json)
98
- Kraps::Worker.new(json, memory_limit: 128.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
98
+ Kraps::Worker.new(json, memory_limit: 16.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
99
99
  end
100
100
  end
101
101
  ```
102
102
 
103
103
  The `json` argument is automatically enqueued by Kraps and contains everything
104
104
  it needs to know about the job and step to execute. The `memory_limit` tells
105
- Kraps how much memory it is allowed to allocate for temporary chunks, etc. This
106
- value depends on the memory size of your container/server and how much worker
107
- threads your background queue spawns. Let's say your container/server has 2
108
- gigabytes of memory and your background framework spawns 5 threads.
109
- Theoretically, you might be able to give 300-400 megabytes to Kraps then. The
110
- `chunk_limit` ensures that only the specified amount of chunks are processed in
111
- a single run. A run basically means: it takes up to `chunk_limit` chunks,
112
- reduces them and pushes the result as a new chunk to the list of chunks to
113
- process. Thus, if your number of file descriptors is unlimited, you want to set
114
- it to a higher number to avoid the overhead of multiple runs. `concurrency`
115
- tells Kraps how much threads to use to concurrently upload/download files from
116
- the storage layer. Finally, `retries` specifies how often Kraps should retry
117
- the job step in case of errors. Kraps will sleep for 5 seconds between those
118
- retries. Please note that it's not yet possible to use the retry mechanism of
119
- your background job framework with Kraps.
105
+ Kraps how much memory it is allowed to allocate for temporary chunks. More
106
+ concretely, it tells Kraps how big the file size of a temporary chunk can grow
107
+ in memory up until Kraps must write it to disk. However, ruby of course
108
+ allocates much more memory for a chunk than the raw file size of the chunk. As
109
+ a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
110
+ `memory_size` depends on the memory size of your container/server, how much
111
+ worker threads your background queue spawns and how much memory your workers
112
+ need besides of Kraps. Let's say your container/server has 2 gigabytes of
113
+ memory and your background framework spawns 5 threads. Theoretically, you might
114
+ be able to give 300-400 megabytes to Kraps then, but now divide this by 10 and
115
+ specify a `memory_limit` of around `30.megabytes`, better less. The
116
+ `memory_limit` affects how much chunks will be written to disk depending on the
117
+ data size you are processing and how big these chunks are. The smaller the
118
+ value, the more chunks and the more chunks, the more runs Kraps need to merge
119
+ the chunks. It can affect the performance The `chunk_limit` ensures that only
120
+ the specified amount of chunks are processed in a single run. A run basically
121
+ means: it takes up to `chunk_limit` chunks, reduces them and pushes the result
122
+ as a new chunk to the list of chunks to process. Thus, if your number of file
123
+ descriptors is unlimited, you want to set it to a higher number to avoid the
124
+ overhead of multiple runs. `concurrency` tells Kraps how much threads to use to
125
+ concurrently upload/download files from the storage layer. Finally, `retries`
126
+ specifies how often Kraps should retry the job step in case of errors. Kraps
127
+ will sleep for 5 seconds between those retries. Please note that it's not yet
128
+ possible to use the retry mechanism of your background job framework with
129
+ Kraps. Please note, however, that `parallelize` is not covered by `retries`
130
+ yet, as the block passed to `parallelize` is executed by the runner, not the
131
+ workers.
132
+
120
133
 
121
134
  Now, executing your job is super easy:
122
135
 
@@ -143,17 +156,18 @@ split. Kraps assigns every `key` to a partition, either using a custom
143
156
  `partitioner` or the default built in hash partitioner. The hash partitioner
144
157
  simply calculates a hash of your key modulo the number of partitions and the
145
158
  resulting partition number is the partition where the respective key is
146
- assigned to. A partitioner is a callable which gets the key as argument and
147
- returns a partition number. The built in hash partitioner looks similar to this
148
- one:
159
+ assigned to. A partitioner is a callable which gets the key and the number of
160
+ partitions as argument and returns a partition number. The built in hash
161
+ partitioner looks similar to this one:
149
162
 
150
163
  ```ruby
151
- partitioner = proc { |key| Digest::SHA1.hexdigest(key.inspect)[0..4].to_i(16) % 128 } # 128 partitions
164
+ partitioner = proc { |key, num_partitions| Digest::SHA1.hexdigest(key.inspect)[0..4].to_i(16) % num_partitions }
152
165
  ```
153
166
 
154
167
  Please note, it's important that the partitioner and the specified number of
155
168
  partitions stays in sync. When you use a custom partitioner, please make sure
156
- that the partitioner operates on the same number of partitions you specify.
169
+ that the partitioner correctly returns a partition number in the range of
170
+ `0...num_partitions`.
157
171
 
158
172
  ## Datatypes
159
173
 
@@ -0,0 +1,7 @@
1
+ module Kraps
2
+ class HashPartitioner
3
+ def call(key, num_partitions)
4
+ Digest::SHA1.hexdigest(JSON.generate(key))[0..4].to_i(16) % num_partitions
5
+ end
6
+ end
7
+ end
data/lib/kraps/job.rb CHANGED
@@ -6,10 +6,10 @@ module Kraps
6
6
  @worker = worker
7
7
  @steps = []
8
8
  @partitions = 0
9
- @partitioner = MapReduce::HashPartitioner.new(@partitions)
9
+ @partitioner = HashPartitioner.new
10
10
  end
11
11
 
12
- def parallelize(partitions:, partitioner: MapReduce::HashPartitioner.new(partitions), worker: @worker, &block)
12
+ def parallelize(partitions:, partitioner: HashPartitioner.new, worker: @worker, &block)
13
13
  fresh.tap do |job|
14
14
  job.instance_eval do
15
15
  @partitions = partitions
@@ -24,7 +24,7 @@ module Kraps
24
24
  fresh.tap do |job|
25
25
  job.instance_eval do
26
26
  @partitions = partitions if partitions
27
- @partitioner = partitioner || MapReduce::HashPartitioner.new(partitions) if partitioner || partitions
27
+ @partitioner = partitioner if partitioner
28
28
 
29
29
  @steps << Step.new(action: Actions::MAP, args: { partitions: @partitions, partitioner: @partitioner, worker: worker }, block: block)
30
30
  end
data/lib/kraps/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Kraps
2
- VERSION = "0.2.0"
2
+ VERSION = "0.4.0"
3
3
  end
data/lib/kraps/worker.rb CHANGED
@@ -60,6 +60,14 @@ module Kraps
60
60
  current_step.block.call(key, value, block)
61
61
  end
62
62
 
63
+ subsequent_step = next_step
64
+
65
+ if subsequent_step&.action == Actions::REDUCE
66
+ implementation.define_singleton_method(:reduce) do |key, value1, value2|
67
+ subsequent_step.block.call(key, value1, value2)
68
+ end
69
+ end
70
+
63
71
  mapper = MapReduce::Mapper.new(implementation, partitioner: partitioner, memory_limit: @memory_limit)
64
72
 
65
73
  temp_paths.each do |temp_path|
@@ -143,15 +151,16 @@ module Kraps
143
151
  yield
144
152
  rescue Kraps::Error
145
153
  distributed_job.stop
154
+ raise
146
155
  rescue StandardError
147
- sleep(5)
148
- retries += 1
149
-
150
156
  if retries >= num_retries
151
157
  distributed_job.stop
152
158
  raise
153
159
  end
154
160
 
161
+ sleep(5)
162
+ retries += 1
163
+
155
164
  retry
156
165
  end
157
166
  end
@@ -180,8 +189,12 @@ module Kraps
180
189
  end
181
190
  end
182
191
 
192
+ def next_step
193
+ @next_step ||= steps[@args["step_index"] + 1]
194
+ end
195
+
183
196
  def partitioner
184
- @partitioner ||= step.args[:partitioner]
197
+ @partitioner ||= proc { |key| step.args[:partitioner].call(key, step.args[:partitions]) }
185
198
  end
186
199
 
187
200
  def distributed_job
data/lib/kraps.rb CHANGED
@@ -2,6 +2,7 @@ require_relative "kraps/version"
2
2
  require_relative "kraps/drivers"
3
3
  require_relative "kraps/actions"
4
4
  require_relative "kraps/parallelizer"
5
+ require_relative "kraps/hash_partitioner"
5
6
  require_relative "kraps/temp_path"
6
7
  require_relative "kraps/temp_paths"
7
8
  require_relative "kraps/timeout_queue"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kraps
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Vetter
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-11-01 00:00:00.000000000 Z
11
+ date: 2022-11-09 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: attachie
@@ -144,6 +144,7 @@ files:
144
144
  - lib/kraps/actions.rb
145
145
  - lib/kraps/drivers.rb
146
146
  - lib/kraps/frame.rb
147
+ - lib/kraps/hash_partitioner.rb
147
148
  - lib/kraps/interval.rb
148
149
  - lib/kraps/job.rb
149
150
  - lib/kraps/parallelizer.rb