map-reduce-ruby 3.0.0 → 3.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a794ca527320099e74e49e3f3abe1a19a6fc6df186eb04a822c926af5ae4985c
4
- data.tar.gz: f8e1b77d7e3fbcf7171aa1fce94273a21fef45d9d76f8768b07ba450c3d51bdb
3
+ metadata.gz: f16a0ed8420d8b3867cdb41be00498d367b7b9e0c1b4af05a8230b3b32762794
4
+ data.tar.gz: 2c02d6dcb2819ccd741498d500e9f645673daf65bee9181ff74e1e73a11ebaae
5
5
  SHA512:
6
- metadata.gz: 79c9b21cd9fd8c586fc61b0b9cc9dd3ddee039fdce2c355965e06309a17436aff28932b68647b0b4ebe90e59208dee0dd4ecf7f83dc06a5216c6da4dea3eb00f
7
- data.tar.gz: 22bae4b877451ee66fdcdcd0ff0861caf65a41cab5e3b77c03370a4bb172c5d381f7623ab2f91516875e060c54eec3a8ac9063edeffe9e167a9a3390e25f8e62
6
+ metadata.gz: 47534a4ebc188caaa1da33b02302f2cc80c7fa7a6128a56aad4a6073cfcf63071b0258cc3702ba3d7ff9a9f66978552717756594db3aabac75017f8817019580
7
+ data.tar.gz: f292f4450acb8ff62507d52c8324d8861d1f91d2615b397cefbfbe9148127bd036dd19535d1958c17118d25de29e04863f4a6e38a03dcb257ca91e8177c128d7
data/CHANGELOG.md CHANGED
@@ -1,5 +1,9 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## v3.0.1
4
+
5
+ * Fix cleanup in `MapReduce::Mapper#shuffle`
6
+
3
7
  ## v3.0.0
4
8
 
5
9
  * [BREAKING] `MapReduce::Mapper#shuffle` now yields a hash of (partition, path)
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- map-reduce-ruby (3.0.0)
4
+ map-reduce-ruby (3.0.1)
5
5
  json
6
6
  lazy_priority_queue
7
7
 
data/README.md CHANGED
@@ -54,7 +54,7 @@ Next, we need some worker code to run the mapping part:
54
54
  ```ruby
55
55
  class WordCountMapper
56
56
  def perform(job_id, mapper_id, url)
57
- mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 100.megabytes)
57
+ mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 10.megabytes)
58
58
  mapper.map(url)
59
59
 
60
60
  mapper.shuffle(chunk_limit: 64) do |partitions|
@@ -68,8 +68,8 @@ end
68
68
  ```
69
69
 
70
70
  Please note that `MapReduce::HashPartitioner.new(16)` states that we want to
71
- split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need some
72
- worker code to run the reduce part:
71
+ split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
72
+ some worker code to run the reduce part:
73
73
 
74
74
  ```ruby
75
75
  class WordCountReducer
@@ -146,9 +146,22 @@ workings of MapReduce. Of course, feel free to check the code as well.
146
146
 
147
147
  `MapReduce::Mapper#map` calls your `map` implementation and adds each yielded
148
148
  key-value pair to an internal buffer up until the memory limit is reached.
149
- When the memory limit is reached, the buffer is sorted by key and fed through
150
- your `reduce` implementation already, as this can greatly reduce the amount of
151
- data already. The result is written to a tempfile. This proceeds up until all
149
+ More concretely, it specifies how big the file size of a temporary chunk can
150
+ grow in memory up until it must be written to disk. However, ruby of course
151
+ allocates much more memory for a chunk than the raw file size of the chunk. As
152
+ a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
153
+ `memory_size` depends on the memory size of your container/server, how much
154
+ worker threads your background queue spawns and how much memory your workers
155
+ need besides map/reduce. Let's say your container/server has 2 gigabytes of
156
+ memory and your background framework spawns 5 threads. Theoretically, you might
157
+ be able to give 300-400 megabytes, but now divide this by 10 and specify a
158
+ `memory_limit` of around `30.megabytes`, better less. The `memory_limit`
159
+ affects how much chunks will be written to disk depending on the data size you
160
+ are processing and how big these chunks are. The smaller the value, the more
161
+ chunks and the more chunks, the more runs are needed to merge the chunks. When
162
+ the memory limit is reached, the buffer is sorted by key and fed through your
163
+ `reduce` implementation already, as this can greatly reduce the amount of data
164
+ already. The result is written to a tempfile. This proceeds up until all
152
165
  key-value pairs are yielded. `MapReduce::Mapper#shuffle` then reads the first
153
166
  key-value pair of all already sorted chunk tempfiles and adds them to a
154
167
  priority queue using a binomial heap, such that with every `pop` operation on
@@ -16,9 +16,9 @@ module MapReduce
16
16
  # bytes.
17
17
  #
18
18
  # @example
19
- # MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 100.megabytes)
19
+ # MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 16.megabytes)
20
20
 
21
- def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 100 * 1024 * 1024)
21
+ def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 16 * 1024 * 1024)
22
22
  super()
23
23
 
24
24
  @implementation = implementation
@@ -86,7 +86,7 @@ module MapReduce
86
86
 
87
87
  yield(partitions.transform_values(&:path))
88
88
  ensure
89
- partitions.each_value(&:delete)
89
+ partitions&.each_value(&:delete)
90
90
 
91
91
  @chunks.each(&:delete)
92
92
  @chunks = []
@@ -18,8 +18,7 @@ module MapReduce
18
18
  super()
19
19
 
20
20
  @implementation = implementation
21
-
22
- @temp_paths ||= []
21
+ @temp_paths = []
23
22
  end
24
23
 
25
24
  # Adds a chunk from the mapper-phase to the reducer by registering a
@@ -1,3 +1,3 @@
1
1
  module MapReduce
2
- VERSION = "3.0.0"
2
+ VERSION = "3.0.1"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: map-reduce-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 3.0.0
4
+ version: 3.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Vetter
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-11-01 00:00:00.000000000 Z
11
+ date: 2022-11-18 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rspec