map-reduce-ruby 3.0.0 → 3.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a794ca527320099e74e49e3f3abe1a19a6fc6df186eb04a822c926af5ae4985c
4
- data.tar.gz: f8e1b77d7e3fbcf7171aa1fce94273a21fef45d9d76f8768b07ba450c3d51bdb
3
+ metadata.gz: f16a0ed8420d8b3867cdb41be00498d367b7b9e0c1b4af05a8230b3b32762794
4
+ data.tar.gz: 2c02d6dcb2819ccd741498d500e9f645673daf65bee9181ff74e1e73a11ebaae
5
5
  SHA512:
6
- metadata.gz: 79c9b21cd9fd8c586fc61b0b9cc9dd3ddee039fdce2c355965e06309a17436aff28932b68647b0b4ebe90e59208dee0dd4ecf7f83dc06a5216c6da4dea3eb00f
7
- data.tar.gz: 22bae4b877451ee66fdcdcd0ff0861caf65a41cab5e3b77c03370a4bb172c5d381f7623ab2f91516875e060c54eec3a8ac9063edeffe9e167a9a3390e25f8e62
6
+ metadata.gz: 47534a4ebc188caaa1da33b02302f2cc80c7fa7a6128a56aad4a6073cfcf63071b0258cc3702ba3d7ff9a9f66978552717756594db3aabac75017f8817019580
7
+ data.tar.gz: f292f4450acb8ff62507d52c8324d8861d1f91d2615b397cefbfbe9148127bd036dd19535d1958c17118d25de29e04863f4a6e38a03dcb257ca91e8177c128d7
data/CHANGELOG.md CHANGED
@@ -1,5 +1,9 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## v3.0.1
4
+
5
+ * Fix cleanup in `MapReduce::Mapper#shuffle`
6
+
3
7
  ## v3.0.0
4
8
 
5
9
  * [BREAKING] `MapReduce::Mapper#shuffle` now yields a hash of (partition, path)
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- map-reduce-ruby (3.0.0)
4
+ map-reduce-ruby (3.0.1)
5
5
  json
6
6
  lazy_priority_queue
7
7
 
data/README.md CHANGED
@@ -54,7 +54,7 @@ Next, we need some worker code to run the mapping part:
54
54
  ```ruby
55
55
  class WordCountMapper
56
56
  def perform(job_id, mapper_id, url)
57
- mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 100.megabytes)
57
+ mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 10.megabytes)
58
58
  mapper.map(url)
59
59
 
60
60
  mapper.shuffle(chunk_limit: 64) do |partitions|
@@ -68,8 +68,8 @@ end
68
68
  ```
69
69
 
70
70
  Please note that `MapReduce::HashPartitioner.new(16)` states that we want to
71
- split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need some
72
- worker code to run the reduce part:
71
+ split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
72
+ some worker code to run the reduce part:
73
73
 
74
74
  ```ruby
75
75
  class WordCountReducer
@@ -146,9 +146,22 @@ workings of MapReduce. Of course, feel free to check the code as well.
146
146
 
147
147
  `MapReduce::Mapper#map` calls your `map` implementation and adds each yielded
148
148
  key-value pair to an internal buffer up until the memory limit is reached.
149
- When the memory limit is reached, the buffer is sorted by key and fed through
150
- your `reduce` implementation already, as this can greatly reduce the amount of
151
- data already. The result is written to a tempfile. This proceeds up until all
149
+ More concretely, it specifies how big the file size of a temporary chunk can
150
+ grow in memory up until it must be written to disk. However, ruby of course
151
+ allocates much more memory for a chunk than the raw file size of the chunk. As
152
+ a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
153
+ `memory_size` depends on the memory size of your container/server, how much
154
+ worker threads your background queue spawns and how much memory your workers
155
+ need besides map/reduce. Let's say your container/server has 2 gigabytes of
156
+ memory and your background framework spawns 5 threads. Theoretically, you might
157
+ be able to give 300-400 megabytes, but now divide this by 10 and specify a
158
+ `memory_limit` of around `30.megabytes`, better less. The `memory_limit`
159
+ affects how much chunks will be written to disk depending on the data size you
160
+ are processing and how big these chunks are. The smaller the value, the more
161
+ chunks and the more chunks, the more runs are needed to merge the chunks. When
162
+ the memory limit is reached, the buffer is sorted by key and fed through your
163
+ `reduce` implementation already, as this can greatly reduce the amount of data
164
+ already. The result is written to a tempfile. This proceeds up until all
152
165
  key-value pairs are yielded. `MapReduce::Mapper#shuffle` then reads the first
153
166
  key-value pair of all already sorted chunk tempfiles and adds them to a
154
167
  priority queue using a binomial heap, such that with every `pop` operation on
@@ -16,9 +16,9 @@ module MapReduce
16
16
  # bytes.
17
17
  #
18
18
  # @example
19
- # MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 100.megabytes)
19
+ # MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 16.megabytes)
20
20
 
21
- def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 100 * 1024 * 1024)
21
+ def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 16 * 1024 * 1024)
22
22
  super()
23
23
 
24
24
  @implementation = implementation
@@ -86,7 +86,7 @@ module MapReduce
86
86
 
87
87
  yield(partitions.transform_values(&:path))
88
88
  ensure
89
- partitions.each_value(&:delete)
89
+ partitions&.each_value(&:delete)
90
90
 
91
91
  @chunks.each(&:delete)
92
92
  @chunks = []
@@ -18,8 +18,7 @@ module MapReduce
18
18
  super()
19
19
 
20
20
  @implementation = implementation
21
-
22
- @temp_paths ||= []
21
+ @temp_paths = []
23
22
  end
24
23
 
25
24
  # Adds a chunk from the mapper-phase to the reducer by registering a
@@ -1,3 +1,3 @@
1
1
  module MapReduce
2
- VERSION = "3.0.0"
2
+ VERSION = "3.0.1"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: map-reduce-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 3.0.0
4
+ version: 3.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Vetter
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-11-01 00:00:00.000000000 Z
11
+ date: 2022-11-18 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rspec