RubyGems - map-reduce-ruby - Versions diffs - 3.0.0 → 3.0.1 - Mend

map-reduce-ruby 3.0.0 → 3.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a794ca527320099e74e49e3f3abe1a19a6fc6df186eb04a822c926af5ae4985c
-  data.tar.gz: f8e1b77d7e3fbcf7171aa1fce94273a21fef45d9d76f8768b07ba450c3d51bdb
+  metadata.gz: f16a0ed8420d8b3867cdb41be00498d367b7b9e0c1b4af05a8230b3b32762794
+  data.tar.gz: 2c02d6dcb2819ccd741498d500e9f645673daf65bee9181ff74e1e73a11ebaae
 SHA512:
-  metadata.gz: 79c9b21cd9fd8c586fc61b0b9cc9dd3ddee039fdce2c355965e06309a17436aff28932b68647b0b4ebe90e59208dee0dd4ecf7f83dc06a5216c6da4dea3eb00f
-  data.tar.gz: 22bae4b877451ee66fdcdcd0ff0861caf65a41cab5e3b77c03370a4bb172c5d381f7623ab2f91516875e060c54eec3a8ac9063edeffe9e167a9a3390e25f8e62
+  metadata.gz: 47534a4ebc188caaa1da33b02302f2cc80c7fa7a6128a56aad4a6073cfcf63071b0258cc3702ba3d7ff9a9f66978552717756594db3aabac75017f8817019580
+  data.tar.gz: f292f4450acb8ff62507d52c8324d8861d1f91d2615b397cefbfbe9148127bd036dd19535d1958c17118d25de29e04863f4a6e38a03dcb257ca91e8177c128d7

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,9 @@
 # CHANGELOG
+## v3.0.1
+* Fix cleanup in `MapReduce::Mapper#shuffle`
 ## v3.0.0
 * [BREAKING] `MapReduce::Mapper#shuffle` now yields a hash of (partition, path)

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    map-reduce-ruby (3.0.0)
+    map-reduce-ruby (3.0.1)
       json
       lazy_priority_queue

data/README.md CHANGED Viewed

@@ -54,7 +54,7 @@ Next, we need some worker code to run the mapping part:
 ```ruby
 class WordCountMapper
   def perform(job_id, mapper_id, url)
-    mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 100.megabytes)
+    mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 10.megabytes)
     mapper.map(url)
     mapper.shuffle(chunk_limit: 64) do |partitions|
@@ -68,8 +68,8 @@ end
 ```
 Please note that `MapReduce::HashPartitioner.new(16)` states that we want to
-split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need some
-worker code to run the reduce part:
+split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
+some worker code to run the reduce part:
 ```ruby
 class WordCountReducer
@@ -146,9 +146,22 @@ workings of MapReduce. Of course, feel free to check the code as well.
 `MapReduce::Mapper#map` calls your `map` implementation and adds each yielded
 key-value pair to an internal buffer up until the memory limit is reached.
-When the memory limit is reached, the buffer is sorted by key and fed through
-your `reduce` implementation already, as this can greatly reduce the amount of
-data already. The result is written to a tempfile. This proceeds up until all
+More concretely, it specifies how big the file size of a temporary chunk can
+grow in memory up until it must be written to disk. However, ruby of course
+allocates much more memory for a chunk than the raw file size of the chunk. As
+a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
+`memory_size` depends on the memory size of your container/server, how much
+worker threads your background queue spawns and how much memory your workers
+need besides map/reduce. Let's say your container/server has 2 gigabytes of
+memory and your background framework spawns 5 threads. Theoretically, you might
+be able to give 300-400 megabytes, but now divide this by 10 and specify a
+`memory_limit` of around `30.megabytes`, better less. The `memory_limit`
+affects how much chunks will be written to disk depending on the data size you
+are processing and how big these chunks are. The smaller the value, the more
+chunks and the more chunks, the more runs are needed to merge the chunks. When
+the memory limit is reached, the buffer is sorted by key and fed through your
+`reduce` implementation already, as this can greatly reduce the amount of data
+already. The result is written to a tempfile. This proceeds up until all
 key-value pairs are yielded. `MapReduce::Mapper#shuffle` then reads the first
 key-value pair of all already sorted chunk tempfiles and adds them to a
 priority queue using a binomial heap, such that with every `pop` operation on

data/lib/map_reduce/mapper.rb CHANGED Viewed

@@ -16,9 +16,9 @@ module MapReduce
     #   bytes.
     #
     # @example
-    #  MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 100.megabytes)
+    #  MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 16.megabytes)
-    def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 100 * 1024 * 1024)
+    def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 16 * 1024 * 1024)
       super()
       @implementation = implementation
@@ -86,7 +86,7 @@ module MapReduce
         yield(partitions.transform_values(&:path))
       ensure
-        partitions.each_value(&:delete)
+        partitions&.each_value(&:delete)
         @chunks.each(&:delete)
         @chunks = []

data/lib/map_reduce/reducer.rb CHANGED Viewed

@@ -18,8 +18,7 @@ module MapReduce
       super()
       @implementation = implementation
-      @temp_paths ||= []
+      @temp_paths = []
     end
     # Adds a chunk from the mapper-phase to the reducer by registering a

data/lib/map_reduce/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module MapReduce
-  VERSION = "3.0.0"
+  VERSION = "3.0.1"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: map-reduce-ruby
 version: !ruby/object:Gem::Version
-  version: 3.0.0
+  version: 3.0.1
 platform: ruby
 authors:
 - Benjamin Vetter
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2022-11-01 00:00:00.000000000 Z
+date: 2022-11-18 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rspec