map-reduce-ruby 3.0.0 → 3.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +4 -0
- data/Gemfile.lock +1 -1
- data/README.md +19 -6
- data/lib/map_reduce/mapper.rb +3 -3
- data/lib/map_reduce/reducer.rb +1 -2
- data/lib/map_reduce/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f16a0ed8420d8b3867cdb41be00498d367b7b9e0c1b4af05a8230b3b32762794
|
4
|
+
data.tar.gz: 2c02d6dcb2819ccd741498d500e9f645673daf65bee9181ff74e1e73a11ebaae
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 47534a4ebc188caaa1da33b02302f2cc80c7fa7a6128a56aad4a6073cfcf63071b0258cc3702ba3d7ff9a9f66978552717756594db3aabac75017f8817019580
|
7
|
+
data.tar.gz: f292f4450acb8ff62507d52c8324d8861d1f91d2615b397cefbfbe9148127bd036dd19535d1958c17118d25de29e04863f4a6e38a03dcb257ca91e8177c128d7
|
data/CHANGELOG.md
CHANGED
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -54,7 +54,7 @@ Next, we need some worker code to run the mapping part:
|
|
54
54
|
```ruby
|
55
55
|
class WordCountMapper
|
56
56
|
def perform(job_id, mapper_id, url)
|
57
|
-
mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit:
|
57
|
+
mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 10.megabytes)
|
58
58
|
mapper.map(url)
|
59
59
|
|
60
60
|
mapper.shuffle(chunk_limit: 64) do |partitions|
|
@@ -68,8 +68,8 @@ end
|
|
68
68
|
```
|
69
69
|
|
70
70
|
Please note that `MapReduce::HashPartitioner.new(16)` states that we want to
|
71
|
-
split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
|
72
|
-
worker code to run the reduce part:
|
71
|
+
split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
|
72
|
+
some worker code to run the reduce part:
|
73
73
|
|
74
74
|
```ruby
|
75
75
|
class WordCountReducer
|
@@ -146,9 +146,22 @@ workings of MapReduce. Of course, feel free to check the code as well.
|
|
146
146
|
|
147
147
|
`MapReduce::Mapper#map` calls your `map` implementation and adds each yielded
|
148
148
|
key-value pair to an internal buffer up until the memory limit is reached.
|
149
|
-
|
150
|
-
|
151
|
-
|
149
|
+
More concretely, it specifies how big the file size of a temporary chunk can
|
150
|
+
grow in memory up until it must be written to disk. However, ruby of course
|
151
|
+
allocates much more memory for a chunk than the raw file size of the chunk. As
|
152
|
+
a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
|
153
|
+
`memory_size` depends on the memory size of your container/server, how much
|
154
|
+
worker threads your background queue spawns and how much memory your workers
|
155
|
+
need besides map/reduce. Let's say your container/server has 2 gigabytes of
|
156
|
+
memory and your background framework spawns 5 threads. Theoretically, you might
|
157
|
+
be able to give 300-400 megabytes, but now divide this by 10 and specify a
|
158
|
+
`memory_limit` of around `30.megabytes`, better less. The `memory_limit`
|
159
|
+
affects how much chunks will be written to disk depending on the data size you
|
160
|
+
are processing and how big these chunks are. The smaller the value, the more
|
161
|
+
chunks and the more chunks, the more runs are needed to merge the chunks. When
|
162
|
+
the memory limit is reached, the buffer is sorted by key and fed through your
|
163
|
+
`reduce` implementation already, as this can greatly reduce the amount of data
|
164
|
+
already. The result is written to a tempfile. This proceeds up until all
|
152
165
|
key-value pairs are yielded. `MapReduce::Mapper#shuffle` then reads the first
|
153
166
|
key-value pair of all already sorted chunk tempfiles and adds them to a
|
154
167
|
priority queue using a binomial heap, such that with every `pop` operation on
|
data/lib/map_reduce/mapper.rb
CHANGED
@@ -16,9 +16,9 @@ module MapReduce
|
|
16
16
|
# bytes.
|
17
17
|
#
|
18
18
|
# @example
|
19
|
-
# MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit:
|
19
|
+
# MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 16.megabytes)
|
20
20
|
|
21
|
-
def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit:
|
21
|
+
def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 16 * 1024 * 1024)
|
22
22
|
super()
|
23
23
|
|
24
24
|
@implementation = implementation
|
@@ -86,7 +86,7 @@ module MapReduce
|
|
86
86
|
|
87
87
|
yield(partitions.transform_values(&:path))
|
88
88
|
ensure
|
89
|
-
partitions
|
89
|
+
partitions&.each_value(&:delete)
|
90
90
|
|
91
91
|
@chunks.each(&:delete)
|
92
92
|
@chunks = []
|
data/lib/map_reduce/reducer.rb
CHANGED
data/lib/map_reduce/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: map-reduce-ruby
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 3.0.
|
4
|
+
version: 3.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Benjamin Vetter
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-11-
|
11
|
+
date: 2022-11-18 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rspec
|