map-reduce-ruby 3.0.0 → 3.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +4 -0
- data/Gemfile.lock +1 -1
- data/README.md +19 -6
- data/lib/map_reduce/mapper.rb +3 -3
- data/lib/map_reduce/reducer.rb +1 -2
- data/lib/map_reduce/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f16a0ed8420d8b3867cdb41be00498d367b7b9e0c1b4af05a8230b3b32762794
|
4
|
+
data.tar.gz: 2c02d6dcb2819ccd741498d500e9f645673daf65bee9181ff74e1e73a11ebaae
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 47534a4ebc188caaa1da33b02302f2cc80c7fa7a6128a56aad4a6073cfcf63071b0258cc3702ba3d7ff9a9f66978552717756594db3aabac75017f8817019580
|
7
|
+
data.tar.gz: f292f4450acb8ff62507d52c8324d8861d1f91d2615b397cefbfbe9148127bd036dd19535d1958c17118d25de29e04863f4a6e38a03dcb257ca91e8177c128d7
|
data/CHANGELOG.md
CHANGED
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -54,7 +54,7 @@ Next, we need some worker code to run the mapping part:
|
|
54
54
|
```ruby
|
55
55
|
class WordCountMapper
|
56
56
|
def perform(job_id, mapper_id, url)
|
57
|
-
mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit:
|
57
|
+
mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 10.megabytes)
|
58
58
|
mapper.map(url)
|
59
59
|
|
60
60
|
mapper.shuffle(chunk_limit: 64) do |partitions|
|
@@ -68,8 +68,8 @@ end
|
|
68
68
|
```
|
69
69
|
|
70
70
|
Please note that `MapReduce::HashPartitioner.new(16)` states that we want to
|
71
|
-
split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
|
72
|
-
worker code to run the reduce part:
|
71
|
+
split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
|
72
|
+
some worker code to run the reduce part:
|
73
73
|
|
74
74
|
```ruby
|
75
75
|
class WordCountReducer
|
@@ -146,9 +146,22 @@ workings of MapReduce. Of course, feel free to check the code as well.
|
|
146
146
|
|
147
147
|
`MapReduce::Mapper#map` calls your `map` implementation and adds each yielded
|
148
148
|
key-value pair to an internal buffer up until the memory limit is reached.
|
149
|
-
|
150
|
-
|
151
|
-
|
149
|
+
More concretely, it specifies how big the file size of a temporary chunk can
|
150
|
+
grow in memory up until it must be written to disk. However, ruby of course
|
151
|
+
allocates much more memory for a chunk than the raw file size of the chunk. As
|
152
|
+
a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
|
153
|
+
`memory_size` depends on the memory size of your container/server, how much
|
154
|
+
worker threads your background queue spawns and how much memory your workers
|
155
|
+
need besides map/reduce. Let's say your container/server has 2 gigabytes of
|
156
|
+
memory and your background framework spawns 5 threads. Theoretically, you might
|
157
|
+
be able to give 300-400 megabytes, but now divide this by 10 and specify a
|
158
|
+
`memory_limit` of around `30.megabytes`, better less. The `memory_limit`
|
159
|
+
affects how much chunks will be written to disk depending on the data size you
|
160
|
+
are processing and how big these chunks are. The smaller the value, the more
|
161
|
+
chunks and the more chunks, the more runs are needed to merge the chunks. When
|
162
|
+
the memory limit is reached, the buffer is sorted by key and fed through your
|
163
|
+
`reduce` implementation already, as this can greatly reduce the amount of data
|
164
|
+
already. The result is written to a tempfile. This proceeds up until all
|
152
165
|
key-value pairs are yielded. `MapReduce::Mapper#shuffle` then reads the first
|
153
166
|
key-value pair of all already sorted chunk tempfiles and adds them to a
|
154
167
|
priority queue using a binomial heap, such that with every `pop` operation on
|
data/lib/map_reduce/mapper.rb
CHANGED
@@ -16,9 +16,9 @@ module MapReduce
|
|
16
16
|
# bytes.
|
17
17
|
#
|
18
18
|
# @example
|
19
|
-
# MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit:
|
19
|
+
# MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 16.megabytes)
|
20
20
|
|
21
|
-
def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit:
|
21
|
+
def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 16 * 1024 * 1024)
|
22
22
|
super()
|
23
23
|
|
24
24
|
@implementation = implementation
|
@@ -86,7 +86,7 @@ module MapReduce
|
|
86
86
|
|
87
87
|
yield(partitions.transform_values(&:path))
|
88
88
|
ensure
|
89
|
-
partitions
|
89
|
+
partitions&.each_value(&:delete)
|
90
90
|
|
91
91
|
@chunks.each(&:delete)
|
92
92
|
@chunks = []
|
data/lib/map_reduce/reducer.rb
CHANGED
data/lib/map_reduce/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: map-reduce-ruby
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 3.0.
|
4
|
+
version: 3.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Benjamin Vetter
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-11-
|
11
|
+
date: 2022-11-18 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rspec
|