map-reduce-ruby 2.1.1 → 3.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +4 -1
- data/CHANGELOG.md +19 -0
- data/Gemfile.lock +1 -1
- data/README.md +30 -11
- data/lib/map_reduce/mapper.rb +59 -35
- data/lib/map_reduce/mergeable.rb +54 -12
- data/lib/map_reduce/reducer.rb +16 -21
- data/lib/map_reduce/version.rb +1 -1
- data/lib/map_reduce.rb +4 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f16a0ed8420d8b3867cdb41be00498d367b7b9e0c1b4af05a8230b3b32762794
|
4
|
+
data.tar.gz: 2c02d6dcb2819ccd741498d500e9f645673daf65bee9181ff74e1e73a11ebaae
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 47534a4ebc188caaa1da33b02302f2cc80c7fa7a6128a56aad4a6073cfcf63071b0258cc3702ba3d7ff9a9f66978552717756594db3aabac75017f8817019580
|
7
|
+
data.tar.gz: f292f4450acb8ff62507d52c8324d8861d1f91d2615b397cefbfbe9148127bd036dd19535d1958c17118d25de29e04863f4a6e38a03dcb257ca91e8177c128d7
|
data/.rubocop.yml
CHANGED
@@ -45,7 +45,7 @@ Style/StringLiteralsInInterpolation:
|
|
45
45
|
EnforcedStyle: double_quotes
|
46
46
|
|
47
47
|
Layout/LineLength:
|
48
|
-
Max:
|
48
|
+
Max: 250
|
49
49
|
|
50
50
|
Style/FrozenStringLiteralComment:
|
51
51
|
EnforcedStyle: never
|
@@ -55,3 +55,6 @@ Style/ObjectThen:
|
|
55
55
|
|
56
56
|
Gemspec/RequireMFA:
|
57
57
|
Enabled: false
|
58
|
+
|
59
|
+
Style/HashTransformValues:
|
60
|
+
Enabled: false
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,24 @@
|
|
1
1
|
# CHANGELOG
|
2
2
|
|
3
|
+
## v3.0.1
|
4
|
+
|
5
|
+
* Fix cleanup in `MapReduce::Mapper#shuffle`
|
6
|
+
|
7
|
+
## v3.0.0
|
8
|
+
|
9
|
+
* [BREAKING] `MapReduce::Mapper#shuffle` now yields a hash of (partition, path)
|
10
|
+
pairs, which e.g. allows to upload the files in parallel
|
11
|
+
* [BREAKING] `MapReduce::Mapper#shuffle` now requires a `chunk_limit`. This
|
12
|
+
allows to further limit the maximum number of open file descriptors
|
13
|
+
* [BREAKING] `MapReduce::Mapper#shuffle` no longer returns an `Enumerator` when
|
14
|
+
no block is given
|
15
|
+
* [BREAKING] `MapReduce::Reducer::InvalidChunkLimit` is now
|
16
|
+
`MapReduce::InvalidChunkLimit` and inherits from `MapReduce::Error` being the
|
17
|
+
base class for all errors
|
18
|
+
* `MapReduce::Mapper#shuffle` no longer keeps all partition files open. Instead,
|
19
|
+
it writes them one after another to further strictly reduce the number of
|
20
|
+
open file descriptors.
|
21
|
+
|
3
22
|
## v2.1.1
|
4
23
|
|
5
24
|
* Fix in `MapReduce::Mapper` when no `reduce` implementation is given
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -54,20 +54,22 @@ Next, we need some worker code to run the mapping part:
|
|
54
54
|
```ruby
|
55
55
|
class WordCountMapper
|
56
56
|
def perform(job_id, mapper_id, url)
|
57
|
-
mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit:
|
57
|
+
mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 10.megabytes)
|
58
58
|
mapper.map(url)
|
59
59
|
|
60
|
-
mapper.shuffle do |
|
61
|
-
|
62
|
-
|
60
|
+
mapper.shuffle(chunk_limit: 64) do |partitions|
|
61
|
+
partitions.each do |partition, path|
|
62
|
+
# store content of the tempfile located at path e.g. on s3:
|
63
|
+
bucket.object("map_reduce/jobs/#{job_id}/partitions/#{partition}/chunk.#{mapper_id}.json").put(body: File.open(path))
|
64
|
+
end
|
63
65
|
end
|
64
66
|
end
|
65
67
|
end
|
66
68
|
```
|
67
69
|
|
68
70
|
Please note that `MapReduce::HashPartitioner.new(16)` states that we want to
|
69
|
-
split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
|
70
|
-
worker code to run the reduce part:
|
71
|
+
split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
|
72
|
+
some worker code to run the reduce part:
|
71
73
|
|
72
74
|
```ruby
|
73
75
|
class WordCountReducer
|
@@ -144,9 +146,22 @@ workings of MapReduce. Of course, feel free to check the code as well.
|
|
144
146
|
|
145
147
|
`MapReduce::Mapper#map` calls your `map` implementation and adds each yielded
|
146
148
|
key-value pair to an internal buffer up until the memory limit is reached.
|
147
|
-
|
148
|
-
|
149
|
-
|
149
|
+
More concretely, it specifies how big the file size of a temporary chunk can
|
150
|
+
grow in memory up until it must be written to disk. However, ruby of course
|
151
|
+
allocates much more memory for a chunk than the raw file size of the chunk. As
|
152
|
+
a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
|
153
|
+
`memory_size` depends on the memory size of your container/server, how much
|
154
|
+
worker threads your background queue spawns and how much memory your workers
|
155
|
+
need besides map/reduce. Let's say your container/server has 2 gigabytes of
|
156
|
+
memory and your background framework spawns 5 threads. Theoretically, you might
|
157
|
+
be able to give 300-400 megabytes, but now divide this by 10 and specify a
|
158
|
+
`memory_limit` of around `30.megabytes`, better less. The `memory_limit`
|
159
|
+
affects how much chunks will be written to disk depending on the data size you
|
160
|
+
are processing and how big these chunks are. The smaller the value, the more
|
161
|
+
chunks and the more chunks, the more runs are needed to merge the chunks. When
|
162
|
+
the memory limit is reached, the buffer is sorted by key and fed through your
|
163
|
+
`reduce` implementation already, as this can greatly reduce the amount of data
|
164
|
+
already. The result is written to a tempfile. This proceeds up until all
|
150
165
|
key-value pairs are yielded. `MapReduce::Mapper#shuffle` then reads the first
|
151
166
|
key-value pair of all already sorted chunk tempfiles and adds them to a
|
152
167
|
priority queue using a binomial heap, such that with every `pop` operation on
|
@@ -205,6 +220,10 @@ interface of callables, could even be expressed as a simple one-liner:
|
|
205
220
|
MyPartitioner = proc { |key| Digest::SHA1.hexdigest(JSON.generate(key))[0..4].to_i(16) % 8 }
|
206
221
|
```
|
207
222
|
|
223
|
+
## Semantic Versioning
|
224
|
+
|
225
|
+
MapReduce is using Semantic Versioning: [SemVer](http://semver.org/)
|
226
|
+
|
208
227
|
## Development
|
209
228
|
|
210
229
|
After checking out the repo, run `bin/setup` to install dependencies. Then, run
|
@@ -224,5 +243,5 @@ https://github.com/mrkamel/map-reduce-ruby
|
|
224
243
|
|
225
244
|
## License
|
226
245
|
|
227
|
-
The gem is available as open source under the terms of the
|
228
|
-
License](https://opensource.org/licenses/MIT).
|
246
|
+
The gem is available as open source under the terms of the
|
247
|
+
[MIT License](https://opensource.org/licenses/MIT).
|
data/lib/map_reduce/mapper.rb
CHANGED
@@ -6,8 +6,6 @@ module MapReduce
|
|
6
6
|
include Reduceable
|
7
7
|
include MonitorMixin
|
8
8
|
|
9
|
-
attr_reader :partitions
|
10
|
-
|
11
9
|
# Initializes a new mapper.
|
12
10
|
#
|
13
11
|
# @param implementation Your map-reduce implementation, i.e. an object
|
@@ -18,9 +16,9 @@ module MapReduce
|
|
18
16
|
# bytes.
|
19
17
|
#
|
20
18
|
# @example
|
21
|
-
# MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit:
|
19
|
+
# MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 16.megabytes)
|
22
20
|
|
23
|
-
def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit:
|
21
|
+
def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 16 * 1024 * 1024)
|
24
22
|
super()
|
25
23
|
|
26
24
|
@implementation = implementation
|
@@ -45,9 +43,11 @@ module MapReduce
|
|
45
43
|
def map(*args, **kwargs)
|
46
44
|
@implementation.map(*args, **kwargs) do |new_key, new_value|
|
47
45
|
synchronize do
|
48
|
-
@
|
46
|
+
partition = @partitioner.call(new_key)
|
47
|
+
item = [[partition, new_key], new_value]
|
49
48
|
|
50
|
-
@
|
49
|
+
@buffer.push(item)
|
50
|
+
@buffer_size += JSON.generate(item).bytesize
|
51
51
|
|
52
52
|
write_chunk if @buffer_size >= @memory_limit
|
53
53
|
end
|
@@ -55,62 +55,86 @@ module MapReduce
|
|
55
55
|
end
|
56
56
|
|
57
57
|
# Performs a k-way-merge of the sorted chunks written to tempfiles while
|
58
|
-
# already reducing the result using your map-reduce implementation
|
59
|
-
# splitting the dataset into partitions. Finally yields
|
60
|
-
#
|
58
|
+
# already reducing the result using your map-reduce implementation (if
|
59
|
+
# available) and splitting the dataset into partitions. Finally yields a
|
60
|
+
# hash of (partition, path) pairs containing the data for the partitions
|
61
|
+
# in tempfiles.
|
62
|
+
#
|
63
|
+
# @param chunk_limit [Integer] The maximum number of files to process
|
64
|
+
# at the same time. Most useful when you run on a system where the
|
65
|
+
# number of open file descriptors is limited. If your number of file
|
66
|
+
# descriptors is unlimited, you want to set it to a higher number to
|
67
|
+
# avoid the overhead of multiple runs.
|
61
68
|
#
|
62
69
|
# @example
|
63
|
-
# mapper.shuffle do |
|
64
|
-
#
|
70
|
+
# mapper.shuffle do |partitions|
|
71
|
+
# partitions.each do |partition, path|
|
72
|
+
# # store data e.g. on s3
|
73
|
+
# end
|
65
74
|
# end
|
66
75
|
|
67
|
-
def shuffle(
|
68
|
-
|
76
|
+
def shuffle(chunk_limit:)
|
77
|
+
raise(InvalidChunkLimit, "Chunk limit must be >= 2") unless chunk_limit >= 2
|
69
78
|
|
70
|
-
|
79
|
+
begin
|
80
|
+
write_chunk if @buffer_size > 0
|
71
81
|
|
72
|
-
|
82
|
+
chunk = k_way_merge(@chunks, chunk_limit: chunk_limit)
|
83
|
+
chunk = reduce_chunk(chunk, @implementation) if @implementation.respond_to?(:reduce)
|
73
84
|
|
74
|
-
|
75
|
-
chunk = reduce_chunk(chunk, @implementation) if @implementation.respond_to?(:reduce)
|
85
|
+
partitions = split_chunk(chunk)
|
76
86
|
|
77
|
-
|
78
|
-
|
87
|
+
yield(partitions.transform_values(&:path))
|
88
|
+
ensure
|
89
|
+
partitions&.each_value(&:delete)
|
79
90
|
|
80
|
-
|
91
|
+
@chunks.each(&:delete)
|
92
|
+
@chunks = []
|
81
93
|
end
|
82
94
|
|
83
|
-
|
84
|
-
|
95
|
+
nil
|
96
|
+
end
|
97
|
+
|
98
|
+
private
|
85
99
|
|
86
|
-
|
100
|
+
def split_chunk(chunk)
|
101
|
+
res = {}
|
102
|
+
current_partition = nil
|
103
|
+
file = nil
|
87
104
|
|
88
|
-
|
89
|
-
|
105
|
+
chunk.each do |((new_partition, key), value)|
|
106
|
+
if new_partition != current_partition
|
107
|
+
file&.close
|
108
|
+
|
109
|
+
current_partition = new_partition
|
110
|
+
temp_path = TempPath.new
|
111
|
+
res[new_partition] = temp_path
|
112
|
+
file = File.open(temp_path.path, "w+")
|
113
|
+
end
|
114
|
+
|
115
|
+
file.puts(JSON.generate([key, value]))
|
90
116
|
end
|
91
117
|
|
92
|
-
|
118
|
+
file&.close
|
93
119
|
|
94
|
-
|
120
|
+
res
|
95
121
|
end
|
96
122
|
|
97
|
-
private
|
98
|
-
|
99
123
|
def write_chunk
|
100
|
-
|
124
|
+
temp_path = TempPath.new
|
101
125
|
|
102
126
|
@buffer.sort_by!(&:first)
|
103
127
|
|
104
128
|
chunk = @buffer
|
105
129
|
chunk = reduce_chunk(chunk, @implementation) if @implementation.respond_to?(:reduce)
|
106
130
|
|
107
|
-
|
108
|
-
|
131
|
+
File.open(temp_path.path, "w+") do |file|
|
132
|
+
chunk.each do |pair|
|
133
|
+
file.puts JSON.generate(pair)
|
134
|
+
end
|
109
135
|
end
|
110
136
|
|
111
|
-
|
112
|
-
|
113
|
-
@chunks.push(tempfile)
|
137
|
+
@chunks.push(temp_path)
|
114
138
|
|
115
139
|
@buffer_size = 0
|
116
140
|
@buffer = []
|
data/lib/map_reduce/mergeable.rb
CHANGED
@@ -5,20 +5,62 @@ module MapReduce
|
|
5
5
|
module Mergeable
|
6
6
|
private
|
7
7
|
|
8
|
-
# Performs the k-way-merge of the passed files
|
9
|
-
# a binomial heap. The content of the passed
|
10
|
-
# starts by reading one item of each file and
|
11
|
-
# queue. Afterwards, it continously pops an item
|
12
|
-
# and reads a new item from the file the popped
|
13
|
-
# read item to the queue. This continues up
|
14
|
-
# have been read. This guarantees that the
|
15
|
-
# sorted without having all items in-memory.
|
8
|
+
# Performs the k-way-merge of the passed files referenced by the temp paths
|
9
|
+
# using a priority queue using a binomial heap. The content of the passed
|
10
|
+
# files needs to be sorted. It starts by reading one item of each file and
|
11
|
+
# adding it to the priority queue. Afterwards, it continously pops an item
|
12
|
+
# from the queue, yields it and reads a new item from the file the popped
|
13
|
+
# item belongs to, adding the read item to the queue. This continues up
|
14
|
+
# until all items from the files have been read. This guarantees that the
|
15
|
+
# yielded key-value pairs are sorted without having all items in-memory.
|
16
16
|
#
|
17
|
-
# @param
|
18
|
-
# content of the files must be sorted.
|
17
|
+
# @param temp_paths [TempPath] The files referenced by the temp paths to
|
18
|
+
# run the k-way-merge for. The content of the files must be sorted.
|
19
|
+
# @param chunk_limit [Integer] The maximum number of files to process
|
20
|
+
# at the same time. Most useful when you run on a system where the
|
21
|
+
# number of open file descriptors is limited. If your number of file
|
22
|
+
# descriptors is unlimited, you want to set it to a higher number to
|
23
|
+
# avoid the overhead of multiple runs.
|
19
24
|
|
20
|
-
def k_way_merge(
|
21
|
-
return enum_for(
|
25
|
+
def k_way_merge(temp_paths, chunk_limit:, &block)
|
26
|
+
return enum_for(__method__, temp_paths, chunk_limit: chunk_limit) unless block_given?
|
27
|
+
|
28
|
+
dupped_temp_paths = temp_paths.dup
|
29
|
+
additional_temp_paths = []
|
30
|
+
|
31
|
+
while dupped_temp_paths.size > chunk_limit
|
32
|
+
temp_path_out = TempPath.new
|
33
|
+
|
34
|
+
File.open(temp_path_out.path, "w+") do |file|
|
35
|
+
files = dupped_temp_paths.shift(chunk_limit).map { |temp_path| File.open(temp_path.path, "r") }
|
36
|
+
|
37
|
+
k_way_merge!(files) do |pair|
|
38
|
+
file.puts(JSON.generate(pair))
|
39
|
+
end
|
40
|
+
|
41
|
+
files.each(&:close)
|
42
|
+
end
|
43
|
+
|
44
|
+
dupped_temp_paths.push(temp_path_out)
|
45
|
+
additional_temp_paths.push(temp_path_out)
|
46
|
+
end
|
47
|
+
|
48
|
+
files = dupped_temp_paths.map { |temp_path| File.open(temp_path.path, "r") }
|
49
|
+
k_way_merge!(files, &block)
|
50
|
+
files.each(&:close)
|
51
|
+
|
52
|
+
nil
|
53
|
+
ensure
|
54
|
+
additional_temp_paths&.each(&:delete)
|
55
|
+
end
|
56
|
+
|
57
|
+
# Performs the actual k-way-merge of the specified files.
|
58
|
+
#
|
59
|
+
# @param files [IO, Tempfile] The files to run the k-way-merge for.
|
60
|
+
# The content of the files must be sorted.
|
61
|
+
|
62
|
+
def k_way_merge!(files)
|
63
|
+
return enum_for(__method__, files) unless block_given?
|
22
64
|
|
23
65
|
if files.size == 1
|
24
66
|
files.first.each_line do |line|
|
data/lib/map_reduce/reducer.rb
CHANGED
@@ -6,8 +6,6 @@ module MapReduce
|
|
6
6
|
include Reduceable
|
7
7
|
include MonitorMixin
|
8
8
|
|
9
|
-
class InvalidChunkLimit < StandardError; end
|
10
|
-
|
11
9
|
# Initializes a new reducer.
|
12
10
|
#
|
13
11
|
# @param implementation Your map-reduce implementation, i.e. an object
|
@@ -20,8 +18,7 @@ module MapReduce
|
|
20
18
|
super()
|
21
19
|
|
22
20
|
@implementation = implementation
|
23
|
-
|
24
|
-
@temp_paths ||= []
|
21
|
+
@temp_paths = []
|
25
22
|
end
|
26
23
|
|
27
24
|
# Adds a chunk from the mapper-phase to the reducer by registering a
|
@@ -70,38 +67,36 @@ module MapReduce
|
|
70
67
|
# end
|
71
68
|
|
72
69
|
def reduce(chunk_limit:, &block)
|
73
|
-
return enum_for(
|
70
|
+
return enum_for(__method__, chunk_limit: chunk_limit) unless block_given?
|
74
71
|
|
75
72
|
raise(InvalidChunkLimit, "Chunk limit must be >= 2") unless chunk_limit >= 2
|
76
73
|
|
77
74
|
begin
|
78
75
|
loop do
|
79
76
|
slice = @temp_paths.shift(chunk_limit)
|
80
|
-
files = slice.select { |temp_path| File.exist?(temp_path.path) }
|
81
|
-
.map { |temp_path| File.open(temp_path.path, "r") }
|
82
|
-
|
83
|
-
begin
|
84
|
-
if @temp_paths.empty?
|
85
|
-
reduce_chunk(k_way_merge(files), @implementation).each do |pair|
|
86
|
-
block.call(pair)
|
87
|
-
end
|
88
77
|
|
89
|
-
|
78
|
+
if @temp_paths.empty?
|
79
|
+
reduce_chunk(k_way_merge(slice, chunk_limit: chunk_limit), @implementation).each do |pair|
|
80
|
+
block.call(pair)
|
90
81
|
end
|
91
82
|
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
83
|
+
return
|
84
|
+
end
|
85
|
+
|
86
|
+
File.open(add_chunk, "w+") do |file|
|
87
|
+
reduce_chunk(k_way_merge(slice, chunk_limit: chunk_limit), @implementation).each do |pair|
|
88
|
+
file.puts JSON.generate(pair)
|
96
89
|
end
|
97
|
-
ensure
|
98
|
-
files.each(&:close)
|
99
|
-
slice.each(&:delete)
|
100
90
|
end
|
91
|
+
ensure
|
92
|
+
slice&.each(&:delete)
|
101
93
|
end
|
102
94
|
ensure
|
103
95
|
@temp_paths.each(&:delete)
|
96
|
+
@temp_paths = []
|
104
97
|
end
|
98
|
+
|
99
|
+
nil
|
105
100
|
end
|
106
101
|
end
|
107
102
|
end
|
data/lib/map_reduce/version.rb
CHANGED
data/lib/map_reduce.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: map-reduce-ruby
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 3.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Benjamin Vetter
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-
|
11
|
+
date: 2022-11-18 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rspec
|