map-reduce-ruby 2.1.1 → 3.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5545309a188291db41e8f5fc24af45a8d983c5084f2233d735cab309921c928c
4
- data.tar.gz: 779a839704ace3780a304bc7295c8b1f27e834253ee0544e9ff6ae21eda93753
3
+ metadata.gz: f16a0ed8420d8b3867cdb41be00498d367b7b9e0c1b4af05a8230b3b32762794
4
+ data.tar.gz: 2c02d6dcb2819ccd741498d500e9f645673daf65bee9181ff74e1e73a11ebaae
5
5
  SHA512:
6
- metadata.gz: eb577347b7e5c09dd34166e814fc0c50180b6f036fad84de3857dc93d28242be81637d5b0c7f19ea7a846c659eca0c12fcbae01cdac37c4f2bf50c6d9f8f27f6
7
- data.tar.gz: 704b5d6a140583099c53902ceaab5af7b45c41c004f159fc215a942a4749063bf0990c2d60cde15af9663da726bab03a1258e085c0e789470150ab96caf895f7
6
+ metadata.gz: 47534a4ebc188caaa1da33b02302f2cc80c7fa7a6128a56aad4a6073cfcf63071b0258cc3702ba3d7ff9a9f66978552717756594db3aabac75017f8817019580
7
+ data.tar.gz: f292f4450acb8ff62507d52c8324d8861d1f91d2615b397cefbfbe9148127bd036dd19535d1958c17118d25de29e04863f4a6e38a03dcb257ca91e8177c128d7
data/.rubocop.yml CHANGED
@@ -45,7 +45,7 @@ Style/StringLiteralsInInterpolation:
45
45
  EnforcedStyle: double_quotes
46
46
 
47
47
  Layout/LineLength:
48
- Max: 120
48
+ Max: 250
49
49
 
50
50
  Style/FrozenStringLiteralComment:
51
51
  EnforcedStyle: never
@@ -55,3 +55,6 @@ Style/ObjectThen:
55
55
 
56
56
  Gemspec/RequireMFA:
57
57
  Enabled: false
58
+
59
+ Style/HashTransformValues:
60
+ Enabled: false
data/CHANGELOG.md CHANGED
@@ -1,5 +1,24 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## v3.0.1
4
+
5
+ * Fix cleanup in `MapReduce::Mapper#shuffle`
6
+
7
+ ## v3.0.0
8
+
9
+ * [BREAKING] `MapReduce::Mapper#shuffle` now yields a hash of (partition, path)
10
+ pairs, which e.g. allows to upload the files in parallel
11
+ * [BREAKING] `MapReduce::Mapper#shuffle` now requires a `chunk_limit`. This
12
+ allows to further limit the maximum number of open file descriptors
13
+ * [BREAKING] `MapReduce::Mapper#shuffle` no longer returns an `Enumerator` when
14
+ no block is given
15
+ * [BREAKING] `MapReduce::Reducer::InvalidChunkLimit` is now
16
+ `MapReduce::InvalidChunkLimit` and inherits from `MapReduce::Error` being the
17
+ base class for all errors
18
+ * `MapReduce::Mapper#shuffle` no longer keeps all partition files open. Instead,
19
+ it writes them one after another to further strictly reduce the number of
20
+ open file descriptors.
21
+
3
22
  ## v2.1.1
4
23
 
5
24
  * Fix in `MapReduce::Mapper` when no `reduce` implementation is given
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- map-reduce-ruby (2.1.1)
4
+ map-reduce-ruby (3.0.1)
5
5
  json
6
6
  lazy_priority_queue
7
7
 
data/README.md CHANGED
@@ -54,20 +54,22 @@ Next, we need some worker code to run the mapping part:
54
54
  ```ruby
55
55
  class WordCountMapper
56
56
  def perform(job_id, mapper_id, url)
57
- mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 100.megabytes)
57
+ mapper = MapReduce::Mapper.new(WordCounter.new, partitioner: MapReduce::HashPartitioner.new(16), memory_limit: 10.megabytes)
58
58
  mapper.map(url)
59
59
 
60
- mapper.shuffle do |partition, tempfile|
61
- # store content of tempfile e.g. on s3:
62
- bucket.object("map_reduce/jobs/#{job_id}/partitions/#{partition}/chunk.#{mapper_id}.json").put(body: tempfile)
60
+ mapper.shuffle(chunk_limit: 64) do |partitions|
61
+ partitions.each do |partition, path|
62
+ # store content of the tempfile located at path e.g. on s3:
63
+ bucket.object("map_reduce/jobs/#{job_id}/partitions/#{partition}/chunk.#{mapper_id}.json").put(body: File.open(path))
64
+ end
63
65
  end
64
66
  end
65
67
  end
66
68
  ```
67
69
 
68
70
  Please note that `MapReduce::HashPartitioner.new(16)` states that we want to
69
- split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need some
70
- worker code to run the reduce part:
71
+ split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need
72
+ some worker code to run the reduce part:
71
73
 
72
74
  ```ruby
73
75
  class WordCountReducer
@@ -144,9 +146,22 @@ workings of MapReduce. Of course, feel free to check the code as well.
144
146
 
145
147
  `MapReduce::Mapper#map` calls your `map` implementation and adds each yielded
146
148
  key-value pair to an internal buffer up until the memory limit is reached.
147
- When the memory limit is reached, the buffer is sorted by key and fed through
148
- your `reduce` implementation already, as this can greatly reduce the amount of
149
- data already. The result is written to a tempfile. This proceeds up until all
149
+ More concretely, it specifies how big the file size of a temporary chunk can
150
+ grow in memory up until it must be written to disk. However, ruby of course
151
+ allocates much more memory for a chunk than the raw file size of the chunk. As
152
+ a rule of thumb, it allocates 10 times more memory. Still, choosing a value for
153
+ `memory_size` depends on the memory size of your container/server, how much
154
+ worker threads your background queue spawns and how much memory your workers
155
+ need besides map/reduce. Let's say your container/server has 2 gigabytes of
156
+ memory and your background framework spawns 5 threads. Theoretically, you might
157
+ be able to give 300-400 megabytes, but now divide this by 10 and specify a
158
+ `memory_limit` of around `30.megabytes`, better less. The `memory_limit`
159
+ affects how much chunks will be written to disk depending on the data size you
160
+ are processing and how big these chunks are. The smaller the value, the more
161
+ chunks and the more chunks, the more runs are needed to merge the chunks. When
162
+ the memory limit is reached, the buffer is sorted by key and fed through your
163
+ `reduce` implementation already, as this can greatly reduce the amount of data
164
+ already. The result is written to a tempfile. This proceeds up until all
150
165
  key-value pairs are yielded. `MapReduce::Mapper#shuffle` then reads the first
151
166
  key-value pair of all already sorted chunk tempfiles and adds them to a
152
167
  priority queue using a binomial heap, such that with every `pop` operation on
@@ -205,6 +220,10 @@ interface of callables, could even be expressed as a simple one-liner:
205
220
  MyPartitioner = proc { |key| Digest::SHA1.hexdigest(JSON.generate(key))[0..4].to_i(16) % 8 }
206
221
  ```
207
222
 
223
+ ## Semantic Versioning
224
+
225
+ MapReduce is using Semantic Versioning: [SemVer](http://semver.org/)
226
+
208
227
  ## Development
209
228
 
210
229
  After checking out the repo, run `bin/setup` to install dependencies. Then, run
@@ -224,5 +243,5 @@ https://github.com/mrkamel/map-reduce-ruby
224
243
 
225
244
  ## License
226
245
 
227
- The gem is available as open source under the terms of the [MIT
228
- License](https://opensource.org/licenses/MIT).
246
+ The gem is available as open source under the terms of the
247
+ [MIT License](https://opensource.org/licenses/MIT).
@@ -6,8 +6,6 @@ module MapReduce
6
6
  include Reduceable
7
7
  include MonitorMixin
8
8
 
9
- attr_reader :partitions
10
-
11
9
  # Initializes a new mapper.
12
10
  #
13
11
  # @param implementation Your map-reduce implementation, i.e. an object
@@ -18,9 +16,9 @@ module MapReduce
18
16
  # bytes.
19
17
  #
20
18
  # @example
21
- # MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 100.megabytes)
19
+ # MapReduce::Mapper.new(MyImplementation.new, partitioner: HashPartitioner.new(16), memory_limit: 16.megabytes)
22
20
 
23
- def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 100 * 1024 * 1024)
21
+ def initialize(implementation, partitioner: HashPartitioner.new(32), memory_limit: 16 * 1024 * 1024)
24
22
  super()
25
23
 
26
24
  @implementation = implementation
@@ -45,9 +43,11 @@ module MapReduce
45
43
  def map(*args, **kwargs)
46
44
  @implementation.map(*args, **kwargs) do |new_key, new_value|
47
45
  synchronize do
48
- @buffer.push([new_key, new_value])
46
+ partition = @partitioner.call(new_key)
47
+ item = [[partition, new_key], new_value]
49
48
 
50
- @buffer_size += JSON.generate([new_key, new_value]).bytesize
49
+ @buffer.push(item)
50
+ @buffer_size += JSON.generate(item).bytesize
51
51
 
52
52
  write_chunk if @buffer_size >= @memory_limit
53
53
  end
@@ -55,62 +55,86 @@ module MapReduce
55
55
  end
56
56
 
57
57
  # Performs a k-way-merge of the sorted chunks written to tempfiles while
58
- # already reducing the result using your map-reduce implementation and
59
- # splitting the dataset into partitions. Finally yields each partition with
60
- # the tempfile containing the data of the partition.
58
+ # already reducing the result using your map-reduce implementation (if
59
+ # available) and splitting the dataset into partitions. Finally yields a
60
+ # hash of (partition, path) pairs containing the data for the partitions
61
+ # in tempfiles.
62
+ #
63
+ # @param chunk_limit [Integer] The maximum number of files to process
64
+ # at the same time. Most useful when you run on a system where the
65
+ # number of open file descriptors is limited. If your number of file
66
+ # descriptors is unlimited, you want to set it to a higher number to
67
+ # avoid the overhead of multiple runs.
61
68
  #
62
69
  # @example
63
- # mapper.shuffle do |partition, tempfile|
64
- # # store data e.g. on s3
70
+ # mapper.shuffle do |partitions|
71
+ # partitions.each do |partition, path|
72
+ # # store data e.g. on s3
73
+ # end
65
74
  # end
66
75
 
67
- def shuffle(&block)
68
- return enum_for(:shuffle) unless block_given?
76
+ def shuffle(chunk_limit:)
77
+ raise(InvalidChunkLimit, "Chunk limit must be >= 2") unless chunk_limit >= 2
69
78
 
70
- write_chunk if @buffer_size > 0
79
+ begin
80
+ write_chunk if @buffer_size > 0
71
81
 
72
- partitions = {}
82
+ chunk = k_way_merge(@chunks, chunk_limit: chunk_limit)
83
+ chunk = reduce_chunk(chunk, @implementation) if @implementation.respond_to?(:reduce)
73
84
 
74
- chunk = k_way_merge(@chunks)
75
- chunk = reduce_chunk(chunk, @implementation) if @implementation.respond_to?(:reduce)
85
+ partitions = split_chunk(chunk)
76
86
 
77
- chunk.each do |pair|
78
- partition = @partitioner.call(pair[0])
87
+ yield(partitions.transform_values(&:path))
88
+ ensure
89
+ partitions&.each_value(&:delete)
79
90
 
80
- (partitions[partition] ||= Tempfile.new).puts(JSON.generate(pair))
91
+ @chunks.each(&:delete)
92
+ @chunks = []
81
93
  end
82
94
 
83
- @chunks.each { |tempfile| tempfile.close(true) }
84
- @chunks = []
95
+ nil
96
+ end
97
+
98
+ private
85
99
 
86
- partitions.each_value(&:rewind)
100
+ def split_chunk(chunk)
101
+ res = {}
102
+ current_partition = nil
103
+ file = nil
87
104
 
88
- partitions.each do |partition, tempfile|
89
- block.call(partition, tempfile)
105
+ chunk.each do |((new_partition, key), value)|
106
+ if new_partition != current_partition
107
+ file&.close
108
+
109
+ current_partition = new_partition
110
+ temp_path = TempPath.new
111
+ res[new_partition] = temp_path
112
+ file = File.open(temp_path.path, "w+")
113
+ end
114
+
115
+ file.puts(JSON.generate([key, value]))
90
116
  end
91
117
 
92
- partitions.each_value { |tempfile| tempfile.close(true) }
118
+ file&.close
93
119
 
94
- nil
120
+ res
95
121
  end
96
122
 
97
- private
98
-
99
123
  def write_chunk
100
- tempfile = Tempfile.new
124
+ temp_path = TempPath.new
101
125
 
102
126
  @buffer.sort_by!(&:first)
103
127
 
104
128
  chunk = @buffer
105
129
  chunk = reduce_chunk(chunk, @implementation) if @implementation.respond_to?(:reduce)
106
130
 
107
- chunk.each do |pair|
108
- tempfile.puts JSON.generate(pair)
131
+ File.open(temp_path.path, "w+") do |file|
132
+ chunk.each do |pair|
133
+ file.puts JSON.generate(pair)
134
+ end
109
135
  end
110
136
 
111
- tempfile.rewind
112
-
113
- @chunks.push(tempfile)
137
+ @chunks.push(temp_path)
114
138
 
115
139
  @buffer_size = 0
116
140
  @buffer = []
@@ -5,20 +5,62 @@ module MapReduce
5
5
  module Mergeable
6
6
  private
7
7
 
8
- # Performs the k-way-merge of the passed files using a priority queue using
9
- # a binomial heap. The content of the passed files needs to be sorted. It
10
- # starts by reading one item of each file and adding it to the priority
11
- # queue. Afterwards, it continously pops an item from the queue, yields it
12
- # and reads a new item from the file the popped item belongs to, adding the
13
- # read item to the queue. This continues up until all items from the files
14
- # have been read. This guarantees that the yielded key-value pairs are
15
- # sorted without having all items in-memory.
8
+ # Performs the k-way-merge of the passed files referenced by the temp paths
9
+ # using a priority queue using a binomial heap. The content of the passed
10
+ # files needs to be sorted. It starts by reading one item of each file and
11
+ # adding it to the priority queue. Afterwards, it continously pops an item
12
+ # from the queue, yields it and reads a new item from the file the popped
13
+ # item belongs to, adding the read item to the queue. This continues up
14
+ # until all items from the files have been read. This guarantees that the
15
+ # yielded key-value pairs are sorted without having all items in-memory.
16
16
  #
17
- # @param files [IO, Tempfile] The files to run the k-way-merge for. The
18
- # content of the files must be sorted.
17
+ # @param temp_paths [TempPath] The files referenced by the temp paths to
18
+ # run the k-way-merge for. The content of the files must be sorted.
19
+ # @param chunk_limit [Integer] The maximum number of files to process
20
+ # at the same time. Most useful when you run on a system where the
21
+ # number of open file descriptors is limited. If your number of file
22
+ # descriptors is unlimited, you want to set it to a higher number to
23
+ # avoid the overhead of multiple runs.
19
24
 
20
- def k_way_merge(files)
21
- return enum_for(:k_way_merge, files) unless block_given?
25
+ def k_way_merge(temp_paths, chunk_limit:, &block)
26
+ return enum_for(__method__, temp_paths, chunk_limit: chunk_limit) unless block_given?
27
+
28
+ dupped_temp_paths = temp_paths.dup
29
+ additional_temp_paths = []
30
+
31
+ while dupped_temp_paths.size > chunk_limit
32
+ temp_path_out = TempPath.new
33
+
34
+ File.open(temp_path_out.path, "w+") do |file|
35
+ files = dupped_temp_paths.shift(chunk_limit).map { |temp_path| File.open(temp_path.path, "r") }
36
+
37
+ k_way_merge!(files) do |pair|
38
+ file.puts(JSON.generate(pair))
39
+ end
40
+
41
+ files.each(&:close)
42
+ end
43
+
44
+ dupped_temp_paths.push(temp_path_out)
45
+ additional_temp_paths.push(temp_path_out)
46
+ end
47
+
48
+ files = dupped_temp_paths.map { |temp_path| File.open(temp_path.path, "r") }
49
+ k_way_merge!(files, &block)
50
+ files.each(&:close)
51
+
52
+ nil
53
+ ensure
54
+ additional_temp_paths&.each(&:delete)
55
+ end
56
+
57
+ # Performs the actual k-way-merge of the specified files.
58
+ #
59
+ # @param files [IO, Tempfile] The files to run the k-way-merge for.
60
+ # The content of the files must be sorted.
61
+
62
+ def k_way_merge!(files)
63
+ return enum_for(__method__, files) unless block_given?
22
64
 
23
65
  if files.size == 1
24
66
  files.first.each_line do |line|
@@ -6,8 +6,6 @@ module MapReduce
6
6
  include Reduceable
7
7
  include MonitorMixin
8
8
 
9
- class InvalidChunkLimit < StandardError; end
10
-
11
9
  # Initializes a new reducer.
12
10
  #
13
11
  # @param implementation Your map-reduce implementation, i.e. an object
@@ -20,8 +18,7 @@ module MapReduce
20
18
  super()
21
19
 
22
20
  @implementation = implementation
23
-
24
- @temp_paths ||= []
21
+ @temp_paths = []
25
22
  end
26
23
 
27
24
  # Adds a chunk from the mapper-phase to the reducer by registering a
@@ -70,38 +67,36 @@ module MapReduce
70
67
  # end
71
68
 
72
69
  def reduce(chunk_limit:, &block)
73
- return enum_for(:reduce, chunk_limit: chunk_limit) unless block_given?
70
+ return enum_for(__method__, chunk_limit: chunk_limit) unless block_given?
74
71
 
75
72
  raise(InvalidChunkLimit, "Chunk limit must be >= 2") unless chunk_limit >= 2
76
73
 
77
74
  begin
78
75
  loop do
79
76
  slice = @temp_paths.shift(chunk_limit)
80
- files = slice.select { |temp_path| File.exist?(temp_path.path) }
81
- .map { |temp_path| File.open(temp_path.path, "r") }
82
-
83
- begin
84
- if @temp_paths.empty?
85
- reduce_chunk(k_way_merge(files), @implementation).each do |pair|
86
- block.call(pair)
87
- end
88
77
 
89
- return
78
+ if @temp_paths.empty?
79
+ reduce_chunk(k_way_merge(slice, chunk_limit: chunk_limit), @implementation).each do |pair|
80
+ block.call(pair)
90
81
  end
91
82
 
92
- File.open(add_chunk, "w") do |file|
93
- reduce_chunk(k_way_merge(files), @implementation).each do |pair|
94
- file.puts JSON.generate(pair)
95
- end
83
+ return
84
+ end
85
+
86
+ File.open(add_chunk, "w+") do |file|
87
+ reduce_chunk(k_way_merge(slice, chunk_limit: chunk_limit), @implementation).each do |pair|
88
+ file.puts JSON.generate(pair)
96
89
  end
97
- ensure
98
- files.each(&:close)
99
- slice.each(&:delete)
100
90
  end
91
+ ensure
92
+ slice&.each(&:delete)
101
93
  end
102
94
  ensure
103
95
  @temp_paths.each(&:delete)
96
+ @temp_paths = []
104
97
  end
98
+
99
+ nil
105
100
  end
106
101
  end
107
102
  end
@@ -1,3 +1,3 @@
1
1
  module MapReduce
2
- VERSION = "2.1.1"
2
+ VERSION = "3.0.1"
3
3
  end
data/lib/map_reduce.rb CHANGED
@@ -13,4 +13,7 @@ require "map_reduce/hash_partitioner"
13
13
  require "map_reduce/mapper"
14
14
  require "map_reduce/reducer"
15
15
 
16
- module MapReduce; end
16
+ module MapReduce
17
+ class Error < StandardError; end
18
+ class InvalidChunkLimit < Error; end
19
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: map-reduce-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.1.1
4
+ version: 3.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Vetter
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-10-24 00:00:00.000000000 Z
11
+ date: 2022-11-18 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rspec