map-reduce-ruby 1.0.0 → 2.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 19ac70d25d6622d30398997cae0a4d6730ebc13f8bbc31df896c5acf480923c4
4
- data.tar.gz: ede2a8f736053a268175a7948324a15954475526d0589bf47416347b57a50740
3
+ metadata.gz: 4f3eeb7739b733f1abdf325ddfcf5a4c42fb257edd837952781246fcda9f48fb
4
+ data.tar.gz: f40eb08a341fc522c8f7043b74ac1c47fb00af491a107f51a6a76ca67f389629
5
5
  SHA512:
6
- metadata.gz: 650125350b5f166ccc0fb1346c77dfa9f17471ac9ae9290b4cbc4731a5baec872a810551e0b15af064b37f73c7e103ab6281d48cd839f26d4aab6780d5cd66d1
7
- data.tar.gz: 70d9f8668143cb507537e369ad0ea79d9bb9f2c33469dffccd5952116f09758a7a880831d81aa4e3b67bc548b4b36aafe321a8d553946d8734fb78c11cd1686e
6
+ metadata.gz: 6e91d7a1f55f0b89d333b317ecedc7978ca29a709dfad403906f6f2835b121a0a49afb29d03f55d8d7c8aa5c1b70628400404b7bf4d590046ae2e66ed48d7abc
7
+ data.tar.gz: cc4524aec895d935b70548163e5818f8725c5a05985e55f9b8d9330c34491968bb3ec54f02466ac0748eeb130f47a40d07baefa2627b04ed10dbac1c9638b1af
data/.rubocop.yml CHANGED
@@ -49,3 +49,9 @@ Layout/LineLength:
49
49
 
50
50
  Style/FrozenStringLiteralComment:
51
51
  EnforcedStyle: never
52
+
53
+ Style/ObjectThen:
54
+ Enabled: false
55
+
56
+ Gemspec/RequireMFA:
57
+ Enabled: false
data/CHANGELOG.md CHANGED
@@ -1 +1,32 @@
1
1
  # CHANGELOG
2
+
3
+ ## v2.1.0
4
+
5
+ * Do not reduce in `MapReduce::Mapper` when no `reduce` implementation is given
6
+
7
+ ## v2.0.0
8
+
9
+ * [BREAKING] Keys are no longer automatically converted to json before using
10
+ them for sorting
11
+ * This allows to have proper semantic sort order for numeric keys in addition
12
+ to just the clustering of keys
13
+ * Examples of valid keys: `"key"`, `["foo", 1.0]`, `["foo", ["bar"]]`
14
+ * Examples of problematic keys: `nil`, `true`, `["foo", nil]`, `{ "foo" => "bar" }`
15
+ * For migration purposes it is recommended to convert your keys to and from
16
+ json manually if you have complex keys using `JSON.generate`/`JSON.parse`:
17
+
18
+ ```ruby
19
+ class WordCounter
20
+ def map(url)
21
+ HTTP.get(url).to_s.split.each do |word|
22
+ yield(JSON.generate("key" => word), 1) # if you use a hash for the key
23
+ end
24
+ end
25
+
26
+ def reduce(json_key, count1, count2)
27
+ key = JSON.parse(json_key) # if you want to access the original key
28
+
29
+ count1 + count2
30
+ end
31
+ end
32
+ ```
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- map-reduce-ruby (1.0.0)
4
+ map-reduce-ruby (2.1.0)
5
5
  json
6
6
  lazy_priority_queue
7
7
 
@@ -9,41 +9,42 @@ GEM
9
9
  remote: https://rubygems.org/
10
10
  specs:
11
11
  ast (2.4.2)
12
- diff-lcs (1.4.4)
13
- json (2.5.1)
12
+ diff-lcs (1.5.0)
13
+ json (2.6.2)
14
14
  lazy_priority_queue (0.1.1)
15
- parallel (1.20.1)
16
- parser (3.0.0.0)
15
+ parallel (1.22.1)
16
+ parser (3.1.2.1)
17
17
  ast (~> 2.4.1)
18
- rainbow (3.0.0)
19
- regexp_parser (2.0.3)
20
- rexml (3.2.4)
21
- rspec (3.10.0)
22
- rspec-core (~> 3.10.0)
23
- rspec-expectations (~> 3.10.0)
24
- rspec-mocks (~> 3.10.0)
25
- rspec-core (3.10.1)
26
- rspec-support (~> 3.10.0)
27
- rspec-expectations (3.10.1)
18
+ rainbow (3.1.1)
19
+ regexp_parser (2.5.0)
20
+ rexml (3.2.5)
21
+ rspec (3.11.0)
22
+ rspec-core (~> 3.11.0)
23
+ rspec-expectations (~> 3.11.0)
24
+ rspec-mocks (~> 3.11.0)
25
+ rspec-core (3.11.0)
26
+ rspec-support (~> 3.11.0)
27
+ rspec-expectations (3.11.1)
28
28
  diff-lcs (>= 1.2.0, < 2.0)
29
- rspec-support (~> 3.10.0)
30
- rspec-mocks (3.10.1)
29
+ rspec-support (~> 3.11.0)
30
+ rspec-mocks (3.11.1)
31
31
  diff-lcs (>= 1.2.0, < 2.0)
32
- rspec-support (~> 3.10.0)
33
- rspec-support (3.10.1)
34
- rubocop (0.93.1)
32
+ rspec-support (~> 3.11.0)
33
+ rspec-support (3.11.1)
34
+ rubocop (1.36.0)
35
+ json (~> 2.3)
35
36
  parallel (~> 1.10)
36
- parser (>= 2.7.1.5)
37
+ parser (>= 3.1.2.1)
37
38
  rainbow (>= 2.2.2, < 4.0)
38
- regexp_parser (>= 1.8)
39
- rexml
40
- rubocop-ast (>= 0.6.0)
39
+ regexp_parser (>= 1.8, < 3.0)
40
+ rexml (>= 3.2.5, < 4.0)
41
+ rubocop-ast (>= 1.20.1, < 2.0)
41
42
  ruby-progressbar (~> 1.7)
42
- unicode-display_width (>= 1.4.0, < 2.0)
43
- rubocop-ast (1.4.1)
44
- parser (>= 2.7.1.5)
43
+ unicode-display_width (>= 1.4.0, < 3.0)
44
+ rubocop-ast (1.21.0)
45
+ parser (>= 3.1.1.0)
45
46
  ruby-progressbar (1.11.0)
46
- unicode-display_width (1.7.0)
47
+ unicode-display_width (2.3.0)
47
48
 
48
49
  PLATFORMS
49
50
  ruby
data/README.md CHANGED
@@ -7,8 +7,7 @@ than memory map-reduce jobs by using your local disk and some arbitrary storage
7
7
  layer like s3. You can specify how much memory you are willing to offer and
8
8
  MapReduce will use its buffers accordingly. Finally, you can use your already
9
9
  existing background job system like `sidekiq` or one of its various
10
- alternatives. Finally, your keys and values can be everything that can be
11
- serialized as json.
10
+ alternatives.
12
11
 
13
12
  ## Installation
14
13
 
@@ -30,9 +29,7 @@ Or install it yourself as:
30
29
 
31
30
  Any map-reduce job consists of an implementation of your `map` function, your
32
31
  `reduce` function and worker code. So let's start with an implementation for a
33
- word count map-reduce task which fetches txt documents from the web. Please
34
- note that your keys and values can be everything that can be serialized as
35
- json, but nothing else.
32
+ word count map-reduce task which fetches txt documents from the web.
36
33
 
37
34
  ```ruby
38
35
  class WordCounter
@@ -68,8 +65,8 @@ class WordCountMapper
68
65
  end
69
66
  ```
70
67
 
71
- Please note that `MapReduce::HashPartitioner.new(16)` states that we want split
72
- the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need some
68
+ Please note that `MapReduce::HashPartitioner.new(16)` states that we want to
69
+ split the dataset into 16 partitions (i.e. 0, 1, ... 15). Finally, we need some
73
70
  worker code to run the reduce part:
74
71
 
75
72
  ```ruby
@@ -120,6 +117,26 @@ mappers are finished.
120
117
 
121
118
  That's it.
122
119
 
120
+ ## Limitations for Keys
121
+
122
+ You have to make sure that your keys are properly sortable in ruby. Please
123
+ note:
124
+
125
+ ```ruby
126
+ "key" < nil # comparison of String with nil failed (ArgumentError)
127
+
128
+ false < true # undefined method `<' for false:FalseClass (NoMethodError)
129
+
130
+ 1 > "key" # comparison of Integer with String failed (ArgumentError
131
+
132
+ { "key" => "value1" } < { "key" => "value2" } #=> false
133
+ { "key" => "value1" } > { "key" => "value2" } #=> false
134
+ { "key" => "value1" } <=> { "key" => "value2" } #=> nil
135
+ ```
136
+
137
+ For those reasons, it is recommended to only use strings, numbers and arrays or
138
+ a combination of those.
139
+
123
140
  ## Internals
124
141
 
125
142
  To fully understand the performance details, the following outlines the inner
@@ -71,7 +71,10 @@ module MapReduce
71
71
 
72
72
  partitions = {}
73
73
 
74
- reduce_chunk(k_way_merge(@chunks), @implementation).each do |pair|
74
+ chunk = k_way_merge(@chunks)
75
+ chunk = reduce_chunk(chunk, @implementation) if @implementation.respond_to?(:reduce)
76
+
77
+ chunk.each do |pair|
75
78
  partition = @partitioner.call(pair[0])
76
79
 
77
80
  (partitions[partition] ||= Tempfile.new).puts(JSON.generate(pair))
@@ -96,7 +99,7 @@ module MapReduce
96
99
  def write_chunk
97
100
  tempfile = Tempfile.new
98
101
 
99
- @buffer.sort_by! { |item| JSON.generate(item.first) }
102
+ @buffer.sort_by!(&:first)
100
103
 
101
104
  reduce_chunk(@buffer, @implementation).each do |pair|
102
105
  tempfile.puts JSON.generate(pair)
@@ -20,6 +20,16 @@ module MapReduce
20
20
  def k_way_merge(files)
21
21
  return enum_for(:k_way_merge, files) unless block_given?
22
22
 
23
+ if files.size == 1
24
+ files.first.each_line do |line|
25
+ yield(JSON.parse(line))
26
+ end
27
+
28
+ files.each(&:rewind)
29
+
30
+ return
31
+ end
32
+
23
33
  queue = PriorityQueue.new
24
34
 
25
35
  files.each_with_index do |file, index|
@@ -29,7 +39,7 @@ module MapReduce
29
39
 
30
40
  key, value = JSON.parse(line)
31
41
 
32
- queue.push([key, value, index], JSON.generate(key))
42
+ queue.push([key, value, index], key)
33
43
  end
34
44
 
35
45
  loop do
@@ -45,7 +55,7 @@ module MapReduce
45
55
 
46
56
  key, value = JSON.parse(line)
47
57
 
48
- queue.push([key, value, index], JSON.generate(key))
58
+ queue.push([key, value, index], key)
49
59
  end
50
60
 
51
61
  files.each(&:rewind)
@@ -1,4 +1,26 @@
1
1
  module MapReduce
2
+ # Since LazyPriorityQueue is using <= and >=, but not <=>, it does not
3
+ # support sorting array keys. Therefore we wrap the keys in SortKey, which
4
+ # provides those operators. See https://bugs.ruby-lang.org/issues/5574
5
+
6
+ class SortKey
7
+ include Comparable
8
+
9
+ attr_reader :object
10
+
11
+ def initialize(object)
12
+ @object = object
13
+ end
14
+
15
+ def <=>(other)
16
+ res = object <=> other.object
17
+
18
+ raise(ArgumentError, "Unable to compare #{@object.inspect} with #{other.object.inspect}") if res.nil?
19
+
20
+ res
21
+ end
22
+ end
23
+
2
24
  # The MapReduce::PriorityQueue implements a min priority queue using a
3
25
  # binomial heap.
4
26
 
@@ -25,7 +47,7 @@ module MapReduce
25
47
  # priority_queue.push("some object", "some key")
26
48
 
27
49
  def push(object, key)
28
- @queue.push([@sequence_number, object], key)
50
+ @queue.push([@sequence_number, object], SortKey.new(key))
29
51
 
30
52
  @sequence_number += 1
31
53
  end
@@ -1,3 +1,3 @@
1
1
  module MapReduce
2
- VERSION = "1.0.0"
2
+ VERSION = "2.1.0"
3
3
  end
@@ -7,7 +7,7 @@ Gem::Specification.new do |spec|
7
7
  spec.email = ["vetter@flakks.com"]
8
8
 
9
9
  spec.summary = "The easiest way to write distributed, larger than memory map-reduce jobs"
10
- spec.description = "The MapReduce gem is the easiest way to write custom, distributed, larger "\
10
+ spec.description = "The MapReduce gem is the easiest way to write custom, distributed, larger " \
11
11
  "than memory map-reduce jobs"
12
12
  spec.homepage = "https://github.com/mrkamel/map-reduce-ruby"
13
13
  spec.license = "MIT"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: map-reduce-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 2.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Vetter
8
- autorequire:
8
+ autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2021-07-05 00:00:00.000000000 Z
11
+ date: 2022-10-24 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rspec
@@ -104,7 +104,7 @@ metadata:
104
104
  homepage_uri: https://github.com/mrkamel/map-reduce-ruby
105
105
  source_code_uri: https://github.com/mrkamel/map-reduce-ruby
106
106
  changelog_uri: https://github.com/mrkamel/map-reduce/blob/master/CHANGELOG.md
107
- post_install_message:
107
+ post_install_message:
108
108
  rdoc_options: []
109
109
  require_paths:
110
110
  - lib
@@ -119,8 +119,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
119
119
  - !ruby/object:Gem::Version
120
120
  version: '0'
121
121
  requirements: []
122
- rubygems_version: 3.0.3
123
- signing_key:
122
+ rubygems_version: 3.3.3
123
+ signing_key:
124
124
  specification_version: 4
125
125
  summary: The easiest way to write distributed, larger than memory map-reduce jobs
126
126
  test_files: []