map_reduce 0.0.1.alpha4 → 0.0.1.alpha5

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -18,9 +18,183 @@ Or install it yourself as:
18
18
 
19
19
  $ gem install mapreduce
20
20
 
21
+ ## Introduction
22
+
23
+ MapReduce has got three entities:
24
+
25
+ * Master
26
+ * Mapper
27
+ * Reducer
28
+
29
+ Perhaps later Manager will be presented to synchronyse Reducers.
30
+
31
+ ### Master
32
+
33
+ Master is a process who accepts emmited by Mappers data, sorts it and sends grouped data to Reducers. One Master can serve multiple tasks (multiple Mappers clusters).
34
+
35
+ To run Master you could specify following options
36
+
37
+ * TCP/IPC/Unix (TCP if you need to work over Network) socket address to bind; Workers will connect to this address (default is `tcp://127.0.0.1:5555`)
38
+ * Logs folder to store temprorary logs with received data (default is `/tmp/map_reduce`); be sure to add read/write access for proccess to this folder
39
+ * Delimeter for key, value (default is `\t`); sometimes you want to set your own delimeter if TAB could be found in your key
40
+
41
+ Also you could define some blocks of code. It could be useful for getting some stats from Master.
42
+
43
+ * after_map - this block will be executed after Master received emmited data
44
+ * after_reduce - this block will be executed after Master sended data to Reducer
45
+
46
+ All blocks recieves `|key, value, task_name|`
47
+
48
+ Simple Master
49
+
50
+ ```ruby
51
+ require 'map_reduce'
52
+ # Default params
53
+ master = MapReduce::Master.new
54
+ # Same as
55
+ master = MapReduce::Master.new socket: "tcp://127.0.0.1:555",
56
+ log_folder: "/tmp/map_reduce",
57
+ delimiter: "\t"
58
+
59
+ # Define some logging after map and reduce
60
+ master.after_map do |key, value, task|
61
+ puts "Task: #{task}, received key: #{key}"
62
+ end
63
+
64
+ master.after_reduce do |key, values, task|
65
+ puts "Task: #{task}, for key: #{key} was sended #{values.size} items"
66
+ end
67
+
68
+ # Run Master
69
+ master.run
70
+ ```
71
+
72
+ ### Mapper
73
+
74
+ Mapper emmits data to masters. It could read log, database, or answer to phone calls. What should Mapper know is how to connect to Masters and it is ready to go. Also you could choose mode in which you want to work. Worker works asynchronously, but you could choose if you want to write callbacks (pure EventMachine) or you prefer to wrap it in Fibers (em-synchrony, for example). Also you could specify task name to worker if Masters serve many tasks.
75
+
76
+ * masters - is an array of all available Masters' sockets
77
+ * type - `:em` or `:sync` (`:em` is default)
78
+ * task - task name, default is `nil`
79
+
80
+ For example, we have got some Web application (shop) and you want to explore which goods people look with each other.
81
+
82
+ (Let's suppose that web server is running under EventMachine and each request is spawned in Fiber)
83
+
84
+ ```ruby
85
+ # Define somewhere your worker
86
+ require 'map_reduce'
87
+
88
+ @worker = MapReduce::Worker.new type: :sync,
89
+ masters: ["tcp://192.168.1.1:5555", "tcp://192.168.1.2:5555"],
90
+ task: "goods"
91
+
92
+ # And use it in your Web App
93
+ get "/good/:id" do
94
+ @good = Good.find(params[:id])
95
+ # Send current user's id and good's id
96
+ @worker.map(current_user.id, @good.id)
97
+ haml :good
98
+ end
99
+ ```
100
+
101
+ Also Mapper has got `wait_for_all` method. If you are mapping data not permanently and you need to know when all mappers finished mapping data you should call this method.
102
+
103
+ ```ruby
104
+ rand(1000000).times do |i|
105
+ mapper.map(i, 1)
106
+ end
107
+ mapper.wait_for_all
108
+ ```
109
+
110
+ So you will be blocked till all servers will finish mapping data. Then you could start reducing data, for example.
111
+
112
+ ### Reducer
113
+
114
+ Reducer is a guy who receives grouped data from Masters. In our previous example with shop Reducer will recieve all goods that current user visited for every user. So now you can use some ML algorithms, or append data to existing GoodsGraph or whatever science.
115
+
116
+ As a Worker, Reducer should know masters' sockets addresses, type of connection and task name if needed (if Mapper emits data with named task, Reducer should specify it as well).
117
+
118
+ ```ruby
119
+ require 'em-synchrony'
120
+ require 'map_reduce'
121
+ # initialize one
122
+ reducer = MapReduce::Reducer.new type: :sync,
123
+ masters: ["tcp://192.168.1.1:5555", "tcp://192.168.1.2:5555"],
124
+ task: "goods"
125
+
126
+ # Let's give masters to collect some data between each reduce and sleep for a while
127
+ EM.synchrony do
128
+ while true
129
+ reducer.reduce do |key, values|
130
+ # You can do some magick here
131
+ puts "User: #{key}, visited #{values} today"
132
+ end
133
+ EM::Synchrony.sleep(60 * 60 * 3)
134
+ end
135
+ end
136
+ ```
137
+
21
138
  ## Usage
22
139
 
23
- TODO
140
+ So. Generally you need to specify two thigs:
141
+
142
+ * What to map
143
+ * How to reduce
144
+
145
+ And implement it with given primitives.
146
+
147
+ Maybe the simplest example should be count of page visits (video views, tracks listens) for each article. In the case you have got millions of visits incrementing your data for each visit in RDBMS could be very expensive operation. So updating one/two times per day in some cases is a good choice. So we have got bunch of logs `article_id, user_id, timestamp` on each frontend and we need to count visits for each article and increment it in database.
148
+
149
+
150
+ So on each server you could run Master, Mapper and Reducer.
151
+
152
+ You could even combine Mapper and Reducer in one process, becuse you need to fire Reducer right after you have finished your map phase.
153
+
154
+ ```ruby
155
+ # master.rb
156
+ require 'map_reduce'
157
+
158
+ MapReduce::Master.new(socket: "#{current_ip}:5555")
159
+ ```
160
+
161
+ ```ruby
162
+ # map_reducer.rb
163
+ require 'map_reduce'
164
+ require 'em-synchrony'
165
+
166
+ @mapper = MapReduce::Mapper.new masters: [ ... ], type: :sync
167
+ @reducer = MapReduce::Reducer.new masters: [ ... ], type: :sync
168
+
169
+ EM.synchrony do
170
+ # Run process each 12 hours
171
+ EM::Synchrony.add_periodic_timer(60*60*12) do
172
+ File.open("/path/to/log").each do |line|
173
+ article_id, user_id, timestamp = line.chomp.split(", ")
174
+ @mapper.map(article_id, 1)
175
+ end
176
+
177
+ @mapper.wait_for_all
178
+
179
+ @reducer.reduce do |key, values|
180
+ # How many time article was visited
181
+ count = values.size
182
+ # Let's increment this value
183
+ Article.increment(id: key, visits: count)
184
+ end
185
+ end
186
+ end
187
+ ```
188
+
189
+ And run them
190
+
191
+ $ ruby master.rb
192
+ $ ruby map_reducer.rb
193
+
194
+
195
+ ## Summary
196
+
197
+ It is pretty simple implementation of map reduce and it doesn't solve synchronizing, loosing connectivity, master/worker/reducer failing problems. They are totally up to developers. And there is Hadoop for really big map reduce problems.
24
198
 
25
199
  ## Contributing
26
200
 
@@ -28,4 +202,4 @@ TODO
28
202
  2. Create your feature branch (`git checkout -b my-new-feature`)
29
203
  3. Commit your changes (`git commit -am 'Add some feature'`)
30
204
  4. Push to the branch (`git push origin my-new-feature`)
31
- 5. Create new Pull Request
205
+ 5. Create new Pull Request
data/lib/map_reduce.rb CHANGED
@@ -17,5 +17,11 @@ module MapReduce
17
17
  end
18
18
  end
19
19
 
20
+ require File.expand_path("../map_reduce/exceptions", __FILE__)
21
+ require File.expand_path("../map_reduce/socket/req_fiber", __FILE__)
22
+ require File.expand_path("../map_reduce/map_log", __FILE__)
23
+ require File.expand_path("../map_reduce/reduce_log", __FILE__)
24
+ require File.expand_path("../map_reduce/socket/master", __FILE__)
20
25
  require File.expand_path("../map_reduce/master", __FILE__)
21
- require File.expand_path("../map_reduce/worker", __FILE__)
26
+ require File.expand_path("../map_reduce/mapper", __FILE__)
27
+ require File.expand_path("../map_reduce/reducer", __FILE__)
@@ -0,0 +1,4 @@
1
+ module MapReduce::Exceptions
2
+ class BlankKey < StandardError; end
3
+ class BlankMasters < StandardError; end
4
+ end
@@ -0,0 +1,45 @@
1
+ module MapReduce
2
+ class MapLog
3
+ MAX_BUFFER_SIZE = 2 ** 20
4
+
5
+ def initialize(log_folder, task)
6
+ @log_folder = log_folder
7
+ @task = task
8
+ @log = ""
9
+ @log_size = 0
10
+ end
11
+
12
+ def <<(str)
13
+ @log_size += str.size
14
+ @log << str << "\n"
15
+ flush if @log_size >= MAX_BUFFER_SIZE
16
+ end
17
+
18
+ def flush
19
+ unless @log.empty?
20
+ log_file << @log
21
+ log_file.flush
22
+ end
23
+ end
24
+
25
+ def reset
26
+ flush
27
+ if @log_file
28
+ fn = File.path(@log_file)
29
+ @log_file.close
30
+ @log_file = nil
31
+ fn
32
+ end
33
+ end
34
+
35
+ def log_file
36
+ @log_file ||= begin
37
+ begin
38
+ fn = File.join(@log_folder, "map_#{@task}_#{Time.now.to_i}_#{rand(1000)}.log")
39
+ end while File.exist?(fn)
40
+ FileUtils.mkdir_p(@log_folder)
41
+ File.open(fn, "a")
42
+ end
43
+ end
44
+ end
45
+ end
@@ -0,0 +1,72 @@
1
+ module MapReduce
2
+ class Mapper
3
+ def initialize(opts = {})
4
+ @masters = opts[:masters] || [::MapReduce::DEFAULT_SOCKET]
5
+ @connection_type = opts[:type] || :em
6
+ @task_name = opts[:task]
7
+ end
8
+
9
+ def emit(key, value, &blk)
10
+ raise MapReduce::Exceptions::BlankKey, "Key can't be nil" if key.nil?
11
+
12
+ sock = pick_master(key)
13
+ sock.send_request(["map", key, value, @task_name], &blk)
14
+ end
15
+ alias :map :emit
16
+
17
+ def wait_for_all(&blk)
18
+ finished = Hash[socket.map{ |s| [s, false] }]
19
+ sockets.each do |sock|
20
+ sock.send_request(["map_finished", @task_name]) do |message|
21
+ finished[sock] = message[0] == "ok"
22
+ if finished.all?{ |k,v| v }
23
+ if block_given?
24
+ blk.call
25
+ else
26
+ return
27
+ end
28
+ else
29
+ after(1) do
30
+ wait_for_all(&blk)
31
+ end
32
+ end
33
+ end
34
+ end
35
+ end
36
+
37
+ private
38
+
39
+ def after(sec)
40
+ klass = if @connection_type == :sync
41
+ EM::Synchrony
42
+ else
43
+ EM
44
+ end
45
+
46
+ klass.add_timer(sec) do
47
+ yield
48
+ end
49
+ end
50
+
51
+ def pick_master(key)
52
+ num = Digest::MD5.hexdigest(key.to_s).to_i(16) % sockets.size
53
+ sockets[num]
54
+ end
55
+
56
+ def sockets
57
+ @sockets ||= begin
58
+ klass = if @connection_type == :sync
59
+ EM::Protocols::Zmq2::ReqFiber
60
+ else
61
+ EM::Protocols::Zmq2::ReqCb
62
+ end
63
+
64
+ @masters.map do |sock|
65
+ s = klass.new
66
+ s.connect(sock)
67
+ s
68
+ end
69
+ end
70
+ end
71
+ end
72
+ end
@@ -2,179 +2,121 @@ require File.expand_path("../socket/master", __FILE__)
2
2
 
3
3
  module MapReduce
4
4
  class Master
5
- # How often data will be flushed to disk
6
- FLUSH_TIMEOUT = 1
7
- # How many lines should be parsed by one iteration of grouping
8
- GROUP_LINES = 100
9
- # How many seconds should we sleep if grouping is going faster then reducing
10
- GROUP_TIMEOUT = 1
11
- # How many keys should be stored before timeout happend
12
- GROUP_MAX = 10_000
13
-
14
- # Valid options:
15
- # * socket - socket address to bind
16
- # default is 'ipc:///dev/shm/master.sock'
17
- # * log_folder - folder to store recieved MAP data
18
- # default is '/tmp/mapreduce/'
19
- # * workers - count of workers that will emit data.
20
- # default is :auto,
21
- # but in small jobs it is better to define in explicitly,
22
- # because if one worker will stop before others start
23
- # master will decide that map job is done and will start reducing
24
- # * delimiter - master log stores data like "key{delimiter}values"
25
- # so to prevent collisions you can specify your own uniq delimiter
26
- # default is a pipe "|"
27
- #
28
5
  def initialize(opts = {})
29
- # Socket addr to bind
30
- @socket_addr = opts[:socket] || ::MapReduce::DEFAULT_SOCKET
31
- # Folder to write logs
32
- @log_folder = opts[:log_folder] || "/tmp/mapreduce/"
33
- # How many MapReduce workers will emit data
34
- @workers = opts[:workers] || 1
35
- # Delimiter to store key/value pairs in log
36
- @delimiter = opts[:delimiter] || "|"
37
-
38
- @log = []
39
- @data = []
40
- @workers_envelopes = {}
41
- @log_filename = File.join(@log_folder, "master-#{Process.pid}.log")
42
- @sorted_log_filename = File.join(@log_folder, "master-#{Process.pid}_sorted.log")
43
-
44
- FileUtils.mkdir_p(@log_folder)
45
- FileUtils.touch(@log_filename)
46
- end
47
-
48
- # Start Eventloop
49
- #
6
+ @socket_addr = opts[:socket] || ::MapReduce::DEFAULT_SOCKET
7
+ @log_folder = opts[:log_folder] || "/tmp/map_reduce"
8
+ @delimiter = opts[:delimiter] || "\t"
9
+
10
+ @tasks = {}
11
+ end
12
+
50
13
  def run
51
14
  EM.run do
52
- # Init socket
53
- master_socket
54
-
55
- # Init flushing timer
56
- flush
15
+ socket
57
16
  end
58
17
  end
59
18
 
60
- # Stop Eventloop
61
- #
62
19
  def stop
63
20
  EM.stop
64
21
  end
65
22
 
66
- # Store data in log array till flush
67
- #
68
- def map(key, message)
69
- @log << "#{key}#{@delimiter}#{message}"
70
- end
71
-
72
- # Send data back to worker.
73
- # Last item in data is last unfinished session,
74
- # so till the end of file reading we don't send it
75
- #
76
- def reduce(envelope)
77
- if @data.size >= 2
78
- data = @data.shift
79
- data = data.flatten
80
- master_socket.send_reply(data, envelope)
81
- elsif @reduce_stop
82
- data = @data.shift
83
- data = data.flatten if data
84
- master_socket.send_reply(data, envelope)
23
+ def after_map(&blk)
24
+ @after_map = blk
25
+ end
26
+
27
+ def after_reduce(&blk)
28
+ @after_reduce = blk
29
+ end
30
+
31
+ def recieve_msg(message, envelope)
32
+ mtype = case message[0]
33
+ when "map"
34
+ store_map(message, envelope)
35
+ when "map_finished"
36
+ all_finished?(message, envelope)
37
+ when "reduce"
38
+ send_reduce(message, envelope)
85
39
  else
86
- EM.add_timer(1) do
87
- reduce(envelope)
88
- end
40
+ MapReduce.logger.error("Wrong message type: #{mtype}")
89
41
  end
90
42
  end
91
43
 
92
- # Openning log file for read/write
93
- #
94
- def log_file
95
- @log_file ||= begin
96
- File.open(@log_filename, "w+")
97
- end
44
+ private
45
+
46
+ def store_map(message, envelope)
47
+ status, key, value, task = message
48
+ map_log(task) << "#{key}#{@delimiter}#{value}"
49
+ ok(envelope)
50
+ register(task, envelope, "mapper", status)
51
+
52
+ @after_map.call(key, value, task) if @after_map
98
53
  end
99
54
 
100
- # Openning sorted log for reading
101
- #
102
- def sorted_log_file
103
- @sorted_log_file ||= begin
104
- File.open(@sorted_log_filename, "r")
55
+ def send_reduce(message, envelope)
56
+ status, task = message
57
+
58
+ data = if @tasks.fetch(task, {}).fetch("reducer", {}).fetch(envelope[0], nil) == "reduce"
59
+ reduce_log(task).get_data
60
+ else
61
+ reduce_log(task, true).get_data
105
62
  end
106
- end
107
63
 
108
- # Flushing data to disk once per FLUSH_TIMEOUT seconds
109
- #
110
- def flush
111
- if @log.any?
112
- log_file << @log*"\n" << "\n"
113
- log_file.flush
114
- @log.clear
64
+ reply(data, envelope)
65
+
66
+ if data
67
+ register(task, envelope, "reducer", status)
68
+ else
69
+ register(task, envelope, "reducer", "reduce_finished")
115
70
  end
116
71
 
117
- EM.add_timer(FLUSH_TIMEOUT) do
118
- flush
72
+ @after_reduce.call(data[0], data[1], task) if data && @after_reduce
73
+ end
74
+
75
+ def all_finished?(message, envelope)
76
+ status, task = message
77
+ register(task, envelope, "mapper", status)
78
+ if @tasks[task]["mapper"].all?{ |k,v| v == status }
79
+ ok(envelope)
80
+ else
81
+ no(envelope)
119
82
  end
120
83
  end
121
84
 
122
- # Sorting log.
123
- # Linux sort is the fastest way to sort big file.
124
- # Deleting original log after sort.
125
- #
126
- def sort
127
- `sort #{@log_filename} -o #{@sorted_log_filename}`
128
- FileUtils.rm(@log_filename)
129
- @log_file = nil
85
+ def map_log(task)
86
+ @map_log ||= {}
87
+ @map_log[task] ||= MapReduce::MapLog.new(@log_folder, task)
130
88
  end
131
89
 
132
- # Start reducing part.
133
- # First, flushing rest of log to disk.
134
- # Then sort data.
135
- # Then start to read/group data
136
- #
137
- def reduce!
138
- flush
139
- sort
90
+ def reduce_log(task, force = false)
91
+ @reduce_log ||= {}
92
+ log = @reduce_log[task] ||= MapReduce::ReduceLog.new(map_log(task), @delimiter)
93
+ @reduce_log[task].force if force
94
+ log
95
+ end
140
96
 
141
- iter = sorted_log_file.each_line
142
- group iter
97
+ def ok(envelope)
98
+ reply(["ok"], envelope)
143
99
  end
144
100
 
145
- # Reading sorted data and grouping by key.
146
- # If queue (@data) is growing faster then workers grad data we pause reading file.
147
- #
148
- def group(iter)
149
- if @data.size >= GROUP_MAX
150
- EM.add_timer(GROUP_TIMEOUT){ group(iter) }
151
- else
152
- GROUP_LINES.times do
153
- line = iter.next.chomp
154
- key, msg = line.split(@delimiter)
155
-
156
- last = @data.last
157
- if last && last[0] == key
158
- last[1] << msg
159
- else
160
- @data << [key, [msg]]
161
- end
162
- end
163
-
164
- EM.next_tick{ group(iter) }
165
- end
166
- rescue StopIteration => e
167
- FileUtils.rm(@sorted_log_filename)
168
- @sorted_log_file = nil
169
- @reduce_stop = true
170
- end
171
-
172
- # Initializing and binding socket
173
- #
174
- def master_socket
175
- @master_socket ||= begin
176
- sock = MapReduce::Socket::Master.new self, @workers
177
- sock.bind @socket_addr
101
+ def np(envelope)
102
+ reply(["not ok"], envelope)
103
+ end
104
+
105
+ def reply(resp, envelope)
106
+ socket.send_reply(resp, envelope)
107
+ end
108
+
109
+ def register(task, envelope, type, status)
110
+ @tasks[task] ||= {}
111
+ @tasks[task][type] ||= {}
112
+ @tasks[task][type][envelope[0]] = status
113
+ end
114
+
115
+ def socket
116
+ @socket ||= begin
117
+ master = self
118
+ sock = MapReduce::Socket::Master.new(self)
119
+ sock.bind(@socket_addr)
178
120
  sock
179
121
  end
180
122
  end