bud 0.0.2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (62) hide show
  1. data/LICENSE +9 -0
  2. data/README +30 -0
  3. data/bin/budplot +134 -0
  4. data/bin/budvis +201 -0
  5. data/bin/rebl +4 -0
  6. data/docs/README.md +13 -0
  7. data/docs/bfs.md +379 -0
  8. data/docs/bfs.raw +251 -0
  9. data/docs/bfs_arch.png +0 -0
  10. data/docs/bloom-loop.png +0 -0
  11. data/docs/bust.md +83 -0
  12. data/docs/cheat.md +291 -0
  13. data/docs/deploy.md +96 -0
  14. data/docs/diffs +181 -0
  15. data/docs/getstarted.md +296 -0
  16. data/docs/intro.md +36 -0
  17. data/docs/modules.md +112 -0
  18. data/docs/operational.md +96 -0
  19. data/docs/rebl.md +99 -0
  20. data/docs/ruby_hooks.md +19 -0
  21. data/docs/visualizations.md +75 -0
  22. data/examples/README +1 -0
  23. data/examples/basics/hello.rb +12 -0
  24. data/examples/basics/out +1103 -0
  25. data/examples/basics/out.new +856 -0
  26. data/examples/basics/paths.rb +51 -0
  27. data/examples/bust/README.md +9 -0
  28. data/examples/bust/bustclient-example.rb +23 -0
  29. data/examples/bust/bustinspector.html +135 -0
  30. data/examples/bust/bustserver-example.rb +18 -0
  31. data/examples/chat/README.md +9 -0
  32. data/examples/chat/chat.rb +45 -0
  33. data/examples/chat/chat_protocol.rb +8 -0
  34. data/examples/chat/chat_server.rb +29 -0
  35. data/examples/deploy/tokenring-ec2.rb +26 -0
  36. data/examples/deploy/tokenring-local.rb +17 -0
  37. data/examples/deploy/tokenring.rb +39 -0
  38. data/lib/bud/aggs.rb +126 -0
  39. data/lib/bud/bud_meta.rb +185 -0
  40. data/lib/bud/bust/bust.rb +126 -0
  41. data/lib/bud/bust/client/idempotence.rb +10 -0
  42. data/lib/bud/bust/client/restclient.rb +49 -0
  43. data/lib/bud/collections.rb +937 -0
  44. data/lib/bud/depanalysis.rb +44 -0
  45. data/lib/bud/deploy/countatomicdelivery.rb +50 -0
  46. data/lib/bud/deploy/deployer.rb +67 -0
  47. data/lib/bud/deploy/ec2deploy.rb +200 -0
  48. data/lib/bud/deploy/localdeploy.rb +41 -0
  49. data/lib/bud/errors.rb +15 -0
  50. data/lib/bud/graphs.rb +405 -0
  51. data/lib/bud/joins.rb +300 -0
  52. data/lib/bud/rebl.rb +314 -0
  53. data/lib/bud/rewrite.rb +523 -0
  54. data/lib/bud/rtrace.rb +27 -0
  55. data/lib/bud/server.rb +43 -0
  56. data/lib/bud/state.rb +108 -0
  57. data/lib/bud/storage/tokyocabinet.rb +170 -0
  58. data/lib/bud/storage/zookeeper.rb +178 -0
  59. data/lib/bud/stratify.rb +83 -0
  60. data/lib/bud/viz.rb +65 -0
  61. data/lib/bud.rb +797 -0
  62. metadata +330 -0
data/docs/bfs.md ADDED
@@ -0,0 +1,379 @@
1
+ # BFS: A distributed file system in Bloom
2
+
3
+ In this document we'll use what we've learned to build a piece of systems software using Bloom. The libraries that ship with BUD provide many of the building blocks we'll need to create a distributed,
4
+ ``chunked'' file system in the style of the Google File System (GFS):
5
+
6
+ * a [key-value store](https://github.com/bloom-lang/bud-sandbox/blob/master/kvs/kvs.rb) (KVS)
7
+ * [nonce generation](https://github.com/bloom-lang/bud-sandbox/blob/master/ordering/nonce.rb)
8
+ * a [heartbeat protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/heartbeat/heartbeat.rb)
9
+
10
+ ## High-level architecture
11
+
12
+ ![BFS Architecture](bfs_arch.png?raw=true)
13
+
14
+ BFS implements a chunked, distributed file system (mostly) in the Bloom
15
+ language. BFS is architecturally based on [BOOM-FS](http://db.cs.berkeley.edu/papers/eurosys10-boom.pdf), which is itself based on
16
+ the Google File System (GFS). As in GFS, a single master node manages
17
+ file system metadata, while data blocks are replicated and stored on a large
18
+ number of storage nodes. Writing or reading data involves a multi-step
19
+ protocol in which clients interact with the master, retrieving metadata and
20
+ possibly changing state, then interact with storage nodes to read or write
21
+ chunks. Background jobs running on the master will contact storage nodes to
22
+ orchestrate chunk migrations, during which storage nodes communicate with
23
+ other storage nodes. As in BOOM-FS, the communication protocols and the data
24
+ channel used for bulk data transfer between clients and datanodes and between
25
+ datanodes is written outside Bloom (in Ruby).
26
+
27
+ ## [Basic File System](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/fs_master.rb)
28
+
29
+ Before we worry about any of the details of distribution, we need to implement the basic file system metadata operations: _create_, _remove_, _mkdir_ and _ls_.
30
+ There are many choices for how to implement these operations, and it makes sense to keep them separate from the (largely orthogonal) distributed file system logic.
31
+ That way, it will be possible later to choose a different implementation of the metadata operations without impacting the rest of the system.
32
+ Another benefit of modularizing the metadata logic is that it can be independently tested and debugged. We want to get the core of the file system
33
+ working correctly before we even send a whisper over the network, let alone add any complex features.
34
+
35
+ ### Protocol
36
+
37
+ module FSProtocol
38
+ state do
39
+ interface input, :fsls, [:reqid, :path]
40
+ interface input, :fscreate, [] => [:reqid, :name, :path, :data]
41
+ interface input, :fsmkdir, [] => [:reqid, :name, :path]
42
+ interface input, :fsrm, [] => [:reqid, :name, :path]
43
+ interface output, :fsret, [:reqid, :status, :data]
44
+ end
45
+ end
46
+
47
+ We create an input interface for each of the operations, and a single output interface for the return for any operation: given a request id, __status__ is a boolean
48
+ indicating whether the request succeeded, and __data__ may contain return values (e.g., _fsls_ should return an array containing the array contents).
49
+
50
+ ### Implementation
51
+
52
+ We already have a library that provides an updateable flat namespace: the key-value store. We can easily implement the tree structure of a file system over a key-value store
53
+ in the following way:
54
+
55
+ 1. keys are paths
56
+ 2. directories have arrays containing child entries (base names)
57
+ 3. files values are their contents
58
+
59
+ <!--- (**JMH**: I find it a bit confusing how you toggle from the discussion above to this naive file-storage design here. Can you warn us a bit more clearly that this is a starting point focused on metadata, with (3) being a strawman for data storage that is intended to be overriden later?)
60
+ --->
61
+ Note that (3) is a strawman: it will cease to apply when we implement chunked storage later. It is tempting, however, to support (3) so that the resulting program is a working
62
+ standalone file system.
63
+
64
+ We begin our implementation of a KVS-backed metadata system in the following way:
65
+
66
+
67
+ module KVSFS
68
+ include FSProtocol
69
+ include BasicKVS
70
+ include TimestepNonce
71
+
72
+ If we wanted to replicate the master node's metadata we could consider mixing in a replicated KVS implementation instead of __BasicKVS__ -- but more on that later.
73
+
74
+ ### Directory Listing
75
+
76
+ The directory listing operation is implemented by a simple block of Bloom statements:
77
+
78
+ kvget <= fsls { |l| [l.reqid, l.path] }
79
+ fsret <= (kvget_response * fsls).pairs(:reqid => :reqid) { |r, i| [r.reqid, true, r.value] }
80
+ fsret <= fsls do |l|
81
+ unless kvget_response.map{ |r| r.reqid}.include? l.reqid
82
+ [l.reqid, false, nil]
83
+ end
84
+ end
85
+
86
+ If we get a __fsls__ request, probe the key-value store for the requested by projecting _reqid_, _path_ from the __fsls__ tuple into __kvget__. If the given path
87
+ is a key, __kvget_response__ will contain a tuple with the same _reqid_, and the join on the second line will succeed. In this case, we insert the value
88
+ associated with that key into __fsret__. Otherwise, the third rule will fire, inserting a failure tuple into __fsret__.
89
+
90
+
91
+ ### Mutation
92
+
93
+ The logic for file and directory creation and deletion follow a similar logic with regard to the parent directory:
94
+
95
+ check_parent_exists <= fscreate { |c| [c.reqid, c.name, c.path, :create, c.data] }
96
+ check_parent_exists <= fsmkdir { |m| [m.reqid, m.name, m.path, :mkdir, nil] }
97
+ check_parent_exists <= fsrm { |m| [m.reqid, m.name, m.path, :rm, nil] }
98
+
99
+ kvget <= check_parent_exists { |c| [c.reqid, c.path] }
100
+ fsret <= check_parent_exists do |c|
101
+ unless kvget_response.map{ |r| r.reqid}.include? c.reqid
102
+ puts "not found #{c.path}" or [c.reqid, false, "parent path #{c.path} for #{c.name} does not exist"]
103
+ end
104
+ end
105
+
106
+
107
+ Unlike a directory listing, however, these operations change the state of the file system. In general, any state change will involve
108
+ carrying out two mutating operations to the key-value store atomically:
109
+
110
+ 1. update the value (child array) associated with the parent directory entry
111
+ 2. update the key-value pair associated with the object in question (a file or directory being created or destroyed).
112
+
113
+ The following Bloom code carries this out:
114
+
115
+ temp :dir_exists <= (check_parent_exists * kvget_response * nonce).combos([check_parent_exists.reqid, kvget_response.reqid])
116
+ fsret <= dir_exists do |c, r, n|
117
+ if c.mtype == :rm
118
+ unless can_remove.map{|can| can.orig_reqid}.include? c.reqid
119
+ [c.reqid, false, "directory #{} not empty"]
120
+ end
121
+ end
122
+ end
123
+
124
+ # update dir entry
125
+ # note that it is unnecessary to ensure that a file is created before its corresponding
126
+ # directory entry, as both inserts into :kvput below will co-occur in the same timestep.
127
+ kvput <= dir_exists do |c, r, n|
128
+ if c.mtype == :rm
129
+ if can_remove.map{|can| can.orig_reqid}.include? c.reqid
130
+ [ip_port, c.path, n.ident, r.value.clone.reject{|item| item == c.name}]
131
+ end
132
+ else
133
+ [ip_port, c.path, n.ident, r.value.clone.push(c.name)]
134
+ end
135
+ end
136
+
137
+ kvput <= dir_exists do |c, r, n|
138
+ case c.mtype
139
+ when :mkdir
140
+ [ip_port, terminate_with_slash(c.path) + c.name, c.reqid, []]
141
+ when :create
142
+ [ip_port, terminate_with_slash(c.path) + c.name, c.reqid, "LEAF"]
143
+ end
144
+ end
145
+
146
+
147
+ <!--- (**JMH**: This next sounds awkward. You *do* take care: by using <= and understanding the atomicity of timesteps in Bloom. I think what you mean to say is that Bloom's atomic timestep model makes this easy compared to ... something.)
148
+ Note that we need not take any particular care to ensure that the two inserts into __kvput__ occur together atomically. Because both statements use the synchronous
149
+ -->
150
+ Note that because both inserts into the __kvput__ collection use the synchronous operator (`<=`), we know that they will occur together in the same fixpoint computation or not at all.
151
+ Therefore we need not be concerned with explicitly sequencing the operations (e.g., ensuring that the directory entries is created _after_ the file entry) to deal with concurrency:
152
+ there can be no visible state of the database in which only one of the operations has succeeded.
153
+
154
+ If the request is a deletion, we need some additional logic to enforce the constraint that only an empty directory may be removed:
155
+
156
+
157
+ check_is_empty <= (fsrm * nonce).pairs {|m, n| [n.ident, m.reqid, terminate_with_slash(m.path) + m.name] }
158
+ kvget <= check_is_empty {|c| [c.reqid, c.name] }
159
+ can_remove <= (kvget_response * check_is_empty).pairs([kvget_response.reqid, check_is_empty.reqid]) do |r, c|
160
+ [c.reqid, c.orig_reqid, c.name] if r.value.length == 0
161
+ end
162
+ # delete entry -- if an 'rm' request,
163
+ kvdel <= dir_exists do |c, r, n|
164
+ if can_remove.map{|can| can.orig_reqid}.include? c.reqid
165
+ [terminate_with_slash(c.path) + c.name, c.reqid]
166
+ end
167
+ end
168
+
169
+
170
+ Recall that when we created KVSFS we mixed in __TimestepNonce__, one of the nonce libraries. While we were able to use the _reqid_ field from the input operation as a unique identifier
171
+ for one of our KVS operations, we need a fresh, unique request id for the second KVS operation in the atomic pair described above. By joining __nonce__, we get
172
+ an identifier that is unique to this timestep.
173
+
174
+
175
+ ## [File Chunking](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/chunking.rb)
176
+
177
+ Now that we have a module providing a basic file system, we can extend it to support chunked storage of file contents. The metadata master will contain, in addition to the KVS
178
+ structure for directory information, a relation mapping a set of chunk identifiers to each file
179
+
180
+ table :chunk, [:chunkid, :file, :siz]
181
+
182
+ and relations associating a chunk with a set of datanodes that host a replica of the chunk.
183
+
184
+ table :chunk_cache, [:node, :chunkid, :time]
185
+
186
+ (**JMH**: ambiguous reference ahead "these latter")
187
+ The latter (defined in __HBMaster__) is soft-state, kept up to data by heartbeat messages from datanodes (described in the next section).
188
+
189
+ To support chunked storage, we add a few metadata operations to those already defined by FSProtocol:
190
+
191
+ module ChunkedFSProtocol
192
+ include FSProtocol
193
+
194
+ state do
195
+ interface :input, :fschunklist, [:reqid, :file]
196
+ interface :input, :fschunklocations, [:reqid, :chunkid]
197
+ interface :input, :fsaddchunk, [:reqid, :file]
198
+ # note that no output interface is defined.
199
+ # we use :fsret (defined in FSProtocol) for output.
200
+ end
201
+ end
202
+
203
+ * __fschunklist__ returns the set of chunks belonging to a given file.
204
+ * __fschunklocations__ returns the set of datanodes in possession of a given chunk.
205
+ * __fsaddchunk__ returns a new chunkid for appending to an existing file, guaranteed to be higher than any existing chunkids for that file, and a list of candidate datanodes that can store a replica of the new chunk.
206
+
207
+ We continue to use __fsret__ for return values.
208
+
209
+ ### Lookups
210
+
211
+ Lines 34-44 are a similar pattern to what we saw in the basic FS: whenever we get a __fschunklist__ or __fsaddchunk__ request, we must first ensure that the given file
212
+ exists, and error out if not. If it does, and the operation was __fschunklist__, we join the metadata relation __chunk__ and return the set of chunks owned
213
+ by the given (existent) file:
214
+
215
+ chunk_buffer <= (fschunklist * kvget_response * chunk).combos([fschunklist.reqid, kvget_response.reqid], [fschunklist.file, chunk.file]) { |l, r, c| [l.reqid, c.chunkid] }
216
+ chunk_buffer2 <= chunk_buffer.group([chunk_buffer.reqid], accum(chunk_buffer.chunkid))
217
+ fsret <= chunk_buffer2 { |c| [c.reqid, true, c.chunklist] }
218
+
219
+ ### Add chunk
220
+
221
+ If it was a __fsaddchunk__ request, we need to generate a unique id for a new chunk and return a list of target datanodes. We reuse __TimestepNonce__ to do the former, and join a relation
222
+ called __available__ that is exported by __HBMaster__ (described in the next section) for the latter:
223
+
224
+ temp :minted_chunk <= (kvget_response * fsaddchunk * available * nonce).combos(kvget_response.reqid => fsaddchunk.reqid) {|r| r if last_heartbeat.length >= REP_FACTOR}
225
+ chunk <= minted_chunk { |r, a, v, n| [n.ident, a.file, 0]}
226
+ fsret <= minted_chunk { |r, a, v, n| [r.reqid, true, [n.ident, v.pref_list.slice(0, (REP_FACTOR + 2))]]}
227
+ fsret <= (kvget_response * fsaddchunk).pairs(:reqid => :reqid) do |r, a|
228
+ if available.empty? or available.first.pref_list.length < REP_FACTOR
229
+ [r.reqid, false, "datanode set cannot satisfy REP_FACTOR = #{REP_FACTOR} with [#{available.first.nil? ? "NIL" : available.first.pref_list.inspect}]"]
230
+ end
231
+ end
232
+
233
+ Finally, it was a __fschunklocations__ request, we have another possible error scenario, because the nodes associated with chunks are a part of our soft state. Even if the file
234
+ exists, it may not be the case that we have fresh information in our cache about what datanodes own a replica of the given chunk:
235
+
236
+ fsret <= fschunklocations do |l|
237
+ unless chunk_cache_alive.map{|c| c.chunkid}.include? l.chunkid
238
+ [l.reqid, false, "no datanodes found for #{l.chunkid} in cc, now #{chunk_cache_alive.length}, with last_hb #{last_heartbeat.length}"]
239
+ end
240
+ end
241
+
242
+ Otherwise, __chunk_cache__ has information about the given chunk, which we may return to the client:
243
+
244
+ temp :chunkjoin <= (fschunklocations * chunk_cache_alive).pairs(:chunkid => :chunkid)
245
+ host_buffer <= chunkjoin {|l, c| [l.reqid, c.node] }
246
+ host_buffer2 <= host_buffer.group([host_buffer.reqid], accum(host_buffer.host))
247
+ fsret <= host_buffer2 {|c| [c.reqid, true, c.hostlist] }
248
+
249
+
250
+ ## Datanodes and Heartbeats
251
+
252
+ ### [[Datanode]](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/datanode.rb)
253
+
254
+ A datanode runs both Bud code (to support the heartbeat and control protocols) and pure Ruby (to support the data transfer protocol). A datanode's main job is keeping the master
255
+ aware of it existence and its state, and participating when necessary in data pipelines to read or write chunk data to and from its local storage.
256
+
257
+ module BFSDatanode
258
+ include HeartbeatAgent
259
+ include StaticMembership
260
+ include TimestepNonce
261
+ include BFSHBProtocol
262
+
263
+ By mixing in HeartbeatAgent, the datanode includes the machinery necessary to regularly send status messages to the master. __HeartbeatAgent__ provides an input interface
264
+ called __payload__ that allows an agent to optionally include additional information in heartbeat messages: in our case, we wish to include state deltas which ensure that
265
+ the master has an accurate view of the set of chunks owned by the datanode.
266
+
267
+ When a datanode is constructed, it takes a port at which the embedded data protocol server will listen, and starts the server in the background:
268
+
269
+ @dp_server = DataProtocolServer.new(dataport)
270
+ return_address <+ [["localhost:#{dataport}"]]
271
+
272
+ At regular intervals, a datanode polls its local chunk directory (which is independently written to by the data protocol):
273
+
274
+ dir_contents <= hb_timer.flat_map do |t|
275
+ dir = Dir.new("#{DATADIR}/#{@data_port}")
276
+ files = dir.to_a.map{|d| d.to_i unless d =~ /^\./}.uniq!
277
+ dir.close
278
+ files.map {|f| [f, Time.parse(t.val).to_f]}
279
+ end
280
+
281
+ We update the payload that we send to the master if our recent poll found files that we don't believe the master knows about:
282
+
283
+
284
+ to_payload <= (dir_contents * nonce).pairs do |c, n|
285
+ unless server_knows.map{|s| s.file}.include? c.file
286
+ #puts "BCAST #{c.file}; server doesn't know" or [n.ident, c.file, c.time]
287
+ [n.ident, c.file, c.time]
288
+ else
289
+ #puts "server knows about #{server_knows.length} files"
290
+ end
291
+ end
292
+
293
+ Our view of what the master ``knows'' about reflects our local cache of acknowledgement messages from the master. This logic is defined in __HBMaster__.
294
+
295
+ ### [Heartbeat master logic](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/hb_master.rb)
296
+
297
+ On the master side of heartbeats, we always send an ack when we get a heartbeat, so that the datanode doesn't need to keep resending its
298
+ payload of local chunks:
299
+
300
+ hb_ack <~ last_heartbeat.map do |l|
301
+ [l.sender, l.payload[0]] unless l.payload[1] == [nil]
302
+ end
303
+
304
+ At the same time, we use the Ruby _flatmap_ method to flatten the array of chunks in the heartbeat payload into a set of tuples, which we
305
+ associate with the heartbeating datanode and the time of receipt in __chunk_cache__:
306
+
307
+ chunk_cache <= join([master_duty_cycle, last_heartbeat]).flat_map do |d, l|
308
+ unless l.payload[1].nil?
309
+ l.payload[1].map do |pay|
310
+ [l.peer, pay, Time.parse(d.val).to_f]
311
+ end
312
+ end
313
+ end
314
+
315
+ We periodically garbage-collect this cached, removing entries for datanodes from whom we have not received a heartbeat in a configurable amount of time.
316
+ __last_heartbeat__ is an output interface provided by the __HeartbeatAgent__ module, and contains the most recent, non-stale heartbeat contents:
317
+
318
+ chunk_cache <- join([master_duty_cycle, chunk_cache]).map do |t, c|
319
+ c unless last_heartbeat.map{|h| h.peer}.include? c.node
320
+ end
321
+
322
+
323
+ ## [BFS Client](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/bfs_client.rb)
324
+
325
+ One of the most complicated parts of the basic GFS design is the client component. To minimize load on the centralized master, we take it off the critical
326
+ path of file transfers. The client therefore needs to pick up this work.
327
+
328
+ We won't spend too much time on the details of the client code, as it is nearly all _plain old Ruby_. The basic idea is:
329
+
330
+ 1. Pure metadata operations
331
+ * _mkdir_, _create_, _ls_, _rm_
332
+ * Send the request to the master and inform the caller of the status.
333
+ * If _ls_, return the directory listing to the caller.
334
+ 2. Append
335
+ * Send a __fsaddchunk__ request to the master, which should return a new chunkid and a list of datanodes.
336
+ * Read a chunk worth of data from the input stream.
337
+ * Connect to the first datanode in the list. Send a header containing the chunkid and the remaining datanodes.
338
+ * Stream the file contents. The target datanode will then ``play client'' and continue the pipeline to the next datanode, and so on.
339
+ 2. Read
340
+ * Send a __getchunks__ request to the master for the given file. It should return the list of chunks owned by the file.
341
+ * For each chunk,
342
+ * Send a __fschunklocations__ request to the master, which should return a list of datanodes in possession of the chunk (returning a list allows the client to perform retries without more communication with the master, should some of the datanodes fail).
343
+ * Connect to a datanode from the list and stream the chunk to a local buffer.
344
+ * As chunks become available, stream them to the caller.
345
+
346
+
347
+ ## [Data transfer protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/data_protocol.rb)
348
+
349
+ The data transfer protocol comprises a set of support functions for the bulk data transfer protocol whose use is described in the previous section.
350
+ Because it is _plain old Ruby_ it is not as interesting as the other modules. It provides:
351
+
352
+ * The TCP server code that runs at each datanode, which parses headers and writes stream data to the local FS (these files are later detected by the directory poll).
353
+ * Client API calls to connect to datanodes and stream data. Datanodes also use this protocol to pipeline chunks to downstream datanodes.
354
+ * Master API code invoked by a background process to replicate chunks from datanodes to other datanodes, when the replication factor for a chunk is too low.
355
+
356
+ ## [Master background process](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/background.rb)
357
+
358
+ So far, we have implemented the BFS master as a strictly reactive system: when clients make requests, it queries and possibly updates local state.
359
+ To maintain the durability requirement that `REP_FACTOR` copies of every chunk are stored on distinct nodes, the master must be an active system
360
+ that maintains a near-consistent view of global state, and takes steps to correct violated requirements.
361
+
362
+ __chunk_cache__ is the master's view of datanode state, maintained as described by collecting and pruning heartbeat messages.
363
+
364
+ cc_demand <= (bg_timer * chunk_cache_alive).rights
365
+ cc_demand <= (bg_timer * last_heartbeat).pairs {|b, h| [h.peer, nil, nil]}
366
+ chunk_cnts_chunk <= cc_demand.group([cc_demand.chunkid], count(cc_demand.node))
367
+ chunk_cnts_host <= cc_demand.group([cc_demand.node], count(cc_demand.chunkid))
368
+
369
+ After defining some helper aggregates (__chunk_cnts_chunk__ or replica count by chunk, and __chunk_cnt_host__ or datanode fill factor),
370
+
371
+ lowchunks <= chunk_cnts_chunk { |c| [c.chunkid] if c.replicas < REP_FACTOR and !c.chunkid.nil?}
372
+
373
+ # nodes in possession of such chunks
374
+ sources <= (cc_demand * lowchunks).pairs(:chunkid => :chunkid) {|a, b| [a.chunkid, a.node]}
375
+ # nodes not in possession of such chunks, and their fill factor
376
+ candidate_nodes <= (chunk_cnts_host * lowchunks).pairs do |c, p|
377
+ unless chunk_cache_alive.map{|a| a.node if a.chunkid == p.chunkid}.include? c.host
378
+ [p.chunkid, c.host, c.chunks]
379
+ ### I am autogenerated. Please do not edit me.
data/docs/bfs.raw ADDED
@@ -0,0 +1,251 @@
1
+ # BFS: A distributed file system in Bloom
2
+
3
+ In this document we'll use what we've learned to build a piece of systems software using Bloom. The libraries that ship with BUD provide many of the building blocks we'll need to create a distributed,
4
+ ``chunked'' file system in the style of the Google File System (GFS):
5
+
6
+ * a [key-value store](https://github.com/bloom-lang/bud-sandbox/blob/master/kvs/kvs.rb) (KVS)
7
+ * [nonce generation](https://github.com/bloom-lang/bud-sandbox/blob/master/ordering/nonce.rb)
8
+ * a [heartbeat protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/heartbeat/heartbeat.rb)
9
+
10
+ ## High-level architecture
11
+
12
+ ![BFS Architecture](bfs_arch.png?raw=true)
13
+
14
+ BFS implements a chunked, distributed file system (mostly) in the Bloom
15
+ language. BFS is architecturally based on [BOOM-FS](http://db.cs.berkeley.edu/papers/eurosys10-boom.pdf), which is itself based on
16
+ the Google File System (GFS). As in GFS, a single master node manages
17
+ file system metadata, while data blocks are replicated and stored on a large
18
+ number of storage nodes. Writing or reading data involves a multi-step
19
+ protocol in which clients interact with the master, retrieving metadata and
20
+ possibly changing state, then interact with storage nodes to read or write
21
+ chunks. Background jobs running on the master will contact storage nodes to
22
+ orchestrate chunk migrations, during which storage nodes communicate with
23
+ other storage nodes. As in BOOM-FS, the communication protocols and the data
24
+ channel used for bulk data transfer between clients and datanodes and between
25
+ datanodes is written outside Bloom (in Ruby).
26
+
27
+ ## [Basic File System](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/fs_master.rb)
28
+
29
+ Before we worry about any of the details of distribution, we need to implement the basic file system metadata operations: _create_, _remove_, _mkdir_ and _ls_.
30
+ There are many choices for how to implement these operations, and it makes sense to keep them separate from the (largely orthogonal) distributed file system logic.
31
+ That way, it will be possible later to choose a different implementation of the metadata operations without impacting the rest of the system.
32
+ Another benefit of modularizing the metadata logic is that it can be independently tested and debugged. We want to get the core of the file system
33
+ working correctly before we even send a whisper over the network, let alone add any complex features.
34
+
35
+ ### Protocol
36
+
37
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|12-20
38
+
39
+ We create an input interface for each of the operations, and a single output interface for the return for any operation: given a request id, __status__ is a boolean
40
+ indicating whether the request succeeded, and __data__ may contain return values (e.g., _fsls_ should return an array containing the array contents).
41
+
42
+ ### Implementation
43
+
44
+ We already have a library that provides an updateable flat namespace: the key-value store. We can easily implement the tree structure of a file system over a key-value store
45
+ in the following way:
46
+
47
+ 1. keys are paths
48
+ 2. directories have arrays containing child entries (base names)
49
+ 3. files values are their contents
50
+
51
+ <!--- (**JMH**: I find it a bit confusing how you toggle from the discussion above to this naive file-storage design here. Can you warn us a bit more clearly that this is a starting point focused on metadata, with (3) being a strawman for data storage that is intended to be overriden later?)
52
+ --->
53
+ Note that (3) is a strawman: it will cease to apply when we implement chunked storage later. It is tempting, however, to support (3) so that the resulting program is a working
54
+ standalone file system.
55
+
56
+ We begin our implementation of a KVS-backed metadata system in the following way:
57
+
58
+
59
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|33-36
60
+
61
+ If we wanted to replicate the master node's metadata we could consider mixing in a replicated KVS implementation instead of __BasicKVS__ -- but more on that later.
62
+
63
+ ### Directory Listing
64
+
65
+ The directory listing operation is implemented by a simple block of Bloom statements:
66
+
67
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|51-57
68
+
69
+ If we get a __fsls__ request, probe the key-value store for the requested by projecting _reqid_, _path_ from the __fsls__ tuple into __kvget__. If the given path
70
+ is a key, __kvget_response__ will contain a tuple with the same _reqid_, and the join on the second line will succeed. In this case, we insert the value
71
+ associated with that key into __fsret__. Otherwise, the third rule will fire, inserting a failure tuple into __fsret__.
72
+
73
+
74
+ ### Mutation
75
+
76
+ The logic for file and directory creation and deletion follow a similar logic with regard to the parent directory:
77
+
78
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|61-71
79
+
80
+ Unlike a directory listing, however, these operations change the state of the file system. In general, any state change will involve
81
+ carrying out two mutating operations to the key-value store atomically:
82
+
83
+ 1. update the value (child array) associated with the parent directory entry
84
+ 2. update the key-value pair associated with the object in question (a file or directory being created or destroyed).
85
+
86
+ The following Bloom code carries this out:
87
+
88
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|73-73
89
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|80-108
90
+
91
+
92
+ <!--- (**JMH**: This next sounds awkward. You *do* take care: by using <= and understanding the atomicity of timesteps in Bloom. I think what you mean to say is that Bloom's atomic timestep model makes this easy compared to ... something.)
93
+ Note that we need not take any particular care to ensure that the two inserts into __kvput__ occur together atomically. Because both statements use the synchronous
94
+ -->
95
+ Note that because both inserts into the __kvput__ collection use the synchronous operator (`<=`), we know that they will occur together in the same fixpoint computation or not at all.
96
+ Therefore we need not be concerned with explicitly sequencing the operations (e.g., ensuring that the directory entries is created _after_ the file entry) to deal with concurrency:
97
+ there can be no visible state of the database in which only one of the operations has succeeded.
98
+
99
+ If the request is a deletion, we need some additional logic to enforce the constraint that only an empty directory may be removed:
100
+
101
+
102
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|74-78
103
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|110-115
104
+
105
+
106
+ Recall that when we created KVSFS we mixed in __TimestepNonce__, one of the nonce libraries. While we were able to use the _reqid_ field from the input operation as a unique identifier
107
+ for one of our KVS operations, we need a fresh, unique request id for the second KVS operation in the atomic pair described above. By joining __nonce__, we get
108
+ an identifier that is unique to this timestep.
109
+
110
+
111
+ ## [File Chunking](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/chunking.rb)
112
+
113
+ Now that we have a module providing a basic file system, we can extend it to support chunked storage of file contents. The metadata master will contain, in addition to the KVS
114
+ structure for directory information, a relation mapping a set of chunk identifiers to each file
115
+
116
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|26-26
117
+
118
+ and relations associating a chunk with a set of datanodes that host a replica of the chunk.
119
+
120
+ ==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|12-12
121
+
122
+ (**JMH**: ambiguous reference ahead "these latter")
123
+ The latter (defined in __HBMaster__) is soft-state, kept up to data by heartbeat messages from datanodes (described in the next section).
124
+
125
+ To support chunked storage, we add a few metadata operations to those already defined by FSProtocol:
126
+
127
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|6-16
128
+
129
+ * __fschunklist__ returns the set of chunks belonging to a given file.
130
+ * __fschunklocations__ returns the set of datanodes in possession of a given chunk.
131
+ * __fsaddchunk__ returns a new chunkid for appending to an existing file, guaranteed to be higher than any existing chunkids for that file, and a list of candidate datanodes that can store a replica of the new chunk.
132
+
133
+ We continue to use __fsret__ for return values.
134
+
135
+ ### Lookups
136
+
137
+ Lines 34-44 are a similar pattern to what we saw in the basic FS: whenever we get a __fschunklist__ or __fsaddchunk__ request, we must first ensure that the given file
138
+ exists, and error out if not. If it does, and the operation was __fschunklist__, we join the metadata relation __chunk__ and return the set of chunks owned
139
+ by the given (existent) file:
140
+
141
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|47-49
142
+
143
+ ### Add chunk
144
+
145
+ If it was a __fsaddchunk__ request, we need to generate a unique id for a new chunk and return a list of target datanodes. We reuse __TimestepNonce__ to do the former, and join a relation
146
+ called __available__ that is exported by __HBMaster__ (described in the next section) for the latter:
147
+
148
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|69-76
149
+
150
+ Finally, it was a __fschunklocations__ request, we have another possible error scenario, because the nodes associated with chunks are a part of our soft state. Even if the file
151
+ exists, it may not be the case that we have fresh information in our cache about what datanodes own a replica of the given chunk:
152
+
153
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|54-58
154
+
155
+ Otherwise, __chunk_cache__ has information about the given chunk, which we may return to the client:
156
+
157
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|61-64
158
+
159
+
160
+ ## Datanodes and Heartbeats
161
+
162
+ ### [Datanode](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/datanode.rb)
163
+
164
+ A datanode runs both Bud code (to support the heartbeat and control protocols) and pure Ruby (to support the data transfer protocol). A datanode's main job is keeping the master
165
+ aware of it existence and its state, and participating when necessary in data pipelines to read or write chunk data to and from its local storage.
166
+
167
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|11-15
168
+
169
+ By mixing in HeartbeatAgent, the datanode includes the machinery necessary to regularly send status messages to the master. __HeartbeatAgent__ provides an input interface
170
+ called __payload__ that allows an agent to optionally include additional information in heartbeat messages: in our case, we wish to include state deltas which ensure that
171
+ the master has an accurate view of the set of chunks owned by the datanode.
172
+
173
+ When a datanode is constructed, it takes a port at which the embedded data protocol server will listen, and starts the server in the background:
174
+
175
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|61-62
176
+
177
+ At regular intervals, a datanode polls its local chunk directory (which is independently written to by the data protocol):
178
+
179
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|26-31
180
+
181
+ We update the payload that we send to the master if our recent poll found files that we don't believe the master knows about:
182
+
183
+
184
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|33-40
185
+
186
+ Our view of what the master ``knows'' about reflects our local cache of acknowledgement messages from the master. This logic is defined in __HBMaster__.
187
+
188
+ ### [Heartbeat master logic](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/hb_master.rb)
189
+
190
+ On the master side of heartbeats, we always send an ack when we get a heartbeat, so that the datanode doesn't need to keep resending its
191
+ payload of local chunks:
192
+
193
+ ==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|30-32
194
+
195
+ At the same time, we use the Ruby _flatmap_ method to flatten the array of chunks in the heartbeat payload into a set of tuples, which we
196
+ associate with the heartbeating datanode and the time of receipt in __chunk_cache__:
197
+
198
+ ==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|22-28
199
+
200
+ We periodically garbage-collect this cached, removing entries for datanodes from whom we have not received a heartbeat in a configurable amount of time.
201
+ __last_heartbeat__ is an output interface provided by the __HeartbeatAgent__ module, and contains the most recent, non-stale heartbeat contents:
202
+
203
+ ==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|34-36
204
+
205
+
206
+ ## [BFS Client](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/bfs_client.rb)
207
+
208
+ One of the most complicated parts of the basic GFS design is the client component. To minimize load on the centralized master, we take it off the critical
209
+ path of file transfers. The client therefore needs to pick up this work.
210
+
211
+ We won't spend too much time on the details of the client code, as it is nearly all _plain old Ruby_. The basic idea is:
212
+
213
+ 1. Pure metadata operations
214
+ * _mkdir_, _create_, _ls_, _rm_
215
+ * Send the request to the master and inform the caller of the status.
216
+ * If _ls_, return the directory listing to the caller.
217
+ 2. Append
218
+ * Send a __fsaddchunk__ request to the master, which should return a new chunkid and a list of datanodes.
219
+ * Read a chunk worth of data from the input stream.
220
+ * Connect to the first datanode in the list. Send a header containing the chunkid and the remaining datanodes.
221
+ * Stream the file contents. The target datanode will then ``play client'' and continue the pipeline to the next datanode, and so on.
222
+ 2. Read
223
+ * Send a __getchunks__ request to the master for the given file. It should return the list of chunks owned by the file.
224
+ * For each chunk,
225
+ * Send a __fschunklocations__ request to the master, which should return a list of datanodes in possession of the chunk (returning a list allows the client to perform retries without more communication with the master, should some of the datanodes fail).
226
+ * Connect to a datanode from the list and stream the chunk to a local buffer.
227
+ * As chunks become available, stream them to the caller.
228
+
229
+
230
+ ## [Data transfer protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/data_protocol.rb)
231
+
232
+ The data transfer protocol comprises a set of support functions for the bulk data transfer protocol whose use is described in the previous section.
233
+ Because it is _plain old Ruby_ it is not as interesting as the other modules. It provides:
234
+
235
+ * The TCP server code that runs at each datanode, which parses headers and writes stream data to the local FS (these files are later detected by the directory poll).
236
+ * Client API calls to connect to datanodes and stream data. Datanodes also use this protocol to pipeline chunks to downstream datanodes.
237
+ * Master API code invoked by a background process to replicate chunks from datanodes to other datanodes, when the replication factor for a chunk is too low.
238
+
239
+ ## [Master background process](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/background.rb)
240
+
241
+ So far, we have implemented the BFS master as a strictly reactive system: when clients make requests, it queries and possibly updates local state.
242
+ To maintain the durability requirement that `REP_FACTOR` copies of every chunk are stored on distinct nodes, the master must be an active system
243
+ that maintains a near-consistent view of global state, and takes steps to correct violated requirements.
244
+
245
+ __chunk_cache__ is the master's view of datanode state, maintained as described by collecting and pruning heartbeat messages.
246
+
247
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/background.rb|24-27
248
+
249
+ After defining some helper aggregates (__chunk_cnts_chunk__ or replica count by chunk, and __chunk_cnt_host__ or datanode fill factor),
250
+
251
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/background.rb|29-36
data/docs/bfs_arch.png ADDED
Binary file
Binary file