bud 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (62) hide show
  1. data/LICENSE +9 -0
  2. data/README +30 -0
  3. data/bin/budplot +134 -0
  4. data/bin/budvis +201 -0
  5. data/bin/rebl +4 -0
  6. data/docs/README.md +13 -0
  7. data/docs/bfs.md +379 -0
  8. data/docs/bfs.raw +251 -0
  9. data/docs/bfs_arch.png +0 -0
  10. data/docs/bloom-loop.png +0 -0
  11. data/docs/bust.md +83 -0
  12. data/docs/cheat.md +291 -0
  13. data/docs/deploy.md +96 -0
  14. data/docs/diffs +181 -0
  15. data/docs/getstarted.md +296 -0
  16. data/docs/intro.md +36 -0
  17. data/docs/modules.md +112 -0
  18. data/docs/operational.md +96 -0
  19. data/docs/rebl.md +99 -0
  20. data/docs/ruby_hooks.md +19 -0
  21. data/docs/visualizations.md +75 -0
  22. data/examples/README +1 -0
  23. data/examples/basics/hello.rb +12 -0
  24. data/examples/basics/out +1103 -0
  25. data/examples/basics/out.new +856 -0
  26. data/examples/basics/paths.rb +51 -0
  27. data/examples/bust/README.md +9 -0
  28. data/examples/bust/bustclient-example.rb +23 -0
  29. data/examples/bust/bustinspector.html +135 -0
  30. data/examples/bust/bustserver-example.rb +18 -0
  31. data/examples/chat/README.md +9 -0
  32. data/examples/chat/chat.rb +45 -0
  33. data/examples/chat/chat_protocol.rb +8 -0
  34. data/examples/chat/chat_server.rb +29 -0
  35. data/examples/deploy/tokenring-ec2.rb +26 -0
  36. data/examples/deploy/tokenring-local.rb +17 -0
  37. data/examples/deploy/tokenring.rb +39 -0
  38. data/lib/bud/aggs.rb +126 -0
  39. data/lib/bud/bud_meta.rb +185 -0
  40. data/lib/bud/bust/bust.rb +126 -0
  41. data/lib/bud/bust/client/idempotence.rb +10 -0
  42. data/lib/bud/bust/client/restclient.rb +49 -0
  43. data/lib/bud/collections.rb +937 -0
  44. data/lib/bud/depanalysis.rb +44 -0
  45. data/lib/bud/deploy/countatomicdelivery.rb +50 -0
  46. data/lib/bud/deploy/deployer.rb +67 -0
  47. data/lib/bud/deploy/ec2deploy.rb +200 -0
  48. data/lib/bud/deploy/localdeploy.rb +41 -0
  49. data/lib/bud/errors.rb +15 -0
  50. data/lib/bud/graphs.rb +405 -0
  51. data/lib/bud/joins.rb +300 -0
  52. data/lib/bud/rebl.rb +314 -0
  53. data/lib/bud/rewrite.rb +523 -0
  54. data/lib/bud/rtrace.rb +27 -0
  55. data/lib/bud/server.rb +43 -0
  56. data/lib/bud/state.rb +108 -0
  57. data/lib/bud/storage/tokyocabinet.rb +170 -0
  58. data/lib/bud/storage/zookeeper.rb +178 -0
  59. data/lib/bud/stratify.rb +83 -0
  60. data/lib/bud/viz.rb +65 -0
  61. data/lib/bud.rb +797 -0
  62. metadata +330 -0
data/docs/bfs.md ADDED
@@ -0,0 +1,379 @@
1
+ # BFS: A distributed file system in Bloom
2
+
3
+ In this document we'll use what we've learned to build a piece of systems software using Bloom. The libraries that ship with BUD provide many of the building blocks we'll need to create a distributed,
4
+ ``chunked'' file system in the style of the Google File System (GFS):
5
+
6
+ * a [key-value store](https://github.com/bloom-lang/bud-sandbox/blob/master/kvs/kvs.rb) (KVS)
7
+ * [nonce generation](https://github.com/bloom-lang/bud-sandbox/blob/master/ordering/nonce.rb)
8
+ * a [heartbeat protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/heartbeat/heartbeat.rb)
9
+
10
+ ## High-level architecture
11
+
12
+ ![BFS Architecture](bfs_arch.png?raw=true)
13
+
14
+ BFS implements a chunked, distributed file system (mostly) in the Bloom
15
+ language. BFS is architecturally based on [BOOM-FS](http://db.cs.berkeley.edu/papers/eurosys10-boom.pdf), which is itself based on
16
+ the Google File System (GFS). As in GFS, a single master node manages
17
+ file system metadata, while data blocks are replicated and stored on a large
18
+ number of storage nodes. Writing or reading data involves a multi-step
19
+ protocol in which clients interact with the master, retrieving metadata and
20
+ possibly changing state, then interact with storage nodes to read or write
21
+ chunks. Background jobs running on the master will contact storage nodes to
22
+ orchestrate chunk migrations, during which storage nodes communicate with
23
+ other storage nodes. As in BOOM-FS, the communication protocols and the data
24
+ channel used for bulk data transfer between clients and datanodes and between
25
+ datanodes is written outside Bloom (in Ruby).
26
+
27
+ ## [Basic File System](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/fs_master.rb)
28
+
29
+ Before we worry about any of the details of distribution, we need to implement the basic file system metadata operations: _create_, _remove_, _mkdir_ and _ls_.
30
+ There are many choices for how to implement these operations, and it makes sense to keep them separate from the (largely orthogonal) distributed file system logic.
31
+ That way, it will be possible later to choose a different implementation of the metadata operations without impacting the rest of the system.
32
+ Another benefit of modularizing the metadata logic is that it can be independently tested and debugged. We want to get the core of the file system
33
+ working correctly before we even send a whisper over the network, let alone add any complex features.
34
+
35
+ ### Protocol
36
+
37
+ module FSProtocol
38
+ state do
39
+ interface input, :fsls, [:reqid, :path]
40
+ interface input, :fscreate, [] => [:reqid, :name, :path, :data]
41
+ interface input, :fsmkdir, [] => [:reqid, :name, :path]
42
+ interface input, :fsrm, [] => [:reqid, :name, :path]
43
+ interface output, :fsret, [:reqid, :status, :data]
44
+ end
45
+ end
46
+
47
+ We create an input interface for each of the operations, and a single output interface for the return for any operation: given a request id, __status__ is a boolean
48
+ indicating whether the request succeeded, and __data__ may contain return values (e.g., _fsls_ should return an array containing the array contents).
49
+
50
+ ### Implementation
51
+
52
+ We already have a library that provides an updateable flat namespace: the key-value store. We can easily implement the tree structure of a file system over a key-value store
53
+ in the following way:
54
+
55
+ 1. keys are paths
56
+ 2. directories have arrays containing child entries (base names)
57
+ 3. files values are their contents
58
+
59
+ <!--- (**JMH**: I find it a bit confusing how you toggle from the discussion above to this naive file-storage design here. Can you warn us a bit more clearly that this is a starting point focused on metadata, with (3) being a strawman for data storage that is intended to be overriden later?)
60
+ --->
61
+ Note that (3) is a strawman: it will cease to apply when we implement chunked storage later. It is tempting, however, to support (3) so that the resulting program is a working
62
+ standalone file system.
63
+
64
+ We begin our implementation of a KVS-backed metadata system in the following way:
65
+
66
+
67
+ module KVSFS
68
+ include FSProtocol
69
+ include BasicKVS
70
+ include TimestepNonce
71
+
72
+ If we wanted to replicate the master node's metadata we could consider mixing in a replicated KVS implementation instead of __BasicKVS__ -- but more on that later.
73
+
74
+ ### Directory Listing
75
+
76
+ The directory listing operation is implemented by a simple block of Bloom statements:
77
+
78
+ kvget <= fsls { |l| [l.reqid, l.path] }
79
+ fsret <= (kvget_response * fsls).pairs(:reqid => :reqid) { |r, i| [r.reqid, true, r.value] }
80
+ fsret <= fsls do |l|
81
+ unless kvget_response.map{ |r| r.reqid}.include? l.reqid
82
+ [l.reqid, false, nil]
83
+ end
84
+ end
85
+
86
+ If we get a __fsls__ request, probe the key-value store for the requested by projecting _reqid_, _path_ from the __fsls__ tuple into __kvget__. If the given path
87
+ is a key, __kvget_response__ will contain a tuple with the same _reqid_, and the join on the second line will succeed. In this case, we insert the value
88
+ associated with that key into __fsret__. Otherwise, the third rule will fire, inserting a failure tuple into __fsret__.
89
+
90
+
91
+ ### Mutation
92
+
93
+ The logic for file and directory creation and deletion follow a similar logic with regard to the parent directory:
94
+
95
+ check_parent_exists <= fscreate { |c| [c.reqid, c.name, c.path, :create, c.data] }
96
+ check_parent_exists <= fsmkdir { |m| [m.reqid, m.name, m.path, :mkdir, nil] }
97
+ check_parent_exists <= fsrm { |m| [m.reqid, m.name, m.path, :rm, nil] }
98
+
99
+ kvget <= check_parent_exists { |c| [c.reqid, c.path] }
100
+ fsret <= check_parent_exists do |c|
101
+ unless kvget_response.map{ |r| r.reqid}.include? c.reqid
102
+ puts "not found #{c.path}" or [c.reqid, false, "parent path #{c.path} for #{c.name} does not exist"]
103
+ end
104
+ end
105
+
106
+
107
+ Unlike a directory listing, however, these operations change the state of the file system. In general, any state change will involve
108
+ carrying out two mutating operations to the key-value store atomically:
109
+
110
+ 1. update the value (child array) associated with the parent directory entry
111
+ 2. update the key-value pair associated with the object in question (a file or directory being created or destroyed).
112
+
113
+ The following Bloom code carries this out:
114
+
115
+ temp :dir_exists <= (check_parent_exists * kvget_response * nonce).combos([check_parent_exists.reqid, kvget_response.reqid])
116
+ fsret <= dir_exists do |c, r, n|
117
+ if c.mtype == :rm
118
+ unless can_remove.map{|can| can.orig_reqid}.include? c.reqid
119
+ [c.reqid, false, "directory #{} not empty"]
120
+ end
121
+ end
122
+ end
123
+
124
+ # update dir entry
125
+ # note that it is unnecessary to ensure that a file is created before its corresponding
126
+ # directory entry, as both inserts into :kvput below will co-occur in the same timestep.
127
+ kvput <= dir_exists do |c, r, n|
128
+ if c.mtype == :rm
129
+ if can_remove.map{|can| can.orig_reqid}.include? c.reqid
130
+ [ip_port, c.path, n.ident, r.value.clone.reject{|item| item == c.name}]
131
+ end
132
+ else
133
+ [ip_port, c.path, n.ident, r.value.clone.push(c.name)]
134
+ end
135
+ end
136
+
137
+ kvput <= dir_exists do |c, r, n|
138
+ case c.mtype
139
+ when :mkdir
140
+ [ip_port, terminate_with_slash(c.path) + c.name, c.reqid, []]
141
+ when :create
142
+ [ip_port, terminate_with_slash(c.path) + c.name, c.reqid, "LEAF"]
143
+ end
144
+ end
145
+
146
+
147
+ <!--- (**JMH**: This next sounds awkward. You *do* take care: by using <= and understanding the atomicity of timesteps in Bloom. I think what you mean to say is that Bloom's atomic timestep model makes this easy compared to ... something.)
148
+ Note that we need not take any particular care to ensure that the two inserts into __kvput__ occur together atomically. Because both statements use the synchronous
149
+ -->
150
+ Note that because both inserts into the __kvput__ collection use the synchronous operator (`<=`), we know that they will occur together in the same fixpoint computation or not at all.
151
+ Therefore we need not be concerned with explicitly sequencing the operations (e.g., ensuring that the directory entries is created _after_ the file entry) to deal with concurrency:
152
+ there can be no visible state of the database in which only one of the operations has succeeded.
153
+
154
+ If the request is a deletion, we need some additional logic to enforce the constraint that only an empty directory may be removed:
155
+
156
+
157
+ check_is_empty <= (fsrm * nonce).pairs {|m, n| [n.ident, m.reqid, terminate_with_slash(m.path) + m.name] }
158
+ kvget <= check_is_empty {|c| [c.reqid, c.name] }
159
+ can_remove <= (kvget_response * check_is_empty).pairs([kvget_response.reqid, check_is_empty.reqid]) do |r, c|
160
+ [c.reqid, c.orig_reqid, c.name] if r.value.length == 0
161
+ end
162
+ # delete entry -- if an 'rm' request,
163
+ kvdel <= dir_exists do |c, r, n|
164
+ if can_remove.map{|can| can.orig_reqid}.include? c.reqid
165
+ [terminate_with_slash(c.path) + c.name, c.reqid]
166
+ end
167
+ end
168
+
169
+
170
+ Recall that when we created KVSFS we mixed in __TimestepNonce__, one of the nonce libraries. While we were able to use the _reqid_ field from the input operation as a unique identifier
171
+ for one of our KVS operations, we need a fresh, unique request id for the second KVS operation in the atomic pair described above. By joining __nonce__, we get
172
+ an identifier that is unique to this timestep.
173
+
174
+
175
+ ## [File Chunking](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/chunking.rb)
176
+
177
+ Now that we have a module providing a basic file system, we can extend it to support chunked storage of file contents. The metadata master will contain, in addition to the KVS
178
+ structure for directory information, a relation mapping a set of chunk identifiers to each file
179
+
180
+ table :chunk, [:chunkid, :file, :siz]
181
+
182
+ and relations associating a chunk with a set of datanodes that host a replica of the chunk.
183
+
184
+ table :chunk_cache, [:node, :chunkid, :time]
185
+
186
+ (**JMH**: ambiguous reference ahead "these latter")
187
+ The latter (defined in __HBMaster__) is soft-state, kept up to data by heartbeat messages from datanodes (described in the next section).
188
+
189
+ To support chunked storage, we add a few metadata operations to those already defined by FSProtocol:
190
+
191
+ module ChunkedFSProtocol
192
+ include FSProtocol
193
+
194
+ state do
195
+ interface :input, :fschunklist, [:reqid, :file]
196
+ interface :input, :fschunklocations, [:reqid, :chunkid]
197
+ interface :input, :fsaddchunk, [:reqid, :file]
198
+ # note that no output interface is defined.
199
+ # we use :fsret (defined in FSProtocol) for output.
200
+ end
201
+ end
202
+
203
+ * __fschunklist__ returns the set of chunks belonging to a given file.
204
+ * __fschunklocations__ returns the set of datanodes in possession of a given chunk.
205
+ * __fsaddchunk__ returns a new chunkid for appending to an existing file, guaranteed to be higher than any existing chunkids for that file, and a list of candidate datanodes that can store a replica of the new chunk.
206
+
207
+ We continue to use __fsret__ for return values.
208
+
209
+ ### Lookups
210
+
211
+ Lines 34-44 are a similar pattern to what we saw in the basic FS: whenever we get a __fschunklist__ or __fsaddchunk__ request, we must first ensure that the given file
212
+ exists, and error out if not. If it does, and the operation was __fschunklist__, we join the metadata relation __chunk__ and return the set of chunks owned
213
+ by the given (existent) file:
214
+
215
+ chunk_buffer <= (fschunklist * kvget_response * chunk).combos([fschunklist.reqid, kvget_response.reqid], [fschunklist.file, chunk.file]) { |l, r, c| [l.reqid, c.chunkid] }
216
+ chunk_buffer2 <= chunk_buffer.group([chunk_buffer.reqid], accum(chunk_buffer.chunkid))
217
+ fsret <= chunk_buffer2 { |c| [c.reqid, true, c.chunklist] }
218
+
219
+ ### Add chunk
220
+
221
+ If it was a __fsaddchunk__ request, we need to generate a unique id for a new chunk and return a list of target datanodes. We reuse __TimestepNonce__ to do the former, and join a relation
222
+ called __available__ that is exported by __HBMaster__ (described in the next section) for the latter:
223
+
224
+ temp :minted_chunk <= (kvget_response * fsaddchunk * available * nonce).combos(kvget_response.reqid => fsaddchunk.reqid) {|r| r if last_heartbeat.length >= REP_FACTOR}
225
+ chunk <= minted_chunk { |r, a, v, n| [n.ident, a.file, 0]}
226
+ fsret <= minted_chunk { |r, a, v, n| [r.reqid, true, [n.ident, v.pref_list.slice(0, (REP_FACTOR + 2))]]}
227
+ fsret <= (kvget_response * fsaddchunk).pairs(:reqid => :reqid) do |r, a|
228
+ if available.empty? or available.first.pref_list.length < REP_FACTOR
229
+ [r.reqid, false, "datanode set cannot satisfy REP_FACTOR = #{REP_FACTOR} with [#{available.first.nil? ? "NIL" : available.first.pref_list.inspect}]"]
230
+ end
231
+ end
232
+
233
+ Finally, it was a __fschunklocations__ request, we have another possible error scenario, because the nodes associated with chunks are a part of our soft state. Even if the file
234
+ exists, it may not be the case that we have fresh information in our cache about what datanodes own a replica of the given chunk:
235
+
236
+ fsret <= fschunklocations do |l|
237
+ unless chunk_cache_alive.map{|c| c.chunkid}.include? l.chunkid
238
+ [l.reqid, false, "no datanodes found for #{l.chunkid} in cc, now #{chunk_cache_alive.length}, with last_hb #{last_heartbeat.length}"]
239
+ end
240
+ end
241
+
242
+ Otherwise, __chunk_cache__ has information about the given chunk, which we may return to the client:
243
+
244
+ temp :chunkjoin <= (fschunklocations * chunk_cache_alive).pairs(:chunkid => :chunkid)
245
+ host_buffer <= chunkjoin {|l, c| [l.reqid, c.node] }
246
+ host_buffer2 <= host_buffer.group([host_buffer.reqid], accum(host_buffer.host))
247
+ fsret <= host_buffer2 {|c| [c.reqid, true, c.hostlist] }
248
+
249
+
250
+ ## Datanodes and Heartbeats
251
+
252
+ ### [[Datanode]](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/datanode.rb)
253
+
254
+ A datanode runs both Bud code (to support the heartbeat and control protocols) and pure Ruby (to support the data transfer protocol). A datanode's main job is keeping the master
255
+ aware of it existence and its state, and participating when necessary in data pipelines to read or write chunk data to and from its local storage.
256
+
257
+ module BFSDatanode
258
+ include HeartbeatAgent
259
+ include StaticMembership
260
+ include TimestepNonce
261
+ include BFSHBProtocol
262
+
263
+ By mixing in HeartbeatAgent, the datanode includes the machinery necessary to regularly send status messages to the master. __HeartbeatAgent__ provides an input interface
264
+ called __payload__ that allows an agent to optionally include additional information in heartbeat messages: in our case, we wish to include state deltas which ensure that
265
+ the master has an accurate view of the set of chunks owned by the datanode.
266
+
267
+ When a datanode is constructed, it takes a port at which the embedded data protocol server will listen, and starts the server in the background:
268
+
269
+ @dp_server = DataProtocolServer.new(dataport)
270
+ return_address <+ [["localhost:#{dataport}"]]
271
+
272
+ At regular intervals, a datanode polls its local chunk directory (which is independently written to by the data protocol):
273
+
274
+ dir_contents <= hb_timer.flat_map do |t|
275
+ dir = Dir.new("#{DATADIR}/#{@data_port}")
276
+ files = dir.to_a.map{|d| d.to_i unless d =~ /^\./}.uniq!
277
+ dir.close
278
+ files.map {|f| [f, Time.parse(t.val).to_f]}
279
+ end
280
+
281
+ We update the payload that we send to the master if our recent poll found files that we don't believe the master knows about:
282
+
283
+
284
+ to_payload <= (dir_contents * nonce).pairs do |c, n|
285
+ unless server_knows.map{|s| s.file}.include? c.file
286
+ #puts "BCAST #{c.file}; server doesn't know" or [n.ident, c.file, c.time]
287
+ [n.ident, c.file, c.time]
288
+ else
289
+ #puts "server knows about #{server_knows.length} files"
290
+ end
291
+ end
292
+
293
+ Our view of what the master ``knows'' about reflects our local cache of acknowledgement messages from the master. This logic is defined in __HBMaster__.
294
+
295
+ ### [Heartbeat master logic](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/hb_master.rb)
296
+
297
+ On the master side of heartbeats, we always send an ack when we get a heartbeat, so that the datanode doesn't need to keep resending its
298
+ payload of local chunks:
299
+
300
+ hb_ack <~ last_heartbeat.map do |l|
301
+ [l.sender, l.payload[0]] unless l.payload[1] == [nil]
302
+ end
303
+
304
+ At the same time, we use the Ruby _flatmap_ method to flatten the array of chunks in the heartbeat payload into a set of tuples, which we
305
+ associate with the heartbeating datanode and the time of receipt in __chunk_cache__:
306
+
307
+ chunk_cache <= join([master_duty_cycle, last_heartbeat]).flat_map do |d, l|
308
+ unless l.payload[1].nil?
309
+ l.payload[1].map do |pay|
310
+ [l.peer, pay, Time.parse(d.val).to_f]
311
+ end
312
+ end
313
+ end
314
+
315
+ We periodically garbage-collect this cached, removing entries for datanodes from whom we have not received a heartbeat in a configurable amount of time.
316
+ __last_heartbeat__ is an output interface provided by the __HeartbeatAgent__ module, and contains the most recent, non-stale heartbeat contents:
317
+
318
+ chunk_cache <- join([master_duty_cycle, chunk_cache]).map do |t, c|
319
+ c unless last_heartbeat.map{|h| h.peer}.include? c.node
320
+ end
321
+
322
+
323
+ ## [BFS Client](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/bfs_client.rb)
324
+
325
+ One of the most complicated parts of the basic GFS design is the client component. To minimize load on the centralized master, we take it off the critical
326
+ path of file transfers. The client therefore needs to pick up this work.
327
+
328
+ We won't spend too much time on the details of the client code, as it is nearly all _plain old Ruby_. The basic idea is:
329
+
330
+ 1. Pure metadata operations
331
+ * _mkdir_, _create_, _ls_, _rm_
332
+ * Send the request to the master and inform the caller of the status.
333
+ * If _ls_, return the directory listing to the caller.
334
+ 2. Append
335
+ * Send a __fsaddchunk__ request to the master, which should return a new chunkid and a list of datanodes.
336
+ * Read a chunk worth of data from the input stream.
337
+ * Connect to the first datanode in the list. Send a header containing the chunkid and the remaining datanodes.
338
+ * Stream the file contents. The target datanode will then ``play client'' and continue the pipeline to the next datanode, and so on.
339
+ 2. Read
340
+ * Send a __getchunks__ request to the master for the given file. It should return the list of chunks owned by the file.
341
+ * For each chunk,
342
+ * Send a __fschunklocations__ request to the master, which should return a list of datanodes in possession of the chunk (returning a list allows the client to perform retries without more communication with the master, should some of the datanodes fail).
343
+ * Connect to a datanode from the list and stream the chunk to a local buffer.
344
+ * As chunks become available, stream them to the caller.
345
+
346
+
347
+ ## [Data transfer protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/data_protocol.rb)
348
+
349
+ The data transfer protocol comprises a set of support functions for the bulk data transfer protocol whose use is described in the previous section.
350
+ Because it is _plain old Ruby_ it is not as interesting as the other modules. It provides:
351
+
352
+ * The TCP server code that runs at each datanode, which parses headers and writes stream data to the local FS (these files are later detected by the directory poll).
353
+ * Client API calls to connect to datanodes and stream data. Datanodes also use this protocol to pipeline chunks to downstream datanodes.
354
+ * Master API code invoked by a background process to replicate chunks from datanodes to other datanodes, when the replication factor for a chunk is too low.
355
+
356
+ ## [Master background process](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/background.rb)
357
+
358
+ So far, we have implemented the BFS master as a strictly reactive system: when clients make requests, it queries and possibly updates local state.
359
+ To maintain the durability requirement that `REP_FACTOR` copies of every chunk are stored on distinct nodes, the master must be an active system
360
+ that maintains a near-consistent view of global state, and takes steps to correct violated requirements.
361
+
362
+ __chunk_cache__ is the master's view of datanode state, maintained as described by collecting and pruning heartbeat messages.
363
+
364
+ cc_demand <= (bg_timer * chunk_cache_alive).rights
365
+ cc_demand <= (bg_timer * last_heartbeat).pairs {|b, h| [h.peer, nil, nil]}
366
+ chunk_cnts_chunk <= cc_demand.group([cc_demand.chunkid], count(cc_demand.node))
367
+ chunk_cnts_host <= cc_demand.group([cc_demand.node], count(cc_demand.chunkid))
368
+
369
+ After defining some helper aggregates (__chunk_cnts_chunk__ or replica count by chunk, and __chunk_cnt_host__ or datanode fill factor),
370
+
371
+ lowchunks <= chunk_cnts_chunk { |c| [c.chunkid] if c.replicas < REP_FACTOR and !c.chunkid.nil?}
372
+
373
+ # nodes in possession of such chunks
374
+ sources <= (cc_demand * lowchunks).pairs(:chunkid => :chunkid) {|a, b| [a.chunkid, a.node]}
375
+ # nodes not in possession of such chunks, and their fill factor
376
+ candidate_nodes <= (chunk_cnts_host * lowchunks).pairs do |c, p|
377
+ unless chunk_cache_alive.map{|a| a.node if a.chunkid == p.chunkid}.include? c.host
378
+ [p.chunkid, c.host, c.chunks]
379
+ ### I am autogenerated. Please do not edit me.
data/docs/bfs.raw ADDED
@@ -0,0 +1,251 @@
1
+ # BFS: A distributed file system in Bloom
2
+
3
+ In this document we'll use what we've learned to build a piece of systems software using Bloom. The libraries that ship with BUD provide many of the building blocks we'll need to create a distributed,
4
+ ``chunked'' file system in the style of the Google File System (GFS):
5
+
6
+ * a [key-value store](https://github.com/bloom-lang/bud-sandbox/blob/master/kvs/kvs.rb) (KVS)
7
+ * [nonce generation](https://github.com/bloom-lang/bud-sandbox/blob/master/ordering/nonce.rb)
8
+ * a [heartbeat protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/heartbeat/heartbeat.rb)
9
+
10
+ ## High-level architecture
11
+
12
+ ![BFS Architecture](bfs_arch.png?raw=true)
13
+
14
+ BFS implements a chunked, distributed file system (mostly) in the Bloom
15
+ language. BFS is architecturally based on [BOOM-FS](http://db.cs.berkeley.edu/papers/eurosys10-boom.pdf), which is itself based on
16
+ the Google File System (GFS). As in GFS, a single master node manages
17
+ file system metadata, while data blocks are replicated and stored on a large
18
+ number of storage nodes. Writing or reading data involves a multi-step
19
+ protocol in which clients interact with the master, retrieving metadata and
20
+ possibly changing state, then interact with storage nodes to read or write
21
+ chunks. Background jobs running on the master will contact storage nodes to
22
+ orchestrate chunk migrations, during which storage nodes communicate with
23
+ other storage nodes. As in BOOM-FS, the communication protocols and the data
24
+ channel used for bulk data transfer between clients and datanodes and between
25
+ datanodes is written outside Bloom (in Ruby).
26
+
27
+ ## [Basic File System](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/fs_master.rb)
28
+
29
+ Before we worry about any of the details of distribution, we need to implement the basic file system metadata operations: _create_, _remove_, _mkdir_ and _ls_.
30
+ There are many choices for how to implement these operations, and it makes sense to keep them separate from the (largely orthogonal) distributed file system logic.
31
+ That way, it will be possible later to choose a different implementation of the metadata operations without impacting the rest of the system.
32
+ Another benefit of modularizing the metadata logic is that it can be independently tested and debugged. We want to get the core of the file system
33
+ working correctly before we even send a whisper over the network, let alone add any complex features.
34
+
35
+ ### Protocol
36
+
37
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|12-20
38
+
39
+ We create an input interface for each of the operations, and a single output interface for the return for any operation: given a request id, __status__ is a boolean
40
+ indicating whether the request succeeded, and __data__ may contain return values (e.g., _fsls_ should return an array containing the array contents).
41
+
42
+ ### Implementation
43
+
44
+ We already have a library that provides an updateable flat namespace: the key-value store. We can easily implement the tree structure of a file system over a key-value store
45
+ in the following way:
46
+
47
+ 1. keys are paths
48
+ 2. directories have arrays containing child entries (base names)
49
+ 3. files values are their contents
50
+
51
+ <!--- (**JMH**: I find it a bit confusing how you toggle from the discussion above to this naive file-storage design here. Can you warn us a bit more clearly that this is a starting point focused on metadata, with (3) being a strawman for data storage that is intended to be overriden later?)
52
+ --->
53
+ Note that (3) is a strawman: it will cease to apply when we implement chunked storage later. It is tempting, however, to support (3) so that the resulting program is a working
54
+ standalone file system.
55
+
56
+ We begin our implementation of a KVS-backed metadata system in the following way:
57
+
58
+
59
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|33-36
60
+
61
+ If we wanted to replicate the master node's metadata we could consider mixing in a replicated KVS implementation instead of __BasicKVS__ -- but more on that later.
62
+
63
+ ### Directory Listing
64
+
65
+ The directory listing operation is implemented by a simple block of Bloom statements:
66
+
67
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|51-57
68
+
69
+ If we get a __fsls__ request, probe the key-value store for the requested by projecting _reqid_, _path_ from the __fsls__ tuple into __kvget__. If the given path
70
+ is a key, __kvget_response__ will contain a tuple with the same _reqid_, and the join on the second line will succeed. In this case, we insert the value
71
+ associated with that key into __fsret__. Otherwise, the third rule will fire, inserting a failure tuple into __fsret__.
72
+
73
+
74
+ ### Mutation
75
+
76
+ The logic for file and directory creation and deletion follow a similar logic with regard to the parent directory:
77
+
78
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|61-71
79
+
80
+ Unlike a directory listing, however, these operations change the state of the file system. In general, any state change will involve
81
+ carrying out two mutating operations to the key-value store atomically:
82
+
83
+ 1. update the value (child array) associated with the parent directory entry
84
+ 2. update the key-value pair associated with the object in question (a file or directory being created or destroyed).
85
+
86
+ The following Bloom code carries this out:
87
+
88
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|73-73
89
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|80-108
90
+
91
+
92
+ <!--- (**JMH**: This next sounds awkward. You *do* take care: by using <= and understanding the atomicity of timesteps in Bloom. I think what you mean to say is that Bloom's atomic timestep model makes this easy compared to ... something.)
93
+ Note that we need not take any particular care to ensure that the two inserts into __kvput__ occur together atomically. Because both statements use the synchronous
94
+ -->
95
+ Note that because both inserts into the __kvput__ collection use the synchronous operator (`<=`), we know that they will occur together in the same fixpoint computation or not at all.
96
+ Therefore we need not be concerned with explicitly sequencing the operations (e.g., ensuring that the directory entries is created _after_ the file entry) to deal with concurrency:
97
+ there can be no visible state of the database in which only one of the operations has succeeded.
98
+
99
+ If the request is a deletion, we need some additional logic to enforce the constraint that only an empty directory may be removed:
100
+
101
+
102
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|74-78
103
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|110-115
104
+
105
+
106
+ Recall that when we created KVSFS we mixed in __TimestepNonce__, one of the nonce libraries. While we were able to use the _reqid_ field from the input operation as a unique identifier
107
+ for one of our KVS operations, we need a fresh, unique request id for the second KVS operation in the atomic pair described above. By joining __nonce__, we get
108
+ an identifier that is unique to this timestep.
109
+
110
+
111
+ ## [File Chunking](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/chunking.rb)
112
+
113
+ Now that we have a module providing a basic file system, we can extend it to support chunked storage of file contents. The metadata master will contain, in addition to the KVS
114
+ structure for directory information, a relation mapping a set of chunk identifiers to each file
115
+
116
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|26-26
117
+
118
+ and relations associating a chunk with a set of datanodes that host a replica of the chunk.
119
+
120
+ ==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|12-12
121
+
122
+ (**JMH**: ambiguous reference ahead "these latter")
123
+ The latter (defined in __HBMaster__) is soft-state, kept up to data by heartbeat messages from datanodes (described in the next section).
124
+
125
+ To support chunked storage, we add a few metadata operations to those already defined by FSProtocol:
126
+
127
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|6-16
128
+
129
+ * __fschunklist__ returns the set of chunks belonging to a given file.
130
+ * __fschunklocations__ returns the set of datanodes in possession of a given chunk.
131
+ * __fsaddchunk__ returns a new chunkid for appending to an existing file, guaranteed to be higher than any existing chunkids for that file, and a list of candidate datanodes that can store a replica of the new chunk.
132
+
133
+ We continue to use __fsret__ for return values.
134
+
135
+ ### Lookups
136
+
137
+ Lines 34-44 are a similar pattern to what we saw in the basic FS: whenever we get a __fschunklist__ or __fsaddchunk__ request, we must first ensure that the given file
138
+ exists, and error out if not. If it does, and the operation was __fschunklist__, we join the metadata relation __chunk__ and return the set of chunks owned
139
+ by the given (existent) file:
140
+
141
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|47-49
142
+
143
+ ### Add chunk
144
+
145
+ If it was a __fsaddchunk__ request, we need to generate a unique id for a new chunk and return a list of target datanodes. We reuse __TimestepNonce__ to do the former, and join a relation
146
+ called __available__ that is exported by __HBMaster__ (described in the next section) for the latter:
147
+
148
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|69-76
149
+
150
+ Finally, it was a __fschunklocations__ request, we have another possible error scenario, because the nodes associated with chunks are a part of our soft state. Even if the file
151
+ exists, it may not be the case that we have fresh information in our cache about what datanodes own a replica of the given chunk:
152
+
153
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|54-58
154
+
155
+ Otherwise, __chunk_cache__ has information about the given chunk, which we may return to the client:
156
+
157
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|61-64
158
+
159
+
160
+ ## Datanodes and Heartbeats
161
+
162
+ ### [Datanode](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/datanode.rb)
163
+
164
+ A datanode runs both Bud code (to support the heartbeat and control protocols) and pure Ruby (to support the data transfer protocol). A datanode's main job is keeping the master
165
+ aware of it existence and its state, and participating when necessary in data pipelines to read or write chunk data to and from its local storage.
166
+
167
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|11-15
168
+
169
+ By mixing in HeartbeatAgent, the datanode includes the machinery necessary to regularly send status messages to the master. __HeartbeatAgent__ provides an input interface
170
+ called __payload__ that allows an agent to optionally include additional information in heartbeat messages: in our case, we wish to include state deltas which ensure that
171
+ the master has an accurate view of the set of chunks owned by the datanode.
172
+
173
+ When a datanode is constructed, it takes a port at which the embedded data protocol server will listen, and starts the server in the background:
174
+
175
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|61-62
176
+
177
+ At regular intervals, a datanode polls its local chunk directory (which is independently written to by the data protocol):
178
+
179
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|26-31
180
+
181
+ We update the payload that we send to the master if our recent poll found files that we don't believe the master knows about:
182
+
183
+
184
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|33-40
185
+
186
+ Our view of what the master ``knows'' about reflects our local cache of acknowledgement messages from the master. This logic is defined in __HBMaster__.
187
+
188
+ ### [Heartbeat master logic](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/hb_master.rb)
189
+
190
+ On the master side of heartbeats, we always send an ack when we get a heartbeat, so that the datanode doesn't need to keep resending its
191
+ payload of local chunks:
192
+
193
+ ==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|30-32
194
+
195
+ At the same time, we use the Ruby _flatmap_ method to flatten the array of chunks in the heartbeat payload into a set of tuples, which we
196
+ associate with the heartbeating datanode and the time of receipt in __chunk_cache__:
197
+
198
+ ==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|22-28
199
+
200
+ We periodically garbage-collect this cached, removing entries for datanodes from whom we have not received a heartbeat in a configurable amount of time.
201
+ __last_heartbeat__ is an output interface provided by the __HeartbeatAgent__ module, and contains the most recent, non-stale heartbeat contents:
202
+
203
+ ==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|34-36
204
+
205
+
206
+ ## [BFS Client](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/bfs_client.rb)
207
+
208
+ One of the most complicated parts of the basic GFS design is the client component. To minimize load on the centralized master, we take it off the critical
209
+ path of file transfers. The client therefore needs to pick up this work.
210
+
211
+ We won't spend too much time on the details of the client code, as it is nearly all _plain old Ruby_. The basic idea is:
212
+
213
+ 1. Pure metadata operations
214
+ * _mkdir_, _create_, _ls_, _rm_
215
+ * Send the request to the master and inform the caller of the status.
216
+ * If _ls_, return the directory listing to the caller.
217
+ 2. Append
218
+ * Send a __fsaddchunk__ request to the master, which should return a new chunkid and a list of datanodes.
219
+ * Read a chunk worth of data from the input stream.
220
+ * Connect to the first datanode in the list. Send a header containing the chunkid and the remaining datanodes.
221
+ * Stream the file contents. The target datanode will then ``play client'' and continue the pipeline to the next datanode, and so on.
222
+ 2. Read
223
+ * Send a __getchunks__ request to the master for the given file. It should return the list of chunks owned by the file.
224
+ * For each chunk,
225
+ * Send a __fschunklocations__ request to the master, which should return a list of datanodes in possession of the chunk (returning a list allows the client to perform retries without more communication with the master, should some of the datanodes fail).
226
+ * Connect to a datanode from the list and stream the chunk to a local buffer.
227
+ * As chunks become available, stream them to the caller.
228
+
229
+
230
+ ## [Data transfer protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/data_protocol.rb)
231
+
232
+ The data transfer protocol comprises a set of support functions for the bulk data transfer protocol whose use is described in the previous section.
233
+ Because it is _plain old Ruby_ it is not as interesting as the other modules. It provides:
234
+
235
+ * The TCP server code that runs at each datanode, which parses headers and writes stream data to the local FS (these files are later detected by the directory poll).
236
+ * Client API calls to connect to datanodes and stream data. Datanodes also use this protocol to pipeline chunks to downstream datanodes.
237
+ * Master API code invoked by a background process to replicate chunks from datanodes to other datanodes, when the replication factor for a chunk is too low.
238
+
239
+ ## [Master background process](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/background.rb)
240
+
241
+ So far, we have implemented the BFS master as a strictly reactive system: when clients make requests, it queries and possibly updates local state.
242
+ To maintain the durability requirement that `REP_FACTOR` copies of every chunk are stored on distinct nodes, the master must be an active system
243
+ that maintains a near-consistent view of global state, and takes steps to correct violated requirements.
244
+
245
+ __chunk_cache__ is the master's view of datanode state, maintained as described by collecting and pruning heartbeat messages.
246
+
247
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/background.rb|24-27
248
+
249
+ After defining some helper aggregates (__chunk_cnts_chunk__ or replica count by chunk, and __chunk_cnt_host__ or datanode fill factor),
250
+
251
+ ==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/background.rb|29-36
data/docs/bfs_arch.png ADDED
Binary file
Binary file