bud 0.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/LICENSE +9 -0
- data/README +30 -0
- data/bin/budplot +134 -0
- data/bin/budvis +201 -0
- data/bin/rebl +4 -0
- data/docs/README.md +13 -0
- data/docs/bfs.md +379 -0
- data/docs/bfs.raw +251 -0
- data/docs/bfs_arch.png +0 -0
- data/docs/bloom-loop.png +0 -0
- data/docs/bust.md +83 -0
- data/docs/cheat.md +291 -0
- data/docs/deploy.md +96 -0
- data/docs/diffs +181 -0
- data/docs/getstarted.md +296 -0
- data/docs/intro.md +36 -0
- data/docs/modules.md +112 -0
- data/docs/operational.md +96 -0
- data/docs/rebl.md +99 -0
- data/docs/ruby_hooks.md +19 -0
- data/docs/visualizations.md +75 -0
- data/examples/README +1 -0
- data/examples/basics/hello.rb +12 -0
- data/examples/basics/out +1103 -0
- data/examples/basics/out.new +856 -0
- data/examples/basics/paths.rb +51 -0
- data/examples/bust/README.md +9 -0
- data/examples/bust/bustclient-example.rb +23 -0
- data/examples/bust/bustinspector.html +135 -0
- data/examples/bust/bustserver-example.rb +18 -0
- data/examples/chat/README.md +9 -0
- data/examples/chat/chat.rb +45 -0
- data/examples/chat/chat_protocol.rb +8 -0
- data/examples/chat/chat_server.rb +29 -0
- data/examples/deploy/tokenring-ec2.rb +26 -0
- data/examples/deploy/tokenring-local.rb +17 -0
- data/examples/deploy/tokenring.rb +39 -0
- data/lib/bud/aggs.rb +126 -0
- data/lib/bud/bud_meta.rb +185 -0
- data/lib/bud/bust/bust.rb +126 -0
- data/lib/bud/bust/client/idempotence.rb +10 -0
- data/lib/bud/bust/client/restclient.rb +49 -0
- data/lib/bud/collections.rb +937 -0
- data/lib/bud/depanalysis.rb +44 -0
- data/lib/bud/deploy/countatomicdelivery.rb +50 -0
- data/lib/bud/deploy/deployer.rb +67 -0
- data/lib/bud/deploy/ec2deploy.rb +200 -0
- data/lib/bud/deploy/localdeploy.rb +41 -0
- data/lib/bud/errors.rb +15 -0
- data/lib/bud/graphs.rb +405 -0
- data/lib/bud/joins.rb +300 -0
- data/lib/bud/rebl.rb +314 -0
- data/lib/bud/rewrite.rb +523 -0
- data/lib/bud/rtrace.rb +27 -0
- data/lib/bud/server.rb +43 -0
- data/lib/bud/state.rb +108 -0
- data/lib/bud/storage/tokyocabinet.rb +170 -0
- data/lib/bud/storage/zookeeper.rb +178 -0
- data/lib/bud/stratify.rb +83 -0
- data/lib/bud/viz.rb +65 -0
- data/lib/bud.rb +797 -0
- metadata +330 -0
data/docs/bfs.md
ADDED
@@ -0,0 +1,379 @@
|
|
1
|
+
# BFS: A distributed file system in Bloom
|
2
|
+
|
3
|
+
In this document we'll use what we've learned to build a piece of systems software using Bloom. The libraries that ship with BUD provide many of the building blocks we'll need to create a distributed,
|
4
|
+
``chunked'' file system in the style of the Google File System (GFS):
|
5
|
+
|
6
|
+
* a [key-value store](https://github.com/bloom-lang/bud-sandbox/blob/master/kvs/kvs.rb) (KVS)
|
7
|
+
* [nonce generation](https://github.com/bloom-lang/bud-sandbox/blob/master/ordering/nonce.rb)
|
8
|
+
* a [heartbeat protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/heartbeat/heartbeat.rb)
|
9
|
+
|
10
|
+
## High-level architecture
|
11
|
+
|
12
|
+

|
13
|
+
|
14
|
+
BFS implements a chunked, distributed file system (mostly) in the Bloom
|
15
|
+
language. BFS is architecturally based on [BOOM-FS](http://db.cs.berkeley.edu/papers/eurosys10-boom.pdf), which is itself based on
|
16
|
+
the Google File System (GFS). As in GFS, a single master node manages
|
17
|
+
file system metadata, while data blocks are replicated and stored on a large
|
18
|
+
number of storage nodes. Writing or reading data involves a multi-step
|
19
|
+
protocol in which clients interact with the master, retrieving metadata and
|
20
|
+
possibly changing state, then interact with storage nodes to read or write
|
21
|
+
chunks. Background jobs running on the master will contact storage nodes to
|
22
|
+
orchestrate chunk migrations, during which storage nodes communicate with
|
23
|
+
other storage nodes. As in BOOM-FS, the communication protocols and the data
|
24
|
+
channel used for bulk data transfer between clients and datanodes and between
|
25
|
+
datanodes is written outside Bloom (in Ruby).
|
26
|
+
|
27
|
+
## [Basic File System](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/fs_master.rb)
|
28
|
+
|
29
|
+
Before we worry about any of the details of distribution, we need to implement the basic file system metadata operations: _create_, _remove_, _mkdir_ and _ls_.
|
30
|
+
There are many choices for how to implement these operations, and it makes sense to keep them separate from the (largely orthogonal) distributed file system logic.
|
31
|
+
That way, it will be possible later to choose a different implementation of the metadata operations without impacting the rest of the system.
|
32
|
+
Another benefit of modularizing the metadata logic is that it can be independently tested and debugged. We want to get the core of the file system
|
33
|
+
working correctly before we even send a whisper over the network, let alone add any complex features.
|
34
|
+
|
35
|
+
### Protocol
|
36
|
+
|
37
|
+
module FSProtocol
|
38
|
+
state do
|
39
|
+
interface input, :fsls, [:reqid, :path]
|
40
|
+
interface input, :fscreate, [] => [:reqid, :name, :path, :data]
|
41
|
+
interface input, :fsmkdir, [] => [:reqid, :name, :path]
|
42
|
+
interface input, :fsrm, [] => [:reqid, :name, :path]
|
43
|
+
interface output, :fsret, [:reqid, :status, :data]
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|
47
|
+
We create an input interface for each of the operations, and a single output interface for the return for any operation: given a request id, __status__ is a boolean
|
48
|
+
indicating whether the request succeeded, and __data__ may contain return values (e.g., _fsls_ should return an array containing the array contents).
|
49
|
+
|
50
|
+
### Implementation
|
51
|
+
|
52
|
+
We already have a library that provides an updateable flat namespace: the key-value store. We can easily implement the tree structure of a file system over a key-value store
|
53
|
+
in the following way:
|
54
|
+
|
55
|
+
1. keys are paths
|
56
|
+
2. directories have arrays containing child entries (base names)
|
57
|
+
3. files values are their contents
|
58
|
+
|
59
|
+
<!--- (**JMH**: I find it a bit confusing how you toggle from the discussion above to this naive file-storage design here. Can you warn us a bit more clearly that this is a starting point focused on metadata, with (3) being a strawman for data storage that is intended to be overriden later?)
|
60
|
+
--->
|
61
|
+
Note that (3) is a strawman: it will cease to apply when we implement chunked storage later. It is tempting, however, to support (3) so that the resulting program is a working
|
62
|
+
standalone file system.
|
63
|
+
|
64
|
+
We begin our implementation of a KVS-backed metadata system in the following way:
|
65
|
+
|
66
|
+
|
67
|
+
module KVSFS
|
68
|
+
include FSProtocol
|
69
|
+
include BasicKVS
|
70
|
+
include TimestepNonce
|
71
|
+
|
72
|
+
If we wanted to replicate the master node's metadata we could consider mixing in a replicated KVS implementation instead of __BasicKVS__ -- but more on that later.
|
73
|
+
|
74
|
+
### Directory Listing
|
75
|
+
|
76
|
+
The directory listing operation is implemented by a simple block of Bloom statements:
|
77
|
+
|
78
|
+
kvget <= fsls { |l| [l.reqid, l.path] }
|
79
|
+
fsret <= (kvget_response * fsls).pairs(:reqid => :reqid) { |r, i| [r.reqid, true, r.value] }
|
80
|
+
fsret <= fsls do |l|
|
81
|
+
unless kvget_response.map{ |r| r.reqid}.include? l.reqid
|
82
|
+
[l.reqid, false, nil]
|
83
|
+
end
|
84
|
+
end
|
85
|
+
|
86
|
+
If we get a __fsls__ request, probe the key-value store for the requested by projecting _reqid_, _path_ from the __fsls__ tuple into __kvget__. If the given path
|
87
|
+
is a key, __kvget_response__ will contain a tuple with the same _reqid_, and the join on the second line will succeed. In this case, we insert the value
|
88
|
+
associated with that key into __fsret__. Otherwise, the third rule will fire, inserting a failure tuple into __fsret__.
|
89
|
+
|
90
|
+
|
91
|
+
### Mutation
|
92
|
+
|
93
|
+
The logic for file and directory creation and deletion follow a similar logic with regard to the parent directory:
|
94
|
+
|
95
|
+
check_parent_exists <= fscreate { |c| [c.reqid, c.name, c.path, :create, c.data] }
|
96
|
+
check_parent_exists <= fsmkdir { |m| [m.reqid, m.name, m.path, :mkdir, nil] }
|
97
|
+
check_parent_exists <= fsrm { |m| [m.reqid, m.name, m.path, :rm, nil] }
|
98
|
+
|
99
|
+
kvget <= check_parent_exists { |c| [c.reqid, c.path] }
|
100
|
+
fsret <= check_parent_exists do |c|
|
101
|
+
unless kvget_response.map{ |r| r.reqid}.include? c.reqid
|
102
|
+
puts "not found #{c.path}" or [c.reqid, false, "parent path #{c.path} for #{c.name} does not exist"]
|
103
|
+
end
|
104
|
+
end
|
105
|
+
|
106
|
+
|
107
|
+
Unlike a directory listing, however, these operations change the state of the file system. In general, any state change will involve
|
108
|
+
carrying out two mutating operations to the key-value store atomically:
|
109
|
+
|
110
|
+
1. update the value (child array) associated with the parent directory entry
|
111
|
+
2. update the key-value pair associated with the object in question (a file or directory being created or destroyed).
|
112
|
+
|
113
|
+
The following Bloom code carries this out:
|
114
|
+
|
115
|
+
temp :dir_exists <= (check_parent_exists * kvget_response * nonce).combos([check_parent_exists.reqid, kvget_response.reqid])
|
116
|
+
fsret <= dir_exists do |c, r, n|
|
117
|
+
if c.mtype == :rm
|
118
|
+
unless can_remove.map{|can| can.orig_reqid}.include? c.reqid
|
119
|
+
[c.reqid, false, "directory #{} not empty"]
|
120
|
+
end
|
121
|
+
end
|
122
|
+
end
|
123
|
+
|
124
|
+
# update dir entry
|
125
|
+
# note that it is unnecessary to ensure that a file is created before its corresponding
|
126
|
+
# directory entry, as both inserts into :kvput below will co-occur in the same timestep.
|
127
|
+
kvput <= dir_exists do |c, r, n|
|
128
|
+
if c.mtype == :rm
|
129
|
+
if can_remove.map{|can| can.orig_reqid}.include? c.reqid
|
130
|
+
[ip_port, c.path, n.ident, r.value.clone.reject{|item| item == c.name}]
|
131
|
+
end
|
132
|
+
else
|
133
|
+
[ip_port, c.path, n.ident, r.value.clone.push(c.name)]
|
134
|
+
end
|
135
|
+
end
|
136
|
+
|
137
|
+
kvput <= dir_exists do |c, r, n|
|
138
|
+
case c.mtype
|
139
|
+
when :mkdir
|
140
|
+
[ip_port, terminate_with_slash(c.path) + c.name, c.reqid, []]
|
141
|
+
when :create
|
142
|
+
[ip_port, terminate_with_slash(c.path) + c.name, c.reqid, "LEAF"]
|
143
|
+
end
|
144
|
+
end
|
145
|
+
|
146
|
+
|
147
|
+
<!--- (**JMH**: This next sounds awkward. You *do* take care: by using <= and understanding the atomicity of timesteps in Bloom. I think what you mean to say is that Bloom's atomic timestep model makes this easy compared to ... something.)
|
148
|
+
Note that we need not take any particular care to ensure that the two inserts into __kvput__ occur together atomically. Because both statements use the synchronous
|
149
|
+
-->
|
150
|
+
Note that because both inserts into the __kvput__ collection use the synchronous operator (`<=`), we know that they will occur together in the same fixpoint computation or not at all.
|
151
|
+
Therefore we need not be concerned with explicitly sequencing the operations (e.g., ensuring that the directory entries is created _after_ the file entry) to deal with concurrency:
|
152
|
+
there can be no visible state of the database in which only one of the operations has succeeded.
|
153
|
+
|
154
|
+
If the request is a deletion, we need some additional logic to enforce the constraint that only an empty directory may be removed:
|
155
|
+
|
156
|
+
|
157
|
+
check_is_empty <= (fsrm * nonce).pairs {|m, n| [n.ident, m.reqid, terminate_with_slash(m.path) + m.name] }
|
158
|
+
kvget <= check_is_empty {|c| [c.reqid, c.name] }
|
159
|
+
can_remove <= (kvget_response * check_is_empty).pairs([kvget_response.reqid, check_is_empty.reqid]) do |r, c|
|
160
|
+
[c.reqid, c.orig_reqid, c.name] if r.value.length == 0
|
161
|
+
end
|
162
|
+
# delete entry -- if an 'rm' request,
|
163
|
+
kvdel <= dir_exists do |c, r, n|
|
164
|
+
if can_remove.map{|can| can.orig_reqid}.include? c.reqid
|
165
|
+
[terminate_with_slash(c.path) + c.name, c.reqid]
|
166
|
+
end
|
167
|
+
end
|
168
|
+
|
169
|
+
|
170
|
+
Recall that when we created KVSFS we mixed in __TimestepNonce__, one of the nonce libraries. While we were able to use the _reqid_ field from the input operation as a unique identifier
|
171
|
+
for one of our KVS operations, we need a fresh, unique request id for the second KVS operation in the atomic pair described above. By joining __nonce__, we get
|
172
|
+
an identifier that is unique to this timestep.
|
173
|
+
|
174
|
+
|
175
|
+
## [File Chunking](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/chunking.rb)
|
176
|
+
|
177
|
+
Now that we have a module providing a basic file system, we can extend it to support chunked storage of file contents. The metadata master will contain, in addition to the KVS
|
178
|
+
structure for directory information, a relation mapping a set of chunk identifiers to each file
|
179
|
+
|
180
|
+
table :chunk, [:chunkid, :file, :siz]
|
181
|
+
|
182
|
+
and relations associating a chunk with a set of datanodes that host a replica of the chunk.
|
183
|
+
|
184
|
+
table :chunk_cache, [:node, :chunkid, :time]
|
185
|
+
|
186
|
+
(**JMH**: ambiguous reference ahead "these latter")
|
187
|
+
The latter (defined in __HBMaster__) is soft-state, kept up to data by heartbeat messages from datanodes (described in the next section).
|
188
|
+
|
189
|
+
To support chunked storage, we add a few metadata operations to those already defined by FSProtocol:
|
190
|
+
|
191
|
+
module ChunkedFSProtocol
|
192
|
+
include FSProtocol
|
193
|
+
|
194
|
+
state do
|
195
|
+
interface :input, :fschunklist, [:reqid, :file]
|
196
|
+
interface :input, :fschunklocations, [:reqid, :chunkid]
|
197
|
+
interface :input, :fsaddchunk, [:reqid, :file]
|
198
|
+
# note that no output interface is defined.
|
199
|
+
# we use :fsret (defined in FSProtocol) for output.
|
200
|
+
end
|
201
|
+
end
|
202
|
+
|
203
|
+
* __fschunklist__ returns the set of chunks belonging to a given file.
|
204
|
+
* __fschunklocations__ returns the set of datanodes in possession of a given chunk.
|
205
|
+
* __fsaddchunk__ returns a new chunkid for appending to an existing file, guaranteed to be higher than any existing chunkids for that file, and a list of candidate datanodes that can store a replica of the new chunk.
|
206
|
+
|
207
|
+
We continue to use __fsret__ for return values.
|
208
|
+
|
209
|
+
### Lookups
|
210
|
+
|
211
|
+
Lines 34-44 are a similar pattern to what we saw in the basic FS: whenever we get a __fschunklist__ or __fsaddchunk__ request, we must first ensure that the given file
|
212
|
+
exists, and error out if not. If it does, and the operation was __fschunklist__, we join the metadata relation __chunk__ and return the set of chunks owned
|
213
|
+
by the given (existent) file:
|
214
|
+
|
215
|
+
chunk_buffer <= (fschunklist * kvget_response * chunk).combos([fschunklist.reqid, kvget_response.reqid], [fschunklist.file, chunk.file]) { |l, r, c| [l.reqid, c.chunkid] }
|
216
|
+
chunk_buffer2 <= chunk_buffer.group([chunk_buffer.reqid], accum(chunk_buffer.chunkid))
|
217
|
+
fsret <= chunk_buffer2 { |c| [c.reqid, true, c.chunklist] }
|
218
|
+
|
219
|
+
### Add chunk
|
220
|
+
|
221
|
+
If it was a __fsaddchunk__ request, we need to generate a unique id for a new chunk and return a list of target datanodes. We reuse __TimestepNonce__ to do the former, and join a relation
|
222
|
+
called __available__ that is exported by __HBMaster__ (described in the next section) for the latter:
|
223
|
+
|
224
|
+
temp :minted_chunk <= (kvget_response * fsaddchunk * available * nonce).combos(kvget_response.reqid => fsaddchunk.reqid) {|r| r if last_heartbeat.length >= REP_FACTOR}
|
225
|
+
chunk <= minted_chunk { |r, a, v, n| [n.ident, a.file, 0]}
|
226
|
+
fsret <= minted_chunk { |r, a, v, n| [r.reqid, true, [n.ident, v.pref_list.slice(0, (REP_FACTOR + 2))]]}
|
227
|
+
fsret <= (kvget_response * fsaddchunk).pairs(:reqid => :reqid) do |r, a|
|
228
|
+
if available.empty? or available.first.pref_list.length < REP_FACTOR
|
229
|
+
[r.reqid, false, "datanode set cannot satisfy REP_FACTOR = #{REP_FACTOR} with [#{available.first.nil? ? "NIL" : available.first.pref_list.inspect}]"]
|
230
|
+
end
|
231
|
+
end
|
232
|
+
|
233
|
+
Finally, it was a __fschunklocations__ request, we have another possible error scenario, because the nodes associated with chunks are a part of our soft state. Even if the file
|
234
|
+
exists, it may not be the case that we have fresh information in our cache about what datanodes own a replica of the given chunk:
|
235
|
+
|
236
|
+
fsret <= fschunklocations do |l|
|
237
|
+
unless chunk_cache_alive.map{|c| c.chunkid}.include? l.chunkid
|
238
|
+
[l.reqid, false, "no datanodes found for #{l.chunkid} in cc, now #{chunk_cache_alive.length}, with last_hb #{last_heartbeat.length}"]
|
239
|
+
end
|
240
|
+
end
|
241
|
+
|
242
|
+
Otherwise, __chunk_cache__ has information about the given chunk, which we may return to the client:
|
243
|
+
|
244
|
+
temp :chunkjoin <= (fschunklocations * chunk_cache_alive).pairs(:chunkid => :chunkid)
|
245
|
+
host_buffer <= chunkjoin {|l, c| [l.reqid, c.node] }
|
246
|
+
host_buffer2 <= host_buffer.group([host_buffer.reqid], accum(host_buffer.host))
|
247
|
+
fsret <= host_buffer2 {|c| [c.reqid, true, c.hostlist] }
|
248
|
+
|
249
|
+
|
250
|
+
## Datanodes and Heartbeats
|
251
|
+
|
252
|
+
### [[Datanode]](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/datanode.rb)
|
253
|
+
|
254
|
+
A datanode runs both Bud code (to support the heartbeat and control protocols) and pure Ruby (to support the data transfer protocol). A datanode's main job is keeping the master
|
255
|
+
aware of it existence and its state, and participating when necessary in data pipelines to read or write chunk data to and from its local storage.
|
256
|
+
|
257
|
+
module BFSDatanode
|
258
|
+
include HeartbeatAgent
|
259
|
+
include StaticMembership
|
260
|
+
include TimestepNonce
|
261
|
+
include BFSHBProtocol
|
262
|
+
|
263
|
+
By mixing in HeartbeatAgent, the datanode includes the machinery necessary to regularly send status messages to the master. __HeartbeatAgent__ provides an input interface
|
264
|
+
called __payload__ that allows an agent to optionally include additional information in heartbeat messages: in our case, we wish to include state deltas which ensure that
|
265
|
+
the master has an accurate view of the set of chunks owned by the datanode.
|
266
|
+
|
267
|
+
When a datanode is constructed, it takes a port at which the embedded data protocol server will listen, and starts the server in the background:
|
268
|
+
|
269
|
+
@dp_server = DataProtocolServer.new(dataport)
|
270
|
+
return_address <+ [["localhost:#{dataport}"]]
|
271
|
+
|
272
|
+
At regular intervals, a datanode polls its local chunk directory (which is independently written to by the data protocol):
|
273
|
+
|
274
|
+
dir_contents <= hb_timer.flat_map do |t|
|
275
|
+
dir = Dir.new("#{DATADIR}/#{@data_port}")
|
276
|
+
files = dir.to_a.map{|d| d.to_i unless d =~ /^\./}.uniq!
|
277
|
+
dir.close
|
278
|
+
files.map {|f| [f, Time.parse(t.val).to_f]}
|
279
|
+
end
|
280
|
+
|
281
|
+
We update the payload that we send to the master if our recent poll found files that we don't believe the master knows about:
|
282
|
+
|
283
|
+
|
284
|
+
to_payload <= (dir_contents * nonce).pairs do |c, n|
|
285
|
+
unless server_knows.map{|s| s.file}.include? c.file
|
286
|
+
#puts "BCAST #{c.file}; server doesn't know" or [n.ident, c.file, c.time]
|
287
|
+
[n.ident, c.file, c.time]
|
288
|
+
else
|
289
|
+
#puts "server knows about #{server_knows.length} files"
|
290
|
+
end
|
291
|
+
end
|
292
|
+
|
293
|
+
Our view of what the master ``knows'' about reflects our local cache of acknowledgement messages from the master. This logic is defined in __HBMaster__.
|
294
|
+
|
295
|
+
### [Heartbeat master logic](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/hb_master.rb)
|
296
|
+
|
297
|
+
On the master side of heartbeats, we always send an ack when we get a heartbeat, so that the datanode doesn't need to keep resending its
|
298
|
+
payload of local chunks:
|
299
|
+
|
300
|
+
hb_ack <~ last_heartbeat.map do |l|
|
301
|
+
[l.sender, l.payload[0]] unless l.payload[1] == [nil]
|
302
|
+
end
|
303
|
+
|
304
|
+
At the same time, we use the Ruby _flatmap_ method to flatten the array of chunks in the heartbeat payload into a set of tuples, which we
|
305
|
+
associate with the heartbeating datanode and the time of receipt in __chunk_cache__:
|
306
|
+
|
307
|
+
chunk_cache <= join([master_duty_cycle, last_heartbeat]).flat_map do |d, l|
|
308
|
+
unless l.payload[1].nil?
|
309
|
+
l.payload[1].map do |pay|
|
310
|
+
[l.peer, pay, Time.parse(d.val).to_f]
|
311
|
+
end
|
312
|
+
end
|
313
|
+
end
|
314
|
+
|
315
|
+
We periodically garbage-collect this cached, removing entries for datanodes from whom we have not received a heartbeat in a configurable amount of time.
|
316
|
+
__last_heartbeat__ is an output interface provided by the __HeartbeatAgent__ module, and contains the most recent, non-stale heartbeat contents:
|
317
|
+
|
318
|
+
chunk_cache <- join([master_duty_cycle, chunk_cache]).map do |t, c|
|
319
|
+
c unless last_heartbeat.map{|h| h.peer}.include? c.node
|
320
|
+
end
|
321
|
+
|
322
|
+
|
323
|
+
## [BFS Client](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/bfs_client.rb)
|
324
|
+
|
325
|
+
One of the most complicated parts of the basic GFS design is the client component. To minimize load on the centralized master, we take it off the critical
|
326
|
+
path of file transfers. The client therefore needs to pick up this work.
|
327
|
+
|
328
|
+
We won't spend too much time on the details of the client code, as it is nearly all _plain old Ruby_. The basic idea is:
|
329
|
+
|
330
|
+
1. Pure metadata operations
|
331
|
+
* _mkdir_, _create_, _ls_, _rm_
|
332
|
+
* Send the request to the master and inform the caller of the status.
|
333
|
+
* If _ls_, return the directory listing to the caller.
|
334
|
+
2. Append
|
335
|
+
* Send a __fsaddchunk__ request to the master, which should return a new chunkid and a list of datanodes.
|
336
|
+
* Read a chunk worth of data from the input stream.
|
337
|
+
* Connect to the first datanode in the list. Send a header containing the chunkid and the remaining datanodes.
|
338
|
+
* Stream the file contents. The target datanode will then ``play client'' and continue the pipeline to the next datanode, and so on.
|
339
|
+
2. Read
|
340
|
+
* Send a __getchunks__ request to the master for the given file. It should return the list of chunks owned by the file.
|
341
|
+
* For each chunk,
|
342
|
+
* Send a __fschunklocations__ request to the master, which should return a list of datanodes in possession of the chunk (returning a list allows the client to perform retries without more communication with the master, should some of the datanodes fail).
|
343
|
+
* Connect to a datanode from the list and stream the chunk to a local buffer.
|
344
|
+
* As chunks become available, stream them to the caller.
|
345
|
+
|
346
|
+
|
347
|
+
## [Data transfer protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/data_protocol.rb)
|
348
|
+
|
349
|
+
The data transfer protocol comprises a set of support functions for the bulk data transfer protocol whose use is described in the previous section.
|
350
|
+
Because it is _plain old Ruby_ it is not as interesting as the other modules. It provides:
|
351
|
+
|
352
|
+
* The TCP server code that runs at each datanode, which parses headers and writes stream data to the local FS (these files are later detected by the directory poll).
|
353
|
+
* Client API calls to connect to datanodes and stream data. Datanodes also use this protocol to pipeline chunks to downstream datanodes.
|
354
|
+
* Master API code invoked by a background process to replicate chunks from datanodes to other datanodes, when the replication factor for a chunk is too low.
|
355
|
+
|
356
|
+
## [Master background process](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/background.rb)
|
357
|
+
|
358
|
+
So far, we have implemented the BFS master as a strictly reactive system: when clients make requests, it queries and possibly updates local state.
|
359
|
+
To maintain the durability requirement that `REP_FACTOR` copies of every chunk are stored on distinct nodes, the master must be an active system
|
360
|
+
that maintains a near-consistent view of global state, and takes steps to correct violated requirements.
|
361
|
+
|
362
|
+
__chunk_cache__ is the master's view of datanode state, maintained as described by collecting and pruning heartbeat messages.
|
363
|
+
|
364
|
+
cc_demand <= (bg_timer * chunk_cache_alive).rights
|
365
|
+
cc_demand <= (bg_timer * last_heartbeat).pairs {|b, h| [h.peer, nil, nil]}
|
366
|
+
chunk_cnts_chunk <= cc_demand.group([cc_demand.chunkid], count(cc_demand.node))
|
367
|
+
chunk_cnts_host <= cc_demand.group([cc_demand.node], count(cc_demand.chunkid))
|
368
|
+
|
369
|
+
After defining some helper aggregates (__chunk_cnts_chunk__ or replica count by chunk, and __chunk_cnt_host__ or datanode fill factor),
|
370
|
+
|
371
|
+
lowchunks <= chunk_cnts_chunk { |c| [c.chunkid] if c.replicas < REP_FACTOR and !c.chunkid.nil?}
|
372
|
+
|
373
|
+
# nodes in possession of such chunks
|
374
|
+
sources <= (cc_demand * lowchunks).pairs(:chunkid => :chunkid) {|a, b| [a.chunkid, a.node]}
|
375
|
+
# nodes not in possession of such chunks, and their fill factor
|
376
|
+
candidate_nodes <= (chunk_cnts_host * lowchunks).pairs do |c, p|
|
377
|
+
unless chunk_cache_alive.map{|a| a.node if a.chunkid == p.chunkid}.include? c.host
|
378
|
+
[p.chunkid, c.host, c.chunks]
|
379
|
+
### I am autogenerated. Please do not edit me.
|
data/docs/bfs.raw
ADDED
@@ -0,0 +1,251 @@
|
|
1
|
+
# BFS: A distributed file system in Bloom
|
2
|
+
|
3
|
+
In this document we'll use what we've learned to build a piece of systems software using Bloom. The libraries that ship with BUD provide many of the building blocks we'll need to create a distributed,
|
4
|
+
``chunked'' file system in the style of the Google File System (GFS):
|
5
|
+
|
6
|
+
* a [key-value store](https://github.com/bloom-lang/bud-sandbox/blob/master/kvs/kvs.rb) (KVS)
|
7
|
+
* [nonce generation](https://github.com/bloom-lang/bud-sandbox/blob/master/ordering/nonce.rb)
|
8
|
+
* a [heartbeat protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/heartbeat/heartbeat.rb)
|
9
|
+
|
10
|
+
## High-level architecture
|
11
|
+
|
12
|
+

|
13
|
+
|
14
|
+
BFS implements a chunked, distributed file system (mostly) in the Bloom
|
15
|
+
language. BFS is architecturally based on [BOOM-FS](http://db.cs.berkeley.edu/papers/eurosys10-boom.pdf), which is itself based on
|
16
|
+
the Google File System (GFS). As in GFS, a single master node manages
|
17
|
+
file system metadata, while data blocks are replicated and stored on a large
|
18
|
+
number of storage nodes. Writing or reading data involves a multi-step
|
19
|
+
protocol in which clients interact with the master, retrieving metadata and
|
20
|
+
possibly changing state, then interact with storage nodes to read or write
|
21
|
+
chunks. Background jobs running on the master will contact storage nodes to
|
22
|
+
orchestrate chunk migrations, during which storage nodes communicate with
|
23
|
+
other storage nodes. As in BOOM-FS, the communication protocols and the data
|
24
|
+
channel used for bulk data transfer between clients and datanodes and between
|
25
|
+
datanodes is written outside Bloom (in Ruby).
|
26
|
+
|
27
|
+
## [Basic File System](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/fs_master.rb)
|
28
|
+
|
29
|
+
Before we worry about any of the details of distribution, we need to implement the basic file system metadata operations: _create_, _remove_, _mkdir_ and _ls_.
|
30
|
+
There are many choices for how to implement these operations, and it makes sense to keep them separate from the (largely orthogonal) distributed file system logic.
|
31
|
+
That way, it will be possible later to choose a different implementation of the metadata operations without impacting the rest of the system.
|
32
|
+
Another benefit of modularizing the metadata logic is that it can be independently tested and debugged. We want to get the core of the file system
|
33
|
+
working correctly before we even send a whisper over the network, let alone add any complex features.
|
34
|
+
|
35
|
+
### Protocol
|
36
|
+
|
37
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|12-20
|
38
|
+
|
39
|
+
We create an input interface for each of the operations, and a single output interface for the return for any operation: given a request id, __status__ is a boolean
|
40
|
+
indicating whether the request succeeded, and __data__ may contain return values (e.g., _fsls_ should return an array containing the array contents).
|
41
|
+
|
42
|
+
### Implementation
|
43
|
+
|
44
|
+
We already have a library that provides an updateable flat namespace: the key-value store. We can easily implement the tree structure of a file system over a key-value store
|
45
|
+
in the following way:
|
46
|
+
|
47
|
+
1. keys are paths
|
48
|
+
2. directories have arrays containing child entries (base names)
|
49
|
+
3. files values are their contents
|
50
|
+
|
51
|
+
<!--- (**JMH**: I find it a bit confusing how you toggle from the discussion above to this naive file-storage design here. Can you warn us a bit more clearly that this is a starting point focused on metadata, with (3) being a strawman for data storage that is intended to be overriden later?)
|
52
|
+
--->
|
53
|
+
Note that (3) is a strawman: it will cease to apply when we implement chunked storage later. It is tempting, however, to support (3) so that the resulting program is a working
|
54
|
+
standalone file system.
|
55
|
+
|
56
|
+
We begin our implementation of a KVS-backed metadata system in the following way:
|
57
|
+
|
58
|
+
|
59
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|33-36
|
60
|
+
|
61
|
+
If we wanted to replicate the master node's metadata we could consider mixing in a replicated KVS implementation instead of __BasicKVS__ -- but more on that later.
|
62
|
+
|
63
|
+
### Directory Listing
|
64
|
+
|
65
|
+
The directory listing operation is implemented by a simple block of Bloom statements:
|
66
|
+
|
67
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|51-57
|
68
|
+
|
69
|
+
If we get a __fsls__ request, probe the key-value store for the requested by projecting _reqid_, _path_ from the __fsls__ tuple into __kvget__. If the given path
|
70
|
+
is a key, __kvget_response__ will contain a tuple with the same _reqid_, and the join on the second line will succeed. In this case, we insert the value
|
71
|
+
associated with that key into __fsret__. Otherwise, the third rule will fire, inserting a failure tuple into __fsret__.
|
72
|
+
|
73
|
+
|
74
|
+
### Mutation
|
75
|
+
|
76
|
+
The logic for file and directory creation and deletion follow a similar logic with regard to the parent directory:
|
77
|
+
|
78
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|61-71
|
79
|
+
|
80
|
+
Unlike a directory listing, however, these operations change the state of the file system. In general, any state change will involve
|
81
|
+
carrying out two mutating operations to the key-value store atomically:
|
82
|
+
|
83
|
+
1. update the value (child array) associated with the parent directory entry
|
84
|
+
2. update the key-value pair associated with the object in question (a file or directory being created or destroyed).
|
85
|
+
|
86
|
+
The following Bloom code carries this out:
|
87
|
+
|
88
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|73-73
|
89
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|80-108
|
90
|
+
|
91
|
+
|
92
|
+
<!--- (**JMH**: This next sounds awkward. You *do* take care: by using <= and understanding the atomicity of timesteps in Bloom. I think what you mean to say is that Bloom's atomic timestep model makes this easy compared to ... something.)
|
93
|
+
Note that we need not take any particular care to ensure that the two inserts into __kvput__ occur together atomically. Because both statements use the synchronous
|
94
|
+
-->
|
95
|
+
Note that because both inserts into the __kvput__ collection use the synchronous operator (`<=`), we know that they will occur together in the same fixpoint computation or not at all.
|
96
|
+
Therefore we need not be concerned with explicitly sequencing the operations (e.g., ensuring that the directory entries is created _after_ the file entry) to deal with concurrency:
|
97
|
+
there can be no visible state of the database in which only one of the operations has succeeded.
|
98
|
+
|
99
|
+
If the request is a deletion, we need some additional logic to enforce the constraint that only an empty directory may be removed:
|
100
|
+
|
101
|
+
|
102
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|74-78
|
103
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|110-115
|
104
|
+
|
105
|
+
|
106
|
+
Recall that when we created KVSFS we mixed in __TimestepNonce__, one of the nonce libraries. While we were able to use the _reqid_ field from the input operation as a unique identifier
|
107
|
+
for one of our KVS operations, we need a fresh, unique request id for the second KVS operation in the atomic pair described above. By joining __nonce__, we get
|
108
|
+
an identifier that is unique to this timestep.
|
109
|
+
|
110
|
+
|
111
|
+
## [File Chunking](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/chunking.rb)
|
112
|
+
|
113
|
+
Now that we have a module providing a basic file system, we can extend it to support chunked storage of file contents. The metadata master will contain, in addition to the KVS
|
114
|
+
structure for directory information, a relation mapping a set of chunk identifiers to each file
|
115
|
+
|
116
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|26-26
|
117
|
+
|
118
|
+
and relations associating a chunk with a set of datanodes that host a replica of the chunk.
|
119
|
+
|
120
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|12-12
|
121
|
+
|
122
|
+
(**JMH**: ambiguous reference ahead "these latter")
|
123
|
+
The latter (defined in __HBMaster__) is soft-state, kept up to data by heartbeat messages from datanodes (described in the next section).
|
124
|
+
|
125
|
+
To support chunked storage, we add a few metadata operations to those already defined by FSProtocol:
|
126
|
+
|
127
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|6-16
|
128
|
+
|
129
|
+
* __fschunklist__ returns the set of chunks belonging to a given file.
|
130
|
+
* __fschunklocations__ returns the set of datanodes in possession of a given chunk.
|
131
|
+
* __fsaddchunk__ returns a new chunkid for appending to an existing file, guaranteed to be higher than any existing chunkids for that file, and a list of candidate datanodes that can store a replica of the new chunk.
|
132
|
+
|
133
|
+
We continue to use __fsret__ for return values.
|
134
|
+
|
135
|
+
### Lookups
|
136
|
+
|
137
|
+
Lines 34-44 are a similar pattern to what we saw in the basic FS: whenever we get a __fschunklist__ or __fsaddchunk__ request, we must first ensure that the given file
|
138
|
+
exists, and error out if not. If it does, and the operation was __fschunklist__, we join the metadata relation __chunk__ and return the set of chunks owned
|
139
|
+
by the given (existent) file:
|
140
|
+
|
141
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|47-49
|
142
|
+
|
143
|
+
### Add chunk
|
144
|
+
|
145
|
+
If it was a __fsaddchunk__ request, we need to generate a unique id for a new chunk and return a list of target datanodes. We reuse __TimestepNonce__ to do the former, and join a relation
|
146
|
+
called __available__ that is exported by __HBMaster__ (described in the next section) for the latter:
|
147
|
+
|
148
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|69-76
|
149
|
+
|
150
|
+
Finally, it was a __fschunklocations__ request, we have another possible error scenario, because the nodes associated with chunks are a part of our soft state. Even if the file
|
151
|
+
exists, it may not be the case that we have fresh information in our cache about what datanodes own a replica of the given chunk:
|
152
|
+
|
153
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|54-58
|
154
|
+
|
155
|
+
Otherwise, __chunk_cache__ has information about the given chunk, which we may return to the client:
|
156
|
+
|
157
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|61-64
|
158
|
+
|
159
|
+
|
160
|
+
## Datanodes and Heartbeats
|
161
|
+
|
162
|
+
### [Datanode](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/datanode.rb)
|
163
|
+
|
164
|
+
A datanode runs both Bud code (to support the heartbeat and control protocols) and pure Ruby (to support the data transfer protocol). A datanode's main job is keeping the master
|
165
|
+
aware of it existence and its state, and participating when necessary in data pipelines to read or write chunk data to and from its local storage.
|
166
|
+
|
167
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|11-15
|
168
|
+
|
169
|
+
By mixing in HeartbeatAgent, the datanode includes the machinery necessary to regularly send status messages to the master. __HeartbeatAgent__ provides an input interface
|
170
|
+
called __payload__ that allows an agent to optionally include additional information in heartbeat messages: in our case, we wish to include state deltas which ensure that
|
171
|
+
the master has an accurate view of the set of chunks owned by the datanode.
|
172
|
+
|
173
|
+
When a datanode is constructed, it takes a port at which the embedded data protocol server will listen, and starts the server in the background:
|
174
|
+
|
175
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|61-62
|
176
|
+
|
177
|
+
At regular intervals, a datanode polls its local chunk directory (which is independently written to by the data protocol):
|
178
|
+
|
179
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|26-31
|
180
|
+
|
181
|
+
We update the payload that we send to the master if our recent poll found files that we don't believe the master knows about:
|
182
|
+
|
183
|
+
|
184
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|33-40
|
185
|
+
|
186
|
+
Our view of what the master ``knows'' about reflects our local cache of acknowledgement messages from the master. This logic is defined in __HBMaster__.
|
187
|
+
|
188
|
+
### [Heartbeat master logic](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/hb_master.rb)
|
189
|
+
|
190
|
+
On the master side of heartbeats, we always send an ack when we get a heartbeat, so that the datanode doesn't need to keep resending its
|
191
|
+
payload of local chunks:
|
192
|
+
|
193
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|30-32
|
194
|
+
|
195
|
+
At the same time, we use the Ruby _flatmap_ method to flatten the array of chunks in the heartbeat payload into a set of tuples, which we
|
196
|
+
associate with the heartbeating datanode and the time of receipt in __chunk_cache__:
|
197
|
+
|
198
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|22-28
|
199
|
+
|
200
|
+
We periodically garbage-collect this cached, removing entries for datanodes from whom we have not received a heartbeat in a configurable amount of time.
|
201
|
+
__last_heartbeat__ is an output interface provided by the __HeartbeatAgent__ module, and contains the most recent, non-stale heartbeat contents:
|
202
|
+
|
203
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|34-36
|
204
|
+
|
205
|
+
|
206
|
+
## [BFS Client](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/bfs_client.rb)
|
207
|
+
|
208
|
+
One of the most complicated parts of the basic GFS design is the client component. To minimize load on the centralized master, we take it off the critical
|
209
|
+
path of file transfers. The client therefore needs to pick up this work.
|
210
|
+
|
211
|
+
We won't spend too much time on the details of the client code, as it is nearly all _plain old Ruby_. The basic idea is:
|
212
|
+
|
213
|
+
1. Pure metadata operations
|
214
|
+
* _mkdir_, _create_, _ls_, _rm_
|
215
|
+
* Send the request to the master and inform the caller of the status.
|
216
|
+
* If _ls_, return the directory listing to the caller.
|
217
|
+
2. Append
|
218
|
+
* Send a __fsaddchunk__ request to the master, which should return a new chunkid and a list of datanodes.
|
219
|
+
* Read a chunk worth of data from the input stream.
|
220
|
+
* Connect to the first datanode in the list. Send a header containing the chunkid and the remaining datanodes.
|
221
|
+
* Stream the file contents. The target datanode will then ``play client'' and continue the pipeline to the next datanode, and so on.
|
222
|
+
2. Read
|
223
|
+
* Send a __getchunks__ request to the master for the given file. It should return the list of chunks owned by the file.
|
224
|
+
* For each chunk,
|
225
|
+
* Send a __fschunklocations__ request to the master, which should return a list of datanodes in possession of the chunk (returning a list allows the client to perform retries without more communication with the master, should some of the datanodes fail).
|
226
|
+
* Connect to a datanode from the list and stream the chunk to a local buffer.
|
227
|
+
* As chunks become available, stream them to the caller.
|
228
|
+
|
229
|
+
|
230
|
+
## [Data transfer protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/data_protocol.rb)
|
231
|
+
|
232
|
+
The data transfer protocol comprises a set of support functions for the bulk data transfer protocol whose use is described in the previous section.
|
233
|
+
Because it is _plain old Ruby_ it is not as interesting as the other modules. It provides:
|
234
|
+
|
235
|
+
* The TCP server code that runs at each datanode, which parses headers and writes stream data to the local FS (these files are later detected by the directory poll).
|
236
|
+
* Client API calls to connect to datanodes and stream data. Datanodes also use this protocol to pipeline chunks to downstream datanodes.
|
237
|
+
* Master API code invoked by a background process to replicate chunks from datanodes to other datanodes, when the replication factor for a chunk is too low.
|
238
|
+
|
239
|
+
## [Master background process](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/background.rb)
|
240
|
+
|
241
|
+
So far, we have implemented the BFS master as a strictly reactive system: when clients make requests, it queries and possibly updates local state.
|
242
|
+
To maintain the durability requirement that `REP_FACTOR` copies of every chunk are stored on distinct nodes, the master must be an active system
|
243
|
+
that maintains a near-consistent view of global state, and takes steps to correct violated requirements.
|
244
|
+
|
245
|
+
__chunk_cache__ is the master's view of datanode state, maintained as described by collecting and pruning heartbeat messages.
|
246
|
+
|
247
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/background.rb|24-27
|
248
|
+
|
249
|
+
After defining some helper aggregates (__chunk_cnts_chunk__ or replica count by chunk, and __chunk_cnt_host__ or datanode fill factor),
|
250
|
+
|
251
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/background.rb|29-36
|
data/docs/bfs_arch.png
ADDED
Binary file
|
data/docs/bloom-loop.png
ADDED
Binary file
|