bud 0.0.2
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE +9 -0
- data/README +30 -0
- data/bin/budplot +134 -0
- data/bin/budvis +201 -0
- data/bin/rebl +4 -0
- data/docs/README.md +13 -0
- data/docs/bfs.md +379 -0
- data/docs/bfs.raw +251 -0
- data/docs/bfs_arch.png +0 -0
- data/docs/bloom-loop.png +0 -0
- data/docs/bust.md +83 -0
- data/docs/cheat.md +291 -0
- data/docs/deploy.md +96 -0
- data/docs/diffs +181 -0
- data/docs/getstarted.md +296 -0
- data/docs/intro.md +36 -0
- data/docs/modules.md +112 -0
- data/docs/operational.md +96 -0
- data/docs/rebl.md +99 -0
- data/docs/ruby_hooks.md +19 -0
- data/docs/visualizations.md +75 -0
- data/examples/README +1 -0
- data/examples/basics/hello.rb +12 -0
- data/examples/basics/out +1103 -0
- data/examples/basics/out.new +856 -0
- data/examples/basics/paths.rb +51 -0
- data/examples/bust/README.md +9 -0
- data/examples/bust/bustclient-example.rb +23 -0
- data/examples/bust/bustinspector.html +135 -0
- data/examples/bust/bustserver-example.rb +18 -0
- data/examples/chat/README.md +9 -0
- data/examples/chat/chat.rb +45 -0
- data/examples/chat/chat_protocol.rb +8 -0
- data/examples/chat/chat_server.rb +29 -0
- data/examples/deploy/tokenring-ec2.rb +26 -0
- data/examples/deploy/tokenring-local.rb +17 -0
- data/examples/deploy/tokenring.rb +39 -0
- data/lib/bud/aggs.rb +126 -0
- data/lib/bud/bud_meta.rb +185 -0
- data/lib/bud/bust/bust.rb +126 -0
- data/lib/bud/bust/client/idempotence.rb +10 -0
- data/lib/bud/bust/client/restclient.rb +49 -0
- data/lib/bud/collections.rb +937 -0
- data/lib/bud/depanalysis.rb +44 -0
- data/lib/bud/deploy/countatomicdelivery.rb +50 -0
- data/lib/bud/deploy/deployer.rb +67 -0
- data/lib/bud/deploy/ec2deploy.rb +200 -0
- data/lib/bud/deploy/localdeploy.rb +41 -0
- data/lib/bud/errors.rb +15 -0
- data/lib/bud/graphs.rb +405 -0
- data/lib/bud/joins.rb +300 -0
- data/lib/bud/rebl.rb +314 -0
- data/lib/bud/rewrite.rb +523 -0
- data/lib/bud/rtrace.rb +27 -0
- data/lib/bud/server.rb +43 -0
- data/lib/bud/state.rb +108 -0
- data/lib/bud/storage/tokyocabinet.rb +170 -0
- data/lib/bud/storage/zookeeper.rb +178 -0
- data/lib/bud/stratify.rb +83 -0
- data/lib/bud/viz.rb +65 -0
- data/lib/bud.rb +797 -0
- metadata +330 -0
data/docs/bfs.md
ADDED
@@ -0,0 +1,379 @@
|
|
1
|
+
# BFS: A distributed file system in Bloom
|
2
|
+
|
3
|
+
In this document we'll use what we've learned to build a piece of systems software using Bloom. The libraries that ship with BUD provide many of the building blocks we'll need to create a distributed,
|
4
|
+
``chunked'' file system in the style of the Google File System (GFS):
|
5
|
+
|
6
|
+
* a [key-value store](https://github.com/bloom-lang/bud-sandbox/blob/master/kvs/kvs.rb) (KVS)
|
7
|
+
* [nonce generation](https://github.com/bloom-lang/bud-sandbox/blob/master/ordering/nonce.rb)
|
8
|
+
* a [heartbeat protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/heartbeat/heartbeat.rb)
|
9
|
+
|
10
|
+
## High-level architecture
|
11
|
+
|
12
|
+
![BFS Architecture](bfs_arch.png?raw=true)
|
13
|
+
|
14
|
+
BFS implements a chunked, distributed file system (mostly) in the Bloom
|
15
|
+
language. BFS is architecturally based on [BOOM-FS](http://db.cs.berkeley.edu/papers/eurosys10-boom.pdf), which is itself based on
|
16
|
+
the Google File System (GFS). As in GFS, a single master node manages
|
17
|
+
file system metadata, while data blocks are replicated and stored on a large
|
18
|
+
number of storage nodes. Writing or reading data involves a multi-step
|
19
|
+
protocol in which clients interact with the master, retrieving metadata and
|
20
|
+
possibly changing state, then interact with storage nodes to read or write
|
21
|
+
chunks. Background jobs running on the master will contact storage nodes to
|
22
|
+
orchestrate chunk migrations, during which storage nodes communicate with
|
23
|
+
other storage nodes. As in BOOM-FS, the communication protocols and the data
|
24
|
+
channel used for bulk data transfer between clients and datanodes and between
|
25
|
+
datanodes is written outside Bloom (in Ruby).
|
26
|
+
|
27
|
+
## [Basic File System](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/fs_master.rb)
|
28
|
+
|
29
|
+
Before we worry about any of the details of distribution, we need to implement the basic file system metadata operations: _create_, _remove_, _mkdir_ and _ls_.
|
30
|
+
There are many choices for how to implement these operations, and it makes sense to keep them separate from the (largely orthogonal) distributed file system logic.
|
31
|
+
That way, it will be possible later to choose a different implementation of the metadata operations without impacting the rest of the system.
|
32
|
+
Another benefit of modularizing the metadata logic is that it can be independently tested and debugged. We want to get the core of the file system
|
33
|
+
working correctly before we even send a whisper over the network, let alone add any complex features.
|
34
|
+
|
35
|
+
### Protocol
|
36
|
+
|
37
|
+
module FSProtocol
|
38
|
+
state do
|
39
|
+
interface input, :fsls, [:reqid, :path]
|
40
|
+
interface input, :fscreate, [] => [:reqid, :name, :path, :data]
|
41
|
+
interface input, :fsmkdir, [] => [:reqid, :name, :path]
|
42
|
+
interface input, :fsrm, [] => [:reqid, :name, :path]
|
43
|
+
interface output, :fsret, [:reqid, :status, :data]
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|
47
|
+
We create an input interface for each of the operations, and a single output interface for the return for any operation: given a request id, __status__ is a boolean
|
48
|
+
indicating whether the request succeeded, and __data__ may contain return values (e.g., _fsls_ should return an array containing the array contents).
|
49
|
+
|
50
|
+
### Implementation
|
51
|
+
|
52
|
+
We already have a library that provides an updateable flat namespace: the key-value store. We can easily implement the tree structure of a file system over a key-value store
|
53
|
+
in the following way:
|
54
|
+
|
55
|
+
1. keys are paths
|
56
|
+
2. directories have arrays containing child entries (base names)
|
57
|
+
3. files values are their contents
|
58
|
+
|
59
|
+
<!--- (**JMH**: I find it a bit confusing how you toggle from the discussion above to this naive file-storage design here. Can you warn us a bit more clearly that this is a starting point focused on metadata, with (3) being a strawman for data storage that is intended to be overriden later?)
|
60
|
+
--->
|
61
|
+
Note that (3) is a strawman: it will cease to apply when we implement chunked storage later. It is tempting, however, to support (3) so that the resulting program is a working
|
62
|
+
standalone file system.
|
63
|
+
|
64
|
+
We begin our implementation of a KVS-backed metadata system in the following way:
|
65
|
+
|
66
|
+
|
67
|
+
module KVSFS
|
68
|
+
include FSProtocol
|
69
|
+
include BasicKVS
|
70
|
+
include TimestepNonce
|
71
|
+
|
72
|
+
If we wanted to replicate the master node's metadata we could consider mixing in a replicated KVS implementation instead of __BasicKVS__ -- but more on that later.
|
73
|
+
|
74
|
+
### Directory Listing
|
75
|
+
|
76
|
+
The directory listing operation is implemented by a simple block of Bloom statements:
|
77
|
+
|
78
|
+
kvget <= fsls { |l| [l.reqid, l.path] }
|
79
|
+
fsret <= (kvget_response * fsls).pairs(:reqid => :reqid) { |r, i| [r.reqid, true, r.value] }
|
80
|
+
fsret <= fsls do |l|
|
81
|
+
unless kvget_response.map{ |r| r.reqid}.include? l.reqid
|
82
|
+
[l.reqid, false, nil]
|
83
|
+
end
|
84
|
+
end
|
85
|
+
|
86
|
+
If we get a __fsls__ request, probe the key-value store for the requested by projecting _reqid_, _path_ from the __fsls__ tuple into __kvget__. If the given path
|
87
|
+
is a key, __kvget_response__ will contain a tuple with the same _reqid_, and the join on the second line will succeed. In this case, we insert the value
|
88
|
+
associated with that key into __fsret__. Otherwise, the third rule will fire, inserting a failure tuple into __fsret__.
|
89
|
+
|
90
|
+
|
91
|
+
### Mutation
|
92
|
+
|
93
|
+
The logic for file and directory creation and deletion follow a similar logic with regard to the parent directory:
|
94
|
+
|
95
|
+
check_parent_exists <= fscreate { |c| [c.reqid, c.name, c.path, :create, c.data] }
|
96
|
+
check_parent_exists <= fsmkdir { |m| [m.reqid, m.name, m.path, :mkdir, nil] }
|
97
|
+
check_parent_exists <= fsrm { |m| [m.reqid, m.name, m.path, :rm, nil] }
|
98
|
+
|
99
|
+
kvget <= check_parent_exists { |c| [c.reqid, c.path] }
|
100
|
+
fsret <= check_parent_exists do |c|
|
101
|
+
unless kvget_response.map{ |r| r.reqid}.include? c.reqid
|
102
|
+
puts "not found #{c.path}" or [c.reqid, false, "parent path #{c.path} for #{c.name} does not exist"]
|
103
|
+
end
|
104
|
+
end
|
105
|
+
|
106
|
+
|
107
|
+
Unlike a directory listing, however, these operations change the state of the file system. In general, any state change will involve
|
108
|
+
carrying out two mutating operations to the key-value store atomically:
|
109
|
+
|
110
|
+
1. update the value (child array) associated with the parent directory entry
|
111
|
+
2. update the key-value pair associated with the object in question (a file or directory being created or destroyed).
|
112
|
+
|
113
|
+
The following Bloom code carries this out:
|
114
|
+
|
115
|
+
temp :dir_exists <= (check_parent_exists * kvget_response * nonce).combos([check_parent_exists.reqid, kvget_response.reqid])
|
116
|
+
fsret <= dir_exists do |c, r, n|
|
117
|
+
if c.mtype == :rm
|
118
|
+
unless can_remove.map{|can| can.orig_reqid}.include? c.reqid
|
119
|
+
[c.reqid, false, "directory #{} not empty"]
|
120
|
+
end
|
121
|
+
end
|
122
|
+
end
|
123
|
+
|
124
|
+
# update dir entry
|
125
|
+
# note that it is unnecessary to ensure that a file is created before its corresponding
|
126
|
+
# directory entry, as both inserts into :kvput below will co-occur in the same timestep.
|
127
|
+
kvput <= dir_exists do |c, r, n|
|
128
|
+
if c.mtype == :rm
|
129
|
+
if can_remove.map{|can| can.orig_reqid}.include? c.reqid
|
130
|
+
[ip_port, c.path, n.ident, r.value.clone.reject{|item| item == c.name}]
|
131
|
+
end
|
132
|
+
else
|
133
|
+
[ip_port, c.path, n.ident, r.value.clone.push(c.name)]
|
134
|
+
end
|
135
|
+
end
|
136
|
+
|
137
|
+
kvput <= dir_exists do |c, r, n|
|
138
|
+
case c.mtype
|
139
|
+
when :mkdir
|
140
|
+
[ip_port, terminate_with_slash(c.path) + c.name, c.reqid, []]
|
141
|
+
when :create
|
142
|
+
[ip_port, terminate_with_slash(c.path) + c.name, c.reqid, "LEAF"]
|
143
|
+
end
|
144
|
+
end
|
145
|
+
|
146
|
+
|
147
|
+
<!--- (**JMH**: This next sounds awkward. You *do* take care: by using <= and understanding the atomicity of timesteps in Bloom. I think what you mean to say is that Bloom's atomic timestep model makes this easy compared to ... something.)
|
148
|
+
Note that we need not take any particular care to ensure that the two inserts into __kvput__ occur together atomically. Because both statements use the synchronous
|
149
|
+
-->
|
150
|
+
Note that because both inserts into the __kvput__ collection use the synchronous operator (`<=`), we know that they will occur together in the same fixpoint computation or not at all.
|
151
|
+
Therefore we need not be concerned with explicitly sequencing the operations (e.g., ensuring that the directory entries is created _after_ the file entry) to deal with concurrency:
|
152
|
+
there can be no visible state of the database in which only one of the operations has succeeded.
|
153
|
+
|
154
|
+
If the request is a deletion, we need some additional logic to enforce the constraint that only an empty directory may be removed:
|
155
|
+
|
156
|
+
|
157
|
+
check_is_empty <= (fsrm * nonce).pairs {|m, n| [n.ident, m.reqid, terminate_with_slash(m.path) + m.name] }
|
158
|
+
kvget <= check_is_empty {|c| [c.reqid, c.name] }
|
159
|
+
can_remove <= (kvget_response * check_is_empty).pairs([kvget_response.reqid, check_is_empty.reqid]) do |r, c|
|
160
|
+
[c.reqid, c.orig_reqid, c.name] if r.value.length == 0
|
161
|
+
end
|
162
|
+
# delete entry -- if an 'rm' request,
|
163
|
+
kvdel <= dir_exists do |c, r, n|
|
164
|
+
if can_remove.map{|can| can.orig_reqid}.include? c.reqid
|
165
|
+
[terminate_with_slash(c.path) + c.name, c.reqid]
|
166
|
+
end
|
167
|
+
end
|
168
|
+
|
169
|
+
|
170
|
+
Recall that when we created KVSFS we mixed in __TimestepNonce__, one of the nonce libraries. While we were able to use the _reqid_ field from the input operation as a unique identifier
|
171
|
+
for one of our KVS operations, we need a fresh, unique request id for the second KVS operation in the atomic pair described above. By joining __nonce__, we get
|
172
|
+
an identifier that is unique to this timestep.
|
173
|
+
|
174
|
+
|
175
|
+
## [File Chunking](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/chunking.rb)
|
176
|
+
|
177
|
+
Now that we have a module providing a basic file system, we can extend it to support chunked storage of file contents. The metadata master will contain, in addition to the KVS
|
178
|
+
structure for directory information, a relation mapping a set of chunk identifiers to each file
|
179
|
+
|
180
|
+
table :chunk, [:chunkid, :file, :siz]
|
181
|
+
|
182
|
+
and relations associating a chunk with a set of datanodes that host a replica of the chunk.
|
183
|
+
|
184
|
+
table :chunk_cache, [:node, :chunkid, :time]
|
185
|
+
|
186
|
+
(**JMH**: ambiguous reference ahead "these latter")
|
187
|
+
The latter (defined in __HBMaster__) is soft-state, kept up to data by heartbeat messages from datanodes (described in the next section).
|
188
|
+
|
189
|
+
To support chunked storage, we add a few metadata operations to those already defined by FSProtocol:
|
190
|
+
|
191
|
+
module ChunkedFSProtocol
|
192
|
+
include FSProtocol
|
193
|
+
|
194
|
+
state do
|
195
|
+
interface :input, :fschunklist, [:reqid, :file]
|
196
|
+
interface :input, :fschunklocations, [:reqid, :chunkid]
|
197
|
+
interface :input, :fsaddchunk, [:reqid, :file]
|
198
|
+
# note that no output interface is defined.
|
199
|
+
# we use :fsret (defined in FSProtocol) for output.
|
200
|
+
end
|
201
|
+
end
|
202
|
+
|
203
|
+
* __fschunklist__ returns the set of chunks belonging to a given file.
|
204
|
+
* __fschunklocations__ returns the set of datanodes in possession of a given chunk.
|
205
|
+
* __fsaddchunk__ returns a new chunkid for appending to an existing file, guaranteed to be higher than any existing chunkids for that file, and a list of candidate datanodes that can store a replica of the new chunk.
|
206
|
+
|
207
|
+
We continue to use __fsret__ for return values.
|
208
|
+
|
209
|
+
### Lookups
|
210
|
+
|
211
|
+
Lines 34-44 are a similar pattern to what we saw in the basic FS: whenever we get a __fschunklist__ or __fsaddchunk__ request, we must first ensure that the given file
|
212
|
+
exists, and error out if not. If it does, and the operation was __fschunklist__, we join the metadata relation __chunk__ and return the set of chunks owned
|
213
|
+
by the given (existent) file:
|
214
|
+
|
215
|
+
chunk_buffer <= (fschunklist * kvget_response * chunk).combos([fschunklist.reqid, kvget_response.reqid], [fschunklist.file, chunk.file]) { |l, r, c| [l.reqid, c.chunkid] }
|
216
|
+
chunk_buffer2 <= chunk_buffer.group([chunk_buffer.reqid], accum(chunk_buffer.chunkid))
|
217
|
+
fsret <= chunk_buffer2 { |c| [c.reqid, true, c.chunklist] }
|
218
|
+
|
219
|
+
### Add chunk
|
220
|
+
|
221
|
+
If it was a __fsaddchunk__ request, we need to generate a unique id for a new chunk and return a list of target datanodes. We reuse __TimestepNonce__ to do the former, and join a relation
|
222
|
+
called __available__ that is exported by __HBMaster__ (described in the next section) for the latter:
|
223
|
+
|
224
|
+
temp :minted_chunk <= (kvget_response * fsaddchunk * available * nonce).combos(kvget_response.reqid => fsaddchunk.reqid) {|r| r if last_heartbeat.length >= REP_FACTOR}
|
225
|
+
chunk <= minted_chunk { |r, a, v, n| [n.ident, a.file, 0]}
|
226
|
+
fsret <= minted_chunk { |r, a, v, n| [r.reqid, true, [n.ident, v.pref_list.slice(0, (REP_FACTOR + 2))]]}
|
227
|
+
fsret <= (kvget_response * fsaddchunk).pairs(:reqid => :reqid) do |r, a|
|
228
|
+
if available.empty? or available.first.pref_list.length < REP_FACTOR
|
229
|
+
[r.reqid, false, "datanode set cannot satisfy REP_FACTOR = #{REP_FACTOR} with [#{available.first.nil? ? "NIL" : available.first.pref_list.inspect}]"]
|
230
|
+
end
|
231
|
+
end
|
232
|
+
|
233
|
+
Finally, it was a __fschunklocations__ request, we have another possible error scenario, because the nodes associated with chunks are a part of our soft state. Even if the file
|
234
|
+
exists, it may not be the case that we have fresh information in our cache about what datanodes own a replica of the given chunk:
|
235
|
+
|
236
|
+
fsret <= fschunklocations do |l|
|
237
|
+
unless chunk_cache_alive.map{|c| c.chunkid}.include? l.chunkid
|
238
|
+
[l.reqid, false, "no datanodes found for #{l.chunkid} in cc, now #{chunk_cache_alive.length}, with last_hb #{last_heartbeat.length}"]
|
239
|
+
end
|
240
|
+
end
|
241
|
+
|
242
|
+
Otherwise, __chunk_cache__ has information about the given chunk, which we may return to the client:
|
243
|
+
|
244
|
+
temp :chunkjoin <= (fschunklocations * chunk_cache_alive).pairs(:chunkid => :chunkid)
|
245
|
+
host_buffer <= chunkjoin {|l, c| [l.reqid, c.node] }
|
246
|
+
host_buffer2 <= host_buffer.group([host_buffer.reqid], accum(host_buffer.host))
|
247
|
+
fsret <= host_buffer2 {|c| [c.reqid, true, c.hostlist] }
|
248
|
+
|
249
|
+
|
250
|
+
## Datanodes and Heartbeats
|
251
|
+
|
252
|
+
### [[Datanode]](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/datanode.rb)
|
253
|
+
|
254
|
+
A datanode runs both Bud code (to support the heartbeat and control protocols) and pure Ruby (to support the data transfer protocol). A datanode's main job is keeping the master
|
255
|
+
aware of it existence and its state, and participating when necessary in data pipelines to read or write chunk data to and from its local storage.
|
256
|
+
|
257
|
+
module BFSDatanode
|
258
|
+
include HeartbeatAgent
|
259
|
+
include StaticMembership
|
260
|
+
include TimestepNonce
|
261
|
+
include BFSHBProtocol
|
262
|
+
|
263
|
+
By mixing in HeartbeatAgent, the datanode includes the machinery necessary to regularly send status messages to the master. __HeartbeatAgent__ provides an input interface
|
264
|
+
called __payload__ that allows an agent to optionally include additional information in heartbeat messages: in our case, we wish to include state deltas which ensure that
|
265
|
+
the master has an accurate view of the set of chunks owned by the datanode.
|
266
|
+
|
267
|
+
When a datanode is constructed, it takes a port at which the embedded data protocol server will listen, and starts the server in the background:
|
268
|
+
|
269
|
+
@dp_server = DataProtocolServer.new(dataport)
|
270
|
+
return_address <+ [["localhost:#{dataport}"]]
|
271
|
+
|
272
|
+
At regular intervals, a datanode polls its local chunk directory (which is independently written to by the data protocol):
|
273
|
+
|
274
|
+
dir_contents <= hb_timer.flat_map do |t|
|
275
|
+
dir = Dir.new("#{DATADIR}/#{@data_port}")
|
276
|
+
files = dir.to_a.map{|d| d.to_i unless d =~ /^\./}.uniq!
|
277
|
+
dir.close
|
278
|
+
files.map {|f| [f, Time.parse(t.val).to_f]}
|
279
|
+
end
|
280
|
+
|
281
|
+
We update the payload that we send to the master if our recent poll found files that we don't believe the master knows about:
|
282
|
+
|
283
|
+
|
284
|
+
to_payload <= (dir_contents * nonce).pairs do |c, n|
|
285
|
+
unless server_knows.map{|s| s.file}.include? c.file
|
286
|
+
#puts "BCAST #{c.file}; server doesn't know" or [n.ident, c.file, c.time]
|
287
|
+
[n.ident, c.file, c.time]
|
288
|
+
else
|
289
|
+
#puts "server knows about #{server_knows.length} files"
|
290
|
+
end
|
291
|
+
end
|
292
|
+
|
293
|
+
Our view of what the master ``knows'' about reflects our local cache of acknowledgement messages from the master. This logic is defined in __HBMaster__.
|
294
|
+
|
295
|
+
### [Heartbeat master logic](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/hb_master.rb)
|
296
|
+
|
297
|
+
On the master side of heartbeats, we always send an ack when we get a heartbeat, so that the datanode doesn't need to keep resending its
|
298
|
+
payload of local chunks:
|
299
|
+
|
300
|
+
hb_ack <~ last_heartbeat.map do |l|
|
301
|
+
[l.sender, l.payload[0]] unless l.payload[1] == [nil]
|
302
|
+
end
|
303
|
+
|
304
|
+
At the same time, we use the Ruby _flatmap_ method to flatten the array of chunks in the heartbeat payload into a set of tuples, which we
|
305
|
+
associate with the heartbeating datanode and the time of receipt in __chunk_cache__:
|
306
|
+
|
307
|
+
chunk_cache <= join([master_duty_cycle, last_heartbeat]).flat_map do |d, l|
|
308
|
+
unless l.payload[1].nil?
|
309
|
+
l.payload[1].map do |pay|
|
310
|
+
[l.peer, pay, Time.parse(d.val).to_f]
|
311
|
+
end
|
312
|
+
end
|
313
|
+
end
|
314
|
+
|
315
|
+
We periodically garbage-collect this cached, removing entries for datanodes from whom we have not received a heartbeat in a configurable amount of time.
|
316
|
+
__last_heartbeat__ is an output interface provided by the __HeartbeatAgent__ module, and contains the most recent, non-stale heartbeat contents:
|
317
|
+
|
318
|
+
chunk_cache <- join([master_duty_cycle, chunk_cache]).map do |t, c|
|
319
|
+
c unless last_heartbeat.map{|h| h.peer}.include? c.node
|
320
|
+
end
|
321
|
+
|
322
|
+
|
323
|
+
## [BFS Client](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/bfs_client.rb)
|
324
|
+
|
325
|
+
One of the most complicated parts of the basic GFS design is the client component. To minimize load on the centralized master, we take it off the critical
|
326
|
+
path of file transfers. The client therefore needs to pick up this work.
|
327
|
+
|
328
|
+
We won't spend too much time on the details of the client code, as it is nearly all _plain old Ruby_. The basic idea is:
|
329
|
+
|
330
|
+
1. Pure metadata operations
|
331
|
+
* _mkdir_, _create_, _ls_, _rm_
|
332
|
+
* Send the request to the master and inform the caller of the status.
|
333
|
+
* If _ls_, return the directory listing to the caller.
|
334
|
+
2. Append
|
335
|
+
* Send a __fsaddchunk__ request to the master, which should return a new chunkid and a list of datanodes.
|
336
|
+
* Read a chunk worth of data from the input stream.
|
337
|
+
* Connect to the first datanode in the list. Send a header containing the chunkid and the remaining datanodes.
|
338
|
+
* Stream the file contents. The target datanode will then ``play client'' and continue the pipeline to the next datanode, and so on.
|
339
|
+
2. Read
|
340
|
+
* Send a __getchunks__ request to the master for the given file. It should return the list of chunks owned by the file.
|
341
|
+
* For each chunk,
|
342
|
+
* Send a __fschunklocations__ request to the master, which should return a list of datanodes in possession of the chunk (returning a list allows the client to perform retries without more communication with the master, should some of the datanodes fail).
|
343
|
+
* Connect to a datanode from the list and stream the chunk to a local buffer.
|
344
|
+
* As chunks become available, stream them to the caller.
|
345
|
+
|
346
|
+
|
347
|
+
## [Data transfer protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/data_protocol.rb)
|
348
|
+
|
349
|
+
The data transfer protocol comprises a set of support functions for the bulk data transfer protocol whose use is described in the previous section.
|
350
|
+
Because it is _plain old Ruby_ it is not as interesting as the other modules. It provides:
|
351
|
+
|
352
|
+
* The TCP server code that runs at each datanode, which parses headers and writes stream data to the local FS (these files are later detected by the directory poll).
|
353
|
+
* Client API calls to connect to datanodes and stream data. Datanodes also use this protocol to pipeline chunks to downstream datanodes.
|
354
|
+
* Master API code invoked by a background process to replicate chunks from datanodes to other datanodes, when the replication factor for a chunk is too low.
|
355
|
+
|
356
|
+
## [Master background process](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/background.rb)
|
357
|
+
|
358
|
+
So far, we have implemented the BFS master as a strictly reactive system: when clients make requests, it queries and possibly updates local state.
|
359
|
+
To maintain the durability requirement that `REP_FACTOR` copies of every chunk are stored on distinct nodes, the master must be an active system
|
360
|
+
that maintains a near-consistent view of global state, and takes steps to correct violated requirements.
|
361
|
+
|
362
|
+
__chunk_cache__ is the master's view of datanode state, maintained as described by collecting and pruning heartbeat messages.
|
363
|
+
|
364
|
+
cc_demand <= (bg_timer * chunk_cache_alive).rights
|
365
|
+
cc_demand <= (bg_timer * last_heartbeat).pairs {|b, h| [h.peer, nil, nil]}
|
366
|
+
chunk_cnts_chunk <= cc_demand.group([cc_demand.chunkid], count(cc_demand.node))
|
367
|
+
chunk_cnts_host <= cc_demand.group([cc_demand.node], count(cc_demand.chunkid))
|
368
|
+
|
369
|
+
After defining some helper aggregates (__chunk_cnts_chunk__ or replica count by chunk, and __chunk_cnt_host__ or datanode fill factor),
|
370
|
+
|
371
|
+
lowchunks <= chunk_cnts_chunk { |c| [c.chunkid] if c.replicas < REP_FACTOR and !c.chunkid.nil?}
|
372
|
+
|
373
|
+
# nodes in possession of such chunks
|
374
|
+
sources <= (cc_demand * lowchunks).pairs(:chunkid => :chunkid) {|a, b| [a.chunkid, a.node]}
|
375
|
+
# nodes not in possession of such chunks, and their fill factor
|
376
|
+
candidate_nodes <= (chunk_cnts_host * lowchunks).pairs do |c, p|
|
377
|
+
unless chunk_cache_alive.map{|a| a.node if a.chunkid == p.chunkid}.include? c.host
|
378
|
+
[p.chunkid, c.host, c.chunks]
|
379
|
+
### I am autogenerated. Please do not edit me.
|
data/docs/bfs.raw
ADDED
@@ -0,0 +1,251 @@
|
|
1
|
+
# BFS: A distributed file system in Bloom
|
2
|
+
|
3
|
+
In this document we'll use what we've learned to build a piece of systems software using Bloom. The libraries that ship with BUD provide many of the building blocks we'll need to create a distributed,
|
4
|
+
``chunked'' file system in the style of the Google File System (GFS):
|
5
|
+
|
6
|
+
* a [key-value store](https://github.com/bloom-lang/bud-sandbox/blob/master/kvs/kvs.rb) (KVS)
|
7
|
+
* [nonce generation](https://github.com/bloom-lang/bud-sandbox/blob/master/ordering/nonce.rb)
|
8
|
+
* a [heartbeat protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/heartbeat/heartbeat.rb)
|
9
|
+
|
10
|
+
## High-level architecture
|
11
|
+
|
12
|
+
![BFS Architecture](bfs_arch.png?raw=true)
|
13
|
+
|
14
|
+
BFS implements a chunked, distributed file system (mostly) in the Bloom
|
15
|
+
language. BFS is architecturally based on [BOOM-FS](http://db.cs.berkeley.edu/papers/eurosys10-boom.pdf), which is itself based on
|
16
|
+
the Google File System (GFS). As in GFS, a single master node manages
|
17
|
+
file system metadata, while data blocks are replicated and stored on a large
|
18
|
+
number of storage nodes. Writing or reading data involves a multi-step
|
19
|
+
protocol in which clients interact with the master, retrieving metadata and
|
20
|
+
possibly changing state, then interact with storage nodes to read or write
|
21
|
+
chunks. Background jobs running on the master will contact storage nodes to
|
22
|
+
orchestrate chunk migrations, during which storage nodes communicate with
|
23
|
+
other storage nodes. As in BOOM-FS, the communication protocols and the data
|
24
|
+
channel used for bulk data transfer between clients and datanodes and between
|
25
|
+
datanodes is written outside Bloom (in Ruby).
|
26
|
+
|
27
|
+
## [Basic File System](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/fs_master.rb)
|
28
|
+
|
29
|
+
Before we worry about any of the details of distribution, we need to implement the basic file system metadata operations: _create_, _remove_, _mkdir_ and _ls_.
|
30
|
+
There are many choices for how to implement these operations, and it makes sense to keep them separate from the (largely orthogonal) distributed file system logic.
|
31
|
+
That way, it will be possible later to choose a different implementation of the metadata operations without impacting the rest of the system.
|
32
|
+
Another benefit of modularizing the metadata logic is that it can be independently tested and debugged. We want to get the core of the file system
|
33
|
+
working correctly before we even send a whisper over the network, let alone add any complex features.
|
34
|
+
|
35
|
+
### Protocol
|
36
|
+
|
37
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|12-20
|
38
|
+
|
39
|
+
We create an input interface for each of the operations, and a single output interface for the return for any operation: given a request id, __status__ is a boolean
|
40
|
+
indicating whether the request succeeded, and __data__ may contain return values (e.g., _fsls_ should return an array containing the array contents).
|
41
|
+
|
42
|
+
### Implementation
|
43
|
+
|
44
|
+
We already have a library that provides an updateable flat namespace: the key-value store. We can easily implement the tree structure of a file system over a key-value store
|
45
|
+
in the following way:
|
46
|
+
|
47
|
+
1. keys are paths
|
48
|
+
2. directories have arrays containing child entries (base names)
|
49
|
+
3. files values are their contents
|
50
|
+
|
51
|
+
<!--- (**JMH**: I find it a bit confusing how you toggle from the discussion above to this naive file-storage design here. Can you warn us a bit more clearly that this is a starting point focused on metadata, with (3) being a strawman for data storage that is intended to be overriden later?)
|
52
|
+
--->
|
53
|
+
Note that (3) is a strawman: it will cease to apply when we implement chunked storage later. It is tempting, however, to support (3) so that the resulting program is a working
|
54
|
+
standalone file system.
|
55
|
+
|
56
|
+
We begin our implementation of a KVS-backed metadata system in the following way:
|
57
|
+
|
58
|
+
|
59
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|33-36
|
60
|
+
|
61
|
+
If we wanted to replicate the master node's metadata we could consider mixing in a replicated KVS implementation instead of __BasicKVS__ -- but more on that later.
|
62
|
+
|
63
|
+
### Directory Listing
|
64
|
+
|
65
|
+
The directory listing operation is implemented by a simple block of Bloom statements:
|
66
|
+
|
67
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|51-57
|
68
|
+
|
69
|
+
If we get a __fsls__ request, probe the key-value store for the requested by projecting _reqid_, _path_ from the __fsls__ tuple into __kvget__. If the given path
|
70
|
+
is a key, __kvget_response__ will contain a tuple with the same _reqid_, and the join on the second line will succeed. In this case, we insert the value
|
71
|
+
associated with that key into __fsret__. Otherwise, the third rule will fire, inserting a failure tuple into __fsret__.
|
72
|
+
|
73
|
+
|
74
|
+
### Mutation
|
75
|
+
|
76
|
+
The logic for file and directory creation and deletion follow a similar logic with regard to the parent directory:
|
77
|
+
|
78
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|61-71
|
79
|
+
|
80
|
+
Unlike a directory listing, however, these operations change the state of the file system. In general, any state change will involve
|
81
|
+
carrying out two mutating operations to the key-value store atomically:
|
82
|
+
|
83
|
+
1. update the value (child array) associated with the parent directory entry
|
84
|
+
2. update the key-value pair associated with the object in question (a file or directory being created or destroyed).
|
85
|
+
|
86
|
+
The following Bloom code carries this out:
|
87
|
+
|
88
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|73-73
|
89
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|80-108
|
90
|
+
|
91
|
+
|
92
|
+
<!--- (**JMH**: This next sounds awkward. You *do* take care: by using <= and understanding the atomicity of timesteps in Bloom. I think what you mean to say is that Bloom's atomic timestep model makes this easy compared to ... something.)
|
93
|
+
Note that we need not take any particular care to ensure that the two inserts into __kvput__ occur together atomically. Because both statements use the synchronous
|
94
|
+
-->
|
95
|
+
Note that because both inserts into the __kvput__ collection use the synchronous operator (`<=`), we know that they will occur together in the same fixpoint computation or not at all.
|
96
|
+
Therefore we need not be concerned with explicitly sequencing the operations (e.g., ensuring that the directory entries is created _after_ the file entry) to deal with concurrency:
|
97
|
+
there can be no visible state of the database in which only one of the operations has succeeded.
|
98
|
+
|
99
|
+
If the request is a deletion, we need some additional logic to enforce the constraint that only an empty directory may be removed:
|
100
|
+
|
101
|
+
|
102
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|74-78
|
103
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/fs_master.rb|110-115
|
104
|
+
|
105
|
+
|
106
|
+
Recall that when we created KVSFS we mixed in __TimestepNonce__, one of the nonce libraries. While we were able to use the _reqid_ field from the input operation as a unique identifier
|
107
|
+
for one of our KVS operations, we need a fresh, unique request id for the second KVS operation in the atomic pair described above. By joining __nonce__, we get
|
108
|
+
an identifier that is unique to this timestep.
|
109
|
+
|
110
|
+
|
111
|
+
## [File Chunking](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/chunking.rb)
|
112
|
+
|
113
|
+
Now that we have a module providing a basic file system, we can extend it to support chunked storage of file contents. The metadata master will contain, in addition to the KVS
|
114
|
+
structure for directory information, a relation mapping a set of chunk identifiers to each file
|
115
|
+
|
116
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|26-26
|
117
|
+
|
118
|
+
and relations associating a chunk with a set of datanodes that host a replica of the chunk.
|
119
|
+
|
120
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|12-12
|
121
|
+
|
122
|
+
(**JMH**: ambiguous reference ahead "these latter")
|
123
|
+
The latter (defined in __HBMaster__) is soft-state, kept up to data by heartbeat messages from datanodes (described in the next section).
|
124
|
+
|
125
|
+
To support chunked storage, we add a few metadata operations to those already defined by FSProtocol:
|
126
|
+
|
127
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|6-16
|
128
|
+
|
129
|
+
* __fschunklist__ returns the set of chunks belonging to a given file.
|
130
|
+
* __fschunklocations__ returns the set of datanodes in possession of a given chunk.
|
131
|
+
* __fsaddchunk__ returns a new chunkid for appending to an existing file, guaranteed to be higher than any existing chunkids for that file, and a list of candidate datanodes that can store a replica of the new chunk.
|
132
|
+
|
133
|
+
We continue to use __fsret__ for return values.
|
134
|
+
|
135
|
+
### Lookups
|
136
|
+
|
137
|
+
Lines 34-44 are a similar pattern to what we saw in the basic FS: whenever we get a __fschunklist__ or __fsaddchunk__ request, we must first ensure that the given file
|
138
|
+
exists, and error out if not. If it does, and the operation was __fschunklist__, we join the metadata relation __chunk__ and return the set of chunks owned
|
139
|
+
by the given (existent) file:
|
140
|
+
|
141
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|47-49
|
142
|
+
|
143
|
+
### Add chunk
|
144
|
+
|
145
|
+
If it was a __fsaddchunk__ request, we need to generate a unique id for a new chunk and return a list of target datanodes. We reuse __TimestepNonce__ to do the former, and join a relation
|
146
|
+
called __available__ that is exported by __HBMaster__ (described in the next section) for the latter:
|
147
|
+
|
148
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|69-76
|
149
|
+
|
150
|
+
Finally, it was a __fschunklocations__ request, we have another possible error scenario, because the nodes associated with chunks are a part of our soft state. Even if the file
|
151
|
+
exists, it may not be the case that we have fresh information in our cache about what datanodes own a replica of the given chunk:
|
152
|
+
|
153
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|54-58
|
154
|
+
|
155
|
+
Otherwise, __chunk_cache__ has information about the given chunk, which we may return to the client:
|
156
|
+
|
157
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/chunking.rb|61-64
|
158
|
+
|
159
|
+
|
160
|
+
## Datanodes and Heartbeats
|
161
|
+
|
162
|
+
### [Datanode](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/datanode.rb)
|
163
|
+
|
164
|
+
A datanode runs both Bud code (to support the heartbeat and control protocols) and pure Ruby (to support the data transfer protocol). A datanode's main job is keeping the master
|
165
|
+
aware of it existence and its state, and participating when necessary in data pipelines to read or write chunk data to and from its local storage.
|
166
|
+
|
167
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|11-15
|
168
|
+
|
169
|
+
By mixing in HeartbeatAgent, the datanode includes the machinery necessary to regularly send status messages to the master. __HeartbeatAgent__ provides an input interface
|
170
|
+
called __payload__ that allows an agent to optionally include additional information in heartbeat messages: in our case, we wish to include state deltas which ensure that
|
171
|
+
the master has an accurate view of the set of chunks owned by the datanode.
|
172
|
+
|
173
|
+
When a datanode is constructed, it takes a port at which the embedded data protocol server will listen, and starts the server in the background:
|
174
|
+
|
175
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|61-62
|
176
|
+
|
177
|
+
At regular intervals, a datanode polls its local chunk directory (which is independently written to by the data protocol):
|
178
|
+
|
179
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|26-31
|
180
|
+
|
181
|
+
We update the payload that we send to the master if our recent poll found files that we don't believe the master knows about:
|
182
|
+
|
183
|
+
|
184
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/datanode.rb|33-40
|
185
|
+
|
186
|
+
Our view of what the master ``knows'' about reflects our local cache of acknowledgement messages from the master. This logic is defined in __HBMaster__.
|
187
|
+
|
188
|
+
### [Heartbeat master logic](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/hb_master.rb)
|
189
|
+
|
190
|
+
On the master side of heartbeats, we always send an ack when we get a heartbeat, so that the datanode doesn't need to keep resending its
|
191
|
+
payload of local chunks:
|
192
|
+
|
193
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|30-32
|
194
|
+
|
195
|
+
At the same time, we use the Ruby _flatmap_ method to flatten the array of chunks in the heartbeat payload into a set of tuples, which we
|
196
|
+
associate with the heartbeating datanode and the time of receipt in __chunk_cache__:
|
197
|
+
|
198
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|22-28
|
199
|
+
|
200
|
+
We periodically garbage-collect this cached, removing entries for datanodes from whom we have not received a heartbeat in a configurable amount of time.
|
201
|
+
__last_heartbeat__ is an output interface provided by the __HeartbeatAgent__ module, and contains the most recent, non-stale heartbeat contents:
|
202
|
+
|
203
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/5c7734912e900c28087e39b3424a1e0191e13704/bfs/hb_master.rb|34-36
|
204
|
+
|
205
|
+
|
206
|
+
## [BFS Client](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/bfs_client.rb)
|
207
|
+
|
208
|
+
One of the most complicated parts of the basic GFS design is the client component. To minimize load on the centralized master, we take it off the critical
|
209
|
+
path of file transfers. The client therefore needs to pick up this work.
|
210
|
+
|
211
|
+
We won't spend too much time on the details of the client code, as it is nearly all _plain old Ruby_. The basic idea is:
|
212
|
+
|
213
|
+
1. Pure metadata operations
|
214
|
+
* _mkdir_, _create_, _ls_, _rm_
|
215
|
+
* Send the request to the master and inform the caller of the status.
|
216
|
+
* If _ls_, return the directory listing to the caller.
|
217
|
+
2. Append
|
218
|
+
* Send a __fsaddchunk__ request to the master, which should return a new chunkid and a list of datanodes.
|
219
|
+
* Read a chunk worth of data from the input stream.
|
220
|
+
* Connect to the first datanode in the list. Send a header containing the chunkid and the remaining datanodes.
|
221
|
+
* Stream the file contents. The target datanode will then ``play client'' and continue the pipeline to the next datanode, and so on.
|
222
|
+
2. Read
|
223
|
+
* Send a __getchunks__ request to the master for the given file. It should return the list of chunks owned by the file.
|
224
|
+
* For each chunk,
|
225
|
+
* Send a __fschunklocations__ request to the master, which should return a list of datanodes in possession of the chunk (returning a list allows the client to perform retries without more communication with the master, should some of the datanodes fail).
|
226
|
+
* Connect to a datanode from the list and stream the chunk to a local buffer.
|
227
|
+
* As chunks become available, stream them to the caller.
|
228
|
+
|
229
|
+
|
230
|
+
## [Data transfer protocol](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/data_protocol.rb)
|
231
|
+
|
232
|
+
The data transfer protocol comprises a set of support functions for the bulk data transfer protocol whose use is described in the previous section.
|
233
|
+
Because it is _plain old Ruby_ it is not as interesting as the other modules. It provides:
|
234
|
+
|
235
|
+
* The TCP server code that runs at each datanode, which parses headers and writes stream data to the local FS (these files are later detected by the directory poll).
|
236
|
+
* Client API calls to connect to datanodes and stream data. Datanodes also use this protocol to pipeline chunks to downstream datanodes.
|
237
|
+
* Master API code invoked by a background process to replicate chunks from datanodes to other datanodes, when the replication factor for a chunk is too low.
|
238
|
+
|
239
|
+
## [Master background process](https://github.com/bloom-lang/bud-sandbox/blob/master/bfs/background.rb)
|
240
|
+
|
241
|
+
So far, we have implemented the BFS master as a strictly reactive system: when clients make requests, it queries and possibly updates local state.
|
242
|
+
To maintain the durability requirement that `REP_FACTOR` copies of every chunk are stored on distinct nodes, the master must be an active system
|
243
|
+
that maintains a near-consistent view of global state, and takes steps to correct violated requirements.
|
244
|
+
|
245
|
+
__chunk_cache__ is the master's view of datanode state, maintained as described by collecting and pruning heartbeat messages.
|
246
|
+
|
247
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/background.rb|24-27
|
248
|
+
|
249
|
+
After defining some helper aggregates (__chunk_cnts_chunk__ or replica count by chunk, and __chunk_cnt_host__ or datanode fill factor),
|
250
|
+
|
251
|
+
==https://github.com/bloom-lang/bud-sandbox/raw/master/bfs/background.rb|29-36
|
data/docs/bfs_arch.png
ADDED
Binary file
|
data/docs/bloom-loop.png
ADDED
Binary file
|