adhd 0.0.1 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,9 +1,68 @@
1
1
  = adhd
2
2
 
3
- Description goes here.
3
+ Adhd is an asynchronous, distributed hard drive. Actually we're not sure if it's really asynchronous or even what asynchronicity would mean in the context of a hard drive, but it is definitely distributed. Adhd is essentially a management layer (written using Ruby and eventmachine) which controls clusters of CouchDB databases to replicate files across disparate machines. Unlike most clustering storage solutions, adhd assumes that machines in the cluster may have different amounts of bandwidth available and is designed to work both inside and outside the data centre.
4
+
5
+ == Installation
6
+
7
+ Don't use this software yet. It is experimental and may eat your mother.
8
+
9
+ Having said that, <pre>sudo gem install adhd</pre> may do the job.
10
+
11
+ == How it works
12
+
13
+ === External API
14
+
15
+ The user issues an HTTP PUT request to a ADHD url. This request can go to any machine in the ADHD cluster. A file is attached to the body of the request, and ADHD stores the file on a number of different servers. At a later time, the user can issue a GET request to the same url and get back their file.
16
+
17
+
18
+ === Under the Hood
19
+
20
+ ==== Overview
21
+
22
+ Multiple server nodes are part of an ADHD network. Each node has a full copy of two CouchDB databases: the node management database, and the shard range database.
23
+
24
+ The node management database keeps a record of all existing nodes, their location (IP), and current status. This database is replicated across all nodes in the cluster; any node is in theory capable of becoming a management node if the current management nodes become unavailable.
25
+
26
+ The shard range database is also replicated everywhere. It allows any node to find out which storage nodes are responsible for holding a particular piece of content. Management nodes are in charge of maintaining the shard database and pushing changes to non-management nodes.
27
+
28
+ Storage nodes hold one or more shards in addition to the management and shard range databases. These shards contain the content and some metadata. A shard consists of a CouchDB database which holds Couch documents - one attached file per document.
29
+
30
+ ==== PUT requests
31
+
32
+ A PUT request reaches any ADHD node. The first line of the request is parsed, and the ADHD node figures out the name, MIME type and content length of the file based on the incoming HTTP request headers. Based on the MD5 hash of the filename, we extract a 20-byte internal ID string for the file and use this string to figure out which shard range the file will be assigned to.
33
+
34
+ Once we have a shard range, we ask the shard range database which node(s) are in charge of storage for the given shard and are currently available. One of these nodes is chosen (starting with a master node and falling back to other nodes as necessary). The file metadata is written as a CouchDB document to the chosen node, followed by the file as an attachment.
35
+
36
+ ==== GET requests
37
+
38
+ The GET request is basically the same as the PUT request, except that the file is retrieved from the CouchDB shard instead of being created on a shard.
39
+
40
+ ==== Replication
41
+
42
+ Replication happens differently for different databases. The node management database and the shard range database are periodically synced to and from the management nodes. The content shards are replicated when they get updated (i.e. when a file is added to a node which is responsible for a given shard, the file will be replicated to all other nodes which are responsible for the same shard).
43
+
44
+ == Rationale
45
+
46
+ We can have multiple redundant nodes storing the same files, getting rid of the need for backups. Not every node needs to store every file, so we can scale up storage capacity by adding more nodes. The design also provides load balancing as it means that we can serve files from multiple nodes.
47
+
48
+ Because all admin information is shared by every node, the unavailability of any node does not jeopardize the operation of the system. As soon as a node becomes unavailable, any other node can perform its functions (although we will lose storage capacity throughout the cluster if we lose multiple large storage nodes).
49
+
50
+ === Some vapourware objectives
51
+
52
+ The design also allows for different capabilities in storage nodes. Shards can be assigned based on specific properties (available bandwidth, available processing power, etc) so that we can store files in the most efficient possible way. It might be desirable, for example, to put all video files on machines with fast processors for doing video encoding, whereas audio and photos wouldn't need any post-processing and could go somewhere else. Or we could store new and popular content in shard ranges which are on fast servers with high throughput, while putting less popular content on servers with less bandwidth (and lower storage costs). Ideally we would like to get to a point where low-demand files can be stored on relatively low-bandwidth home or office network connections and still be available in the cluster.
53
+
54
+
55
+ == Fictional use cases (no one has ever used this software in real life)
56
+
57
+ Wikipedia: photos of Cheryl Cole and Robbie Williams are unfortunately wildly popular at present and will be requested often. Photos of the millions of singers with more talent and less marketing budgets will not be requested very often, but it's still good if they are in the cluster. If by some stroke of good luck Kevin Quain suddenly gets the recognition he deserves, his photo will be shifted from the "dial-up" node class to something with more stomp.
58
+
59
+ Archive.org: those sex-ed videos and Bert the turtle from the Prelinger Archive are awesome and everybody wants to watch them, so they should be on beefy video-storage nodes with high bandwidth. Documentaries on the yellow-bellied sapsucker may be requested only once in a while.
60
+
61
+ BBC News: material going up on the site right now will need a lot of bandwidth, but by next week most of this media will be out of the news cycle and can be consigned to a set of servers with much less bandwidth. By next year, it is unlikely that this media will be viewed even a few times a day, so it could be smart to trade storage costs for speed and put old media on cheaper, lower-bandwidth boxes with huge hard drives.
62
+
4
63
 
5
64
  == Note on Patches/Pull Requests
6
-
65
+
7
66
  * Fork the project.
8
67
  * Make your feature addition or bug fix.
9
68
  * Add tests for it. This is important so I don't break it in a
@@ -16,3 +75,4 @@ Description goes here.
16
75
  == Copyright
17
76
 
18
77
  Copyright (c) 2009 dave@netbook. See LICENSE for details.
78
+
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.1
1
+ 0.1.0
@@ -5,15 +5,14 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{adhd}
8
- s.version = "0.0.1"
8
+ s.version = "0.1.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["dave.hrycyszyn@headlondon.com"]
12
- s.date = %q{2009-12-19}
13
- s.default_executable = %q{adhd}
12
+ s.date = %q{2009-12-27}
14
13
  s.description = %q{More to say when something works! Do not bother installing this! }
15
14
  s.email = %q{dave.hrycyszyn@headlondon.com}
16
- s.executables = ["adhd"]
15
+ s.executables = ["adhd", "adhd_cleanup"]
17
16
  s.extra_rdoc_files = [
18
17
  "LICENSE",
19
18
  "README.rdoc"
@@ -27,11 +26,15 @@ Gem::Specification.new do |s|
27
26
  "VERSION",
28
27
  "adhd.gemspec",
29
28
  "bin/adhd",
29
+ "bin/adhd_cleanup",
30
30
  "doc/adhd.xmi",
31
- "lib/adhd.rb",
31
+ "lib/adhd/adhd_rest_server.rb",
32
32
  "lib/adhd/config.yml",
33
- "lib/adhd/models.rb",
34
- "lib/adhd/node.rb",
33
+ "lib/adhd/models/content_doc.rb",
34
+ "lib/adhd/models/content_shard.rb",
35
+ "lib/adhd/models/node_doc.rb",
36
+ "lib/adhd/models/shard_range.rb",
37
+ "lib/adhd/node_manager.rb",
35
38
  "lib/adhd/reactor.rb",
36
39
  "lib/ext/hash_to_openstruct.rb",
37
40
  "lib/public/images/img01.jpg",
@@ -53,7 +56,6 @@ Gem::Specification.new do |s|
53
56
  "lib/public/style.css",
54
57
  "lib/views/index.erb",
55
58
  "lib/views/layout.erb",
56
- "models.rb",
57
59
  "test/helper.rb",
58
60
  "test/test_adhd.rb"
59
61
  ]
data/bin/adhd CHANGED
@@ -1,33 +1,48 @@
1
1
  #!/usr/bin/env ruby
2
2
  require File.dirname(__FILE__) + '/../lib/ext/hash_to_openstruct'
3
- require File.dirname(__FILE__) + '/../lib/adhd/node'
3
+ require File.dirname(__FILE__) + '/../lib/adhd/node_manager'
4
4
  require File.dirname(__FILE__) + '/../lib/adhd/reactor'
5
+ require File.dirname(__FILE__) + '/../lib/adhd/adhd_rest_server'
5
6
 
6
7
  require 'optparse'
7
8
  require 'ftools'
8
9
  require 'yaml'
9
- require 'socket'
10
+ # require 'socket'
10
11
 
11
12
  def parse_config(file)
12
13
  @config = YAML.load_openstruct(File.read(file))
13
14
  end
14
15
 
15
- @command = ARGV.shift
16
+ # @command = ARGV.shift
16
17
 
17
- options = {}
18
+ #options = {}
18
19
 
19
20
  opts = OptionParser.new do |opts|
20
- opts.on("-C", "--config C", "YAML config file") do |n|
21
+ opts.on("-c", "--config C", "YAML config file") do |n|
22
+ puts "Parsing config file #{n}"
21
23
  parse_config(n)
22
24
  end
23
25
  end
24
26
 
25
27
  opts.parse! ARGV
26
28
 
27
- @node = Adhd::Node.new(@config)
29
+ @node_manager = Adhd::NodeManager.new(@config)
30
+
31
+
32
+ # Start the Thin server within the reactor loop
28
33
 
29
34
  EM.run {
30
- puts "Starting EventMachine reactor loop..."
31
- EM.connect @config.node_url, @config.couchdb_server_port, Adhd::DbUpdateReactor, @node
35
+ # puts "Starting EventMachine reactor loop..."
36
+ # EM.connect @config.node_url, @config.couchdb_server_port, Adhd::DbUpdateNotifier, @node_manager
37
+ timer = EventMachine::PeriodicTimer.new(5) do
38
+ # puts "Sync Admin"
39
+ @node_manager.sync_admin
40
+ @node_manager.run
41
+ end
42
+
43
+ # Start the server
44
+ EventMachine::start_server @config.node_url, @config.couchdb_server_port + 1, AdhdRESTServer, @node_manager
45
+
46
+
32
47
  }
33
48
 
@@ -0,0 +1,57 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # Usage: adhd_cleanup nodename
4
+ # Deletes all databases associated with this adhd node
5
+
6
+ require 'rubygems'
7
+ require 'couchrest'
8
+ require 'optparse'
9
+
10
+ node_names = []
11
+ delete_all = false
12
+ db_url = ""
13
+
14
+ opts = OptionParser.new do |opts|
15
+ opts.on("-u", "--url U", "node name") do |u|
16
+ puts "Database URL is #{db_url}"
17
+ # parse_config(n)
18
+ db_url = u
19
+ end
20
+
21
+
22
+ opts.on("-n", "--node N", "node name") do |n|
23
+ puts "Deleting databases of node #{n}"
24
+ # parse_config(n)
25
+ node_names << n
26
+ end
27
+
28
+ opts.on("-d", "--delete", "Delete databases") do |n|
29
+ delete_all = true
30
+ end
31
+ end
32
+
33
+ opts.parse! ARGV
34
+
35
+ if delete_all
36
+ puts "DELETE all databases in #{db_url}"
37
+ else
38
+ puts "LIST all databases in #{db_url}"
39
+ end
40
+
41
+ puts node_names.join(' ,')
42
+
43
+
44
+ # Find all databases we want
45
+ server = CouchRest::Server.new(db_url)
46
+ server.databases.each do |db_name|
47
+ node_names.each do |node_name|
48
+ if db_name.split('_')[0] == node_name
49
+ puts db_name
50
+ if delete_all
51
+ db = server.database(db_name)
52
+ db.delete!
53
+ end
54
+ end
55
+ end
56
+
57
+ end
@@ -0,0 +1,229 @@
1
+ require 'eventmachine'
2
+ require 'uri'
3
+ require 'net/http'
4
+ require 'webrick'
5
+
6
+ module ProxyToServer
7
+ # This implements the connection that proxies an incoming file to to the
8
+ # respective CouchDB instance, as an attachment.
9
+
10
+ def initialize our_client_conn, init_request
11
+ @our_client_conn = our_client_conn
12
+ @init_request = init_request
13
+ our_client_conn.proxy_conn = self
14
+ end
15
+
16
+ def post_init
17
+ # We have opened a connection to the DB server, so now it is time
18
+ # to send the initial Couch request, using HTTP 1.0.
19
+ puts "Send request: #{@init_request}"
20
+ send_data @init_request
21
+ end
22
+
23
+ def receive_data data
24
+ @our_client_conn.proxy_receive_data data
25
+ end
26
+
27
+ def unbind
28
+ @our_client_conn.proxy_unbind
29
+ end
30
+
31
+ end
32
+
33
+ module AdhdRESTServer
34
+ attr_accessor :proxy_conn
35
+
36
+ def initialize node_manager
37
+ @node_manager = node_manager
38
+ @buffer = ""
39
+ @status = :header
40
+ end
41
+
42
+ #def post_init
43
+ # puts "-- someone connected to the echo server!"
44
+ #end
45
+
46
+ # The format we are expecting:
47
+ #
48
+ # PUT somedatabase/document/attachment?rev=123 HTTP/1.0
49
+ # Content-Length: 245
50
+ # Content-Type: image/jpeg
51
+ #
52
+ # <JPEG data>
53
+
54
+ def receive_data data
55
+
56
+ # First we get all the headers in to find out which resource
57
+ # we are looking for.
58
+
59
+ if @status == :header
60
+ @buffer += data
61
+
62
+ if data =~ /\r\n\r\n/
63
+
64
+ # Detected end of headers
65
+ header_data = @buffer[0...($~.begin(0))]
66
+
67
+ # Try the webrick parser
68
+ @req = WEBrick::HTTPRequest.new(WEBrick::Config::HTTP)
69
+
70
+ StringIO.open(header_data, 'rb') do |socket|
71
+ @req.parse(socket)
72
+ end
73
+
74
+ # The rest of the incomming connection
75
+ @buffer = @buffer[($~.end(0))..-1]
76
+
77
+ # Compute the ID of the sought resource
78
+ if @req.path =~ /\/adhd\/(.*)/
79
+ @req.header["Filename"] = $1
80
+ @req.header["ID"] = MD5.new($1).to_s
81
+ else
82
+ # Throw an error
83
+ end
84
+
85
+ # Change the status once headers are found
86
+ @status = :find_node
87
+ else
88
+ # Avoid DoS via buffer filling
89
+ close_connection if @buffer.length > 1000
90
+ end
91
+
92
+ end
93
+
94
+ # Now we have the headers, but maybe not the full body, and we are looking
95
+ # for the right node in our network to handle the call.
96
+ if @status == :find_node
97
+ pause # We want to tell the remote host to wait a bit
98
+ # This would allow us to defer the execution of the calls to find
99
+ # the right nodes, and extract the doc.
100
+
101
+ # TODO: We need to push all the chit-chat with the remote servers to
102
+ # A deferable object, or some other connection, not to block.
103
+ # Right now we are blocking and it sucks.
104
+
105
+ # Now get or write the document associated with this file
106
+ if @req.request_method == "GET"
107
+
108
+ @our_doc = @node_manager.srdb.get_doc_directly(@req.header["ID"])
109
+
110
+ # TODO: handle erros if file does not exist
111
+ if @our_doc[:ok]
112
+ @status == :get
113
+ handle_get
114
+ else
115
+ send_data "Problem"
116
+ end
117
+ end
118
+
119
+ if @req.request_method == "PUT"
120
+ # Define a Doc with the data so far
121
+ @our_doc = ContentDoc.new
122
+
123
+ @our_doc._id = @req.header["ID"]
124
+ @our_doc.internal_id = @req.header["ID"]
125
+ @our_doc.size_bytes = @req.content_length
126
+ @our_doc.filename = @req.header["Filename"]
127
+ @our_doc.mime_type = @req.content_type
128
+
129
+ # Write to the right node
130
+ @our_doc = @node_manager.srdb.write_doc_directly(@our_doc)
131
+
132
+ # TODO: if an error is returned here, we cannot execute the query
133
+ if @our_doc[:ok]
134
+ @status = :put
135
+ handle_put
136
+ else
137
+ send_data "Problem"
138
+ end
139
+ end
140
+
141
+ # Now send the reply as an HTTP1.0 reponse
142
+
143
+ # HTTP/1.0 200 OK
144
+ # Date: Fri, 08 Aug 2003 08:12:31 GMT
145
+ # Server: Apache/1.3.27 (Unix)
146
+ # MIME-version: 1.0
147
+ # Last-Modified: Fri, 01 Aug 2003 12:45:26 GMT
148
+ # Content-Type: text/html
149
+ # Content-Length: 2345
150
+ # ** a blank line *
151
+ # <HTML> ...
152
+
153
+
154
+
155
+ # response = @our_doc.to_s
156
+ #
157
+ # send_data "HTTP/1.0 200 OK\r\n"
158
+ # send_data "Content-Type: text/plain\r\n"
159
+ # send_data "Content-Length: #{response.length}\r\n"
160
+ # send_data "\r\n"
161
+ # send_data response
162
+ #
163
+ # # Close the connection
164
+ # close_connection_after_writing
165
+
166
+ end
167
+
168
+ # We have the header and the node, and now we execute the request
169
+ if @status == :execute_request
170
+
171
+ end
172
+
173
+ end
174
+
175
+ def handle_get
176
+ resume
177
+ # We need to connect to the right server and build a header
178
+ server_uri = URI.parse(@our_doc[:db].server.uri)
179
+ server_addr = server_uri.host
180
+ server_port = server_uri.port
181
+
182
+ docid = @our_doc[:doc]._id
183
+ dbname = @our_doc[:db].name
184
+ request = "GET /#{dbname}/#{docid}/#{@our_doc[:doc].filename} HTTP/1.0\r\n\r\n"
185
+ #send_data request
186
+ #close_connection_after_writing
187
+ puts "Connect to #{server_addr} port #{server_port}"
188
+ conn = EM::connect server_addr, server_port, ProxyToServer, self, request
189
+ EM::enable_proxy proxy_conn, self, 1024
190
+ end
191
+
192
+ def proxy_unbind
193
+ # Our cpnnection to the CouchDB has just been torn down
194
+ close_connection_after_writing
195
+ end
196
+
197
+ def proxy_receive_data data
198
+ # Response to a PUT request only
199
+ send_data data
200
+ end
201
+
202
+
203
+ def handle_put
204
+ resume
205
+
206
+ # We need to connect to the right server and build a header
207
+ server_uri = URI.parse(@our_doc[:db].server.uri)
208
+ server_addr = server_uri.host
209
+ server_port = server_uri.port
210
+
211
+ docid = @our_doc[:doc]._id
212
+ dbname = @our_doc[:db].name
213
+ request = "PUT /#{dbname}/#{docid}/#{@our_doc[:doc].filename}?rev=#{@our_doc[:doc]["_rev"]} HTTP/1.0\r\n"
214
+ request += "Content-Type: #{@our_doc[:doc].mime_type}\r\n"
215
+ request += "Content-Length: #{@our_doc[:doc].size_bytes}\r\n"
216
+ request += "\r\n"
217
+ request += @buffer
218
+ #send_data request
219
+ #close_connection_after_writing
220
+ puts "Connect to #{server_addr} port #{server_port}"
221
+ conn = EM::connect server_addr, server_port, ProxyToServer, self, request
222
+ EM::enable_proxy self, proxy_conn, 1024
223
+ end
224
+
225
+ def unbind
226
+ puts "-- someone disconnected from the echo server!"
227
+ end
228
+ end
229
+