lwac 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. checksums.yaml +7 -0
  2. data/LICENSE +70 -0
  3. data/README.md +31 -0
  4. data/bin/lwac +132 -0
  5. data/client_config.md +71 -0
  6. data/concepts.md +70 -0
  7. data/config_docs.md +40 -0
  8. data/doc/compile.rb +52 -0
  9. data/doc/template.rhtml +145 -0
  10. data/example_config/client.jv.yml +33 -0
  11. data/example_config/client.yml +34 -0
  12. data/example_config/export.yml +70 -0
  13. data/example_config/import.yml +19 -0
  14. data/example_config/server.yml +97 -0
  15. data/export_config.md +448 -0
  16. data/import_config.md +29 -0
  17. data/index.md +49 -0
  18. data/install.md +29 -0
  19. data/lib/lwac.rb +17 -0
  20. data/lib/lwac/client.rb +354 -0
  21. data/lib/lwac/client/file_cache.rb +160 -0
  22. data/lib/lwac/client/storage.rb +69 -0
  23. data/lib/lwac/export.rb +362 -0
  24. data/lib/lwac/export/format.rb +310 -0
  25. data/lib/lwac/export/key_value_format.rb +132 -0
  26. data/lib/lwac/export/resources.rb +82 -0
  27. data/lib/lwac/import.rb +152 -0
  28. data/lib/lwac/server.rb +294 -0
  29. data/lib/lwac/server/consistency_manager.rb +265 -0
  30. data/lib/lwac/server/db_conn.rb +376 -0
  31. data/lib/lwac/server/storage_manager.rb +290 -0
  32. data/lib/lwac/shared/data_types.rb +283 -0
  33. data/lib/lwac/shared/identity.rb +44 -0
  34. data/lib/lwac/shared/launch_tools.rb +87 -0
  35. data/lib/lwac/shared/multilog.rb +158 -0
  36. data/lib/lwac/shared/serialiser.rb +86 -0
  37. data/limits.md +114 -0
  38. data/log_config.md +30 -0
  39. data/monitoring.md +13 -0
  40. data/resources/schemata/mysql/links.sql +7 -0
  41. data/resources/schemata/sqlite/links.sql +5 -0
  42. data/server_config.md +242 -0
  43. data/tools.md +89 -0
  44. data/workflows.md +39 -0
  45. metadata +140 -0
@@ -0,0 +1,29 @@
1
+ Import Tool Configuration
2
+ =========================
3
+ The import tool is responsible for creating (or adding links into) the server's corpus. Its configuration file is thus very simple---all options for string processing and storage remain the preserve of the [server's config file](server_config.html).
4
+
5
+
6
+ Config
7
+ ------
8
+ The import tool uses the server configuration to access a corpus, and loads it as if it were a server.
9
+
10
+ * `server_config` --- The path to the server configuration file. The importer will use this for storage properties and string sanitisation settings.
11
+
12
+ Other than that its configuration options apply to output and file handling:
13
+
14
+ * `schemata_path` --- The path where `.sql` files may be found for creating a database. Leave this blank to use defaults that come bundled with LWAC.
15
+ * `create_db` --- Boolean. Set to `true` to make the import script attempt to create the database if it doesn't exist.
16
+ * `notify` --- How many lines should be import before the UI updates and tells the user.
17
+
18
+ For example:
19
+
20
+ :server_config: example_config/server.yml
21
+ :schemata_path: # use defaults
22
+ :notify: 12345
23
+ :create_db: true
24
+
25
+
26
+ Logging
27
+ -------
28
+ The logging system is the same for client, server, import, and export tools and shares a configuration with them. For details, see [configuring logging](log_config.html)
29
+
@@ -0,0 +1,49 @@
1
+ Index
2
+ =====
3
+ Welcome to the LWAC user guide. The purpose of this document is to explain some of the design concepts used, provide recipes for common tasks, and explain some of the technical capacities/limitations of LWAC.
4
+
5
+ Installation and Dependencies
6
+ -----------------------------
7
+ LWAC's dependencies and installation are documented [here](install.html)
8
+
9
+ Concepts
10
+ --------
11
+ The [concepts](concepts.html) used in the model are explained here.
12
+
13
+ Tools
14
+ -----
15
+ LWAC consists of a number of [tools](tools.html), and these are explained individually here.
16
+
17
+ Workflows
18
+ ---------
19
+ LWAC's [workflow](workflows.html) was designed around longitudinal sampling. Some common methods are explained here.
20
+
21
+ Process Monitoring
22
+ ------------------
23
+ LWAC is designed for longitudinal sampling, and as such a process monitor is a good idea to ensure that you are notified in case of any problems. [Monitoring and maintenance](monitoring.html) has its own section.
24
+
25
+ Limits and Performance
26
+ ----------------------
27
+ See some rough indications on what [limits LWAC's performance](limits.html)
28
+
29
+ Bugs and Suggestions
30
+ --------------------
31
+ Please get in touch by reporting bugs to [my issue tracker](http://stephenwattam.com/issues/), or [email me](http://stephenwattam.com/contact/). Please do this if you notice something---I relish the opportunity to fix it.
32
+
33
+ If you can write code, feel free to submit a patch or contact me for your own branch in the git repository. There's a space in the authors listing for your name :-)
34
+
35
+ License and Usage Conditions
36
+ ----------------------------
37
+ LWAC is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-nc-sa/3.0/), a copy of which may be found in the `LICENSE` file in the root of the distribution.
38
+
39
+ In addition to the creative commons terms above, I wish to mandate the following conditions: You must...
40
+
41
+ * Not profit from directly selling the code (including as part of something else);
42
+ * Not profit from selling corpora built using LWAC. (Note that I will probably grant permission if contacted, I merely wish to stop people turning it into a commercial service);
43
+ * Credit LWAC and provide a link to [http://stephenwattam.com/project/LWAC/](http://stephenwattam.com/project/LWAC/) or [ucrel.lancs.ac.uk/LWAC/](http://ucrel.lancs.ac.uk/LWAC/) in any publications (You can also cite the paper pending for WaC8, once it's published);
44
+ * Not use it to DDOS things.
45
+
46
+ Other than that, knock yourself out. The code's fairly clean, and fairly well documented.
47
+
48
+ It's worth noting here that the data you download from the web is probably copyright to some third party who wouldn't necessarily consent to its distribution. Laws on the use of web data vary wildly and are a grey area in many juristrictions, so _caveat emptor_ (or _caveat corpus aedificator_).
49
+
@@ -0,0 +1,29 @@
1
+ Installation
2
+ ============
3
+ LWAC is distributed as a gem, and this should handle management of include paths and dependencies:
4
+
5
+ * [Ruby](http://www.ruby-lang.org/en/) v1.9.3 or above
6
+ * [libcurl](http://curl.haxx.se/libcurl/) (client only)
7
+ * [sqlite3](http://www.sqlite.org/) or [mysql](http://www.mysql.com/) or [mariaDB](https://mariadb.org/) (server only)
8
+ * Some supporting gems:
9
+ * simplerpc
10
+ * sqlite3 or mysql2 (server only)
11
+ * curb (client only)
12
+
13
+ To install, simply run:
14
+
15
+ $ gem install lwac
16
+
17
+ to install the latest version, and then:
18
+
19
+ * If you are only ever going to run the client, `gem install curb`
20
+ * If you are only ever going to run the server, `gem install sqlite3` _or_ `gem install mysql2`
21
+ * If you're going to run both clients and servers, install both of the above.
22
+
23
+ Git
24
+ ---
25
+ If you wish to run LWAC straight from the git repository, this is possible by adding a single item to ruby's `$LOAD_PATH`. To do this, run:
26
+
27
+ ruby -I ./lib ./bin/lwac [commands]
28
+
29
+ This is particularly useful when modifying the code. If you make some modifications, please don't hesitate to get in touch and I'll do my best to integrate them upstream.
@@ -0,0 +1,17 @@
1
+
2
+ # LWAC namespace
3
+ module LWAC
4
+ # Overall LWAC version, as in the git tags
5
+ VERSION = '0.2.0'
6
+ # Date of last significant edit
7
+ DATE = '20-07-13'
8
+
9
+ # Authors
10
+ AUTHORS = [
11
+ {:name => "Stephen Wattam", :contact => "http://stephenwattam.com"},
12
+ #{:name => "", :contact => ""} # Add yourself here (and in the gemspec) if you contribute to LWAC
13
+ ]
14
+
15
+ # Location of resources
16
+ RESOURCE_DIR = File.join( File.dirname( File.expand_path(__FILE__) ), '../resources/')
17
+ end
@@ -0,0 +1,354 @@
1
+ require 'lwac/client/file_cache'
2
+ require 'lwac/client/storage'
3
+
4
+ require 'lwac/shared/multilog'
5
+ require 'lwac/shared/identity'
6
+ require 'lwac/shared/serialiser'
7
+
8
+ require 'timeout'
9
+ require 'digest/md5'
10
+
11
+
12
+ require 'simplerpc/client'
13
+ require 'lwac/shared/data_types' # for serialisation
14
+ require 'blat'
15
+
16
+ require 'curb'
17
+
18
+ module LWAC
19
+
20
+ class DownloadClient
21
+
22
+ # Construct a new DownloadClient
23
+ def initialize(config)
24
+ # Save the config
25
+ @config = config
26
+
27
+ # Generate a unique identifier for this host
28
+ @uuid = generate_uuid
29
+
30
+ # Fire up el RPC client...
31
+ @rpc_client = SimpleRPC::Client.new(@config[:server])
32
+
33
+ # Don't RPC again until...
34
+ @rpc_delay = Time.now
35
+
36
+ # Construct a new multi-curl thingy
37
+ $log.info "Starting download engine..."
38
+ @dl = Blat::Queue.new(@config[:client][:simultaneous_workers])
39
+
40
+ # List of links pending, cached.
41
+ @links = []
42
+ @cache = new_cache
43
+ @cache_bytes = 0
44
+
45
+ # Mutices for link access
46
+ @link_mx = Mutex.new
47
+ @cache_mx = Mutex.new
48
+
49
+ # Don't try to acquire more data until...
50
+ @checkout_delay = Time.now
51
+
52
+ # Start the log with UUID info.
53
+ $log.info "Client started with UUID: #{@uuid}"
54
+
55
+ # ping for helpfulness
56
+ ping
57
+ end
58
+
59
+ # Poll the server and download from the web, maintaining throughput
60
+ # to the web by downloading batches of links.
61
+ def work
62
+
63
+ loop do
64
+ @dl.perform do
65
+
66
+ sleep(@config[:client][:monitor_rate])
67
+
68
+ # Keep the download queue topped up
69
+ while @dl.request_count < @config[:client][:simultaneous_workers] && (new_link = get_curl)
70
+ @dl.add(new_link)
71
+ end
72
+
73
+ # Read things safely using a mutex
74
+ link_len = @link_mx.synchronize { @links.length }
75
+ active_requests = @dl.request_count
76
+ cache_len, bytes = @cache_mx.synchronize { [@cache.length, @cache_bytes] }
77
+
78
+ # Print nice progress output for folks
79
+ if @config[:client][:announce_progress] && (link_len > 0 || cache_len > 0 || active_requests > 0)
80
+ progress_mb = bytes.to_f / 1024 / 1024
81
+ limit_mb = @config[:client][:cache_limit].to_f / 1024 / 1024
82
+ pc_progress = (progress_mb / limit_mb) * 100
83
+
84
+ str = "#{progress_bar(@config[:client][:cache_limit], bytes)} #{pc_progress.round}%"
85
+ str += " #{progress_mb.round(2)}/#{limit_mb.round(2)}MB"
86
+ str += " (#{link_len} pend, #{active_requests} active, #{cache_len} done)"
87
+ str += " #{(link_len == 0 && Time.now < @checkout_delay) ? "[waiting #{(@checkout_delay - Time.now).round}s]" : ''}"
88
+
89
+ $log.info(str)
90
+ end
91
+
92
+ # Run out of links
93
+ if link_len <= 0 && Time.now > @checkout_delay
94
+ acquire_links
95
+ end
96
+
97
+ # Downloaded enough data already
98
+ if (@dl.idle? || bytes > @config[:client][:cache_limit]) && cache_len > 0
99
+ # (@pool.cache_size.to_f / 1024.0 / 1024.0) > @config[:client][:cache_limit] then
100
+ # Send completed points back to the server
101
+ send_cache
102
+ end
103
+ end
104
+
105
+ $log.debug "Downloader is idle."
106
+
107
+ end
108
+
109
+ rescue SignalException => se
110
+ $log.fatal "Caught signal - #{se}"
111
+ ensure
112
+ # Cancel web requests
113
+ cancelled = @dl.request_count
114
+ if @dl.request_count > 0
115
+ $log.info "Cancelling #{@dl.request_count} web requests..."
116
+ @dl.cancel
117
+ else
118
+ $log.info "No web requests active."
119
+ end
120
+
121
+ # Tell the server we're dying
122
+ if @links.length > 0 || @cache.length > 0 || @dl.request_count > 0
123
+ $log.info "Releasing lock on approx. #{@links.length + @cache.length + cancelled} links..."
124
+ rpc(5) do |s|
125
+ s.cancel(LWAC::VERSION, @uuid)
126
+ end
127
+ else
128
+ $log.info "No links to clean up."
129
+ end
130
+
131
+ # Quit
132
+ $log.info "Done. Client has closed cleanly."
133
+ end
134
+
135
+ private
136
+
137
+ # Pings the server to test RPC methods
138
+ def ping
139
+ $log.info "Pinging server..."
140
+ nonce = Random.rand(82349849)
141
+ reply = rpc(1) do |s|
142
+ s.ping(LWAC::VERSION, @uuid, nonce)
143
+ end
144
+
145
+ unless nonce == reply
146
+ $log.warn "Failed to ping server! Please check your network properties."
147
+ else
148
+ $log.info "Your network setup seems to work, that's good news :-)"
149
+ end
150
+ end
151
+
152
+ # Returns a cURL::Easy object for downloading
153
+ def get_curl
154
+ link = @link_mx.synchronize { @links.pop }
155
+
156
+ return nil unless link
157
+
158
+ # Construct new curl from the link
159
+ curl = Curl::Easy.new(link[:link].uri)
160
+
161
+ # configure curl using config
162
+ link[:config][:curl_workers].each do |k, v|
163
+ if v.is_a?(Array)
164
+ curl.send(k.to_s + '=', *v)
165
+ else
166
+ curl.send(k.to_s + '=', v)
167
+ end
168
+ end
169
+
170
+ # Set completion handler
171
+ curl.on_complete do |res|
172
+ datapoint = nil
173
+ begin
174
+ datapoint = LWAC::DataPoint.from_request(link[:config], link[:link], res, @uuid, nil) # TODO: set error if needed.
175
+ rescue StandardError => e
176
+ $log.error "Error during request standardisation: #{e}"
177
+ $log.debug "#{e.backtrace.join("\n")}"
178
+
179
+ # Insert error if the above failed
180
+ datapoint = LWAC::DataPoint.new(link[:link], {}, '', '', {}, @uuid, e) if !datapoint
181
+ ensure
182
+ @cache_mx.synchronize do
183
+ @cache[link[:link].id] = datapoint
184
+ @cache_bytes += res.downloaded_bytes
185
+ end
186
+ end
187
+ $log.debug "Link #{link[:link].id} downloaded."
188
+ end
189
+
190
+ $log.debug "Link #{link[:link].id} sent for download."
191
+
192
+ # Return curl
193
+ return curl
194
+ end
195
+
196
+ # Acquire links from the server.
197
+ def acquire_links
198
+
199
+ $log.info "Requesting #{@config[:client][:batch_capacity]} links..."
200
+
201
+ loop do
202
+ ret = rpc do |s|
203
+ s.check_out(LWAC::VERSION, @uuid, @config[:client][:batch_capacity])
204
+ end
205
+
206
+ # If the server tells us to back off, so do.
207
+ if ret.class == Fixnum
208
+ $log.info "Server says to ask again at #{Time.now + ret}"
209
+ @checkout_delay = Time.now + [ret, @config[:network][:maximum_reconnect_time]].min
210
+ return
211
+ elsif ret.class == Array && ret.length == 2
212
+
213
+ # Load the worker config into the list
214
+ policy, links = ret
215
+ @link_mx.synchronize do
216
+ links.each do |l|
217
+ @links << {:link => l, :config => policy}
218
+ end
219
+ end
220
+
221
+ $log.info "Received #{links.length}/#{@config[:client][:batch_capacity]} links from server."
222
+ return links.length
223
+ else
224
+ $log.warn "Received unrecognised return from server of type: #{ret.class}. Retrying..."
225
+ $log.debug "Server said: '#{ret}'"
226
+ end
227
+ end
228
+ end
229
+
230
+
231
+ def send_cache
232
+
233
+ # Create a new cache and keep the old one for uploading
234
+ cache_to_send, bytes_to_send = nil, nil
235
+ @cache_mx.synchronize do
236
+ # Retain handles to old ones
237
+ cache_to_send = @cache
238
+ bytes_to_send = @cache_bytes
239
+
240
+ # And create new ones
241
+ @cache = new_cache
242
+ @cache_bytes = 0
243
+ end
244
+
245
+ while cache_to_send.length > 0 do
246
+
247
+ # Take them out of the pstore up to a given size limit, then send that chunk
248
+ pending = []
249
+ pending_size = 0.0
250
+ $log.debug "Counting data for upload..."
251
+ while(cache_to_send.length > 0 and pending_size < @config[:client][:check_in_size]) do
252
+ key = cache_to_send.keys[0]
253
+ dp = cache_to_send[key]
254
+ pending_size += dp.response_properties[:downloaded_bytes].to_f
255
+ pending << dp
256
+ cache_to_send.delete_from_index(key)
257
+ end
258
+
259
+ # send datapoints
260
+ $log.info "Sending #{pending.length} datapoints (~#{(pending_size.to_f / 1024 / 1024).round(2)}MB) to server..."
261
+ ret = rpc do |s|
262
+ s.check_in(LWAC::VERSION, @uuid, pending)
263
+ end
264
+ if ret.is_a?(Array)
265
+ $log.info "Done (server reported #{ret[0]} failures)"
266
+ $log.info "Server reports work rate as #{ret[1].to_f.round(2)} links/s" if ret[1]
267
+ else
268
+ $log.warn "Server returned something unexpected when checking in."
269
+ end
270
+ end
271
+
272
+ # Here cache_to_send.length == 0, so wipe the cache
273
+ cache_to_send.delete_all
274
+ end
275
+
276
+ # Yields to perform RPC tasks with a backoff
277
+ def rpc(retries = -1)
278
+ $log.debug "Connecting to server #{@rpc_client.hostname}:#{@rpc_client.port}..."
279
+ # TODO: update this to match new SimpleRPC exception format (as soon as one is implemented)
280
+
281
+ failed = true
282
+ ret = nil
283
+ rpc_delay_increment = @config[:network][:minimum_reconnect_time]
284
+ while (retries -= 1) != -1 && failed do
285
+
286
+ # Delay until the point we were asked to
287
+ if @rpc_delay > Time.now
288
+ $log.info "Rate limit: delaying for #{(@rpc_delay - Time.now).round}s until #{@rpc_delay}..."
289
+ sleep(@rpc_delay - Time.now) + 0.1
290
+ end
291
+
292
+ begin
293
+
294
+ ret = yield(@rpc_client.get_proxy)
295
+ failed = false
296
+
297
+ # This looks funny, and is, but I double-catch in order
298
+ # to handle remote exceptions, which extend exception but not
299
+ # standarderror
300
+ rescue SignalException => se
301
+ raise se
302
+ rescue Exception => e
303
+
304
+ if e.is_a?(SimpleRPC::RemoteException)
305
+ $log.error "Server reported error: #{e}"
306
+ else
307
+ $log.error "Local error during RPC call: #{e}"
308
+ end
309
+
310
+ $log.debug "#{e.backtrace.join("\n")}"
311
+ failed = true
312
+
313
+ $log.warn "#{retries} retries remaining before disconnection..." if retries > 0
314
+
315
+ # Delay longer on failure
316
+ @rpc_delay = Time.now + [rpc_delay_increment, @config[:network][:maximum_reconnect_time]].min
317
+ rpc_delay_increment += @config[:network][:connect_failure_penalty]
318
+ end
319
+ end
320
+
321
+ return ret
322
+ end
323
+
324
+ # Create this client's ID.
325
+ # Must be persistent across instances, but not across machines.
326
+ def generate_uuid
327
+ require 'socket'
328
+ # TODO: make this ID based more on IP address, and/or make it more readable
329
+ @config[:client][:uuid_salt] + "_" + Digest::MD5.hexdigest("#{Socket.gethostname}#{@config[:client][:uuid_salt]}").to_s
330
+ end
331
+
332
+ # Get a new cache to replace the one in the pool
333
+ def new_cache
334
+ # Create cache in a random filename in the dir specified
335
+ filename = nil
336
+ filename = File.join(@config[:client][:cache_dir], rand.hash.abs.to_s) if @config[:client][:cache_dir]
337
+
338
+ $log.debug "Creating cache #{(@config[:client][:cache_dir] == nil) ? "in RAM" : "at #{filename}"}..."
339
+ return Store.new(filename)
340
+ end
341
+
342
+ # Returns a string progress bar for use in output
343
+ def progress_bar(total, progress, length=25)
344
+ bar_len = ((progress.to_f / total.to_f) * length).round
345
+ str = '['
346
+ str += '=' * [bar_len, length].min
347
+ str[-1] = '>' if bar_len > length
348
+ str += ' ' * (length - [bar_len, length].min)
349
+ str += ']'
350
+ end
351
+
352
+ end
353
+
354
+ end