lwac 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (45) hide show
  1. checksums.yaml +7 -0
  2. data/LICENSE +70 -0
  3. data/README.md +31 -0
  4. data/bin/lwac +132 -0
  5. data/client_config.md +71 -0
  6. data/concepts.md +70 -0
  7. data/config_docs.md +40 -0
  8. data/doc/compile.rb +52 -0
  9. data/doc/template.rhtml +145 -0
  10. data/example_config/client.jv.yml +33 -0
  11. data/example_config/client.yml +34 -0
  12. data/example_config/export.yml +70 -0
  13. data/example_config/import.yml +19 -0
  14. data/example_config/server.yml +97 -0
  15. data/export_config.md +448 -0
  16. data/import_config.md +29 -0
  17. data/index.md +49 -0
  18. data/install.md +29 -0
  19. data/lib/lwac.rb +17 -0
  20. data/lib/lwac/client.rb +354 -0
  21. data/lib/lwac/client/file_cache.rb +160 -0
  22. data/lib/lwac/client/storage.rb +69 -0
  23. data/lib/lwac/export.rb +362 -0
  24. data/lib/lwac/export/format.rb +310 -0
  25. data/lib/lwac/export/key_value_format.rb +132 -0
  26. data/lib/lwac/export/resources.rb +82 -0
  27. data/lib/lwac/import.rb +152 -0
  28. data/lib/lwac/server.rb +294 -0
  29. data/lib/lwac/server/consistency_manager.rb +265 -0
  30. data/lib/lwac/server/db_conn.rb +376 -0
  31. data/lib/lwac/server/storage_manager.rb +290 -0
  32. data/lib/lwac/shared/data_types.rb +283 -0
  33. data/lib/lwac/shared/identity.rb +44 -0
  34. data/lib/lwac/shared/launch_tools.rb +87 -0
  35. data/lib/lwac/shared/multilog.rb +158 -0
  36. data/lib/lwac/shared/serialiser.rb +86 -0
  37. data/limits.md +114 -0
  38. data/log_config.md +30 -0
  39. data/monitoring.md +13 -0
  40. data/resources/schemata/mysql/links.sql +7 -0
  41. data/resources/schemata/sqlite/links.sql +5 -0
  42. data/server_config.md +242 -0
  43. data/tools.md +89 -0
  44. data/workflows.md +39 -0
  45. metadata +140 -0
@@ -0,0 +1,86 @@
1
+
2
+ require 'yaml'
3
+
4
+
5
+ module LWAC
6
+
7
+
8
+
9
+ # This class wraps three possible serialisation systems, providing a common interface to them all.
10
+ #
11
+ # This is largely ripped from SimpleRPC, with some minor modifications.
12
+ class Serialiser
13
+
14
+ SUPPORTED_METHODS = %w{marshal json msgpack yaml}.map{|s| s.to_sym}
15
+
16
+ # Create a new Serialiser with the given method. Optionally provide a binding to have
17
+ # the serialisation method execute within another context, i.e. for it to pick up
18
+ # on various libraries and classes (though this will impact performance somewhat).
19
+ #
20
+ # Supported methods are:
21
+ #
22
+ # :marshal:: Use ruby's Marshal system. A good mix of speed and generality.
23
+ # :yaml:: Use YAML. Very slow but very general
24
+ # :msgpack:: Use MessagePack gem. Very fast but not very general (limited data format support)
25
+ # :json:: Use JSON. Also slow, but better for interoperability than YAML.
26
+ #
27
+ def initialize(method = :marshal, binding=nil)
28
+ @method = method
29
+ @binding = nil
30
+ raise "Unrecognised serialisation method" if not SUPPORTED_METHODS.include?(method)
31
+
32
+ # Require prerequisites and handle msgpack not installed-iness.
33
+ case method
34
+ when :msgpack
35
+ begin
36
+ gem "msgpack", "~> 0.5"
37
+ rescue Gem::LoadError => e
38
+ $stderr.puts "The :msgpack serialisation method requires the MessagePack gem (msgpack)."
39
+ $stderr.puts "Please install it or use another serialisation method."
40
+ raise e
41
+ end
42
+ require 'msgpack'
43
+ @cls = MessagePack
44
+ when :yaml
45
+ require 'yaml'
46
+ @cls = YAML
47
+ when :json
48
+ require 'json'
49
+ @cls = JSON
50
+ else
51
+ # marshal is alaways available
52
+ @cls = Marshal
53
+ end
54
+ end
55
+
56
+ # Serialise to a string
57
+ def dump(obj)
58
+ return eval("#{@cls.to_s}.dump(obj)", @binding) if @binding
59
+ return @cls.send(:dump, obj)
60
+ end
61
+
62
+ # Deserialise from a string
63
+ def load(bits)
64
+ return eval("#{@cls.to_s}.load(bits)", @binding) if @binding
65
+ return @cls.send(:load, bits)
66
+ end
67
+
68
+ # Load an object from disk
69
+ def load_file(fn)
70
+ return File.open(fn, 'r'){ |f| Marshal.load(f) } if @method == :marshal
71
+ return YAML.load_file(File.read(fn)) if @method == :yaml
72
+ return File.open(fn, 'r'){ |f| MessagePack.load( f ) } if @method == :msgpack # efficientify me
73
+ return File.open(fn, 'r'){ |f| JSON.load( f.read ) } if @method == :json
74
+ end
75
+
76
+ # Write an object to disk
77
+ def dump_file(obj, fn)
78
+ return File.open(fn, 'w'){ |f| Marshal.dump(obj, f) } if @method == :marshal
79
+ return YAML.dump(obj, File.open(fn, 'w')).close if @method == :yaml
80
+ return File.open(fn, 'w'){ |f| MessagePack.dump(obj, f ) } if @method == :msgpack
81
+ return File.open(fn, 'w'){ |f| JSON.dump( obj, f) } if @method == :json
82
+ end
83
+ end
84
+
85
+
86
+ end
@@ -0,0 +1,114 @@
1
+ Limits and Performance
2
+ ======================
3
+ LWAC was designed to maximise throughput to the web, and as such easily stretches certain system resources. Extracting the best performance requires knowledge of some of the limits of your underlying system, as well as the architecture of LWAC. This guide should cover each of the points, and explain which need attention for which conditions.
4
+
5
+ Throughput
6
+ ----------
7
+ The system is capable of downloading around 2.5 million pages per hour per client when resource speed is not an issue. This slows to roughly 100,000 per hour per client when using real-world (year-old) URL lists. This equals roughly 10-100 million words per hour in practice.
8
+
9
+ Most speed issues are caused by the slow response of servers, for which parallelism is the only practical solution. It's worth noting that the tool can download at around 20Mbps/client in a sustained manner---this most certainly breaks netiquette and may be sufficient to overload some hosts if you have many links pointing to the same servers.
10
+
11
+ Network
12
+ -------
13
+ LWAC transfers large batches of data between its client and server tools, as well as to the web. To ensure reliability, these transfers occur at different times, however, they are both subject to various limitations.
14
+
15
+
16
+ ### Client-Server Communication
17
+
18
+ #### Batch Size
19
+ Increasing the size of a batch will have a number of effects.
20
+
21
+ 1. More data will be queued up in RAM on both the client and the server. The server will, at most, `hold cache_size * number_of_clients` links in memory. The client will store at most its batch size. Generally this is not an issue, as any modern system can store millions of Link objects in memory.
22
+ 2. Transfer between client and server will lock out other clients until complete, so smaller batches allow for better client load-balancing.
23
+
24
+ In my experience, sizes of 1000-10,000 are suitable for smaller corpora, and/or low client specifications. Even if your web pages are small, there is overhead in managing the cache prior to sending it to the server, so batch sizes in excess of around 20,000 start slowing down (changing the client cache policy can help this).
25
+
26
+
27
+ #### Client Competition
28
+ Clients compete with one another for server time, and follow this algorithm to do so:
29
+
30
+ 1. Start worker servers
31
+ 2. Maintain work:
32
+ * If the link pool is empty, connect to the server to get more links
33
+ * When N MB of data have been downloaded, contact the server to upload
34
+ 3. When the server has no links to give, wait and back off
35
+
36
+ The server can support an unlimited number of clients, but beyond a point they will start locking one another out and efficiency drops off. This point is highly dependent on:
37
+
38
+ * Network speeds
39
+ * The batch size, concurrency and other [client settings](client_config.html)
40
+
41
+ Since the server guarantees data consistency in the case of clients disconnecting, I recommend connecting progressively more clients until the point of maximum efficiency is reached for your setup.
42
+
43
+
44
+ ### Web connection
45
+ LWAC places significant stress on the connection to the web, and can trigger things such as DDOS protection and traffic shaping schemes. Steps should be adjusted to avoid this, if possible, such as by placing clients on different parts of the network.
46
+
47
+
48
+ #### Proxies/single points of failure
49
+ Proxies can be configured in the curl settings (`client_policy` section of the [server config](server_config.html)), however, they are subject to a number of effects that are generally undesirable for sampling (such as caching and header rewriting), and present a single point of failure which handles all of the stress.
50
+
51
+ Unfortunately there is currently no method to set a different proxy on each client (this is due in a later version).
52
+
53
+ #### DNS
54
+ DNS lookup is considered an integral part of fetching web data, and is thus repeated every time a client downloads a link. If this places undue stress on one's DNS provider, a local cache can vastly reduce outgoing traffic with relatively little risk of damaging the quality of resultant data.
55
+
56
+
57
+
58
+ File System
59
+ -----------
60
+ The filesystem is used extensively in LWAC both as backing store and as a cache. Many filesystems, especially those on virtual servers, impose significant performance overheads when dealing with small files or certain types of access. It is important than LWAC is adjusted if this is the case.
61
+
62
+ ### Server
63
+ The server's use of the filesystem is twofold:
64
+
65
+ * Corpus backing store
66
+ * Metadata database (read only beyond initial import)
67
+
68
+ #### Max Files per Dir
69
+ Many filesystems impose a fairly low limit on the number of files that will fit in one directory. Since one file per datapoint is used in a corpus, they may be spanned over many directories. See the [server config](server_config.html) for the relevant configuration properties.
70
+
71
+ #### Corpus Dir Size
72
+ Over time the corpus directory will grow very large. It's possible to copy all but the current sample out of this directory whilst the server is running, though a better policy might simply be to place the corpus on a large disk or RAID volume.
73
+
74
+
75
+ #### SQLite3 performance (pragmas)
76
+ SQLite3 is largely retained in RAM during use, however, its disk access can be controlled through the use of `PRAGMA` statements. If disk or memory usage is a particular problem, these can be adjusted to specify new limits, as mentioned in the [server config](server_config.html).
77
+
78
+ ### Client
79
+ The client uses disks infrequently or never, depending on the configuration used, using a single cache file:
80
+
81
+ * Datapoint cache
82
+
83
+ #### Open Socket Limit
84
+ Most operating systems impose limits on the number of sockets that may be open at any one time. Since the client is heavily multithreaded, it is capable of exceeding these limits with relative ease.
85
+
86
+ On unix systems, the limit can be read using the command `ulimit -a`, where it is typically listed as the number of file descriptors allowed (minus one for the cache file and one for each log).
87
+
88
+ #### Cache Filesize
89
+ The file caching system is based on a single file, which will grow to the size of the sum of all data downloaded in one batch. In testing with HTML data, this generally means about 10MB for every thousand links.
90
+
91
+ The client's cache system uses repeated `fseek` calls to look up data. If your filesystem is very bad at seeking, it may be wiser to use a memory cache instead.
92
+
93
+ Memory Usage
94
+ ------------
95
+ LWAC was designed for large samples, and as such its memory usage is minimal, static (O(1) complexity w.r.t. total corpus size), and configurable.
96
+
97
+ ### Server
98
+ The server requires enough memory to store:
99
+
100
+ * Lists of failed links (which accumulate if clients drop out during a batch, but are soon re-used). A thousand links uses under 100KB of storage in RAM.
101
+ * Datapoints currently being checked in (see `check_in_rate` in the [client config](client_config.html) to set this in MB)
102
+ * SQLite3 or MySQL cache (see above)
103
+
104
+ This means that the server should always be using less than a few hundred megabytes of RAM, and much of that is ruby/libsqlite3/libcurl.
105
+
106
+ ### Client
107
+ The client typically uses more RAM:
108
+
109
+ * Lists of links to download (max will be the batch size).
110
+ * Data downloaded from the web if using a memory cache (set `cache_file` in the [client config](client_config.html) to use a disk cache, or set `max_body_size` in the [server config](server_config.html))
111
+ * Data being accumulated for upload to the server (see `check_in_rate` in the [client config](client_config.html))
112
+ * Working data for a large number of download threads (at most equal to the number of threads multiplied by the `max_body_size`)
113
+
114
+ A client with a large batch size (tens of thousands of links), downloading large files (such as PDFs), and using a memory cache, may use gigabytes of RAM. The same client with a disk cache will use only a couple of hundred MB, mostly comprising ruby, libcurl, and marilyn/eventmachine.
@@ -0,0 +1,30 @@
1
+ Logger Configuration
2
+ ====================
3
+ LWAC uses a more flexible extension of ruby's standard log libraries and outputs in a fairly standard format. Each tool's config file contains its own logging section, though they share a common format.
4
+
5
+ Configuration
6
+ -------------
7
+
8
+ * `progname` --- The name of the program to output in the log file (printed on each line)
9
+ * `logs{}` --- A hash containing all other logs that a user wishes to use. The `LOG_NAME` below is arbitrary and used to refer to the log in summary output.
10
+ * `logs/LOG_NAME/dev` --- The device to use for this log. This can either be a filename, or STDOUT/STDERR.
11
+ * `logs/LOG_NAME/level` --- The level to log at. This is one of the symbols `:debug`, `:warn`, `:info`, `:error`, or `:fatal`
12
+
13
+
14
+ Sample Config
15
+ -------------
16
+ The configuration below outputs three logs, one to stdout for basic information whilst the program runs, one with errors only to a file, and another with basic progress to another file.
17
+
18
+ :logging:
19
+ :progname: Server
20
+ :logs:
21
+ :default:
22
+ :dev: STDOUT
23
+ :level: :info
24
+ :errors:
25
+ :dev: 'logs/server.err'
26
+ :level: :warn
27
+ :file_log:
28
+ :dev: 'logs/server.log'
29
+ :level: :info
30
+
@@ -0,0 +1,13 @@
1
+ Monitoring/Maintenance
2
+ ======================
3
+ LWAC should require little maintenance beyond initial deployment. It is heavily network-dependent, and as such its libraries should be kept up-to-date for security purposes.
4
+
5
+
6
+ Disk Space
7
+ ----------
8
+ As the corpus grows it may be necessary to move data off the server's working disk. Any samples that are not currently open may be moved whislt the server is running---this basically means all but the last entry in the corpus.
9
+
10
+ Process Monitoring
11
+ ------------------
12
+ LWAC is as prone to failure as any other long-running process, and should ideally be monitored during its sampling runs, especially if they last many months. Many tools are available for this (such as [monit](http://mmonit.com/monit/), [Ubic](https://github.com/berekuk/Ubic), [God](http://godrb.com/) or [bluepill](https://github.com/arya/bluepill)), and any should suffice in monitoring both client and server processes.
13
+
@@ -0,0 +1,7 @@
1
+ -- Describe LINKS
2
+ CREATE TABLE links (
3
+ id INT NOT NULL AUTO_INCREMENT,
4
+ uri VARCHAR(1024) NOT NULL,
5
+ PRIMARY KEY ( id )
6
+ );
7
+
@@ -0,0 +1,5 @@
1
+ -- Describe LINKS
2
+ CREATE TABLE links (
3
+ "id" INTEGER PRIMARY KEY,
4
+ "uri" TEXT NOT NULL
5
+ );
@@ -0,0 +1,242 @@
1
+ Server Configuration
2
+ ====================
3
+ The server's configuration file is a valid ruby Hash object, expressed in YAML. As such, it starts with a single line containing three dashes, and follows a key-value structure throughout. It may loosely be separated into a number of sections, forming the top level of this tree.
4
+
5
+ For help interpreting this document, see [Reading Config Documentation](config_docs.html)
6
+
7
+ Storage
8
+ -------
9
+ Storage is defined by the `/storage/` key, and contains details on the corpus and its metadata database.
10
+
11
+ ### Corpus Details
12
+
13
+ * `root` --- The root directory of the corpus, relative to the server binary.
14
+ * `state_file` --- The name of the file where server state will be stored. This contains a list of incomplete links for the current sample.
15
+ * `sample_subdir` --- The name of the directory within the corpus where samples will be stored.
16
+ * `sample_filename` --- The filename where summary details on a particular sample are stored.
17
+ * `files_per_dir` --- How many files to store in each directory below the `sample_subdir`. This is set to avoid overloading filesystems that have finite inode tables.
18
+ * `serialiser` --- The serialisation method used to write to disk. Supported methods are `:marshal`, `:yaml` or `:json`. `:marshal` is fastest and recommended unless you desperately need to access the corpus using languages other than ruby.
19
+
20
+ For example:
21
+
22
+ :storage:
23
+ :root: corpus
24
+ :state_file: state
25
+ :sample_subdir: samples
26
+ :sample_filename: sample
27
+ :files_per_dir: 1000
28
+ :serialiser: :marshal
29
+ :database:
30
+ ...
31
+
32
+
33
+ ### Database Details
34
+ The database is configured in `/storage/database` and consists of two main blocks:
35
+
36
+ * `engine` --- Either `:sqlite` for the SQLite3 database engine or `:mysql` for mysql. You must install the appropriate dependency for the engine you select.
37
+ * `engine_conf{}` --- Configuration parameters for a given engine. See below for examples of each.
38
+ * `table` --- The table name where links to be downloaded are stored
39
+ * `fields` --- Contains information on the links table's fields
40
+ * `fields/id` --- The field name containing the link ID
41
+ * `fields/uri` --- The field name contain the URI to request from a remote server.
42
+
43
+ For example:
44
+
45
+ :table: links
46
+ :fields:
47
+ :id: id
48
+ :uri: uri
49
+ :engine: :mysql
50
+ :engine_conf:
51
+ ...
52
+
53
+
54
+ #### SQLite3
55
+ The SQLite3 engine is rather heavily optimised for read speed from the database, and is recommended if you want speed or have a smaller corpus. Its configuration parameters are thus:
56
+
57
+ * `database/transaction_limit` --- How many queries to run per transaction. Larger numbers speed up access at the expense of data security.
58
+ * `database/pragma{}` --- A key-value list of pragma statements to configure the SQLite3 database. These configure the database, and take the form of a list of key-value strings. A full list of SQLite3 pragma statements is available on [their website](http://www.sqlite.org/pragma.html)
59
+ * `transaction_limit` --- The number of calls to make per transaction. May provide a minor speed increase if large, but most database access is read only anyway.
60
+ * `filename` --- The position of the database file, relative to `pwd`
61
+
62
+ For example:
63
+
64
+ :engine: :mysql
65
+ :engine_conf:
66
+ :filename: corpus/links.db
67
+ :transaction_limit: 100
68
+ :pragma: # Custom pragmas. See SQLite's docs.
69
+ "locking_mode": "EXCLUSIVE" # Do not allow others to access the db when the server is running
70
+ "cache_size": 20000 # Allow a large cache
71
+ "synchronous": 0 # Asynchronous operations speed things up a lot
72
+ "temp_store": 2 # Use temp storage
73
+
74
+ #### MySQL
75
+ The MySQL engine's configuration parameters are largely defined by the gem. Full documentation is available on the [github page](https://github.com/brianmario/mysql2), and common parameters are listed below:
76
+
77
+ * `username` --- The uername to log in with
78
+ * `password` --- The password to use when connecting to the mysql server
79
+ * `host` --- The hostname at which the mysql server is listening (omit this if using a socket)
80
+ * `port` --- The port on which the mysql server is listening (omit this if using a socket)
81
+ * `socket` --- The filepath of a socket over which to talk to the mysql server
82
+ * `database` --- The name of the database where the links table will be stored
83
+ * `encoding` --- The character encoding to use. 'utf8' is _strongly_ recommended.
84
+
85
+ For example:
86
+
87
+ :engine: :sqlite
88
+ :engine_conf: # Options from https://github.com/brianmario/mysql2
89
+ :username: lwac
90
+ :password: lwacpass
91
+ # :host: localhost
92
+ # :port: 3345
93
+ :socket: /var/run/mysqld/mysqld.sock
94
+ :database: lwac
95
+ :encoding: 'utf8'
96
+ :read_timeout: 10 #seconds
97
+ :write_timeout: 10 #seconds
98
+ :connect_timeout: 10 #seconds
99
+ :reconnect: true #/false
100
+ :local_infile: true #/false
101
+
102
+
103
+ Sampling Policy
104
+ ---------------
105
+ The sampling policy used by the server is defined by three parameters, *count*, *duration* and *alignment*.
106
+
107
+ * `sample_limit` --- Sample at most this number of samples. Note that the IDs start from 0, so the last sample will have the ID `sample_limit - 1`. The server will quit when trying to open the `n+1`th sample.
108
+ * `sample_time` --- A sample's *duration* is the minimum time a sample may take. For example, if this is a daily sample, it should be set to 84600 (the number of seconds in a day) in order to sample once at midnight each day.
109
+ * `sample_alignment` --- A sample's *alignment* defines whereabouts within each sample time the sample may start. Using the above example, setting this to 7200 would cause samples to begin at 2am each day.
110
+
111
+ Note that if a sample takes more than `sample_time` to run, it will overlap and cancel the next sample. This is to prevent misalignment with the other datapoints in the sample, and is preferrable in many analyses to a series of messily-timed run-on samples. The code that selects the samples thus follows this algorithm:
112
+
113
+ * Round down to the last valid sample time (`time = Time.at(((Time.now.to_i / @config[:sample_time]).floor * @config[:sample_time]) + @config[:sample_alignment])`)
114
+ * While `time < Time.now`
115
+ * increment the prospective time by the `sample_time`
116
+
117
+ If you wish to edit the sample computation algorithm, it resides in `lib/server/consistency_manager.rb`, under the method `compute_next_sample_time`.
118
+
119
+ For example:
120
+
121
+ :sampling_policy:
122
+ :sample_limit: 0
123
+ :sample_time: 60
124
+ :sample_alignment: 0
125
+
126
+
127
+
128
+ Client Policy
129
+ -------------
130
+ This section describes the properties each client must inherit from its server. It describes things such as how the client appears to external websites, and how it normalises and packages its data before upload to the server.
131
+
132
+ * `dry_run` --- Boolean. Set to `true` to disable web access on the client, so that it samples empty datapoints.
133
+ * `max_body_size` --- Stop downloading when this number of bytes have been downloaded. Used to prevent aberrantly large files from filling RAM or disk storage.
134
+ * `fix_encoding` --- Boolean. Should the encoding be normalised to `target_encoding`?
135
+ * `target_encoding` --- The name of an encoding to normalise to. Default is 'UTF-8', but anything supported by Ruby's String#encode method will work
136
+ * `encoding_options` --- Options hash passed to String#encode, may include:
137
+ * `encoding_options\invalid` --- If value is `:replace` , replaces undefined chars with the `:replace` char
138
+ * `encoding_options\undef` --- If value is `:replace` , replaces undefined chars with the `:replace` char
139
+ * `encoding_options\replace` --- The char to use in replacement, defaults to uFFFD for unicode and '?' for other targets
140
+ * `encoding_options\fallback{}` --- A key-value table of characters to replace
141
+ * `encoding_options\xml` --- Either `:text` or `:attr`. If `:text`, replaces things with hex entities, if `:attr`, it also quotes the entities "&amp;quot;"
142
+ * `encoding_options\cr_newline` --- Boolean. Replaces LF(\n) with CR(\r) if true
143
+ * `encoding_options\lf_newline` --- Boolean. Replaces LF(\n) with CRLF(\r\n) if true
144
+ * `encoding_options\universal_newline` --- Boolean. Replaces CRLF(\r\n) and CR(\r) with LF(\n) if true
145
+
146
+ Clients use cURL to download links, using the `curb` library, and may thus be configured with custom request parameters and other options. The options below are applied by setting properties on the cURL object, and as such anything may be provided as a key that is in the [curb documentation](https://rubygems.org/gems/curb), such as 'verbose' or control over SSL. By default, SSL options are overridden to accept connections without verifying certificates.
147
+
148
+ * `curl_workers{}` --- Defines properties of the CURl workers that contact web servers.
149
+ * `max_redirect` --- How many HTTP redirects should be followed before giving up?
150
+ * `useragent` --- The user agent to show to the remote server
151
+ * `follow_location` --- Should the agent follow location headers?
152
+ * `timeout` --- Overall timeout, in seconds, for the whole request.
153
+ * `connect_timeout` --- TCP connect timeout
154
+ * `dns_cache_timeout` --- DNS lookup timeout
155
+
156
+ MIME type handling can be controlled in a rudimentary manner to prevent superfluous saving of binary data. This is controlled using the `Content-type` field of the response headers, and takes the form of a whitelist or blacklist based on regular expressions. Any link that is 'denied' has its body content wiped and a flag set in its datapoint metadata, but otherwise remains intact. It is configured by the structure called `:mimes`.
157
+
158
+ * `mimes{}` --- Defines mime-type acceptance handling
159
+ * `policy` --- Either `:whitelist` to only accept the items matching the list, or `:blacklist` to only decline the items on the list.
160
+ * `ignore_case` --- Should the regexp matching be case-insensitive?
161
+ * `list[]` --- A list of regular expressions. If one matches, depending on white/blacklist configuration, the body will be blanked.
162
+
163
+ For example:
164
+
165
+ :client_policy:
166
+ :dry_run: false
167
+ :fix_encoding: true
168
+ :target_encoding: UTF-8
169
+ :encoding_options:
170
+ :invalid: :replace
171
+ :undef: :replace
172
+ #:replace: '?'
173
+ #:fallback:
174
+ #'from': 'to'
175
+ #'from2': 'to2'
176
+ #:xml: :attr
177
+ #:cr_newline: true
178
+ #:crlf_newline:
179
+ :universal_newline: true
180
+ :max_body_size: 20971520
181
+ :mimes:
182
+ :policy: :whitelist
183
+ :ignore_case: true
184
+ :list:
185
+ - ^text\/?.*$ # text-only mimes
186
+ #- ^.+$ # anything with a valid content-type
187
+ :curl_workers:
188
+ :max_redirects: 5
189
+ :useragent: ! '"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11"'
190
+ :enable_cookies: true
191
+ # :headers: "Header: String"
192
+ :verbose: false
193
+ :follow_location: true
194
+ :timeout: 60
195
+ :connect_timeout: 10
196
+ :dns_cache_timeout: 10
197
+ :ftp_response_timeout: 10
198
+
199
+ Client Management
200
+ -----------------
201
+ The server's responsibilities for managing clients as they connect and process work mean that it must present meaningful time limits on these connections (lest clients crash or misbehave). These settings govern the rate at which clients are presumed to work, before the server starts re-assigning their work to other clients.
202
+
203
+
204
+ The client is given only a finite time to complete its work before the server will assume it has died and re-assign its links elsewhere. This is controlled by two parameters---one for clients that have not been deen before, and one that modifies the dynamic timeout computed by the server.
205
+
206
+ * `time_per_link` --- How long, in seconds, a new client is given per link to download a batch. Clients typically download fairly fast, so this should be quite low (below 5).
207
+ * `dynamic_time_overestimate` --- How much to multiply the client's last performance by when computing a timeout i.e. value of "1.2" will give 20% overhead.
208
+
209
+ These next to parameters define how long clients are told to wait when they contact the server but find no work available (for example, no sample is open right now).
210
+
211
+ * `empty_client_backoff` --- If no links are available but a sample is open, tell clients to retry after this time.
212
+ * `delay_overestimate` --- When a sample is closed, clients are told to wait until after the sample opening time. A small amount is added to this to prevent clients from hitting the final seconds before the sample is open. This should be less than `empty_client_backoff` for it to make any difference. I recommend below 10 seconds.
213
+
214
+ For example:
215
+
216
+ :client_management:
217
+ :time_per_link: 5
218
+ :dynamic_time_overestimate: 1.3
219
+ :empty_client_backoff: 60
220
+ :delay_overestimate: 10
221
+
222
+
223
+ Server
224
+ ------
225
+ These settings govern the network properties of the server, as used for data transfer to and from clients. Settings here are passed to [SimpleRPC](http://stephenwattam.com/projects/simplerpc/), for which full documentation is available in the ruby gem. Only the salient configuration options are listed here.
226
+
227
+ * `hostname` --- The hostname or IP address of an interface on which to listen
228
+ * `port` --- The port on which to listen for this interface
229
+ * `password` --- Optional. The password to use for auth (must match client config)
230
+ * `secret` --- Optional. The encryption key to use when sending the password (must match client config)
231
+
232
+ For example:
233
+
234
+ :server:
235
+ :hostname:
236
+ :port: 27401
237
+ :password: lwacpass
238
+ :secret: egrniognhre89n34ifnui4n8gf490
239
+
240
+ Logging
241
+ -------
242
+ The logging system is the same for client, server, and export tools and shares a configuration with them. For details, see [configuring logging](log_config.html)