lwac 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/LICENSE +70 -0
- data/README.md +31 -0
- data/bin/lwac +132 -0
- data/client_config.md +71 -0
- data/concepts.md +70 -0
- data/config_docs.md +40 -0
- data/doc/compile.rb +52 -0
- data/doc/template.rhtml +145 -0
- data/example_config/client.jv.yml +33 -0
- data/example_config/client.yml +34 -0
- data/example_config/export.yml +70 -0
- data/example_config/import.yml +19 -0
- data/example_config/server.yml +97 -0
- data/export_config.md +448 -0
- data/import_config.md +29 -0
- data/index.md +49 -0
- data/install.md +29 -0
- data/lib/lwac.rb +17 -0
- data/lib/lwac/client.rb +354 -0
- data/lib/lwac/client/file_cache.rb +160 -0
- data/lib/lwac/client/storage.rb +69 -0
- data/lib/lwac/export.rb +362 -0
- data/lib/lwac/export/format.rb +310 -0
- data/lib/lwac/export/key_value_format.rb +132 -0
- data/lib/lwac/export/resources.rb +82 -0
- data/lib/lwac/import.rb +152 -0
- data/lib/lwac/server.rb +294 -0
- data/lib/lwac/server/consistency_manager.rb +265 -0
- data/lib/lwac/server/db_conn.rb +376 -0
- data/lib/lwac/server/storage_manager.rb +290 -0
- data/lib/lwac/shared/data_types.rb +283 -0
- data/lib/lwac/shared/identity.rb +44 -0
- data/lib/lwac/shared/launch_tools.rb +87 -0
- data/lib/lwac/shared/multilog.rb +158 -0
- data/lib/lwac/shared/serialiser.rb +86 -0
- data/limits.md +114 -0
- data/log_config.md +30 -0
- data/monitoring.md +13 -0
- data/resources/schemata/mysql/links.sql +7 -0
- data/resources/schemata/sqlite/links.sql +5 -0
- data/server_config.md +242 -0
- data/tools.md +89 -0
- data/workflows.md +39 -0
- metadata +140 -0
@@ -0,0 +1,86 @@
|
|
1
|
+
|
2
|
+
require 'yaml'
|
3
|
+
|
4
|
+
|
5
|
+
module LWAC
|
6
|
+
|
7
|
+
|
8
|
+
|
9
|
+
# This class wraps three possible serialisation systems, providing a common interface to them all.
|
10
|
+
#
|
11
|
+
# This is largely ripped from SimpleRPC, with some minor modifications.
|
12
|
+
class Serialiser
|
13
|
+
|
14
|
+
SUPPORTED_METHODS = %w{marshal json msgpack yaml}.map{|s| s.to_sym}
|
15
|
+
|
16
|
+
# Create a new Serialiser with the given method. Optionally provide a binding to have
|
17
|
+
# the serialisation method execute within another context, i.e. for it to pick up
|
18
|
+
# on various libraries and classes (though this will impact performance somewhat).
|
19
|
+
#
|
20
|
+
# Supported methods are:
|
21
|
+
#
|
22
|
+
# :marshal:: Use ruby's Marshal system. A good mix of speed and generality.
|
23
|
+
# :yaml:: Use YAML. Very slow but very general
|
24
|
+
# :msgpack:: Use MessagePack gem. Very fast but not very general (limited data format support)
|
25
|
+
# :json:: Use JSON. Also slow, but better for interoperability than YAML.
|
26
|
+
#
|
27
|
+
def initialize(method = :marshal, binding=nil)
|
28
|
+
@method = method
|
29
|
+
@binding = nil
|
30
|
+
raise "Unrecognised serialisation method" if not SUPPORTED_METHODS.include?(method)
|
31
|
+
|
32
|
+
# Require prerequisites and handle msgpack not installed-iness.
|
33
|
+
case method
|
34
|
+
when :msgpack
|
35
|
+
begin
|
36
|
+
gem "msgpack", "~> 0.5"
|
37
|
+
rescue Gem::LoadError => e
|
38
|
+
$stderr.puts "The :msgpack serialisation method requires the MessagePack gem (msgpack)."
|
39
|
+
$stderr.puts "Please install it or use another serialisation method."
|
40
|
+
raise e
|
41
|
+
end
|
42
|
+
require 'msgpack'
|
43
|
+
@cls = MessagePack
|
44
|
+
when :yaml
|
45
|
+
require 'yaml'
|
46
|
+
@cls = YAML
|
47
|
+
when :json
|
48
|
+
require 'json'
|
49
|
+
@cls = JSON
|
50
|
+
else
|
51
|
+
# marshal is alaways available
|
52
|
+
@cls = Marshal
|
53
|
+
end
|
54
|
+
end
|
55
|
+
|
56
|
+
# Serialise to a string
|
57
|
+
def dump(obj)
|
58
|
+
return eval("#{@cls.to_s}.dump(obj)", @binding) if @binding
|
59
|
+
return @cls.send(:dump, obj)
|
60
|
+
end
|
61
|
+
|
62
|
+
# Deserialise from a string
|
63
|
+
def load(bits)
|
64
|
+
return eval("#{@cls.to_s}.load(bits)", @binding) if @binding
|
65
|
+
return @cls.send(:load, bits)
|
66
|
+
end
|
67
|
+
|
68
|
+
# Load an object from disk
|
69
|
+
def load_file(fn)
|
70
|
+
return File.open(fn, 'r'){ |f| Marshal.load(f) } if @method == :marshal
|
71
|
+
return YAML.load_file(File.read(fn)) if @method == :yaml
|
72
|
+
return File.open(fn, 'r'){ |f| MessagePack.load( f ) } if @method == :msgpack # efficientify me
|
73
|
+
return File.open(fn, 'r'){ |f| JSON.load( f.read ) } if @method == :json
|
74
|
+
end
|
75
|
+
|
76
|
+
# Write an object to disk
|
77
|
+
def dump_file(obj, fn)
|
78
|
+
return File.open(fn, 'w'){ |f| Marshal.dump(obj, f) } if @method == :marshal
|
79
|
+
return YAML.dump(obj, File.open(fn, 'w')).close if @method == :yaml
|
80
|
+
return File.open(fn, 'w'){ |f| MessagePack.dump(obj, f ) } if @method == :msgpack
|
81
|
+
return File.open(fn, 'w'){ |f| JSON.dump( obj, f) } if @method == :json
|
82
|
+
end
|
83
|
+
end
|
84
|
+
|
85
|
+
|
86
|
+
end
|
data/limits.md
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
Limits and Performance
|
2
|
+
======================
|
3
|
+
LWAC was designed to maximise throughput to the web, and as such easily stretches certain system resources. Extracting the best performance requires knowledge of some of the limits of your underlying system, as well as the architecture of LWAC. This guide should cover each of the points, and explain which need attention for which conditions.
|
4
|
+
|
5
|
+
Throughput
|
6
|
+
----------
|
7
|
+
The system is capable of downloading around 2.5 million pages per hour per client when resource speed is not an issue. This slows to roughly 100,000 per hour per client when using real-world (year-old) URL lists. This equals roughly 10-100 million words per hour in practice.
|
8
|
+
|
9
|
+
Most speed issues are caused by the slow response of servers, for which parallelism is the only practical solution. It's worth noting that the tool can download at around 20Mbps/client in a sustained manner---this most certainly breaks netiquette and may be sufficient to overload some hosts if you have many links pointing to the same servers.
|
10
|
+
|
11
|
+
Network
|
12
|
+
-------
|
13
|
+
LWAC transfers large batches of data between its client and server tools, as well as to the web. To ensure reliability, these transfers occur at different times, however, they are both subject to various limitations.
|
14
|
+
|
15
|
+
|
16
|
+
### Client-Server Communication
|
17
|
+
|
18
|
+
#### Batch Size
|
19
|
+
Increasing the size of a batch will have a number of effects.
|
20
|
+
|
21
|
+
1. More data will be queued up in RAM on both the client and the server. The server will, at most, `hold cache_size * number_of_clients` links in memory. The client will store at most its batch size. Generally this is not an issue, as any modern system can store millions of Link objects in memory.
|
22
|
+
2. Transfer between client and server will lock out other clients until complete, so smaller batches allow for better client load-balancing.
|
23
|
+
|
24
|
+
In my experience, sizes of 1000-10,000 are suitable for smaller corpora, and/or low client specifications. Even if your web pages are small, there is overhead in managing the cache prior to sending it to the server, so batch sizes in excess of around 20,000 start slowing down (changing the client cache policy can help this).
|
25
|
+
|
26
|
+
|
27
|
+
#### Client Competition
|
28
|
+
Clients compete with one another for server time, and follow this algorithm to do so:
|
29
|
+
|
30
|
+
1. Start worker servers
|
31
|
+
2. Maintain work:
|
32
|
+
* If the link pool is empty, connect to the server to get more links
|
33
|
+
* When N MB of data have been downloaded, contact the server to upload
|
34
|
+
3. When the server has no links to give, wait and back off
|
35
|
+
|
36
|
+
The server can support an unlimited number of clients, but beyond a point they will start locking one another out and efficiency drops off. This point is highly dependent on:
|
37
|
+
|
38
|
+
* Network speeds
|
39
|
+
* The batch size, concurrency and other [client settings](client_config.html)
|
40
|
+
|
41
|
+
Since the server guarantees data consistency in the case of clients disconnecting, I recommend connecting progressively more clients until the point of maximum efficiency is reached for your setup.
|
42
|
+
|
43
|
+
|
44
|
+
### Web connection
|
45
|
+
LWAC places significant stress on the connection to the web, and can trigger things such as DDOS protection and traffic shaping schemes. Steps should be adjusted to avoid this, if possible, such as by placing clients on different parts of the network.
|
46
|
+
|
47
|
+
|
48
|
+
#### Proxies/single points of failure
|
49
|
+
Proxies can be configured in the curl settings (`client_policy` section of the [server config](server_config.html)), however, they are subject to a number of effects that are generally undesirable for sampling (such as caching and header rewriting), and present a single point of failure which handles all of the stress.
|
50
|
+
|
51
|
+
Unfortunately there is currently no method to set a different proxy on each client (this is due in a later version).
|
52
|
+
|
53
|
+
#### DNS
|
54
|
+
DNS lookup is considered an integral part of fetching web data, and is thus repeated every time a client downloads a link. If this places undue stress on one's DNS provider, a local cache can vastly reduce outgoing traffic with relatively little risk of damaging the quality of resultant data.
|
55
|
+
|
56
|
+
|
57
|
+
|
58
|
+
File System
|
59
|
+
-----------
|
60
|
+
The filesystem is used extensively in LWAC both as backing store and as a cache. Many filesystems, especially those on virtual servers, impose significant performance overheads when dealing with small files or certain types of access. It is important than LWAC is adjusted if this is the case.
|
61
|
+
|
62
|
+
### Server
|
63
|
+
The server's use of the filesystem is twofold:
|
64
|
+
|
65
|
+
* Corpus backing store
|
66
|
+
* Metadata database (read only beyond initial import)
|
67
|
+
|
68
|
+
#### Max Files per Dir
|
69
|
+
Many filesystems impose a fairly low limit on the number of files that will fit in one directory. Since one file per datapoint is used in a corpus, they may be spanned over many directories. See the [server config](server_config.html) for the relevant configuration properties.
|
70
|
+
|
71
|
+
#### Corpus Dir Size
|
72
|
+
Over time the corpus directory will grow very large. It's possible to copy all but the current sample out of this directory whilst the server is running, though a better policy might simply be to place the corpus on a large disk or RAID volume.
|
73
|
+
|
74
|
+
|
75
|
+
#### SQLite3 performance (pragmas)
|
76
|
+
SQLite3 is largely retained in RAM during use, however, its disk access can be controlled through the use of `PRAGMA` statements. If disk or memory usage is a particular problem, these can be adjusted to specify new limits, as mentioned in the [server config](server_config.html).
|
77
|
+
|
78
|
+
### Client
|
79
|
+
The client uses disks infrequently or never, depending on the configuration used, using a single cache file:
|
80
|
+
|
81
|
+
* Datapoint cache
|
82
|
+
|
83
|
+
#### Open Socket Limit
|
84
|
+
Most operating systems impose limits on the number of sockets that may be open at any one time. Since the client is heavily multithreaded, it is capable of exceeding these limits with relative ease.
|
85
|
+
|
86
|
+
On unix systems, the limit can be read using the command `ulimit -a`, where it is typically listed as the number of file descriptors allowed (minus one for the cache file and one for each log).
|
87
|
+
|
88
|
+
#### Cache Filesize
|
89
|
+
The file caching system is based on a single file, which will grow to the size of the sum of all data downloaded in one batch. In testing with HTML data, this generally means about 10MB for every thousand links.
|
90
|
+
|
91
|
+
The client's cache system uses repeated `fseek` calls to look up data. If your filesystem is very bad at seeking, it may be wiser to use a memory cache instead.
|
92
|
+
|
93
|
+
Memory Usage
|
94
|
+
------------
|
95
|
+
LWAC was designed for large samples, and as such its memory usage is minimal, static (O(1) complexity w.r.t. total corpus size), and configurable.
|
96
|
+
|
97
|
+
### Server
|
98
|
+
The server requires enough memory to store:
|
99
|
+
|
100
|
+
* Lists of failed links (which accumulate if clients drop out during a batch, but are soon re-used). A thousand links uses under 100KB of storage in RAM.
|
101
|
+
* Datapoints currently being checked in (see `check_in_rate` in the [client config](client_config.html) to set this in MB)
|
102
|
+
* SQLite3 or MySQL cache (see above)
|
103
|
+
|
104
|
+
This means that the server should always be using less than a few hundred megabytes of RAM, and much of that is ruby/libsqlite3/libcurl.
|
105
|
+
|
106
|
+
### Client
|
107
|
+
The client typically uses more RAM:
|
108
|
+
|
109
|
+
* Lists of links to download (max will be the batch size).
|
110
|
+
* Data downloaded from the web if using a memory cache (set `cache_file` in the [client config](client_config.html) to use a disk cache, or set `max_body_size` in the [server config](server_config.html))
|
111
|
+
* Data being accumulated for upload to the server (see `check_in_rate` in the [client config](client_config.html))
|
112
|
+
* Working data for a large number of download threads (at most equal to the number of threads multiplied by the `max_body_size`)
|
113
|
+
|
114
|
+
A client with a large batch size (tens of thousands of links), downloading large files (such as PDFs), and using a memory cache, may use gigabytes of RAM. The same client with a disk cache will use only a couple of hundred MB, mostly comprising ruby, libcurl, and marilyn/eventmachine.
|
data/log_config.md
ADDED
@@ -0,0 +1,30 @@
|
|
1
|
+
Logger Configuration
|
2
|
+
====================
|
3
|
+
LWAC uses a more flexible extension of ruby's standard log libraries and outputs in a fairly standard format. Each tool's config file contains its own logging section, though they share a common format.
|
4
|
+
|
5
|
+
Configuration
|
6
|
+
-------------
|
7
|
+
|
8
|
+
* `progname` --- The name of the program to output in the log file (printed on each line)
|
9
|
+
* `logs{}` --- A hash containing all other logs that a user wishes to use. The `LOG_NAME` below is arbitrary and used to refer to the log in summary output.
|
10
|
+
* `logs/LOG_NAME/dev` --- The device to use for this log. This can either be a filename, or STDOUT/STDERR.
|
11
|
+
* `logs/LOG_NAME/level` --- The level to log at. This is one of the symbols `:debug`, `:warn`, `:info`, `:error`, or `:fatal`
|
12
|
+
|
13
|
+
|
14
|
+
Sample Config
|
15
|
+
-------------
|
16
|
+
The configuration below outputs three logs, one to stdout for basic information whilst the program runs, one with errors only to a file, and another with basic progress to another file.
|
17
|
+
|
18
|
+
:logging:
|
19
|
+
:progname: Server
|
20
|
+
:logs:
|
21
|
+
:default:
|
22
|
+
:dev: STDOUT
|
23
|
+
:level: :info
|
24
|
+
:errors:
|
25
|
+
:dev: 'logs/server.err'
|
26
|
+
:level: :warn
|
27
|
+
:file_log:
|
28
|
+
:dev: 'logs/server.log'
|
29
|
+
:level: :info
|
30
|
+
|
data/monitoring.md
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
Monitoring/Maintenance
|
2
|
+
======================
|
3
|
+
LWAC should require little maintenance beyond initial deployment. It is heavily network-dependent, and as such its libraries should be kept up-to-date for security purposes.
|
4
|
+
|
5
|
+
|
6
|
+
Disk Space
|
7
|
+
----------
|
8
|
+
As the corpus grows it may be necessary to move data off the server's working disk. Any samples that are not currently open may be moved whislt the server is running---this basically means all but the last entry in the corpus.
|
9
|
+
|
10
|
+
Process Monitoring
|
11
|
+
------------------
|
12
|
+
LWAC is as prone to failure as any other long-running process, and should ideally be monitored during its sampling runs, especially if they last many months. Many tools are available for this (such as [monit](http://mmonit.com/monit/), [Ubic](https://github.com/berekuk/Ubic), [God](http://godrb.com/) or [bluepill](https://github.com/arya/bluepill)), and any should suffice in monitoring both client and server processes.
|
13
|
+
|
data/server_config.md
ADDED
@@ -0,0 +1,242 @@
|
|
1
|
+
Server Configuration
|
2
|
+
====================
|
3
|
+
The server's configuration file is a valid ruby Hash object, expressed in YAML. As such, it starts with a single line containing three dashes, and follows a key-value structure throughout. It may loosely be separated into a number of sections, forming the top level of this tree.
|
4
|
+
|
5
|
+
For help interpreting this document, see [Reading Config Documentation](config_docs.html)
|
6
|
+
|
7
|
+
Storage
|
8
|
+
-------
|
9
|
+
Storage is defined by the `/storage/` key, and contains details on the corpus and its metadata database.
|
10
|
+
|
11
|
+
### Corpus Details
|
12
|
+
|
13
|
+
* `root` --- The root directory of the corpus, relative to the server binary.
|
14
|
+
* `state_file` --- The name of the file where server state will be stored. This contains a list of incomplete links for the current sample.
|
15
|
+
* `sample_subdir` --- The name of the directory within the corpus where samples will be stored.
|
16
|
+
* `sample_filename` --- The filename where summary details on a particular sample are stored.
|
17
|
+
* `files_per_dir` --- How many files to store in each directory below the `sample_subdir`. This is set to avoid overloading filesystems that have finite inode tables.
|
18
|
+
* `serialiser` --- The serialisation method used to write to disk. Supported methods are `:marshal`, `:yaml` or `:json`. `:marshal` is fastest and recommended unless you desperately need to access the corpus using languages other than ruby.
|
19
|
+
|
20
|
+
For example:
|
21
|
+
|
22
|
+
:storage:
|
23
|
+
:root: corpus
|
24
|
+
:state_file: state
|
25
|
+
:sample_subdir: samples
|
26
|
+
:sample_filename: sample
|
27
|
+
:files_per_dir: 1000
|
28
|
+
:serialiser: :marshal
|
29
|
+
:database:
|
30
|
+
...
|
31
|
+
|
32
|
+
|
33
|
+
### Database Details
|
34
|
+
The database is configured in `/storage/database` and consists of two main blocks:
|
35
|
+
|
36
|
+
* `engine` --- Either `:sqlite` for the SQLite3 database engine or `:mysql` for mysql. You must install the appropriate dependency for the engine you select.
|
37
|
+
* `engine_conf{}` --- Configuration parameters for a given engine. See below for examples of each.
|
38
|
+
* `table` --- The table name where links to be downloaded are stored
|
39
|
+
* `fields` --- Contains information on the links table's fields
|
40
|
+
* `fields/id` --- The field name containing the link ID
|
41
|
+
* `fields/uri` --- The field name contain the URI to request from a remote server.
|
42
|
+
|
43
|
+
For example:
|
44
|
+
|
45
|
+
:table: links
|
46
|
+
:fields:
|
47
|
+
:id: id
|
48
|
+
:uri: uri
|
49
|
+
:engine: :mysql
|
50
|
+
:engine_conf:
|
51
|
+
...
|
52
|
+
|
53
|
+
|
54
|
+
#### SQLite3
|
55
|
+
The SQLite3 engine is rather heavily optimised for read speed from the database, and is recommended if you want speed or have a smaller corpus. Its configuration parameters are thus:
|
56
|
+
|
57
|
+
* `database/transaction_limit` --- How many queries to run per transaction. Larger numbers speed up access at the expense of data security.
|
58
|
+
* `database/pragma{}` --- A key-value list of pragma statements to configure the SQLite3 database. These configure the database, and take the form of a list of key-value strings. A full list of SQLite3 pragma statements is available on [their website](http://www.sqlite.org/pragma.html)
|
59
|
+
* `transaction_limit` --- The number of calls to make per transaction. May provide a minor speed increase if large, but most database access is read only anyway.
|
60
|
+
* `filename` --- The position of the database file, relative to `pwd`
|
61
|
+
|
62
|
+
For example:
|
63
|
+
|
64
|
+
:engine: :mysql
|
65
|
+
:engine_conf:
|
66
|
+
:filename: corpus/links.db
|
67
|
+
:transaction_limit: 100
|
68
|
+
:pragma: # Custom pragmas. See SQLite's docs.
|
69
|
+
"locking_mode": "EXCLUSIVE" # Do not allow others to access the db when the server is running
|
70
|
+
"cache_size": 20000 # Allow a large cache
|
71
|
+
"synchronous": 0 # Asynchronous operations speed things up a lot
|
72
|
+
"temp_store": 2 # Use temp storage
|
73
|
+
|
74
|
+
#### MySQL
|
75
|
+
The MySQL engine's configuration parameters are largely defined by the gem. Full documentation is available on the [github page](https://github.com/brianmario/mysql2), and common parameters are listed below:
|
76
|
+
|
77
|
+
* `username` --- The uername to log in with
|
78
|
+
* `password` --- The password to use when connecting to the mysql server
|
79
|
+
* `host` --- The hostname at which the mysql server is listening (omit this if using a socket)
|
80
|
+
* `port` --- The port on which the mysql server is listening (omit this if using a socket)
|
81
|
+
* `socket` --- The filepath of a socket over which to talk to the mysql server
|
82
|
+
* `database` --- The name of the database where the links table will be stored
|
83
|
+
* `encoding` --- The character encoding to use. 'utf8' is _strongly_ recommended.
|
84
|
+
|
85
|
+
For example:
|
86
|
+
|
87
|
+
:engine: :sqlite
|
88
|
+
:engine_conf: # Options from https://github.com/brianmario/mysql2
|
89
|
+
:username: lwac
|
90
|
+
:password: lwacpass
|
91
|
+
# :host: localhost
|
92
|
+
# :port: 3345
|
93
|
+
:socket: /var/run/mysqld/mysqld.sock
|
94
|
+
:database: lwac
|
95
|
+
:encoding: 'utf8'
|
96
|
+
:read_timeout: 10 #seconds
|
97
|
+
:write_timeout: 10 #seconds
|
98
|
+
:connect_timeout: 10 #seconds
|
99
|
+
:reconnect: true #/false
|
100
|
+
:local_infile: true #/false
|
101
|
+
|
102
|
+
|
103
|
+
Sampling Policy
|
104
|
+
---------------
|
105
|
+
The sampling policy used by the server is defined by three parameters, *count*, *duration* and *alignment*.
|
106
|
+
|
107
|
+
* `sample_limit` --- Sample at most this number of samples. Note that the IDs start from 0, so the last sample will have the ID `sample_limit - 1`. The server will quit when trying to open the `n+1`th sample.
|
108
|
+
* `sample_time` --- A sample's *duration* is the minimum time a sample may take. For example, if this is a daily sample, it should be set to 84600 (the number of seconds in a day) in order to sample once at midnight each day.
|
109
|
+
* `sample_alignment` --- A sample's *alignment* defines whereabouts within each sample time the sample may start. Using the above example, setting this to 7200 would cause samples to begin at 2am each day.
|
110
|
+
|
111
|
+
Note that if a sample takes more than `sample_time` to run, it will overlap and cancel the next sample. This is to prevent misalignment with the other datapoints in the sample, and is preferrable in many analyses to a series of messily-timed run-on samples. The code that selects the samples thus follows this algorithm:
|
112
|
+
|
113
|
+
* Round down to the last valid sample time (`time = Time.at(((Time.now.to_i / @config[:sample_time]).floor * @config[:sample_time]) + @config[:sample_alignment])`)
|
114
|
+
* While `time < Time.now`
|
115
|
+
* increment the prospective time by the `sample_time`
|
116
|
+
|
117
|
+
If you wish to edit the sample computation algorithm, it resides in `lib/server/consistency_manager.rb`, under the method `compute_next_sample_time`.
|
118
|
+
|
119
|
+
For example:
|
120
|
+
|
121
|
+
:sampling_policy:
|
122
|
+
:sample_limit: 0
|
123
|
+
:sample_time: 60
|
124
|
+
:sample_alignment: 0
|
125
|
+
|
126
|
+
|
127
|
+
|
128
|
+
Client Policy
|
129
|
+
-------------
|
130
|
+
This section describes the properties each client must inherit from its server. It describes things such as how the client appears to external websites, and how it normalises and packages its data before upload to the server.
|
131
|
+
|
132
|
+
* `dry_run` --- Boolean. Set to `true` to disable web access on the client, so that it samples empty datapoints.
|
133
|
+
* `max_body_size` --- Stop downloading when this number of bytes have been downloaded. Used to prevent aberrantly large files from filling RAM or disk storage.
|
134
|
+
* `fix_encoding` --- Boolean. Should the encoding be normalised to `target_encoding`?
|
135
|
+
* `target_encoding` --- The name of an encoding to normalise to. Default is 'UTF-8', but anything supported by Ruby's String#encode method will work
|
136
|
+
* `encoding_options` --- Options hash passed to String#encode, may include:
|
137
|
+
* `encoding_options\invalid` --- If value is `:replace` , replaces undefined chars with the `:replace` char
|
138
|
+
* `encoding_options\undef` --- If value is `:replace` , replaces undefined chars with the `:replace` char
|
139
|
+
* `encoding_options\replace` --- The char to use in replacement, defaults to uFFFD for unicode and '?' for other targets
|
140
|
+
* `encoding_options\fallback{}` --- A key-value table of characters to replace
|
141
|
+
* `encoding_options\xml` --- Either `:text` or `:attr`. If `:text`, replaces things with hex entities, if `:attr`, it also quotes the entities "&quot;"
|
142
|
+
* `encoding_options\cr_newline` --- Boolean. Replaces LF(\n) with CR(\r) if true
|
143
|
+
* `encoding_options\lf_newline` --- Boolean. Replaces LF(\n) with CRLF(\r\n) if true
|
144
|
+
* `encoding_options\universal_newline` --- Boolean. Replaces CRLF(\r\n) and CR(\r) with LF(\n) if true
|
145
|
+
|
146
|
+
Clients use cURL to download links, using the `curb` library, and may thus be configured with custom request parameters and other options. The options below are applied by setting properties on the cURL object, and as such anything may be provided as a key that is in the [curb documentation](https://rubygems.org/gems/curb), such as 'verbose' or control over SSL. By default, SSL options are overridden to accept connections without verifying certificates.
|
147
|
+
|
148
|
+
* `curl_workers{}` --- Defines properties of the CURl workers that contact web servers.
|
149
|
+
* `max_redirect` --- How many HTTP redirects should be followed before giving up?
|
150
|
+
* `useragent` --- The user agent to show to the remote server
|
151
|
+
* `follow_location` --- Should the agent follow location headers?
|
152
|
+
* `timeout` --- Overall timeout, in seconds, for the whole request.
|
153
|
+
* `connect_timeout` --- TCP connect timeout
|
154
|
+
* `dns_cache_timeout` --- DNS lookup timeout
|
155
|
+
|
156
|
+
MIME type handling can be controlled in a rudimentary manner to prevent superfluous saving of binary data. This is controlled using the `Content-type` field of the response headers, and takes the form of a whitelist or blacklist based on regular expressions. Any link that is 'denied' has its body content wiped and a flag set in its datapoint metadata, but otherwise remains intact. It is configured by the structure called `:mimes`.
|
157
|
+
|
158
|
+
* `mimes{}` --- Defines mime-type acceptance handling
|
159
|
+
* `policy` --- Either `:whitelist` to only accept the items matching the list, or `:blacklist` to only decline the items on the list.
|
160
|
+
* `ignore_case` --- Should the regexp matching be case-insensitive?
|
161
|
+
* `list[]` --- A list of regular expressions. If one matches, depending on white/blacklist configuration, the body will be blanked.
|
162
|
+
|
163
|
+
For example:
|
164
|
+
|
165
|
+
:client_policy:
|
166
|
+
:dry_run: false
|
167
|
+
:fix_encoding: true
|
168
|
+
:target_encoding: UTF-8
|
169
|
+
:encoding_options:
|
170
|
+
:invalid: :replace
|
171
|
+
:undef: :replace
|
172
|
+
#:replace: '?'
|
173
|
+
#:fallback:
|
174
|
+
#'from': 'to'
|
175
|
+
#'from2': 'to2'
|
176
|
+
#:xml: :attr
|
177
|
+
#:cr_newline: true
|
178
|
+
#:crlf_newline:
|
179
|
+
:universal_newline: true
|
180
|
+
:max_body_size: 20971520
|
181
|
+
:mimes:
|
182
|
+
:policy: :whitelist
|
183
|
+
:ignore_case: true
|
184
|
+
:list:
|
185
|
+
- ^text\/?.*$ # text-only mimes
|
186
|
+
#- ^.+$ # anything with a valid content-type
|
187
|
+
:curl_workers:
|
188
|
+
:max_redirects: 5
|
189
|
+
:useragent: ! '"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11"'
|
190
|
+
:enable_cookies: true
|
191
|
+
# :headers: "Header: String"
|
192
|
+
:verbose: false
|
193
|
+
:follow_location: true
|
194
|
+
:timeout: 60
|
195
|
+
:connect_timeout: 10
|
196
|
+
:dns_cache_timeout: 10
|
197
|
+
:ftp_response_timeout: 10
|
198
|
+
|
199
|
+
Client Management
|
200
|
+
-----------------
|
201
|
+
The server's responsibilities for managing clients as they connect and process work mean that it must present meaningful time limits on these connections (lest clients crash or misbehave). These settings govern the rate at which clients are presumed to work, before the server starts re-assigning their work to other clients.
|
202
|
+
|
203
|
+
|
204
|
+
The client is given only a finite time to complete its work before the server will assume it has died and re-assign its links elsewhere. This is controlled by two parameters---one for clients that have not been deen before, and one that modifies the dynamic timeout computed by the server.
|
205
|
+
|
206
|
+
* `time_per_link` --- How long, in seconds, a new client is given per link to download a batch. Clients typically download fairly fast, so this should be quite low (below 5).
|
207
|
+
* `dynamic_time_overestimate` --- How much to multiply the client's last performance by when computing a timeout i.e. value of "1.2" will give 20% overhead.
|
208
|
+
|
209
|
+
These next to parameters define how long clients are told to wait when they contact the server but find no work available (for example, no sample is open right now).
|
210
|
+
|
211
|
+
* `empty_client_backoff` --- If no links are available but a sample is open, tell clients to retry after this time.
|
212
|
+
* `delay_overestimate` --- When a sample is closed, clients are told to wait until after the sample opening time. A small amount is added to this to prevent clients from hitting the final seconds before the sample is open. This should be less than `empty_client_backoff` for it to make any difference. I recommend below 10 seconds.
|
213
|
+
|
214
|
+
For example:
|
215
|
+
|
216
|
+
:client_management:
|
217
|
+
:time_per_link: 5
|
218
|
+
:dynamic_time_overestimate: 1.3
|
219
|
+
:empty_client_backoff: 60
|
220
|
+
:delay_overestimate: 10
|
221
|
+
|
222
|
+
|
223
|
+
Server
|
224
|
+
------
|
225
|
+
These settings govern the network properties of the server, as used for data transfer to and from clients. Settings here are passed to [SimpleRPC](http://stephenwattam.com/projects/simplerpc/), for which full documentation is available in the ruby gem. Only the salient configuration options are listed here.
|
226
|
+
|
227
|
+
* `hostname` --- The hostname or IP address of an interface on which to listen
|
228
|
+
* `port` --- The port on which to listen for this interface
|
229
|
+
* `password` --- Optional. The password to use for auth (must match client config)
|
230
|
+
* `secret` --- Optional. The encryption key to use when sending the password (must match client config)
|
231
|
+
|
232
|
+
For example:
|
233
|
+
|
234
|
+
:server:
|
235
|
+
:hostname:
|
236
|
+
:port: 27401
|
237
|
+
:password: lwacpass
|
238
|
+
:secret: egrniognhre89n34ifnui4n8gf490
|
239
|
+
|
240
|
+
Logging
|
241
|
+
-------
|
242
|
+
The logging system is the same for client, server, and export tools and shares a configuration with them. For details, see [configuring logging](log_config.html)
|