lwac 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. checksums.yaml +7 -0
  2. data/LICENSE +70 -0
  3. data/README.md +31 -0
  4. data/bin/lwac +132 -0
  5. data/client_config.md +71 -0
  6. data/concepts.md +70 -0
  7. data/config_docs.md +40 -0
  8. data/doc/compile.rb +52 -0
  9. data/doc/template.rhtml +145 -0
  10. data/example_config/client.jv.yml +33 -0
  11. data/example_config/client.yml +34 -0
  12. data/example_config/export.yml +70 -0
  13. data/example_config/import.yml +19 -0
  14. data/example_config/server.yml +97 -0
  15. data/export_config.md +448 -0
  16. data/import_config.md +29 -0
  17. data/index.md +49 -0
  18. data/install.md +29 -0
  19. data/lib/lwac.rb +17 -0
  20. data/lib/lwac/client.rb +354 -0
  21. data/lib/lwac/client/file_cache.rb +160 -0
  22. data/lib/lwac/client/storage.rb +69 -0
  23. data/lib/lwac/export.rb +362 -0
  24. data/lib/lwac/export/format.rb +310 -0
  25. data/lib/lwac/export/key_value_format.rb +132 -0
  26. data/lib/lwac/export/resources.rb +82 -0
  27. data/lib/lwac/import.rb +152 -0
  28. data/lib/lwac/server.rb +294 -0
  29. data/lib/lwac/server/consistency_manager.rb +265 -0
  30. data/lib/lwac/server/db_conn.rb +376 -0
  31. data/lib/lwac/server/storage_manager.rb +290 -0
  32. data/lib/lwac/shared/data_types.rb +283 -0
  33. data/lib/lwac/shared/identity.rb +44 -0
  34. data/lib/lwac/shared/launch_tools.rb +87 -0
  35. data/lib/lwac/shared/multilog.rb +158 -0
  36. data/lib/lwac/shared/serialiser.rb +86 -0
  37. data/limits.md +114 -0
  38. data/log_config.md +30 -0
  39. data/monitoring.md +13 -0
  40. data/resources/schemata/mysql/links.sql +7 -0
  41. data/resources/schemata/sqlite/links.sql +5 -0
  42. data/server_config.md +242 -0
  43. data/tools.md +89 -0
  44. data/workflows.md +39 -0
  45. metadata +140 -0
@@ -0,0 +1,89 @@
1
+ Tools
2
+ =====
3
+ LWAC's workflow is based around a simple import-download-export system, and as such there are three major tools in the distribution:
4
+
5
+ Install
6
+ -------
7
+ Installation (and testing) is documented [elsewhere](install.html).
8
+
9
+ Import
10
+ ------
11
+ The import tool is responsible for creating a metadata database, and importing links into it. It does not create the whole corpus directory structure (this is handled by the download server), but will construct requisite SQL tables for handling sampling.
12
+
13
+ ### Usage
14
+
15
+ The import script may be run simply by running
16
+
17
+ lwac import config.yml LINKFILE
18
+
19
+ where:
20
+
21
+ * `LINKFILE` is a one-link-per-line list of hyperlinks to use for this sample (assumed UTF-8), and;
22
+
23
+ ### Configuration
24
+ The import tool has a much shorter [config file](import_config.html) than others, and relies heavily on the server config to manage its storage system.
25
+
26
+ Download
27
+ --------
28
+ The download phase is controlled by a single system, split into server and client. The roles of the server are to:
29
+
30
+ * Manage access to metadata and backing store (main corpus) in an atomic fashion
31
+ * Enforce sampling policy
32
+ * Manage client connections and download attempts
33
+
34
+ and it is thus configured with knowledge of the limitations of the backing store, properties of the metadata database, and network access to the client.
35
+
36
+ The client is tasked with:
37
+
38
+ * Connecting to a relevant server and asking for work
39
+ * Connecting to external servers to download data
40
+ * Uploading data to a storage server
41
+
42
+ and thus is configured with network access properties for both the server and external HTTP servers, and rate/batch limits that should be tuned for the machine on which it is run.
43
+
44
+ Each server supports an unlimited number of clients, however, their access to the corpus is regulated through a competition model---whilst one is connected, the others are told to wait.
45
+
46
+ ### Server
47
+
48
+ #### Usage
49
+ To run the server, simply provide it with a path to a config file:
50
+
51
+ lwac server config/server.yml
52
+
53
+
54
+ #### Configuration
55
+ The server requires the following as a prerequisite:
56
+
57
+ * A metadata database must be created using the import tool
58
+ * This database must be placed within a directory to which the server has write access. This will form the root of the corpus
59
+
60
+ For more detailed configuration options, see the detailed writeup on the [server configuration page](server_config.html).
61
+
62
+
63
+ ### Client
64
+
65
+ #### Usage
66
+ To run the client, simply provide it with a path to a relevant config file:
67
+
68
+ lwac client config/client.yml
69
+
70
+ #### Configuration
71
+ The client is also managed exclusively by its config file. See more detail on the [client configuration page](client_config.html).
72
+
73
+
74
+
75
+ Export
76
+ ------
77
+ The export tool is used to reformat information from the metadata and backing store into CSV files for simple processing with tools such as R. It is heavily based on a "filter and transform" model, where small code snippets are used to select and then present data in a useful form. This approach has a number of advantages, making it simple to do simple tasks without limiting the power and complexity of the selection rules.
78
+
79
+ ### Usage
80
+ To run the export tool, simply provide it with a path pointing to a relevant config file:
81
+
82
+ lwac export config/export.yml
83
+
84
+ It's worth noting that the export tool uses the server configuration file for corpus access, and thus will need to be able to access that also.
85
+
86
+ ### Configuration
87
+ The export tool is influenced by both its own config file and that of the server. Of most interest is the [export configuration page](export_config.html).
88
+
89
+
@@ -0,0 +1,39 @@
1
+ LWAC Workflows
2
+ ==============
3
+ LWAC functions as a data acquisition system only, however, it's still fairly flexible in how it is deployed. This page outlines a simple setup for downloading a URL list, and covers what to edit/configure/run when.
4
+
5
+ URI Selection
6
+ -------------
7
+ Stage one is to settle on which URIs should be sampled. These should then be placed in a file in one-line-per-URI format for use with the import script. Don't import them yet, since the import script uses the server configuration to govern its storage format.
8
+
9
+ Server Configuration
10
+ --------------------
11
+ On the server machine, a directory should be created that will hold the corpus. This should be accessible to the server process (with write permissions), and should be on a filesystem that can handle many small files efficiently. Place the metadata database in this corpus.
12
+
13
+ Next, configure the [server's configuration file](server_config.html) such that it is suited to the position of the corpus directory and the limits of the filesystem, network, and host machine.
14
+
15
+
16
+ URI Import
17
+ -----------
18
+ The URI list from earlier should then be imported into an empty metadata database using the [import tool](config_docs.html). This will create a SQLite3 database and a corpus with space for sample summaries and datapoint information.
19
+
20
+
21
+ Client Configuration
22
+ --------------------
23
+ Each client that is to do the download work must also be [configured](client_config.html) to point to the server, and should be tweaked to match the capacities of its host (RAM, disk size, etc).
24
+
25
+ Data Collection
26
+ ---------------
27
+ Clients will continually attempt to contact the server as long as they are running, so the order in which clients and servers are started is of no consequence.
28
+
29
+ Summary statistics are output by the server regarding overall performance, including the number of links downloaded, progress on individual samples, etc. Inspecting the logs of a running server should provide enough information on overall download progress. It's also possible to export data from an 'active' corpus, though it is possible to configure the database in such a way that this is not allowed (exclusive locking).
30
+
31
+
32
+ Operationalisation
33
+ ------------------
34
+ This is possible one of two ways. The former is to use the server's storage libraries to write custom export code in ruby. The latter, and easier, is to use the export tool provided.
35
+
36
+ If using the export tool, it must be [configured](export_config.html) to extract variables of interest from the corpus. This configuration will vary for each server and study.
37
+
38
+ Once you've exported data, import it into some kind of analysis tool and do science with it :-)
39
+
metadata ADDED
@@ -0,0 +1,140 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: lwac
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.2.0
5
+ platform: ruby
6
+ authors:
7
+ - Stephen Wattam
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2013-04-22 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: simplerpc
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ~>
18
+ - !ruby/object:Gem::Version
19
+ version: '0.2'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ~>
25
+ - !ruby/object:Gem::Version
26
+ version: '0.2'
27
+ - !ruby/object:Gem::Dependency
28
+ name: blat
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ~>
32
+ - !ruby/object:Gem::Version
33
+ version: '0.1'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ~>
39
+ - !ruby/object:Gem::Version
40
+ version: '0.1'
41
+ description: A tool to construct longitudinal corpora from web data
42
+ email: stephenwattam@gmail.com
43
+ executables:
44
+ - lwac
45
+ extensions: []
46
+ extra_rdoc_files:
47
+ - limits.md
48
+ - index.md
49
+ - config_docs.md
50
+ - client_config.md
51
+ - log_config.md
52
+ - monitoring.md
53
+ - import_config.md
54
+ - install.md
55
+ - workflows.md
56
+ - export_config.md
57
+ - README.md
58
+ - concepts.md
59
+ - tools.md
60
+ - server_config.md
61
+ files:
62
+ - LICENSE
63
+ - resources/schemata/mysql/links.sql
64
+ - resources/schemata/sqlite/links.sql
65
+ - doc/compile.rb
66
+ - doc/template.rhtml
67
+ - example_config/client.yml
68
+ - example_config/export.yml
69
+ - example_config/import.yml
70
+ - example_config/client.jv.yml
71
+ - example_config/server.yml
72
+ - lib/lwac/export/resources.rb
73
+ - lib/lwac/export/format.rb
74
+ - lib/lwac/export/key_value_format.rb
75
+ - lib/lwac/shared/multilog.rb
76
+ - lib/lwac/shared/launch_tools.rb
77
+ - lib/lwac/shared/identity.rb
78
+ - lib/lwac/shared/serialiser.rb
79
+ - lib/lwac/shared/data_types.rb
80
+ - lib/lwac/client/file_cache.rb
81
+ - lib/lwac/client/storage.rb
82
+ - lib/lwac/server/consistency_manager.rb
83
+ - lib/lwac/server/storage_manager.rb
84
+ - lib/lwac/server/db_conn.rb
85
+ - lib/lwac/client.rb
86
+ - lib/lwac/server.rb
87
+ - lib/lwac/import.rb
88
+ - lib/lwac/export.rb
89
+ - lib/lwac.rb
90
+ - limits.md
91
+ - index.md
92
+ - config_docs.md
93
+ - client_config.md
94
+ - log_config.md
95
+ - monitoring.md
96
+ - import_config.md
97
+ - install.md
98
+ - workflows.md
99
+ - export_config.md
100
+ - README.md
101
+ - concepts.md
102
+ - tools.md
103
+ - server_config.md
104
+ - bin/lwac
105
+ homepage: http://stephenwattam.com/projects/LWAC
106
+ licenses:
107
+ - CC-BY-NC-SA 3.0
108
+ metadata: {}
109
+ post_install_message: |+
110
+ Thanks for installing LWAC.
111
+
112
+ Optional Dependencies
113
+ ---------------------
114
+ - mysql2 ~> 0.3 (server)
115
+ - sqlite3 ~> 1.3 (server)
116
+ - curb ~> 0.8 (client)
117
+
118
+ The server/export/import tools REQUIRE either mysql2 or sqlite3.
119
+ The client REQUIRES curb.
120
+
121
+ rdoc_options: []
122
+ require_paths:
123
+ - lib
124
+ required_ruby_version: !ruby/object:Gem::Requirement
125
+ requirements:
126
+ - - '>='
127
+ - !ruby/object:Gem::Version
128
+ version: '2.0'
129
+ required_rubygems_version: !ruby/object:Gem::Requirement
130
+ requirements:
131
+ - - '>='
132
+ - !ruby/object:Gem::Version
133
+ version: '0'
134
+ requirements: []
135
+ rubyforge_project:
136
+ rubygems_version: 2.0.2
137
+ signing_key:
138
+ specification_version: 4
139
+ summary: Longitudinal Web-as-Corpus sampling
140
+ test_files: []