RubyGems - lwac - Versions diffs - 0.2.0 - Mend

lwac 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

checksums.yaml +7 -0
data/LICENSE +70 -0
data/README.md +31 -0
data/bin/lwac +132 -0
data/client_config.md +71 -0
data/concepts.md +70 -0
data/config_docs.md +40 -0
data/doc/compile.rb +52 -0
data/doc/template.rhtml +145 -0
data/example_config/client.jv.yml +33 -0
data/example_config/client.yml +34 -0
data/example_config/export.yml +70 -0
data/example_config/import.yml +19 -0
data/example_config/server.yml +97 -0
data/export_config.md +448 -0
data/import_config.md +29 -0
data/index.md +49 -0
data/install.md +29 -0
data/lib/lwac.rb +17 -0
data/lib/lwac/client.rb +354 -0
data/lib/lwac/client/file_cache.rb +160 -0
data/lib/lwac/client/storage.rb +69 -0
data/lib/lwac/export.rb +362 -0
data/lib/lwac/export/format.rb +310 -0
data/lib/lwac/export/key_value_format.rb +132 -0
data/lib/lwac/export/resources.rb +82 -0
data/lib/lwac/import.rb +152 -0
data/lib/lwac/server.rb +294 -0
data/lib/lwac/server/consistency_manager.rb +265 -0
data/lib/lwac/server/db_conn.rb +376 -0
data/lib/lwac/server/storage_manager.rb +290 -0
data/lib/lwac/shared/data_types.rb +283 -0
data/lib/lwac/shared/identity.rb +44 -0
data/lib/lwac/shared/launch_tools.rb +87 -0
data/lib/lwac/shared/multilog.rb +158 -0
data/lib/lwac/shared/serialiser.rb +86 -0
data/limits.md +114 -0
data/log_config.md +30 -0
data/monitoring.md +13 -0
data/resources/schemata/mysql/links.sql +7 -0
data/resources/schemata/sqlite/links.sql +5 -0
data/server_config.md +242 -0
data/tools.md +89 -0
data/workflows.md +39 -0
metadata +140 -0

data/tools.md ADDED

@@ -0,0 +1,89 @@
+Tools
+=====
+LWAC's workflow is based around a simple import-download-export system, and as such there are three major tools in the distribution:
+Install
+-------
+Installation (and testing) is documented [elsewhere](install.html).
+Import
+------
+The import tool is responsible for creating a metadata database, and importing links into it.  It does not create the whole corpus directory structure (this is handled by the download server), but will construct requisite SQL tables for handling sampling.
+### Usage
+The import script may be run simply by running
+    lwac import config.yml LINKFILE
+where:
+ * `LINKFILE` is a one-link-per-line list of hyperlinks to use for this sample (assumed UTF-8), and;
+### Configuration
+The import tool has a much shorter [config file](import_config.html) than others, and relies heavily on the server config to manage its storage system.
+Download
+--------
+The download phase is controlled by a single system, split into server and client.  The roles of the server are to:
+ * Manage access to metadata and backing store (main corpus) in an atomic fashion
+ * Enforce sampling policy
+ * Manage client connections and download attempts
+and it is thus configured with knowledge of the limitations of the backing store, properties of the metadata database, and network access to the client.
+The client is tasked with:
+ * Connecting to a relevant server and asking for work
+ * Connecting to external servers to download data
+ * Uploading data to a storage server
+and thus is configured with network access properties for both the server and external HTTP servers, and rate/batch limits that should be tuned for the machine on which it is run.
+Each server supports an unlimited number of clients, however, their access to the corpus is regulated through a competition model---whilst one is connected, the others are told to wait.
+### Server
+#### Usage
+To run the server, simply provide it with a path to a config file:
+    lwac server config/server.yml
+#### Configuration
+The server requires the following as a prerequisite:
+ * A metadata database must be created using the import tool
+ * This database must be placed within a directory to which the server has write access.  This will form the root of the corpus
+For more detailed configuration options, see the detailed writeup on the [server configuration page](server_config.html).
+### Client
+#### Usage
+To run the client, simply provide it with a path to a relevant config file:
+    lwac client config/client.yml
+#### Configuration
+The client is also managed exclusively by its config file.  See more detail on the [client configuration page](client_config.html).
+Export
+------
+The export tool is used to reformat information from the metadata and backing store into CSV files for simple processing with tools such as R.  It is heavily based on a "filter and transform" model, where small code snippets are used to select and then present data in a useful form.  This approach has a number of advantages, making it simple to do simple tasks without limiting the power and complexity of the selection rules.
+### Usage
+To run the export tool, simply provide it with a path pointing to a relevant config file:
+    lwac export config/export.yml
+It's worth noting that the export tool uses the server configuration file for corpus access, and thus will need to be able to access that also.
+### Configuration
+The export tool is influenced by both its own config file and that of the server.  Of most interest is the [export configuration page](export_config.html).

data/workflows.md ADDED

@@ -0,0 +1,39 @@
+LWAC Workflows
+==============
+LWAC functions as a data acquisition system only, however, it's still fairly flexible in how it is deployed.  This page outlines a simple setup for downloading a URL list, and covers what to edit/configure/run when.
+URI Selection
+-------------
+Stage one is to settle on which URIs should be sampled.  These should then be placed in a file in one-line-per-URI format for use with the import script.  Don't import them yet, since the import script uses the server configuration to govern its storage format.
+Server Configuration
+--------------------
+On the server machine, a directory should be created that will hold the corpus.  This should be accessible to the server process (with write permissions), and should be on a filesystem that can handle many small files efficiently.  Place the metadata database in this corpus.
+Next, configure the [server's configuration file](server_config.html) such that it is suited to the position of the corpus directory and the limits of the filesystem, network, and host machine.
+URI Import
+-----------
+The URI list from earlier should then be imported into an empty metadata database using the [import tool](config_docs.html).  This will create a SQLite3 database and a corpus with space for sample summaries and datapoint information.
+Client Configuration
+--------------------
+Each client that is to do the download work must also be [configured](client_config.html) to point to the server, and should be tweaked to match the capacities of its host (RAM, disk size, etc).
+Data Collection
+---------------
+Clients will continually attempt to contact the server as long as they are running, so the order in which clients and servers are started is of no consequence.
+Summary statistics are output by the server regarding overall performance, including the number of links downloaded, progress on individual samples, etc.  Inspecting the logs of a running server should provide enough information on overall download progress.  It's also possible to export data from an 'active' corpus, though it is possible to configure the database in such a way that this is not allowed (exclusive locking).
+Operationalisation
+------------------
+This is possible one of two ways.  The former is to use the server's storage libraries to write custom export code in ruby.  The latter, and easier, is to use the export tool provided.
+If using the export tool, it must be [configured](export_config.html) to extract variables of interest from the corpus.  This configuration will vary for each server and study.
+Once you've exported data, import it into some kind of analysis tool and do science with it :-)

metadata ADDED

@@ -0,0 +1,140 @@
+--- !ruby/object:Gem::Specification
+name: lwac
+version: !ruby/object:Gem::Version
+  version: 0.2.0
+platform: ruby
+authors:
+- Stephen Wattam
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2013-04-22 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: simplerpc
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '0.2'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '0.2'
+- !ruby/object:Gem::Dependency
+  name: blat
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '0.1'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '0.1'
+description: A tool to construct longitudinal corpora from web data
+email: stephenwattam@gmail.com
+executables:
+- lwac
+extensions: []
+extra_rdoc_files:
+- limits.md
+- index.md
+- config_docs.md
+- client_config.md
+- log_config.md
+- monitoring.md
+- import_config.md
+- install.md
+- workflows.md
+- export_config.md
+- README.md
+- concepts.md
+- tools.md
+- server_config.md
+files:
+- LICENSE
+- resources/schemata/mysql/links.sql
+- resources/schemata/sqlite/links.sql
+- doc/compile.rb
+- doc/template.rhtml
+- example_config/client.yml
+- example_config/export.yml
+- example_config/import.yml
+- example_config/client.jv.yml
+- example_config/server.yml
+- lib/lwac/export/resources.rb
+- lib/lwac/export/format.rb
+- lib/lwac/export/key_value_format.rb
+- lib/lwac/shared/multilog.rb
+- lib/lwac/shared/launch_tools.rb
+- lib/lwac/shared/identity.rb
+- lib/lwac/shared/serialiser.rb
+- lib/lwac/shared/data_types.rb
+- lib/lwac/client/file_cache.rb
+- lib/lwac/client/storage.rb
+- lib/lwac/server/consistency_manager.rb
+- lib/lwac/server/storage_manager.rb
+- lib/lwac/server/db_conn.rb
+- lib/lwac/client.rb
+- lib/lwac/server.rb
+- lib/lwac/import.rb
+- lib/lwac/export.rb
+- lib/lwac.rb
+- limits.md
+- index.md
+- config_docs.md
+- client_config.md
+- log_config.md
+- monitoring.md
+- import_config.md
+- install.md
+- workflows.md
+- export_config.md
+- README.md
+- concepts.md
+- tools.md
+- server_config.md
+- bin/lwac
+homepage: http://stephenwattam.com/projects/LWAC
+licenses:
+- CC-BY-NC-SA 3.0
+metadata: {}
+post_install_message: |+
+  Thanks for installing LWAC.
+  Optional Dependencies
+  ---------------------
+   - mysql2 ~> 0.3 (server)
+   - sqlite3 ~> 1.3 (server)
+   - curb ~> 0.8 (client)
+  The server/export/import tools REQUIRE either mysql2 or sqlite3.
+  The client REQUIRES curb.
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '2.0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.0.2
+signing_key:
+specification_version: 4
+summary: Longitudinal Web-as-Corpus sampling
+test_files: []