lwac 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (45) hide show
  1. checksums.yaml +7 -0
  2. data/LICENSE +70 -0
  3. data/README.md +31 -0
  4. data/bin/lwac +132 -0
  5. data/client_config.md +71 -0
  6. data/concepts.md +70 -0
  7. data/config_docs.md +40 -0
  8. data/doc/compile.rb +52 -0
  9. data/doc/template.rhtml +145 -0
  10. data/example_config/client.jv.yml +33 -0
  11. data/example_config/client.yml +34 -0
  12. data/example_config/export.yml +70 -0
  13. data/example_config/import.yml +19 -0
  14. data/example_config/server.yml +97 -0
  15. data/export_config.md +448 -0
  16. data/import_config.md +29 -0
  17. data/index.md +49 -0
  18. data/install.md +29 -0
  19. data/lib/lwac.rb +17 -0
  20. data/lib/lwac/client.rb +354 -0
  21. data/lib/lwac/client/file_cache.rb +160 -0
  22. data/lib/lwac/client/storage.rb +69 -0
  23. data/lib/lwac/export.rb +362 -0
  24. data/lib/lwac/export/format.rb +310 -0
  25. data/lib/lwac/export/key_value_format.rb +132 -0
  26. data/lib/lwac/export/resources.rb +82 -0
  27. data/lib/lwac/import.rb +152 -0
  28. data/lib/lwac/server.rb +294 -0
  29. data/lib/lwac/server/consistency_manager.rb +265 -0
  30. data/lib/lwac/server/db_conn.rb +376 -0
  31. data/lib/lwac/server/storage_manager.rb +290 -0
  32. data/lib/lwac/shared/data_types.rb +283 -0
  33. data/lib/lwac/shared/identity.rb +44 -0
  34. data/lib/lwac/shared/launch_tools.rb +87 -0
  35. data/lib/lwac/shared/multilog.rb +158 -0
  36. data/lib/lwac/shared/serialiser.rb +86 -0
  37. data/limits.md +114 -0
  38. data/log_config.md +30 -0
  39. data/monitoring.md +13 -0
  40. data/resources/schemata/mysql/links.sql +7 -0
  41. data/resources/schemata/sqlite/links.sql +5 -0
  42. data/server_config.md +242 -0
  43. data/tools.md +89 -0
  44. data/workflows.md +39 -0
  45. metadata +140 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 39b62b54afaf2a83e20c9679e9565aa97e103053
4
+ data.tar.gz: 0d0aa8384bead19ef07dde68cefda240079f12c8
5
+ SHA512:
6
+ metadata.gz: f89aa00d904cd2fa06d14eb703393a824f683cb2fd13a1425ec0a74836e8ba0395a4f2cb1cb4308749c1b5841de15cabfc628e6922fc58d5b0360cc89729d554
7
+ data.tar.gz: 4df133f013e0b0fe85ed911b32944b0edd633faf48077c98b3f10720cce57480e600f6bcb945e32861b76a2fc64377913402e4fd0b9f505a99d991fff3c47931
data/LICENSE ADDED
@@ -0,0 +1,70 @@
1
+ LICENSE
2
+ -------
3
+ LWAC is governed by a creative-commons noncommercial sharealike license, v3, plus some extensions. In addition to the below CC license, I apply the conditions that you must:
4
+
5
+ * Not profit from directly selling the code (including as part of something else).
6
+ * Not profit from selling corpora built using LWAC. (Note that I will probably grant permission if contacted, I merely wish to stop people turning it into a commercial service)
7
+ * Credit LWAC and provide a link to http://stephenwattam.com/project/LWAC/ or http://ucrel.lancs.ac.uk/LWAC/ in any publications (You can also cite the paper pending for WaC8, once it's published).
8
+ * Don't use it to DDOS things.
9
+
10
+ The text of the CC license is below. It is also available from http://creativecommons.org/licenses/by-nc-sa/3.0/legalcode and is nicely explained at http://creativecommons.org/licenses/by-nc-sa/3.0/ .
11
+
12
+ License
13
+
14
+ THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.
15
+
16
+ BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND CONDITIONS.
17
+
18
+ 1. Definitions
19
+
20
+ "Adaptation" means a work based upon the Work, or upon the Work and other pre-existing works, such as a translation, adaptation, derivative work, arrangement of music or other alterations of a literary or artistic work, or phonogram or performance and includes cinematographic adaptations or any other form in which the Work may be recast, transformed, or adapted including in any form recognizably derived from the original, except that a work that constitutes a Collection will not be considered an Adaptation for the purpose of this License. For the avoidance of doubt, where the Work is a musical work, performance or phonogram, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered an Adaptation for the purpose of this License.
21
+ "Collection" means a collection of literary or artistic works, such as encyclopedias and anthologies, or performances, phonograms or broadcasts, or other works or subject matter other than works listed in Section 1(g) below, which, by reason of the selection and arrangement of their contents, constitute intellectual creations, in which the Work is included in its entirety in unmodified form along with one or more other contributions, each constituting separate and independent works in themselves, which together are assembled into a collective whole. A work that constitutes a Collection will not be considered an Adaptation (as defined above) for the purposes of this License.
22
+ "Distribute" means to make available to the public the original and copies of the Work or Adaptation, as appropriate, through sale or other transfer of ownership.
23
+ "License Elements" means the following high-level license attributes as selected by Licensor and indicated in the title of this License: Attribution, Noncommercial, ShareAlike.
24
+ "Licensor" means the individual, individuals, entity or entities that offer(s) the Work under the terms of this License.
25
+ "Original Author" means, in the case of a literary or artistic work, the individual, individuals, entity or entities who created the Work or if no individual or entity can be identified, the publisher; and in addition (i) in the case of a performance the actors, singers, musicians, dancers, and other persons who act, sing, deliver, declaim, play in, interpret or otherwise perform literary or artistic works or expressions of folklore; (ii) in the case of a phonogram the producer being the person or legal entity who first fixes the sounds of a performance or other sounds; and, (iii) in the case of broadcasts, the organization that transmits the broadcast.
26
+ "Work" means the literary and/or artistic work offered under the terms of this License including without limitation any production in the literary, scientific and artistic domain, whatever may be the mode or form of its expression including digital form, such as a book, pamphlet and other writing; a lecture, address, sermon or other work of the same nature; a dramatic or dramatico-musical work; a choreographic work or entertainment in dumb show; a musical composition with or without words; a cinematographic work to which are assimilated works expressed by a process analogous to cinematography; a work of drawing, painting, architecture, sculpture, engraving or lithography; a photographic work to which are assimilated works expressed by a process analogous to photography; a work of applied art; an illustration, map, plan, sketch or three-dimensional work relative to geography, topography, architecture or science; a performance; a broadcast; a phonogram; a compilation of data to the extent it is protected as a copyrightable work; or a work performed by a variety or circus performer to the extent it is not otherwise considered a literary or artistic work.
27
+ "You" means an individual or entity exercising rights under this License who has not previously violated the terms of this License with respect to the Work, or who has received express permission from the Licensor to exercise rights under this License despite a previous violation.
28
+ "Publicly Perform" means to perform public recitations of the Work and to communicate to the public those public recitations, by any means or process, including by wire or wireless means or public digital performances; to make available to the public Works in such a way that members of the public may access these Works from a place and at a place individually chosen by them; to perform the Work to the public by any means or process and the communication to the public of the performances of the Work, including by public digital performance; to broadcast and rebroadcast the Work by any means including signs, sounds or images.
29
+ "Reproduce" means to make copies of the Work by any means including without limitation by sound or visual recordings and the right of fixation and reproducing fixations of the Work, including storage of a protected performance or phonogram in digital form or other electronic medium.
30
+ 2. Fair Dealing Rights. Nothing in this License is intended to reduce, limit, or restrict any uses free from copyright or rights arising from limitations or exceptions that are provided for in connection with the copyright protection under copyright law or other applicable laws.
31
+
32
+ 3. License Grant. Subject to the terms and conditions of this License, Licensor hereby grants You a worldwide, royalty-free, non-exclusive, perpetual (for the duration of the applicable copyright) license to exercise the rights in the Work as stated below:
33
+
34
+ to Reproduce the Work, to incorporate the Work into one or more Collections, and to Reproduce the Work as incorporated in the Collections;
35
+ to create and Reproduce Adaptations provided that any such Adaptation, including any translation in any medium, takes reasonable steps to clearly label, demarcate or otherwise identify that changes were made to the original Work. For example, a translation could be marked "The original work was translated from English to Spanish," or a modification could indicate "The original work has been modified.";
36
+ to Distribute and Publicly Perform the Work including as incorporated in Collections; and,
37
+ to Distribute and Publicly Perform Adaptations.
38
+ The above rights may be exercised in all media and formats whether now known or hereafter devised. The above rights include the right to make such modifications as are technically necessary to exercise the rights in other media and formats. Subject to Section 8(f), all rights not expressly granted by Licensor are hereby reserved, including but not limited to the rights described in Section 4(e).
39
+
40
+ 4. Restrictions. The license granted in Section 3 above is expressly made subject to and limited by the following restrictions:
41
+
42
+ You may Distribute or Publicly Perform the Work only under the terms of this License. You must include a copy of, or the Uniform Resource Identifier (URI) for, this License with every copy of the Work You Distribute or Publicly Perform. You may not offer or impose any terms on the Work that restrict the terms of this License or the ability of the recipient of the Work to exercise the rights granted to that recipient under the terms of the License. You may not sublicense the Work. You must keep intact all notices that refer to this License and to the disclaimer of warranties with every copy of the Work You Distribute or Publicly Perform. When You Distribute or Publicly Perform the Work, You may not impose any effective technological measures on the Work that restrict the ability of a recipient of the Work from You to exercise the rights granted to that recipient under the terms of the License. This Section 4(a) applies to the Work as incorporated in a Collection, but this does not require the Collection apart from the Work itself to be made subject to the terms of this License. If You create a Collection, upon notice from any Licensor You must, to the extent practicable, remove from the Collection any credit as required by Section 4(d), as requested. If You create an Adaptation, upon notice from any Licensor You must, to the extent practicable, remove from the Adaptation any credit as required by Section 4(d), as requested.
43
+ You may Distribute or Publicly Perform an Adaptation only under: (i) the terms of this License; (ii) a later version of this License with the same License Elements as this License; (iii) a Creative Commons jurisdiction license (either this or a later license version) that contains the same License Elements as this License (e.g., Attribution-NonCommercial-ShareAlike 3.0 US) ("Applicable License"). You must include a copy of, or the URI, for Applicable License with every copy of each Adaptation You Distribute or Publicly Perform. You may not offer or impose any terms on the Adaptation that restrict the terms of the Applicable License or the ability of the recipient of the Adaptation to exercise the rights granted to that recipient under the terms of the Applicable License. You must keep intact all notices that refer to the Applicable License and to the disclaimer of warranties with every copy of the Work as included in the Adaptation You Distribute or Publicly Perform. When You Distribute or Publicly Perform the Adaptation, You may not impose any effective technological measures on the Adaptation that restrict the ability of a recipient of the Adaptation from You to exercise the rights granted to that recipient under the terms of the Applicable License. This Section 4(b) applies to the Adaptation as incorporated in a Collection, but this does not require the Collection apart from the Adaptation itself to be made subject to the terms of the Applicable License.
44
+ You may not exercise any of the rights granted to You in Section 3 above in any manner that is primarily intended for or directed toward commercial advantage or private monetary compensation. The exchange of the Work for other copyrighted works by means of digital file-sharing or otherwise shall not be considered to be intended for or directed toward commercial advantage or private monetary compensation, provided there is no payment of any monetary compensation in con-nection with the exchange of copyrighted works.
45
+ If You Distribute, or Publicly Perform the Work or any Adaptations or Collections, You must, unless a request has been made pursuant to Section 4(a), keep intact all copyright notices for the Work and provide, reasonable to the medium or means You are utilizing: (i) the name of the Original Author (or pseudonym, if applicable) if supplied, and/or if the Original Author and/or Licensor designate another party or parties (e.g., a sponsor institute, publishing entity, journal) for attribution ("Attribution Parties") in Licensor's copyright notice, terms of service or by other reasonable means, the name of such party or parties; (ii) the title of the Work if supplied; (iii) to the extent reasonably practicable, the URI, if any, that Licensor specifies to be associated with the Work, unless such URI does not refer to the copyright notice or licensing information for the Work; and, (iv) consistent with Section 3(b), in the case of an Adaptation, a credit identifying the use of the Work in the Adaptation (e.g., "French translation of the Work by Original Author," or "Screenplay based on original Work by Original Author"). The credit required by this Section 4(d) may be implemented in any reasonable manner; provided, however, that in the case of a Adaptation or Collection, at a minimum such credit will appear, if a credit for all contributing authors of the Adaptation or Collection appears, then as part of these credits and in a manner at least as prominent as the credits for the other contributing authors. For the avoidance of doubt, You may only use the credit required by this Section for the purpose of attribution in the manner set out above and, by exercising Your rights under this License, You may not implicitly or explicitly assert or imply any connection with, sponsorship or endorsement by the Original Author, Licensor and/or Attribution Parties, as appropriate, of You or Your use of the Work, without the separate, express prior written permission of the Original Author, Licensor and/or Attribution Parties.
46
+ For the avoidance of doubt:
47
+
48
+ Non-waivable Compulsory License Schemes. In those jurisdictions in which the right to collect royalties through any statutory or compulsory licensing scheme cannot be waived, the Licensor reserves the exclusive right to collect such royalties for any exercise by You of the rights granted under this License;
49
+ Waivable Compulsory License Schemes. In those jurisdictions in which the right to collect royalties through any statutory or compulsory licensing scheme can be waived, the Licensor reserves the exclusive right to collect such royalties for any exercise by You of the rights granted under this License if Your exercise of such rights is for a purpose or use which is otherwise than noncommercial as permitted under Section 4(c) and otherwise waives the right to collect royalties through any statutory or compulsory licensing scheme; and,
50
+ Voluntary License Schemes. The Licensor reserves the right to collect royalties, whether individually or, in the event that the Licensor is a member of a collecting society that administers voluntary licensing schemes, via that society, from any exercise by You of the rights granted under this License that is for a purpose or use which is otherwise than noncommercial as permitted under Section 4(c).
51
+ Except as otherwise agreed in writing by the Licensor or as may be otherwise permitted by applicable law, if You Reproduce, Distribute or Publicly Perform the Work either by itself or as part of any Adaptations or Collections, You must not distort, mutilate, modify or take other derogatory action in relation to the Work which would be prejudicial to the Original Author's honor or reputation. Licensor agrees that in those jurisdictions (e.g. Japan), in which any exercise of the right granted in Section 3(b) of this License (the right to make Adaptations) would be deemed to be a distortion, mutilation, modification or other derogatory action prejudicial to the Original Author's honor and reputation, the Licensor will waive or not assert, as appropriate, this Section, to the fullest extent permitted by the applicable national law, to enable You to reasonably exercise Your right under Section 3(b) of this License (right to make Adaptations) but not otherwise.
52
+ 5. Representations, Warranties and Disclaimer
53
+
54
+ UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING AND TO THE FULLEST EXTENT PERMITTED BY APPLICABLE LAW, LICENSOR OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS, WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO THIS EXCLUSION MAY NOT APPLY TO YOU.
55
+
56
+ 6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
57
+
58
+ 7. Termination
59
+
60
+ This License and the rights granted hereunder will terminate automatically upon any breach by You of the terms of this License. Individuals or entities who have received Adaptations or Collections from You under this License, however, will not have their licenses terminated provided such individuals or entities remain in full compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8 will survive any termination of this License.
61
+ Subject to the above terms and conditions, the license granted here is perpetual (for the duration of the applicable copyright in the Work). Notwithstanding the above, Licensor reserves the right to release the Work under different license terms or to stop distributing the Work at any time; provided, however that any such election will not serve to withdraw this License (or any other license that has been, or is required to be, granted under the terms of this License), and this License will continue in full force and effect unless terminated as stated above.
62
+ 8. Miscellaneous
63
+
64
+ Each time You Distribute or Publicly Perform the Work or a Collection, the Licensor offers to the recipient a license to the Work on the same terms and conditions as the license granted to You under this License.
65
+ Each time You Distribute or Publicly Perform an Adaptation, Licensor offers to the recipient a license to the original Work on the same terms and conditions as the license granted to You under this License.
66
+ If any provision of this License is invalid or unenforceable under applicable law, it shall not affect the validity or enforceability of the remainder of the terms of this License, and without further action by the parties to this agreement, such provision shall be reformed to the minimum extent necessary to make such provision valid and enforceable.
67
+ No term or provision of this License shall be deemed waived and no breach consented to unless such waiver or consent shall be in writing and signed by the party to be charged with such waiver or consent.
68
+ This License constitutes the entire agreement between the parties with respect to the Work licensed here. There are no understandings, agreements or representations with respect to the Work not specified here. Licensor shall not be bound by any additional provisions that may appear in any communication from You. This License may not be modified without the mutual written agreement of the Licensor and You.
69
+ The rights granted under, and the subject matter referenced, in this License were drafted utilizing the terminology of the Berne Convention for the Protection of Literary and Artistic Works (as amended on September 28, 1979), the Rome Convention of 1961, the WIPO Copyright Treaty of 1996, the WIPO Performances and Phonograms Treaty of 1996 and the Universal Copyright Convention (as revised on July 24, 1971). These rights and subject matter take effect in the relevant jurisdiction in which the License terms are sought to be enforced according to the corresponding provisions of the implementation of those treaty provisions in the applicable national law. If the standard suite of rights granted under applicable copyright law includes additional rights not granted under this License, such additional rights are deemed to be included in the License; this License is not intended to restrict the license of any rights under applicable law.
70
+
@@ -0,0 +1,31 @@
1
+ LWAC Downloader
2
+ ===============
3
+ The system comprises two parts: a server which manages sample consistency and data storage, and a client which makes requests to remote resources and reports the results.
4
+
5
+ Clients and servers are not persistently connected during this process: a client will connect to a server, receive a job, disconnect, execute it, then reconnect to return the results. Multiple clients thus compete for the single connection, and compete to consume links from the server. Clients are expected to return to a server repeatedly until told to wait. They will then back off until more work is available.
6
+
7
+ Time is not necessarily expected to be synchronised between server and client, but we recommend the use of NTP to esure that times reported in the results of clients are reliable.
8
+
9
+
10
+ Dependencies
11
+ ------------
12
+ The client and server components have slightly different dependencies: the client need not perform any database lookups or complex storage operations, but it must be capable of performing HTTP and FTP requests.
13
+
14
+ ### Common dependencies
15
+
16
+ * Ruby 1.9.1+ (String#encode is required)
17
+ * simplerpc
18
+
19
+ ### Client
20
+
21
+ * cURL Ruby bindings (gem install -r curb)
22
+ * The 'gethostname' syscall
23
+
24
+ ### Server
25
+
26
+ * SQLite3 and Ruby bindings (gem install -r sqlite3)
27
+
28
+
29
+ Configuring
30
+ -----------
31
+ Please see the user documentation at docs/User for more information, as well as the sample documentation in configs/
@@ -0,0 +1,132 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # -----------------------------------------------------------------------------
4
+ # Load the launch tools and check gem prerequisites
5
+ require 'lwac'
6
+ require 'lwac/shared/launch_tools'
7
+ require 'lwac/shared/identity'
8
+
9
+ # Load config using launch_tools
10
+ tool, config = LWAC.load_config
11
+
12
+ # Announce version
13
+ LWAC::Identity::announce_version
14
+
15
+ # Summarise logs
16
+ $log.summarise_logging
17
+
18
+ # -----------------------------------------------------------------------------
19
+ # do stuff
20
+
21
+ case tool
22
+ # ---------------------------------------------------------------------------
23
+ # Server
24
+ # ---------------------------------------------------------------------------
25
+ when :server
26
+ require 'lwac/server'
27
+ require 'simplerpc/server'
28
+
29
+ # Fire up the server
30
+ server = LWAC::DownloadServer.new(config)
31
+ service = LWAC::DownloadService.new(server)
32
+
33
+ # construct the rpc handler
34
+ rpc = SimpleRPC::Server.new( service, config[:server] )
35
+
36
+ # listen
37
+ $log.info "Starting server on #{config[:server][:hostname]}:#{config[:server][:port]}"
38
+ loop{
39
+ begin
40
+ rpc.listen
41
+ rescue StandardError => e
42
+ $log.error "Error: #{e}"
43
+ $log.debug "#{e.backtrace.join("\n")}"
44
+ rescue SignalException => e
45
+ $log.fatal "Caught Signal: #{e}"
46
+ # $log.debug "#{e.backtrace.join("\n")}"
47
+ break
48
+ end
49
+
50
+ $log.info "Restarting server after a short delay..."
51
+ sleep(1)
52
+ }
53
+
54
+ # Ensure we exit cleanly after EM's done
55
+ server.close
56
+
57
+
58
+
59
+ # ---------------------------------------------------------------------------
60
+ # Client
61
+ # ---------------------------------------------------------------------------
62
+ when :client
63
+ require 'lwac/client'
64
+
65
+ begin
66
+ # Start the client going
67
+ dc = LWAC::DownloadClient.new(config)
68
+
69
+ # download
70
+ dc.work
71
+ rescue StandardError => e
72
+ $log.error "Error: #{e}"
73
+ $log.debug "#{e.backtrace.join("\n")}"
74
+ end
75
+
76
+
77
+ # ---------------------------------------------------------------------------
78
+ # Import Tool
79
+ # ---------------------------------------------------------------------------
80
+ when :import
81
+ file = nil
82
+ if ARGV[2] and File.exist?(ARGV[2]) and File.readable?(ARGV[2]) then
83
+ file = ARGV[2]
84
+ else
85
+ $log.fatal "Cannot read file to import: #{file}" if ARGV[2]
86
+ $log.fatal "Please provide a file to import!" if not ARGV[2]
87
+ exit(1)
88
+ end
89
+
90
+
91
+ require 'lwac/import'
92
+
93
+ begin
94
+ im = LWAC::Importer.new(config)
95
+ im.import(file)
96
+ rescue StandardError => e
97
+ $log.fatal "#{e.to_s}"
98
+ $log.debug "#{e.backtrace.join("\n")}"
99
+ end
100
+
101
+
102
+
103
+ # ---------------------------------------------------------------------------
104
+ # Export Tool
105
+ # ---------------------------------------------------------------------------
106
+ when :export
107
+ require 'lwac/export'
108
+
109
+ begin
110
+ # Construct the exporter object and load stuff from disk
111
+ ex = LWAC::Exporter.new(config)
112
+
113
+ # Dump stuff back to disk
114
+ ex.export
115
+
116
+ rescue StandardError => e
117
+ $log.error "Error: #{e}"
118
+ $log.debug "#{e.backtrace.join("\n")}"
119
+ end
120
+
121
+
122
+ # ---------------------------------------------------------------------------
123
+ # Else...
124
+ # ---------------------------------------------------------------------------
125
+ else
126
+ $log.error "Unknown tool: #{tool}."
127
+ $log.error "The code should also never reach this statement unless something is wrong."
128
+ exit(1)
129
+ end
130
+
131
+
132
+ $log.info "Goodbye"
@@ -0,0 +1,71 @@
1
+ Client Configuration
2
+ ====================
3
+ The client is responsible for applying to the server and accessing the web in order to download small batches of links, which it then uploads back to the server for storage. As such its config file is concerned with network access both to the server and to the web.
4
+
5
+
6
+
7
+ Server
8
+ ------
9
+ This section defines which server to connect to. This section contains configuration for [SimpleRPC](http://stephenwattam.com/projects/simplerpc/), and supports all features it does. Only the salient ones are documented here.
10
+
11
+ * `hostname` --- The IP address or hostname at which to contact the server
12
+ * `port` --- The port to use when contacting the server
13
+ * `password` --- Optional. The password to use for auth (must match server config)
14
+ * `secret` --- Optional. The encryption key to use when sending the password (must match server config)
15
+
16
+ For example:
17
+
18
+ :server:
19
+ :hostname: "127.0.0.1"
20
+ :port: 27401
21
+ :password: lwacpass
22
+ :secret: egrniognhre89n34ifnui4n8gf490
23
+
24
+ Network
25
+ -------
26
+ This section governs the manner in which clients attempt to contact the server, notably aggressiveness of retrying and polling for jobs. Clients implement a linear backoff system to ensure they do not over-compete for server resources when failing to perform transactions.
27
+
28
+ * `connect_timeout` --- How long we should give the socket to respond when connecting to the server
29
+ * `minimum_reconnect_time` --- The minimum time we should wait before reconnecting
30
+ * `maximum_reconnect_time` --- The maximum time we should wait before connecting, approached gradually from the minimum
31
+ * `connect_failure_penalty` --- The delay to add to the backoff time upon each failure, up to the `maximum_reconnect_time`
32
+
33
+ For example:
34
+
35
+ :network:
36
+ :connect_timeout: 20
37
+ :minimum_reconnect_time: 1
38
+ :maximum_reconnect_time: 240
39
+ :connect_failure_penalty: 3
40
+
41
+ Client
42
+ ------
43
+ The client's limitations as a system are described here, as well as a way of identifying multiple clients run from a single host.
44
+
45
+ Clients check out batches of links, process them, then check in smaller batches (since the datapoints now have a large payload). The ratio of these sizes should be tuned in accordance with the filesizes being uploaded on a regular basis, and the degree of data security one wishes to ensure.
46
+
47
+ * `announce_progress` --- Boolean. Set to true to print worker status to the screen every half second during operation.
48
+ * `uuid_salt` --- A human-readable string to prepend the client UUID with. Each client computes its ID from the hostname, and this is a way of making the IDs more human-readable (as well as running multiple clients on the same host).
49
+ * `batch_capacity` --- How many links to check out and download in one batch. The client will receive up to this number of links to download each time it contacts the server.
50
+ * `check_in_size` --- How many datapoints to upload at once, in MB. Set to the `cache_limit` to make uploads go fastest, or below it to split them.
51
+ * `cache_limit` --- The approximate size of the cache used by the client (in bytes). After downloading this amount of data, the cache will be swapped out and uploaded in chunks to the server.
52
+ * `cache_dir` --- A directory to create file caches in. Reduces client RAM requirements, as the cache will store web data before upload. If you wish to use memory instead, leave this blank. At most two caches will be active at any one time, meaning memory limits will be:
53
+ * If using memory caching, `2 * cache_limit + simultaneous_workers * max_body_size`
54
+ * If using disk caching, `check_in_size + simultaneous_workers * max_body_size`
55
+ * `simultaneous_workers` --- The number of workers to run in the same pool. Given preferrable network conditions, this many connections to websites will be open at once, and this number must be chosen whilst bearing in mind the limitations of your kernel and netiquette (especially if you have many links pointing at the same servers). Within each client, links are downloaded from servers by a series of workers, which consume links from the pending pool. This has the distinct advantage of being capable of very high degrees of parallelism (beyond that where the kernel will start dropping connections) with relatively little overhead.
56
+
57
+ For example:
58
+
59
+ :client:
60
+ :announce_progress: true
61
+ :monitor_rate: 0.5
62
+ :uuid_salt: "LOCAL"
63
+ :batch_capacity: 1000
64
+ :cache_limit: 209715200
65
+ :check_in_size: 209715200
66
+ :simultaneous_workers: 200
67
+ :cache_dir: # nil to use RAM cache
68
+
69
+ Logging
70
+ -------
71
+ The logging system is the same for all tools and shares a configuration format. For details, see [configuring logging](log_config.html)
@@ -0,0 +1,70 @@
1
+ LWAC Concepts
2
+ =============
3
+ This document describes the format and purpose of the corpus around which LWAC is based, and serves to describe how one would go about operationalising the data therein.
4
+
5
+
6
+ Overview
7
+ --------
8
+ LWAC is based around a central longitudinal corpus, stored in an arbitrary directory (as defined in the server config) in a serialised object format (JSON, YAML, or ruby's native binary format). This means the process of sampling is thus:
9
+
10
+ 1. Define a population of links
11
+ 2. Sample them with as small a time differential as possible
12
+ 3. Wait until the next sample time
13
+ 4. go to 2
14
+
15
+ A `link` without web data attached is known as a `datapoint` in this documentation. A `sample` is one cross-sectional attempt to download all links.
16
+
17
+ Theoretically, this forms three levels at which we may access the data:
18
+
19
+ 1. Server level, describing the whole sample (all links, samples, and datapoints, a conventional longitudinal sample);
20
+ 2. Sample level, describing one attempt to download links (a conventional cross-sectional sample containing many datapoints);
21
+ 3. Datapoint level, describing one attempt to download a single link.
22
+
23
+ Simply, a server has many samples, each of which has many datapoints. Samples are temporally homogenous to the greatest extent possible, and datapoints refer to the same URI (for their id). Since the cumulative time taken to download each sample applies some drift, links are downloaded as intensively as possible.
24
+
25
+ The corpus, as stored on disk, consists of two types of storage:
26
+
27
+ 1. Metadata, stored in an SQLite database, which contains a table simply listing the links along with a unique ID.
28
+ 2. Corpus data, stored in a flatfile structure as serialised ruby DataPoint objects.
29
+
30
+ The format of the corpus is described in greater detail in the rest of this document.
31
+
32
+
33
+ The Corpus
34
+ ----------
35
+
36
+ ### Structure
37
+ The corpus itself is structured as a root directory, containing a specific structure:
38
+
39
+ root/
40
+ root/database.db
41
+ root/state
42
+ root/files/sample_id/sample
43
+ root/files/sample_id/1/2/3/456
44
+
45
+
46
+ The corpus includes:
47
+
48
+ * The metadata database (if using SQLite3)
49
+ * The state of the current sample. This is stored as a serialised ruby object so that a sample may be resumed later if the server is stopped.
50
+ * A list of sample ID folders containing:
51
+ * A file describing the properties of this sample as a serialised ruby Sample object
52
+ * A structure of directories describing link IDs, each of which has up to N files within it (as defined in the server config). This structure uses the first characters of the ID to nest directories in order to avoid filesystem limits on inode size and speed up random access, i.e.:
53
+ 0/1/1
54
+ 0/1/2
55
+ 0/1/3
56
+ 0/2/1
57
+ 0/2/2
58
+ 0/2/3
59
+ etc.
60
+
61
+
62
+ ### File Formats
63
+ Each of the files within a corpus, with the exception of the metadata database, is a serialised ruby object, as defined in `/lib/shared/data_types.rb`. These objects are serialised using one of three formats:
64
+
65
+ * `:marshal` --- Ruby's native binary format is very fast but cannot realistically be read from other languages
66
+ * `:json` --- JSON is widely used but slow
67
+ * `:yaml` --- YAML is also readable by other languages, but is slow (around 60 times slower than `:marshal`)
68
+
69
+ Note that a corpus written using one serialisation system will be unreadable by a server using another. I highly recommend using `:marshal` and using the export tool to extract your data into a more workable format later.
70
+
@@ -0,0 +1,40 @@
1
+ Reading Configuration Documentation
2
+ ===================================
3
+ Configuration files in LWAC are valid YAML files, and typically follow a hash structure. As such they resemble large trees, with occasional lists and many small textual elements.
4
+
5
+ A first place to look when interpeting these will be [the YAML overview at Wikipedia](http://en.wikipedia.org/wiki/YAML), which will help familiarise you with the format. The rest of this document is about how I refer to keys and values in this documentation.
6
+
7
+ Keys
8
+ ----
9
+ YAML keys will be specified as paths from root, loosely following XPATH notation: `/key1/key2/key3` will denote
10
+
11
+ ---
12
+ :key1:
13
+ :key2:
14
+ :key3: value
15
+
16
+
17
+ Where YAML files contain lists, aspect specifiers will be used in a similar manner to C-like languages (0-base), i.e. `/key1/key2[2]` refers to `orange`:
18
+
19
+ ---
20
+ :key1:
21
+ :key2:
22
+ - apple
23
+ - banana
24
+ - orange
25
+
26
+ Where YAML keys contain hashes, that must be retained as simple key-value pairs (i.e. for SQLite pragma settings), curly braces will be used as aspect specifiers, i.e. `/key1/key2{orange}` will refer to `bob`:
27
+
28
+ ---
29
+ :key1:
30
+ :key2:
31
+ apple: adolf
32
+ banana: martin
33
+ orange: bob
34
+
35
+
36
+ Symbols and Data Formats
37
+ ------------------------
38
+ Most, though not all, keys in LWAC configuration files are ruby symbols, and thus are prefixed with a colon (:). Those that are specifically not symbols are noted as such in the text, as it is normally relevant to their use.
39
+
40
+ Booleans and other binary options are also noted in the text. It's worth noting that ruby considers `nil` to be false in boolean tests.
@@ -0,0 +1,52 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ #
4
+ # Compiles all nearby markdown into HTML, and pops it in ./docs
5
+ #
6
+
7
+ require 'markdown'
8
+ require 'fileutils'
9
+ require 'erb'
10
+
11
+ input_dir = "./user/*"
12
+ output_dir = "./html_docs"
13
+ TEMPLATE = "template.rhtml"
14
+
15
+ # for version
16
+ require '../lib/lwac.rb'
17
+
18
+
19
+
20
+ if File.exist?(output_dir) then
21
+ $stderr.puts "Output directory exists (#{output_dir}) --- please delete and run again."
22
+ exit(1)
23
+ end
24
+
25
+
26
+ def template(version, filename, content, pages)
27
+ return ERB.new(File.read(TEMPLATE)).result(binding)
28
+ end
29
+
30
+ # create output dir
31
+ FileUtils.mkdir_p(output_dir)
32
+
33
+ # create list of pages
34
+ pages = Dir.glob(input_dir).to_a.delete_if{|f| File.extname(f)[1..-1] != "md" or File.directory?(f)}.map{|f| File.basename(f).to_s[0..-(File.extname(f).length + 1)]}
35
+
36
+ puts "LIST: #{pages.join(', ')}"
37
+
38
+ Dir.glob(input_dir){|f|
39
+ if not File.directory?(f) and File.extname(f) == ".md" then
40
+ puts "Compiling #{f}..."
41
+
42
+ File.open(File.join(output_dir, File.basename(f)[0..-(File.extname(f).length + 1)] + ".html"), 'w'){|of|
43
+ of.write( template(LWAC::VERSION, File.basename(f), Markdown.new(File.read(f)).to_html, pages ))
44
+ }
45
+ elsif f != $0 and File.basename(f) != File.basename(output_dir)
46
+ puts "Copying #{f}..."
47
+ FileUtils.cp_r(f, File.join(output_dir, File.basename(f)))
48
+ end
49
+ }
50
+
51
+ puts "Done."
52
+