unbreakable 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,6 @@
1
+ *.gem
2
+ .bundle
3
+ .yardoc
4
+ Gemfile.lock
5
+ doc/*
6
+ pkg/*
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in govkit-ca.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2011 Open North Inc.
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,60 @@
1
+ # Unbreakable
2
+
3
+ Unbreakable is a Ruby gem that abstracts and bulletproofs common web scraping tasks. It forces a separation of concerns for maximum flexibility. Loose coupling allows for easier modification and re-use of component parts.
4
+
5
+ # Installation
6
+
7
+ gem install unbreakable
8
+
9
+ # What's the problem?
10
+
11
+ A common web scraping project involves four steps. As an illustrative example, we'll scrape the language with the most articles on Wikipedia using standard command-line tools:
12
+
13
+ 1. Retrieve some raw HTML
14
+
15
+ # Download the list of Wikipedias
16
+ curl -s -o in.html http://s23.org/wikistats/wikipedias_html
17
+
18
+ 1. Process the raw HTML into a machine-readable format
19
+
20
+ # Extract the language with the most articles
21
+ grep '><td class="number">1<' in.html | sed 's/.*e">\([^<]*\).*/\1/' > out.html
22
+
23
+ 1. Release the data to the community through an API and/or as a download
24
+
25
+ # Upload the machine-readable data to a public server
26
+ curl http://pastie.org/pastes -F "paste[parser_id]=6" -F "paste[authorization]=burger" \
27
+ -F "paste[body]=`cat out.txt`" -s -o /dev/null -L -w "%{url_effective}"
28
+
29
+ 1. Use the data as you like
30
+
31
+ echo "The most popular language is `curl -s http://pastie.org/pastes/2487244/download`."
32
+
33
+ In most web scraping projects, at least one step is tightly coupled to another, making modification or re-use of individual steps by the community difficult. It is especially common for authors to tailor the workflow to their specific use of the data. The coupling produces esoteric code, with the domain logic of the author's use case slipping into the otherwise generic code for retrieving and processing data. Because the scrapers are embedded in a larger project, they are often undiscoverable.
34
+
35
+ Furthermore, how the first two steps store data may be incompatible with some environments. If the processor code stores data in a database, but you prefer flat files for your use case, you may have to do a long refactor.
36
+
37
+ # What's the solution?
38
+
39
+ Web scraping projects should write standalone downloaders, processors, APIs and apps.
40
+
41
+ Retrieving should be separate from processing, if only to avoid hammering remote servers while developing or tweaking a processor. This separation also allows the community to develop multiple processors of the same raw data without duplication of effort.
42
+
43
+ Standalone components are easier for the community to discover, modify and re-use, as they do not need to concern themselves with the other parts of the workflow or expose themselves to the use case of the original author.
44
+
45
+ The code for retrieving and processing data should delegate the persistence of data to a storage layer. The community can then develop various, swappable storage adapters and will not be bound to any single solution.
46
+
47
+ Unbreakable helps you write standalone downloaders and processor and provides an extensible persistence layer.
48
+
49
+ # Getting started
50
+
51
+ For now, the best way to learn how to use this gem is to read the documentation.
52
+
53
+ rake yard
54
+ open doc/index.html
55
+
56
+ # Bugs? Questions?
57
+
58
+ Unbreakable's main repository is on GitHub: [http://github.com/opennorth/unbreakable](http://github.com/opennorth/unbreakable), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
59
+
60
+ Copyright (c) 2011 Open North Inc., released under the MIT license
@@ -0,0 +1,16 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
3
+
4
+ require 'rspec/core/rake_task'
5
+ RSpec::Core::RakeTask.new(:spec)
6
+
7
+ task :default => :spec
8
+
9
+ begin
10
+ require 'yard'
11
+ YARD::Rake::YardocTask.new
12
+ rescue LoadError
13
+ task :yard do
14
+ abort 'YARD is not available. In order to run yard, you must: gem install yard'
15
+ end
16
+ end
data/USAGE ADDED
@@ -0,0 +1 @@
1
+ See README.md for full usage details.
@@ -0,0 +1,66 @@
1
+ require 'dragonfly'
2
+
3
+ # When using this gem, you'll start by defining a {Scraper}, with methods for
4
+ # retrieving and processing data. The data will be stored in {DataStorage};
5
+ # this gem currently provides only a {DataStorage::FileDataStore FileDataStore}.
6
+ # You may enhance a datastore with {Decorators} and {Observers}: for example,
7
+ # a {Decorators::Timeout Timeout} decorator to retry on timeout with exponential
8
+ # backoff and a {Observers::Log Log} observer which logs retrieval progress.
9
+ # Of course, you must also define a {Processors::Transform Processor} to turn
10
+ # your raw data into machine-readable data.
11
+ #
12
+ # A skeleton scraper:
13
+ #
14
+ # require 'unbreakable'
15
+ #
16
+ # class MyScraper < Unbreakable::Scraper
17
+ # def retrieve
18
+ # # download all the documents
19
+ # end
20
+ # def processable
21
+ # # return a list of documents to process
22
+ # end
23
+ # end
24
+ #
25
+ # class MyProcessor < Unbreakable::Processors::Transform
26
+ # def perform(temp_object)
27
+ # # return the transformed record as a hash, array, etc.
28
+ # end
29
+ # def persist(temp_object, arg)
30
+ # # store the hash/array/etc. in Mongo, MySQL, YAML, etc.
31
+ # end
32
+ # end
33
+ #
34
+ # scraper = MyScraper.new
35
+ # scraper.processor.register MyProcessor
36
+ # scraper.configure do |c|
37
+ # # configure the scraper
38
+ # end
39
+ # scraper.run(ARGV)
40
+ #
41
+ # Every scraper script can run as a command-line script. Try it!
42
+ #
43
+ # ruby myscraper.rb
44
+ module Unbreakable
45
+ autoload :Scraper, 'unbreakable/scraper'
46
+
47
+ module Processors
48
+ autoload :Transform, 'unbreakable/processors/transform'
49
+ end
50
+
51
+ module Observers
52
+ autoload :Observer, 'unbreakable/observers/observer'
53
+ autoload :Log, 'unbreakable/observers/log'
54
+ end
55
+
56
+ module Decorators
57
+ autoload :Timeout, 'unbreakable/decorators/timeout'
58
+ end
59
+
60
+ module DataStorage
61
+ autoload :FileDataStore, 'unbreakable/data_storage/file_data_store'
62
+ end
63
+
64
+ class UnbreakableError < StandardError; end
65
+ class InvalidRemoteFile < UnbreakableError; end
66
+ end
@@ -0,0 +1,139 @@
1
+ require 'observer'
2
+
3
+ module Unbreakable
4
+ module DataStorage
5
+ # Stores files to the filesystem. To configure:
6
+ #
7
+ # scraper.configure do |c|
8
+ # c.datastore = Unbreakable::DataStorage::FileDataStore.new(scraper,
9
+ # :decorators => [:timeout], # optional
10
+ # :observers => [:log]) # optional
11
+ # c.datastore.root_path = '/path/dir' # default '/var/tmp/unbreakable'
12
+ # c.datastore.store_meta = true # default false
13
+ # end
14
+ class FileDataStore < Dragonfly::DataStorage::FileDataStore
15
+ include Observable
16
+ include Dragonfly::Loggable
17
+
18
+ # Decorators should be able to add configuration variables.
19
+ public_class_method :configurable_attr
20
+
21
+ # Configure the datastore to overwrite files upon repeated download.
22
+ #
23
+ # scraper.configure do |c|
24
+ # c.datastore.clobber = true # default false
25
+ # end
26
+ #
27
+ # @return [Boolean, Proc, lambda] whether to overwrite files upon repeated
28
+ # download
29
+ configurable_attr :clobber, false
30
+
31
+ # @param [Dragonfly::App] app
32
+ # @param [Hash] opts
33
+ # @option options [Module, Symbol, Array<Module, Symbol>] :decorators
34
+ # a module, the name of a decorator module, or an array of such
35
+ # @option options [Class, Symbol, Array<Class, Symbol>] :observers
36
+ # a class, the name of an observer class, or an array of such
37
+ def initialize(app, opts = {})
38
+ use_same_log_as(app)
39
+ use_as_fallback_config(app)
40
+ if opts[:decorators]
41
+ opts[:decorators].each do |decorator|
42
+ extend Symbol === decorator ? Unbreakable::Decorators.const_get(decorator.capitalize) : decorator
43
+ end
44
+ end
45
+ if opts[:observers]
46
+ opts[:observers].each do |observer|
47
+ add_observer Symbol === observer ? Unbreakable::Observers.const_get(observer.capitalize).new(self) : observer.new(self)
48
+ end
49
+ end
50
+ end
51
+
52
+ # Stores a record in the datastore. This method does lazy evaluation of
53
+ # the record's contents, e.g.:
54
+ #
55
+ # defer_store(:path => 'index.html') do
56
+ # open('http://www.example.com/').read
57
+ # end
58
+ #
59
+ # The +open+ method is called only if the record hasn't already been
60
+ # downloaded or if the datastore has been configured to overwrite files
61
+ # upon repeated download.
62
+ #
63
+ # @param [Hash] opts
64
+ # @option opts [Hash] :meta any file metadata, e.g. bitrate
65
+ # @option opts [String] :path the relative path at which to store the file
66
+ # @param [Proc] block a block that yields the contents of the file
67
+ # @raise [Dragonfly::DataStorage::UnableToStore] if permission is denied
68
+ # @return [String] the relative path to the file
69
+ # @see [Dragonfly::DataStorage::FileDataStore#store]
70
+ def defer_store(opts = {}, &block)
71
+ meta = opts[:meta] || {}
72
+ relative_path = if opts[:path]
73
+ opts[:path]
74
+ else
75
+ filename = meta[:name] || 'file'
76
+ relative_path = relative_path_for(filename)
77
+ end
78
+
79
+ changed
80
+ if empty?(relative_path) or clobber?(relative_path)
81
+ begin
82
+ path = absolute(relative_path)
83
+ prepare_path(path)
84
+ string = yield_block(relative_path, &block)
85
+ Dragonfly::TempObject.new(string).to_file(path).close
86
+ store_meta_data(path, meta) if store_meta
87
+ notify_observers :store, relative_path, string
88
+ relative(path)
89
+ rescue InvalidRemoteFile => e
90
+ log.error e.message
91
+ rescue Errno::EACCES => e
92
+ raise UnableToStore, e.message
93
+ end
94
+ else
95
+ notify_observers :skip, relative_path
96
+ end
97
+ end
98
+
99
+ # Returns all filenames matching a pattern, if given.
100
+ # @param [String, Regexp] pattern a pattern to match filenames with
101
+ # @return [Array<String>] an array of matching filenames
102
+ def records(pattern = nil)
103
+ if pattern
104
+ Dir[File.join(root_path, '**', pattern)]
105
+ else
106
+ Dir[File.join(root_path, '**', '*')]
107
+ end.map do |absolute_path|
108
+ relative absolute_path
109
+ end
110
+ end
111
+
112
+ private
113
+
114
+ # @param [String] relative_path the relative path to the file
115
+ # @return [Boolean] whether the file is empty or non-existent
116
+ def empty?(relative_path)
117
+ path = absolute(relative_path)
118
+ !File.exist?(path) || File.size(path).zero?
119
+ end
120
+
121
+ # @param [String] relative_path the relative path to the file
122
+ # @return [Boolean] whether to overwrite any existing file
123
+ def clobber?(relative_path)
124
+ if clobber.respond_to? :call
125
+ clobber.call(relative_path)
126
+ else
127
+ !!clobber
128
+ end
129
+ end
130
+
131
+ # Yields a block.
132
+ # @param [String] relative_path the relative path to the file
133
+ # @return [String] the contents of the file
134
+ def yield_block(relative_path)
135
+ yield
136
+ end
137
+ end
138
+ end
139
+ end
@@ -0,0 +1,47 @@
1
+ require 'timeout'
2
+
3
+ module Unbreakable
4
+ module Decorators
5
+ # Catches timeouts and retries with exponential backoff. To configure:
6
+ #
7
+ # scraper.configure do |c|
8
+ # c.datastore.retry_limit = 5 # the maximum number of retries
9
+ # c.datastore.timeout_length = 60 # the timeout length
10
+ # end
11
+ #
12
+ module Timeout
13
+ # @param object an object
14
+ def self.extended(obj)
15
+ obj.class.instance_eval do
16
+ configurable_attr :retry_limit, 5
17
+ configurable_attr :timeout_length, 60
18
+ end
19
+ end
20
+
21
+ private
22
+
23
+ # (see DataStorage::FileDataStore#yield_block)
24
+ def yield_block(relative_path)
25
+ retry_attempt = 0
26
+ begin
27
+ retry_attempt += 1
28
+ ::Timeout::timeout(timeout_length) do
29
+ super
30
+ end
31
+ rescue ::Timeout::Error
32
+ if retry_attempt < retry_limit
33
+ log.warn "Timeout on #{relative_path}, retrying in #{retry_delay} (#{retry_attempt}/#{retry_limit})"
34
+ sleep retry_delay
35
+ retry
36
+ else
37
+ log.error "Timeout on #{relative_path}, skipping"
38
+ end
39
+ end
40
+ end
41
+
42
+ def retry_delay(retry_attempt)
43
+ 2 ** retry_attempt
44
+ end
45
+ end
46
+ end
47
+ end
@@ -0,0 +1,19 @@
1
+ module Unbreakable
2
+ module Observers
3
+ # Logs debug messages when files are stored or skipped if the observed
4
+ # object has a +#log+ method.
5
+ class Log < Observer
6
+ # (see Observer#update)
7
+ def update(method, *args)
8
+ if observed.respond_to? :log
9
+ case method
10
+ when :store
11
+ observed.log.debug "Store #{args.first}"
12
+ when :skip
13
+ observed.log.debug "Skip #{args.first}"
14
+ end
15
+ end
16
+ end
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,27 @@
1
+ module Unbreakable
2
+ module Observers
3
+ # Abstract class for observers following the Ruby
4
+ # {http://ruby-doc.org/stdlib/libdoc/observer/rdoc/index.html stdlib}
5
+ # implementation of the _Observer_ object-oriented design pattern. See
6
+ # {Unbreakable::Observers::Log} for an example.
7
+ #
8
+ # The following instance methods must be implemented in sub-classes:
9
+ #
10
+ # * +update+
11
+ class Observer
12
+ attr_reader :observed
13
+
14
+ # @param observed the observed object
15
+ def initialize(observed)
16
+ @observed = observed
17
+ end
18
+
19
+ # @param [Symbol] method the method called on the observed object
20
+ # @param [Array] args the arguments to the method
21
+ # @return [void]
22
+ def update(method, *args)
23
+ raise NotImplementedError
24
+ end
25
+ end
26
+ end
27
+ end
@@ -0,0 +1,60 @@
1
+ module Unbreakable
2
+ module Processors
3
+ # You may implement a transform process by subclassing this class:
4
+ #
5
+ # require 'nokogiri'
6
+ # class MyProcessor < Unbreakable::Processors::Transform
7
+ # # Extracts the page title from an HTML page.
8
+ # def perform(temp_object)
9
+ # Nokogiri::HTML(temp_object.data).at_css('title')
10
+ # end
11
+ #
12
+ # # Saves the page title to an external database.
13
+ # def persist(temp_object, arg)
14
+ # MyModel.create(:title => arg)
15
+ # end
16
+ # end
17
+ # MyScraper.processor.register MyProcessor
18
+ #
19
+ # The following instance methods must be implemented in sub-classes:
20
+ #
21
+ # * +perform+
22
+ # * +persist+
23
+ #
24
+ # You may also override +transform+, which calls +perform+ and +persist+ in
25
+ # the default implementation, but you probably won't have to.
26
+ class Transform
27
+ include Dragonfly::Configurable
28
+ include Dragonfly::Loggable
29
+
30
+ # +#transform+ must be defined on the subclass for Dragonfly to see it.
31
+ # @param [Class] subclass a subclass
32
+ def self.inherited(subclass)
33
+ subclass.class_eval do
34
+ # @param [Dragonfly::TempObject] temp_object
35
+ # @return [Dragonfly::TempObject] the same object
36
+ def transform(temp_object)
37
+ persist temp_object, perform(temp_object)
38
+ temp_object
39
+ end
40
+ end
41
+ end
42
+
43
+ private
44
+
45
+ # Transforms a record.
46
+ # @param [Dragonfly::TempObject] temp_object
47
+ # @return [Hash] the transformed record
48
+ def perform(temp_object)
49
+ raise NotImplementedError
50
+ end
51
+
52
+ # Persists a transformed record.
53
+ # @param [Dragonfly::TempObject] temp_object
54
+ # @param arg a transformed record
55
+ def persist(temp_object, arg)
56
+ raise NotImplementedError
57
+ end
58
+ end
59
+ end
60
+ end
@@ -0,0 +1,200 @@
1
+ require 'forwardable'
2
+ require 'optparse'
3
+ require 'securerandom'
4
+
5
+ require 'active_support/inflector/methods'
6
+
7
+ module Unbreakable
8
+ # You may implement a scraper by subclassing this class:
9
+ #
10
+ # require 'open-uri'
11
+ # class MyScraper < Unbreakable::Scraper
12
+ # # Stores the contents of +http://www.example.com/+ in +index.html+.
13
+ # def retrieve
14
+ # store(:path => 'index.html'){ open('http://www.example.com/').read }
15
+ # end
16
+ #
17
+ # # Processes +index.html+.
18
+ # def process
19
+ # fetch('index.html').process(:transform).apply
20
+ # end
21
+ #
22
+ # # Alternatively, you can just set the files to fetch, which will be
23
+ # # processed using a +:transform+ processor which you must implement.
24
+ # def processable
25
+ # ['index.html']
26
+ # end
27
+ # end
28
+ #
29
+ # To configure:
30
+ #
31
+ # scraper.configure do |c|
32
+ # c.datastore = MyDataStore.new # default Unbreakable::DataStorage::FileDataStore.new(scraper)
33
+ # c.log = Logger.new('/path/to/file') # default Logger.new(STDOUT)
34
+ # c.datastore.store_meta = true # default false
35
+ # end
36
+ #
37
+ # The following instance methods must be implemented in sub-classes:
38
+ #
39
+ # * +retrieve+
40
+ # * +process+ or +processable+
41
+ class Scraper
42
+ extend Forwardable
43
+
44
+ def_delegators :@app, :add_child_configurable, :configure, :datastore,
45
+ :fetch, :log, :processor
46
+
47
+ # Initializes a Dragonfly app for storage and processing.
48
+ def initialize
49
+ @app = Dragonfly[SecureRandom.hex.to_sym]
50
+ # defaults to Logger.new('/var/tmp/dragonfly.log')
51
+ @app.log = Logger.new(STDOUT)
52
+ # defaults to Dragonfly::DataStorage::FileDataStore.new
53
+ @app.datastore = Unbreakable::DataStorage::FileDataStore.new(self)
54
+ # defaults to '/var/tmp/dragonfly'
55
+ @app.datastore.root_path = '/var/tmp/unbreakable'
56
+ # defaults to true
57
+ @app.datastore.store_meta = false
58
+ end
59
+
60
+ # Returns an option parser.
61
+ # @return [OptionParser] an option parser
62
+ def opts
63
+ if @opts.nil?
64
+ @opts = OptionParser.new
65
+ @opts.banner = <<-eos
66
+ usage: #{@opts.program_name} [options] <command> [<args>]
67
+
68
+ The most commonly used commands are:
69
+ retrieve Cache remote files to the datastore for later processing
70
+ process Process cached files into machine-readable data
71
+ config Print the current configuration
72
+ eos
73
+
74
+ @opts.separator ''
75
+ @opts.separator 'Specific options:'
76
+ extract_configuration @app
77
+
78
+ @opts.separator ''
79
+ @opts.separator 'General options:'
80
+ @opts.on_tail('-h', '--help', 'Display this screen') do
81
+ puts @opts
82
+ exit
83
+ end
84
+ end
85
+ @opts
86
+ end
87
+
88
+ # Runs the command. Most often run from a command-line script as:
89
+ #
90
+ # scraper.run(ARGV)
91
+ #
92
+ # @param [Array] args command-line arguments
93
+ # @note Only call this method once per scraper instance.
94
+ def run(args)
95
+ opts.parse!(args)
96
+ command = args.shift
97
+ case command
98
+ when 'retrieve'
99
+ retrieve
100
+ when 'process'
101
+ process
102
+ when 'config'
103
+ print_configuration @app
104
+ when nil
105
+ puts opts
106
+ else
107
+ opts.abort "'#{command}' is not a #{opts.program_name} command. See '#{opts.program_name} --help'."
108
+ end
109
+ end
110
+
111
+ # Stores a record in the datastore.
112
+ # @param [Hash] opts options to pass to the datastore
113
+ # @param [Proc] block a block that yields the contents of the file
114
+ def store(opts = {}, &block)
115
+ datastore.defer_store(opts, &block)
116
+ end
117
+
118
+ # Parses a JSON, HTML, XML, or YAML file.
119
+ # @param [String, Dragonfly::TempObject] temp_object_or_uid a +TempObject+ or record ID
120
+ # @param [String] encoding a file encoding
121
+ # @return the parsing, either a Ruby or +Nokogiri+ type
122
+ # @raise [LoadError] if the {http://nokogiri.org/ nokogiri} gem is
123
+ # unavailable for parsing an HTML or XML file
124
+ def parse(temp_object_or_uid, encoding = 'utf-8')
125
+ temp_object = temp_object_or_uid.is_a?(Dragonfly::TempObject) ? temp_object_or_uid : fetch(temp_object_or_uid)
126
+ string = temp_object.data
127
+ case File.extname temp_object.path
128
+ when '.json'
129
+ begin
130
+ require 'yajl'
131
+ Yajl::Parser.parse string
132
+ rescue LoadError
133
+ require 'json'
134
+ JSON.parse string
135
+ end
136
+ when '.html'
137
+ require 'nokogiri'
138
+ Nokogiri::HTML string, nil, encoding
139
+ when '.xml'
140
+ require 'nokogiri'
141
+ Nokogiri::XML string, nil, encoding
142
+ when '.yml', '.yaml'
143
+ require 'yaml'
144
+ YAML.load string
145
+ else
146
+ string
147
+ end
148
+ end
149
+
150
+ # Caches remote files to the datastore for later processing.
151
+ def retrieve
152
+ raise NotImplementedError
153
+ end
154
+
155
+ # Processes cached files into machine-readable data.
156
+ def process
157
+ processable.each do |record|
158
+ fetch(record).process(:transform).apply
159
+ end
160
+ end
161
+
162
+ # Returns a list of record IDs to process.
163
+ # @return [Array<String>] a list of record IDs to process
164
+ def processable
165
+ raise NotImplementedError
166
+ end
167
+
168
+ private
169
+
170
+ # @param [#configuration] object
171
+ def extract_configuration(object)
172
+ object.default_configuration.merge(object.configuration).each do |key,value|
173
+ if true === value or false === value
174
+ @opts.on("--[no-]#{key}", "default #{value.inspect}") do |x|
175
+ object.send "#{key}=", x
176
+ end
177
+ elsif String === value or Fixnum === value
178
+ @opts.on("--#{key} ARG", "default #{value.inspect}") do |x|
179
+ object.send "#{key}=", x
180
+ end
181
+ elsif object != value and value.respond_to? :configuration
182
+ extract_configuration value
183
+ end
184
+ end
185
+ end
186
+
187
+ # @param [#configuration] object
188
+ def print_configuration(object, indent = 0)
189
+ indentation = ' ' * indent
190
+ puts "#{indentation}#{object.class.name}:"
191
+ object.default_configuration.merge(object.configuration).each do |key,value|
192
+ if true === value or false === value or String === value or Fixnum === value
193
+ puts " #{indentation}#{key.to_s.ljust 25 - indent}#{value.inspect}"
194
+ elsif object != value and value.respond_to? :configuration
195
+ print_configuration value, indent + 2
196
+ end
197
+ end
198
+ end
199
+ end
200
+ end
@@ -0,0 +1,3 @@
1
+ module Unbreakable
2
+ VERSION = "0.0.1"
3
+ end
@@ -0,0 +1,5 @@
1
+ --colour
2
+ --format nested
3
+ --loadby mtime
4
+ --reverse
5
+ --backtrace
@@ -0,0 +1,3 @@
1
+ require 'rubygems'
2
+ require 'rspec'
3
+ require File.dirname(__FILE__) + '/../lib/unbreakable'
@@ -0,0 +1,5 @@
1
+ require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
2
+
3
+ module Unbreakable
4
+ # TODO
5
+ end
@@ -0,0 +1,25 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+ require "unbreakable/version"
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = "unbreakable"
7
+ s.version = Unbreakable::VERSION
8
+ s.platform = Gem::Platform::RUBY
9
+ s.authors = ["Open North"]
10
+ s.email = ["info@opennorth.ca"]
11
+ s.homepage = "http://github.com/opennorth/unbreakable"
12
+ s.summary = %q{Make your scrapers unbreakable™}
13
+ s.description = %q{Abstracts and bulletproofs common scraping tasks.}
14
+
15
+ s.rubyforge_project = "unbreakable"
16
+
17
+ s.files = `git ls-files`.split("\n")
18
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
19
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
20
+ s.require_paths = ["lib"]
21
+
22
+ s.add_runtime_dependency('activesupport', '~> 3.1.0')
23
+ s.add_runtime_dependency('dragonfly', '~> 0.9.5')
24
+ s.add_development_dependency('rspec', '~> 2.6.0')
25
+ end
metadata ADDED
@@ -0,0 +1,99 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: unbreakable
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Open North
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2011-09-07 00:00:00.000000000Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: activesupport
16
+ requirement: &70281322437780 !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ~>
20
+ - !ruby/object:Gem::Version
21
+ version: 3.1.0
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: *70281322437780
25
+ - !ruby/object:Gem::Dependency
26
+ name: dragonfly
27
+ requirement: &70281322437140 !ruby/object:Gem::Requirement
28
+ none: false
29
+ requirements:
30
+ - - ~>
31
+ - !ruby/object:Gem::Version
32
+ version: 0.9.5
33
+ type: :runtime
34
+ prerelease: false
35
+ version_requirements: *70281322437140
36
+ - !ruby/object:Gem::Dependency
37
+ name: rspec
38
+ requirement: &70281322436620 !ruby/object:Gem::Requirement
39
+ none: false
40
+ requirements:
41
+ - - ~>
42
+ - !ruby/object:Gem::Version
43
+ version: 2.6.0
44
+ type: :development
45
+ prerelease: false
46
+ version_requirements: *70281322436620
47
+ description: Abstracts and bulletproofs common scraping tasks.
48
+ email:
49
+ - info@opennorth.ca
50
+ executables: []
51
+ extensions: []
52
+ extra_rdoc_files: []
53
+ files:
54
+ - .gitignore
55
+ - Gemfile
56
+ - LICENSE
57
+ - README.md
58
+ - Rakefile
59
+ - USAGE
60
+ - lib/unbreakable.rb
61
+ - lib/unbreakable/data_storage/file_data_store.rb
62
+ - lib/unbreakable/decorators/timeout.rb
63
+ - lib/unbreakable/observers/log.rb
64
+ - lib/unbreakable/observers/observer.rb
65
+ - lib/unbreakable/processors/transform.rb
66
+ - lib/unbreakable/scraper.rb
67
+ - lib/unbreakable/version.rb
68
+ - spec/spec.opts
69
+ - spec/spec_helper.rb
70
+ - spec/unbreakable_spec.rb
71
+ - unbreakable.gemspec
72
+ homepage: http://github.com/opennorth/unbreakable
73
+ licenses: []
74
+ post_install_message:
75
+ rdoc_options: []
76
+ require_paths:
77
+ - lib
78
+ required_ruby_version: !ruby/object:Gem::Requirement
79
+ none: false
80
+ requirements:
81
+ - - ! '>='
82
+ - !ruby/object:Gem::Version
83
+ version: '0'
84
+ required_rubygems_version: !ruby/object:Gem::Requirement
85
+ none: false
86
+ requirements:
87
+ - - ! '>='
88
+ - !ruby/object:Gem::Version
89
+ version: '0'
90
+ requirements: []
91
+ rubyforge_project: unbreakable
92
+ rubygems_version: 1.8.6
93
+ signing_key:
94
+ specification_version: 3
95
+ summary: Make your scrapers unbreakable™
96
+ test_files:
97
+ - spec/spec.opts
98
+ - spec/spec_helper.rb
99
+ - spec/unbreakable_spec.rb