unbreakable 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +6 -0
- data/Gemfile +4 -0
- data/LICENSE +20 -0
- data/README.md +60 -0
- data/Rakefile +16 -0
- data/USAGE +1 -0
- data/lib/unbreakable.rb +66 -0
- data/lib/unbreakable/data_storage/file_data_store.rb +139 -0
- data/lib/unbreakable/decorators/timeout.rb +47 -0
- data/lib/unbreakable/observers/log.rb +19 -0
- data/lib/unbreakable/observers/observer.rb +27 -0
- data/lib/unbreakable/processors/transform.rb +60 -0
- data/lib/unbreakable/scraper.rb +200 -0
- data/lib/unbreakable/version.rb +3 -0
- data/spec/spec.opts +5 -0
- data/spec/spec_helper.rb +3 -0
- data/spec/unbreakable_spec.rb +5 -0
- data/unbreakable.gemspec +25 -0
- metadata +99 -0
data/.gitignore
ADDED
data/Gemfile
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2011 Open North Inc.
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,60 @@
|
|
1
|
+
# Unbreakable
|
2
|
+
|
3
|
+
Unbreakable is a Ruby gem that abstracts and bulletproofs common web scraping tasks. It forces a separation of concerns for maximum flexibility. Loose coupling allows for easier modification and re-use of component parts.
|
4
|
+
|
5
|
+
# Installation
|
6
|
+
|
7
|
+
gem install unbreakable
|
8
|
+
|
9
|
+
# What's the problem?
|
10
|
+
|
11
|
+
A common web scraping project involves four steps. As an illustrative example, we'll scrape the language with the most articles on Wikipedia using standard command-line tools:
|
12
|
+
|
13
|
+
1. Retrieve some raw HTML
|
14
|
+
|
15
|
+
# Download the list of Wikipedias
|
16
|
+
curl -s -o in.html http://s23.org/wikistats/wikipedias_html
|
17
|
+
|
18
|
+
1. Process the raw HTML into a machine-readable format
|
19
|
+
|
20
|
+
# Extract the language with the most articles
|
21
|
+
grep '><td class="number">1<' in.html | sed 's/.*e">\([^<]*\).*/\1/' > out.html
|
22
|
+
|
23
|
+
1. Release the data to the community through an API and/or as a download
|
24
|
+
|
25
|
+
# Upload the machine-readable data to a public server
|
26
|
+
curl http://pastie.org/pastes -F "paste[parser_id]=6" -F "paste[authorization]=burger" \
|
27
|
+
-F "paste[body]=`cat out.txt`" -s -o /dev/null -L -w "%{url_effective}"
|
28
|
+
|
29
|
+
1. Use the data as you like
|
30
|
+
|
31
|
+
echo "The most popular language is `curl -s http://pastie.org/pastes/2487244/download`."
|
32
|
+
|
33
|
+
In most web scraping projects, at least one step is tightly coupled to another, making modification or re-use of individual steps by the community difficult. It is especially common for authors to tailor the workflow to their specific use of the data. The coupling produces esoteric code, with the domain logic of the author's use case slipping into the otherwise generic code for retrieving and processing data. Because the scrapers are embedded in a larger project, they are often undiscoverable.
|
34
|
+
|
35
|
+
Furthermore, how the first two steps store data may be incompatible with some environments. If the processor code stores data in a database, but you prefer flat files for your use case, you may have to do a long refactor.
|
36
|
+
|
37
|
+
# What's the solution?
|
38
|
+
|
39
|
+
Web scraping projects should write standalone downloaders, processors, APIs and apps.
|
40
|
+
|
41
|
+
Retrieving should be separate from processing, if only to avoid hammering remote servers while developing or tweaking a processor. This separation also allows the community to develop multiple processors of the same raw data without duplication of effort.
|
42
|
+
|
43
|
+
Standalone components are easier for the community to discover, modify and re-use, as they do not need to concern themselves with the other parts of the workflow or expose themselves to the use case of the original author.
|
44
|
+
|
45
|
+
The code for retrieving and processing data should delegate the persistence of data to a storage layer. The community can then develop various, swappable storage adapters and will not be bound to any single solution.
|
46
|
+
|
47
|
+
Unbreakable helps you write standalone downloaders and processor and provides an extensible persistence layer.
|
48
|
+
|
49
|
+
# Getting started
|
50
|
+
|
51
|
+
For now, the best way to learn how to use this gem is to read the documentation.
|
52
|
+
|
53
|
+
rake yard
|
54
|
+
open doc/index.html
|
55
|
+
|
56
|
+
# Bugs? Questions?
|
57
|
+
|
58
|
+
Unbreakable's main repository is on GitHub: [http://github.com/opennorth/unbreakable](http://github.com/opennorth/unbreakable), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
|
59
|
+
|
60
|
+
Copyright (c) 2011 Open North Inc., released under the MIT license
|
data/Rakefile
ADDED
@@ -0,0 +1,16 @@
|
|
1
|
+
require 'bundler'
|
2
|
+
Bundler::GemHelper.install_tasks
|
3
|
+
|
4
|
+
require 'rspec/core/rake_task'
|
5
|
+
RSpec::Core::RakeTask.new(:spec)
|
6
|
+
|
7
|
+
task :default => :spec
|
8
|
+
|
9
|
+
begin
|
10
|
+
require 'yard'
|
11
|
+
YARD::Rake::YardocTask.new
|
12
|
+
rescue LoadError
|
13
|
+
task :yard do
|
14
|
+
abort 'YARD is not available. In order to run yard, you must: gem install yard'
|
15
|
+
end
|
16
|
+
end
|
data/USAGE
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
See README.md for full usage details.
|
data/lib/unbreakable.rb
ADDED
@@ -0,0 +1,66 @@
|
|
1
|
+
require 'dragonfly'
|
2
|
+
|
3
|
+
# When using this gem, you'll start by defining a {Scraper}, with methods for
|
4
|
+
# retrieving and processing data. The data will be stored in {DataStorage};
|
5
|
+
# this gem currently provides only a {DataStorage::FileDataStore FileDataStore}.
|
6
|
+
# You may enhance a datastore with {Decorators} and {Observers}: for example,
|
7
|
+
# a {Decorators::Timeout Timeout} decorator to retry on timeout with exponential
|
8
|
+
# backoff and a {Observers::Log Log} observer which logs retrieval progress.
|
9
|
+
# Of course, you must also define a {Processors::Transform Processor} to turn
|
10
|
+
# your raw data into machine-readable data.
|
11
|
+
#
|
12
|
+
# A skeleton scraper:
|
13
|
+
#
|
14
|
+
# require 'unbreakable'
|
15
|
+
#
|
16
|
+
# class MyScraper < Unbreakable::Scraper
|
17
|
+
# def retrieve
|
18
|
+
# # download all the documents
|
19
|
+
# end
|
20
|
+
# def processable
|
21
|
+
# # return a list of documents to process
|
22
|
+
# end
|
23
|
+
# end
|
24
|
+
#
|
25
|
+
# class MyProcessor < Unbreakable::Processors::Transform
|
26
|
+
# def perform(temp_object)
|
27
|
+
# # return the transformed record as a hash, array, etc.
|
28
|
+
# end
|
29
|
+
# def persist(temp_object, arg)
|
30
|
+
# # store the hash/array/etc. in Mongo, MySQL, YAML, etc.
|
31
|
+
# end
|
32
|
+
# end
|
33
|
+
#
|
34
|
+
# scraper = MyScraper.new
|
35
|
+
# scraper.processor.register MyProcessor
|
36
|
+
# scraper.configure do |c|
|
37
|
+
# # configure the scraper
|
38
|
+
# end
|
39
|
+
# scraper.run(ARGV)
|
40
|
+
#
|
41
|
+
# Every scraper script can run as a command-line script. Try it!
|
42
|
+
#
|
43
|
+
# ruby myscraper.rb
|
44
|
+
module Unbreakable
|
45
|
+
autoload :Scraper, 'unbreakable/scraper'
|
46
|
+
|
47
|
+
module Processors
|
48
|
+
autoload :Transform, 'unbreakable/processors/transform'
|
49
|
+
end
|
50
|
+
|
51
|
+
module Observers
|
52
|
+
autoload :Observer, 'unbreakable/observers/observer'
|
53
|
+
autoload :Log, 'unbreakable/observers/log'
|
54
|
+
end
|
55
|
+
|
56
|
+
module Decorators
|
57
|
+
autoload :Timeout, 'unbreakable/decorators/timeout'
|
58
|
+
end
|
59
|
+
|
60
|
+
module DataStorage
|
61
|
+
autoload :FileDataStore, 'unbreakable/data_storage/file_data_store'
|
62
|
+
end
|
63
|
+
|
64
|
+
class UnbreakableError < StandardError; end
|
65
|
+
class InvalidRemoteFile < UnbreakableError; end
|
66
|
+
end
|
@@ -0,0 +1,139 @@
|
|
1
|
+
require 'observer'
|
2
|
+
|
3
|
+
module Unbreakable
|
4
|
+
module DataStorage
|
5
|
+
# Stores files to the filesystem. To configure:
|
6
|
+
#
|
7
|
+
# scraper.configure do |c|
|
8
|
+
# c.datastore = Unbreakable::DataStorage::FileDataStore.new(scraper,
|
9
|
+
# :decorators => [:timeout], # optional
|
10
|
+
# :observers => [:log]) # optional
|
11
|
+
# c.datastore.root_path = '/path/dir' # default '/var/tmp/unbreakable'
|
12
|
+
# c.datastore.store_meta = true # default false
|
13
|
+
# end
|
14
|
+
class FileDataStore < Dragonfly::DataStorage::FileDataStore
|
15
|
+
include Observable
|
16
|
+
include Dragonfly::Loggable
|
17
|
+
|
18
|
+
# Decorators should be able to add configuration variables.
|
19
|
+
public_class_method :configurable_attr
|
20
|
+
|
21
|
+
# Configure the datastore to overwrite files upon repeated download.
|
22
|
+
#
|
23
|
+
# scraper.configure do |c|
|
24
|
+
# c.datastore.clobber = true # default false
|
25
|
+
# end
|
26
|
+
#
|
27
|
+
# @return [Boolean, Proc, lambda] whether to overwrite files upon repeated
|
28
|
+
# download
|
29
|
+
configurable_attr :clobber, false
|
30
|
+
|
31
|
+
# @param [Dragonfly::App] app
|
32
|
+
# @param [Hash] opts
|
33
|
+
# @option options [Module, Symbol, Array<Module, Symbol>] :decorators
|
34
|
+
# a module, the name of a decorator module, or an array of such
|
35
|
+
# @option options [Class, Symbol, Array<Class, Symbol>] :observers
|
36
|
+
# a class, the name of an observer class, or an array of such
|
37
|
+
def initialize(app, opts = {})
|
38
|
+
use_same_log_as(app)
|
39
|
+
use_as_fallback_config(app)
|
40
|
+
if opts[:decorators]
|
41
|
+
opts[:decorators].each do |decorator|
|
42
|
+
extend Symbol === decorator ? Unbreakable::Decorators.const_get(decorator.capitalize) : decorator
|
43
|
+
end
|
44
|
+
end
|
45
|
+
if opts[:observers]
|
46
|
+
opts[:observers].each do |observer|
|
47
|
+
add_observer Symbol === observer ? Unbreakable::Observers.const_get(observer.capitalize).new(self) : observer.new(self)
|
48
|
+
end
|
49
|
+
end
|
50
|
+
end
|
51
|
+
|
52
|
+
# Stores a record in the datastore. This method does lazy evaluation of
|
53
|
+
# the record's contents, e.g.:
|
54
|
+
#
|
55
|
+
# defer_store(:path => 'index.html') do
|
56
|
+
# open('http://www.example.com/').read
|
57
|
+
# end
|
58
|
+
#
|
59
|
+
# The +open+ method is called only if the record hasn't already been
|
60
|
+
# downloaded or if the datastore has been configured to overwrite files
|
61
|
+
# upon repeated download.
|
62
|
+
#
|
63
|
+
# @param [Hash] opts
|
64
|
+
# @option opts [Hash] :meta any file metadata, e.g. bitrate
|
65
|
+
# @option opts [String] :path the relative path at which to store the file
|
66
|
+
# @param [Proc] block a block that yields the contents of the file
|
67
|
+
# @raise [Dragonfly::DataStorage::UnableToStore] if permission is denied
|
68
|
+
# @return [String] the relative path to the file
|
69
|
+
# @see [Dragonfly::DataStorage::FileDataStore#store]
|
70
|
+
def defer_store(opts = {}, &block)
|
71
|
+
meta = opts[:meta] || {}
|
72
|
+
relative_path = if opts[:path]
|
73
|
+
opts[:path]
|
74
|
+
else
|
75
|
+
filename = meta[:name] || 'file'
|
76
|
+
relative_path = relative_path_for(filename)
|
77
|
+
end
|
78
|
+
|
79
|
+
changed
|
80
|
+
if empty?(relative_path) or clobber?(relative_path)
|
81
|
+
begin
|
82
|
+
path = absolute(relative_path)
|
83
|
+
prepare_path(path)
|
84
|
+
string = yield_block(relative_path, &block)
|
85
|
+
Dragonfly::TempObject.new(string).to_file(path).close
|
86
|
+
store_meta_data(path, meta) if store_meta
|
87
|
+
notify_observers :store, relative_path, string
|
88
|
+
relative(path)
|
89
|
+
rescue InvalidRemoteFile => e
|
90
|
+
log.error e.message
|
91
|
+
rescue Errno::EACCES => e
|
92
|
+
raise UnableToStore, e.message
|
93
|
+
end
|
94
|
+
else
|
95
|
+
notify_observers :skip, relative_path
|
96
|
+
end
|
97
|
+
end
|
98
|
+
|
99
|
+
# Returns all filenames matching a pattern, if given.
|
100
|
+
# @param [String, Regexp] pattern a pattern to match filenames with
|
101
|
+
# @return [Array<String>] an array of matching filenames
|
102
|
+
def records(pattern = nil)
|
103
|
+
if pattern
|
104
|
+
Dir[File.join(root_path, '**', pattern)]
|
105
|
+
else
|
106
|
+
Dir[File.join(root_path, '**', '*')]
|
107
|
+
end.map do |absolute_path|
|
108
|
+
relative absolute_path
|
109
|
+
end
|
110
|
+
end
|
111
|
+
|
112
|
+
private
|
113
|
+
|
114
|
+
# @param [String] relative_path the relative path to the file
|
115
|
+
# @return [Boolean] whether the file is empty or non-existent
|
116
|
+
def empty?(relative_path)
|
117
|
+
path = absolute(relative_path)
|
118
|
+
!File.exist?(path) || File.size(path).zero?
|
119
|
+
end
|
120
|
+
|
121
|
+
# @param [String] relative_path the relative path to the file
|
122
|
+
# @return [Boolean] whether to overwrite any existing file
|
123
|
+
def clobber?(relative_path)
|
124
|
+
if clobber.respond_to? :call
|
125
|
+
clobber.call(relative_path)
|
126
|
+
else
|
127
|
+
!!clobber
|
128
|
+
end
|
129
|
+
end
|
130
|
+
|
131
|
+
# Yields a block.
|
132
|
+
# @param [String] relative_path the relative path to the file
|
133
|
+
# @return [String] the contents of the file
|
134
|
+
def yield_block(relative_path)
|
135
|
+
yield
|
136
|
+
end
|
137
|
+
end
|
138
|
+
end
|
139
|
+
end
|
@@ -0,0 +1,47 @@
|
|
1
|
+
require 'timeout'
|
2
|
+
|
3
|
+
module Unbreakable
|
4
|
+
module Decorators
|
5
|
+
# Catches timeouts and retries with exponential backoff. To configure:
|
6
|
+
#
|
7
|
+
# scraper.configure do |c|
|
8
|
+
# c.datastore.retry_limit = 5 # the maximum number of retries
|
9
|
+
# c.datastore.timeout_length = 60 # the timeout length
|
10
|
+
# end
|
11
|
+
#
|
12
|
+
module Timeout
|
13
|
+
# @param object an object
|
14
|
+
def self.extended(obj)
|
15
|
+
obj.class.instance_eval do
|
16
|
+
configurable_attr :retry_limit, 5
|
17
|
+
configurable_attr :timeout_length, 60
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
private
|
22
|
+
|
23
|
+
# (see DataStorage::FileDataStore#yield_block)
|
24
|
+
def yield_block(relative_path)
|
25
|
+
retry_attempt = 0
|
26
|
+
begin
|
27
|
+
retry_attempt += 1
|
28
|
+
::Timeout::timeout(timeout_length) do
|
29
|
+
super
|
30
|
+
end
|
31
|
+
rescue ::Timeout::Error
|
32
|
+
if retry_attempt < retry_limit
|
33
|
+
log.warn "Timeout on #{relative_path}, retrying in #{retry_delay} (#{retry_attempt}/#{retry_limit})"
|
34
|
+
sleep retry_delay
|
35
|
+
retry
|
36
|
+
else
|
37
|
+
log.error "Timeout on #{relative_path}, skipping"
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
def retry_delay(retry_attempt)
|
43
|
+
2 ** retry_attempt
|
44
|
+
end
|
45
|
+
end
|
46
|
+
end
|
47
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
module Unbreakable
|
2
|
+
module Observers
|
3
|
+
# Logs debug messages when files are stored or skipped if the observed
|
4
|
+
# object has a +#log+ method.
|
5
|
+
class Log < Observer
|
6
|
+
# (see Observer#update)
|
7
|
+
def update(method, *args)
|
8
|
+
if observed.respond_to? :log
|
9
|
+
case method
|
10
|
+
when :store
|
11
|
+
observed.log.debug "Store #{args.first}"
|
12
|
+
when :skip
|
13
|
+
observed.log.debug "Skip #{args.first}"
|
14
|
+
end
|
15
|
+
end
|
16
|
+
end
|
17
|
+
end
|
18
|
+
end
|
19
|
+
end
|
@@ -0,0 +1,27 @@
|
|
1
|
+
module Unbreakable
|
2
|
+
module Observers
|
3
|
+
# Abstract class for observers following the Ruby
|
4
|
+
# {http://ruby-doc.org/stdlib/libdoc/observer/rdoc/index.html stdlib}
|
5
|
+
# implementation of the _Observer_ object-oriented design pattern. See
|
6
|
+
# {Unbreakable::Observers::Log} for an example.
|
7
|
+
#
|
8
|
+
# The following instance methods must be implemented in sub-classes:
|
9
|
+
#
|
10
|
+
# * +update+
|
11
|
+
class Observer
|
12
|
+
attr_reader :observed
|
13
|
+
|
14
|
+
# @param observed the observed object
|
15
|
+
def initialize(observed)
|
16
|
+
@observed = observed
|
17
|
+
end
|
18
|
+
|
19
|
+
# @param [Symbol] method the method called on the observed object
|
20
|
+
# @param [Array] args the arguments to the method
|
21
|
+
# @return [void]
|
22
|
+
def update(method, *args)
|
23
|
+
raise NotImplementedError
|
24
|
+
end
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
@@ -0,0 +1,60 @@
|
|
1
|
+
module Unbreakable
|
2
|
+
module Processors
|
3
|
+
# You may implement a transform process by subclassing this class:
|
4
|
+
#
|
5
|
+
# require 'nokogiri'
|
6
|
+
# class MyProcessor < Unbreakable::Processors::Transform
|
7
|
+
# # Extracts the page title from an HTML page.
|
8
|
+
# def perform(temp_object)
|
9
|
+
# Nokogiri::HTML(temp_object.data).at_css('title')
|
10
|
+
# end
|
11
|
+
#
|
12
|
+
# # Saves the page title to an external database.
|
13
|
+
# def persist(temp_object, arg)
|
14
|
+
# MyModel.create(:title => arg)
|
15
|
+
# end
|
16
|
+
# end
|
17
|
+
# MyScraper.processor.register MyProcessor
|
18
|
+
#
|
19
|
+
# The following instance methods must be implemented in sub-classes:
|
20
|
+
#
|
21
|
+
# * +perform+
|
22
|
+
# * +persist+
|
23
|
+
#
|
24
|
+
# You may also override +transform+, which calls +perform+ and +persist+ in
|
25
|
+
# the default implementation, but you probably won't have to.
|
26
|
+
class Transform
|
27
|
+
include Dragonfly::Configurable
|
28
|
+
include Dragonfly::Loggable
|
29
|
+
|
30
|
+
# +#transform+ must be defined on the subclass for Dragonfly to see it.
|
31
|
+
# @param [Class] subclass a subclass
|
32
|
+
def self.inherited(subclass)
|
33
|
+
subclass.class_eval do
|
34
|
+
# @param [Dragonfly::TempObject] temp_object
|
35
|
+
# @return [Dragonfly::TempObject] the same object
|
36
|
+
def transform(temp_object)
|
37
|
+
persist temp_object, perform(temp_object)
|
38
|
+
temp_object
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
private
|
44
|
+
|
45
|
+
# Transforms a record.
|
46
|
+
# @param [Dragonfly::TempObject] temp_object
|
47
|
+
# @return [Hash] the transformed record
|
48
|
+
def perform(temp_object)
|
49
|
+
raise NotImplementedError
|
50
|
+
end
|
51
|
+
|
52
|
+
# Persists a transformed record.
|
53
|
+
# @param [Dragonfly::TempObject] temp_object
|
54
|
+
# @param arg a transformed record
|
55
|
+
def persist(temp_object, arg)
|
56
|
+
raise NotImplementedError
|
57
|
+
end
|
58
|
+
end
|
59
|
+
end
|
60
|
+
end
|
@@ -0,0 +1,200 @@
|
|
1
|
+
require 'forwardable'
|
2
|
+
require 'optparse'
|
3
|
+
require 'securerandom'
|
4
|
+
|
5
|
+
require 'active_support/inflector/methods'
|
6
|
+
|
7
|
+
module Unbreakable
|
8
|
+
# You may implement a scraper by subclassing this class:
|
9
|
+
#
|
10
|
+
# require 'open-uri'
|
11
|
+
# class MyScraper < Unbreakable::Scraper
|
12
|
+
# # Stores the contents of +http://www.example.com/+ in +index.html+.
|
13
|
+
# def retrieve
|
14
|
+
# store(:path => 'index.html'){ open('http://www.example.com/').read }
|
15
|
+
# end
|
16
|
+
#
|
17
|
+
# # Processes +index.html+.
|
18
|
+
# def process
|
19
|
+
# fetch('index.html').process(:transform).apply
|
20
|
+
# end
|
21
|
+
#
|
22
|
+
# # Alternatively, you can just set the files to fetch, which will be
|
23
|
+
# # processed using a +:transform+ processor which you must implement.
|
24
|
+
# def processable
|
25
|
+
# ['index.html']
|
26
|
+
# end
|
27
|
+
# end
|
28
|
+
#
|
29
|
+
# To configure:
|
30
|
+
#
|
31
|
+
# scraper.configure do |c|
|
32
|
+
# c.datastore = MyDataStore.new # default Unbreakable::DataStorage::FileDataStore.new(scraper)
|
33
|
+
# c.log = Logger.new('/path/to/file') # default Logger.new(STDOUT)
|
34
|
+
# c.datastore.store_meta = true # default false
|
35
|
+
# end
|
36
|
+
#
|
37
|
+
# The following instance methods must be implemented in sub-classes:
|
38
|
+
#
|
39
|
+
# * +retrieve+
|
40
|
+
# * +process+ or +processable+
|
41
|
+
class Scraper
|
42
|
+
extend Forwardable
|
43
|
+
|
44
|
+
def_delegators :@app, :add_child_configurable, :configure, :datastore,
|
45
|
+
:fetch, :log, :processor
|
46
|
+
|
47
|
+
# Initializes a Dragonfly app for storage and processing.
|
48
|
+
def initialize
|
49
|
+
@app = Dragonfly[SecureRandom.hex.to_sym]
|
50
|
+
# defaults to Logger.new('/var/tmp/dragonfly.log')
|
51
|
+
@app.log = Logger.new(STDOUT)
|
52
|
+
# defaults to Dragonfly::DataStorage::FileDataStore.new
|
53
|
+
@app.datastore = Unbreakable::DataStorage::FileDataStore.new(self)
|
54
|
+
# defaults to '/var/tmp/dragonfly'
|
55
|
+
@app.datastore.root_path = '/var/tmp/unbreakable'
|
56
|
+
# defaults to true
|
57
|
+
@app.datastore.store_meta = false
|
58
|
+
end
|
59
|
+
|
60
|
+
# Returns an option parser.
|
61
|
+
# @return [OptionParser] an option parser
|
62
|
+
def opts
|
63
|
+
if @opts.nil?
|
64
|
+
@opts = OptionParser.new
|
65
|
+
@opts.banner = <<-eos
|
66
|
+
usage: #{@opts.program_name} [options] <command> [<args>]
|
67
|
+
|
68
|
+
The most commonly used commands are:
|
69
|
+
retrieve Cache remote files to the datastore for later processing
|
70
|
+
process Process cached files into machine-readable data
|
71
|
+
config Print the current configuration
|
72
|
+
eos
|
73
|
+
|
74
|
+
@opts.separator ''
|
75
|
+
@opts.separator 'Specific options:'
|
76
|
+
extract_configuration @app
|
77
|
+
|
78
|
+
@opts.separator ''
|
79
|
+
@opts.separator 'General options:'
|
80
|
+
@opts.on_tail('-h', '--help', 'Display this screen') do
|
81
|
+
puts @opts
|
82
|
+
exit
|
83
|
+
end
|
84
|
+
end
|
85
|
+
@opts
|
86
|
+
end
|
87
|
+
|
88
|
+
# Runs the command. Most often run from a command-line script as:
|
89
|
+
#
|
90
|
+
# scraper.run(ARGV)
|
91
|
+
#
|
92
|
+
# @param [Array] args command-line arguments
|
93
|
+
# @note Only call this method once per scraper instance.
|
94
|
+
def run(args)
|
95
|
+
opts.parse!(args)
|
96
|
+
command = args.shift
|
97
|
+
case command
|
98
|
+
when 'retrieve'
|
99
|
+
retrieve
|
100
|
+
when 'process'
|
101
|
+
process
|
102
|
+
when 'config'
|
103
|
+
print_configuration @app
|
104
|
+
when nil
|
105
|
+
puts opts
|
106
|
+
else
|
107
|
+
opts.abort "'#{command}' is not a #{opts.program_name} command. See '#{opts.program_name} --help'."
|
108
|
+
end
|
109
|
+
end
|
110
|
+
|
111
|
+
# Stores a record in the datastore.
|
112
|
+
# @param [Hash] opts options to pass to the datastore
|
113
|
+
# @param [Proc] block a block that yields the contents of the file
|
114
|
+
def store(opts = {}, &block)
|
115
|
+
datastore.defer_store(opts, &block)
|
116
|
+
end
|
117
|
+
|
118
|
+
# Parses a JSON, HTML, XML, or YAML file.
|
119
|
+
# @param [String, Dragonfly::TempObject] temp_object_or_uid a +TempObject+ or record ID
|
120
|
+
# @param [String] encoding a file encoding
|
121
|
+
# @return the parsing, either a Ruby or +Nokogiri+ type
|
122
|
+
# @raise [LoadError] if the {http://nokogiri.org/ nokogiri} gem is
|
123
|
+
# unavailable for parsing an HTML or XML file
|
124
|
+
def parse(temp_object_or_uid, encoding = 'utf-8')
|
125
|
+
temp_object = temp_object_or_uid.is_a?(Dragonfly::TempObject) ? temp_object_or_uid : fetch(temp_object_or_uid)
|
126
|
+
string = temp_object.data
|
127
|
+
case File.extname temp_object.path
|
128
|
+
when '.json'
|
129
|
+
begin
|
130
|
+
require 'yajl'
|
131
|
+
Yajl::Parser.parse string
|
132
|
+
rescue LoadError
|
133
|
+
require 'json'
|
134
|
+
JSON.parse string
|
135
|
+
end
|
136
|
+
when '.html'
|
137
|
+
require 'nokogiri'
|
138
|
+
Nokogiri::HTML string, nil, encoding
|
139
|
+
when '.xml'
|
140
|
+
require 'nokogiri'
|
141
|
+
Nokogiri::XML string, nil, encoding
|
142
|
+
when '.yml', '.yaml'
|
143
|
+
require 'yaml'
|
144
|
+
YAML.load string
|
145
|
+
else
|
146
|
+
string
|
147
|
+
end
|
148
|
+
end
|
149
|
+
|
150
|
+
# Caches remote files to the datastore for later processing.
|
151
|
+
def retrieve
|
152
|
+
raise NotImplementedError
|
153
|
+
end
|
154
|
+
|
155
|
+
# Processes cached files into machine-readable data.
|
156
|
+
def process
|
157
|
+
processable.each do |record|
|
158
|
+
fetch(record).process(:transform).apply
|
159
|
+
end
|
160
|
+
end
|
161
|
+
|
162
|
+
# Returns a list of record IDs to process.
|
163
|
+
# @return [Array<String>] a list of record IDs to process
|
164
|
+
def processable
|
165
|
+
raise NotImplementedError
|
166
|
+
end
|
167
|
+
|
168
|
+
private
|
169
|
+
|
170
|
+
# @param [#configuration] object
|
171
|
+
def extract_configuration(object)
|
172
|
+
object.default_configuration.merge(object.configuration).each do |key,value|
|
173
|
+
if true === value or false === value
|
174
|
+
@opts.on("--[no-]#{key}", "default #{value.inspect}") do |x|
|
175
|
+
object.send "#{key}=", x
|
176
|
+
end
|
177
|
+
elsif String === value or Fixnum === value
|
178
|
+
@opts.on("--#{key} ARG", "default #{value.inspect}") do |x|
|
179
|
+
object.send "#{key}=", x
|
180
|
+
end
|
181
|
+
elsif object != value and value.respond_to? :configuration
|
182
|
+
extract_configuration value
|
183
|
+
end
|
184
|
+
end
|
185
|
+
end
|
186
|
+
|
187
|
+
# @param [#configuration] object
|
188
|
+
def print_configuration(object, indent = 0)
|
189
|
+
indentation = ' ' * indent
|
190
|
+
puts "#{indentation}#{object.class.name}:"
|
191
|
+
object.default_configuration.merge(object.configuration).each do |key,value|
|
192
|
+
if true === value or false === value or String === value or Fixnum === value
|
193
|
+
puts " #{indentation}#{key.to_s.ljust 25 - indent}#{value.inspect}"
|
194
|
+
elsif object != value and value.respond_to? :configuration
|
195
|
+
print_configuration value, indent + 2
|
196
|
+
end
|
197
|
+
end
|
198
|
+
end
|
199
|
+
end
|
200
|
+
end
|
data/spec/spec.opts
ADDED
data/spec/spec_helper.rb
ADDED
data/unbreakable.gemspec
ADDED
@@ -0,0 +1,25 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
$:.push File.expand_path("../lib", __FILE__)
|
3
|
+
require "unbreakable/version"
|
4
|
+
|
5
|
+
Gem::Specification.new do |s|
|
6
|
+
s.name = "unbreakable"
|
7
|
+
s.version = Unbreakable::VERSION
|
8
|
+
s.platform = Gem::Platform::RUBY
|
9
|
+
s.authors = ["Open North"]
|
10
|
+
s.email = ["info@opennorth.ca"]
|
11
|
+
s.homepage = "http://github.com/opennorth/unbreakable"
|
12
|
+
s.summary = %q{Make your scrapers unbreakable™}
|
13
|
+
s.description = %q{Abstracts and bulletproofs common scraping tasks.}
|
14
|
+
|
15
|
+
s.rubyforge_project = "unbreakable"
|
16
|
+
|
17
|
+
s.files = `git ls-files`.split("\n")
|
18
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
19
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
20
|
+
s.require_paths = ["lib"]
|
21
|
+
|
22
|
+
s.add_runtime_dependency('activesupport', '~> 3.1.0')
|
23
|
+
s.add_runtime_dependency('dragonfly', '~> 0.9.5')
|
24
|
+
s.add_development_dependency('rspec', '~> 2.6.0')
|
25
|
+
end
|
metadata
ADDED
@@ -0,0 +1,99 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: unbreakable
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Open North
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2011-09-07 00:00:00.000000000Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: activesupport
|
16
|
+
requirement: &70281322437780 !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ~>
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: 3.1.0
|
22
|
+
type: :runtime
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: *70281322437780
|
25
|
+
- !ruby/object:Gem::Dependency
|
26
|
+
name: dragonfly
|
27
|
+
requirement: &70281322437140 !ruby/object:Gem::Requirement
|
28
|
+
none: false
|
29
|
+
requirements:
|
30
|
+
- - ~>
|
31
|
+
- !ruby/object:Gem::Version
|
32
|
+
version: 0.9.5
|
33
|
+
type: :runtime
|
34
|
+
prerelease: false
|
35
|
+
version_requirements: *70281322437140
|
36
|
+
- !ruby/object:Gem::Dependency
|
37
|
+
name: rspec
|
38
|
+
requirement: &70281322436620 !ruby/object:Gem::Requirement
|
39
|
+
none: false
|
40
|
+
requirements:
|
41
|
+
- - ~>
|
42
|
+
- !ruby/object:Gem::Version
|
43
|
+
version: 2.6.0
|
44
|
+
type: :development
|
45
|
+
prerelease: false
|
46
|
+
version_requirements: *70281322436620
|
47
|
+
description: Abstracts and bulletproofs common scraping tasks.
|
48
|
+
email:
|
49
|
+
- info@opennorth.ca
|
50
|
+
executables: []
|
51
|
+
extensions: []
|
52
|
+
extra_rdoc_files: []
|
53
|
+
files:
|
54
|
+
- .gitignore
|
55
|
+
- Gemfile
|
56
|
+
- LICENSE
|
57
|
+
- README.md
|
58
|
+
- Rakefile
|
59
|
+
- USAGE
|
60
|
+
- lib/unbreakable.rb
|
61
|
+
- lib/unbreakable/data_storage/file_data_store.rb
|
62
|
+
- lib/unbreakable/decorators/timeout.rb
|
63
|
+
- lib/unbreakable/observers/log.rb
|
64
|
+
- lib/unbreakable/observers/observer.rb
|
65
|
+
- lib/unbreakable/processors/transform.rb
|
66
|
+
- lib/unbreakable/scraper.rb
|
67
|
+
- lib/unbreakable/version.rb
|
68
|
+
- spec/spec.opts
|
69
|
+
- spec/spec_helper.rb
|
70
|
+
- spec/unbreakable_spec.rb
|
71
|
+
- unbreakable.gemspec
|
72
|
+
homepage: http://github.com/opennorth/unbreakable
|
73
|
+
licenses: []
|
74
|
+
post_install_message:
|
75
|
+
rdoc_options: []
|
76
|
+
require_paths:
|
77
|
+
- lib
|
78
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
79
|
+
none: false
|
80
|
+
requirements:
|
81
|
+
- - ! '>='
|
82
|
+
- !ruby/object:Gem::Version
|
83
|
+
version: '0'
|
84
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
85
|
+
none: false
|
86
|
+
requirements:
|
87
|
+
- - ! '>='
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '0'
|
90
|
+
requirements: []
|
91
|
+
rubyforge_project: unbreakable
|
92
|
+
rubygems_version: 1.8.6
|
93
|
+
signing_key:
|
94
|
+
specification_version: 3
|
95
|
+
summary: Make your scrapers unbreakable™
|
96
|
+
test_files:
|
97
|
+
- spec/spec.opts
|
98
|
+
- spec/spec_helper.rb
|
99
|
+
- spec/unbreakable_spec.rb
|