grell 1.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.rspec +2 -0
- data/CHANGELOG.md +32 -0
- data/Gemfile +3 -0
- data/LICENSE.txt +22 -0
- data/README.md +78 -0
- data/Rakefile +2 -0
- data/grell.gemspec +30 -0
- data/lib/grell.rb +10 -0
- data/lib/grell/capybara_driver.rb +34 -0
- data/lib/grell/crawler.rb +44 -0
- data/lib/grell/grell_logger.rb +10 -0
- data/lib/grell/page.rb +231 -0
- data/lib/grell/page_collection.rb +46 -0
- data/lib/grell/rawpage.rb +37 -0
- data/lib/grell/reader.rb +15 -0
- data/lib/grell/version.rb +3 -0
- data/spec/lib/crawler_spec.rb +108 -0
- data/spec/lib/page_collection_spec.rb +149 -0
- data/spec/lib/page_spec.rb +284 -0
- data/spec/lib/reader_spec.rb +43 -0
- data/spec/spec_helper.rb +64 -0
- metadata +196 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 85e9d00d79051e54ba1c5361e7f1edfd8abf8af6
|
4
|
+
data.tar.gz: 25b1f2db2ef87e61294158843fe6e396ed15b046
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: b00755884c96ddf04e6954d5761903e28cbd46207f45ba5a97f1f1b727b70aad2361386a35e3c96ecb8b65089f1412ac4d7a1bdc8a18d70b87009198883205c3
|
7
|
+
data.tar.gz: 927714b4cd7e7b520d75755e993eff343cebd06a8c36993bcd81abb7453e1ace497e05766f0b8d1ae4bed2e7c00447a1ded267c20028d6d118f32d661c36685e
|
data/.rspec
ADDED
data/CHANGELOG.md
ADDED
@@ -0,0 +1,32 @@
|
|
1
|
+
* Version 1.3
|
2
|
+
The Crawler object allows you to provide an external logger object.
|
3
|
+
Clearer semantics when an error happens, special headers are returned so the user can inspect the error
|
4
|
+
|
5
|
+
Caveats:
|
6
|
+
- The 'debug' option in the crawler does not have any affect anymore. Provide an external logger with 'logger' instead
|
7
|
+
- The errors provided in the headers by grell has changed from 'grell_status' to 'grellStatus'.
|
8
|
+
- The 'visited' property in the page was never supposed to be accesible. Use 'visited?' instead.
|
9
|
+
|
10
|
+
* Version 1.2.1
|
11
|
+
Solve bug: URLs are case insensitive
|
12
|
+
|
13
|
+
* Version 1.2
|
14
|
+
Grell now will consider two links to point to the same page only when the whole URL is exactly the same.
|
15
|
+
Versions previously would only consider two links to be the same when they shared the path.
|
16
|
+
|
17
|
+
* Version 1.1.2
|
18
|
+
Solve bug where we were adding links in heads as if there were normal links in the body
|
19
|
+
|
20
|
+
* Version 1.1.1
|
21
|
+
Solve bug with the new data-href functionality
|
22
|
+
|
23
|
+
* Version 1.1
|
24
|
+
Solve problem with randomly failing spec
|
25
|
+
Search for elements with 'href' or 'data-href' to find links
|
26
|
+
|
27
|
+
* Version 1.0.1
|
28
|
+
Rescueing Javascript errors
|
29
|
+
|
30
|
+
* Version 1.0
|
31
|
+
Initial implementation
|
32
|
+
Basic support to crawling pages.
|
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2015 Medidata Solutions Worldwide
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,78 @@
|
|
1
|
+
# Grell
|
2
|
+
|
3
|
+
Grell is a generic crawler for the web written in Ruby.
|
4
|
+
It can be used to gather data, test pages in a given domain, etc.
|
5
|
+
|
6
|
+
## Installation
|
7
|
+
|
8
|
+
Add this line to your application's Gemfile:
|
9
|
+
|
10
|
+
```ruby
|
11
|
+
gem 'grell'
|
12
|
+
```
|
13
|
+
|
14
|
+
And then execute:
|
15
|
+
|
16
|
+
$ bundle
|
17
|
+
|
18
|
+
Or install it yourself as:
|
19
|
+
|
20
|
+
$ gem install grell
|
21
|
+
|
22
|
+
Grell uses PhantomJS, you will need to download and install it in your
|
23
|
+
system. Check for instructions in http://phantomjs.org/
|
24
|
+
Grell has been tested with PhantomJS v1.9.x
|
25
|
+
|
26
|
+
## Usage
|
27
|
+
|
28
|
+
|
29
|
+
### Crawling an entire site
|
30
|
+
|
31
|
+
The main entry point of the library is Grell#start_crawling.
|
32
|
+
Grell will yield to your code with each page it finds:
|
33
|
+
|
34
|
+
```ruby
|
35
|
+
require 'grell'
|
36
|
+
|
37
|
+
crawler = Grell::Crawler.new
|
38
|
+
crawler.start_crawling('http://www.google.com') do |page|
|
39
|
+
#Grell will keep iterating this block which each unique page it finds
|
40
|
+
puts "yes we crawled #{page.url}"
|
41
|
+
puts "status: #{page.status}"
|
42
|
+
puts "headers: #{page.headers}"
|
43
|
+
puts "body: #{page.body}"
|
44
|
+
puts "We crawled it at #{page.timestamp}"
|
45
|
+
puts "We found #{page.links.size} links"
|
46
|
+
puts "page id and parent_id #{page.id}, #{page.parent_id}"
|
47
|
+
end
|
48
|
+
|
49
|
+
```
|
50
|
+
|
51
|
+
Grell keeps a list of pages previously crawled and do not visit the same page twice.
|
52
|
+
This list is indexed by the complete url, including query parameters.
|
53
|
+
|
54
|
+
### Pages' id
|
55
|
+
|
56
|
+
Each page has an unique id, accessed by the property 'id'. Also each page stores the id of the page from which we found this page, accessed by the property 'parent_id'.
|
57
|
+
The page object generated by accessing the first URL passed to the start_crawling(the root) has a 'parent_id' equal to 'nil' and an 'id' equal to 0.
|
58
|
+
Using this information it is possible to construct a directed graph.
|
59
|
+
|
60
|
+
|
61
|
+
### Errors
|
62
|
+
When there is an error in the page or an internal error in the crawler (Javascript crashed the browser, etc). Grell will return with status 404 and the headers will have the following keys:
|
63
|
+
- grellStatus: 'Error'
|
64
|
+
- errorClass: The class of the error which broke this page.
|
65
|
+
- errorMessage: A descriptive message with the information Grell could gather about the error.
|
66
|
+
|
67
|
+
|
68
|
+
## Tests
|
69
|
+
|
70
|
+
Run the tests with
|
71
|
+
```ruby
|
72
|
+
bundle exec rake ci
|
73
|
+
```
|
74
|
+
|
75
|
+
## Contributors
|
76
|
+
Grell is (c) Medidata Solutions Worldwide and owned by its major contributors:
|
77
|
+
* [Teruhide Hoshikawa](https://github.com/thoshikawa-mdsol)
|
78
|
+
* [Jordi Polo Carres](https://github.com/jcarres-mdsol)
|
data/Rakefile
ADDED
data/grell.gemspec
ADDED
@@ -0,0 +1,30 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'grell/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = "grell"
|
8
|
+
spec.version = Grell::VERSION
|
9
|
+
spec.authors = ["Jordi Polo Carres"]
|
10
|
+
spec.email = ["jcarres@mdsol.com"]
|
11
|
+
spec.summary = %q{Ruby web crawler}
|
12
|
+
spec.description = %q{Ruby web crawler using PhantomJS}
|
13
|
+
spec.homepage = "https://github.com/mdsol/grell"
|
14
|
+
|
15
|
+
spec.files = `git ls-files -z`.split("\x0")
|
16
|
+
spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
|
17
|
+
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
18
|
+
spec.require_paths = ["lib"]
|
19
|
+
|
20
|
+
spec.add_dependency 'capybara', '~> 2.2'
|
21
|
+
spec.add_dependency 'poltergeist', '~> 1.5'
|
22
|
+
|
23
|
+
spec.add_development_dependency "bundler", "~> 1.6"
|
24
|
+
spec.add_development_dependency "byebug", "~> 4.0"
|
25
|
+
spec.add_development_dependency "kender"
|
26
|
+
spec.add_development_dependency "rake"
|
27
|
+
spec.add_development_dependency "webmock", '~> 1.18'
|
28
|
+
spec.add_development_dependency 'rspec', '~> 3.0'
|
29
|
+
spec.add_development_dependency 'puffing-billy', '~> 0.5'
|
30
|
+
end
|
data/lib/grell.rb
ADDED
@@ -0,0 +1,34 @@
|
|
1
|
+
|
2
|
+
module Grell
|
3
|
+
|
4
|
+
#The driver for Capybara. It uses Portelgeist to control PhantomJS
|
5
|
+
class CapybaraDriver
|
6
|
+
include Capybara::DSL
|
7
|
+
|
8
|
+
USER_AGENT = "Mozilla/5.0 (Grell Crawler)"
|
9
|
+
|
10
|
+
def self.setup(options)
|
11
|
+
new.setup_capybara unless options[:external_driver]
|
12
|
+
end
|
13
|
+
|
14
|
+
def setup_capybara
|
15
|
+
Capybara.register_driver :poltergeist_crawler do |app|
|
16
|
+
Capybara::Poltergeist::Driver.new(app, {
|
17
|
+
js_errors: false,
|
18
|
+
inspector: false,
|
19
|
+
phantomjs_logger: open('/dev/null'),
|
20
|
+
phantomjs_options: ['--debug=no', '--load-images=no', '--ignore-ssl-errors=yes', '--ssl-protocol=TLSv1']
|
21
|
+
})
|
22
|
+
end
|
23
|
+
|
24
|
+
Capybara.default_wait_time = 3
|
25
|
+
Capybara.run_server = false
|
26
|
+
Capybara.default_driver = :poltergeist_crawler
|
27
|
+
page.driver.headers = {
|
28
|
+
"DNT" => 1,
|
29
|
+
"User-Agent" => USER_AGENT
|
30
|
+
}
|
31
|
+
end
|
32
|
+
end
|
33
|
+
|
34
|
+
end
|
@@ -0,0 +1,44 @@
|
|
1
|
+
|
2
|
+
module Grell
|
3
|
+
|
4
|
+
# This is the class that starts and controls the crawling
|
5
|
+
class Crawler
|
6
|
+
attr_reader :collection
|
7
|
+
|
8
|
+
def initialize(options = {})
|
9
|
+
CapybaraDriver.setup(options)
|
10
|
+
|
11
|
+
if options[:logger]
|
12
|
+
Grell.logger = options[:logger]
|
13
|
+
else
|
14
|
+
Grell.logger = Logger.new(STDOUT)
|
15
|
+
end
|
16
|
+
|
17
|
+
@collection = PageCollection.new
|
18
|
+
end
|
19
|
+
|
20
|
+
|
21
|
+
def start_crawling(url, &block)
|
22
|
+
Grell.logger.info "GRELL Started crawling"
|
23
|
+
@collection = PageCollection.new
|
24
|
+
@collection.create_page(url, nil)
|
25
|
+
while !@collection.discovered_pages.empty?
|
26
|
+
crawl(@collection.next_page, block)
|
27
|
+
end
|
28
|
+
Grell.logger.info "GRELL finished crawling"
|
29
|
+
end
|
30
|
+
|
31
|
+
def crawl(site, block)
|
32
|
+
Grell.logger.info "Visiting #{site.url}, visited_links: #{@collection.visited_pages.size}, discovered #{@collection.discovered_pages.size}"
|
33
|
+
site.navigate
|
34
|
+
|
35
|
+
block.call(site) if block
|
36
|
+
|
37
|
+
site.links.each do |url|
|
38
|
+
@collection.create_page(url, site.id)
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
end
|
43
|
+
|
44
|
+
end
|
data/lib/grell/page.rb
ADDED
@@ -0,0 +1,231 @@
|
|
1
|
+
require 'forwardable'
|
2
|
+
|
3
|
+
module Grell
|
4
|
+
# This class contains the logic related to work with each page we crawl. It is also the interface we use
|
5
|
+
# To access the information of each page.
|
6
|
+
# This information comes from result private classes below.
|
7
|
+
class Page
|
8
|
+
extend Forwardable
|
9
|
+
|
10
|
+
WAIT_TIME = 10
|
11
|
+
WAIT_INTERVAL = 0.5
|
12
|
+
|
13
|
+
attr_reader :url, :timestamp, :id, :parent_id, :rawpage
|
14
|
+
#Most of the interesting information accessed through this class is accessed by the methods below
|
15
|
+
def_delegators :@result_page, :headers, :body, :status, :links, :has_selector?, :host, :visited?
|
16
|
+
|
17
|
+
def initialize( url, id, parent_id)
|
18
|
+
@rawpage = RawPage.new
|
19
|
+
@url = url
|
20
|
+
@id = id
|
21
|
+
@parent_id = parent_id
|
22
|
+
@timestamp = nil
|
23
|
+
@result_page = UnvisitedPage.new
|
24
|
+
end
|
25
|
+
|
26
|
+
def navigate
|
27
|
+
# We wait a maximum of WAIT_TIME seconds to get an HTML page. We try or best to workaround inconsistencies on poltergeist
|
28
|
+
Reader.wait_for(->{@rawpage.navigate(url)}, WAIT_TIME, WAIT_INTERVAL ) do
|
29
|
+
@rawpage.status && !@rawpage.headers.empty? &&
|
30
|
+
@rawpage.headers["Content-Type"] && @rawpage.headers["Content-Type"].include?('text/html').equal?(true)
|
31
|
+
end
|
32
|
+
@result_page = VisitedPage.new(@rawpage)
|
33
|
+
@timestamp = Time.now
|
34
|
+
rescue Capybara::Poltergeist::JavascriptError => e
|
35
|
+
unavailable_page(404, e)
|
36
|
+
rescue Capybara::Poltergeist::BrowserError => e #This may happen internally on Poltergeist, they claim is a bug.
|
37
|
+
unavailable_page(404, e)
|
38
|
+
rescue URI::InvalidURIError => e #No cool URL means we report error
|
39
|
+
unavailable_page(404, e)
|
40
|
+
rescue Capybara::Poltergeist::TimeoutError => e #Poltergeist has its own timeout which is similar to Chromes.
|
41
|
+
unavailable_page(404, e)
|
42
|
+
rescue Capybara::Poltergeist::StatusFailError => e
|
43
|
+
unavailable_page(404, e)
|
44
|
+
end
|
45
|
+
|
46
|
+
private
|
47
|
+
def unavailable_page(status, exception)
|
48
|
+
Grell.logger.warn "The page with the URL #{@url} was not available. Exception #{exception}"
|
49
|
+
@result_page = ErroredPage.new(status, exception)
|
50
|
+
@timestamp = Time.now
|
51
|
+
end
|
52
|
+
|
53
|
+
# Private class.
|
54
|
+
# This is a result page when it has not been visited yet. Essentially empty of information
|
55
|
+
#
|
56
|
+
class UnvisitedPage
|
57
|
+
def status
|
58
|
+
nil
|
59
|
+
end
|
60
|
+
|
61
|
+
def body
|
62
|
+
''
|
63
|
+
end
|
64
|
+
|
65
|
+
def headers
|
66
|
+
{grellStatus: 'NotVisited' }
|
67
|
+
end
|
68
|
+
|
69
|
+
def links
|
70
|
+
[]
|
71
|
+
end
|
72
|
+
|
73
|
+
def host
|
74
|
+
''
|
75
|
+
end
|
76
|
+
|
77
|
+
def visited?
|
78
|
+
false
|
79
|
+
end
|
80
|
+
|
81
|
+
def has_selector?(selector)
|
82
|
+
false
|
83
|
+
end
|
84
|
+
|
85
|
+
end
|
86
|
+
|
87
|
+
# Private class.
|
88
|
+
# This is a result page when some error happened. It provides some information about the error.
|
89
|
+
#
|
90
|
+
class ErroredPage
|
91
|
+
def initialize(error_code, exception)
|
92
|
+
@error_code = error_code
|
93
|
+
@exception = exception
|
94
|
+
end
|
95
|
+
|
96
|
+
def status
|
97
|
+
@error_code
|
98
|
+
end
|
99
|
+
|
100
|
+
def body
|
101
|
+
''
|
102
|
+
end
|
103
|
+
|
104
|
+
def headers
|
105
|
+
message = begin
|
106
|
+
@exception.message
|
107
|
+
rescue StandardError
|
108
|
+
"Error message can not be accessed" #Poltergeist may try to access a nil object when accessing message
|
109
|
+
end
|
110
|
+
|
111
|
+
{
|
112
|
+
grellStatus: 'Error',
|
113
|
+
errorClass: @exception.class.to_s,
|
114
|
+
errorMessage: message
|
115
|
+
}
|
116
|
+
end
|
117
|
+
|
118
|
+
def links
|
119
|
+
[]
|
120
|
+
end
|
121
|
+
|
122
|
+
def host
|
123
|
+
''
|
124
|
+
end
|
125
|
+
|
126
|
+
def visited?
|
127
|
+
true
|
128
|
+
end
|
129
|
+
|
130
|
+
def has_selector?(selector)
|
131
|
+
false
|
132
|
+
end
|
133
|
+
|
134
|
+
end
|
135
|
+
|
136
|
+
|
137
|
+
# Private class.
|
138
|
+
# This is a result page when we successfully got some information back after visiting the page.
|
139
|
+
# It delegates most of the information to the @rawpage capybara page. But any transformation or logic is here
|
140
|
+
#
|
141
|
+
class VisitedPage
|
142
|
+
def initialize(rawpage)
|
143
|
+
@rawpage = rawpage
|
144
|
+
end
|
145
|
+
|
146
|
+
def status
|
147
|
+
@rawpage.status
|
148
|
+
end
|
149
|
+
|
150
|
+
def body
|
151
|
+
@rawpage.body
|
152
|
+
end
|
153
|
+
|
154
|
+
def headers
|
155
|
+
@rawpage.headers
|
156
|
+
rescue Capybara::Poltergeist::BrowserError => e #This may happen internally on Poltergeist, they claim is a bug.
|
157
|
+
{
|
158
|
+
grellStatus: 'Error',
|
159
|
+
errorClass: e.class.to_s,
|
160
|
+
errorMessage: e.message
|
161
|
+
}
|
162
|
+
end
|
163
|
+
|
164
|
+
def links
|
165
|
+
@links ||= all_links
|
166
|
+
end
|
167
|
+
|
168
|
+
def host
|
169
|
+
@rawpage.host
|
170
|
+
end
|
171
|
+
|
172
|
+
def visited?
|
173
|
+
true
|
174
|
+
end
|
175
|
+
|
176
|
+
def has_selector?(selector)
|
177
|
+
@rawpage.has_selector?(selector)
|
178
|
+
end
|
179
|
+
|
180
|
+
private
|
181
|
+
def all_links
|
182
|
+
# <link> can only be used in the <head> as of: https://developer.mozilla.org/en/docs/Web/HTML/Element/link
|
183
|
+
anchors_in_body = @rawpage.all_anchors.reject{|anchor| anchor.tag_name == 'link' }
|
184
|
+
|
185
|
+
unique_links = anchors_in_body.map do |anchor|
|
186
|
+
anchor['href'] || anchor['data-href']
|
187
|
+
end.compact
|
188
|
+
|
189
|
+
unique_links.map{|link| link_to_url(link)}.uniq.compact
|
190
|
+
|
191
|
+
rescue Capybara::Poltergeist::ObsoleteNode
|
192
|
+
Grell.logger.warn "We found an obsolete node in #{@url}. Ignoring all links"
|
193
|
+
# Sometimes Javascript and timing may screw this, we lose these links.
|
194
|
+
# TODO: Can we do something more intelligent here?
|
195
|
+
[]
|
196
|
+
end
|
197
|
+
|
198
|
+
# We only accept links in this same host that start with a path
|
199
|
+
# nil from this
|
200
|
+
def link_to_url(link)
|
201
|
+
uri = URI.parse(link)
|
202
|
+
if uri.absolute?
|
203
|
+
if uri.host != URI.parse(host).host
|
204
|
+
Grell.logger.debug "GRELL does not follow links to external hosts: #{link}"
|
205
|
+
nil
|
206
|
+
else
|
207
|
+
link # Absolute link to our own host
|
208
|
+
end
|
209
|
+
else
|
210
|
+
if uri.path.nil?
|
211
|
+
Grell.logger.debug "GRELL does not follow links without a path: #{uri}"
|
212
|
+
nil
|
213
|
+
end
|
214
|
+
if uri.path.start_with?('/')
|
215
|
+
host + link #convert to full URL
|
216
|
+
else #links like href="google.com" the browser would go to http://google.com like "http://#{link}"
|
217
|
+
Grell.logger.debug "GRELL Bad formatted link: #{link}, assuming external"
|
218
|
+
nil
|
219
|
+
end
|
220
|
+
end
|
221
|
+
|
222
|
+
rescue URI::InvalidURIError #We will have invalid links propagating till we navigate to them
|
223
|
+
link
|
224
|
+
end
|
225
|
+
end
|
226
|
+
|
227
|
+
|
228
|
+
|
229
|
+
end
|
230
|
+
|
231
|
+
end
|