grell 1.3.0 → 1.3.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 85e9d00d79051e54ba1c5361e7f1edfd8abf8af6
4
- data.tar.gz: 25b1f2db2ef87e61294158843fe6e396ed15b046
3
+ metadata.gz: eb66e60267eec85ffc191a0aa22313e6b2c7ea81
4
+ data.tar.gz: b00d7a5d2a56ff0a24157a7c0a4b34c22adbb7de
5
5
  SHA512:
6
- metadata.gz: b00755884c96ddf04e6954d5761903e28cbd46207f45ba5a97f1f1b727b70aad2361386a35e3c96ecb8b65089f1412ac4d7a1bdc8a18d70b87009198883205c3
7
- data.tar.gz: 927714b4cd7e7b520d75755e993eff343cebd06a8c36993bcd81abb7453e1ace497e05766f0b8d1ae4bed2e7c00447a1ded267c20028d6d118f32d661c36685e
6
+ metadata.gz: 38f9699d3431a4189457a7da6a8768bd87c3f8c0203fb85bbacdd4e360a32a8736088128ee7f1a56d40156ab407a8b35b9beffd0decb286674ad77acb920c75c
7
+ data.tar.gz: 0f5ded5c4c18cb85fd23f6a256ccb26e23804f01e61ec1cb7ab6a22242c32b0d272f4f9871fe10e2c347b3d5abd63578fa64fb9d52addc23936b77ca2a200bd6
data/CHANGELOG.md CHANGED
@@ -1,3 +1,7 @@
1
+ * Version 1.3.1
2
+ Added whitelisting and blacklisting
3
+ Better info in gemspec
4
+
1
5
  * Version 1.3
2
6
  The Crawler object allows you to provide an external logger object.
3
7
  Clearer semantics when an error happens, special headers are returned so the user can inspect the error
data/README.md CHANGED
@@ -51,6 +51,54 @@ end
51
51
  Grell keeps a list of pages previously crawled and do not visit the same page twice.
52
52
  This list is indexed by the complete url, including query parameters.
53
53
 
54
+ ### Selecting links to follow
55
+
56
+ Grell by default will follow all the links it finds going to the site
57
+ your are crawling. It will never follow links linking outside your site.
58
+ If you want to further limit the amount of links crawled, you can use
59
+ whitelisting, blacklisting or manual filtering.
60
+
61
+ #### Whitelisting
62
+
63
+ ```ruby
64
+ require 'grell'
65
+
66
+ crawler = Grell::Crawler.new
67
+ crawler.whitelist([/games\/.*/, '/fun'])
68
+ crawler.start_crawling('http://www.google.com')
69
+ ```
70
+
71
+ Grell here will only follow links to games and '/fun' and ignore all
72
+ other links. You can provide a regexp, strings (if any part of the
73
+ string match is whitelisted) or an array with regexps and/or strings.
74
+
75
+ #### Blacklisting
76
+
77
+ ```ruby
78
+ require 'grell'
79
+
80
+ crawler = Grell::Crawler.new
81
+ crawler.blacklist(/games\/.*/)
82
+ crawler.start_crawling('http://www.google.com')
83
+ ```
84
+
85
+ Similar to whitelisting. But now Grell will follow every other link in
86
+ this site which does not go to /games/...
87
+
88
+ If you call both whitelist and blacklist then both will apply, a link
89
+ has to fullfill both conditions to survive. If you do not call any, then
90
+ all links on this site will be crawled. Think of these methods as
91
+ filters.
92
+
93
+ #### Manual link filtering
94
+
95
+ If you have a more complex use-case, you can modify the list of links
96
+ manually.
97
+ Grell yields the page to you before it adds the links to the list of
98
+ links to visit. So you can modify in your block of code "page.links" to
99
+ add and delete links to instruct Grell to add them to the list of links
100
+ to visit next.
101
+
54
102
  ### Pages' id
55
103
 
56
104
  Each page has an unique id, accessed by the property 'id'. Also each page stores the id of the page from which we found this page, accessed by the property 'parent_id'.
@@ -64,6 +112,11 @@ When there is an error in the page or an internal error in the crawler (Javascri
64
112
  - errorClass: The class of the error which broke this page.
65
113
  - errorMessage: A descriptive message with the information Grell could gather about the error.
66
114
 
115
+ ### Logging
116
+ You can pass your logger to Grell. For example in a Rails app:
117
+ ```Ruby
118
+ crawler = Grell::Crawler.new(logger: Rails.logger)
119
+ ```
67
120
 
68
121
  ## Tests
69
122
 
data/grell.gemspec CHANGED
@@ -6,24 +6,28 @@ require 'grell/version'
6
6
  Gem::Specification.new do |spec|
7
7
  spec.name = "grell"
8
8
  spec.version = Grell::VERSION
9
+ spec.platform = Gem::Platform::RUBY
9
10
  spec.authors = ["Jordi Polo Carres"]
10
11
  spec.email = ["jcarres@mdsol.com"]
11
12
  spec.summary = %q{Ruby web crawler}
12
13
  spec.description = %q{Ruby web crawler using PhantomJS}
13
14
  spec.homepage = "https://github.com/mdsol/grell"
15
+ spec.license = 'MIT'
14
16
 
15
17
  spec.files = `git ls-files -z`.split("\x0")
16
18
  spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
17
19
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
18
20
  spec.require_paths = ["lib"]
19
21
 
22
+ spec.required_ruby_version = '>= 1.9.3'
23
+
20
24
  spec.add_dependency 'capybara', '~> 2.2'
21
25
  spec.add_dependency 'poltergeist', '~> 1.5'
22
26
 
23
27
  spec.add_development_dependency "bundler", "~> 1.6"
24
28
  spec.add_development_dependency "byebug", "~> 4.0"
25
- spec.add_development_dependency "kender"
26
- spec.add_development_dependency "rake"
29
+ spec.add_development_dependency "kender", '~> 0.2'
30
+ spec.add_development_dependency "rake", '~> 10.0'
27
31
  spec.add_development_dependency "webmock", '~> 1.18'
28
32
  spec.add_development_dependency 'rspec', '~> 3.0'
29
33
  spec.add_development_dependency 'puffing-billy', '~> 0.5'
data/lib/grell/crawler.rb CHANGED
@@ -17,6 +17,14 @@ module Grell
17
17
  @collection = PageCollection.new
18
18
  end
19
19
 
20
+ def whitelist(list)
21
+ @whitelist_regexp = Regexp.union(list)
22
+ end
23
+
24
+ def blacklist(list)
25
+ @blacklist_regexp = Regexp.union(list)
26
+ end
27
+
20
28
 
21
29
  def start_crawling(url, &block)
22
30
  Grell.logger.info "GRELL Started crawling"
@@ -32,6 +40,8 @@ module Grell
32
40
  Grell.logger.info "Visiting #{site.url}, visited_links: #{@collection.visited_pages.size}, discovered #{@collection.discovered_pages.size}"
33
41
  site.navigate
34
42
 
43
+ filter!(site.links)
44
+
35
45
  block.call(site) if block
36
46
 
37
47
  site.links.each do |url|
@@ -39,6 +49,12 @@ module Grell
39
49
  end
40
50
  end
41
51
 
52
+ private
53
+ def filter!(links)
54
+ links.select!{ |link| link =~ @whitelist_regexp } if @whitelist_regexp
55
+ links.delete_if{ |link| link =~ @blacklist_regexp } if @blacklist_regexp
56
+ end
57
+
42
58
  end
43
59
 
44
60
  end
@@ -1,4 +1,8 @@
1
1
  module Grell
2
+ # Keeps a record of all the pages crawled.
3
+ # When a new url is found it is added to this collection, which makes sure it is unique.
4
+ # This page is part of the discovered pages. Eventually that page will be navigated to, then
5
+ # the page will be part of the visited pages.
2
6
  class PageCollection
3
7
  attr_reader :collection
4
8
 
data/lib/grell/reader.rb CHANGED
@@ -1,4 +1,7 @@
1
1
  module Grell
2
+ # A tooling class, it waits a maximum of max_waiting for an action to finish. If the action is not
3
+ # finished by then, it will continue anyway.
4
+ # The wait may be long but we want to finish it as soon as the action has finished
2
5
  class Reader
3
6
  def self.wait_for(action, max_waiting, sleeping_time)
4
7
  time_start = Time.now
data/lib/grell/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Grell
2
- VERSION = "1.3.0"
2
+ VERSION = "1.3.1"
3
3
  end
@@ -5,7 +5,7 @@ RSpec.describe Grell::Crawler do
5
5
  let(:page) {Grell::Page.new(url, page_id, parent_page_id)}
6
6
  let(:host) {"http://www.example.com"}
7
7
  let(:url) {"http://www.example.com/test"}
8
- let(:crawler) { Grell::Crawler.new(external_driver: true)}
8
+ let(:crawler) { Grell::Crawler.new(logger: Logger.new(nil), external_driver: true)}
9
9
  let(:body) {'body'}
10
10
 
11
11
  before do
@@ -64,24 +64,144 @@ RSpec.describe Grell::Crawler do
64
64
  end
65
65
  end
66
66
 
67
+ shared_examples_for 'visits all available pages' do
68
+ it 'visits all the pages' do
69
+ crawler.start_crawling(url)
70
+ expect(crawler.collection.visited_pages.size).to eq(visited_pages_count)
71
+ end
72
+ it 'has no more pages to discover' do
73
+ crawler.start_crawling(url)
74
+ expect(crawler.collection.discovered_pages.size).to eq(0)
75
+ end
76
+
77
+ it 'contains the whitelisted page and the base page only' do
78
+ crawler.start_crawling(url)
79
+ expect(crawler.collection.visited_pages.map(&:url)).
80
+ to eq(visited_pages)
81
+ end
82
+ end
83
+
67
84
  context 'the url has no links' do
68
85
  let(:body) do
69
86
  "<html><head></head><body>
70
87
  Hello world!
71
88
  </body></html>"
72
89
  end
90
+ let(:visited_pages_count) {1}
91
+ let(:visited_pages) {['http://www.example.com/test']}
92
+
93
+ it_behaves_like 'visits all available pages'
94
+ end
95
+
96
+ context 'the url has several links' do
97
+ let(:body) do
98
+ "<html><head></head><body>
99
+ <a href=\"/trusmis.html\">trusmis</a>
100
+ <a href=\"/help.html\">help</a>
101
+ Hello world!
102
+ </body></html>"
103
+ end
73
104
  before do
74
- crawler.start_crawling(url)
105
+ proxy.stub('http://www.example.com/trusmis.html').and_return(body: 'body', code: 200)
106
+ proxy.stub('http://www.example.com/help.html').and_return(body: 'body', code: 200)
75
107
  end
76
- it 'visits all the pages' do
77
- expect(crawler.collection.visited_pages.size).to eq(1)
108
+ let(:visited_pages_count) {3}
109
+ let(:visited_pages) do
110
+ ['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
78
111
  end
79
- it 'has no more pages to discover' do
80
- expect(crawler.collection.discovered_pages.size).to eq(0)
112
+
113
+ it_behaves_like 'visits all available pages'
114
+ end
115
+
116
+ describe '#whitelist' do
117
+ let(:body) do
118
+ "<html><head></head><body>
119
+ <a href=\"/trusmis.html\">trusmis</a>
120
+ <a href=\"/help.html\">help</a>
121
+ Hello world!
122
+ </body></html>"
123
+ end
124
+
125
+ before do
126
+ proxy.stub('http://www.example.com/trusmis.html').and_return(body: 'body', code: 200)
127
+ proxy.stub('http://www.example.com/help.html').and_return(body: 'body', code: 200)
128
+ end
129
+
130
+ context 'using a single string' do
131
+ before do
132
+ crawler.whitelist('/trusmis.html')
133
+ end
134
+ let(:visited_pages_count) {2} #my own page + trusmis
135
+ let(:visited_pages) do
136
+ ['http://www.example.com/test','http://www.example.com/trusmis.html']
137
+ end
138
+
139
+ it_behaves_like 'visits all available pages'
140
+ end
141
+
142
+ context 'using an array of strings' do
143
+ before do
144
+ crawler.whitelist(['/trusmis.html', '/nothere', 'another.html'])
145
+ end
146
+ let(:visited_pages_count) {2}
147
+ let(:visited_pages) do
148
+ ['http://www.example.com/test','http://www.example.com/trusmis.html']
149
+ end
150
+
151
+ it_behaves_like 'visits all available pages'
152
+ end
153
+
154
+ context 'using a regexp' do
155
+ before do
156
+ crawler.whitelist(/\/trusmis\.html/)
157
+ end
158
+ let(:visited_pages_count) {2}
159
+ let(:visited_pages) do
160
+ ['http://www.example.com/test','http://www.example.com/trusmis.html']
161
+ end
162
+
163
+ it_behaves_like 'visits all available pages'
164
+ end
165
+
166
+ context 'using an array of regexps' do
167
+ before do
168
+ crawler.whitelist([/\/trusmis\.html/])
169
+ end
170
+ let(:visited_pages_count) {2}
171
+ let(:visited_pages) do
172
+ ['http://www.example.com/test','http://www.example.com/trusmis.html']
173
+ end
174
+
175
+ it_behaves_like 'visits all available pages'
176
+ end
177
+
178
+ context 'using an empty array' do
179
+ before do
180
+ crawler.whitelist([])
181
+ end
182
+ let(:visited_pages_count) {1} #my own page only
183
+ let(:visited_pages) do
184
+ ['http://www.example.com/test']
185
+ end
186
+
187
+ it_behaves_like 'visits all available pages'
188
+ end
189
+
190
+ context 'adding all links to the whitelist' do
191
+ before do
192
+ crawler.whitelist(['/trusmis', '/help'])
193
+ end
194
+ let(:visited_pages_count) {3} #all links
195
+ let(:visited_pages) do
196
+ ['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
197
+ end
198
+
199
+ it_behaves_like 'visits all available pages'
81
200
  end
82
201
  end
83
202
 
84
- context 'the url has several links' do
203
+
204
+ describe '#blacklist' do
85
205
  let(:body) do
86
206
  "<html><head></head><body>
87
207
  <a href=\"/trusmis.html\">trusmis</a>
@@ -89,18 +209,124 @@ RSpec.describe Grell::Crawler do
89
209
  Hello world!
90
210
  </body></html>"
91
211
  end
212
+
92
213
  before do
93
214
  proxy.stub('http://www.example.com/trusmis.html').and_return(body: 'body', code: 200)
94
215
  proxy.stub('http://www.example.com/help.html').and_return(body: 'body', code: 200)
95
216
  end
96
217
 
97
- it 'visits all the pages' do
98
- crawler.start_crawling(url)
99
- expect(crawler.collection.visited_pages.size).to eq(3)
218
+ context 'using a single string' do
219
+ before do
220
+ crawler.blacklist('/trusmis.html')
221
+ end
222
+ let(:visited_pages_count) {2}
223
+ let(:visited_pages) do
224
+ ['http://www.example.com/test','http://www.example.com/help.html']
225
+ end
226
+
227
+ it_behaves_like 'visits all available pages'
100
228
  end
101
- it 'has no more pages to discover' do
102
- crawler.start_crawling(url)
103
- expect(crawler.collection.discovered_pages.size).to eq(0)
229
+
230
+ context 'using an array of strings' do
231
+ before do
232
+ crawler.blacklist(['/trusmis.html', '/nothere', 'another.html'])
233
+ end
234
+ let(:visited_pages_count) {2}
235
+ let(:visited_pages) do
236
+ ['http://www.example.com/test','http://www.example.com/help.html']
237
+ end
238
+
239
+ it_behaves_like 'visits all available pages'
240
+ end
241
+
242
+ context 'using a regexp' do
243
+ before do
244
+ crawler.blacklist(/\/trusmis\.html/)
245
+ end
246
+ let(:visited_pages_count) {2}
247
+ let(:visited_pages) do
248
+ ['http://www.example.com/test','http://www.example.com/help.html']
249
+ end
250
+
251
+ it_behaves_like 'visits all available pages'
252
+ end
253
+
254
+ context 'using an array of regexps' do
255
+ before do
256
+ crawler.blacklist([/\/trusmis\.html/])
257
+ end
258
+ let(:visited_pages_count) {2}
259
+ let(:visited_pages) do
260
+ ['http://www.example.com/test','http://www.example.com/help.html']
261
+ end
262
+
263
+ it_behaves_like 'visits all available pages'
264
+ end
265
+
266
+ context 'using an empty array' do
267
+ before do
268
+ crawler.blacklist([])
269
+ end
270
+ let(:visited_pages_count) {3} #all links
271
+ let(:visited_pages) do
272
+ ['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
273
+ end
274
+
275
+ it_behaves_like 'visits all available pages'
276
+ end
277
+
278
+ context 'adding all links to the whitelist' do
279
+ before do
280
+ crawler.blacklist(['/trusmis', '/help'])
281
+ end
282
+ let(:visited_pages_count) {1}
283
+ let(:visited_pages) do
284
+ ['http://www.example.com/test']
285
+ end
286
+
287
+ it_behaves_like 'visits all available pages'
288
+ end
289
+ end
290
+
291
+
292
+ describe 'Whitelisting and blacklisting' do
293
+ let(:body) do
294
+ "<html><head></head><body>
295
+ <a href=\"/trusmis.html\">trusmis</a>
296
+ <a href=\"/help.html\">help</a>
297
+ Hello world!
298
+ </body></html>"
299
+ end
300
+
301
+ before do
302
+ proxy.stub('http://www.example.com/trusmis.html').and_return(body: 'body', code: 200)
303
+ proxy.stub('http://www.example.com/help.html').and_return(body: 'body', code: 200)
304
+ end
305
+
306
+ context 'we blacklist the only whitelisted page' do
307
+ before do
308
+ crawler.whitelist('/trusmis.html')
309
+ crawler.blacklist('/trusmis.html')
310
+ end
311
+ let(:visited_pages_count) {1}
312
+ let(:visited_pages) do
313
+ ['http://www.example.com/test']
314
+ end
315
+
316
+ it_behaves_like 'visits all available pages'
317
+ end
318
+
319
+ context 'we blacklist none of the whitelisted pages' do
320
+ before do
321
+ crawler.whitelist('/trusmis.html')
322
+ crawler.blacklist('/raistlin.html')
323
+ end
324
+ let(:visited_pages_count) {2}
325
+ let(:visited_pages) do
326
+ ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
327
+ end
328
+
329
+ it_behaves_like 'visits all available pages'
104
330
  end
105
331
  end
106
332
 
@@ -9,6 +9,7 @@ RSpec.describe Grell::Page do
9
9
  let(:now) {Time.now}
10
10
  before do
11
11
  allow(Time).to receive(:now).and_return(now)
12
+ Grell.logger = Logger.new(nil) #avoids noise in rspec output
12
13
  end
13
14
 
14
15
  it "gives access to the url" do
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: grell
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.3.0
4
+ version: 1.3.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jordi Polo Carres
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-05-07 00:00:00.000000000 Z
11
+ date: 2015-05-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: capybara
@@ -70,30 +70,30 @@ dependencies:
70
70
  name: kender
71
71
  requirement: !ruby/object:Gem::Requirement
72
72
  requirements:
73
- - - ">="
73
+ - - "~>"
74
74
  - !ruby/object:Gem::Version
75
- version: '0'
75
+ version: '0.2'
76
76
  type: :development
77
77
  prerelease: false
78
78
  version_requirements: !ruby/object:Gem::Requirement
79
79
  requirements:
80
- - - ">="
80
+ - - "~>"
81
81
  - !ruby/object:Gem::Version
82
- version: '0'
82
+ version: '0.2'
83
83
  - !ruby/object:Gem::Dependency
84
84
  name: rake
85
85
  requirement: !ruby/object:Gem::Requirement
86
86
  requirements:
87
- - - ">="
87
+ - - "~>"
88
88
  - !ruby/object:Gem::Version
89
- version: '0'
89
+ version: '10.0'
90
90
  type: :development
91
91
  prerelease: false
92
92
  version_requirements: !ruby/object:Gem::Requirement
93
93
  requirements:
94
- - - ">="
94
+ - - "~>"
95
95
  - !ruby/object:Gem::Version
96
- version: '0'
96
+ version: '10.0'
97
97
  - !ruby/object:Gem::Dependency
98
98
  name: webmock
99
99
  requirement: !ruby/object:Gem::Requirement
@@ -165,7 +165,8 @@ files:
165
165
  - spec/lib/reader_spec.rb
166
166
  - spec/spec_helper.rb
167
167
  homepage: https://github.com/mdsol/grell
168
- licenses: []
168
+ licenses:
169
+ - MIT
169
170
  metadata: {}
170
171
  post_install_message:
171
172
  rdoc_options: []
@@ -175,7 +176,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
175
176
  requirements:
176
177
  - - ">="
177
178
  - !ruby/object:Gem::Version
178
- version: '0'
179
+ version: 1.9.3
179
180
  required_rubygems_version: !ruby/object:Gem::Requirement
180
181
  requirements:
181
182
  - - ">="