spiderkit 0.1.1 → 0.1.2

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -37,12 +37,13 @@ Using Spiderkit, you implement your spiders and bots around the idea of a visit
37
37
  * Once an object is iterated over, it's removed from the queue
38
38
  * Once an object is iterated over, it's string value is added to an already-visited list, at which point you can't add it again.
39
39
  * The queue will stop once it's empty, and optionally execute a final Proc that you pass to it
40
- * The queue will not fetch web pages or anything else on it's own - that's part of what *you* implement.
40
+ * The queue will not fetch web pages or anything else on it's own - that's part of what *you* implement.
41
+
42
+ Since you need to implement page fetching on your own (using any of a number of high quality gems or libararies), you'll also need to implement the associated error checking, network timeout handling, and sanity checking that's involved. If you handle redirects by pushing them onto the queue, however, then you'll at least get a little help where redirect loops are concerned.
41
43
 
42
44
  A basic example:
43
45
 
44
46
  ```ruby
45
-
46
47
  mybot = Spider::VisitQueue.new
47
48
  mybot.push_front('http://someurl.com')
48
49
 
@@ -58,17 +59,15 @@ A slightly fancier example:
58
59
 
59
60
  ```ruby
60
61
  #download robots.txt as variable txt
61
- #user agent is "my-bot/1.0"
62
+ #user agent for robots.txt is "my-bot"
62
63
 
63
64
  finalizer = Proc.new { puts "done"}
64
-
65
- mybot = Spider::VisitQueue.new(txt, "my-bot/1.0", finalizer)
65
+ mybot = Spider::VisitQueue.new(txt, "my-bot", finalizer)
66
66
  ```
67
67
 
68
68
  As urls are fetched and added to the queue, any links already visited will be dropped transparently. You have the option to push objects to either the front or rear of the queue at any time. If you do push to the front of the queue while iterating over it, the things you push will be the next items visited, and vice versa if you push to the back:
69
69
 
70
70
  ```ruby
71
-
72
71
  mybot.visit_each do |url|
73
72
  #these will be visited next
74
73
  mybot.push_front(nexturls)
@@ -76,19 +75,26 @@ mybot.visit_each do |url|
76
75
  #these will be visited last
77
76
  mybot.push_back(lasturls)
78
77
  end
78
+ ```
79
+
80
+ The already visited list is implemented as a Bloom filter, so you should be able to spider even fairly large domains (and there are quite a few out there) without re-visiting pages. You can get a count of pages you've already visited at any time with the visit_count method.
79
81
 
82
+ If you need to clear the visited list at any point, use the clear_visited method:
83
+
84
+ ```ruby
85
+ mybot.visit_each do |url|
86
+ mybot.clear_visited
87
+ end
80
88
  ```
81
89
 
82
- The already visited list is implemented as a Bloom filter, so you should be able to spider even fairly large domains (and there are quite a few out there) without re-visiting pages.
90
+ After which you can push urls onto the queue regardless of if you visited them before clearing. However, the queue will still refuse to visit them once you've done so again. Note also that the count of visited pages will *not* reset.
83
91
 
84
92
  Finally, you can forcefully stop spidering at any point:
85
93
 
86
94
  ```ruby
87
-
88
95
  mybot.visit_each do |url|
89
96
  mybot.stop
90
97
  end
91
-
92
98
  ```
93
99
 
94
100
  The finalizer, if any, will still be executed after stopping iteration.
@@ -120,6 +126,25 @@ mybot.visit_each |url|
120
126
  end
121
127
  ```
122
128
 
129
+ If you don't pass an agent string, then the parser will take it's configuration from the default agent specified in the robots.txt. If you want your bot to respond to directives for a given user agent, just pass the agent to either the queue when you create it, or the parser:
130
+
131
+ ```ruby
132
+ #visit queue that will respond to any robots.txt
133
+ #with User-agent: mybot in them
134
+ mybot = Spider::VisitQueue(txt, 'mybot')
135
+
136
+ #same thing as a standalone parser
137
+ myparser = Spider::ExclusionParser.new(txt, 'mybot')
138
+ ```
139
+
140
+ Note that user agent string passed in to your exclusion parser and the user agent string sent along with HTTP requests are not necessarily one and the same, although the user agent contained in robots.txt will usually be a subset of the HTTP user agent.
141
+
142
+ For example:
143
+
144
+ Googlebot/2.1 (+http://www.google.com/bot.html)
145
+
146
+ should respond to "googlebot" in robots.txt. By convention, bots and spiders usually have the name 'bot' somewhere in their user agent strings.
147
+
123
148
  ## Wait Time
124
149
 
125
150
  Ideally a bot should wait for some period of time in between requests to avoid crashing websites (less likely) or being blacklisted (more likely). A WaitTime class is provided that encapsulates this waiting logic, and logic to respond to rate limit codes and the "crawl-delay" directives found in some robots.txt files. Times are in seconds.
@@ -127,7 +152,6 @@ Ideally a bot should wait for some period of time in between requests to avoid c
127
152
  You can create it standalone, or get it from an exclusion parser:
128
153
 
129
154
  ```ruby
130
-
131
155
  #download a robots.txt with a crawl-delay 40
132
156
 
133
157
  robots_txt = Spider::ExclusionParser.new(txt)
@@ -147,7 +171,6 @@ delay.reduce_wait
147
171
 
148
172
  #after one call to back_off and one call to reduce_wait
149
173
  delay.value => 160
150
-
151
174
  ```
152
175
 
153
176
  By default a WaitTime will specify an initial value of 2 seconds. You can pass a value to new to specify the wait seconds, although values larger than the max allowable value will be set to the max allowable value (3 minutes / 180 seconds).
data/lib/queue.rb CHANGED
@@ -15,10 +15,10 @@ module Spider
15
15
  attr_accessor :robot_txt
16
16
 
17
17
  def initialize(robots=nil, agent=nil, finish=nil)
18
- @visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
19
18
  @robot_txt = ExclusionParser.new(robots, agent) if robots
20
19
  @finalize = finish
21
20
  @visit_count = 0
21
+ clear_visited
22
22
  @pending = []
23
23
  end
24
24
 
@@ -57,6 +57,10 @@ module Spider
57
57
  def stop
58
58
  raise IterationExit
59
59
  end
60
+
61
+ def clear_visited
62
+ @visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
63
+ end
60
64
 
61
65
  private
62
66
 
@@ -71,7 +75,7 @@ module Spider
71
75
  urls.compact!
72
76
 
73
77
  urls.each do |url|
74
- unless @visited.include?(url)
78
+ unless @visited.include?(url) || @pending.include?(url)
75
79
  yield url
76
80
  end
77
81
  end
data/lib/version.rb CHANGED
@@ -3,5 +3,5 @@
3
3
  # License:: MIT
4
4
 
5
5
  module Spider
6
- VERSION = "0.1.1"
6
+ VERSION = "0.1.2"
7
7
  end
@@ -31,7 +31,6 @@ module Spider
31
31
  expect(order).to eq %w(one two divider)
32
32
  end
33
33
 
34
-
35
34
  it 'should allow appending to the back of the queue' do
36
35
  @queue.push_back('one')
37
36
  @queue.push_back('two')
@@ -86,6 +85,13 @@ module Spider
86
85
  expect(@queue.empty?).to be true
87
86
  end
88
87
 
88
+ it 'should not insert urls already in the pending queue' do
89
+ @queue.push_back(%w(one two three))
90
+ expect(@queue.size).to eq 4
91
+ @queue.push_back(%w(one two three))
92
+ expect(@queue.size).to eq 4
93
+ end
94
+
89
95
  it 'should track number of urls visited' do
90
96
  expect(@queue.visit_count).to eq 0
91
97
  @queue.push_back(%w(one two three four))
@@ -124,6 +130,25 @@ REND
124
130
  expect(queue.visit_count).to eq 1
125
131
  expect(flag).to be true
126
132
  end
133
+
134
+ it 'should allow you to clear the visited list' do
135
+ @queue.push_front(%w(one two three))
136
+ order = get_visit_order(@queue)
137
+ expect(order).to eq %w(three two one divider)
138
+ expect(@queue.visit_count).to eq 4
139
+
140
+ @queue.push_front(%w(one two three))
141
+ order = get_visit_order(@queue)
142
+ expect(order).to be_empty
143
+ expect(@queue.visit_count).to eq 4
144
+
145
+ @queue.clear_visited
146
+ @queue.push_front(%w(one two three))
147
+ order = get_visit_order(@queue)
148
+ expect(order).to eq %w(three two one)
149
+ expect(@queue.visit_count).to eq 7
150
+ end
151
+
127
152
  end
128
153
 
129
154
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: spiderkit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.2
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2016-07-10 00:00:00.000000000 Z
12
+ date: 2016-07-15 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bundler