RubyGems - spiderkit - Versions diffs - 0.1.1 → 0.1.2 - Mend

spiderkit 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README.md CHANGED Viewed

@@ -37,12 +37,13 @@ Using Spiderkit, you implement your spiders and bots around the idea of a visit
 * Once an object is iterated over, it's removed from the queue
 * Once an object is iterated over, it's string value is added to an already-visited list, at which point you can't add it again.
 * The queue will stop once it's empty, and optionally execute a final Proc that you pass to it
-* The queue will not fetch web pages or anything else on it's own - that's part of what *you* implement.
+* The queue will not fetch web pages or anything else on it's own - that's part of what *you* implement.
+Since you need to implement page fetching on your own (using any of a number of high quality gems or libararies), you'll also need to implement the associated error checking, network timeout handling, and sanity checking that's involved.  If you handle redirects by pushing them onto the queue, however, then you'll at least get a little help where redirect loops are concerned.
 A basic example:
 ```ruby
 mybot = Spider::VisitQueue.new
 mybot.push_front('http://someurl.com')
@@ -58,17 +59,15 @@ A slightly fancier example:
 ```ruby
 #download robots.txt as variable txt
-#user agent is "my-bot/1.0"
+#user agent for robots.txt is "my-bot"
 finalizer = Proc.new { puts "done"}
-mybot = Spider::VisitQueue.new(txt, "my-bot/1.0", finalizer)
+mybot = Spider::VisitQueue.new(txt, "my-bot", finalizer)
 ```
 As urls are fetched and added to the queue, any links already visited will be dropped transparently.  You have the option to push objects to either the front or rear of the queue at any time.  If you do push to the front of the queue while iterating over it, the things you push will be the next items visited, and vice versa if you push to the back:
 ```ruby
 mybot.visit_each do |url|
   #these will be visited next
   mybot.push_front(nexturls)
@@ -76,19 +75,26 @@ mybot.visit_each do |url|
   #these will be visited last
   mybot.push_back(lasturls)
 end
+```
+The already visited list is implemented as a Bloom filter, so you should be able to spider even fairly large domains (and there are quite a few out there) without re-visiting pages.  You can get a count of pages you've already visited at any time with the visit_count method.
+If you need to clear the visited list at any point, use the clear_visited method:
+```ruby
+mybot.visit_each do |url|
+  mybot.clear_visited
+end
 ```
-The already visited list is implemented as a Bloom filter, so you should be able to spider even fairly large domains (and there are quite a few out there) without re-visiting pages.
+After which you can push urls onto the queue regardless of if you visited them before clearing.  However, the queue will still refuse to visit them once you've done so again.  Note also that the count of visited pages will *not* reset.
 Finally, you can forcefully stop spidering at any point:
 ```ruby
 mybot.visit_each do |url|
   mybot.stop
 end
 ```
 The finalizer, if any, will still be executed after stopping iteration.
@@ -120,6 +126,25 @@ mybot.visit_each |url|
 end
 ```
+If you don't pass an agent string, then the parser will take it's configuration from the default agent specified in the robots.txt.  If you want your bot to respond to directives for a given user agent, just pass the agent to either the queue when you create it, or the parser:
+```ruby
+#visit queue that will respond to any robots.txt
+#with User-agent: mybot in them
+mybot = Spider::VisitQueue(txt, 'mybot')
+#same thing as a standalone parser
+myparser = Spider::ExclusionParser.new(txt, 'mybot')
+```
+Note that user agent string passed in to your exclusion parser and the user agent string sent along with HTTP requests are not necessarily one and the same, although the user agent contained in robots.txt will usually be a subset of the HTTP user agent.
+For example:
+Googlebot/2.1 (+http://www.google.com/bot.html)
+should respond to "googlebot" in robots.txt.  By convention, bots and spiders usually have the name 'bot' somewhere in their user agent strings.
 ## Wait Time
 Ideally a bot should wait for some period of time in between requests to avoid crashing websites (less likely) or being blacklisted (more likely).  A WaitTime class is provided that encapsulates this waiting logic, and logic to respond to rate limit codes and the "crawl-delay" directives found in some robots.txt files.  Times are in seconds.
@@ -127,7 +152,6 @@ Ideally a bot should wait for some period of time in between requests to avoid c
 You can create it standalone, or get it from an exclusion parser:
 ```ruby
 #download a robots.txt with a crawl-delay 40
 robots_txt = Spider::ExclusionParser.new(txt)
@@ -147,7 +171,6 @@ delay.reduce_wait
 #after one call to back_off and one call to reduce_wait
 delay.value => 160
 ```
 By default a WaitTime will specify an initial value of 2 seconds.  You can pass a value to new to specify the wait seconds, although values larger than the max allowable value will be set to the max allowable value (3 minutes / 180 seconds).

data/lib/queue.rb CHANGED Viewed

@@ -15,10 +15,10 @@ module Spider
     attr_accessor :robot_txt
     def initialize(robots=nil, agent=nil, finish=nil)
-      @visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
       @robot_txt = ExclusionParser.new(robots, agent) if robots
       @finalize = finish
       @visit_count = 0
+      clear_visited
       @pending = []
     end
@@ -57,6 +57,10 @@ module Spider
     def stop
       raise IterationExit
     end
+    def clear_visited
+      @visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
+    end
     private
@@ -71,7 +75,7 @@ module Spider
       urls.compact!
       urls.each do |url|
-        unless @visited.include?(url)
+        unless @visited.include?(url) || @pending.include?(url)
           yield url
         end
       end

data/lib/version.rb CHANGED Viewed

@@ -3,5 +3,5 @@
 # License::   MIT
 module Spider
-  VERSION = "0.1.1"
+  VERSION = "0.1.2"
 end

data/spec/visit_queue_spec.rb CHANGED Viewed

@@ -31,7 +31,6 @@ module Spider
       expect(order).to eq %w(one two divider)
     end
     it 'should allow appending to the back of the queue' do
       @queue.push_back('one')
       @queue.push_back('two')
@@ -86,6 +85,13 @@ module Spider
       expect(@queue.empty?).to be true
     end
+    it 'should not insert urls already in the pending queue' do
+      @queue.push_back(%w(one two three))
+      expect(@queue.size).to eq 4
+      @queue.push_back(%w(one two three))
+      expect(@queue.size).to eq 4
+    end
     it 'should track number of urls visited' do
       expect(@queue.visit_count).to eq 0
       @queue.push_back(%w(one two three four))
@@ -124,6 +130,25 @@ REND
       expect(queue.visit_count).to eq 1
       expect(flag).to be true
     end
+    it 'should allow you to clear the visited list' do
+      @queue.push_front(%w(one two three))
+      order = get_visit_order(@queue)
+      expect(order).to eq %w(three two one divider)
+      expect(@queue.visit_count).to eq 4
+      @queue.push_front(%w(one two three))
+      order = get_visit_order(@queue)
+      expect(order).to be_empty
+      expect(@queue.visit_count).to eq 4
+      @queue.clear_visited
+      @queue.push_front(%w(one two three))
+      order = get_visit_order(@queue)
+      expect(order).to eq %w(three two one)
+      expect(@queue.visit_count).to eq 7
+    end
   end
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: spiderkit
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.1.2
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-07-10 00:00:00.000000000 Z
+date: 2016-07-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler