RubyGems - spiderkit - Versions diffs - 0.2.0 → 0.2.1 - Mend

spiderkit 0.2.0 → 0.2.1

Files changed (6) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 5d3b757d2369c3d6c9520e8f4b78f60f79fdfd95
-  data.tar.gz: 562120125d785f938f37b102a53e90dc78dc56cc
+  metadata.gz: be49878f0fe9fc2947133b602a98e78070e82cee
+  data.tar.gz: 70045c213e2dc966cdcceb7b4d66b68aa953b09d
 SHA512:
-  metadata.gz: 391d8705750fe6e738152384439da6b512dc29471796c37bd48f1d90bd2b3909fc1a1c6257c1a95baff42533d662140f7cf6a999a3d9cedca17cdd8cb1f43c06
-  data.tar.gz: a243700c54bf0eecaf16a879d76e201c7debcb93094bed3ba32bea47f9615c321ba5f297451955cead398e3fda3f7ecd2892673eea87102339e0ca75c3df34de
+  metadata.gz: bd5e451c53232047a24c050161dde45356706d80ea230fa793375dae97e0f1c629b6adf21919b4572f534f210cf91b4fe73a0ca4edaf3b8a99c78f852cb25cb3
+  data.tar.gz: a6bd3b7be26ad0d56f1f408ef5778734a9753b4b599a74e5968b7b4d620e7781fa9d9b4dd7e2b42f87a43ed98b7393a672505a193e6f080d617fc3a5cf98945d

data/README.md CHANGED Viewed

@@ -44,13 +44,14 @@ Since you need to implement page fetching on your own (using any of a number of
 A basic example:
 ```ruby
+require 'open-uri'
 mybot = Spider::VisitQueue.new
 mybot.push_front('http://someurl.com')
 mybot.visit_each do |url|
-  #fetch the url
+  data = open(url).read
   #pull out the links as linklist
   mybot.push_back(linklist)
 end
 ```
@@ -104,7 +105,9 @@ The finalizer, if any, will still be executed after stopping iteration.
 Spiderkit also includes a robots.txt parser that can either work standalone, or be passed as an argument to the visit queue.  If passed as an argument, urls that are excluded by the robots.txt will be dropped transparently.
 ```
-# fetch robots.txt as variable txt
+require 'open-uri'
+txt = open('http://somesite.com/robots.txt').read
 # create a stand alone parser
 robots_txt = Spider::ExclusionParser.new(txt)
@@ -150,9 +153,8 @@ should respond to "googlebot" in robots.txt.  By convention, bots and spiders us
 ```ruby
 require 'open-uri'
-status = 0
-data = open('http://wikipedia.org/robots.txt') { |f| status = f.status }
-mybot.robot_txt = Spider::ExclusionParser.new(data.read, 'mybot', status)
+data = open('http://wikipedia.org/robots.txt')
+mybot.robot_txt = Spider::ExclusionParser.new(data.read, 'mybot', data.status)
 ```
 Finally, as a sanity check / to avoid DoS honeypots with malicious robots.txt files, the exclusion parser will process a maximum of one thousand non-whitespace lines before stopping.
@@ -164,7 +166,10 @@ Ideally a bot should wait for some period of time in between requests to avoid c
 You can create it standalone, or get it from an exclusion parser:
 ```ruby
+require 'open-uri'
 # download a robots.txt with a crawl-delay 40
+txt = open('http://crawldelay40seconds.com/robots.txt').read
 robots_txt = Spider::ExclusionParser.new(txt)
 delay = robots_txt.wait_time
@@ -209,14 +214,14 @@ Spider::VisitRecorder.record!
 mybot.visit_each do |url|
   data = Spider::VisitRecorder.recall(url) do
-    text = ''
     puts "fetch #{url}"
-    open(url) do |f|
-      text = f.read
-      # doing this is only necessary if you want to
-      # save this information in the recording
-      text.http_status = f.status.first.to_i
-    end
+    data = open(url)
+    text = data.read
+    # doing this is only necessary if you want to
+    # save this information in the recording
+    text.http_status = data.status.first.to_i
     text
   end

data/lib/queue.rb CHANGED Viewed

@@ -2,7 +2,7 @@
 # Copyright:: Copyright (c) 2016 Robert Dormer
 # License::   MIT
-require 'bloom-filter'
+require 'bloomer'
 require 'exclusion'
 module Spider
@@ -27,7 +27,7 @@ module Spider
           url = @pending.pop
           next unless url_okay(url)
           yield url.clone if block_given?
-          @visited.insert(url)
+          @visited.add(url)
           @visit_count += 1
         end
       rescue IterationExit
@@ -46,7 +46,7 @@ module Spider
     def mark(urls)
       urls = [urls] unless urls.is_a? Array
-      urls.each { |u| @visited.insert(u) }
+      urls.each { |u| @visited.add(u) }
     end
     def size
@@ -62,7 +62,7 @@ module Spider
     end
     def clear_visited
-      @visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
+      @visited =  Bloomer.new(10_000, 0.001)
     end
     def url_okay(url)

data/lib/version.rb CHANGED Viewed

@@ -3,5 +3,5 @@
 # License::   MIT
 module Spider
-  VERSION = "0.2.0"
+  VERSION = "0.2.1"
 end

data/spiderkit.gemspec CHANGED Viewed

@@ -23,5 +23,5 @@ Gem::Specification.new do |spec|
   spec.add_development_dependency "rspec",  "~> 3.4.0"
   spec.add_development_dependency "rake"
-  spec.add_dependency "bloom-filter", "~> 0.2.0"
+  spec.add_dependency "bloomer", "~> 0.0.5"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: spiderkit
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.2.1
 platform: ruby
 authors:
 - Robert Dormer
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-08-06 00:00:00.000000000 Z
+date: 2016-12-23 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -53,19 +53,19 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: bloom-filter
+  name: bloomer
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 0.2.0
+        version: 0.0.5
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 0.2.0
+        version: 0.0.5
 description: Spiderkit library for basic spiders and bots
 email:
 - rdormer@gmail.com