RubyGems - spider - Versions diffs - 0.3.0 → 0.4.0 - Mend

spider 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

data/CHANGES +3 -0
data/README +90 -17
data/doc/classes/IncludedInMemcached.html +217 -0
data/doc/classes/Spider.html +10 -8
data/doc/classes/SpiderInstance.html +96 -45
data/doc/created.rid +1 -1
data/doc/files/README.html +95 -21
data/doc/{classes/Net.html → files/lib/included_in_memcached_rb.html} +23 -16
data/doc/files/lib/spider_instance_rb.html +118 -0
data/doc/files/lib/spider_rb.html +95 -32
data/doc/fr_class_index.html +1 -0
data/doc/fr_file_index.html +2 -0
data/doc/fr_method_index.html +11 -7
data/lib/included_in_memcached.rb +22 -0
data/lib/spider.rb +4 -246
data/lib/spider_instance.rb +290 -0
data/spec/included_in_memcached_spec.rb +44 -0
data/spec/spider_instance_spec.rb +46 -4
data/spider.gemspec +1 -1
metadata +8 -8
data/doc/classes/Net/HTTPRedirection.html +0 -144
data/doc/classes/Net/HTTPResponse.html +0 -166
data/doc/classes/Net/HTTPSuccess.html +0 -144
data/doc/classes/NilClass.html +0 -144

data/doc/created.rid CHANGED

	@@ -1 +1 @@
1	- ~~Wed~~, 31 ~~Oct~~ 2007 23:51:58 -0400
1	+ Fri, 02 Nov 2007 17:20:02 -0400

data/doc/files/README.html CHANGED

@@ -56,7 +56,7 @@
     </tr>
     <tr class="top-aligned-row">
       <td><strong>Last Update:</strong></td>
-      <td>Wed Oct 31 23:26:17 -0400 2007</td>
+      <td>Fri Nov 02 17:19:47 -0400 2007</td>
     </tr>
     </table>
   </div>
@@ -74,44 +74,118 @@
 Ruby. It handles the robots.txt, scraping, collecting, and looping so that
 you can just handle the data.
 </p>
-<h2>Usage</h2>
+<h2>Examples</h2>
+<h3>Crawl the Web, loading each page in turn, until you run out of memory</h3>
 <pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') {}
+</pre>
+<h3>To handle erroneous responses</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.on :failure do |a_url, resp, prior_url|
+     puts &quot;URL failed: #{a_url}&quot;
+     puts &quot; linked from #{prior_url}&quot;
+   end
+ end
+</pre>
+<h3>Or handle successful responses</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.on :success do |a_url, resp, prior_url|
+     puts &quot;#{a_url}: #{resp.code}&quot;
+     puts resp.body
+     puts
+   end
+ end
+</pre>
+<h3>Limit to just one domain</h3>
+<pre>
+ require 'spider'
  Spider.start_at('http://mike-burns.com/') do |s|
-   # Limit the pages to just this domain.
    s.add_url_check do |a_url|
      a_url =~ %r{^http://mike-burns.com.*}
    end
-   # Handle 404s.
-   s.on 404 do |a_url, resp, prior_url|
-     puts &quot;URL not found: #{a_url}&quot;
+ end
+</pre>
+<h3>Pass headers to some requests</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.setup do |a_url|
+     if a_url =~ %r{^http://.*wikipedia.*}
+       headers['User-Agent'] = &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;
+     end
    end
+ end
+</pre>
+<h3>Use memcached to track cycles</h3>
+<pre>
+ require 'spider'
+ require 'spider/included_in_memcached'
+ SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.check_already_seen_with IncludedInMemcached.new(SERVERS)
+ end
+</pre>
+<h3>Track cycles with a custom object</h3>
+<pre>
+ require 'spider'
-   # Handle 2xx.
-   s.on :success do |a_url, resp, prior_url|
-     puts &quot;body: #{resp.body}&quot;
+ class ExpireLinks &lt; Hash
+   def &lt;&lt;(v)
+     [v] = Time.now
+   end
+   def include?(v)
+     [v] &amp;&amp; (Time.now + 86400) &lt;= [v]
    end
+ end
-   # Handle everything.
-   s.on :every do |a_url, resp, prior_url|
-     puts &quot;URL returned anything: #{a_url} with this code #{resp.code}&quot;
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.check_already_seen_with ExpireLinks.new
+ end
+</pre>
+<h3>Create a URL graph</h3>
+<pre>
+ require 'spider'
+ nodes = {}
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.add_url_check {|a_url| a_url =~ %r{^http://mike-burns.com.*} }
+   s.on(:every) do |a_url, resp, prior_url|
+     nodes[prior_url] ||= []
+     nodes[prior_url] &lt;&lt; a_url
+   end
+ end
+</pre>
+<h3>Use a proxy</h3>
+<pre>
+ require 'net/http_configuration'
+ require 'spider'
+ http_conf = Net::HTTP::Configuration.new(:proxy_host =&gt; '7proxies.org',
+                                          :proxy_port =&gt; 8881)
+ http_conf.apply do
+   Spider.start_at('http://img.4chan.org/b/') do |s|
+     s.on(:success) do |a_url, resp, prior_url|
+       File.open(a_url.gsub('/',':'),'w') do |f|
+         f.write(resp.body)
+       end
+     end
    end
  end
 </pre>
-<h2>Requirements</h2>
-<p>
-This library uses `robot_rules&#8217; (included), `open-uri&#8217;, and
-`uri&#8217;. Any modern Ruby should work; if yours doesn&#8216;t, let me
-know so I can update this with your version number.
-</p>
 <h2>Author</h2>
 <p>
 Mike Burns <a href="http://mike-burns.com">mike-burns.com</a>
 mike@mike-burns.com
 </p>
 <p>
-With help from Matt Horan and John Nagro. With `robot_rules&#8217; from
-James Edward Gray II via <a
+Help from Matt Horan and John Nagro.
+</p>
+<p>
+With `robot_rules&#8217; from James Edward Gray II via <a
 href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589">blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589</a>
 </p>

data/doc/{classes/Net.html → files/lib/included_in_memcached_rb.html} RENAMED

@@ -5,10 +5,10 @@
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
-  <title>Module: Net</title>
+  <title>File: included_in_memcached.rb</title>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
   <meta http-equiv="Content-Script-Type" content="text/javascript" />
-  <link rel="stylesheet" href=".././rdoc-style.css" type="text/css" media="screen" />
+  <link rel="stylesheet" href="../.././rdoc-style.css" type="text/css" media="screen" />
   <script type="text/javascript">
   // <![CDATA[
@@ -46,20 +46,20 @@
-    <div id="classHeader">
-        <table class="header-table">
-        <tr class="top-aligned-row">
-          <td><strong>Module</strong></td>
-          <td class="class-name-in-header">Net</td>
-        </tr>
-        <tr class="top-aligned-row">
-            <td><strong>In:</strong></td>
-            <td>
-            </td>
-        </tr>
-        </table>
-    </div>
+  <div id="fileHeader">
+    <h1>included_in_memcached.rb</h1>
+    <table class="header-table">
+    <tr class="top-aligned-row">
+      <td><strong>Path:</strong></td>
+      <td>lib/included_in_memcached.rb
+      </td>
+    </tr>
+    <tr class="top-aligned-row">
+      <td><strong>Last Update:</strong></td>
+      <td>Fri Nov 02 15:04:14 -0400 2007</td>
+    </tr>
+    </table>
+  </div>
   <!-- banner header -->
   <div id="bodyContent">
@@ -69,6 +69,13 @@
   <div id="contextContent">
+    <div id="requires-list">
+      <h3 class="section-bar">Required files</h3>
+      <div class="name-list">
+      memcache&nbsp;&nbsp;
+      </div>
+    </div>
    </div>

data/doc/files/lib/spider_instance_rb.html ADDED

@@ -0,0 +1,118 @@
+<?xml version="1.0" encoding="iso-8859-1"?>
+<!DOCTYPE html
+     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+<head>
+  <title>File: spider_instance.rb</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+  <meta http-equiv="Content-Script-Type" content="text/javascript" />
+  <link rel="stylesheet" href="../.././rdoc-style.css" type="text/css" media="screen" />
+  <script type="text/javascript">
+  // <![CDATA[
+  function popupCode( url ) {
+    window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
+  }
+  function toggleCode( id ) {
+    if ( document.getElementById )
+      elem = document.getElementById( id );
+    else if ( document.all )
+      elem = eval( "document.all." + id );
+    else
+      return false;
+    elemStyle = elem.style;
+    if ( elemStyle.display != "block" ) {
+      elemStyle.display = "block"
+    } else {
+      elemStyle.display = "none"
+    }
+    return true;
+  }
+  // Make codeblocks hidden by default
+  document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
+  // ]]>
+  </script>
+</head>
+<body>
+  <div id="fileHeader">
+    <h1>spider_instance.rb</h1>
+    <table class="header-table">
+    <tr class="top-aligned-row">
+      <td><strong>Path:</strong></td>
+      <td>lib/spider_instance.rb
+      </td>
+    </tr>
+    <tr class="top-aligned-row">
+      <td><strong>Last Update:</strong></td>
+      <td>Fri Nov 02 17:05:49 -0400 2007</td>
+    </tr>
+    </table>
+  </div>
+  <!-- banner header -->
+  <div id="bodyContent">
+  <div id="contextContent">
+    <div id="description">
+      <p>
+Copyright 2007 Mike Burns
+</p>
+    </div>
+    <div id="requires-list">
+      <h3 class="section-bar">Required files</h3>
+      <div class="name-list">
+      robot_rules&nbsp;&nbsp;
+      open-uri&nbsp;&nbsp;
+      uri&nbsp;&nbsp;
+      net/http&nbsp;&nbsp;
+      net/https&nbsp;&nbsp;
+      </div>
+    </div>
+   </div>
+  </div>
+    <!-- if includes -->
+    <div id="section">
+    <!-- if method_list -->
+  </div>
+<div id="validator-badges">
+  <p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
+</div>
+</body>
+</html>

data/doc/files/lib/spider_rb.html CHANGED

@@ -56,7 +56,7 @@
     </tr>
     <tr class="top-aligned-row">
       <td><strong>Last Update:</strong></td>
-      <td>Wed Oct 31 23:25:57 -0400 2007</td>
+      <td>Fri Nov 02 12:32:39 -0400 2007</td>
     </tr>
     </table>
   </div>
@@ -74,60 +74,123 @@ Copyright 2007 Mike Burns <a href="../../classes/Spider.html">Spider</a>, a
 Web spidering library for Ruby. It handles the robots.txt, scraping,
 collecting, and looping so that you can just handle the data.
 </p>
-<h2>Usage</h2>
+<h2>Examples</h2>
+<h3>Crawl the Web, loading each page in turn, until you run out of memory</h3>
 <pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') {}
+</pre>
+<h3>To handle erroneous responses</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.on :failure do |a_url, resp, prior_url|
+     puts &quot;URL failed: #{a_url}&quot;
+     puts &quot; linked from #{prior_url}&quot;
+   end
+ end
+</pre>
+<h3>Or handle successful responses</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.on :success do |a_url, resp, prior_url|
+     puts &quot;#{a_url}: #{resp.code}&quot;
+     puts resp.body
+     puts
+   end
+ end
+</pre>
+<h3>Limit to just one domain</h3>
+<pre>
+ require 'spider'
  Spider.start_at('http://mike-burns.com/') do |s|
-   # Limit the pages to just this domain.
    s.add_url_check do |a_url|
      a_url =~ %r{^http://mike-burns.com.*}
    end
-   # Handle 404s.
-   s.on 404 do |a_url, resp, prior_url|
-     puts &quot;URL not found: #{a_url}&quot;
+ end
+</pre>
+<h3>Pass headers to some requests</h3>
+<pre>
+ require 'spider'
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.setup do |a_url|
+     if a_url =~ %r{^http://.*wikipedia.*}
+       headers['User-Agent'] = &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;
+     end
    end
+ end
+</pre>
+<h3>Use memcached to track cycles</h3>
+<pre>
+ require 'spider'
+ require 'spider/included_in_memcached'
+ SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.check_already_seen_with IncludedInMemcached.new(SERVERS)
+ end
+</pre>
+<h3>Track cycles with a custom object</h3>
+<pre>
+ require 'spider'
-   # Handle 2xx.
-   s.on :success do |a_url, resp, prior_url|
-     puts &quot;body: #{resp.body}&quot;
+ class ExpireLinks &lt; Hash
+   def &lt;&lt;(v)
+     [v] = Time.now
+   end
+   def include?(v)
+     [v] &amp;&amp; (Time.now + 86400) &lt;= [v]
    end
+ end
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.check_already_seen_with ExpireLinks.new
+ end
+</pre>
+<h3>Create a URL graph</h3>
+<pre>
+ require 'spider'
+ nodes = {}
+ Spider.start_at('http://mike-burns.com/') do |s|
+   s.add_url_check {|a_url| a_url =~ %r{^http://mike-burns.com.*} }
-   # Handle everything.
-   s.on :every do |a_url, resp, prior_url|
-     puts &quot;URL returned anything: #{a_url} with this code #{resp.code}&quot;
+   s.on(:every) do |a_url, resp, prior_url|
+     nodes[prior_url] ||= []
+     nodes[prior_url] &lt;&lt; a_url
+   end
+ end
+</pre>
+<h3>Use a proxy</h3>
+<pre>
+ require 'net/http_configuration'
+ require 'spider'
+ http_conf = Net::HTTP::Configuration.new(:proxy_host =&gt; '7proxies.org',
+                                          :proxy_port =&gt; 8881)
+ http_conf.apply do
+   Spider.start_at('http://img.4chan.org/b/') do |s|
+     s.on(:success) do |a_url, resp, prior_url|
+       File.open(a_url.gsub('/',':'),'w') do |f|
+         f.write(resp.body)
+       end
+     end
    end
  end
 </pre>
-<h2>Requirements</h2>
-<p>
-This library uses `robot_rules&#8217; (included), `open-uri&#8217;, and
-`uri&#8217;. Any modern Ruby should work; if yours doesn&#8216;t, let me
-know so I can update this with your version number.
-</p>
 <h2>Author</h2>
 <p>
 Mike Burns <a href="http://mike-burns.com">mike-burns.com</a>
 mike@mike-burns.com
 </p>
 <p>
-With help from Matt Horan and John Nagro. With `robot_rules&#8217; from
-James Edward Gray II via <a
+Help from Matt Horan and John Nagro.
+</p>
+<p>
+With `robot_rules&#8217; from James Edward Gray II via <a
 href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589">blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589</a>
 </p>
     </div>
-    <div id="requires-list">
-      <h3 class="section-bar">Required files</h3>
-      <div class="name-list">
-      robot_rules&nbsp;&nbsp;
-      open-uri&nbsp;&nbsp;
-      uri&nbsp;&nbsp;
-      net/http&nbsp;&nbsp;
-      net/https&nbsp;&nbsp;
-      </div>
-    </div>
    </div>