spider 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (7) hide show
  1. data/CHANGES +4 -0
  2. data/LICENSE +339 -0
  3. data/README +37 -0
  4. data/lib/robot_rules.rb +75 -0
  5. data/lib/spider.rb +126 -0
  6. data/spider.gemspec +17 -0
  7. metadata +52 -0
data/CHANGES ADDED
@@ -0,0 +1,4 @@
1
+ 2007-03-28:
2
+ * Change the tail recursion to a `while' loop, to please Ruby.
3
+ * Documentation.
4
+ * Initial release: version 0.1.0 .
data/LICENSE ADDED
@@ -0,0 +1,339 @@
1
+ GNU GENERAL PUBLIC LICENSE
2
+ Version 2, June 1991
3
+
4
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
5
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
6
+ Everyone is permitted to copy and distribute verbatim copies
7
+ of this license document, but changing it is not allowed.
8
+
9
+ Preamble
10
+
11
+ The licenses for most software are designed to take away your
12
+ freedom to share and change it. By contrast, the GNU General Public
13
+ License is intended to guarantee your freedom to share and change free
14
+ software--to make sure the software is free for all its users. This
15
+ General Public License applies to most of the Free Software
16
+ Foundation's software and to any other program whose authors commit to
17
+ using it. (Some other Free Software Foundation software is covered by
18
+ the GNU Lesser General Public License instead.) You can apply it to
19
+ your programs, too.
20
+
21
+ When we speak of free software, we are referring to freedom, not
22
+ price. Our General Public Licenses are designed to make sure that you
23
+ have the freedom to distribute copies of free software (and charge for
24
+ this service if you wish), that you receive source code or can get it
25
+ if you want it, that you can change the software or use pieces of it
26
+ in new free programs; and that you know you can do these things.
27
+
28
+ To protect your rights, we need to make restrictions that forbid
29
+ anyone to deny you these rights or to ask you to surrender the rights.
30
+ These restrictions translate to certain responsibilities for you if you
31
+ distribute copies of the software, or if you modify it.
32
+
33
+ For example, if you distribute copies of such a program, whether
34
+ gratis or for a fee, you must give the recipients all the rights that
35
+ you have. You must make sure that they, too, receive or can get the
36
+ source code. And you must show them these terms so they know their
37
+ rights.
38
+
39
+ We protect your rights with two steps: (1) copyright the software, and
40
+ (2) offer you this license which gives you legal permission to copy,
41
+ distribute and/or modify the software.
42
+
43
+ Also, for each author's protection and ours, we want to make certain
44
+ that everyone understands that there is no warranty for this free
45
+ software. If the software is modified by someone else and passed on, we
46
+ want its recipients to know that what they have is not the original, so
47
+ that any problems introduced by others will not reflect on the original
48
+ authors' reputations.
49
+
50
+ Finally, any free program is threatened constantly by software
51
+ patents. We wish to avoid the danger that redistributors of a free
52
+ program will individually obtain patent licenses, in effect making the
53
+ program proprietary. To prevent this, we have made it clear that any
54
+ patent must be licensed for everyone's free use or not licensed at all.
55
+
56
+ The precise terms and conditions for copying, distribution and
57
+ modification follow.
58
+
59
+ GNU GENERAL PUBLIC LICENSE
60
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
61
+
62
+ 0. This License applies to any program or other work which contains
63
+ a notice placed by the copyright holder saying it may be distributed
64
+ under the terms of this General Public License. The "Program", below,
65
+ refers to any such program or work, and a "work based on the Program"
66
+ means either the Program or any derivative work under copyright law:
67
+ that is to say, a work containing the Program or a portion of it,
68
+ either verbatim or with modifications and/or translated into another
69
+ language. (Hereinafter, translation is included without limitation in
70
+ the term "modification".) Each licensee is addressed as "you".
71
+
72
+ Activities other than copying, distribution and modification are not
73
+ covered by this License; they are outside its scope. The act of
74
+ running the Program is not restricted, and the output from the Program
75
+ is covered only if its contents constitute a work based on the
76
+ Program (independent of having been made by running the Program).
77
+ Whether that is true depends on what the Program does.
78
+
79
+ 1. You may copy and distribute verbatim copies of the Program's
80
+ source code as you receive it, in any medium, provided that you
81
+ conspicuously and appropriately publish on each copy an appropriate
82
+ copyright notice and disclaimer of warranty; keep intact all the
83
+ notices that refer to this License and to the absence of any warranty;
84
+ and give any other recipients of the Program a copy of this License
85
+ along with the Program.
86
+
87
+ You may charge a fee for the physical act of transferring a copy, and
88
+ you may at your option offer warranty protection in exchange for a fee.
89
+
90
+ 2. You may modify your copy or copies of the Program or any portion
91
+ of it, thus forming a work based on the Program, and copy and
92
+ distribute such modifications or work under the terms of Section 1
93
+ above, provided that you also meet all of these conditions:
94
+
95
+ a) You must cause the modified files to carry prominent notices
96
+ stating that you changed the files and the date of any change.
97
+
98
+ b) You must cause any work that you distribute or publish, that in
99
+ whole or in part contains or is derived from the Program or any
100
+ part thereof, to be licensed as a whole at no charge to all third
101
+ parties under the terms of this License.
102
+
103
+ c) If the modified program normally reads commands interactively
104
+ when run, you must cause it, when started running for such
105
+ interactive use in the most ordinary way, to print or display an
106
+ announcement including an appropriate copyright notice and a
107
+ notice that there is no warranty (or else, saying that you provide
108
+ a warranty) and that users may redistribute the program under
109
+ these conditions, and telling the user how to view a copy of this
110
+ License. (Exception: if the Program itself is interactive but
111
+ does not normally print such an announcement, your work based on
112
+ the Program is not required to print an announcement.)
113
+
114
+ These requirements apply to the modified work as a whole. If
115
+ identifiable sections of that work are not derived from the Program,
116
+ and can be reasonably considered independent and separate works in
117
+ themselves, then this License, and its terms, do not apply to those
118
+ sections when you distribute them as separate works. But when you
119
+ distribute the same sections as part of a whole which is a work based
120
+ on the Program, the distribution of the whole must be on the terms of
121
+ this License, whose permissions for other licensees extend to the
122
+ entire whole, and thus to each and every part regardless of who wrote it.
123
+
124
+ Thus, it is not the intent of this section to claim rights or contest
125
+ your rights to work written entirely by you; rather, the intent is to
126
+ exercise the right to control the distribution of derivative or
127
+ collective works based on the Program.
128
+
129
+ In addition, mere aggregation of another work not based on the Program
130
+ with the Program (or with a work based on the Program) on a volume of
131
+ a storage or distribution medium does not bring the other work under
132
+ the scope of this License.
133
+
134
+ 3. You may copy and distribute the Program (or a work based on it,
135
+ under Section 2) in object code or executable form under the terms of
136
+ Sections 1 and 2 above provided that you also do one of the following:
137
+
138
+ a) Accompany it with the complete corresponding machine-readable
139
+ source code, which must be distributed under the terms of Sections
140
+ 1 and 2 above on a medium customarily used for software interchange; or,
141
+
142
+ b) Accompany it with a written offer, valid for at least three
143
+ years, to give any third party, for a charge no more than your
144
+ cost of physically performing source distribution, a complete
145
+ machine-readable copy of the corresponding source code, to be
146
+ distributed under the terms of Sections 1 and 2 above on a medium
147
+ customarily used for software interchange; or,
148
+
149
+ c) Accompany it with the information you received as to the offer
150
+ to distribute corresponding source code. (This alternative is
151
+ allowed only for noncommercial distribution and only if you
152
+ received the program in object code or executable form with such
153
+ an offer, in accord with Subsection b above.)
154
+
155
+ The source code for a work means the preferred form of the work for
156
+ making modifications to it. For an executable work, complete source
157
+ code means all the source code for all modules it contains, plus any
158
+ associated interface definition files, plus the scripts used to
159
+ control compilation and installation of the executable. However, as a
160
+ special exception, the source code distributed need not include
161
+ anything that is normally distributed (in either source or binary
162
+ form) with the major components (compiler, kernel, and so on) of the
163
+ operating system on which the executable runs, unless that component
164
+ itself accompanies the executable.
165
+
166
+ If distribution of executable or object code is made by offering
167
+ access to copy from a designated place, then offering equivalent
168
+ access to copy the source code from the same place counts as
169
+ distribution of the source code, even though third parties are not
170
+ compelled to copy the source along with the object code.
171
+
172
+ 4. You may not copy, modify, sublicense, or distribute the Program
173
+ except as expressly provided under this License. Any attempt
174
+ otherwise to copy, modify, sublicense or distribute the Program is
175
+ void, and will automatically terminate your rights under this License.
176
+ However, parties who have received copies, or rights, from you under
177
+ this License will not have their licenses terminated so long as such
178
+ parties remain in full compliance.
179
+
180
+ 5. You are not required to accept this License, since you have not
181
+ signed it. However, nothing else grants you permission to modify or
182
+ distribute the Program or its derivative works. These actions are
183
+ prohibited by law if you do not accept this License. Therefore, by
184
+ modifying or distributing the Program (or any work based on the
185
+ Program), you indicate your acceptance of this License to do so, and
186
+ all its terms and conditions for copying, distributing or modifying
187
+ the Program or works based on it.
188
+
189
+ 6. Each time you redistribute the Program (or any work based on the
190
+ Program), the recipient automatically receives a license from the
191
+ original licensor to copy, distribute or modify the Program subject to
192
+ these terms and conditions. You may not impose any further
193
+ restrictions on the recipients' exercise of the rights granted herein.
194
+ You are not responsible for enforcing compliance by third parties to
195
+ this License.
196
+
197
+ 7. If, as a consequence of a court judgment or allegation of patent
198
+ infringement or for any other reason (not limited to patent issues),
199
+ conditions are imposed on you (whether by court order, agreement or
200
+ otherwise) that contradict the conditions of this License, they do not
201
+ excuse you from the conditions of this License. If you cannot
202
+ distribute so as to satisfy simultaneously your obligations under this
203
+ License and any other pertinent obligations, then as a consequence you
204
+ may not distribute the Program at all. For example, if a patent
205
+ license would not permit royalty-free redistribution of the Program by
206
+ all those who receive copies directly or indirectly through you, then
207
+ the only way you could satisfy both it and this License would be to
208
+ refrain entirely from distribution of the Program.
209
+
210
+ If any portion of this section is held invalid or unenforceable under
211
+ any particular circumstance, the balance of the section is intended to
212
+ apply and the section as a whole is intended to apply in other
213
+ circumstances.
214
+
215
+ It is not the purpose of this section to induce you to infringe any
216
+ patents or other property right claims or to contest validity of any
217
+ such claims; this section has the sole purpose of protecting the
218
+ integrity of the free software distribution system, which is
219
+ implemented by public license practices. Many people have made
220
+ generous contributions to the wide range of software distributed
221
+ through that system in reliance on consistent application of that
222
+ system; it is up to the author/donor to decide if he or she is willing
223
+ to distribute software through any other system and a licensee cannot
224
+ impose that choice.
225
+
226
+ This section is intended to make thoroughly clear what is believed to
227
+ be a consequence of the rest of this License.
228
+
229
+ 8. If the distribution and/or use of the Program is restricted in
230
+ certain countries either by patents or by copyrighted interfaces, the
231
+ original copyright holder who places the Program under this License
232
+ may add an explicit geographical distribution limitation excluding
233
+ those countries, so that distribution is permitted only in or among
234
+ countries not thus excluded. In such case, this License incorporates
235
+ the limitation as if written in the body of this License.
236
+
237
+ 9. The Free Software Foundation may publish revised and/or new versions
238
+ of the General Public License from time to time. Such new versions will
239
+ be similar in spirit to the present version, but may differ in detail to
240
+ address new problems or concerns.
241
+
242
+ Each version is given a distinguishing version number. If the Program
243
+ specifies a version number of this License which applies to it and "any
244
+ later version", you have the option of following the terms and conditions
245
+ either of that version or of any later version published by the Free
246
+ Software Foundation. If the Program does not specify a version number of
247
+ this License, you may choose any version ever published by the Free Software
248
+ Foundation.
249
+
250
+ 10. If you wish to incorporate parts of the Program into other free
251
+ programs whose distribution conditions are different, write to the author
252
+ to ask for permission. For software which is copyrighted by the Free
253
+ Software Foundation, write to the Free Software Foundation; we sometimes
254
+ make exceptions for this. Our decision will be guided by the two goals
255
+ of preserving the free status of all derivatives of our free software and
256
+ of promoting the sharing and reuse of software generally.
257
+
258
+ NO WARRANTY
259
+
260
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
261
+ FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
262
+ OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
263
+ PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
264
+ OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
265
+ MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
266
+ TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
267
+ PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
268
+ REPAIR OR CORRECTION.
269
+
270
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
271
+ WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
272
+ REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
273
+ INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
274
+ OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
275
+ TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
276
+ YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
277
+ PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
278
+ POSSIBILITY OF SUCH DAMAGES.
279
+
280
+ END OF TERMS AND CONDITIONS
281
+
282
+ How to Apply These Terms to Your New Programs
283
+
284
+ If you develop a new program, and you want it to be of the greatest
285
+ possible use to the public, the best way to achieve this is to make it
286
+ free software which everyone can redistribute and change under these terms.
287
+
288
+ To do so, attach the following notices to the program. It is safest
289
+ to attach them to the start of each source file to most effectively
290
+ convey the exclusion of warranty; and each file should have at least
291
+ the "copyright" line and a pointer to where the full notice is found.
292
+
293
+ <one line to give the program's name and a brief idea of what it does.>
294
+ Copyright (C) <year> <name of author>
295
+
296
+ This program is free software; you can redistribute it and/or modify
297
+ it under the terms of the GNU General Public License as published by
298
+ the Free Software Foundation; either version 2 of the License, or
299
+ (at your option) any later version.
300
+
301
+ This program is distributed in the hope that it will be useful,
302
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
303
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
304
+ GNU General Public License for more details.
305
+
306
+ You should have received a copy of the GNU General Public License along
307
+ with this program; if not, write to the Free Software Foundation, Inc.,
308
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
309
+
310
+ Also add information on how to contact you by electronic and paper mail.
311
+
312
+ If the program is interactive, make it output a short notice like this
313
+ when it starts in an interactive mode:
314
+
315
+ Gnomovision version 69, Copyright (C) year name of author
316
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
317
+ This is free software, and you are welcome to redistribute it
318
+ under certain conditions; type `show c' for details.
319
+
320
+ The hypothetical commands `show w' and `show c' should show the appropriate
321
+ parts of the General Public License. Of course, the commands you use may
322
+ be called something other than `show w' and `show c'; they could even be
323
+ mouse-clicks or menu items--whatever suits your program.
324
+
325
+ You should also get your employer (if you work as a programmer) or your
326
+ school, if any, to sign a "copyright disclaimer" for the program, if
327
+ necessary. Here is a sample; alter the names:
328
+
329
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
330
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
331
+
332
+ <signature of Ty Coon>, 1 April 1989
333
+ Ty Coon, President of Vice
334
+
335
+ This General Public License does not permit incorporating your program into
336
+ proprietary programs. If your program is a subroutine library, you may
337
+ consider it more useful to permit linking proprietary applications with the
338
+ library. If this is what you want to do, use the GNU Lesser General
339
+ Public License instead of this License.
data/README ADDED
@@ -0,0 +1,37 @@
1
+ Spider, a Web spidering library for Ruby. It handles the robots.txt,
2
+ scraping, collecting, and looping so that you can just handle the data.
3
+
4
+ == Requirements ==
5
+
6
+ This library uses `robot_rules' (included), `open-uri', and `uri'. Any modern
7
+ Ruby should work; if yours doesn't, let me know so I can update this with your
8
+ version number.
9
+
10
+ == Usage ==
11
+
12
+ One function: `spider'. It takes a list of seed URLs and a block; this block is
13
+ passed each URL and its Web page. This function never returns, ideally.
14
+
15
+ spider : [String] (String String -> a) -> omega
16
+
17
+ Examples:
18
+
19
+ require 'spider'
20
+ include Spider
21
+
22
+ spider(['http://yahoo.com']) do |a_url, web_page|
23
+ puts "At #{a_url}"
24
+ end
25
+
26
+ spider(['http://mike-burns.com','http://matthoran.com']) do |u, page|
27
+ # assumes `scrape_images' and `store_image!' functions.
28
+ scrape_images(page).each { |img| store_image!(img) }
29
+ end
30
+
31
+ == Author ==
32
+
33
+ Mike Burns http://mike-burns.com mike@mike-burns.com
34
+
35
+ With help from Matt Horan.
36
+ With `robot_rules' from James Edward Gray II via
37
+ http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589
@@ -0,0 +1,75 @@
1
+ # Created by James Edward Gray II on 2006-01-31.
2
+ # Copyright 2006 Gray Productions. All rights reserved.
3
+
4
+ require "uri"
5
+
6
+ # Based on Perl's WWW::RobotRules module, by Gisle Aas.
7
+ class RobotRules
8
+ def initialize( user_agent )
9
+ @user_agent = user_agent.scan(/\S+/).first.sub(%r{/.*}, "").downcase
10
+ @rules = Hash.new { |rules, rule| rules[rule] = Array.new }
11
+ end
12
+
13
+ def parse( text_uri, robots_data )
14
+ uri = URI.parse(text_uri)
15
+ location = "#{uri.host}:#{uri.port}"
16
+ @rules.delete(location)
17
+
18
+ rules = robots_data.split(/[\015\012]+/).map do |rule|
19
+ rule.sub(/\s*#.*$/, "")
20
+ end
21
+ anon_rules = Array.new
22
+ my_rules = Array.new
23
+ current = anon_rules
24
+ rules.each do |rule|
25
+ case rule
26
+ when /^\s*User-Agent\s*:\s*(.+?)\s*$/i
27
+ break unless my_rules.empty?
28
+
29
+ current = if $1 == "*"
30
+ anon_rules
31
+ elsif $1.downcase.index(@user_agent)
32
+ my_rules
33
+ else
34
+ nil
35
+ end
36
+ when /^\s*Disallow\s*:\s*(.*?)\s*$/i
37
+ next if current.nil?
38
+
39
+ if $1.empty?
40
+ current << nil
41
+ else
42
+ disallow = URI.parse($1)
43
+
44
+ next unless disallow.scheme.nil? or disallow.scheme ==
45
+ uri.scheme
46
+ next unless disallow.port.nil? or disallow.port == uri.port
47
+ next unless disallow.host.nil? or
48
+ disallow.host.downcase == uri.host.downcase
49
+
50
+ disallow = disallow.path
51
+ disallow = "/" if disallow.empty?
52
+ disallow = "/#{disallow}" unless disallow[0] == ?/
53
+
54
+ current << disallow
55
+ end
56
+ end
57
+ end
58
+
59
+ @rules[location] = if my_rules.empty?
60
+ anon_rules.compact
61
+ else
62
+ my_rules.compact
63
+ end
64
+ end
65
+
66
+ def allowed?( text_uri )
67
+ uri = URI.parse(text_uri)
68
+ location = "#{uri.host}:#{uri.port}"
69
+ path = uri.path
70
+
71
+ return true unless %w{http https}.include?(uri.scheme)
72
+
73
+ not @rules[location].any? { |rule| path.index(rule) == 0 }
74
+ end
75
+ end
data/lib/spider.rb ADDED
@@ -0,0 +1,126 @@
1
+ # Copyright 2007 Mike Burns
2
+
3
+ # This program is free software; you can redistribute it and/or modify
4
+ # it under the terms of the GNU General Public License as published by
5
+ # the Free Software Foundation; either version 2 of the License, or
6
+ # (at your option) any later version.
7
+ #
8
+ # This program is distributed in the hope that it will be useful,
9
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
10
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11
+ # GNU General Public License for more details.
12
+ #
13
+ # You should have received a copy of the GNU General Public License
14
+ # along with this program; if not, write to the Free Software
15
+ # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
16
+
17
+ require 'robot_rules'
18
+ require 'open-uri'
19
+ require 'uri'
20
+
21
+ # A spidering library for Ruby. Handles robots.txt, scraping, finding more
22
+ # links, and doing it all over again.
23
+ module Spider
24
+
25
+ # [String] (String String -> a) -> omega
26
+ # The only function worth calling. Takes a list of seed URLs and a block.
27
+ # This block is passed each URL and its Web page.
28
+ #
29
+ # Examples:
30
+ #
31
+ # spider(['http://yahoo.com']) do |a_url, web_page|
32
+ # puts "At #{a_url}"
33
+ # end
34
+ #
35
+ # spider(['http://mike-burns.com','http://matthoran.com']) do |u, page|
36
+ # scrape_images(page).each { |img| store_image!(img) }
37
+ # end
38
+ def spider(urls)
39
+ rules = RobotRules.new('Ruby Spider 1.0')
40
+ next_urls = (urls.is_a?(Array) ? urls : [urls])
41
+ seen = []
42
+ robots_seen = []
43
+ # This used to be (tail) recursive, but Ruby doesn't optimize that.
44
+ # I have no idea if this iterative version is correct, but it seems it.
45
+ begin
46
+ next_urls = next_urls.map do |a_url|
47
+ [a_url, (URI.parse(a_url) rescue nil)]
48
+ end.select do |a_url, parsed_url|
49
+ !parsed_url.nil? && !seen.include?(a_url) &&
50
+ allowed?(a_url, parsed_url, rules, robots_seen)
51
+ end.map do |a_url, parsed_url|
52
+ scrape_links(a_url, parsed_url) do |a_url,web_page|
53
+ seen << a_url
54
+ yield(a_url,web_page)
55
+ end
56
+ end.flatten.map { |a_url,parsed_url| a_url }
57
+ end while !next_urls.empty?
58
+ end
59
+
60
+ private
61
+
62
+ # True if the robots.txt for that URL allows access to it.
63
+ def allowed?(a_url, parsed_url, rules, robots_seen) # :nodoc:
64
+ u = "#{parsed_url.scheme}://#{parsed_url.host}:#{parsed_url.port}/robots.txt"
65
+ begin
66
+ unless robots_seen.include?(u)
67
+ open(u, 'User-Agent' => 'Ruby Spider',
68
+ 'Accept' => 'text/html,text/xml,application/xml,text/plain') do |url|
69
+ rules.parse(u, url.read)
70
+ end
71
+ robots_seen << u
72
+ end
73
+ rules.allowed?(a_url)
74
+ rescue OpenURI::HTTPError
75
+ true
76
+ rescue Timeout::Error # to keep it from crashing
77
+ false
78
+ rescue
79
+ false
80
+ end
81
+ end
82
+
83
+ # Produce all the links on the page.
84
+ def scrape_links(a_url, parsed_url) # :nodoc:
85
+ begin
86
+ open(a_url, 'User-Agent' => 'Ruby Spider',
87
+ 'Accept' => 'text/html,text/xml,application/xml,text/plain') do |data|
88
+ web_page = data.read
89
+ base_url = (web_page.scan(/base\s+href="(.*?)"/i).flatten +
90
+ [a_url[0,a_url.rindex('/')]])[0]
91
+ links = web_page.scan(/href="(.*?)"/i).flatten.map do |link|
92
+ begin
93
+ parsed_link = URI.parse(link)
94
+ if parsed_link.fragment == '#'
95
+ nil
96
+ else
97
+ case parsed_link.scheme
98
+ when 'http'
99
+ link
100
+ when nil
101
+ u = URI.parse(base_url)
102
+ if link[0].chr == '/'
103
+ "#{u.scheme}://#{u.host}:#{u.port}#{link}"
104
+ else
105
+ "#{u.scheme}://#{u.host}:#{u.port}/#{u.path}/#{link}"
106
+ end
107
+ else
108
+ nil
109
+ end
110
+ end
111
+ rescue
112
+ nil
113
+ end
114
+ end.reject{|link|link.nil?}
115
+
116
+ yield(a_url,web_page)
117
+ links
118
+ end
119
+ rescue Timeout::Error # to keep it from crashing
120
+ []
121
+ rescue
122
+ []
123
+ end
124
+ end
125
+
126
+ end
data/spider.gemspec ADDED
@@ -0,0 +1,17 @@
1
+ require 'rubygems'
2
+
3
+ spec = Gem::Specification.new do |s|
4
+ s.author = 'Mike Burns'
5
+ s.email = 'mike@mike-burns.com'
6
+ s.has_rdoc = true
7
+ ###s.homepage = ''
8
+ s.name = 'spider'
9
+ s.summary = 'A Web spidering library'
10
+ s.files = Dir['**/*'].delete_if { |f| f =~ /(cvs|gem|svn)$/i }
11
+ s.require_path = 'lib'
12
+ s.description = <<-EOF
13
+ A Web spidering library: handles robots.txt, scraping, finding more
14
+ links, and doing it all over again.
15
+ EOF
16
+ s.version = '0.1.0'
17
+ end
metadata ADDED
@@ -0,0 +1,52 @@
1
+ --- !ruby/object:Gem::Specification
2
+ rubygems_version: 0.9.2
3
+ specification_version: 1
4
+ name: spider
5
+ version: !ruby/object:Gem::Version
6
+ version: 0.1.0
7
+ date: 2007-03-30 00:00:00 -04:00
8
+ summary: A Web spidering library
9
+ require_paths:
10
+ - lib
11
+ email: mike@mike-burns.com
12
+ homepage:
13
+ rubyforge_project:
14
+ description: "A Web spidering library: handles robots.txt, scraping, finding more links, and doing it all over again."
15
+ autorequire:
16
+ default_executable:
17
+ bindir: bin
18
+ has_rdoc: true
19
+ required_ruby_version: !ruby/object:Gem::Version::Requirement
20
+ requirements:
21
+ - - ">"
22
+ - !ruby/object:Gem::Version
23
+ version: 0.0.0
24
+ version:
25
+ platform: ruby
26
+ signing_key:
27
+ cert_chain:
28
+ post_install_message:
29
+ authors:
30
+ - Mike Burns
31
+ files:
32
+ - spider.gemspec
33
+ - README
34
+ - CHANGES
35
+ - LICENSE
36
+ - lib
37
+ - lib/spider.rb
38
+ - lib/robot_rules.rb
39
+ test_files: []
40
+
41
+ rdoc_options: []
42
+
43
+ extra_rdoc_files: []
44
+
45
+ executables: []
46
+
47
+ extensions: []
48
+
49
+ requirements: []
50
+
51
+ dependencies: []
52
+