regexp_crawler 0.9.1
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE +20 -0
- data/README.textile +114 -0
- data/Rakefile +22 -0
- data/VERSION +1 -0
- data/example/github_projects.rb +19 -0
- data/init.rb +0 -0
- data/lib/regexp_crawler.rb +27 -0
- data/lib/regexp_crawler/crawler.rb +89 -0
- data/lib/regexp_crawler/http.rb +9 -0
- data/regexp_crawler.gemspec +58 -0
- data/spec/regexp_crawler_spec.rb +122 -0
- data/spec/resources/complex.html +11 -0
- data/spec/resources/nested1.html +12 -0
- data/spec/resources/nested2.html +13 -0
- data/spec/resources/nested21.html +12 -0
- data/spec/resources/simple.html +12 -0
- data/spec/spec.opts +8 -0
- data/spec/spec_helper.rb +8 -0
- metadata +74 -0
data/LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2009 Richard Huang (flyerhzm@gmail.com)
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.textile
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
h1. RegexpCrawler
|
2
|
+
|
3
|
+
RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression.
|
4
|
+
|
5
|
+
**************************************************************************
|
6
|
+
|
7
|
+
h2. Install
|
8
|
+
|
9
|
+
<pre><code>
|
10
|
+
gem install regexp_crawler
|
11
|
+
</code></pre>
|
12
|
+
|
13
|
+
**************************************************************************
|
14
|
+
|
15
|
+
h2. Usage
|
16
|
+
|
17
|
+
It's really easy to use, sometime just one line.
|
18
|
+
|
19
|
+
<pre><code>
|
20
|
+
RegexpCrawler::Crawler.new(options).start
|
21
|
+
</code></pre>
|
22
|
+
|
23
|
+
options is a hash
|
24
|
+
* <code>:start_page</code>, mandatory, a string to define a website url where crawler start
|
25
|
+
* <code>:continue_regexp</code>, optional, a regexp to define what website urls the crawler continue to crawl, it is parsed by String#scan and get the first not nil result
|
26
|
+
* <code>:capture_regexp</code>, mandatory, a regexp to define what contents the crawler crawl, it is parse by Regexp#match and get all group captures
|
27
|
+
* <code>:named_captures</code>, mandatory, a string array to define the names of captured groups according to :capture_regexp
|
28
|
+
* <code>:model</code>, optional if :save_method defined, a string of result's model class
|
29
|
+
* <code>:save_method</code>, optional if :model defined, a proc to define how to save the result which the crawler crawled, the proc accept two parameters, first is one page crawled result, second is the crawled url
|
30
|
+
* <code>:headers</code>, optional, a hash to define http headers
|
31
|
+
* <code>:encoding</code>, optional, a string of the coding of crawled page, the results will be converted to utf8
|
32
|
+
* <code>:need_parse</code>, optional, a proc if parsing the page by regexp or not, the proc accept two parameters, first is the crawled website uri, second is the response body of crawled page
|
33
|
+
* <code>:logger</code>, optional, true for logging to STDOUT, or a Logger object for logging to that logger
|
34
|
+
|
35
|
+
If the crawler define :model no :save_method, the RegexpCrawler::Crawler#start will return an array of results, such as
|
36
|
+
<pre><code>
|
37
|
+
[{:model_name => {:attr_name => 'attr_value'}, :page => 'website url'}, {:model_name => {:attr_name => 'attr_value'}, :page => 'another website url'}]
|
38
|
+
</code></pre>
|
39
|
+
|
40
|
+
**************************************************************************
|
41
|
+
|
42
|
+
h2. Example
|
43
|
+
|
44
|
+
a script to synchronize your github projects except fork projects, please check <code>example/github_projects.rb</code>
|
45
|
+
|
46
|
+
<pre><code>
|
47
|
+
require 'rubygems'
|
48
|
+
require 'regexp_crawler'
|
49
|
+
|
50
|
+
crawler = RegexpCrawler::Crawler.new(
|
51
|
+
:start_page => "http://github.com/flyerhzm",
|
52
|
+
:continue_regexp => %r{<div class="title"><b><a href="(/flyerhzm/.*?)">}m,
|
53
|
+
:capture_regexp => %r{<a href="http://github.com/flyerhzm/[^/"]*?(?:/tree)?">(.*?)</a>.*<span id="repository_description".*?>(.*?)</span>.*(<div class="(?:wikistyle|plain)">.*?</div>)</div>}m,
|
54
|
+
:named_captures => ['title', 'description', 'body'],
|
55
|
+
:save_method => Proc.new do |result, page|
|
56
|
+
puts '============================='
|
57
|
+
puts page
|
58
|
+
puts result[:title]
|
59
|
+
puts result[:description]
|
60
|
+
puts result[:body][0..100] + "..."
|
61
|
+
end,
|
62
|
+
:need_parse => Proc.new do |page, response_body|
|
63
|
+
page =~ %r{http://github.com/flyerhzm/\w+} && !response_body.index(/Fork of.*?<a href=".*?">/)
|
64
|
+
end)
|
65
|
+
crawler.start
|
66
|
+
</pre></code>
|
67
|
+
|
68
|
+
The results are as follows:
|
69
|
+
<pre><code>
|
70
|
+
=============================
|
71
|
+
http://github.com/flyerhzm/bullet/tree/master
|
72
|
+
bullet
|
73
|
+
A rails plugin/gem to kill N+1 queries and unused eager loading
|
74
|
+
<div class="wikistyle"><h1>Bullet</h1>
|
75
|
+
<p>The Bullet plugin/gem is designed to help you increase your...
|
76
|
+
=============================
|
77
|
+
http://github.com/flyerhzm/regexp_crawler/tree/master
|
78
|
+
regexp_crawler
|
79
|
+
A crawler which use regular expression to catch data.
|
80
|
+
<div class="wikistyle"><h1>RegexpCrawler</h1>
|
81
|
+
<p>RegexpCrawler is a crawler which use regex expressi...
|
82
|
+
=============================
|
83
|
+
http://github.com/flyerhzm/sitemap/tree/master
|
84
|
+
sitemap
|
85
|
+
This plugin will generate a sitemap.xml from sitemap.rb whose format is very similar to routes.rb
|
86
|
+
<div class="wikistyle"><h1>Sitemap</h1>
|
87
|
+
<p>This plugin will generate a sitemap.xml or sitemap.xml.gz ...
|
88
|
+
=============================
|
89
|
+
http://github.com/flyerhzm/visual_partial/tree/master
|
90
|
+
visual_partial
|
91
|
+
This plugin provides a way that you can see all the partial pages rendered. So it can prevent you from using partial page too much, which hurts the performance.
|
92
|
+
<div class="wikistyle"><h1>VisualPartial</h1>
|
93
|
+
<p>This plugin provides a way that you can see all the ...
|
94
|
+
=============================
|
95
|
+
http://github.com/flyerhzm/chinese_regions/tree/master
|
96
|
+
chinese_regions
|
97
|
+
provides all chinese regions, cities and districts
|
98
|
+
<div class="wikistyle"><h1>ChineseRegions</h1>
|
99
|
+
<p>Provides all chinese regions, cities and districts<...
|
100
|
+
=============================
|
101
|
+
http://github.com/flyerhzm/chinese_permalink/tree/master
|
102
|
+
chinese_permalink
|
103
|
+
This plugin adds a capability for ar model to create a seo permalink with your chinese text. It will translate your chinese text to english url based on google translate.
|
104
|
+
<div class="wikistyle"><h1>ChinesePermalink</h1>
|
105
|
+
<p>This plugin adds a capability for ar model to cre...
|
106
|
+
=============================
|
107
|
+
http://github.com/flyerhzm/codelinestatistics/tree/master
|
108
|
+
codelinestatistics
|
109
|
+
The code line statistics takes files and directories from GUI, counts the total files, total sizes of files, total lines, lines of codes, lines of comments and lines of blanks in the files, displays the results and can also export results to html file.
|
110
|
+
<div class="plain"><pre>codelinestatistics README file:
|
111
|
+
|
112
|
+
----------------------------------------
|
113
|
+
Wha...
|
114
|
+
</code></pre>
|
data/Rakefile
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
require 'rake'
|
2
|
+
require 'rake/rdoctask'
|
3
|
+
require 'spec/rake/spectask'
|
4
|
+
require 'jeweler'
|
5
|
+
|
6
|
+
desc "Run all specs in spec directory"
|
7
|
+
Spec::Rake::SpecTask.new(:spec) do |t|
|
8
|
+
t.spec_files = FileList['spec/**/*_spec.rb']
|
9
|
+
t.rcov = true
|
10
|
+
t.rcov_opts = ['--exclude', 'spec,config,Library,usr/lib/ruby']
|
11
|
+
t.rcov_dir = File.join(File.dirname(__FILE__), "tmp")
|
12
|
+
end
|
13
|
+
|
14
|
+
Jeweler::Tasks.new do |gemspec|
|
15
|
+
gemspec.name = "regexp_crawler"
|
16
|
+
gemspec.summary = "RegexpCrawler is a Ruby library for crawl data from website using regular expression."
|
17
|
+
gemspec.description = "RegexpCrawler is a Ruby library for crawl data from website using regular expression."
|
18
|
+
gemspec.email = "flyerhzm@gmail.com"
|
19
|
+
gemspec.homepage = ""
|
20
|
+
gemspec.authors = ["Richard Huang"]
|
21
|
+
gemspec.files.exclude '.gitignore'
|
22
|
+
end
|
data/VERSION
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
0.9.1
|
@@ -0,0 +1,19 @@
|
|
1
|
+
require 'rubygems'
|
2
|
+
require 'regexp_crawler'
|
3
|
+
|
4
|
+
crawler = RegexpCrawler::Crawler.new(
|
5
|
+
:start_page => "http://github.com/flyerhzm",
|
6
|
+
:continue_regexp => %r{<div class="title"><b><a href="(/flyerhzm/.*?)">}m,
|
7
|
+
:capture_regexp => %r{<a href="http://github.com/flyerhzm/[^/"]*?(?:/tree)?">(.*?)</a>.*<span id="repository_description".*?>(.*?)</span>.*(<div class="(?:wikistyle|plain)">.*?</div>)</div>}m,
|
8
|
+
:named_captures => ['title', 'description', 'body'],
|
9
|
+
:save_method => Proc.new do |result, page|
|
10
|
+
puts '============================='
|
11
|
+
puts page
|
12
|
+
puts result[:title]
|
13
|
+
puts result[:description]
|
14
|
+
puts result[:body][0..100] + "..."
|
15
|
+
end,
|
16
|
+
:need_parse => Proc.new do |page, response_body|
|
17
|
+
page =~ %r{http://github.com/flyerhzm/\w+} && !response_body.index(/Fork of.*?<a href=".*?">/)
|
18
|
+
end)
|
19
|
+
crawler.start
|
data/init.rb
ADDED
File without changes
|
@@ -0,0 +1,27 @@
|
|
1
|
+
require 'net/http'
|
2
|
+
require 'uri'
|
3
|
+
require 'iconv'
|
4
|
+
require 'logger'
|
5
|
+
require 'regexp_crawler/http'
|
6
|
+
require 'regexp_crawler/crawler'
|
7
|
+
|
8
|
+
module RegexpCrawler
|
9
|
+
|
10
|
+
def self.included(base)
|
11
|
+
base.extend ClassMethods
|
12
|
+
end
|
13
|
+
|
14
|
+
module ClassMethods
|
15
|
+
def regexp_crawler(options)
|
16
|
+
@crawlers ||= []
|
17
|
+
@crawlers << Crawler.new(options)
|
18
|
+
end
|
19
|
+
|
20
|
+
def start_crawl
|
21
|
+
@crawlers.each do |crawler|
|
22
|
+
crawler.start
|
23
|
+
end
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
end
|
@@ -0,0 +1,89 @@
|
|
1
|
+
module RegexpCrawler
|
2
|
+
class Crawler
|
3
|
+
attr_accessor :start_page, :continue_regexp, :named_captures, :model, :save_method, :headers, :encoding, :need_parse
|
4
|
+
|
5
|
+
def initialize(options = {})
|
6
|
+
@start_page = options[:start_page]
|
7
|
+
@continue_regexp = options[:continue_regexp]
|
8
|
+
@capture_regexp = options[:capture_regexp]
|
9
|
+
@named_captures = options[:named_captures]
|
10
|
+
@model = options[:model]
|
11
|
+
@save_method = options[:save_method]
|
12
|
+
@headers = options[:headers]
|
13
|
+
@encoding = options[:encoding]
|
14
|
+
@need_parse = options[:need_parse]
|
15
|
+
@logger = options[:logger] == true ? Logger.new(STDOUT) : options[:logger]
|
16
|
+
end
|
17
|
+
|
18
|
+
def capture_regexp=(regexp)
|
19
|
+
@capture_regexp = Regexp.new(regexp.source, regexp.options | Regexp::MULTILINE)
|
20
|
+
end
|
21
|
+
|
22
|
+
def start
|
23
|
+
@results = []
|
24
|
+
@captured_pages = []
|
25
|
+
@pages = [URI.parse(@start_page)]
|
26
|
+
while !@pages.empty? and !@stop
|
27
|
+
uri = @pages.shift
|
28
|
+
@captured_pages << uri
|
29
|
+
parse_page(uri)
|
30
|
+
end
|
31
|
+
@results
|
32
|
+
end
|
33
|
+
|
34
|
+
private
|
35
|
+
def parse_page(uri)
|
36
|
+
@logger.debug "crawling page: #{uri.to_s}" if @logger
|
37
|
+
response = Net::HTTP.get_response_with_headers(uri, @headers)
|
38
|
+
parse_response(response, uri)
|
39
|
+
end
|
40
|
+
|
41
|
+
def continue_uri(uri, page)
|
42
|
+
if page =~ /^#{uri.scheme}/
|
43
|
+
URI.parse(page)
|
44
|
+
elsif page =~ /^\//
|
45
|
+
URI.join(uri.scheme + '://' + uri.host, page)
|
46
|
+
else
|
47
|
+
URI.parse(uri.to_s.split('/')[0..-2].join('/') + '/' + page)
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
def parse_response(response, uri)
|
52
|
+
response_body = encoding.nil? ? response.body : Iconv.iconv("UTF-8//IGNORE", "#{encoding}//IGNORE", response.body).first
|
53
|
+
if response.is_a? Net::HTTPSuccess
|
54
|
+
@logger.debug "crawling success: #{uri.to_s}" if @logger
|
55
|
+
if continue_regexp
|
56
|
+
response_body.scan(continue_regexp).each do |page|
|
57
|
+
@logger.debug "continue_page: #{page}" if @logger
|
58
|
+
page = page.compact.first if page.is_a? Array
|
59
|
+
continue_uri = continue_uri(uri, page)
|
60
|
+
@pages << continue_uri unless @captured_pages.include?(continue_uri) or @pages.include?(continue_uri)
|
61
|
+
end
|
62
|
+
end
|
63
|
+
if @need_parse.nil? or @need_parse.call(uri.to_s, response_body)
|
64
|
+
md = @capture_regexp.match(response_body)
|
65
|
+
if md
|
66
|
+
@logger.debug "response body captured" if @logger
|
67
|
+
captures = md.captures
|
68
|
+
result = {}
|
69
|
+
captures.each_index do |i|
|
70
|
+
result[named_captures[i].to_sym] = captures[i]
|
71
|
+
end
|
72
|
+
if @save_method
|
73
|
+
ret = @save_method.call(result, uri.to_s)
|
74
|
+
@stop = true if ret == false
|
75
|
+
else
|
76
|
+
@results << {@model.downcase.to_sym => result, :page => uri.to_s}
|
77
|
+
end
|
78
|
+
end
|
79
|
+
end
|
80
|
+
elsif response.is_a? Net::HTTPRedirection
|
81
|
+
@logger.debug "crawling redirect: #{response['location']}" if @logger
|
82
|
+
parse_page(URI.parse(response['location']))
|
83
|
+
else
|
84
|
+
@logger.debug "crawling nothing: #{uri.to_s}" if @logger
|
85
|
+
# do nothing
|
86
|
+
end
|
87
|
+
end
|
88
|
+
end
|
89
|
+
end
|
@@ -0,0 +1,58 @@
|
|
1
|
+
# Generated by jeweler
|
2
|
+
# DO NOT EDIT THIS FILE
|
3
|
+
# Instead, edit Jeweler::Tasks in Rakefile, and run `rake gemspec`
|
4
|
+
# -*- encoding: utf-8 -*-
|
5
|
+
|
6
|
+
Gem::Specification.new do |s|
|
7
|
+
s.name = %q{regexp_crawler}
|
8
|
+
s.version = "0.9.1"
|
9
|
+
|
10
|
+
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
|
+
s.authors = ["Richard Huang"]
|
12
|
+
s.date = %q{2009-09-14}
|
13
|
+
s.description = %q{RegexpCrawler is a Ruby library for crawl data from website using regular expression.}
|
14
|
+
s.email = %q{flyerhzm@gmail.com}
|
15
|
+
s.extra_rdoc_files = [
|
16
|
+
"LICENSE",
|
17
|
+
"README.textile"
|
18
|
+
]
|
19
|
+
s.files = [
|
20
|
+
"LICENSE",
|
21
|
+
"README.textile",
|
22
|
+
"Rakefile",
|
23
|
+
"VERSION",
|
24
|
+
"example/github_projects.rb",
|
25
|
+
"init.rb",
|
26
|
+
"lib/regexp_crawler.rb",
|
27
|
+
"lib/regexp_crawler/crawler.rb",
|
28
|
+
"lib/regexp_crawler/http.rb",
|
29
|
+
"regexp_crawler.gemspec",
|
30
|
+
"spec/regexp_crawler_spec.rb",
|
31
|
+
"spec/resources/complex.html",
|
32
|
+
"spec/resources/nested1.html",
|
33
|
+
"spec/resources/nested2.html",
|
34
|
+
"spec/resources/nested21.html",
|
35
|
+
"spec/resources/simple.html",
|
36
|
+
"spec/spec.opts",
|
37
|
+
"spec/spec_helper.rb"
|
38
|
+
]
|
39
|
+
s.homepage = %q{}
|
40
|
+
s.rdoc_options = ["--charset=UTF-8"]
|
41
|
+
s.require_paths = ["lib"]
|
42
|
+
s.rubygems_version = %q{1.3.5}
|
43
|
+
s.summary = %q{RegexpCrawler is a Ruby library for crawl data from website using regular expression.}
|
44
|
+
s.test_files = [
|
45
|
+
"spec/regexp_crawler_spec.rb",
|
46
|
+
"spec/spec_helper.rb"
|
47
|
+
]
|
48
|
+
|
49
|
+
if s.respond_to? :specification_version then
|
50
|
+
current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
|
51
|
+
s.specification_version = 3
|
52
|
+
|
53
|
+
if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then
|
54
|
+
else
|
55
|
+
end
|
56
|
+
else
|
57
|
+
end
|
58
|
+
end
|
@@ -0,0 +1,122 @@
|
|
1
|
+
require File.expand_path(File.dirname(__FILE__) + "/spec_helper.rb")
|
2
|
+
|
3
|
+
describe RegexpCrawler::Crawler do
|
4
|
+
context '#simple html' do
|
5
|
+
it 'should parse data according to regexp' do
|
6
|
+
success_page('/resources/simple.html', 'http://simple.com/')
|
7
|
+
|
8
|
+
crawl = RegexpCrawler::Crawler.new(:start_page => 'http://simple.com/', :capture_regexp => %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m, :named_captures => ['title', 'date', 'body'], :model => 'post', :logger => true)
|
9
|
+
results = crawl.start
|
10
|
+
results.size.should == 1
|
11
|
+
results.first[:post][:title].should == 'test'
|
12
|
+
end
|
13
|
+
|
14
|
+
it 'should redirect' do
|
15
|
+
redirect_page('http://redirect.com/', 'http://simple.com/')
|
16
|
+
success_page('/resources/simple.html', 'http://simple.com/')
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
context '#complex html' do
|
21
|
+
before(:each) do
|
22
|
+
success_page('/resources/complex.html', 'http://complex.com/')
|
23
|
+
success_page('/resources/nested1.html', 'http://complex.com/nested1.html')
|
24
|
+
success_page('/resources/nested2.html', 'http://complex.com/nested2.html')
|
25
|
+
end
|
26
|
+
|
27
|
+
it 'should parse data according to regexp' do
|
28
|
+
crawl = RegexpCrawler::Crawler.new
|
29
|
+
crawl.start_page = 'http://complex.com/'
|
30
|
+
crawl.continue_regexp = %r{(?:http://complex.com)?/nested\d.html}
|
31
|
+
crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
|
32
|
+
crawl.named_captures = ['title', 'date', 'body']
|
33
|
+
crawl.model = 'post'
|
34
|
+
results = crawl.start
|
35
|
+
results.size.should == 2
|
36
|
+
results.first[:post][:title].should == 'nested1'
|
37
|
+
results.last[:post][:title].should == 'nested2'
|
38
|
+
end
|
39
|
+
|
40
|
+
it 'should parse nested of nested data' do
|
41
|
+
success_page('/resources/nested21.html', 'http://complex.com/nested21.html')
|
42
|
+
crawl = RegexpCrawler::Crawler.new
|
43
|
+
crawl.start_page = 'http://complex.com/'
|
44
|
+
crawl.continue_regexp = %r{(?:http://complex.com)?/?nested\d+.html}
|
45
|
+
crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
|
46
|
+
crawl.named_captures = ['title', 'date', 'body']
|
47
|
+
crawl.model = 'post'
|
48
|
+
results = crawl.start
|
49
|
+
results.size.should == 3
|
50
|
+
results.first[:post][:title].should == 'nested1'
|
51
|
+
results.last[:post][:title].should == 'nested21'
|
52
|
+
end
|
53
|
+
|
54
|
+
it "should save by myself" do
|
55
|
+
crawl = RegexpCrawler::Crawler.new
|
56
|
+
crawl.start_page = 'http://complex.com/'
|
57
|
+
crawl.continue_regexp = %r{(?:http://complex.com)?/nested\d.html}
|
58
|
+
crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
|
59
|
+
crawl.named_captures = ['title', 'date', 'body']
|
60
|
+
my_results = []
|
61
|
+
crawl.save_method = Proc.new {|result, page| my_results << result}
|
62
|
+
results = crawl.start
|
63
|
+
results.size.should == 0
|
64
|
+
my_results.size.should == 2
|
65
|
+
end
|
66
|
+
|
67
|
+
it "should stop parse" do
|
68
|
+
crawl = RegexpCrawler::Crawler.new
|
69
|
+
crawl.start_page = 'http://complex.com/'
|
70
|
+
crawl.continue_regexp = %r{(?:http://complex.com)?/nested\d.html}
|
71
|
+
crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
|
72
|
+
crawl.named_captures = ['title', 'date', 'body']
|
73
|
+
stop_page = "http://complex.com/nested1.html"
|
74
|
+
parse_pages = []
|
75
|
+
crawl.save_method = Proc.new do |result, page|
|
76
|
+
if page == stop_page
|
77
|
+
false
|
78
|
+
else
|
79
|
+
parse_pages << page
|
80
|
+
end
|
81
|
+
end
|
82
|
+
results = crawl.start
|
83
|
+
parse_pages.size.should == 0
|
84
|
+
end
|
85
|
+
|
86
|
+
it 'should parse skip nested2.html' do
|
87
|
+
success_page('/resources/nested21.html', 'http://complex.com/nested21.html')
|
88
|
+
crawl = RegexpCrawler::Crawler.new
|
89
|
+
crawl.start_page = 'http://complex.com/'
|
90
|
+
crawl.continue_regexp = %r{(?:http://complex.com)?/?nested\d+.html}
|
91
|
+
crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
|
92
|
+
crawl.named_captures = ['title', 'date', 'body']
|
93
|
+
crawl.model = 'post'
|
94
|
+
crawl.need_parse = Proc.new do |page, response_body|
|
95
|
+
if response_body.index('nested2 test html')
|
96
|
+
false
|
97
|
+
else
|
98
|
+
true
|
99
|
+
end
|
100
|
+
end
|
101
|
+
results = crawl.start
|
102
|
+
results.size.should == 2
|
103
|
+
results.first[:post][:title].should == 'nested1'
|
104
|
+
results.last[:post][:title].should == 'nested21'
|
105
|
+
end
|
106
|
+
end
|
107
|
+
|
108
|
+
def success_page(local_path, remote_path)
|
109
|
+
path = File.expand_path(File.dirname(__FILE__) + local_path)
|
110
|
+
content = File.read(path)
|
111
|
+
http = mock(Net::HTTPSuccess)
|
112
|
+
http.stubs(:is_a?).with(Net::HTTPSuccess).returns(true)
|
113
|
+
http.stubs(:body).returns(content)
|
114
|
+
Net::HTTP.expects(:get_response_with_headers).times(1).with(URI.parse(remote_path), nil).returns(http)
|
115
|
+
end
|
116
|
+
|
117
|
+
def redirect_page(remote_path, redirect_path)
|
118
|
+
http = mock(Net::HTTPRedirection)
|
119
|
+
http.stubs(:is_a?).with(Net::HTTPRedirection).returns(true)
|
120
|
+
Net::HTTP.expects(:get_response_with_headers).times(1).with(URI.parse(remote_path), nil).returns(http)
|
121
|
+
end
|
122
|
+
end
|
@@ -0,0 +1,13 @@
|
|
1
|
+
<html>
|
2
|
+
<head>
|
3
|
+
<title>nested2 test html</title>
|
4
|
+
</head>
|
5
|
+
<body>
|
6
|
+
<div>
|
7
|
+
<div class="title">nested2</div>
|
8
|
+
<div class="date">2008/10/10</div>
|
9
|
+
<div class="body"><p class="content">nested2</p></div>
|
10
|
+
<a href="nested21.html">nested21</a>
|
11
|
+
</div>
|
12
|
+
</body>
|
13
|
+
</html>
|
data/spec/spec.opts
ADDED
data/spec/spec_helper.rb
ADDED
@@ -0,0 +1,8 @@
|
|
1
|
+
require 'rubygems'
|
2
|
+
require 'spec/autorun'
|
3
|
+
require 'date'
|
4
|
+
require 'mocha'
|
5
|
+
|
6
|
+
require File.join(File.dirname(__FILE__), '/../lib/regexp_crawler.rb')
|
7
|
+
require File.join(File.dirname(__FILE__), '/../lib/regexp_crawler/crawler.rb')
|
8
|
+
require File.join(File.dirname(__FILE__), '/../lib/regexp_crawler/http.rb')
|
metadata
ADDED
@@ -0,0 +1,74 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: regexp_crawler
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.9.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Richard Huang
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
|
12
|
+
date: 2009-09-14 00:00:00 +08:00
|
13
|
+
default_executable:
|
14
|
+
dependencies: []
|
15
|
+
|
16
|
+
description: RegexpCrawler is a Ruby library for crawl data from website using regular expression.
|
17
|
+
email: flyerhzm@gmail.com
|
18
|
+
executables: []
|
19
|
+
|
20
|
+
extensions: []
|
21
|
+
|
22
|
+
extra_rdoc_files:
|
23
|
+
- LICENSE
|
24
|
+
- README.textile
|
25
|
+
files:
|
26
|
+
- LICENSE
|
27
|
+
- README.textile
|
28
|
+
- Rakefile
|
29
|
+
- VERSION
|
30
|
+
- example/github_projects.rb
|
31
|
+
- init.rb
|
32
|
+
- lib/regexp_crawler.rb
|
33
|
+
- lib/regexp_crawler/crawler.rb
|
34
|
+
- lib/regexp_crawler/http.rb
|
35
|
+
- regexp_crawler.gemspec
|
36
|
+
- spec/regexp_crawler_spec.rb
|
37
|
+
- spec/resources/complex.html
|
38
|
+
- spec/resources/nested1.html
|
39
|
+
- spec/resources/nested2.html
|
40
|
+
- spec/resources/nested21.html
|
41
|
+
- spec/resources/simple.html
|
42
|
+
- spec/spec.opts
|
43
|
+
- spec/spec_helper.rb
|
44
|
+
has_rdoc: true
|
45
|
+
homepage: ""
|
46
|
+
licenses: []
|
47
|
+
|
48
|
+
post_install_message:
|
49
|
+
rdoc_options:
|
50
|
+
- --charset=UTF-8
|
51
|
+
require_paths:
|
52
|
+
- lib
|
53
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
54
|
+
requirements:
|
55
|
+
- - ">="
|
56
|
+
- !ruby/object:Gem::Version
|
57
|
+
version: "0"
|
58
|
+
version:
|
59
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
60
|
+
requirements:
|
61
|
+
- - ">="
|
62
|
+
- !ruby/object:Gem::Version
|
63
|
+
version: "0"
|
64
|
+
version:
|
65
|
+
requirements: []
|
66
|
+
|
67
|
+
rubyforge_project:
|
68
|
+
rubygems_version: 1.3.5
|
69
|
+
signing_key:
|
70
|
+
specification_version: 3
|
71
|
+
summary: RegexpCrawler is a Ruby library for crawl data from website using regular expression.
|
72
|
+
test_files:
|
73
|
+
- spec/regexp_crawler_spec.rb
|
74
|
+
- spec/spec_helper.rb
|