regexp_crawler 0.9.1

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2009 Richard Huang (flyerhzm@gmail.com)
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.textile ADDED
@@ -0,0 +1,114 @@
1
+ h1. RegexpCrawler
2
+
3
+ RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression.
4
+
5
+ **************************************************************************
6
+
7
+ h2. Install
8
+
9
+ <pre><code>
10
+ gem install regexp_crawler
11
+ </code></pre>
12
+
13
+ **************************************************************************
14
+
15
+ h2. Usage
16
+
17
+ It's really easy to use, sometime just one line.
18
+
19
+ <pre><code>
20
+ RegexpCrawler::Crawler.new(options).start
21
+ </code></pre>
22
+
23
+ options is a hash
24
+ * <code>:start_page</code>, mandatory, a string to define a website url where crawler start
25
+ * <code>:continue_regexp</code>, optional, a regexp to define what website urls the crawler continue to crawl, it is parsed by String#scan and get the first not nil result
26
+ * <code>:capture_regexp</code>, mandatory, a regexp to define what contents the crawler crawl, it is parse by Regexp#match and get all group captures
27
+ * <code>:named_captures</code>, mandatory, a string array to define the names of captured groups according to :capture_regexp
28
+ * <code>:model</code>, optional if :save_method defined, a string of result's model class
29
+ * <code>:save_method</code>, optional if :model defined, a proc to define how to save the result which the crawler crawled, the proc accept two parameters, first is one page crawled result, second is the crawled url
30
+ * <code>:headers</code>, optional, a hash to define http headers
31
+ * <code>:encoding</code>, optional, a string of the coding of crawled page, the results will be converted to utf8
32
+ * <code>:need_parse</code>, optional, a proc if parsing the page by regexp or not, the proc accept two parameters, first is the crawled website uri, second is the response body of crawled page
33
+ * <code>:logger</code>, optional, true for logging to STDOUT, or a Logger object for logging to that logger
34
+
35
+ If the crawler define :model no :save_method, the RegexpCrawler::Crawler#start will return an array of results, such as
36
+ <pre><code>
37
+ [{:model_name => {:attr_name => 'attr_value'}, :page => 'website url'}, {:model_name => {:attr_name => 'attr_value'}, :page => 'another website url'}]
38
+ </code></pre>
39
+
40
+ **************************************************************************
41
+
42
+ h2. Example
43
+
44
+ a script to synchronize your github projects except fork projects, please check <code>example/github_projects.rb</code>
45
+
46
+ <pre><code>
47
+ require 'rubygems'
48
+ require 'regexp_crawler'
49
+
50
+ crawler = RegexpCrawler::Crawler.new(
51
+ :start_page => "http://github.com/flyerhzm",
52
+ :continue_regexp => %r{<div class="title"><b><a href="(/flyerhzm/.*?)">}m,
53
+ :capture_regexp => %r{<a href="http://github.com/flyerhzm/[^/"]*?(?:/tree)?">(.*?)</a>.*<span id="repository_description".*?>(.*?)</span>.*(<div class="(?:wikistyle|plain)">.*?</div>)</div>}m,
54
+ :named_captures => ['title', 'description', 'body'],
55
+ :save_method => Proc.new do |result, page|
56
+ puts '============================='
57
+ puts page
58
+ puts result[:title]
59
+ puts result[:description]
60
+ puts result[:body][0..100] + "..."
61
+ end,
62
+ :need_parse => Proc.new do |page, response_body|
63
+ page =~ %r{http://github.com/flyerhzm/\w+} && !response_body.index(/Fork of.*?<a href=".*?">/)
64
+ end)
65
+ crawler.start
66
+ </pre></code>
67
+
68
+ The results are as follows:
69
+ <pre><code>
70
+ =============================
71
+ http://github.com/flyerhzm/bullet/tree/master
72
+ bullet
73
+ A rails plugin/gem to kill N+1 queries and unused eager loading
74
+ <div class="wikistyle"><h1>Bullet</h1>
75
+ <p>The Bullet plugin/gem is designed to help you increase your...
76
+ =============================
77
+ http://github.com/flyerhzm/regexp_crawler/tree/master
78
+ regexp_crawler
79
+ A crawler which use regular expression to catch data.
80
+ <div class="wikistyle"><h1>RegexpCrawler</h1>
81
+ <p>RegexpCrawler is a crawler which use regex expressi...
82
+ =============================
83
+ http://github.com/flyerhzm/sitemap/tree/master
84
+ sitemap
85
+ This plugin will generate a sitemap.xml from sitemap.rb whose format is very similar to routes.rb
86
+ <div class="wikistyle"><h1>Sitemap</h1>
87
+ <p>This plugin will generate a sitemap.xml or sitemap.xml.gz ...
88
+ =============================
89
+ http://github.com/flyerhzm/visual_partial/tree/master
90
+ visual_partial
91
+ This plugin provides a way that you can see all the partial pages rendered. So it can prevent you from using partial page too much, which hurts the performance.
92
+ <div class="wikistyle"><h1>VisualPartial</h1>
93
+ <p>This plugin provides a way that you can see all the ...
94
+ =============================
95
+ http://github.com/flyerhzm/chinese_regions/tree/master
96
+ chinese_regions
97
+ provides all chinese regions, cities and districts
98
+ <div class="wikistyle"><h1>ChineseRegions</h1>
99
+ <p>Provides all chinese regions, cities and districts<...
100
+ =============================
101
+ http://github.com/flyerhzm/chinese_permalink/tree/master
102
+ chinese_permalink
103
+ This plugin adds a capability for ar model to create a seo permalink with your chinese text. It will translate your chinese text to english url based on google translate.
104
+ <div class="wikistyle"><h1>ChinesePermalink</h1>
105
+ <p>This plugin adds a capability for ar model to cre...
106
+ =============================
107
+ http://github.com/flyerhzm/codelinestatistics/tree/master
108
+ codelinestatistics
109
+ The code line statistics takes files and directories from GUI, counts the total files, total sizes of files, total lines, lines of codes, lines of comments and lines of blanks in the files, displays the results and can also export results to html file.
110
+ <div class="plain"><pre>codelinestatistics README file:
111
+
112
+ ----------------------------------------
113
+ Wha...
114
+ </code></pre>
data/Rakefile ADDED
@@ -0,0 +1,22 @@
1
+ require 'rake'
2
+ require 'rake/rdoctask'
3
+ require 'spec/rake/spectask'
4
+ require 'jeweler'
5
+
6
+ desc "Run all specs in spec directory"
7
+ Spec::Rake::SpecTask.new(:spec) do |t|
8
+ t.spec_files = FileList['spec/**/*_spec.rb']
9
+ t.rcov = true
10
+ t.rcov_opts = ['--exclude', 'spec,config,Library,usr/lib/ruby']
11
+ t.rcov_dir = File.join(File.dirname(__FILE__), "tmp")
12
+ end
13
+
14
+ Jeweler::Tasks.new do |gemspec|
15
+ gemspec.name = "regexp_crawler"
16
+ gemspec.summary = "RegexpCrawler is a Ruby library for crawl data from website using regular expression."
17
+ gemspec.description = "RegexpCrawler is a Ruby library for crawl data from website using regular expression."
18
+ gemspec.email = "flyerhzm@gmail.com"
19
+ gemspec.homepage = ""
20
+ gemspec.authors = ["Richard Huang"]
21
+ gemspec.files.exclude '.gitignore'
22
+ end
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 0.9.1
@@ -0,0 +1,19 @@
1
+ require 'rubygems'
2
+ require 'regexp_crawler'
3
+
4
+ crawler = RegexpCrawler::Crawler.new(
5
+ :start_page => "http://github.com/flyerhzm",
6
+ :continue_regexp => %r{<div class="title"><b><a href="(/flyerhzm/.*?)">}m,
7
+ :capture_regexp => %r{<a href="http://github.com/flyerhzm/[^/"]*?(?:/tree)?">(.*?)</a>.*<span id="repository_description".*?>(.*?)</span>.*(<div class="(?:wikistyle|plain)">.*?</div>)</div>}m,
8
+ :named_captures => ['title', 'description', 'body'],
9
+ :save_method => Proc.new do |result, page|
10
+ puts '============================='
11
+ puts page
12
+ puts result[:title]
13
+ puts result[:description]
14
+ puts result[:body][0..100] + "..."
15
+ end,
16
+ :need_parse => Proc.new do |page, response_body|
17
+ page =~ %r{http://github.com/flyerhzm/\w+} && !response_body.index(/Fork of.*?<a href=".*?">/)
18
+ end)
19
+ crawler.start
data/init.rb ADDED
File without changes
@@ -0,0 +1,27 @@
1
+ require 'net/http'
2
+ require 'uri'
3
+ require 'iconv'
4
+ require 'logger'
5
+ require 'regexp_crawler/http'
6
+ require 'regexp_crawler/crawler'
7
+
8
+ module RegexpCrawler
9
+
10
+ def self.included(base)
11
+ base.extend ClassMethods
12
+ end
13
+
14
+ module ClassMethods
15
+ def regexp_crawler(options)
16
+ @crawlers ||= []
17
+ @crawlers << Crawler.new(options)
18
+ end
19
+
20
+ def start_crawl
21
+ @crawlers.each do |crawler|
22
+ crawler.start
23
+ end
24
+ end
25
+ end
26
+
27
+ end
@@ -0,0 +1,89 @@
1
+ module RegexpCrawler
2
+ class Crawler
3
+ attr_accessor :start_page, :continue_regexp, :named_captures, :model, :save_method, :headers, :encoding, :need_parse
4
+
5
+ def initialize(options = {})
6
+ @start_page = options[:start_page]
7
+ @continue_regexp = options[:continue_regexp]
8
+ @capture_regexp = options[:capture_regexp]
9
+ @named_captures = options[:named_captures]
10
+ @model = options[:model]
11
+ @save_method = options[:save_method]
12
+ @headers = options[:headers]
13
+ @encoding = options[:encoding]
14
+ @need_parse = options[:need_parse]
15
+ @logger = options[:logger] == true ? Logger.new(STDOUT) : options[:logger]
16
+ end
17
+
18
+ def capture_regexp=(regexp)
19
+ @capture_regexp = Regexp.new(regexp.source, regexp.options | Regexp::MULTILINE)
20
+ end
21
+
22
+ def start
23
+ @results = []
24
+ @captured_pages = []
25
+ @pages = [URI.parse(@start_page)]
26
+ while !@pages.empty? and !@stop
27
+ uri = @pages.shift
28
+ @captured_pages << uri
29
+ parse_page(uri)
30
+ end
31
+ @results
32
+ end
33
+
34
+ private
35
+ def parse_page(uri)
36
+ @logger.debug "crawling page: #{uri.to_s}" if @logger
37
+ response = Net::HTTP.get_response_with_headers(uri, @headers)
38
+ parse_response(response, uri)
39
+ end
40
+
41
+ def continue_uri(uri, page)
42
+ if page =~ /^#{uri.scheme}/
43
+ URI.parse(page)
44
+ elsif page =~ /^\//
45
+ URI.join(uri.scheme + '://' + uri.host, page)
46
+ else
47
+ URI.parse(uri.to_s.split('/')[0..-2].join('/') + '/' + page)
48
+ end
49
+ end
50
+
51
+ def parse_response(response, uri)
52
+ response_body = encoding.nil? ? response.body : Iconv.iconv("UTF-8//IGNORE", "#{encoding}//IGNORE", response.body).first
53
+ if response.is_a? Net::HTTPSuccess
54
+ @logger.debug "crawling success: #{uri.to_s}" if @logger
55
+ if continue_regexp
56
+ response_body.scan(continue_regexp).each do |page|
57
+ @logger.debug "continue_page: #{page}" if @logger
58
+ page = page.compact.first if page.is_a? Array
59
+ continue_uri = continue_uri(uri, page)
60
+ @pages << continue_uri unless @captured_pages.include?(continue_uri) or @pages.include?(continue_uri)
61
+ end
62
+ end
63
+ if @need_parse.nil? or @need_parse.call(uri.to_s, response_body)
64
+ md = @capture_regexp.match(response_body)
65
+ if md
66
+ @logger.debug "response body captured" if @logger
67
+ captures = md.captures
68
+ result = {}
69
+ captures.each_index do |i|
70
+ result[named_captures[i].to_sym] = captures[i]
71
+ end
72
+ if @save_method
73
+ ret = @save_method.call(result, uri.to_s)
74
+ @stop = true if ret == false
75
+ else
76
+ @results << {@model.downcase.to_sym => result, :page => uri.to_s}
77
+ end
78
+ end
79
+ end
80
+ elsif response.is_a? Net::HTTPRedirection
81
+ @logger.debug "crawling redirect: #{response['location']}" if @logger
82
+ parse_page(URI.parse(response['location']))
83
+ else
84
+ @logger.debug "crawling nothing: #{uri.to_s}" if @logger
85
+ # do nothing
86
+ end
87
+ end
88
+ end
89
+ end
@@ -0,0 +1,9 @@
1
+ module Net
2
+ class HTTP
3
+ def HTTP.get_response_with_headers(uri, headers)
4
+ response = start(uri.host, uri.port) do |http|
5
+ http.get(uri.request_uri, headers)
6
+ end
7
+ end
8
+ end
9
+ end
@@ -0,0 +1,58 @@
1
+ # Generated by jeweler
2
+ # DO NOT EDIT THIS FILE
3
+ # Instead, edit Jeweler::Tasks in Rakefile, and run `rake gemspec`
4
+ # -*- encoding: utf-8 -*-
5
+
6
+ Gem::Specification.new do |s|
7
+ s.name = %q{regexp_crawler}
8
+ s.version = "0.9.1"
9
+
10
+ s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
+ s.authors = ["Richard Huang"]
12
+ s.date = %q{2009-09-14}
13
+ s.description = %q{RegexpCrawler is a Ruby library for crawl data from website using regular expression.}
14
+ s.email = %q{flyerhzm@gmail.com}
15
+ s.extra_rdoc_files = [
16
+ "LICENSE",
17
+ "README.textile"
18
+ ]
19
+ s.files = [
20
+ "LICENSE",
21
+ "README.textile",
22
+ "Rakefile",
23
+ "VERSION",
24
+ "example/github_projects.rb",
25
+ "init.rb",
26
+ "lib/regexp_crawler.rb",
27
+ "lib/regexp_crawler/crawler.rb",
28
+ "lib/regexp_crawler/http.rb",
29
+ "regexp_crawler.gemspec",
30
+ "spec/regexp_crawler_spec.rb",
31
+ "spec/resources/complex.html",
32
+ "spec/resources/nested1.html",
33
+ "spec/resources/nested2.html",
34
+ "spec/resources/nested21.html",
35
+ "spec/resources/simple.html",
36
+ "spec/spec.opts",
37
+ "spec/spec_helper.rb"
38
+ ]
39
+ s.homepage = %q{}
40
+ s.rdoc_options = ["--charset=UTF-8"]
41
+ s.require_paths = ["lib"]
42
+ s.rubygems_version = %q{1.3.5}
43
+ s.summary = %q{RegexpCrawler is a Ruby library for crawl data from website using regular expression.}
44
+ s.test_files = [
45
+ "spec/regexp_crawler_spec.rb",
46
+ "spec/spec_helper.rb"
47
+ ]
48
+
49
+ if s.respond_to? :specification_version then
50
+ current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
51
+ s.specification_version = 3
52
+
53
+ if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then
54
+ else
55
+ end
56
+ else
57
+ end
58
+ end
@@ -0,0 +1,122 @@
1
+ require File.expand_path(File.dirname(__FILE__) + "/spec_helper.rb")
2
+
3
+ describe RegexpCrawler::Crawler do
4
+ context '#simple html' do
5
+ it 'should parse data according to regexp' do
6
+ success_page('/resources/simple.html', 'http://simple.com/')
7
+
8
+ crawl = RegexpCrawler::Crawler.new(:start_page => 'http://simple.com/', :capture_regexp => %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m, :named_captures => ['title', 'date', 'body'], :model => 'post', :logger => true)
9
+ results = crawl.start
10
+ results.size.should == 1
11
+ results.first[:post][:title].should == 'test'
12
+ end
13
+
14
+ it 'should redirect' do
15
+ redirect_page('http://redirect.com/', 'http://simple.com/')
16
+ success_page('/resources/simple.html', 'http://simple.com/')
17
+ end
18
+ end
19
+
20
+ context '#complex html' do
21
+ before(:each) do
22
+ success_page('/resources/complex.html', 'http://complex.com/')
23
+ success_page('/resources/nested1.html', 'http://complex.com/nested1.html')
24
+ success_page('/resources/nested2.html', 'http://complex.com/nested2.html')
25
+ end
26
+
27
+ it 'should parse data according to regexp' do
28
+ crawl = RegexpCrawler::Crawler.new
29
+ crawl.start_page = 'http://complex.com/'
30
+ crawl.continue_regexp = %r{(?:http://complex.com)?/nested\d.html}
31
+ crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
32
+ crawl.named_captures = ['title', 'date', 'body']
33
+ crawl.model = 'post'
34
+ results = crawl.start
35
+ results.size.should == 2
36
+ results.first[:post][:title].should == 'nested1'
37
+ results.last[:post][:title].should == 'nested2'
38
+ end
39
+
40
+ it 'should parse nested of nested data' do
41
+ success_page('/resources/nested21.html', 'http://complex.com/nested21.html')
42
+ crawl = RegexpCrawler::Crawler.new
43
+ crawl.start_page = 'http://complex.com/'
44
+ crawl.continue_regexp = %r{(?:http://complex.com)?/?nested\d+.html}
45
+ crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
46
+ crawl.named_captures = ['title', 'date', 'body']
47
+ crawl.model = 'post'
48
+ results = crawl.start
49
+ results.size.should == 3
50
+ results.first[:post][:title].should == 'nested1'
51
+ results.last[:post][:title].should == 'nested21'
52
+ end
53
+
54
+ it "should save by myself" do
55
+ crawl = RegexpCrawler::Crawler.new
56
+ crawl.start_page = 'http://complex.com/'
57
+ crawl.continue_regexp = %r{(?:http://complex.com)?/nested\d.html}
58
+ crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
59
+ crawl.named_captures = ['title', 'date', 'body']
60
+ my_results = []
61
+ crawl.save_method = Proc.new {|result, page| my_results << result}
62
+ results = crawl.start
63
+ results.size.should == 0
64
+ my_results.size.should == 2
65
+ end
66
+
67
+ it "should stop parse" do
68
+ crawl = RegexpCrawler::Crawler.new
69
+ crawl.start_page = 'http://complex.com/'
70
+ crawl.continue_regexp = %r{(?:http://complex.com)?/nested\d.html}
71
+ crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
72
+ crawl.named_captures = ['title', 'date', 'body']
73
+ stop_page = "http://complex.com/nested1.html"
74
+ parse_pages = []
75
+ crawl.save_method = Proc.new do |result, page|
76
+ if page == stop_page
77
+ false
78
+ else
79
+ parse_pages << page
80
+ end
81
+ end
82
+ results = crawl.start
83
+ parse_pages.size.should == 0
84
+ end
85
+
86
+ it 'should parse skip nested2.html' do
87
+ success_page('/resources/nested21.html', 'http://complex.com/nested21.html')
88
+ crawl = RegexpCrawler::Crawler.new
89
+ crawl.start_page = 'http://complex.com/'
90
+ crawl.continue_regexp = %r{(?:http://complex.com)?/?nested\d+.html}
91
+ crawl.capture_regexp = %r{<div class="title">(.*?)</div>.*<div class="date">(.*?)</div>.*<div class="body">(.*?)</div>}m
92
+ crawl.named_captures = ['title', 'date', 'body']
93
+ crawl.model = 'post'
94
+ crawl.need_parse = Proc.new do |page, response_body|
95
+ if response_body.index('nested2 test html')
96
+ false
97
+ else
98
+ true
99
+ end
100
+ end
101
+ results = crawl.start
102
+ results.size.should == 2
103
+ results.first[:post][:title].should == 'nested1'
104
+ results.last[:post][:title].should == 'nested21'
105
+ end
106
+ end
107
+
108
+ def success_page(local_path, remote_path)
109
+ path = File.expand_path(File.dirname(__FILE__) + local_path)
110
+ content = File.read(path)
111
+ http = mock(Net::HTTPSuccess)
112
+ http.stubs(:is_a?).with(Net::HTTPSuccess).returns(true)
113
+ http.stubs(:body).returns(content)
114
+ Net::HTTP.expects(:get_response_with_headers).times(1).with(URI.parse(remote_path), nil).returns(http)
115
+ end
116
+
117
+ def redirect_page(remote_path, redirect_path)
118
+ http = mock(Net::HTTPRedirection)
119
+ http.stubs(:is_a?).with(Net::HTTPRedirection).returns(true)
120
+ Net::HTTP.expects(:get_response_with_headers).times(1).with(URI.parse(remote_path), nil).returns(http)
121
+ end
122
+ end
@@ -0,0 +1,11 @@
1
+ <html>
2
+ <head>
3
+ <title>complex test html</title>
4
+ </head>
5
+ <body>
6
+ <div>
7
+ <a href="/nested1.html">nested1</a>
8
+ <a href="http://complex.com/nested2.html">nested2</a>
9
+ </div>
10
+ </body>
11
+ </html>
@@ -0,0 +1,12 @@
1
+ <html>
2
+ <head>
3
+ <title>nested1 test html</title>
4
+ </head>
5
+ <body>
6
+ <div>
7
+ <div class="title">nested1</div>
8
+ <div class="date">2008/10/10</div>
9
+ <div class="body"><p class="content">nested1</p></div>
10
+ </div>
11
+ </body>
12
+ </html>
@@ -0,0 +1,13 @@
1
+ <html>
2
+ <head>
3
+ <title>nested2 test html</title>
4
+ </head>
5
+ <body>
6
+ <div>
7
+ <div class="title">nested2</div>
8
+ <div class="date">2008/10/10</div>
9
+ <div class="body"><p class="content">nested2</p></div>
10
+ <a href="nested21.html">nested21</a>
11
+ </div>
12
+ </body>
13
+ </html>
@@ -0,0 +1,12 @@
1
+ <html>
2
+ <head>
3
+ <title>nested21 test html</title>
4
+ </head>
5
+ <body>
6
+ <div>
7
+ <div class="title">nested21</div>
8
+ <div class="date">2008/11/11</div>
9
+ <div class="body"><p class="content">nested21</p></div>
10
+ </div>
11
+ </body>
12
+ </html>
@@ -0,0 +1,12 @@
1
+ <html>
2
+ <head>
3
+ <title>simple test html</title>
4
+ </head>
5
+ <body>
6
+ <div>
7
+ <div class="title">test</div>
8
+ <div class="date">2008/09/10</div>
9
+ <div class="body"><p class="content">test</p></div>
10
+ </div>
11
+ </body>
12
+ </html>
data/spec/spec.opts ADDED
@@ -0,0 +1,8 @@
1
+ --colour
2
+ --format
3
+ specdoc
4
+ --reverse
5
+ --timeout
6
+ 20
7
+ --loadby
8
+ mtime
@@ -0,0 +1,8 @@
1
+ require 'rubygems'
2
+ require 'spec/autorun'
3
+ require 'date'
4
+ require 'mocha'
5
+
6
+ require File.join(File.dirname(__FILE__), '/../lib/regexp_crawler.rb')
7
+ require File.join(File.dirname(__FILE__), '/../lib/regexp_crawler/crawler.rb')
8
+ require File.join(File.dirname(__FILE__), '/../lib/regexp_crawler/http.rb')
metadata ADDED
@@ -0,0 +1,74 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: regexp_crawler
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.9.1
5
+ platform: ruby
6
+ authors:
7
+ - Richard Huang
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2009-09-14 00:00:00 +08:00
13
+ default_executable:
14
+ dependencies: []
15
+
16
+ description: RegexpCrawler is a Ruby library for crawl data from website using regular expression.
17
+ email: flyerhzm@gmail.com
18
+ executables: []
19
+
20
+ extensions: []
21
+
22
+ extra_rdoc_files:
23
+ - LICENSE
24
+ - README.textile
25
+ files:
26
+ - LICENSE
27
+ - README.textile
28
+ - Rakefile
29
+ - VERSION
30
+ - example/github_projects.rb
31
+ - init.rb
32
+ - lib/regexp_crawler.rb
33
+ - lib/regexp_crawler/crawler.rb
34
+ - lib/regexp_crawler/http.rb
35
+ - regexp_crawler.gemspec
36
+ - spec/regexp_crawler_spec.rb
37
+ - spec/resources/complex.html
38
+ - spec/resources/nested1.html
39
+ - spec/resources/nested2.html
40
+ - spec/resources/nested21.html
41
+ - spec/resources/simple.html
42
+ - spec/spec.opts
43
+ - spec/spec_helper.rb
44
+ has_rdoc: true
45
+ homepage: ""
46
+ licenses: []
47
+
48
+ post_install_message:
49
+ rdoc_options:
50
+ - --charset=UTF-8
51
+ require_paths:
52
+ - lib
53
+ required_ruby_version: !ruby/object:Gem::Requirement
54
+ requirements:
55
+ - - ">="
56
+ - !ruby/object:Gem::Version
57
+ version: "0"
58
+ version:
59
+ required_rubygems_version: !ruby/object:Gem::Requirement
60
+ requirements:
61
+ - - ">="
62
+ - !ruby/object:Gem::Version
63
+ version: "0"
64
+ version:
65
+ requirements: []
66
+
67
+ rubyforge_project:
68
+ rubygems_version: 1.3.5
69
+ signing_key:
70
+ specification_version: 3
71
+ summary: RegexpCrawler is a Ruby library for crawl data from website using regular expression.
72
+ test_files:
73
+ - spec/regexp_crawler_spec.rb
74
+ - spec/spec_helper.rb