flyerhzm-regexp_crawler 0.8.0 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.textile CHANGED
@@ -1,6 +1,6 @@
1
1
  h1. RegexpCrawler
2
2
 
3
- RegexpCrawler is a crawler which use regrex expression to catch data.
3
+ RegexpCrawler is a crawler which uses regrex expression to catch data from website. It is easy to use and less code if you are familiar with regrex expression.
4
4
 
5
5
  **************************************************************************
6
6
 
@@ -15,9 +15,100 @@ gem install flyerhzm-regexp_crawler
15
15
 
16
16
  h2. Usage
17
17
 
18
+ It's really easy to use, sometime just one line.
19
+
20
+ <pre><code>
21
+ RegexpCrawler::Crawler.new(options).start
22
+ </code></pre>
23
+
24
+ options is a hash
25
+ * <code>:start_page</code>, mandatory, a string to define a website url where crawler start
26
+ * <code>:continue_regexp</code>, optional, a regexp to define what website urls the crawler continue to crawl, it is parsed by String#scan and get the first not nil result
27
+ * <code>:capture_regexp</code>, mandatory, a regexp to define what contents the crawler crawl, it is parse by Regexp#match and get all group captures
28
+ * <code>:named_captures</code>, mandatory, a string array to define the names of captured groups according to :capture_regexp
29
+ * <code>:model</code>, :optional if :save_method defined, a string of result's model class
30
+ * <code>:save_method</code>, :optional if :model defined, a proc to define how to save the result which the crawler crawled, the proc accept two parameters, first is one page crawled result, second is the crawled url
31
+ * <code>:headers</code>, optional, a hash to define http headers
32
+ * <code>:encoding</code>, optional, a string of the coding of crawled page, the results will be converted to utf8
33
+ * <code>:need_parse</code>, optional, a proc if parsing the page by regexp or not, the proc accept two parameters, first is the crawled website uri, second is the response body of crawled page
34
+
35
+ If the crawler define :model no :save_method, the RegexpCrawler::Crawler#start will return an array of results, such as
36
+ <pre><code>
37
+ [{:model_name => {:attr_name => 'attr_value'}, :page => 'website url'}, {:model_name => {:attr_name => 'attr_value'}, :page => 'another website url'}]
38
+ </code></pre>
39
+
40
+ **************************************************************************
41
+
42
+ h2. Example
43
+
44
+ a script to synchronize your github projects except fork projects
45
+
46
+ <pre><code>
47
+ require 'rubygems'
48
+ require 'regexp_crawler'
49
+
50
+ crawler = RegexpCrawler::Crawler.new(
51
+ :start_page => "http://github.com/flyerhzm",
52
+ :continue_regexp => %r{<div class="title"><b><a href="(/flyerhzm/.*?/tree)">}m,
53
+ :capture_regexp => %r{<a href="http://github.com/flyerhzm/.*?/tree">(.*?)</a>.*<span id="repository_description".*?>(.*?)</span>.*(<div class="(?:wikistyle|plain)">.*?</div>)</div>}m,
54
+ :named_captures => ['title', 'description', 'body'],
55
+ :save_method => Proc.new do |result, page|
56
+ puts '============================='
57
+ puts page
58
+ puts result[:title]
59
+ puts result[:description]
60
+ puts result[:body][0..100] + "..."
61
+ end,
62
+ :need_parse => Proc.new do |page, response_body|
63
+ !response_body.index "Fork of"
64
+ end)
65
+ crawler.start
66
+ </code></pre>
67
+
68
+ The results are as follows:
18
69
  <pre><code>
19
- >> crawler = RegexpCrawler::Crawler.new(:start_page => "http://www.tijee.com/tags/64-google-face-questions/posts", :continue_regexp => %r{"(/posts/\d+-[^#]*?)"}, :capture_regexp => %r{<h2 class='title'><a.*?>(.*?)</a></h2>.*?<div class='body'>(.*?)</div>}m, :named_captures => ['title', 'body'], :model => 'post')
20
- >> crawler.start
70
+ =============================
71
+ http://github.com/flyerhzm/bullet/tree/master
72
+ bullet
73
+ A rails plugin/gem to kill N+1 queries and unused eager loading
74
+ <div class="wikistyle"><h1>Bullet</h1>
75
+ <p>The Bullet plugin/gem is designed to help you increase your...
76
+ =============================
77
+ http://github.com/flyerhzm/regexp_crawler/tree/master
78
+ regexp_crawler
79
+ A crawler which use regrex expression to catch data.
80
+ <div class="wikistyle"><h1>RegexpCrawler</h1>
81
+ <p>RegexpCrawler is a crawler which use regrex expressi...
82
+ =============================
83
+ http://github.com/flyerhzm/sitemap/tree/master
84
+ sitemap
85
+ This plugin will generate a sitemap.xml from sitemap.rb whose format is very similar to routes.rb
86
+ <div class="wikistyle"><h1>Sitemap</h1>
87
+ <p>This plugin will generate a sitemap.xml or sitemap.xml.gz ...
88
+ =============================
89
+ http://github.com/flyerhzm/visual_partial/tree/master
90
+ visual_partial
91
+ This plugin provides a way that you can see all the partial pages rendered. So it can prevent you from using partial page too much, which hurts the performance.
92
+ <div class="wikistyle"><h1>VisualPartial</h1>
93
+ <p>This plugin provides a way that you can see all the ...
94
+ =============================
95
+ http://github.com/flyerhzm/chinese_regions/tree/master
96
+ chinese_regions
97
+ provides all chinese regions, cities and districts
98
+ <div class="wikistyle"><h1>ChineseRegions</h1>
99
+ <p>Provides all chinese regions, cities and districts<...
100
+ =============================
101
+ http://github.com/flyerhzm/chinese_permalink/tree/master
102
+ chinese_permalink
103
+ This plugin adds a capability for ar model to create a seo permalink with your chinese text. It will translate your chinese text to english url based on google translate.
104
+ <div class="wikistyle"><h1>ChinesePermalink</h1>
105
+ <p>This plugin adds a capability for ar model to cre...
106
+ =============================
107
+ http://github.com/flyerhzm/codelinestatistics/tree/master
108
+ codelinestatistics
109
+ The code line statistics takes files and directories from GUI, counts the total files, total sizes of files, total lines, lines of codes, lines of comments and lines of blanks in the files, displays the results and can also export results to html file.
110
+ <div class="plain"><pre>codelinestatistics README file:
21
111
 
22
- =>[{:page=>"http://www.tijee.com/posts/327-google-face-questions-many-companies-will-ask-oh", :post=>{:title=>"Google面试题(很多公司都会问的哦)", :body=>"\n内容摘要:几星期前,一个朋友接受..."}}, {:page=>"http://www.tijee.com/posts/328-java-surface-together-with-the-google-test", :post=>{:title=>"google的一道JAVA面试题", :body=>"\n内容摘要:有一个整数n,写一个函数f(n..."}}]
112
+ ----------------------------------------
113
+ Wha...
23
114
  </code></pre>
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.8.0
1
+ 0.8.1
@@ -56,7 +56,7 @@ module RegexpCrawler
56
56
  @pages << continue_uri unless @captured_pages.include?(continue_uri) or @pages.include?(continue_uri)
57
57
  end
58
58
  end
59
- if @need_parse.nil? or @need_parse.call(uri, response_body)
59
+ if @need_parse.nil? or @need_parse.call(uri.to_i, response_body)
60
60
  md = @capture_regexp.match(response_body)
61
61
  if md
62
62
  captures = md.captures
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{regexp_crawler}
8
- s.version = "0.8.0"
8
+ s.version = "0.8.1"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Richard Huang"]
12
- s.date = %q{2009-09-01}
12
+ s.date = %q{2009-09-12}
13
13
  s.description = %q{RegexpCrawler is a Ruby library for crawl data from website using regular expression.}
14
14
  s.email = %q{flyerhzm@gmail.com}
15
15
  s.extra_rdoc_files = [
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: flyerhzm-regexp_crawler
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.8.0
4
+ version: 0.8.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Richard Huang
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-09-01 00:00:00 -07:00
12
+ date: 2009-09-12 00:00:00 -07:00
13
13
  default_executable:
14
14
  dependencies: []
15
15
 
@@ -43,7 +43,6 @@ files:
43
43
  - spec/spec_helper.rb
44
44
  has_rdoc: false
45
45
  homepage: ""
46
- licenses:
47
46
  post_install_message:
48
47
  rdoc_options:
49
48
  - --charset=UTF-8
@@ -64,7 +63,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
64
63
  requirements: []
65
64
 
66
65
  rubyforge_project:
67
- rubygems_version: 1.3.5
66
+ rubygems_version: 1.2.0
68
67
  signing_key:
69
68
  specification_version: 3
70
69
  summary: RegexpCrawler is a Ruby library for crawl data from website using regular expression.