spidey 0.0.3 → 0.0.4

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,32 +1,46 @@
1
1
  Spidey
2
2
  ======
3
3
 
4
- Spidey provides a bare-bones framework for crawling and scraping web sites.
4
+ Spidey provides a bare-bones framework for crawling and scraping web sites. Its goal is to keep boilerplate scraping logic out of your code.
5
5
 
6
6
 
7
7
  Example
8
8
  -------
9
9
 
10
- This [non-working] example _spider_ crawls the ebay.com home page, follows links to auction pages, and finally records a few scraped item details as a _result_.
10
+ This example _spider_ crawls an eBay page, follows links to category pages, continues to auction detail pages, and finally records a few scraped item details as a _result_.
11
11
 
12
- class EbaySpider < Spidey::AbstractSpider
13
- handle "http://www.ebay.com", :process_home
12
+ class EbayPetSuppliesSpider < Spidey::AbstractSpider
13
+ handle "http://pet-supplies.shop.ebay.com", :process_home
14
14
 
15
15
  def process_home(page, default_data = {})
16
- page.links_with(href: /auction\.aspx/).each do |link|
17
- handle resolve_url(link.href, page), :process_auction, auction_title: link.text
16
+ page.search("#AllCats a[role=menuitem]").each do |a|
17
+ handle resolve_url(a.attr('href'), page), :process_category, category: a.text.strip
18
+ end
19
+ end
20
+
21
+ def process_category(page, default_data = {})
22
+ page.search("#ResultSetItems table.li td.dtl a").each do |a|
23
+ handle resolve_url(a.attr('href'), page), :process_auction, default_data.merge(title: a.text.strip)
18
24
  end
19
25
  end
20
26
 
21
27
  def process_auction(page, default_data = {})
22
- record default_data.merge(sale_price: page.search('.sale_price').text)
28
+ image_el = page.search('div.vi-ipic1 img').first
29
+ price_el = page.search('span[itemprop=price]').first
30
+ record default_data.merge(
31
+ image_url: (image_el.attr('src') if image_el),
32
+ price: price_el.text.strip
33
+ )
23
34
  end
35
+
24
36
  end
25
37
 
26
- spider = EbaySpider.new verbose: true
38
+ spider = EbayPetSuppliesSpider.new verbose: true
27
39
  spider.crawl max_urls: 100
40
+
41
+ spider.results # => [{category: "Aquarium & Fish", title: "5 Gal. Fish Tank"...
28
42
 
29
- Implement a _spider_ class extending `Spidey::AbstractSpider` for each target site. The class can declare starting URLs with class-level calls to `handle`. Spidey invokes each of the methods specified in those calls, passing in the resulting `page` (a [Mechanize](http://mechanize.rubyforge.org/) [Page](http://mechanize.rubyforge.org/Mechanize/Page.html) object) and, optionally, some scraped data. The methods can do whatever processing of the page is necessary, calling `handle` with additional URLs to crawl and/or `record` with scraped results.
43
+ Implement a _spider_ class extending `Spidey::AbstractSpider` for each target site. The class can declare starting URLs by calling `handle` at the class level. Spidey invokes each of the methods specified in those calls, passing in the resulting `page` (a [Mechanize](http://mechanize.rubyforge.org/) [Page](http://mechanize.rubyforge.org/Mechanize/Page.html) object) and, optionally, some scraped data. The methods can do whatever processing of the page is necessary, calling `handle` with additional URLs to crawl and/or `record` with scraped results.
30
44
 
31
45
 
32
46
  Storage Strategies
@@ -38,12 +52,17 @@ By default, the lists of URLs being crawled, results scraped, and errors encount
38
52
  spider.results # => [{auction_title: "...", sale_price: "..."}, ...]
39
53
  spider.errors # => [{url: "...", handler: :process_home, error: FooException}, ...]
40
54
 
41
- Add the [spidey-mongo](https://github.com/joeyAghion/spidey-mongo) gem and include `Spidey::Strategies::Mongo` in your spider to instead use MongoDB to persist these data. [See the docs](https://github.com/joeyAghion/spidey-mongo) for more information.
55
+ Add the [spidey-mongo](https://github.com/joeyAghion/spidey-mongo) gem and include `Spidey::Strategies::Mongo` in your spider to instead use MongoDB to persist these data. [See the docs](https://github.com/joeyAghion/spidey-mongo) for more information. Or, you can implement your own strategy by overriding the appropriate methods from `AbstractSpider`.
56
+
57
+
58
+ Contributing
59
+ ------------
60
+
61
+ Spidey is very much a work in progress. Pull requests welcome.
42
62
 
43
63
 
44
64
  To Do
45
65
  -----
46
- * Add working examples
47
66
  * Spidey works well for crawling public web pages, but since little effort is undertaken to preserve the crawler's state across requests, it works less well when particular cookies or sequences of form submissions are required. [Mechanize](http://mechanize.rubyforge.org/) supports this quite well, though, so Spidey could grow in that direction.
48
67
 
49
68
 
@@ -0,0 +1,25 @@
1
+ class EbayPetSuppliesSpider < Spidey::AbstractSpider
2
+ handle "http://pet-supplies.shop.ebay.com", :process_home
3
+
4
+ def process_home(page, default_data = {})
5
+ page.search("#AllCats a[role=menuitem]").each do |a|
6
+ handle resolve_url(a.attr('href'), page), :process_category, category: a.text.strip
7
+ end
8
+ end
9
+
10
+ def process_category(page, default_data = {})
11
+ page.search("#ResultSetItems table.li td.dtl a").each do |a|
12
+ handle resolve_url(a.attr('href'), page), :process_auction, default_data.merge(title: a.text.strip)
13
+ end
14
+ end
15
+
16
+ def process_auction(page, default_data = {})
17
+ image_el = page.search('div.vi-ipic1 img').first
18
+ price_el = page.search('span[itemprop=price]').first
19
+ record default_data.merge(
20
+ image_url: (image_el.attr('src') if image_el),
21
+ price: price_el.text.strip
22
+ )
23
+ end
24
+
25
+ end
@@ -1,3 +1,3 @@
1
1
  module Spidey
2
- VERSION = "0.0.3"
2
+ VERSION = "0.0.4"
3
3
  end
data/spidey.gemspec CHANGED
@@ -21,6 +21,7 @@ Gem::Specification.new do |s|
21
21
 
22
22
  s.add_development_dependency "rake"
23
23
  s.add_development_dependency "rspec"
24
+ s.add_development_dependency "ruby-debug19"
24
25
 
25
26
  s.add_runtime_dependency "mechanize"
26
27
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: spidey
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.3
4
+ version: 0.0.4
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-06-27 00:00:00.000000000Z
12
+ date: 2012-12-21 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rake
16
- requirement: &70359848145740 !ruby/object:Gem::Requirement
16
+ requirement: !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,31 @@ dependencies:
21
21
  version: '0'
22
22
  type: :development
23
23
  prerelease: false
24
- version_requirements: *70359848145740
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
25
30
  - !ruby/object:Gem::Dependency
26
31
  name: rspec
27
- requirement: &70359848144760 !ruby/object:Gem::Requirement
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ - !ruby/object:Gem::Dependency
47
+ name: ruby-debug19
48
+ requirement: !ruby/object:Gem::Requirement
28
49
  none: false
29
50
  requirements:
30
51
  - - ! '>='
@@ -32,10 +53,15 @@ dependencies:
32
53
  version: '0'
33
54
  type: :development
34
55
  prerelease: false
35
- version_requirements: *70359848144760
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
36
62
  - !ruby/object:Gem::Dependency
37
63
  name: mechanize
38
- requirement: &70359848143880 !ruby/object:Gem::Requirement
64
+ requirement: !ruby/object:Gem::Requirement
39
65
  none: false
40
66
  requirements:
41
67
  - - ! '>='
@@ -43,7 +69,12 @@ dependencies:
43
69
  version: '0'
44
70
  type: :runtime
45
71
  prerelease: false
46
- version_requirements: *70359848143880
72
+ version_requirements: !ruby/object:Gem::Requirement
73
+ none: false
74
+ requirements:
75
+ - - ! '>='
76
+ - !ruby/object:Gem::Version
77
+ version: '0'
47
78
  description: A loose framework for crawling and scraping web sites.
48
79
  email:
49
80
  - joey@aghion.com
@@ -56,6 +87,7 @@ files:
56
87
  - LICENSE.txt
57
88
  - README.md
58
89
  - Rakefile
90
+ - examples/ebay_pet_supplies_spider.rb
59
91
  - lib/spidey.rb
60
92
  - lib/spidey/abstract_spider.rb
61
93
  - lib/spidey/version.rb
@@ -77,7 +109,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
77
109
  version: '0'
78
110
  segments:
79
111
  - 0
80
- hash: -1311415898264218872
112
+ hash: -3162500508741796001
81
113
  required_rubygems_version: !ruby/object:Gem::Requirement
82
114
  none: false
83
115
  requirements:
@@ -86,10 +118,10 @@ required_rubygems_version: !ruby/object:Gem::Requirement
86
118
  version: '0'
87
119
  segments:
88
120
  - 0
89
- hash: -1311415898264218872
121
+ hash: -3162500508741796001
90
122
  requirements: []
91
123
  rubyforge_project: spidey
92
- rubygems_version: 1.8.10
124
+ rubygems_version: 1.8.24
93
125
  signing_key:
94
126
  specification_version: 3
95
127
  summary: A loose framework for crawling and scraping web sites.