proto 0.0.5 → 0.0.6
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +18 -3
- data/lib/proto/scraper.rb +3 -3
- data/lib/proto/version.rb +1 -1
- metadata +2 -2
data/README.md
CHANGED
@@ -6,9 +6,9 @@ It is the evolution of [another project](https://github.com/kcurtin/scrape_sourc
|
|
6
6
|
|
7
7
|
Proto is meant to be lightweight and flexible, the objects you get back inherit from OpenStruct. New methods can be dynamically added to the objects, you won't ever get method_missing errors, and you can access the data in a bunch of different ways. Check out the documentation for more info: [OpenStruct](http://www.ruby-doc.org/stdlib-1.9.3/libdoc/ostruct/rdoc/OpenStruct.html)
|
8
8
|
|
9
|
-
## Usage
|
9
|
+
## Usage
|
10
10
|
|
11
|
-
|
11
|
+
####Scraping a single page
|
12
12
|
|
13
13
|
```ruby
|
14
14
|
proto = Proto::Scraper.new('http://twitter.com/kcurtin')
|
@@ -20,7 +20,7 @@ proto.inspect
|
|
20
20
|
#=> #<Proto::Scraper:0x007fc6fb852860 @doc=#<Nokogiri::HTML::Document:0x3fe37d0b1634...>
|
21
21
|
```
|
22
22
|
|
23
|
-
|
23
|
+
```.fetch``` method accepts a constant name and a hash as arguments:
|
24
24
|
```ruby
|
25
25
|
tweets = proto.fetch('Tweet', {:name => 'strong.fullname',
|
26
26
|
:content => 'p.js-tweet-text',
|
@@ -36,6 +36,21 @@ tweets.inspect
|
|
36
36
|
#=> [#<Proto::Tweet name="Kevin Curtin", content="@cawebs06 just a tad over my head... You guys are smart :)", created_at="11h">,
|
37
37
|
#<Proto::Tweet name="Kevin Curtin", content="@garybernhardt awesome, thanks. any plans to be in nyc soon? @FlatironSchool would love to have you stop by. we love DAS", created_at="12h">...]
|
38
38
|
```
|
39
|
+
####Scraping multiple pages using an index page
|
40
|
+
|
41
|
+
```ruby
|
42
|
+
#index page url
|
43
|
+
obj = Proto::Scraper.new('http://jobs.rubynow.com/')
|
44
|
+
#selector for the a tags with the links you want to visit
|
45
|
+
obj.collect_urls('ul.jobs li h2 a:first')
|
46
|
+
#attributes and selectors you want
|
47
|
+
jobs = obj.fetch( { :title => 'h2#headline',
|
48
|
+
:company => 'h2#headline a',
|
49
|
+
:location => 'h3#location',
|
50
|
+
:type => 'strong:last',
|
51
|
+
:description => 'div#info' }
|
52
|
+
)
|
53
|
+
```
|
39
54
|
|
40
55
|
OpenStruct features:
|
41
56
|
|
data/lib/proto/scraper.rb
CHANGED
@@ -7,9 +7,9 @@ module Proto
|
|
7
7
|
@doc = Nokogiri::HTML(open(url))
|
8
8
|
end
|
9
9
|
|
10
|
-
def collect_urls(selector)
|
10
|
+
def collect_urls(base_url=self.url, selector)
|
11
11
|
@url_collection = doc.css(selector).map do |link|
|
12
|
-
"#{
|
12
|
+
"#{base_url}#{link['href']}"
|
13
13
|
end
|
14
14
|
end
|
15
15
|
|
@@ -27,7 +27,7 @@ module Proto
|
|
27
27
|
private
|
28
28
|
|
29
29
|
def scrape_multiple_pages(attributes)
|
30
|
-
url_collection.
|
30
|
+
url_collection.map do |url|
|
31
31
|
gather_data(url, attributes)
|
32
32
|
end
|
33
33
|
end
|
data/lib/proto/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: proto
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.6
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-
|
12
|
+
date: 2012-12-05 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rspec
|