proto 0.0.5 → 0.0.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +18 -3
- data/lib/proto/scraper.rb +3 -3
- data/lib/proto/version.rb +1 -1
- metadata +2 -2
data/README.md
CHANGED
@@ -6,9 +6,9 @@ It is the evolution of [another project](https://github.com/kcurtin/scrape_sourc
|
|
6
6
|
|
7
7
|
Proto is meant to be lightweight and flexible, the objects you get back inherit from OpenStruct. New methods can be dynamically added to the objects, you won't ever get method_missing errors, and you can access the data in a bunch of different ways. Check out the documentation for more info: [OpenStruct](http://www.ruby-doc.org/stdlib-1.9.3/libdoc/ostruct/rdoc/OpenStruct.html)
|
8
8
|
|
9
|
-
## Usage
|
9
|
+
## Usage
|
10
10
|
|
11
|
-
|
11
|
+
####Scraping a single page
|
12
12
|
|
13
13
|
```ruby
|
14
14
|
proto = Proto::Scraper.new('http://twitter.com/kcurtin')
|
@@ -20,7 +20,7 @@ proto.inspect
|
|
20
20
|
#=> #<Proto::Scraper:0x007fc6fb852860 @doc=#<Nokogiri::HTML::Document:0x3fe37d0b1634...>
|
21
21
|
```
|
22
22
|
|
23
|
-
|
23
|
+
```.fetch``` method accepts a constant name and a hash as arguments:
|
24
24
|
```ruby
|
25
25
|
tweets = proto.fetch('Tweet', {:name => 'strong.fullname',
|
26
26
|
:content => 'p.js-tweet-text',
|
@@ -36,6 +36,21 @@ tweets.inspect
|
|
36
36
|
#=> [#<Proto::Tweet name="Kevin Curtin", content="@cawebs06 just a tad over my head... You guys are smart :)", created_at="11h">,
|
37
37
|
#<Proto::Tweet name="Kevin Curtin", content="@garybernhardt awesome, thanks. any plans to be in nyc soon? @FlatironSchool would love to have you stop by. we love DAS", created_at="12h">...]
|
38
38
|
```
|
39
|
+
####Scraping multiple pages using an index page
|
40
|
+
|
41
|
+
```ruby
|
42
|
+
#index page url
|
43
|
+
obj = Proto::Scraper.new('http://jobs.rubynow.com/')
|
44
|
+
#selector for the a tags with the links you want to visit
|
45
|
+
obj.collect_urls('ul.jobs li h2 a:first')
|
46
|
+
#attributes and selectors you want
|
47
|
+
jobs = obj.fetch( { :title => 'h2#headline',
|
48
|
+
:company => 'h2#headline a',
|
49
|
+
:location => 'h3#location',
|
50
|
+
:type => 'strong:last',
|
51
|
+
:description => 'div#info' }
|
52
|
+
)
|
53
|
+
```
|
39
54
|
|
40
55
|
OpenStruct features:
|
41
56
|
|
data/lib/proto/scraper.rb
CHANGED
@@ -7,9 +7,9 @@ module Proto
|
|
7
7
|
@doc = Nokogiri::HTML(open(url))
|
8
8
|
end
|
9
9
|
|
10
|
-
def collect_urls(selector)
|
10
|
+
def collect_urls(base_url=self.url, selector)
|
11
11
|
@url_collection = doc.css(selector).map do |link|
|
12
|
-
"#{
|
12
|
+
"#{base_url}#{link['href']}"
|
13
13
|
end
|
14
14
|
end
|
15
15
|
|
@@ -27,7 +27,7 @@ module Proto
|
|
27
27
|
private
|
28
28
|
|
29
29
|
def scrape_multiple_pages(attributes)
|
30
|
-
url_collection.
|
30
|
+
url_collection.map do |url|
|
31
31
|
gather_data(url, attributes)
|
32
32
|
end
|
33
33
|
end
|
data/lib/proto/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: proto
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.6
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-
|
12
|
+
date: 2012-12-05 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rspec
|