sagrone_scraper 0.0.3 → 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -0
- data/Guardfile +1 -1
- data/README.md +68 -69
- data/lib/sagrone_scraper.rb +2 -33
- data/lib/sagrone_scraper/agent.rb +4 -19
- data/lib/sagrone_scraper/base.rb +63 -0
- data/lib/sagrone_scraper/collection.rb +38 -0
- data/lib/sagrone_scraper/version.rb +1 -1
- data/spec/sagrone_scraper/agent_spec.rb +10 -19
- data/spec/sagrone_scraper/base_spec.rb +162 -0
- data/spec/sagrone_scraper/collection_spec.rb +80 -0
- data/spec/sagrone_scraper_spec.rb +1 -74
- data/spec/support/test_scrapers/twitter_scraper.rb +23 -0
- metadata +10 -7
- data/lib/sagrone_scraper/parser.rb +0 -42
- data/spec/sagrone_scraper/parser_spec.rb +0 -83
- data/spec/support/test_parsers/twitter_parser.rb +0 -17
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7b4420d9fd6c018746bbdb2ca17ea53402ca3857
|
4
|
+
data.tar.gz: 7ff6f5cdb840b214ad7eaf8f7ccb23e2efe9ff24
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: b051e3982a7959f19cb25b29129905cd26e5fbc2529a012a0d8b091c5dfdc78d985c79b8aacb6c8142e2c030aed107d91a69d94aba8f6b2c16c57eaeb0f284cb
|
7
|
+
data.tar.gz: b404f338c87a266e9bcc03705327830be76d73018e5ff789017890cb77d7b0b0157daff1dbace000d542be0fd272a7912a75a018d88154da3e165c25e8e4bea5
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,21 @@
|
|
1
1
|
### HEAD
|
2
2
|
|
3
|
+
### 0.1.0
|
4
|
+
|
5
|
+
- simplify library entities:
|
6
|
+
- `SagroneScraper::Parser` is now `SagroneScraper::Base`
|
7
|
+
- `SagroneScraper::Base` is base class to create new scrapers
|
8
|
+
- `SagroneScraper.registered_parsers` is now `SagroneScraper.registered_scrapers`
|
9
|
+
- `SagroneScraper.register_parser` is now `SagroneScraper.register_scraper`
|
10
|
+
- `SagroneScraper::Base.parse_page!` renamed to `SagroneScraper::Base.scrape_page!`
|
11
|
+
- `SagroneScraper::Base.can_parse?` renamed to `SagroneScraper::Base.can_scrape?`
|
12
|
+
- `SagroneScraper::Base` _private_ instance methods are not used for extracting data, only _public_ instance methods are.
|
13
|
+
- `SagroneScraper::Collection`:
|
14
|
+
- registeres new created scrappers automatically
|
15
|
+
- knows how to scrape a page from a generic URL (if there is a valid scraper for that URL)
|
16
|
+
- `SagroneScraper::Base` takes exactly one of `url` or `page` options
|
17
|
+
- `SagroneScraper::Agent` takes only a `url` option
|
18
|
+
|
3
19
|
### 0.0.3
|
4
20
|
|
5
21
|
- add `SagroneScraper::Parser.can_parse?(url)` class method, which must be implemented in subclasses
|
data/Guardfile
CHANGED
@@ -8,7 +8,7 @@ guard :rspec, cmd: "bundle exec rspec" do
|
|
8
8
|
rspec = dsl.rspec
|
9
9
|
watch(rspec.spec_files)
|
10
10
|
watch(%r{^spec/(.+)_helper\.rb$}) { "spec" }
|
11
|
-
watch(%r{^spec/
|
11
|
+
watch(%r{^spec/support/(.+)$}) { "spec" }
|
12
12
|
|
13
13
|
# Library files
|
14
14
|
watch(%r{^lib/(.+)\.rb$}) { |m| "spec/#{m[1]}_spec.rb" }
|
data/README.md
CHANGED
@@ -11,8 +11,12 @@ Simple library to scrap web pages. Bellow you will find information on [how to u
|
|
11
11
|
- [Basic Usage](#basic-usage)
|
12
12
|
- [Modules](#modules)
|
13
13
|
+ [`SagroneScraper::Agent`](#sagronescraperagent)
|
14
|
-
+ [`SagroneScraper::
|
15
|
-
|
14
|
+
+ [`SagroneScraper::Base`](#sagronescraperbase)
|
15
|
+
* [Create a scraper class](#create-a-scraper-class)
|
16
|
+
* [Instantiate the scraper](#instantiate-the-scraper)
|
17
|
+
* [Scrape the page](#scrape-the-page)
|
18
|
+
* [Extract the data](#extract-the-data)
|
19
|
+
+ [`SagroneScraper::Collection`](#sagronescrapercollection)
|
16
20
|
|
17
21
|
## Installation
|
18
22
|
|
@@ -30,115 +34,110 @@ Or install it yourself as:
|
|
30
34
|
|
31
35
|
## Basic Usage
|
32
36
|
|
33
|
-
|
37
|
+
In order to _scrape a web page_ you will need to:
|
34
38
|
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
The agent is responsible for scraping a web page from a URL. Here is how you can create an `agent`:
|
39
|
+
1. [create a new scraper class](#create-a-scraper-class) by inheriting from `SagroneScraper::Base`, and
|
40
|
+
2. [instantiate it with a `url` or `page`](#instantiate-the-scraper)
|
41
|
+
3. then you can use the scraper instance to [scrape the page](#scrape-the-page) and [extract structured data](#extract-the-data)
|
40
42
|
|
41
|
-
|
43
|
+
More informations at [`SagroneScraper::Base`](#sagronescraperbase) module.
|
42
44
|
|
43
|
-
|
44
|
-
require 'sagrone_scraper'
|
45
|
+
## Modules
|
45
46
|
|
46
|
-
|
47
|
-
agent.page
|
48
|
-
# => Mechanize::Page
|
47
|
+
### `SagroneScraper::Agent`
|
49
48
|
|
50
|
-
|
51
|
-
# => "Javascript User Group Milano #milanojs"
|
52
|
-
```
|
49
|
+
The agent is responsible for obtaining a page, `Mechanize::Page`, from a URL. Here is how you can create an `agent`:
|
53
50
|
|
54
|
-
|
51
|
+
```ruby
|
52
|
+
require 'sagrone_scraper'
|
55
53
|
|
56
|
-
|
57
|
-
|
54
|
+
agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
|
55
|
+
agent.page
|
56
|
+
# => Mechanize::Page
|
58
57
|
|
59
|
-
|
60
|
-
|
61
|
-
|
58
|
+
agent.page.at('.ProfileHeaderCard-bio').text
|
59
|
+
# => "Javascript User Group Milano #milanojs"
|
60
|
+
```
|
62
61
|
|
63
|
-
|
64
|
-
agent.url
|
65
|
-
# => "https://twitter.com/Milano_JS"
|
62
|
+
### `SagroneScraper::Base`
|
66
63
|
|
67
|
-
|
68
|
-
# => "Milan, Italy"
|
69
|
-
```
|
64
|
+
Here we define a `TwitterScraper`, by inheriting from `SagroneScraper::Base` class.
|
70
65
|
|
71
|
-
|
66
|
+
The _scraper_ is responsible for extracting structured data from a _page_ or a _url_. The _page_ can be obtained by the [_agent_](#sagronescraperagent).
|
72
67
|
|
73
|
-
|
68
|
+
_Public_ instance methods will be used to extract data, whereas _private_ instance methods will be ignored (seen as helper methods). Most importantly `self.can_scrape?(url)` class method ensures that only a known subset of pages can be scraped for data.
|
74
69
|
|
75
|
-
|
70
|
+
#### Create a scraper class
|
76
71
|
|
77
72
|
```ruby
|
78
73
|
require 'sagrone_scraper'
|
79
74
|
|
80
|
-
|
81
|
-
class TwitterParser < SagroneScraper::Parser
|
75
|
+
class TwitterScraper < SagroneScraper::Base
|
82
76
|
TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
|
83
77
|
|
84
|
-
def self.
|
85
|
-
url.match(TWITTER_PROFILE_URL)
|
78
|
+
def self.can_scrape?(url)
|
79
|
+
url.match(TWITTER_PROFILE_URL) ? true : false
|
86
80
|
end
|
87
81
|
|
82
|
+
# Public instance methods are used for data extraction.
|
83
|
+
|
88
84
|
def bio
|
89
|
-
|
85
|
+
text_at('.ProfileHeaderCard-bio')
|
90
86
|
end
|
91
87
|
|
92
88
|
def location
|
93
|
-
|
89
|
+
text_at('.ProfileHeaderCard-locationText')
|
90
|
+
end
|
91
|
+
|
92
|
+
private
|
93
|
+
|
94
|
+
# Private instance methods are not used for data extraction.
|
95
|
+
|
96
|
+
def text_at(selector)
|
97
|
+
page.at(selector).text if page.at(selector)
|
94
98
|
end
|
95
99
|
end
|
100
|
+
```
|
101
|
+
|
102
|
+
#### Instantiate the scraper
|
103
|
+
|
104
|
+
```ruby
|
105
|
+
# Instantiate the scraper with a "url".
|
106
|
+
scraper = TwitterScraper.new(url: 'https://twitter.com/Milano_JS')
|
96
107
|
|
97
|
-
#
|
108
|
+
# Instantiate the scraper with a "page" (Mechanize::Page).
|
98
109
|
agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
|
110
|
+
scraper = TwitterScraper.new(page: agent.page)
|
111
|
+
```
|
112
|
+
|
113
|
+
#### Scrape the page
|
114
|
+
|
115
|
+
```ruby
|
116
|
+
scraper.scrape_page!
|
117
|
+
```
|
99
118
|
|
100
|
-
|
101
|
-
parser = TwitterParser.new(page: agent.page)
|
119
|
+
#### Extract the data
|
102
120
|
|
103
|
-
|
104
|
-
|
105
|
-
parser.attributes
|
121
|
+
```ruby
|
122
|
+
scraper.attributes
|
106
123
|
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
|
107
124
|
```
|
108
125
|
|
109
|
-
|
126
|
+
### `SagroneScraper::Collection`
|
110
127
|
|
111
128
|
This is the simplest way to scrape a web page:
|
112
129
|
|
113
130
|
```ruby
|
114
131
|
require 'sagrone_scraper'
|
115
132
|
|
116
|
-
# 1)
|
117
|
-
class TwitterParser < SagroneScraper::Parser
|
118
|
-
TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
|
119
|
-
|
120
|
-
def self.can_parse?(url)
|
121
|
-
url.match(TWITTER_PROFILE_URL)
|
122
|
-
end
|
123
|
-
|
124
|
-
def bio
|
125
|
-
page.at('.ProfileHeaderCard-bio').text
|
126
|
-
end
|
127
|
-
|
128
|
-
def location
|
129
|
-
page.at('.ProfileHeaderCard-locationText').text
|
130
|
-
end
|
131
|
-
end
|
132
|
-
|
133
|
-
# 2) We register the parser.
|
134
|
-
SagroneScraper.register_parser('TwitterParser')
|
133
|
+
# 1) Define a scraper. For example, the TwitterScraper above.
|
135
134
|
|
136
|
-
#
|
137
|
-
SagroneScraper.
|
138
|
-
# => ['
|
135
|
+
# 2) New created scrapers will be registered.
|
136
|
+
SagroneScraper.Collection::registered_scrapers
|
137
|
+
# => ['TwitterScraper']
|
139
138
|
|
140
|
-
#
|
141
|
-
SagroneScraper.scrape(url: 'https://twitter.com/Milano_JS')
|
139
|
+
# 3) Here we use the collection to scrape data at a URL.
|
140
|
+
SagroneScraper::Collection.scrape(url: 'https://twitter.com/Milano_JS')
|
142
141
|
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
|
143
142
|
```
|
144
143
|
|
data/lib/sagrone_scraper.rb
CHANGED
@@ -1,41 +1,10 @@
|
|
1
1
|
require "sagrone_scraper/version"
|
2
2
|
require "sagrone_scraper/agent"
|
3
|
-
require "sagrone_scraper/
|
3
|
+
require "sagrone_scraper/base"
|
4
|
+
require "sagrone_scraper/collection"
|
4
5
|
|
5
6
|
module SagroneScraper
|
6
|
-
Error = Class.new(RuntimeError)
|
7
|
-
|
8
7
|
def self.version
|
9
8
|
VERSION
|
10
9
|
end
|
11
|
-
|
12
|
-
def self.registered_parsers
|
13
|
-
@registered_parsers ||= []
|
14
|
-
end
|
15
|
-
|
16
|
-
def self.register_parser(name)
|
17
|
-
return if registered_parsers.include?(name)
|
18
|
-
|
19
|
-
parser_class = Object.const_get(name)
|
20
|
-
raise Error.new("Expected parser to be a SagroneScraper::Parser.") unless parser_class.ancestors.include?(SagroneScraper::Parser)
|
21
|
-
|
22
|
-
registered_parsers.push(name)
|
23
|
-
end
|
24
|
-
|
25
|
-
def self.scrape(options)
|
26
|
-
url = options.fetch(:url) do
|
27
|
-
raise Error.new('Option "url" must be provided.')
|
28
|
-
end
|
29
|
-
|
30
|
-
parser_class = registered_parsers
|
31
|
-
.map { |parser_name| Object.const_get(parser_name) }
|
32
|
-
.find { |parser_class| parser_class.can_parse?(url) }
|
33
|
-
|
34
|
-
raise Error.new("No registed parser can parse URL #{url}") unless parser_class
|
35
|
-
|
36
|
-
agent = SagroneScraper::Agent.new(url: url)
|
37
|
-
parser = parser_class.new(page: agent.page)
|
38
|
-
parser.parse_page!
|
39
|
-
parser.attributes
|
40
|
-
end
|
41
10
|
end
|
@@ -9,12 +9,10 @@ module SagroneScraper
|
|
9
9
|
attr_reader :url, :page
|
10
10
|
|
11
11
|
def initialize(options = {})
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
@url ||= page_url
|
17
|
-
@page ||= http_client.get(url)
|
12
|
+
@url = options.fetch(:url) do
|
13
|
+
raise Error.new('Option "url" must be provided')
|
14
|
+
end
|
15
|
+
@page = http_client.get(url)
|
18
16
|
rescue StandardError => error
|
19
17
|
raise Error.new(error.message)
|
20
18
|
end
|
@@ -29,18 +27,5 @@ module SagroneScraper
|
|
29
27
|
agent.max_history = 0
|
30
28
|
end
|
31
29
|
end
|
32
|
-
|
33
|
-
private
|
34
|
-
|
35
|
-
def page_url
|
36
|
-
@page.uri.to_s
|
37
|
-
end
|
38
|
-
|
39
|
-
def exactly_one_of(options)
|
40
|
-
url_present = !!options[:url]
|
41
|
-
page_present = !!options[:page]
|
42
|
-
|
43
|
-
(url_present && !page_present) || (!url_present && page_present)
|
44
|
-
end
|
45
30
|
end
|
46
31
|
end
|
@@ -0,0 +1,63 @@
|
|
1
|
+
require 'mechanize'
|
2
|
+
require 'sagrone_scraper'
|
3
|
+
|
4
|
+
module SagroneScraper
|
5
|
+
class Base
|
6
|
+
Error = Class.new(RuntimeError)
|
7
|
+
|
8
|
+
attr_reader :page, :url, :attributes
|
9
|
+
|
10
|
+
def initialize(options = {})
|
11
|
+
raise Error.new('Exactly one option must be provided: "url" or "page"') unless exactly_one_of(options)
|
12
|
+
|
13
|
+
@url, @page = options[:url], options[:page]
|
14
|
+
|
15
|
+
@url ||= page_url
|
16
|
+
@page ||= Agent.new(url: url).page
|
17
|
+
@attributes = {}
|
18
|
+
end
|
19
|
+
|
20
|
+
def scrape_page!
|
21
|
+
return unless self.class.can_scrape?(page_url)
|
22
|
+
|
23
|
+
self.class.method_names.each do |name|
|
24
|
+
attributes[name] = send(name)
|
25
|
+
end
|
26
|
+
nil
|
27
|
+
end
|
28
|
+
|
29
|
+
def self.can_scrape?(url)
|
30
|
+
class_with_method = "#{self}.can_scrape?(url)"
|
31
|
+
raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
|
32
|
+
end
|
33
|
+
|
34
|
+
def self.should_ignore_method?(name)
|
35
|
+
private_method_defined?(name)
|
36
|
+
end
|
37
|
+
|
38
|
+
private
|
39
|
+
|
40
|
+
def exactly_one_of(options)
|
41
|
+
url_present = !!options[:url]
|
42
|
+
page_present = !!options[:page]
|
43
|
+
|
44
|
+
(url_present && !page_present) || (!url_present && page_present)
|
45
|
+
end
|
46
|
+
|
47
|
+
def page_url
|
48
|
+
@page.uri.to_s
|
49
|
+
end
|
50
|
+
|
51
|
+
def self.method_names
|
52
|
+
@method_names ||= []
|
53
|
+
end
|
54
|
+
|
55
|
+
def self.method_added(name)
|
56
|
+
method_names.push(name) unless should_ignore_method?(name)
|
57
|
+
end
|
58
|
+
|
59
|
+
def self.inherited(klass)
|
60
|
+
SagroneScraper::Collection.register_scraper(klass.name)
|
61
|
+
end
|
62
|
+
end
|
63
|
+
end
|
@@ -0,0 +1,38 @@
|
|
1
|
+
require 'sagrone_scraper'
|
2
|
+
|
3
|
+
module SagroneScraper
|
4
|
+
module Collection
|
5
|
+
Error = Class.new(RuntimeError)
|
6
|
+
|
7
|
+
def self.registered_scrapers
|
8
|
+
@registered_scrapers ||= []
|
9
|
+
end
|
10
|
+
|
11
|
+
def self.register_scraper(name)
|
12
|
+
return if registered_scrapers.include?(name)
|
13
|
+
|
14
|
+
scraper_class = Object.const_get(name)
|
15
|
+
raise Error.new("Expected scraper to be a SagroneScraper::Base.") unless scraper_class.ancestors.include?(SagroneScraper::Base)
|
16
|
+
|
17
|
+
registered_scrapers.push(name)
|
18
|
+
nil
|
19
|
+
end
|
20
|
+
|
21
|
+
def self.scrape(options)
|
22
|
+
url = options.fetch(:url) do
|
23
|
+
raise Error.new('Option "url" must be provided.')
|
24
|
+
end
|
25
|
+
|
26
|
+
scraper_class = registered_scrapers
|
27
|
+
.map { |scraper_name| Object.const_get(scraper_name) }
|
28
|
+
.find { |a_scraper_class| a_scraper_class.can_scrape?(url) }
|
29
|
+
|
30
|
+
raise Error.new("No registed scraper can scrape URL #{url}") unless scraper_class
|
31
|
+
|
32
|
+
agent = SagroneScraper::Agent.new(url: url)
|
33
|
+
scraper = scraper_class.new(page: agent.page)
|
34
|
+
scraper.scrape_page!
|
35
|
+
scraper.attributes
|
36
|
+
end
|
37
|
+
end
|
38
|
+
end
|
@@ -12,7 +12,7 @@ RSpec.describe SagroneScraper::Agent do
|
|
12
12
|
it { expect(described_class::AGENT_ALIASES).to eq(user_agent_aliases) }
|
13
13
|
end
|
14
14
|
|
15
|
-
describe '.http_client' do
|
15
|
+
describe 'self.http_client' do
|
16
16
|
subject { described_class.http_client }
|
17
17
|
|
18
18
|
it { should be_a(Mechanize) }
|
@@ -21,7 +21,7 @@ RSpec.describe SagroneScraper::Agent do
|
|
21
21
|
end
|
22
22
|
|
23
23
|
describe '#initialize' do
|
24
|
-
describe 'should require
|
24
|
+
describe 'should require `url` option' do
|
25
25
|
before do
|
26
26
|
stub_request_for('http://example.com', 'www.example.com')
|
27
27
|
end
|
@@ -30,30 +30,21 @@ RSpec.describe SagroneScraper::Agent do
|
|
30
30
|
expect {
|
31
31
|
described_class.new
|
32
32
|
}.to raise_error(SagroneScraper::Agent::Error,
|
33
|
-
'
|
34
|
-
end
|
35
|
-
|
36
|
-
it 'when both options are present' do
|
37
|
-
page = Mechanize.new.get('http://example.com')
|
38
|
-
|
39
|
-
expect {
|
40
|
-
described_class.new(url: 'http://example.com', page: page)
|
41
|
-
}.to raise_error(SagroneScraper::Agent::Error,
|
42
|
-
'Exactly one option must be provided: "url" or "page"')
|
33
|
+
'Option "url" must be provided')
|
43
34
|
end
|
44
35
|
end
|
45
36
|
|
46
|
-
describe 'with
|
37
|
+
describe 'with url option' do
|
47
38
|
before do
|
48
39
|
stub_request_for('http://example.com', 'www.example.com')
|
49
40
|
end
|
50
41
|
|
51
|
-
let(:
|
52
|
-
let(:agent) { described_class.new(page: page) }
|
42
|
+
let(:agent) { described_class.new(url: 'http://example.com') }
|
53
43
|
|
54
44
|
it { expect { agent }.to_not raise_error }
|
55
45
|
it { expect(agent.page).to be }
|
56
|
-
it { expect(agent.
|
46
|
+
it { expect(agent.page).to be_a(Mechanize::Page) }
|
47
|
+
it { expect(agent.url).to eq 'http://example.com' }
|
57
48
|
end
|
58
49
|
|
59
50
|
describe 'with invalid URL' do
|
@@ -62,7 +53,7 @@ RSpec.describe SagroneScraper::Agent do
|
|
62
53
|
it 'should require URL is absolute' do
|
63
54
|
@invalid_url = 'not-a-url'
|
64
55
|
|
65
|
-
expect { agent }.to raise_error(
|
56
|
+
expect { agent }.to raise_error(described_class::Error,
|
66
57
|
'absolute URL needed (not not-a-url)')
|
67
58
|
end
|
68
59
|
|
@@ -70,7 +61,7 @@ RSpec.describe SagroneScraper::Agent do
|
|
70
61
|
@invalid_url = 'http://'
|
71
62
|
|
72
63
|
webmock_allow do
|
73
|
-
expect { agent }.to raise_error(
|
64
|
+
expect { agent }.to raise_error(described_class::Error,
|
74
65
|
/bad URI\(absolute but no path\)/)
|
75
66
|
end
|
76
67
|
end
|
@@ -79,7 +70,7 @@ RSpec.describe SagroneScraper::Agent do
|
|
79
70
|
@invalid_url = 'http://example'
|
80
71
|
|
81
72
|
webmock_allow do
|
82
|
-
expect { agent }.to raise_error(
|
73
|
+
expect { agent }.to raise_error(described_class::Error,
|
83
74
|
/getaddrinfo/)
|
84
75
|
end
|
85
76
|
end
|
@@ -0,0 +1,162 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
require 'sagrone_scraper/base'
|
3
|
+
|
4
|
+
RSpec.describe SagroneScraper::Base do
|
5
|
+
describe '#initialize' do
|
6
|
+
before do
|
7
|
+
stub_request_for('http://example.com', 'www.example.com')
|
8
|
+
end
|
9
|
+
|
10
|
+
describe 'should require exactly one of `url` or `page` option' do
|
11
|
+
it 'when options is empty' do
|
12
|
+
expect {
|
13
|
+
described_class.new
|
14
|
+
}.to raise_error(described_class::Error,
|
15
|
+
'Exactly one option must be provided: "url" or "page"')
|
16
|
+
end
|
17
|
+
|
18
|
+
it 'when both options are present' do
|
19
|
+
page = Mechanize.new.get('http://example.com')
|
20
|
+
|
21
|
+
expect {
|
22
|
+
described_class.new(url: 'http://example.com', page: page)
|
23
|
+
}.to raise_error(described_class::Error,
|
24
|
+
'Exactly one option must be provided: "url" or "page"')
|
25
|
+
end
|
26
|
+
end
|
27
|
+
|
28
|
+
describe 'with page option' do
|
29
|
+
let(:page) { Mechanize.new.get('http://example.com') }
|
30
|
+
let(:agent) { described_class.new(page: page) }
|
31
|
+
|
32
|
+
it { expect { agent }.to_not raise_error }
|
33
|
+
it { expect(agent.page).to be }
|
34
|
+
it { expect(agent.url).to eq 'http://example.com/' }
|
35
|
+
end
|
36
|
+
|
37
|
+
describe 'with url option' do
|
38
|
+
let(:agent) { described_class.new(url: 'http://example.com') }
|
39
|
+
|
40
|
+
it { expect { agent }.to_not raise_error }
|
41
|
+
it { expect(agent.page).to be }
|
42
|
+
it { expect(agent.url).to eq 'http://example.com' }
|
43
|
+
end
|
44
|
+
end
|
45
|
+
|
46
|
+
describe 'instance methods' do
|
47
|
+
let(:page) { Mechanize::Page.new }
|
48
|
+
let(:scraper) { described_class.new(page: page) }
|
49
|
+
|
50
|
+
describe '#page' do
|
51
|
+
it { expect(scraper.page).to be_a(Mechanize::Page) }
|
52
|
+
end
|
53
|
+
|
54
|
+
describe '#url' do
|
55
|
+
it { expect(scraper.url).to eq page.uri.to_s }
|
56
|
+
end
|
57
|
+
|
58
|
+
describe '#scrape_page!' do
|
59
|
+
it do
|
60
|
+
expect {
|
61
|
+
scraper.scrape_page!
|
62
|
+
}.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
66
|
+
describe '#attributes' do
|
67
|
+
it { expect(scraper.attributes).to be_empty }
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
describe 'class methods' do
|
72
|
+
describe 'self.can_scrape?(url)' do
|
73
|
+
it 'must be implemented in subclasses' do
|
74
|
+
expect {
|
75
|
+
described_class.can_scrape?('url')
|
76
|
+
}.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
|
77
|
+
end
|
78
|
+
end
|
79
|
+
end
|
80
|
+
|
81
|
+
describe 'example TwitterScraper' do
|
82
|
+
before do
|
83
|
+
stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
|
84
|
+
end
|
85
|
+
|
86
|
+
let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
|
87
|
+
let(:twitter_scraper) { TwitterScraper.new(page: page) }
|
88
|
+
let(:expected_attributes) do
|
89
|
+
{
|
90
|
+
bio: "Javascript User Group Milano #milanojs",
|
91
|
+
location: "Milan, Italy"
|
92
|
+
}
|
93
|
+
end
|
94
|
+
|
95
|
+
describe '#initialize' do
|
96
|
+
it 'should be a SagroneScraper::Base' do
|
97
|
+
expect(twitter_scraper).to be_a(described_class)
|
98
|
+
end
|
99
|
+
end
|
100
|
+
|
101
|
+
describe 'instance methods' do
|
102
|
+
describe '#scrape_page!' do
|
103
|
+
it 'should succeed' do
|
104
|
+
expect { twitter_scraper.scrape_page! }.to_not raise_error
|
105
|
+
end
|
106
|
+
|
107
|
+
it 'should have attributes present after scraping' do
|
108
|
+
twitter_scraper.scrape_page!
|
109
|
+
|
110
|
+
expect(twitter_scraper.attributes).to_not be_empty
|
111
|
+
expect(twitter_scraper.attributes).to eq expected_attributes
|
112
|
+
end
|
113
|
+
|
114
|
+
it 'should have correct attributes even if scraping is done multiple times' do
|
115
|
+
twitter_scraper.scrape_page!
|
116
|
+
twitter_scraper.scrape_page!
|
117
|
+
twitter_scraper.scrape_page!
|
118
|
+
|
119
|
+
expect(twitter_scraper.attributes).to_not be_empty
|
120
|
+
expect(twitter_scraper.attributes).to eq expected_attributes
|
121
|
+
end
|
122
|
+
end
|
123
|
+
end
|
124
|
+
|
125
|
+
describe 'class methods' do
|
126
|
+
describe 'self.can_scrape?(url)' do
|
127
|
+
it 'should be true for scrapable URLs' do
|
128
|
+
can_scrape = TwitterScraper.can_scrape?('https://twitter.com/Milano_JS')
|
129
|
+
|
130
|
+
expect(can_scrape).to eq(true)
|
131
|
+
end
|
132
|
+
|
133
|
+
it 'should be false for unknown URLs' do
|
134
|
+
can_scrape = TwitterScraper.can_scrape?('https://www.facebook.com/milanojavascript')
|
135
|
+
|
136
|
+
expect(can_scrape).to eq(false)
|
137
|
+
end
|
138
|
+
end
|
139
|
+
|
140
|
+
describe 'self.should_ignore_method?(name)' do
|
141
|
+
let(:private_methods) { %w(text_at) }
|
142
|
+
let(:public_methods) { %w(bio location) }
|
143
|
+
|
144
|
+
it 'ignores private methods' do
|
145
|
+
private_methods.each do |private_method|
|
146
|
+
ignored = TwitterScraper.should_ignore_method?(private_method)
|
147
|
+
|
148
|
+
expect(ignored).to eq(true)
|
149
|
+
end
|
150
|
+
end
|
151
|
+
|
152
|
+
it 'allows public methods' do
|
153
|
+
public_methods.each do |public_method|
|
154
|
+
ignored = TwitterScraper.should_ignore_method?(public_method)
|
155
|
+
|
156
|
+
expect(ignored).to eq(false)
|
157
|
+
end
|
158
|
+
end
|
159
|
+
end
|
160
|
+
end
|
161
|
+
end
|
162
|
+
end
|
@@ -0,0 +1,80 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
require 'sagrone_scraper/collection'
|
3
|
+
|
4
|
+
RSpec.describe SagroneScraper::Collection do
|
5
|
+
context 'scrapers registered' do
|
6
|
+
before do
|
7
|
+
described_class.registered_scrapers.clear
|
8
|
+
end
|
9
|
+
|
10
|
+
describe 'self.registered_scrapers' do
|
11
|
+
it { expect(described_class.registered_scrapers).to be_empty }
|
12
|
+
it { expect(described_class.registered_scrapers).to be_a(Array) }
|
13
|
+
end
|
14
|
+
|
15
|
+
describe 'self.register_scraper(name)' do
|
16
|
+
it 'should add a new scraper class to registered scrapers automatically' do
|
17
|
+
class ScraperOne < SagroneScraper::Base ; end
|
18
|
+
class ScraperTwo < SagroneScraper::Base ; end
|
19
|
+
|
20
|
+
expect(described_class.registered_scrapers).to include('ScraperOne')
|
21
|
+
expect(described_class.registered_scrapers).to include('ScraperTwo')
|
22
|
+
expect(described_class.registered_scrapers.size).to eq(2)
|
23
|
+
end
|
24
|
+
|
25
|
+
it 'should check scraper name is an existing constant' do
|
26
|
+
expect {
|
27
|
+
described_class.register_scraper('Unknown')
|
28
|
+
}.to raise_error(NameError, 'uninitialized constant Unknown')
|
29
|
+
end
|
30
|
+
|
31
|
+
it 'should check scraper class inherits from SagroneScraper::Base' do
|
32
|
+
NotScraper = Class.new
|
33
|
+
|
34
|
+
expect {
|
35
|
+
described_class.register_scraper('NotScraper')
|
36
|
+
}.to raise_error(described_class::Error, 'Expected scraper to be a SagroneScraper::Base.')
|
37
|
+
end
|
38
|
+
|
39
|
+
it 'should register multiple scrapers only once' do
|
40
|
+
class TestScraper < SagroneScraper::Base ; end
|
41
|
+
|
42
|
+
described_class.register_scraper('TestScraper')
|
43
|
+
described_class.register_scraper('TestScraper')
|
44
|
+
|
45
|
+
expect(described_class.registered_scrapers).to include('TestScraper')
|
46
|
+
expect(described_class.registered_scrapers.size).to eq 1
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
describe 'self.scrape' do
|
52
|
+
before do
|
53
|
+
described_class.registered_scrapers.clear
|
54
|
+
described_class.register_scraper('TwitterScraper')
|
55
|
+
|
56
|
+
stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
|
57
|
+
end
|
58
|
+
|
59
|
+
it 'should require `url` option' do
|
60
|
+
expect {
|
61
|
+
described_class.scrape({})
|
62
|
+
}.to raise_error(described_class::Error, 'Option "url" must be provided.')
|
63
|
+
end
|
64
|
+
|
65
|
+
it 'should scrape URL if registered scraper knows how to scrape it' do
|
66
|
+
expected_attributes = {
|
67
|
+
bio: "Javascript User Group Milano #milanojs",
|
68
|
+
location: "Milan, Italy"
|
69
|
+
}
|
70
|
+
|
71
|
+
expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
|
72
|
+
end
|
73
|
+
|
74
|
+
it 'should return raise error if no registered paser can scrape the URL' do
|
75
|
+
expect {
|
76
|
+
described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
|
77
|
+
}.to raise_error(described_class::Error, "No registed scraper can scrape URL https://twitter.com/Milano_JS/media")
|
78
|
+
end
|
79
|
+
end
|
80
|
+
end
|
@@ -2,80 +2,7 @@ require 'spec_helper'
|
|
2
2
|
require 'sagrone_scraper'
|
3
3
|
|
4
4
|
RSpec.describe SagroneScraper do
|
5
|
-
describe '.version' do
|
5
|
+
describe 'self.version' do
|
6
6
|
it { expect(SagroneScraper.version).to be_a(String) }
|
7
7
|
end
|
8
|
-
|
9
|
-
context 'parsers registered' do
|
10
|
-
before do
|
11
|
-
described_class.registered_parsers.clear
|
12
|
-
end
|
13
|
-
|
14
|
-
describe '.registered_parsers' do
|
15
|
-
it { expect(described_class.registered_parsers).to be_empty }
|
16
|
-
it { expect(described_class.registered_parsers).to be_a(Array) }
|
17
|
-
end
|
18
|
-
|
19
|
-
describe '.register_parser(name)' do
|
20
|
-
TestParser = Class.new(SagroneScraper::Parser)
|
21
|
-
NotParser = Class.new
|
22
|
-
|
23
|
-
it 'should check parser name is an existing constant' do
|
24
|
-
expect {
|
25
|
-
described_class.register_parser('Unknown')
|
26
|
-
}.to raise_error(NameError, 'uninitialized constant Unknown')
|
27
|
-
end
|
28
|
-
|
29
|
-
it 'should check parser class inherits from SagroneScraper::Parser' do
|
30
|
-
expect {
|
31
|
-
described_class.register_parser('NotParser')
|
32
|
-
}.to raise_error(SagroneScraper::Error, 'Expected parser to be a SagroneScraper::Parser.')
|
33
|
-
end
|
34
|
-
|
35
|
-
it 'after adding a "parser" should have it registered' do
|
36
|
-
described_class.register_parser('TestParser')
|
37
|
-
|
38
|
-
expect(described_class.registered_parsers).to include('TestParser')
|
39
|
-
expect(described_class.registered_parsers.size).to eq 1
|
40
|
-
end
|
41
|
-
|
42
|
-
it 'adding same "parser" multiple times should register it once' do
|
43
|
-
described_class.register_parser('TestParser')
|
44
|
-
described_class.register_parser('TestParser')
|
45
|
-
|
46
|
-
expect(described_class.registered_parsers).to include('TestParser')
|
47
|
-
expect(described_class.registered_parsers.size).to eq 1
|
48
|
-
end
|
49
|
-
end
|
50
|
-
end
|
51
|
-
|
52
|
-
describe '.scrape' do
|
53
|
-
before do
|
54
|
-
SagroneScraper.registered_parsers.clear
|
55
|
-
SagroneScraper.register_parser('TwitterParser')
|
56
|
-
|
57
|
-
stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
|
58
|
-
end
|
59
|
-
|
60
|
-
it 'should `url` option' do
|
61
|
-
expect {
|
62
|
-
described_class.scrape({})
|
63
|
-
}.to raise_error(SagroneScraper::Error, 'Option "url" must be provided.')
|
64
|
-
end
|
65
|
-
|
66
|
-
it 'should scrape URL if registered parser knows how to parse it' do
|
67
|
-
expected_attributes = {
|
68
|
-
bio: "Javascript User Group Milano #milanojs",
|
69
|
-
location: "Milan, Italy"
|
70
|
-
}
|
71
|
-
|
72
|
-
expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
|
73
|
-
end
|
74
|
-
|
75
|
-
it 'should return raise error if no registered paser can parse the URL' do
|
76
|
-
expect {
|
77
|
-
described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
|
78
|
-
}.to raise_error(SagroneScraper::Error, "No registed parser can parse URL https://twitter.com/Milano_JS/media")
|
79
|
-
end
|
80
|
-
end
|
81
8
|
end
|
@@ -0,0 +1,23 @@
|
|
1
|
+
require 'sagrone_scraper/base'
|
2
|
+
|
3
|
+
class TwitterScraper < SagroneScraper::Base
|
4
|
+
TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
|
5
|
+
|
6
|
+
def self.can_scrape?(url)
|
7
|
+
url.match(TWITTER_PROFILE_URL) ? true : false
|
8
|
+
end
|
9
|
+
|
10
|
+
def bio
|
11
|
+
text_at('.ProfileHeaderCard-bio')
|
12
|
+
end
|
13
|
+
|
14
|
+
def location
|
15
|
+
text_at('.ProfileHeaderCard-locationText')
|
16
|
+
end
|
17
|
+
|
18
|
+
private
|
19
|
+
|
20
|
+
def text_at(selector)
|
21
|
+
page.at(selector).text if page.at(selector)
|
22
|
+
end
|
23
|
+
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: sagrone_scraper
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0
|
4
|
+
version: 0.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Marius Colacioiu
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-
|
11
|
+
date: 2015-04-09 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: mechanize
|
@@ -113,17 +113,19 @@ files:
|
|
113
113
|
- Rakefile
|
114
114
|
- lib/sagrone_scraper.rb
|
115
115
|
- lib/sagrone_scraper/agent.rb
|
116
|
-
- lib/sagrone_scraper/
|
116
|
+
- lib/sagrone_scraper/base.rb
|
117
|
+
- lib/sagrone_scraper/collection.rb
|
117
118
|
- lib/sagrone_scraper/version.rb
|
118
119
|
- sagrone_scraper.gemspec
|
119
120
|
- spec/sagrone_scraper/agent_spec.rb
|
120
|
-
- spec/sagrone_scraper/
|
121
|
+
- spec/sagrone_scraper/base_spec.rb
|
122
|
+
- spec/sagrone_scraper/collection_spec.rb
|
121
123
|
- spec/sagrone_scraper_spec.rb
|
122
124
|
- spec/spec_helper.rb
|
123
125
|
- spec/stub_helper.rb
|
124
|
-
- spec/support/test_parsers/twitter_parser.rb
|
125
126
|
- spec/support/test_responses/twitter.com:Milano_JS
|
126
127
|
- spec/support/test_responses/www.example.com
|
128
|
+
- spec/support/test_scrapers/twitter_scraper.rb
|
127
129
|
homepage: ''
|
128
130
|
licenses:
|
129
131
|
- Apache License 2.0
|
@@ -150,10 +152,11 @@ specification_version: 4
|
|
150
152
|
summary: Sagrone Ruby Scraper.
|
151
153
|
test_files:
|
152
154
|
- spec/sagrone_scraper/agent_spec.rb
|
153
|
-
- spec/sagrone_scraper/
|
155
|
+
- spec/sagrone_scraper/base_spec.rb
|
156
|
+
- spec/sagrone_scraper/collection_spec.rb
|
154
157
|
- spec/sagrone_scraper_spec.rb
|
155
158
|
- spec/spec_helper.rb
|
156
159
|
- spec/stub_helper.rb
|
157
|
-
- spec/support/test_parsers/twitter_parser.rb
|
158
160
|
- spec/support/test_responses/twitter.com:Milano_JS
|
159
161
|
- spec/support/test_responses/www.example.com
|
162
|
+
- spec/support/test_scrapers/twitter_scraper.rb
|
@@ -1,42 +0,0 @@
|
|
1
|
-
require 'mechanize'
|
2
|
-
|
3
|
-
module SagroneScraper
|
4
|
-
class Parser
|
5
|
-
Error = Class.new(RuntimeError)
|
6
|
-
|
7
|
-
attr_reader :page, :page_url, :attributes
|
8
|
-
|
9
|
-
def initialize(options = {})
|
10
|
-
@page = options.fetch(:page) do
|
11
|
-
raise Error.new('Option "page" must be provided.')
|
12
|
-
end
|
13
|
-
@page_url = @page.uri.to_s
|
14
|
-
@attributes = {}
|
15
|
-
end
|
16
|
-
|
17
|
-
def parse_page!
|
18
|
-
return unless self.class.can_parse?(page_url)
|
19
|
-
|
20
|
-
self.class.method_names.each do |name|
|
21
|
-
attributes[name] = send(name)
|
22
|
-
end
|
23
|
-
nil
|
24
|
-
end
|
25
|
-
|
26
|
-
def self.can_parse?(url)
|
27
|
-
class_with_method = "#{self}.can_parse?(url)"
|
28
|
-
raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
|
29
|
-
end
|
30
|
-
|
31
|
-
private
|
32
|
-
|
33
|
-
def self.method_names
|
34
|
-
@method_names ||= []
|
35
|
-
end
|
36
|
-
|
37
|
-
def self.method_added(name)
|
38
|
-
puts "added #{name} to #{self}"
|
39
|
-
method_names.push(name)
|
40
|
-
end
|
41
|
-
end
|
42
|
-
end
|
@@ -1,83 +0,0 @@
|
|
1
|
-
require 'spec_helper'
|
2
|
-
require 'sagrone_scraper/parser'
|
3
|
-
|
4
|
-
RSpec.describe SagroneScraper::Parser do
|
5
|
-
describe '#initialize' do
|
6
|
-
it 'requires a "page" option' do
|
7
|
-
expect {
|
8
|
-
described_class.new
|
9
|
-
}.to raise_error(SagroneScraper::Parser::Error, 'Option "page" must be provided.')
|
10
|
-
end
|
11
|
-
end
|
12
|
-
|
13
|
-
describe 'instance methods' do
|
14
|
-
let(:page) { Mechanize::Page.new }
|
15
|
-
let(:parser) { described_class.new(page: page) }
|
16
|
-
|
17
|
-
describe '#page' do
|
18
|
-
it { expect(parser.page).to be_a(Mechanize::Page) }
|
19
|
-
end
|
20
|
-
|
21
|
-
describe '#page_url' do
|
22
|
-
it { expect(parser.page_url).to be }
|
23
|
-
it { expect(parser.page_url).to eq page.uri.to_s }
|
24
|
-
end
|
25
|
-
|
26
|
-
describe '#parse_page!' do
|
27
|
-
it do
|
28
|
-
expect {
|
29
|
-
parser.parse_page!
|
30
|
-
}.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
|
31
|
-
end
|
32
|
-
end
|
33
|
-
|
34
|
-
describe '#attributes' do
|
35
|
-
it { expect(parser.attributes).to be_empty }
|
36
|
-
end
|
37
|
-
end
|
38
|
-
|
39
|
-
describe 'class methods' do
|
40
|
-
describe '.can_parse?(url)' do
|
41
|
-
it do
|
42
|
-
expect {
|
43
|
-
described_class.can_parse?('url')
|
44
|
-
}.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
|
45
|
-
end
|
46
|
-
end
|
47
|
-
end
|
48
|
-
|
49
|
-
describe 'create custom TwitterParser from SagroneScraper::Parser' do
|
50
|
-
before do
|
51
|
-
stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
|
52
|
-
end
|
53
|
-
|
54
|
-
let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
|
55
|
-
let(:twitter_parser) { TwitterParser.new(page: page) }
|
56
|
-
let(:expected_attributes) do
|
57
|
-
{
|
58
|
-
bio: "Javascript User Group Milano #milanojs",
|
59
|
-
location: "Milan, Italy"
|
60
|
-
}
|
61
|
-
end
|
62
|
-
|
63
|
-
describe 'should be able to parse page without errors' do
|
64
|
-
it { expect { twitter_parser.parse_page! }.to_not raise_error }
|
65
|
-
end
|
66
|
-
|
67
|
-
it 'should have attributes present after parsing' do
|
68
|
-
twitter_parser.parse_page!
|
69
|
-
|
70
|
-
expect(twitter_parser.attributes).to_not be_empty
|
71
|
-
expect(twitter_parser.attributes).to eq expected_attributes
|
72
|
-
end
|
73
|
-
|
74
|
-
it 'should have correct attributes event if parsing is done multiple times' do
|
75
|
-
twitter_parser.parse_page!
|
76
|
-
twitter_parser.parse_page!
|
77
|
-
twitter_parser.parse_page!
|
78
|
-
|
79
|
-
expect(twitter_parser.attributes).to_not be_empty
|
80
|
-
expect(twitter_parser.attributes).to eq expected_attributes
|
81
|
-
end
|
82
|
-
end
|
83
|
-
end
|
@@ -1,17 +0,0 @@
|
|
1
|
-
require 'sagrone_scraper/parser'
|
2
|
-
|
3
|
-
class TwitterParser < SagroneScraper::Parser
|
4
|
-
TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
|
5
|
-
|
6
|
-
def self.can_parse?(url)
|
7
|
-
url.match(TWITTER_PROFILE_URL)
|
8
|
-
end
|
9
|
-
|
10
|
-
def bio
|
11
|
-
page.at('.ProfileHeaderCard-bio').text
|
12
|
-
end
|
13
|
-
|
14
|
-
def location
|
15
|
-
page.at('.ProfileHeaderCard-locationText').text
|
16
|
-
end
|
17
|
-
end
|