sagrone_scraper 0.0.3 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 80b3c30080aba0c8b1da8cfdcdefbb8e6ef527e1
4
- data.tar.gz: c9992757e44377ed3081089f0348cdf1535e8a8e
3
+ metadata.gz: 7b4420d9fd6c018746bbdb2ca17ea53402ca3857
4
+ data.tar.gz: 7ff6f5cdb840b214ad7eaf8f7ccb23e2efe9ff24
5
5
  SHA512:
6
- metadata.gz: 97477b7732ec3485aa7ba5ef2c7cb16ac130d6b1a1f6ee8b57deb5cf53fb6ae50bafdec13d1c575dbc382a748583a3e70b0da55660bd7065daaebc1479d466c4
7
- data.tar.gz: fc8762b63b3429dcbd004bbdc39c6ef5bd7351e8b483ec8185c79b985ab5b2091601005a533270fc511e16d329f31a20616e1ea068f2b7ce7c3a0b3928087c5f
6
+ metadata.gz: b051e3982a7959f19cb25b29129905cd26e5fbc2529a012a0d8b091c5dfdc78d985c79b8aacb6c8142e2c030aed107d91a69d94aba8f6b2c16c57eaeb0f284cb
7
+ data.tar.gz: b404f338c87a266e9bcc03705327830be76d73018e5ff789017890cb77d7b0b0157daff1dbace000d542be0fd272a7912a75a018d88154da3e165c25e8e4bea5
@@ -1,5 +1,21 @@
1
1
  ### HEAD
2
2
 
3
+ ### 0.1.0
4
+
5
+ - simplify library entities:
6
+ - `SagroneScraper::Parser` is now `SagroneScraper::Base`
7
+ - `SagroneScraper::Base` is base class to create new scrapers
8
+ - `SagroneScraper.registered_parsers` is now `SagroneScraper.registered_scrapers`
9
+ - `SagroneScraper.register_parser` is now `SagroneScraper.register_scraper`
10
+ - `SagroneScraper::Base.parse_page!` renamed to `SagroneScraper::Base.scrape_page!`
11
+ - `SagroneScraper::Base.can_parse?` renamed to `SagroneScraper::Base.can_scrape?`
12
+ - `SagroneScraper::Base` _private_ instance methods are not used for extracting data, only _public_ instance methods are.
13
+ - `SagroneScraper::Collection`:
14
+ - registeres new created scrappers automatically
15
+ - knows how to scrape a page from a generic URL (if there is a valid scraper for that URL)
16
+ - `SagroneScraper::Base` takes exactly one of `url` or `page` options
17
+ - `SagroneScraper::Agent` takes only a `url` option
18
+
3
19
  ### 0.0.3
4
20
 
5
21
  - add `SagroneScraper::Parser.can_parse?(url)` class method, which must be implemented in subclasses
data/Guardfile CHANGED
@@ -8,7 +8,7 @@ guard :rspec, cmd: "bundle exec rspec" do
8
8
  rspec = dsl.rspec
9
9
  watch(rspec.spec_files)
10
10
  watch(%r{^spec/(.+)_helper\.rb$}) { "spec" }
11
- watch(%r{^spec/test_responses/(.+)$}) { "spec" }
11
+ watch(%r{^spec/support/(.+)$}) { "spec" }
12
12
 
13
13
  # Library files
14
14
  watch(%r{^lib/(.+)\.rb$}) { |m| "spec/#{m[1]}_spec.rb" }
data/README.md CHANGED
@@ -11,8 +11,12 @@ Simple library to scrap web pages. Bellow you will find information on [how to u
11
11
  - [Basic Usage](#basic-usage)
12
12
  - [Modules](#modules)
13
13
  + [`SagroneScraper::Agent`](#sagronescraperagent)
14
- + [`SagroneScraper::Parser`](#sagronescraperparser)
15
- + [`SagroneScraper.scrape`](#sagronescraperscrape)
14
+ + [`SagroneScraper::Base`](#sagronescraperbase)
15
+ * [Create a scraper class](#create-a-scraper-class)
16
+ * [Instantiate the scraper](#instantiate-the-scraper)
17
+ * [Scrape the page](#scrape-the-page)
18
+ * [Extract the data](#extract-the-data)
19
+ + [`SagroneScraper::Collection`](#sagronescrapercollection)
16
20
 
17
21
  ## Installation
18
22
 
@@ -30,115 +34,110 @@ Or install it yourself as:
30
34
 
31
35
  ## Basic Usage
32
36
 
33
- Comming soon...
37
+ In order to _scrape a web page_ you will need to:
34
38
 
35
- ## Modules
36
-
37
- #### `SagroneScraper::Agent`
38
-
39
- The agent is responsible for scraping a web page from a URL. Here is how you can create an `agent`:
39
+ 1. [create a new scraper class](#create-a-scraper-class) by inheriting from `SagroneScraper::Base`, and
40
+ 2. [instantiate it with a `url` or `page`](#instantiate-the-scraper)
41
+ 3. then you can use the scraper instance to [scrape the page](#scrape-the-page) and [extract structured data](#extract-the-data)
40
42
 
41
- 1. one way is to pass it a `url` option
43
+ More informations at [`SagroneScraper::Base`](#sagronescraperbase) module.
42
44
 
43
- ```ruby
44
- require 'sagrone_scraper'
45
+ ## Modules
45
46
 
46
- agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
47
- agent.page
48
- # => Mechanize::Page
47
+ ### `SagroneScraper::Agent`
49
48
 
50
- agent.page.at('.ProfileHeaderCard-bio').text
51
- # => "Javascript User Group Milano #milanojs"
52
- ```
49
+ The agent is responsible for obtaining a page, `Mechanize::Page`, from a URL. Here is how you can create an `agent`:
53
50
 
54
- 2. another way is to pass a `page` option (`Mechanize::Page`)
51
+ ```ruby
52
+ require 'sagrone_scraper'
55
53
 
56
- ```ruby
57
- require 'sagrone_scraper'
54
+ agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
55
+ agent.page
56
+ # => Mechanize::Page
58
57
 
59
- mechanize_agent = Mechanize.new { |agent| agent.user_agent_alias = 'Linux Firefox' }
60
- page = mechanize_agent.get('https://twitter.com/Milano_JS')
61
- # => Mechanize::Page
58
+ agent.page.at('.ProfileHeaderCard-bio').text
59
+ # => "Javascript User Group Milano #milanojs"
60
+ ```
62
61
 
63
- agent = SagroneScraper::Agent.new(page: page)
64
- agent.url
65
- # => "https://twitter.com/Milano_JS"
62
+ ### `SagroneScraper::Base`
66
63
 
67
- agent.page.at('.ProfileHeaderCard-locationText').text
68
- # => "Milan, Italy"
69
- ```
64
+ Here we define a `TwitterScraper`, by inheriting from `SagroneScraper::Base` class.
70
65
 
71
- #### `SagroneScraper::Parser`
66
+ The _scraper_ is responsible for extracting structured data from a _page_ or a _url_. The _page_ can be obtained by the [_agent_](#sagronescraperagent).
72
67
 
73
- The _parser_ is responsible for extracting structured data from a _page_. The page can be obtained by the _agent_.
68
+ _Public_ instance methods will be used to extract data, whereas _private_ instance methods will be ignored (seen as helper methods). Most importantly `self.can_scrape?(url)` class method ensures that only a known subset of pages can be scraped for data.
74
69
 
75
- Example usage:
70
+ #### Create a scraper class
76
71
 
77
72
  ```ruby
78
73
  require 'sagrone_scraper'
79
74
 
80
- # 1) First define a custom parser, for example twitter.
81
- class TwitterParser < SagroneScraper::Parser
75
+ class TwitterScraper < SagroneScraper::Base
82
76
  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
83
77
 
84
- def self.can_parse?(url)
85
- url.match(TWITTER_PROFILE_URL)
78
+ def self.can_scrape?(url)
79
+ url.match(TWITTER_PROFILE_URL) ? true : false
86
80
  end
87
81
 
82
+ # Public instance methods are used for data extraction.
83
+
88
84
  def bio
89
- page.at('.ProfileHeaderCard-bio').text
85
+ text_at('.ProfileHeaderCard-bio')
90
86
  end
91
87
 
92
88
  def location
93
- page.at('.ProfileHeaderCard-locationText').text
89
+ text_at('.ProfileHeaderCard-locationText')
90
+ end
91
+
92
+ private
93
+
94
+ # Private instance methods are not used for data extraction.
95
+
96
+ def text_at(selector)
97
+ page.at(selector).text if page.at(selector)
94
98
  end
95
99
  end
100
+ ```
101
+
102
+ #### Instantiate the scraper
103
+
104
+ ```ruby
105
+ # Instantiate the scraper with a "url".
106
+ scraper = TwitterScraper.new(url: 'https://twitter.com/Milano_JS')
96
107
 
97
- # 2) Create an agent scraper, which will give us the page to parse.
108
+ # Instantiate the scraper with a "page" (Mechanize::Page).
98
109
  agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
110
+ scraper = TwitterScraper.new(page: agent.page)
111
+ ```
112
+
113
+ #### Scrape the page
114
+
115
+ ```ruby
116
+ scraper.scrape_page!
117
+ ```
99
118
 
100
- # 3) Instantiate the parser.
101
- parser = TwitterParser.new(page: agent.page)
119
+ #### Extract the data
102
120
 
103
- # 4) Parse page and extract attributes.
104
- parser.parse_page!
105
- parser.attributes
121
+ ```ruby
122
+ scraper.attributes
106
123
  # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
107
124
  ```
108
125
 
109
- #### `SagroneScraper.scrape`
126
+ ### `SagroneScraper::Collection`
110
127
 
111
128
  This is the simplest way to scrape a web page:
112
129
 
113
130
  ```ruby
114
131
  require 'sagrone_scraper'
115
132
 
116
- # 1) First we define a custom parser, for example twitter.
117
- class TwitterParser < SagroneScraper::Parser
118
- TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
119
-
120
- def self.can_parse?(url)
121
- url.match(TWITTER_PROFILE_URL)
122
- end
123
-
124
- def bio
125
- page.at('.ProfileHeaderCard-bio').text
126
- end
127
-
128
- def location
129
- page.at('.ProfileHeaderCard-locationText').text
130
- end
131
- end
132
-
133
- # 2) We register the parser.
134
- SagroneScraper.register_parser('TwitterParser')
133
+ # 1) Define a scraper. For example, the TwitterScraper above.
135
134
 
136
- # 3) We can query for registered parsers.
137
- SagroneScraper.registered_parsers
138
- # => ['TwitterParser']
135
+ # 2) New created scrapers will be registered.
136
+ SagroneScraper.Collection::registered_scrapers
137
+ # => ['TwitterScraper']
139
138
 
140
- # 4) We can now scrape twitter profile URLs.
141
- SagroneScraper.scrape(url: 'https://twitter.com/Milano_JS')
139
+ # 3) Here we use the collection to scrape data at a URL.
140
+ SagroneScraper::Collection.scrape(url: 'https://twitter.com/Milano_JS')
142
141
  # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
143
142
  ```
144
143
 
@@ -1,41 +1,10 @@
1
1
  require "sagrone_scraper/version"
2
2
  require "sagrone_scraper/agent"
3
- require "sagrone_scraper/parser"
3
+ require "sagrone_scraper/base"
4
+ require "sagrone_scraper/collection"
4
5
 
5
6
  module SagroneScraper
6
- Error = Class.new(RuntimeError)
7
-
8
7
  def self.version
9
8
  VERSION
10
9
  end
11
-
12
- def self.registered_parsers
13
- @registered_parsers ||= []
14
- end
15
-
16
- def self.register_parser(name)
17
- return if registered_parsers.include?(name)
18
-
19
- parser_class = Object.const_get(name)
20
- raise Error.new("Expected parser to be a SagroneScraper::Parser.") unless parser_class.ancestors.include?(SagroneScraper::Parser)
21
-
22
- registered_parsers.push(name)
23
- end
24
-
25
- def self.scrape(options)
26
- url = options.fetch(:url) do
27
- raise Error.new('Option "url" must be provided.')
28
- end
29
-
30
- parser_class = registered_parsers
31
- .map { |parser_name| Object.const_get(parser_name) }
32
- .find { |parser_class| parser_class.can_parse?(url) }
33
-
34
- raise Error.new("No registed parser can parse URL #{url}") unless parser_class
35
-
36
- agent = SagroneScraper::Agent.new(url: url)
37
- parser = parser_class.new(page: agent.page)
38
- parser.parse_page!
39
- parser.attributes
40
- end
41
10
  end
@@ -9,12 +9,10 @@ module SagroneScraper
9
9
  attr_reader :url, :page
10
10
 
11
11
  def initialize(options = {})
12
- raise Error.new('Exactly one option must be provided: "url" or "page"') unless exactly_one_of(options)
13
-
14
- @url, @page = options[:url], options[:page]
15
-
16
- @url ||= page_url
17
- @page ||= http_client.get(url)
12
+ @url = options.fetch(:url) do
13
+ raise Error.new('Option "url" must be provided')
14
+ end
15
+ @page = http_client.get(url)
18
16
  rescue StandardError => error
19
17
  raise Error.new(error.message)
20
18
  end
@@ -29,18 +27,5 @@ module SagroneScraper
29
27
  agent.max_history = 0
30
28
  end
31
29
  end
32
-
33
- private
34
-
35
- def page_url
36
- @page.uri.to_s
37
- end
38
-
39
- def exactly_one_of(options)
40
- url_present = !!options[:url]
41
- page_present = !!options[:page]
42
-
43
- (url_present && !page_present) || (!url_present && page_present)
44
- end
45
30
  end
46
31
  end
@@ -0,0 +1,63 @@
1
+ require 'mechanize'
2
+ require 'sagrone_scraper'
3
+
4
+ module SagroneScraper
5
+ class Base
6
+ Error = Class.new(RuntimeError)
7
+
8
+ attr_reader :page, :url, :attributes
9
+
10
+ def initialize(options = {})
11
+ raise Error.new('Exactly one option must be provided: "url" or "page"') unless exactly_one_of(options)
12
+
13
+ @url, @page = options[:url], options[:page]
14
+
15
+ @url ||= page_url
16
+ @page ||= Agent.new(url: url).page
17
+ @attributes = {}
18
+ end
19
+
20
+ def scrape_page!
21
+ return unless self.class.can_scrape?(page_url)
22
+
23
+ self.class.method_names.each do |name|
24
+ attributes[name] = send(name)
25
+ end
26
+ nil
27
+ end
28
+
29
+ def self.can_scrape?(url)
30
+ class_with_method = "#{self}.can_scrape?(url)"
31
+ raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
32
+ end
33
+
34
+ def self.should_ignore_method?(name)
35
+ private_method_defined?(name)
36
+ end
37
+
38
+ private
39
+
40
+ def exactly_one_of(options)
41
+ url_present = !!options[:url]
42
+ page_present = !!options[:page]
43
+
44
+ (url_present && !page_present) || (!url_present && page_present)
45
+ end
46
+
47
+ def page_url
48
+ @page.uri.to_s
49
+ end
50
+
51
+ def self.method_names
52
+ @method_names ||= []
53
+ end
54
+
55
+ def self.method_added(name)
56
+ method_names.push(name) unless should_ignore_method?(name)
57
+ end
58
+
59
+ def self.inherited(klass)
60
+ SagroneScraper::Collection.register_scraper(klass.name)
61
+ end
62
+ end
63
+ end
@@ -0,0 +1,38 @@
1
+ require 'sagrone_scraper'
2
+
3
+ module SagroneScraper
4
+ module Collection
5
+ Error = Class.new(RuntimeError)
6
+
7
+ def self.registered_scrapers
8
+ @registered_scrapers ||= []
9
+ end
10
+
11
+ def self.register_scraper(name)
12
+ return if registered_scrapers.include?(name)
13
+
14
+ scraper_class = Object.const_get(name)
15
+ raise Error.new("Expected scraper to be a SagroneScraper::Base.") unless scraper_class.ancestors.include?(SagroneScraper::Base)
16
+
17
+ registered_scrapers.push(name)
18
+ nil
19
+ end
20
+
21
+ def self.scrape(options)
22
+ url = options.fetch(:url) do
23
+ raise Error.new('Option "url" must be provided.')
24
+ end
25
+
26
+ scraper_class = registered_scrapers
27
+ .map { |scraper_name| Object.const_get(scraper_name) }
28
+ .find { |a_scraper_class| a_scraper_class.can_scrape?(url) }
29
+
30
+ raise Error.new("No registed scraper can scrape URL #{url}") unless scraper_class
31
+
32
+ agent = SagroneScraper::Agent.new(url: url)
33
+ scraper = scraper_class.new(page: agent.page)
34
+ scraper.scrape_page!
35
+ scraper.attributes
36
+ end
37
+ end
38
+ end
@@ -1,3 +1,3 @@
1
1
  module SagroneScraper
2
- VERSION = "0.0.3"
2
+ VERSION = "0.1.0"
3
3
  end
@@ -12,7 +12,7 @@ RSpec.describe SagroneScraper::Agent do
12
12
  it { expect(described_class::AGENT_ALIASES).to eq(user_agent_aliases) }
13
13
  end
14
14
 
15
- describe '.http_client' do
15
+ describe 'self.http_client' do
16
16
  subject { described_class.http_client }
17
17
 
18
18
  it { should be_a(Mechanize) }
@@ -21,7 +21,7 @@ RSpec.describe SagroneScraper::Agent do
21
21
  end
22
22
 
23
23
  describe '#initialize' do
24
- describe 'should require exactly one of `url` or `page` option' do
24
+ describe 'should require `url` option' do
25
25
  before do
26
26
  stub_request_for('http://example.com', 'www.example.com')
27
27
  end
@@ -30,30 +30,21 @@ RSpec.describe SagroneScraper::Agent do
30
30
  expect {
31
31
  described_class.new
32
32
  }.to raise_error(SagroneScraper::Agent::Error,
33
- 'Exactly one option must be provided: "url" or "page"')
34
- end
35
-
36
- it 'when both options are present' do
37
- page = Mechanize.new.get('http://example.com')
38
-
39
- expect {
40
- described_class.new(url: 'http://example.com', page: page)
41
- }.to raise_error(SagroneScraper::Agent::Error,
42
- 'Exactly one option must be provided: "url" or "page"')
33
+ 'Option "url" must be provided')
43
34
  end
44
35
  end
45
36
 
46
- describe 'with page option' do
37
+ describe 'with url option' do
47
38
  before do
48
39
  stub_request_for('http://example.com', 'www.example.com')
49
40
  end
50
41
 
51
- let(:page) { Mechanize.new.get('http://example.com') }
52
- let(:agent) { described_class.new(page: page) }
42
+ let(:agent) { described_class.new(url: 'http://example.com') }
53
43
 
54
44
  it { expect { agent }.to_not raise_error }
55
45
  it { expect(agent.page).to be }
56
- it { expect(agent.url).to eq 'http://example.com/' }
46
+ it { expect(agent.page).to be_a(Mechanize::Page) }
47
+ it { expect(agent.url).to eq 'http://example.com' }
57
48
  end
58
49
 
59
50
  describe 'with invalid URL' do
@@ -62,7 +53,7 @@ RSpec.describe SagroneScraper::Agent do
62
53
  it 'should require URL is absolute' do
63
54
  @invalid_url = 'not-a-url'
64
55
 
65
- expect { agent }.to raise_error(SagroneScraper::Agent::Error,
56
+ expect { agent }.to raise_error(described_class::Error,
66
57
  'absolute URL needed (not not-a-url)')
67
58
  end
68
59
 
@@ -70,7 +61,7 @@ RSpec.describe SagroneScraper::Agent do
70
61
  @invalid_url = 'http://'
71
62
 
72
63
  webmock_allow do
73
- expect { agent }.to raise_error(SagroneScraper::Agent::Error,
64
+ expect { agent }.to raise_error(described_class::Error,
74
65
  /bad URI\(absolute but no path\)/)
75
66
  end
76
67
  end
@@ -79,7 +70,7 @@ RSpec.describe SagroneScraper::Agent do
79
70
  @invalid_url = 'http://example'
80
71
 
81
72
  webmock_allow do
82
- expect { agent }.to raise_error(SagroneScraper::Agent::Error,
73
+ expect { agent }.to raise_error(described_class::Error,
83
74
  /getaddrinfo/)
84
75
  end
85
76
  end
@@ -0,0 +1,162 @@
1
+ require 'spec_helper'
2
+ require 'sagrone_scraper/base'
3
+
4
+ RSpec.describe SagroneScraper::Base do
5
+ describe '#initialize' do
6
+ before do
7
+ stub_request_for('http://example.com', 'www.example.com')
8
+ end
9
+
10
+ describe 'should require exactly one of `url` or `page` option' do
11
+ it 'when options is empty' do
12
+ expect {
13
+ described_class.new
14
+ }.to raise_error(described_class::Error,
15
+ 'Exactly one option must be provided: "url" or "page"')
16
+ end
17
+
18
+ it 'when both options are present' do
19
+ page = Mechanize.new.get('http://example.com')
20
+
21
+ expect {
22
+ described_class.new(url: 'http://example.com', page: page)
23
+ }.to raise_error(described_class::Error,
24
+ 'Exactly one option must be provided: "url" or "page"')
25
+ end
26
+ end
27
+
28
+ describe 'with page option' do
29
+ let(:page) { Mechanize.new.get('http://example.com') }
30
+ let(:agent) { described_class.new(page: page) }
31
+
32
+ it { expect { agent }.to_not raise_error }
33
+ it { expect(agent.page).to be }
34
+ it { expect(agent.url).to eq 'http://example.com/' }
35
+ end
36
+
37
+ describe 'with url option' do
38
+ let(:agent) { described_class.new(url: 'http://example.com') }
39
+
40
+ it { expect { agent }.to_not raise_error }
41
+ it { expect(agent.page).to be }
42
+ it { expect(agent.url).to eq 'http://example.com' }
43
+ end
44
+ end
45
+
46
+ describe 'instance methods' do
47
+ let(:page) { Mechanize::Page.new }
48
+ let(:scraper) { described_class.new(page: page) }
49
+
50
+ describe '#page' do
51
+ it { expect(scraper.page).to be_a(Mechanize::Page) }
52
+ end
53
+
54
+ describe '#url' do
55
+ it { expect(scraper.url).to eq page.uri.to_s }
56
+ end
57
+
58
+ describe '#scrape_page!' do
59
+ it do
60
+ expect {
61
+ scraper.scrape_page!
62
+ }.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
63
+ end
64
+ end
65
+
66
+ describe '#attributes' do
67
+ it { expect(scraper.attributes).to be_empty }
68
+ end
69
+ end
70
+
71
+ describe 'class methods' do
72
+ describe 'self.can_scrape?(url)' do
73
+ it 'must be implemented in subclasses' do
74
+ expect {
75
+ described_class.can_scrape?('url')
76
+ }.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
77
+ end
78
+ end
79
+ end
80
+
81
+ describe 'example TwitterScraper' do
82
+ before do
83
+ stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
84
+ end
85
+
86
+ let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
87
+ let(:twitter_scraper) { TwitterScraper.new(page: page) }
88
+ let(:expected_attributes) do
89
+ {
90
+ bio: "Javascript User Group Milano #milanojs",
91
+ location: "Milan, Italy"
92
+ }
93
+ end
94
+
95
+ describe '#initialize' do
96
+ it 'should be a SagroneScraper::Base' do
97
+ expect(twitter_scraper).to be_a(described_class)
98
+ end
99
+ end
100
+
101
+ describe 'instance methods' do
102
+ describe '#scrape_page!' do
103
+ it 'should succeed' do
104
+ expect { twitter_scraper.scrape_page! }.to_not raise_error
105
+ end
106
+
107
+ it 'should have attributes present after scraping' do
108
+ twitter_scraper.scrape_page!
109
+
110
+ expect(twitter_scraper.attributes).to_not be_empty
111
+ expect(twitter_scraper.attributes).to eq expected_attributes
112
+ end
113
+
114
+ it 'should have correct attributes even if scraping is done multiple times' do
115
+ twitter_scraper.scrape_page!
116
+ twitter_scraper.scrape_page!
117
+ twitter_scraper.scrape_page!
118
+
119
+ expect(twitter_scraper.attributes).to_not be_empty
120
+ expect(twitter_scraper.attributes).to eq expected_attributes
121
+ end
122
+ end
123
+ end
124
+
125
+ describe 'class methods' do
126
+ describe 'self.can_scrape?(url)' do
127
+ it 'should be true for scrapable URLs' do
128
+ can_scrape = TwitterScraper.can_scrape?('https://twitter.com/Milano_JS')
129
+
130
+ expect(can_scrape).to eq(true)
131
+ end
132
+
133
+ it 'should be false for unknown URLs' do
134
+ can_scrape = TwitterScraper.can_scrape?('https://www.facebook.com/milanojavascript')
135
+
136
+ expect(can_scrape).to eq(false)
137
+ end
138
+ end
139
+
140
+ describe 'self.should_ignore_method?(name)' do
141
+ let(:private_methods) { %w(text_at) }
142
+ let(:public_methods) { %w(bio location) }
143
+
144
+ it 'ignores private methods' do
145
+ private_methods.each do |private_method|
146
+ ignored = TwitterScraper.should_ignore_method?(private_method)
147
+
148
+ expect(ignored).to eq(true)
149
+ end
150
+ end
151
+
152
+ it 'allows public methods' do
153
+ public_methods.each do |public_method|
154
+ ignored = TwitterScraper.should_ignore_method?(public_method)
155
+
156
+ expect(ignored).to eq(false)
157
+ end
158
+ end
159
+ end
160
+ end
161
+ end
162
+ end
@@ -0,0 +1,80 @@
1
+ require 'spec_helper'
2
+ require 'sagrone_scraper/collection'
3
+
4
+ RSpec.describe SagroneScraper::Collection do
5
+ context 'scrapers registered' do
6
+ before do
7
+ described_class.registered_scrapers.clear
8
+ end
9
+
10
+ describe 'self.registered_scrapers' do
11
+ it { expect(described_class.registered_scrapers).to be_empty }
12
+ it { expect(described_class.registered_scrapers).to be_a(Array) }
13
+ end
14
+
15
+ describe 'self.register_scraper(name)' do
16
+ it 'should add a new scraper class to registered scrapers automatically' do
17
+ class ScraperOne < SagroneScraper::Base ; end
18
+ class ScraperTwo < SagroneScraper::Base ; end
19
+
20
+ expect(described_class.registered_scrapers).to include('ScraperOne')
21
+ expect(described_class.registered_scrapers).to include('ScraperTwo')
22
+ expect(described_class.registered_scrapers.size).to eq(2)
23
+ end
24
+
25
+ it 'should check scraper name is an existing constant' do
26
+ expect {
27
+ described_class.register_scraper('Unknown')
28
+ }.to raise_error(NameError, 'uninitialized constant Unknown')
29
+ end
30
+
31
+ it 'should check scraper class inherits from SagroneScraper::Base' do
32
+ NotScraper = Class.new
33
+
34
+ expect {
35
+ described_class.register_scraper('NotScraper')
36
+ }.to raise_error(described_class::Error, 'Expected scraper to be a SagroneScraper::Base.')
37
+ end
38
+
39
+ it 'should register multiple scrapers only once' do
40
+ class TestScraper < SagroneScraper::Base ; end
41
+
42
+ described_class.register_scraper('TestScraper')
43
+ described_class.register_scraper('TestScraper')
44
+
45
+ expect(described_class.registered_scrapers).to include('TestScraper')
46
+ expect(described_class.registered_scrapers.size).to eq 1
47
+ end
48
+ end
49
+ end
50
+
51
+ describe 'self.scrape' do
52
+ before do
53
+ described_class.registered_scrapers.clear
54
+ described_class.register_scraper('TwitterScraper')
55
+
56
+ stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
57
+ end
58
+
59
+ it 'should require `url` option' do
60
+ expect {
61
+ described_class.scrape({})
62
+ }.to raise_error(described_class::Error, 'Option "url" must be provided.')
63
+ end
64
+
65
+ it 'should scrape URL if registered scraper knows how to scrape it' do
66
+ expected_attributes = {
67
+ bio: "Javascript User Group Milano #milanojs",
68
+ location: "Milan, Italy"
69
+ }
70
+
71
+ expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
72
+ end
73
+
74
+ it 'should return raise error if no registered paser can scrape the URL' do
75
+ expect {
76
+ described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
77
+ }.to raise_error(described_class::Error, "No registed scraper can scrape URL https://twitter.com/Milano_JS/media")
78
+ end
79
+ end
80
+ end
@@ -2,80 +2,7 @@ require 'spec_helper'
2
2
  require 'sagrone_scraper'
3
3
 
4
4
  RSpec.describe SagroneScraper do
5
- describe '.version' do
5
+ describe 'self.version' do
6
6
  it { expect(SagroneScraper.version).to be_a(String) }
7
7
  end
8
-
9
- context 'parsers registered' do
10
- before do
11
- described_class.registered_parsers.clear
12
- end
13
-
14
- describe '.registered_parsers' do
15
- it { expect(described_class.registered_parsers).to be_empty }
16
- it { expect(described_class.registered_parsers).to be_a(Array) }
17
- end
18
-
19
- describe '.register_parser(name)' do
20
- TestParser = Class.new(SagroneScraper::Parser)
21
- NotParser = Class.new
22
-
23
- it 'should check parser name is an existing constant' do
24
- expect {
25
- described_class.register_parser('Unknown')
26
- }.to raise_error(NameError, 'uninitialized constant Unknown')
27
- end
28
-
29
- it 'should check parser class inherits from SagroneScraper::Parser' do
30
- expect {
31
- described_class.register_parser('NotParser')
32
- }.to raise_error(SagroneScraper::Error, 'Expected parser to be a SagroneScraper::Parser.')
33
- end
34
-
35
- it 'after adding a "parser" should have it registered' do
36
- described_class.register_parser('TestParser')
37
-
38
- expect(described_class.registered_parsers).to include('TestParser')
39
- expect(described_class.registered_parsers.size).to eq 1
40
- end
41
-
42
- it 'adding same "parser" multiple times should register it once' do
43
- described_class.register_parser('TestParser')
44
- described_class.register_parser('TestParser')
45
-
46
- expect(described_class.registered_parsers).to include('TestParser')
47
- expect(described_class.registered_parsers.size).to eq 1
48
- end
49
- end
50
- end
51
-
52
- describe '.scrape' do
53
- before do
54
- SagroneScraper.registered_parsers.clear
55
- SagroneScraper.register_parser('TwitterParser')
56
-
57
- stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
58
- end
59
-
60
- it 'should `url` option' do
61
- expect {
62
- described_class.scrape({})
63
- }.to raise_error(SagroneScraper::Error, 'Option "url" must be provided.')
64
- end
65
-
66
- it 'should scrape URL if registered parser knows how to parse it' do
67
- expected_attributes = {
68
- bio: "Javascript User Group Milano #milanojs",
69
- location: "Milan, Italy"
70
- }
71
-
72
- expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
73
- end
74
-
75
- it 'should return raise error if no registered paser can parse the URL' do
76
- expect {
77
- described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
78
- }.to raise_error(SagroneScraper::Error, "No registed parser can parse URL https://twitter.com/Milano_JS/media")
79
- end
80
- end
81
8
  end
@@ -0,0 +1,23 @@
1
+ require 'sagrone_scraper/base'
2
+
3
+ class TwitterScraper < SagroneScraper::Base
4
+ TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
5
+
6
+ def self.can_scrape?(url)
7
+ url.match(TWITTER_PROFILE_URL) ? true : false
8
+ end
9
+
10
+ def bio
11
+ text_at('.ProfileHeaderCard-bio')
12
+ end
13
+
14
+ def location
15
+ text_at('.ProfileHeaderCard-locationText')
16
+ end
17
+
18
+ private
19
+
20
+ def text_at(selector)
21
+ page.at(selector).text if page.at(selector)
22
+ end
23
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sagrone_scraper
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.3
4
+ version: 0.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Marius Colacioiu
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-03-10 00:00:00.000000000 Z
11
+ date: 2015-04-09 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: mechanize
@@ -113,17 +113,19 @@ files:
113
113
  - Rakefile
114
114
  - lib/sagrone_scraper.rb
115
115
  - lib/sagrone_scraper/agent.rb
116
- - lib/sagrone_scraper/parser.rb
116
+ - lib/sagrone_scraper/base.rb
117
+ - lib/sagrone_scraper/collection.rb
117
118
  - lib/sagrone_scraper/version.rb
118
119
  - sagrone_scraper.gemspec
119
120
  - spec/sagrone_scraper/agent_spec.rb
120
- - spec/sagrone_scraper/parser_spec.rb
121
+ - spec/sagrone_scraper/base_spec.rb
122
+ - spec/sagrone_scraper/collection_spec.rb
121
123
  - spec/sagrone_scraper_spec.rb
122
124
  - spec/spec_helper.rb
123
125
  - spec/stub_helper.rb
124
- - spec/support/test_parsers/twitter_parser.rb
125
126
  - spec/support/test_responses/twitter.com:Milano_JS
126
127
  - spec/support/test_responses/www.example.com
128
+ - spec/support/test_scrapers/twitter_scraper.rb
127
129
  homepage: ''
128
130
  licenses:
129
131
  - Apache License 2.0
@@ -150,10 +152,11 @@ specification_version: 4
150
152
  summary: Sagrone Ruby Scraper.
151
153
  test_files:
152
154
  - spec/sagrone_scraper/agent_spec.rb
153
- - spec/sagrone_scraper/parser_spec.rb
155
+ - spec/sagrone_scraper/base_spec.rb
156
+ - spec/sagrone_scraper/collection_spec.rb
154
157
  - spec/sagrone_scraper_spec.rb
155
158
  - spec/spec_helper.rb
156
159
  - spec/stub_helper.rb
157
- - spec/support/test_parsers/twitter_parser.rb
158
160
  - spec/support/test_responses/twitter.com:Milano_JS
159
161
  - spec/support/test_responses/www.example.com
162
+ - spec/support/test_scrapers/twitter_scraper.rb
@@ -1,42 +0,0 @@
1
- require 'mechanize'
2
-
3
- module SagroneScraper
4
- class Parser
5
- Error = Class.new(RuntimeError)
6
-
7
- attr_reader :page, :page_url, :attributes
8
-
9
- def initialize(options = {})
10
- @page = options.fetch(:page) do
11
- raise Error.new('Option "page" must be provided.')
12
- end
13
- @page_url = @page.uri.to_s
14
- @attributes = {}
15
- end
16
-
17
- def parse_page!
18
- return unless self.class.can_parse?(page_url)
19
-
20
- self.class.method_names.each do |name|
21
- attributes[name] = send(name)
22
- end
23
- nil
24
- end
25
-
26
- def self.can_parse?(url)
27
- class_with_method = "#{self}.can_parse?(url)"
28
- raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
29
- end
30
-
31
- private
32
-
33
- def self.method_names
34
- @method_names ||= []
35
- end
36
-
37
- def self.method_added(name)
38
- puts "added #{name} to #{self}"
39
- method_names.push(name)
40
- end
41
- end
42
- end
@@ -1,83 +0,0 @@
1
- require 'spec_helper'
2
- require 'sagrone_scraper/parser'
3
-
4
- RSpec.describe SagroneScraper::Parser do
5
- describe '#initialize' do
6
- it 'requires a "page" option' do
7
- expect {
8
- described_class.new
9
- }.to raise_error(SagroneScraper::Parser::Error, 'Option "page" must be provided.')
10
- end
11
- end
12
-
13
- describe 'instance methods' do
14
- let(:page) { Mechanize::Page.new }
15
- let(:parser) { described_class.new(page: page) }
16
-
17
- describe '#page' do
18
- it { expect(parser.page).to be_a(Mechanize::Page) }
19
- end
20
-
21
- describe '#page_url' do
22
- it { expect(parser.page_url).to be }
23
- it { expect(parser.page_url).to eq page.uri.to_s }
24
- end
25
-
26
- describe '#parse_page!' do
27
- it do
28
- expect {
29
- parser.parse_page!
30
- }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
31
- end
32
- end
33
-
34
- describe '#attributes' do
35
- it { expect(parser.attributes).to be_empty }
36
- end
37
- end
38
-
39
- describe 'class methods' do
40
- describe '.can_parse?(url)' do
41
- it do
42
- expect {
43
- described_class.can_parse?('url')
44
- }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
45
- end
46
- end
47
- end
48
-
49
- describe 'create custom TwitterParser from SagroneScraper::Parser' do
50
- before do
51
- stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
52
- end
53
-
54
- let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
55
- let(:twitter_parser) { TwitterParser.new(page: page) }
56
- let(:expected_attributes) do
57
- {
58
- bio: "Javascript User Group Milano #milanojs",
59
- location: "Milan, Italy"
60
- }
61
- end
62
-
63
- describe 'should be able to parse page without errors' do
64
- it { expect { twitter_parser.parse_page! }.to_not raise_error }
65
- end
66
-
67
- it 'should have attributes present after parsing' do
68
- twitter_parser.parse_page!
69
-
70
- expect(twitter_parser.attributes).to_not be_empty
71
- expect(twitter_parser.attributes).to eq expected_attributes
72
- end
73
-
74
- it 'should have correct attributes event if parsing is done multiple times' do
75
- twitter_parser.parse_page!
76
- twitter_parser.parse_page!
77
- twitter_parser.parse_page!
78
-
79
- expect(twitter_parser.attributes).to_not be_empty
80
- expect(twitter_parser.attributes).to eq expected_attributes
81
- end
82
- end
83
- end
@@ -1,17 +0,0 @@
1
- require 'sagrone_scraper/parser'
2
-
3
- class TwitterParser < SagroneScraper::Parser
4
- TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
5
-
6
- def self.can_parse?(url)
7
- url.match(TWITTER_PROFILE_URL)
8
- end
9
-
10
- def bio
11
- page.at('.ProfileHeaderCard-bio').text
12
- end
13
-
14
- def location
15
- page.at('.ProfileHeaderCard-locationText').text
16
- end
17
- end