sagrone_scraper 0.0.3 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 80b3c30080aba0c8b1da8cfdcdefbb8e6ef527e1
4
- data.tar.gz: c9992757e44377ed3081089f0348cdf1535e8a8e
3
+ metadata.gz: 7b4420d9fd6c018746bbdb2ca17ea53402ca3857
4
+ data.tar.gz: 7ff6f5cdb840b214ad7eaf8f7ccb23e2efe9ff24
5
5
  SHA512:
6
- metadata.gz: 97477b7732ec3485aa7ba5ef2c7cb16ac130d6b1a1f6ee8b57deb5cf53fb6ae50bafdec13d1c575dbc382a748583a3e70b0da55660bd7065daaebc1479d466c4
7
- data.tar.gz: fc8762b63b3429dcbd004bbdc39c6ef5bd7351e8b483ec8185c79b985ab5b2091601005a533270fc511e16d329f31a20616e1ea068f2b7ce7c3a0b3928087c5f
6
+ metadata.gz: b051e3982a7959f19cb25b29129905cd26e5fbc2529a012a0d8b091c5dfdc78d985c79b8aacb6c8142e2c030aed107d91a69d94aba8f6b2c16c57eaeb0f284cb
7
+ data.tar.gz: b404f338c87a266e9bcc03705327830be76d73018e5ff789017890cb77d7b0b0157daff1dbace000d542be0fd272a7912a75a018d88154da3e165c25e8e4bea5
@@ -1,5 +1,21 @@
1
1
  ### HEAD
2
2
 
3
+ ### 0.1.0
4
+
5
+ - simplify library entities:
6
+ - `SagroneScraper::Parser` is now `SagroneScraper::Base`
7
+ - `SagroneScraper::Base` is base class to create new scrapers
8
+ - `SagroneScraper.registered_parsers` is now `SagroneScraper.registered_scrapers`
9
+ - `SagroneScraper.register_parser` is now `SagroneScraper.register_scraper`
10
+ - `SagroneScraper::Base.parse_page!` renamed to `SagroneScraper::Base.scrape_page!`
11
+ - `SagroneScraper::Base.can_parse?` renamed to `SagroneScraper::Base.can_scrape?`
12
+ - `SagroneScraper::Base` _private_ instance methods are not used for extracting data, only _public_ instance methods are.
13
+ - `SagroneScraper::Collection`:
14
+ - registeres new created scrappers automatically
15
+ - knows how to scrape a page from a generic URL (if there is a valid scraper for that URL)
16
+ - `SagroneScraper::Base` takes exactly one of `url` or `page` options
17
+ - `SagroneScraper::Agent` takes only a `url` option
18
+
3
19
  ### 0.0.3
4
20
 
5
21
  - add `SagroneScraper::Parser.can_parse?(url)` class method, which must be implemented in subclasses
data/Guardfile CHANGED
@@ -8,7 +8,7 @@ guard :rspec, cmd: "bundle exec rspec" do
8
8
  rspec = dsl.rspec
9
9
  watch(rspec.spec_files)
10
10
  watch(%r{^spec/(.+)_helper\.rb$}) { "spec" }
11
- watch(%r{^spec/test_responses/(.+)$}) { "spec" }
11
+ watch(%r{^spec/support/(.+)$}) { "spec" }
12
12
 
13
13
  # Library files
14
14
  watch(%r{^lib/(.+)\.rb$}) { |m| "spec/#{m[1]}_spec.rb" }
data/README.md CHANGED
@@ -11,8 +11,12 @@ Simple library to scrap web pages. Bellow you will find information on [how to u
11
11
  - [Basic Usage](#basic-usage)
12
12
  - [Modules](#modules)
13
13
  + [`SagroneScraper::Agent`](#sagronescraperagent)
14
- + [`SagroneScraper::Parser`](#sagronescraperparser)
15
- + [`SagroneScraper.scrape`](#sagronescraperscrape)
14
+ + [`SagroneScraper::Base`](#sagronescraperbase)
15
+ * [Create a scraper class](#create-a-scraper-class)
16
+ * [Instantiate the scraper](#instantiate-the-scraper)
17
+ * [Scrape the page](#scrape-the-page)
18
+ * [Extract the data](#extract-the-data)
19
+ + [`SagroneScraper::Collection`](#sagronescrapercollection)
16
20
 
17
21
  ## Installation
18
22
 
@@ -30,115 +34,110 @@ Or install it yourself as:
30
34
 
31
35
  ## Basic Usage
32
36
 
33
- Comming soon...
37
+ In order to _scrape a web page_ you will need to:
34
38
 
35
- ## Modules
36
-
37
- #### `SagroneScraper::Agent`
38
-
39
- The agent is responsible for scraping a web page from a URL. Here is how you can create an `agent`:
39
+ 1. [create a new scraper class](#create-a-scraper-class) by inheriting from `SagroneScraper::Base`, and
40
+ 2. [instantiate it with a `url` or `page`](#instantiate-the-scraper)
41
+ 3. then you can use the scraper instance to [scrape the page](#scrape-the-page) and [extract structured data](#extract-the-data)
40
42
 
41
- 1. one way is to pass it a `url` option
43
+ More informations at [`SagroneScraper::Base`](#sagronescraperbase) module.
42
44
 
43
- ```ruby
44
- require 'sagrone_scraper'
45
+ ## Modules
45
46
 
46
- agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
47
- agent.page
48
- # => Mechanize::Page
47
+ ### `SagroneScraper::Agent`
49
48
 
50
- agent.page.at('.ProfileHeaderCard-bio').text
51
- # => "Javascript User Group Milano #milanojs"
52
- ```
49
+ The agent is responsible for obtaining a page, `Mechanize::Page`, from a URL. Here is how you can create an `agent`:
53
50
 
54
- 2. another way is to pass a `page` option (`Mechanize::Page`)
51
+ ```ruby
52
+ require 'sagrone_scraper'
55
53
 
56
- ```ruby
57
- require 'sagrone_scraper'
54
+ agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
55
+ agent.page
56
+ # => Mechanize::Page
58
57
 
59
- mechanize_agent = Mechanize.new { |agent| agent.user_agent_alias = 'Linux Firefox' }
60
- page = mechanize_agent.get('https://twitter.com/Milano_JS')
61
- # => Mechanize::Page
58
+ agent.page.at('.ProfileHeaderCard-bio').text
59
+ # => "Javascript User Group Milano #milanojs"
60
+ ```
62
61
 
63
- agent = SagroneScraper::Agent.new(page: page)
64
- agent.url
65
- # => "https://twitter.com/Milano_JS"
62
+ ### `SagroneScraper::Base`
66
63
 
67
- agent.page.at('.ProfileHeaderCard-locationText').text
68
- # => "Milan, Italy"
69
- ```
64
+ Here we define a `TwitterScraper`, by inheriting from `SagroneScraper::Base` class.
70
65
 
71
- #### `SagroneScraper::Parser`
66
+ The _scraper_ is responsible for extracting structured data from a _page_ or a _url_. The _page_ can be obtained by the [_agent_](#sagronescraperagent).
72
67
 
73
- The _parser_ is responsible for extracting structured data from a _page_. The page can be obtained by the _agent_.
68
+ _Public_ instance methods will be used to extract data, whereas _private_ instance methods will be ignored (seen as helper methods). Most importantly `self.can_scrape?(url)` class method ensures that only a known subset of pages can be scraped for data.
74
69
 
75
- Example usage:
70
+ #### Create a scraper class
76
71
 
77
72
  ```ruby
78
73
  require 'sagrone_scraper'
79
74
 
80
- # 1) First define a custom parser, for example twitter.
81
- class TwitterParser < SagroneScraper::Parser
75
+ class TwitterScraper < SagroneScraper::Base
82
76
  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
83
77
 
84
- def self.can_parse?(url)
85
- url.match(TWITTER_PROFILE_URL)
78
+ def self.can_scrape?(url)
79
+ url.match(TWITTER_PROFILE_URL) ? true : false
86
80
  end
87
81
 
82
+ # Public instance methods are used for data extraction.
83
+
88
84
  def bio
89
- page.at('.ProfileHeaderCard-bio').text
85
+ text_at('.ProfileHeaderCard-bio')
90
86
  end
91
87
 
92
88
  def location
93
- page.at('.ProfileHeaderCard-locationText').text
89
+ text_at('.ProfileHeaderCard-locationText')
90
+ end
91
+
92
+ private
93
+
94
+ # Private instance methods are not used for data extraction.
95
+
96
+ def text_at(selector)
97
+ page.at(selector).text if page.at(selector)
94
98
  end
95
99
  end
100
+ ```
101
+
102
+ #### Instantiate the scraper
103
+
104
+ ```ruby
105
+ # Instantiate the scraper with a "url".
106
+ scraper = TwitterScraper.new(url: 'https://twitter.com/Milano_JS')
96
107
 
97
- # 2) Create an agent scraper, which will give us the page to parse.
108
+ # Instantiate the scraper with a "page" (Mechanize::Page).
98
109
  agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
110
+ scraper = TwitterScraper.new(page: agent.page)
111
+ ```
112
+
113
+ #### Scrape the page
114
+
115
+ ```ruby
116
+ scraper.scrape_page!
117
+ ```
99
118
 
100
- # 3) Instantiate the parser.
101
- parser = TwitterParser.new(page: agent.page)
119
+ #### Extract the data
102
120
 
103
- # 4) Parse page and extract attributes.
104
- parser.parse_page!
105
- parser.attributes
121
+ ```ruby
122
+ scraper.attributes
106
123
  # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
107
124
  ```
108
125
 
109
- #### `SagroneScraper.scrape`
126
+ ### `SagroneScraper::Collection`
110
127
 
111
128
  This is the simplest way to scrape a web page:
112
129
 
113
130
  ```ruby
114
131
  require 'sagrone_scraper'
115
132
 
116
- # 1) First we define a custom parser, for example twitter.
117
- class TwitterParser < SagroneScraper::Parser
118
- TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
119
-
120
- def self.can_parse?(url)
121
- url.match(TWITTER_PROFILE_URL)
122
- end
123
-
124
- def bio
125
- page.at('.ProfileHeaderCard-bio').text
126
- end
127
-
128
- def location
129
- page.at('.ProfileHeaderCard-locationText').text
130
- end
131
- end
132
-
133
- # 2) We register the parser.
134
- SagroneScraper.register_parser('TwitterParser')
133
+ # 1) Define a scraper. For example, the TwitterScraper above.
135
134
 
136
- # 3) We can query for registered parsers.
137
- SagroneScraper.registered_parsers
138
- # => ['TwitterParser']
135
+ # 2) New created scrapers will be registered.
136
+ SagroneScraper.Collection::registered_scrapers
137
+ # => ['TwitterScraper']
139
138
 
140
- # 4) We can now scrape twitter profile URLs.
141
- SagroneScraper.scrape(url: 'https://twitter.com/Milano_JS')
139
+ # 3) Here we use the collection to scrape data at a URL.
140
+ SagroneScraper::Collection.scrape(url: 'https://twitter.com/Milano_JS')
142
141
  # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
143
142
  ```
144
143
 
@@ -1,41 +1,10 @@
1
1
  require "sagrone_scraper/version"
2
2
  require "sagrone_scraper/agent"
3
- require "sagrone_scraper/parser"
3
+ require "sagrone_scraper/base"
4
+ require "sagrone_scraper/collection"
4
5
 
5
6
  module SagroneScraper
6
- Error = Class.new(RuntimeError)
7
-
8
7
  def self.version
9
8
  VERSION
10
9
  end
11
-
12
- def self.registered_parsers
13
- @registered_parsers ||= []
14
- end
15
-
16
- def self.register_parser(name)
17
- return if registered_parsers.include?(name)
18
-
19
- parser_class = Object.const_get(name)
20
- raise Error.new("Expected parser to be a SagroneScraper::Parser.") unless parser_class.ancestors.include?(SagroneScraper::Parser)
21
-
22
- registered_parsers.push(name)
23
- end
24
-
25
- def self.scrape(options)
26
- url = options.fetch(:url) do
27
- raise Error.new('Option "url" must be provided.')
28
- end
29
-
30
- parser_class = registered_parsers
31
- .map { |parser_name| Object.const_get(parser_name) }
32
- .find { |parser_class| parser_class.can_parse?(url) }
33
-
34
- raise Error.new("No registed parser can parse URL #{url}") unless parser_class
35
-
36
- agent = SagroneScraper::Agent.new(url: url)
37
- parser = parser_class.new(page: agent.page)
38
- parser.parse_page!
39
- parser.attributes
40
- end
41
10
  end
@@ -9,12 +9,10 @@ module SagroneScraper
9
9
  attr_reader :url, :page
10
10
 
11
11
  def initialize(options = {})
12
- raise Error.new('Exactly one option must be provided: "url" or "page"') unless exactly_one_of(options)
13
-
14
- @url, @page = options[:url], options[:page]
15
-
16
- @url ||= page_url
17
- @page ||= http_client.get(url)
12
+ @url = options.fetch(:url) do
13
+ raise Error.new('Option "url" must be provided')
14
+ end
15
+ @page = http_client.get(url)
18
16
  rescue StandardError => error
19
17
  raise Error.new(error.message)
20
18
  end
@@ -29,18 +27,5 @@ module SagroneScraper
29
27
  agent.max_history = 0
30
28
  end
31
29
  end
32
-
33
- private
34
-
35
- def page_url
36
- @page.uri.to_s
37
- end
38
-
39
- def exactly_one_of(options)
40
- url_present = !!options[:url]
41
- page_present = !!options[:page]
42
-
43
- (url_present && !page_present) || (!url_present && page_present)
44
- end
45
30
  end
46
31
  end
@@ -0,0 +1,63 @@
1
+ require 'mechanize'
2
+ require 'sagrone_scraper'
3
+
4
+ module SagroneScraper
5
+ class Base
6
+ Error = Class.new(RuntimeError)
7
+
8
+ attr_reader :page, :url, :attributes
9
+
10
+ def initialize(options = {})
11
+ raise Error.new('Exactly one option must be provided: "url" or "page"') unless exactly_one_of(options)
12
+
13
+ @url, @page = options[:url], options[:page]
14
+
15
+ @url ||= page_url
16
+ @page ||= Agent.new(url: url).page
17
+ @attributes = {}
18
+ end
19
+
20
+ def scrape_page!
21
+ return unless self.class.can_scrape?(page_url)
22
+
23
+ self.class.method_names.each do |name|
24
+ attributes[name] = send(name)
25
+ end
26
+ nil
27
+ end
28
+
29
+ def self.can_scrape?(url)
30
+ class_with_method = "#{self}.can_scrape?(url)"
31
+ raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
32
+ end
33
+
34
+ def self.should_ignore_method?(name)
35
+ private_method_defined?(name)
36
+ end
37
+
38
+ private
39
+
40
+ def exactly_one_of(options)
41
+ url_present = !!options[:url]
42
+ page_present = !!options[:page]
43
+
44
+ (url_present && !page_present) || (!url_present && page_present)
45
+ end
46
+
47
+ def page_url
48
+ @page.uri.to_s
49
+ end
50
+
51
+ def self.method_names
52
+ @method_names ||= []
53
+ end
54
+
55
+ def self.method_added(name)
56
+ method_names.push(name) unless should_ignore_method?(name)
57
+ end
58
+
59
+ def self.inherited(klass)
60
+ SagroneScraper::Collection.register_scraper(klass.name)
61
+ end
62
+ end
63
+ end
@@ -0,0 +1,38 @@
1
+ require 'sagrone_scraper'
2
+
3
+ module SagroneScraper
4
+ module Collection
5
+ Error = Class.new(RuntimeError)
6
+
7
+ def self.registered_scrapers
8
+ @registered_scrapers ||= []
9
+ end
10
+
11
+ def self.register_scraper(name)
12
+ return if registered_scrapers.include?(name)
13
+
14
+ scraper_class = Object.const_get(name)
15
+ raise Error.new("Expected scraper to be a SagroneScraper::Base.") unless scraper_class.ancestors.include?(SagroneScraper::Base)
16
+
17
+ registered_scrapers.push(name)
18
+ nil
19
+ end
20
+
21
+ def self.scrape(options)
22
+ url = options.fetch(:url) do
23
+ raise Error.new('Option "url" must be provided.')
24
+ end
25
+
26
+ scraper_class = registered_scrapers
27
+ .map { |scraper_name| Object.const_get(scraper_name) }
28
+ .find { |a_scraper_class| a_scraper_class.can_scrape?(url) }
29
+
30
+ raise Error.new("No registed scraper can scrape URL #{url}") unless scraper_class
31
+
32
+ agent = SagroneScraper::Agent.new(url: url)
33
+ scraper = scraper_class.new(page: agent.page)
34
+ scraper.scrape_page!
35
+ scraper.attributes
36
+ end
37
+ end
38
+ end
@@ -1,3 +1,3 @@
1
1
  module SagroneScraper
2
- VERSION = "0.0.3"
2
+ VERSION = "0.1.0"
3
3
  end
@@ -12,7 +12,7 @@ RSpec.describe SagroneScraper::Agent do
12
12
  it { expect(described_class::AGENT_ALIASES).to eq(user_agent_aliases) }
13
13
  end
14
14
 
15
- describe '.http_client' do
15
+ describe 'self.http_client' do
16
16
  subject { described_class.http_client }
17
17
 
18
18
  it { should be_a(Mechanize) }
@@ -21,7 +21,7 @@ RSpec.describe SagroneScraper::Agent do
21
21
  end
22
22
 
23
23
  describe '#initialize' do
24
- describe 'should require exactly one of `url` or `page` option' do
24
+ describe 'should require `url` option' do
25
25
  before do
26
26
  stub_request_for('http://example.com', 'www.example.com')
27
27
  end
@@ -30,30 +30,21 @@ RSpec.describe SagroneScraper::Agent do
30
30
  expect {
31
31
  described_class.new
32
32
  }.to raise_error(SagroneScraper::Agent::Error,
33
- 'Exactly one option must be provided: "url" or "page"')
34
- end
35
-
36
- it 'when both options are present' do
37
- page = Mechanize.new.get('http://example.com')
38
-
39
- expect {
40
- described_class.new(url: 'http://example.com', page: page)
41
- }.to raise_error(SagroneScraper::Agent::Error,
42
- 'Exactly one option must be provided: "url" or "page"')
33
+ 'Option "url" must be provided')
43
34
  end
44
35
  end
45
36
 
46
- describe 'with page option' do
37
+ describe 'with url option' do
47
38
  before do
48
39
  stub_request_for('http://example.com', 'www.example.com')
49
40
  end
50
41
 
51
- let(:page) { Mechanize.new.get('http://example.com') }
52
- let(:agent) { described_class.new(page: page) }
42
+ let(:agent) { described_class.new(url: 'http://example.com') }
53
43
 
54
44
  it { expect { agent }.to_not raise_error }
55
45
  it { expect(agent.page).to be }
56
- it { expect(agent.url).to eq 'http://example.com/' }
46
+ it { expect(agent.page).to be_a(Mechanize::Page) }
47
+ it { expect(agent.url).to eq 'http://example.com' }
57
48
  end
58
49
 
59
50
  describe 'with invalid URL' do
@@ -62,7 +53,7 @@ RSpec.describe SagroneScraper::Agent do
62
53
  it 'should require URL is absolute' do
63
54
  @invalid_url = 'not-a-url'
64
55
 
65
- expect { agent }.to raise_error(SagroneScraper::Agent::Error,
56
+ expect { agent }.to raise_error(described_class::Error,
66
57
  'absolute URL needed (not not-a-url)')
67
58
  end
68
59
 
@@ -70,7 +61,7 @@ RSpec.describe SagroneScraper::Agent do
70
61
  @invalid_url = 'http://'
71
62
 
72
63
  webmock_allow do
73
- expect { agent }.to raise_error(SagroneScraper::Agent::Error,
64
+ expect { agent }.to raise_error(described_class::Error,
74
65
  /bad URI\(absolute but no path\)/)
75
66
  end
76
67
  end
@@ -79,7 +70,7 @@ RSpec.describe SagroneScraper::Agent do
79
70
  @invalid_url = 'http://example'
80
71
 
81
72
  webmock_allow do
82
- expect { agent }.to raise_error(SagroneScraper::Agent::Error,
73
+ expect { agent }.to raise_error(described_class::Error,
83
74
  /getaddrinfo/)
84
75
  end
85
76
  end
@@ -0,0 +1,162 @@
1
+ require 'spec_helper'
2
+ require 'sagrone_scraper/base'
3
+
4
+ RSpec.describe SagroneScraper::Base do
5
+ describe '#initialize' do
6
+ before do
7
+ stub_request_for('http://example.com', 'www.example.com')
8
+ end
9
+
10
+ describe 'should require exactly one of `url` or `page` option' do
11
+ it 'when options is empty' do
12
+ expect {
13
+ described_class.new
14
+ }.to raise_error(described_class::Error,
15
+ 'Exactly one option must be provided: "url" or "page"')
16
+ end
17
+
18
+ it 'when both options are present' do
19
+ page = Mechanize.new.get('http://example.com')
20
+
21
+ expect {
22
+ described_class.new(url: 'http://example.com', page: page)
23
+ }.to raise_error(described_class::Error,
24
+ 'Exactly one option must be provided: "url" or "page"')
25
+ end
26
+ end
27
+
28
+ describe 'with page option' do
29
+ let(:page) { Mechanize.new.get('http://example.com') }
30
+ let(:agent) { described_class.new(page: page) }
31
+
32
+ it { expect { agent }.to_not raise_error }
33
+ it { expect(agent.page).to be }
34
+ it { expect(agent.url).to eq 'http://example.com/' }
35
+ end
36
+
37
+ describe 'with url option' do
38
+ let(:agent) { described_class.new(url: 'http://example.com') }
39
+
40
+ it { expect { agent }.to_not raise_error }
41
+ it { expect(agent.page).to be }
42
+ it { expect(agent.url).to eq 'http://example.com' }
43
+ end
44
+ end
45
+
46
+ describe 'instance methods' do
47
+ let(:page) { Mechanize::Page.new }
48
+ let(:scraper) { described_class.new(page: page) }
49
+
50
+ describe '#page' do
51
+ it { expect(scraper.page).to be_a(Mechanize::Page) }
52
+ end
53
+
54
+ describe '#url' do
55
+ it { expect(scraper.url).to eq page.uri.to_s }
56
+ end
57
+
58
+ describe '#scrape_page!' do
59
+ it do
60
+ expect {
61
+ scraper.scrape_page!
62
+ }.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
63
+ end
64
+ end
65
+
66
+ describe '#attributes' do
67
+ it { expect(scraper.attributes).to be_empty }
68
+ end
69
+ end
70
+
71
+ describe 'class methods' do
72
+ describe 'self.can_scrape?(url)' do
73
+ it 'must be implemented in subclasses' do
74
+ expect {
75
+ described_class.can_scrape?('url')
76
+ }.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
77
+ end
78
+ end
79
+ end
80
+
81
+ describe 'example TwitterScraper' do
82
+ before do
83
+ stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
84
+ end
85
+
86
+ let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
87
+ let(:twitter_scraper) { TwitterScraper.new(page: page) }
88
+ let(:expected_attributes) do
89
+ {
90
+ bio: "Javascript User Group Milano #milanojs",
91
+ location: "Milan, Italy"
92
+ }
93
+ end
94
+
95
+ describe '#initialize' do
96
+ it 'should be a SagroneScraper::Base' do
97
+ expect(twitter_scraper).to be_a(described_class)
98
+ end
99
+ end
100
+
101
+ describe 'instance methods' do
102
+ describe '#scrape_page!' do
103
+ it 'should succeed' do
104
+ expect { twitter_scraper.scrape_page! }.to_not raise_error
105
+ end
106
+
107
+ it 'should have attributes present after scraping' do
108
+ twitter_scraper.scrape_page!
109
+
110
+ expect(twitter_scraper.attributes).to_not be_empty
111
+ expect(twitter_scraper.attributes).to eq expected_attributes
112
+ end
113
+
114
+ it 'should have correct attributes even if scraping is done multiple times' do
115
+ twitter_scraper.scrape_page!
116
+ twitter_scraper.scrape_page!
117
+ twitter_scraper.scrape_page!
118
+
119
+ expect(twitter_scraper.attributes).to_not be_empty
120
+ expect(twitter_scraper.attributes).to eq expected_attributes
121
+ end
122
+ end
123
+ end
124
+
125
+ describe 'class methods' do
126
+ describe 'self.can_scrape?(url)' do
127
+ it 'should be true for scrapable URLs' do
128
+ can_scrape = TwitterScraper.can_scrape?('https://twitter.com/Milano_JS')
129
+
130
+ expect(can_scrape).to eq(true)
131
+ end
132
+
133
+ it 'should be false for unknown URLs' do
134
+ can_scrape = TwitterScraper.can_scrape?('https://www.facebook.com/milanojavascript')
135
+
136
+ expect(can_scrape).to eq(false)
137
+ end
138
+ end
139
+
140
+ describe 'self.should_ignore_method?(name)' do
141
+ let(:private_methods) { %w(text_at) }
142
+ let(:public_methods) { %w(bio location) }
143
+
144
+ it 'ignores private methods' do
145
+ private_methods.each do |private_method|
146
+ ignored = TwitterScraper.should_ignore_method?(private_method)
147
+
148
+ expect(ignored).to eq(true)
149
+ end
150
+ end
151
+
152
+ it 'allows public methods' do
153
+ public_methods.each do |public_method|
154
+ ignored = TwitterScraper.should_ignore_method?(public_method)
155
+
156
+ expect(ignored).to eq(false)
157
+ end
158
+ end
159
+ end
160
+ end
161
+ end
162
+ end
@@ -0,0 +1,80 @@
1
+ require 'spec_helper'
2
+ require 'sagrone_scraper/collection'
3
+
4
+ RSpec.describe SagroneScraper::Collection do
5
+ context 'scrapers registered' do
6
+ before do
7
+ described_class.registered_scrapers.clear
8
+ end
9
+
10
+ describe 'self.registered_scrapers' do
11
+ it { expect(described_class.registered_scrapers).to be_empty }
12
+ it { expect(described_class.registered_scrapers).to be_a(Array) }
13
+ end
14
+
15
+ describe 'self.register_scraper(name)' do
16
+ it 'should add a new scraper class to registered scrapers automatically' do
17
+ class ScraperOne < SagroneScraper::Base ; end
18
+ class ScraperTwo < SagroneScraper::Base ; end
19
+
20
+ expect(described_class.registered_scrapers).to include('ScraperOne')
21
+ expect(described_class.registered_scrapers).to include('ScraperTwo')
22
+ expect(described_class.registered_scrapers.size).to eq(2)
23
+ end
24
+
25
+ it 'should check scraper name is an existing constant' do
26
+ expect {
27
+ described_class.register_scraper('Unknown')
28
+ }.to raise_error(NameError, 'uninitialized constant Unknown')
29
+ end
30
+
31
+ it 'should check scraper class inherits from SagroneScraper::Base' do
32
+ NotScraper = Class.new
33
+
34
+ expect {
35
+ described_class.register_scraper('NotScraper')
36
+ }.to raise_error(described_class::Error, 'Expected scraper to be a SagroneScraper::Base.')
37
+ end
38
+
39
+ it 'should register multiple scrapers only once' do
40
+ class TestScraper < SagroneScraper::Base ; end
41
+
42
+ described_class.register_scraper('TestScraper')
43
+ described_class.register_scraper('TestScraper')
44
+
45
+ expect(described_class.registered_scrapers).to include('TestScraper')
46
+ expect(described_class.registered_scrapers.size).to eq 1
47
+ end
48
+ end
49
+ end
50
+
51
+ describe 'self.scrape' do
52
+ before do
53
+ described_class.registered_scrapers.clear
54
+ described_class.register_scraper('TwitterScraper')
55
+
56
+ stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
57
+ end
58
+
59
+ it 'should require `url` option' do
60
+ expect {
61
+ described_class.scrape({})
62
+ }.to raise_error(described_class::Error, 'Option "url" must be provided.')
63
+ end
64
+
65
+ it 'should scrape URL if registered scraper knows how to scrape it' do
66
+ expected_attributes = {
67
+ bio: "Javascript User Group Milano #milanojs",
68
+ location: "Milan, Italy"
69
+ }
70
+
71
+ expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
72
+ end
73
+
74
+ it 'should return raise error if no registered paser can scrape the URL' do
75
+ expect {
76
+ described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
77
+ }.to raise_error(described_class::Error, "No registed scraper can scrape URL https://twitter.com/Milano_JS/media")
78
+ end
79
+ end
80
+ end
@@ -2,80 +2,7 @@ require 'spec_helper'
2
2
  require 'sagrone_scraper'
3
3
 
4
4
  RSpec.describe SagroneScraper do
5
- describe '.version' do
5
+ describe 'self.version' do
6
6
  it { expect(SagroneScraper.version).to be_a(String) }
7
7
  end
8
-
9
- context 'parsers registered' do
10
- before do
11
- described_class.registered_parsers.clear
12
- end
13
-
14
- describe '.registered_parsers' do
15
- it { expect(described_class.registered_parsers).to be_empty }
16
- it { expect(described_class.registered_parsers).to be_a(Array) }
17
- end
18
-
19
- describe '.register_parser(name)' do
20
- TestParser = Class.new(SagroneScraper::Parser)
21
- NotParser = Class.new
22
-
23
- it 'should check parser name is an existing constant' do
24
- expect {
25
- described_class.register_parser('Unknown')
26
- }.to raise_error(NameError, 'uninitialized constant Unknown')
27
- end
28
-
29
- it 'should check parser class inherits from SagroneScraper::Parser' do
30
- expect {
31
- described_class.register_parser('NotParser')
32
- }.to raise_error(SagroneScraper::Error, 'Expected parser to be a SagroneScraper::Parser.')
33
- end
34
-
35
- it 'after adding a "parser" should have it registered' do
36
- described_class.register_parser('TestParser')
37
-
38
- expect(described_class.registered_parsers).to include('TestParser')
39
- expect(described_class.registered_parsers.size).to eq 1
40
- end
41
-
42
- it 'adding same "parser" multiple times should register it once' do
43
- described_class.register_parser('TestParser')
44
- described_class.register_parser('TestParser')
45
-
46
- expect(described_class.registered_parsers).to include('TestParser')
47
- expect(described_class.registered_parsers.size).to eq 1
48
- end
49
- end
50
- end
51
-
52
- describe '.scrape' do
53
- before do
54
- SagroneScraper.registered_parsers.clear
55
- SagroneScraper.register_parser('TwitterParser')
56
-
57
- stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
58
- end
59
-
60
- it 'should `url` option' do
61
- expect {
62
- described_class.scrape({})
63
- }.to raise_error(SagroneScraper::Error, 'Option "url" must be provided.')
64
- end
65
-
66
- it 'should scrape URL if registered parser knows how to parse it' do
67
- expected_attributes = {
68
- bio: "Javascript User Group Milano #milanojs",
69
- location: "Milan, Italy"
70
- }
71
-
72
- expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
73
- end
74
-
75
- it 'should return raise error if no registered paser can parse the URL' do
76
- expect {
77
- described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
78
- }.to raise_error(SagroneScraper::Error, "No registed parser can parse URL https://twitter.com/Milano_JS/media")
79
- end
80
- end
81
8
  end
@@ -0,0 +1,23 @@
1
+ require 'sagrone_scraper/base'
2
+
3
+ class TwitterScraper < SagroneScraper::Base
4
+ TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
5
+
6
+ def self.can_scrape?(url)
7
+ url.match(TWITTER_PROFILE_URL) ? true : false
8
+ end
9
+
10
+ def bio
11
+ text_at('.ProfileHeaderCard-bio')
12
+ end
13
+
14
+ def location
15
+ text_at('.ProfileHeaderCard-locationText')
16
+ end
17
+
18
+ private
19
+
20
+ def text_at(selector)
21
+ page.at(selector).text if page.at(selector)
22
+ end
23
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sagrone_scraper
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.3
4
+ version: 0.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Marius Colacioiu
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-03-10 00:00:00.000000000 Z
11
+ date: 2015-04-09 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: mechanize
@@ -113,17 +113,19 @@ files:
113
113
  - Rakefile
114
114
  - lib/sagrone_scraper.rb
115
115
  - lib/sagrone_scraper/agent.rb
116
- - lib/sagrone_scraper/parser.rb
116
+ - lib/sagrone_scraper/base.rb
117
+ - lib/sagrone_scraper/collection.rb
117
118
  - lib/sagrone_scraper/version.rb
118
119
  - sagrone_scraper.gemspec
119
120
  - spec/sagrone_scraper/agent_spec.rb
120
- - spec/sagrone_scraper/parser_spec.rb
121
+ - spec/sagrone_scraper/base_spec.rb
122
+ - spec/sagrone_scraper/collection_spec.rb
121
123
  - spec/sagrone_scraper_spec.rb
122
124
  - spec/spec_helper.rb
123
125
  - spec/stub_helper.rb
124
- - spec/support/test_parsers/twitter_parser.rb
125
126
  - spec/support/test_responses/twitter.com:Milano_JS
126
127
  - spec/support/test_responses/www.example.com
128
+ - spec/support/test_scrapers/twitter_scraper.rb
127
129
  homepage: ''
128
130
  licenses:
129
131
  - Apache License 2.0
@@ -150,10 +152,11 @@ specification_version: 4
150
152
  summary: Sagrone Ruby Scraper.
151
153
  test_files:
152
154
  - spec/sagrone_scraper/agent_spec.rb
153
- - spec/sagrone_scraper/parser_spec.rb
155
+ - spec/sagrone_scraper/base_spec.rb
156
+ - spec/sagrone_scraper/collection_spec.rb
154
157
  - spec/sagrone_scraper_spec.rb
155
158
  - spec/spec_helper.rb
156
159
  - spec/stub_helper.rb
157
- - spec/support/test_parsers/twitter_parser.rb
158
160
  - spec/support/test_responses/twitter.com:Milano_JS
159
161
  - spec/support/test_responses/www.example.com
162
+ - spec/support/test_scrapers/twitter_scraper.rb
@@ -1,42 +0,0 @@
1
- require 'mechanize'
2
-
3
- module SagroneScraper
4
- class Parser
5
- Error = Class.new(RuntimeError)
6
-
7
- attr_reader :page, :page_url, :attributes
8
-
9
- def initialize(options = {})
10
- @page = options.fetch(:page) do
11
- raise Error.new('Option "page" must be provided.')
12
- end
13
- @page_url = @page.uri.to_s
14
- @attributes = {}
15
- end
16
-
17
- def parse_page!
18
- return unless self.class.can_parse?(page_url)
19
-
20
- self.class.method_names.each do |name|
21
- attributes[name] = send(name)
22
- end
23
- nil
24
- end
25
-
26
- def self.can_parse?(url)
27
- class_with_method = "#{self}.can_parse?(url)"
28
- raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
29
- end
30
-
31
- private
32
-
33
- def self.method_names
34
- @method_names ||= []
35
- end
36
-
37
- def self.method_added(name)
38
- puts "added #{name} to #{self}"
39
- method_names.push(name)
40
- end
41
- end
42
- end
@@ -1,83 +0,0 @@
1
- require 'spec_helper'
2
- require 'sagrone_scraper/parser'
3
-
4
- RSpec.describe SagroneScraper::Parser do
5
- describe '#initialize' do
6
- it 'requires a "page" option' do
7
- expect {
8
- described_class.new
9
- }.to raise_error(SagroneScraper::Parser::Error, 'Option "page" must be provided.')
10
- end
11
- end
12
-
13
- describe 'instance methods' do
14
- let(:page) { Mechanize::Page.new }
15
- let(:parser) { described_class.new(page: page) }
16
-
17
- describe '#page' do
18
- it { expect(parser.page).to be_a(Mechanize::Page) }
19
- end
20
-
21
- describe '#page_url' do
22
- it { expect(parser.page_url).to be }
23
- it { expect(parser.page_url).to eq page.uri.to_s }
24
- end
25
-
26
- describe '#parse_page!' do
27
- it do
28
- expect {
29
- parser.parse_page!
30
- }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
31
- end
32
- end
33
-
34
- describe '#attributes' do
35
- it { expect(parser.attributes).to be_empty }
36
- end
37
- end
38
-
39
- describe 'class methods' do
40
- describe '.can_parse?(url)' do
41
- it do
42
- expect {
43
- described_class.can_parse?('url')
44
- }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
45
- end
46
- end
47
- end
48
-
49
- describe 'create custom TwitterParser from SagroneScraper::Parser' do
50
- before do
51
- stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
52
- end
53
-
54
- let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
55
- let(:twitter_parser) { TwitterParser.new(page: page) }
56
- let(:expected_attributes) do
57
- {
58
- bio: "Javascript User Group Milano #milanojs",
59
- location: "Milan, Italy"
60
- }
61
- end
62
-
63
- describe 'should be able to parse page without errors' do
64
- it { expect { twitter_parser.parse_page! }.to_not raise_error }
65
- end
66
-
67
- it 'should have attributes present after parsing' do
68
- twitter_parser.parse_page!
69
-
70
- expect(twitter_parser.attributes).to_not be_empty
71
- expect(twitter_parser.attributes).to eq expected_attributes
72
- end
73
-
74
- it 'should have correct attributes event if parsing is done multiple times' do
75
- twitter_parser.parse_page!
76
- twitter_parser.parse_page!
77
- twitter_parser.parse_page!
78
-
79
- expect(twitter_parser.attributes).to_not be_empty
80
- expect(twitter_parser.attributes).to eq expected_attributes
81
- end
82
- end
83
- end
@@ -1,17 +0,0 @@
1
- require 'sagrone_scraper/parser'
2
-
3
- class TwitterParser < SagroneScraper::Parser
4
- TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
5
-
6
- def self.can_parse?(url)
7
- url.match(TWITTER_PROFILE_URL)
8
- end
9
-
10
- def bio
11
- page.at('.ProfileHeaderCard-bio').text
12
- end
13
-
14
- def location
15
- page.at('.ProfileHeaderCard-locationText').text
16
- end
17
- end