sagrone_scraper 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 82fc9ba674d9d3398b5f596d513fbb8eeb8abe3b
4
- data.tar.gz: a8acac43dc318b6dcad951d57d3e1ce59478c67d
3
+ metadata.gz: 80b3c30080aba0c8b1da8cfdcdefbb8e6ef527e1
4
+ data.tar.gz: c9992757e44377ed3081089f0348cdf1535e8a8e
5
5
  SHA512:
6
- metadata.gz: 0dc5041f027f685ac241fcf0e103d3d8a6fa225002cdedd65a2f072fafb43904d3514a7cabad9a83bb0d3f02a3f7241c2c8dbeffdb48f133439f308319462870
7
- data.tar.gz: 15644f0da27c4cb3f2452ac11958261acc7d0bd5e3e0539a7d7817b47f5c5096ea3f0d107b2f456f34f6fe8273fb7742128cd72f5c142c8416a521262ffb8e09
6
+ metadata.gz: 97477b7732ec3485aa7ba5ef2c7cb16ac130d6b1a1f6ee8b57deb5cf53fb6ae50bafdec13d1c575dbc382a748583a3e70b0da55660bd7065daaebc1479d466c4
7
+ data.tar.gz: fc8762b63b3429dcbd004bbdc39c6ef5bd7351e8b483ec8185c79b985ab5b2091601005a533270fc511e16d329f31a20616e1ea068f2b7ce7c3a0b3928087c5f
data/CHANGELOG.md CHANGED
@@ -1,5 +1,10 @@
1
1
  ### HEAD
2
2
 
3
+ ### 0.0.3
4
+
5
+ - add `SagroneScraper::Parser.can_parse?(url)` class method, which must be implemented in subclasses
6
+ - add `SagroneScraper` logic to _scrape_ a URL based on a set of _registered parsers_
7
+
3
8
  ### 0.0.2
4
9
 
5
10
  - add `SagroneScraper::Parser`
data/README.md CHANGED
@@ -3,7 +3,7 @@
3
3
  [![Gem Version](https://badge.fury.io/rb/sagrone_scraper.svg)](http://badge.fury.io/rb/sagrone_scraper)
4
4
  [![Build Status](https://travis-ci.org/Sagrone/scraper.svg?branch=master)](https://travis-ci.org/Sagrone/scraper)
5
5
 
6
- Simple library to scrap web pages. Bellow you will find information on [how to use it](#usage).
6
+ Simple library to scrap web pages. Bellow you will find information on [how to use it](#basic-usage).
7
7
 
8
8
  ## Table of Contents
9
9
 
@@ -12,6 +12,7 @@ Simple library to scrap web pages. Bellow you will find information on [how to u
12
12
  - [Modules](#modules)
13
13
  + [`SagroneScraper::Agent`](#sagronescraperagent)
14
14
  + [`SagroneScraper::Parser`](#sagronescraperparser)
15
+ + [`SagroneScraper.scrape`](#sagronescraperscrape)
15
16
 
16
17
  ## Installation
17
18
 
@@ -40,7 +41,7 @@ The agent is responsible for scraping a web page from a URL. Here is how you can
40
41
  1. one way is to pass it a `url` option
41
42
 
42
43
  ```ruby
43
- require 'sagrone_scraper/agent'
44
+ require 'sagrone_scraper'
44
45
 
45
46
  agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
46
47
  agent.page
@@ -53,7 +54,7 @@ The agent is responsible for scraping a web page from a URL. Here is how you can
53
54
  2. another way is to pass a `page` option (`Mechanize::Page`)
54
55
 
55
56
  ```ruby
56
- require 'sagrone_scraper/agent'
57
+ require 'sagrone_scraper'
57
58
 
58
59
  mechanize_agent = Mechanize.new { |agent| agent.user_agent_alias = 'Linux Firefox' }
59
60
  page = mechanize_agent.get('https://twitter.com/Milano_JS')
@@ -74,11 +75,16 @@ The _parser_ is responsible for extracting structured data from a _page_. The pa
74
75
  Example usage:
75
76
 
76
77
  ```ruby
77
- require 'sagrone_scraper/agent'
78
- require 'sagrone_scraper/parser'
78
+ require 'sagrone_scraper'
79
79
 
80
80
  # 1) First define a custom parser, for example twitter.
81
81
  class TwitterParser < SagroneScraper::Parser
82
+ TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
83
+
84
+ def self.can_parse?(url)
85
+ url.match(TWITTER_PROFILE_URL)
86
+ end
87
+
82
88
  def bio
83
89
  page.at('.ProfileHeaderCard-bio').text
84
90
  end
@@ -100,6 +106,42 @@ parser.attributes
100
106
  # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
101
107
  ```
102
108
 
109
+ #### `SagroneScraper.scrape`
110
+
111
+ This is the simplest way to scrape a web page:
112
+
113
+ ```ruby
114
+ require 'sagrone_scraper'
115
+
116
+ # 1) First we define a custom parser, for example twitter.
117
+ class TwitterParser < SagroneScraper::Parser
118
+ TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
119
+
120
+ def self.can_parse?(url)
121
+ url.match(TWITTER_PROFILE_URL)
122
+ end
123
+
124
+ def bio
125
+ page.at('.ProfileHeaderCard-bio').text
126
+ end
127
+
128
+ def location
129
+ page.at('.ProfileHeaderCard-locationText').text
130
+ end
131
+ end
132
+
133
+ # 2) We register the parser.
134
+ SagroneScraper.register_parser('TwitterParser')
135
+
136
+ # 3) We can query for registered parsers.
137
+ SagroneScraper.registered_parsers
138
+ # => ['TwitterParser']
139
+
140
+ # 4) We can now scrape twitter profile URLs.
141
+ SagroneScraper.scrape(url: 'https://twitter.com/Milano_JS')
142
+ # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
143
+ ```
144
+
103
145
  ## Contributing
104
146
 
105
147
  1. Fork it ( https://github.com/[my-github-username]/sagrone_scraper/fork )
@@ -1,7 +1,41 @@
1
1
  require "sagrone_scraper/version"
2
+ require "sagrone_scraper/agent"
3
+ require "sagrone_scraper/parser"
2
4
 
3
5
  module SagroneScraper
6
+ Error = Class.new(RuntimeError)
7
+
4
8
  def self.version
5
9
  VERSION
6
10
  end
11
+
12
+ def self.registered_parsers
13
+ @registered_parsers ||= []
14
+ end
15
+
16
+ def self.register_parser(name)
17
+ return if registered_parsers.include?(name)
18
+
19
+ parser_class = Object.const_get(name)
20
+ raise Error.new("Expected parser to be a SagroneScraper::Parser.") unless parser_class.ancestors.include?(SagroneScraper::Parser)
21
+
22
+ registered_parsers.push(name)
23
+ end
24
+
25
+ def self.scrape(options)
26
+ url = options.fetch(:url) do
27
+ raise Error.new('Option "url" must be provided.')
28
+ end
29
+
30
+ parser_class = registered_parsers
31
+ .map { |parser_name| Object.const_get(parser_name) }
32
+ .find { |parser_class| parser_class.can_parse?(url) }
33
+
34
+ raise Error.new("No registed parser can parse URL #{url}") unless parser_class
35
+
36
+ agent = SagroneScraper::Agent.new(url: url)
37
+ parser = parser_class.new(page: agent.page)
38
+ parser.parse_page!
39
+ parser.attributes
40
+ end
7
41
  end
@@ -4,22 +4,30 @@ module SagroneScraper
4
4
  class Parser
5
5
  Error = Class.new(RuntimeError)
6
6
 
7
- attr_reader :page, :attributes
7
+ attr_reader :page, :page_url, :attributes
8
8
 
9
9
  def initialize(options = {})
10
10
  @page = options.fetch(:page) do
11
11
  raise Error.new('Option "page" must be provided.')
12
12
  end
13
+ @page_url = @page.uri.to_s
13
14
  @attributes = {}
14
15
  end
15
16
 
16
17
  def parse_page!
18
+ return unless self.class.can_parse?(page_url)
19
+
17
20
  self.class.method_names.each do |name|
18
21
  attributes[name] = send(name)
19
22
  end
20
23
  nil
21
24
  end
22
25
 
26
+ def self.can_parse?(url)
27
+ class_with_method = "#{self}.can_parse?(url)"
28
+ raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
29
+ end
30
+
23
31
  private
24
32
 
25
33
  def self.method_names
@@ -1,3 +1,3 @@
1
1
  module SagroneScraper
2
- VERSION = "0.0.2"
2
+ VERSION = "0.0.3"
3
3
  end
@@ -27,8 +27,10 @@ RSpec.describe SagroneScraper::Agent do
27
27
  end
28
28
 
29
29
  it 'when options is empty' do
30
- expect { described_class.new }.to raise_error(SagroneScraper::Agent::Error,
31
- /Exactly one option must be provided: "url" or "page"/)
30
+ expect {
31
+ described_class.new
32
+ }.to raise_error(SagroneScraper::Agent::Error,
33
+ 'Exactly one option must be provided: "url" or "page"')
32
34
  end
33
35
 
34
36
  it 'when both options are present' do
@@ -37,7 +39,7 @@ RSpec.describe SagroneScraper::Agent do
37
39
  expect {
38
40
  described_class.new(url: 'http://example.com', page: page)
39
41
  }.to raise_error(SagroneScraper::Agent::Error,
40
- /Exactly one option must be provided: "url" or "page"/)
42
+ 'Exactly one option must be provided: "url" or "page"')
41
43
  end
42
44
  end
43
45
 
@@ -61,7 +63,7 @@ RSpec.describe SagroneScraper::Agent do
61
63
  @invalid_url = 'not-a-url'
62
64
 
63
65
  expect { agent }.to raise_error(SagroneScraper::Agent::Error,
64
- /absolute URL needed \(not not-a-url\)/)
66
+ 'absolute URL needed (not not-a-url)')
65
67
  end
66
68
 
67
69
  it 'should require absolute path' do
@@ -4,7 +4,9 @@ require 'sagrone_scraper/parser'
4
4
  RSpec.describe SagroneScraper::Parser do
5
5
  describe '#initialize' do
6
6
  it 'requires a "page" option' do
7
- expect { described_class.new }.to raise_error(SagroneScraper::Parser::Error, /Option "page" must be provided./)
7
+ expect {
8
+ described_class.new
9
+ }.to raise_error(SagroneScraper::Parser::Error, 'Option "page" must be provided.')
8
10
  end
9
11
  end
10
12
 
@@ -16,8 +18,17 @@ RSpec.describe SagroneScraper::Parser do
16
18
  it { expect(parser.page).to be_a(Mechanize::Page) }
17
19
  end
18
20
 
21
+ describe '#page_url' do
22
+ it { expect(parser.page_url).to be }
23
+ it { expect(parser.page_url).to eq page.uri.to_s }
24
+ end
25
+
19
26
  describe '#parse_page!' do
20
- it { expect(parser.parse_page!).to eq nil }
27
+ it do
28
+ expect {
29
+ parser.parse_page!
30
+ }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
31
+ end
21
32
  end
22
33
 
23
34
  describe '#attributes' do
@@ -25,17 +36,17 @@ RSpec.describe SagroneScraper::Parser do
25
36
  end
26
37
  end
27
38
 
28
- describe 'create custom TwitterParser from SagroneScraper::Parser' do
29
- class TwitterParser < SagroneScraper::Parser
30
- def bio
31
- page.at('.ProfileHeaderCard-bio').text
32
- end
33
-
34
- def location
35
- page.at('.ProfileHeaderCard-locationText').text
39
+ describe 'class methods' do
40
+ describe '.can_parse?(url)' do
41
+ it do
42
+ expect {
43
+ described_class.can_parse?('url')
44
+ }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
36
45
  end
37
46
  end
47
+ end
38
48
 
49
+ describe 'create custom TwitterParser from SagroneScraper::Parser' do
39
50
  before do
40
51
  stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
41
52
  end
@@ -5,4 +5,77 @@ RSpec.describe SagroneScraper do
5
5
  describe '.version' do
6
6
  it { expect(SagroneScraper.version).to be_a(String) }
7
7
  end
8
+
9
+ context 'parsers registered' do
10
+ before do
11
+ described_class.registered_parsers.clear
12
+ end
13
+
14
+ describe '.registered_parsers' do
15
+ it { expect(described_class.registered_parsers).to be_empty }
16
+ it { expect(described_class.registered_parsers).to be_a(Array) }
17
+ end
18
+
19
+ describe '.register_parser(name)' do
20
+ TestParser = Class.new(SagroneScraper::Parser)
21
+ NotParser = Class.new
22
+
23
+ it 'should check parser name is an existing constant' do
24
+ expect {
25
+ described_class.register_parser('Unknown')
26
+ }.to raise_error(NameError, 'uninitialized constant Unknown')
27
+ end
28
+
29
+ it 'should check parser class inherits from SagroneScraper::Parser' do
30
+ expect {
31
+ described_class.register_parser('NotParser')
32
+ }.to raise_error(SagroneScraper::Error, 'Expected parser to be a SagroneScraper::Parser.')
33
+ end
34
+
35
+ it 'after adding a "parser" should have it registered' do
36
+ described_class.register_parser('TestParser')
37
+
38
+ expect(described_class.registered_parsers).to include('TestParser')
39
+ expect(described_class.registered_parsers.size).to eq 1
40
+ end
41
+
42
+ it 'adding same "parser" multiple times should register it once' do
43
+ described_class.register_parser('TestParser')
44
+ described_class.register_parser('TestParser')
45
+
46
+ expect(described_class.registered_parsers).to include('TestParser')
47
+ expect(described_class.registered_parsers.size).to eq 1
48
+ end
49
+ end
50
+ end
51
+
52
+ describe '.scrape' do
53
+ before do
54
+ SagroneScraper.registered_parsers.clear
55
+ SagroneScraper.register_parser('TwitterParser')
56
+
57
+ stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
58
+ end
59
+
60
+ it 'should `url` option' do
61
+ expect {
62
+ described_class.scrape({})
63
+ }.to raise_error(SagroneScraper::Error, 'Option "url" must be provided.')
64
+ end
65
+
66
+ it 'should scrape URL if registered parser knows how to parse it' do
67
+ expected_attributes = {
68
+ bio: "Javascript User Group Milano #milanojs",
69
+ location: "Milan, Italy"
70
+ }
71
+
72
+ expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
73
+ end
74
+
75
+ it 'should return raise error if no registered paser can parse the URL' do
76
+ expect {
77
+ described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
78
+ }.to raise_error(SagroneScraper::Error, "No registed parser can parse URL https://twitter.com/Milano_JS/media")
79
+ end
80
+ end
8
81
  end
data/spec/spec_helper.rb CHANGED
@@ -1,5 +1,7 @@
1
1
  require 'stub_helper'
2
2
 
3
+ Dir["./spec/support/**/*.rb"].sort.each { |file| require file }
4
+
3
5
  RSpec.configure do |config|
4
6
  config.include(StubHelper)
5
7
 
data/spec/stub_helper.rb CHANGED
@@ -17,6 +17,6 @@ module StubHelper
17
17
  end
18
18
 
19
19
  def get_response_file(name)
20
- IO.read(File.join('spec/test_responses', "#{name}"))
20
+ IO.read(File.join('spec/support/test_responses', "#{name}"))
21
21
  end
22
22
  end
@@ -0,0 +1,17 @@
1
+ require 'sagrone_scraper/parser'
2
+
3
+ class TwitterParser < SagroneScraper::Parser
4
+ TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
5
+
6
+ def self.can_parse?(url)
7
+ url.match(TWITTER_PROFILE_URL)
8
+ end
9
+
10
+ def bio
11
+ page.at('.ProfileHeaderCard-bio').text
12
+ end
13
+
14
+ def location
15
+ page.at('.ProfileHeaderCard-locationText').text
16
+ end
17
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sagrone_scraper
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Marius Colacioiu
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-03-06 00:00:00.000000000 Z
11
+ date: 2015-03-10 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: mechanize
@@ -121,8 +121,9 @@ files:
121
121
  - spec/sagrone_scraper_spec.rb
122
122
  - spec/spec_helper.rb
123
123
  - spec/stub_helper.rb
124
- - spec/test_responses/twitter.com:Milano_JS
125
- - spec/test_responses/www.example.com
124
+ - spec/support/test_parsers/twitter_parser.rb
125
+ - spec/support/test_responses/twitter.com:Milano_JS
126
+ - spec/support/test_responses/www.example.com
126
127
  homepage: ''
127
128
  licenses:
128
129
  - Apache License 2.0
@@ -153,5 +154,6 @@ test_files:
153
154
  - spec/sagrone_scraper_spec.rb
154
155
  - spec/spec_helper.rb
155
156
  - spec/stub_helper.rb
156
- - spec/test_responses/twitter.com:Milano_JS
157
- - spec/test_responses/www.example.com
157
+ - spec/support/test_parsers/twitter_parser.rb
158
+ - spec/support/test_responses/twitter.com:Milano_JS
159
+ - spec/support/test_responses/www.example.com