RubyGems - sagrone_scraper - Versions diffs - 0.0.3 → 0.1.0 - Mend

sagrone_scraper 0.0.3 → 0.1.0

Files changed (18) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +16 -0
data/Guardfile +1 -1
data/README.md +68 -69
data/lib/sagrone_scraper.rb +2 -33
data/lib/sagrone_scraper/agent.rb +4 -19
data/lib/sagrone_scraper/base.rb +63 -0
data/lib/sagrone_scraper/collection.rb +38 -0
data/lib/sagrone_scraper/version.rb +1 -1
data/spec/sagrone_scraper/agent_spec.rb +10 -19
data/spec/sagrone_scraper/base_spec.rb +162 -0
data/spec/sagrone_scraper/collection_spec.rb +80 -0
data/spec/sagrone_scraper_spec.rb +1 -74
data/spec/support/test_scrapers/twitter_scraper.rb +23 -0
metadata +10 -7
data/lib/sagrone_scraper/parser.rb +0 -42
data/spec/sagrone_scraper/parser_spec.rb +0 -83
data/spec/support/test_parsers/twitter_parser.rb +0 -17

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 80b3c30080aba0c8b1da8cfdcdefbb8e6ef527e1
-  data.tar.gz: c9992757e44377ed3081089f0348cdf1535e8a8e
+  metadata.gz: 7b4420d9fd6c018746bbdb2ca17ea53402ca3857
+  data.tar.gz: 7ff6f5cdb840b214ad7eaf8f7ccb23e2efe9ff24
 SHA512:
-  metadata.gz: 97477b7732ec3485aa7ba5ef2c7cb16ac130d6b1a1f6ee8b57deb5cf53fb6ae50bafdec13d1c575dbc382a748583a3e70b0da55660bd7065daaebc1479d466c4
-  data.tar.gz: fc8762b63b3429dcbd004bbdc39c6ef5bd7351e8b483ec8185c79b985ab5b2091601005a533270fc511e16d329f31a20616e1ea068f2b7ce7c3a0b3928087c5f
+  metadata.gz: b051e3982a7959f19cb25b29129905cd26e5fbc2529a012a0d8b091c5dfdc78d985c79b8aacb6c8142e2c030aed107d91a69d94aba8f6b2c16c57eaeb0f284cb
+  data.tar.gz: b404f338c87a266e9bcc03705327830be76d73018e5ff789017890cb77d7b0b0157daff1dbace000d542be0fd272a7912a75a018d88154da3e165c25e8e4bea5

data/CHANGELOG.md CHANGED

@@ -1,5 +1,21 @@
 ### HEAD
+### 0.1.0
+- simplify library entities:
+  - `SagroneScraper::Parser` is now `SagroneScraper::Base`
+  - `SagroneScraper::Base` is base class to create new scrapers
+  - `SagroneScraper.registered_parsers` is now `SagroneScraper.registered_scrapers`
+  - `SagroneScraper.register_parser` is now `SagroneScraper.register_scraper`
+  - `SagroneScraper::Base.parse_page!` renamed to `SagroneScraper::Base.scrape_page!`
+  - `SagroneScraper::Base.can_parse?` renamed to `SagroneScraper::Base.can_scrape?`
+  - `SagroneScraper::Base` _private_ instance methods are not used for extracting data, only _public_ instance methods are.
+- `SagroneScraper::Collection`:
+  - registeres new created scrappers automatically
+  - knows how to scrape a page from a generic URL (if there is a valid scraper for that URL)
+- `SagroneScraper::Base` takes exactly one of `url` or `page` options
+- `SagroneScraper::Agent` takes only a `url` option
 ### 0.0.3
 - add `SagroneScraper::Parser.can_parse?(url)` class method, which must be  implemented in subclasses

data/Guardfile CHANGED

@@ -8,7 +8,7 @@ guard :rspec, cmd: "bundle exec rspec" do
   rspec = dsl.rspec
   watch(rspec.spec_files)
   watch(%r{^spec/(.+)_helper\.rb$}) { "spec" }
-  watch(%r{^spec/test_responses/(.+)$}) { "spec" }
+  watch(%r{^spec/support/(.+)$}) { "spec" }
   # Library files
   watch(%r{^lib/(.+)\.rb$}) { |m| "spec/#{m[1]}_spec.rb" }

data/README.md CHANGED

@@ -11,8 +11,12 @@ Simple library to scrap web pages. Bellow you will find information on [how to u
 - [Basic Usage](#basic-usage)
 - [Modules](#modules)
   + [`SagroneScraper::Agent`](#sagronescraperagent)
-  + [`SagroneScraper::Parser`](#sagronescraperparser)
-  + [`SagroneScraper.scrape`](#sagronescraperscrape)
+  + [`SagroneScraper::Base`](#sagronescraperbase)
+    * [Create a scraper class](#create-a-scraper-class)
+    * [Instantiate the scraper](#instantiate-the-scraper)
+    * [Scrape the page](#scrape-the-page)
+    * [Extract the data](#extract-the-data)
+  + [`SagroneScraper::Collection`](#sagronescrapercollection)
 ## Installation
@@ -30,115 +34,110 @@ Or install it yourself as:
 ## Basic Usage
-Comming soon...
+In order to _scrape a web page_ you will need to:
-## Modules
-#### `SagroneScraper::Agent`
-The agent is responsible for scraping a web page from a URL. Here is how you can create an `agent`:
+1. [create a new scraper class](#create-a-scraper-class) by inheriting from `SagroneScraper::Base`, and
+2. [instantiate it with a `url` or `page`](#instantiate-the-scraper)
+3. then you can use the scraper instance to [scrape the page](#scrape-the-page) and [extract structured data](#extract-the-data)
-1. one way is to pass it a `url` option
+More informations at [`SagroneScraper::Base`](#sagronescraperbase) module.
-    ```ruby
-    require 'sagrone_scraper'
+## Modules
-    agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
-    agent.page
-    # => Mechanize::Page
+### `SagroneScraper::Agent`
-    agent.page.at('.ProfileHeaderCard-bio').text
-    # => "Javascript User Group Milano #milanojs"
-    ```
+The agent is responsible for obtaining a page, `Mechanize::Page`, from a URL. Here is how you can create an `agent`:
-2. another way is to pass a `page` option (`Mechanize::Page`)
+```ruby
+require 'sagrone_scraper'
-    ```ruby
-    require 'sagrone_scraper'
+agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
+agent.page
+# => Mechanize::Page
-    mechanize_agent = Mechanize.new { |agent| agent.user_agent_alias = 'Linux Firefox' }
-    page = mechanize_agent.get('https://twitter.com/Milano_JS')
-    # => Mechanize::Page
+agent.page.at('.ProfileHeaderCard-bio').text
+# => "Javascript User Group Milano #milanojs"
+```
-    agent = SagroneScraper::Agent.new(page: page)
-    agent.url
-    # => "https://twitter.com/Milano_JS"
+### `SagroneScraper::Base`
-    agent.page.at('.ProfileHeaderCard-locationText').text
-    # => "Milan, Italy"
-    ```
+Here we define a `TwitterScraper`, by inheriting from `SagroneScraper::Base` class.
-#### `SagroneScraper::Parser`
+The _scraper_ is responsible for extracting structured data from a _page_ or a _url_. The _page_ can be obtained by the [_agent_](#sagronescraperagent).
-The _parser_ is responsible for extracting structured data from a _page_. The page can be obtained by the _agent_.
+_Public_ instance methods will be used to extract data, whereas _private_ instance methods will be ignored (seen as helper methods). Most importantly `self.can_scrape?(url)` class method ensures that only a known subset of pages can be scraped for data.
-Example usage:
+#### Create a scraper class
 ```ruby
 require 'sagrone_scraper'
-# 1) First define a custom parser, for example twitter.
-class TwitterParser < SagroneScraper::Parser
+class TwitterScraper < SagroneScraper::Base
   TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
-  def self.can_parse?(url)
-    url.match(TWITTER_PROFILE_URL)
+  def self.can_scrape?(url)
+    url.match(TWITTER_PROFILE_URL) ? true : false
   end
+  # Public instance methods are used for data extraction.
   def bio
-    page.at('.ProfileHeaderCard-bio').text
+    text_at('.ProfileHeaderCard-bio')
   end
   def location
-    page.at('.ProfileHeaderCard-locationText').text
+    text_at('.ProfileHeaderCard-locationText')
+  end
+  private
+  # Private instance methods are not used for data extraction.
+  def text_at(selector)
+    page.at(selector).text if page.at(selector)
   end
 end
+```
+#### Instantiate the scraper
+```ruby
+# Instantiate the scraper with a "url".
+scraper = TwitterScraper.new(url: 'https://twitter.com/Milano_JS')
-# 2) Create an agent scraper, which will give us the page to parse.
+# Instantiate the scraper with a "page" (Mechanize::Page).
 agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
+scraper = TwitterScraper.new(page: agent.page)
+```
+#### Scrape the page
+```ruby
+scraper.scrape_page!
+```
-# 3) Instantiate the parser.
-parser = TwitterParser.new(page: agent.page)
+#### Extract the data
-# 4) Parse page and extract attributes.
-parser.parse_page!
-parser.attributes
+```ruby
+scraper.attributes
 # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
 ```
-#### `SagroneScraper.scrape`
+### `SagroneScraper::Collection`
 This is the simplest way to scrape a web page:
 ```ruby
 require 'sagrone_scraper'
-# 1) First we define a custom parser, for example twitter.
-class TwitterParser < SagroneScraper::Parser
-  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
-  def self.can_parse?(url)
-    url.match(TWITTER_PROFILE_URL)
-  end
-  def bio
-    page.at('.ProfileHeaderCard-bio').text
-  end
-  def location
-    page.at('.ProfileHeaderCard-locationText').text
-  end
-end
-# 2) We register the parser.
-SagroneScraper.register_parser('TwitterParser')
+# 1) Define a scraper. For example, the TwitterScraper above.
-# 3) We can query for registered parsers.
-SagroneScraper.registered_parsers
-# => ['TwitterParser']
+# 2) New created scrapers will be registered.
+SagroneScraper.Collection::registered_scrapers
+# => ['TwitterScraper']
-# 4) We can now scrape twitter profile URLs.
-SagroneScraper.scrape(url: 'https://twitter.com/Milano_JS')
+# 3) Here we use the collection to scrape data at a URL.
+SagroneScraper::Collection.scrape(url: 'https://twitter.com/Milano_JS')
 # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
 ```

data/lib/sagrone_scraper.rb CHANGED

@@ -1,41 +1,10 @@
 require "sagrone_scraper/version"
 require "sagrone_scraper/agent"
-require "sagrone_scraper/parser"
+require "sagrone_scraper/base"
+require "sagrone_scraper/collection"
 module SagroneScraper
-  Error = Class.new(RuntimeError)
   def self.version
     VERSION
   end
-  def self.registered_parsers
-    @registered_parsers ||= []
-  end
-  def self.register_parser(name)
-    return if registered_parsers.include?(name)
-    parser_class = Object.const_get(name)
-    raise Error.new("Expected parser to be a SagroneScraper::Parser.") unless parser_class.ancestors.include?(SagroneScraper::Parser)
-    registered_parsers.push(name)
-  end
-  def self.scrape(options)
-    url = options.fetch(:url) do
-            raise Error.new('Option "url" must be provided.')
-          end
-    parser_class = registered_parsers
-                    .map { |parser_name| Object.const_get(parser_name) }
-                    .find { |parser_class| parser_class.can_parse?(url) }
-    raise Error.new("No registed parser can parse URL #{url}") unless parser_class
-    agent = SagroneScraper::Agent.new(url: url)
-    parser = parser_class.new(page: agent.page)
-    parser.parse_page!
-    parser.attributes
-  end
 end

data/lib/sagrone_scraper/agent.rb CHANGED

@@ -9,12 +9,10 @@ module SagroneScraper
     attr_reader :url, :page
     def initialize(options = {})
-      raise Error.new('Exactly one option must be provided: "url" or "page"') unless exactly_one_of(options)
-      @url, @page = options[:url], options[:page]
-      @url ||= page_url
-      @page ||= http_client.get(url)
+      @url = options.fetch(:url) do
+              raise Error.new('Option "url" must be provided')
+             end
+      @page = http_client.get(url)
     rescue StandardError => error
       raise Error.new(error.message)
     end
@@ -29,18 +27,5 @@ module SagroneScraper
         agent.max_history = 0
       end
     end
-    private
-    def page_url
-      @page.uri.to_s
-    end
-    def exactly_one_of(options)
-      url_present = !!options[:url]
-      page_present = !!options[:page]
-      (url_present && !page_present) || (!url_present && page_present)
-    end
   end
 end

data/lib/sagrone_scraper/base.rb ADDED

@@ -0,0 +1,63 @@
+require 'mechanize'
+require 'sagrone_scraper'
+module SagroneScraper
+  class Base
+    Error = Class.new(RuntimeError)
+    attr_reader :page, :url, :attributes
+    def initialize(options = {})
+      raise Error.new('Exactly one option must be provided: "url" or "page"') unless exactly_one_of(options)
+      @url, @page = options[:url], options[:page]
+      @url ||= page_url
+      @page ||= Agent.new(url: url).page
+      @attributes = {}
+    end
+    def scrape_page!
+      return unless self.class.can_scrape?(page_url)
+      self.class.method_names.each do |name|
+        attributes[name] = send(name)
+      end
+      nil
+    end
+    def self.can_scrape?(url)
+      class_with_method = "#{self}.can_scrape?(url)"
+      raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
+    end
+    def self.should_ignore_method?(name)
+      private_method_defined?(name)
+    end
+    private
+    def exactly_one_of(options)
+      url_present = !!options[:url]
+      page_present = !!options[:page]
+      (url_present && !page_present) || (!url_present && page_present)
+    end
+    def page_url
+      @page.uri.to_s
+    end
+    def self.method_names
+      @method_names ||= []
+    end
+    def self.method_added(name)
+      method_names.push(name) unless should_ignore_method?(name)
+    end
+    def self.inherited(klass)
+      SagroneScraper::Collection.register_scraper(klass.name)
+    end
+  end
+end

data/lib/sagrone_scraper/collection.rb ADDED

@@ -0,0 +1,38 @@
+require 'sagrone_scraper'
+module SagroneScraper
+  module Collection
+    Error = Class.new(RuntimeError)
+    def self.registered_scrapers
+      @registered_scrapers ||= []
+    end
+    def self.register_scraper(name)
+      return if registered_scrapers.include?(name)
+      scraper_class = Object.const_get(name)
+      raise Error.new("Expected scraper to be a SagroneScraper::Base.") unless scraper_class.ancestors.include?(SagroneScraper::Base)
+      registered_scrapers.push(name)
+      nil
+    end
+    def self.scrape(options)
+      url = options.fetch(:url) do
+              raise Error.new('Option "url" must be provided.')
+            end
+      scraper_class = registered_scrapers
+                      .map { |scraper_name| Object.const_get(scraper_name) }
+                      .find { |a_scraper_class| a_scraper_class.can_scrape?(url) }
+      raise Error.new("No registed scraper can scrape URL #{url}") unless scraper_class
+      agent = SagroneScraper::Agent.new(url: url)
+      scraper = scraper_class.new(page: agent.page)
+      scraper.scrape_page!
+      scraper.attributes
+    end
+  end
+end

data/lib/sagrone_scraper/version.rb CHANGED

@@ -1,3 +1,3 @@
 module SagroneScraper
-  VERSION = "0.0.3"
+  VERSION = "0.1.0"
 end

data/spec/sagrone_scraper/agent_spec.rb CHANGED

@@ -12,7 +12,7 @@ RSpec.describe SagroneScraper::Agent do
     it { expect(described_class::AGENT_ALIASES).to eq(user_agent_aliases) }
   end
-  describe '.http_client' do
+  describe 'self.http_client' do
     subject { described_class.http_client }
     it { should be_a(Mechanize) }
@@ -21,7 +21,7 @@ RSpec.describe SagroneScraper::Agent do
   end
   describe '#initialize' do
-    describe 'should require exactly one of `url` or `page` option' do
+    describe 'should require `url` option' do
       before do
         stub_request_for('http://example.com', 'www.example.com')
       end
@@ -30,30 +30,21 @@ RSpec.describe SagroneScraper::Agent do
         expect {
           described_class.new
         }.to raise_error(SagroneScraper::Agent::Error,
-                          'Exactly one option must be provided: "url" or "page"')
-      end
-      it 'when both options are present' do
-        page = Mechanize.new.get('http://example.com')
-        expect {
-          described_class.new(url: 'http://example.com', page: page)
-        }.to raise_error(SagroneScraper::Agent::Error,
-                          'Exactly one option must be provided: "url" or "page"')
+                          'Option "url" must be provided')
       end
     end
-    describe 'with page option' do
+    describe 'with url option' do
       before do
         stub_request_for('http://example.com', 'www.example.com')
       end
-      let(:page) { Mechanize.new.get('http://example.com') }
-      let(:agent) { described_class.new(page: page) }
+      let(:agent) { described_class.new(url: 'http://example.com') }
       it { expect { agent }.to_not raise_error }
       it { expect(agent.page).to be }
-      it { expect(agent.url).to eq 'http://example.com/' }
+      it { expect(agent.page).to be_a(Mechanize::Page) }
+      it { expect(agent.url).to eq 'http://example.com' }
     end
     describe 'with invalid URL' do
@@ -62,7 +53,7 @@ RSpec.describe SagroneScraper::Agent do
       it 'should require URL is absolute' do
         @invalid_url = 'not-a-url'
-        expect { agent }.to raise_error(SagroneScraper::Agent::Error,
+        expect { agent }.to raise_error(described_class::Error,
                                         'absolute URL needed (not not-a-url)')
       end
@@ -70,7 +61,7 @@ RSpec.describe SagroneScraper::Agent do
         @invalid_url = 'http://'
         webmock_allow do
-          expect { agent }.to raise_error(SagroneScraper::Agent::Error,
+          expect { agent }.to raise_error(described_class::Error,
                                           /bad URI\(absolute but no path\)/)
         end
       end
@@ -79,7 +70,7 @@ RSpec.describe SagroneScraper::Agent do
         @invalid_url = 'http://example'
         webmock_allow do
-          expect { agent }.to raise_error(SagroneScraper::Agent::Error,
+          expect { agent }.to raise_error(described_class::Error,
                                           /getaddrinfo/)
         end
       end

data/spec/sagrone_scraper/base_spec.rb ADDED

@@ -0,0 +1,162 @@
+require 'spec_helper'
+require 'sagrone_scraper/base'
+RSpec.describe SagroneScraper::Base do
+  describe '#initialize' do
+    before do
+      stub_request_for('http://example.com', 'www.example.com')
+    end
+    describe 'should require exactly one of `url` or `page` option' do
+      it 'when options is empty' do
+        expect {
+          described_class.new
+        }.to raise_error(described_class::Error,
+                          'Exactly one option must be provided: "url" or "page"')
+      end
+      it 'when both options are present' do
+        page = Mechanize.new.get('http://example.com')
+        expect {
+          described_class.new(url: 'http://example.com', page: page)
+        }.to raise_error(described_class::Error,
+                          'Exactly one option must be provided: "url" or "page"')
+      end
+    end
+    describe 'with page option' do
+      let(:page) { Mechanize.new.get('http://example.com') }
+      let(:agent) { described_class.new(page: page) }
+      it { expect { agent }.to_not raise_error }
+      it { expect(agent.page).to be }
+      it { expect(agent.url).to eq 'http://example.com/' }
+    end
+    describe 'with url option' do
+      let(:agent) { described_class.new(url: 'http://example.com') }
+      it { expect { agent }.to_not raise_error }
+      it { expect(agent.page).to be }
+      it { expect(agent.url).to eq 'http://example.com' }
+    end
+  end
+  describe 'instance methods' do
+    let(:page) { Mechanize::Page.new }
+    let(:scraper) { described_class.new(page: page) }
+    describe '#page' do
+      it { expect(scraper.page).to be_a(Mechanize::Page) }
+    end
+    describe '#url' do
+      it { expect(scraper.url).to eq page.uri.to_s }
+    end
+    describe '#scrape_page!' do
+      it do
+        expect {
+          scraper.scrape_page!
+        }.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
+      end
+    end
+    describe '#attributes' do
+      it { expect(scraper.attributes).to be_empty }
+    end
+  end
+  describe 'class methods' do
+    describe 'self.can_scrape?(url)' do
+      it 'must be implemented in subclasses' do
+        expect {
+          described_class.can_scrape?('url')
+        }.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
+      end
+    end
+  end
+  describe 'example TwitterScraper' do
+    before do
+      stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
+    end
+    let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
+    let(:twitter_scraper) { TwitterScraper.new(page: page) }
+    let(:expected_attributes) do
+      {
+        bio: "Javascript User Group Milano #milanojs",
+        location: "Milan, Italy"
+      }
+    end
+    describe '#initialize' do
+      it 'should be a SagroneScraper::Base' do
+        expect(twitter_scraper).to be_a(described_class)
+      end
+    end
+    describe 'instance methods' do
+      describe '#scrape_page!' do
+        it 'should succeed' do
+          expect { twitter_scraper.scrape_page! }.to_not raise_error
+        end
+        it 'should have attributes present after scraping' do
+          twitter_scraper.scrape_page!
+          expect(twitter_scraper.attributes).to_not be_empty
+          expect(twitter_scraper.attributes).to eq expected_attributes
+        end
+        it 'should have correct attributes even if scraping is done multiple times' do
+          twitter_scraper.scrape_page!
+          twitter_scraper.scrape_page!
+          twitter_scraper.scrape_page!
+          expect(twitter_scraper.attributes).to_not be_empty
+          expect(twitter_scraper.attributes).to eq expected_attributes
+        end
+      end
+    end
+    describe 'class methods' do
+      describe 'self.can_scrape?(url)' do
+        it 'should be true for scrapable URLs' do
+          can_scrape = TwitterScraper.can_scrape?('https://twitter.com/Milano_JS')
+          expect(can_scrape).to eq(true)
+        end
+        it 'should be false for unknown URLs' do
+          can_scrape = TwitterScraper.can_scrape?('https://www.facebook.com/milanojavascript')
+          expect(can_scrape).to eq(false)
+        end
+      end
+      describe 'self.should_ignore_method?(name)' do
+        let(:private_methods) { %w(text_at) }
+        let(:public_methods) { %w(bio location) }
+        it 'ignores private methods' do
+          private_methods.each do |private_method|
+            ignored = TwitterScraper.should_ignore_method?(private_method)
+            expect(ignored).to eq(true)
+          end
+        end
+        it 'allows public methods' do
+          public_methods.each do |public_method|
+            ignored = TwitterScraper.should_ignore_method?(public_method)
+            expect(ignored).to eq(false)
+          end
+        end
+      end
+    end
+  end
+end

data/spec/sagrone_scraper/collection_spec.rb ADDED

@@ -0,0 +1,80 @@
+require 'spec_helper'
+require 'sagrone_scraper/collection'
+RSpec.describe SagroneScraper::Collection do
+  context 'scrapers registered' do
+    before do
+      described_class.registered_scrapers.clear
+    end
+    describe 'self.registered_scrapers' do
+      it { expect(described_class.registered_scrapers).to be_empty }
+      it { expect(described_class.registered_scrapers).to be_a(Array) }
+    end
+    describe 'self.register_scraper(name)' do
+      it 'should add a new scraper class to registered scrapers automatically' do
+        class ScraperOne < SagroneScraper::Base ; end
+        class ScraperTwo < SagroneScraper::Base ; end
+        expect(described_class.registered_scrapers).to include('ScraperOne')
+        expect(described_class.registered_scrapers).to include('ScraperTwo')
+        expect(described_class.registered_scrapers.size).to eq(2)
+      end
+      it 'should check scraper name is an existing constant' do
+        expect {
+          described_class.register_scraper('Unknown')
+        }.to raise_error(NameError, 'uninitialized constant Unknown')
+      end
+      it 'should check scraper class inherits from SagroneScraper::Base' do
+        NotScraper = Class.new
+        expect {
+          described_class.register_scraper('NotScraper')
+        }.to raise_error(described_class::Error, 'Expected scraper to be a SagroneScraper::Base.')
+      end
+      it 'should register multiple scrapers only once' do
+        class TestScraper < SagroneScraper::Base ; end
+        described_class.register_scraper('TestScraper')
+        described_class.register_scraper('TestScraper')
+        expect(described_class.registered_scrapers).to include('TestScraper')
+        expect(described_class.registered_scrapers.size).to eq 1
+      end
+    end
+  end
+  describe 'self.scrape' do
+    before do
+      described_class.registered_scrapers.clear
+      described_class.register_scraper('TwitterScraper')
+      stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
+    end
+    it 'should require `url` option' do
+      expect {
+        described_class.scrape({})
+      }.to raise_error(described_class::Error, 'Option "url" must be provided.')
+    end
+    it 'should scrape URL if registered scraper knows how to scrape it' do
+      expected_attributes = {
+        bio: "Javascript User Group Milano #milanojs",
+        location: "Milan, Italy"
+      }
+      expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
+    end
+    it 'should return raise error if no registered paser can scrape the URL' do
+      expect {
+        described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
+      }.to raise_error(described_class::Error, "No registed scraper can scrape URL https://twitter.com/Milano_JS/media")
+    end
+  end
+end

data/spec/sagrone_scraper_spec.rb CHANGED

@@ -2,80 +2,7 @@ require 'spec_helper'
 require 'sagrone_scraper'
 RSpec.describe SagroneScraper do
-  describe '.version' do
+  describe 'self.version' do
     it { expect(SagroneScraper.version).to be_a(String) }
   end
-  context 'parsers registered' do
-    before do
-      described_class.registered_parsers.clear
-    end
-    describe '.registered_parsers' do
-      it { expect(described_class.registered_parsers).to be_empty }
-      it { expect(described_class.registered_parsers).to be_a(Array) }
-    end
-    describe '.register_parser(name)' do
-      TestParser = Class.new(SagroneScraper::Parser)
-      NotParser = Class.new
-      it 'should check parser name is an existing constant' do
-        expect {
-          described_class.register_parser('Unknown')
-        }.to raise_error(NameError, 'uninitialized constant Unknown')
-      end
-      it 'should check parser class inherits from SagroneScraper::Parser' do
-        expect {
-          described_class.register_parser('NotParser')
-        }.to raise_error(SagroneScraper::Error, 'Expected parser to be a SagroneScraper::Parser.')
-      end
-      it 'after adding a "parser" should have it registered' do
-        described_class.register_parser('TestParser')
-        expect(described_class.registered_parsers).to include('TestParser')
-        expect(described_class.registered_parsers.size).to eq 1
-      end
-      it 'adding same "parser" multiple times should register it once' do
-        described_class.register_parser('TestParser')
-        described_class.register_parser('TestParser')
-        expect(described_class.registered_parsers).to include('TestParser')
-        expect(described_class.registered_parsers.size).to eq 1
-      end
-    end
-  end
-  describe '.scrape' do
-    before do
-      SagroneScraper.registered_parsers.clear
-      SagroneScraper.register_parser('TwitterParser')
-      stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
-    end
-    it 'should `url` option' do
-      expect {
-        described_class.scrape({})
-      }.to raise_error(SagroneScraper::Error, 'Option "url" must be provided.')
-    end
-    it 'should scrape URL if registered parser knows how to parse it' do
-      expected_attributes = {
-        bio: "Javascript User Group Milano #milanojs",
-        location: "Milan, Italy"
-      }
-      expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
-    end
-    it 'should return raise error if no registered paser can parse the URL' do
-      expect {
-        described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
-      }.to raise_error(SagroneScraper::Error, "No registed parser can parse URL https://twitter.com/Milano_JS/media")
-    end
-  end
 end

data/spec/support/test_scrapers/twitter_scraper.rb ADDED

@@ -0,0 +1,23 @@
+require 'sagrone_scraper/base'
+class TwitterScraper < SagroneScraper::Base
+  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
+  def self.can_scrape?(url)
+    url.match(TWITTER_PROFILE_URL) ? true : false
+  end
+  def bio
+    text_at('.ProfileHeaderCard-bio')
+  end
+  def location
+    text_at('.ProfileHeaderCard-locationText')
+  end
+  private
+  def text_at(selector)
+    page.at(selector).text if page.at(selector)
+  end
+end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: sagrone_scraper
 version: !ruby/object:Gem::Version
-  version: 0.0.3
+  version: 0.1.0
 platform: ruby
 authors:
 - Marius Colacioiu
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-03-10 00:00:00.000000000 Z
+date: 2015-04-09 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mechanize
@@ -113,17 +113,19 @@ files:
 - Rakefile
 - lib/sagrone_scraper.rb
 - lib/sagrone_scraper/agent.rb
-- lib/sagrone_scraper/parser.rb
+- lib/sagrone_scraper/base.rb
+- lib/sagrone_scraper/collection.rb
 - lib/sagrone_scraper/version.rb
 - sagrone_scraper.gemspec
 - spec/sagrone_scraper/agent_spec.rb
-- spec/sagrone_scraper/parser_spec.rb
+- spec/sagrone_scraper/base_spec.rb
+- spec/sagrone_scraper/collection_spec.rb
 - spec/sagrone_scraper_spec.rb
 - spec/spec_helper.rb
 - spec/stub_helper.rb
-- spec/support/test_parsers/twitter_parser.rb
 - spec/support/test_responses/twitter.com:Milano_JS
 - spec/support/test_responses/www.example.com
+- spec/support/test_scrapers/twitter_scraper.rb
 homepage: ''
 licenses:
 - Apache License 2.0
@@ -150,10 +152,11 @@ specification_version: 4
 summary: Sagrone Ruby Scraper.
 test_files:
 - spec/sagrone_scraper/agent_spec.rb
-- spec/sagrone_scraper/parser_spec.rb
+- spec/sagrone_scraper/base_spec.rb
+- spec/sagrone_scraper/collection_spec.rb
 - spec/sagrone_scraper_spec.rb
 - spec/spec_helper.rb
 - spec/stub_helper.rb
-- spec/support/test_parsers/twitter_parser.rb
 - spec/support/test_responses/twitter.com:Milano_JS
 - spec/support/test_responses/www.example.com
+- spec/support/test_scrapers/twitter_scraper.rb

data/lib/sagrone_scraper/parser.rb DELETED

@@ -1,42 +0,0 @@
-require 'mechanize'
-module SagroneScraper
-  class Parser
-    Error = Class.new(RuntimeError)
-    attr_reader :page, :page_url, :attributes
-    def initialize(options = {})
-      @page = options.fetch(:page) do
-                raise Error.new('Option "page" must be provided.')
-              end
-      @page_url = @page.uri.to_s
-      @attributes = {}
-    end
-    def parse_page!
-      return unless self.class.can_parse?(page_url)
-      self.class.method_names.each do |name|
-        attributes[name] = send(name)
-      end
-      nil
-    end
-    def self.can_parse?(url)
-      class_with_method = "#{self}.can_parse?(url)"
-      raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
-    end
-    private
-    def self.method_names
-      @method_names ||= []
-    end
-    def self.method_added(name)
-      puts "added #{name} to #{self}"
-      method_names.push(name)
-    end
-  end
-end

data/spec/sagrone_scraper/parser_spec.rb DELETED

@@ -1,83 +0,0 @@
-require 'spec_helper'
-require 'sagrone_scraper/parser'
-RSpec.describe SagroneScraper::Parser do
-  describe '#initialize' do
-    it 'requires a "page" option' do
-      expect {
-        described_class.new
-      }.to raise_error(SagroneScraper::Parser::Error, 'Option "page" must be provided.')
-    end
-  end
-  describe 'instance methods' do
-    let(:page) { Mechanize::Page.new }
-    let(:parser) { described_class.new(page: page) }
-    describe '#page' do
-      it { expect(parser.page).to be_a(Mechanize::Page) }
-    end
-    describe '#page_url' do
-      it { expect(parser.page_url).to be }
-      it { expect(parser.page_url).to eq page.uri.to_s }
-    end
-    describe '#parse_page!' do
-      it do
-        expect {
-          parser.parse_page!
-        }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
-      end
-    end
-    describe '#attributes' do
-      it { expect(parser.attributes).to be_empty }
-    end
-  end
-  describe 'class methods' do
-    describe '.can_parse?(url)' do
-      it do
-        expect {
-          described_class.can_parse?('url')
-        }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
-      end
-    end
-  end
-  describe 'create custom TwitterParser from SagroneScraper::Parser' do
-    before do
-      stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
-    end
-    let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
-    let(:twitter_parser) { TwitterParser.new(page: page) }
-    let(:expected_attributes) do
-      {
-        bio: "Javascript User Group Milano #milanojs",
-        location: "Milan, Italy"
-      }
-    end
-    describe 'should be able to parse page without errors' do
-      it { expect { twitter_parser.parse_page! }.to_not raise_error }
-    end
-    it 'should have attributes present after parsing' do
-      twitter_parser.parse_page!
-      expect(twitter_parser.attributes).to_not be_empty
-      expect(twitter_parser.attributes).to eq expected_attributes
-    end
-    it 'should have correct attributes event if parsing is done multiple times' do
-      twitter_parser.parse_page!
-      twitter_parser.parse_page!
-      twitter_parser.parse_page!
-      expect(twitter_parser.attributes).to_not be_empty
-      expect(twitter_parser.attributes).to eq expected_attributes
-    end
-  end
-end

data/spec/support/test_parsers/twitter_parser.rb DELETED

@@ -1,17 +0,0 @@
-require 'sagrone_scraper/parser'
-class TwitterParser < SagroneScraper::Parser
-  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
-  def self.can_parse?(url)
-    url.match(TWITTER_PROFILE_URL)
-  end
-  def bio
-    page.at('.ProfileHeaderCard-bio').text
-  end
-  def location
-    page.at('.ProfileHeaderCard-locationText').text
-  end
-end