RubyGems - sagrone_scraper - Versions diffs - 0.0.3 → 0.1.0 - Mend

sagrone_scraper 0.0.3 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +16 -0
data/Guardfile +1 -1
data/README.md +68 -69
data/lib/sagrone_scraper.rb +2 -33
data/lib/sagrone_scraper/agent.rb +4 -19
data/lib/sagrone_scraper/base.rb +63 -0
data/lib/sagrone_scraper/collection.rb +38 -0
data/lib/sagrone_scraper/version.rb +1 -1
data/spec/sagrone_scraper/agent_spec.rb +10 -19
data/spec/sagrone_scraper/base_spec.rb +162 -0
data/spec/sagrone_scraper/collection_spec.rb +80 -0
data/spec/sagrone_scraper_spec.rb +1 -74
data/spec/support/test_scrapers/twitter_scraper.rb +23 -0
metadata +10 -7
data/lib/sagrone_scraper/parser.rb +0 -42
data/spec/sagrone_scraper/parser_spec.rb +0 -83
data/spec/support/test_parsers/twitter_parser.rb +0 -17

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 80b3c30080aba0c8b1da8cfdcdefbb8e6ef527e1
-  data.tar.gz: c9992757e44377ed3081089f0348cdf1535e8a8e
+  metadata.gz: 7b4420d9fd6c018746bbdb2ca17ea53402ca3857
+  data.tar.gz: 7ff6f5cdb840b214ad7eaf8f7ccb23e2efe9ff24
 SHA512:
-  metadata.gz: 97477b7732ec3485aa7ba5ef2c7cb16ac130d6b1a1f6ee8b57deb5cf53fb6ae50bafdec13d1c575dbc382a748583a3e70b0da55660bd7065daaebc1479d466c4
-  data.tar.gz: fc8762b63b3429dcbd004bbdc39c6ef5bd7351e8b483ec8185c79b985ab5b2091601005a533270fc511e16d329f31a20616e1ea068f2b7ce7c3a0b3928087c5f
+  metadata.gz: b051e3982a7959f19cb25b29129905cd26e5fbc2529a012a0d8b091c5dfdc78d985c79b8aacb6c8142e2c030aed107d91a69d94aba8f6b2c16c57eaeb0f284cb
+  data.tar.gz: b404f338c87a266e9bcc03705327830be76d73018e5ff789017890cb77d7b0b0157daff1dbace000d542be0fd272a7912a75a018d88154da3e165c25e8e4bea5

data/CHANGELOG.md CHANGED

@@ -1,5 +1,21 @@
 ### HEAD
+### 0.1.0
+- simplify library entities:
+  - `SagroneScraper::Parser` is now `SagroneScraper::Base`
+  - `SagroneScraper::Base` is base class to create new scrapers
+  - `SagroneScraper.registered_parsers` is now `SagroneScraper.registered_scrapers`
+  - `SagroneScraper.register_parser` is now `SagroneScraper.register_scraper`
+  - `SagroneScraper::Base.parse_page!` renamed to `SagroneScraper::Base.scrape_page!`
+  - `SagroneScraper::Base.can_parse?` renamed to `SagroneScraper::Base.can_scrape?`
+  - `SagroneScraper::Base` _private_ instance methods are not used for extracting data, only _public_ instance methods are.
+- `SagroneScraper::Collection`:
+  - registeres new created scrappers automatically
+  - knows how to scrape a page from a generic URL (if there is a valid scraper for that URL)
+- `SagroneScraper::Base` takes exactly one of `url` or `page` options
+- `SagroneScraper::Agent` takes only a `url` option
 ### 0.0.3
 - add `SagroneScraper::Parser.can_parse?(url)` class method, which must be  implemented in subclasses

data/Guardfile CHANGED

@@ -8,7 +8,7 @@ guard :rspec, cmd: "bundle exec rspec" do
   rspec = dsl.rspec
   watch(rspec.spec_files)
   watch(%r{^spec/(.+)_helper\.rb$}) { "spec" }
-  watch(%r{^spec/test_responses/(.+)$}) { "spec" }
+  watch(%r{^spec/support/(.+)$}) { "spec" }
   # Library files
   watch(%r{^lib/(.+)\.rb$}) { |m| "spec/#{m[1]}_spec.rb" }

data/README.md CHANGED

@@ -11,8 +11,12 @@ Simple library to scrap web pages. Bellow you will find information on [how to u
 - [Basic Usage](#basic-usage)
 - [Modules](#modules)
   + [`SagroneScraper::Agent`](#sagronescraperagent)
-  + [`SagroneScraper::Parser`](#sagronescraperparser)
-  + [`SagroneScraper.scrape`](#sagronescraperscrape)
+  + [`SagroneScraper::Base`](#sagronescraperbase)
+    * [Create a scraper class](#create-a-scraper-class)
+    * [Instantiate the scraper](#instantiate-the-scraper)
+    * [Scrape the page](#scrape-the-page)
+    * [Extract the data](#extract-the-data)
+  + [`SagroneScraper::Collection`](#sagronescrapercollection)
 ## Installation
@@ -30,115 +34,110 @@ Or install it yourself as:
 ## Basic Usage
-Comming soon...
+In order to _scrape a web page_ you will need to:
-## Modules
-#### `SagroneScraper::Agent`
-The agent is responsible for scraping a web page from a URL. Here is how you can create an `agent`:
+1. [create a new scraper class](#create-a-scraper-class) by inheriting from `SagroneScraper::Base`, and
+2. [instantiate it with a `url` or `page`](#instantiate-the-scraper)
+3. then you can use the scraper instance to [scrape the page](#scrape-the-page) and [extract structured data](#extract-the-data)
-1. one way is to pass it a `url` option
+More informations at [`SagroneScraper::Base`](#sagronescraperbase) module.
-    ```ruby
-    require 'sagrone_scraper'
+## Modules
-    agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
-    agent.page
-    # => Mechanize::Page
+### `SagroneScraper::Agent`
-    agent.page.at('.ProfileHeaderCard-bio').text
-    # => "Javascript User Group Milano #milanojs"
-    ```
+The agent is responsible for obtaining a page, `Mechanize::Page`, from a URL. Here is how you can create an `agent`:
-2. another way is to pass a `page` option (`Mechanize::Page`)
+```ruby
+require 'sagrone_scraper'
-    ```ruby
-    require 'sagrone_scraper'
+agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
+agent.page
+# => Mechanize::Page
-    mechanize_agent = Mechanize.new { |agent| agent.user_agent_alias = 'Linux Firefox' }
-    page = mechanize_agent.get('https://twitter.com/Milano_JS')
-    # => Mechanize::Page
+agent.page.at('.ProfileHeaderCard-bio').text
+# => "Javascript User Group Milano #milanojs"
+```
-    agent = SagroneScraper::Agent.new(page: page)
-    agent.url
-    # => "https://twitter.com/Milano_JS"
+### `SagroneScraper::Base`
-    agent.page.at('.ProfileHeaderCard-locationText').text
-    # => "Milan, Italy"
-    ```
+Here we define a `TwitterScraper`, by inheriting from `SagroneScraper::Base` class.
-#### `SagroneScraper::Parser`
+The _scraper_ is responsible for extracting structured data from a _page_ or a _url_. The _page_ can be obtained by the [_agent_](#sagronescraperagent).
-The _parser_ is responsible for extracting structured data from a _page_. The page can be obtained by the _agent_.
+_Public_ instance methods will be used to extract data, whereas _private_ instance methods will be ignored (seen as helper methods). Most importantly `self.can_scrape?(url)` class method ensures that only a known subset of pages can be scraped for data.
-Example usage:
+#### Create a scraper class
 ```ruby
 require 'sagrone_scraper'
-# 1) First define a custom parser, for example twitter.
-class TwitterParser < SagroneScraper::Parser
+class TwitterScraper < SagroneScraper::Base
   TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
-  def self.can_parse?(url)
-    url.match(TWITTER_PROFILE_URL)
+  def self.can_scrape?(url)
+    url.match(TWITTER_PROFILE_URL) ? true : false
   end
+  # Public instance methods are used for data extraction.
   def bio
-    page.at('.ProfileHeaderCard-bio').text
+    text_at('.ProfileHeaderCard-bio')
   end
   def location
-    page.at('.ProfileHeaderCard-locationText').text
+    text_at('.ProfileHeaderCard-locationText')
+  end
+  private
+  # Private instance methods are not used for data extraction.
+  def text_at(selector)
+    page.at(selector).text if page.at(selector)
   end
 end
+```
+#### Instantiate the scraper
+```ruby
+# Instantiate the scraper with a "url".
+scraper = TwitterScraper.new(url: 'https://twitter.com/Milano_JS')
-# 2) Create an agent scraper, which will give us the page to parse.
+# Instantiate the scraper with a "page" (Mechanize::Page).
 agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
+scraper = TwitterScraper.new(page: agent.page)
+```
+#### Scrape the page
+```ruby
+scraper.scrape_page!
+```
-# 3) Instantiate the parser.
-parser = TwitterParser.new(page: agent.page)
+#### Extract the data
-# 4) Parse page and extract attributes.
-parser.parse_page!
-parser.attributes
+```ruby
+scraper.attributes
 # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
 ```
-#### `SagroneScraper.scrape`
+### `SagroneScraper::Collection`
 This is the simplest way to scrape a web page:
 ```ruby
 require 'sagrone_scraper'
-# 1) First we define a custom parser, for example twitter.
-class TwitterParser < SagroneScraper::Parser
-  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
-  def self.can_parse?(url)
-    url.match(TWITTER_PROFILE_URL)
-  end
-  def bio
-    page.at('.ProfileHeaderCard-bio').text
-  end
-  def location
-    page.at('.ProfileHeaderCard-locationText').text
-  end
-end
-# 2) We register the parser.
-SagroneScraper.register_parser('TwitterParser')
+# 1) Define a scraper. For example, the TwitterScraper above.
-# 3) We can query for registered parsers.
-SagroneScraper.registered_parsers
-# => ['TwitterParser']
+# 2) New created scrapers will be registered.
+SagroneScraper.Collection::registered_scrapers
+# => ['TwitterScraper']
-# 4) We can now scrape twitter profile URLs.
-SagroneScraper.scrape(url: 'https://twitter.com/Milano_JS')
+# 3) Here we use the collection to scrape data at a URL.
+SagroneScraper::Collection.scrape(url: 'https://twitter.com/Milano_JS')
 # => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}
 ```

data/lib/sagrone_scraper.rb CHANGED

@@ -1,41 +1,10 @@
 require "sagrone_scraper/version"
 require "sagrone_scraper/agent"
-require "sagrone_scraper/parser"
+require "sagrone_scraper/base"
+require "sagrone_scraper/collection"
 module SagroneScraper
-  Error = Class.new(RuntimeError)
   def self.version
     VERSION
   end
-  def self.registered_parsers
-    @registered_parsers ||= []
-  end
-  def self.register_parser(name)
-    return if registered_parsers.include?(name)
-    parser_class = Object.const_get(name)
-    raise Error.new("Expected parser to be a SagroneScraper::Parser.") unless parser_class.ancestors.include?(SagroneScraper::Parser)
-    registered_parsers.push(name)
-  end
-  def self.scrape(options)
-    url = options.fetch(:url) do
-            raise Error.new('Option "url" must be provided.')
-          end
-    parser_class = registered_parsers
-                    .map { |parser_name| Object.const_get(parser_name) }
-                    .find { |parser_class| parser_class.can_parse?(url) }
-    raise Error.new("No registed parser can parse URL #{url}") unless parser_class
-    agent = SagroneScraper::Agent.new(url: url)
-    parser = parser_class.new(page: agent.page)
-    parser.parse_page!
-    parser.attributes
-  end
 end

data/lib/sagrone_scraper/agent.rb CHANGED

@@ -9,12 +9,10 @@ module SagroneScraper
     attr_reader :url, :page
     def initialize(options = {})
-      raise Error.new('Exactly one option must be provided: "url" or "page"') unless exactly_one_of(options)
-      @url, @page = options[:url], options[:page]
-      @url ||= page_url
-      @page ||= http_client.get(url)
+      @url = options.fetch(:url) do
+              raise Error.new('Option "url" must be provided')
+             end
+      @page = http_client.get(url)
     rescue StandardError => error
       raise Error.new(error.message)
     end
@@ -29,18 +27,5 @@ module SagroneScraper
         agent.max_history = 0
       end
     end
-    private
-    def page_url
-      @page.uri.to_s
-    end
-    def exactly_one_of(options)
-      url_present = !!options[:url]
-      page_present = !!options[:page]
-      (url_present && !page_present) || (!url_present && page_present)
-    end
   end
 end

data/lib/sagrone_scraper/base.rb ADDED

@@ -0,0 +1,63 @@
+require 'mechanize'
+require 'sagrone_scraper'
+module SagroneScraper
+  class Base
+    Error = Class.new(RuntimeError)
+    attr_reader :page, :url, :attributes
+    def initialize(options = {})
+      raise Error.new('Exactly one option must be provided: "url" or "page"') unless exactly_one_of(options)
+      @url, @page = options[:url], options[:page]
+      @url ||= page_url
+      @page ||= Agent.new(url: url).page
+      @attributes = {}
+    end
+    def scrape_page!
+      return unless self.class.can_scrape?(page_url)
+      self.class.method_names.each do |name|
+        attributes[name] = send(name)
+      end
+      nil
+    end
+    def self.can_scrape?(url)
+      class_with_method = "#{self}.can_scrape?(url)"
+      raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
+    end
+    def self.should_ignore_method?(name)
+      private_method_defined?(name)
+    end
+    private
+    def exactly_one_of(options)
+      url_present = !!options[:url]
+      page_present = !!options[:page]
+      (url_present && !page_present) || (!url_present && page_present)
+    end
+    def page_url
+      @page.uri.to_s
+    end
+    def self.method_names
+      @method_names ||= []
+    end
+    def self.method_added(name)
+      method_names.push(name) unless should_ignore_method?(name)
+    end
+    def self.inherited(klass)
+      SagroneScraper::Collection.register_scraper(klass.name)
+    end
+  end
+end

data/lib/sagrone_scraper/collection.rb ADDED

@@ -0,0 +1,38 @@
+require 'sagrone_scraper'
+module SagroneScraper
+  module Collection
+    Error = Class.new(RuntimeError)
+    def self.registered_scrapers
+      @registered_scrapers ||= []
+    end
+    def self.register_scraper(name)
+      return if registered_scrapers.include?(name)
+      scraper_class = Object.const_get(name)
+      raise Error.new("Expected scraper to be a SagroneScraper::Base.") unless scraper_class.ancestors.include?(SagroneScraper::Base)
+      registered_scrapers.push(name)
+      nil
+    end
+    def self.scrape(options)
+      url = options.fetch(:url) do
+              raise Error.new('Option "url" must be provided.')
+            end
+      scraper_class = registered_scrapers
+                      .map { |scraper_name| Object.const_get(scraper_name) }
+                      .find { |a_scraper_class| a_scraper_class.can_scrape?(url) }
+      raise Error.new("No registed scraper can scrape URL #{url}") unless scraper_class
+      agent = SagroneScraper::Agent.new(url: url)
+      scraper = scraper_class.new(page: agent.page)
+      scraper.scrape_page!
+      scraper.attributes
+    end
+  end
+end

data/lib/sagrone_scraper/version.rb CHANGED

@@ -1,3 +1,3 @@
 module SagroneScraper
-  VERSION = "0.0.3"
+  VERSION = "0.1.0"
 end

data/spec/sagrone_scraper/agent_spec.rb CHANGED

@@ -12,7 +12,7 @@ RSpec.describe SagroneScraper::Agent do
     it { expect(described_class::AGENT_ALIASES).to eq(user_agent_aliases) }
   end
-  describe '.http_client' do
+  describe 'self.http_client' do
     subject { described_class.http_client }
     it { should be_a(Mechanize) }
@@ -21,7 +21,7 @@ RSpec.describe SagroneScraper::Agent do
   end
   describe '#initialize' do
-    describe 'should require exactly one of `url` or `page` option' do
+    describe 'should require `url` option' do
       before do
         stub_request_for('http://example.com', 'www.example.com')
       end
@@ -30,30 +30,21 @@ RSpec.describe SagroneScraper::Agent do
         expect {
           described_class.new
         }.to raise_error(SagroneScraper::Agent::Error,
-                          'Exactly one option must be provided: "url" or "page"')
-      end
-      it 'when both options are present' do
-        page = Mechanize.new.get('http://example.com')
-        expect {
-          described_class.new(url: 'http://example.com', page: page)
-        }.to raise_error(SagroneScraper::Agent::Error,
-                          'Exactly one option must be provided: "url" or "page"')
+                          'Option "url" must be provided')
       end
     end
-    describe 'with page option' do
+    describe 'with url option' do
       before do
         stub_request_for('http://example.com', 'www.example.com')
       end
-      let(:page) { Mechanize.new.get('http://example.com') }
-      let(:agent) { described_class.new(page: page) }
+      let(:agent) { described_class.new(url: 'http://example.com') }
       it { expect { agent }.to_not raise_error }
       it { expect(agent.page).to be }
-      it { expect(agent.url).to eq 'http://example.com/' }
+      it { expect(agent.page).to be_a(Mechanize::Page) }
+      it { expect(agent.url).to eq 'http://example.com' }
     end
     describe 'with invalid URL' do
@@ -62,7 +53,7 @@ RSpec.describe SagroneScraper::Agent do
       it 'should require URL is absolute' do
         @invalid_url = 'not-a-url'
-        expect { agent }.to raise_error(SagroneScraper::Agent::Error,
+        expect { agent }.to raise_error(described_class::Error,
                                         'absolute URL needed (not not-a-url)')
       end
@@ -70,7 +61,7 @@ RSpec.describe SagroneScraper::Agent do
         @invalid_url = 'http://'
         webmock_allow do
-          expect { agent }.to raise_error(SagroneScraper::Agent::Error,
+          expect { agent }.to raise_error(described_class::Error,
                                           /bad URI\(absolute but no path\)/)
         end
       end
@@ -79,7 +70,7 @@ RSpec.describe SagroneScraper::Agent do
         @invalid_url = 'http://example'
         webmock_allow do
-          expect { agent }.to raise_error(SagroneScraper::Agent::Error,
+          expect { agent }.to raise_error(described_class::Error,
                                           /getaddrinfo/)
         end
       end

data/spec/sagrone_scraper/base_spec.rb ADDED

@@ -0,0 +1,162 @@
+require 'spec_helper'
+require 'sagrone_scraper/base'
+RSpec.describe SagroneScraper::Base do
+  describe '#initialize' do
+    before do
+      stub_request_for('http://example.com', 'www.example.com')
+    end
+    describe 'should require exactly one of `url` or `page` option' do
+      it 'when options is empty' do
+        expect {
+          described_class.new
+        }.to raise_error(described_class::Error,
+                          'Exactly one option must be provided: "url" or "page"')
+      end
+      it 'when both options are present' do
+        page = Mechanize.new.get('http://example.com')
+        expect {
+          described_class.new(url: 'http://example.com', page: page)
+        }.to raise_error(described_class::Error,
+                          'Exactly one option must be provided: "url" or "page"')
+      end
+    end
+    describe 'with page option' do
+      let(:page) { Mechanize.new.get('http://example.com') }
+      let(:agent) { described_class.new(page: page) }
+      it { expect { agent }.to_not raise_error }
+      it { expect(agent.page).to be }
+      it { expect(agent.url).to eq 'http://example.com/' }
+    end
+    describe 'with url option' do
+      let(:agent) { described_class.new(url: 'http://example.com') }
+      it { expect { agent }.to_not raise_error }
+      it { expect(agent.page).to be }
+      it { expect(agent.url).to eq 'http://example.com' }
+    end
+  end
+  describe 'instance methods' do
+    let(:page) { Mechanize::Page.new }
+    let(:scraper) { described_class.new(page: page) }
+    describe '#page' do
+      it { expect(scraper.page).to be_a(Mechanize::Page) }
+    end
+    describe '#url' do
+      it { expect(scraper.url).to eq page.uri.to_s }
+    end
+    describe '#scrape_page!' do
+      it do
+        expect {
+          scraper.scrape_page!
+        }.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
+      end
+    end
+    describe '#attributes' do
+      it { expect(scraper.attributes).to be_empty }
+    end
+  end
+  describe 'class methods' do
+    describe 'self.can_scrape?(url)' do
+      it 'must be implemented in subclasses' do
+        expect {
+          described_class.can_scrape?('url')
+        }.to raise_error(NotImplementedError, "Expected #{described_class}.can_scrape?(url) to be implemented.")
+      end
+    end
+  end
+  describe 'example TwitterScraper' do
+    before do
+      stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
+    end
+    let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
+    let(:twitter_scraper) { TwitterScraper.new(page: page) }
+    let(:expected_attributes) do
+      {
+        bio: "Javascript User Group Milano #milanojs",
+        location: "Milan, Italy"
+      }
+    end
+    describe '#initialize' do
+      it 'should be a SagroneScraper::Base' do
+        expect(twitter_scraper).to be_a(described_class)
+      end
+    end
+    describe 'instance methods' do
+      describe '#scrape_page!' do
+        it 'should succeed' do
+          expect { twitter_scraper.scrape_page! }.to_not raise_error
+        end
+        it 'should have attributes present after scraping' do
+          twitter_scraper.scrape_page!
+          expect(twitter_scraper.attributes).to_not be_empty
+          expect(twitter_scraper.attributes).to eq expected_attributes
+        end
+        it 'should have correct attributes even if scraping is done multiple times' do
+          twitter_scraper.scrape_page!
+          twitter_scraper.scrape_page!
+          twitter_scraper.scrape_page!
+          expect(twitter_scraper.attributes).to_not be_empty
+          expect(twitter_scraper.attributes).to eq expected_attributes
+        end
+      end
+    end
+    describe 'class methods' do
+      describe 'self.can_scrape?(url)' do
+        it 'should be true for scrapable URLs' do
+          can_scrape = TwitterScraper.can_scrape?('https://twitter.com/Milano_JS')
+          expect(can_scrape).to eq(true)
+        end
+        it 'should be false for unknown URLs' do
+          can_scrape = TwitterScraper.can_scrape?('https://www.facebook.com/milanojavascript')
+          expect(can_scrape).to eq(false)
+        end
+      end
+      describe 'self.should_ignore_method?(name)' do
+        let(:private_methods) { %w(text_at) }
+        let(:public_methods) { %w(bio location) }
+        it 'ignores private methods' do
+          private_methods.each do |private_method|
+            ignored = TwitterScraper.should_ignore_method?(private_method)
+            expect(ignored).to eq(true)
+          end
+        end
+        it 'allows public methods' do
+          public_methods.each do |public_method|
+            ignored = TwitterScraper.should_ignore_method?(public_method)
+            expect(ignored).to eq(false)
+          end
+        end
+      end
+    end
+  end
+end

data/spec/sagrone_scraper/collection_spec.rb ADDED

@@ -0,0 +1,80 @@
+require 'spec_helper'
+require 'sagrone_scraper/collection'
+RSpec.describe SagroneScraper::Collection do
+  context 'scrapers registered' do
+    before do
+      described_class.registered_scrapers.clear
+    end
+    describe 'self.registered_scrapers' do
+      it { expect(described_class.registered_scrapers).to be_empty }
+      it { expect(described_class.registered_scrapers).to be_a(Array) }
+    end
+    describe 'self.register_scraper(name)' do
+      it 'should add a new scraper class to registered scrapers automatically' do
+        class ScraperOne < SagroneScraper::Base ; end
+        class ScraperTwo < SagroneScraper::Base ; end
+        expect(described_class.registered_scrapers).to include('ScraperOne')
+        expect(described_class.registered_scrapers).to include('ScraperTwo')
+        expect(described_class.registered_scrapers.size).to eq(2)
+      end
+      it 'should check scraper name is an existing constant' do
+        expect {
+          described_class.register_scraper('Unknown')
+        }.to raise_error(NameError, 'uninitialized constant Unknown')
+      end
+      it 'should check scraper class inherits from SagroneScraper::Base' do
+        NotScraper = Class.new
+        expect {
+          described_class.register_scraper('NotScraper')
+        }.to raise_error(described_class::Error, 'Expected scraper to be a SagroneScraper::Base.')
+      end
+      it 'should register multiple scrapers only once' do
+        class TestScraper < SagroneScraper::Base ; end
+        described_class.register_scraper('TestScraper')
+        described_class.register_scraper('TestScraper')
+        expect(described_class.registered_scrapers).to include('TestScraper')
+        expect(described_class.registered_scrapers.size).to eq 1
+      end
+    end
+  end
+  describe 'self.scrape' do
+    before do
+      described_class.registered_scrapers.clear
+      described_class.register_scraper('TwitterScraper')
+      stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
+    end
+    it 'should require `url` option' do
+      expect {
+        described_class.scrape({})
+      }.to raise_error(described_class::Error, 'Option "url" must be provided.')
+    end
+    it 'should scrape URL if registered scraper knows how to scrape it' do
+      expected_attributes = {
+        bio: "Javascript User Group Milano #milanojs",
+        location: "Milan, Italy"
+      }
+      expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
+    end
+    it 'should return raise error if no registered paser can scrape the URL' do
+      expect {
+        described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
+      }.to raise_error(described_class::Error, "No registed scraper can scrape URL https://twitter.com/Milano_JS/media")
+    end
+  end
+end

data/spec/sagrone_scraper_spec.rb CHANGED

@@ -2,80 +2,7 @@ require 'spec_helper'
 require 'sagrone_scraper'
 RSpec.describe SagroneScraper do
-  describe '.version' do
+  describe 'self.version' do
     it { expect(SagroneScraper.version).to be_a(String) }
   end
-  context 'parsers registered' do
-    before do
-      described_class.registered_parsers.clear
-    end
-    describe '.registered_parsers' do
-      it { expect(described_class.registered_parsers).to be_empty }
-      it { expect(described_class.registered_parsers).to be_a(Array) }
-    end
-    describe '.register_parser(name)' do
-      TestParser = Class.new(SagroneScraper::Parser)
-      NotParser = Class.new
-      it 'should check parser name is an existing constant' do
-        expect {
-          described_class.register_parser('Unknown')
-        }.to raise_error(NameError, 'uninitialized constant Unknown')
-      end
-      it 'should check parser class inherits from SagroneScraper::Parser' do
-        expect {
-          described_class.register_parser('NotParser')
-        }.to raise_error(SagroneScraper::Error, 'Expected parser to be a SagroneScraper::Parser.')
-      end
-      it 'after adding a "parser" should have it registered' do
-        described_class.register_parser('TestParser')
-        expect(described_class.registered_parsers).to include('TestParser')
-        expect(described_class.registered_parsers.size).to eq 1
-      end
-      it 'adding same "parser" multiple times should register it once' do
-        described_class.register_parser('TestParser')
-        described_class.register_parser('TestParser')
-        expect(described_class.registered_parsers).to include('TestParser')
-        expect(described_class.registered_parsers.size).to eq 1
-      end
-    end
-  end
-  describe '.scrape' do
-    before do
-      SagroneScraper.registered_parsers.clear
-      SagroneScraper.register_parser('TwitterParser')
-      stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
-    end
-    it 'should `url` option' do
-      expect {
-        described_class.scrape({})
-      }.to raise_error(SagroneScraper::Error, 'Option "url" must be provided.')
-    end
-    it 'should scrape URL if registered parser knows how to parse it' do
-      expected_attributes = {
-        bio: "Javascript User Group Milano #milanojs",
-        location: "Milan, Italy"
-      }
-      expect(described_class.scrape(url: 'https://twitter.com/Milano_JS')).to eq(expected_attributes)
-    end
-    it 'should return raise error if no registered paser can parse the URL' do
-      expect {
-        described_class.scrape(url: 'https://twitter.com/Milano_JS/media')
-      }.to raise_error(SagroneScraper::Error, "No registed parser can parse URL https://twitter.com/Milano_JS/media")
-    end
-  end
 end

data/spec/support/test_scrapers/twitter_scraper.rb ADDED

@@ -0,0 +1,23 @@
+require 'sagrone_scraper/base'
+class TwitterScraper < SagroneScraper::Base
+  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
+  def self.can_scrape?(url)
+    url.match(TWITTER_PROFILE_URL) ? true : false
+  end
+  def bio
+    text_at('.ProfileHeaderCard-bio')
+  end
+  def location
+    text_at('.ProfileHeaderCard-locationText')
+  end
+  private
+  def text_at(selector)
+    page.at(selector).text if page.at(selector)
+  end
+end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: sagrone_scraper
 version: !ruby/object:Gem::Version
-  version: 0.0.3
+  version: 0.1.0
 platform: ruby
 authors:
 - Marius Colacioiu
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-03-10 00:00:00.000000000 Z
+date: 2015-04-09 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mechanize
@@ -113,17 +113,19 @@ files:
 - Rakefile
 - lib/sagrone_scraper.rb
 - lib/sagrone_scraper/agent.rb
-- lib/sagrone_scraper/parser.rb
+- lib/sagrone_scraper/base.rb
+- lib/sagrone_scraper/collection.rb
 - lib/sagrone_scraper/version.rb
 - sagrone_scraper.gemspec
 - spec/sagrone_scraper/agent_spec.rb
-- spec/sagrone_scraper/parser_spec.rb
+- spec/sagrone_scraper/base_spec.rb
+- spec/sagrone_scraper/collection_spec.rb
 - spec/sagrone_scraper_spec.rb
 - spec/spec_helper.rb
 - spec/stub_helper.rb
-- spec/support/test_parsers/twitter_parser.rb
 - spec/support/test_responses/twitter.com:Milano_JS
 - spec/support/test_responses/www.example.com
+- spec/support/test_scrapers/twitter_scraper.rb
 homepage: ''
 licenses:
 - Apache License 2.0
@@ -150,10 +152,11 @@ specification_version: 4
 summary: Sagrone Ruby Scraper.
 test_files:
 - spec/sagrone_scraper/agent_spec.rb
-- spec/sagrone_scraper/parser_spec.rb
+- spec/sagrone_scraper/base_spec.rb
+- spec/sagrone_scraper/collection_spec.rb
 - spec/sagrone_scraper_spec.rb
 - spec/spec_helper.rb
 - spec/stub_helper.rb
-- spec/support/test_parsers/twitter_parser.rb
 - spec/support/test_responses/twitter.com:Milano_JS
 - spec/support/test_responses/www.example.com
+- spec/support/test_scrapers/twitter_scraper.rb

data/lib/sagrone_scraper/parser.rb DELETED

@@ -1,42 +0,0 @@
-require 'mechanize'
-module SagroneScraper
-  class Parser
-    Error = Class.new(RuntimeError)
-    attr_reader :page, :page_url, :attributes
-    def initialize(options = {})
-      @page = options.fetch(:page) do
-                raise Error.new('Option "page" must be provided.')
-              end
-      @page_url = @page.uri.to_s
-      @attributes = {}
-    end
-    def parse_page!
-      return unless self.class.can_parse?(page_url)
-      self.class.method_names.each do |name|
-        attributes[name] = send(name)
-      end
-      nil
-    end
-    def self.can_parse?(url)
-      class_with_method = "#{self}.can_parse?(url)"
-      raise NotImplementedError.new("Expected #{class_with_method} to be implemented.")
-    end
-    private
-    def self.method_names
-      @method_names ||= []
-    end
-    def self.method_added(name)
-      puts "added #{name} to #{self}"
-      method_names.push(name)
-    end
-  end
-end

data/spec/sagrone_scraper/parser_spec.rb DELETED

@@ -1,83 +0,0 @@
-require 'spec_helper'
-require 'sagrone_scraper/parser'
-RSpec.describe SagroneScraper::Parser do
-  describe '#initialize' do
-    it 'requires a "page" option' do
-      expect {
-        described_class.new
-      }.to raise_error(SagroneScraper::Parser::Error, 'Option "page" must be provided.')
-    end
-  end
-  describe 'instance methods' do
-    let(:page) { Mechanize::Page.new }
-    let(:parser) { described_class.new(page: page) }
-    describe '#page' do
-      it { expect(parser.page).to be_a(Mechanize::Page) }
-    end
-    describe '#page_url' do
-      it { expect(parser.page_url).to be }
-      it { expect(parser.page_url).to eq page.uri.to_s }
-    end
-    describe '#parse_page!' do
-      it do
-        expect {
-          parser.parse_page!
-        }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
-      end
-    end
-    describe '#attributes' do
-      it { expect(parser.attributes).to be_empty }
-    end
-  end
-  describe 'class methods' do
-    describe '.can_parse?(url)' do
-      it do
-        expect {
-          described_class.can_parse?('url')
-        }.to raise_error(NotImplementedError, "Expected #{described_class}.can_parse?(url) to be implemented.")
-      end
-    end
-  end
-  describe 'create custom TwitterParser from SagroneScraper::Parser' do
-    before do
-      stub_request_for('https://twitter.com/Milano_JS', 'twitter.com:Milano_JS')
-    end
-    let(:page) { Mechanize.new.get('https://twitter.com/Milano_JS') }
-    let(:twitter_parser) { TwitterParser.new(page: page) }
-    let(:expected_attributes) do
-      {
-        bio: "Javascript User Group Milano #milanojs",
-        location: "Milan, Italy"
-      }
-    end
-    describe 'should be able to parse page without errors' do
-      it { expect { twitter_parser.parse_page! }.to_not raise_error }
-    end
-    it 'should have attributes present after parsing' do
-      twitter_parser.parse_page!
-      expect(twitter_parser.attributes).to_not be_empty
-      expect(twitter_parser.attributes).to eq expected_attributes
-    end
-    it 'should have correct attributes event if parsing is done multiple times' do
-      twitter_parser.parse_page!
-      twitter_parser.parse_page!
-      twitter_parser.parse_page!
-      expect(twitter_parser.attributes).to_not be_empty
-      expect(twitter_parser.attributes).to eq expected_attributes
-    end
-  end
-end

data/spec/support/test_parsers/twitter_parser.rb DELETED

@@ -1,17 +0,0 @@
-require 'sagrone_scraper/parser'
-class TwitterParser < SagroneScraper::Parser
-  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i
-  def self.can_parse?(url)
-    url.match(TWITTER_PROFILE_URL)
-  end
-  def bio
-    page.at('.ProfileHeaderCard-bio').text
-  end
-  def location
-    page.at('.ProfileHeaderCard-locationText').text
-  end
-end