RubyGems - scrapifier - Versions diffs - 0.0.1 → 0.0.2 - Mend

scrapifier 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/.travis.yml +4 -0
data/README.md +88 -83
data/lib/scrapifier/methods.rb +25 -35
data/lib/scrapifier/support.rb +152 -128
data/lib/scrapifier/version.rb +1 -1
data/spec/factories/{uris.rb → samples.rb} +0 -0
data/spec/spec_helper.rb +1 -1
metadata +5 -4

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: d645984640446b98bdc5bdf71972c991b361b75e
-  data.tar.gz: 9bd772bf8ab26ab4dda602fa69eee8c12ae45e39
+  metadata.gz: aa0d714c01fc436bf4f90f4baaf50e98bcb9197a
+  data.tar.gz: b9337a1a690c0a7f0c57b237f327eec80d9bbdda
 SHA512:
-  metadata.gz: ff5dd829fd8e41af883fccd65ade03bb44d968f71a5457a0f2ceeb3afb0e389f71b44c0b559b352793293f4675cd9556c6e0bbab9799b637f1c4e6e7bdbb61ee
-  data.tar.gz: 3b974c372000f5d4f795f32074bd4da64e4fd910e8623f8446169c18b6880b1967aee105bf5df4e9de36fd054697c567727d4a8eee48def87f51b93b4117f556
+  metadata.gz: 75baa6ed1838759bd6c0ebc8a453120b6c2cfe8f484f4396a5401a1d1acd66eebb0377423750a647ca9417e0fb8f4677ba688d40a4bc7e642cee653a6a76131a
+  data.tar.gz: a77ab52807dbcf3a9846226641a00b210324e2d64db0d18324597be380e576faea9c65eb97b00ea0a98e9de6359329f3b295f168fd001b59a4b9b8390ef3badc

data/.travis.yml ADDED

@@ -0,0 +1,4 @@
+language: ruby
+rvm:
+  - 2.0.0
+  - 1.9.3

data/README.md CHANGED

@@ -1,83 +1,88 @@
-# Scrapifier
-It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
-## Installation
-Compatible with Ruby 1.9.3+
-Add this line to your application's Gemfile:
-    gem 'scrapifier'
-And then execute:
-    $ bundle
-Or install it yourself as:
-    $ gem install scrapifier
-## Usage
-The method finds an URI in the String and gets some meta information from it, like the page's title, description, images and the URI. All the data is returned in a well-formatted Hash.
-#### Default usage.
-``` ruby
-'Wow! What an awesome site: http://adtangerine.com!'.scrapify
-#=> {
-#   title:       "AdTangerine | Advertising Platform for Social Media",
-#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
-#   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
-#   uri:         "http://adtangerine.com"
-# }
-```
-#### Allow only certain image types.
-``` ruby
-'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: :jpg
-#=> {
-#   title:       "AdTangerine | Advertising Platform for Social Media",
-#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
-#   images:      ["http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg"],
-#   uri:         "http://adtangerine.com"
-# }
-'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: [:png, :gif]
-#=> {
-#   title:       "AdTangerine | Advertising Platform for Social Media",
-#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
-#   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/foobar.gif"],
-#   uri:         "http://adtangerine.com"
-# }
-```
-#### Choose which URI you want it to be scraped.
-``` ruby
-'Check out: http://adtangerine.com and www.twitflink.com'.scrapify which: 1
-#=> {
-#   title:       "TwitFlink | Find a link!",
-#   description: "TwitFlink is a very simple searching tool that allows people to find out links tweeted by any user from Twitter.",
-#   images:      ["http://www.twitflink.com//assets/tf_logo.png", "http://twitflink.com/assets/tf_logo.png"],
-#   uri:         "http://www.twitflink.com"
-# }
-'Check out: http://adtangerine.com and www.twitflink.com'.scrapify({ which: 0, images: :gif })
-#=> {
-#   title:       "AdTangerine | Advertising Platform for Social Media",
-#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
-#   images:      ["http://adtangerine.com/assets/foobar.gif"],
-#   uri:         "http://adtangerine.com"
-# }
-```
-## Contributing
-1. Fork it
-2. Create your feature branch (`git checkout -b my-new-feature`)
-3. Commit your changes (`git commit -am 'Add some feature'`)
-4. Push to the branch (`git push origin my-new-feature`)
-5. Create new Pull Request
+# Scrapifier
+[![Build Status](https://travis-ci.org/tiagopog/scrapifier.svg?branch=master)](https://travis-ci.org/tiagopog/scrapifier)
+[![Code Climate](https://codeclimate.com/github/tiagopog/scrapifier.png)](https://codeclimate.com/github/tiagopog/scrapifier)
+[![Dependency Status](https://gemnasium.com/tiagopog/scrapifier.svg)](https://gemnasium.com/tiagopog/scrapifier)
+[![Gem Version](https://badge.fury.io/rb/scrapifier.svg)](http://badge.fury.io/rb/scrapifier)
+It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
+## Installation
+Compatible with Ruby 1.9.3+
+Add this line to your application's Gemfile:
+    gem 'scrapifier'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install scrapifier
+## Usage
+The method finds an URI in the String and gets some meta information from it, like the page's title, description, images and the URI. All the data is returned in a well-formatted Hash.
+#### Default usage.
+``` ruby
+'Wow! What an awesome site: http://adtangerine.com!'.scrapify
+#=> {
+#   title:       "AdTangerine | Advertising Platform for Social Media",
+#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
+#   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
+#   uri:         "http://adtangerine.com"
+# }
+```
+#### Allow only certain image types.
+``` ruby
+'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: :jpg
+#=> {
+#   title:       "AdTangerine | Advertising Platform for Social Media",
+#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
+#   images:      ["http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg"],
+#   uri:         "http://adtangerine.com"
+# }
+'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: [:png, :gif]
+#=> {
+#   title:       "AdTangerine | Advertising Platform for Social Media",
+#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
+#   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/foobar.gif"],
+#   uri:         "http://adtangerine.com"
+# }
+```
+#### Choose which URI you want it to be scraped.
+``` ruby
+'Check out: http://adtangerine.com and www.twitflink.com'.scrapify which: 1
+#=> {
+#   title:       "TwitFlink | Find a link!",
+#   description: "TwitFlink is a very simple searching tool that allows people to find out links tweeted by any user from Twitter.",
+#   images:      ["http://www.twitflink.com//assets/tf_logo.png", "http://twitflink.com/assets/tf_logo.png"],
+#   uri:         "http://www.twitflink.com"
+# }
+'Check out: http://adtangerine.com and www.twitflink.com'.scrapify({ which: 0, images: :gif })
+#=> {
+#   title:       "AdTangerine | Advertising Platform for Social Media",
+#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
+#   images:      ["http://adtangerine.com/assets/foobar.gif"],
+#   uri:         "http://adtangerine.com"
+# }
+```
+## Contributing
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/lib/scrapifier/methods.rb CHANGED

@@ -4,67 +4,57 @@ require 'open-uri'
 require 'scrapifier/support'
 module Scrapifier
+  # Methods which will be included into the String class.
   module Methods
     include Scrapifier::Support
-    # Gets meta data from an URI using the screen scraping technique.
-    #
+    # Get metadata from an URI using the screen scraping technique.
+    #
     # Example:
     #   >> 'Wow! What an awesome site: http://adtangerine.com!'.scrapify
     #   => {
     #        :title => "AdTangerine | Advertising Platform for Social Media",
-    #        :description => "AdTangerine is an advertising platform that uses the tangerine as a virtual currency...",
-    #        :images => ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png],
+    #        :description => "AdTangerine is an advertising platform that...",
+    #        :images => [
+    #          "http://adtangerine.com/assets/logo_adt_og.png",
+    #          "http://adtangerine.com/assets/logo_adt_og.png
+    #        ],
     #        :uri => "http://adtangerine.com"
     #      }
     # Arguments:
     #   options: (Hash)
-    #     - which: (Integer) Indicates which URI in the String will be used. It starts from 0 to N.
-    #     - images: (Symbol or Array) Indicates the image extensions which are allowed to be returned as result.
+    #     - which: (Integer)
+    #         Which URI in the String will be used. It starts from 0 to N.
+    #     - images: (Symbol or Array)
+    #         Image extensions which are allowed to be returned as result.
     def scrapify(options = {})
-      meta, uri = {}, find_uri(options[:which])
-      begin
-        if uri.nil?
-          raise
-        elsif uri =~ sf_regex(:image)
-          uri = (sf_check_img_ext(uri, options[:images])[0] rescue [])
-          raise if uri.empty?
-          [:title, :description, :uri, :images].each { |key| meta[key] = uri }
-        else
-          doc          = Nokogiri::HTML(open(uri).read)
-          doc.encoding = 'utf-8'
-          [:title, :description].each do |key|
-            meta[key] = (doc.xpath(sf_paths[key])[0].text rescue '-')
-          end
+      uri, meta = find_uri(options[:which]), {}
+      return meta if uri.nil?
-          meta[:images] = sf_fix_imgs(doc.xpath(sf_paths[:image]), uri, options[:images])
-          meta[:uri]    = uri
-        end
-      rescue
-        meta = {}
+      if !(uri =~ sf_regex(:image))
+        meta = sf_eval_uri(uri, options[:images])
+      elsif !sf_check_img_ext(uri, options[:images]).empty?
+        [:title, :description, :uri, :images].each { |k| meta[k] = uri }
       end
       meta
     end
-    # Looks for URIs in the String.
-    #
+    # Find URIs in the String.
+    #
     # Example:
     #   >> 'Wow! What an awesome site: http://adtangerine.com!'.find_uri
     #   => 'http://adtangerine.com'
-    #   >> 'Wow! What an awesome sites: http://adtangerine.com and www.twitflink.com'.find_uri 1
+    #   >> 'Very cool: http://adtangerine.com and www.twitflink.com'.find_uri 1
     #   => 'www.twitflink.com'
     # Arguments:
     #   which: (Integer)
     #     - Which URI in the String: first (0), second (1) and so on.
     def find_uri(which = 0)
-      which ||= which.to_i
-      which = self.scan(sf_regex(:uri))[which][0] rescue nil
-      (which.nil? or which =~ sf_regex(:protocol)) ? which : 'http://' << which
+      which = scan(sf_regex(:uri))[which.to_i][0]
+      which =~ sf_regex(:protocol) ? which : "http://#{which}"
+    rescue NoMethodError
+      nil
     end
   end
 end

data/lib/scrapifier/support.rb CHANGED

@@ -1,144 +1,168 @@
 module Scrapifier
   module Support
-    private
-      # Filters images returning those with the allowed extentions.
-      #
-      # Example:
-      #   >> sf_check_img_ext('http://source.com/image.gif', :jpg)
-      #   => []
-      #   >> sf_check_img_ext(['http://source.com/image.gif', 'http://source.com/image.jpg'], [:jpg, :png])
-      #   => ['http://source.com/image.jpg']
-      # Arguments:
-      #   images: (String or Array)
-      #     - Images which will be checked.
-      #   allowed: (String, Symbol or Array)
-      #     - Allowed types of image extension.
-      def sf_check_img_ext(images, allowed = [])
-        allowed ||= []
-        if images.is_a?(String)
-          images = images.split
-        elsif !images.is_a?(Array)
-          images = []
-        end
-        images.select { |i| i =~ sf_regex(:image, allowed) }
-      end
+    module_function
-      # Selects regexes for URIs, protocols and image extensions.
-      #
-      # Example:
-      #   >> sf_regex(:uri)
-      #   => /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
-      #   >> sf_regex(:image, :jpg)
-      #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg)(\?.+)?$)/i
-      # Arguments:
-      #   type: (Symbol or String)
-      #     - Regex type.
-      #   args: (*)
-      #     - Anything.
+    # Evaluate the URI's HTML document and get its metadata.
+    #
+    # Example:
+    #   >> eval_uri('http://adtangerine.com', [:png])
+    #   => {
+    #        :title => "AdTangerine | Advertising Platform for Social Media",
+    #        :description => "AdTangerine is an advertising platform that...",
+    #        :images => [
+    #          "http://adtangerine.com/assets/logo_adt_og.png",
+    #          "http://adtangerine.com/assets/logo_adt_og.png
+    #        ],
+    #        :uri => "http://adtangerine.com"
+    #      }
+    # Arguments:
+    #   uri: (String)
+    #     - URI.
+    #   imgs: (Array)
+    #     - Allowed type of images.
+    def sf_eval_uri(uri, imgs = [])
+      doc = Nokogiri::HTML(open(uri).read)
+      doc.encoding, meta = 'utf-8', { uri: uri }
-      def sf_regex(type, *args)
-        type = type.to_sym unless type.is_a? Symbol
-        if type == :image
-          sf_img_regex args.flatten
-        else
-          regexes = {
-            uri:      /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
-            protocol: /((ht|f)tp[s]?)/i
-          }
-          regexes[type]
-        end
-      end
+      %i(title description).each { |k| meta[k] = (doc.xpath(sf_paths[k])[0].text rescue '-') }
+      meta[:images] = sf_fix_imgs(doc.xpath(sf_paths[:image]), uri, imgs)
+      meta
+    rescue SocketError
+      {}
+    end
-      # Builds image regexes according to the required extensions.
-      #
-      # Example:
-      #   >> sf_img_regex
-      #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|jpeg|png|gif)(\?.+)?$)/i
-      #   >> sf_img_regex([:jpg, :png])
-      #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|png)(\?.+)?$)/i
-      # Arguments:
-      #   exts: (Array)
-      #     - Image extensions which will be included in the regex.
-      def sf_img_regex(exts = [])
-        exts = [exts].flatten unless exts.is_a?(Array)
-        if exts.nil? or exts.empty?
-          exts = %w(jpg jpeg png gif)
-        elsif exts.include?(:jpg) and !exts.include?(:jpeg)
-          exts.push :jpeg
-        end
-        eval "/(^http{1}[s]?:\\/\\/([w]{3}\\.)?.+\\.(#{exts.join('|')})(\\?.+)?$)/i"
+    # Filter images returning those with the allowed extentions.
+    #
+    # Example:
+    #   >> sf_check_img_ext('http://source.com/image.gif', :jpg)
+    #   => []
+    #   >> sf_check_img_ext(['http://source.com/image.gif', 'http://source.com/image.jpg'], [:jpg, :png])
+    #   => ['http://source.com/image.jpg']
+    # Arguments:
+    #   images: (String or Array)
+    #     - Images which will be checked.
+    #   allowed: (String, Symbol or Array)
+    #     - Allowed types of image extension.
+    def sf_check_img_ext(images, allowed = [])
+      allowed ||= []
+      if images.is_a?(String)
+        images = images.split
+      elsif !images.is_a?(Array)
+        images = []
       end
+      images.select { |i| i =~ sf_regex(:image, allowed) }
+    end
-      # Collection of paths used to get content from HTML tags via Node#xpath method.
-      # See more: http://nokogiri.org/tutorials/searching_a_xml_html_document.html
-      #
-      # Example:
-      #   >> sf_paths[:title]
-      #   => '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1'
-      def sf_paths
-        {
-          title:       '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1',
-          description: '//meta[@property = "og:description"]/@content | //meta[@name = "description"]/@content | //meta[@name = "Description"]/@content | //h1 | //h3 | //p | //span | //font',
-          image:       '//meta[@property = "og:image"]/@content | //link[@rel = "image_src"]/@href | //meta[@itemprop = "image"]/@content | //div[@id = "logo"]/img/@src | //a[@id = "logo"]/img/@src | //div[@class = "logo"]/img/@src | //a[@class = "logo"]/img/@src | //a//img[@width]/@src | //img[@width]/@src | //a//img[@height]/@src | //img[@height]/@src | //a//img/@src | //span//img/@src'
+    # Select regexes for URIs, protocols and image extensions.
+    #
+    # Example:
+    #   >> sf_regex(:uri)
+    #   => /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
+    #   >> sf_regex(:image, :jpg)
+    #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg)(\?.+)?$)/i
+    # Arguments:
+    #   type: (Symbol or String)
+    #     - Regex type.
+    #   args: (*)
+    #     - Anything.
+    def sf_regex(type, *args)
+      type = type.to_sym unless type.is_a? Symbol
+      if type == :image
+        sf_img_regex args.flatten
+      else
+        regexes = {
+          uri:      /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
+          protocol: /((ht|f)tp[s]?)/i
         }
+        regexes[type]
       end
+    end
-      # Checks and returns only the valid image URIs.
-      #
-      # Example:
-      #   >>  sf_fix_imgs(['http://adtangerine.com/image.png', '/assets/image.jpg'], 'http://adtangerine.com', :jpg)
-      #   => ['http://adtangerine/assets/image.jpg']
-      # Arguments:
-      #   imgs: (Array)
-      #     - Image URIs got from the HTML doc.
-      #   uri: (String)
-      #     - Used as basis to the URIs that don't have any protocol/domain set.
-      #   exts: (Symbol or Array)
-      #     -  Allowed image extesntions.
+    # Build image regexes according to the required extensions.
+    #
+    # Example:
+    #   >> sf_img_regex
+    #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|jpeg|png|gif)(\?.+)?$)/i
+    #   >> sf_img_regex([:jpg, :png])
+    #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|png)(\?.+)?$)/i
+    # Arguments:
+    #   exts: (Array)
+    #     - Image extensions which will be included in the regex.
+    def sf_img_regex(exts = [])
+      exts = [exts].flatten unless exts.is_a?(Array)
+      if exts.nil? or exts.empty?
+        exts = %w(jpg jpeg png gif)
+      elsif exts.include?(:jpg) and !exts.include?(:jpeg)
+        exts.push :jpeg
+      end
+      eval "/(^http{1}[s]?:\\/\\/([w]{3}\\.)?.+\\.(#{exts.join('|')})(\\?.+)?$)/i"
+    end
-      def sf_fix_imgs(imgs, uri, exts = [])
-        sf_check_img_ext(imgs.map do |img|
-          img = img.to_s
-          img = sf_fix_protocol(img, sf_domain(uri)) unless img =~ sf_regex(:protocol)
-          img if (img =~ sf_regex(:image))
-        end.compact, exts)
-      end
+    # Collection of paths used to get content from HTML tags via Node#xpath method.
+    # See more: http://nokogiri.org/tutorials/searching_a_xml_html_document.html
+    #
+    # Example:
+    #   >> sf_paths[:title]
+    #   => '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1'
+    def sf_paths
+      {
+        title:       '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1',
+        description: '//meta[@property = "og:description"]/@content | //meta[@name = "description"]/@content | //meta[@name = "Description"]/@content | //h1 | //h3 | //p | //span | //font',
+        image:       '//meta[@property = "og:image"]/@content | //link[@rel = "image_src"]/@href | //meta[@itemprop = "image"]/@content | //div[@id = "logo"]/img/@src | //a[@id = "logo"]/img/@src | //div[@class = "logo"]/img/@src | //a[@class = "logo"]/img/@src | //a//img[@width]/@src | //img[@width]/@src | //a//img[@height]/@src | //img[@height]/@src | //a//img/@src | //span//img/@src'
+      }
+    end
-      # Fixes image URIs that doesn't present protocol/domain.
-      #
-      # Example:
-      #   >> sf_fix_protocol('/assets/image.jpg', 'http://adtangerine.com')
-      #   => 'http://adtangerine/assets/image.jpg'
-      #   >> sf_fix_protocol('//s.ytimg.com/yts/img/youtub_img.png', 'https://youtube.com')
-      #   => 'https://s.ytimg.com/yts/img/youtub_img.png'
-      # Arguments:
-      #   path: (String)
-      #     - URI path having no protocol/domain set.
-      #   domain: (String)
-      #     - Domain that will be prepended into the path.
+    # Check and return only the valid image URIs.
+    #
+    # Example:
+    #   >>  sf_fix_imgs(['http://adtangerine.com/image.png', '/assets/image.jpg'], 'http://adtangerine.com', :jpg)
+    #   => ['http://adtangerine/assets/image.jpg']
+    # Arguments:
+    #   imgs: (Array)
+    #     - Image URIs got from the HTML doc.
+    #   uri: (String)
+    #     - Used as basis to the URIs that don't have any protocol/domain set.
+    #   exts: (Symbol or Array)
+    #     -  Allowed image extesntions.
+    def sf_fix_imgs(imgs, uri, exts = [])
+      sf_check_img_ext(imgs.map do |img|
+        img = img.to_s
+        img = sf_fix_protocol(img, sf_domain(uri)) unless img =~ sf_regex(:protocol)
+        img if (img =~ sf_regex(:image))
+      end.compact, exts)
+    end
-      def sf_fix_protocol(path, domain)
-        if path =~ /^\/\/[^\/]+/
-          'http:' << path
-        else
-           "http://#{domain}#{'/' unless path =~ /^\/[^\/]+/}#{path}"
-        end
-      end
+    # Fix image URIs that don't have a protocol/domain set.
+    #
+    # Example:
+    #   >> sf_fix_protocol('/assets/image.jpg', 'http://adtangerine.com')
+    #   => 'http://adtangerine/assets/image.jpg'
+    #   >> sf_fix_protocol('//s.ytimg.com/yts/img/youtub_img.png', 'https://youtube.com')
+    #   => 'https://s.ytimg.com/yts/img/youtub_img.png'
+    # Arguments:
+    #   path: (String)
+    #     - URI path having no protocol/domain set.
+    #   domain: (String)
+    #     - Domain that will be prepended into the path.
+    def sf_fix_protocol(path, domain)
+      if path =~ /^\/\/[^\/]+/
+        'http:' << path
+      else
+         "http://#{domain}#{'/' unless path =~ /^\/[^\/]+/}#{path}"
+      end
+    end
-      # Returns the domain from an URI
-      #
-      # Example:
-      #   >> sf_domain('http://adtangerine.com')
-      #   => 'adtangerine.com'
-      # Arguments:
-      #   uri: (String)
-      #     - URI.
-      def sf_domain(uri)
-        (uri.split('/')[2] rescue '')
-      end
+    # Return the URI domain.
+    #
+    # Example:
+    #   >> sf_domain('http://adtangerine.com')
+    #   => 'adtangerine.com'
+    # Arguments:
+    #   uri: (String)
+    #     - URI.
+    def sf_domain(uri)
+      (uri.split('/')[2] rescue '')
+    end
   end
 end

data/lib/scrapifier/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Scrapifier
-  VERSION = '0.0.1'
+  VERSION = '0.0.2'
 end

data/spec/factories/{uris.rb → samples.rb} RENAMED

File without changes

data/spec/spec_helper.rb CHANGED

@@ -1,5 +1,5 @@
 require 'rubygems'
 require 'bundler/setup'
 require 'scrapifier'
-require 'factories/uris'
+require 'factories/samples'

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: scrapifier
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.0.2
 platform: ruby
 authors:
 - Tiago Guedes
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-04-07 00:00:00.000000000 Z
+date: 2014-04-30 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -75,6 +75,7 @@ extensions: []
 extra_rdoc_files: []
 files:
 - ".gitignore"
+- ".travis.yml"
 - Gemfile
 - LICENSE.txt
 - README.md
@@ -84,7 +85,7 @@ files:
 - lib/scrapifier/support.rb
 - lib/scrapifier/version.rb
 - scrapifier.gemspec
-- spec/factories/uris.rb
+- spec/factories/samples.rb
 - spec/scrapifier_spec.rb
 - spec/spec_helper.rb
 homepage: https://github.com/tiagopog/scrapifier
@@ -112,6 +113,6 @@ signing_key:
 specification_version: 4
 summary: Extends the Ruby String class with a screen scraping method.
 test_files:
-- spec/factories/uris.rb
+- spec/factories/samples.rb
 - spec/scrapifier_spec.rb
 - spec/spec_helper.rb