RubyGems - scrapifier - Versions diffs - 0.0.1 → 0.0.2 - Mend

scrapifier 0.0.1 → 0.0.2

Files changed (9) hide show

checksums.yaml +4 -4
data/.travis.yml +4 -0
data/README.md +88 -83
data/lib/scrapifier/methods.rb +25 -35
data/lib/scrapifier/support.rb +152 -128
data/lib/scrapifier/version.rb +1 -1
data/spec/factories/{uris.rb → samples.rb} +0 -0
data/spec/spec_helper.rb +1 -1
metadata +5 -4

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: d645984640446b98bdc5bdf71972c991b361b75e
-  data.tar.gz: 9bd772bf8ab26ab4dda602fa69eee8c12ae45e39
+  metadata.gz: aa0d714c01fc436bf4f90f4baaf50e98bcb9197a
+  data.tar.gz: b9337a1a690c0a7f0c57b237f327eec80d9bbdda
 SHA512:
-  metadata.gz: ff5dd829fd8e41af883fccd65ade03bb44d968f71a5457a0f2ceeb3afb0e389f71b44c0b559b352793293f4675cd9556c6e0bbab9799b637f1c4e6e7bdbb61ee
-  data.tar.gz: 3b974c372000f5d4f795f32074bd4da64e4fd910e8623f8446169c18b6880b1967aee105bf5df4e9de36fd054697c567727d4a8eee48def87f51b93b4117f556
+  metadata.gz: 75baa6ed1838759bd6c0ebc8a453120b6c2cfe8f484f4396a5401a1d1acd66eebb0377423750a647ca9417e0fb8f4677ba688d40a4bc7e642cee653a6a76131a
+  data.tar.gz: a77ab52807dbcf3a9846226641a00b210324e2d64db0d18324597be380e576faea9c65eb97b00ea0a98e9de6359329f3b295f168fd001b59a4b9b8390ef3badc

data/.travis.yml ADDED

@@ -0,0 +1,4 @@
+language: ruby
+rvm:
+  - 2.0.0
+  - 1.9.3

data/README.md CHANGED

@@ -1,83 +1,88 @@
-# Scrapifier
-It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
-## Installation
-Compatible with Ruby 1.9.3+
-Add this line to your application's Gemfile:
-    gem 'scrapifier'
-And then execute:
-    $ bundle
-Or install it yourself as:
-    $ gem install scrapifier
-## Usage
-The method finds an URI in the String and gets some meta information from it, like the page's title, description, images and the URI. All the data is returned in a well-formatted Hash.
-#### Default usage.
-``` ruby
-'Wow! What an awesome site: http://adtangerine.com!'.scrapify
-#=> {
-#   title:       "AdTangerine | Advertising Platform for Social Media",
-#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
-#   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
-#   uri:         "http://adtangerine.com"
-# }
-```
-#### Allow only certain image types.
-``` ruby
-'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: :jpg
-#=> {
-#   title:       "AdTangerine | Advertising Platform for Social Media",
-#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
-#   images:      ["http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg"],
-#   uri:         "http://adtangerine.com"
-# }
-'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: [:png, :gif]
-#=> {
-#   title:       "AdTangerine | Advertising Platform for Social Media",
-#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
-#   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/foobar.gif"],
-#   uri:         "http://adtangerine.com"
-# }
-```
-#### Choose which URI you want it to be scraped.
-``` ruby
-'Check out: http://adtangerine.com and www.twitflink.com'.scrapify which: 1
-#=> {
-#   title:       "TwitFlink | Find a link!",
-#   description: "TwitFlink is a very simple searching tool that allows people to find out links tweeted by any user from Twitter.",
-#   images:      ["http://www.twitflink.com//assets/tf_logo.png", "http://twitflink.com/assets/tf_logo.png"],
-#   uri:         "http://www.twitflink.com"
-# }
-'Check out: http://adtangerine.com and www.twitflink.com'.scrapify({ which: 0, images: :gif })
-#=> {
-#   title:       "AdTangerine | Advertising Platform for Social Media",
-#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
-#   images:      ["http://adtangerine.com/assets/foobar.gif"],
-#   uri:         "http://adtangerine.com"
-# }
-```
-## Contributing
-1. Fork it
-2. Create your feature branch (`git checkout -b my-new-feature`)
-3. Commit your changes (`git commit -am 'Add some feature'`)
-4. Push to the branch (`git push origin my-new-feature`)
-5. Create new Pull Request
+# Scrapifier
+[![Build Status](https://travis-ci.org/tiagopog/scrapifier.svg?branch=master)](https://travis-ci.org/tiagopog/scrapifier)
+[![Code Climate](https://codeclimate.com/github/tiagopog/scrapifier.png)](https://codeclimate.com/github/tiagopog/scrapifier)
+[![Dependency Status](https://gemnasium.com/tiagopog/scrapifier.svg)](https://gemnasium.com/tiagopog/scrapifier)
+[![Gem Version](https://badge.fury.io/rb/scrapifier.svg)](http://badge.fury.io/rb/scrapifier)
+It's a Ruby gem that brings a very simple way to extract meta information from URIs using the screen scraping technique.
+## Installation
+Compatible with Ruby 1.9.3+
+Add this line to your application's Gemfile:
+    gem 'scrapifier'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install scrapifier
+## Usage
+The method finds an URI in the String and gets some meta information from it, like the page's title, description, images and the URI. All the data is returned in a well-formatted Hash.
+#### Default usage.
+``` ruby
+'Wow! What an awesome site: http://adtangerine.com!'.scrapify
+#=> {
+#   title:       "AdTangerine | Advertising Platform for Social Media",
+#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
+#   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
+#   uri:         "http://adtangerine.com"
+# }
+```
+#### Allow only certain image types.
+``` ruby
+'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: :jpg
+#=> {
+#   title:       "AdTangerine | Advertising Platform for Social Media",
+#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
+#   images:      ["http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg"],
+#   uri:         "http://adtangerine.com"
+# }
+'Wow! What an awesome site: http://adtangerine.com!'.scrapify images: [:png, :gif]
+#=> {
+#   title:       "AdTangerine | Advertising Platform for Social Media",
+#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
+#   images:      ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/foobar.gif"],
+#   uri:         "http://adtangerine.com"
+# }
+```
+#### Choose which URI you want it to be scraped.
+``` ruby
+'Check out: http://adtangerine.com and www.twitflink.com'.scrapify which: 1
+#=> {
+#   title:       "TwitFlink | Find a link!",
+#   description: "TwitFlink is a very simple searching tool that allows people to find out links tweeted by any user from Twitter.",
+#   images:      ["http://www.twitflink.com//assets/tf_logo.png", "http://twitflink.com/assets/tf_logo.png"],
+#   uri:         "http://www.twitflink.com"
+# }
+'Check out: http://adtangerine.com and www.twitflink.com'.scrapify({ which: 0, images: :gif })
+#=> {
+#   title:       "AdTangerine | Advertising Platform for Social Media",
+#   description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
+#   images:      ["http://adtangerine.com/assets/foobar.gif"],
+#   uri:         "http://adtangerine.com"
+# }
+```
+## Contributing
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/lib/scrapifier/methods.rb CHANGED

@@ -4,67 +4,57 @@ require 'open-uri'
 require 'scrapifier/support'
 module Scrapifier
+  # Methods which will be included into the String class.
   module Methods
     include Scrapifier::Support
-    # Gets meta data from an URI using the screen scraping technique.
-    #
+    # Get metadata from an URI using the screen scraping technique.
+    #
     # Example:
     #   >> 'Wow! What an awesome site: http://adtangerine.com!'.scrapify
     #   => {
     #        :title => "AdTangerine | Advertising Platform for Social Media",
-    #        :description => "AdTangerine is an advertising platform that uses the tangerine as a virtual currency...",
-    #        :images => ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png],
+    #        :description => "AdTangerine is an advertising platform that...",
+    #        :images => [
+    #          "http://adtangerine.com/assets/logo_adt_og.png",
+    #          "http://adtangerine.com/assets/logo_adt_og.png
+    #        ],
     #        :uri => "http://adtangerine.com"
     #      }
     # Arguments:
     #   options: (Hash)
-    #     - which: (Integer) Indicates which URI in the String will be used. It starts from 0 to N.
-    #     - images: (Symbol or Array) Indicates the image extensions which are allowed to be returned as result.
+    #     - which: (Integer)
+    #         Which URI in the String will be used. It starts from 0 to N.
+    #     - images: (Symbol or Array)
+    #         Image extensions which are allowed to be returned as result.
     def scrapify(options = {})
-      meta, uri = {}, find_uri(options[:which])
-      begin
-        if uri.nil?
-          raise
-        elsif uri =~ sf_regex(:image)
-          uri = (sf_check_img_ext(uri, options[:images])[0] rescue [])
-          raise if uri.empty?
-          [:title, :description, :uri, :images].each { |key| meta[key] = uri }
-        else
-          doc          = Nokogiri::HTML(open(uri).read)
-          doc.encoding = 'utf-8'
-          [:title, :description].each do |key|
-            meta[key] = (doc.xpath(sf_paths[key])[0].text rescue '-')
-          end
+      uri, meta = find_uri(options[:which]), {}
+      return meta if uri.nil?
-          meta[:images] = sf_fix_imgs(doc.xpath(sf_paths[:image]), uri, options[:images])
-          meta[:uri]    = uri
-        end
-      rescue
-        meta = {}
+      if !(uri =~ sf_regex(:image))
+        meta = sf_eval_uri(uri, options[:images])
+      elsif !sf_check_img_ext(uri, options[:images]).empty?
+        [:title, :description, :uri, :images].each { |k| meta[k] = uri }
       end
       meta
     end
-    # Looks for URIs in the String.
-    #
+    # Find URIs in the String.
+    #
     # Example:
     #   >> 'Wow! What an awesome site: http://adtangerine.com!'.find_uri
     #   => 'http://adtangerine.com'
-    #   >> 'Wow! What an awesome sites: http://adtangerine.com and www.twitflink.com'.find_uri 1
+    #   >> 'Very cool: http://adtangerine.com and www.twitflink.com'.find_uri 1
     #   => 'www.twitflink.com'
     # Arguments:
     #   which: (Integer)
     #     - Which URI in the String: first (0), second (1) and so on.
     def find_uri(which = 0)
-      which ||= which.to_i
-      which = self.scan(sf_regex(:uri))[which][0] rescue nil
-      (which.nil? or which =~ sf_regex(:protocol)) ? which : 'http://' << which
+      which = scan(sf_regex(:uri))[which.to_i][0]
+      which =~ sf_regex(:protocol) ? which : "http://#{which}"
+    rescue NoMethodError
+      nil
     end
   end
 end

data/lib/scrapifier/support.rb CHANGED

@@ -1,144 +1,168 @@
 module Scrapifier
   module Support
-    private
-      # Filters images returning those with the allowed extentions.
-      #
-      # Example:
-      #   >> sf_check_img_ext('http://source.com/image.gif', :jpg)
-      #   => []
-      #   >> sf_check_img_ext(['http://source.com/image.gif', 'http://source.com/image.jpg'], [:jpg, :png])
-      #   => ['http://source.com/image.jpg']
-      # Arguments:
-      #   images: (String or Array)
-      #     - Images which will be checked.
-      #   allowed: (String, Symbol or Array)
-      #     - Allowed types of image extension.
-      def sf_check_img_ext(images, allowed = [])
-        allowed ||= []
-        if images.is_a?(String)
-          images = images.split
-        elsif !images.is_a?(Array)
-          images = []
-        end
-        images.select { |i| i =~ sf_regex(:image, allowed) }
-      end
+    module_function
-      # Selects regexes for URIs, protocols and image extensions.
-      #
-      # Example:
-      #   >> sf_regex(:uri)
-      #   => /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
-      #   >> sf_regex(:image, :jpg)
-      #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg)(\?.+)?$)/i
-      # Arguments:
-      #   type: (Symbol or String)
-      #     - Regex type.
-      #   args: (*)
-      #     - Anything.
+    # Evaluate the URI's HTML document and get its metadata.
+    #
+    # Example:
+    #   >> eval_uri('http://adtangerine.com', [:png])
+    #   => {
+    #        :title => "AdTangerine | Advertising Platform for Social Media",
+    #        :description => "AdTangerine is an advertising platform that...",
+    #        :images => [
+    #          "http://adtangerine.com/assets/logo_adt_og.png",
+    #          "http://adtangerine.com/assets/logo_adt_og.png
+    #        ],
+    #        :uri => "http://adtangerine.com"
+    #      }
+    # Arguments:
+    #   uri: (String)
+    #     - URI.
+    #   imgs: (Array)
+    #     - Allowed type of images.
+    def sf_eval_uri(uri, imgs = [])
+      doc = Nokogiri::HTML(open(uri).read)
+      doc.encoding, meta = 'utf-8', { uri: uri }
-      def sf_regex(type, *args)
-        type = type.to_sym unless type.is_a? Symbol
-        if type == :image
-          sf_img_regex args.flatten
-        else
-          regexes = {
-            uri:      /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
-            protocol: /((ht|f)tp[s]?)/i
-          }
-          regexes[type]
-        end
-      end
+      %i(title description).each { |k| meta[k] = (doc.xpath(sf_paths[k])[0].text rescue '-') }
+      meta[:images] = sf_fix_imgs(doc.xpath(sf_paths[:image]), uri, imgs)
+      meta
+    rescue SocketError
+      {}
+    end
-      # Builds image regexes according to the required extensions.
-      #
-      # Example:
-      #   >> sf_img_regex
-      #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|jpeg|png|gif)(\?.+)?$)/i
-      #   >> sf_img_regex([:jpg, :png])
-      #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|png)(\?.+)?$)/i
-      # Arguments:
-      #   exts: (Array)
-      #     - Image extensions which will be included in the regex.
-      def sf_img_regex(exts = [])
-        exts = [exts].flatten unless exts.is_a?(Array)
-        if exts.nil? or exts.empty?
-          exts = %w(jpg jpeg png gif)
-        elsif exts.include?(:jpg) and !exts.include?(:jpeg)
-          exts.push :jpeg
-        end
-        eval "/(^http{1}[s]?:\\/\\/([w]{3}\\.)?.+\\.(#{exts.join('|')})(\\?.+)?$)/i"
+    # Filter images returning those with the allowed extentions.
+    #
+    # Example:
+    #   >> sf_check_img_ext('http://source.com/image.gif', :jpg)
+    #   => []
+    #   >> sf_check_img_ext(['http://source.com/image.gif', 'http://source.com/image.jpg'], [:jpg, :png])
+    #   => ['http://source.com/image.jpg']
+    # Arguments:
+    #   images: (String or Array)
+    #     - Images which will be checked.
+    #   allowed: (String, Symbol or Array)
+    #     - Allowed types of image extension.
+    def sf_check_img_ext(images, allowed = [])
+      allowed ||= []
+      if images.is_a?(String)
+        images = images.split
+      elsif !images.is_a?(Array)
+        images = []
       end
+      images.select { |i| i =~ sf_regex(:image, allowed) }
+    end
-      # Collection of paths used to get content from HTML tags via Node#xpath method.
-      # See more: http://nokogiri.org/tutorials/searching_a_xml_html_document.html
-      #
-      # Example:
-      #   >> sf_paths[:title]
-      #   => '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1'
-      def sf_paths
-        {
-          title:       '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1',
-          description: '//meta[@property = "og:description"]/@content | //meta[@name = "description"]/@content | //meta[@name = "Description"]/@content | //h1 | //h3 | //p | //span | //font',
-          image:       '//meta[@property = "og:image"]/@content | //link[@rel = "image_src"]/@href | //meta[@itemprop = "image"]/@content | //div[@id = "logo"]/img/@src | //a[@id = "logo"]/img/@src | //div[@class = "logo"]/img/@src | //a[@class = "logo"]/img/@src | //a//img[@width]/@src | //img[@width]/@src | //a//img[@height]/@src | //img[@height]/@src | //a//img/@src | //span//img/@src'
+    # Select regexes for URIs, protocols and image extensions.
+    #
+    # Example:
+    #   >> sf_regex(:uri)
+    #   => /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
+    #   >> sf_regex(:image, :jpg)
+    #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg)(\?.+)?$)/i
+    # Arguments:
+    #   type: (Symbol or String)
+    #     - Regex type.
+    #   args: (*)
+    #     - Anything.
+    def sf_regex(type, *args)
+      type = type.to_sym unless type.is_a? Symbol
+      if type == :image
+        sf_img_regex args.flatten
+      else
+        regexes = {
+          uri:      /\b((((ht|f)tp[s]?:\/\/)|([a-z0-9]+\.))+(?<!@)([a-z0-9\_\-]+)(\.[a-z]+)+([\?\/\:][a-z0-9_=%&@\?\.\/\-\:\#\(\)]+)?\/?)/i,
+          protocol: /((ht|f)tp[s]?)/i
         }
+        regexes[type]
       end
+    end
-      # Checks and returns only the valid image URIs.
-      #
-      # Example:
-      #   >>  sf_fix_imgs(['http://adtangerine.com/image.png', '/assets/image.jpg'], 'http://adtangerine.com', :jpg)
-      #   => ['http://adtangerine/assets/image.jpg']
-      # Arguments:
-      #   imgs: (Array)
-      #     - Image URIs got from the HTML doc.
-      #   uri: (String)
-      #     - Used as basis to the URIs that don't have any protocol/domain set.
-      #   exts: (Symbol or Array)
-      #     -  Allowed image extesntions.
+    # Build image regexes according to the required extensions.
+    #
+    # Example:
+    #   >> sf_img_regex
+    #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|jpeg|png|gif)(\?.+)?$)/i
+    #   >> sf_img_regex([:jpg, :png])
+    #   => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|png)(\?.+)?$)/i
+    # Arguments:
+    #   exts: (Array)
+    #     - Image extensions which will be included in the regex.
+    def sf_img_regex(exts = [])
+      exts = [exts].flatten unless exts.is_a?(Array)
+      if exts.nil? or exts.empty?
+        exts = %w(jpg jpeg png gif)
+      elsif exts.include?(:jpg) and !exts.include?(:jpeg)
+        exts.push :jpeg
+      end
+      eval "/(^http{1}[s]?:\\/\\/([w]{3}\\.)?.+\\.(#{exts.join('|')})(\\?.+)?$)/i"
+    end
-      def sf_fix_imgs(imgs, uri, exts = [])
-        sf_check_img_ext(imgs.map do |img|
-          img = img.to_s
-          img = sf_fix_protocol(img, sf_domain(uri)) unless img =~ sf_regex(:protocol)
-          img if (img =~ sf_regex(:image))
-        end.compact, exts)
-      end
+    # Collection of paths used to get content from HTML tags via Node#xpath method.
+    # See more: http://nokogiri.org/tutorials/searching_a_xml_html_document.html
+    #
+    # Example:
+    #   >> sf_paths[:title]
+    #   => '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1'
+    def sf_paths
+      {
+        title:       '//meta[@property = "og:title"]/@content | //meta[@name = "title"]/@content | //meta[@name = "Title"]/@content | //title | //h1',
+        description: '//meta[@property = "og:description"]/@content | //meta[@name = "description"]/@content | //meta[@name = "Description"]/@content | //h1 | //h3 | //p | //span | //font',
+        image:       '//meta[@property = "og:image"]/@content | //link[@rel = "image_src"]/@href | //meta[@itemprop = "image"]/@content | //div[@id = "logo"]/img/@src | //a[@id = "logo"]/img/@src | //div[@class = "logo"]/img/@src | //a[@class = "logo"]/img/@src | //a//img[@width]/@src | //img[@width]/@src | //a//img[@height]/@src | //img[@height]/@src | //a//img/@src | //span//img/@src'
+      }
+    end
-      # Fixes image URIs that doesn't present protocol/domain.
-      #
-      # Example:
-      #   >> sf_fix_protocol('/assets/image.jpg', 'http://adtangerine.com')
-      #   => 'http://adtangerine/assets/image.jpg'
-      #   >> sf_fix_protocol('//s.ytimg.com/yts/img/youtub_img.png', 'https://youtube.com')
-      #   => 'https://s.ytimg.com/yts/img/youtub_img.png'
-      # Arguments:
-      #   path: (String)
-      #     - URI path having no protocol/domain set.
-      #   domain: (String)
-      #     - Domain that will be prepended into the path.
+    # Check and return only the valid image URIs.
+    #
+    # Example:
+    #   >>  sf_fix_imgs(['http://adtangerine.com/image.png', '/assets/image.jpg'], 'http://adtangerine.com', :jpg)
+    #   => ['http://adtangerine/assets/image.jpg']
+    # Arguments:
+    #   imgs: (Array)
+    #     - Image URIs got from the HTML doc.
+    #   uri: (String)
+    #     - Used as basis to the URIs that don't have any protocol/domain set.
+    #   exts: (Symbol or Array)
+    #     -  Allowed image extesntions.
+    def sf_fix_imgs(imgs, uri, exts = [])
+      sf_check_img_ext(imgs.map do |img|
+        img = img.to_s
+        img = sf_fix_protocol(img, sf_domain(uri)) unless img =~ sf_regex(:protocol)
+        img if (img =~ sf_regex(:image))
+      end.compact, exts)
+    end
-      def sf_fix_protocol(path, domain)
-        if path =~ /^\/\/[^\/]+/
-          'http:' << path
-        else
-           "http://#{domain}#{'/' unless path =~ /^\/[^\/]+/}#{path}"
-        end
-      end
+    # Fix image URIs that don't have a protocol/domain set.
+    #
+    # Example:
+    #   >> sf_fix_protocol('/assets/image.jpg', 'http://adtangerine.com')
+    #   => 'http://adtangerine/assets/image.jpg'
+    #   >> sf_fix_protocol('//s.ytimg.com/yts/img/youtub_img.png', 'https://youtube.com')
+    #   => 'https://s.ytimg.com/yts/img/youtub_img.png'
+    # Arguments:
+    #   path: (String)
+    #     - URI path having no protocol/domain set.
+    #   domain: (String)
+    #     - Domain that will be prepended into the path.
+    def sf_fix_protocol(path, domain)
+      if path =~ /^\/\/[^\/]+/
+        'http:' << path
+      else
+         "http://#{domain}#{'/' unless path =~ /^\/[^\/]+/}#{path}"
+      end
+    end
-      # Returns the domain from an URI
-      #
-      # Example:
-      #   >> sf_domain('http://adtangerine.com')
-      #   => 'adtangerine.com'
-      # Arguments:
-      #   uri: (String)
-      #     - URI.
-      def sf_domain(uri)
-        (uri.split('/')[2] rescue '')
-      end
+    # Return the URI domain.
+    #
+    # Example:
+    #   >> sf_domain('http://adtangerine.com')
+    #   => 'adtangerine.com'
+    # Arguments:
+    #   uri: (String)
+    #     - URI.
+    def sf_domain(uri)
+      (uri.split('/')[2] rescue '')
+    end
   end
 end

data/lib/scrapifier/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Scrapifier
-  VERSION = '0.0.1'
+  VERSION = '0.0.2'
 end

data/spec/factories/{uris.rb → samples.rb} RENAMED

File without changes

data/spec/spec_helper.rb CHANGED

@@ -1,5 +1,5 @@
 require 'rubygems'
 require 'bundler/setup'
 require 'scrapifier'
-require 'factories/uris'
+require 'factories/samples'

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: scrapifier
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.0.2
 platform: ruby
 authors:
 - Tiago Guedes
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-04-07 00:00:00.000000000 Z
+date: 2014-04-30 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -75,6 +75,7 @@ extensions: []
 extra_rdoc_files: []
 files:
 - ".gitignore"
+- ".travis.yml"
 - Gemfile
 - LICENSE.txt
 - README.md
@@ -84,7 +85,7 @@ files:
 - lib/scrapifier/support.rb
 - lib/scrapifier/version.rb
 - scrapifier.gemspec
-- spec/factories/uris.rb
+- spec/factories/samples.rb
 - spec/scrapifier_spec.rb
 - spec/spec_helper.rb
 homepage: https://github.com/tiagopog/scrapifier
@@ -112,6 +113,6 @@ signing_key:
 specification_version: 4
 summary: Extends the Ruby String class with a screen scraping method.
 test_files:
-- spec/factories/uris.rb
+- spec/factories/samples.rb
 - spec/scrapifier_spec.rb
 - spec/spec_helper.rb