RubyGems - metainspector - Versions diffs - 4.0.0 → 4.1.0 - Mend

metainspector 4.0.0 → 4.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/README.md +193 -133
data/examples/basic_scraping.rb +16 -5
data/examples/link_checker.rb +22 -21
data/examples/spider.rb +1 -1
data/lib/meta_inspector/document.rb +6 -2
data/lib/meta_inspector/url.rb +9 -1
data/lib/meta_inspector/version.rb +1 -1
data/meta_inspector.gemspec +3 -3
data/spec/document_spec.rb +10 -0
data/spec/spec_helper.rb +5 -0
metadata +10 -10

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 4fbb85a1c08f497b3c38edbdc97e0c8d96ee6c6a
-  data.tar.gz: 9ce2c80b81b1eb085037312e75fb82d1e46f4202
+  metadata.gz: 694ffa2f1b0080c05335ccc14abe02a874a6562f
+  data.tar.gz: 8ef1ff54d9cf15ab21225bf64f2827033dc3b409
 SHA512:
-  metadata.gz: e12a19a7598d3a9c7d83d90c121336964490dcd8b334f72d9ceb64ea8efab67c3b269445eb1ebf46eb5385169ea04a81ef155533dbe92779614eb3e0a10c50b3
-  data.tar.gz: 555a9b35ee7f51def2c45a24e46996cc130a65d15daebda9841c7be74fda8a2c76cb0097c53a67ad763b80272db52d84f8bdb7b99ecee124929a19b3c36a6338
+  metadata.gz: 92ab014d16a8c6ad1332db4dac9e103c1ab114b3f04c823cc2bc140c43ddb038ae3c77d46bb273b6f6651b80bc3a8bde519ae1cc89ca436bcb1510707c03b888
+  data.tar.gz: d9fa07214d680a5af7b1a2d57f1e8481f85dfbe09f76c45978d5fa346c7c7d64614442c305ef4c4a7d3f1dc3613f467a873431abba9c1db60157265249f7dc07

data/README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# MetaInspector [![Build Status](https://secure.travis-ci.org/jaimeiniesta/metainspector.png)](http://travis-ci.org/jaimeiniesta/metainspector) [![Dependency Status](https://gemnasium.com/jaimeiniesta/metainspector.png)](https://gemnasium.com/jaimeiniesta/metainspector)
+# MetaInspector [![Build Status](https://secure.travis-ci.org/jaimeiniesta/metainspector.png)](http://travis-ci.org/jaimeiniesta/metainspector) [![Dependency Status](https://gemnasium.com/jaimeiniesta/metainspector.png)](https://gemnasium.com/jaimeiniesta/metainspector) [![Code Climate](https://codeclimate.com/github/jaimeiniesta/metainspector/badges/gpa.svg)](https://codeclimate.com/github/jaimeiniesta/metainspector)
 MetaInspector is a gem for web scraping purposes.
@@ -40,11 +40,15 @@ Also, we've introduced a new feature:
 Install the gem from RubyGems:
-    gem install metainspector
+```bash
+gem install metainspector
+```
 If you're using it on a Rails application, just add it to your Gemfile and run `bundle install`
-    gem 'metainspector'
+```ruby
+gem 'metainspector'
+```
 This gem is tested on Ruby versions 2.0.0 and 2.1.3.
@@ -52,15 +56,21 @@ This gem is tested on Ruby versions 2.0.0 and 2.1.3.
 Initialize a MetaInspector instance for an URL, like this:
-    page = MetaInspector.new('http://sitevalidator.com')
+```ruby
+page = MetaInspector.new('http://sitevalidator.com')
+```
 If you don't include the scheme on the URL, http:// will be used by default:
-    page = MetaInspector.new('sitevalidator.com')
+```ruby
+page = MetaInspector.new('sitevalidator.com')
+```
 You can also include the html which will be used as the document to scrape:
-    page = MetaInspector.new("http://sitevalidator.com", :document => "<html><head><title>Hello From Passed Html</title><a href='/hello'>Hello link</a></head><body></body></html>")
+```ruby
+page = MetaInspector.new("http://sitevalidator.com", :document => "<html><head><title>Hello From Passed Html</title><a href='/hello'>Hello link</a></head><body></body></html>")
+```
 ## Accessing response status and headers
@@ -75,124 +85,138 @@ page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset
 You can see the scraped data like this:
-    page.url                 # URL of the page
-    page.scheme              # Scheme of the page (http, https)
-    page.host                # Hostname of the page (like, sitevalidator.com, without the scheme)
-    page.root_url            # Root url (scheme + host, like http://sitevalidator.com/)
-    page.title               # title of the page, as string
-    page.links.raw           # every link found, unprocessed
-    page.links.all           # every link found on the page as an absolute URL
-    page.links.http          # every HTTP link found
-    page.links.non_http      # every non-HTTP link found
-    page.links.internal      # every internal link found on the page as an absolute URL
-    page.links.external      # every external link found on the page as an absolute URL
-    page.meta['keywords']    # meta keywords, as string
-    page.meta['description'] # meta description, as string
-    page.description         # returns the meta description, or the first long paragraph if no meta description is found
-    page.images              # enumerable collection, with every img found on the page as an absolute URL
-    page.images.best         # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element
-    page.images.favicon      # absolute URL to the favicon
-    page.feed                # Get rss or atom links in meta data fields as array
-    page.charset             # UTF-8
-    page.content_type        # content-type returned by the server when the url was requested
+```ruby
+page.url                 # URL of the page
+page.scheme              # Scheme of the page (http, https)
+page.host                # Hostname of the page (like, sitevalidator.com, without the scheme)
+page.root_url            # Root url (scheme + host, like http://sitevalidator.com/)
+page.title               # title of the page, as string
+page.links.raw           # every link found, unprocessed
+page.links.all           # every link found on the page as an absolute URL
+page.links.http          # every HTTP link found
+page.links.non_http      # every non-HTTP link found
+page.links.internal      # every internal link found on the page as an absolute URL
+page.links.external      # every external link found on the page as an absolute URL
+page.meta['keywords']    # meta keywords, as string
+page.meta['description'] # meta description, as string
+page.description         # returns the meta description, or the first long paragraph if no meta description is found
+page.images              # enumerable collection, with every img found on the page as an absolute URL
+page.images.best         # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element
+page.images.favicon      # absolute URL to the favicon
+page.feed                # Get rss or atom links in meta data fields as array
+page.charset             # UTF-8
+page.content_type        # content-type returned by the server when the url was requested
+```
 ## Meta tags
 When it comes to meta tags, you have several options:
-    page.meta_tags          # Gives you all the meta tags by type:
-                            # (meta name, meta http-equiv, meta property and meta charset)
-                            # As meta tags can be repeated (in the case of 'og:image', for example),
-                            # the values returned will be arrays
-                            #
-                            # For example:
-                            #
-                            # {
-                                'name' => {
-                                            'keywords'       => ['one, two, three'],
-                                            'description'    => ['the description'],
-                                            'author'         => ['Joe Sample'],
-                                            'robots'         => ['index,follow'],
-                                            'revisit'        => ['15 days'],
-                                            'dc.date.issued' => ['2011-09-15']
-                                           },
-                                'http-equiv' => {
-                                                  'content-type'        => ['text/html; charset=UTF-8'],
-                                                  'content-style-type'  => ['text/css']
-                                                },
-                                'property' => {
-                                                'og:title'        => ['An OG title'],
-                                                'og:type'         => ['website'],
-                                                'og:url'          => ['http://example.com/meta-tags'],
-                                                'og:image'        => ['http://example.com/rock.jpg',
-                                                                      'http://example.com/rock2.jpg',
-                                                                      'http://example.com/rock3.jpg'],
-                                                'og:image:width'  => ['300'],
-                                                'og:image:height' => ['300', '1000']
-                                              },
-                                'charset' => ['UTF-8']
-                              }
+```ruby
+page.meta_tags  # Gives you all the meta tags by type:
+                # (meta name, meta http-equiv, meta property and meta charset)
+                # As meta tags can be repeated (in the case of 'og:image', for example),
+                # the values returned will be arrays
+                #
+                # For example:
+                #
+                # {
+                    'name' => {
+                                'keywords'       => ['one, two, three'],
+                                'description'    => ['the description'],
+                                'author'         => ['Joe Sample'],
+                                'robots'         => ['index,follow'],
+                                'revisit'        => ['15 days'],
+                                'dc.date.issued' => ['2011-09-15']
+                              },
+                    'http-equiv' => {
+                                        'content-type'        => ['text/html; charset=UTF-8'],
+                                        'content-style-type'  => ['text/css']
+                                    },
+                    'property' => {
+                                    'og:title'        => ['An OG title'],
+                                    'og:type'         => ['website'],
+                                    'og:url'          => ['http://example.com/meta-tags'],
+                                    'og:image'        => ['http://example.com/rock.jpg',
+                                                          'http://example.com/rock2.jpg',
+                                                          'http://example.com/rock3.jpg'],
+                                    'og:image:width'  => ['300'],
+                                    'og:image:height' => ['300', '1000']
+                                   },
+                    'charset' => ['UTF-8']
+                  }
+```
 As this method returns a hash, you can also take only the key that you need, like in:
-    page.meta_tags['property']  # Returns:
-                                # {
-                                #   'og:title'        => ['An OG title'],
-                                #   'og:type'         => ['website'],
-                                #   'og:url'          => ['http://example.com/meta-tags'],
-                                #   'og:image'        => ['http://example.com/rock.jpg',
-                                #                         'http://example.com/rock2.jpg',
-                                #                         'http://example.com/rock3.jpg'],
-                                #   'og:image:width'  => ['300'],
-                                #   'og:image:height' => ['300', '1000']
-                                # }
+```ruby
+page.meta_tags['property']  # Returns:
+                            # {
+                            #   'og:title'        => ['An OG title'],
+                            #   'og:type'         => ['website'],
+                            #   'og:url'          => ['http://example.com/meta-tags'],
+                            #   'og:image'        => ['http://example.com/rock.jpg',
+                            #                         'http://example.com/rock2.jpg',
+                            #                         'http://example.com/rock3.jpg'],
+                            #   'og:image:width'  => ['300'],
+                            #   'og:image:height' => ['300', '1000']
+                            # }
+```
 In most cases you will only be interested in the first occurrence of a meta tag, so you can
 use the singular form of that method:
-    page.meta_tag['name']  # Returns:
-                           # {
-                           #   'keywords'       => 'one, two, three',
-                           #   'description'    => 'the description',
-                           #   'author'         => 'Joe Sample',
-                           #   'robots'         => 'index,follow',
-                           #   'revisit'        => '15 days',
-                           #   'dc.date.issued' => '2011-09-15'
-                           #  }
+```ruby
+page.meta_tag['name']   # Returns:
+                        # {
+                        #   'keywords'       => 'one, two, three',
+                        #   'description'    => 'the description',
+                        #   'author'         => 'Joe Sample',
+                        #   'robots'         => 'index,follow',
+                        #   'revisit'        => '15 days',
+                        #   'dc.date.issued' => '2011-09-15'
+                        # }
+```
 Or, as this is also a hash:
-    page.meta_tag['name']['keywords']    # Returns 'one, two, three'
+```ruby
+page.meta_tag['name']['keywords']    # Returns 'one, two, three'
+```
 And finally, you can use the shorter `meta` method that will merge the different keys so you have
 a simpler hash:
-    page.meta       # Returns:
-                    #
-                    # {
-                    #     'keywords'            => 'one, two, three',
-                    #     'description'         => 'the description',
-                    #     'author'              => 'Joe Sample',
-                    #     'robots'              => 'index,follow',
-                    #     'revisit'             => '15 days',
-                    #     'dc.date.issued'      => '2011-09-15',
-                    #     'content-type'        => 'text/html; charset=UTF-8',
-                    #     'content-style-type'  => 'text/css',
-                    #     'og:title'            => 'An OG title',
-                    #     'og:type'             => 'website',
-                    #     'og:url'              => 'http://example.com/meta-tags',
-                    #     'og:image'            => 'http://example.com/rock.jpg',
-                    #     'og:image:width'      => '300',
-                    #     'og:image:height'     => '300',
-                    #     'charset'             => 'UTF-8'
-                    #   }
+```ruby
+page.meta   # Returns:
+            #
+            # {
+            #   'keywords'            => 'one, two, three',
+            #   'description'         => 'the description',
+            #   'author'              => 'Joe Sample',
+            #   'robots'              => 'index,follow',
+            #   'revisit'             => '15 days',
+            #   'dc.date.issued'      => '2011-09-15',
+            #   'content-type'        => 'text/html; charset=UTF-8',
+            #   'content-style-type'  => 'text/css',
+            #   'og:title'            => 'An OG title',
+            #   'og:type'             => 'website',
+            #   'og:url'              => 'http://example.com/meta-tags',
+            #   'og:image'            => 'http://example.com/rock.jpg',
+            #   'og:image:width'      => '300',
+            #   'og:image:height'     => '300',
+            #   'charset'             => 'UTF-8'
+            # }
+```
 This way, you can get most meta tags just like that:
-    page.meta['author']     # Returns "Joe Sample"
+```ruby
+page.meta['author']     # Returns "Joe Sample"
+```
 Please be aware that all keys are converted to downcase, so it's `'dc.date.issued'` and not `'DC.date.issued'`.
@@ -200,16 +224,22 @@ Please be aware that all keys are converted to downcase, so it's `'dc.date.issue
 You can also access most of the scraped data as a hash:
-    page.to_hash  # { "url"   => "http://sitevalidator.com",
-                      "title" => "MarkupValidator :: site-wide markup validation tool", ... }
+```ruby
+page.to_hash    # { "url"   => "http://sitevalidator.com",
+                    "title" => "MarkupValidator :: site-wide markup validation tool", ... }
+```
 The original document is accessible from:
-    page.to_s         # A String with the contents of the HTML document
+```ruby
+page.to_s         # A String with the contents of the HTML document
+```
 And the full scraped document is accessible from:
-    page.parsed  # Nokogiri doc that you can use it to get any element from the page
+```ruby
+page.parsed  # Nokogiri doc that you can use it to get any element from the page
+```
 ## Options
@@ -252,36 +282,64 @@ By default, MetaInspector will follow redirects (up to a limit of 10).
 If you want to disallow redirects, you can do it like this:
-    page = MetaInspector.new('facebook.com', :allow_redirections => false)
+```ruby
+page = MetaInspector.new('facebook.com', :allow_redirections => false)
+```
 ### Headers
 By default, the following headers are set:
-    {'User-Agent' => "MetaInspector/#{MetaInspector::VERSION} (+https://github.com/jaimeiniesta/metainspector)"}
+```ruby
+{'User-Agent' => "MetaInspector/#{MetaInspector::VERSION} (+https://github.com/jaimeiniesta/metainspector)"}
+```
 If you want to set custom headers then use the `headers` option:
-     # Set the User-Agent header
-     page = MetaInspector.new('example.com', :headers => {'User-Agent' => 'My custom User-Agent'})
+```ruby
+# Set the User-Agent header
+page = MetaInspector.new('example.com', :headers => {'User-Agent' => 'My custom User-Agent'})
+```
 ### HTML Content Only
 MetaInspector will try to parse all URLs by default. If you want to raise an exception when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
-    page = MetaInspector.new('sitevalidator.com', :html_content_only => true)
+```ruby
+page = MetaInspector.new('sitevalidator.com', :html_content_only => true)
+```
 This is useful when using MetaInspector on web spidering. Although on the initial URL you'll probably have an HTML URL, following links you may find yourself trying to parse non-html URLs.
-    page = MetaInspector.new('http://example.com/image.png')
-    page.content_type  # "image/png"
-    page.description   # will returned a garbled string
+```ruby
+page = MetaInspector.new('http://example.com/image.png')
+page.content_type  # "image/png"
+page.description   # will returned a garbled string
+page = MetaInspector.new('http://example.com/image.png', :html_content_only => true)
+page.content_type  # "image/png"
+page.description   # raises an exception
+```
+### URL Normalization
+By default, URLs are normalized using the Addressable gem. For example:
+```ruby
+# Normalization will add a default scheme and a trailing slash...
+page = MetaInspector.new('sitevalidator.com')
+page.url # http://sitevalidator.com/
+# ...and it will also convert international characters
+page = MetaInspector.new('http://www.詹姆斯.com')
+page.url # http://www.xn--8ws00zhy3a.com/
+```
-    page = MetaInspector.new('http://example.com/image.png', :html_content_only => true)
-    page.content_type  # "image/png"
-    page.description   # raises an exception
+While this is generally useful, it can be [tricky](https://github.com/sporkmonger/addressable/issues/182) [sometimes](https://github.com/sporkmonger/addressable/issues/160).
-## Exception handling
+You can disable URL normalization by passing the `normalize_url: false` option.
+## Exception Handling
 By default, MetaInspector will raise the exceptions found. We think that this is the safest default: in case the URL you're trying to scrape is unreachable, you should clearly be notified, and treat the exception as needed in your app.
@@ -295,27 +353,29 @@ You should avoid using the `:store` option, or use it wisely, as silencing error
 You can find some sample scripts on the `examples` folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
-    $ irb
-    >> require 'metainspector'
-    => true
+```ruby
+$ irb
+>> require 'metainspector'
+=> true
-    >> page = MetaInspector.new('http://sitevalidator.com')
-    => #<MetaInspector:0x11330c0 @url="http://sitevalidator.com">
+>> page = MetaInspector.new('http://sitevalidator.com')
+=> #<MetaInspector:0x11330c0 @url="http://sitevalidator.com">
-    >> page.title
-    => "MarkupValidator :: site-wide markup validation tool"
+>> page.title
+=> "MarkupValidator :: site-wide markup validation tool"
-    >> page.meta['description']
-    => "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
+>> page.meta['description']
+=> "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
-    >> page.meta['keywords']
-    => "html, markup, validation, validator, tool, w3c, development, standards, free"
+>> page.meta['keywords']
+=> "html, markup, validation, validator, tool, w3c, development, standards, free"
-    >> page.links.size
-    => 15
+>> page.links.size
+=> 15
-    >> page.links[4]
-    => "/plans-and-pricing"
+>> page.links[4]
+=> "/plans-and-pricing"
+```
 ## ZOMG Fork! Thank you!

data/examples/basic_scraping.rb CHANGED Viewed

@@ -11,14 +11,25 @@ url = ARGV[0] || (puts "Enter an url"; gets.strip)
 page = MetaInspector.new(url)
-puts "Scraping #{page.url} returned these results:"
-puts "TITLE: #{page.title}"
+puts "\nScraping #{page.url} returned these results:"
+puts "\nTITLE: #{page.title}"
 puts "META DESCRIPTION: #{page.meta['description']}"
 puts "META KEYWORDS: #{page.meta['keywords']}"
-puts "#{page.links.size} links found..."
-page.links.each do |link|
+puts "\n#{page.links.internal.size} internal links found..."
+page.links.internal.each do |link|
+  puts " ==> #{link}"
+end
+puts "\n#{page.links.external.size} external links found..."
+page.links.external.each do |link|
+  puts " ==> #{link}"
+end
+puts "\n#{page.links.non_http.size} non-http links found..."
+page.links.non_http.each do |link|
   puts " ==> #{link}"
 end
-puts "to_hash..."
+puts "\nto_hash..."
 puts page.to_hash

data/examples/link_checker.rb CHANGED Viewed

@@ -7,7 +7,6 @@
 require 'metainspector'
 class BrokenLinkChecker
-  attr_reader :broken
   def initialize(url)
     @url      = url
@@ -33,32 +32,26 @@ class BrokenLinkChecker
   private
   def check
-    # Resolve initial redirections
-    page = MetaInspector.new(@url)
+    # Resolves redirections of initial URL before placing it on the queue
+    @queue.push(MetaInspector.new(@url).url)
-    # Push this initial URL to the queue
-    @queue.push(page.url)
-    while @queue.any?
-      url = @queue.pop
-      page = MetaInspector.new(url, :warn_level => :store)
+    process_next_on_queue while @queue.any?
+  end
-      if page.ok?
-        # Gets all HTTP links
-        page.links.select {|l| l =~ /^http(s)?:\/\//i}.each do |link|
-          check_status(link, page.url)
-        end
-      end
+  def process_next_on_queue
+    page = MetaInspector.new(@queue.pop, :warn_level => :store)
-      @visited.push(page.url)
+    page.links.all.select {|l| l =~ /^http(s)?:\/\//i}.each do |link|
+      check_status(link, page.url)
+    end if page.ok?
-      page.internal_links.each do |link|
-        @queue.push(link) unless @visited.include?(link) || @broken.include?(link) || @queue.include?(link)
-      end
+    @visited.push(page.url)
-      puts "#{'%3s' % @visited.size} pages visited, #{'%3s' % @queue.size} pages on queue, #{'%2s' % @broken.size} broken links"
+    page.links.internal.each do |link|
+      @queue.push(link) if should_be_enqueued?(link)
     end
+    show_stats
   end
   # Checks the response status of the linked_url and stores it on the ok or broken collections
@@ -78,6 +71,14 @@ class BrokenLinkChecker
     end
   end
+  def should_be_enqueued?(url)
+    !(@visited.include?(url) || @broken.include?(url) || @queue.include?(url))
+  end
+  def show_stats
+    puts "#{'%3s' % @visited.size} pages visited, #{'%3s' % @queue.size} pages on queue, #{'%2s' % @broken.size} broken links"
+  end
   # A page is reachable if its response status is less than 400
   # In the case of exceptions, like timeouts or server connection errors,
   # we consider it unreachable

data/examples/spider.rb CHANGED Viewed

@@ -28,7 +28,7 @@ while queue.any?
   page = MetaInspector.new(url)
-  page.internal_links.each do |link|
+  page.links.internal.each do |link|
     queue.push(link) unless visited.include?(link) || queue.include?(link)
   end

data/lib/meta_inspector/document.rb CHANGED Viewed

@@ -17,6 +17,7 @@ module MetaInspector
     # * warn_level: what to do when encountering exceptions.
     #   Can be :warn, :raise or nil
     # * headers: object containing custom headers for the request
+    # * normalize_url: true by default
     def initialize(initial_url, options = {})
       options             = defaults.merge(options)
       @connection_timeout = options[:connection_timeout]
@@ -28,7 +29,9 @@ module MetaInspector
       @headers            = options[:headers]
       @warn_level         = options[:warn_level]
       @exception_log      = options[:exception_log] || MetaInspector::ExceptionLog.new(warn_level: warn_level)
-      @url                = MetaInspector::URL.new(initial_url, exception_log: @exception_log)
+      @normalize_url      = options[:normalize_url]
+      @url                = MetaInspector::URL.new(initial_url, exception_log: @exception_log,
+                                                                normalize:     @normalize_url)
       @request            = MetaInspector::Request.new(@url,  allow_redirections: @allow_redirections,
                                                               connection_timeout: @connection_timeout,
                                                               read_timeout:       @read_timeout,
@@ -77,7 +80,8 @@ module MetaInspector
         :html_content_only  => false,
         :warn_level         => :raise,
         :headers            => { 'User-Agent' => default_user_agent },
-        :allow_redirections => true }
+        :allow_redirections => true,
+        :normalize_url      => true }
     end
     def default_user_agent

data/lib/meta_inspector/url.rb CHANGED Viewed

@@ -7,7 +7,10 @@ module MetaInspector
     include MetaInspector::Exceptionable
     def initialize(initial_url, options = {})
+      options        = defaults.merge(options)
       @exception_log = options[:exception_log]
+      @normalize     = options[:normalize]
       self.url = initial_url
     end
@@ -25,7 +28,8 @@ module MetaInspector
     end
     def url=(new_url)
-      @url = normalized(with_default_scheme(new_url))
+      url  = with_default_scheme(new_url)
+      @url = @normalize ? normalized(url) : url
     end
     # Converts a protocol-relative url to its full form,
@@ -50,6 +54,10 @@ module MetaInspector
     private
+    def defaults
+      { :normalize => true }
+    end
     # Adds 'http' as default scheme, if there is none
     def with_default_scheme(url)
       parsed(url) && parsed(url).scheme.nil? ? 'http://' + url : url

data/lib/meta_inspector/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module MetaInspector
-  VERSION = "4.0.0"
+  VERSION = "4.1.0"
 end

data/meta_inspector.gemspec CHANGED Viewed

@@ -3,8 +3,8 @@ require File.expand_path('../lib/meta_inspector/version', __FILE__)
 Gem::Specification.new do |gem|
   gem.authors       = ["Jaime Iniesta"]
   gem.email         = ["jaimeiniesta@gmail.com"]
-  gem.description   = %q{MetaInspector lets you scrape a web page and get its title, charset, link and meta tags}
-  gem.summary       = %q{MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL}
+  gem.description   = %q{MetaInspector lets you scrape a web page and get its links, images, texts, meta tags...}
+  gem.summary       = %q{MetaInspector is a ruby gem for web scraping purposes, that returns metadata from a given URL}
   gem.homepage      = "http://jaimeiniesta.github.io/metainspector/"
   gem.license       = "MIT"
@@ -23,7 +23,7 @@ Gem::Specification.new do |gem|
   gem.add_development_dependency 'rspec', '2.14.1'
   gem.add_development_dependency 'fakeweb', '1.3.0'
   gem.add_development_dependency 'webmock'
-  gem.add_development_dependency 'awesome_print', '~> 1.2.0'
+  gem.add_development_dependency 'awesome_print'
   gem.add_development_dependency 'rake', '~> 10.1.0'
   gem.add_development_dependency 'pry'
   gem.add_development_dependency 'guard'

data/spec/document_spec.rb CHANGED Viewed

@@ -171,4 +171,14 @@ describe MetaInspector::Document do
       MetaInspector::Document.new(url, headers: headers)
     end
   end
+  describe 'url normalization' do
+    it 'should normalize by default' do
+      MetaInspector.new('http://example.com/%EF%BD%9E').url.should == 'http://example.com/~'
+    end
+    it 'should not normalize if the normalize_url option is false' do
+      MetaInspector.new('http://example.com/%EF%BD%9E', normalize_url: false).url.should == 'http://example.com/%EF%BD%9E'
+    end
+  end
 end

data/spec/spec_helper.rb CHANGED Viewed

@@ -79,3 +79,8 @@ FakeWeb.register_uri(:get, "https://www.facebook.com/",     :response => fixture
 # https://unsafe-facebook.com => http://unsafe-facebook.com
 FakeWeb.register_uri(:get, "https://unsafe-facebook.com/",  :response => fixture_file("unsafe_https.facebook.com.response"))
 FakeWeb.register_uri(:get, "http://unsafe-facebook.com/",   :response => fixture_file("unsafe_facebook.com.response"))
+# These examples are used to test normalize URLs
+FakeWeb.register_uri(:get, "http://example.com/%EF%BD%9E", :response => fixture_file("example.response"))
+FakeWeb.register_uri(:get, "http://example.com/~", :response => fixture_file("example.response"))

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: metainspector
 version: !ruby/object:Gem::Version
-  version: 4.0.0
+  version: 4.1.0
 platform: ruby
 authors:
 - Jaime Iniesta
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-11-27 00:00:00.000000000 Z
+date: 2015-01-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -126,16 +126,16 @@ dependencies:
   name: awesome_print
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
+    - - ">="
       - !ruby/object:Gem::Version
-        version: 1.2.0
+        version: '0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
+    - - ">="
       - !ruby/object:Gem::Version
-        version: 1.2.0
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
@@ -206,8 +206,8 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-description: MetaInspector lets you scrape a web page and get its title, charset,
-  link and meta tags
+description: MetaInspector lets you scrape a web page and get its links, images, texts,
+  meta tags...
 email:
 - jaimeiniesta@gmail.com
 executables: []
@@ -309,6 +309,6 @@ rubyforge_project:
 rubygems_version: 2.2.2
 signing_key:
 specification_version: 4
-summary: MetaInspector is a ruby gem for web scraping purposes, that returns a hash
-  with metadata from a given URL
+summary: MetaInspector is a ruby gem for web scraping purposes, that returns metadata
+  from a given URL
 test_files: []