RubyGems - metainspector - Versions diffs - 4.4.2 → 4.5.0 - Mend

metainspector 4.4.2 → 4.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +66 -0
data/README.md +43 -62
data/lib/meta_inspector.rb +1 -0
data/lib/meta_inspector/document.rb +4 -2
data/lib/meta_inspector/parser.rb +7 -5
data/lib/meta_inspector/parsers/head_links.rb +40 -0
data/lib/meta_inspector/parsers/images.rb +23 -14
data/lib/meta_inspector/parsers/links.rb +1 -14
data/lib/meta_inspector/url.rb +20 -6
data/lib/meta_inspector/version.rb +1 -1
data/spec/fixtures/head_links.response +34 -0
data/spec/fixtures/protocol_relative.response +5 -0
data/spec/meta_inspector/head_links_spec.rb +42 -0
data/spec/meta_inspector/images_spec.rb +60 -0
data/spec/spec_helper.rb +4 -0
data/spec/url_spec.rb +41 -0
metadata +6 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 55d69ca7b0d4a656349f395a906afe70ca816f5d
-  data.tar.gz: 3e16c396163b6171ecc136f906f726de8ab59cd4
+  metadata.gz: d8b2f4cf8526bd14a55d879334ff9bf14c95180f
+  data.tar.gz: 0d39ceedb495d19a761fd7f6bcfdae767d1e1c26
 SHA512:
-  metadata.gz: a55b1cb2c32dcd1f8a020b28b0de9b4ce017ad20571ca1f5d1b35a26e0611e1a6faecc680dcb76fd8bad102841781c23aabd8298647a4dfafb3b8c54ea0369ac
-  data.tar.gz: b9272f589ff465a43736dcb7c02c85c376312d2cd7d46427588d966ac396fb820bce392db8f3708b6304b87fd5584004ca44457cd3b61c7afa8349e94649adaf
+  metadata.gz: b9b8a345bb8f935bfe5a5fb74d4e86a92893c8d44066f87cdffbe029fc5746841c290c366fd94fc2a84edb73edbbf43c491189f6b82e754f4bc0c494eaed6591
+  data.tar.gz: f559c11756c34406d5083a58c8cd80fdf5665fd79f4a204fdd22eee9eed6bbb65553c3a0b47f16962623ef4c4903adb74b6ac503893b0cb6ea66ee84513e85d4

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,66 @@
+# MetaInpector Changelog
+## Changes in 4.5
+* The Document API now includes access to head/link elements
+    * `page.head_links` returns an array of hashes of all head/links.
+    * `page.stylesheets` returns head/links where rel='stylesheet'
+    * `page.canonicals` returns head/links where rel='canonical'
+* The URL API can remove common tracking parameters from the querystring
+    * `url.tracked?` will tell you if the url contains known tracking parameters
+    * `url.untracked_url` will return the url with known tracking parameters removed
+    * `url.untrack!` will remove the tracking parameters from the url
+* The images API has been extended:
+    * `page.images.with_size` returns a sorted array (by descending area) of [image_url, width, height]
+## Changes in 4.4
+The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
+## Changes in 4.3
+* The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
+* `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
+## Changes in 4.2
+* The images API has been extended, with two new methods:
+  * `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
+  * `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
+* The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
+## Changes in 4.1
+* Introduces the `:normalize_url` option, which allows to disable URL normalization.
+## Changes in 4.0
+* The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
+```ruby
+page.links.raw      # Returns all links found, unprocessed
+page.links.all      # Returns all links found, unrelavitized and absolutified
+page.links.http     # Returns all HTTP links found
+page.links.non_http # Returns all non-HTTP links found
+page.links.internal # Returns all internal HTTP links found
+page.links.external # Returns all external HTTP links found
+```
+* The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
+* Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
+* You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
+## Changes in 3.0
+* The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
+* We've dropped support for Ruby < 2.
+Also, we've introduced a new feature:
+* Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.

data/README.md CHANGED Viewed

@@ -8,56 +8,6 @@ You give it an URL, and it lets you easily get its title, links, images, charset
 You can try MetaInspector live at this little demo: [https://metainspectordemo.herokuapp.com](https://metainspectordemo.herokuapp.com)
-## Changes in 4.4
-The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
-## Changes in 4.3
-* The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
-* `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
-## Changes in 4.2
-* The images API has been extended, with two new methods:
-  * `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
-  * `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
-* The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
-## Changes in 4.1
-* Introduces the `:normalize_url` option, which allows to disable URL normalization.
-## Changes in 4.0
-* The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
-```ruby
-page.links.raw      # Returns all links found, unprocessed
-page.links.all      # Returns all links found, unrelavitized and absolutified
-page.links.http     # Returns all HTTP links found
-page.links.non_http # Returns all non-HTTP links found
-page.links.internal # Returns all internal HTTP links found
-page.links.external # Returns all external HTTP links found
-```
-* The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
-* Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
-* You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
-## Changes in 3.0
-* The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
-* We've dropped support for Ruby < 2.
-Also, we've introduced a new feature:
-* Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.
 ## Installation
 Install the gem from RubyGems:
@@ -91,47 +41,72 @@ page = MetaInspector.new('sitevalidator.com')
 You can also include the html which will be used as the document to scrape:
 ```ruby
-page = MetaInspector.new("http://sitevalidator.com", :document => "<html><head><title>Hello From Passed Html</title><a href='/hello'>Hello link</a></head><body></body></html>")
+page = MetaInspector.new("http://sitevalidator.com",
+                         :document => "<html>...</html>")
 ```
-## Accessing response status and headers
+## Accessing response
 You can check the status and headers from the response like this:
 ```ruby
 page.response.status  # 200
-page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8", "cache-control"=>"must-revalidate, private, max-age=0", ... }
+page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8",
+                      #   "cache-control"=>"must-revalidate, private, max-age=0", ... }
 ```
 ## Accessing scraped data
-You can see the scraped data like this:
+### URL
 ```ruby
 page.url                 # URL of the page
+page.tracked?            # returns true if the url contains known tracking parameters
+page.untracked_url       # returns the url with the known tracking parameters removed
+page.untrack!            # removes the known tracking parameters from the url
 page.scheme              # Scheme of the page (http, https)
 page.host                # Hostname of the page (like, sitevalidator.com, without the scheme)
 page.root_url            # Root url (scheme + host, like http://sitevalidator.com/)
+```
+### Head links
+```ruby
+page.head_links          # an array of hashes of all head/links
+page.stylesheets         # an array of hashes of all head/links where rel='stylesheet'
+page.canonicals          # an array of hashes of all head/links where rel='canonical'
+page.feed                # Get rss or atom links in meta data fields as array
+```
+### Texts
+```ruby
 page.title               # title of the page from the head section, as string
 page.best_title          # best title of the page, from a selection of candidates
+page.description         # returns the meta description, or the first long paragraph if no meta description is found
+```
+### Links
+```ruby
 page.links.raw           # every link found, unprocessed
 page.links.all           # every link found on the page as an absolute URL
 page.links.http          # every HTTP link found
 page.links.non_http      # every non-HTTP link found
 page.links.internal      # every internal link found on the page as an absolute URL
 page.links.external      # every external link found on the page as an absolute URL
-page.meta['keywords']    # meta keywords, as string
-page.meta['description'] # meta description, as string
-page.description         # returns the meta description, or the first long paragraph if no meta description is found
+```
+### Images
+```ruby
 page.images              # enumerable collection, with every img found on the page as an absolute URL
+page.images.with_size    # a sorted array (by descending area) of [image_url, width, height]
 page.images.best         # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element
 page.images.favicon      # absolute URL to the favicon
-page.feed                # Get rss or atom links in meta data fields as array
-page.charset             # UTF-8
-page.content_type        # content-type returned by the server when the url was requested
 ```
-## Meta tags
+### Meta tags
 When it comes to meta tags, you have several options:
@@ -243,6 +218,13 @@ page.meta['author']     # Returns "Joe Sample"
 Please be aware that all keys are converted to downcase, so it's `'dc.date.issued'` and not `'DC.date.issued'`.
+### Misc
+```ruby
+page.charset             # UTF-8
+page.content_type        # content-type returned by the server when the url was requested
+```
 ## Other representations
 You can also access most of the scraped data as a hash:
@@ -422,7 +404,6 @@ You're more than welcome to fork this project and send pull requests. Just remem
 * Create a topic branch for your changes.
 * Add specs.
 * Keep your fake responses as small as possible. For each change in `spec/fixtures`, a comment should be included explaining why it's needed.
-* Update `version.rb`, following the [semantic versioning convention](http://semver.org/).
 * Update `README.md` if needed (for example, when you're adding or changing a feature).
 Thanks to all the contributors:

data/lib/meta_inspector.rb CHANGED Viewed

@@ -7,6 +7,7 @@ require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parse
 require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/base'))
 require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/images'))
 require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/links'))
+require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/head_links'))
 require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/meta_tags'))
 require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/texts'))
 require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/document'))

data/lib/meta_inspector/document.rb CHANGED Viewed

@@ -44,14 +44,16 @@ module MetaInspector
     end
     extend Forwardable
-    delegate [:url, :scheme, :host, :root_url]        => :@url
+    delegate [:url, :scheme, :host, :root_url,
+              :tracked?, :untracked_url, :untrack!]   => :@url
     delegate [:content_type, :response]               => :@request
     delegate [:parsed, :title, :best_title,
               :description, :links,
               :images, :feed, :charset, :meta_tags,
-              :meta_tag, :meta, :favicon]             => :@parser
+              :meta_tag, :meta, :favicon,
+              :head_links, :stylesheets, :canonicals] => :@parser
     # Returns all document data as a nested Hash
     def to_hash

data/lib/meta_inspector/parser.rb CHANGED Viewed

@@ -13,6 +13,7 @@ module MetaInspector
     def initialize(document, options = {})
       @document        = document
       @exception_log   = options[:exception_log]
+      @head_links_parser = MetaInspector::Parsers::HeadLinksParser.new(self)
       @meta_tag_parser = MetaInspector::Parsers::MetaTagsParser.new(self)
       @links_parser    = MetaInspector::Parsers::LinksParser.new(self)
       @download_images = options[:download_images]
@@ -21,11 +22,12 @@ module MetaInspector
     end
     extend Forwardable
-    delegate [:url, :scheme, :host]                   => :@document
-    delegate [:meta_tags, :meta_tag, :meta, :charset] => :@meta_tag_parser
-    delegate [:links, :feed, :base_url]               => :@links_parser
-    delegate :images                                  => :@images_parser
-    delegate [:title, :best_title, :description]      => :@texts_parser
+    delegate [:url, :scheme, :host]                          => :@document
+    delegate [:meta_tags, :meta_tag, :meta, :charset]        => :@meta_tag_parser
+    delegate [:head_links, :stylesheets, :canonicals, :feed] => :@head_links_parser
+    delegate [:links, :base_url]                             => :@links_parser
+    delegate :images                                         => :@images_parser
+    delegate [:title, :best_title, :description]             => :@texts_parser
     # Returns the whole parsed document
     def parsed

data/lib/meta_inspector/parsers/head_links.rb ADDED Viewed

@@ -0,0 +1,40 @@
+module MetaInspector
+  module Parsers
+    class HeadLinksParser < Base
+      delegate [:parsed, :base_url] => :@main_parser
+      def head_links
+        @head_links ||= parsed.css('head link').map do |tag|
+          Hash[
+            tag.attributes.keys.map do |key|
+              keysym = key.to_sym
+              val = tag.attributes[key].value
+              val = URL.absolutify(val, base_url) if keysym == :href
+              [keysym, val]
+            end
+          ]
+        end
+      end
+      def stylesheets
+        @stylesheets ||= head_links.select { |hl| hl[:rel] == 'stylesheet' }
+      end
+      def canonicals
+        @canonicals ||= head_links.select { |hl| hl[:rel] == 'canonical' }
+      end
+      # Returns the parsed document meta rss link
+      def feed
+        @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
+      end
+      private
+      def parsed_feed(format)
+        feed = parsed.search("//link[@type='application/#{format}+xml']").first
+        feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
+      end
+    end
+  end
+end

data/lib/meta_inspector/parsers/images.rb CHANGED Viewed

@@ -32,28 +32,37 @@ module MetaInspector
         URL.absolutify(suggested_img, base_url) if suggested_img
       end
-      # Returns the largest image from the image collection,
-      # filtered for images that are more square than 10:1 or 1:10
-      def largest()
-        @larget_image ||= begin
+      # Returns an array of [img_url, width, height] sorted by image area (width * height)
+      def with_size
+        @with_size ||= begin
           img_nodes = parsed.search('//img').select{ |img_node| img_node['src'] }
-          sizes = img_nodes.map { |img_node| [URL.absolutify(img_node['src'], base_url), img_node['width'], img_node['height']] }
-          sizes.uniq! { |url, width, height| url }
+          imgs_with_size = img_nodes.map { |img_node| [URL.absolutify(img_node['src'], base_url), img_node['width'], img_node['height']] }
+          imgs_with_size.uniq! { |url, width, height| url }
           if @download_images
-            sizes.map! do |url, width, height|
+            imgs_with_size.map! do |url, width, height|
               width, height = FastImage.size(url) if width.nil? || height.nil?
-              [url, width, height]
+              [url, width.to_i, height.to_i]
             end
           else
-            sizes.map! do |url, width, height|
+            imgs_with_size.map! do |url, width, height|
               width, height = [0, 0] if width.nil? || height.nil?
-              [url, width, height]
+              [url, width.to_i, height.to_i]
             end
           end
-          sizes.map! { |url, width, height| [url, width.to_i * height.to_i, width.to_f / height.to_f] }
-          sizes.keep_if { |url, area, ratio| ratio > 0.1 && ratio < 10 }
-          sizes.sort_by! { |url, area, ratio| -area }
-          url, area, ratio = sizes.first
+          imgs_with_size.sort_by { |url, width, height| -(width.to_i * height.to_i) }
+        end
+      end
+      # Returns the largest image from the image collection,
+      # filtered for images that are more square than 10:1 or 1:10
+      def largest
+        @largest_image ||= begin
+          imgs_with_size = with_size.dup
+          imgs_with_size.keep_if do |url, width, height|
+            ratio = width.to_f / height.to_f
+            ratio > 0.1 && ratio < 10
+          end
+          url, width, height = imgs_with_size.first
           url
         end
       end

data/lib/meta_inspector/parsers/links.rb CHANGED Viewed

@@ -14,8 +14,7 @@ module MetaInspector
       # Returns all links found, unrelavitized and absolutified
       def all
-        @all ||= raw.map { |link| URL.absolutify(URL.unrelativize(link, scheme), base_url) }
-                    .compact.uniq
+        @all ||= raw.map { |link| URL.absolutify(link, base_url) }.compact.uniq
       end
       # Returns all HTTP links found
@@ -44,11 +43,6 @@ module MetaInspector
           'non_http' => non_http }
       end
-      # Returns the parsed document meta rss link
-      def feed
-        @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
-      end
       # Returns the base url to absolutify relative links.
       # This can be the one set on a <base> tag,
       # or the url of the document if no <base> tag was found.
@@ -56,13 +50,6 @@ module MetaInspector
         base_href || url
       end
-      private
-      def parsed_feed(format)
-        feed = parsed.search("//link[@type='application/#{format}+xml']").first
-        feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
-      end
       # Returns the value of the href attribute on the <base /> tag, if exists
       def base_href
         parsed.search('base').first.attributes['href'].value rescue nil

data/lib/meta_inspector/url.rb CHANGED Viewed

@@ -27,21 +27,35 @@ module MetaInspector
       "#{scheme}://#{host}/"
     end
+    WELL_KNOWN_TRACKING_PARAMS = %w( utm_source utm_medium utm_term utm_content utm_campaign )
+    def tracked?
+      u = parsed(url)
+      found_tracking_params = WELL_KNOWN_TRACKING_PARAMS & u.query_values.keys
+      return found_tracking_params.any?
+    end
+    def untracked_url
+      u = parsed(url)
+      u.query_values = u.query_values.delete_if { |key, _| WELL_KNOWN_TRACKING_PARAMS.include? key }
+      u.to_s
+    end
+    def untrack!
+      self.url = untracked_url
+    end
     def url=(new_url)
       url  = with_default_scheme(new_url)
       @url = @normalize ? normalized(url) : url
     end
-    # Converts a protocol-relative url to its full form,
-    # depending on the scheme of the page that contains it
-    def self.unrelativize(url, scheme)
-      url =~ /^\/\// ? "#{scheme}://#{url[2..-1]}" : url
-    end
     # Converts a relative URL to an absolute URL, like:
     #   "/faq" => "http://example.com/faq"
     # Respecting already absolute URLs like the ones starting with
     #   http:, ftp:, telnet:, mailto:, javascript: ...
+    # Protocol-relative URLs are also resolved to use the same
+    # schema as the base_url
     def self.absolutify(url, base_url)
       if url =~ /^\w*\:/i
         MetaInspector::URL.new(url).url

data/lib/meta_inspector/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module MetaInspector
-  VERSION = "4.4.2"
+  VERSION = '4.5.0'
 end

data/spec/fixtures/head_links.response ADDED Viewed

@@ -0,0 +1,34 @@
+HTTP/1.1 200 OK
+Server: nginx/0.7.67
+Date: Fri, 18 Nov 2011 21:46:46 GMT
+Content-Type: text/html
+Connection: keep-alive
+Last-Modified: Mon, 14 Nov 2011 16:53:18 GMT
+Content-Length: 4987
+X-Varnish: 2000423390
+Age: 0
+Via: 1.1 varnish
+<html>
+  <head>
+    <title>An example page</title>
+    <link
+        rel="canonical"
+        href="http://example.com/canonical-from-head"
+    />
+    <link rel="stylesheet" href="/stylesheets/screen.css">
+    <link rel="stylesheet" href="//example2.com/stylesheets/screen.css">
+    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
+    <link rel="shorturl" href="http://gu.com/p/32v5a" />
+    <link
+        rel="stylesheet"
+        type="text/css"
+        href="http://foo/print.css"
+        media="print"
+        class="contrast"
+    />
+  </head>
+  <body>
+    <h1>Hello World</h1>
+  </body>
+</html>

data/spec/fixtures/protocol_relative.response CHANGED Viewed

@@ -12,6 +12,8 @@ Accept-Ranges: bytes
 <head>
   <meta charset="utf-8" />
   <title>Protocol-relative URLs</title>
+  <meta property="og:image" content="//static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg"/>
+  <link rel="shortcut icon" href="//static-secure.guim.co.uk/sys-images/favicon.ico" type="image/x-icon" />
 </head>
 <body>
   <p>Internal links</p>
@@ -22,5 +24,8 @@ Accept-Ranges: bytes
   <p>External links</p>
   <a href="http://google.com">External: normal link</a>
   <a href="//yahoo.com">External: protocol-relative link</a>
+  <p>Images</p>
+  <img src="//example.com/image.jpg" />
 </body>
 </html>

data/spec/meta_inspector/head_links_spec.rb ADDED Viewed

@@ -0,0 +1,42 @@
+require 'spec_helper'
+describe MetaInspector do
+  describe "head_links" do
+    let(:page) { MetaInspector.new('http://example.com/head_links') }
+    let(:page_https) { MetaInspector.new('https://example.com/head_links') }
+    it "#head_links" do
+      expect(page.head_links).to eq([
+                                        {rel: 'canonical', href: 'http://example.com/canonical-from-head'},
+                                        {rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
+                                        {rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
+                                        {rel: 'shortcut icon', href: 'http://example.com/favicon.ico', type: 'image/x-icon'},
+                                        {rel: 'shorturl', href: 'http://gu.com/p/32v5a'},
+                                        {rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
+                                    ])
+    end
+    it "#stylesheets" do
+      expect(page.stylesheets).to eq([
+                                         {rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
+                                         {rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
+                                         {rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
+                                     ])
+      expect(page_https.stylesheets).to eq([
+                                         {rel: 'stylesheet', href: 'https://example.com/stylesheets/screen.css'},
+                                         {rel: 'stylesheet', href: 'https://example2.com/stylesheets/screen.css'},
+                                         {rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
+                                     ])
+    end
+    it "#canonical" do
+      expect(page.canonicals).to eq([
+                                        {rel: 'canonical', href: 'http://example.com/canonical-from-head'}
+                                    ])
+    end
+  end
+end

data/spec/meta_inspector/images_spec.rb CHANGED Viewed

@@ -123,6 +123,44 @@ describe MetaInspector do
     end
   end
+  describe "images.with_size" do
+    it "should return sorted by area array of [img_url, width, height] using html sizes" do
+      page = MetaInspector.new('http://example.com/largest_image_in_html')
+      expect(page.images.with_size).to eq([
+        ["http://example.com/largest", 100, 100],
+        ["http://example.com/too_narrow", 10, 100],
+        ["http://example.com/too_wide", 100, 10],
+        ["http://example.com/smaller", 10, 10],
+        ["http://example.com/smallest", 1, 1]
+      ])
+    end
+    it "should return sorted by area array of [img_url, width, height] using actual image sizes" do
+      page = MetaInspector.new('http://example.com/largest_image_using_image_size')
+      expect(page.images.with_size).to eq([
+        ["http://example.com/100x100", 100, 100],
+        ["http://example.com/10x100", 10, 100],
+        ["http://example.com/100x10", 100, 10],
+        ["http://example.com/10x10", 10, 10],
+        ["http://example.com/1x1", 1, 1]
+      ])
+    end
+    it "should return sorted by area array of [img_url, width, height] without downloading images" do
+      page = MetaInspector.new('http://example.com/largest_image_using_image_size', download_images: false)
+      expect(page.images.with_size).to eq([
+        ["http://example.com/10x100", 10, 100],
+        ["http://example.com/100x10", 100, 10],
+        ["http://example.com/1x1", 1, 1],
+        ["http://example.com/10x10", 0, 0],
+        ["http://example.com/100x100", 0, 0]
+      ])
+    end
+  end
   describe "images.largest" do
     it "should find the largest image on the page using html sizes" do
       page = MetaInspector.new('http://example.com/largest_image_in_html')
@@ -174,4 +212,26 @@ describe MetaInspector do
       expect(page.images.favicon).to eq(nil)
     end
   end
+  describe 'protocol-relative' do
+    before(:each) do
+      @m_http   = MetaInspector.new('http://protocol-relative.com')
+      @m_https  = MetaInspector.new('https://protocol-relative.com')
+    end
+    it 'should unrelativize images' do
+      expect(@m_http.images.to_a).to eq(['http://example.com/image.jpg'])
+      expect(@m_https.images.to_a).to eq(['https://example.com/image.jpg'])
+    end
+    it 'should unrelativize owner suggested image' do
+      expect(@m_http.images.owner_suggested).to eq('http://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
+      expect(@m_https.images.owner_suggested).to eq('https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
+    end
+    it 'should unrelativize favicon' do
+      expect(@m_http.images.favicon).to eq('http://static-secure.guim.co.uk/sys-images/favicon.ico')
+      expect(@m_https.images.favicon).to eq('https://static-secure.guim.co.uk/sys-images/favicon.ico')
+    end
+  end
 end

data/spec/spec_helper.rb CHANGED Viewed

@@ -41,6 +41,10 @@ FakeWeb.register_uri(:get, "http://example.com/10x10", :response => fixture_file
 FakeWeb.register_uri(:get, "http://example.com/100x100", :response => fixture_file("100x100.jpg.response"))
 FakeWeb.register_uri(:get, "http://www.24-horas.mx/mexico-firma-acuerdo-bilateral-automotriz-con-argentina/", :response => fixture_file("relative_og_image.response"))
+#Used to test canonical URLs in head
+FakeWeb.register_uri(:get, "http://example.com/head_links", :response => fixture_file("head_links.response"))
+FakeWeb.register_uri(:get, "https://example.com/head_links", :response => fixture_file("head_links.response"))
 # Used to test best_title logic
 FakeWeb.register_uri(:get, "http://example.com/title_in_head", :response => fixture_file("title_in_head.response"))
 FakeWeb.register_uri(:get, "http://example.com/title_in_body", :response => fixture_file("title_in_body.response"))

data/spec/url_spec.rb CHANGED Viewed

@@ -36,6 +36,47 @@ describe MetaInspector::URL do
     expect(MetaInspector::URL.new('http://example.com/faqs').root_url).to   eq('http://example.com/')
   end
+  it "should return an untracked url" do
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
+  end
+  it "should remove tracking parameters from url" do
+    tracked_urls = ['http://example.com/foo?not_utm_thing=bar&utm_source=1234',
+                    'http://example.com/foo?not_utm_thing=bar&utm_medium=1234',
+                    'http://example.com/foo?not_utm_thing=bar&utm_term=1234',
+                    'http://example.com/foo?not_utm_thing=bar&utm_content=1234',
+                    'http://example.com/foo?not_utm_thing=bar&utm_campaign=1234',
+                    'http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436'
+    ]
+    tracked_urls.each do |tracked_url|
+      url = MetaInspector::URL.new(tracked_url)
+      url.untrack!
+      expect(url.url).to eq('http://example.com/foo?not_utm_thing=bar')
+    end
+  end
+  it "should say if the url is tracked" do
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').tracked?).to be true
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').tracked?).to be true
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').tracked?).to be true
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').tracked?).to be true
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').tracked?).to be true
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').tracked?).to be true
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_source=1234').tracked?).to be false
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_medium=1234').tracked?).to be false
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_term=1234').tracked?).to be false
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_content=1234').tracked?).to be false
+    expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_campaign=1234').tracked?).to be false
+  end
   describe "url=" do
     it "should update the url" do
       url = MetaInspector::URL.new('http://first.com/')

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: metainspector
 version: !ruby/object:Gem::Version
-  version: 4.4.2
+  version: 4.5.0
 platform: ruby
 authors:
 - Jaime Iniesta
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-04-30 00:00:00.000000000 Z
+date: 2015-05-29 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -232,6 +232,7 @@ files:
 - ".rspec.example"
 - ".rubocop.yml.example"
 - ".travis.yml"
+- CHANGELOG.md
 - Gemfile
 - Guardfile
 - MIT-LICENSE
@@ -247,6 +248,7 @@ files:
 - lib/meta_inspector/exceptionable.rb
 - lib/meta_inspector/parser.rb
 - lib/meta_inspector/parsers/base.rb
+- lib/meta_inspector/parsers/head_links.rb
 - lib/meta_inspector/parsers/images.rb
 - lib/meta_inspector/parsers/links.rb
 - lib/meta_inspector/parsers/meta_tags.rb
@@ -270,6 +272,7 @@ files:
 - spec/fixtures/example.response
 - spec/fixtures/facebook.com.response
 - spec/fixtures/guardian.co.uk.response
+- spec/fixtures/head_links.response
 - spec/fixtures/https.facebook.com.response
 - spec/fixtures/international.response
 - spec/fixtures/invalid_href.response
@@ -305,6 +308,7 @@ files:
 - spec/fixtures/wordpress_site.response
 - spec/fixtures/youtube.response
 - spec/fixtures/youtube_short_title.response
+- spec/meta_inspector/head_links_spec.rb
 - spec/meta_inspector/images_spec.rb
 - spec/meta_inspector/links_spec.rb
 - spec/meta_inspector/meta_inspector_spec.rb