metainspector 4.4.2 → 4.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 55d69ca7b0d4a656349f395a906afe70ca816f5d
4
- data.tar.gz: 3e16c396163b6171ecc136f906f726de8ab59cd4
3
+ metadata.gz: d8b2f4cf8526bd14a55d879334ff9bf14c95180f
4
+ data.tar.gz: 0d39ceedb495d19a761fd7f6bcfdae767d1e1c26
5
5
  SHA512:
6
- metadata.gz: a55b1cb2c32dcd1f8a020b28b0de9b4ce017ad20571ca1f5d1b35a26e0611e1a6faecc680dcb76fd8bad102841781c23aabd8298647a4dfafb3b8c54ea0369ac
7
- data.tar.gz: b9272f589ff465a43736dcb7c02c85c376312d2cd7d46427588d966ac396fb820bce392db8f3708b6304b87fd5584004ca44457cd3b61c7afa8349e94649adaf
6
+ metadata.gz: b9b8a345bb8f935bfe5a5fb74d4e86a92893c8d44066f87cdffbe029fc5746841c290c366fd94fc2a84edb73edbbf43c491189f6b82e754f4bc0c494eaed6591
7
+ data.tar.gz: f559c11756c34406d5083a58c8cd80fdf5665fd79f4a204fdd22eee9eed6bbb65553c3a0b47f16962623ef4c4903adb74b6ac503893b0cb6ea66ee84513e85d4
data/CHANGELOG.md ADDED
@@ -0,0 +1,66 @@
1
+ # MetaInpector Changelog
2
+
3
+ ## Changes in 4.5
4
+
5
+ * The Document API now includes access to head/link elements
6
+ * `page.head_links` returns an array of hashes of all head/links.
7
+ * `page.stylesheets` returns head/links where rel='stylesheet'
8
+ * `page.canonicals` returns head/links where rel='canonical'
9
+
10
+ * The URL API can remove common tracking parameters from the querystring
11
+ * `url.tracked?` will tell you if the url contains known tracking parameters
12
+ * `url.untracked_url` will return the url with known tracking parameters removed
13
+ * `url.untrack!` will remove the tracking parameters from the url
14
+
15
+ * The images API has been extended:
16
+ * `page.images.with_size` returns a sorted array (by descending area) of [image_url, width, height]
17
+
18
+ ## Changes in 4.4
19
+
20
+ The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
21
+
22
+ ## Changes in 4.3
23
+
24
+ * The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
25
+ * `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
26
+
27
+ ## Changes in 4.2
28
+
29
+ * The images API has been extended, with two new methods:
30
+
31
+ * `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
32
+ * `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
33
+
34
+ * The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
35
+
36
+ ## Changes in 4.1
37
+
38
+ * Introduces the `:normalize_url` option, which allows to disable URL normalization.
39
+
40
+ ## Changes in 4.0
41
+
42
+ * The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
43
+
44
+ ```ruby
45
+ page.links.raw # Returns all links found, unprocessed
46
+ page.links.all # Returns all links found, unrelavitized and absolutified
47
+ page.links.http # Returns all HTTP links found
48
+ page.links.non_http # Returns all non-HTTP links found
49
+ page.links.internal # Returns all internal HTTP links found
50
+ page.links.external # Returns all external HTTP links found
51
+ ```
52
+
53
+ * The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
54
+
55
+ * Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
56
+
57
+ * You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
58
+
59
+ ## Changes in 3.0
60
+
61
+ * The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
62
+ * We've dropped support for Ruby < 2.
63
+
64
+ Also, we've introduced a new feature:
65
+
66
+ * Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.
data/README.md CHANGED
@@ -8,56 +8,6 @@ You give it an URL, and it lets you easily get its title, links, images, charset
8
8
 
9
9
  You can try MetaInspector live at this little demo: [https://metainspectordemo.herokuapp.com](https://metainspectordemo.herokuapp.com)
10
10
 
11
- ## Changes in 4.4
12
-
13
- The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
14
-
15
- ## Changes in 4.3
16
-
17
- * The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
18
- * `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
19
-
20
- ## Changes in 4.2
21
-
22
- * The images API has been extended, with two new methods:
23
-
24
- * `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
25
- * `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
26
-
27
- * The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
28
-
29
- ## Changes in 4.1
30
-
31
- * Introduces the `:normalize_url` option, which allows to disable URL normalization.
32
-
33
- ## Changes in 4.0
34
-
35
- * The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
36
-
37
- ```ruby
38
- page.links.raw # Returns all links found, unprocessed
39
- page.links.all # Returns all links found, unrelavitized and absolutified
40
- page.links.http # Returns all HTTP links found
41
- page.links.non_http # Returns all non-HTTP links found
42
- page.links.internal # Returns all internal HTTP links found
43
- page.links.external # Returns all external HTTP links found
44
- ```
45
-
46
- * The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
47
-
48
- * Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
49
-
50
- * You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
51
-
52
- ## Changes in 3.0
53
-
54
- * The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
55
- * We've dropped support for Ruby < 2.
56
-
57
- Also, we've introduced a new feature:
58
-
59
- * Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.
60
-
61
11
  ## Installation
62
12
 
63
13
  Install the gem from RubyGems:
@@ -91,47 +41,72 @@ page = MetaInspector.new('sitevalidator.com')
91
41
  You can also include the html which will be used as the document to scrape:
92
42
 
93
43
  ```ruby
94
- page = MetaInspector.new("http://sitevalidator.com", :document => "<html><head><title>Hello From Passed Html</title><a href='/hello'>Hello link</a></head><body></body></html>")
44
+ page = MetaInspector.new("http://sitevalidator.com",
45
+ :document => "<html>...</html>")
95
46
  ```
96
47
 
97
- ## Accessing response status and headers
48
+ ## Accessing response
98
49
 
99
50
  You can check the status and headers from the response like this:
100
51
 
101
52
  ```ruby
102
53
  page.response.status # 200
103
- page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8", "cache-control"=>"must-revalidate, private, max-age=0", ... }
54
+ page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8",
55
+ # "cache-control"=>"must-revalidate, private, max-age=0", ... }
104
56
  ```
105
57
 
106
58
  ## Accessing scraped data
107
59
 
108
- You can see the scraped data like this:
60
+ ### URL
109
61
 
110
62
  ```ruby
111
63
  page.url # URL of the page
64
+ page.tracked? # returns true if the url contains known tracking parameters
65
+ page.untracked_url # returns the url with the known tracking parameters removed
66
+ page.untrack! # removes the known tracking parameters from the url
112
67
  page.scheme # Scheme of the page (http, https)
113
68
  page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
114
69
  page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
70
+ ```
71
+
72
+ ### Head links
73
+
74
+ ```ruby
75
+ page.head_links # an array of hashes of all head/links
76
+ page.stylesheets # an array of hashes of all head/links where rel='stylesheet'
77
+ page.canonicals # an array of hashes of all head/links where rel='canonical'
78
+ page.feed # Get rss or atom links in meta data fields as array
79
+ ```
80
+
81
+ ### Texts
82
+
83
+ ```ruby
115
84
  page.title # title of the page from the head section, as string
116
85
  page.best_title # best title of the page, from a selection of candidates
86
+ page.description # returns the meta description, or the first long paragraph if no meta description is found
87
+ ```
88
+
89
+ ### Links
90
+
91
+ ```ruby
117
92
  page.links.raw # every link found, unprocessed
118
93
  page.links.all # every link found on the page as an absolute URL
119
94
  page.links.http # every HTTP link found
120
95
  page.links.non_http # every non-HTTP link found
121
96
  page.links.internal # every internal link found on the page as an absolute URL
122
97
  page.links.external # every external link found on the page as an absolute URL
123
- page.meta['keywords'] # meta keywords, as string
124
- page.meta['description'] # meta description, as string
125
- page.description # returns the meta description, or the first long paragraph if no meta description is found
98
+ ```
99
+
100
+ ### Images
101
+
102
+ ```ruby
126
103
  page.images # enumerable collection, with every img found on the page as an absolute URL
104
+ page.images.with_size # a sorted array (by descending area) of [image_url, width, height]
127
105
  page.images.best # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element
128
106
  page.images.favicon # absolute URL to the favicon
129
- page.feed # Get rss or atom links in meta data fields as array
130
- page.charset # UTF-8
131
- page.content_type # content-type returned by the server when the url was requested
132
107
  ```
133
108
 
134
- ## Meta tags
109
+ ### Meta tags
135
110
 
136
111
  When it comes to meta tags, you have several options:
137
112
 
@@ -243,6 +218,13 @@ page.meta['author'] # Returns "Joe Sample"
243
218
 
244
219
  Please be aware that all keys are converted to downcase, so it's `'dc.date.issued'` and not `'DC.date.issued'`.
245
220
 
221
+ ### Misc
222
+
223
+ ```ruby
224
+ page.charset # UTF-8
225
+ page.content_type # content-type returned by the server when the url was requested
226
+ ```
227
+
246
228
  ## Other representations
247
229
 
248
230
  You can also access most of the scraped data as a hash:
@@ -422,7 +404,6 @@ You're more than welcome to fork this project and send pull requests. Just remem
422
404
  * Create a topic branch for your changes.
423
405
  * Add specs.
424
406
  * Keep your fake responses as small as possible. For each change in `spec/fixtures`, a comment should be included explaining why it's needed.
425
- * Update `version.rb`, following the [semantic versioning convention](http://semver.org/).
426
407
  * Update `README.md` if needed (for example, when you're adding or changing a feature).
427
408
 
428
409
  Thanks to all the contributors:
@@ -7,6 +7,7 @@ require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parse
7
7
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/base'))
8
8
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/images'))
9
9
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/links'))
10
+ require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/head_links'))
10
11
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/meta_tags'))
11
12
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/texts'))
12
13
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/document'))
@@ -44,14 +44,16 @@ module MetaInspector
44
44
  end
45
45
 
46
46
  extend Forwardable
47
- delegate [:url, :scheme, :host, :root_url] => :@url
47
+ delegate [:url, :scheme, :host, :root_url,
48
+ :tracked?, :untracked_url, :untrack!] => :@url
48
49
 
49
50
  delegate [:content_type, :response] => :@request
50
51
 
51
52
  delegate [:parsed, :title, :best_title,
52
53
  :description, :links,
53
54
  :images, :feed, :charset, :meta_tags,
54
- :meta_tag, :meta, :favicon] => :@parser
55
+ :meta_tag, :meta, :favicon,
56
+ :head_links, :stylesheets, :canonicals] => :@parser
55
57
 
56
58
  # Returns all document data as a nested Hash
57
59
  def to_hash
@@ -13,6 +13,7 @@ module MetaInspector
13
13
  def initialize(document, options = {})
14
14
  @document = document
15
15
  @exception_log = options[:exception_log]
16
+ @head_links_parser = MetaInspector::Parsers::HeadLinksParser.new(self)
16
17
  @meta_tag_parser = MetaInspector::Parsers::MetaTagsParser.new(self)
17
18
  @links_parser = MetaInspector::Parsers::LinksParser.new(self)
18
19
  @download_images = options[:download_images]
@@ -21,11 +22,12 @@ module MetaInspector
21
22
  end
22
23
 
23
24
  extend Forwardable
24
- delegate [:url, :scheme, :host] => :@document
25
- delegate [:meta_tags, :meta_tag, :meta, :charset] => :@meta_tag_parser
26
- delegate [:links, :feed, :base_url] => :@links_parser
27
- delegate :images => :@images_parser
28
- delegate [:title, :best_title, :description] => :@texts_parser
25
+ delegate [:url, :scheme, :host] => :@document
26
+ delegate [:meta_tags, :meta_tag, :meta, :charset] => :@meta_tag_parser
27
+ delegate [:head_links, :stylesheets, :canonicals, :feed] => :@head_links_parser
28
+ delegate [:links, :base_url] => :@links_parser
29
+ delegate :images => :@images_parser
30
+ delegate [:title, :best_title, :description] => :@texts_parser
29
31
 
30
32
  # Returns the whole parsed document
31
33
  def parsed
@@ -0,0 +1,40 @@
1
+ module MetaInspector
2
+ module Parsers
3
+ class HeadLinksParser < Base
4
+ delegate [:parsed, :base_url] => :@main_parser
5
+
6
+ def head_links
7
+ @head_links ||= parsed.css('head link').map do |tag|
8
+ Hash[
9
+ tag.attributes.keys.map do |key|
10
+ keysym = key.to_sym
11
+ val = tag.attributes[key].value
12
+ val = URL.absolutify(val, base_url) if keysym == :href
13
+ [keysym, val]
14
+ end
15
+ ]
16
+ end
17
+ end
18
+
19
+ def stylesheets
20
+ @stylesheets ||= head_links.select { |hl| hl[:rel] == 'stylesheet' }
21
+ end
22
+
23
+ def canonicals
24
+ @canonicals ||= head_links.select { |hl| hl[:rel] == 'canonical' }
25
+ end
26
+
27
+ # Returns the parsed document meta rss link
28
+ def feed
29
+ @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
30
+ end
31
+
32
+ private
33
+
34
+ def parsed_feed(format)
35
+ feed = parsed.search("//link[@type='application/#{format}+xml']").first
36
+ feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
37
+ end
38
+ end
39
+ end
40
+ end
@@ -32,28 +32,37 @@ module MetaInspector
32
32
  URL.absolutify(suggested_img, base_url) if suggested_img
33
33
  end
34
34
 
35
- # Returns the largest image from the image collection,
36
- # filtered for images that are more square than 10:1 or 1:10
37
- def largest()
38
- @larget_image ||= begin
35
+ # Returns an array of [img_url, width, height] sorted by image area (width * height)
36
+ def with_size
37
+ @with_size ||= begin
39
38
  img_nodes = parsed.search('//img').select{ |img_node| img_node['src'] }
40
- sizes = img_nodes.map { |img_node| [URL.absolutify(img_node['src'], base_url), img_node['width'], img_node['height']] }
41
- sizes.uniq! { |url, width, height| url }
39
+ imgs_with_size = img_nodes.map { |img_node| [URL.absolutify(img_node['src'], base_url), img_node['width'], img_node['height']] }
40
+ imgs_with_size.uniq! { |url, width, height| url }
42
41
  if @download_images
43
- sizes.map! do |url, width, height|
42
+ imgs_with_size.map! do |url, width, height|
44
43
  width, height = FastImage.size(url) if width.nil? || height.nil?
45
- [url, width, height]
44
+ [url, width.to_i, height.to_i]
46
45
  end
47
46
  else
48
- sizes.map! do |url, width, height|
47
+ imgs_with_size.map! do |url, width, height|
49
48
  width, height = [0, 0] if width.nil? || height.nil?
50
- [url, width, height]
49
+ [url, width.to_i, height.to_i]
51
50
  end
52
51
  end
53
- sizes.map! { |url, width, height| [url, width.to_i * height.to_i, width.to_f / height.to_f] }
54
- sizes.keep_if { |url, area, ratio| ratio > 0.1 && ratio < 10 }
55
- sizes.sort_by! { |url, area, ratio| -area }
56
- url, area, ratio = sizes.first
52
+ imgs_with_size.sort_by { |url, width, height| -(width.to_i * height.to_i) }
53
+ end
54
+ end
55
+
56
+ # Returns the largest image from the image collection,
57
+ # filtered for images that are more square than 10:1 or 1:10
58
+ def largest
59
+ @largest_image ||= begin
60
+ imgs_with_size = with_size.dup
61
+ imgs_with_size.keep_if do |url, width, height|
62
+ ratio = width.to_f / height.to_f
63
+ ratio > 0.1 && ratio < 10
64
+ end
65
+ url, width, height = imgs_with_size.first
57
66
  url
58
67
  end
59
68
  end
@@ -14,8 +14,7 @@ module MetaInspector
14
14
 
15
15
  # Returns all links found, unrelavitized and absolutified
16
16
  def all
17
- @all ||= raw.map { |link| URL.absolutify(URL.unrelativize(link, scheme), base_url) }
18
- .compact.uniq
17
+ @all ||= raw.map { |link| URL.absolutify(link, base_url) }.compact.uniq
19
18
  end
20
19
 
21
20
  # Returns all HTTP links found
@@ -44,11 +43,6 @@ module MetaInspector
44
43
  'non_http' => non_http }
45
44
  end
46
45
 
47
- # Returns the parsed document meta rss link
48
- def feed
49
- @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
50
- end
51
-
52
46
  # Returns the base url to absolutify relative links.
53
47
  # This can be the one set on a <base> tag,
54
48
  # or the url of the document if no <base> tag was found.
@@ -56,13 +50,6 @@ module MetaInspector
56
50
  base_href || url
57
51
  end
58
52
 
59
- private
60
-
61
- def parsed_feed(format)
62
- feed = parsed.search("//link[@type='application/#{format}+xml']").first
63
- feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
64
- end
65
-
66
53
  # Returns the value of the href attribute on the <base /> tag, if exists
67
54
  def base_href
68
55
  parsed.search('base').first.attributes['href'].value rescue nil
@@ -27,21 +27,35 @@ module MetaInspector
27
27
  "#{scheme}://#{host}/"
28
28
  end
29
29
 
30
+ WELL_KNOWN_TRACKING_PARAMS = %w( utm_source utm_medium utm_term utm_content utm_campaign )
31
+
32
+ def tracked?
33
+ u = parsed(url)
34
+ found_tracking_params = WELL_KNOWN_TRACKING_PARAMS & u.query_values.keys
35
+ return found_tracking_params.any?
36
+ end
37
+
38
+ def untracked_url
39
+ u = parsed(url)
40
+ u.query_values = u.query_values.delete_if { |key, _| WELL_KNOWN_TRACKING_PARAMS.include? key }
41
+ u.to_s
42
+ end
43
+
44
+ def untrack!
45
+ self.url = untracked_url
46
+ end
47
+
30
48
  def url=(new_url)
31
49
  url = with_default_scheme(new_url)
32
50
  @url = @normalize ? normalized(url) : url
33
51
  end
34
52
 
35
- # Converts a protocol-relative url to its full form,
36
- # depending on the scheme of the page that contains it
37
- def self.unrelativize(url, scheme)
38
- url =~ /^\/\// ? "#{scheme}://#{url[2..-1]}" : url
39
- end
40
-
41
53
  # Converts a relative URL to an absolute URL, like:
42
54
  # "/faq" => "http://example.com/faq"
43
55
  # Respecting already absolute URLs like the ones starting with
44
56
  # http:, ftp:, telnet:, mailto:, javascript: ...
57
+ # Protocol-relative URLs are also resolved to use the same
58
+ # schema as the base_url
45
59
  def self.absolutify(url, base_url)
46
60
  if url =~ /^\w*\:/i
47
61
  MetaInspector::URL.new(url).url
@@ -1,3 +1,3 @@
1
1
  module MetaInspector
2
- VERSION = "4.4.2"
2
+ VERSION = '4.5.0'
3
3
  end
@@ -0,0 +1,34 @@
1
+ HTTP/1.1 200 OK
2
+ Server: nginx/0.7.67
3
+ Date: Fri, 18 Nov 2011 21:46:46 GMT
4
+ Content-Type: text/html
5
+ Connection: keep-alive
6
+ Last-Modified: Mon, 14 Nov 2011 16:53:18 GMT
7
+ Content-Length: 4987
8
+ X-Varnish: 2000423390
9
+ Age: 0
10
+ Via: 1.1 varnish
11
+
12
+ <html>
13
+ <head>
14
+ <title>An example page</title>
15
+ <link
16
+ rel="canonical"
17
+ href="http://example.com/canonical-from-head"
18
+ />
19
+ <link rel="stylesheet" href="/stylesheets/screen.css">
20
+ <link rel="stylesheet" href="//example2.com/stylesheets/screen.css">
21
+ <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
22
+ <link rel="shorturl" href="http://gu.com/p/32v5a" />
23
+ <link
24
+ rel="stylesheet"
25
+ type="text/css"
26
+ href="http://foo/print.css"
27
+ media="print"
28
+ class="contrast"
29
+ />
30
+ </head>
31
+ <body>
32
+ <h1>Hello World</h1>
33
+ </body>
34
+ </html>
@@ -12,6 +12,8 @@ Accept-Ranges: bytes
12
12
  <head>
13
13
  <meta charset="utf-8" />
14
14
  <title>Protocol-relative URLs</title>
15
+ <meta property="og:image" content="//static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg"/>
16
+ <link rel="shortcut icon" href="//static-secure.guim.co.uk/sys-images/favicon.ico" type="image/x-icon" />
15
17
  </head>
16
18
  <body>
17
19
  <p>Internal links</p>
@@ -22,5 +24,8 @@ Accept-Ranges: bytes
22
24
  <p>External links</p>
23
25
  <a href="http://google.com">External: normal link</a>
24
26
  <a href="//yahoo.com">External: protocol-relative link</a>
27
+
28
+ <p>Images</p>
29
+ <img src="//example.com/image.jpg" />
25
30
  </body>
26
31
  </html>
@@ -0,0 +1,42 @@
1
+ require 'spec_helper'
2
+
3
+ describe MetaInspector do
4
+
5
+ describe "head_links" do
6
+ let(:page) { MetaInspector.new('http://example.com/head_links') }
7
+ let(:page_https) { MetaInspector.new('https://example.com/head_links') }
8
+
9
+ it "#head_links" do
10
+ expect(page.head_links).to eq([
11
+ {rel: 'canonical', href: 'http://example.com/canonical-from-head'},
12
+ {rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
13
+ {rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
14
+ {rel: 'shortcut icon', href: 'http://example.com/favicon.ico', type: 'image/x-icon'},
15
+ {rel: 'shorturl', href: 'http://gu.com/p/32v5a'},
16
+ {rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
17
+ ])
18
+ end
19
+
20
+ it "#stylesheets" do
21
+ expect(page.stylesheets).to eq([
22
+ {rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
23
+ {rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
24
+ {rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
25
+ ])
26
+
27
+ expect(page_https.stylesheets).to eq([
28
+ {rel: 'stylesheet', href: 'https://example.com/stylesheets/screen.css'},
29
+ {rel: 'stylesheet', href: 'https://example2.com/stylesheets/screen.css'},
30
+ {rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
31
+ ])
32
+ end
33
+
34
+ it "#canonical" do
35
+ expect(page.canonicals).to eq([
36
+ {rel: 'canonical', href: 'http://example.com/canonical-from-head'}
37
+ ])
38
+ end
39
+
40
+ end
41
+
42
+ end
@@ -123,6 +123,44 @@ describe MetaInspector do
123
123
  end
124
124
  end
125
125
 
126
+ describe "images.with_size" do
127
+ it "should return sorted by area array of [img_url, width, height] using html sizes" do
128
+ page = MetaInspector.new('http://example.com/largest_image_in_html')
129
+
130
+ expect(page.images.with_size).to eq([
131
+ ["http://example.com/largest", 100, 100],
132
+ ["http://example.com/too_narrow", 10, 100],
133
+ ["http://example.com/too_wide", 100, 10],
134
+ ["http://example.com/smaller", 10, 10],
135
+ ["http://example.com/smallest", 1, 1]
136
+ ])
137
+ end
138
+
139
+ it "should return sorted by area array of [img_url, width, height] using actual image sizes" do
140
+ page = MetaInspector.new('http://example.com/largest_image_using_image_size')
141
+
142
+ expect(page.images.with_size).to eq([
143
+ ["http://example.com/100x100", 100, 100],
144
+ ["http://example.com/10x100", 10, 100],
145
+ ["http://example.com/100x10", 100, 10],
146
+ ["http://example.com/10x10", 10, 10],
147
+ ["http://example.com/1x1", 1, 1]
148
+ ])
149
+ end
150
+
151
+ it "should return sorted by area array of [img_url, width, height] without downloading images" do
152
+ page = MetaInspector.new('http://example.com/largest_image_using_image_size', download_images: false)
153
+
154
+ expect(page.images.with_size).to eq([
155
+ ["http://example.com/10x100", 10, 100],
156
+ ["http://example.com/100x10", 100, 10],
157
+ ["http://example.com/1x1", 1, 1],
158
+ ["http://example.com/10x10", 0, 0],
159
+ ["http://example.com/100x100", 0, 0]
160
+ ])
161
+ end
162
+ end
163
+
126
164
  describe "images.largest" do
127
165
  it "should find the largest image on the page using html sizes" do
128
166
  page = MetaInspector.new('http://example.com/largest_image_in_html')
@@ -174,4 +212,26 @@ describe MetaInspector do
174
212
  expect(page.images.favicon).to eq(nil)
175
213
  end
176
214
  end
215
+
216
+ describe 'protocol-relative' do
217
+ before(:each) do
218
+ @m_http = MetaInspector.new('http://protocol-relative.com')
219
+ @m_https = MetaInspector.new('https://protocol-relative.com')
220
+ end
221
+
222
+ it 'should unrelativize images' do
223
+ expect(@m_http.images.to_a).to eq(['http://example.com/image.jpg'])
224
+ expect(@m_https.images.to_a).to eq(['https://example.com/image.jpg'])
225
+ end
226
+
227
+ it 'should unrelativize owner suggested image' do
228
+ expect(@m_http.images.owner_suggested).to eq('http://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
229
+ expect(@m_https.images.owner_suggested).to eq('https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
230
+ end
231
+
232
+ it 'should unrelativize favicon' do
233
+ expect(@m_http.images.favicon).to eq('http://static-secure.guim.co.uk/sys-images/favicon.ico')
234
+ expect(@m_https.images.favicon).to eq('https://static-secure.guim.co.uk/sys-images/favicon.ico')
235
+ end
236
+ end
177
237
  end
data/spec/spec_helper.rb CHANGED
@@ -41,6 +41,10 @@ FakeWeb.register_uri(:get, "http://example.com/10x10", :response => fixture_file
41
41
  FakeWeb.register_uri(:get, "http://example.com/100x100", :response => fixture_file("100x100.jpg.response"))
42
42
  FakeWeb.register_uri(:get, "http://www.24-horas.mx/mexico-firma-acuerdo-bilateral-automotriz-con-argentina/", :response => fixture_file("relative_og_image.response"))
43
43
 
44
+ #Used to test canonical URLs in head
45
+ FakeWeb.register_uri(:get, "http://example.com/head_links", :response => fixture_file("head_links.response"))
46
+ FakeWeb.register_uri(:get, "https://example.com/head_links", :response => fixture_file("head_links.response"))
47
+
44
48
  # Used to test best_title logic
45
49
  FakeWeb.register_uri(:get, "http://example.com/title_in_head", :response => fixture_file("title_in_head.response"))
46
50
  FakeWeb.register_uri(:get, "http://example.com/title_in_body", :response => fixture_file("title_in_body.response"))
data/spec/url_spec.rb CHANGED
@@ -36,6 +36,47 @@ describe MetaInspector::URL do
36
36
  expect(MetaInspector::URL.new('http://example.com/faqs').root_url).to eq('http://example.com/')
37
37
  end
38
38
 
39
+ it "should return an untracked url" do
40
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
41
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
42
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
43
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
44
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
45
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
46
+ end
47
+
48
+ it "should remove tracking parameters from url" do
49
+
50
+ tracked_urls = ['http://example.com/foo?not_utm_thing=bar&utm_source=1234',
51
+ 'http://example.com/foo?not_utm_thing=bar&utm_medium=1234',
52
+ 'http://example.com/foo?not_utm_thing=bar&utm_term=1234',
53
+ 'http://example.com/foo?not_utm_thing=bar&utm_content=1234',
54
+ 'http://example.com/foo?not_utm_thing=bar&utm_campaign=1234',
55
+ 'http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436'
56
+ ]
57
+
58
+ tracked_urls.each do |tracked_url|
59
+ url = MetaInspector::URL.new(tracked_url)
60
+ url.untrack!
61
+ expect(url.url).to eq('http://example.com/foo?not_utm_thing=bar')
62
+ end
63
+ end
64
+
65
+ it "should say if the url is tracked" do
66
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').tracked?).to be true
67
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').tracked?).to be true
68
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').tracked?).to be true
69
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').tracked?).to be true
70
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').tracked?).to be true
71
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').tracked?).to be true
72
+
73
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_source=1234').tracked?).to be false
74
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_medium=1234').tracked?).to be false
75
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_term=1234').tracked?).to be false
76
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_content=1234').tracked?).to be false
77
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_campaign=1234').tracked?).to be false
78
+ end
79
+
39
80
  describe "url=" do
40
81
  it "should update the url" do
41
82
  url = MetaInspector::URL.new('http://first.com/')
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metainspector
3
3
  version: !ruby/object:Gem::Version
4
- version: 4.4.2
4
+ version: 4.5.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jaime Iniesta
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-04-30 00:00:00.000000000 Z
11
+ date: 2015-05-29 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -232,6 +232,7 @@ files:
232
232
  - ".rspec.example"
233
233
  - ".rubocop.yml.example"
234
234
  - ".travis.yml"
235
+ - CHANGELOG.md
235
236
  - Gemfile
236
237
  - Guardfile
237
238
  - MIT-LICENSE
@@ -247,6 +248,7 @@ files:
247
248
  - lib/meta_inspector/exceptionable.rb
248
249
  - lib/meta_inspector/parser.rb
249
250
  - lib/meta_inspector/parsers/base.rb
251
+ - lib/meta_inspector/parsers/head_links.rb
250
252
  - lib/meta_inspector/parsers/images.rb
251
253
  - lib/meta_inspector/parsers/links.rb
252
254
  - lib/meta_inspector/parsers/meta_tags.rb
@@ -270,6 +272,7 @@ files:
270
272
  - spec/fixtures/example.response
271
273
  - spec/fixtures/facebook.com.response
272
274
  - spec/fixtures/guardian.co.uk.response
275
+ - spec/fixtures/head_links.response
273
276
  - spec/fixtures/https.facebook.com.response
274
277
  - spec/fixtures/international.response
275
278
  - spec/fixtures/invalid_href.response
@@ -305,6 +308,7 @@ files:
305
308
  - spec/fixtures/wordpress_site.response
306
309
  - spec/fixtures/youtube.response
307
310
  - spec/fixtures/youtube_short_title.response
311
+ - spec/meta_inspector/head_links_spec.rb
308
312
  - spec/meta_inspector/images_spec.rb
309
313
  - spec/meta_inspector/links_spec.rb
310
314
  - spec/meta_inspector/meta_inspector_spec.rb