metainspector 4.4.2 → 4.5.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 55d69ca7b0d4a656349f395a906afe70ca816f5d
4
- data.tar.gz: 3e16c396163b6171ecc136f906f726de8ab59cd4
3
+ metadata.gz: d8b2f4cf8526bd14a55d879334ff9bf14c95180f
4
+ data.tar.gz: 0d39ceedb495d19a761fd7f6bcfdae767d1e1c26
5
5
  SHA512:
6
- metadata.gz: a55b1cb2c32dcd1f8a020b28b0de9b4ce017ad20571ca1f5d1b35a26e0611e1a6faecc680dcb76fd8bad102841781c23aabd8298647a4dfafb3b8c54ea0369ac
7
- data.tar.gz: b9272f589ff465a43736dcb7c02c85c376312d2cd7d46427588d966ac396fb820bce392db8f3708b6304b87fd5584004ca44457cd3b61c7afa8349e94649adaf
6
+ metadata.gz: b9b8a345bb8f935bfe5a5fb74d4e86a92893c8d44066f87cdffbe029fc5746841c290c366fd94fc2a84edb73edbbf43c491189f6b82e754f4bc0c494eaed6591
7
+ data.tar.gz: f559c11756c34406d5083a58c8cd80fdf5665fd79f4a204fdd22eee9eed6bbb65553c3a0b47f16962623ef4c4903adb74b6ac503893b0cb6ea66ee84513e85d4
data/CHANGELOG.md ADDED
@@ -0,0 +1,66 @@
1
+ # MetaInpector Changelog
2
+
3
+ ## Changes in 4.5
4
+
5
+ * The Document API now includes access to head/link elements
6
+ * `page.head_links` returns an array of hashes of all head/links.
7
+ * `page.stylesheets` returns head/links where rel='stylesheet'
8
+ * `page.canonicals` returns head/links where rel='canonical'
9
+
10
+ * The URL API can remove common tracking parameters from the querystring
11
+ * `url.tracked?` will tell you if the url contains known tracking parameters
12
+ * `url.untracked_url` will return the url with known tracking parameters removed
13
+ * `url.untrack!` will remove the tracking parameters from the url
14
+
15
+ * The images API has been extended:
16
+ * `page.images.with_size` returns a sorted array (by descending area) of [image_url, width, height]
17
+
18
+ ## Changes in 4.4
19
+
20
+ The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
21
+
22
+ ## Changes in 4.3
23
+
24
+ * The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
25
+ * `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
26
+
27
+ ## Changes in 4.2
28
+
29
+ * The images API has been extended, with two new methods:
30
+
31
+ * `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
32
+ * `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
33
+
34
+ * The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
35
+
36
+ ## Changes in 4.1
37
+
38
+ * Introduces the `:normalize_url` option, which allows to disable URL normalization.
39
+
40
+ ## Changes in 4.0
41
+
42
+ * The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
43
+
44
+ ```ruby
45
+ page.links.raw # Returns all links found, unprocessed
46
+ page.links.all # Returns all links found, unrelavitized and absolutified
47
+ page.links.http # Returns all HTTP links found
48
+ page.links.non_http # Returns all non-HTTP links found
49
+ page.links.internal # Returns all internal HTTP links found
50
+ page.links.external # Returns all external HTTP links found
51
+ ```
52
+
53
+ * The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
54
+
55
+ * Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
56
+
57
+ * You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
58
+
59
+ ## Changes in 3.0
60
+
61
+ * The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
62
+ * We've dropped support for Ruby < 2.
63
+
64
+ Also, we've introduced a new feature:
65
+
66
+ * Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.
data/README.md CHANGED
@@ -8,56 +8,6 @@ You give it an URL, and it lets you easily get its title, links, images, charset
8
8
 
9
9
  You can try MetaInspector live at this little demo: [https://metainspectordemo.herokuapp.com](https://metainspectordemo.herokuapp.com)
10
10
 
11
- ## Changes in 4.4
12
-
13
- The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
14
-
15
- ## Changes in 4.3
16
-
17
- * The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
18
- * `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
19
-
20
- ## Changes in 4.2
21
-
22
- * The images API has been extended, with two new methods:
23
-
24
- * `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
25
- * `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
26
-
27
- * The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
28
-
29
- ## Changes in 4.1
30
-
31
- * Introduces the `:normalize_url` option, which allows to disable URL normalization.
32
-
33
- ## Changes in 4.0
34
-
35
- * The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
36
-
37
- ```ruby
38
- page.links.raw # Returns all links found, unprocessed
39
- page.links.all # Returns all links found, unrelavitized and absolutified
40
- page.links.http # Returns all HTTP links found
41
- page.links.non_http # Returns all non-HTTP links found
42
- page.links.internal # Returns all internal HTTP links found
43
- page.links.external # Returns all external HTTP links found
44
- ```
45
-
46
- * The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
47
-
48
- * Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
49
-
50
- * You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
51
-
52
- ## Changes in 3.0
53
-
54
- * The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
55
- * We've dropped support for Ruby < 2.
56
-
57
- Also, we've introduced a new feature:
58
-
59
- * Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.
60
-
61
11
  ## Installation
62
12
 
63
13
  Install the gem from RubyGems:
@@ -91,47 +41,72 @@ page = MetaInspector.new('sitevalidator.com')
91
41
  You can also include the html which will be used as the document to scrape:
92
42
 
93
43
  ```ruby
94
- page = MetaInspector.new("http://sitevalidator.com", :document => "<html><head><title>Hello From Passed Html</title><a href='/hello'>Hello link</a></head><body></body></html>")
44
+ page = MetaInspector.new("http://sitevalidator.com",
45
+ :document => "<html>...</html>")
95
46
  ```
96
47
 
97
- ## Accessing response status and headers
48
+ ## Accessing response
98
49
 
99
50
  You can check the status and headers from the response like this:
100
51
 
101
52
  ```ruby
102
53
  page.response.status # 200
103
- page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8", "cache-control"=>"must-revalidate, private, max-age=0", ... }
54
+ page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8",
55
+ # "cache-control"=>"must-revalidate, private, max-age=0", ... }
104
56
  ```
105
57
 
106
58
  ## Accessing scraped data
107
59
 
108
- You can see the scraped data like this:
60
+ ### URL
109
61
 
110
62
  ```ruby
111
63
  page.url # URL of the page
64
+ page.tracked? # returns true if the url contains known tracking parameters
65
+ page.untracked_url # returns the url with the known tracking parameters removed
66
+ page.untrack! # removes the known tracking parameters from the url
112
67
  page.scheme # Scheme of the page (http, https)
113
68
  page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
114
69
  page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
70
+ ```
71
+
72
+ ### Head links
73
+
74
+ ```ruby
75
+ page.head_links # an array of hashes of all head/links
76
+ page.stylesheets # an array of hashes of all head/links where rel='stylesheet'
77
+ page.canonicals # an array of hashes of all head/links where rel='canonical'
78
+ page.feed # Get rss or atom links in meta data fields as array
79
+ ```
80
+
81
+ ### Texts
82
+
83
+ ```ruby
115
84
  page.title # title of the page from the head section, as string
116
85
  page.best_title # best title of the page, from a selection of candidates
86
+ page.description # returns the meta description, or the first long paragraph if no meta description is found
87
+ ```
88
+
89
+ ### Links
90
+
91
+ ```ruby
117
92
  page.links.raw # every link found, unprocessed
118
93
  page.links.all # every link found on the page as an absolute URL
119
94
  page.links.http # every HTTP link found
120
95
  page.links.non_http # every non-HTTP link found
121
96
  page.links.internal # every internal link found on the page as an absolute URL
122
97
  page.links.external # every external link found on the page as an absolute URL
123
- page.meta['keywords'] # meta keywords, as string
124
- page.meta['description'] # meta description, as string
125
- page.description # returns the meta description, or the first long paragraph if no meta description is found
98
+ ```
99
+
100
+ ### Images
101
+
102
+ ```ruby
126
103
  page.images # enumerable collection, with every img found on the page as an absolute URL
104
+ page.images.with_size # a sorted array (by descending area) of [image_url, width, height]
127
105
  page.images.best # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element
128
106
  page.images.favicon # absolute URL to the favicon
129
- page.feed # Get rss or atom links in meta data fields as array
130
- page.charset # UTF-8
131
- page.content_type # content-type returned by the server when the url was requested
132
107
  ```
133
108
 
134
- ## Meta tags
109
+ ### Meta tags
135
110
 
136
111
  When it comes to meta tags, you have several options:
137
112
 
@@ -243,6 +218,13 @@ page.meta['author'] # Returns "Joe Sample"
243
218
 
244
219
  Please be aware that all keys are converted to downcase, so it's `'dc.date.issued'` and not `'DC.date.issued'`.
245
220
 
221
+ ### Misc
222
+
223
+ ```ruby
224
+ page.charset # UTF-8
225
+ page.content_type # content-type returned by the server when the url was requested
226
+ ```
227
+
246
228
  ## Other representations
247
229
 
248
230
  You can also access most of the scraped data as a hash:
@@ -422,7 +404,6 @@ You're more than welcome to fork this project and send pull requests. Just remem
422
404
  * Create a topic branch for your changes.
423
405
  * Add specs.
424
406
  * Keep your fake responses as small as possible. For each change in `spec/fixtures`, a comment should be included explaining why it's needed.
425
- * Update `version.rb`, following the [semantic versioning convention](http://semver.org/).
426
407
  * Update `README.md` if needed (for example, when you're adding or changing a feature).
427
408
 
428
409
  Thanks to all the contributors:
@@ -7,6 +7,7 @@ require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parse
7
7
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/base'))
8
8
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/images'))
9
9
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/links'))
10
+ require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/head_links'))
10
11
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/meta_tags'))
11
12
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/texts'))
12
13
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/document'))
@@ -44,14 +44,16 @@ module MetaInspector
44
44
  end
45
45
 
46
46
  extend Forwardable
47
- delegate [:url, :scheme, :host, :root_url] => :@url
47
+ delegate [:url, :scheme, :host, :root_url,
48
+ :tracked?, :untracked_url, :untrack!] => :@url
48
49
 
49
50
  delegate [:content_type, :response] => :@request
50
51
 
51
52
  delegate [:parsed, :title, :best_title,
52
53
  :description, :links,
53
54
  :images, :feed, :charset, :meta_tags,
54
- :meta_tag, :meta, :favicon] => :@parser
55
+ :meta_tag, :meta, :favicon,
56
+ :head_links, :stylesheets, :canonicals] => :@parser
55
57
 
56
58
  # Returns all document data as a nested Hash
57
59
  def to_hash
@@ -13,6 +13,7 @@ module MetaInspector
13
13
  def initialize(document, options = {})
14
14
  @document = document
15
15
  @exception_log = options[:exception_log]
16
+ @head_links_parser = MetaInspector::Parsers::HeadLinksParser.new(self)
16
17
  @meta_tag_parser = MetaInspector::Parsers::MetaTagsParser.new(self)
17
18
  @links_parser = MetaInspector::Parsers::LinksParser.new(self)
18
19
  @download_images = options[:download_images]
@@ -21,11 +22,12 @@ module MetaInspector
21
22
  end
22
23
 
23
24
  extend Forwardable
24
- delegate [:url, :scheme, :host] => :@document
25
- delegate [:meta_tags, :meta_tag, :meta, :charset] => :@meta_tag_parser
26
- delegate [:links, :feed, :base_url] => :@links_parser
27
- delegate :images => :@images_parser
28
- delegate [:title, :best_title, :description] => :@texts_parser
25
+ delegate [:url, :scheme, :host] => :@document
26
+ delegate [:meta_tags, :meta_tag, :meta, :charset] => :@meta_tag_parser
27
+ delegate [:head_links, :stylesheets, :canonicals, :feed] => :@head_links_parser
28
+ delegate [:links, :base_url] => :@links_parser
29
+ delegate :images => :@images_parser
30
+ delegate [:title, :best_title, :description] => :@texts_parser
29
31
 
30
32
  # Returns the whole parsed document
31
33
  def parsed
@@ -0,0 +1,40 @@
1
+ module MetaInspector
2
+ module Parsers
3
+ class HeadLinksParser < Base
4
+ delegate [:parsed, :base_url] => :@main_parser
5
+
6
+ def head_links
7
+ @head_links ||= parsed.css('head link').map do |tag|
8
+ Hash[
9
+ tag.attributes.keys.map do |key|
10
+ keysym = key.to_sym
11
+ val = tag.attributes[key].value
12
+ val = URL.absolutify(val, base_url) if keysym == :href
13
+ [keysym, val]
14
+ end
15
+ ]
16
+ end
17
+ end
18
+
19
+ def stylesheets
20
+ @stylesheets ||= head_links.select { |hl| hl[:rel] == 'stylesheet' }
21
+ end
22
+
23
+ def canonicals
24
+ @canonicals ||= head_links.select { |hl| hl[:rel] == 'canonical' }
25
+ end
26
+
27
+ # Returns the parsed document meta rss link
28
+ def feed
29
+ @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
30
+ end
31
+
32
+ private
33
+
34
+ def parsed_feed(format)
35
+ feed = parsed.search("//link[@type='application/#{format}+xml']").first
36
+ feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
37
+ end
38
+ end
39
+ end
40
+ end
@@ -32,28 +32,37 @@ module MetaInspector
32
32
  URL.absolutify(suggested_img, base_url) if suggested_img
33
33
  end
34
34
 
35
- # Returns the largest image from the image collection,
36
- # filtered for images that are more square than 10:1 or 1:10
37
- def largest()
38
- @larget_image ||= begin
35
+ # Returns an array of [img_url, width, height] sorted by image area (width * height)
36
+ def with_size
37
+ @with_size ||= begin
39
38
  img_nodes = parsed.search('//img').select{ |img_node| img_node['src'] }
40
- sizes = img_nodes.map { |img_node| [URL.absolutify(img_node['src'], base_url), img_node['width'], img_node['height']] }
41
- sizes.uniq! { |url, width, height| url }
39
+ imgs_with_size = img_nodes.map { |img_node| [URL.absolutify(img_node['src'], base_url), img_node['width'], img_node['height']] }
40
+ imgs_with_size.uniq! { |url, width, height| url }
42
41
  if @download_images
43
- sizes.map! do |url, width, height|
42
+ imgs_with_size.map! do |url, width, height|
44
43
  width, height = FastImage.size(url) if width.nil? || height.nil?
45
- [url, width, height]
44
+ [url, width.to_i, height.to_i]
46
45
  end
47
46
  else
48
- sizes.map! do |url, width, height|
47
+ imgs_with_size.map! do |url, width, height|
49
48
  width, height = [0, 0] if width.nil? || height.nil?
50
- [url, width, height]
49
+ [url, width.to_i, height.to_i]
51
50
  end
52
51
  end
53
- sizes.map! { |url, width, height| [url, width.to_i * height.to_i, width.to_f / height.to_f] }
54
- sizes.keep_if { |url, area, ratio| ratio > 0.1 && ratio < 10 }
55
- sizes.sort_by! { |url, area, ratio| -area }
56
- url, area, ratio = sizes.first
52
+ imgs_with_size.sort_by { |url, width, height| -(width.to_i * height.to_i) }
53
+ end
54
+ end
55
+
56
+ # Returns the largest image from the image collection,
57
+ # filtered for images that are more square than 10:1 or 1:10
58
+ def largest
59
+ @largest_image ||= begin
60
+ imgs_with_size = with_size.dup
61
+ imgs_with_size.keep_if do |url, width, height|
62
+ ratio = width.to_f / height.to_f
63
+ ratio > 0.1 && ratio < 10
64
+ end
65
+ url, width, height = imgs_with_size.first
57
66
  url
58
67
  end
59
68
  end
@@ -14,8 +14,7 @@ module MetaInspector
14
14
 
15
15
  # Returns all links found, unrelavitized and absolutified
16
16
  def all
17
- @all ||= raw.map { |link| URL.absolutify(URL.unrelativize(link, scheme), base_url) }
18
- .compact.uniq
17
+ @all ||= raw.map { |link| URL.absolutify(link, base_url) }.compact.uniq
19
18
  end
20
19
 
21
20
  # Returns all HTTP links found
@@ -44,11 +43,6 @@ module MetaInspector
44
43
  'non_http' => non_http }
45
44
  end
46
45
 
47
- # Returns the parsed document meta rss link
48
- def feed
49
- @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
50
- end
51
-
52
46
  # Returns the base url to absolutify relative links.
53
47
  # This can be the one set on a <base> tag,
54
48
  # or the url of the document if no <base> tag was found.
@@ -56,13 +50,6 @@ module MetaInspector
56
50
  base_href || url
57
51
  end
58
52
 
59
- private
60
-
61
- def parsed_feed(format)
62
- feed = parsed.search("//link[@type='application/#{format}+xml']").first
63
- feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
64
- end
65
-
66
53
  # Returns the value of the href attribute on the <base /> tag, if exists
67
54
  def base_href
68
55
  parsed.search('base').first.attributes['href'].value rescue nil
@@ -27,21 +27,35 @@ module MetaInspector
27
27
  "#{scheme}://#{host}/"
28
28
  end
29
29
 
30
+ WELL_KNOWN_TRACKING_PARAMS = %w( utm_source utm_medium utm_term utm_content utm_campaign )
31
+
32
+ def tracked?
33
+ u = parsed(url)
34
+ found_tracking_params = WELL_KNOWN_TRACKING_PARAMS & u.query_values.keys
35
+ return found_tracking_params.any?
36
+ end
37
+
38
+ def untracked_url
39
+ u = parsed(url)
40
+ u.query_values = u.query_values.delete_if { |key, _| WELL_KNOWN_TRACKING_PARAMS.include? key }
41
+ u.to_s
42
+ end
43
+
44
+ def untrack!
45
+ self.url = untracked_url
46
+ end
47
+
30
48
  def url=(new_url)
31
49
  url = with_default_scheme(new_url)
32
50
  @url = @normalize ? normalized(url) : url
33
51
  end
34
52
 
35
- # Converts a protocol-relative url to its full form,
36
- # depending on the scheme of the page that contains it
37
- def self.unrelativize(url, scheme)
38
- url =~ /^\/\// ? "#{scheme}://#{url[2..-1]}" : url
39
- end
40
-
41
53
  # Converts a relative URL to an absolute URL, like:
42
54
  # "/faq" => "http://example.com/faq"
43
55
  # Respecting already absolute URLs like the ones starting with
44
56
  # http:, ftp:, telnet:, mailto:, javascript: ...
57
+ # Protocol-relative URLs are also resolved to use the same
58
+ # schema as the base_url
45
59
  def self.absolutify(url, base_url)
46
60
  if url =~ /^\w*\:/i
47
61
  MetaInspector::URL.new(url).url
@@ -1,3 +1,3 @@
1
1
  module MetaInspector
2
- VERSION = "4.4.2"
2
+ VERSION = '4.5.0'
3
3
  end
@@ -0,0 +1,34 @@
1
+ HTTP/1.1 200 OK
2
+ Server: nginx/0.7.67
3
+ Date: Fri, 18 Nov 2011 21:46:46 GMT
4
+ Content-Type: text/html
5
+ Connection: keep-alive
6
+ Last-Modified: Mon, 14 Nov 2011 16:53:18 GMT
7
+ Content-Length: 4987
8
+ X-Varnish: 2000423390
9
+ Age: 0
10
+ Via: 1.1 varnish
11
+
12
+ <html>
13
+ <head>
14
+ <title>An example page</title>
15
+ <link
16
+ rel="canonical"
17
+ href="http://example.com/canonical-from-head"
18
+ />
19
+ <link rel="stylesheet" href="/stylesheets/screen.css">
20
+ <link rel="stylesheet" href="//example2.com/stylesheets/screen.css">
21
+ <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
22
+ <link rel="shorturl" href="http://gu.com/p/32v5a" />
23
+ <link
24
+ rel="stylesheet"
25
+ type="text/css"
26
+ href="http://foo/print.css"
27
+ media="print"
28
+ class="contrast"
29
+ />
30
+ </head>
31
+ <body>
32
+ <h1>Hello World</h1>
33
+ </body>
34
+ </html>
@@ -12,6 +12,8 @@ Accept-Ranges: bytes
12
12
  <head>
13
13
  <meta charset="utf-8" />
14
14
  <title>Protocol-relative URLs</title>
15
+ <meta property="og:image" content="//static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg"/>
16
+ <link rel="shortcut icon" href="//static-secure.guim.co.uk/sys-images/favicon.ico" type="image/x-icon" />
15
17
  </head>
16
18
  <body>
17
19
  <p>Internal links</p>
@@ -22,5 +24,8 @@ Accept-Ranges: bytes
22
24
  <p>External links</p>
23
25
  <a href="http://google.com">External: normal link</a>
24
26
  <a href="//yahoo.com">External: protocol-relative link</a>
27
+
28
+ <p>Images</p>
29
+ <img src="//example.com/image.jpg" />
25
30
  </body>
26
31
  </html>
@@ -0,0 +1,42 @@
1
+ require 'spec_helper'
2
+
3
+ describe MetaInspector do
4
+
5
+ describe "head_links" do
6
+ let(:page) { MetaInspector.new('http://example.com/head_links') }
7
+ let(:page_https) { MetaInspector.new('https://example.com/head_links') }
8
+
9
+ it "#head_links" do
10
+ expect(page.head_links).to eq([
11
+ {rel: 'canonical', href: 'http://example.com/canonical-from-head'},
12
+ {rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
13
+ {rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
14
+ {rel: 'shortcut icon', href: 'http://example.com/favicon.ico', type: 'image/x-icon'},
15
+ {rel: 'shorturl', href: 'http://gu.com/p/32v5a'},
16
+ {rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
17
+ ])
18
+ end
19
+
20
+ it "#stylesheets" do
21
+ expect(page.stylesheets).to eq([
22
+ {rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
23
+ {rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
24
+ {rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
25
+ ])
26
+
27
+ expect(page_https.stylesheets).to eq([
28
+ {rel: 'stylesheet', href: 'https://example.com/stylesheets/screen.css'},
29
+ {rel: 'stylesheet', href: 'https://example2.com/stylesheets/screen.css'},
30
+ {rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
31
+ ])
32
+ end
33
+
34
+ it "#canonical" do
35
+ expect(page.canonicals).to eq([
36
+ {rel: 'canonical', href: 'http://example.com/canonical-from-head'}
37
+ ])
38
+ end
39
+
40
+ end
41
+
42
+ end
@@ -123,6 +123,44 @@ describe MetaInspector do
123
123
  end
124
124
  end
125
125
 
126
+ describe "images.with_size" do
127
+ it "should return sorted by area array of [img_url, width, height] using html sizes" do
128
+ page = MetaInspector.new('http://example.com/largest_image_in_html')
129
+
130
+ expect(page.images.with_size).to eq([
131
+ ["http://example.com/largest", 100, 100],
132
+ ["http://example.com/too_narrow", 10, 100],
133
+ ["http://example.com/too_wide", 100, 10],
134
+ ["http://example.com/smaller", 10, 10],
135
+ ["http://example.com/smallest", 1, 1]
136
+ ])
137
+ end
138
+
139
+ it "should return sorted by area array of [img_url, width, height] using actual image sizes" do
140
+ page = MetaInspector.new('http://example.com/largest_image_using_image_size')
141
+
142
+ expect(page.images.with_size).to eq([
143
+ ["http://example.com/100x100", 100, 100],
144
+ ["http://example.com/10x100", 10, 100],
145
+ ["http://example.com/100x10", 100, 10],
146
+ ["http://example.com/10x10", 10, 10],
147
+ ["http://example.com/1x1", 1, 1]
148
+ ])
149
+ end
150
+
151
+ it "should return sorted by area array of [img_url, width, height] without downloading images" do
152
+ page = MetaInspector.new('http://example.com/largest_image_using_image_size', download_images: false)
153
+
154
+ expect(page.images.with_size).to eq([
155
+ ["http://example.com/10x100", 10, 100],
156
+ ["http://example.com/100x10", 100, 10],
157
+ ["http://example.com/1x1", 1, 1],
158
+ ["http://example.com/10x10", 0, 0],
159
+ ["http://example.com/100x100", 0, 0]
160
+ ])
161
+ end
162
+ end
163
+
126
164
  describe "images.largest" do
127
165
  it "should find the largest image on the page using html sizes" do
128
166
  page = MetaInspector.new('http://example.com/largest_image_in_html')
@@ -174,4 +212,26 @@ describe MetaInspector do
174
212
  expect(page.images.favicon).to eq(nil)
175
213
  end
176
214
  end
215
+
216
+ describe 'protocol-relative' do
217
+ before(:each) do
218
+ @m_http = MetaInspector.new('http://protocol-relative.com')
219
+ @m_https = MetaInspector.new('https://protocol-relative.com')
220
+ end
221
+
222
+ it 'should unrelativize images' do
223
+ expect(@m_http.images.to_a).to eq(['http://example.com/image.jpg'])
224
+ expect(@m_https.images.to_a).to eq(['https://example.com/image.jpg'])
225
+ end
226
+
227
+ it 'should unrelativize owner suggested image' do
228
+ expect(@m_http.images.owner_suggested).to eq('http://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
229
+ expect(@m_https.images.owner_suggested).to eq('https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
230
+ end
231
+
232
+ it 'should unrelativize favicon' do
233
+ expect(@m_http.images.favicon).to eq('http://static-secure.guim.co.uk/sys-images/favicon.ico')
234
+ expect(@m_https.images.favicon).to eq('https://static-secure.guim.co.uk/sys-images/favicon.ico')
235
+ end
236
+ end
177
237
  end
data/spec/spec_helper.rb CHANGED
@@ -41,6 +41,10 @@ FakeWeb.register_uri(:get, "http://example.com/10x10", :response => fixture_file
41
41
  FakeWeb.register_uri(:get, "http://example.com/100x100", :response => fixture_file("100x100.jpg.response"))
42
42
  FakeWeb.register_uri(:get, "http://www.24-horas.mx/mexico-firma-acuerdo-bilateral-automotriz-con-argentina/", :response => fixture_file("relative_og_image.response"))
43
43
 
44
+ #Used to test canonical URLs in head
45
+ FakeWeb.register_uri(:get, "http://example.com/head_links", :response => fixture_file("head_links.response"))
46
+ FakeWeb.register_uri(:get, "https://example.com/head_links", :response => fixture_file("head_links.response"))
47
+
44
48
  # Used to test best_title logic
45
49
  FakeWeb.register_uri(:get, "http://example.com/title_in_head", :response => fixture_file("title_in_head.response"))
46
50
  FakeWeb.register_uri(:get, "http://example.com/title_in_body", :response => fixture_file("title_in_body.response"))
data/spec/url_spec.rb CHANGED
@@ -36,6 +36,47 @@ describe MetaInspector::URL do
36
36
  expect(MetaInspector::URL.new('http://example.com/faqs').root_url).to eq('http://example.com/')
37
37
  end
38
38
 
39
+ it "should return an untracked url" do
40
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
41
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
42
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
43
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
44
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
45
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
46
+ end
47
+
48
+ it "should remove tracking parameters from url" do
49
+
50
+ tracked_urls = ['http://example.com/foo?not_utm_thing=bar&utm_source=1234',
51
+ 'http://example.com/foo?not_utm_thing=bar&utm_medium=1234',
52
+ 'http://example.com/foo?not_utm_thing=bar&utm_term=1234',
53
+ 'http://example.com/foo?not_utm_thing=bar&utm_content=1234',
54
+ 'http://example.com/foo?not_utm_thing=bar&utm_campaign=1234',
55
+ 'http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436'
56
+ ]
57
+
58
+ tracked_urls.each do |tracked_url|
59
+ url = MetaInspector::URL.new(tracked_url)
60
+ url.untrack!
61
+ expect(url.url).to eq('http://example.com/foo?not_utm_thing=bar')
62
+ end
63
+ end
64
+
65
+ it "should say if the url is tracked" do
66
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').tracked?).to be true
67
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').tracked?).to be true
68
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').tracked?).to be true
69
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').tracked?).to be true
70
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').tracked?).to be true
71
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').tracked?).to be true
72
+
73
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_source=1234').tracked?).to be false
74
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_medium=1234').tracked?).to be false
75
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_term=1234').tracked?).to be false
76
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_content=1234').tracked?).to be false
77
+ expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&not_utm_campaign=1234').tracked?).to be false
78
+ end
79
+
39
80
  describe "url=" do
40
81
  it "should update the url" do
41
82
  url = MetaInspector::URL.new('http://first.com/')
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metainspector
3
3
  version: !ruby/object:Gem::Version
4
- version: 4.4.2
4
+ version: 4.5.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jaime Iniesta
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-04-30 00:00:00.000000000 Z
11
+ date: 2015-05-29 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -232,6 +232,7 @@ files:
232
232
  - ".rspec.example"
233
233
  - ".rubocop.yml.example"
234
234
  - ".travis.yml"
235
+ - CHANGELOG.md
235
236
  - Gemfile
236
237
  - Guardfile
237
238
  - MIT-LICENSE
@@ -247,6 +248,7 @@ files:
247
248
  - lib/meta_inspector/exceptionable.rb
248
249
  - lib/meta_inspector/parser.rb
249
250
  - lib/meta_inspector/parsers/base.rb
251
+ - lib/meta_inspector/parsers/head_links.rb
250
252
  - lib/meta_inspector/parsers/images.rb
251
253
  - lib/meta_inspector/parsers/links.rb
252
254
  - lib/meta_inspector/parsers/meta_tags.rb
@@ -270,6 +272,7 @@ files:
270
272
  - spec/fixtures/example.response
271
273
  - spec/fixtures/facebook.com.response
272
274
  - spec/fixtures/guardian.co.uk.response
275
+ - spec/fixtures/head_links.response
273
276
  - spec/fixtures/https.facebook.com.response
274
277
  - spec/fixtures/international.response
275
278
  - spec/fixtures/invalid_href.response
@@ -305,6 +308,7 @@ files:
305
308
  - spec/fixtures/wordpress_site.response
306
309
  - spec/fixtures/youtube.response
307
310
  - spec/fixtures/youtube_short_title.response
311
+ - spec/meta_inspector/head_links_spec.rb
308
312
  - spec/meta_inspector/images_spec.rb
309
313
  - spec/meta_inspector/links_spec.rb
310
314
  - spec/meta_inspector/meta_inspector_spec.rb