metainspector 4.4.2 → 4.5.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +66 -0
- data/README.md +43 -62
- data/lib/meta_inspector.rb +1 -0
- data/lib/meta_inspector/document.rb +4 -2
- data/lib/meta_inspector/parser.rb +7 -5
- data/lib/meta_inspector/parsers/head_links.rb +40 -0
- data/lib/meta_inspector/parsers/images.rb +23 -14
- data/lib/meta_inspector/parsers/links.rb +1 -14
- data/lib/meta_inspector/url.rb +20 -6
- data/lib/meta_inspector/version.rb +1 -1
- data/spec/fixtures/head_links.response +34 -0
- data/spec/fixtures/protocol_relative.response +5 -0
- data/spec/meta_inspector/head_links_spec.rb +42 -0
- data/spec/meta_inspector/images_spec.rb +60 -0
- data/spec/spec_helper.rb +4 -0
- data/spec/url_spec.rb +41 -0
- metadata +6 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d8b2f4cf8526bd14a55d879334ff9bf14c95180f
|
4
|
+
data.tar.gz: 0d39ceedb495d19a761fd7f6bcfdae767d1e1c26
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: b9b8a345bb8f935bfe5a5fb74d4e86a92893c8d44066f87cdffbe029fc5746841c290c366fd94fc2a84edb73edbbf43c491189f6b82e754f4bc0c494eaed6591
|
7
|
+
data.tar.gz: f559c11756c34406d5083a58c8cd80fdf5665fd79f4a204fdd22eee9eed6bbb65553c3a0b47f16962623ef4c4903adb74b6ac503893b0cb6ea66ee84513e85d4
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,66 @@
|
|
1
|
+
# MetaInpector Changelog
|
2
|
+
|
3
|
+
## Changes in 4.5
|
4
|
+
|
5
|
+
* The Document API now includes access to head/link elements
|
6
|
+
* `page.head_links` returns an array of hashes of all head/links.
|
7
|
+
* `page.stylesheets` returns head/links where rel='stylesheet'
|
8
|
+
* `page.canonicals` returns head/links where rel='canonical'
|
9
|
+
|
10
|
+
* The URL API can remove common tracking parameters from the querystring
|
11
|
+
* `url.tracked?` will tell you if the url contains known tracking parameters
|
12
|
+
* `url.untracked_url` will return the url with known tracking parameters removed
|
13
|
+
* `url.untrack!` will remove the tracking parameters from the url
|
14
|
+
|
15
|
+
* The images API has been extended:
|
16
|
+
* `page.images.with_size` returns a sorted array (by descending area) of [image_url, width, height]
|
17
|
+
|
18
|
+
## Changes in 4.4
|
19
|
+
|
20
|
+
The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
|
21
|
+
|
22
|
+
## Changes in 4.3
|
23
|
+
|
24
|
+
* The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
|
25
|
+
* `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
|
26
|
+
|
27
|
+
## Changes in 4.2
|
28
|
+
|
29
|
+
* The images API has been extended, with two new methods:
|
30
|
+
|
31
|
+
* `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
|
32
|
+
* `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
|
33
|
+
|
34
|
+
* The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
|
35
|
+
|
36
|
+
## Changes in 4.1
|
37
|
+
|
38
|
+
* Introduces the `:normalize_url` option, which allows to disable URL normalization.
|
39
|
+
|
40
|
+
## Changes in 4.0
|
41
|
+
|
42
|
+
* The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
|
43
|
+
|
44
|
+
```ruby
|
45
|
+
page.links.raw # Returns all links found, unprocessed
|
46
|
+
page.links.all # Returns all links found, unrelavitized and absolutified
|
47
|
+
page.links.http # Returns all HTTP links found
|
48
|
+
page.links.non_http # Returns all non-HTTP links found
|
49
|
+
page.links.internal # Returns all internal HTTP links found
|
50
|
+
page.links.external # Returns all external HTTP links found
|
51
|
+
```
|
52
|
+
|
53
|
+
* The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
|
54
|
+
|
55
|
+
* Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
|
56
|
+
|
57
|
+
* You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
|
58
|
+
|
59
|
+
## Changes in 3.0
|
60
|
+
|
61
|
+
* The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
|
62
|
+
* We've dropped support for Ruby < 2.
|
63
|
+
|
64
|
+
Also, we've introduced a new feature:
|
65
|
+
|
66
|
+
* Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.
|
data/README.md
CHANGED
@@ -8,56 +8,6 @@ You give it an URL, and it lets you easily get its title, links, images, charset
|
|
8
8
|
|
9
9
|
You can try MetaInspector live at this little demo: [https://metainspectordemo.herokuapp.com](https://metainspectordemo.herokuapp.com)
|
10
10
|
|
11
|
-
## Changes in 4.4
|
12
|
-
|
13
|
-
The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
|
14
|
-
|
15
|
-
## Changes in 4.3
|
16
|
-
|
17
|
-
* The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
|
18
|
-
* `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
|
19
|
-
|
20
|
-
## Changes in 4.2
|
21
|
-
|
22
|
-
* The images API has been extended, with two new methods:
|
23
|
-
|
24
|
-
* `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
|
25
|
-
* `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
|
26
|
-
|
27
|
-
* The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
|
28
|
-
|
29
|
-
## Changes in 4.1
|
30
|
-
|
31
|
-
* Introduces the `:normalize_url` option, which allows to disable URL normalization.
|
32
|
-
|
33
|
-
## Changes in 4.0
|
34
|
-
|
35
|
-
* The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
|
36
|
-
|
37
|
-
```ruby
|
38
|
-
page.links.raw # Returns all links found, unprocessed
|
39
|
-
page.links.all # Returns all links found, unrelavitized and absolutified
|
40
|
-
page.links.http # Returns all HTTP links found
|
41
|
-
page.links.non_http # Returns all non-HTTP links found
|
42
|
-
page.links.internal # Returns all internal HTTP links found
|
43
|
-
page.links.external # Returns all external HTTP links found
|
44
|
-
```
|
45
|
-
|
46
|
-
* The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
|
47
|
-
|
48
|
-
* Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
|
49
|
-
|
50
|
-
* You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
|
51
|
-
|
52
|
-
## Changes in 3.0
|
53
|
-
|
54
|
-
* The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
|
55
|
-
* We've dropped support for Ruby < 2.
|
56
|
-
|
57
|
-
Also, we've introduced a new feature:
|
58
|
-
|
59
|
-
* Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.
|
60
|
-
|
61
11
|
## Installation
|
62
12
|
|
63
13
|
Install the gem from RubyGems:
|
@@ -91,47 +41,72 @@ page = MetaInspector.new('sitevalidator.com')
|
|
91
41
|
You can also include the html which will be used as the document to scrape:
|
92
42
|
|
93
43
|
```ruby
|
94
|
-
page = MetaInspector.new("http://sitevalidator.com",
|
44
|
+
page = MetaInspector.new("http://sitevalidator.com",
|
45
|
+
:document => "<html>...</html>")
|
95
46
|
```
|
96
47
|
|
97
|
-
## Accessing response
|
48
|
+
## Accessing response
|
98
49
|
|
99
50
|
You can check the status and headers from the response like this:
|
100
51
|
|
101
52
|
```ruby
|
102
53
|
page.response.status # 200
|
103
|
-
page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8",
|
54
|
+
page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8",
|
55
|
+
# "cache-control"=>"must-revalidate, private, max-age=0", ... }
|
104
56
|
```
|
105
57
|
|
106
58
|
## Accessing scraped data
|
107
59
|
|
108
|
-
|
60
|
+
### URL
|
109
61
|
|
110
62
|
```ruby
|
111
63
|
page.url # URL of the page
|
64
|
+
page.tracked? # returns true if the url contains known tracking parameters
|
65
|
+
page.untracked_url # returns the url with the known tracking parameters removed
|
66
|
+
page.untrack! # removes the known tracking parameters from the url
|
112
67
|
page.scheme # Scheme of the page (http, https)
|
113
68
|
page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
|
114
69
|
page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
|
70
|
+
```
|
71
|
+
|
72
|
+
### Head links
|
73
|
+
|
74
|
+
```ruby
|
75
|
+
page.head_links # an array of hashes of all head/links
|
76
|
+
page.stylesheets # an array of hashes of all head/links where rel='stylesheet'
|
77
|
+
page.canonicals # an array of hashes of all head/links where rel='canonical'
|
78
|
+
page.feed # Get rss or atom links in meta data fields as array
|
79
|
+
```
|
80
|
+
|
81
|
+
### Texts
|
82
|
+
|
83
|
+
```ruby
|
115
84
|
page.title # title of the page from the head section, as string
|
116
85
|
page.best_title # best title of the page, from a selection of candidates
|
86
|
+
page.description # returns the meta description, or the first long paragraph if no meta description is found
|
87
|
+
```
|
88
|
+
|
89
|
+
### Links
|
90
|
+
|
91
|
+
```ruby
|
117
92
|
page.links.raw # every link found, unprocessed
|
118
93
|
page.links.all # every link found on the page as an absolute URL
|
119
94
|
page.links.http # every HTTP link found
|
120
95
|
page.links.non_http # every non-HTTP link found
|
121
96
|
page.links.internal # every internal link found on the page as an absolute URL
|
122
97
|
page.links.external # every external link found on the page as an absolute URL
|
123
|
-
|
124
|
-
|
125
|
-
|
98
|
+
```
|
99
|
+
|
100
|
+
### Images
|
101
|
+
|
102
|
+
```ruby
|
126
103
|
page.images # enumerable collection, with every img found on the page as an absolute URL
|
104
|
+
page.images.with_size # a sorted array (by descending area) of [image_url, width, height]
|
127
105
|
page.images.best # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element
|
128
106
|
page.images.favicon # absolute URL to the favicon
|
129
|
-
page.feed # Get rss or atom links in meta data fields as array
|
130
|
-
page.charset # UTF-8
|
131
|
-
page.content_type # content-type returned by the server when the url was requested
|
132
107
|
```
|
133
108
|
|
134
|
-
|
109
|
+
### Meta tags
|
135
110
|
|
136
111
|
When it comes to meta tags, you have several options:
|
137
112
|
|
@@ -243,6 +218,13 @@ page.meta['author'] # Returns "Joe Sample"
|
|
243
218
|
|
244
219
|
Please be aware that all keys are converted to downcase, so it's `'dc.date.issued'` and not `'DC.date.issued'`.
|
245
220
|
|
221
|
+
### Misc
|
222
|
+
|
223
|
+
```ruby
|
224
|
+
page.charset # UTF-8
|
225
|
+
page.content_type # content-type returned by the server when the url was requested
|
226
|
+
```
|
227
|
+
|
246
228
|
## Other representations
|
247
229
|
|
248
230
|
You can also access most of the scraped data as a hash:
|
@@ -422,7 +404,6 @@ You're more than welcome to fork this project and send pull requests. Just remem
|
|
422
404
|
* Create a topic branch for your changes.
|
423
405
|
* Add specs.
|
424
406
|
* Keep your fake responses as small as possible. For each change in `spec/fixtures`, a comment should be included explaining why it's needed.
|
425
|
-
* Update `version.rb`, following the [semantic versioning convention](http://semver.org/).
|
426
407
|
* Update `README.md` if needed (for example, when you're adding or changing a feature).
|
427
408
|
|
428
409
|
Thanks to all the contributors:
|
data/lib/meta_inspector.rb
CHANGED
@@ -7,6 +7,7 @@ require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parse
|
|
7
7
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/base'))
|
8
8
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/images'))
|
9
9
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/links'))
|
10
|
+
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/head_links'))
|
10
11
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/meta_tags'))
|
11
12
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/texts'))
|
12
13
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/document'))
|
@@ -44,14 +44,16 @@ module MetaInspector
|
|
44
44
|
end
|
45
45
|
|
46
46
|
extend Forwardable
|
47
|
-
delegate [:url, :scheme, :host, :root_url
|
47
|
+
delegate [:url, :scheme, :host, :root_url,
|
48
|
+
:tracked?, :untracked_url, :untrack!] => :@url
|
48
49
|
|
49
50
|
delegate [:content_type, :response] => :@request
|
50
51
|
|
51
52
|
delegate [:parsed, :title, :best_title,
|
52
53
|
:description, :links,
|
53
54
|
:images, :feed, :charset, :meta_tags,
|
54
|
-
:meta_tag, :meta, :favicon
|
55
|
+
:meta_tag, :meta, :favicon,
|
56
|
+
:head_links, :stylesheets, :canonicals] => :@parser
|
55
57
|
|
56
58
|
# Returns all document data as a nested Hash
|
57
59
|
def to_hash
|
@@ -13,6 +13,7 @@ module MetaInspector
|
|
13
13
|
def initialize(document, options = {})
|
14
14
|
@document = document
|
15
15
|
@exception_log = options[:exception_log]
|
16
|
+
@head_links_parser = MetaInspector::Parsers::HeadLinksParser.new(self)
|
16
17
|
@meta_tag_parser = MetaInspector::Parsers::MetaTagsParser.new(self)
|
17
18
|
@links_parser = MetaInspector::Parsers::LinksParser.new(self)
|
18
19
|
@download_images = options[:download_images]
|
@@ -21,11 +22,12 @@ module MetaInspector
|
|
21
22
|
end
|
22
23
|
|
23
24
|
extend Forwardable
|
24
|
-
delegate [:url, :scheme, :host]
|
25
|
-
delegate [:meta_tags, :meta_tag, :meta, :charset]
|
26
|
-
delegate [:
|
27
|
-
delegate :
|
28
|
-
delegate
|
25
|
+
delegate [:url, :scheme, :host] => :@document
|
26
|
+
delegate [:meta_tags, :meta_tag, :meta, :charset] => :@meta_tag_parser
|
27
|
+
delegate [:head_links, :stylesheets, :canonicals, :feed] => :@head_links_parser
|
28
|
+
delegate [:links, :base_url] => :@links_parser
|
29
|
+
delegate :images => :@images_parser
|
30
|
+
delegate [:title, :best_title, :description] => :@texts_parser
|
29
31
|
|
30
32
|
# Returns the whole parsed document
|
31
33
|
def parsed
|
@@ -0,0 +1,40 @@
|
|
1
|
+
module MetaInspector
|
2
|
+
module Parsers
|
3
|
+
class HeadLinksParser < Base
|
4
|
+
delegate [:parsed, :base_url] => :@main_parser
|
5
|
+
|
6
|
+
def head_links
|
7
|
+
@head_links ||= parsed.css('head link').map do |tag|
|
8
|
+
Hash[
|
9
|
+
tag.attributes.keys.map do |key|
|
10
|
+
keysym = key.to_sym
|
11
|
+
val = tag.attributes[key].value
|
12
|
+
val = URL.absolutify(val, base_url) if keysym == :href
|
13
|
+
[keysym, val]
|
14
|
+
end
|
15
|
+
]
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
def stylesheets
|
20
|
+
@stylesheets ||= head_links.select { |hl| hl[:rel] == 'stylesheet' }
|
21
|
+
end
|
22
|
+
|
23
|
+
def canonicals
|
24
|
+
@canonicals ||= head_links.select { |hl| hl[:rel] == 'canonical' }
|
25
|
+
end
|
26
|
+
|
27
|
+
# Returns the parsed document meta rss link
|
28
|
+
def feed
|
29
|
+
@feed ||= (parsed_feed('rss') || parsed_feed('atom'))
|
30
|
+
end
|
31
|
+
|
32
|
+
private
|
33
|
+
|
34
|
+
def parsed_feed(format)
|
35
|
+
feed = parsed.search("//link[@type='application/#{format}+xml']").first
|
36
|
+
feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
|
37
|
+
end
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
@@ -32,28 +32,37 @@ module MetaInspector
|
|
32
32
|
URL.absolutify(suggested_img, base_url) if suggested_img
|
33
33
|
end
|
34
34
|
|
35
|
-
# Returns
|
36
|
-
|
37
|
-
|
38
|
-
@larget_image ||= begin
|
35
|
+
# Returns an array of [img_url, width, height] sorted by image area (width * height)
|
36
|
+
def with_size
|
37
|
+
@with_size ||= begin
|
39
38
|
img_nodes = parsed.search('//img').select{ |img_node| img_node['src'] }
|
40
|
-
|
41
|
-
|
39
|
+
imgs_with_size = img_nodes.map { |img_node| [URL.absolutify(img_node['src'], base_url), img_node['width'], img_node['height']] }
|
40
|
+
imgs_with_size.uniq! { |url, width, height| url }
|
42
41
|
if @download_images
|
43
|
-
|
42
|
+
imgs_with_size.map! do |url, width, height|
|
44
43
|
width, height = FastImage.size(url) if width.nil? || height.nil?
|
45
|
-
[url, width, height]
|
44
|
+
[url, width.to_i, height.to_i]
|
46
45
|
end
|
47
46
|
else
|
48
|
-
|
47
|
+
imgs_with_size.map! do |url, width, height|
|
49
48
|
width, height = [0, 0] if width.nil? || height.nil?
|
50
|
-
[url, width, height]
|
49
|
+
[url, width.to_i, height.to_i]
|
51
50
|
end
|
52
51
|
end
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
52
|
+
imgs_with_size.sort_by { |url, width, height| -(width.to_i * height.to_i) }
|
53
|
+
end
|
54
|
+
end
|
55
|
+
|
56
|
+
# Returns the largest image from the image collection,
|
57
|
+
# filtered for images that are more square than 10:1 or 1:10
|
58
|
+
def largest
|
59
|
+
@largest_image ||= begin
|
60
|
+
imgs_with_size = with_size.dup
|
61
|
+
imgs_with_size.keep_if do |url, width, height|
|
62
|
+
ratio = width.to_f / height.to_f
|
63
|
+
ratio > 0.1 && ratio < 10
|
64
|
+
end
|
65
|
+
url, width, height = imgs_with_size.first
|
57
66
|
url
|
58
67
|
end
|
59
68
|
end
|
@@ -14,8 +14,7 @@ module MetaInspector
|
|
14
14
|
|
15
15
|
# Returns all links found, unrelavitized and absolutified
|
16
16
|
def all
|
17
|
-
@all ||= raw.map { |link| URL.absolutify(
|
18
|
-
.compact.uniq
|
17
|
+
@all ||= raw.map { |link| URL.absolutify(link, base_url) }.compact.uniq
|
19
18
|
end
|
20
19
|
|
21
20
|
# Returns all HTTP links found
|
@@ -44,11 +43,6 @@ module MetaInspector
|
|
44
43
|
'non_http' => non_http }
|
45
44
|
end
|
46
45
|
|
47
|
-
# Returns the parsed document meta rss link
|
48
|
-
def feed
|
49
|
-
@feed ||= (parsed_feed('rss') || parsed_feed('atom'))
|
50
|
-
end
|
51
|
-
|
52
46
|
# Returns the base url to absolutify relative links.
|
53
47
|
# This can be the one set on a <base> tag,
|
54
48
|
# or the url of the document if no <base> tag was found.
|
@@ -56,13 +50,6 @@ module MetaInspector
|
|
56
50
|
base_href || url
|
57
51
|
end
|
58
52
|
|
59
|
-
private
|
60
|
-
|
61
|
-
def parsed_feed(format)
|
62
|
-
feed = parsed.search("//link[@type='application/#{format}+xml']").first
|
63
|
-
feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
|
64
|
-
end
|
65
|
-
|
66
53
|
# Returns the value of the href attribute on the <base /> tag, if exists
|
67
54
|
def base_href
|
68
55
|
parsed.search('base').first.attributes['href'].value rescue nil
|
data/lib/meta_inspector/url.rb
CHANGED
@@ -27,21 +27,35 @@ module MetaInspector
|
|
27
27
|
"#{scheme}://#{host}/"
|
28
28
|
end
|
29
29
|
|
30
|
+
WELL_KNOWN_TRACKING_PARAMS = %w( utm_source utm_medium utm_term utm_content utm_campaign )
|
31
|
+
|
32
|
+
def tracked?
|
33
|
+
u = parsed(url)
|
34
|
+
found_tracking_params = WELL_KNOWN_TRACKING_PARAMS & u.query_values.keys
|
35
|
+
return found_tracking_params.any?
|
36
|
+
end
|
37
|
+
|
38
|
+
def untracked_url
|
39
|
+
u = parsed(url)
|
40
|
+
u.query_values = u.query_values.delete_if { |key, _| WELL_KNOWN_TRACKING_PARAMS.include? key }
|
41
|
+
u.to_s
|
42
|
+
end
|
43
|
+
|
44
|
+
def untrack!
|
45
|
+
self.url = untracked_url
|
46
|
+
end
|
47
|
+
|
30
48
|
def url=(new_url)
|
31
49
|
url = with_default_scheme(new_url)
|
32
50
|
@url = @normalize ? normalized(url) : url
|
33
51
|
end
|
34
52
|
|
35
|
-
# Converts a protocol-relative url to its full form,
|
36
|
-
# depending on the scheme of the page that contains it
|
37
|
-
def self.unrelativize(url, scheme)
|
38
|
-
url =~ /^\/\// ? "#{scheme}://#{url[2..-1]}" : url
|
39
|
-
end
|
40
|
-
|
41
53
|
# Converts a relative URL to an absolute URL, like:
|
42
54
|
# "/faq" => "http://example.com/faq"
|
43
55
|
# Respecting already absolute URLs like the ones starting with
|
44
56
|
# http:, ftp:, telnet:, mailto:, javascript: ...
|
57
|
+
# Protocol-relative URLs are also resolved to use the same
|
58
|
+
# schema as the base_url
|
45
59
|
def self.absolutify(url, base_url)
|
46
60
|
if url =~ /^\w*\:/i
|
47
61
|
MetaInspector::URL.new(url).url
|
@@ -0,0 +1,34 @@
|
|
1
|
+
HTTP/1.1 200 OK
|
2
|
+
Server: nginx/0.7.67
|
3
|
+
Date: Fri, 18 Nov 2011 21:46:46 GMT
|
4
|
+
Content-Type: text/html
|
5
|
+
Connection: keep-alive
|
6
|
+
Last-Modified: Mon, 14 Nov 2011 16:53:18 GMT
|
7
|
+
Content-Length: 4987
|
8
|
+
X-Varnish: 2000423390
|
9
|
+
Age: 0
|
10
|
+
Via: 1.1 varnish
|
11
|
+
|
12
|
+
<html>
|
13
|
+
<head>
|
14
|
+
<title>An example page</title>
|
15
|
+
<link
|
16
|
+
rel="canonical"
|
17
|
+
href="http://example.com/canonical-from-head"
|
18
|
+
/>
|
19
|
+
<link rel="stylesheet" href="/stylesheets/screen.css">
|
20
|
+
<link rel="stylesheet" href="//example2.com/stylesheets/screen.css">
|
21
|
+
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
|
22
|
+
<link rel="shorturl" href="http://gu.com/p/32v5a" />
|
23
|
+
<link
|
24
|
+
rel="stylesheet"
|
25
|
+
type="text/css"
|
26
|
+
href="http://foo/print.css"
|
27
|
+
media="print"
|
28
|
+
class="contrast"
|
29
|
+
/>
|
30
|
+
</head>
|
31
|
+
<body>
|
32
|
+
<h1>Hello World</h1>
|
33
|
+
</body>
|
34
|
+
</html>
|
@@ -12,6 +12,8 @@ Accept-Ranges: bytes
|
|
12
12
|
<head>
|
13
13
|
<meta charset="utf-8" />
|
14
14
|
<title>Protocol-relative URLs</title>
|
15
|
+
<meta property="og:image" content="//static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg"/>
|
16
|
+
<link rel="shortcut icon" href="//static-secure.guim.co.uk/sys-images/favicon.ico" type="image/x-icon" />
|
15
17
|
</head>
|
16
18
|
<body>
|
17
19
|
<p>Internal links</p>
|
@@ -22,5 +24,8 @@ Accept-Ranges: bytes
|
|
22
24
|
<p>External links</p>
|
23
25
|
<a href="http://google.com">External: normal link</a>
|
24
26
|
<a href="//yahoo.com">External: protocol-relative link</a>
|
27
|
+
|
28
|
+
<p>Images</p>
|
29
|
+
<img src="//example.com/image.jpg" />
|
25
30
|
</body>
|
26
31
|
</html>
|
@@ -0,0 +1,42 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe MetaInspector do
|
4
|
+
|
5
|
+
describe "head_links" do
|
6
|
+
let(:page) { MetaInspector.new('http://example.com/head_links') }
|
7
|
+
let(:page_https) { MetaInspector.new('https://example.com/head_links') }
|
8
|
+
|
9
|
+
it "#head_links" do
|
10
|
+
expect(page.head_links).to eq([
|
11
|
+
{rel: 'canonical', href: 'http://example.com/canonical-from-head'},
|
12
|
+
{rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
|
13
|
+
{rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
|
14
|
+
{rel: 'shortcut icon', href: 'http://example.com/favicon.ico', type: 'image/x-icon'},
|
15
|
+
{rel: 'shorturl', href: 'http://gu.com/p/32v5a'},
|
16
|
+
{rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
|
17
|
+
])
|
18
|
+
end
|
19
|
+
|
20
|
+
it "#stylesheets" do
|
21
|
+
expect(page.stylesheets).to eq([
|
22
|
+
{rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
|
23
|
+
{rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
|
24
|
+
{rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
|
25
|
+
])
|
26
|
+
|
27
|
+
expect(page_https.stylesheets).to eq([
|
28
|
+
{rel: 'stylesheet', href: 'https://example.com/stylesheets/screen.css'},
|
29
|
+
{rel: 'stylesheet', href: 'https://example2.com/stylesheets/screen.css'},
|
30
|
+
{rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
|
31
|
+
])
|
32
|
+
end
|
33
|
+
|
34
|
+
it "#canonical" do
|
35
|
+
expect(page.canonicals).to eq([
|
36
|
+
{rel: 'canonical', href: 'http://example.com/canonical-from-head'}
|
37
|
+
])
|
38
|
+
end
|
39
|
+
|
40
|
+
end
|
41
|
+
|
42
|
+
end
|
@@ -123,6 +123,44 @@ describe MetaInspector do
|
|
123
123
|
end
|
124
124
|
end
|
125
125
|
|
126
|
+
describe "images.with_size" do
|
127
|
+
it "should return sorted by area array of [img_url, width, height] using html sizes" do
|
128
|
+
page = MetaInspector.new('http://example.com/largest_image_in_html')
|
129
|
+
|
130
|
+
expect(page.images.with_size).to eq([
|
131
|
+
["http://example.com/largest", 100, 100],
|
132
|
+
["http://example.com/too_narrow", 10, 100],
|
133
|
+
["http://example.com/too_wide", 100, 10],
|
134
|
+
["http://example.com/smaller", 10, 10],
|
135
|
+
["http://example.com/smallest", 1, 1]
|
136
|
+
])
|
137
|
+
end
|
138
|
+
|
139
|
+
it "should return sorted by area array of [img_url, width, height] using actual image sizes" do
|
140
|
+
page = MetaInspector.new('http://example.com/largest_image_using_image_size')
|
141
|
+
|
142
|
+
expect(page.images.with_size).to eq([
|
143
|
+
["http://example.com/100x100", 100, 100],
|
144
|
+
["http://example.com/10x100", 10, 100],
|
145
|
+
["http://example.com/100x10", 100, 10],
|
146
|
+
["http://example.com/10x10", 10, 10],
|
147
|
+
["http://example.com/1x1", 1, 1]
|
148
|
+
])
|
149
|
+
end
|
150
|
+
|
151
|
+
it "should return sorted by area array of [img_url, width, height] without downloading images" do
|
152
|
+
page = MetaInspector.new('http://example.com/largest_image_using_image_size', download_images: false)
|
153
|
+
|
154
|
+
expect(page.images.with_size).to eq([
|
155
|
+
["http://example.com/10x100", 10, 100],
|
156
|
+
["http://example.com/100x10", 100, 10],
|
157
|
+
["http://example.com/1x1", 1, 1],
|
158
|
+
["http://example.com/10x10", 0, 0],
|
159
|
+
["http://example.com/100x100", 0, 0]
|
160
|
+
])
|
161
|
+
end
|
162
|
+
end
|
163
|
+
|
126
164
|
describe "images.largest" do
|
127
165
|
it "should find the largest image on the page using html sizes" do
|
128
166
|
page = MetaInspector.new('http://example.com/largest_image_in_html')
|
@@ -174,4 +212,26 @@ describe MetaInspector do
|
|
174
212
|
expect(page.images.favicon).to eq(nil)
|
175
213
|
end
|
176
214
|
end
|
215
|
+
|
216
|
+
describe 'protocol-relative' do
|
217
|
+
before(:each) do
|
218
|
+
@m_http = MetaInspector.new('http://protocol-relative.com')
|
219
|
+
@m_https = MetaInspector.new('https://protocol-relative.com')
|
220
|
+
end
|
221
|
+
|
222
|
+
it 'should unrelativize images' do
|
223
|
+
expect(@m_http.images.to_a).to eq(['http://example.com/image.jpg'])
|
224
|
+
expect(@m_https.images.to_a).to eq(['https://example.com/image.jpg'])
|
225
|
+
end
|
226
|
+
|
227
|
+
it 'should unrelativize owner suggested image' do
|
228
|
+
expect(@m_http.images.owner_suggested).to eq('http://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
|
229
|
+
expect(@m_https.images.owner_suggested).to eq('https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
|
230
|
+
end
|
231
|
+
|
232
|
+
it 'should unrelativize favicon' do
|
233
|
+
expect(@m_http.images.favicon).to eq('http://static-secure.guim.co.uk/sys-images/favicon.ico')
|
234
|
+
expect(@m_https.images.favicon).to eq('https://static-secure.guim.co.uk/sys-images/favicon.ico')
|
235
|
+
end
|
236
|
+
end
|
177
237
|
end
|
data/spec/spec_helper.rb
CHANGED
@@ -41,6 +41,10 @@ FakeWeb.register_uri(:get, "http://example.com/10x10", :response => fixture_file
|
|
41
41
|
FakeWeb.register_uri(:get, "http://example.com/100x100", :response => fixture_file("100x100.jpg.response"))
|
42
42
|
FakeWeb.register_uri(:get, "http://www.24-horas.mx/mexico-firma-acuerdo-bilateral-automotriz-con-argentina/", :response => fixture_file("relative_og_image.response"))
|
43
43
|
|
44
|
+
#Used to test canonical URLs in head
|
45
|
+
FakeWeb.register_uri(:get, "http://example.com/head_links", :response => fixture_file("head_links.response"))
|
46
|
+
FakeWeb.register_uri(:get, "https://example.com/head_links", :response => fixture_file("head_links.response"))
|
47
|
+
|
44
48
|
# Used to test best_title logic
|
45
49
|
FakeWeb.register_uri(:get, "http://example.com/title_in_head", :response => fixture_file("title_in_head.response"))
|
46
50
|
FakeWeb.register_uri(:get, "http://example.com/title_in_body", :response => fixture_file("title_in_body.response"))
|
data/spec/url_spec.rb
CHANGED
@@ -36,6 +36,47 @@ describe MetaInspector::URL do
|
|
36
36
|
expect(MetaInspector::URL.new('http://example.com/faqs').root_url).to eq('http://example.com/')
|
37
37
|
end
|
38
38
|
|
39
|
+
it "should return an untracked url" do
|
40
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
41
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
42
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
43
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
44
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
45
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
46
|
+
end
|
47
|
+
|
48
|
+
it "should remove tracking parameters from url" do
|
49
|
+
|
50
|
+
tracked_urls = ['http://example.com/foo?not_utm_thing=bar&utm_source=1234',
|
51
|
+
'http://example.com/foo?not_utm_thing=bar&utm_medium=1234',
|
52
|
+
'http://example.com/foo?not_utm_thing=bar&utm_term=1234',
|
53
|
+
'http://example.com/foo?not_utm_thing=bar&utm_content=1234',
|
54
|
+
'http://example.com/foo?not_utm_thing=bar&utm_campaign=1234',
|
55
|
+
'http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436'
|
56
|
+
]
|
57
|
+
|
58
|
+
tracked_urls.each do |tracked_url|
|
59
|
+
url = MetaInspector::URL.new(tracked_url)
|
60
|
+
url.untrack!
|
61
|
+
expect(url.url).to eq('http://example.com/foo?not_utm_thing=bar')
|
62
|
+
end
|
63
|
+
end
|
64
|
+
|
65
|
+
it "should say if the url is tracked" do
|
66
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').tracked?).to be true
|
67
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').tracked?).to be true
|
68
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').tracked?).to be true
|
69
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').tracked?).to be true
|
70
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').tracked?).to be true
|
71
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').tracked?).to be true
|
72
|
+
|
73
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_source=1234').tracked?).to be false
|
74
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_medium=1234').tracked?).to be false
|
75
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_term=1234').tracked?).to be false
|
76
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_content=1234').tracked?).to be false
|
77
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_campaign=1234').tracked?).to be false
|
78
|
+
end
|
79
|
+
|
39
80
|
describe "url=" do
|
40
81
|
it "should update the url" do
|
41
82
|
url = MetaInspector::URL.new('http://first.com/')
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: metainspector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 4.
|
4
|
+
version: 4.5.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jaime Iniesta
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-
|
11
|
+
date: 2015-05-29 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -232,6 +232,7 @@ files:
|
|
232
232
|
- ".rspec.example"
|
233
233
|
- ".rubocop.yml.example"
|
234
234
|
- ".travis.yml"
|
235
|
+
- CHANGELOG.md
|
235
236
|
- Gemfile
|
236
237
|
- Guardfile
|
237
238
|
- MIT-LICENSE
|
@@ -247,6 +248,7 @@ files:
|
|
247
248
|
- lib/meta_inspector/exceptionable.rb
|
248
249
|
- lib/meta_inspector/parser.rb
|
249
250
|
- lib/meta_inspector/parsers/base.rb
|
251
|
+
- lib/meta_inspector/parsers/head_links.rb
|
250
252
|
- lib/meta_inspector/parsers/images.rb
|
251
253
|
- lib/meta_inspector/parsers/links.rb
|
252
254
|
- lib/meta_inspector/parsers/meta_tags.rb
|
@@ -270,6 +272,7 @@ files:
|
|
270
272
|
- spec/fixtures/example.response
|
271
273
|
- spec/fixtures/facebook.com.response
|
272
274
|
- spec/fixtures/guardian.co.uk.response
|
275
|
+
- spec/fixtures/head_links.response
|
273
276
|
- spec/fixtures/https.facebook.com.response
|
274
277
|
- spec/fixtures/international.response
|
275
278
|
- spec/fixtures/invalid_href.response
|
@@ -305,6 +308,7 @@ files:
|
|
305
308
|
- spec/fixtures/wordpress_site.response
|
306
309
|
- spec/fixtures/youtube.response
|
307
310
|
- spec/fixtures/youtube_short_title.response
|
311
|
+
- spec/meta_inspector/head_links_spec.rb
|
308
312
|
- spec/meta_inspector/images_spec.rb
|
309
313
|
- spec/meta_inspector/links_spec.rb
|
310
314
|
- spec/meta_inspector/meta_inspector_spec.rb
|