metainspector 4.4.2 → 4.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +66 -0
- data/README.md +43 -62
- data/lib/meta_inspector.rb +1 -0
- data/lib/meta_inspector/document.rb +4 -2
- data/lib/meta_inspector/parser.rb +7 -5
- data/lib/meta_inspector/parsers/head_links.rb +40 -0
- data/lib/meta_inspector/parsers/images.rb +23 -14
- data/lib/meta_inspector/parsers/links.rb +1 -14
- data/lib/meta_inspector/url.rb +20 -6
- data/lib/meta_inspector/version.rb +1 -1
- data/spec/fixtures/head_links.response +34 -0
- data/spec/fixtures/protocol_relative.response +5 -0
- data/spec/meta_inspector/head_links_spec.rb +42 -0
- data/spec/meta_inspector/images_spec.rb +60 -0
- data/spec/spec_helper.rb +4 -0
- data/spec/url_spec.rb +41 -0
- metadata +6 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d8b2f4cf8526bd14a55d879334ff9bf14c95180f
|
4
|
+
data.tar.gz: 0d39ceedb495d19a761fd7f6bcfdae767d1e1c26
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: b9b8a345bb8f935bfe5a5fb74d4e86a92893c8d44066f87cdffbe029fc5746841c290c366fd94fc2a84edb73edbbf43c491189f6b82e754f4bc0c494eaed6591
|
7
|
+
data.tar.gz: f559c11756c34406d5083a58c8cd80fdf5665fd79f4a204fdd22eee9eed6bbb65553c3a0b47f16962623ef4c4903adb74b6ac503893b0cb6ea66ee84513e85d4
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,66 @@
|
|
1
|
+
# MetaInpector Changelog
|
2
|
+
|
3
|
+
## Changes in 4.5
|
4
|
+
|
5
|
+
* The Document API now includes access to head/link elements
|
6
|
+
* `page.head_links` returns an array of hashes of all head/links.
|
7
|
+
* `page.stylesheets` returns head/links where rel='stylesheet'
|
8
|
+
* `page.canonicals` returns head/links where rel='canonical'
|
9
|
+
|
10
|
+
* The URL API can remove common tracking parameters from the querystring
|
11
|
+
* `url.tracked?` will tell you if the url contains known tracking parameters
|
12
|
+
* `url.untracked_url` will return the url with known tracking parameters removed
|
13
|
+
* `url.untrack!` will remove the tracking parameters from the url
|
14
|
+
|
15
|
+
* The images API has been extended:
|
16
|
+
* `page.images.with_size` returns a sorted array (by descending area) of [image_url, width, height]
|
17
|
+
|
18
|
+
## Changes in 4.4
|
19
|
+
|
20
|
+
The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
|
21
|
+
|
22
|
+
## Changes in 4.3
|
23
|
+
|
24
|
+
* The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
|
25
|
+
* `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
|
26
|
+
|
27
|
+
## Changes in 4.2
|
28
|
+
|
29
|
+
* The images API has been extended, with two new methods:
|
30
|
+
|
31
|
+
* `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
|
32
|
+
* `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
|
33
|
+
|
34
|
+
* The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
|
35
|
+
|
36
|
+
## Changes in 4.1
|
37
|
+
|
38
|
+
* Introduces the `:normalize_url` option, which allows to disable URL normalization.
|
39
|
+
|
40
|
+
## Changes in 4.0
|
41
|
+
|
42
|
+
* The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
|
43
|
+
|
44
|
+
```ruby
|
45
|
+
page.links.raw # Returns all links found, unprocessed
|
46
|
+
page.links.all # Returns all links found, unrelavitized and absolutified
|
47
|
+
page.links.http # Returns all HTTP links found
|
48
|
+
page.links.non_http # Returns all non-HTTP links found
|
49
|
+
page.links.internal # Returns all internal HTTP links found
|
50
|
+
page.links.external # Returns all external HTTP links found
|
51
|
+
```
|
52
|
+
|
53
|
+
* The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
|
54
|
+
|
55
|
+
* Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
|
56
|
+
|
57
|
+
* You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
|
58
|
+
|
59
|
+
## Changes in 3.0
|
60
|
+
|
61
|
+
* The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
|
62
|
+
* We've dropped support for Ruby < 2.
|
63
|
+
|
64
|
+
Also, we've introduced a new feature:
|
65
|
+
|
66
|
+
* Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.
|
data/README.md
CHANGED
@@ -8,56 +8,6 @@ You give it an URL, and it lets you easily get its title, links, images, charset
|
|
8
8
|
|
9
9
|
You can try MetaInspector live at this little demo: [https://metainspectordemo.herokuapp.com](https://metainspectordemo.herokuapp.com)
|
10
10
|
|
11
|
-
## Changes in 4.4
|
12
|
-
|
13
|
-
The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
|
14
|
-
|
15
|
-
## Changes in 4.3
|
16
|
-
|
17
|
-
* The Document API has been extended with one new method `page.best_title` that returns the longest text available from a selection of candidates.
|
18
|
-
* `to_hash` now includes `scheme`, `host`, `root_url`, `best_title` and `description`.
|
19
|
-
|
20
|
-
## Changes in 4.2
|
21
|
-
|
22
|
-
* The images API has been extended, with two new methods:
|
23
|
-
|
24
|
-
* `page.images.owner_suggested` returns the OG or Twitter image, or `nil` if neither are present.
|
25
|
-
* `page.images.largest` returns the largest image found in the page. This uses the HTML height and width attributes as well as the [fastimage](https://github.com/sdsykes/fastimage) gem to return the largest image on the page that has a ratio squarer than 1:10 or 10:1. This usually provides a good alternative to the OG or Twitter images if they are not supplied.
|
26
|
-
|
27
|
-
* The criteria for `page.images.best` has changed slightly, we'll now return the largest image instead of the first image if no owner-suggested image is found.
|
28
|
-
|
29
|
-
## Changes in 4.1
|
30
|
-
|
31
|
-
* Introduces the `:normalize_url` option, which allows to disable URL normalization.
|
32
|
-
|
33
|
-
## Changes in 4.0
|
34
|
-
|
35
|
-
* The links API has been changed, now instead of `page.links`, `page.internal_links` and `page.external_links` we have:
|
36
|
-
|
37
|
-
```ruby
|
38
|
-
page.links.raw # Returns all links found, unprocessed
|
39
|
-
page.links.all # Returns all links found, unrelavitized and absolutified
|
40
|
-
page.links.http # Returns all HTTP links found
|
41
|
-
page.links.non_http # Returns all non-HTTP links found
|
42
|
-
page.links.internal # Returns all internal HTTP links found
|
43
|
-
page.links.external # Returns all external HTTP links found
|
44
|
-
```
|
45
|
-
|
46
|
-
* The images API has been changed, now instead of `page.image` we have `page.images.best`, and instead of `page.favicon` we have `page.images.favicon`.
|
47
|
-
|
48
|
-
* Now `page.image` will return the first image in `page.images` if no OG or Twitter image found, instead of returning `nil`.
|
49
|
-
|
50
|
-
* You can now specify 2 different timeouts, `connection_timeout` and `read_timeout`, instead of the previous single `timeout`.
|
51
|
-
|
52
|
-
## Changes in 3.0
|
53
|
-
|
54
|
-
* The redirect API has been changed, now the `:allow_redirections` option will expect only a boolean, which by default is `true`. That is, no more specifying `:safe`, `:unsafe` or `:all`.
|
55
|
-
* We've dropped support for Ruby < 2.
|
56
|
-
|
57
|
-
Also, we've introduced a new feature:
|
58
|
-
|
59
|
-
* Persist cookies across redirects. Now MetaInspector will include the received cookies when following redirects. This fixes some cases where a redirect would fail, sometimes caught in a redirection loop.
|
60
|
-
|
61
11
|
## Installation
|
62
12
|
|
63
13
|
Install the gem from RubyGems:
|
@@ -91,47 +41,72 @@ page = MetaInspector.new('sitevalidator.com')
|
|
91
41
|
You can also include the html which will be used as the document to scrape:
|
92
42
|
|
93
43
|
```ruby
|
94
|
-
page = MetaInspector.new("http://sitevalidator.com",
|
44
|
+
page = MetaInspector.new("http://sitevalidator.com",
|
45
|
+
:document => "<html>...</html>")
|
95
46
|
```
|
96
47
|
|
97
|
-
## Accessing response
|
48
|
+
## Accessing response
|
98
49
|
|
99
50
|
You can check the status and headers from the response like this:
|
100
51
|
|
101
52
|
```ruby
|
102
53
|
page.response.status # 200
|
103
|
-
page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8",
|
54
|
+
page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8",
|
55
|
+
# "cache-control"=>"must-revalidate, private, max-age=0", ... }
|
104
56
|
```
|
105
57
|
|
106
58
|
## Accessing scraped data
|
107
59
|
|
108
|
-
|
60
|
+
### URL
|
109
61
|
|
110
62
|
```ruby
|
111
63
|
page.url # URL of the page
|
64
|
+
page.tracked? # returns true if the url contains known tracking parameters
|
65
|
+
page.untracked_url # returns the url with the known tracking parameters removed
|
66
|
+
page.untrack! # removes the known tracking parameters from the url
|
112
67
|
page.scheme # Scheme of the page (http, https)
|
113
68
|
page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
|
114
69
|
page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
|
70
|
+
```
|
71
|
+
|
72
|
+
### Head links
|
73
|
+
|
74
|
+
```ruby
|
75
|
+
page.head_links # an array of hashes of all head/links
|
76
|
+
page.stylesheets # an array of hashes of all head/links where rel='stylesheet'
|
77
|
+
page.canonicals # an array of hashes of all head/links where rel='canonical'
|
78
|
+
page.feed # Get rss or atom links in meta data fields as array
|
79
|
+
```
|
80
|
+
|
81
|
+
### Texts
|
82
|
+
|
83
|
+
```ruby
|
115
84
|
page.title # title of the page from the head section, as string
|
116
85
|
page.best_title # best title of the page, from a selection of candidates
|
86
|
+
page.description # returns the meta description, or the first long paragraph if no meta description is found
|
87
|
+
```
|
88
|
+
|
89
|
+
### Links
|
90
|
+
|
91
|
+
```ruby
|
117
92
|
page.links.raw # every link found, unprocessed
|
118
93
|
page.links.all # every link found on the page as an absolute URL
|
119
94
|
page.links.http # every HTTP link found
|
120
95
|
page.links.non_http # every non-HTTP link found
|
121
96
|
page.links.internal # every internal link found on the page as an absolute URL
|
122
97
|
page.links.external # every external link found on the page as an absolute URL
|
123
|
-
|
124
|
-
|
125
|
-
|
98
|
+
```
|
99
|
+
|
100
|
+
### Images
|
101
|
+
|
102
|
+
```ruby
|
126
103
|
page.images # enumerable collection, with every img found on the page as an absolute URL
|
104
|
+
page.images.with_size # a sorted array (by descending area) of [image_url, width, height]
|
127
105
|
page.images.best # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element
|
128
106
|
page.images.favicon # absolute URL to the favicon
|
129
|
-
page.feed # Get rss or atom links in meta data fields as array
|
130
|
-
page.charset # UTF-8
|
131
|
-
page.content_type # content-type returned by the server when the url was requested
|
132
107
|
```
|
133
108
|
|
134
|
-
|
109
|
+
### Meta tags
|
135
110
|
|
136
111
|
When it comes to meta tags, you have several options:
|
137
112
|
|
@@ -243,6 +218,13 @@ page.meta['author'] # Returns "Joe Sample"
|
|
243
218
|
|
244
219
|
Please be aware that all keys are converted to downcase, so it's `'dc.date.issued'` and not `'DC.date.issued'`.
|
245
220
|
|
221
|
+
### Misc
|
222
|
+
|
223
|
+
```ruby
|
224
|
+
page.charset # UTF-8
|
225
|
+
page.content_type # content-type returned by the server when the url was requested
|
226
|
+
```
|
227
|
+
|
246
228
|
## Other representations
|
247
229
|
|
248
230
|
You can also access most of the scraped data as a hash:
|
@@ -422,7 +404,6 @@ You're more than welcome to fork this project and send pull requests. Just remem
|
|
422
404
|
* Create a topic branch for your changes.
|
423
405
|
* Add specs.
|
424
406
|
* Keep your fake responses as small as possible. For each change in `spec/fixtures`, a comment should be included explaining why it's needed.
|
425
|
-
* Update `version.rb`, following the [semantic versioning convention](http://semver.org/).
|
426
407
|
* Update `README.md` if needed (for example, when you're adding or changing a feature).
|
427
408
|
|
428
409
|
Thanks to all the contributors:
|
data/lib/meta_inspector.rb
CHANGED
@@ -7,6 +7,7 @@ require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parse
|
|
7
7
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/base'))
|
8
8
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/images'))
|
9
9
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/links'))
|
10
|
+
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/head_links'))
|
10
11
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/meta_tags'))
|
11
12
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parsers/texts'))
|
12
13
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/document'))
|
@@ -44,14 +44,16 @@ module MetaInspector
|
|
44
44
|
end
|
45
45
|
|
46
46
|
extend Forwardable
|
47
|
-
delegate [:url, :scheme, :host, :root_url
|
47
|
+
delegate [:url, :scheme, :host, :root_url,
|
48
|
+
:tracked?, :untracked_url, :untrack!] => :@url
|
48
49
|
|
49
50
|
delegate [:content_type, :response] => :@request
|
50
51
|
|
51
52
|
delegate [:parsed, :title, :best_title,
|
52
53
|
:description, :links,
|
53
54
|
:images, :feed, :charset, :meta_tags,
|
54
|
-
:meta_tag, :meta, :favicon
|
55
|
+
:meta_tag, :meta, :favicon,
|
56
|
+
:head_links, :stylesheets, :canonicals] => :@parser
|
55
57
|
|
56
58
|
# Returns all document data as a nested Hash
|
57
59
|
def to_hash
|
@@ -13,6 +13,7 @@ module MetaInspector
|
|
13
13
|
def initialize(document, options = {})
|
14
14
|
@document = document
|
15
15
|
@exception_log = options[:exception_log]
|
16
|
+
@head_links_parser = MetaInspector::Parsers::HeadLinksParser.new(self)
|
16
17
|
@meta_tag_parser = MetaInspector::Parsers::MetaTagsParser.new(self)
|
17
18
|
@links_parser = MetaInspector::Parsers::LinksParser.new(self)
|
18
19
|
@download_images = options[:download_images]
|
@@ -21,11 +22,12 @@ module MetaInspector
|
|
21
22
|
end
|
22
23
|
|
23
24
|
extend Forwardable
|
24
|
-
delegate [:url, :scheme, :host]
|
25
|
-
delegate [:meta_tags, :meta_tag, :meta, :charset]
|
26
|
-
delegate [:
|
27
|
-
delegate :
|
28
|
-
delegate
|
25
|
+
delegate [:url, :scheme, :host] => :@document
|
26
|
+
delegate [:meta_tags, :meta_tag, :meta, :charset] => :@meta_tag_parser
|
27
|
+
delegate [:head_links, :stylesheets, :canonicals, :feed] => :@head_links_parser
|
28
|
+
delegate [:links, :base_url] => :@links_parser
|
29
|
+
delegate :images => :@images_parser
|
30
|
+
delegate [:title, :best_title, :description] => :@texts_parser
|
29
31
|
|
30
32
|
# Returns the whole parsed document
|
31
33
|
def parsed
|
@@ -0,0 +1,40 @@
|
|
1
|
+
module MetaInspector
|
2
|
+
module Parsers
|
3
|
+
class HeadLinksParser < Base
|
4
|
+
delegate [:parsed, :base_url] => :@main_parser
|
5
|
+
|
6
|
+
def head_links
|
7
|
+
@head_links ||= parsed.css('head link').map do |tag|
|
8
|
+
Hash[
|
9
|
+
tag.attributes.keys.map do |key|
|
10
|
+
keysym = key.to_sym
|
11
|
+
val = tag.attributes[key].value
|
12
|
+
val = URL.absolutify(val, base_url) if keysym == :href
|
13
|
+
[keysym, val]
|
14
|
+
end
|
15
|
+
]
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
def stylesheets
|
20
|
+
@stylesheets ||= head_links.select { |hl| hl[:rel] == 'stylesheet' }
|
21
|
+
end
|
22
|
+
|
23
|
+
def canonicals
|
24
|
+
@canonicals ||= head_links.select { |hl| hl[:rel] == 'canonical' }
|
25
|
+
end
|
26
|
+
|
27
|
+
# Returns the parsed document meta rss link
|
28
|
+
def feed
|
29
|
+
@feed ||= (parsed_feed('rss') || parsed_feed('atom'))
|
30
|
+
end
|
31
|
+
|
32
|
+
private
|
33
|
+
|
34
|
+
def parsed_feed(format)
|
35
|
+
feed = parsed.search("//link[@type='application/#{format}+xml']").first
|
36
|
+
feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
|
37
|
+
end
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
@@ -32,28 +32,37 @@ module MetaInspector
|
|
32
32
|
URL.absolutify(suggested_img, base_url) if suggested_img
|
33
33
|
end
|
34
34
|
|
35
|
-
# Returns
|
36
|
-
|
37
|
-
|
38
|
-
@larget_image ||= begin
|
35
|
+
# Returns an array of [img_url, width, height] sorted by image area (width * height)
|
36
|
+
def with_size
|
37
|
+
@with_size ||= begin
|
39
38
|
img_nodes = parsed.search('//img').select{ |img_node| img_node['src'] }
|
40
|
-
|
41
|
-
|
39
|
+
imgs_with_size = img_nodes.map { |img_node| [URL.absolutify(img_node['src'], base_url), img_node['width'], img_node['height']] }
|
40
|
+
imgs_with_size.uniq! { |url, width, height| url }
|
42
41
|
if @download_images
|
43
|
-
|
42
|
+
imgs_with_size.map! do |url, width, height|
|
44
43
|
width, height = FastImage.size(url) if width.nil? || height.nil?
|
45
|
-
[url, width, height]
|
44
|
+
[url, width.to_i, height.to_i]
|
46
45
|
end
|
47
46
|
else
|
48
|
-
|
47
|
+
imgs_with_size.map! do |url, width, height|
|
49
48
|
width, height = [0, 0] if width.nil? || height.nil?
|
50
|
-
[url, width, height]
|
49
|
+
[url, width.to_i, height.to_i]
|
51
50
|
end
|
52
51
|
end
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
52
|
+
imgs_with_size.sort_by { |url, width, height| -(width.to_i * height.to_i) }
|
53
|
+
end
|
54
|
+
end
|
55
|
+
|
56
|
+
# Returns the largest image from the image collection,
|
57
|
+
# filtered for images that are more square than 10:1 or 1:10
|
58
|
+
def largest
|
59
|
+
@largest_image ||= begin
|
60
|
+
imgs_with_size = with_size.dup
|
61
|
+
imgs_with_size.keep_if do |url, width, height|
|
62
|
+
ratio = width.to_f / height.to_f
|
63
|
+
ratio > 0.1 && ratio < 10
|
64
|
+
end
|
65
|
+
url, width, height = imgs_with_size.first
|
57
66
|
url
|
58
67
|
end
|
59
68
|
end
|
@@ -14,8 +14,7 @@ module MetaInspector
|
|
14
14
|
|
15
15
|
# Returns all links found, unrelavitized and absolutified
|
16
16
|
def all
|
17
|
-
@all ||= raw.map { |link| URL.absolutify(
|
18
|
-
.compact.uniq
|
17
|
+
@all ||= raw.map { |link| URL.absolutify(link, base_url) }.compact.uniq
|
19
18
|
end
|
20
19
|
|
21
20
|
# Returns all HTTP links found
|
@@ -44,11 +43,6 @@ module MetaInspector
|
|
44
43
|
'non_http' => non_http }
|
45
44
|
end
|
46
45
|
|
47
|
-
# Returns the parsed document meta rss link
|
48
|
-
def feed
|
49
|
-
@feed ||= (parsed_feed('rss') || parsed_feed('atom'))
|
50
|
-
end
|
51
|
-
|
52
46
|
# Returns the base url to absolutify relative links.
|
53
47
|
# This can be the one set on a <base> tag,
|
54
48
|
# or the url of the document if no <base> tag was found.
|
@@ -56,13 +50,6 @@ module MetaInspector
|
|
56
50
|
base_href || url
|
57
51
|
end
|
58
52
|
|
59
|
-
private
|
60
|
-
|
61
|
-
def parsed_feed(format)
|
62
|
-
feed = parsed.search("//link[@type='application/#{format}+xml']").first
|
63
|
-
feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
|
64
|
-
end
|
65
|
-
|
66
53
|
# Returns the value of the href attribute on the <base /> tag, if exists
|
67
54
|
def base_href
|
68
55
|
parsed.search('base').first.attributes['href'].value rescue nil
|
data/lib/meta_inspector/url.rb
CHANGED
@@ -27,21 +27,35 @@ module MetaInspector
|
|
27
27
|
"#{scheme}://#{host}/"
|
28
28
|
end
|
29
29
|
|
30
|
+
WELL_KNOWN_TRACKING_PARAMS = %w( utm_source utm_medium utm_term utm_content utm_campaign )
|
31
|
+
|
32
|
+
def tracked?
|
33
|
+
u = parsed(url)
|
34
|
+
found_tracking_params = WELL_KNOWN_TRACKING_PARAMS & u.query_values.keys
|
35
|
+
return found_tracking_params.any?
|
36
|
+
end
|
37
|
+
|
38
|
+
def untracked_url
|
39
|
+
u = parsed(url)
|
40
|
+
u.query_values = u.query_values.delete_if { |key, _| WELL_KNOWN_TRACKING_PARAMS.include? key }
|
41
|
+
u.to_s
|
42
|
+
end
|
43
|
+
|
44
|
+
def untrack!
|
45
|
+
self.url = untracked_url
|
46
|
+
end
|
47
|
+
|
30
48
|
def url=(new_url)
|
31
49
|
url = with_default_scheme(new_url)
|
32
50
|
@url = @normalize ? normalized(url) : url
|
33
51
|
end
|
34
52
|
|
35
|
-
# Converts a protocol-relative url to its full form,
|
36
|
-
# depending on the scheme of the page that contains it
|
37
|
-
def self.unrelativize(url, scheme)
|
38
|
-
url =~ /^\/\// ? "#{scheme}://#{url[2..-1]}" : url
|
39
|
-
end
|
40
|
-
|
41
53
|
# Converts a relative URL to an absolute URL, like:
|
42
54
|
# "/faq" => "http://example.com/faq"
|
43
55
|
# Respecting already absolute URLs like the ones starting with
|
44
56
|
# http:, ftp:, telnet:, mailto:, javascript: ...
|
57
|
+
# Protocol-relative URLs are also resolved to use the same
|
58
|
+
# schema as the base_url
|
45
59
|
def self.absolutify(url, base_url)
|
46
60
|
if url =~ /^\w*\:/i
|
47
61
|
MetaInspector::URL.new(url).url
|
@@ -0,0 +1,34 @@
|
|
1
|
+
HTTP/1.1 200 OK
|
2
|
+
Server: nginx/0.7.67
|
3
|
+
Date: Fri, 18 Nov 2011 21:46:46 GMT
|
4
|
+
Content-Type: text/html
|
5
|
+
Connection: keep-alive
|
6
|
+
Last-Modified: Mon, 14 Nov 2011 16:53:18 GMT
|
7
|
+
Content-Length: 4987
|
8
|
+
X-Varnish: 2000423390
|
9
|
+
Age: 0
|
10
|
+
Via: 1.1 varnish
|
11
|
+
|
12
|
+
<html>
|
13
|
+
<head>
|
14
|
+
<title>An example page</title>
|
15
|
+
<link
|
16
|
+
rel="canonical"
|
17
|
+
href="http://example.com/canonical-from-head"
|
18
|
+
/>
|
19
|
+
<link rel="stylesheet" href="/stylesheets/screen.css">
|
20
|
+
<link rel="stylesheet" href="//example2.com/stylesheets/screen.css">
|
21
|
+
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
|
22
|
+
<link rel="shorturl" href="http://gu.com/p/32v5a" />
|
23
|
+
<link
|
24
|
+
rel="stylesheet"
|
25
|
+
type="text/css"
|
26
|
+
href="http://foo/print.css"
|
27
|
+
media="print"
|
28
|
+
class="contrast"
|
29
|
+
/>
|
30
|
+
</head>
|
31
|
+
<body>
|
32
|
+
<h1>Hello World</h1>
|
33
|
+
</body>
|
34
|
+
</html>
|
@@ -12,6 +12,8 @@ Accept-Ranges: bytes
|
|
12
12
|
<head>
|
13
13
|
<meta charset="utf-8" />
|
14
14
|
<title>Protocol-relative URLs</title>
|
15
|
+
<meta property="og:image" content="//static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg"/>
|
16
|
+
<link rel="shortcut icon" href="//static-secure.guim.co.uk/sys-images/favicon.ico" type="image/x-icon" />
|
15
17
|
</head>
|
16
18
|
<body>
|
17
19
|
<p>Internal links</p>
|
@@ -22,5 +24,8 @@ Accept-Ranges: bytes
|
|
22
24
|
<p>External links</p>
|
23
25
|
<a href="http://google.com">External: normal link</a>
|
24
26
|
<a href="//yahoo.com">External: protocol-relative link</a>
|
27
|
+
|
28
|
+
<p>Images</p>
|
29
|
+
<img src="//example.com/image.jpg" />
|
25
30
|
</body>
|
26
31
|
</html>
|
@@ -0,0 +1,42 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe MetaInspector do
|
4
|
+
|
5
|
+
describe "head_links" do
|
6
|
+
let(:page) { MetaInspector.new('http://example.com/head_links') }
|
7
|
+
let(:page_https) { MetaInspector.new('https://example.com/head_links') }
|
8
|
+
|
9
|
+
it "#head_links" do
|
10
|
+
expect(page.head_links).to eq([
|
11
|
+
{rel: 'canonical', href: 'http://example.com/canonical-from-head'},
|
12
|
+
{rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
|
13
|
+
{rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
|
14
|
+
{rel: 'shortcut icon', href: 'http://example.com/favicon.ico', type: 'image/x-icon'},
|
15
|
+
{rel: 'shorturl', href: 'http://gu.com/p/32v5a'},
|
16
|
+
{rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
|
17
|
+
])
|
18
|
+
end
|
19
|
+
|
20
|
+
it "#stylesheets" do
|
21
|
+
expect(page.stylesheets).to eq([
|
22
|
+
{rel: 'stylesheet', href: 'http://example.com/stylesheets/screen.css'},
|
23
|
+
{rel: 'stylesheet', href: 'http://example2.com/stylesheets/screen.css'},
|
24
|
+
{rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
|
25
|
+
])
|
26
|
+
|
27
|
+
expect(page_https.stylesheets).to eq([
|
28
|
+
{rel: 'stylesheet', href: 'https://example.com/stylesheets/screen.css'},
|
29
|
+
{rel: 'stylesheet', href: 'https://example2.com/stylesheets/screen.css'},
|
30
|
+
{rel: 'stylesheet', type: 'text/css', href: 'http://foo/print.css', media: 'print', class: 'contrast'}
|
31
|
+
])
|
32
|
+
end
|
33
|
+
|
34
|
+
it "#canonical" do
|
35
|
+
expect(page.canonicals).to eq([
|
36
|
+
{rel: 'canonical', href: 'http://example.com/canonical-from-head'}
|
37
|
+
])
|
38
|
+
end
|
39
|
+
|
40
|
+
end
|
41
|
+
|
42
|
+
end
|
@@ -123,6 +123,44 @@ describe MetaInspector do
|
|
123
123
|
end
|
124
124
|
end
|
125
125
|
|
126
|
+
describe "images.with_size" do
|
127
|
+
it "should return sorted by area array of [img_url, width, height] using html sizes" do
|
128
|
+
page = MetaInspector.new('http://example.com/largest_image_in_html')
|
129
|
+
|
130
|
+
expect(page.images.with_size).to eq([
|
131
|
+
["http://example.com/largest", 100, 100],
|
132
|
+
["http://example.com/too_narrow", 10, 100],
|
133
|
+
["http://example.com/too_wide", 100, 10],
|
134
|
+
["http://example.com/smaller", 10, 10],
|
135
|
+
["http://example.com/smallest", 1, 1]
|
136
|
+
])
|
137
|
+
end
|
138
|
+
|
139
|
+
it "should return sorted by area array of [img_url, width, height] using actual image sizes" do
|
140
|
+
page = MetaInspector.new('http://example.com/largest_image_using_image_size')
|
141
|
+
|
142
|
+
expect(page.images.with_size).to eq([
|
143
|
+
["http://example.com/100x100", 100, 100],
|
144
|
+
["http://example.com/10x100", 10, 100],
|
145
|
+
["http://example.com/100x10", 100, 10],
|
146
|
+
["http://example.com/10x10", 10, 10],
|
147
|
+
["http://example.com/1x1", 1, 1]
|
148
|
+
])
|
149
|
+
end
|
150
|
+
|
151
|
+
it "should return sorted by area array of [img_url, width, height] without downloading images" do
|
152
|
+
page = MetaInspector.new('http://example.com/largest_image_using_image_size', download_images: false)
|
153
|
+
|
154
|
+
expect(page.images.with_size).to eq([
|
155
|
+
["http://example.com/10x100", 10, 100],
|
156
|
+
["http://example.com/100x10", 100, 10],
|
157
|
+
["http://example.com/1x1", 1, 1],
|
158
|
+
["http://example.com/10x10", 0, 0],
|
159
|
+
["http://example.com/100x100", 0, 0]
|
160
|
+
])
|
161
|
+
end
|
162
|
+
end
|
163
|
+
|
126
164
|
describe "images.largest" do
|
127
165
|
it "should find the largest image on the page using html sizes" do
|
128
166
|
page = MetaInspector.new('http://example.com/largest_image_in_html')
|
@@ -174,4 +212,26 @@ describe MetaInspector do
|
|
174
212
|
expect(page.images.favicon).to eq(nil)
|
175
213
|
end
|
176
214
|
end
|
215
|
+
|
216
|
+
describe 'protocol-relative' do
|
217
|
+
before(:each) do
|
218
|
+
@m_http = MetaInspector.new('http://protocol-relative.com')
|
219
|
+
@m_https = MetaInspector.new('https://protocol-relative.com')
|
220
|
+
end
|
221
|
+
|
222
|
+
it 'should unrelativize images' do
|
223
|
+
expect(@m_http.images.to_a).to eq(['http://example.com/image.jpg'])
|
224
|
+
expect(@m_https.images.to_a).to eq(['https://example.com/image.jpg'])
|
225
|
+
end
|
226
|
+
|
227
|
+
it 'should unrelativize owner suggested image' do
|
228
|
+
expect(@m_http.images.owner_suggested).to eq('http://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
|
229
|
+
expect(@m_https.images.owner_suggested).to eq('https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/8/8/1312810126887/gu_192x115.jpg')
|
230
|
+
end
|
231
|
+
|
232
|
+
it 'should unrelativize favicon' do
|
233
|
+
expect(@m_http.images.favicon).to eq('http://static-secure.guim.co.uk/sys-images/favicon.ico')
|
234
|
+
expect(@m_https.images.favicon).to eq('https://static-secure.guim.co.uk/sys-images/favicon.ico')
|
235
|
+
end
|
236
|
+
end
|
177
237
|
end
|
data/spec/spec_helper.rb
CHANGED
@@ -41,6 +41,10 @@ FakeWeb.register_uri(:get, "http://example.com/10x10", :response => fixture_file
|
|
41
41
|
FakeWeb.register_uri(:get, "http://example.com/100x100", :response => fixture_file("100x100.jpg.response"))
|
42
42
|
FakeWeb.register_uri(:get, "http://www.24-horas.mx/mexico-firma-acuerdo-bilateral-automotriz-con-argentina/", :response => fixture_file("relative_og_image.response"))
|
43
43
|
|
44
|
+
#Used to test canonical URLs in head
|
45
|
+
FakeWeb.register_uri(:get, "http://example.com/head_links", :response => fixture_file("head_links.response"))
|
46
|
+
FakeWeb.register_uri(:get, "https://example.com/head_links", :response => fixture_file("head_links.response"))
|
47
|
+
|
44
48
|
# Used to test best_title logic
|
45
49
|
FakeWeb.register_uri(:get, "http://example.com/title_in_head", :response => fixture_file("title_in_head.response"))
|
46
50
|
FakeWeb.register_uri(:get, "http://example.com/title_in_body", :response => fixture_file("title_in_body.response"))
|
data/spec/url_spec.rb
CHANGED
@@ -36,6 +36,47 @@ describe MetaInspector::URL do
|
|
36
36
|
expect(MetaInspector::URL.new('http://example.com/faqs').root_url).to eq('http://example.com/')
|
37
37
|
end
|
38
38
|
|
39
|
+
it "should return an untracked url" do
|
40
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
41
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
42
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
43
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
44
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
45
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').untracked_url).to eq('http://example.com/foo?not_utm_thing=bar')
|
46
|
+
end
|
47
|
+
|
48
|
+
it "should remove tracking parameters from url" do
|
49
|
+
|
50
|
+
tracked_urls = ['http://example.com/foo?not_utm_thing=bar&utm_source=1234',
|
51
|
+
'http://example.com/foo?not_utm_thing=bar&utm_medium=1234',
|
52
|
+
'http://example.com/foo?not_utm_thing=bar&utm_term=1234',
|
53
|
+
'http://example.com/foo?not_utm_thing=bar&utm_content=1234',
|
54
|
+
'http://example.com/foo?not_utm_thing=bar&utm_campaign=1234',
|
55
|
+
'http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436'
|
56
|
+
]
|
57
|
+
|
58
|
+
tracked_urls.each do |tracked_url|
|
59
|
+
url = MetaInspector::URL.new(tracked_url)
|
60
|
+
url.untrack!
|
61
|
+
expect(url.url).to eq('http://example.com/foo?not_utm_thing=bar')
|
62
|
+
end
|
63
|
+
end
|
64
|
+
|
65
|
+
it "should say if the url is tracked" do
|
66
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234').tracked?).to be true
|
67
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_medium=1234').tracked?).to be true
|
68
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_term=1234').tracked?).to be true
|
69
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_content=1234').tracked?).to be true
|
70
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_campaign=1234').tracked?).to be true
|
71
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar&utm_source=1234&utm_medium=5678&utm_term=4321&utm_content=9876&utm_campaign=5436').tracked?).to be true
|
72
|
+
|
73
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_source=1234').tracked?).to be false
|
74
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_medium=1234').tracked?).to be false
|
75
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_term=1234').tracked?).to be false
|
76
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_content=1234').tracked?).to be false
|
77
|
+
expect(MetaInspector::URL.new('http://example.com/foo?not_utm_thing=bar¬_utm_campaign=1234').tracked?).to be false
|
78
|
+
end
|
79
|
+
|
39
80
|
describe "url=" do
|
40
81
|
it "should update the url" do
|
41
82
|
url = MetaInspector::URL.new('http://first.com/')
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: metainspector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 4.
|
4
|
+
version: 4.5.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jaime Iniesta
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-
|
11
|
+
date: 2015-05-29 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -232,6 +232,7 @@ files:
|
|
232
232
|
- ".rspec.example"
|
233
233
|
- ".rubocop.yml.example"
|
234
234
|
- ".travis.yml"
|
235
|
+
- CHANGELOG.md
|
235
236
|
- Gemfile
|
236
237
|
- Guardfile
|
237
238
|
- MIT-LICENSE
|
@@ -247,6 +248,7 @@ files:
|
|
247
248
|
- lib/meta_inspector/exceptionable.rb
|
248
249
|
- lib/meta_inspector/parser.rb
|
249
250
|
- lib/meta_inspector/parsers/base.rb
|
251
|
+
- lib/meta_inspector/parsers/head_links.rb
|
250
252
|
- lib/meta_inspector/parsers/images.rb
|
251
253
|
- lib/meta_inspector/parsers/links.rb
|
252
254
|
- lib/meta_inspector/parsers/meta_tags.rb
|
@@ -270,6 +272,7 @@ files:
|
|
270
272
|
- spec/fixtures/example.response
|
271
273
|
- spec/fixtures/facebook.com.response
|
272
274
|
- spec/fixtures/guardian.co.uk.response
|
275
|
+
- spec/fixtures/head_links.response
|
273
276
|
- spec/fixtures/https.facebook.com.response
|
274
277
|
- spec/fixtures/international.response
|
275
278
|
- spec/fixtures/invalid_href.response
|
@@ -305,6 +308,7 @@ files:
|
|
305
308
|
- spec/fixtures/wordpress_site.response
|
306
309
|
- spec/fixtures/youtube.response
|
307
310
|
- spec/fixtures/youtube_short_title.response
|
311
|
+
- spec/meta_inspector/head_links_spec.rb
|
308
312
|
- spec/meta_inspector/images_spec.rb
|
309
313
|
- spec/meta_inspector/links_spec.rb
|
310
314
|
- spec/meta_inspector/meta_inspector_spec.rb
|