metainspector 1.17.3 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +123 -52
- data/lib/meta_inspector.rb +0 -1
- data/lib/meta_inspector/document.rb +5 -4
- data/lib/meta_inspector/parser.rb +38 -69
- data/lib/meta_inspector/version.rb +1 -1
- data/meta_inspector.gemspec +0 -1
- data/spec/document_spec.rb +7 -10
- data/spec/fixtures/meta_tags.response +54 -0
- data/spec/fixtures/youtube.response +1 -1
- data/spec/parser_spec.rb +88 -202
- data/spec/spec_helper.rb +1 -1
- metadata +3 -18
- data/lib/meta_inspector/meta_tags_dynamic_match.rb +0 -18
- data/spec/fixtures/opengraph.response +0 -52
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f4afd755a0fdc53abb2c3af992bea56021245ec2
|
4
|
+
data.tar.gz: fddcbbb1b151558bf245c9630585553149a93798
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e3c4cf8afc4de72cf0432cb2c051f3fc38049f1e1e648033c151644d5a3b38a211f829d88cff33f7813ff6d1d18fb945aa6016df9dea286a7b1d8eed37bbc623
|
7
|
+
data.tar.gz: d18e6647d1187ff115a734d5d9ebca9ab6f87eab4b12b91f0541df6db65cb2c2eb5e31b1be298c7540862fdfbdc3b84ca79eb5769401abfd0de2dac2a701d744
|
data/README.md
CHANGED
@@ -1,6 +1,8 @@
|
|
1
1
|
# MetaInspector [![Build Status](https://secure.travis-ci.org/jaimeiniesta/metainspector.png)](http://travis-ci.org/jaimeiniesta/metainspector) [![Dependency Status](https://gemnasium.com/jaimeiniesta/metainspector.png)](https://gemnasium.com/jaimeiniesta/metainspector)
|
2
2
|
|
3
|
-
MetaInspector is a gem for web scraping purposes.
|
3
|
+
MetaInspector is a gem for web scraping purposes.
|
4
|
+
|
5
|
+
You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
|
4
6
|
|
5
7
|
## See it in action!
|
6
8
|
|
@@ -36,36 +38,124 @@ You can also include the html which will be used as the document to scrape:
|
|
36
38
|
|
37
39
|
Then you can see the scraped data like this:
|
38
40
|
|
39
|
-
page.url
|
40
|
-
page.scheme
|
41
|
-
page.host
|
42
|
-
page.root_url
|
43
|
-
page.title
|
44
|
-
page.links
|
45
|
-
page.internal_links
|
46
|
-
page.external_links
|
47
|
-
page.
|
48
|
-
page.description
|
49
|
-
page.
|
50
|
-
page.image
|
51
|
-
page.images
|
52
|
-
page.feed
|
53
|
-
page.charset
|
54
|
-
page.content_type
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
page.
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
41
|
+
page.url # URL of the page
|
42
|
+
page.scheme # Scheme of the page (http, https)
|
43
|
+
page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
|
44
|
+
page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
|
45
|
+
page.title # title of the page, as string
|
46
|
+
page.links # array of strings, with every link found on the page as an absolute URL
|
47
|
+
page.internal_links # array of strings, with every internal link found on the page as an absolute URL
|
48
|
+
page.external_links # array of strings, with every external link found on the page as an absolute URL
|
49
|
+
page.meta['keywords'] # meta keywords, as string
|
50
|
+
page.meta['description'] # meta description, as string
|
51
|
+
page.description # returns the meta description, or the first long paragraph if no meta description is found
|
52
|
+
page.image # Most relevant image, if defined with the og:image meta tag
|
53
|
+
page.images # array of strings, with every img found on the page as an absolute URL
|
54
|
+
page.feed # Get rss or atom links in meta data fields as array
|
55
|
+
page.charset # UTF-8
|
56
|
+
page.content_type # content-type returned by the server when the url was requested
|
57
|
+
|
58
|
+
## Meta tags
|
59
|
+
|
60
|
+
When it comes to meta tags, you have several options:
|
61
|
+
|
62
|
+
page.meta_tags # Gives you all the meta tags by type:
|
63
|
+
# (meta name, meta http-equiv, meta property and meta charset)
|
64
|
+
# As meta tags can be repeated (in the case of 'og:image', for example),
|
65
|
+
# the values returned will be arrays
|
66
|
+
#
|
67
|
+
# For example:
|
68
|
+
#
|
69
|
+
# {
|
70
|
+
'name' => {
|
71
|
+
'keywords' => ['one, two, three'],
|
72
|
+
'description' => ['the description'],
|
73
|
+
'author' => ['Joe Sample'],
|
74
|
+
'robots' => ['index,follow'],
|
75
|
+
'revisit' => ['15 days'],
|
76
|
+
'dc.date.issued' => ['2011-09-15']
|
77
|
+
},
|
78
|
+
|
79
|
+
'http-equiv' => {
|
80
|
+
'content-type' => ['text/html; charset=UTF-8'],
|
81
|
+
'content-style-type' => ['text/css']
|
82
|
+
},
|
83
|
+
|
84
|
+
'property' => {
|
85
|
+
'og:title' => ['An OG title'],
|
86
|
+
'og:type' => ['website'],
|
87
|
+
'og:url' => ['http://example.com/meta-tags'],
|
88
|
+
'og:image' => ['http://example.com/rock.jpg',
|
89
|
+
'http://example.com/rock2.jpg',
|
90
|
+
'http://example.com/rock3.jpg'],
|
91
|
+
'og:image:width' => ['300'],
|
92
|
+
'og:image:height' => ['300', '1000']
|
93
|
+
},
|
94
|
+
|
95
|
+
'charset' => ['UTF-8']
|
96
|
+
}
|
97
|
+
|
98
|
+
As this method returns a hash, you can also take only the key that you need, like in:
|
99
|
+
|
100
|
+
page.meta_tags['property'] # Returns:
|
101
|
+
# {
|
102
|
+
# 'og:title' => ['An OG title'],
|
103
|
+
# 'og:type' => ['website'],
|
104
|
+
# 'og:url' => ['http://example.com/meta-tags'],
|
105
|
+
# 'og:image' => ['http://example.com/rock.jpg',
|
106
|
+
# 'http://example.com/rock2.jpg',
|
107
|
+
# 'http://example.com/rock3.jpg'],
|
108
|
+
# 'og:image:width' => ['300'],
|
109
|
+
# 'og:image:height' => ['300', '1000']
|
110
|
+
# }
|
111
|
+
|
112
|
+
In most cases you will only be interested in the first occurrence of a meta tag, so you can
|
113
|
+
use the singular form of that method:
|
114
|
+
|
115
|
+
page.meta_tag['name'] # Returns:
|
116
|
+
# {
|
117
|
+
# 'keywords' => 'one, two, three',
|
118
|
+
# 'description' => 'the description',
|
119
|
+
# 'author' => 'Joe Sample',
|
120
|
+
# 'robots' => 'index,follow',
|
121
|
+
# 'revisit' => '15 days',
|
122
|
+
# 'dc.date.issued' => '2011-09-15'
|
123
|
+
# }
|
124
|
+
|
125
|
+
Or, as this is also a hash:
|
126
|
+
|
127
|
+
page.meta_tag['name']['keywords'] # Returns 'one, two, three'
|
128
|
+
|
129
|
+
And finally, you can use the shorter `meta` method that will merge the different keys so you have
|
130
|
+
a simpler hash:
|
131
|
+
|
132
|
+
page.meta # Returns:
|
133
|
+
#
|
134
|
+
# {
|
135
|
+
# 'keywords' => 'one, two, three',
|
136
|
+
# 'description' => 'the description',
|
137
|
+
# 'author' => 'Joe Sample',
|
138
|
+
# 'robots' => 'index,follow',
|
139
|
+
# 'revisit' => '15 days',
|
140
|
+
# 'dc.date.issued' => '2011-09-15',
|
141
|
+
# 'content-type' => 'text/html; charset=UTF-8',
|
142
|
+
# 'content-style-type' => 'text/css',
|
143
|
+
# 'og:title' => 'An OG title',
|
144
|
+
# 'og:type' => 'website',
|
145
|
+
# 'og:url' => 'http://example.com/meta-tags',
|
146
|
+
# 'og:image' => 'http://example.com/rock.jpg',
|
147
|
+
# 'og:image:width' => '300',
|
148
|
+
# 'og:image:height' => '300',
|
149
|
+
# 'charset' => 'UTF-8'
|
150
|
+
# }
|
151
|
+
|
152
|
+
This way, you can get most meta tags just like that:
|
153
|
+
|
154
|
+
page.meta['author'] # Returns "Joe Sample"
|
155
|
+
|
156
|
+
Please be aware that all keys are converted to downcase, so it's `'dc.date.issued'` and not `'DC.date.issued'`.
|
157
|
+
|
158
|
+
## Other representations
|
69
159
|
|
70
160
|
You can also access most of the scraped data as a hash:
|
71
161
|
|
@@ -80,25 +170,6 @@ And the full scraped document is accessible from:
|
|
80
170
|
|
81
171
|
page.parsed # Nokogiri doc that you can use it to get any element from the page
|
82
172
|
|
83
|
-
## Opengraph and Twitter card meta tags
|
84
|
-
|
85
|
-
Twitter cards & Open graph tags make it possible for you to attach media experiences to Tweets & Facebook posts. Nowadays most of the content creators add these meta tags to headers to quickly identify content on the page. Sometimes these tags could be nested as well. For example when a site wants to provide information about primary image used on a page it could use
|
86
|
-
|
87
|
-
<meta name="og:image" content="http://www.somedomain.com/assets/images/abc.jpeg">
|
88
|
-
<meta name="og:image:width" content="200">
|
89
|
-
<meta name="twitter:image" value="http://www.somedomain.com/assets/images/abc.jpeg">
|
90
|
-
<meta property="twitter:image:width" value="200">
|
91
|
-
|
92
|
-
Also many sites use name & property, content & value attributes interchangeably. Using MetaInspector accessing this information is as easy as -
|
93
|
-
|
94
|
-
page.meta_og_image
|
95
|
-
page.meta_twitter_image_width
|
96
|
-
|
97
|
-
Note that MetaInspector gives priority to content over value. In other words if there is a tag of the form
|
98
|
-
|
99
|
-
<meta property="og:something" value="100" content="real value">
|
100
|
-
page.meta_og_something #=> "real value"
|
101
|
-
|
102
173
|
## Options
|
103
174
|
|
104
175
|
### Timeout
|
@@ -173,10 +244,10 @@ You can find some sample scripts on the samples folder, including a basic scrapi
|
|
173
244
|
>> page.title
|
174
245
|
=> "MarkupValidator :: site-wide markup validation tool"
|
175
246
|
|
176
|
-
>> page.
|
247
|
+
>> page.meta['description']
|
177
248
|
=> "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
|
178
249
|
|
179
|
-
>> page.
|
250
|
+
>> page.meta['keywords']
|
180
251
|
=> "html, markup, validation, validator, tool, w3c, development, standards, free"
|
181
252
|
|
182
253
|
>> page.links.size
|
data/lib/meta_inspector.rb
CHANGED
@@ -5,7 +5,6 @@ require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/excep
|
|
5
5
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/exception_log'))
|
6
6
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/request'))
|
7
7
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/url'))
|
8
|
-
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/meta_tags_dynamic_match'))
|
9
8
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parser'))
|
10
9
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/document'))
|
11
10
|
require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/deprecations'))
|
@@ -39,8 +39,8 @@ module MetaInspector
|
|
39
39
|
extend Forwardable
|
40
40
|
def_delegators :@url, :url, :scheme, :host, :root_url
|
41
41
|
def_delegators :@request, :content_type
|
42
|
-
def_delegators :@parser, :parsed, :
|
43
|
-
:images, :image, :feed, :charset
|
42
|
+
def_delegators :@parser, :parsed, :respond_to?, :title, :description, :links, :internal_links, :external_links,
|
43
|
+
:images, :image, :feed, :charset, :meta_tags, :meta_tag, :meta
|
44
44
|
|
45
45
|
# Returns all document data as a nested Hash
|
46
46
|
def to_hash
|
@@ -53,8 +53,9 @@ module MetaInspector
|
|
53
53
|
'images' => images,
|
54
54
|
'charset' => charset,
|
55
55
|
'feed' => feed,
|
56
|
-
'content_type' => content_type
|
57
|
-
|
56
|
+
'content_type' => content_type,
|
57
|
+
'meta_tags' => meta_tags
|
58
|
+
}
|
58
59
|
end
|
59
60
|
|
60
61
|
# Returns the contents of the document as a string
|
@@ -1,7 +1,6 @@
|
|
1
1
|
# -*- encoding: utf-8 -*-
|
2
2
|
|
3
3
|
require 'nokogiri'
|
4
|
-
require 'hashie/rash'
|
5
4
|
|
6
5
|
module MetaInspector
|
7
6
|
# Parses the document with Nokogiri
|
@@ -12,13 +11,29 @@ module MetaInspector
|
|
12
11
|
options = defaults.merge(options)
|
13
12
|
|
14
13
|
@document = document
|
15
|
-
@data = Hashie::Rash.new
|
16
14
|
@exception_log = options[:exception_log]
|
17
15
|
end
|
18
16
|
|
19
17
|
extend Forwardable
|
20
18
|
def_delegators :@document, :url, :scheme, :host
|
21
19
|
|
20
|
+
def meta_tags
|
21
|
+
{
|
22
|
+
'name' => meta_tags_by('name'),
|
23
|
+
'http-equiv' => meta_tags_by('http-equiv'),
|
24
|
+
'property' => meta_tags_by('property'),
|
25
|
+
'charset' => [charset_from_meta_charset]
|
26
|
+
}
|
27
|
+
end
|
28
|
+
|
29
|
+
def meta_tag
|
30
|
+
convert_each_array_to_first_element_on meta_tags
|
31
|
+
end
|
32
|
+
|
33
|
+
def meta
|
34
|
+
meta_tag['name'].merge(meta_tag['http-equiv']).merge(meta_tag['property']).merge({'charset' => meta_tag['charset']})
|
35
|
+
end
|
36
|
+
|
22
37
|
# Returns the whole parsed document
|
23
38
|
def parsed
|
24
39
|
@parsed ||= Nokogiri::HTML(@document.to_s)
|
@@ -27,11 +42,6 @@ module MetaInspector
|
|
27
42
|
@exception_log << e
|
28
43
|
end
|
29
44
|
|
30
|
-
def to_hash
|
31
|
-
scrape_meta_data
|
32
|
-
@data.to_hash
|
33
|
-
end
|
34
|
-
|
35
45
|
# Returns the parsed document title, from the content of the <title> tag.
|
36
46
|
# This is not the same as the meta_title tag
|
37
47
|
def title
|
@@ -41,7 +51,7 @@ module MetaInspector
|
|
41
51
|
# A description getter that first checks for a meta description and if not present will
|
42
52
|
# guess by looking at the first paragraph with more than 120 characters
|
43
53
|
def description
|
44
|
-
|
54
|
+
meta['description'] || secondary_description
|
45
55
|
end
|
46
56
|
|
47
57
|
# Links found on the page, as absolute URLs
|
@@ -67,8 +77,10 @@ module MetaInspector
|
|
67
77
|
# Returns the parsed image from Facebook's open graph property tags
|
68
78
|
# Most all major websites now define this property and is usually very relevant
|
69
79
|
# See doc at http://developers.facebook.com/docs/opengraph/
|
80
|
+
# If none found, tries with Twitter image
|
81
|
+
# TODO: if not found, try with images.first
|
70
82
|
def image
|
71
|
-
|
83
|
+
meta['og:image'] || meta['twitter:image']
|
72
84
|
end
|
73
85
|
|
74
86
|
# Returns the parsed document meta rss link
|
@@ -83,81 +95,38 @@ module MetaInspector
|
|
83
95
|
@charset ||= (charset_from_meta_charset || charset_from_meta_content_type)
|
84
96
|
end
|
85
97
|
|
86
|
-
def respond_to?(method_name, include_private = false)
|
87
|
-
MetaInspector::MetaTagsDynamicMatch.new(method_name).match? || super
|
88
|
-
end
|
89
|
-
|
90
98
|
private
|
91
99
|
|
92
100
|
def defaults
|
93
101
|
{ exception_log: MetaInspector::ExceptionLog.new }
|
94
102
|
end
|
95
103
|
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
# with meta http-equiv="...", substituting "_" by "-"
|
102
|
-
def method_missing(method_name)
|
103
|
-
meta_tags_method = MetaInspector::MetaTagsDynamicMatch.new(method_name)
|
104
|
+
def meta_tags_by(attribute)
|
105
|
+
hash = {}
|
106
|
+
parsed.css("meta[@#{attribute}]").map do |tag|
|
107
|
+
name = tag.attributes[attribute].value.downcase rescue nil
|
108
|
+
content = tag.attributes['content'].value rescue nil
|
104
109
|
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
#special treatment for opengraph (og:) and twitter card (twitter:) tags
|
109
|
-
if key =~ /^og_(.*)/
|
110
|
-
key = og_key(key)
|
111
|
-
elsif key =~ /^twitter_(.*)/
|
112
|
-
key.gsub!("_",":")
|
110
|
+
if name && content
|
111
|
+
hash[name] ||= []
|
112
|
+
hash[name] << content
|
113
113
|
end
|
114
|
-
|
115
|
-
scrape_meta_data
|
116
|
-
|
117
|
-
@data.meta.name && (@data.meta.name[key.downcase]) || (@data.meta.property && @data.meta.property[key.downcase])
|
118
|
-
else
|
119
|
-
super
|
120
|
-
end
|
121
|
-
end
|
122
|
-
|
123
|
-
# Not all OG keys can be directly translated to meta tags method names replacing _ by : as they include the _ in the name
|
124
|
-
# This is going to be deprecated and replaced soon by a simpler, clearer method, like page.meta['og:site_name']
|
125
|
-
def og_key(key)
|
126
|
-
case key
|
127
|
-
when "og_site_name"
|
128
|
-
"og:site_name"
|
129
|
-
when "og_image_secure_url"
|
130
|
-
"og:image:secure_url"
|
131
|
-
when "og_video_secure_url"
|
132
|
-
"og:video:secure_url"
|
133
|
-
when "og_audio_secure_url"
|
134
|
-
"og:audio:secure_url"
|
135
|
-
else
|
136
|
-
key.gsub("_", ":")
|
137
114
|
end
|
115
|
+
hash
|
138
116
|
end
|
139
117
|
|
140
|
-
|
141
|
-
|
142
|
-
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
118
|
+
def convert_each_array_to_first_element_on(hash)
|
119
|
+
hash.each_pair do |k, v|
|
120
|
+
hash[k] = if v.is_a?(Hash)
|
121
|
+
convert_each_array_to_first_element_on(v)
|
122
|
+
elsif v.is_a?(Array)
|
123
|
+
v.first
|
124
|
+
else
|
125
|
+
v
|
147
126
|
end
|
148
127
|
end
|
149
128
|
end
|
150
129
|
|
151
|
-
# Store meta tag value, looking at meta name or meta property
|
152
|
-
def get_meta_name_or_property(element)
|
153
|
-
name_or_property = element.attributes["name"] ? "name" : (element.attributes["property"] ? "property" : nil)
|
154
|
-
content_or_value = element.attributes["content"] ? "content" : (element.attributes["value"] ? "value" : nil)
|
155
|
-
|
156
|
-
if !name_or_property.nil? && !content_or_value.nil?
|
157
|
-
@data.meta.name[element.attributes[name_or_property].value.downcase] = element.attributes[content_or_value].value
|
158
|
-
end
|
159
|
-
end
|
160
|
-
|
161
130
|
# Look for the first <p> block with 120 characters or more
|
162
131
|
def secondary_description
|
163
132
|
first_long_paragraph = parsed_search('//p[string-length() >= 120]').first
|
data/meta_inspector.gemspec
CHANGED
@@ -16,7 +16,6 @@ Gem::Specification.new do |gem|
|
|
16
16
|
gem.version = MetaInspector::VERSION
|
17
17
|
|
18
18
|
gem.add_dependency 'nokogiri', '~> 1.6'
|
19
|
-
gem.add_dependency 'rash', '~> 0.4.0'
|
20
19
|
gem.add_dependency 'open_uri_redirections', '~> 0.1.4'
|
21
20
|
gem.add_dependency 'addressable', '~> 2.3.5'
|
22
21
|
|
data/spec/document_spec.rb
CHANGED
@@ -46,16 +46,13 @@ describe MetaInspector::Document do
|
|
46
46
|
"charset" => "utf-8",
|
47
47
|
"feed" => "http://feeds.feedburner.com/PageRankAlert",
|
48
48
|
"content_type" =>"text/html",
|
49
|
-
"
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
},
|
57
|
-
"property"=>{}
|
58
|
-
}
|
49
|
+
"meta_tags" => { "name" => { "description" => ["Track your PageRank(TM) changes and receive alerts by email"],
|
50
|
+
"keywords" => ["pagerank, seo, optimization, google"], "robots"=>["all,follow"],
|
51
|
+
"csrf-param" => ["authenticity_token"],
|
52
|
+
"csrf-token" => ["iW1/w+R8zrtDkhOlivkLZ793BN04Kr3X/pS+ixObHsE="] },
|
53
|
+
"http-equiv" => {},
|
54
|
+
"property" => {},
|
55
|
+
"charset" => ["utf-8"] }
|
59
56
|
}
|
60
57
|
end
|
61
58
|
|
@@ -0,0 +1,54 @@
|
|
1
|
+
HTTP/1.1 200 OK
|
2
|
+
Age: 13
|
3
|
+
Cache-Control: max-age=120
|
4
|
+
Content-Type: text/html
|
5
|
+
Date: Mon, 06 Jan 2014 12:47:42 GMT
|
6
|
+
Expires: Mon, 06 Jan 2014 12:49:28 GMT
|
7
|
+
Server: Apache/2.2.14 (Ubuntu)
|
8
|
+
Vary: Accept-Encoding
|
9
|
+
Via: 1.1 varnish
|
10
|
+
X-Powered-By: PHP/5.3.2-1ubuntu4.22
|
11
|
+
X-Varnish: 1188792404 1188790413
|
12
|
+
Content-Length: 40571
|
13
|
+
Connection: keep-alive
|
14
|
+
|
15
|
+
<!DOCTYPE html>
|
16
|
+
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
|
17
|
+
<head>
|
18
|
+
<!-- meta name examples -->
|
19
|
+
|
20
|
+
<meta name="keywords" content="one, two, three" />
|
21
|
+
<meta name="description" content="the description" />
|
22
|
+
<meta name="author" content="Joe Sample" />
|
23
|
+
<meta name="robots" content="index,follow" />
|
24
|
+
<meta name="revisit" content="15 days" />
|
25
|
+
<meta name="DC.date.issued" content="2011-09-15">
|
26
|
+
|
27
|
+
<!-- meta http-equiv examples -->
|
28
|
+
|
29
|
+
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
30
|
+
<meta http-equiv="Content-Style-Type" content="text/css" />
|
31
|
+
|
32
|
+
<!-- meta charset examples -->
|
33
|
+
|
34
|
+
<meta charset="UTF-8" />
|
35
|
+
|
36
|
+
<!-- meta property examples -->
|
37
|
+
|
38
|
+
<meta property="og:title" content="An OG title" />
|
39
|
+
<meta property="og:type" content="website" />
|
40
|
+
<meta property="og:url" content="http://example.com/meta-tags" />
|
41
|
+
|
42
|
+
<!-- meta properties can be repeated, like in this example from http://open.me -->
|
43
|
+
|
44
|
+
<meta property="og:image" content="http://example.com/rock.jpg" />
|
45
|
+
<meta property="og:image:width" content="300" />
|
46
|
+
<meta property="og:image:height" content="300" />
|
47
|
+
<meta property="og:image" content="http://example.com/rock2.jpg" />
|
48
|
+
<meta property="og:image" content="http://example.com/rock3.jpg" />
|
49
|
+
<meta property="og:image:height" content="1000" />
|
50
|
+
</head>
|
51
|
+
<body>
|
52
|
+
<p>A sample page with many types of meta tags</p>
|
53
|
+
</body>
|
54
|
+
</html>
|
@@ -38,7 +38,7 @@ var yt = yt || {};yt.timing = yt.timing || {};yt.timing.tick = function(label, o
|
|
38
38
|
<meta name="title" content="4. Far Cry 3 - Ubisoft E3 2011 Press Conference HD 1080p">
|
39
39
|
|
40
40
|
|
41
|
-
<meta name="description" content="">
|
41
|
+
<meta name="description" content="This is Youtube">
|
42
42
|
|
43
43
|
|
44
44
|
<meta name="keywords" content="FARCRY, Ubisoft, E3, 2011, Press, Conference, HD, 1080p">
|
data/spec/parser_spec.rb
CHANGED
@@ -21,7 +21,6 @@ describe MetaInspector::Parser do
|
|
21
21
|
it "should find the og image" do
|
22
22
|
@m = MetaInspector::Parser.new(doc 'http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
|
23
23
|
@m.image.should == "http://o.onionstatic.com/images/articles/article/2772/Apple-Claims-600w-R_jpg_130x110_q85.jpg"
|
24
|
-
@m.meta_og_image.should == "http://o.onionstatic.com/images/articles/article/2772/Apple-Claims-600w-R_jpg_130x110_q85.jpg"
|
25
24
|
end
|
26
25
|
|
27
26
|
it "should find image on youtube" do
|
@@ -71,16 +70,16 @@ describe MetaInspector::Parser do
|
|
71
70
|
@m.feed.should == nil
|
72
71
|
end
|
73
72
|
end
|
73
|
+
end
|
74
74
|
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
75
|
+
describe '#description' do
|
76
|
+
it "should find description from meta description" do
|
77
|
+
page = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
|
78
|
+
|
79
|
+
page.description.should == "This is Youtube"
|
79
80
|
end
|
80
|
-
end
|
81
81
|
|
82
|
-
|
83
|
-
it "should find a secondary description" do
|
82
|
+
it "should find a secondary description if no meta description" do
|
84
83
|
@m = MetaInspector::Parser.new(doc 'http://theonion-no-description.com')
|
85
84
|
@m.description.should == "SAN FRANCISCO—In a move expected to revolutionize the mobile device industry, Apple launched its fastest and most powerful iPhone to date Tuesday, an innovative new model that can only be seen by the company's hippest and most dedicated customers. This is secondary text picked up because of a missing meta description."
|
86
85
|
end
|
@@ -267,188 +266,87 @@ describe MetaInspector::Parser do
|
|
267
266
|
end
|
268
267
|
end
|
269
268
|
|
270
|
-
describe '
|
271
|
-
|
272
|
-
|
273
|
-
|
274
|
-
|
275
|
-
|
276
|
-
|
277
|
-
|
278
|
-
|
279
|
-
|
280
|
-
|
281
|
-
|
282
|
-
|
283
|
-
|
284
|
-
|
285
|
-
|
286
|
-
|
287
|
-
|
288
|
-
|
289
|
-
|
290
|
-
|
291
|
-
|
292
|
-
|
293
|
-
|
294
|
-
|
295
|
-
|
296
|
-
|
297
|
-
|
298
|
-
|
299
|
-
|
300
|
-
|
301
|
-
|
302
|
-
end
|
303
|
-
|
304
|
-
it "
|
305
|
-
|
306
|
-
|
307
|
-
|
308
|
-
|
309
|
-
|
310
|
-
|
311
|
-
|
312
|
-
|
313
|
-
|
314
|
-
|
315
|
-
|
316
|
-
|
317
|
-
|
318
|
-
|
319
|
-
|
320
|
-
|
321
|
-
|
322
|
-
|
323
|
-
|
324
|
-
|
325
|
-
|
326
|
-
|
327
|
-
|
328
|
-
|
329
|
-
|
330
|
-
|
331
|
-
|
332
|
-
|
333
|
-
|
334
|
-
|
335
|
-
|
336
|
-
|
337
|
-
|
338
|
-
|
339
|
-
|
340
|
-
|
341
|
-
|
342
|
-
|
343
|
-
|
344
|
-
|
345
|
-
|
346
|
-
|
347
|
-
|
348
|
-
|
349
|
-
|
350
|
-
|
351
|
-
@m.meta_twitter_site.should == "@youtube"
|
352
|
-
end
|
353
|
-
|
354
|
-
it "should find a meta_twitter_player_width" do
|
355
|
-
@m = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
|
356
|
-
@m.meta_twitter_player_width.should == "1920"
|
357
|
-
end
|
358
|
-
|
359
|
-
it "should not find a meta_twitter_dummy" do
|
360
|
-
@m = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
|
361
|
-
@m.meta_twitter_dummy.should == nil
|
362
|
-
end
|
363
|
-
|
364
|
-
describe "opengraph meta tags" do
|
365
|
-
before(:each) do
|
366
|
-
@m = MetaInspector::Parser.new(doc 'http://example.com/opengraph')
|
367
|
-
end
|
368
|
-
|
369
|
-
it "should find a meta og:title" do
|
370
|
-
@m.meta_og_title.should == "An OG title"
|
371
|
-
end
|
372
|
-
|
373
|
-
it "should find a meta og:type" do
|
374
|
-
@m.meta_og_type.should == "website"
|
375
|
-
end
|
376
|
-
|
377
|
-
it "should find a meta og:url" do
|
378
|
-
@m.meta_og_url.should == "http://example.com/opengraph"
|
379
|
-
end
|
380
|
-
|
381
|
-
it "should find a meta og:description" do
|
382
|
-
@m.meta_og_description.should == "Sean Connery found fame and fortune"
|
383
|
-
end
|
384
|
-
|
385
|
-
it "should find a meta og:determiner" do
|
386
|
-
@m.meta_og_determiner.should == "the"
|
387
|
-
end
|
388
|
-
|
389
|
-
it "should find a meta og:locale" do
|
390
|
-
@m.meta_og_locale.should == "en_GB"
|
391
|
-
end
|
392
|
-
|
393
|
-
it "should find a meta og:locale:alternate" do
|
394
|
-
@m.meta_og_locale_alternate.should == "fr_FR"
|
395
|
-
end
|
396
|
-
|
397
|
-
it "should find a meta og:site_name" do
|
398
|
-
@m.meta_og_site_name.should == "IMDb"
|
399
|
-
end
|
400
|
-
|
401
|
-
it "should find a meta og:image" do
|
402
|
-
@m.meta_og_image.should == "http://example.com/ogp.jpg"
|
403
|
-
end
|
404
|
-
|
405
|
-
it "should find a meta og:image:secure_url" do
|
406
|
-
@m.meta_og_image_secure_url.should == "https://secure.example.com/ogp.jpg"
|
407
|
-
end
|
408
|
-
|
409
|
-
it "should find a meta og:image:type" do
|
410
|
-
@m.meta_og_image_type.should == "image/jpeg"
|
411
|
-
end
|
412
|
-
|
413
|
-
it "should find a meta og:image:width" do
|
414
|
-
@m.meta_og_image_width.should == "400"
|
415
|
-
end
|
416
|
-
|
417
|
-
it "should find a meta og:image:height" do
|
418
|
-
@m.meta_og_image_height.should == "300"
|
419
|
-
end
|
420
|
-
|
421
|
-
it "should find a meta og:video" do
|
422
|
-
@m.meta_og_video.should == "http://example.com/movie.swf"
|
423
|
-
end
|
424
|
-
|
425
|
-
it "should find a meta og:video:secure_url" do
|
426
|
-
@m.meta_og_video_secure_url.should == "https://secure.example.com/movie.swf"
|
427
|
-
end
|
428
|
-
|
429
|
-
it "should find a meta og:video:type" do
|
430
|
-
@m.meta_og_video_type.should == "application/x-shockwave-flash"
|
431
|
-
end
|
432
|
-
|
433
|
-
it "should find a meta og:video:width" do
|
434
|
-
@m.meta_og_video_width.should == "400"
|
435
|
-
end
|
436
|
-
|
437
|
-
it "should find a meta og:video:height" do
|
438
|
-
@m.meta_og_video_height.should == "300"
|
439
|
-
end
|
440
|
-
|
441
|
-
it "should find a meta og:audio" do
|
442
|
-
@m.meta_og_audio.should == "http://example.com/sound.mp3"
|
443
|
-
end
|
444
|
-
|
445
|
-
it "should find a meta og:video:secure_url" do
|
446
|
-
@m.meta_og_audio_secure_url.should == "https://secure.example.com/sound.mp3"
|
447
|
-
end
|
448
|
-
|
449
|
-
it "should find a meta og:audio:type" do
|
450
|
-
@m.meta_og_audio_type.should == "audio/mpeg"
|
451
|
-
end
|
269
|
+
describe 'Getting meta tags' do
|
270
|
+
let(:page) { MetaInspector::Parser.new(doc 'http://example.com/meta-tags') }
|
271
|
+
|
272
|
+
it "#meta_tags" do
|
273
|
+
page.meta_tags.should == {
|
274
|
+
'name' => {
|
275
|
+
'keywords' => ['one, two, three'],
|
276
|
+
'description' => ['the description'],
|
277
|
+
'author' => ['Joe Sample'],
|
278
|
+
'robots' => ['index,follow'],
|
279
|
+
'revisit' => ['15 days'],
|
280
|
+
'dc.date.issued' => ['2011-09-15']
|
281
|
+
},
|
282
|
+
|
283
|
+
'http-equiv' => {
|
284
|
+
'content-type' => ['text/html; charset=UTF-8'],
|
285
|
+
'content-style-type' => ['text/css']
|
286
|
+
},
|
287
|
+
|
288
|
+
'property' => {
|
289
|
+
'og:title' => ['An OG title'],
|
290
|
+
'og:type' => ['website'],
|
291
|
+
'og:url' => ['http://example.com/meta-tags'],
|
292
|
+
'og:image' => ['http://example.com/rock.jpg',
|
293
|
+
'http://example.com/rock2.jpg',
|
294
|
+
'http://example.com/rock3.jpg'],
|
295
|
+
'og:image:width' => ['300'],
|
296
|
+
'og:image:height' => ['300', '1000']
|
297
|
+
},
|
298
|
+
|
299
|
+
'charset' => ['UTF-8']
|
300
|
+
}
|
301
|
+
end
|
302
|
+
|
303
|
+
it "#meta_tag" do
|
304
|
+
page.meta_tag.should == {
|
305
|
+
'name' => {
|
306
|
+
'keywords' => 'one, two, three',
|
307
|
+
'description' => 'the description',
|
308
|
+
'author' => 'Joe Sample',
|
309
|
+
'robots' => 'index,follow',
|
310
|
+
'revisit' => '15 days',
|
311
|
+
'dc.date.issued' => '2011-09-15'
|
312
|
+
},
|
313
|
+
|
314
|
+
'http-equiv' => {
|
315
|
+
'content-type' => 'text/html; charset=UTF-8',
|
316
|
+
'content-style-type' => 'text/css'
|
317
|
+
},
|
318
|
+
|
319
|
+
'property' => {
|
320
|
+
'og:title' => 'An OG title',
|
321
|
+
'og:type' => 'website',
|
322
|
+
'og:url' => 'http://example.com/meta-tags',
|
323
|
+
'og:image' => 'http://example.com/rock.jpg',
|
324
|
+
'og:image:width' => '300',
|
325
|
+
'og:image:height' => '300'
|
326
|
+
},
|
327
|
+
|
328
|
+
'charset' => 'UTF-8'
|
329
|
+
}
|
330
|
+
end
|
331
|
+
|
332
|
+
it "#meta" do
|
333
|
+
page.meta.should == {
|
334
|
+
'keywords' => 'one, two, three',
|
335
|
+
'description' => 'the description',
|
336
|
+
'author' => 'Joe Sample',
|
337
|
+
'robots' => 'index,follow',
|
338
|
+
'revisit' => '15 days',
|
339
|
+
'dc.date.issued' => '2011-09-15',
|
340
|
+
'content-type' => 'text/html; charset=UTF-8',
|
341
|
+
'content-style-type' => 'text/css',
|
342
|
+
'og:title' => 'An OG title',
|
343
|
+
'og:type' => 'website',
|
344
|
+
'og:url' => 'http://example.com/meta-tags',
|
345
|
+
'og:image' => 'http://example.com/rock.jpg',
|
346
|
+
'og:image:width' => '300',
|
347
|
+
'og:image:height' => '300',
|
348
|
+
'charset' => 'UTF-8'
|
349
|
+
}
|
452
350
|
end
|
453
351
|
end
|
454
352
|
|
@@ -469,18 +367,6 @@ describe MetaInspector::Parser do
|
|
469
367
|
end
|
470
368
|
end
|
471
369
|
|
472
|
-
describe 'to_hash' do
|
473
|
-
it "should return a hash with all the values set" do
|
474
|
-
@m = MetaInspector::Parser.new(doc 'http://pagerankalert.com')
|
475
|
-
@m.to_hash.should == { "meta" => { "name" => { "description" => "Track your PageRank(TM) changes and receive alerts by email",
|
476
|
-
"keywords" => "pagerank, seo, optimization, google",
|
477
|
-
"robots" => "all,follow",
|
478
|
-
"csrf_param" => "authenticity_token",
|
479
|
-
"csrf_token" => "iW1/w+R8zrtDkhOlivkLZ793BN04Kr3X/pS+ixObHsE="},
|
480
|
-
"property"=>{}}}
|
481
|
-
end
|
482
|
-
end
|
483
|
-
|
484
370
|
private
|
485
371
|
|
486
372
|
def doc(url, options = {})
|
data/spec/spec_helper.rb
CHANGED
@@ -41,7 +41,7 @@ FakeWeb.register_uri(:get, "http://charset002.com", :response => fixture_file("c
|
|
41
41
|
FakeWeb.register_uri(:get, "http://www.inkthemes.com/", :response => fixture_file("wordpress_site.response"))
|
42
42
|
FakeWeb.register_uri(:get, "http://pagerankalert.com/image.png", :body => "Image", :content_type => "image/png")
|
43
43
|
FakeWeb.register_uri(:get, "http://pagerankalert.com/file.tar.gz", :body => "Image", :content_type => "application/x-gzip")
|
44
|
-
FakeWeb.register_uri(:get, "http://example.com/
|
44
|
+
FakeWeb.register_uri(:get, "http://example.com/meta-tags", :response => fixture_file("meta_tags.response"))
|
45
45
|
|
46
46
|
# These examples are used to test relative links
|
47
47
|
FakeWeb.register_uri(:get, "http://relative.com/", :response => fixture_file("relative_links.response"))
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: metainspector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 2.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jaime Iniesta
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-01-
|
11
|
+
date: 2014-01-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -24,20 +24,6 @@ dependencies:
|
|
24
24
|
- - ~>
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '1.6'
|
27
|
-
- !ruby/object:Gem::Dependency
|
28
|
-
name: rash
|
29
|
-
requirement: !ruby/object:Gem::Requirement
|
30
|
-
requirements:
|
31
|
-
- - ~>
|
32
|
-
- !ruby/object:Gem::Version
|
33
|
-
version: 0.4.0
|
34
|
-
type: :runtime
|
35
|
-
prerelease: false
|
36
|
-
version_requirements: !ruby/object:Gem::Requirement
|
37
|
-
requirements:
|
38
|
-
- - ~>
|
39
|
-
- !ruby/object:Gem::Version
|
40
|
-
version: 0.4.0
|
41
27
|
- !ruby/object:Gem::Dependency
|
42
28
|
name: open_uri_redirections
|
43
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -142,7 +128,6 @@ files:
|
|
142
128
|
- lib/meta_inspector/document.rb
|
143
129
|
- lib/meta_inspector/exception_log.rb
|
144
130
|
- lib/meta_inspector/exceptionable.rb
|
145
|
-
- lib/meta_inspector/meta_tags_dynamic_match.rb
|
146
131
|
- lib/meta_inspector/parser.rb
|
147
132
|
- lib/meta_inspector/request.rb
|
148
133
|
- lib/meta_inspector/url.rb
|
@@ -167,8 +152,8 @@ files:
|
|
167
152
|
- spec/fixtures/iteh.at.response
|
168
153
|
- spec/fixtures/malformed_href.response
|
169
154
|
- spec/fixtures/markupvalidator_faqs.response
|
155
|
+
- spec/fixtures/meta_tags.response
|
170
156
|
- spec/fixtures/nonhttp.response
|
171
|
-
- spec/fixtures/opengraph.response
|
172
157
|
- spec/fixtures/pagerankalert.com.response
|
173
158
|
- spec/fixtures/protocol_relative.response
|
174
159
|
- spec/fixtures/relative_links.response
|
@@ -1,18 +0,0 @@
|
|
1
|
-
module MetaInspector
|
2
|
-
|
3
|
-
# Encapsulates matching for method_missing and respond_to? for meta tags methods
|
4
|
-
class MetaTagsDynamicMatch
|
5
|
-
attr_reader :meta_tag
|
6
|
-
|
7
|
-
def initialize(method_name)
|
8
|
-
if method_name.to_s =~ /^meta_(.+)/
|
9
|
-
@meta_tag = $1
|
10
|
-
end
|
11
|
-
end
|
12
|
-
|
13
|
-
def match?
|
14
|
-
@meta_tag
|
15
|
-
end
|
16
|
-
|
17
|
-
end
|
18
|
-
end
|
@@ -1,52 +0,0 @@
|
|
1
|
-
HTTP/1.1 200 OK
|
2
|
-
Age: 13
|
3
|
-
Cache-Control: max-age=120
|
4
|
-
Content-Type: text/html
|
5
|
-
Date: Mon, 06 Jan 2014 12:47:42 GMT
|
6
|
-
Expires: Mon, 06 Jan 2014 12:49:28 GMT
|
7
|
-
Server: Apache/2.2.14 (Ubuntu)
|
8
|
-
Vary: Accept-Encoding
|
9
|
-
Via: 1.1 varnish
|
10
|
-
X-Powered-By: PHP/5.3.2-1ubuntu4.22
|
11
|
-
X-Varnish: 1188792404 1188790413
|
12
|
-
Content-Length: 40571
|
13
|
-
Connection: keep-alive
|
14
|
-
|
15
|
-
<!DOCTYPE html>
|
16
|
-
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
|
17
|
-
<head>
|
18
|
-
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
|
19
|
-
|
20
|
-
<!-- Basic OG Metadata -->
|
21
|
-
<meta property="og:title" content="An OG title" />
|
22
|
-
<meta property="og:type" content="website" />
|
23
|
-
<meta property="og:url" content="http://example.com/opengraph" />
|
24
|
-
|
25
|
-
<!-- Optional OG Metadata -->
|
26
|
-
<meta property="og:description" content="Sean Connery found fame and fortune" />
|
27
|
-
<meta property="og:determiner" content="the" />
|
28
|
-
<meta property="og:locale" content="en_GB" />
|
29
|
-
<meta property="og:locale:alternate" content="fr_FR" />
|
30
|
-
<meta property="og:site_name" content="IMDb" />
|
31
|
-
|
32
|
-
<!-- Structured OG Properties -->
|
33
|
-
<meta property="og:image" content="http://example.com/ogp.jpg" />
|
34
|
-
<meta property="og:image:secure_url" content="https://secure.example.com/ogp.jpg" />
|
35
|
-
<meta property="og:image:type" content="image/jpeg" />
|
36
|
-
<meta property="og:image:width" content="400" />
|
37
|
-
<meta property="og:image:height" content="300" />
|
38
|
-
|
39
|
-
<meta property="og:video" content="http://example.com/movie.swf" />
|
40
|
-
<meta property="og:video:secure_url" content="https://secure.example.com/movie.swf" />
|
41
|
-
<meta property="og:video:type" content="application/x-shockwave-flash" />
|
42
|
-
<meta property="og:video:width" content="400" />
|
43
|
-
<meta property="og:video:height" content="300" />
|
44
|
-
|
45
|
-
<meta property="og:audio" content="http://example.com/sound.mp3" />
|
46
|
-
<meta property="og:audio:secure_url" content="https://secure.example.com/sound.mp3" />
|
47
|
-
<meta property="og:audio:type" content="audio/mpeg" />
|
48
|
-
</head>
|
49
|
-
<body>
|
50
|
-
<p>A sample page with many Open Graph meta tags</p>
|
51
|
-
</body>
|
52
|
-
</html>
|