metainspector 1.17.3 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 5c94b07066d8b0080029d5e93808a4388b716575
4
- data.tar.gz: 33135740e3e740e21c4ccc011a44a3466c73a926
3
+ metadata.gz: f4afd755a0fdc53abb2c3af992bea56021245ec2
4
+ data.tar.gz: fddcbbb1b151558bf245c9630585553149a93798
5
5
  SHA512:
6
- metadata.gz: 4c3ffda64efceaaaa9631178751df36a0316d17176ce0788ce6256fbd96ac5347f8fc9c6b7019c3942bcd746a2f8868f78fdaa7d7aab83d4314a6a01dd9522dd
7
- data.tar.gz: 38c3fd01c8c156c82985c6b6972cec1002d90135f4381ab53f9e69e4dd6595bc2b22c8a530c2035703aabc318dfb9f37abf62396502dae12b994183aea091db2
6
+ metadata.gz: e3c4cf8afc4de72cf0432cb2c051f3fc38049f1e1e648033c151644d5a3b38a211f829d88cff33f7813ff6d1d18fb945aa6016df9dea286a7b1d8eed37bbc623
7
+ data.tar.gz: d18e6647d1187ff115a734d5d9ebca9ab6f87eab4b12b91f0541df6db65cb2c2eb5e31b1be298c7540862fdfbdc3b84ca79eb5769401abfd0de2dac2a701d744
data/README.md CHANGED
@@ -1,6 +1,8 @@
1
1
  # MetaInspector [![Build Status](https://secure.travis-ci.org/jaimeiniesta/metainspector.png)](http://travis-ci.org/jaimeiniesta/metainspector) [![Dependency Status](https://gemnasium.com/jaimeiniesta/metainspector.png)](https://gemnasium.com/jaimeiniesta/metainspector)
2
2
 
3
- MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
3
+ MetaInspector is a gem for web scraping purposes.
4
+
5
+ You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
4
6
 
5
7
  ## See it in action!
6
8
 
@@ -36,36 +38,124 @@ You can also include the html which will be used as the document to scrape:
36
38
 
37
39
  Then you can see the scraped data like this:
38
40
 
39
- page.url # URL of the page
40
- page.scheme # Scheme of the page (http, https)
41
- page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
42
- page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
43
- page.title # title of the page, as string
44
- page.links # array of strings, with every link found on the page as an absolute URL
45
- page.internal_links # array of strings, with every internal link found on the page as an absolute URL
46
- page.external_links # array of strings, with every external link found on the page as an absolute URL
47
- page.meta_description # meta description, as string
48
- page.description # returns the meta description, or the first long paragraph if no meta description is found
49
- page.meta_keywords # meta keywords, as string
50
- page.image # Most relevant image, if defined with og:image
51
- page.images # array of strings, with every img found on the page as an absolute URL
52
- page.feed # Get rss or atom links in meta data fields as array
53
- page.charset # UTF-8
54
- page.content_type # content-type returned by the server when the url was requested
55
-
56
- MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
57
-
58
- page.meta_description # <meta name="description" content="..." />
59
- page.meta_keywords # <meta name="keywords" content="..." />
60
- page.meta_robots # <meta name="robots" content="..." />
61
- page.meta_generator # <meta name="generator" content="..." />
62
-
63
- It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
64
-
65
- page.meta_content_language # <meta http-equiv="content-language" content="..." />
66
- page.meta_Content_Type # <meta http-equiv="Content-Type" content="..." />
67
-
68
- Please notice that MetaInspector is case sensitive, so `page.meta_Content_Type` is not the same as `page.meta_content_type`
41
+ page.url # URL of the page
42
+ page.scheme # Scheme of the page (http, https)
43
+ page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
44
+ page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
45
+ page.title # title of the page, as string
46
+ page.links # array of strings, with every link found on the page as an absolute URL
47
+ page.internal_links # array of strings, with every internal link found on the page as an absolute URL
48
+ page.external_links # array of strings, with every external link found on the page as an absolute URL
49
+ page.meta['keywords'] # meta keywords, as string
50
+ page.meta['description'] # meta description, as string
51
+ page.description # returns the meta description, or the first long paragraph if no meta description is found
52
+ page.image # Most relevant image, if defined with the og:image meta tag
53
+ page.images # array of strings, with every img found on the page as an absolute URL
54
+ page.feed # Get rss or atom links in meta data fields as array
55
+ page.charset # UTF-8
56
+ page.content_type # content-type returned by the server when the url was requested
57
+
58
+ ## Meta tags
59
+
60
+ When it comes to meta tags, you have several options:
61
+
62
+ page.meta_tags # Gives you all the meta tags by type:
63
+ # (meta name, meta http-equiv, meta property and meta charset)
64
+ # As meta tags can be repeated (in the case of 'og:image', for example),
65
+ # the values returned will be arrays
66
+ #
67
+ # For example:
68
+ #
69
+ # {
70
+ 'name' => {
71
+ 'keywords' => ['one, two, three'],
72
+ 'description' => ['the description'],
73
+ 'author' => ['Joe Sample'],
74
+ 'robots' => ['index,follow'],
75
+ 'revisit' => ['15 days'],
76
+ 'dc.date.issued' => ['2011-09-15']
77
+ },
78
+
79
+ 'http-equiv' => {
80
+ 'content-type' => ['text/html; charset=UTF-8'],
81
+ 'content-style-type' => ['text/css']
82
+ },
83
+
84
+ 'property' => {
85
+ 'og:title' => ['An OG title'],
86
+ 'og:type' => ['website'],
87
+ 'og:url' => ['http://example.com/meta-tags'],
88
+ 'og:image' => ['http://example.com/rock.jpg',
89
+ 'http://example.com/rock2.jpg',
90
+ 'http://example.com/rock3.jpg'],
91
+ 'og:image:width' => ['300'],
92
+ 'og:image:height' => ['300', '1000']
93
+ },
94
+
95
+ 'charset' => ['UTF-8']
96
+ }
97
+
98
+ As this method returns a hash, you can also take only the key that you need, like in:
99
+
100
+ page.meta_tags['property'] # Returns:
101
+ # {
102
+ # 'og:title' => ['An OG title'],
103
+ # 'og:type' => ['website'],
104
+ # 'og:url' => ['http://example.com/meta-tags'],
105
+ # 'og:image' => ['http://example.com/rock.jpg',
106
+ # 'http://example.com/rock2.jpg',
107
+ # 'http://example.com/rock3.jpg'],
108
+ # 'og:image:width' => ['300'],
109
+ # 'og:image:height' => ['300', '1000']
110
+ # }
111
+
112
+ In most cases you will only be interested in the first occurrence of a meta tag, so you can
113
+ use the singular form of that method:
114
+
115
+ page.meta_tag['name'] # Returns:
116
+ # {
117
+ # 'keywords' => 'one, two, three',
118
+ # 'description' => 'the description',
119
+ # 'author' => 'Joe Sample',
120
+ # 'robots' => 'index,follow',
121
+ # 'revisit' => '15 days',
122
+ # 'dc.date.issued' => '2011-09-15'
123
+ # }
124
+
125
+ Or, as this is also a hash:
126
+
127
+ page.meta_tag['name']['keywords'] # Returns 'one, two, three'
128
+
129
+ And finally, you can use the shorter `meta` method that will merge the different keys so you have
130
+ a simpler hash:
131
+
132
+ page.meta # Returns:
133
+ #
134
+ # {
135
+ # 'keywords' => 'one, two, three',
136
+ # 'description' => 'the description',
137
+ # 'author' => 'Joe Sample',
138
+ # 'robots' => 'index,follow',
139
+ # 'revisit' => '15 days',
140
+ # 'dc.date.issued' => '2011-09-15',
141
+ # 'content-type' => 'text/html; charset=UTF-8',
142
+ # 'content-style-type' => 'text/css',
143
+ # 'og:title' => 'An OG title',
144
+ # 'og:type' => 'website',
145
+ # 'og:url' => 'http://example.com/meta-tags',
146
+ # 'og:image' => 'http://example.com/rock.jpg',
147
+ # 'og:image:width' => '300',
148
+ # 'og:image:height' => '300',
149
+ # 'charset' => 'UTF-8'
150
+ # }
151
+
152
+ This way, you can get most meta tags just like that:
153
+
154
+ page.meta['author'] # Returns "Joe Sample"
155
+
156
+ Please be aware that all keys are converted to downcase, so it's `'dc.date.issued'` and not `'DC.date.issued'`.
157
+
158
+ ## Other representations
69
159
 
70
160
  You can also access most of the scraped data as a hash:
71
161
 
@@ -80,25 +170,6 @@ And the full scraped document is accessible from:
80
170
 
81
171
  page.parsed # Nokogiri doc that you can use it to get any element from the page
82
172
 
83
- ## Opengraph and Twitter card meta tags
84
-
85
- Twitter cards & Open graph tags make it possible for you to attach media experiences to Tweets & Facebook posts. Nowadays most of the content creators add these meta tags to headers to quickly identify content on the page. Sometimes these tags could be nested as well. For example when a site wants to provide information about primary image used on a page it could use
86
-
87
- <meta name="og:image" content="http://www.somedomain.com/assets/images/abc.jpeg">
88
- <meta name="og:image:width" content="200">
89
- <meta name="twitter:image" value="http://www.somedomain.com/assets/images/abc.jpeg">
90
- <meta property="twitter:image:width" value="200">
91
-
92
- Also many sites use name & property, content & value attributes interchangeably. Using MetaInspector accessing this information is as easy as -
93
-
94
- page.meta_og_image
95
- page.meta_twitter_image_width
96
-
97
- Note that MetaInspector gives priority to content over value. In other words if there is a tag of the form
98
-
99
- <meta property="og:something" value="100" content="real value">
100
- page.meta_og_something #=> "real value"
101
-
102
173
  ## Options
103
174
 
104
175
  ### Timeout
@@ -173,10 +244,10 @@ You can find some sample scripts on the samples folder, including a basic scrapi
173
244
  >> page.title
174
245
  => "MarkupValidator :: site-wide markup validation tool"
175
246
 
176
- >> page.meta_description
247
+ >> page.meta['description']
177
248
  => "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
178
249
 
179
- >> page.meta_keywords
250
+ >> page.meta['keywords']
180
251
  => "html, markup, validation, validator, tool, w3c, development, standards, free"
181
252
 
182
253
  >> page.links.size
@@ -5,7 +5,6 @@ require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/excep
5
5
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/exception_log'))
6
6
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/request'))
7
7
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/url'))
8
- require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/meta_tags_dynamic_match'))
9
8
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parser'))
10
9
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/document'))
11
10
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/deprecations'))
@@ -39,8 +39,8 @@ module MetaInspector
39
39
  extend Forwardable
40
40
  def_delegators :@url, :url, :scheme, :host, :root_url
41
41
  def_delegators :@request, :content_type
42
- def_delegators :@parser, :parsed, :method_missing, :respond_to?, :title, :description, :links, :internal_links, :external_links,
43
- :images, :image, :feed, :charset
42
+ def_delegators :@parser, :parsed, :respond_to?, :title, :description, :links, :internal_links, :external_links,
43
+ :images, :image, :feed, :charset, :meta_tags, :meta_tag, :meta
44
44
 
45
45
  # Returns all document data as a nested Hash
46
46
  def to_hash
@@ -53,8 +53,9 @@ module MetaInspector
53
53
  'images' => images,
54
54
  'charset' => charset,
55
55
  'feed' => feed,
56
- 'content_type' => content_type
57
- }.merge @parser.to_hash
56
+ 'content_type' => content_type,
57
+ 'meta_tags' => meta_tags
58
+ }
58
59
  end
59
60
 
60
61
  # Returns the contents of the document as a string
@@ -1,7 +1,6 @@
1
1
  # -*- encoding: utf-8 -*-
2
2
 
3
3
  require 'nokogiri'
4
- require 'hashie/rash'
5
4
 
6
5
  module MetaInspector
7
6
  # Parses the document with Nokogiri
@@ -12,13 +11,29 @@ module MetaInspector
12
11
  options = defaults.merge(options)
13
12
 
14
13
  @document = document
15
- @data = Hashie::Rash.new
16
14
  @exception_log = options[:exception_log]
17
15
  end
18
16
 
19
17
  extend Forwardable
20
18
  def_delegators :@document, :url, :scheme, :host
21
19
 
20
+ def meta_tags
21
+ {
22
+ 'name' => meta_tags_by('name'),
23
+ 'http-equiv' => meta_tags_by('http-equiv'),
24
+ 'property' => meta_tags_by('property'),
25
+ 'charset' => [charset_from_meta_charset]
26
+ }
27
+ end
28
+
29
+ def meta_tag
30
+ convert_each_array_to_first_element_on meta_tags
31
+ end
32
+
33
+ def meta
34
+ meta_tag['name'].merge(meta_tag['http-equiv']).merge(meta_tag['property']).merge({'charset' => meta_tag['charset']})
35
+ end
36
+
22
37
  # Returns the whole parsed document
23
38
  def parsed
24
39
  @parsed ||= Nokogiri::HTML(@document.to_s)
@@ -27,11 +42,6 @@ module MetaInspector
27
42
  @exception_log << e
28
43
  end
29
44
 
30
- def to_hash
31
- scrape_meta_data
32
- @data.to_hash
33
- end
34
-
35
45
  # Returns the parsed document title, from the content of the <title> tag.
36
46
  # This is not the same as the meta_title tag
37
47
  def title
@@ -41,7 +51,7 @@ module MetaInspector
41
51
  # A description getter that first checks for a meta description and if not present will
42
52
  # guess by looking at the first paragraph with more than 120 characters
43
53
  def description
44
- meta_description || secondary_description
54
+ meta['description'] || secondary_description
45
55
  end
46
56
 
47
57
  # Links found on the page, as absolute URLs
@@ -67,8 +77,10 @@ module MetaInspector
67
77
  # Returns the parsed image from Facebook's open graph property tags
68
78
  # Most all major websites now define this property and is usually very relevant
69
79
  # See doc at http://developers.facebook.com/docs/opengraph/
80
+ # If none found, tries with Twitter image
81
+ # TODO: if not found, try with images.first
70
82
  def image
71
- meta_og_image || meta_twitter_image
83
+ meta['og:image'] || meta['twitter:image']
72
84
  end
73
85
 
74
86
  # Returns the parsed document meta rss link
@@ -83,81 +95,38 @@ module MetaInspector
83
95
  @charset ||= (charset_from_meta_charset || charset_from_meta_content_type)
84
96
  end
85
97
 
86
- def respond_to?(method_name, include_private = false)
87
- MetaInspector::MetaTagsDynamicMatch.new(method_name).match? || super
88
- end
89
-
90
98
  private
91
99
 
92
100
  def defaults
93
101
  { exception_log: MetaInspector::ExceptionLog.new }
94
102
  end
95
103
 
96
- # Scrapers for all meta_tags in the form of "meta_name" are automatically defined. This has been tested for
97
- # meta name: keywords, description, robots, generator
98
- # meta http-equiv: content-language, Content-Type
99
- #
100
- # It will first try with meta name="..." and if nothing found,
101
- # with meta http-equiv="...", substituting "_" by "-"
102
- def method_missing(method_name)
103
- meta_tags_method = MetaInspector::MetaTagsDynamicMatch.new(method_name)
104
+ def meta_tags_by(attribute)
105
+ hash = {}
106
+ parsed.css("meta[@#{attribute}]").map do |tag|
107
+ name = tag.attributes[attribute].value.downcase rescue nil
108
+ content = tag.attributes['content'].value rescue nil
104
109
 
105
- if meta_tags_method.match?
106
- key = meta_tags_method.meta_tag
107
-
108
- #special treatment for opengraph (og:) and twitter card (twitter:) tags
109
- if key =~ /^og_(.*)/
110
- key = og_key(key)
111
- elsif key =~ /^twitter_(.*)/
112
- key.gsub!("_",":")
110
+ if name && content
111
+ hash[name] ||= []
112
+ hash[name] << content
113
113
  end
114
-
115
- scrape_meta_data
116
-
117
- @data.meta.name && (@data.meta.name[key.downcase]) || (@data.meta.property && @data.meta.property[key.downcase])
118
- else
119
- super
120
- end
121
- end
122
-
123
- # Not all OG keys can be directly translated to meta tags method names replacing _ by : as they include the _ in the name
124
- # This is going to be deprecated and replaced soon by a simpler, clearer method, like page.meta['og:site_name']
125
- def og_key(key)
126
- case key
127
- when "og_site_name"
128
- "og:site_name"
129
- when "og_image_secure_url"
130
- "og:image:secure_url"
131
- when "og_video_secure_url"
132
- "og:video:secure_url"
133
- when "og_audio_secure_url"
134
- "og:audio:secure_url"
135
- else
136
- key.gsub("_", ":")
137
114
  end
115
+ hash
138
116
  end
139
117
 
140
- # Scrapes all meta tags found
141
- def scrape_meta_data
142
- unless @data.meta
143
- @data.meta!.name!
144
- @data.meta!.property!
145
- parsed_search("//meta").each do |element|
146
- get_meta_name_or_property(element)
118
+ def convert_each_array_to_first_element_on(hash)
119
+ hash.each_pair do |k, v|
120
+ hash[k] = if v.is_a?(Hash)
121
+ convert_each_array_to_first_element_on(v)
122
+ elsif v.is_a?(Array)
123
+ v.first
124
+ else
125
+ v
147
126
  end
148
127
  end
149
128
  end
150
129
 
151
- # Store meta tag value, looking at meta name or meta property
152
- def get_meta_name_or_property(element)
153
- name_or_property = element.attributes["name"] ? "name" : (element.attributes["property"] ? "property" : nil)
154
- content_or_value = element.attributes["content"] ? "content" : (element.attributes["value"] ? "value" : nil)
155
-
156
- if !name_or_property.nil? && !content_or_value.nil?
157
- @data.meta.name[element.attributes[name_or_property].value.downcase] = element.attributes[content_or_value].value
158
- end
159
- end
160
-
161
130
  # Look for the first <p> block with 120 characters or more
162
131
  def secondary_description
163
132
  first_long_paragraph = parsed_search('//p[string-length() >= 120]').first
@@ -1,5 +1,5 @@
1
1
  # -*- encoding: utf-8 -*-
2
2
 
3
3
  module MetaInspector
4
- VERSION = "1.17.3"
4
+ VERSION = "2.0.0"
5
5
  end
@@ -16,7 +16,6 @@ Gem::Specification.new do |gem|
16
16
  gem.version = MetaInspector::VERSION
17
17
 
18
18
  gem.add_dependency 'nokogiri', '~> 1.6'
19
- gem.add_dependency 'rash', '~> 0.4.0'
20
19
  gem.add_dependency 'open_uri_redirections', '~> 0.1.4'
21
20
  gem.add_dependency 'addressable', '~> 2.3.5'
22
21
 
@@ -46,16 +46,13 @@ describe MetaInspector::Document do
46
46
  "charset" => "utf-8",
47
47
  "feed" => "http://feeds.feedburner.com/PageRankAlert",
48
48
  "content_type" =>"text/html",
49
- "meta" => {
50
- "name" => {
51
- "description"=> "Track your PageRank(TM) changes and receive alerts by email",
52
- "keywords" => "pagerank, seo, optimization, google",
53
- "robots" => "all,follow",
54
- "csrf_param" => "authenticity_token",
55
- "csrf_token" => "iW1/w+R8zrtDkhOlivkLZ793BN04Kr3X/pS+ixObHsE="
56
- },
57
- "property"=>{}
58
- }
49
+ "meta_tags" => { "name" => { "description" => ["Track your PageRank(TM) changes and receive alerts by email"],
50
+ "keywords" => ["pagerank, seo, optimization, google"], "robots"=>["all,follow"],
51
+ "csrf-param" => ["authenticity_token"],
52
+ "csrf-token" => ["iW1/w+R8zrtDkhOlivkLZ793BN04Kr3X/pS+ixObHsE="] },
53
+ "http-equiv" => {},
54
+ "property" => {},
55
+ "charset" => ["utf-8"] }
59
56
  }
60
57
  end
61
58
 
@@ -0,0 +1,54 @@
1
+ HTTP/1.1 200 OK
2
+ Age: 13
3
+ Cache-Control: max-age=120
4
+ Content-Type: text/html
5
+ Date: Mon, 06 Jan 2014 12:47:42 GMT
6
+ Expires: Mon, 06 Jan 2014 12:49:28 GMT
7
+ Server: Apache/2.2.14 (Ubuntu)
8
+ Vary: Accept-Encoding
9
+ Via: 1.1 varnish
10
+ X-Powered-By: PHP/5.3.2-1ubuntu4.22
11
+ X-Varnish: 1188792404 1188790413
12
+ Content-Length: 40571
13
+ Connection: keep-alive
14
+
15
+ <!DOCTYPE html>
16
+ <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
17
+ <head>
18
+ <!-- meta name examples -->
19
+
20
+ <meta name="keywords" content="one, two, three" />
21
+ <meta name="description" content="the description" />
22
+ <meta name="author" content="Joe Sample" />
23
+ <meta name="robots" content="index,follow" />
24
+ <meta name="revisit" content="15 days" />
25
+ <meta name="DC.date.issued" content="2011-09-15">
26
+
27
+ <!-- meta http-equiv examples -->
28
+
29
+ <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
30
+ <meta http-equiv="Content-Style-Type" content="text/css" />
31
+
32
+ <!-- meta charset examples -->
33
+
34
+ <meta charset="UTF-8" />
35
+
36
+ <!-- meta property examples -->
37
+
38
+ <meta property="og:title" content="An OG title" />
39
+ <meta property="og:type" content="website" />
40
+ <meta property="og:url" content="http://example.com/meta-tags" />
41
+
42
+ <!-- meta properties can be repeated, like in this example from http://open.me -->
43
+
44
+ <meta property="og:image" content="http://example.com/rock.jpg" />
45
+ <meta property="og:image:width" content="300" />
46
+ <meta property="og:image:height" content="300" />
47
+ <meta property="og:image" content="http://example.com/rock2.jpg" />
48
+ <meta property="og:image" content="http://example.com/rock3.jpg" />
49
+ <meta property="og:image:height" content="1000" />
50
+ </head>
51
+ <body>
52
+ <p>A sample page with many types of meta tags</p>
53
+ </body>
54
+ </html>
@@ -38,7 +38,7 @@ var yt = yt || {};yt.timing = yt.timing || {};yt.timing.tick = function(label, o
38
38
  <meta name="title" content="4. Far Cry 3 - Ubisoft E3 2011 Press Conference HD 1080p">
39
39
 
40
40
 
41
- <meta name="description" content="">
41
+ <meta name="description" content="This is Youtube">
42
42
 
43
43
 
44
44
  <meta name="keywords" content="FARCRY, Ubisoft, E3, 2011, Press, Conference, HD, 1080p">
data/spec/parser_spec.rb CHANGED
@@ -21,7 +21,6 @@ describe MetaInspector::Parser do
21
21
  it "should find the og image" do
22
22
  @m = MetaInspector::Parser.new(doc 'http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
23
23
  @m.image.should == "http://o.onionstatic.com/images/articles/article/2772/Apple-Claims-600w-R_jpg_130x110_q85.jpg"
24
- @m.meta_og_image.should == "http://o.onionstatic.com/images/articles/article/2772/Apple-Claims-600w-R_jpg_130x110_q85.jpg"
25
24
  end
26
25
 
27
26
  it "should find image on youtube" do
@@ -71,16 +70,16 @@ describe MetaInspector::Parser do
71
70
  @m.feed.should == nil
72
71
  end
73
72
  end
73
+ end
74
74
 
75
- describe "get description" do
76
- it "should find description on youtube" do
77
- MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc').description.should == ""
78
- end
75
+ describe '#description' do
76
+ it "should find description from meta description" do
77
+ page = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
78
+
79
+ page.description.should == "This is Youtube"
79
80
  end
80
- end
81
81
 
82
- describe 'Page with missing meta description' do
83
- it "should find a secondary description" do
82
+ it "should find a secondary description if no meta description" do
84
83
  @m = MetaInspector::Parser.new(doc 'http://theonion-no-description.com')
85
84
  @m.description.should == "SAN FRANCISCO—In a move expected to revolutionize the mobile device industry, Apple launched its fastest and most powerful iPhone to date Tuesday, an innovative new model that can only be seen by the company's hippest and most dedicated customers. This is secondary text picked up because of a missing meta description."
86
85
  end
@@ -267,188 +266,87 @@ describe MetaInspector::Parser do
267
266
  end
268
267
  end
269
268
 
270
- describe 'respond_to? for meta tags ghost methods' do
271
- before(:each) do
272
- @m = MetaInspector.new('http://pagerankalert.com')
273
- end
274
-
275
- it "should return true for meta tags as string" do
276
- @m.respond_to?("meta_robots").should be_true
277
- end
278
-
279
- it "should return true for meta tags as symbols" do
280
- @m.respond_to?(:meta_robots).should be_true
281
- end
282
-
283
- it "should return true for meta_twitter_site as string" do
284
- @m = MetaInspector.new('http://www.youtube.com/watch?v=iaGSSrp49uc')
285
- @m.respond_to?("meta_twitter_site").should be_true
286
- end
287
-
288
- it "should return true for meta_twitter_site as symbol" do
289
- @m = MetaInspector.new('http://www.youtube.com/watch?v=iaGSSrp49uc')
290
- @m.respond_to?(:meta_twitter_player_width).should be_true
291
- end
292
- end
293
-
294
- describe 'respond_to? for not implemented methods' do
295
-
296
- before(:each) do
297
- @m = MetaInspector.new('http://pagerankalert.com')
298
- end
299
-
300
- it "should return false when method name passed as string" do
301
- @m.respond_to?("method_not_implemented").should be_false
302
- end
303
-
304
- it "should return false when method name passed as symbols" do
305
- @m = MetaInspector.new('http://www.youtube.com/watch?v=iaGSSrp49uc')
306
- @m.respond_to?(:method_not_implemented).should be_false
307
- end
308
- end
309
-
310
- describe 'Getting meta tags by ghost methods' do
311
- before(:each) do
312
- @m = MetaInspector::Parser.new(doc 'http://pagerankalert.com')
313
- end
314
-
315
- it "should get the robots meta tag" do
316
- @m.meta_robots.should == 'all,follow'
317
- end
318
-
319
- it "should get the robots meta tag" do
320
- @m.meta_RoBoTs.should == 'all,follow'
321
- end
322
-
323
- it "should get the description meta tag" do
324
- @m.meta_description.should == 'Track your PageRank(TM) changes and receive alerts by email'
325
- end
326
-
327
- it "should get the keywords meta tag" do
328
- @m.meta_keywords.should == "pagerank, seo, optimization, google"
329
- end
330
-
331
- it "should get the content-language meta tag" do
332
- pending "mocks"
333
- @m.meta_content_language.should == "en"
334
- end
335
-
336
- it "should get the Csrf_pAram meta tag" do
337
- @m.meta_Csrf_pAram.should == "authenticity_token"
338
- end
339
-
340
- it "should return nil for nonfound meta_tags" do
341
- @m.meta_lollypop.should == nil
342
- end
343
-
344
- it "should get the generator meta tag" do
345
- @m = MetaInspector::Parser.new(doc 'http://www.inkthemes.com/')
346
- @m.meta_generator.should == 'WordPress 3.4.2'
347
- end
348
-
349
- it "should find a meta_twitter_site" do
350
- @m = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
351
- @m.meta_twitter_site.should == "@youtube"
352
- end
353
-
354
- it "should find a meta_twitter_player_width" do
355
- @m = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
356
- @m.meta_twitter_player_width.should == "1920"
357
- end
358
-
359
- it "should not find a meta_twitter_dummy" do
360
- @m = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
361
- @m.meta_twitter_dummy.should == nil
362
- end
363
-
364
- describe "opengraph meta tags" do
365
- before(:each) do
366
- @m = MetaInspector::Parser.new(doc 'http://example.com/opengraph')
367
- end
368
-
369
- it "should find a meta og:title" do
370
- @m.meta_og_title.should == "An OG title"
371
- end
372
-
373
- it "should find a meta og:type" do
374
- @m.meta_og_type.should == "website"
375
- end
376
-
377
- it "should find a meta og:url" do
378
- @m.meta_og_url.should == "http://example.com/opengraph"
379
- end
380
-
381
- it "should find a meta og:description" do
382
- @m.meta_og_description.should == "Sean Connery found fame and fortune"
383
- end
384
-
385
- it "should find a meta og:determiner" do
386
- @m.meta_og_determiner.should == "the"
387
- end
388
-
389
- it "should find a meta og:locale" do
390
- @m.meta_og_locale.should == "en_GB"
391
- end
392
-
393
- it "should find a meta og:locale:alternate" do
394
- @m.meta_og_locale_alternate.should == "fr_FR"
395
- end
396
-
397
- it "should find a meta og:site_name" do
398
- @m.meta_og_site_name.should == "IMDb"
399
- end
400
-
401
- it "should find a meta og:image" do
402
- @m.meta_og_image.should == "http://example.com/ogp.jpg"
403
- end
404
-
405
- it "should find a meta og:image:secure_url" do
406
- @m.meta_og_image_secure_url.should == "https://secure.example.com/ogp.jpg"
407
- end
408
-
409
- it "should find a meta og:image:type" do
410
- @m.meta_og_image_type.should == "image/jpeg"
411
- end
412
-
413
- it "should find a meta og:image:width" do
414
- @m.meta_og_image_width.should == "400"
415
- end
416
-
417
- it "should find a meta og:image:height" do
418
- @m.meta_og_image_height.should == "300"
419
- end
420
-
421
- it "should find a meta og:video" do
422
- @m.meta_og_video.should == "http://example.com/movie.swf"
423
- end
424
-
425
- it "should find a meta og:video:secure_url" do
426
- @m.meta_og_video_secure_url.should == "https://secure.example.com/movie.swf"
427
- end
428
-
429
- it "should find a meta og:video:type" do
430
- @m.meta_og_video_type.should == "application/x-shockwave-flash"
431
- end
432
-
433
- it "should find a meta og:video:width" do
434
- @m.meta_og_video_width.should == "400"
435
- end
436
-
437
- it "should find a meta og:video:height" do
438
- @m.meta_og_video_height.should == "300"
439
- end
440
-
441
- it "should find a meta og:audio" do
442
- @m.meta_og_audio.should == "http://example.com/sound.mp3"
443
- end
444
-
445
- it "should find a meta og:video:secure_url" do
446
- @m.meta_og_audio_secure_url.should == "https://secure.example.com/sound.mp3"
447
- end
448
-
449
- it "should find a meta og:audio:type" do
450
- @m.meta_og_audio_type.should == "audio/mpeg"
451
- end
269
+ describe 'Getting meta tags' do
270
+ let(:page) { MetaInspector::Parser.new(doc 'http://example.com/meta-tags') }
271
+
272
+ it "#meta_tags" do
273
+ page.meta_tags.should == {
274
+ 'name' => {
275
+ 'keywords' => ['one, two, three'],
276
+ 'description' => ['the description'],
277
+ 'author' => ['Joe Sample'],
278
+ 'robots' => ['index,follow'],
279
+ 'revisit' => ['15 days'],
280
+ 'dc.date.issued' => ['2011-09-15']
281
+ },
282
+
283
+ 'http-equiv' => {
284
+ 'content-type' => ['text/html; charset=UTF-8'],
285
+ 'content-style-type' => ['text/css']
286
+ },
287
+
288
+ 'property' => {
289
+ 'og:title' => ['An OG title'],
290
+ 'og:type' => ['website'],
291
+ 'og:url' => ['http://example.com/meta-tags'],
292
+ 'og:image' => ['http://example.com/rock.jpg',
293
+ 'http://example.com/rock2.jpg',
294
+ 'http://example.com/rock3.jpg'],
295
+ 'og:image:width' => ['300'],
296
+ 'og:image:height' => ['300', '1000']
297
+ },
298
+
299
+ 'charset' => ['UTF-8']
300
+ }
301
+ end
302
+
303
+ it "#meta_tag" do
304
+ page.meta_tag.should == {
305
+ 'name' => {
306
+ 'keywords' => 'one, two, three',
307
+ 'description' => 'the description',
308
+ 'author' => 'Joe Sample',
309
+ 'robots' => 'index,follow',
310
+ 'revisit' => '15 days',
311
+ 'dc.date.issued' => '2011-09-15'
312
+ },
313
+
314
+ 'http-equiv' => {
315
+ 'content-type' => 'text/html; charset=UTF-8',
316
+ 'content-style-type' => 'text/css'
317
+ },
318
+
319
+ 'property' => {
320
+ 'og:title' => 'An OG title',
321
+ 'og:type' => 'website',
322
+ 'og:url' => 'http://example.com/meta-tags',
323
+ 'og:image' => 'http://example.com/rock.jpg',
324
+ 'og:image:width' => '300',
325
+ 'og:image:height' => '300'
326
+ },
327
+
328
+ 'charset' => 'UTF-8'
329
+ }
330
+ end
331
+
332
+ it "#meta" do
333
+ page.meta.should == {
334
+ 'keywords' => 'one, two, three',
335
+ 'description' => 'the description',
336
+ 'author' => 'Joe Sample',
337
+ 'robots' => 'index,follow',
338
+ 'revisit' => '15 days',
339
+ 'dc.date.issued' => '2011-09-15',
340
+ 'content-type' => 'text/html; charset=UTF-8',
341
+ 'content-style-type' => 'text/css',
342
+ 'og:title' => 'An OG title',
343
+ 'og:type' => 'website',
344
+ 'og:url' => 'http://example.com/meta-tags',
345
+ 'og:image' => 'http://example.com/rock.jpg',
346
+ 'og:image:width' => '300',
347
+ 'og:image:height' => '300',
348
+ 'charset' => 'UTF-8'
349
+ }
452
350
  end
453
351
  end
454
352
 
@@ -469,18 +367,6 @@ describe MetaInspector::Parser do
469
367
  end
470
368
  end
471
369
 
472
- describe 'to_hash' do
473
- it "should return a hash with all the values set" do
474
- @m = MetaInspector::Parser.new(doc 'http://pagerankalert.com')
475
- @m.to_hash.should == { "meta" => { "name" => { "description" => "Track your PageRank(TM) changes and receive alerts by email",
476
- "keywords" => "pagerank, seo, optimization, google",
477
- "robots" => "all,follow",
478
- "csrf_param" => "authenticity_token",
479
- "csrf_token" => "iW1/w+R8zrtDkhOlivkLZ793BN04Kr3X/pS+ixObHsE="},
480
- "property"=>{}}}
481
- end
482
- end
483
-
484
370
  private
485
371
 
486
372
  def doc(url, options = {})
data/spec/spec_helper.rb CHANGED
@@ -41,7 +41,7 @@ FakeWeb.register_uri(:get, "http://charset002.com", :response => fixture_file("c
41
41
  FakeWeb.register_uri(:get, "http://www.inkthemes.com/", :response => fixture_file("wordpress_site.response"))
42
42
  FakeWeb.register_uri(:get, "http://pagerankalert.com/image.png", :body => "Image", :content_type => "image/png")
43
43
  FakeWeb.register_uri(:get, "http://pagerankalert.com/file.tar.gz", :body => "Image", :content_type => "application/x-gzip")
44
- FakeWeb.register_uri(:get, "http://example.com/opengraph", :response => fixture_file("opengraph.response"))
44
+ FakeWeb.register_uri(:get, "http://example.com/meta-tags", :response => fixture_file("meta_tags.response"))
45
45
 
46
46
  # These examples are used to test relative links
47
47
  FakeWeb.register_uri(:get, "http://relative.com/", :response => fixture_file("relative_links.response"))
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metainspector
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.17.3
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jaime Iniesta
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-01-09 00:00:00.000000000 Z
11
+ date: 2014-01-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -24,20 +24,6 @@ dependencies:
24
24
  - - ~>
25
25
  - !ruby/object:Gem::Version
26
26
  version: '1.6'
27
- - !ruby/object:Gem::Dependency
28
- name: rash
29
- requirement: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - ~>
32
- - !ruby/object:Gem::Version
33
- version: 0.4.0
34
- type: :runtime
35
- prerelease: false
36
- version_requirements: !ruby/object:Gem::Requirement
37
- requirements:
38
- - - ~>
39
- - !ruby/object:Gem::Version
40
- version: 0.4.0
41
27
  - !ruby/object:Gem::Dependency
42
28
  name: open_uri_redirections
43
29
  requirement: !ruby/object:Gem::Requirement
@@ -142,7 +128,6 @@ files:
142
128
  - lib/meta_inspector/document.rb
143
129
  - lib/meta_inspector/exception_log.rb
144
130
  - lib/meta_inspector/exceptionable.rb
145
- - lib/meta_inspector/meta_tags_dynamic_match.rb
146
131
  - lib/meta_inspector/parser.rb
147
132
  - lib/meta_inspector/request.rb
148
133
  - lib/meta_inspector/url.rb
@@ -167,8 +152,8 @@ files:
167
152
  - spec/fixtures/iteh.at.response
168
153
  - spec/fixtures/malformed_href.response
169
154
  - spec/fixtures/markupvalidator_faqs.response
155
+ - spec/fixtures/meta_tags.response
170
156
  - spec/fixtures/nonhttp.response
171
- - spec/fixtures/opengraph.response
172
157
  - spec/fixtures/pagerankalert.com.response
173
158
  - spec/fixtures/protocol_relative.response
174
159
  - spec/fixtures/relative_links.response
@@ -1,18 +0,0 @@
1
- module MetaInspector
2
-
3
- # Encapsulates matching for method_missing and respond_to? for meta tags methods
4
- class MetaTagsDynamicMatch
5
- attr_reader :meta_tag
6
-
7
- def initialize(method_name)
8
- if method_name.to_s =~ /^meta_(.+)/
9
- @meta_tag = $1
10
- end
11
- end
12
-
13
- def match?
14
- @meta_tag
15
- end
16
-
17
- end
18
- end
@@ -1,52 +0,0 @@
1
- HTTP/1.1 200 OK
2
- Age: 13
3
- Cache-Control: max-age=120
4
- Content-Type: text/html
5
- Date: Mon, 06 Jan 2014 12:47:42 GMT
6
- Expires: Mon, 06 Jan 2014 12:49:28 GMT
7
- Server: Apache/2.2.14 (Ubuntu)
8
- Vary: Accept-Encoding
9
- Via: 1.1 varnish
10
- X-Powered-By: PHP/5.3.2-1ubuntu4.22
11
- X-Varnish: 1188792404 1188790413
12
- Content-Length: 40571
13
- Connection: keep-alive
14
-
15
- <!DOCTYPE html>
16
- <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
17
- <head>
18
- <meta http-equiv="Content-type" content="text/html; charset=utf-8">
19
-
20
- <!-- Basic OG Metadata -->
21
- <meta property="og:title" content="An OG title" />
22
- <meta property="og:type" content="website" />
23
- <meta property="og:url" content="http://example.com/opengraph" />
24
-
25
- <!-- Optional OG Metadata -->
26
- <meta property="og:description" content="Sean Connery found fame and fortune" />
27
- <meta property="og:determiner" content="the" />
28
- <meta property="og:locale" content="en_GB" />
29
- <meta property="og:locale:alternate" content="fr_FR" />
30
- <meta property="og:site_name" content="IMDb" />
31
-
32
- <!-- Structured OG Properties -->
33
- <meta property="og:image" content="http://example.com/ogp.jpg" />
34
- <meta property="og:image:secure_url" content="https://secure.example.com/ogp.jpg" />
35
- <meta property="og:image:type" content="image/jpeg" />
36
- <meta property="og:image:width" content="400" />
37
- <meta property="og:image:height" content="300" />
38
-
39
- <meta property="og:video" content="http://example.com/movie.swf" />
40
- <meta property="og:video:secure_url" content="https://secure.example.com/movie.swf" />
41
- <meta property="og:video:type" content="application/x-shockwave-flash" />
42
- <meta property="og:video:width" content="400" />
43
- <meta property="og:video:height" content="300" />
44
-
45
- <meta property="og:audio" content="http://example.com/sound.mp3" />
46
- <meta property="og:audio:secure_url" content="https://secure.example.com/sound.mp3" />
47
- <meta property="og:audio:type" content="audio/mpeg" />
48
- </head>
49
- <body>
50
- <p>A sample page with many Open Graph meta tags</p>
51
- </body>
52
- </html>