metainspector 1.12.0 → 1.12.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,169 @@
1
+ # MetaInspector [![Build Status](https://secure.travis-ci.org/jaimeiniesta/metainspector.png)](http://travis-ci.org/jaimeiniesta/metainspector) [![Dependency Status](https://gemnasium.com/jaimeiniesta/metainspector.png)](https://gemnasium.com/jaimeiniesta/metainspector)
2
+
3
+ MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
4
+
5
+ ## See it in action!
6
+
7
+ You can try MetaInspector live at this little demo: [https://metainspectordemo.herokuapp.com](https://metainspectordemo.herokuapp.com)
8
+
9
+ ## Installation
10
+
11
+ Install the gem from RubyGems:
12
+
13
+ gem install metainspector
14
+
15
+ If you're using it on a Rails application, just add it to your Gemfile and run `bundle install`
16
+
17
+ gem 'metainspector'
18
+
19
+ This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
20
+
21
+ ## Usage
22
+
23
+ Initialize a MetaInspector instance for an URL, like this:
24
+
25
+ page = MetaInspector.new('http://markupvalidator.com')
26
+
27
+ If you don't include the scheme on the URL, http:// will be used by default:
28
+
29
+ page = MetaInspector.new('markupvalidator.com')
30
+
31
+ ## Accessing scraped data
32
+
33
+ Then you can see the scraped data like this:
34
+
35
+ page.url # URL of the page
36
+ page.scheme # Scheme of the page (http, https)
37
+ page.host # Hostname of the page (like, markupvalidator.com, without the scheme)
38
+ page.root_url # Root url (scheme + host, like http://markupvalidator.com/)
39
+ page.title # title of the page, as string
40
+ page.links # array of strings, with every link found on the page as an absolute URL
41
+ page.internal_links # array of strings, with every internal link found on the page as an absolute URL
42
+ page.external_links # array of strings, with every external link found on the page as an absolute URL
43
+ page.meta_description # meta description, as string
44
+ page.description # returns the meta description, or the first long paragraph if no meta description is found
45
+ page.meta_keywords # meta keywords, as string
46
+ page.image # Most relevant image, if defined with og:image
47
+ page.images # array of strings, with every img found on the page as an absolute URL
48
+ page.feed # Get rss or atom links in meta data fields as array
49
+ page.meta_og_title # opengraph title
50
+ page.meta_og_image # opengraph image
51
+ page.charset # UTF-8
52
+ page.content_type # content-type returned by the server when the url was requested
53
+
54
+ MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
55
+
56
+ page.meta_description # <meta name="description" content="..." />
57
+ page.meta_keywords # <meta name="keywords" content="..." />
58
+ page.meta_robots # <meta name="robots" content="..." />
59
+ page.meta_generator # <meta name="generator" content="..." />
60
+
61
+ It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
62
+
63
+ page.meta_content_language # <meta http-equiv="content-language" content="..." />
64
+ page.meta_Content_Type # <meta http-equiv="Content-Type" content="..." />
65
+
66
+ Please notice that MetaInspector is case sensitive, so `page.meta_Content_Type` is not the same as `page.meta_content_type`
67
+
68
+ You can also access most of the scraped data as a hash:
69
+
70
+ page.to_hash # { "url" => "http://markupvalidator.com",
71
+ "title" => "MarkupValidator :: site-wide markup validation tool", ... }
72
+
73
+ The full scraped document if accessible from:
74
+
75
+ page.document # Nokogiri doc that you can use it to get any element from the page
76
+
77
+ ## Options
78
+
79
+ ### Timeout
80
+
81
+ By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
82
+ You can set a different timeout with a second parameter, like this:
83
+
84
+ page = MetaInspector.new('markupvalidator.com', :timeout => 5) # 5 seconds timeout
85
+
86
+ ### Redirections
87
+
88
+ MetaInspector allows safe redirects from http to https (for example, [http://github.com](http://github.com) => [https://github.com](https://github.com)) by default. With the option `:allow_safe_redirections => false`, it will throw exceptions on such redirects.
89
+
90
+ page = MetaInspector.new('facebook.com', :allow_safe_redirections => false)
91
+
92
+ To enable unsafe redirects from https to http (like, [https://example.com](https://example.com) => [http://example.com](http://example.com)) you can pass the option `:allow_unsafe_redirections => true`. If this option is not specified or is false an exception is thrown on such redirects.
93
+
94
+ page = MetaInspector.new('facebook.com', :allow_unsafe_redirections => true)
95
+
96
+ ### HTML Content Only
97
+
98
+ MetaInspector will try to parse all URLs by default. If you want to raise an error when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
99
+
100
+ page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
101
+
102
+ This is useful when using MetaInspector on web spidering. Although on the initial URL you'll probably have an HTML URL, following links you may find yourself trying to parse non-html URLs.
103
+
104
+ page = MetaInspector.new('http://example.com/image.png')
105
+ page.title # returns ""
106
+ page.content_type # "image/png"
107
+ page.ok? # true
108
+
109
+ page = MetaInspector.new('http://example.com/image.png', :html_content_only => true)
110
+ page.title # returns nil
111
+ page.content_type # "image/png"
112
+ page.ok? # false
113
+ page.errors.first # "Scraping exception: The url provided contains image/png content instead of text/html content"
114
+
115
+ ## Error handling
116
+
117
+ You can check if the page has been succesfully parsed with:
118
+
119
+ page.ok? # Will return true if everything looks OK
120
+
121
+ In case there have been any errors, you can check them with:
122
+
123
+ page.errors # Will return an array with the error messages
124
+
125
+ If you also want to see the errors on console, you can initialize MetaInspector with the verbose option like that:
126
+
127
+ page = MetaInspector.new('http://example.com', :verbose => true)
128
+
129
+ ## Examples
130
+
131
+ You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
132
+
133
+ $ irb
134
+ >> require 'metainspector'
135
+ => true
136
+
137
+ >> page = MetaInspector.new('http://markupvalidator.com')
138
+ => #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">
139
+
140
+ >> page.title
141
+ => "MarkupValidator :: site-wide markup validation tool"
142
+
143
+ >> page.meta_description
144
+ => "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
145
+
146
+ >> page.meta_keywords
147
+ => "html, markup, validation, validator, tool, w3c, development, standards, free"
148
+
149
+ >> page.links.size
150
+ => 15
151
+
152
+ >> page.links[4]
153
+ => "/plans-and-pricing"
154
+
155
+ >> page.document.class
156
+ => String
157
+
158
+ >> page.parsed_document.class
159
+ => Nokogiri::HTML::Document
160
+
161
+ ## ZOMG Fork! Thank you!
162
+
163
+ You're welcome to fork this project and send pull requests. Just remember to include specs.
164
+
165
+ Thanks to all the contributors:
166
+
167
+ [https://github.com/jaimeiniesta/metainspector/graphs/contributors](https://github.com/jaimeiniesta/metainspector/graphs/contributors)
168
+
169
+ Copyright (c) 2009-2012 Jaime Iniesta, released under the MIT license
@@ -72,13 +72,9 @@ module MetaInspector
72
72
  meta_og_image
73
73
  end
74
74
 
75
- # Returns the parsed document meta rss links
75
+ # Returns the parsed document meta rss link
76
76
  def feed
77
- @feed ||= parsed_document.xpath("//link").select{ |link|
78
- link.attributes["type"] && link.attributes["type"].value =~ /(atom|rss)/
79
- }.map { |link|
80
- absolutify_url(link.attributes["href"].value)
81
- }.first rescue nil
77
+ @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
82
78
  end
83
79
 
84
80
  # Returns the charset from the meta tags, looking for it in the following order:
@@ -133,13 +129,6 @@ module MetaInspector
133
129
  errors.empty?
134
130
  end
135
131
 
136
- ##### DEPRECATIONS ####
137
- def parsed?
138
- warn "the parsed? method has been deprecated, please use ok? instead"
139
- !@parsed_document.nil?
140
- end
141
- ##### DEPRECATIONS ####
142
-
143
132
  private
144
133
 
145
134
  def defaults
@@ -190,29 +179,36 @@ module MetaInspector
190
179
  @data.meta!.name!
191
180
  @data.meta!.property!
192
181
  parsed_document.xpath("//meta").each do |element|
193
- if element.attributes["content"]
194
- if element.attributes["name"]
195
- @data.meta.name[element.attributes["name"].value.downcase] = element.attributes["content"].value
196
- end
197
-
198
- if element.attributes["property"]
199
- @data.meta.property[element.attributes["property"].value.downcase] = element.attributes["content"].value
200
- end
201
- end
182
+ get_meta_name_or_property(element)
202
183
  end
203
184
  end
204
185
  end
205
186
 
187
+ # Store meta tag value, looking at meta name or meta property
188
+ def get_meta_name_or_property(element)
189
+ if element.attributes["content"]
190
+ type = element.attributes["name"] ? "name" : (element.attributes["property"] ? "property" : nil)
191
+
192
+ @data.meta.name[element.attributes[type].value.downcase] = element.attributes["content"].value if type
193
+ end
194
+ end
195
+
196
+ def parsed_feed(format)
197
+ feed = parsed_document.search("//link[@type='application/#{format}+xml']").first
198
+ feed ? absolutify_url(feed.attributes['href'].value) : nil
199
+ end
200
+
206
201
  def parsed_links
207
- @parsed_links ||= parsed_document.search("//a") \
208
- .map {|link| link.attributes["href"] \
209
- .to_s.strip}.uniq rescue []
202
+ @parsed_links ||= cleanup_nokogiri_values(parsed_document.search("//a/@href"))
210
203
  end
211
204
 
212
205
  def parsed_images
213
- @parsed_images ||= parsed_document.search('//img') \
214
- .reject{|i| (i.attributes['src'].nil? || i.attributes['src'].value.empty?) } \
215
- .map{ |i| i.attributes['src'].value }.uniq
206
+ @parsed_images ||= cleanup_nokogiri_values(parsed_document.search('//img/@src'))
207
+ end
208
+
209
+ # Takes a nokogiri search result, strips the values, rejects the empty ones, and removes duplicates
210
+ def cleanup_nokogiri_values(results)
211
+ results.map { |a| a.value.strip }.reject { |s| s.empty? }.uniq
216
212
  end
217
213
 
218
214
  # Stores the error for later inspection
@@ -250,7 +246,8 @@ module MetaInspector
250
246
 
251
247
  # Look for the first <p> block with 120 characters or more
252
248
  def secondary_description
253
- (p = parsed_document.search('//p').map(&:text).select{ |p| p.length > 120 }.first).nil? ? '' : p
249
+ first_long_paragraph = parsed_document.search('//p[string-length() >= 120]').first
250
+ first_long_paragraph ? first_long_paragraph.text : ''
254
251
  end
255
252
 
256
253
  def charset_from_meta_charset
@@ -1,5 +1,5 @@
1
1
  # -*- encoding: utf-8 -*-
2
2
 
3
3
  module MetaInspector
4
- VERSION = "1.12.0"
4
+ VERSION = "1.12.1"
5
5
  end
@@ -17,8 +17,8 @@ Gem::Specification.new do |gem|
17
17
  gem.add_dependency 'nokogiri', '~> 1.5'
18
18
  gem.add_dependency 'rash', '0.3.2'
19
19
 
20
- gem.add_development_dependency 'rspec', '2.11.0'
20
+ gem.add_development_dependency 'rspec', '2.12.0'
21
21
  gem.add_development_dependency 'fakeweb', '1.3.0'
22
- gem.add_development_dependency 'awesome_print', '1.0.2'
23
- gem.add_development_dependency 'rake', '0.9.2.2'
22
+ gem.add_development_dependency 'awesome_print', '1.1.0'
23
+ gem.add_development_dependency 'rake', '10.0.2'
24
24
  end
data/samples/spider.rb CHANGED
@@ -6,7 +6,7 @@ require 'meta_inspector'
6
6
  q = Queue.new
7
7
  visited_links=[]
8
8
 
9
- puts "Enter a valid http url to spider it following external links"
9
+ puts "Enter a valid http url to spider it following internal links"
10
10
  url = gets.strip
11
11
 
12
12
  page = MetaInspector.new(url)
@@ -20,9 +20,9 @@ while q.size > 0
20
20
  puts "TITLE: #{page.title}"
21
21
  puts "META DESCRIPTION: #{page.meta_description}"
22
22
  puts "META KEYWORDS: #{page.meta_keywords}"
23
- puts "LINKS: #{page.links.size}"
24
- page.links.each do |link|
25
- if link[0..6] == 'http://' && !visited_links.include?(link)
23
+ puts "LINKS: #{page.internal_links.size}"
24
+ page.internal_links.each do |link|
25
+ if !visited_links.include?(link)
26
26
  q.push(link)
27
27
  end
28
28
  end
@@ -89,14 +89,21 @@ describe MetaInspector do
89
89
  @m.document.class.should == String
90
90
  end
91
91
 
92
- it "should get rss feed" do
93
- @m = MetaInspector.new('http://www.iteh.at')
94
- @m.feed.should == 'http://www.iteh.at/de/rss/'
95
- end
92
+ describe "Feed" do
93
+ it "should get rss feed" do
94
+ @m = MetaInspector.new('http://www.iteh.at')
95
+ @m.feed.should == 'http://www.iteh.at/de/rss/'
96
+ end
96
97
 
97
- it "should get atom feed" do
98
- @m = MetaInspector.new('http://www.tea-tron.com/jbravo/blog/')
99
- @m.feed.should == 'http://www.tea-tron.com/jbravo/blog/feed/'
98
+ it "should get atom feed" do
99
+ @m = MetaInspector.new('http://www.tea-tron.com/jbravo/blog/')
100
+ @m.feed.should == 'http://www.tea-tron.com/jbravo/blog/feed/'
101
+ end
102
+
103
+ it "should return nil if no feed found" do
104
+ @m = MetaInspector.new('http://www.alazan.com')
105
+ @m.feed.should == nil
106
+ end
100
107
  end
101
108
 
102
109
  describe "get description" do
data/spec/spec_helper.rb CHANGED
@@ -32,7 +32,7 @@ FakeWeb.register_uri(:get, "http://example.com/invalid_href", :response => fixtu
32
32
  FakeWeb.register_uri(:get, "http://www.youtube.com/watch?v=iaGSSrp49uc", :response => fixture_file("youtube.response"))
33
33
  FakeWeb.register_uri(:get, "http://markupvalidator.com/faqs", :response => fixture_file("markupvalidator_faqs.response"))
34
34
  FakeWeb.register_uri(:get, "https://twitter.com/markupvalidator", :response => fixture_file("twitter_markupvalidator.response"))
35
- FakeWeb.register_uri(:get, "https://example.com/empty", :response => fixture_file("empty_page.response"))
35
+ FakeWeb.register_uri(:get, "http://example.com/empty", :response => fixture_file("empty_page.response"))
36
36
  FakeWeb.register_uri(:get, "http://international.com", :response => fixture_file("international.response"))
37
37
  FakeWeb.register_uri(:get, "http://charset000.com", :response => fixture_file("charset_000.response"))
38
38
  FakeWeb.register_uri(:get, "http://charset001.com", :response => fixture_file("charset_001.response"))
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metainspector
3
3
  version: !ruby/object:Gem::Version
4
- hash: 39
4
+ hash: 37
5
5
  prerelease:
6
6
  segments:
7
7
  - 1
8
8
  - 12
9
- - 0
10
- version: 1.12.0
9
+ - 1
10
+ version: 1.12.1
11
11
  platform: ruby
12
12
  authors:
13
13
  - Jaime Iniesta
@@ -15,10 +15,12 @@ autorequire:
15
15
  bindir: bin
16
16
  cert_chain: []
17
17
 
18
- date: 2012-12-01 00:00:00 Z
18
+ date: 2012-12-03 00:00:00 Z
19
19
  dependencies:
20
20
  - !ruby/object:Gem::Dependency
21
- version_requirements: &id001 !ruby/object:Gem::Requirement
21
+ name: nokogiri
22
+ prerelease: false
23
+ requirement: &id001 !ruby/object:Gem::Requirement
22
24
  none: false
23
25
  requirements:
24
26
  - - ~>
@@ -28,12 +30,12 @@ dependencies:
28
30
  - 1
29
31
  - 5
30
32
  version: "1.5"
31
- prerelease: false
32
33
  type: :runtime
33
- name: nokogiri
34
- requirement: *id001
34
+ version_requirements: *id001
35
35
  - !ruby/object:Gem::Dependency
36
- version_requirements: &id002 !ruby/object:Gem::Requirement
36
+ name: rash
37
+ prerelease: false
38
+ requirement: &id002 !ruby/object:Gem::Requirement
37
39
  none: false
38
40
  requirements:
39
41
  - - "="
@@ -44,28 +46,28 @@ dependencies:
44
46
  - 3
45
47
  - 2
46
48
  version: 0.3.2
47
- prerelease: false
48
49
  type: :runtime
49
- name: rash
50
- requirement: *id002
50
+ version_requirements: *id002
51
51
  - !ruby/object:Gem::Dependency
52
- version_requirements: &id003 !ruby/object:Gem::Requirement
52
+ name: rspec
53
+ prerelease: false
54
+ requirement: &id003 !ruby/object:Gem::Requirement
53
55
  none: false
54
56
  requirements:
55
57
  - - "="
56
58
  - !ruby/object:Gem::Version
57
- hash: 35
59
+ hash: 63
58
60
  segments:
59
61
  - 2
60
- - 11
62
+ - 12
61
63
  - 0
62
- version: 2.11.0
63
- prerelease: false
64
+ version: 2.12.0
64
65
  type: :development
65
- name: rspec
66
- requirement: *id003
66
+ version_requirements: *id003
67
67
  - !ruby/object:Gem::Dependency
68
- version_requirements: &id004 !ruby/object:Gem::Requirement
68
+ name: fakeweb
69
+ prerelease: false
70
+ requirement: &id004 !ruby/object:Gem::Requirement
69
71
  none: false
70
72
  requirements:
71
73
  - - "="
@@ -76,12 +78,12 @@ dependencies:
76
78
  - 3
77
79
  - 0
78
80
  version: 1.3.0
79
- prerelease: false
80
81
  type: :development
81
- name: fakeweb
82
- requirement: *id004
82
+ version_requirements: *id004
83
83
  - !ruby/object:Gem::Dependency
84
- version_requirements: &id005 !ruby/object:Gem::Requirement
84
+ name: awesome_print
85
+ prerelease: false
86
+ requirement: &id005 !ruby/object:Gem::Requirement
85
87
  none: false
86
88
  requirements:
87
89
  - - "="
@@ -89,30 +91,27 @@ dependencies:
89
91
  hash: 19
90
92
  segments:
91
93
  - 1
94
+ - 1
92
95
  - 0
93
- - 2
94
- version: 1.0.2
95
- prerelease: false
96
+ version: 1.1.0
96
97
  type: :development
97
- name: awesome_print
98
- requirement: *id005
98
+ version_requirements: *id005
99
99
  - !ruby/object:Gem::Dependency
100
- version_requirements: &id006 !ruby/object:Gem::Requirement
100
+ name: rake
101
+ prerelease: false
102
+ requirement: &id006 !ruby/object:Gem::Requirement
101
103
  none: false
102
104
  requirements:
103
105
  - - "="
104
106
  - !ruby/object:Gem::Version
105
- hash: 11
107
+ hash: 75
106
108
  segments:
109
+ - 10
107
110
  - 0
108
- - 9
109
111
  - 2
110
- - 2
111
- version: 0.9.2.2
112
- prerelease: false
112
+ version: 10.0.2
113
113
  type: :development
114
- name: rake
115
- requirement: *id006
114
+ version_requirements: *id006
116
115
  description: MetaInspector lets you scrape a web page and get its title, charset, link and meta tags
117
116
  email:
118
117
  - jaimeiniesta@gmail.com
@@ -128,7 +127,7 @@ files:
128
127
  - .travis.yml
129
128
  - Gemfile
130
129
  - MIT-LICENSE
131
- - README.rdoc
130
+ - README.md
132
131
  - Rakefile
133
132
  - lib/meta_inspector.rb
134
133
  - lib/meta_inspector/open_uri.rb
data/README.rdoc DELETED
@@ -1,152 +0,0 @@
1
- = MetaInspector {<img src="https://secure.travis-ci.org/jaimeiniesta/metainspector.png?branch=master" />}[http://travis-ci.org/jaimeiniesta/metainspector] {<img src="https://codeclimate.com/badge.png" />}[https://codeclimate.com/github/jaimeiniesta/metainspector]
2
-
3
- MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
4
-
5
- = See it in action!
6
-
7
- You can try MetaInspector live at this little demo: https://metainspectordemo.herokuapp.com
8
-
9
- = Installation
10
-
11
- Install the gem from RubyGems:
12
-
13
- gem install metainspector
14
-
15
- This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
16
-
17
- = Usage
18
-
19
- Initialize a scraper instance for an URL, like this:
20
-
21
- page = MetaInspector::Scraper.new('http://markupvalidator.com')
22
-
23
- or, for short, a convenience alias is also available:
24
-
25
- page = MetaInspector.new('http://markupvalidator.com')
26
-
27
- If you don't include the scheme on the URL, http:// will be used
28
- by defaul:
29
-
30
- page = MetaInspector.new('markupvalidator.com')
31
-
32
- By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
33
- You can set a different timeout with a second parameter, like this:
34
-
35
- page = MetaInspector.new('markupvalidator.com', :timeout => 5) # this would wait just 5 seconds to timeout
36
-
37
- MetaInspector will try to parse all URLs by default. If you want to parse only those URLs that have text/html as content-type you can specify it like this:
38
-
39
- page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
40
-
41
- MetaInspector allows safe redirects from http to https sites by default. Passing allow_safe_redirections as false will throw exceptions on such redirects.
42
-
43
- page = MetaInspector.new('facebook.com', :allow_safe_redirections => false)
44
-
45
- To enable unsafe redirects from https to http sites you can pass allow_unsafe_redirections as true. If this option is not specified or is false an exception is thrown on such redirects.
46
-
47
- page = MetaInspector.new('facebook.com', :allow_unsafe_redirections => true)
48
-
49
- Then you can see the scraped data like this:
50
-
51
- page.url # URL of the page
52
- page.scheme # Scheme of the page (http, https)
53
- page.host # Hostname of the page (like, markupvalidator.com, without the scheme)
54
- page.root_url # Root url (scheme + host, like http://markupvalidator.com/)
55
- page.title # title of the page, as string
56
- page.links # array of strings, with every link found on the page as an absolute URL
57
- page.internal_links # array of strings, with every internal link found on the page as an absolute URL
58
- page.external_links # array of strings, with every external link found on the page as an absolute URL
59
- page.meta_description # meta description, as string
60
- page.description # returns the meta description, or the first long paragraph if no meta description is found
61
- page.meta_keywords # meta keywords, as string
62
- page.image # Most relevant image, if defined with og:image
63
- page.images # array of strings, with every img found on the page as an absolute URL
64
- page.feed # Get rss or atom links in meta data fields as array
65
- page.meta_og_title # opengraph title
66
- page.meta_og_image # opengraph image
67
- page.charset # UTF-8
68
- page.content_type # content-type returned by the server when the url was requested
69
-
70
- MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
71
-
72
- page.meta_description # <meta name="description" content="..." />
73
- page.meta_keywords # <meta name="keywords" content="..." />
74
- page.meta_robots # <meta name="robots" content="..." />
75
- page.meta_generator # <meta name="generator" content="..." />
76
-
77
- It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
78
-
79
- page.meta_content_language # <meta http-equiv="content-language" content="..." />
80
- page.meta_Content_Type # <meta http-equiv="Content-Type" content="..." />
81
-
82
- Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type
83
-
84
- You can also access most of the scraped data as a hash:
85
-
86
- page.to_hash # { "url"=>"http://markupvalidator.com", "title" => "MarkupValidator :: site-wide markup validation tool", ... }
87
-
88
- The full scraped document if accessible from:
89
-
90
- page.document # Nokogiri doc that you can use it to get any element from the page
91
-
92
- = Errors handling
93
-
94
- You can check if the page has been succesfully parsed with:
95
-
96
- page.ok? # Will return true if everything looks OK
97
-
98
- In case there have been any errors, you can check them with:
99
-
100
- page.errors # Will return an array with the error messages
101
-
102
- If you also want to see the errors on console, you can initialize MetaInspector with the verbose option like that:
103
-
104
- page = MetaInspector.new('http://example.com', :verbose => true)
105
-
106
- = Examples
107
-
108
- You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
109
-
110
- $ irb
111
- >> require 'metainspector'
112
- => true
113
-
114
- >> page = MetaInspector.new('http://markupvalidator.com')
115
- => #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">
116
-
117
- >> page.title
118
- => "MarkupValidator :: site-wide markup validation tool"
119
-
120
- >> page.meta_description
121
- => "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
122
-
123
- >> page.meta_keywords
124
- => "html, markup, validation, validator, tool, w3c, development, standards, free"
125
-
126
- >> page.links.size
127
- => 15
128
-
129
- >> page.links[4]
130
- => "/plans-and-pricing"
131
-
132
- >> page.document.class
133
- => String
134
-
135
- >> page.parsed_document.class
136
- => Nokogiri::HTML::Document
137
-
138
- = ZOMG Fork! Thank you!
139
-
140
- You're welcome to fork this project and send pull requests. Just remember to include specs.
141
-
142
- Thanks to all the contributors:
143
-
144
- https://github.com/jaimeiniesta/metainspector/graphs/contributors
145
-
146
- = To Do
147
-
148
- * Get page.base_dir from the URL
149
- * If keywords seem to be separated by blank spaces, replace them with commas
150
- * Autodiscover all available meta tags
151
-
152
- Copyright (c) 2009-2012 Jaime Iniesta, released under the MIT license