metainspector 1.12.0 → 1.12.1

Sign up to get free protection for your applications and to get access to all the features.
data/README.md ADDED
@@ -0,0 +1,169 @@
1
+ # MetaInspector [![Build Status](https://secure.travis-ci.org/jaimeiniesta/metainspector.png)](http://travis-ci.org/jaimeiniesta/metainspector) [![Dependency Status](https://gemnasium.com/jaimeiniesta/metainspector.png)](https://gemnasium.com/jaimeiniesta/metainspector)
2
+
3
+ MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
4
+
5
+ ## See it in action!
6
+
7
+ You can try MetaInspector live at this little demo: [https://metainspectordemo.herokuapp.com](https://metainspectordemo.herokuapp.com)
8
+
9
+ ## Installation
10
+
11
+ Install the gem from RubyGems:
12
+
13
+ gem install metainspector
14
+
15
+ If you're using it on a Rails application, just add it to your Gemfile and run `bundle install`
16
+
17
+ gem 'metainspector'
18
+
19
+ This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
20
+
21
+ ## Usage
22
+
23
+ Initialize a MetaInspector instance for an URL, like this:
24
+
25
+ page = MetaInspector.new('http://markupvalidator.com')
26
+
27
+ If you don't include the scheme on the URL, http:// will be used by default:
28
+
29
+ page = MetaInspector.new('markupvalidator.com')
30
+
31
+ ## Accessing scraped data
32
+
33
+ Then you can see the scraped data like this:
34
+
35
+ page.url # URL of the page
36
+ page.scheme # Scheme of the page (http, https)
37
+ page.host # Hostname of the page (like, markupvalidator.com, without the scheme)
38
+ page.root_url # Root url (scheme + host, like http://markupvalidator.com/)
39
+ page.title # title of the page, as string
40
+ page.links # array of strings, with every link found on the page as an absolute URL
41
+ page.internal_links # array of strings, with every internal link found on the page as an absolute URL
42
+ page.external_links # array of strings, with every external link found on the page as an absolute URL
43
+ page.meta_description # meta description, as string
44
+ page.description # returns the meta description, or the first long paragraph if no meta description is found
45
+ page.meta_keywords # meta keywords, as string
46
+ page.image # Most relevant image, if defined with og:image
47
+ page.images # array of strings, with every img found on the page as an absolute URL
48
+ page.feed # Get rss or atom links in meta data fields as array
49
+ page.meta_og_title # opengraph title
50
+ page.meta_og_image # opengraph image
51
+ page.charset # UTF-8
52
+ page.content_type # content-type returned by the server when the url was requested
53
+
54
+ MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
55
+
56
+ page.meta_description # <meta name="description" content="..." />
57
+ page.meta_keywords # <meta name="keywords" content="..." />
58
+ page.meta_robots # <meta name="robots" content="..." />
59
+ page.meta_generator # <meta name="generator" content="..." />
60
+
61
+ It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
62
+
63
+ page.meta_content_language # <meta http-equiv="content-language" content="..." />
64
+ page.meta_Content_Type # <meta http-equiv="Content-Type" content="..." />
65
+
66
+ Please notice that MetaInspector is case sensitive, so `page.meta_Content_Type` is not the same as `page.meta_content_type`
67
+
68
+ You can also access most of the scraped data as a hash:
69
+
70
+ page.to_hash # { "url" => "http://markupvalidator.com",
71
+ "title" => "MarkupValidator :: site-wide markup validation tool", ... }
72
+
73
+ The full scraped document if accessible from:
74
+
75
+ page.document # Nokogiri doc that you can use it to get any element from the page
76
+
77
+ ## Options
78
+
79
+ ### Timeout
80
+
81
+ By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
82
+ You can set a different timeout with a second parameter, like this:
83
+
84
+ page = MetaInspector.new('markupvalidator.com', :timeout => 5) # 5 seconds timeout
85
+
86
+ ### Redirections
87
+
88
+ MetaInspector allows safe redirects from http to https (for example, [http://github.com](http://github.com) => [https://github.com](https://github.com)) by default. With the option `:allow_safe_redirections => false`, it will throw exceptions on such redirects.
89
+
90
+ page = MetaInspector.new('facebook.com', :allow_safe_redirections => false)
91
+
92
+ To enable unsafe redirects from https to http (like, [https://example.com](https://example.com) => [http://example.com](http://example.com)) you can pass the option `:allow_unsafe_redirections => true`. If this option is not specified or is false an exception is thrown on such redirects.
93
+
94
+ page = MetaInspector.new('facebook.com', :allow_unsafe_redirections => true)
95
+
96
+ ### HTML Content Only
97
+
98
+ MetaInspector will try to parse all URLs by default. If you want to raise an error when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
99
+
100
+ page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
101
+
102
+ This is useful when using MetaInspector on web spidering. Although on the initial URL you'll probably have an HTML URL, following links you may find yourself trying to parse non-html URLs.
103
+
104
+ page = MetaInspector.new('http://example.com/image.png')
105
+ page.title # returns ""
106
+ page.content_type # "image/png"
107
+ page.ok? # true
108
+
109
+ page = MetaInspector.new('http://example.com/image.png', :html_content_only => true)
110
+ page.title # returns nil
111
+ page.content_type # "image/png"
112
+ page.ok? # false
113
+ page.errors.first # "Scraping exception: The url provided contains image/png content instead of text/html content"
114
+
115
+ ## Error handling
116
+
117
+ You can check if the page has been succesfully parsed with:
118
+
119
+ page.ok? # Will return true if everything looks OK
120
+
121
+ In case there have been any errors, you can check them with:
122
+
123
+ page.errors # Will return an array with the error messages
124
+
125
+ If you also want to see the errors on console, you can initialize MetaInspector with the verbose option like that:
126
+
127
+ page = MetaInspector.new('http://example.com', :verbose => true)
128
+
129
+ ## Examples
130
+
131
+ You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
132
+
133
+ $ irb
134
+ >> require 'metainspector'
135
+ => true
136
+
137
+ >> page = MetaInspector.new('http://markupvalidator.com')
138
+ => #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">
139
+
140
+ >> page.title
141
+ => "MarkupValidator :: site-wide markup validation tool"
142
+
143
+ >> page.meta_description
144
+ => "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
145
+
146
+ >> page.meta_keywords
147
+ => "html, markup, validation, validator, tool, w3c, development, standards, free"
148
+
149
+ >> page.links.size
150
+ => 15
151
+
152
+ >> page.links[4]
153
+ => "/plans-and-pricing"
154
+
155
+ >> page.document.class
156
+ => String
157
+
158
+ >> page.parsed_document.class
159
+ => Nokogiri::HTML::Document
160
+
161
+ ## ZOMG Fork! Thank you!
162
+
163
+ You're welcome to fork this project and send pull requests. Just remember to include specs.
164
+
165
+ Thanks to all the contributors:
166
+
167
+ [https://github.com/jaimeiniesta/metainspector/graphs/contributors](https://github.com/jaimeiniesta/metainspector/graphs/contributors)
168
+
169
+ Copyright (c) 2009-2012 Jaime Iniesta, released under the MIT license
@@ -72,13 +72,9 @@ module MetaInspector
72
72
  meta_og_image
73
73
  end
74
74
 
75
- # Returns the parsed document meta rss links
75
+ # Returns the parsed document meta rss link
76
76
  def feed
77
- @feed ||= parsed_document.xpath("//link").select{ |link|
78
- link.attributes["type"] && link.attributes["type"].value =~ /(atom|rss)/
79
- }.map { |link|
80
- absolutify_url(link.attributes["href"].value)
81
- }.first rescue nil
77
+ @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
82
78
  end
83
79
 
84
80
  # Returns the charset from the meta tags, looking for it in the following order:
@@ -133,13 +129,6 @@ module MetaInspector
133
129
  errors.empty?
134
130
  end
135
131
 
136
- ##### DEPRECATIONS ####
137
- def parsed?
138
- warn "the parsed? method has been deprecated, please use ok? instead"
139
- !@parsed_document.nil?
140
- end
141
- ##### DEPRECATIONS ####
142
-
143
132
  private
144
133
 
145
134
  def defaults
@@ -190,29 +179,36 @@ module MetaInspector
190
179
  @data.meta!.name!
191
180
  @data.meta!.property!
192
181
  parsed_document.xpath("//meta").each do |element|
193
- if element.attributes["content"]
194
- if element.attributes["name"]
195
- @data.meta.name[element.attributes["name"].value.downcase] = element.attributes["content"].value
196
- end
197
-
198
- if element.attributes["property"]
199
- @data.meta.property[element.attributes["property"].value.downcase] = element.attributes["content"].value
200
- end
201
- end
182
+ get_meta_name_or_property(element)
202
183
  end
203
184
  end
204
185
  end
205
186
 
187
+ # Store meta tag value, looking at meta name or meta property
188
+ def get_meta_name_or_property(element)
189
+ if element.attributes["content"]
190
+ type = element.attributes["name"] ? "name" : (element.attributes["property"] ? "property" : nil)
191
+
192
+ @data.meta.name[element.attributes[type].value.downcase] = element.attributes["content"].value if type
193
+ end
194
+ end
195
+
196
+ def parsed_feed(format)
197
+ feed = parsed_document.search("//link[@type='application/#{format}+xml']").first
198
+ feed ? absolutify_url(feed.attributes['href'].value) : nil
199
+ end
200
+
206
201
  def parsed_links
207
- @parsed_links ||= parsed_document.search("//a") \
208
- .map {|link| link.attributes["href"] \
209
- .to_s.strip}.uniq rescue []
202
+ @parsed_links ||= cleanup_nokogiri_values(parsed_document.search("//a/@href"))
210
203
  end
211
204
 
212
205
  def parsed_images
213
- @parsed_images ||= parsed_document.search('//img') \
214
- .reject{|i| (i.attributes['src'].nil? || i.attributes['src'].value.empty?) } \
215
- .map{ |i| i.attributes['src'].value }.uniq
206
+ @parsed_images ||= cleanup_nokogiri_values(parsed_document.search('//img/@src'))
207
+ end
208
+
209
+ # Takes a nokogiri search result, strips the values, rejects the empty ones, and removes duplicates
210
+ def cleanup_nokogiri_values(results)
211
+ results.map { |a| a.value.strip }.reject { |s| s.empty? }.uniq
216
212
  end
217
213
 
218
214
  # Stores the error for later inspection
@@ -250,7 +246,8 @@ module MetaInspector
250
246
 
251
247
  # Look for the first <p> block with 120 characters or more
252
248
  def secondary_description
253
- (p = parsed_document.search('//p').map(&:text).select{ |p| p.length > 120 }.first).nil? ? '' : p
249
+ first_long_paragraph = parsed_document.search('//p[string-length() >= 120]').first
250
+ first_long_paragraph ? first_long_paragraph.text : ''
254
251
  end
255
252
 
256
253
  def charset_from_meta_charset
@@ -1,5 +1,5 @@
1
1
  # -*- encoding: utf-8 -*-
2
2
 
3
3
  module MetaInspector
4
- VERSION = "1.12.0"
4
+ VERSION = "1.12.1"
5
5
  end
@@ -17,8 +17,8 @@ Gem::Specification.new do |gem|
17
17
  gem.add_dependency 'nokogiri', '~> 1.5'
18
18
  gem.add_dependency 'rash', '0.3.2'
19
19
 
20
- gem.add_development_dependency 'rspec', '2.11.0'
20
+ gem.add_development_dependency 'rspec', '2.12.0'
21
21
  gem.add_development_dependency 'fakeweb', '1.3.0'
22
- gem.add_development_dependency 'awesome_print', '1.0.2'
23
- gem.add_development_dependency 'rake', '0.9.2.2'
22
+ gem.add_development_dependency 'awesome_print', '1.1.0'
23
+ gem.add_development_dependency 'rake', '10.0.2'
24
24
  end
data/samples/spider.rb CHANGED
@@ -6,7 +6,7 @@ require 'meta_inspector'
6
6
  q = Queue.new
7
7
  visited_links=[]
8
8
 
9
- puts "Enter a valid http url to spider it following external links"
9
+ puts "Enter a valid http url to spider it following internal links"
10
10
  url = gets.strip
11
11
 
12
12
  page = MetaInspector.new(url)
@@ -20,9 +20,9 @@ while q.size > 0
20
20
  puts "TITLE: #{page.title}"
21
21
  puts "META DESCRIPTION: #{page.meta_description}"
22
22
  puts "META KEYWORDS: #{page.meta_keywords}"
23
- puts "LINKS: #{page.links.size}"
24
- page.links.each do |link|
25
- if link[0..6] == 'http://' && !visited_links.include?(link)
23
+ puts "LINKS: #{page.internal_links.size}"
24
+ page.internal_links.each do |link|
25
+ if !visited_links.include?(link)
26
26
  q.push(link)
27
27
  end
28
28
  end
@@ -89,14 +89,21 @@ describe MetaInspector do
89
89
  @m.document.class.should == String
90
90
  end
91
91
 
92
- it "should get rss feed" do
93
- @m = MetaInspector.new('http://www.iteh.at')
94
- @m.feed.should == 'http://www.iteh.at/de/rss/'
95
- end
92
+ describe "Feed" do
93
+ it "should get rss feed" do
94
+ @m = MetaInspector.new('http://www.iteh.at')
95
+ @m.feed.should == 'http://www.iteh.at/de/rss/'
96
+ end
96
97
 
97
- it "should get atom feed" do
98
- @m = MetaInspector.new('http://www.tea-tron.com/jbravo/blog/')
99
- @m.feed.should == 'http://www.tea-tron.com/jbravo/blog/feed/'
98
+ it "should get atom feed" do
99
+ @m = MetaInspector.new('http://www.tea-tron.com/jbravo/blog/')
100
+ @m.feed.should == 'http://www.tea-tron.com/jbravo/blog/feed/'
101
+ end
102
+
103
+ it "should return nil if no feed found" do
104
+ @m = MetaInspector.new('http://www.alazan.com')
105
+ @m.feed.should == nil
106
+ end
100
107
  end
101
108
 
102
109
  describe "get description" do
data/spec/spec_helper.rb CHANGED
@@ -32,7 +32,7 @@ FakeWeb.register_uri(:get, "http://example.com/invalid_href", :response => fixtu
32
32
  FakeWeb.register_uri(:get, "http://www.youtube.com/watch?v=iaGSSrp49uc", :response => fixture_file("youtube.response"))
33
33
  FakeWeb.register_uri(:get, "http://markupvalidator.com/faqs", :response => fixture_file("markupvalidator_faqs.response"))
34
34
  FakeWeb.register_uri(:get, "https://twitter.com/markupvalidator", :response => fixture_file("twitter_markupvalidator.response"))
35
- FakeWeb.register_uri(:get, "https://example.com/empty", :response => fixture_file("empty_page.response"))
35
+ FakeWeb.register_uri(:get, "http://example.com/empty", :response => fixture_file("empty_page.response"))
36
36
  FakeWeb.register_uri(:get, "http://international.com", :response => fixture_file("international.response"))
37
37
  FakeWeb.register_uri(:get, "http://charset000.com", :response => fixture_file("charset_000.response"))
38
38
  FakeWeb.register_uri(:get, "http://charset001.com", :response => fixture_file("charset_001.response"))
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metainspector
3
3
  version: !ruby/object:Gem::Version
4
- hash: 39
4
+ hash: 37
5
5
  prerelease:
6
6
  segments:
7
7
  - 1
8
8
  - 12
9
- - 0
10
- version: 1.12.0
9
+ - 1
10
+ version: 1.12.1
11
11
  platform: ruby
12
12
  authors:
13
13
  - Jaime Iniesta
@@ -15,10 +15,12 @@ autorequire:
15
15
  bindir: bin
16
16
  cert_chain: []
17
17
 
18
- date: 2012-12-01 00:00:00 Z
18
+ date: 2012-12-03 00:00:00 Z
19
19
  dependencies:
20
20
  - !ruby/object:Gem::Dependency
21
- version_requirements: &id001 !ruby/object:Gem::Requirement
21
+ name: nokogiri
22
+ prerelease: false
23
+ requirement: &id001 !ruby/object:Gem::Requirement
22
24
  none: false
23
25
  requirements:
24
26
  - - ~>
@@ -28,12 +30,12 @@ dependencies:
28
30
  - 1
29
31
  - 5
30
32
  version: "1.5"
31
- prerelease: false
32
33
  type: :runtime
33
- name: nokogiri
34
- requirement: *id001
34
+ version_requirements: *id001
35
35
  - !ruby/object:Gem::Dependency
36
- version_requirements: &id002 !ruby/object:Gem::Requirement
36
+ name: rash
37
+ prerelease: false
38
+ requirement: &id002 !ruby/object:Gem::Requirement
37
39
  none: false
38
40
  requirements:
39
41
  - - "="
@@ -44,28 +46,28 @@ dependencies:
44
46
  - 3
45
47
  - 2
46
48
  version: 0.3.2
47
- prerelease: false
48
49
  type: :runtime
49
- name: rash
50
- requirement: *id002
50
+ version_requirements: *id002
51
51
  - !ruby/object:Gem::Dependency
52
- version_requirements: &id003 !ruby/object:Gem::Requirement
52
+ name: rspec
53
+ prerelease: false
54
+ requirement: &id003 !ruby/object:Gem::Requirement
53
55
  none: false
54
56
  requirements:
55
57
  - - "="
56
58
  - !ruby/object:Gem::Version
57
- hash: 35
59
+ hash: 63
58
60
  segments:
59
61
  - 2
60
- - 11
62
+ - 12
61
63
  - 0
62
- version: 2.11.0
63
- prerelease: false
64
+ version: 2.12.0
64
65
  type: :development
65
- name: rspec
66
- requirement: *id003
66
+ version_requirements: *id003
67
67
  - !ruby/object:Gem::Dependency
68
- version_requirements: &id004 !ruby/object:Gem::Requirement
68
+ name: fakeweb
69
+ prerelease: false
70
+ requirement: &id004 !ruby/object:Gem::Requirement
69
71
  none: false
70
72
  requirements:
71
73
  - - "="
@@ -76,12 +78,12 @@ dependencies:
76
78
  - 3
77
79
  - 0
78
80
  version: 1.3.0
79
- prerelease: false
80
81
  type: :development
81
- name: fakeweb
82
- requirement: *id004
82
+ version_requirements: *id004
83
83
  - !ruby/object:Gem::Dependency
84
- version_requirements: &id005 !ruby/object:Gem::Requirement
84
+ name: awesome_print
85
+ prerelease: false
86
+ requirement: &id005 !ruby/object:Gem::Requirement
85
87
  none: false
86
88
  requirements:
87
89
  - - "="
@@ -89,30 +91,27 @@ dependencies:
89
91
  hash: 19
90
92
  segments:
91
93
  - 1
94
+ - 1
92
95
  - 0
93
- - 2
94
- version: 1.0.2
95
- prerelease: false
96
+ version: 1.1.0
96
97
  type: :development
97
- name: awesome_print
98
- requirement: *id005
98
+ version_requirements: *id005
99
99
  - !ruby/object:Gem::Dependency
100
- version_requirements: &id006 !ruby/object:Gem::Requirement
100
+ name: rake
101
+ prerelease: false
102
+ requirement: &id006 !ruby/object:Gem::Requirement
101
103
  none: false
102
104
  requirements:
103
105
  - - "="
104
106
  - !ruby/object:Gem::Version
105
- hash: 11
107
+ hash: 75
106
108
  segments:
109
+ - 10
107
110
  - 0
108
- - 9
109
111
  - 2
110
- - 2
111
- version: 0.9.2.2
112
- prerelease: false
112
+ version: 10.0.2
113
113
  type: :development
114
- name: rake
115
- requirement: *id006
114
+ version_requirements: *id006
116
115
  description: MetaInspector lets you scrape a web page and get its title, charset, link and meta tags
117
116
  email:
118
117
  - jaimeiniesta@gmail.com
@@ -128,7 +127,7 @@ files:
128
127
  - .travis.yml
129
128
  - Gemfile
130
129
  - MIT-LICENSE
131
- - README.rdoc
130
+ - README.md
132
131
  - Rakefile
133
132
  - lib/meta_inspector.rb
134
133
  - lib/meta_inspector/open_uri.rb
data/README.rdoc DELETED
@@ -1,152 +0,0 @@
1
- = MetaInspector {<img src="https://secure.travis-ci.org/jaimeiniesta/metainspector.png?branch=master" />}[http://travis-ci.org/jaimeiniesta/metainspector] {<img src="https://codeclimate.com/badge.png" />}[https://codeclimate.com/github/jaimeiniesta/metainspector]
2
-
3
- MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
4
-
5
- = See it in action!
6
-
7
- You can try MetaInspector live at this little demo: https://metainspectordemo.herokuapp.com
8
-
9
- = Installation
10
-
11
- Install the gem from RubyGems:
12
-
13
- gem install metainspector
14
-
15
- This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
16
-
17
- = Usage
18
-
19
- Initialize a scraper instance for an URL, like this:
20
-
21
- page = MetaInspector::Scraper.new('http://markupvalidator.com')
22
-
23
- or, for short, a convenience alias is also available:
24
-
25
- page = MetaInspector.new('http://markupvalidator.com')
26
-
27
- If you don't include the scheme on the URL, http:// will be used
28
- by defaul:
29
-
30
- page = MetaInspector.new('markupvalidator.com')
31
-
32
- By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
33
- You can set a different timeout with a second parameter, like this:
34
-
35
- page = MetaInspector.new('markupvalidator.com', :timeout => 5) # this would wait just 5 seconds to timeout
36
-
37
- MetaInspector will try to parse all URLs by default. If you want to parse only those URLs that have text/html as content-type you can specify it like this:
38
-
39
- page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
40
-
41
- MetaInspector allows safe redirects from http to https sites by default. Passing allow_safe_redirections as false will throw exceptions on such redirects.
42
-
43
- page = MetaInspector.new('facebook.com', :allow_safe_redirections => false)
44
-
45
- To enable unsafe redirects from https to http sites you can pass allow_unsafe_redirections as true. If this option is not specified or is false an exception is thrown on such redirects.
46
-
47
- page = MetaInspector.new('facebook.com', :allow_unsafe_redirections => true)
48
-
49
- Then you can see the scraped data like this:
50
-
51
- page.url # URL of the page
52
- page.scheme # Scheme of the page (http, https)
53
- page.host # Hostname of the page (like, markupvalidator.com, without the scheme)
54
- page.root_url # Root url (scheme + host, like http://markupvalidator.com/)
55
- page.title # title of the page, as string
56
- page.links # array of strings, with every link found on the page as an absolute URL
57
- page.internal_links # array of strings, with every internal link found on the page as an absolute URL
58
- page.external_links # array of strings, with every external link found on the page as an absolute URL
59
- page.meta_description # meta description, as string
60
- page.description # returns the meta description, or the first long paragraph if no meta description is found
61
- page.meta_keywords # meta keywords, as string
62
- page.image # Most relevant image, if defined with og:image
63
- page.images # array of strings, with every img found on the page as an absolute URL
64
- page.feed # Get rss or atom links in meta data fields as array
65
- page.meta_og_title # opengraph title
66
- page.meta_og_image # opengraph image
67
- page.charset # UTF-8
68
- page.content_type # content-type returned by the server when the url was requested
69
-
70
- MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
71
-
72
- page.meta_description # <meta name="description" content="..." />
73
- page.meta_keywords # <meta name="keywords" content="..." />
74
- page.meta_robots # <meta name="robots" content="..." />
75
- page.meta_generator # <meta name="generator" content="..." />
76
-
77
- It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
78
-
79
- page.meta_content_language # <meta http-equiv="content-language" content="..." />
80
- page.meta_Content_Type # <meta http-equiv="Content-Type" content="..." />
81
-
82
- Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type
83
-
84
- You can also access most of the scraped data as a hash:
85
-
86
- page.to_hash # { "url"=>"http://markupvalidator.com", "title" => "MarkupValidator :: site-wide markup validation tool", ... }
87
-
88
- The full scraped document if accessible from:
89
-
90
- page.document # Nokogiri doc that you can use it to get any element from the page
91
-
92
- = Errors handling
93
-
94
- You can check if the page has been succesfully parsed with:
95
-
96
- page.ok? # Will return true if everything looks OK
97
-
98
- In case there have been any errors, you can check them with:
99
-
100
- page.errors # Will return an array with the error messages
101
-
102
- If you also want to see the errors on console, you can initialize MetaInspector with the verbose option like that:
103
-
104
- page = MetaInspector.new('http://example.com', :verbose => true)
105
-
106
- = Examples
107
-
108
- You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
109
-
110
- $ irb
111
- >> require 'metainspector'
112
- => true
113
-
114
- >> page = MetaInspector.new('http://markupvalidator.com')
115
- => #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">
116
-
117
- >> page.title
118
- => "MarkupValidator :: site-wide markup validation tool"
119
-
120
- >> page.meta_description
121
- => "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
122
-
123
- >> page.meta_keywords
124
- => "html, markup, validation, validator, tool, w3c, development, standards, free"
125
-
126
- >> page.links.size
127
- => 15
128
-
129
- >> page.links[4]
130
- => "/plans-and-pricing"
131
-
132
- >> page.document.class
133
- => String
134
-
135
- >> page.parsed_document.class
136
- => Nokogiri::HTML::Document
137
-
138
- = ZOMG Fork! Thank you!
139
-
140
- You're welcome to fork this project and send pull requests. Just remember to include specs.
141
-
142
- Thanks to all the contributors:
143
-
144
- https://github.com/jaimeiniesta/metainspector/graphs/contributors
145
-
146
- = To Do
147
-
148
- * Get page.base_dir from the URL
149
- * If keywords seem to be separated by blank spaces, replace them with commas
150
- * Autodiscover all available meta tags
151
-
152
- Copyright (c) 2009-2012 Jaime Iniesta, released under the MIT license