metainspector 1.12.0 → 1.12.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +169 -0
- data/lib/meta_inspector/scraper.rb +26 -29
- data/lib/meta_inspector/version.rb +1 -1
- data/meta_inspector.gemspec +3 -3
- data/samples/spider.rb +4 -4
- data/spec/metainspector_spec.rb +14 -7
- data/spec/spec_helper.rb +1 -1
- metadata +37 -38
- data/README.rdoc +0 -152
data/README.md
ADDED
@@ -0,0 +1,169 @@
|
|
1
|
+
# MetaInspector [](http://travis-ci.org/jaimeiniesta/metainspector) [](https://gemnasium.com/jaimeiniesta/metainspector)
|
2
|
+
|
3
|
+
MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
|
4
|
+
|
5
|
+
## See it in action!
|
6
|
+
|
7
|
+
You can try MetaInspector live at this little demo: [https://metainspectordemo.herokuapp.com](https://metainspectordemo.herokuapp.com)
|
8
|
+
|
9
|
+
## Installation
|
10
|
+
|
11
|
+
Install the gem from RubyGems:
|
12
|
+
|
13
|
+
gem install metainspector
|
14
|
+
|
15
|
+
If you're using it on a Rails application, just add it to your Gemfile and run `bundle install`
|
16
|
+
|
17
|
+
gem 'metainspector'
|
18
|
+
|
19
|
+
This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
|
20
|
+
|
21
|
+
## Usage
|
22
|
+
|
23
|
+
Initialize a MetaInspector instance for an URL, like this:
|
24
|
+
|
25
|
+
page = MetaInspector.new('http://markupvalidator.com')
|
26
|
+
|
27
|
+
If you don't include the scheme on the URL, http:// will be used by default:
|
28
|
+
|
29
|
+
page = MetaInspector.new('markupvalidator.com')
|
30
|
+
|
31
|
+
## Accessing scraped data
|
32
|
+
|
33
|
+
Then you can see the scraped data like this:
|
34
|
+
|
35
|
+
page.url # URL of the page
|
36
|
+
page.scheme # Scheme of the page (http, https)
|
37
|
+
page.host # Hostname of the page (like, markupvalidator.com, without the scheme)
|
38
|
+
page.root_url # Root url (scheme + host, like http://markupvalidator.com/)
|
39
|
+
page.title # title of the page, as string
|
40
|
+
page.links # array of strings, with every link found on the page as an absolute URL
|
41
|
+
page.internal_links # array of strings, with every internal link found on the page as an absolute URL
|
42
|
+
page.external_links # array of strings, with every external link found on the page as an absolute URL
|
43
|
+
page.meta_description # meta description, as string
|
44
|
+
page.description # returns the meta description, or the first long paragraph if no meta description is found
|
45
|
+
page.meta_keywords # meta keywords, as string
|
46
|
+
page.image # Most relevant image, if defined with og:image
|
47
|
+
page.images # array of strings, with every img found on the page as an absolute URL
|
48
|
+
page.feed # Get rss or atom links in meta data fields as array
|
49
|
+
page.meta_og_title # opengraph title
|
50
|
+
page.meta_og_image # opengraph image
|
51
|
+
page.charset # UTF-8
|
52
|
+
page.content_type # content-type returned by the server when the url was requested
|
53
|
+
|
54
|
+
MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
|
55
|
+
|
56
|
+
page.meta_description # <meta name="description" content="..." />
|
57
|
+
page.meta_keywords # <meta name="keywords" content="..." />
|
58
|
+
page.meta_robots # <meta name="robots" content="..." />
|
59
|
+
page.meta_generator # <meta name="generator" content="..." />
|
60
|
+
|
61
|
+
It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
|
62
|
+
|
63
|
+
page.meta_content_language # <meta http-equiv="content-language" content="..." />
|
64
|
+
page.meta_Content_Type # <meta http-equiv="Content-Type" content="..." />
|
65
|
+
|
66
|
+
Please notice that MetaInspector is case sensitive, so `page.meta_Content_Type` is not the same as `page.meta_content_type`
|
67
|
+
|
68
|
+
You can also access most of the scraped data as a hash:
|
69
|
+
|
70
|
+
page.to_hash # { "url" => "http://markupvalidator.com",
|
71
|
+
"title" => "MarkupValidator :: site-wide markup validation tool", ... }
|
72
|
+
|
73
|
+
The full scraped document if accessible from:
|
74
|
+
|
75
|
+
page.document # Nokogiri doc that you can use it to get any element from the page
|
76
|
+
|
77
|
+
## Options
|
78
|
+
|
79
|
+
### Timeout
|
80
|
+
|
81
|
+
By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
|
82
|
+
You can set a different timeout with a second parameter, like this:
|
83
|
+
|
84
|
+
page = MetaInspector.new('markupvalidator.com', :timeout => 5) # 5 seconds timeout
|
85
|
+
|
86
|
+
### Redirections
|
87
|
+
|
88
|
+
MetaInspector allows safe redirects from http to https (for example, [http://github.com](http://github.com) => [https://github.com](https://github.com)) by default. With the option `:allow_safe_redirections => false`, it will throw exceptions on such redirects.
|
89
|
+
|
90
|
+
page = MetaInspector.new('facebook.com', :allow_safe_redirections => false)
|
91
|
+
|
92
|
+
To enable unsafe redirects from https to http (like, [https://example.com](https://example.com) => [http://example.com](http://example.com)) you can pass the option `:allow_unsafe_redirections => true`. If this option is not specified or is false an exception is thrown on such redirects.
|
93
|
+
|
94
|
+
page = MetaInspector.new('facebook.com', :allow_unsafe_redirections => true)
|
95
|
+
|
96
|
+
### HTML Content Only
|
97
|
+
|
98
|
+
MetaInspector will try to parse all URLs by default. If you want to raise an error when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
|
99
|
+
|
100
|
+
page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
|
101
|
+
|
102
|
+
This is useful when using MetaInspector on web spidering. Although on the initial URL you'll probably have an HTML URL, following links you may find yourself trying to parse non-html URLs.
|
103
|
+
|
104
|
+
page = MetaInspector.new('http://example.com/image.png')
|
105
|
+
page.title # returns ""
|
106
|
+
page.content_type # "image/png"
|
107
|
+
page.ok? # true
|
108
|
+
|
109
|
+
page = MetaInspector.new('http://example.com/image.png', :html_content_only => true)
|
110
|
+
page.title # returns nil
|
111
|
+
page.content_type # "image/png"
|
112
|
+
page.ok? # false
|
113
|
+
page.errors.first # "Scraping exception: The url provided contains image/png content instead of text/html content"
|
114
|
+
|
115
|
+
## Error handling
|
116
|
+
|
117
|
+
You can check if the page has been succesfully parsed with:
|
118
|
+
|
119
|
+
page.ok? # Will return true if everything looks OK
|
120
|
+
|
121
|
+
In case there have been any errors, you can check them with:
|
122
|
+
|
123
|
+
page.errors # Will return an array with the error messages
|
124
|
+
|
125
|
+
If you also want to see the errors on console, you can initialize MetaInspector with the verbose option like that:
|
126
|
+
|
127
|
+
page = MetaInspector.new('http://example.com', :verbose => true)
|
128
|
+
|
129
|
+
## Examples
|
130
|
+
|
131
|
+
You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
|
132
|
+
|
133
|
+
$ irb
|
134
|
+
>> require 'metainspector'
|
135
|
+
=> true
|
136
|
+
|
137
|
+
>> page = MetaInspector.new('http://markupvalidator.com')
|
138
|
+
=> #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">
|
139
|
+
|
140
|
+
>> page.title
|
141
|
+
=> "MarkupValidator :: site-wide markup validation tool"
|
142
|
+
|
143
|
+
>> page.meta_description
|
144
|
+
=> "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
|
145
|
+
|
146
|
+
>> page.meta_keywords
|
147
|
+
=> "html, markup, validation, validator, tool, w3c, development, standards, free"
|
148
|
+
|
149
|
+
>> page.links.size
|
150
|
+
=> 15
|
151
|
+
|
152
|
+
>> page.links[4]
|
153
|
+
=> "/plans-and-pricing"
|
154
|
+
|
155
|
+
>> page.document.class
|
156
|
+
=> String
|
157
|
+
|
158
|
+
>> page.parsed_document.class
|
159
|
+
=> Nokogiri::HTML::Document
|
160
|
+
|
161
|
+
## ZOMG Fork! Thank you!
|
162
|
+
|
163
|
+
You're welcome to fork this project and send pull requests. Just remember to include specs.
|
164
|
+
|
165
|
+
Thanks to all the contributors:
|
166
|
+
|
167
|
+
[https://github.com/jaimeiniesta/metainspector/graphs/contributors](https://github.com/jaimeiniesta/metainspector/graphs/contributors)
|
168
|
+
|
169
|
+
Copyright (c) 2009-2012 Jaime Iniesta, released under the MIT license
|
@@ -72,13 +72,9 @@ module MetaInspector
|
|
72
72
|
meta_og_image
|
73
73
|
end
|
74
74
|
|
75
|
-
# Returns the parsed document meta rss
|
75
|
+
# Returns the parsed document meta rss link
|
76
76
|
def feed
|
77
|
-
@feed ||=
|
78
|
-
link.attributes["type"] && link.attributes["type"].value =~ /(atom|rss)/
|
79
|
-
}.map { |link|
|
80
|
-
absolutify_url(link.attributes["href"].value)
|
81
|
-
}.first rescue nil
|
77
|
+
@feed ||= (parsed_feed('rss') || parsed_feed('atom'))
|
82
78
|
end
|
83
79
|
|
84
80
|
# Returns the charset from the meta tags, looking for it in the following order:
|
@@ -133,13 +129,6 @@ module MetaInspector
|
|
133
129
|
errors.empty?
|
134
130
|
end
|
135
131
|
|
136
|
-
##### DEPRECATIONS ####
|
137
|
-
def parsed?
|
138
|
-
warn "the parsed? method has been deprecated, please use ok? instead"
|
139
|
-
!@parsed_document.nil?
|
140
|
-
end
|
141
|
-
##### DEPRECATIONS ####
|
142
|
-
|
143
132
|
private
|
144
133
|
|
145
134
|
def defaults
|
@@ -190,29 +179,36 @@ module MetaInspector
|
|
190
179
|
@data.meta!.name!
|
191
180
|
@data.meta!.property!
|
192
181
|
parsed_document.xpath("//meta").each do |element|
|
193
|
-
|
194
|
-
if element.attributes["name"]
|
195
|
-
@data.meta.name[element.attributes["name"].value.downcase] = element.attributes["content"].value
|
196
|
-
end
|
197
|
-
|
198
|
-
if element.attributes["property"]
|
199
|
-
@data.meta.property[element.attributes["property"].value.downcase] = element.attributes["content"].value
|
200
|
-
end
|
201
|
-
end
|
182
|
+
get_meta_name_or_property(element)
|
202
183
|
end
|
203
184
|
end
|
204
185
|
end
|
205
186
|
|
187
|
+
# Store meta tag value, looking at meta name or meta property
|
188
|
+
def get_meta_name_or_property(element)
|
189
|
+
if element.attributes["content"]
|
190
|
+
type = element.attributes["name"] ? "name" : (element.attributes["property"] ? "property" : nil)
|
191
|
+
|
192
|
+
@data.meta.name[element.attributes[type].value.downcase] = element.attributes["content"].value if type
|
193
|
+
end
|
194
|
+
end
|
195
|
+
|
196
|
+
def parsed_feed(format)
|
197
|
+
feed = parsed_document.search("//link[@type='application/#{format}+xml']").first
|
198
|
+
feed ? absolutify_url(feed.attributes['href'].value) : nil
|
199
|
+
end
|
200
|
+
|
206
201
|
def parsed_links
|
207
|
-
@parsed_links ||= parsed_document.search("//a")
|
208
|
-
.map {|link| link.attributes["href"] \
|
209
|
-
.to_s.strip}.uniq rescue []
|
202
|
+
@parsed_links ||= cleanup_nokogiri_values(parsed_document.search("//a/@href"))
|
210
203
|
end
|
211
204
|
|
212
205
|
def parsed_images
|
213
|
-
@parsed_images ||= parsed_document.search('//img')
|
214
|
-
|
215
|
-
|
206
|
+
@parsed_images ||= cleanup_nokogiri_values(parsed_document.search('//img/@src'))
|
207
|
+
end
|
208
|
+
|
209
|
+
# Takes a nokogiri search result, strips the values, rejects the empty ones, and removes duplicates
|
210
|
+
def cleanup_nokogiri_values(results)
|
211
|
+
results.map { |a| a.value.strip }.reject { |s| s.empty? }.uniq
|
216
212
|
end
|
217
213
|
|
218
214
|
# Stores the error for later inspection
|
@@ -250,7 +246,8 @@ module MetaInspector
|
|
250
246
|
|
251
247
|
# Look for the first <p> block with 120 characters or more
|
252
248
|
def secondary_description
|
253
|
-
|
249
|
+
first_long_paragraph = parsed_document.search('//p[string-length() >= 120]').first
|
250
|
+
first_long_paragraph ? first_long_paragraph.text : ''
|
254
251
|
end
|
255
252
|
|
256
253
|
def charset_from_meta_charset
|
data/meta_inspector.gemspec
CHANGED
@@ -17,8 +17,8 @@ Gem::Specification.new do |gem|
|
|
17
17
|
gem.add_dependency 'nokogiri', '~> 1.5'
|
18
18
|
gem.add_dependency 'rash', '0.3.2'
|
19
19
|
|
20
|
-
gem.add_development_dependency 'rspec', '2.
|
20
|
+
gem.add_development_dependency 'rspec', '2.12.0'
|
21
21
|
gem.add_development_dependency 'fakeweb', '1.3.0'
|
22
|
-
gem.add_development_dependency 'awesome_print', '1.0
|
23
|
-
gem.add_development_dependency 'rake', '0.
|
22
|
+
gem.add_development_dependency 'awesome_print', '1.1.0'
|
23
|
+
gem.add_development_dependency 'rake', '10.0.2'
|
24
24
|
end
|
data/samples/spider.rb
CHANGED
@@ -6,7 +6,7 @@ require 'meta_inspector'
|
|
6
6
|
q = Queue.new
|
7
7
|
visited_links=[]
|
8
8
|
|
9
|
-
puts "Enter a valid http url to spider it following
|
9
|
+
puts "Enter a valid http url to spider it following internal links"
|
10
10
|
url = gets.strip
|
11
11
|
|
12
12
|
page = MetaInspector.new(url)
|
@@ -20,9 +20,9 @@ while q.size > 0
|
|
20
20
|
puts "TITLE: #{page.title}"
|
21
21
|
puts "META DESCRIPTION: #{page.meta_description}"
|
22
22
|
puts "META KEYWORDS: #{page.meta_keywords}"
|
23
|
-
puts "LINKS: #{page.
|
24
|
-
page.
|
25
|
-
if
|
23
|
+
puts "LINKS: #{page.internal_links.size}"
|
24
|
+
page.internal_links.each do |link|
|
25
|
+
if !visited_links.include?(link)
|
26
26
|
q.push(link)
|
27
27
|
end
|
28
28
|
end
|
data/spec/metainspector_spec.rb
CHANGED
@@ -89,14 +89,21 @@ describe MetaInspector do
|
|
89
89
|
@m.document.class.should == String
|
90
90
|
end
|
91
91
|
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
92
|
+
describe "Feed" do
|
93
|
+
it "should get rss feed" do
|
94
|
+
@m = MetaInspector.new('http://www.iteh.at')
|
95
|
+
@m.feed.should == 'http://www.iteh.at/de/rss/'
|
96
|
+
end
|
96
97
|
|
97
|
-
|
98
|
-
|
99
|
-
|
98
|
+
it "should get atom feed" do
|
99
|
+
@m = MetaInspector.new('http://www.tea-tron.com/jbravo/blog/')
|
100
|
+
@m.feed.should == 'http://www.tea-tron.com/jbravo/blog/feed/'
|
101
|
+
end
|
102
|
+
|
103
|
+
it "should return nil if no feed found" do
|
104
|
+
@m = MetaInspector.new('http://www.alazan.com')
|
105
|
+
@m.feed.should == nil
|
106
|
+
end
|
100
107
|
end
|
101
108
|
|
102
109
|
describe "get description" do
|
data/spec/spec_helper.rb
CHANGED
@@ -32,7 +32,7 @@ FakeWeb.register_uri(:get, "http://example.com/invalid_href", :response => fixtu
|
|
32
32
|
FakeWeb.register_uri(:get, "http://www.youtube.com/watch?v=iaGSSrp49uc", :response => fixture_file("youtube.response"))
|
33
33
|
FakeWeb.register_uri(:get, "http://markupvalidator.com/faqs", :response => fixture_file("markupvalidator_faqs.response"))
|
34
34
|
FakeWeb.register_uri(:get, "https://twitter.com/markupvalidator", :response => fixture_file("twitter_markupvalidator.response"))
|
35
|
-
FakeWeb.register_uri(:get, "
|
35
|
+
FakeWeb.register_uri(:get, "http://example.com/empty", :response => fixture_file("empty_page.response"))
|
36
36
|
FakeWeb.register_uri(:get, "http://international.com", :response => fixture_file("international.response"))
|
37
37
|
FakeWeb.register_uri(:get, "http://charset000.com", :response => fixture_file("charset_000.response"))
|
38
38
|
FakeWeb.register_uri(:get, "http://charset001.com", :response => fixture_file("charset_001.response"))
|
metadata
CHANGED
@@ -1,13 +1,13 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: metainspector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
hash:
|
4
|
+
hash: 37
|
5
5
|
prerelease:
|
6
6
|
segments:
|
7
7
|
- 1
|
8
8
|
- 12
|
9
|
-
-
|
10
|
-
version: 1.12.
|
9
|
+
- 1
|
10
|
+
version: 1.12.1
|
11
11
|
platform: ruby
|
12
12
|
authors:
|
13
13
|
- Jaime Iniesta
|
@@ -15,10 +15,12 @@ autorequire:
|
|
15
15
|
bindir: bin
|
16
16
|
cert_chain: []
|
17
17
|
|
18
|
-
date: 2012-12-
|
18
|
+
date: 2012-12-03 00:00:00 Z
|
19
19
|
dependencies:
|
20
20
|
- !ruby/object:Gem::Dependency
|
21
|
-
|
21
|
+
name: nokogiri
|
22
|
+
prerelease: false
|
23
|
+
requirement: &id001 !ruby/object:Gem::Requirement
|
22
24
|
none: false
|
23
25
|
requirements:
|
24
26
|
- - ~>
|
@@ -28,12 +30,12 @@ dependencies:
|
|
28
30
|
- 1
|
29
31
|
- 5
|
30
32
|
version: "1.5"
|
31
|
-
prerelease: false
|
32
33
|
type: :runtime
|
33
|
-
|
34
|
-
requirement: *id001
|
34
|
+
version_requirements: *id001
|
35
35
|
- !ruby/object:Gem::Dependency
|
36
|
-
|
36
|
+
name: rash
|
37
|
+
prerelease: false
|
38
|
+
requirement: &id002 !ruby/object:Gem::Requirement
|
37
39
|
none: false
|
38
40
|
requirements:
|
39
41
|
- - "="
|
@@ -44,28 +46,28 @@ dependencies:
|
|
44
46
|
- 3
|
45
47
|
- 2
|
46
48
|
version: 0.3.2
|
47
|
-
prerelease: false
|
48
49
|
type: :runtime
|
49
|
-
|
50
|
-
requirement: *id002
|
50
|
+
version_requirements: *id002
|
51
51
|
- !ruby/object:Gem::Dependency
|
52
|
-
|
52
|
+
name: rspec
|
53
|
+
prerelease: false
|
54
|
+
requirement: &id003 !ruby/object:Gem::Requirement
|
53
55
|
none: false
|
54
56
|
requirements:
|
55
57
|
- - "="
|
56
58
|
- !ruby/object:Gem::Version
|
57
|
-
hash:
|
59
|
+
hash: 63
|
58
60
|
segments:
|
59
61
|
- 2
|
60
|
-
-
|
62
|
+
- 12
|
61
63
|
- 0
|
62
|
-
version: 2.
|
63
|
-
prerelease: false
|
64
|
+
version: 2.12.0
|
64
65
|
type: :development
|
65
|
-
|
66
|
-
requirement: *id003
|
66
|
+
version_requirements: *id003
|
67
67
|
- !ruby/object:Gem::Dependency
|
68
|
-
|
68
|
+
name: fakeweb
|
69
|
+
prerelease: false
|
70
|
+
requirement: &id004 !ruby/object:Gem::Requirement
|
69
71
|
none: false
|
70
72
|
requirements:
|
71
73
|
- - "="
|
@@ -76,12 +78,12 @@ dependencies:
|
|
76
78
|
- 3
|
77
79
|
- 0
|
78
80
|
version: 1.3.0
|
79
|
-
prerelease: false
|
80
81
|
type: :development
|
81
|
-
|
82
|
-
requirement: *id004
|
82
|
+
version_requirements: *id004
|
83
83
|
- !ruby/object:Gem::Dependency
|
84
|
-
|
84
|
+
name: awesome_print
|
85
|
+
prerelease: false
|
86
|
+
requirement: &id005 !ruby/object:Gem::Requirement
|
85
87
|
none: false
|
86
88
|
requirements:
|
87
89
|
- - "="
|
@@ -89,30 +91,27 @@ dependencies:
|
|
89
91
|
hash: 19
|
90
92
|
segments:
|
91
93
|
- 1
|
94
|
+
- 1
|
92
95
|
- 0
|
93
|
-
|
94
|
-
version: 1.0.2
|
95
|
-
prerelease: false
|
96
|
+
version: 1.1.0
|
96
97
|
type: :development
|
97
|
-
|
98
|
-
requirement: *id005
|
98
|
+
version_requirements: *id005
|
99
99
|
- !ruby/object:Gem::Dependency
|
100
|
-
|
100
|
+
name: rake
|
101
|
+
prerelease: false
|
102
|
+
requirement: &id006 !ruby/object:Gem::Requirement
|
101
103
|
none: false
|
102
104
|
requirements:
|
103
105
|
- - "="
|
104
106
|
- !ruby/object:Gem::Version
|
105
|
-
hash:
|
107
|
+
hash: 75
|
106
108
|
segments:
|
109
|
+
- 10
|
107
110
|
- 0
|
108
|
-
- 9
|
109
111
|
- 2
|
110
|
-
|
111
|
-
version: 0.9.2.2
|
112
|
-
prerelease: false
|
112
|
+
version: 10.0.2
|
113
113
|
type: :development
|
114
|
-
|
115
|
-
requirement: *id006
|
114
|
+
version_requirements: *id006
|
116
115
|
description: MetaInspector lets you scrape a web page and get its title, charset, link and meta tags
|
117
116
|
email:
|
118
117
|
- jaimeiniesta@gmail.com
|
@@ -128,7 +127,7 @@ files:
|
|
128
127
|
- .travis.yml
|
129
128
|
- Gemfile
|
130
129
|
- MIT-LICENSE
|
131
|
-
- README.
|
130
|
+
- README.md
|
132
131
|
- Rakefile
|
133
132
|
- lib/meta_inspector.rb
|
134
133
|
- lib/meta_inspector/open_uri.rb
|
data/README.rdoc
DELETED
@@ -1,152 +0,0 @@
|
|
1
|
-
= MetaInspector {<img src="https://secure.travis-ci.org/jaimeiniesta/metainspector.png?branch=master" />}[http://travis-ci.org/jaimeiniesta/metainspector] {<img src="https://codeclimate.com/badge.png" />}[https://codeclimate.com/github/jaimeiniesta/metainspector]
|
2
|
-
|
3
|
-
MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
|
4
|
-
|
5
|
-
= See it in action!
|
6
|
-
|
7
|
-
You can try MetaInspector live at this little demo: https://metainspectordemo.herokuapp.com
|
8
|
-
|
9
|
-
= Installation
|
10
|
-
|
11
|
-
Install the gem from RubyGems:
|
12
|
-
|
13
|
-
gem install metainspector
|
14
|
-
|
15
|
-
This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
|
16
|
-
|
17
|
-
= Usage
|
18
|
-
|
19
|
-
Initialize a scraper instance for an URL, like this:
|
20
|
-
|
21
|
-
page = MetaInspector::Scraper.new('http://markupvalidator.com')
|
22
|
-
|
23
|
-
or, for short, a convenience alias is also available:
|
24
|
-
|
25
|
-
page = MetaInspector.new('http://markupvalidator.com')
|
26
|
-
|
27
|
-
If you don't include the scheme on the URL, http:// will be used
|
28
|
-
by defaul:
|
29
|
-
|
30
|
-
page = MetaInspector.new('markupvalidator.com')
|
31
|
-
|
32
|
-
By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
|
33
|
-
You can set a different timeout with a second parameter, like this:
|
34
|
-
|
35
|
-
page = MetaInspector.new('markupvalidator.com', :timeout => 5) # this would wait just 5 seconds to timeout
|
36
|
-
|
37
|
-
MetaInspector will try to parse all URLs by default. If you want to parse only those URLs that have text/html as content-type you can specify it like this:
|
38
|
-
|
39
|
-
page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
|
40
|
-
|
41
|
-
MetaInspector allows safe redirects from http to https sites by default. Passing allow_safe_redirections as false will throw exceptions on such redirects.
|
42
|
-
|
43
|
-
page = MetaInspector.new('facebook.com', :allow_safe_redirections => false)
|
44
|
-
|
45
|
-
To enable unsafe redirects from https to http sites you can pass allow_unsafe_redirections as true. If this option is not specified or is false an exception is thrown on such redirects.
|
46
|
-
|
47
|
-
page = MetaInspector.new('facebook.com', :allow_unsafe_redirections => true)
|
48
|
-
|
49
|
-
Then you can see the scraped data like this:
|
50
|
-
|
51
|
-
page.url # URL of the page
|
52
|
-
page.scheme # Scheme of the page (http, https)
|
53
|
-
page.host # Hostname of the page (like, markupvalidator.com, without the scheme)
|
54
|
-
page.root_url # Root url (scheme + host, like http://markupvalidator.com/)
|
55
|
-
page.title # title of the page, as string
|
56
|
-
page.links # array of strings, with every link found on the page as an absolute URL
|
57
|
-
page.internal_links # array of strings, with every internal link found on the page as an absolute URL
|
58
|
-
page.external_links # array of strings, with every external link found on the page as an absolute URL
|
59
|
-
page.meta_description # meta description, as string
|
60
|
-
page.description # returns the meta description, or the first long paragraph if no meta description is found
|
61
|
-
page.meta_keywords # meta keywords, as string
|
62
|
-
page.image # Most relevant image, if defined with og:image
|
63
|
-
page.images # array of strings, with every img found on the page as an absolute URL
|
64
|
-
page.feed # Get rss or atom links in meta data fields as array
|
65
|
-
page.meta_og_title # opengraph title
|
66
|
-
page.meta_og_image # opengraph image
|
67
|
-
page.charset # UTF-8
|
68
|
-
page.content_type # content-type returned by the server when the url was requested
|
69
|
-
|
70
|
-
MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
|
71
|
-
|
72
|
-
page.meta_description # <meta name="description" content="..." />
|
73
|
-
page.meta_keywords # <meta name="keywords" content="..." />
|
74
|
-
page.meta_robots # <meta name="robots" content="..." />
|
75
|
-
page.meta_generator # <meta name="generator" content="..." />
|
76
|
-
|
77
|
-
It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
|
78
|
-
|
79
|
-
page.meta_content_language # <meta http-equiv="content-language" content="..." />
|
80
|
-
page.meta_Content_Type # <meta http-equiv="Content-Type" content="..." />
|
81
|
-
|
82
|
-
Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type
|
83
|
-
|
84
|
-
You can also access most of the scraped data as a hash:
|
85
|
-
|
86
|
-
page.to_hash # { "url"=>"http://markupvalidator.com", "title" => "MarkupValidator :: site-wide markup validation tool", ... }
|
87
|
-
|
88
|
-
The full scraped document if accessible from:
|
89
|
-
|
90
|
-
page.document # Nokogiri doc that you can use it to get any element from the page
|
91
|
-
|
92
|
-
= Errors handling
|
93
|
-
|
94
|
-
You can check if the page has been succesfully parsed with:
|
95
|
-
|
96
|
-
page.ok? # Will return true if everything looks OK
|
97
|
-
|
98
|
-
In case there have been any errors, you can check them with:
|
99
|
-
|
100
|
-
page.errors # Will return an array with the error messages
|
101
|
-
|
102
|
-
If you also want to see the errors on console, you can initialize MetaInspector with the verbose option like that:
|
103
|
-
|
104
|
-
page = MetaInspector.new('http://example.com', :verbose => true)
|
105
|
-
|
106
|
-
= Examples
|
107
|
-
|
108
|
-
You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
|
109
|
-
|
110
|
-
$ irb
|
111
|
-
>> require 'metainspector'
|
112
|
-
=> true
|
113
|
-
|
114
|
-
>> page = MetaInspector.new('http://markupvalidator.com')
|
115
|
-
=> #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">
|
116
|
-
|
117
|
-
>> page.title
|
118
|
-
=> "MarkupValidator :: site-wide markup validation tool"
|
119
|
-
|
120
|
-
>> page.meta_description
|
121
|
-
=> "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
|
122
|
-
|
123
|
-
>> page.meta_keywords
|
124
|
-
=> "html, markup, validation, validator, tool, w3c, development, standards, free"
|
125
|
-
|
126
|
-
>> page.links.size
|
127
|
-
=> 15
|
128
|
-
|
129
|
-
>> page.links[4]
|
130
|
-
=> "/plans-and-pricing"
|
131
|
-
|
132
|
-
>> page.document.class
|
133
|
-
=> String
|
134
|
-
|
135
|
-
>> page.parsed_document.class
|
136
|
-
=> Nokogiri::HTML::Document
|
137
|
-
|
138
|
-
= ZOMG Fork! Thank you!
|
139
|
-
|
140
|
-
You're welcome to fork this project and send pull requests. Just remember to include specs.
|
141
|
-
|
142
|
-
Thanks to all the contributors:
|
143
|
-
|
144
|
-
https://github.com/jaimeiniesta/metainspector/graphs/contributors
|
145
|
-
|
146
|
-
= To Do
|
147
|
-
|
148
|
-
* Get page.base_dir from the URL
|
149
|
-
* If keywords seem to be separated by blank spaces, replace them with commas
|
150
|
-
* Autodiscover all available meta tags
|
151
|
-
|
152
|
-
Copyright (c) 2009-2012 Jaime Iniesta, released under the MIT license
|