metainspector 1.17.2 → 1.17.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +10 -16
- data/lib/meta_inspector/parser.rb +22 -1
- data/lib/meta_inspector/version.rb +1 -1
- data/spec/fixtures/opengraph.response +52 -0
- data/spec/parser_spec.rb +89 -14
- data/spec/spec_helper.rb +1 -0
- metadata +4 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5c94b07066d8b0080029d5e93808a4388b716575
|
4
|
+
data.tar.gz: 33135740e3e740e21c4ccc011a44a3466c73a926
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4c3ffda64efceaaaa9631178751df36a0316d17176ce0788ce6256fbd96ac5347f8fc9c6b7019c3942bcd746a2f8868f78fdaa7d7aab83d4314a6a01dd9522dd
|
7
|
+
data.tar.gz: 38c3fd01c8c156c82985c6b6972cec1002d90135f4381ab53f9e69e4dd6595bc2b22c8a530c2035703aabc318dfb9f37abf62396502dae12b994183aea091db2
|
data/README.md
CHANGED
@@ -22,15 +22,15 @@ This gem is tested on Ruby versions 1.9.2, 1.9.3 and 2.0.0.
|
|
22
22
|
|
23
23
|
Initialize a MetaInspector instance for an URL, like this:
|
24
24
|
|
25
|
-
page = MetaInspector.new('http://
|
25
|
+
page = MetaInspector.new('http://sitevalidator.com')
|
26
26
|
|
27
27
|
If you don't include the scheme on the URL, http:// will be used by default:
|
28
28
|
|
29
|
-
page = MetaInspector.new('
|
29
|
+
page = MetaInspector.new('sitevalidator.com')
|
30
30
|
|
31
31
|
You can also include the html which will be used as the document to scrape:
|
32
32
|
|
33
|
-
page = MetaInspector.new("http://
|
33
|
+
page = MetaInspector.new("http://sitevalidator.com", :document => "<html><head><title>Hello From Passed Html</title><a href='/hello'>Hello link</a></head><body></body></html>")
|
34
34
|
|
35
35
|
## Accessing scraped data
|
36
36
|
|
@@ -38,8 +38,8 @@ Then you can see the scraped data like this:
|
|
38
38
|
|
39
39
|
page.url # URL of the page
|
40
40
|
page.scheme # Scheme of the page (http, https)
|
41
|
-
page.host # Hostname of the page (like,
|
42
|
-
page.root_url # Root url (scheme + host, like http://
|
41
|
+
page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
|
42
|
+
page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
|
43
43
|
page.title # title of the page, as string
|
44
44
|
page.links # array of strings, with every link found on the page as an absolute URL
|
45
45
|
page.internal_links # array of strings, with every internal link found on the page as an absolute URL
|
@@ -69,7 +69,7 @@ Please notice that MetaInspector is case sensitive, so `page.meta_Content_Type`
|
|
69
69
|
|
70
70
|
You can also access most of the scraped data as a hash:
|
71
71
|
|
72
|
-
page.to_hash # { "url" => "http://
|
72
|
+
page.to_hash # { "url" => "http://sitevalidator.com",
|
73
73
|
"title" => "MarkupValidator :: site-wide markup validation tool", ... }
|
74
74
|
|
75
75
|
The original document is accessible from:
|
@@ -106,7 +106,7 @@ Note that MetaInspector gives priority to content over value. In other words if
|
|
106
106
|
By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
|
107
107
|
You can set a different timeout with a second parameter, like this:
|
108
108
|
|
109
|
-
page = MetaInspector.new('
|
109
|
+
page = MetaInspector.new('sitevalidator.com', :timeout => 5) # 5 seconds timeout
|
110
110
|
|
111
111
|
### Redirections
|
112
112
|
|
@@ -124,7 +124,7 @@ However, you can tell MetaInspector to allow these redirections with the option
|
|
124
124
|
|
125
125
|
MetaInspector will try to parse all URLs by default. If you want to raise an exception when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
|
126
126
|
|
127
|
-
page = MetaInspector.new('
|
127
|
+
page = MetaInspector.new('sitevalidator.com', :html_content_only => true)
|
128
128
|
|
129
129
|
This is useful when using MetaInspector on web spidering. Although on the initial URL you'll probably have an HTML URL, following links you may find yourself trying to parse non-html URLs.
|
130
130
|
|
@@ -167,8 +167,8 @@ You can find some sample scripts on the samples folder, including a basic scrapi
|
|
167
167
|
>> require 'metainspector'
|
168
168
|
=> true
|
169
169
|
|
170
|
-
>> page = MetaInspector.new('http://
|
171
|
-
=> #<MetaInspector:0x11330c0 @url="http://
|
170
|
+
>> page = MetaInspector.new('http://sitevalidator.com')
|
171
|
+
=> #<MetaInspector:0x11330c0 @url="http://sitevalidator.com">
|
172
172
|
|
173
173
|
>> page.title
|
174
174
|
=> "MarkupValidator :: site-wide markup validation tool"
|
@@ -185,12 +185,6 @@ You can find some sample scripts on the samples folder, including a basic scrapi
|
|
185
185
|
>> page.links[4]
|
186
186
|
=> "/plans-and-pricing"
|
187
187
|
|
188
|
-
>> page.document.class
|
189
|
-
=> String
|
190
|
-
|
191
|
-
>> page.parsed_document.class
|
192
|
-
=> Nokogiri::HTML::Document
|
193
|
-
|
194
188
|
## ZOMG Fork! Thank you!
|
195
189
|
|
196
190
|
You're welcome to fork this project and send pull requests. Just remember to include specs.
|
@@ -106,7 +106,11 @@ module MetaInspector
|
|
106
106
|
key = meta_tags_method.meta_tag
|
107
107
|
|
108
108
|
#special treatment for opengraph (og:) and twitter card (twitter:) tags
|
109
|
-
|
109
|
+
if key =~ /^og_(.*)/
|
110
|
+
key = og_key(key)
|
111
|
+
elsif key =~ /^twitter_(.*)/
|
112
|
+
key.gsub!("_",":")
|
113
|
+
end
|
110
114
|
|
111
115
|
scrape_meta_data
|
112
116
|
|
@@ -116,6 +120,23 @@ module MetaInspector
|
|
116
120
|
end
|
117
121
|
end
|
118
122
|
|
123
|
+
# Not all OG keys can be directly translated to meta tags method names replacing _ by : as they include the _ in the name
|
124
|
+
# This is going to be deprecated and replaced soon by a simpler, clearer method, like page.meta['og:site_name']
|
125
|
+
def og_key(key)
|
126
|
+
case key
|
127
|
+
when "og_site_name"
|
128
|
+
"og:site_name"
|
129
|
+
when "og_image_secure_url"
|
130
|
+
"og:image:secure_url"
|
131
|
+
when "og_video_secure_url"
|
132
|
+
"og:video:secure_url"
|
133
|
+
when "og_audio_secure_url"
|
134
|
+
"og:audio:secure_url"
|
135
|
+
else
|
136
|
+
key.gsub("_", ":")
|
137
|
+
end
|
138
|
+
end
|
139
|
+
|
119
140
|
# Scrapes all meta tags found
|
120
141
|
def scrape_meta_data
|
121
142
|
unless @data.meta
|
@@ -0,0 +1,52 @@
|
|
1
|
+
HTTP/1.1 200 OK
|
2
|
+
Age: 13
|
3
|
+
Cache-Control: max-age=120
|
4
|
+
Content-Type: text/html
|
5
|
+
Date: Mon, 06 Jan 2014 12:47:42 GMT
|
6
|
+
Expires: Mon, 06 Jan 2014 12:49:28 GMT
|
7
|
+
Server: Apache/2.2.14 (Ubuntu)
|
8
|
+
Vary: Accept-Encoding
|
9
|
+
Via: 1.1 varnish
|
10
|
+
X-Powered-By: PHP/5.3.2-1ubuntu4.22
|
11
|
+
X-Varnish: 1188792404 1188790413
|
12
|
+
Content-Length: 40571
|
13
|
+
Connection: keep-alive
|
14
|
+
|
15
|
+
<!DOCTYPE html>
|
16
|
+
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
|
17
|
+
<head>
|
18
|
+
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
|
19
|
+
|
20
|
+
<!-- Basic OG Metadata -->
|
21
|
+
<meta property="og:title" content="An OG title" />
|
22
|
+
<meta property="og:type" content="website" />
|
23
|
+
<meta property="og:url" content="http://example.com/opengraph" />
|
24
|
+
|
25
|
+
<!-- Optional OG Metadata -->
|
26
|
+
<meta property="og:description" content="Sean Connery found fame and fortune" />
|
27
|
+
<meta property="og:determiner" content="the" />
|
28
|
+
<meta property="og:locale" content="en_GB" />
|
29
|
+
<meta property="og:locale:alternate" content="fr_FR" />
|
30
|
+
<meta property="og:site_name" content="IMDb" />
|
31
|
+
|
32
|
+
<!-- Structured OG Properties -->
|
33
|
+
<meta property="og:image" content="http://example.com/ogp.jpg" />
|
34
|
+
<meta property="og:image:secure_url" content="https://secure.example.com/ogp.jpg" />
|
35
|
+
<meta property="og:image:type" content="image/jpeg" />
|
36
|
+
<meta property="og:image:width" content="400" />
|
37
|
+
<meta property="og:image:height" content="300" />
|
38
|
+
|
39
|
+
<meta property="og:video" content="http://example.com/movie.swf" />
|
40
|
+
<meta property="og:video:secure_url" content="https://secure.example.com/movie.swf" />
|
41
|
+
<meta property="og:video:type" content="application/x-shockwave-flash" />
|
42
|
+
<meta property="og:video:width" content="400" />
|
43
|
+
<meta property="og:video:height" content="300" />
|
44
|
+
|
45
|
+
<meta property="og:audio" content="http://example.com/sound.mp3" />
|
46
|
+
<meta property="og:audio:secure_url" content="https://secure.example.com/sound.mp3" />
|
47
|
+
<meta property="og:audio:type" content="audio/mpeg" />
|
48
|
+
</head>
|
49
|
+
<body>
|
50
|
+
<p>A sample page with many Open Graph meta tags</p>
|
51
|
+
</body>
|
52
|
+
</html>
|
data/spec/parser_spec.rb
CHANGED
@@ -292,7 +292,7 @@ describe MetaInspector::Parser do
|
|
292
292
|
end
|
293
293
|
|
294
294
|
describe 'respond_to? for not implemented methods' do
|
295
|
-
|
295
|
+
|
296
296
|
before(:each) do
|
297
297
|
@m = MetaInspector.new('http://pagerankalert.com')
|
298
298
|
end
|
@@ -346,16 +346,6 @@ describe MetaInspector::Parser do
|
|
346
346
|
@m.meta_generator.should == 'WordPress 3.4.2'
|
347
347
|
end
|
348
348
|
|
349
|
-
it "should find a meta_og_title" do
|
350
|
-
@m = MetaInspector::Parser.new(doc 'http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
|
351
|
-
@m.meta_og_title.should == "Apple Claims New iPhone Only Visible To Most Loyal Of Customers"
|
352
|
-
end
|
353
|
-
|
354
|
-
it "should not find a meta_og_something" do
|
355
|
-
@m = MetaInspector::Parser.new(doc 'http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
|
356
|
-
@m.meta_og_something.should == nil
|
357
|
-
end
|
358
|
-
|
359
349
|
it "should find a meta_twitter_site" do
|
360
350
|
@m = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
|
361
351
|
@m.meta_twitter_site.should == "@youtube"
|
@@ -371,9 +361,94 @@ describe MetaInspector::Parser do
|
|
371
361
|
@m.meta_twitter_dummy.should == nil
|
372
362
|
end
|
373
363
|
|
374
|
-
|
375
|
-
|
376
|
-
|
364
|
+
describe "opengraph meta tags" do
|
365
|
+
before(:each) do
|
366
|
+
@m = MetaInspector::Parser.new(doc 'http://example.com/opengraph')
|
367
|
+
end
|
368
|
+
|
369
|
+
it "should find a meta og:title" do
|
370
|
+
@m.meta_og_title.should == "An OG title"
|
371
|
+
end
|
372
|
+
|
373
|
+
it "should find a meta og:type" do
|
374
|
+
@m.meta_og_type.should == "website"
|
375
|
+
end
|
376
|
+
|
377
|
+
it "should find a meta og:url" do
|
378
|
+
@m.meta_og_url.should == "http://example.com/opengraph"
|
379
|
+
end
|
380
|
+
|
381
|
+
it "should find a meta og:description" do
|
382
|
+
@m.meta_og_description.should == "Sean Connery found fame and fortune"
|
383
|
+
end
|
384
|
+
|
385
|
+
it "should find a meta og:determiner" do
|
386
|
+
@m.meta_og_determiner.should == "the"
|
387
|
+
end
|
388
|
+
|
389
|
+
it "should find a meta og:locale" do
|
390
|
+
@m.meta_og_locale.should == "en_GB"
|
391
|
+
end
|
392
|
+
|
393
|
+
it "should find a meta og:locale:alternate" do
|
394
|
+
@m.meta_og_locale_alternate.should == "fr_FR"
|
395
|
+
end
|
396
|
+
|
397
|
+
it "should find a meta og:site_name" do
|
398
|
+
@m.meta_og_site_name.should == "IMDb"
|
399
|
+
end
|
400
|
+
|
401
|
+
it "should find a meta og:image" do
|
402
|
+
@m.meta_og_image.should == "http://example.com/ogp.jpg"
|
403
|
+
end
|
404
|
+
|
405
|
+
it "should find a meta og:image:secure_url" do
|
406
|
+
@m.meta_og_image_secure_url.should == "https://secure.example.com/ogp.jpg"
|
407
|
+
end
|
408
|
+
|
409
|
+
it "should find a meta og:image:type" do
|
410
|
+
@m.meta_og_image_type.should == "image/jpeg"
|
411
|
+
end
|
412
|
+
|
413
|
+
it "should find a meta og:image:width" do
|
414
|
+
@m.meta_og_image_width.should == "400"
|
415
|
+
end
|
416
|
+
|
417
|
+
it "should find a meta og:image:height" do
|
418
|
+
@m.meta_og_image_height.should == "300"
|
419
|
+
end
|
420
|
+
|
421
|
+
it "should find a meta og:video" do
|
422
|
+
@m.meta_og_video.should == "http://example.com/movie.swf"
|
423
|
+
end
|
424
|
+
|
425
|
+
it "should find a meta og:video:secure_url" do
|
426
|
+
@m.meta_og_video_secure_url.should == "https://secure.example.com/movie.swf"
|
427
|
+
end
|
428
|
+
|
429
|
+
it "should find a meta og:video:type" do
|
430
|
+
@m.meta_og_video_type.should == "application/x-shockwave-flash"
|
431
|
+
end
|
432
|
+
|
433
|
+
it "should find a meta og:video:width" do
|
434
|
+
@m.meta_og_video_width.should == "400"
|
435
|
+
end
|
436
|
+
|
437
|
+
it "should find a meta og:video:height" do
|
438
|
+
@m.meta_og_video_height.should == "300"
|
439
|
+
end
|
440
|
+
|
441
|
+
it "should find a meta og:audio" do
|
442
|
+
@m.meta_og_audio.should == "http://example.com/sound.mp3"
|
443
|
+
end
|
444
|
+
|
445
|
+
it "should find a meta og:video:secure_url" do
|
446
|
+
@m.meta_og_audio_secure_url.should == "https://secure.example.com/sound.mp3"
|
447
|
+
end
|
448
|
+
|
449
|
+
it "should find a meta og:audio:type" do
|
450
|
+
@m.meta_og_audio_type.should == "audio/mpeg"
|
451
|
+
end
|
377
452
|
end
|
378
453
|
end
|
379
454
|
|
data/spec/spec_helper.rb
CHANGED
@@ -41,6 +41,7 @@ FakeWeb.register_uri(:get, "http://charset002.com", :response => fixture_file("c
|
|
41
41
|
FakeWeb.register_uri(:get, "http://www.inkthemes.com/", :response => fixture_file("wordpress_site.response"))
|
42
42
|
FakeWeb.register_uri(:get, "http://pagerankalert.com/image.png", :body => "Image", :content_type => "image/png")
|
43
43
|
FakeWeb.register_uri(:get, "http://pagerankalert.com/file.tar.gz", :body => "Image", :content_type => "application/x-gzip")
|
44
|
+
FakeWeb.register_uri(:get, "http://example.com/opengraph", :response => fixture_file("opengraph.response"))
|
44
45
|
|
45
46
|
# These examples are used to test relative links
|
46
47
|
FakeWeb.register_uri(:get, "http://relative.com/", :response => fixture_file("relative_links.response"))
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: metainspector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.17.
|
4
|
+
version: 1.17.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jaime Iniesta
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2014-01-09 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -168,6 +168,7 @@ files:
|
|
168
168
|
- spec/fixtures/malformed_href.response
|
169
169
|
- spec/fixtures/markupvalidator_faqs.response
|
170
170
|
- spec/fixtures/nonhttp.response
|
171
|
+
- spec/fixtures/opengraph.response
|
171
172
|
- spec/fixtures/pagerankalert.com.response
|
172
173
|
- spec/fixtures/protocol_relative.response
|
173
174
|
- spec/fixtures/relative_links.response
|
@@ -206,7 +207,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
206
207
|
version: '0'
|
207
208
|
requirements: []
|
208
209
|
rubyforge_project:
|
209
|
-
rubygems_version: 2.
|
210
|
+
rubygems_version: 2.0.6
|
210
211
|
signing_key:
|
211
212
|
specification_version: 4
|
212
213
|
summary: MetaInspector is a ruby gem for web scraping purposes, that returns a hash
|