metainspector 1.17.2 → 1.17.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +10 -16
- data/lib/meta_inspector/parser.rb +22 -1
- data/lib/meta_inspector/version.rb +1 -1
- data/spec/fixtures/opengraph.response +52 -0
- data/spec/parser_spec.rb +89 -14
- data/spec/spec_helper.rb +1 -0
- metadata +4 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5c94b07066d8b0080029d5e93808a4388b716575
|
4
|
+
data.tar.gz: 33135740e3e740e21c4ccc011a44a3466c73a926
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4c3ffda64efceaaaa9631178751df36a0316d17176ce0788ce6256fbd96ac5347f8fc9c6b7019c3942bcd746a2f8868f78fdaa7d7aab83d4314a6a01dd9522dd
|
7
|
+
data.tar.gz: 38c3fd01c8c156c82985c6b6972cec1002d90135f4381ab53f9e69e4dd6595bc2b22c8a530c2035703aabc318dfb9f37abf62396502dae12b994183aea091db2
|
data/README.md
CHANGED
@@ -22,15 +22,15 @@ This gem is tested on Ruby versions 1.9.2, 1.9.3 and 2.0.0.
|
|
22
22
|
|
23
23
|
Initialize a MetaInspector instance for an URL, like this:
|
24
24
|
|
25
|
-
page = MetaInspector.new('http://
|
25
|
+
page = MetaInspector.new('http://sitevalidator.com')
|
26
26
|
|
27
27
|
If you don't include the scheme on the URL, http:// will be used by default:
|
28
28
|
|
29
|
-
page = MetaInspector.new('
|
29
|
+
page = MetaInspector.new('sitevalidator.com')
|
30
30
|
|
31
31
|
You can also include the html which will be used as the document to scrape:
|
32
32
|
|
33
|
-
page = MetaInspector.new("http://
|
33
|
+
page = MetaInspector.new("http://sitevalidator.com", :document => "<html><head><title>Hello From Passed Html</title><a href='/hello'>Hello link</a></head><body></body></html>")
|
34
34
|
|
35
35
|
## Accessing scraped data
|
36
36
|
|
@@ -38,8 +38,8 @@ Then you can see the scraped data like this:
|
|
38
38
|
|
39
39
|
page.url # URL of the page
|
40
40
|
page.scheme # Scheme of the page (http, https)
|
41
|
-
page.host # Hostname of the page (like,
|
42
|
-
page.root_url # Root url (scheme + host, like http://
|
41
|
+
page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
|
42
|
+
page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
|
43
43
|
page.title # title of the page, as string
|
44
44
|
page.links # array of strings, with every link found on the page as an absolute URL
|
45
45
|
page.internal_links # array of strings, with every internal link found on the page as an absolute URL
|
@@ -69,7 +69,7 @@ Please notice that MetaInspector is case sensitive, so `page.meta_Content_Type`
|
|
69
69
|
|
70
70
|
You can also access most of the scraped data as a hash:
|
71
71
|
|
72
|
-
page.to_hash # { "url" => "http://
|
72
|
+
page.to_hash # { "url" => "http://sitevalidator.com",
|
73
73
|
"title" => "MarkupValidator :: site-wide markup validation tool", ... }
|
74
74
|
|
75
75
|
The original document is accessible from:
|
@@ -106,7 +106,7 @@ Note that MetaInspector gives priority to content over value. In other words if
|
|
106
106
|
By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
|
107
107
|
You can set a different timeout with a second parameter, like this:
|
108
108
|
|
109
|
-
page = MetaInspector.new('
|
109
|
+
page = MetaInspector.new('sitevalidator.com', :timeout => 5) # 5 seconds timeout
|
110
110
|
|
111
111
|
### Redirections
|
112
112
|
|
@@ -124,7 +124,7 @@ However, you can tell MetaInspector to allow these redirections with the option
|
|
124
124
|
|
125
125
|
MetaInspector will try to parse all URLs by default. If you want to raise an exception when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
|
126
126
|
|
127
|
-
page = MetaInspector.new('
|
127
|
+
page = MetaInspector.new('sitevalidator.com', :html_content_only => true)
|
128
128
|
|
129
129
|
This is useful when using MetaInspector on web spidering. Although on the initial URL you'll probably have an HTML URL, following links you may find yourself trying to parse non-html URLs.
|
130
130
|
|
@@ -167,8 +167,8 @@ You can find some sample scripts on the samples folder, including a basic scrapi
|
|
167
167
|
>> require 'metainspector'
|
168
168
|
=> true
|
169
169
|
|
170
|
-
>> page = MetaInspector.new('http://
|
171
|
-
=> #<MetaInspector:0x11330c0 @url="http://
|
170
|
+
>> page = MetaInspector.new('http://sitevalidator.com')
|
171
|
+
=> #<MetaInspector:0x11330c0 @url="http://sitevalidator.com">
|
172
172
|
|
173
173
|
>> page.title
|
174
174
|
=> "MarkupValidator :: site-wide markup validation tool"
|
@@ -185,12 +185,6 @@ You can find some sample scripts on the samples folder, including a basic scrapi
|
|
185
185
|
>> page.links[4]
|
186
186
|
=> "/plans-and-pricing"
|
187
187
|
|
188
|
-
>> page.document.class
|
189
|
-
=> String
|
190
|
-
|
191
|
-
>> page.parsed_document.class
|
192
|
-
=> Nokogiri::HTML::Document
|
193
|
-
|
194
188
|
## ZOMG Fork! Thank you!
|
195
189
|
|
196
190
|
You're welcome to fork this project and send pull requests. Just remember to include specs.
|
@@ -106,7 +106,11 @@ module MetaInspector
|
|
106
106
|
key = meta_tags_method.meta_tag
|
107
107
|
|
108
108
|
#special treatment for opengraph (og:) and twitter card (twitter:) tags
|
109
|
-
|
109
|
+
if key =~ /^og_(.*)/
|
110
|
+
key = og_key(key)
|
111
|
+
elsif key =~ /^twitter_(.*)/
|
112
|
+
key.gsub!("_",":")
|
113
|
+
end
|
110
114
|
|
111
115
|
scrape_meta_data
|
112
116
|
|
@@ -116,6 +120,23 @@ module MetaInspector
|
|
116
120
|
end
|
117
121
|
end
|
118
122
|
|
123
|
+
# Not all OG keys can be directly translated to meta tags method names replacing _ by : as they include the _ in the name
|
124
|
+
# This is going to be deprecated and replaced soon by a simpler, clearer method, like page.meta['og:site_name']
|
125
|
+
def og_key(key)
|
126
|
+
case key
|
127
|
+
when "og_site_name"
|
128
|
+
"og:site_name"
|
129
|
+
when "og_image_secure_url"
|
130
|
+
"og:image:secure_url"
|
131
|
+
when "og_video_secure_url"
|
132
|
+
"og:video:secure_url"
|
133
|
+
when "og_audio_secure_url"
|
134
|
+
"og:audio:secure_url"
|
135
|
+
else
|
136
|
+
key.gsub("_", ":")
|
137
|
+
end
|
138
|
+
end
|
139
|
+
|
119
140
|
# Scrapes all meta tags found
|
120
141
|
def scrape_meta_data
|
121
142
|
unless @data.meta
|
@@ -0,0 +1,52 @@
|
|
1
|
+
HTTP/1.1 200 OK
|
2
|
+
Age: 13
|
3
|
+
Cache-Control: max-age=120
|
4
|
+
Content-Type: text/html
|
5
|
+
Date: Mon, 06 Jan 2014 12:47:42 GMT
|
6
|
+
Expires: Mon, 06 Jan 2014 12:49:28 GMT
|
7
|
+
Server: Apache/2.2.14 (Ubuntu)
|
8
|
+
Vary: Accept-Encoding
|
9
|
+
Via: 1.1 varnish
|
10
|
+
X-Powered-By: PHP/5.3.2-1ubuntu4.22
|
11
|
+
X-Varnish: 1188792404 1188790413
|
12
|
+
Content-Length: 40571
|
13
|
+
Connection: keep-alive
|
14
|
+
|
15
|
+
<!DOCTYPE html>
|
16
|
+
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
|
17
|
+
<head>
|
18
|
+
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
|
19
|
+
|
20
|
+
<!-- Basic OG Metadata -->
|
21
|
+
<meta property="og:title" content="An OG title" />
|
22
|
+
<meta property="og:type" content="website" />
|
23
|
+
<meta property="og:url" content="http://example.com/opengraph" />
|
24
|
+
|
25
|
+
<!-- Optional OG Metadata -->
|
26
|
+
<meta property="og:description" content="Sean Connery found fame and fortune" />
|
27
|
+
<meta property="og:determiner" content="the" />
|
28
|
+
<meta property="og:locale" content="en_GB" />
|
29
|
+
<meta property="og:locale:alternate" content="fr_FR" />
|
30
|
+
<meta property="og:site_name" content="IMDb" />
|
31
|
+
|
32
|
+
<!-- Structured OG Properties -->
|
33
|
+
<meta property="og:image" content="http://example.com/ogp.jpg" />
|
34
|
+
<meta property="og:image:secure_url" content="https://secure.example.com/ogp.jpg" />
|
35
|
+
<meta property="og:image:type" content="image/jpeg" />
|
36
|
+
<meta property="og:image:width" content="400" />
|
37
|
+
<meta property="og:image:height" content="300" />
|
38
|
+
|
39
|
+
<meta property="og:video" content="http://example.com/movie.swf" />
|
40
|
+
<meta property="og:video:secure_url" content="https://secure.example.com/movie.swf" />
|
41
|
+
<meta property="og:video:type" content="application/x-shockwave-flash" />
|
42
|
+
<meta property="og:video:width" content="400" />
|
43
|
+
<meta property="og:video:height" content="300" />
|
44
|
+
|
45
|
+
<meta property="og:audio" content="http://example.com/sound.mp3" />
|
46
|
+
<meta property="og:audio:secure_url" content="https://secure.example.com/sound.mp3" />
|
47
|
+
<meta property="og:audio:type" content="audio/mpeg" />
|
48
|
+
</head>
|
49
|
+
<body>
|
50
|
+
<p>A sample page with many Open Graph meta tags</p>
|
51
|
+
</body>
|
52
|
+
</html>
|
data/spec/parser_spec.rb
CHANGED
@@ -292,7 +292,7 @@ describe MetaInspector::Parser do
|
|
292
292
|
end
|
293
293
|
|
294
294
|
describe 'respond_to? for not implemented methods' do
|
295
|
-
|
295
|
+
|
296
296
|
before(:each) do
|
297
297
|
@m = MetaInspector.new('http://pagerankalert.com')
|
298
298
|
end
|
@@ -346,16 +346,6 @@ describe MetaInspector::Parser do
|
|
346
346
|
@m.meta_generator.should == 'WordPress 3.4.2'
|
347
347
|
end
|
348
348
|
|
349
|
-
it "should find a meta_og_title" do
|
350
|
-
@m = MetaInspector::Parser.new(doc 'http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
|
351
|
-
@m.meta_og_title.should == "Apple Claims New iPhone Only Visible To Most Loyal Of Customers"
|
352
|
-
end
|
353
|
-
|
354
|
-
it "should not find a meta_og_something" do
|
355
|
-
@m = MetaInspector::Parser.new(doc 'http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
|
356
|
-
@m.meta_og_something.should == nil
|
357
|
-
end
|
358
|
-
|
359
349
|
it "should find a meta_twitter_site" do
|
360
350
|
@m = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
|
361
351
|
@m.meta_twitter_site.should == "@youtube"
|
@@ -371,9 +361,94 @@ describe MetaInspector::Parser do
|
|
371
361
|
@m.meta_twitter_dummy.should == nil
|
372
362
|
end
|
373
363
|
|
374
|
-
|
375
|
-
|
376
|
-
|
364
|
+
describe "opengraph meta tags" do
|
365
|
+
before(:each) do
|
366
|
+
@m = MetaInspector::Parser.new(doc 'http://example.com/opengraph')
|
367
|
+
end
|
368
|
+
|
369
|
+
it "should find a meta og:title" do
|
370
|
+
@m.meta_og_title.should == "An OG title"
|
371
|
+
end
|
372
|
+
|
373
|
+
it "should find a meta og:type" do
|
374
|
+
@m.meta_og_type.should == "website"
|
375
|
+
end
|
376
|
+
|
377
|
+
it "should find a meta og:url" do
|
378
|
+
@m.meta_og_url.should == "http://example.com/opengraph"
|
379
|
+
end
|
380
|
+
|
381
|
+
it "should find a meta og:description" do
|
382
|
+
@m.meta_og_description.should == "Sean Connery found fame and fortune"
|
383
|
+
end
|
384
|
+
|
385
|
+
it "should find a meta og:determiner" do
|
386
|
+
@m.meta_og_determiner.should == "the"
|
387
|
+
end
|
388
|
+
|
389
|
+
it "should find a meta og:locale" do
|
390
|
+
@m.meta_og_locale.should == "en_GB"
|
391
|
+
end
|
392
|
+
|
393
|
+
it "should find a meta og:locale:alternate" do
|
394
|
+
@m.meta_og_locale_alternate.should == "fr_FR"
|
395
|
+
end
|
396
|
+
|
397
|
+
it "should find a meta og:site_name" do
|
398
|
+
@m.meta_og_site_name.should == "IMDb"
|
399
|
+
end
|
400
|
+
|
401
|
+
it "should find a meta og:image" do
|
402
|
+
@m.meta_og_image.should == "http://example.com/ogp.jpg"
|
403
|
+
end
|
404
|
+
|
405
|
+
it "should find a meta og:image:secure_url" do
|
406
|
+
@m.meta_og_image_secure_url.should == "https://secure.example.com/ogp.jpg"
|
407
|
+
end
|
408
|
+
|
409
|
+
it "should find a meta og:image:type" do
|
410
|
+
@m.meta_og_image_type.should == "image/jpeg"
|
411
|
+
end
|
412
|
+
|
413
|
+
it "should find a meta og:image:width" do
|
414
|
+
@m.meta_og_image_width.should == "400"
|
415
|
+
end
|
416
|
+
|
417
|
+
it "should find a meta og:image:height" do
|
418
|
+
@m.meta_og_image_height.should == "300"
|
419
|
+
end
|
420
|
+
|
421
|
+
it "should find a meta og:video" do
|
422
|
+
@m.meta_og_video.should == "http://example.com/movie.swf"
|
423
|
+
end
|
424
|
+
|
425
|
+
it "should find a meta og:video:secure_url" do
|
426
|
+
@m.meta_og_video_secure_url.should == "https://secure.example.com/movie.swf"
|
427
|
+
end
|
428
|
+
|
429
|
+
it "should find a meta og:video:type" do
|
430
|
+
@m.meta_og_video_type.should == "application/x-shockwave-flash"
|
431
|
+
end
|
432
|
+
|
433
|
+
it "should find a meta og:video:width" do
|
434
|
+
@m.meta_og_video_width.should == "400"
|
435
|
+
end
|
436
|
+
|
437
|
+
it "should find a meta og:video:height" do
|
438
|
+
@m.meta_og_video_height.should == "300"
|
439
|
+
end
|
440
|
+
|
441
|
+
it "should find a meta og:audio" do
|
442
|
+
@m.meta_og_audio.should == "http://example.com/sound.mp3"
|
443
|
+
end
|
444
|
+
|
445
|
+
it "should find a meta og:video:secure_url" do
|
446
|
+
@m.meta_og_audio_secure_url.should == "https://secure.example.com/sound.mp3"
|
447
|
+
end
|
448
|
+
|
449
|
+
it "should find a meta og:audio:type" do
|
450
|
+
@m.meta_og_audio_type.should == "audio/mpeg"
|
451
|
+
end
|
377
452
|
end
|
378
453
|
end
|
379
454
|
|
data/spec/spec_helper.rb
CHANGED
@@ -41,6 +41,7 @@ FakeWeb.register_uri(:get, "http://charset002.com", :response => fixture_file("c
|
|
41
41
|
FakeWeb.register_uri(:get, "http://www.inkthemes.com/", :response => fixture_file("wordpress_site.response"))
|
42
42
|
FakeWeb.register_uri(:get, "http://pagerankalert.com/image.png", :body => "Image", :content_type => "image/png")
|
43
43
|
FakeWeb.register_uri(:get, "http://pagerankalert.com/file.tar.gz", :body => "Image", :content_type => "application/x-gzip")
|
44
|
+
FakeWeb.register_uri(:get, "http://example.com/opengraph", :response => fixture_file("opengraph.response"))
|
44
45
|
|
45
46
|
# These examples are used to test relative links
|
46
47
|
FakeWeb.register_uri(:get, "http://relative.com/", :response => fixture_file("relative_links.response"))
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: metainspector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.17.
|
4
|
+
version: 1.17.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jaime Iniesta
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2014-01-09 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -168,6 +168,7 @@ files:
|
|
168
168
|
- spec/fixtures/malformed_href.response
|
169
169
|
- spec/fixtures/markupvalidator_faqs.response
|
170
170
|
- spec/fixtures/nonhttp.response
|
171
|
+
- spec/fixtures/opengraph.response
|
171
172
|
- spec/fixtures/pagerankalert.com.response
|
172
173
|
- spec/fixtures/protocol_relative.response
|
173
174
|
- spec/fixtures/relative_links.response
|
@@ -206,7 +207,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
206
207
|
version: '0'
|
207
208
|
requirements: []
|
208
209
|
rubyforge_project:
|
209
|
-
rubygems_version: 2.
|
210
|
+
rubygems_version: 2.0.6
|
210
211
|
signing_key:
|
211
212
|
specification_version: 4
|
212
213
|
summary: MetaInspector is a ruby gem for web scraping purposes, that returns a hash
|