metainspector 1.17.2 → 1.17.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 6d1795a40750fba9db895d1ab321f54d94349a38
4
- data.tar.gz: 0648ba0471516d9a6e33a9439972619d51580ff7
3
+ metadata.gz: 5c94b07066d8b0080029d5e93808a4388b716575
4
+ data.tar.gz: 33135740e3e740e21c4ccc011a44a3466c73a926
5
5
  SHA512:
6
- metadata.gz: a4c8859c52e9f424cf08c72b8f3515dede0f678ac87bcc2331a8e543fa1a839912c0fd85f2b5c56069990cd270dd5db8c2761ac2a3dce6d5929ba906b3692856
7
- data.tar.gz: 61217d3d51e81e892f6750f8d24379ea75a80d9e7d07c2408d6c762786e5c088a14996674ed2cb070a5e532c9286e406e55832398c5bb842f4bd51497fcc1674
6
+ metadata.gz: 4c3ffda64efceaaaa9631178751df36a0316d17176ce0788ce6256fbd96ac5347f8fc9c6b7019c3942bcd746a2f8868f78fdaa7d7aab83d4314a6a01dd9522dd
7
+ data.tar.gz: 38c3fd01c8c156c82985c6b6972cec1002d90135f4381ab53f9e69e4dd6595bc2b22c8a530c2035703aabc318dfb9f37abf62396502dae12b994183aea091db2
data/README.md CHANGED
@@ -22,15 +22,15 @@ This gem is tested on Ruby versions 1.9.2, 1.9.3 and 2.0.0.
22
22
 
23
23
  Initialize a MetaInspector instance for an URL, like this:
24
24
 
25
- page = MetaInspector.new('http://markupvalidator.com')
25
+ page = MetaInspector.new('http://sitevalidator.com')
26
26
 
27
27
  If you don't include the scheme on the URL, http:// will be used by default:
28
28
 
29
- page = MetaInspector.new('markupvalidator.com')
29
+ page = MetaInspector.new('sitevalidator.com')
30
30
 
31
31
  You can also include the html which will be used as the document to scrape:
32
32
 
33
- page = MetaInspector.new("http://markupvalidator.com", :document => "<html><head><title>Hello From Passed Html</title><a href='/hello'>Hello link</a></head><body></body></html>")
33
+ page = MetaInspector.new("http://sitevalidator.com", :document => "<html><head><title>Hello From Passed Html</title><a href='/hello'>Hello link</a></head><body></body></html>")
34
34
 
35
35
  ## Accessing scraped data
36
36
 
@@ -38,8 +38,8 @@ Then you can see the scraped data like this:
38
38
 
39
39
  page.url # URL of the page
40
40
  page.scheme # Scheme of the page (http, https)
41
- page.host # Hostname of the page (like, markupvalidator.com, without the scheme)
42
- page.root_url # Root url (scheme + host, like http://markupvalidator.com/)
41
+ page.host # Hostname of the page (like, sitevalidator.com, without the scheme)
42
+ page.root_url # Root url (scheme + host, like http://sitevalidator.com/)
43
43
  page.title # title of the page, as string
44
44
  page.links # array of strings, with every link found on the page as an absolute URL
45
45
  page.internal_links # array of strings, with every internal link found on the page as an absolute URL
@@ -69,7 +69,7 @@ Please notice that MetaInspector is case sensitive, so `page.meta_Content_Type`
69
69
 
70
70
  You can also access most of the scraped data as a hash:
71
71
 
72
- page.to_hash # { "url" => "http://markupvalidator.com",
72
+ page.to_hash # { "url" => "http://sitevalidator.com",
73
73
  "title" => "MarkupValidator :: site-wide markup validation tool", ... }
74
74
 
75
75
  The original document is accessible from:
@@ -106,7 +106,7 @@ Note that MetaInspector gives priority to content over value. In other words if
106
106
  By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
107
107
  You can set a different timeout with a second parameter, like this:
108
108
 
109
- page = MetaInspector.new('markupvalidator.com', :timeout => 5) # 5 seconds timeout
109
+ page = MetaInspector.new('sitevalidator.com', :timeout => 5) # 5 seconds timeout
110
110
 
111
111
  ### Redirections
112
112
 
@@ -124,7 +124,7 @@ However, you can tell MetaInspector to allow these redirections with the option
124
124
 
125
125
  MetaInspector will try to parse all URLs by default. If you want to raise an exception when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
126
126
 
127
- page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
127
+ page = MetaInspector.new('sitevalidator.com', :html_content_only => true)
128
128
 
129
129
  This is useful when using MetaInspector on web spidering. Although on the initial URL you'll probably have an HTML URL, following links you may find yourself trying to parse non-html URLs.
130
130
 
@@ -167,8 +167,8 @@ You can find some sample scripts on the samples folder, including a basic scrapi
167
167
  >> require 'metainspector'
168
168
  => true
169
169
 
170
- >> page = MetaInspector.new('http://markupvalidator.com')
171
- => #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">
170
+ >> page = MetaInspector.new('http://sitevalidator.com')
171
+ => #<MetaInspector:0x11330c0 @url="http://sitevalidator.com">
172
172
 
173
173
  >> page.title
174
174
  => "MarkupValidator :: site-wide markup validation tool"
@@ -185,12 +185,6 @@ You can find some sample scripts on the samples folder, including a basic scrapi
185
185
  >> page.links[4]
186
186
  => "/plans-and-pricing"
187
187
 
188
- >> page.document.class
189
- => String
190
-
191
- >> page.parsed_document.class
192
- => Nokogiri::HTML::Document
193
-
194
188
  ## ZOMG Fork! Thank you!
195
189
 
196
190
  You're welcome to fork this project and send pull requests. Just remember to include specs.
@@ -106,7 +106,11 @@ module MetaInspector
106
106
  key = meta_tags_method.meta_tag
107
107
 
108
108
  #special treatment for opengraph (og:) and twitter card (twitter:) tags
109
- key.gsub!("_",":") if key =~ /^og_(.*)/ || key =~ /^twitter_(.*)/
109
+ if key =~ /^og_(.*)/
110
+ key = og_key(key)
111
+ elsif key =~ /^twitter_(.*)/
112
+ key.gsub!("_",":")
113
+ end
110
114
 
111
115
  scrape_meta_data
112
116
 
@@ -116,6 +120,23 @@ module MetaInspector
116
120
  end
117
121
  end
118
122
 
123
+ # Not all OG keys can be directly translated to meta tags method names replacing _ by : as they include the _ in the name
124
+ # This is going to be deprecated and replaced soon by a simpler, clearer method, like page.meta['og:site_name']
125
+ def og_key(key)
126
+ case key
127
+ when "og_site_name"
128
+ "og:site_name"
129
+ when "og_image_secure_url"
130
+ "og:image:secure_url"
131
+ when "og_video_secure_url"
132
+ "og:video:secure_url"
133
+ when "og_audio_secure_url"
134
+ "og:audio:secure_url"
135
+ else
136
+ key.gsub("_", ":")
137
+ end
138
+ end
139
+
119
140
  # Scrapes all meta tags found
120
141
  def scrape_meta_data
121
142
  unless @data.meta
@@ -1,5 +1,5 @@
1
1
  # -*- encoding: utf-8 -*-
2
2
 
3
3
  module MetaInspector
4
- VERSION = "1.17.2"
4
+ VERSION = "1.17.3"
5
5
  end
@@ -0,0 +1,52 @@
1
+ HTTP/1.1 200 OK
2
+ Age: 13
3
+ Cache-Control: max-age=120
4
+ Content-Type: text/html
5
+ Date: Mon, 06 Jan 2014 12:47:42 GMT
6
+ Expires: Mon, 06 Jan 2014 12:49:28 GMT
7
+ Server: Apache/2.2.14 (Ubuntu)
8
+ Vary: Accept-Encoding
9
+ Via: 1.1 varnish
10
+ X-Powered-By: PHP/5.3.2-1ubuntu4.22
11
+ X-Varnish: 1188792404 1188790413
12
+ Content-Length: 40571
13
+ Connection: keep-alive
14
+
15
+ <!DOCTYPE html>
16
+ <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
17
+ <head>
18
+ <meta http-equiv="Content-type" content="text/html; charset=utf-8">
19
+
20
+ <!-- Basic OG Metadata -->
21
+ <meta property="og:title" content="An OG title" />
22
+ <meta property="og:type" content="website" />
23
+ <meta property="og:url" content="http://example.com/opengraph" />
24
+
25
+ <!-- Optional OG Metadata -->
26
+ <meta property="og:description" content="Sean Connery found fame and fortune" />
27
+ <meta property="og:determiner" content="the" />
28
+ <meta property="og:locale" content="en_GB" />
29
+ <meta property="og:locale:alternate" content="fr_FR" />
30
+ <meta property="og:site_name" content="IMDb" />
31
+
32
+ <!-- Structured OG Properties -->
33
+ <meta property="og:image" content="http://example.com/ogp.jpg" />
34
+ <meta property="og:image:secure_url" content="https://secure.example.com/ogp.jpg" />
35
+ <meta property="og:image:type" content="image/jpeg" />
36
+ <meta property="og:image:width" content="400" />
37
+ <meta property="og:image:height" content="300" />
38
+
39
+ <meta property="og:video" content="http://example.com/movie.swf" />
40
+ <meta property="og:video:secure_url" content="https://secure.example.com/movie.swf" />
41
+ <meta property="og:video:type" content="application/x-shockwave-flash" />
42
+ <meta property="og:video:width" content="400" />
43
+ <meta property="og:video:height" content="300" />
44
+
45
+ <meta property="og:audio" content="http://example.com/sound.mp3" />
46
+ <meta property="og:audio:secure_url" content="https://secure.example.com/sound.mp3" />
47
+ <meta property="og:audio:type" content="audio/mpeg" />
48
+ </head>
49
+ <body>
50
+ <p>A sample page with many Open Graph meta tags</p>
51
+ </body>
52
+ </html>
data/spec/parser_spec.rb CHANGED
@@ -292,7 +292,7 @@ describe MetaInspector::Parser do
292
292
  end
293
293
 
294
294
  describe 'respond_to? for not implemented methods' do
295
-
295
+
296
296
  before(:each) do
297
297
  @m = MetaInspector.new('http://pagerankalert.com')
298
298
  end
@@ -346,16 +346,6 @@ describe MetaInspector::Parser do
346
346
  @m.meta_generator.should == 'WordPress 3.4.2'
347
347
  end
348
348
 
349
- it "should find a meta_og_title" do
350
- @m = MetaInspector::Parser.new(doc 'http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
351
- @m.meta_og_title.should == "Apple Claims New iPhone Only Visible To Most Loyal Of Customers"
352
- end
353
-
354
- it "should not find a meta_og_something" do
355
- @m = MetaInspector::Parser.new(doc 'http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
356
- @m.meta_og_something.should == nil
357
- end
358
-
359
349
  it "should find a meta_twitter_site" do
360
350
  @m = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
361
351
  @m.meta_twitter_site.should == "@youtube"
@@ -371,9 +361,94 @@ describe MetaInspector::Parser do
371
361
  @m.meta_twitter_dummy.should == nil
372
362
  end
373
363
 
374
- it "should find a meta_og_video_width" do
375
- @m = MetaInspector::Parser.new(doc 'http://www.youtube.com/watch?v=iaGSSrp49uc')
376
- @m.meta_og_video_width.should == "1920"
364
+ describe "opengraph meta tags" do
365
+ before(:each) do
366
+ @m = MetaInspector::Parser.new(doc 'http://example.com/opengraph')
367
+ end
368
+
369
+ it "should find a meta og:title" do
370
+ @m.meta_og_title.should == "An OG title"
371
+ end
372
+
373
+ it "should find a meta og:type" do
374
+ @m.meta_og_type.should == "website"
375
+ end
376
+
377
+ it "should find a meta og:url" do
378
+ @m.meta_og_url.should == "http://example.com/opengraph"
379
+ end
380
+
381
+ it "should find a meta og:description" do
382
+ @m.meta_og_description.should == "Sean Connery found fame and fortune"
383
+ end
384
+
385
+ it "should find a meta og:determiner" do
386
+ @m.meta_og_determiner.should == "the"
387
+ end
388
+
389
+ it "should find a meta og:locale" do
390
+ @m.meta_og_locale.should == "en_GB"
391
+ end
392
+
393
+ it "should find a meta og:locale:alternate" do
394
+ @m.meta_og_locale_alternate.should == "fr_FR"
395
+ end
396
+
397
+ it "should find a meta og:site_name" do
398
+ @m.meta_og_site_name.should == "IMDb"
399
+ end
400
+
401
+ it "should find a meta og:image" do
402
+ @m.meta_og_image.should == "http://example.com/ogp.jpg"
403
+ end
404
+
405
+ it "should find a meta og:image:secure_url" do
406
+ @m.meta_og_image_secure_url.should == "https://secure.example.com/ogp.jpg"
407
+ end
408
+
409
+ it "should find a meta og:image:type" do
410
+ @m.meta_og_image_type.should == "image/jpeg"
411
+ end
412
+
413
+ it "should find a meta og:image:width" do
414
+ @m.meta_og_image_width.should == "400"
415
+ end
416
+
417
+ it "should find a meta og:image:height" do
418
+ @m.meta_og_image_height.should == "300"
419
+ end
420
+
421
+ it "should find a meta og:video" do
422
+ @m.meta_og_video.should == "http://example.com/movie.swf"
423
+ end
424
+
425
+ it "should find a meta og:video:secure_url" do
426
+ @m.meta_og_video_secure_url.should == "https://secure.example.com/movie.swf"
427
+ end
428
+
429
+ it "should find a meta og:video:type" do
430
+ @m.meta_og_video_type.should == "application/x-shockwave-flash"
431
+ end
432
+
433
+ it "should find a meta og:video:width" do
434
+ @m.meta_og_video_width.should == "400"
435
+ end
436
+
437
+ it "should find a meta og:video:height" do
438
+ @m.meta_og_video_height.should == "300"
439
+ end
440
+
441
+ it "should find a meta og:audio" do
442
+ @m.meta_og_audio.should == "http://example.com/sound.mp3"
443
+ end
444
+
445
+ it "should find a meta og:video:secure_url" do
446
+ @m.meta_og_audio_secure_url.should == "https://secure.example.com/sound.mp3"
447
+ end
448
+
449
+ it "should find a meta og:audio:type" do
450
+ @m.meta_og_audio_type.should == "audio/mpeg"
451
+ end
377
452
  end
378
453
  end
379
454
 
data/spec/spec_helper.rb CHANGED
@@ -41,6 +41,7 @@ FakeWeb.register_uri(:get, "http://charset002.com", :response => fixture_file("c
41
41
  FakeWeb.register_uri(:get, "http://www.inkthemes.com/", :response => fixture_file("wordpress_site.response"))
42
42
  FakeWeb.register_uri(:get, "http://pagerankalert.com/image.png", :body => "Image", :content_type => "image/png")
43
43
  FakeWeb.register_uri(:get, "http://pagerankalert.com/file.tar.gz", :body => "Image", :content_type => "application/x-gzip")
44
+ FakeWeb.register_uri(:get, "http://example.com/opengraph", :response => fixture_file("opengraph.response"))
44
45
 
45
46
  # These examples are used to test relative links
46
47
  FakeWeb.register_uri(:get, "http://relative.com/", :response => fixture_file("relative_links.response"))
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metainspector
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.17.2
4
+ version: 1.17.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jaime Iniesta
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-11-21 00:00:00.000000000 Z
11
+ date: 2014-01-09 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -168,6 +168,7 @@ files:
168
168
  - spec/fixtures/malformed_href.response
169
169
  - spec/fixtures/markupvalidator_faqs.response
170
170
  - spec/fixtures/nonhttp.response
171
+ - spec/fixtures/opengraph.response
171
172
  - spec/fixtures/pagerankalert.com.response
172
173
  - spec/fixtures/protocol_relative.response
173
174
  - spec/fixtures/relative_links.response
@@ -206,7 +207,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
206
207
  version: '0'
207
208
  requirements: []
208
209
  rubyforge_project:
209
- rubygems_version: 2.1.3
210
+ rubygems_version: 2.0.6
210
211
  signing_key:
211
212
  specification_version: 4
212
213
  summary: MetaInspector is a ruby gem for web scraping purposes, that returns a hash