metainspector 1.0.2 → 1.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +5 -0
- data/Gemfile +4 -0
- data/MIT-LICENSE +20 -0
- data/README.rdoc +90 -0
- data/Rakefile +2 -20
- data/lib/meta_inspector.rb +12 -0
- data/lib/meta_inspector/scraper.rb +81 -0
- data/lib/meta_inspector/version.rb +5 -0
- data/lib/metainspector.rb +2 -46
- data/meta_inspector.gemspec +26 -0
- data/samples/basic_scraping.rb +17 -0
- data/samples/spider.rb +28 -0
- data/spec/metainspector_spec.rb +77 -0
- data/spec/spec_helper.rb +4 -0
- metadata +109 -58
- data/History.txt +0 -21
- data/Manifest.txt +0 -7
- data/README.txt +0 -62
- data/bin/metainspector +0 -0
- data/test/test_metainspector.rb +0 -1
data/Gemfile
ADDED
data/MIT-LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2009-2011 Jaime Iniesta
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.rdoc
ADDED
@@ -0,0 +1,90 @@
|
|
1
|
+
= MetaInspector
|
2
|
+
|
3
|
+
MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, and meta tags.
|
4
|
+
|
5
|
+
= Installation
|
6
|
+
|
7
|
+
Install the gem from RubyGems:
|
8
|
+
|
9
|
+
gem install metainspector
|
10
|
+
|
11
|
+
= Usage
|
12
|
+
|
13
|
+
Initialize a scraper instance for an URL, like this:
|
14
|
+
|
15
|
+
page = MetaInspector::Scraper.new('http://pagerankalert.com')
|
16
|
+
|
17
|
+
or, for short, a convenience alias is also available:
|
18
|
+
|
19
|
+
page = MetaInspector.new('http://pagerankalert.com')
|
20
|
+
|
21
|
+
Then you can see the scraped data like this:
|
22
|
+
|
23
|
+
page.address # URL of the page
|
24
|
+
page.title # title of the page, as string
|
25
|
+
page.links # array of strings, with every link found on the page
|
26
|
+
page.meta_description # meta description, as string
|
27
|
+
page.meta_keywords # meta keywords, as string
|
28
|
+
|
29
|
+
MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
|
30
|
+
|
31
|
+
page.meta_description # <meta name="description" content="..." />
|
32
|
+
page.meta_keywords # <meta name="keywords" content="..." />
|
33
|
+
page.meta_robots # <meta name="robots" content="..." />
|
34
|
+
page.meta_generator # <meta name="generator" content="..." />
|
35
|
+
|
36
|
+
It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
|
37
|
+
|
38
|
+
page.meta_content_language # <meta http-equiv="content-language" content="..." />
|
39
|
+
page.meta_Content_Type # <meta http-equiv="Content-Type" content="..." />
|
40
|
+
|
41
|
+
Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type
|
42
|
+
|
43
|
+
The full scraped document if accessible from:
|
44
|
+
|
45
|
+
page.document # Nokogiri doc that you can use it to get any element from the page
|
46
|
+
|
47
|
+
= Examples
|
48
|
+
|
49
|
+
You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
|
50
|
+
|
51
|
+
$ irb
|
52
|
+
>> require 'metainspector'
|
53
|
+
=> true
|
54
|
+
|
55
|
+
>> page = MetaInspector.new('http://pagerankalert.com')
|
56
|
+
=> #<MetaInspector:0x11330c0 @document=nil, @links=nil, @address="http://pagerankalert.com", @description=nil, @keywords=nil, @title=nil>
|
57
|
+
|
58
|
+
>> page.title
|
59
|
+
=> "PageRankAlert.com :: Track your PageRank changes"
|
60
|
+
|
61
|
+
>> page.meta_description
|
62
|
+
=> "Track your PageRank(TM) changes and receive alerts by email"
|
63
|
+
|
64
|
+
>> page.meta_keywords
|
65
|
+
=> "pagerank, seo, optimization, google"
|
66
|
+
|
67
|
+
>> page.links.size
|
68
|
+
=> 8
|
69
|
+
|
70
|
+
>> page.links[5]
|
71
|
+
=> "http://pagerankalert.posterous.com"
|
72
|
+
|
73
|
+
>> page.document.class
|
74
|
+
=> String
|
75
|
+
|
76
|
+
>> page.parsed_document.class
|
77
|
+
=> Nokogiri::HTML::Document
|
78
|
+
|
79
|
+
= To Do
|
80
|
+
|
81
|
+
* Get page.base_dir from the address
|
82
|
+
* Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs
|
83
|
+
* Return array of images in page as absolute URLs
|
84
|
+
* Be able to set a timeout in seconds
|
85
|
+
* If keywords seem to be separated by blank spaces, replace them with commas
|
86
|
+
* Mocks
|
87
|
+
* Check content type, process only HTML pages, don't try to scrape TAR files like http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
|
88
|
+
* Get most important image querying Facebook
|
89
|
+
|
90
|
+
Copyright (c) 2009-2011 Jaime Iniesta, released under the MIT license
|
data/Rakefile
CHANGED
@@ -1,20 +1,2 @@
|
|
1
|
-
|
2
|
-
|
3
|
-
require 'rubygems'
|
4
|
-
require 'hoe'
|
5
|
-
require 'open-uri'
|
6
|
-
require 'hpricot'
|
7
|
-
require './lib/metainspector.rb'
|
8
|
-
|
9
|
-
Hoe.new('metainspector', MetaInspector::VERSION) do |p|
|
10
|
-
p.rubyforge_name = 'metainspector'
|
11
|
-
p.author = 'Jaime Iniesta'
|
12
|
-
p.email = 'jaimeiniesta@gmail.com'
|
13
|
-
p.summary = 'Ruby gem for web scraping purposes. It scrapes a given URL, and returns you a hash with data from it like for example the title, meta description, meta keywords, an array with all the links, all the images in it, etc.'
|
14
|
-
p.description = p.paragraphs_of('README.txt', 2..5).join("\n\n")
|
15
|
-
p.url = p.paragraphs_of('README.txt', 0).first.split(/\n/)[1..-1]
|
16
|
-
p.changes = p.paragraphs_of('History.txt', 0..1).join("\n\n")
|
17
|
-
p.extra_deps << "hpricot"
|
18
|
-
end
|
19
|
-
|
20
|
-
# vim: syntax=Ruby
|
1
|
+
require 'bundler'
|
2
|
+
Bundler::GemHelper.install_tasks
|
@@ -0,0 +1,81 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
|
3
|
+
require 'open-uri'
|
4
|
+
require 'rubygems'
|
5
|
+
require 'nokogiri'
|
6
|
+
require 'charguess'
|
7
|
+
require 'iconv'
|
8
|
+
|
9
|
+
# MetaInspector provides an easy way to scrape web pages and get its elements
|
10
|
+
module MetaInspector
|
11
|
+
class Scraper
|
12
|
+
attr_reader :address
|
13
|
+
|
14
|
+
# Initializes a new instance of MetaInspector, setting the URL address to the one given
|
15
|
+
# TODO: validate address as http URL, dont initialize it if wrong format
|
16
|
+
def initialize(address)
|
17
|
+
@address = address
|
18
|
+
|
19
|
+
@document = @title = @description = @keywords = @links = nil
|
20
|
+
end
|
21
|
+
|
22
|
+
# Returns the parsed document title, from the content of the <title> tag.
|
23
|
+
# This is not the same as the meta_tite tag
|
24
|
+
def title
|
25
|
+
@title ||= parsed_document.css('title').inner_html rescue nil
|
26
|
+
end
|
27
|
+
|
28
|
+
# Returns the parsed document links
|
29
|
+
def links
|
30
|
+
@links ||= parsed_document.search("//a").map {|link| link.attributes["href"].to_s.strip} rescue nil
|
31
|
+
end
|
32
|
+
|
33
|
+
# Returns the charset
|
34
|
+
# TODO: We should trust the charset expressed on the Content-Type meta tag
|
35
|
+
# and only guess it if none given
|
36
|
+
def charset
|
37
|
+
@charset ||= CharGuess.guess(document).downcase
|
38
|
+
end
|
39
|
+
|
40
|
+
# Returns the whole parsed document
|
41
|
+
def parsed_document
|
42
|
+
@parsed_document ||= Nokogiri::HTML(document)
|
43
|
+
|
44
|
+
rescue
|
45
|
+
warn 'An exception occurred while trying to scrape the page!'
|
46
|
+
end
|
47
|
+
|
48
|
+
# Returns the original, unparsed document
|
49
|
+
def document
|
50
|
+
@document ||= open(@address).read
|
51
|
+
|
52
|
+
rescue SocketError
|
53
|
+
warn 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
|
54
|
+
@scraped = false
|
55
|
+
rescue TimeoutError
|
56
|
+
warn 'Timeout!!!'
|
57
|
+
rescue
|
58
|
+
warn 'An exception occurred while trying to fetch the page!'
|
59
|
+
end
|
60
|
+
|
61
|
+
# Scrapers for all meta_tags in the form of "meta_name" are automatically defined. This has been tested for
|
62
|
+
# meta name: keywords, description, robots, generator
|
63
|
+
# meta http-equiv: content-language, Content-Type
|
64
|
+
#
|
65
|
+
# It will first try with meta name="..." and if nothing found,
|
66
|
+
# with meta http-equiv="...", substituting "_" by "-"
|
67
|
+
# TODO: this should be case unsensitive, so meta_robots gets the results from the HTML for robots, Robots, ROBOTS...
|
68
|
+
# TODO: cache results on instance variables, using ||=
|
69
|
+
# TODO: define respond_to? to return true on the meta_name methods
|
70
|
+
def method_missing(method_name)
|
71
|
+
if method_name.to_s =~ /^meta_(.*)/
|
72
|
+
content = parsed_document.css("meta[@name='#{$1}']").first['content'] rescue nil
|
73
|
+
content = parsed_document.css("meta[@http-equiv='#{$1.gsub("_", "-")}']").first['content'] rescue nil if content.nil?
|
74
|
+
|
75
|
+
content
|
76
|
+
else
|
77
|
+
super
|
78
|
+
end
|
79
|
+
end
|
80
|
+
end
|
81
|
+
end
|
data/lib/metainspector.rb
CHANGED
@@ -1,47 +1,3 @@
|
|
1
|
-
|
2
|
-
require 'open-uri'
|
3
|
-
require 'rubygems'
|
4
|
-
require 'hpricot'
|
1
|
+
# -*- encoding: utf-8 -*-
|
5
2
|
|
6
|
-
|
7
|
-
|
8
|
-
Hpricot.buffer_size = 300000
|
9
|
-
|
10
|
-
def self.scrape(url)
|
11
|
-
doc = Hpricot(open(url))
|
12
|
-
|
13
|
-
# Searching title...
|
14
|
-
if doc.at('title')
|
15
|
-
title = doc.at('title').inner_html
|
16
|
-
else
|
17
|
-
title = ""
|
18
|
-
end
|
19
|
-
|
20
|
-
# Searching meta description...
|
21
|
-
if doc.at("meta[@name='description']")
|
22
|
-
description = doc.at("meta[@name='description']")['content']
|
23
|
-
else
|
24
|
-
description = ""
|
25
|
-
end
|
26
|
-
|
27
|
-
# Searching meta keywords...
|
28
|
-
if doc.at("meta[@name='keywords']")
|
29
|
-
keywords = doc.at("meta[@name='keywords']")['content']
|
30
|
-
else
|
31
|
-
keywords = ""
|
32
|
-
end
|
33
|
-
|
34
|
-
# Searching links...
|
35
|
-
links = []
|
36
|
-
doc.search("//a").each do |link|
|
37
|
-
links << link.attributes["href"] if (!link.attributes["href"].nil?)
|
38
|
-
end
|
39
|
-
|
40
|
-
# Returning all data...
|
41
|
-
{'ok' => true, 'title' => title, 'description' => description, 'keywords' => keywords, 'links' => links}
|
42
|
-
|
43
|
-
rescue SocketError
|
44
|
-
puts 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
|
45
|
-
{'ok' => false, 'title' => nil, 'description' => nil, 'keywords' => nil, 'links' => nil}
|
46
|
-
end
|
47
|
-
end
|
3
|
+
require 'meta_inspector'
|
@@ -0,0 +1,26 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
$:.push File.expand_path("../lib", __FILE__)
|
3
|
+
require "meta_inspector/version"
|
4
|
+
|
5
|
+
Gem::Specification.new do |s|
|
6
|
+
s.name = "metainspector"
|
7
|
+
s.version = MetaInspector::VERSION
|
8
|
+
s.platform = Gem::Platform::RUBY
|
9
|
+
s.authors = ["Jaime Iniesta"]
|
10
|
+
s.email = ["jaimeiniesta@gmail.com"]
|
11
|
+
s.homepage = "https://rubygems.org/gems/metainspector"
|
12
|
+
s.summary = %q{MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL}
|
13
|
+
s.description = %q{MetaInspector lets you scrape a web page and get its title, charset, link and meta tags}
|
14
|
+
|
15
|
+
s.rubyforge_project = "MetaInspector"
|
16
|
+
|
17
|
+
s.files = `git ls-files`.split("\n")
|
18
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
19
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
20
|
+
s.require_paths = ["lib"]
|
21
|
+
|
22
|
+
s.add_dependency 'nokogiri', '1.4.4'
|
23
|
+
s.add_dependency 'charguess', '1.3.20110226181011'
|
24
|
+
|
25
|
+
s.add_development_dependency 'rspec', '2.5.0'
|
26
|
+
end
|
@@ -0,0 +1,17 @@
|
|
1
|
+
# Some basic MetaInspector samples
|
2
|
+
|
3
|
+
require_relative '../lib/meta_inspector.rb'
|
4
|
+
|
5
|
+
puts "Enter a valid http address to scrape it"
|
6
|
+
address = gets.strip
|
7
|
+
page = MetaInspector.new(address)
|
8
|
+
puts "...please wait while scraping the page..."
|
9
|
+
|
10
|
+
puts "Scraping #{page.address} returned these results:"
|
11
|
+
puts "TITLE: #{page.title}"
|
12
|
+
puts "META DESCRIPTION: #{page.meta_description}"
|
13
|
+
puts "META KEYWORDS: #{page.meta_keywords}"
|
14
|
+
puts "#{page.links.size} links found..."
|
15
|
+
page.links.each do |link|
|
16
|
+
puts " ==> #{link}"
|
17
|
+
end
|
data/samples/spider.rb
ADDED
@@ -0,0 +1,28 @@
|
|
1
|
+
# A basic spider that will follow links on an infinite loop
|
2
|
+
require_relative '../lib/meta_inspector.rb'
|
3
|
+
|
4
|
+
q = Queue.new
|
5
|
+
visited_links=[]
|
6
|
+
|
7
|
+
puts "Enter a valid http address to spider it following external links"
|
8
|
+
address = gets.strip
|
9
|
+
|
10
|
+
page = MetaInspector.new(address)
|
11
|
+
q.push(address)
|
12
|
+
|
13
|
+
while q.size > 0
|
14
|
+
visited_links << address = q.pop
|
15
|
+
page = MetaInspector.new(address)
|
16
|
+
puts "Spidering #{page.address}"
|
17
|
+
|
18
|
+
puts "TITLE: #{page.title}"
|
19
|
+
puts "META DESCRIPTION: #{page.meta_description}"
|
20
|
+
puts "META KEYWORDS: #{page.meta_keywords}"
|
21
|
+
puts "LINKS: #{page.links.size}"
|
22
|
+
page.links.each do |link|
|
23
|
+
if link[0..6] == 'http://' && !visited_links.include?(link)
|
24
|
+
q.push(link)
|
25
|
+
end
|
26
|
+
end
|
27
|
+
puts "#{visited_links.size} pages visited, #{q.size} pages on queue\n\n"
|
28
|
+
end
|
@@ -0,0 +1,77 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
|
3
|
+
require File.join(File.dirname(__FILE__), "/spec_helper")
|
4
|
+
|
5
|
+
describe MetaInspector do
|
6
|
+
|
7
|
+
context 'Doing a basic scrape' do
|
8
|
+
before(:each) do
|
9
|
+
@m = MetaInspector.new('http://pagerankalert.com')
|
10
|
+
end
|
11
|
+
|
12
|
+
it "should get the title" do
|
13
|
+
@m.title.should == 'PageRankAlert.com :: Track your PageRank changes'
|
14
|
+
end
|
15
|
+
|
16
|
+
it "should get the links" do
|
17
|
+
@m.links.size.should == 8
|
18
|
+
end
|
19
|
+
|
20
|
+
it "should have a Nokogiri::HTML::Document as parsed_document" do
|
21
|
+
@m.parsed_document.class.should == Nokogiri::HTML::Document
|
22
|
+
end
|
23
|
+
|
24
|
+
it "should have a String as document" do
|
25
|
+
@m.document.class.should == String
|
26
|
+
end
|
27
|
+
end
|
28
|
+
|
29
|
+
context 'Getting meta tags by ghost methods' do
|
30
|
+
before(:each) do
|
31
|
+
@m = MetaInspector.new('http://pagerankalert.com')
|
32
|
+
end
|
33
|
+
|
34
|
+
it "should get the robots meta tag" do
|
35
|
+
@m.meta_robots.should == 'all,follow'
|
36
|
+
end
|
37
|
+
|
38
|
+
it "should get the description meta tag" do
|
39
|
+
@m.meta_description.should == 'Track your PageRank(TM) changes and receive alerts by email'
|
40
|
+
end
|
41
|
+
|
42
|
+
it "should get the keywords meta tag" do
|
43
|
+
@m.meta_keywords.should == "pagerank, seo, optimization, google"
|
44
|
+
end
|
45
|
+
|
46
|
+
it "should get the content-language meta tag" do
|
47
|
+
pending "mocks"
|
48
|
+
@m.meta_content_language.should == "en"
|
49
|
+
end
|
50
|
+
|
51
|
+
it "should get the Content-Type meta tag" do
|
52
|
+
pending "mocks"
|
53
|
+
@m.meta_Content_Type.should == "text/html; charset=utf-8"
|
54
|
+
end
|
55
|
+
|
56
|
+
it "should get the generator meta tag" do
|
57
|
+
pending "mocks"
|
58
|
+
@m.meta_generator.should == 'WordPress 2.8.4'
|
59
|
+
end
|
60
|
+
|
61
|
+
it "should return nil for nonfound meta_tags" do
|
62
|
+
@m.meta_lollypop.should == nil
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
66
|
+
context 'Charset detection' do
|
67
|
+
it "should detect windows-1252 charset" do
|
68
|
+
@m = MetaInspector.new('http://www.alazan.com')
|
69
|
+
@m.charset.should == "windows-1252"
|
70
|
+
end
|
71
|
+
|
72
|
+
it "should detect utf-8 charset" do
|
73
|
+
@m = MetaInspector.new('http://www.pagerankalert.com')
|
74
|
+
@m.charset.should == "utf-8"
|
75
|
+
end
|
76
|
+
end
|
77
|
+
end
|
data/spec/spec_helper.rb
ADDED
metadata
CHANGED
@@ -1,72 +1,123 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
|
-
rubygems_version: 0.9.2
|
3
|
-
specification_version: 1
|
4
2
|
name: metainspector
|
5
3
|
version: !ruby/object:Gem::Version
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
-
|
11
|
-
|
12
|
-
homepage: " by Jaime Iniesta"
|
13
|
-
rubyforge_project: metainspector
|
14
|
-
description: "== FEATURES/PROBLEMS: * Scrape a given URL and return data from its HTML == SYNOPSIS: # Require all gems and libs needed... require 'rubygems' require 'open-uri' require 'hpricot' require 'metainspector' # Scrape an URL... page_data = MetaInspector.scrape(url)"
|
15
|
-
autorequire:
|
16
|
-
default_executable:
|
17
|
-
bindir: bin
|
18
|
-
has_rdoc: true
|
19
|
-
required_ruby_version: !ruby/object:Gem::Version::Requirement
|
20
|
-
requirements:
|
21
|
-
- - ">"
|
22
|
-
- !ruby/object:Gem::Version
|
23
|
-
version: 0.0.0
|
24
|
-
version:
|
4
|
+
prerelease: false
|
5
|
+
segments:
|
6
|
+
- 1
|
7
|
+
- 2
|
8
|
+
- 0
|
9
|
+
version: 1.2.0
|
25
10
|
platform: ruby
|
26
|
-
signing_key:
|
27
|
-
cert_chain:
|
28
|
-
post_install_message:
|
29
11
|
authors:
|
30
12
|
- Jaime Iniesta
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
- README.txt
|
35
|
-
- Rakefile
|
36
|
-
- bin/metainspector
|
37
|
-
- lib/metainspector.rb
|
38
|
-
- test/test_metainspector.rb
|
39
|
-
test_files:
|
40
|
-
- test/test_metainspector.rb
|
41
|
-
rdoc_options:
|
42
|
-
- --main
|
43
|
-
- README.txt
|
44
|
-
extra_rdoc_files:
|
45
|
-
- History.txt
|
46
|
-
- Manifest.txt
|
47
|
-
- README.txt
|
48
|
-
executables:
|
49
|
-
- metainspector
|
50
|
-
extensions: []
|
51
|
-
|
52
|
-
requirements: []
|
13
|
+
autorequire:
|
14
|
+
bindir: bin
|
15
|
+
cert_chain: []
|
53
16
|
|
17
|
+
date: 2011-05-05 00:00:00 +02:00
|
18
|
+
default_executable:
|
54
19
|
dependencies:
|
55
20
|
- !ruby/object:Gem::Dependency
|
56
|
-
name:
|
57
|
-
|
58
|
-
|
21
|
+
name: nokogiri
|
22
|
+
prerelease: false
|
23
|
+
requirement: &id001 !ruby/object:Gem::Requirement
|
24
|
+
none: false
|
25
|
+
requirements:
|
26
|
+
- - "="
|
27
|
+
- !ruby/object:Gem::Version
|
28
|
+
segments:
|
29
|
+
- 1
|
30
|
+
- 4
|
31
|
+
- 4
|
32
|
+
version: 1.4.4
|
33
|
+
type: :runtime
|
34
|
+
version_requirements: *id001
|
35
|
+
- !ruby/object:Gem::Dependency
|
36
|
+
name: charguess
|
37
|
+
prerelease: false
|
38
|
+
requirement: &id002 !ruby/object:Gem::Requirement
|
39
|
+
none: false
|
59
40
|
requirements:
|
60
|
-
- - "
|
41
|
+
- - "="
|
61
42
|
- !ruby/object:Gem::Version
|
62
|
-
|
63
|
-
|
43
|
+
segments:
|
44
|
+
- 1
|
45
|
+
- 3
|
46
|
+
- 20110226181011
|
47
|
+
version: 1.3.20110226181011
|
48
|
+
type: :runtime
|
49
|
+
version_requirements: *id002
|
64
50
|
- !ruby/object:Gem::Dependency
|
65
|
-
name:
|
66
|
-
|
67
|
-
|
51
|
+
name: rspec
|
52
|
+
prerelease: false
|
53
|
+
requirement: &id003 !ruby/object:Gem::Requirement
|
54
|
+
none: false
|
68
55
|
requirements:
|
69
|
-
- - "
|
56
|
+
- - "="
|
70
57
|
- !ruby/object:Gem::Version
|
71
|
-
|
72
|
-
|
58
|
+
segments:
|
59
|
+
- 2
|
60
|
+
- 5
|
61
|
+
- 0
|
62
|
+
version: 2.5.0
|
63
|
+
type: :development
|
64
|
+
version_requirements: *id003
|
65
|
+
description: MetaInspector lets you scrape a web page and get its title, charset, link and meta tags
|
66
|
+
email:
|
67
|
+
- jaimeiniesta@gmail.com
|
68
|
+
executables: []
|
69
|
+
|
70
|
+
extensions: []
|
71
|
+
|
72
|
+
extra_rdoc_files: []
|
73
|
+
|
74
|
+
files:
|
75
|
+
- .gitignore
|
76
|
+
- Gemfile
|
77
|
+
- MIT-LICENSE
|
78
|
+
- README.rdoc
|
79
|
+
- Rakefile
|
80
|
+
- lib/meta_inspector.rb
|
81
|
+
- lib/meta_inspector/scraper.rb
|
82
|
+
- lib/meta_inspector/version.rb
|
83
|
+
- lib/metainspector.rb
|
84
|
+
- meta_inspector.gemspec
|
85
|
+
- samples/basic_scraping.rb
|
86
|
+
- samples/spider.rb
|
87
|
+
- spec/metainspector_spec.rb
|
88
|
+
- spec/spec_helper.rb
|
89
|
+
has_rdoc: true
|
90
|
+
homepage: https://rubygems.org/gems/metainspector
|
91
|
+
licenses: []
|
92
|
+
|
93
|
+
post_install_message:
|
94
|
+
rdoc_options: []
|
95
|
+
|
96
|
+
require_paths:
|
97
|
+
- lib
|
98
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
99
|
+
none: false
|
100
|
+
requirements:
|
101
|
+
- - ">="
|
102
|
+
- !ruby/object:Gem::Version
|
103
|
+
segments:
|
104
|
+
- 0
|
105
|
+
version: "0"
|
106
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
107
|
+
none: false
|
108
|
+
requirements:
|
109
|
+
- - ">="
|
110
|
+
- !ruby/object:Gem::Version
|
111
|
+
segments:
|
112
|
+
- 0
|
113
|
+
version: "0"
|
114
|
+
requirements: []
|
115
|
+
|
116
|
+
rubyforge_project: MetaInspector
|
117
|
+
rubygems_version: 1.3.7
|
118
|
+
signing_key:
|
119
|
+
specification_version: 3
|
120
|
+
summary: MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL
|
121
|
+
test_files:
|
122
|
+
- spec/metainspector_spec.rb
|
123
|
+
- spec/spec_helper.rb
|
data/History.txt
DELETED
@@ -1,21 +0,0 @@
|
|
1
|
-
== 1.0.2 / 2007-12-10
|
2
|
-
|
3
|
-
* Open-uri, Rubygems and Hpricot required at the MetaInspector class, so you won't need to require them. Just require metainspector and they will be included along.
|
4
|
-
|
5
|
-
* Rescue in case of socket error. If the URL does not exist or is unreachable, it will catch the SocketError exception and return 'ok' => false.
|
6
|
-
|
7
|
-
* Added hpricot as extra dependency so it will be automatically installed when you install the metainspector gem.
|
8
|
-
|
9
|
-
* Misc code cleanup... "if !doc.at('title').nil?" is the same as "if doc.at('title')"
|
10
|
-
|
11
|
-
* Thanks to David Calavera (http://thinkincode.net) and Juan Alvarez (http://ruby.reboot.com.mx/) for their comments and contributions to this release.
|
12
|
-
|
13
|
-
== 1.0.1 / 2007-12-06
|
14
|
-
|
15
|
-
* Added some info at README.txt, translated all methods to English
|
16
|
-
|
17
|
-
== 1.0.0 / 2007-12-06
|
18
|
-
|
19
|
-
* MetaInspector is born!
|
20
|
-
* Birthday!
|
21
|
-
|
data/Manifest.txt
DELETED
data/README.txt
DELETED
@@ -1,62 +0,0 @@
|
|
1
|
-
metainspector
|
2
|
-
by Jaime Iniesta
|
3
|
-
http://metainspector.rubyforge.org/
|
4
|
-
|
5
|
-
== DESCRIPTION:
|
6
|
-
|
7
|
-
Ruby gem for web scraping purposes. It scrapes a given URL, and returns you a hash with data from it like for example the title, meta description, meta keywords, an array with all the links, all the images in it, etc.
|
8
|
-
|
9
|
-
== FEATURES/PROBLEMS:
|
10
|
-
|
11
|
-
* Scrape a given URL and return data from its HTML
|
12
|
-
|
13
|
-
== SYNOPSIS:
|
14
|
-
|
15
|
-
# Require all gems and libs needed...
|
16
|
-
require 'rubygems'
|
17
|
-
require 'open-uri'
|
18
|
-
require 'hpricot'
|
19
|
-
require 'metainspector'
|
20
|
-
|
21
|
-
# Scrape an URL...
|
22
|
-
page_data = MetaInspector.scrape(url)
|
23
|
-
|
24
|
-
# See extracted data...
|
25
|
-
page_data['title']
|
26
|
-
page_data['description']
|
27
|
-
page_data['keywords']
|
28
|
-
page_data['links']
|
29
|
-
|
30
|
-
== REQUIREMENTS:
|
31
|
-
|
32
|
-
* open-uri
|
33
|
-
* hpricot
|
34
|
-
|
35
|
-
== INSTALL:
|
36
|
-
|
37
|
-
* sudo gem install metainspector
|
38
|
-
|
39
|
-
== LICENSE:
|
40
|
-
|
41
|
-
(The MIT License)
|
42
|
-
|
43
|
-
Copyright (c) 2007 Jaime Iniesta
|
44
|
-
|
45
|
-
Permission is hereby granted, free of charge, to any person obtaining
|
46
|
-
a copy of this software and associated documentation files (the
|
47
|
-
'Software'), to deal in the Software without restriction, including
|
48
|
-
without limitation the rights to use, copy, modify, merge, publish,
|
49
|
-
distribute, sublicense, and/or sell copies of the Software, and to
|
50
|
-
permit persons to whom the Software is furnished to do so, subject to
|
51
|
-
the following conditions:
|
52
|
-
|
53
|
-
The above copyright notice and this permission notice shall be
|
54
|
-
included in all copies or substantial portions of the Software.
|
55
|
-
|
56
|
-
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
|
57
|
-
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
58
|
-
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
59
|
-
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
60
|
-
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
61
|
-
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
62
|
-
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/bin/metainspector
DELETED
File without changes
|
data/test/test_metainspector.rb
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
|