RubyGems - jaimeiniesta-metainspector - Versions diffs - 1.1.1 → 1.1.2 - Mend

jaimeiniesta-metainspector 1.1.1 → 1.1.2

Files changed (7) hide show

data/CHANGELOG.rdoc CHANGED Viewed

@@ -1,3 +1,7 @@
+= 1.1.2
+=== 19th May, 2009
+* Using nokogiri instead of hpricot
 = 1.1.1
 === 14th May, 2009
 * Simplified scrape method, leaves as nil the metadata not found, to be able to distinguish between a not found element or a found element that was empty.

data/README.rdoc CHANGED Viewed

@@ -2,6 +2,22 @@
 MetaInspector is a gem for web scraping purposes. You give it an URL, and it returns you metadata from it.
+= Dependencies
+MetaInspector uses the nokogiri gem to parse HTML. You can install it from github.
+Run the following if you haven't already:
+  gem sources -a http://gems.github.com
+Then install the gem:
+  sudo gem install tenderlove-nokogiri
+If you're on Ubuntu, you might need to install these packages before installing nokogiri:
+  sudo aptitude install libxslt-dev libxml2 libxml2-dev
 = Installation
 Run the following if you haven't already:
@@ -76,7 +92,7 @@ You can find some sample scripts on the samples folder, including a basic scrapi
   => #<File:/var/folders/X8/X8TBsDiWGYuMKzrB3bhWTU+++TI/-Tmp-/open-uri.6656.0>
   >> page.scraped_doc.class
-  => Hpricot::Doc
+  => Nokogiri::HTML::Document
   >> page.scraped?
   => true
@@ -98,10 +114,17 @@ You can find some sample scripts on the samples folder, including a basic scrapi
 = To Do
-* Mocks
-* Check content type, process only HTML pages (i.e., dont try to scrape http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2)
-* Return array of images in page
+* Get page.base_dir from the address
+* Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs
+* Return array of images in page as absolute URLs
 * Return contents of meta robots tag
-* Consider using nokogiri instead of hpricot
+* Be able to set a timeout in seconds
+* Recover from Timeout exception
+* Recover from Errno::ECONNREFUSED
+* If keywords seem to be separated by blank spaces, replace them with commas
+* Mocks
+* Check content type, process only HTML pages_
+** Don't try to scrape http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2
+** Don't try to scrape http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
 Copyright (c) 2009 Jaime Iniesta, released under the MIT license

data/lib/metainspector.rb CHANGED Viewed

@@ -1,12 +1,10 @@
 require 'open-uri'
 require 'rubygems'
-require 'hpricot'
+require 'nokogiri'
 # MetaInspector provides an easy way to scrape web pages and get its elements
 class MetaInspector
-  VERSION = '1.1.1'
-  Hpricot.buffer_size = 300000
+  VERSION = '1.1.2'
   attr_reader :address, :title, :description, :keywords, :links, :full_doc, :scraped_doc
@@ -28,21 +26,21 @@ class MetaInspector
   # Visit web page, get its contents, and parse it
   def scrape!
     @full_doc = open(@address)
-    @scraped_doc = Hpricot(@full_doc)
+    @scraped_doc = Nokogiri::HTML(@full_doc)
     # Searching title...
-    @title = @scraped_doc.at('title').inner_html.strip if @scraped_doc.at('title')
+    @title = @scraped_doc.css('title').inner_html rescue nil
     # Searching meta description...
-    @description = @scraped_doc.at("meta[@name='description']")['content'].strip if @scraped_doc.at("meta[@name='description']")
+    @description = @scraped_doc.css("meta[@name='description']").first['content'] rescue nil
     # Searching meta keywords...
-    @keywords = @scraped_doc.at("meta[@name='keywords']")['content'].strip if @scraped_doc.at("meta[@name='keywords']")
+    @keywords = @scraped_doc.css("meta[@name='keywords']").first['content'] rescue nil
     # Searching links...
     @links = []
     @scraped_doc.search("//a").each do |link|
-      @links << link.attributes["href"].strip if (!link.attributes["href"].nil?)
+      @links << link.attributes["href"].to_s.strip
     end
     # Mark scraping as success
@@ -51,6 +49,10 @@ class MetaInspector
     rescue SocketError
       puts 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
       @scraped = false
+    rescue TimeoutError
+      puts 'Timeout!!!'
+    rescue
+      puts 'An exception occurred while trying to scrape the page!'
   end
   # Syntactic sugar

data/metainspector.gemspec CHANGED Viewed

@@ -1,7 +1,7 @@
 Gem::Specification.new do |s|
   s.name = "metainspector"
-  s.version = "1.1.1"
-  s.date = "2009-05-14"
+  s.version = "1.1.2"
+  s.date = "2009-05-19"
   s.summary = "Ruby gem for web scraping"
   s.email = "jaimeiniesta@gmail.com"
   s.homepage = "http://github.com/jaimeiniesta/metainspector/tree/master"
@@ -12,5 +12,5 @@ Gem::Specification.new do |s|
   s.test_files = []
   s.rdoc_options = []
   s.extra_rdoc_files = []
-  s.add_dependency("hpricot", ["> 0.5"])
+  s.add_dependency("nokogiri", ["> 1.2"])
 end

data/samples/spider.rb CHANGED Viewed

@@ -6,6 +6,7 @@ visited_links=[]
 puts "Enter a valid http address to spider it following external links"
 address = gets.strip
 page = MetaInspector.new(address)
 q.push(address)
@@ -14,9 +15,13 @@ while q.size > 0
   page.address=address
   puts "Spidering #{page.address}"
   page.scrape!
   puts "TITLE: #{page.title}"
+  puts "DESCRIPTION: #{page.description}"
+  puts "KEYWORDS: #{page.keywords}"
+  puts "LINKS: #{page.links.size}"
   page.links.each do |link|
-    if link[0..6].downcase == 'http://' && !visited_links.include?(link)
+    if link[0..6] == 'http://' && !visited_links.include?(link)
       q.push(link)
     end
   end

data/test/test_metainspector.rb CHANGED Viewed

@@ -31,7 +31,7 @@ class TestMetaInspector < Test::Unit::TestCase
     assert_equal m.links.size, 31
     assert_equal m.links[30], 'http://www.nuvio.cz/'
     assert_equal m.full_doc.class, Tempfile
-    assert_equal m.scraped_doc.class, Hpricot::Doc
+    assert_equal m.scraped_doc.class, Nokogiri::HTML::Document
   end
   # Test changing the address resets the state of the instance

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: jaimeiniesta-metainspector
 version: !ruby/object:Gem::Version
-  version: 1.1.1
+  version: 1.1.2
 platform: ruby
 authors:
 - Jaime Iniesta
@@ -9,18 +9,18 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2009-05-14 00:00:00 -07:00
+date: 2009-05-19 00:00:00 -07:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
-  name: hpricot
+  name: nokogiri
   type: :runtime
   version_requirement:
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - ">"
       - !ruby/object:Gem::Version
-        version: "0.5"
+        version: "1.2"
     version:
 description: MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL
 email: jaimeiniesta@gmail.com