RubyGems - jaimeiniesta-metainspector - Versions diffs - 1.1.2 → 1.1.3 - Mend

jaimeiniesta-metainspector 1.1.2 → 1.1.3

Files changed (8) hide show

data/CHANGELOG.rdoc CHANGED Viewed

@@ -1,6 +1,13 @@
+= 1.1.3
+=== 22nd May, 2009
+* Simplified code: now there's no need to call page.scrape!, just initialize it and go directly to page.address, page.title, page.description, page.keywords or page.links, the page will be scraped on the fly
+* Removed page.scraped?, page.scrape!, page.full_doc and page.scraped_doc
+* Added page.document, which returns the whole document scraped with nokogiri
 = 1.1.2
 === 19th May, 2009
 * Using nokogiri instead of hpricot
+* Recover from exceptions
 = 1.1.1
 === 14th May, 2009

data/README.rdoc CHANGED Viewed

@@ -34,11 +34,7 @@ Initialize a MetaInspector instance with an URL like this:
   page = MetaInspector.new('http://pagerankalert.com')
-Then you can tell it to fetch and scrape the URL:
-  page.scrape!
-Once scraped, you can see the returned data like this:
+Once scraped, you can see the scraped data like this:
   page.address       # URL of the page
   page.title         # title of the page, as string
@@ -46,17 +42,15 @@ Once scraped, you can see the returned data like this:
   page.keywords      # meta keywords, as string
   page.links         # array of strings, with every link found on the page
-You can see if the scraping process went ok checking what page.scrape! returns (true or false), or checking the page.scraped? method, which returns false if no successfull scraping has been finished since the last address change.
 You can also change the address of the page to be scraped using the address= setter, like this:
   page.address="http://jaimeiniesta.com"
-Doing so resets the state of the MetaInspector instance to the initial state (not scraped yet, cleared stored meta data). You can re-scrape it again by calling the page.scrape! method.
+Doing so resets the state of the MetaInspector instance to the initial state (not scraped yet, cleared stored meta data). The page will be re-scraped when you consult any of its metadata again.
-The full fetched document and the scraped doc are accessible from:
+The full scraped document if accessible from:
-  page.full_doc    # it points to the temp file where the fetched doc is stored
-  page.scraped_doc # Hpricot doc that you can use it to get any element from the page
+  page.document # Nokogiri doc that you can use it to get any element from the page
 = Examples
@@ -65,52 +59,33 @@ You can find some sample scripts on the samples folder, including a basic scrapi
   $ irb
   >> require 'metainspector'
   => true
   >> page = MetaInspector.new('http://pagerankalert.com')
-  => #<MetaInspector:0x5fc594 @full_doc=nil, @scraped=false, @description=nil, @links=nil,
-     @address="http://pagerankalert.com", @keywords=nil, @scraped_doc=nil, @title=nil>
-  >> page.scrape!
-  => true
+  => #<MetaInspector:0x11330c0 @document=nil, @links=nil, @address="http://pagerankalert.com", @description=nil, @keywords=nil, @title=nil>
   >> page.title
   => "PageRankAlert.com :: Track your pagerank changes"
   >> page.description
   => "Track your PageRank(TM) changes and receive alert by email"
   >> page.keywords
   => "pagerank, seo, optimization, google"
   >> page.links.size
   => 31
   >> page.links[30]
   => "http://www.nuvio.cz/"
-  >> page.full_doc
-  => #<File:/var/folders/X8/X8TBsDiWGYuMKzrB3bhWTU+++TI/-Tmp-/open-uri.6656.0>
-  >> page.scraped_doc.class
+  >> page.document.class
   => Nokogiri::HTML::Document
-  >> page.scraped?
-  => true
   >> page.address="http://jaimeiniesta.com"
   => "http://jaimeiniesta.com"
-  >> page.scraped?
-  => false
-  >> page.scrape!
-  => true
-  >> page.scraped?
-  => true
   >> page.title
-  => "ruby on rails freelance developer &#8212; Jaime Iniesta"
+  => "ruby on rails freelance developer -- Jaime Iniesta"
 = To Do
@@ -119,12 +94,9 @@ You can find some sample scripts on the samples folder, including a basic scrapi
 * Return array of images in page as absolute URLs
 * Return contents of meta robots tag
 * Be able to set a timeout in seconds
-* Recover from Timeout exception
-* Recover from Errno::ECONNREFUSED
+* Detect charset
 * If keywords seem to be separated by blank spaces, replace them with commas
 * Mocks
-* Check content type, process only HTML pages_
-** Don't try to scrape http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2
-** Don't try to scrape http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
+* Check content type, process only HTML pages, don't try to scrape TAR files like http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
 Copyright (c) 2009 Jaime Iniesta, released under the MIT license

data/lib/metainspector.rb CHANGED Viewed

@@ -4,18 +4,16 @@ require 'nokogiri'
 # MetaInspector provides an easy way to scrape web pages and get its elements
 class MetaInspector
-  VERSION = '1.1.2'
+  VERSION = '1.1.3'
-  attr_reader :address, :title, :description, :keywords, :links, :full_doc, :scraped_doc
+  attr_reader :address
   # Initializes a new instance of MetaInspector, setting the URL address to the one given
   # TODO: validate address as http URL, dont initialize it if wrong format
   def initialize(address)
     @address = address
-    @scraped = false
-    @title = @description = @keywords = @full_doc = @scraped_doc = nil
-    @links = []
+    @document = @title = @description = @keywords = @links = nil
   end
   # Setter for address. Initializes the whole state as the address is being changed.
@@ -23,29 +21,30 @@ class MetaInspector
     initialize(address)
   end
-  # Visit web page, get its contents, and parse it
-  def scrape!
-    @full_doc = open(@address)
-    @scraped_doc = Nokogiri::HTML(@full_doc)
-    # Searching title...
-    @title = @scraped_doc.css('title').inner_html rescue nil
-    # Searching meta description...
-    @description = @scraped_doc.css("meta[@name='description']").first['content'] rescue nil
-    # Searching meta keywords...
-    @keywords = @scraped_doc.css("meta[@name='keywords']").first['content'] rescue nil
-    # Searching links...
-    @links = []
-    @scraped_doc.search("//a").each do |link|
-      @links << link.attributes["href"].to_s.strip
-    end
+  # Returns the parsed document title
+  def title
+    @title ||= document.css('title').inner_html rescue nil
+  end
+  # Returns the parsed document meta description
+  def description
+    @description ||= document.css("meta[@name='description']").first['content'] rescue nil
+  end
+  # Returns the parsed document meta keywords
+  def keywords
+    @keywords ||= document.css("meta[@name='keywords']").first['content'] rescue nil
+  end
+  # Returns the parsed document links
+  def links
+    @links ||= document.search("//a").map {|link| link.attributes["href"].to_s.strip} rescue nil
+  end
+  # Returns the whole parsed document
+  def document
+    @document ||= Nokogiri::HTML(open(@address))
-    # Mark scraping as success
-    @scraped = true
     rescue SocketError
       puts 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
       @scraped = false
@@ -54,9 +53,5 @@ class MetaInspector
     rescue
       puts 'An exception occurred while trying to scrape the page!'
   end
-  # Syntactic sugar
-  def scraped?
-    @scraped
-  end
 end

data/metainspector.gemspec CHANGED Viewed

@@ -1,6 +1,6 @@
 Gem::Specification.new do |s|
   s.name = "metainspector"
-  s.version = "1.1.2"
+  s.version = "1.1.3"
   s.date = "2009-05-19"
   s.summary = "Ruby gem for web scraping"
   s.email = "jaimeiniesta@gmail.com"
@@ -8,8 +8,15 @@ Gem::Specification.new do |s|
   s.description = "MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL"
   s.has_rdoc = false
   s.authors = ["Jaime Iniesta"]
-  s.files = ["README.rdoc", "CHANGELOG.rdoc", "MIT-LICENSE", "metainspector.gemspec", "lib/metainspector.rb", "test/test_metainspector.rb", "samples/basic_scraping.rb", "samples/spider.rb"]
-  s.test_files = []
+  s.files = [
+    "README.rdoc",
+    "CHANGELOG.rdoc",
+    "MIT-LICENSE",
+    "metainspector.gemspec",
+    "lib/metainspector.rb",
+    "samples/basic_scraping.rb",
+    "samples/spider.rb"]
+  s.test_files = ["test/test_metainspector.rb"]
   s.rdoc_options = []
   s.extra_rdoc_files = []
   s.add_dependency("nokogiri", ["> 1.2"])

data/samples/basic_scraping.rb CHANGED Viewed

@@ -7,7 +7,6 @@ address = gets
 page = MetaInspector.new(address)
 puts "Scraping #{address}"
 puts "...please wait..."
-page.scrape!
 puts "Scraping #{page.address} returned these results:"
 puts "TITLE: #{page.title}"

data/samples/spider.rb CHANGED Viewed

@@ -14,7 +14,6 @@ while q.size > 0
   visited_links << address = q.pop
   page.address=address
   puts "Spidering #{page.address}"
-  page.scrape!
   puts "TITLE: #{page.title}"
   puts "DESCRIPTION: #{page.description}"

data/test/test_metainspector.rb CHANGED Viewed

@@ -3,58 +3,42 @@ require '../lib/metainspector.rb'
 class TestMetaInspector < Test::Unit::TestCase
   # TODO: mock tests
-  # Test we can initialize a new instance, setting its address, and initial state
-  # is not scraped and every meta data value set to nil
   # TODO: validate URL format, only http and https allowed
-  def test_initialize
-    m = MetaInspector.new('http://pagerankalert.com')
-    assert_equal m.address, 'http://pagerankalert.com'
-    assert_equal m.scraped?, false
-    assert_nil m.title
-    assert_nil m.description
-    assert_nil m.keywords
-    assert_equal m.links.size, 0
-    assert_nil m.full_doc
-    assert_nil m.scraped_doc
-  end
+  # TODO: check timeouts
   # Test scraping an URL, marking it as scraped and setting meta data values
-  # TODO: check timeouts
-  def test_scrape!
+  def test_scrape
     m = MetaInspector.new('http://pagerankalert.com')
-    assert m.scrape!
-    assert m.scraped?
     assert_equal m.title, 'PageRankAlert.com :: Track your pagerank changes'
     assert_equal m.description, 'Track your PageRank(TM) changes and receive alert by email'
     assert_equal m.keywords, 'pagerank, seo, optimization, google'
     assert_equal m.links.size, 31
     assert_equal m.links[30], 'http://www.nuvio.cz/'
-    assert_equal m.full_doc.class, Tempfile
-    assert_equal m.scraped_doc.class, Nokogiri::HTML::Document
+    assert_equal m.document.class, Nokogiri::HTML::Document
   end
-  # Test changing the address resets the state of the instance
+  # Test changing the address resets the state of the instance so it causes a new scraping
   def test_address_setter
     m = MetaInspector.new('http://pagerankalert.com')
     assert_equal m.address, 'http://pagerankalert.com'
-    m.scrape!
-    assert m.scraped?
-    assert_not_nil m.title
-    assert_not_nil m.description
-    assert_not_nil m.keywords
-    assert_not_nil m.links
-    assert_not_nil m.full_doc
-    assert_not_nil m.scraped_doc
+    title_1 = m.title
+    description_1 = m.description
+    keywords_1 = m.keywords
+    links_1 = m.links
+    document_1 = m.document
     m.address = 'http://jaimeiniesta.com'
     assert_equal m.address, 'http://jaimeiniesta.com'
-    assert !m.scraped?
-    assert_nil m.title
-    assert_nil m.description
-    assert_nil m.keywords
-    assert_equal m.links.size, 0
-    assert_nil m.full_doc
-    assert_nil m.scraped_doc
+    title_2 = m.title
+    description_2 = m.description
+    keywords_2 = m.keywords
+    links_2 = m.links
+    document_2 = m.document
+    assert_not_equal title_1, title_2
+    assert_not_equal description_1, description_2
+    assert_not_equal keywords_1, keywords_2
+    assert_not_equal links_1, links_2
+    assert_not_equal document_1, document_2
   end
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: jaimeiniesta-metainspector
 version: !ruby/object:Gem::Version
-  version: 1.1.2
+  version: 1.1.3
 platform: ruby
 authors:
 - Jaime Iniesta
@@ -36,7 +36,6 @@ files:
 - MIT-LICENSE
 - metainspector.gemspec
 - lib/metainspector.rb
-- test/test_metainspector.rb
 - samples/basic_scraping.rb
 - samples/spider.rb
 has_rdoc: false
@@ -65,5 +64,5 @@ rubygems_version: 1.2.0
 signing_key:
 specification_version: 2
 summary: Ruby gem for web scraping
-test_files: []
+test_files:
+- test/test_metainspector.rb