RubyGems - scrubyt - Versions diffs - 0.2.6 → 0.2.8 - Mend

scrubyt 0.2.6 → 0.2.8

Files changed (32) hide show

data/CHANGELOG +59 -12
data/Rakefile +2 -2
data/lib/scrubyt.rb +24 -6
data/lib/scrubyt/core/navigation/fetch_action.rb +91 -56
data/lib/scrubyt/core/navigation/navigation_actions.rb +32 -22
data/lib/scrubyt/core/scraping/constraint.rb +53 -57
data/lib/scrubyt/core/scraping/constraint_adder.rb +15 -38
data/lib/scrubyt/core/scraping/filters/attribute_filter.rb +17 -0
data/lib/scrubyt/core/scraping/filters/base_filter.rb +111 -0
data/lib/scrubyt/core/scraping/filters/detail_page_filter.rb +14 -0
data/lib/scrubyt/core/scraping/filters/download_filter.rb +49 -0
data/lib/scrubyt/core/scraping/filters/html_subtree_filter.rb +7 -0
data/lib/scrubyt/core/scraping/filters/regexp_filter.rb +17 -0
data/lib/scrubyt/core/scraping/filters/tree_filter.rb +121 -0
data/lib/scrubyt/core/scraping/pattern.rb +292 -157
data/lib/scrubyt/core/scraping/result_indexer.rb +51 -47
data/lib/scrubyt/core/shared/evaluation_context.rb +3 -42
data/lib/scrubyt/core/shared/extractor.rb +122 -163
data/lib/scrubyt/output/export.rb +59 -174
data/lib/scrubyt/output/post_processor.rb +4 -3
data/lib/scrubyt/output/result.rb +8 -9
data/lib/scrubyt/output/result_dumper.rb +81 -42
data/lib/scrubyt/utils/compound_example_lookup.rb +11 -11
data/lib/scrubyt/utils/ruby_extensions.rb +113 -0
data/lib/scrubyt/utils/shared_utils.rb +39 -26
data/lib/scrubyt/utils/simple_example_lookup.rb +6 -6
data/lib/scrubyt/utils/xpathutils.rb +31 -30
data/test/unittests/constraint_test.rb +11 -7
data/test/unittests/extractor_test.rb +6 -6
data/test/unittests/filter_test.rb +66 -66
metadata +22 -15
data/lib/scrubyt/core/scraping/filter.rb +0 -201

data/CHANGELOG CHANGED

@@ -1,29 +1,76 @@
 = scRUBYt! Changelog
-== 0.2.5
+== 0.2.7
+=== 15th April, 2007
+=<tt>changes:</tt>
+[NEW] download pattern: download the file pointed to by the
+      parent pattern
+[NEW] checking checkboxes
+[NEW] basic authentication support
+[NEW] default values for missing elements
+[NEW] possibility to resolve relative paths against a custom url
+[NEW] first simple version of to_csv and to_hash
+[NEW] complete rewrite of the exporting system (Credit: Neelance)
+[NEW] first version of smart regular expressions: they are constructed
+      from examples, just as regular expressions (Credit: Neelance)
+[NEW] Possibility to click the n-th link
+[FIX] Clicking on links using scRUBYt's aadvanced example lookup
+[NEW] Forcing writing text of non-leaf nodes with :write_text => true
+[NEW] Possibility to set custom user-agent; Specified default user agent
+      as Microsoft IE6
+[FIX] Fixed crawling to detail pages in case of leaving the
+      original site (Credit: Michael Mazour)
+[FIX] fixing the '//' problem - if the relative url contained two
+      slashes, the fetching failed
+[FIX] scrubyt assumed that documents have a list of nested elements
+      (Credit: Rick Bradley)
+[FIX] crawling to detail pages works also if the parent pattern is
+      a string pattern
+[FIX] shorcut url fixed again
+[FIX] regexp pattern fixed in case it's parent was a string
+[FIX] refactoring the core classes, lots of bugfixes and stabilization
+== 0.2.6
 === 22th March, 2007
-The mission of this release was to add even more powerful features, like crawling to detail pages or compound example specification, as well as fixing the most frequently popping-up bugs. Scraping of concrete sites is more and more frequently the cause for new features and bugfixes, which in my opinion means that the framework is beginning to make sense: from a shiny toy which looks cool and everybody wants to play with, it is moving towards a tool which you reach after if you seriously want to scrape a site.
-The new stuff in this release is 99% scraping related - if you are looking for new features in the navigation part, probably the next version will be for you, where I will concentrate more on adding new widgets and possibilities to the navigation process. Firewatir integration is very close, too - perhaps already the next release will contain FireWatir, or in the worst the next-next one.
+The mission of this release was to add even more powerful features,
+like crawling to detail pages or compound example specification,
+as well as fixing the most frequently popping-up bugs. Scraping
+of concrete sites is more and more frequently the cause for new
+features and bugfixes, which in my opinion means that the
+framework is beginning to make sense: from a shiny toy which
+looks cool and everybody wants to play with, it is moving
+towards a tool which you reach after if you seriously want
+to scrape a site.
+The new stuff in this release is 99% scraping related - if
+you are looking for new features in the navigation part,
+probably the next version will be for you, where I will
+concentrate more on adding new widgets and possibilities
+to the navigation process. Firewatir integration is very
+close, too - perhaps already the next release will
+support FireWatir navigation!
 =<tt>changes:</tt>
 * [NEW] Automatically crawling to and extracting from detail pages
 * [NEW] Compound example specification: So far the example of a pattern had to be a string.
         Now it can be a hash as well, like {:contains => /\d\d-\d/, :begins_with => 'Telephone'}
 * [NEW] More sophisticated example specification: Possible to use regexp as well, and need not
-        (but still possible of course) to specify the whole content of the node - nodes that
+        (but still possible of course) to specify the whole content of the node - nodes that
         contain the string/match the regexp will be returned, too
 * [NEW] Possibility to force writing text in case of non-leaf nodes
-* [NEW] Crawling to the next page now possible via image links as well
+* [NEW] Crawling to the next page now possible via image links as well
 * [NEW] Possibility to define examples for any pattern (before it did not make sense for ancestors)
 * [NEW] Implementation of crawling to the next page with different methods
-* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
+* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
         some_url 'href', :type => :attribute
 * [FIX] Crawling to the next page (the broken google example): if the next
-        link text is not an <a>, traverse down until the <a> is found; if it is
+        link text is not an <a>, traverse down until the <a> is found; if it is
         still not found, traverse up until it is found
 * [FIX] Crawling to next pages does not break if the next link is greyed out
-        (or otherwise present but has no href attribute (Credit: sorry, I could not find in the comments :(
+        (or otherwise present but has no href attribute (Credit: Robert Au)
 * [FIX] DRY-ed next link lookup - it should be much more robust now as it is uses the 'standard' example lookup
 * [NEW] Correct exporting of detail page extractors
 * [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
@@ -105,13 +152,13 @@ This is a preview release before the first real public release, 0.2.0. Basically
 * Enhanced heuristics for example text detection
 * First version of algorithm to remove dupes resulting from multiple examples
 * empty XML leaf nodes are not written
-* new examples
+* new examples
 * TONS of bugfixes
 = 0.1
 === 15th January, 2007
-First pre-alpha (non-public) release
+First pre-alpha (non-public) release
 This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
 Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
@@ -201,13 +248,13 @@ This is a preview release before the first real public release, 0.2.0. Basically
 * Enhanced heuristics for example text detection
 * First version of algorithm to remove dupes resulting from multiple examples
 * empty XML leaf nodes are not written
-* new examples
+* new examples
 * TONS of bugfixes
 = 0.1
 === 15th January, 2007
-First pre-alpha (non-public) release
+First pre-alpha (non-public) release
 This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
 Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.

data/Rakefile CHANGED

@@ -18,7 +18,7 @@ task "cleanup_readme" => ["rdoc"]
 gem_spec = Gem::Specification.new do |s|
   s.name = 'scrubyt'
-  s.version = '0.2.6'
+  s.version = '0.2.8'
   s.summary = 'A powerful Web-scraping framework'
   s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!}
   # Files containing Test::Unit test cases.
@@ -82,7 +82,7 @@ Rake::GemPackageTask.new(gem_spec) do |pkg|
   pkg.need_tar = false
 end
-Rake::PackageTask.new('scrubyt-examples', '0.2.6') do |pkg|
+Rake::PackageTask.new('scrubyt-examples', '0.2.8') do |pkg|
   pkg.need_zip = true
   pkg.need_tar = true
   pkg.package_files.include("examples/**/*")

data/lib/scrubyt.rb CHANGED

@@ -1,3 +1,19 @@
+#ruby core
+require 'open-uri'
+#gems
+require 'rubygems'
+require 'mechanize'
+require 'hpricot'
+require 'parse_tree'
+require 'ruby2ruby'
+#scrubyt
+require 'scrubyt/utils/ruby_extensions.rb'
+require 'scrubyt/utils/xpathutils.rb'
+require 'scrubyt/utils/shared_utils.rb'
+require 'scrubyt/utils/simple_example_lookup.rb'
+require 'scrubyt/utils/compound_example_lookup.rb'
 require 'scrubyt/core/scraping/constraint_adder.rb'
 require 'scrubyt/core/scraping/constraint.rb'
 require 'scrubyt/core/scraping/result_indexer.rb'
@@ -5,16 +21,18 @@ require 'scrubyt/core/scraping/pre_filter_document.rb'
 require 'scrubyt/core/scraping/compound_example.rb'
 require 'scrubyt/output/export.rb'
 require 'scrubyt/core/shared/extractor.rb'
-require 'scrubyt/core/scraping/filter.rb'
+require 'scrubyt/core/scraping/filters/base_filter.rb'
+require 'scrubyt/core/scraping/filters/attribute_filter.rb'
+require 'scrubyt/core/scraping/filters/detail_page_filter.rb'
+require 'scrubyt/core/scraping/filters/download_filter.rb'
+require 'scrubyt/core/scraping/filters/html_subtree_filter.rb'
+require 'scrubyt/core/scraping/filters/regexp_filter.rb'
+require 'scrubyt/core/scraping/filters/tree_filter.rb'
 require 'scrubyt/core/scraping/pattern.rb'
 require 'scrubyt/output/result_dumper.rb'
 require 'scrubyt/output/result.rb'
-require 'scrubyt/utils/xpathutils.rb'
 require 'scrubyt/output/post_processor.rb'
 require 'scrubyt/core/navigation/navigation_actions.rb'
 require 'scrubyt/core/navigation/fetch_action.rb'
 require 'scrubyt/core/shared/evaluation_context.rb'
-require 'scrubyt/core/shared/u_r_i_builder.rb'
-require 'scrubyt/utils/shared_utils.rb'
-require 'scrubyt/utils/simple_example_lookup.rb'
-require 'scrubyt/utils/compound_example_lookup.rb'
+require 'scrubyt/core/shared/u_r_i_builder.rb'

data/lib/scrubyt/core/navigation/fetch_action.rb CHANGED

@@ -2,38 +2,46 @@ module Scrubyt
   ##
   #=<tt>Fetching pages (and related functionality)</tt>
   #
-  #Since lot of things are happening during (and before)
+  #Since lot of things are happening during (and before)
   #the fetching of a document, I decided to move out fetching related
   #functionality to a separate class - so if you are looking for anything
   #which is loading a document (even by submitting a form or clicking a link)
   #and related things like setting a proxy etc. you should find it here.
   class FetchAction
     def initialize
-      @@current_doc_url = nil
+      @@current_doc_url = nil
       @@current_doc_protocol = nil
       @@base_dir = nil
       @@host_name = nil
-      @@agent = WWW::Mechanize.new
+      @@agent = WWW::Mechanize.new
+      @@history = []
     end
     ##
     #Action to fetch a document (either a file or a http address)
-    #
+    #
     #*parameters*
     #
     #_doc_url_ - the url or file name to fetch
-    def self.fetch(doc_url, proxy=nil, mechanize_doc=nil)
-      parse_and_set_proxy(proxy) if proxy
-      if (mechanize_doc == nil)
+    def self.fetch(doc_url, *args)
+      #Refactor this crap!!! with option_accessor stuff
+      proxy = args[0][:proxy]
+      mechanize_doc = args[0][:mechanize_doc]
+      resolve = args[0][:resolve] || :full
+      basic_auth = args[0][:basic_auth]
+      user_agent = args[0][:user_agent] || "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
+      #Refactor this whole stuff as well!!! It looks awful...
+      parse_and_set_proxy(proxy) if proxy
+      set_user_agent(user_agent)
+      parse_and_set_basic_auth(basic_auth) if basic_auth
+      if !mechanize_doc
         @@current_doc_url = doc_url
         @@current_doc_protocol = determine_protocol
         handle_relative_path(doc_url)
-        handle_relative_url(doc_url)
+        handle_relative_url(doc_url,resolve)
         puts "[ACTION] fetching document: #{@@current_doc_url}"
         if @@current_doc_protocol != 'file'
-          @@mechanize_doc = @@agent.get(@@current_doc_url)
-          store_host_name(doc_url)
+          @@mechanize_doc = @@agent.get(@@current_doc_url)
         end
       else
         @@current_doc_url = doc_url
@@ -44,60 +52,75 @@ module Scrubyt
         @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(@@current_doc_url).read))
       else
         @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc.body))
+        store_host_name(self.get_current_doc_url)   # in case we're on a new host
       end
     end
     ##
-    #Submit the last form;
+    #Submit the last form;
     def self.submit(current_form, button=nil)
       puts '[ACTION] submitting form...'
-      if button == nil
+      if button == nil
         result_page = @@agent.submit(current_form)
       else
         result_page = @@agent.submit(current_form, button)
       end
       @@current_doc_url = result_page.uri.to_s
       puts "[ACTION] fetched #{@@current_doc_url}"
-      fetch(@@current_doc_url, nil, result_page)
+      fetch(@@current_doc_url, :mechanize_doc => result_page)
     end
     ##
-    #Click the link specified by the text
-    def self.click_link(link_text)
-      puts "[ACTION] clicking link: #{link_text}"
-      link = @@mechanize_doc.links.text(/^#{Regexp.escape(link_text)}$/)
-      result_page = @@agent.click(link)
+    #Click the link specified by the text
+    def self.click_link(link_spec,index = 0)
+      print "[ACTION] clicking link specified by: "; p link_spec
+      if link_spec.is_a? Hash
+        clicked_elem = CompoundExampleLookup.find_node_from_compund_example(@@hpricot_doc, link_spec, false, index)
+      else
+        clicked_elem = SimpleExampleLookup.find_node_from_text(@@hpricot_doc, link_spec, false, index)
+      end
+      clicked_elem = XPathUtils.find_nearest_node_with_attribute(clicked_elem, 'href')
+      result_page = @@agent.click(clicked_elem)
       @@current_doc_url = result_page.uri.to_s
       puts "[ACTION] fetched #{@@current_doc_url}"
-      fetch(@@current_doc_url, nil, result_page)
-    end
+      fetch(@@current_doc_url, :mechanize_doc => result_page)
+    end
     ##
     # At any given point, the current document can be queried with this method; Typically used
-    # when the navigation is over and the result document is passed to the wrapper
+    # when the navigation is over and the result document is passed to the wrapper
     def self.get_current_doc_url
       @@current_doc_url
     end
     def self.get_mechanize_doc
       @@mechanize_doc
     end
     def self.get_hpricot_doc
       @@hpricot_doc
     end
     def self.get_host_name
-      @@host_name
+      @@host_name
     end
     def self.restore_host_name
+      return if @@current_doc_protocol == 'file'
       @@host_name = @@original_host_name
-    end
-private
+    end
+    def self.store_page
+      @@history.push @@hpricot_doc
+    end
+    def self.restore_page
+      @@hpricot_doc = @@history.pop
+    end
     def self.determine_protocol
       old_protocol = @@current_doc_protocol
-      new_protocol = case @@current_doc_url
+      new_protocol = case @@current_doc_url
         when /^https/
           'https'
         when /^http/
@@ -110,10 +133,9 @@ private
       return 'http' if ((old_protocol == 'http') && new_protocol == 'file')
       return 'https' if ((old_protocol == 'https') && new_protocol == 'file')
       new_protocol
-    end
+    end
     def self.parse_and_set_proxy(proxy)
-      proxy = proxy[:proxy]
       if proxy.downcase == 'localhost'
         @@host = 'localhost'
         @@port = proxy.split(':').last
@@ -130,34 +152,47 @@ private
       puts "[ACTION] Setting proxy: host=<#{@@host}>, port=<#{@@port}>"
       @@agent.set_proxy(@@host, @@port)
     end
+    def self.parse_and_set_basic_auth(basic_auth)
+      login, pass = basic_auth.split('@')
+      puts "[ACTION] Basic authentication: login=<#{login}>, pass=<#{pass}>"
+      @@agent.basic_auth(login, pass)
+    end
+    def self.set_user_agent(user_agent)
+      #puts "[ACTION] Setting user-agent to #{user_agent}"
+      @@agent.user_agent = user_agent
+    end
     def self.handle_relative_path(doc_url)
       if @@base_dir == nil
         @@base_dir = doc_url.scan(/.+\//)[0] if @@current_doc_protocol == 'file'
-      else
+      else
         @@current_doc_url = ((@@base_dir + doc_url) if doc_url !~ /#{@@base_dir}/)
       end
     end
-    def self.handle_relative_url(doc_url)
-      return if doc_url =~ /^http/
-      if @@host_name != nil
-        #p doc_url
-        #p @@host_name
-        if doc_url !~ /#{@@host_name}/
-          @@current_doc_url = (@@host_name + doc_url)
-          #remove duplicate parts, like /blogs/en/blogs/en
-          @@current_doc_url = @@current_doc_url.split('/').uniq.reject{|x| x == ""}.join('/')
-          @@current_doc_url.sub!('http:/', 'http://')
-        end
-      end
-    end
     def self.store_host_name(doc_url)
       @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0] if @@current_doc_protocol == 'http'
       @@host_name = 'https://' + @@mechanize_doc.uri.to_s.scan(/https:\/\/(.+\/)+/).flatten[0] if @@current_doc_protocol == 'https'
       @@host_name = doc_url if @@host_name == nil
-      @@original_host_name ||= @@host_name
-    end #end of function store_host_name
+      @@host_name = @@host_name[0..-2] if @@host_name[-1].chr == '/'
+      @@original_host_name ||= @@host_name
+    end #end of method store_host_name
+    def self.handle_relative_url(doc_url, resolve)
+      return if doc_url =~ /^http/
+      case resolve
+        when :full
+          @@current_doc_url = (@@host_name + doc_url) if ( @@host_name != nil && (doc_url !~ /#{@@host_name}/))
+          @@current_doc_url = @@current_doc_url.split('/').uniq.join('/')
+        when :host
+          base_host_name = @@host_name.scan(/(http.+?\/\/.+?)\//)[0][0]
+          @@current_doc_url = base_host_name + doc_url
+        else
+          #custom resilving
+          @@current_doc_url = resolve + doc_url
+      end
+    end #end of function handle_relative_url
   end #end of class FetchAction
 end #end of module Scrubyt

data/lib/scrubyt/core/navigation/navigation_actions.rb CHANGED

@@ -8,25 +8,26 @@ module Scrubyt
   class NavigationActions
     #These are reserved keywords - they can not be the name of any pattern
     #since they are reserved for describing the navigation
-    KEYWORDS = ['fetch',
+    KEYWORDS = ['fetch',
                 'fill_textfield',
-                'fill_textarea',
+                'fill_textarea',
                 'submit',
                 'click_link',
-                'select_option',
+                'select_option',
+                'check_checkbox',
                 'end']
     def initialize
         @@current_form = nil
         FetchAction.new
     end
     ##
     #Action to fill a textfield with a query string
     #
     ##*parameters*
     #
-    #_textfield_name_ - the name of the textfield (e.g. the name of the google search
+    #_textfield_name_ - the name of the textfield (e.g. the name of the google search
     #textfield is 'q'
     #
     #_query_string_ - the string that should be entered into the textfield
@@ -34,15 +35,15 @@ module Scrubyt
       lookup_form_for_tag('input','textfield',textfield_name,query_string)
       eval("@@current_form['#{textfield_name}'] = '#{query_string}'")
     end
     ##
     #Action to fill a textarea with text
     def self.fill_textarea(textarea_name, text)
       lookup_form_for_tag('textarea','textarea',textarea_name,text)
       eval("@@current_form['#{textarea_name}'] = '#{text}'")
     end
-    ##
+    ##
     #Action for selecting an option from a dropdown box
     def self.select_option(selectlist_name, option)
       lookup_form_for_tag('select','select list',selectlist_name,option)
@@ -51,13 +52,19 @@ module Scrubyt
       searched_option.click
     end
+    def self.check_checkbox(checkbox_name)
+      puts checkbox_name
+      lookup_form_for_tag('input','checkbox',checkbox_name, '')
+      @@current_form.checkboxes.name(checkbox_name).check
+    end
     ##
     #Fetch the document
-    def self.fetch(doc_url, mechanize_doc=nil)
-      FetchAction.fetch(doc_url, mechanize_doc)
+    def self.fetch(*args)
+      FetchAction.fetch(*args)
     end
     ##
-    #Submit the current form (delegate it to NavigationActions)
+    #Submit the current form (delegate it to NavigationActions)
     def self.submit(index=nil)
       if index == nil
         FetchAction.submit(@@current_form)
@@ -65,39 +72,42 @@ module Scrubyt
         FetchAction.submit(@@current_form, @@current_form.buttons[index])
       end
     end
     ##
     #Click the link specified by the text ((delegate it to NavigationActions)
-    def self.click_link(link_text)
-      FetchAction.click_link(link_text)
+    def self.click_link(link_spec,index=0)
+      FetchAction.click_link(link_spec,index)
     end
     def self.get_hpricot_doc
       FetchAction.get_hpricot_doc
     end
     def self.get_current_doc_url
       FetchAction.get_current_doc_url
     end
+    def self.get_host_name
+      FetchAction.get_host_name
+    end
 private
     def self.lookup_form_for_tag(tag,widget_name,name_attribute,query_string)
       puts "[ACTION] typing #{query_string} into the #{widget_name} named '#{name_attribute}'"
       widget = (FetchAction.get_hpricot_doc/"#{tag}[@name=#{name_attribute}]").map()[0]
       form_tag = Scrubyt::XPathUtils.traverse_up_until_name(widget, 'form')
-      find_form_based_on_tag(form_tag, ['name', 'id', 'action'])
+      find_form_based_on_tag(form_tag, ['name', 'id', 'action'])
     end
     def self.find_form_based_on_tag(tag, possible_attrs)
       lookup_attribute_name = nil
       lookup_attribute_value = nil
       possible_attrs.each { |a|
         lookup_attribute_name = a
         lookup_attribute_value = tag.attributes[a]
         break if lookup_attribute_value != nil
       }
       i = 0
       loop do
         @@current_form = FetchAction.get_mechanize_doc.forms[i]