RubyGems - scrubyt - Versions diffs - 0.2.0 → 0.2.3 - Mend

scrubyt 0.2.0 → 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

data/CHANGELOG +132 -1
data/Rakefile +4 -2
data/lib/scrubyt.rb +15 -10
data/lib/scrubyt/core/navigation/fetch_action.rb +152 -0
data/lib/scrubyt/core/navigation/navigation_actions.rb +106 -0
data/lib/scrubyt/{constraint.rb → core/scraping/constraint.rb} +0 -0
data/lib/scrubyt/{constraint_adder.rb → core/scraping/constraint_adder.rb} +0 -0
data/lib/scrubyt/{filter.rb → core/scraping/filter.rb} +22 -4
data/lib/scrubyt/{pattern.rb → core/scraping/pattern.rb} +21 -98
data/lib/scrubyt/core/scraping/pre_filter_document.rb +13 -0
data/lib/scrubyt/core/scraping/result_indexer.rb +88 -0
data/lib/scrubyt/core/shared/evaluation_context.rb +97 -0
data/lib/scrubyt/core/shared/extractor.rb +116 -0
data/lib/scrubyt/{export.rb → output/export.rb} +14 -8
data/lib/scrubyt/output/post_processor.rb +137 -0
data/lib/scrubyt/{result.rb → output/result.rb} +0 -0
data/lib/scrubyt/{result_dumper.rb → output/result_dumper.rb} +0 -7
data/lib/scrubyt/{xpathutils.rb → utils/xpathutils.rb} +5 -2
data/test/unittests/pattern_test.rb +27 -0
metadata +40 -17
data/lib/scrubyt/extractor.rb +0 -279
data/lib/scrubyt/post_processor.rb +0 -73

data/CHANGELOG CHANGED

@@ -1,4 +1,135 @@
-= scRUBYt! changelog
+= scRUBYt! Changelog
+== 0.2.3
+=== 20th February, 2007
+Thanks to the feedback from all of you, I managed to find a lot of bugs as well as write up a nice feature request list. The bugs are mostly fixed and also some shiny new features have been added. Stability was also improved by adding new tests and totally refacroring the whole code.
+The new features make this release much more powerful than the previous one. Sites requiring login, submitting forms with button click, filling text areas, dealing with variable-size results, smart handling of attribute lookup, https, custom proxy setting and tons of bugfixes make this release capable of doing much-much more than it was possible in 0.2.0.
+I have added also some shiny new examples - scraping reddit, del.icio.us, rubyforge login, wordpress automatic comment
+ing for example.
+=<tt>changes:</tt>
+* [FIX] Cookies (and other stuff) are now taken into consideration
+* [NEW] select_indices feature. Example:
+  table do
+    (row '1').select_indices(:last)
+  end
+  this will select only the last row;
+  possibility to specify a Range, or an array of indices, or other
+  constants like :first, :every_odd etc. More to come in the future!
+* [FIX] digg.com next page problem fixed
+* [FIX] Fetching of https sites
+* [FIX] Next page works incorrectly when given an absolute path
+* [FIX] Fixing exporting if the pattern parameters are parenthesized
+* [NEW] Possibility to submit forms by clicking a button
+* [NEW] Added new unit test suite: pattern_test
+* [NEW] Possibility to set a proxy for fetching the input document
+* [NEW] Added possibility to choose an option from a selection list (Credit: Zaheed Haque)
+* [FIX] Image pattern example lookup fix
+* [NEW] Possibility to prefilter the document before passing it to Hpricot (Credit: Demitrious Kelly)
+* [FIX] corrected gem dependencies (Credit: Tim Fletcher)
+* [FIX] remove duplicates only if there are more examples present
+* [NEW] new examples: wordpress comment (Credit: Zaheed Haque), rubyforge login, del.icio.us, reddit and more
+* [FIX] if there is no scraper defined, exit with a message rather than raise an exception
+* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)
+== 0.2.0
+=== 30th January, 2007
+The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
+=<tt>changes:</tt>
+* better form detection heuristics
+* report message if there are absolutely no results
+* lots of bugfixes
+  * fixed amazon_data.books[0].item[0].title[0] style output access
+    and implemented it correctly in case of crawling as well
+  * /body/div/h3 not detected as XPath
+  * crawling problem (improved heuristics of url joining)
+  * fixed blackbox test runner - no more platform dependent code
+  * fixed exporting bug: swapped exported XPaths in the case of no example     present
+  * fixed exporting bug: capturing \W (non-word character) after the\          pattern name; this way we can distinguish pattern names where one
+    name is substring of the other
+  * Evaluation stops if the example was not found - but not in the case
+    of next page link lookup
+  * google_data[0].link[0].url[0] style result lookup now works in the
+    case of more documents, too
+  * tons of others bugfixes
+  * overall stability fixes
+* more blackbox tests
+* more examples
+* overall stability fixes
+= 0.1.9
+=== 28th January, 2007
+This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
+=<tt>Changes</tt>:
+* Possibility to specify multiple examples (hence a pattern can have more filters)
+* Enhanced heuristics for example text detection
+* First version of algorithm to remove dupes resulting from multiple examples
+* empty XML leaf nodes are not written
+* new examples
+* TONS of bugfixes
+= 0.1
+=== 15th January, 2007
+First pre-alpha (non-public) release
+This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
+Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
+* Navigation:
+  * fetching pages
+  * clicking links
+  * filling input fields
+  * submitting forms
+  * automatically passing the document to the scraping
+  * both files and http:// support
+  * automatic crawling
+* Scraping:
+  * Fairly powerful DSL to describe the full scraping process
+  * Automatic navigation with WWW::Mechanize
+  * Automatic scraping through examples with Hpricot
+  * automatic recursive scraping through the next button
+=<tt>changes:</tt>
+* [FIX] cookies (and other stuff) are now taken into consideration
+* [FIX] digg.com next page problem fixed
+* [FIX] fetching of https sites
+* [FIX] Next page works incorrectly when given an absolute path
+* [FIX] Fixing exporting if the pattern parameters are parenthesized
+* [NEW] Possibility to submit forms by clicking a button
+* [NEW] Added new unit test suite: pattern_test
+* [NEW] Possibility to set a proxy for fetching the input document
+* [NEW] Added possibility to choose an option from a selection list
+* [NEW] select_indices feature. Example:
+  table do
+    (row '1').select_indices(:last)
+  end
+  this will select only the last row;
+  possibility to specify a Range, or an array of indices, or other
+  constants like :first, :every_odd etc. More to come in the future!
+* [FIX] Image pattern example lookup fix
+* [FIX] corrected gem dependencies (thanks to Tim Fletcher)
+* [FIX] remove duplicates only if there are more examples present
+* [NEW] new examples: gmail login, wordpress comment, del.icio.us, grab_rows (showcasing select_indices)
+* [FIX] if there is no scraper defined, exit with a message rather than
+  raise an exception
+* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)
 == 0.2.0
 === 30th January, 2007

data/Rakefile CHANGED

@@ -18,7 +18,7 @@ task "cleanup_readme" => ["rdoc"]
 gem_spec = Gem::Specification.new do |s|
   s.name = 'scrubyt'
-  s.version = '0.2.0'
+  s.version = '0.2.3'
   s.summary = 'A powerful Web-scraping framework'
   s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!}
   # Files containing Test::Unit test cases.
@@ -28,6 +28,8 @@ gem_spec = Gem::Specification.new do |s|
   s.author = 'Peter Szinek'
   s.email = 'peter@rubyrailways.com'
   s.homepage = 'http://www.scrubyt.org'
+  s.add_dependency('hpricot', '>= 0.5')
+  s.add_dependency('mechanize', '>= 0.6.3')
   s.has_rdoc = 'true'
 end
@@ -80,7 +82,7 @@ Rake::GemPackageTask.new(gem_spec) do |pkg|
   pkg.need_tar = false
 end
-Rake::PackageTask.new('scrubyt-examples', '0.2.0') do |pkg|
+Rake::PackageTask.new('scrubyt-examples', '0.2.3') do |pkg|
   pkg.need_zip = true
   pkg.need_tar = true
   pkg.package_files.include("examples/**/*")

data/lib/scrubyt.rb CHANGED

@@ -1,10 +1,15 @@
-require 'scrubyt/constraint_adder.rb'
-require 'scrubyt/constraint.rb'
-require 'scrubyt/export.rb'
-require 'scrubyt/extractor.rb'
-require 'scrubyt/filter.rb'
-require 'scrubyt/pattern.rb'
-require 'scrubyt/result_dumper.rb'
-require 'scrubyt/result.rb'
-require 'scrubyt/xpathutils.rb'
-require 'scrubyt/post_processor.rb'
+require 'scrubyt/core/scraping/constraint_adder.rb'
+require 'scrubyt/core/scraping/constraint.rb'
+require 'scrubyt/core/scraping/result_indexer.rb'
+require 'scrubyt/core/scraping/pre_filter_document.rb'
+require 'scrubyt/output/export.rb'
+require 'scrubyt/core/shared/extractor.rb'
+require 'scrubyt/core/scraping/filter.rb'
+require 'scrubyt/core/scraping/pattern.rb'
+require 'scrubyt/output/result_dumper.rb'
+require 'scrubyt/output/result.rb'
+require 'scrubyt/utils/xpathutils.rb'
+require 'scrubyt/output/post_processor.rb'
+require 'scrubyt/core/navigation/navigation_actions.rb'
+require 'scrubyt/core/navigation/fetch_action.rb'
+require 'scrubyt/core/shared/evaluation_context.rb'

data/lib/scrubyt/core/navigation/fetch_action.rb ADDED

@@ -0,0 +1,152 @@
+module Scrubyt
+  ##
+  #=<tt>Fetching pages (and related functionality)</tt>
+  #
+  #Since lot of things are happening during (and before)
+  #the fetching of a document, I decided to move out fetching related
+  #functionality to a separate class - so if you are looking for anything
+  #which is loading a document (even by submitting a form or clicking a link)
+  #and related things like setting a proxy etc. you should find it here.
+  class FetchAction
+    def initialize
+      @@current_doc_url = nil
+      @@current_doc_protocol = nil
+      @@base_dir = nil
+      @@host_name = nil
+      @@agent = WWW::Mechanize.new
+    end
+    ##
+    #Action to fetch a document (either a file or a http address)
+    #
+    #*parameters*
+    #
+    #_doc_url_ - the url or file name to fetch
+    def self.fetch(doc_url, proxy=nil, mechanize_doc=nil)
+      parse_and_set_proxy(proxy) if proxy
+      if (mechanize_doc == nil)
+        @@current_doc_url = doc_url
+        @@current_doc_protocol = determine_protocol
+        handle_relative_path(doc_url)
+        handle_relative_url(doc_url)
+        puts "[ACTION] fetching document: #{@@current_doc_url}"
+        if @@current_doc_protocol != 'file'
+          @@mechanize_doc = @@agent.get(@@current_doc_url)
+          store_host_name(doc_url)
+        end
+      else
+        @@current_doc_url = doc_url
+        @@mechanize_doc = mechanize_doc
+        @@current_doc_protocol = determine_protocol
+      end
+      if @@current_doc_protocol == 'file'
+        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(@@current_doc_url).read))
+      else
+        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc.body))
+      end
+    end
+    ##
+    #Submit the last form;
+    def self.submit(current_form, button=nil)
+      puts '[ACTION] submitting form...'
+      if button == nil
+        result_page = @@agent.submit(current_form)
+      else
+        result_page = @@agent.submit(current_form, button)
+      end
+      @@current_doc_url = result_page.uri.to_s
+      puts "[ACTION] fetched #{@@current_doc_url}"
+      fetch(@@current_doc_url, nil, result_page)
+    end
+    ##
+    #Click the link specified by the text
+    def self.click_link(link_text)
+      puts "[ACTION] clicking link: #{link_text}"
+      link = @@mechanize_doc.links.text(/^#{Regexp.escape(link_text)}$/)
+      result_page = @@agent.click(link)
+      @@current_doc_url = result_page.uri.to_s
+      puts "[ACTION] fetched #{@@current_doc_url}"
+      fetch(@@current_doc_url, nil, result_page)
+    end
+    ##
+    # At any given point, the current document can be queried with this method; Typically used
+    # when the navigation is over and the result document is passed to the wrapper
+    def self.get_current_doc_url
+      @@current_doc_url
+    end
+    def self.get_mechanize_doc
+      @@mechanize_doc
+    end
+    def self.get_hpricot_doc
+      @@hpricot_doc
+    end
+private
+    def self.determine_protocol
+      old_protocol = @@current_doc_protocol
+      new_protocol = case @@current_doc_url
+        when /^https/
+          'https'
+        when /^http/
+          'http'
+        when /^www/
+          'http'
+        else
+          'file'
+        end
+      return 'http' if ((old_protocol == 'http') && new_protocol == 'file')
+      return 'https' if ((old_protocol == 'https') && new_protocol == 'file')
+      new_protocol
+    end
+    def self.parse_and_set_proxy(proxy)
+      proxy = proxy[:proxy]
+      if proxy.downcase == 'localhost'
+        @@host = 'localhost'
+        @@port = proxy.split(':').last
+      else
+        parts = proxy.split(':')
+        @@port = parts.delete_at(-1)
+        @@host = parts.join(':')
+        if (@@host == nil || @@port == nil)# !@@host =~ /^http/)
+          puts "Invalid proxy specification..."
+          puts "neither host nor port can be nil!"
+          exit
+        end
+      end
+      puts "[ACTION] Setting proxy: host=<#{@@host}>, port=<#{@@port}>"
+      @@agent.set_proxy(@@host, @@port)
+    end
+    def self.handle_relative_path(doc_url)
+      if @@base_dir == nil
+        @@base_dir = doc_url.scan(/.+\//)[0] if @@current_doc_protocol == 'file'
+      else
+        @@current_doc_url = ((@@base_dir + doc_url) if doc_url !~ /#{@@base_dir}/)
+      end
+    end
+    def self.handle_relative_url(doc_url)
+      return if doc_url =~ /^http/
+      if @@host_name != nil
+        if doc_url !~ /#{@@host_name}/
+          @@current_doc_url = (@@host_name + doc_url)
+          #remove duplicate parts, like /blogs/en/blogs/en
+          @@current_doc_url = @@current_doc_url.split('/').uniq.reject{|x| x == ""}.join('/')
+          @@current_doc_url.sub!('http:/', 'http://')
+        end
+      end
+    end
+    def self.store_host_name(doc_url)
+      @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0] if @@current_doc_protocol == 'http'
+      @@host_name = 'https://' + @@mechanize_doc.uri.to_s.scan(/https:\/\/(.+\/)+/).flatten[0] if @@current_doc_protocol == 'https'
+      @@host_name = doc_url if @@host_name == nil
+    end #end of function store_host_name
+  end #end of class FetchAction
+end #end of module Scrubyt

data/lib/scrubyt/core/navigation/navigation_actions.rb ADDED

@@ -0,0 +1,106 @@
+module Scrubyt
+  ##
+  #=<tt>Describing actions which interact with the page</tt>
+  #
+  #This class contains all the actions that are used to navigate on web pages;
+  #first of all, *fetch* for downloading the pages - then various actions
+  #like filling textfields, submitting formst, clicking links and more
+  class NavigationActions
+    #These are reserved keywords - they can not be the name of any pattern
+    #since they are reserved for describing the navigation
+    KEYWORDS = ['fetch',
+                'fill_textfield',
+                'fill_textarea',
+                'submit',
+                'click_link',
+                'select_option',
+                'end']
+    def initialize
+        @@current_form = nil
+        FetchAction.new
+    end
+    ##
+    #Action to fill a textfield with a query string
+    #
+    ##*parameters*
+    #
+    #_textfield_name_ - the name of the textfield (e.g. the name of the google search
+    #textfield is 'q'
+    #
+    #_query_string_ - the string that should be entered into the textfield
+    def self.fill_textfield(textfield_name, query_string)
+      lookup_form_for_tag('input','textfield',textfield_name,query_string)
+      eval("@@current_form['#{textfield_name}'] = '#{query_string}'")
+    end
+    ##
+    #Action to fill a textarea with text
+    def self.fill_textarea(textarea_name, text)
+      lookup_form_for_tag('textarea','textarea',textarea_name,text)
+      eval("@@current_form['#{textarea_name}'] = '#{text}'")
+    end
+    ##
+    #Action for selecting an option from a dropdown box
+    def self.select_option(selectlist_name, option)
+      lookup_form_for_tag('select','select list',selectlist_name,option)
+      select_list = @@current_form.fields.find {|f| f.name == selectlist_name}
+      searched_option = select_list.options.find{|f| f.text == option}
+      searched_option.click
+    end
+    ##
+    #Fetch the document
+    def self.fetch(doc_url, mechanize_doc=nil)
+      FetchAction.fetch(doc_url, mechanize_doc)
+    end
+    ##
+    #Submit the current form (delegate it to NavigationActions)
+    def self.submit(index=nil)
+      if index == nil
+        FetchAction.submit(@@current_form)
+      else
+        FetchAction.submit(@@current_form, @@current_form.buttons[index])
+      end
+    end
+    ##
+    #Click the link specified by the text ((delegate it to NavigationActions)
+    def self.click_link(link_text)
+      FetchAction.click_link(link_text)
+    end
+    def self.get_hpricot_doc
+      FetchAction.get_hpricot_doc
+    end
+private
+    def self.lookup_form_for_tag(tag,widget_name,name_attribute,query_string)
+      puts "[ACTION] typing #{query_string} into the #{widget_name} named '#{name_attribute}'"
+      widget = (FetchAction.get_hpricot_doc/"#{tag}[@name=#{name_attribute}]").map()[0]
+      form_tag = Scrubyt::XPathUtils.traverse_up_until_name(widget, 'form')
+      find_form_based_on_tag(form_tag, ['name', 'id', 'action'])
+    end
+    def self.find_form_based_on_tag(tag, possible_attrs)
+      lookup_attribute_name = nil
+      lookup_attribute_value = nil
+      possible_attrs.each { |a|
+        lookup_attribute_name = a
+        lookup_attribute_value = tag.attributes[a]
+        break if lookup_attribute_value != nil
+      }
+      i = 0
+      loop do
+        @@current_form = FetchAction.get_mechanize_doc.forms[i]
+        return nil if @@current_form == nil
+        break if @@current_form.form_node.attributes[lookup_attribute_name] == lookup_attribute_value
+        i+= 1
+      end
+    end#find_form_based_on_tag
+  end#end of class NavigationActions
+end#end of module Scrubyt