RubyGems - scrubyt - Versions diffs - 0.1.9 → 0.2.0 - Mend

scrubyt 0.1.9 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

data/CHANGELOG +52 -18
data/README +69 -40
data/Rakefile +42 -11
data/lib/scrubyt/export.rb +4 -4
data/lib/scrubyt/extractor.rb +50 -9
data/lib/scrubyt/filter.rb +7 -7
data/lib/scrubyt/pattern.rb +19 -10
data/lib/scrubyt/post_processor.rb +15 -0
data/lib/scrubyt/result_dumper.rb +14 -2
data/lib/scrubyt/xpathutils.rb +9 -7
data/test/unittests/filter_test.rb +8 -0
data/test/unittests/xpathutils_test.rb +3 -3
metadata +11 -11

data/CHANGELOG CHANGED

@@ -1,16 +1,47 @@
+= scRUBYt! changelog
+== 0.2.0
+=== 30th January, 2007
+The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
+=<tt>changes:</tt>
+* better form detection heuristics
+* report message if there are absolutely no results
+* lots of bugfixes
+  * fixed amazon_data.books[0].item[0].title[0] style output access
+    and implemented it correctly in case of crawling as well
+  * /body/div/h3 not detected as XPath
+  * crawling problem (improved heuristics of url joining)
+  * fixed blackbox test runner - no more platform dependent code
+  * fixed exporting bug: swapped exported XPaths in the case of no example     present
+  * fixed exporting bug: capturing \W (non-word character) after the\          pattern name; this way we can distinguish pattern names where one
+    name is substring of the other
+  * Evaluation stops if the example was not found - but not in the case
+    of next page link lookup
+  * google_data[0].link[0].url[0] style result lookup now works in the
+    case of more documents, too
+  * tons of others bugfixes
+  * overall stability fixes
+* more blackbox tests
+* more examples
+* overall stability fixes
 = 0.1.9
 === 28th January, 2007
 This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
-changes:
+=<tt>Changes</tt>:
-- Possibility to specify multiple examples (hence a pattern can have more filters)
-- Enhanced heuristics for example text detection
-- First version of algorithm to remove dupes resulting from multiple examples
-- empty XML leaf nodes are not written
-- new examples
-- TONS of bugfixes
+* Possibility to specify multiple examples (hence a pattern can have more filters)
+* Enhanced heuristics for example text detection
+* First version of algorithm to remove dupes resulting from multiple examples
+* empty XML leaf nodes are not written
+* new examples
+* TONS of bugfixes
 = 0.1
 === 15th January, 2007
@@ -20,15 +51,18 @@ This release was made more for myself (to try and test rubyforge, gems, etc) rat
 Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
-Navigation:
-  fetching pages
-  clicking links
-  filling input fields
-  submitting forms
-Scraping:
-  - Fairly powerful DSL to describe the full scraping process
-  - Automatic navigation with WWW::Mechanize
-  - Automatic scraping through examples with Hpricot
-  - automatic recursive scraping through the next button
+* Navigation:
+  * fetching pages
+  * clicking links
+  * filling input fields
+  * submitting forms
+  * automatically passing the document to the scraping
+  * both files and http:// support
+  * automatic crawling
+* Scraping:
+  * Fairly powerful DSL to describe the full scraping process
+  * Automatic navigation with WWW::Mechanize
+  * Automatic scraping through examples with Hpricot
+  * automatic recursive scraping through the next button

data/README CHANGED

@@ -1,70 +1,99 @@
-============================================
-scRUBYt! - Hpricot and Mechanize on steroids
-============================================
+= scRUBYt! - Hpricot and Mechanize on steroids
-A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web, Extract, query, transform and save relevant data from the Web page of interest by the concise and easy to use DSL provided by scRUBYt!.
+A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web, Extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL.
-=============================================
-Why do we need one more web-scraping toolkit?
-=============================================
+Do you think that Mechanize and Hpricot are powerful libraries? You're right, they are, indeed - hats off to their authors: without these libs scRUBYt! could not exist now! I have been wondering whether their functionality could be still enhanced further - so I took these two powerful ingredients, threw in a handful of smart heuristics, wrapped them around with a chunky DSL coating and sprinkled the whole stuff with a lots of convention over configuration(tm) goodies - and ... enter scRUBYt! and decide it yourself.
-After all, we have HPricot, and Rubyful soup, and Mechanize, and scrAPI, and ARIEL and ...
-Well, because scRUBYt! is different. It has entirely different philosophy, underlying techniques, use cases - shortly it should be used in different situations with different requirements than the previosly mentioned ones.
+= Wait... why do we need one more web-scraping toolkit?
+After all, we have HPricot, and Rubyful-soup, and Mechanize, and scrAPI, and ARIEL and scrapes and ...
+Well, because scRUBYt! is different. It has an entirely different philosophy, underlying techniques, theoretical background, use cases, todo list, real-life scenarios etc.  - shortly it should be used in different situations with different requirements than the previosly mentioned ones.
 If you need something quick and/or would like to have maximal control over the scraping process, I recommend HPricot. Mechanize shines when it comes to interaction with Web pages. Since scRUBYt! is operating based on XPaths, sometimes you will chose scrAPI because CSS selectors will better suit your needs. The list goes on and on, boiling down to the good old mantra: use the right tool for the right job!
-I hope there will be times when you will want to experiment with Pandora's box and reach after the power of scRUBYt! :-)
+I hope there will be also times when you will want to experiment with Pandora's box and reach after the power of scRUBYt! :-)
+= Sounds fine - show me an example!
+Let's apply the "show don't tell" principle. Okay, here we go:
-========================================
-OK, OK, I believe you, what should I do?
-========================================
+<tt>ebay_data = Scrubyt::Extractor.define do</tt>
-Useful adresses
+  fetch 'http://www.ebay.com/'
+  fill_textfield 'satitle', 'ipod'
+  submit
+  click_link 'Apple iPod'
+  record do
+    item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
+    price '$71.99'
+  end
+  next_page 'Next >', :limit => 5
-scrubyt.rubyforge.org
-rubyrailways.com (some theory)
-future: public extractor repository
+<tt>end</tt>
-==============
-How to install
-==============
+output:
-Dependencies:
+<tt><root></tt>
+    <record>
+      <item_name>APPLE IPOD NANO 4GB - PINK - MP3 PLAYER</item_name>
+      <price>$149.95</price>
+    </record>
+    <record>
+      <item_name>APPLE IPOD 30GB BLACK VIDEO/PHOTO/MP3 PLAYER</item_name>
+      <price>$172.50</price>
+    </record>
+    <record>
+      <item_name>NEW APPLE IPOD NANO 4GB PINK MP3 PLAYER</item_name>
+      <price>$171.06</price>
+    </record>
+    <!-- another 200+ results -->
+<tt></root></tt>
-Ruby 1.8.4 (or higher)
-Hpricot 0.4.84
-Mechanize 0.6.3 (or higher)
+This was a relatively beginner-level example (scRUBYt knows a lot more than this and there are much complicated extractors than the above one) - yet it did a lot of things automagically. First of all,
+it automatically loaded the page of interest (by going to ebay.com, automatically searching for ipods
+and narrowing down the results by clicking on 'Apple iPod'), then it extracted *all* the items that
+looked like the specified example (which btw described also how the output structure should look like) - on the first 5 result pages. Not so bad for about 10 lines of code, eh?
-I assume you have Ruby any Rubygems installed. To install Mechanize, just run
+= OK, OK, I believe you, what should I do?
-sudo gem install mechanize
+You can find everything you will need at these addresses (or if not, I doubt you will find it elsewhere...). See the next section about installation, and after installing be sure to check out these URLs:
-Hpricot (until 0.5 comes out) is a little bit tougher nut to crack, since you need a special version, not the latest stable.
+* <a href='http://www.rubyrailways.com'>rubyrailways.com</a> - for some theory; if you would like to take a sneak peek at web scraping in general and/or you would like to understand what's going on under the hood, check out <a href='http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails'>this article about web-scraping</a>!
+* <a href='http://scrubyt.org'>http://scrubyt.org</a> - your source of tutorials, howtos, news etc.
+* <a href='http://scrubyt.rubyforge.org'>scrubyt.rubyforge.org</a> - for an up-to-date, online Rdoc
+* <a href='http://projects.rubyforge.org/scrubyt'>projects.rubyforge.org/scrubyt</a> - for developer info, including open and closed bugs, files etc.
+* projects.rubyforge.org/scrubyt/files... - fair amount (and still growing with every release) of examples, showcasing the features of scRUBYt!
+* planned: public extractor repository - hopefully (after people realize how great this package is :-)) scRUBYt! will have a community, and people will upload their extractors for whatever reason
-After this, you are ready to install Hpricot 0.4.86 (if there is no 86, choose the next, e.g. 88 - ). Run
+If you still can't find something here, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
-gem install hpricot --source code.whytheluckystiff.net
+= How to install
-and choose the correct version (e.g. 0.4.86)
+scRUBYt! requires these packages to be installed:
-To test whether everything is working, from the svn directory launch
+* Ruby 1.8.4
+* Hpricot 0.5
+* Mechanize 0.6.3
-rake fulltest
+I assume you have ruby any rubygems installed. To install WWW::Mechanize 0.6.3 or higher, just run
-You should see 0 errors...
+<tt>sudo gem install mechanize</tt>
-=============================
-Additional installation notes
-=============================
+Hpricot 0.5 is just hot off the frying pan - perfect timing, _why! - install it with
-[1]
-you will have to install ragel (dependency of HPricot) with something like
+<tt>sudo gem install hpricot</tt>
-sudo apt-get install ragel
+Once all the dependencies (Mechanize and Hpricot) are up and running, you can install scrubyt with
-depending on your distro (the above works for debian based stuff).
+<tt>sudo gem install scrubyt</tt>
+If you encounter any problems, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
+= Author
+Copyright (c) 2006 by Peter Szinek (peter@/NO-SPAM/rubyrailways.com)
+= Copyright
+This library is distributed under the GPL.  Please see the LICENSE file.

data/Rakefile CHANGED

@@ -1,6 +1,7 @@
 require 'rake/rdoctask'
 require 'rake/testtask'
 require 'rake/gempackagetask'
+require 'rake/packagetask'
 ###################################################
 # Dependencies
@@ -8,6 +9,8 @@ require 'rake/gempackagetask'
 task "default" => ["test"]
 task "fulltest" => ["test", "blackbox"]
+task "generate_rdoc" => ["cleanup_readme"]
+task "cleanup_readme" => ["rdoc"]
 ###################################################
 # Gem specification
@@ -15,13 +18,13 @@ task "fulltest" => ["test", "blackbox"]
 gem_spec = Gem::Specification.new do |s|
   s.name = 'scrubyt'
-  s.version = '0.1.9'
+  s.version = '0.2.0'
   s.summary = 'A powerful Web-scraping framework'
   s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!}
   # Files containing Test::Unit test cases.
   s.test_files = FileList['test/unittests/**/*']
   # List of other files to be included.
-  s.files = FileList['README', 'COPYING', 'CHANGELOG', 'Rakefile', 'lib/**/*.rb']
+  s.files = FileList['COPYING', 'README', 'CHANGELOG', 'Rakefile', 'lib/**/*.rb']
   s.author = 'Peter Szinek'
   s.email = 'peter@rubyrailways.com'
   s.homepage = 'http://www.scrubyt.org'
@@ -32,14 +35,14 @@ end
 # Tasks
 ###################################################
-Rake::RDocTask.new do |rdoc|
-     files = ['lib/**/*.rb', 'README']
-     rdoc.rdoc_files.add(files)
-     rdoc.main = "README" # page to start on
-     rdoc.title = "Scrubyt Documentation"
-     rdoc.template = "resources/allison/allison.rb"
-     rdoc.rdoc_dir = 'doc' # rdoc output folder
-     rdoc.options << '--line-numbers' << '--inline-source'
+Rake::RDocTask.new do |generate_rdoc|
+     files = ['lib/**/*.rb', 'README', 'CHANGELOG']
+     generate_rdoc.rdoc_files.add(files)
+     generate_rdoc.main = "README" # page to start on
+     generate_rdoc.title = "Scrubyt Documentation"
+     generate_rdoc.template = "resources/allison/allison.rb"
+     generate_rdoc.rdoc_dir = 'doc' # rdoc output folder
+     generate_rdoc.options << '--line-numbers' << '--inline-source'
 end
 Rake::TestTask.new do |test|
@@ -50,7 +53,35 @@ task "blackbox" do
   ruby "test/blackbox/run_blackbox_tests.rb"
 end
+task "cleanup_readme" do
+  puts "Cleaning up README..."
+  readme_in = open('./doc/files/README.html')
+  content = readme_in.read
+  content.sub!('<h1 id="item_name">File: README</h1>','')
+  content.sub!('<h1>Description</h1>','')
+  readme_in.close
+  open('./doc/files/README.html', 'w') {|f| f.write(content)}
+  #OK, this is uggly as hell and as non-DRY as possible, but
+  #I don't have time to deal with it right now
+  puts "Cleaning up CHANGELOG..."
+  readme_in = open('./doc/files/CHANGELOG.html')
+  content = readme_in.read
+  content.sub!('<h1 id="item_name">File: CHANGELOG</h1>','')
+  content.sub!('<h1>Description</h1>','')
+  readme_in.close
+  open('./doc/files/CHANGELOG.html', 'w') {|f| f.write(content)}
+end
+task "generate_rdoc" do
+end
 Rake::GemPackageTask.new(gem_spec) do |pkg|
   pkg.need_zip = false
   pkg.need_tar = false
-end
+end
+Rake::PackageTask.new('scrubyt-examples', '0.2.0') do |pkg|
+  pkg.need_zip = true
+  pkg.need_tar = true
+  pkg.package_files.include("examples/**/*")
+end

data/lib/scrubyt/export.rb CHANGED

@@ -1,5 +1,5 @@
 #require File.join(File.dirname(__FILE__), 'pattern.rb')
 module Scrubyt
   # =<tt>exporting previously defined extractors</tt>
   class Export
@@ -142,14 +142,14 @@ private
       @name_to_xpath_map = {}
       create_name_to_xpath_map(pattern)
       #Replace the examples which are quoted with " and '
-      @name_to_xpath_map.each do |name, xpaths|
+      @name_to_xpath_map.each do |name, xpaths|
         replace_example_with_xpath(name, xpaths, %q{"})
         replace_example_with_xpath(name, xpaths, %q{'})
       end
       #Finally, add XPaths to pattern which had no example at the beginning (the XPath was
       #generated from the child patterns
       @name_to_xpath_map.each do |name, xpaths|
-        xpaths.each do |xpath|
+        xpaths.reverse.each do |xpath|
           comma = @full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0].sub('do'){}.strip == '' ? '' : ','
           if (@full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0]).include?('{')
             @full_definition.sub!("P.#{name}") {"P.#{name}('#{xpath}')"}
@@ -180,7 +180,7 @@ private
     def self.replace_example_with_xpath(name, xpaths, left_delimiter, right_delimiter=left_delimiter)
       return if name=='root'
-      full_line = @full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0]
+      full_line = @full_definition.scan(/P.#{name}\W(.+)$/)[0][0]
       examples = full_line.split(",")
       examples.reject! {|exa| exa.strip!;  exa[0..0] != %q{"} && exa[0..0] != %q{'} }
       all_xpaths = ""

data/lib/scrubyt/extractor.rb CHANGED

@@ -46,6 +46,7 @@ module Scrubyt
       end
       ensure_all_postconditions(root_pattern)
       PostProcessor.remove_multiple_filter_duplicates(root_pattern)
+      PostProcessor.report_if_no_results(root_pattern)
       #Return the root pattern
       root_pattern
     end
@@ -121,21 +122,28 @@ module Scrubyt
         @@current_doc_url = ((@@base_dir + doc_url) if doc_url !~ /#{@@base_dir}/)
       end
-      if @@host_name != nil
+      if @@host_name != nil
         if doc_url !~ /#{@@host_name}/
-          @@current_doc_url = (@@host_name + doc_url)
-          @@current_doc_url.gsub!(/([^:])\/\//) {"#{$1}/"}
+          @@current_doc_url = (@@host_name + doc_url)
+          #remove duplicate parts, like /blogs/en/blogs/en
+          @@current_doc_url = @@current_doc_url.split('/').uniq.reject{|x| x == ""}.join('/')
+          @@current_doc_url.sub!('http:/', 'http://')
         end
       end
       puts "[ACTION] fetching document: #{@@current_doc_url}"
-      @@mechanize_doc = @@agent.get(@@current_doc_url) if @@current_doc_protocol == :http
+      if @@current_doc_protocol == :http
+        @@mechanize_doc = @@agent.get(@@current_doc_url)
+        @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0]
+        @@host_name = doc_url if @@host_name == nil
+      end
     else
       @@current_doc_url = doc_url
       @@mechanize_doc = mechanize_doc
       @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0]
       @@host_name = doc_url if @@host_name == nil
     end
-    @@hpricot_doc = Hpricot(open(@@current_doc_url))#.to_original_html
+    @@hpricot_doc = Hpricot(open(@@current_doc_url))
   end
   ##
@@ -150,23 +158,56 @@ module Scrubyt
   def self.fill_textfield(textfield_name, query_string)
     puts "[ACTION] typing #{query_string} into the textfield named '#{textfield_name}'"
     textfield = (@@hpricot_doc/"input[@name=#{textfield_name}]").map()[0]
-    formname = Scrubyt::XPathUtils.traverse_up_until_name(textfield, 'form').attributes['name']
-    @@current_form = @@mechanize_doc.forms.with.name(formname).first
+    form_tag = Scrubyt::XPathUtils.traverse_up_until_name(textfield, 'form')
+    #Refactor this code, it's a total mess
+    formname = form_tag.attributes['name']
+    if formname == nil
+      id_string = form_tag.attributes['id']
+      if id_string == nil
+        action_string = form_tag.attributes['action']
+        if action_string == nil
+          #If even this fails, do it with a button
+        else
+          puts "Finding from action"
+          puts action_string
+          find_form_with_attribute('action', action_string)
+        end
+      else
+        puts "Finding from id"
+        find_form_with_attribute('id', id_string)
+      end
+    else
+      puts "Finding from name"
+      @@current_form = @@mechanize_doc.forms.with.name(formname).first
+    end
     eval("@@current_form['#{textfield_name}'] = '#{query_string}'")
   end
+  def self.find_form_with_attribute(attr, expected_value)
+    puts "attr: #{attr}"
+    i = 0
+    loop do
+      @@current_form = @@mechanize_doc.forms[i]
+      print "current a: "
+      puts @@current_form.form_node.attributes[attr]
+      return nil if @@current_form == nil
+      break if @@current_form.form_node.attributes[attr] == expected_value
+      i+= 1
+    end
+  end
   #Submit the last form;
   def self.submit
     puts '[ACTION] submitting form...'
     result_page = @@agent.submit(@@current_form)#, @@current_form.buttons.first)
     @@current_doc_url = result_page.uri.to_s
+    puts "[ACTION] fetched #{@@current_doc_url}"
     fetch(@@current_doc_url, result_page)
   end
   def self.click_link(link_text)
     puts "[ACTION] clicking link: #{link_text}"
-    #puts /^#{Regexp.escape(link_text)}$/
-    #p /^#{Regexp.escape(link_text)}$/
     link = @@mechanize_doc.links.text(/^#{Regexp.escape(link_text)}$/)
     result_page = @@agent.click(link)
     @@current_doc_url = result_page.uri.to_s

data/lib/scrubyt/filter.rb CHANGED

@@ -53,8 +53,10 @@ module Scrubyt
       @parent_pattern = parent_pattern
       #If the example type is not explicitly defined in the pattern definition,
       #try to determine it automatically from the example
-      @example_type = (args[0] == nil ? Filter.determine_example_type(example) :
-                                        args[0][:example_type])
+      #@example_type = (args[0] == nil ? Filter.determine_example_type(example) :
+      #                                  args[0][:example_type])
+      #TODOOOOO correct this!
+      @example_type = Filter.determine_example_type(example)
       @sink = []                  #output of a filter
       @source = []                #input of a filter
       @example = example
@@ -67,14 +69,13 @@ module Scrubyt
     #Evaluate this filter. This method shoulf not be called directly - as the pattern hierarchy
     #is evaluated, every pattern evaluates its filters and then they are calling this method
     def evaluate(source)
-      @parent_pattern.root_pattern.already_evaluated_sources ||= {}
       case @parent_pattern.type
         when Scrubyt::Pattern::PATTERN_TYPE_TREE
           result = source/@xpath
           result.class == Hpricot::Elements ? result.map : [result]
         when Scrubyt::Pattern::PATTERN_TYPE_ATTRIBUTE
           [source.attributes[@example]]
-        when Scrubyt::Pattern::PATTERN_TYPE_REGEXP
+        when Scrubyt::Pattern::PATTERN_TYPE_REGEXP
           source.inner_text.scan(@example).flatten
       end
     end
@@ -87,10 +88,9 @@ module Scrubyt
         when EXAMPLE_TYPE_XPATH
           @xpath = @example
         when EXAMPLE_TYPE_STRING
-          @temp_sink = XPathUtils.find_node_from_text( @parent_pattern.root_pattern.filters[0].source[0], @example )
+          @temp_sink = XPathUtils.find_node_from_text( @parent_pattern.root_pattern.filters[0].source[0], @example, false )
           @xpath = @parent_pattern.generalize ? XPathUtils.generate_XPath(@temp_sink, nil, false) :
                                                  XPathUtils.generate_XPath(@temp_sink, nil, true)
-          puts @xpath
         when EXAMPLE_TYPE_CHILDREN
           current_example_index = 0
           loop do
@@ -148,7 +148,7 @@ private
             EXAMPLE_TYPE_CHILDREN
           when /\.(jpg|png|gif|jpeg)$/
             EXAMPLE_TYPE_IMAGE
-          when /^\/{1,2}[a-z]+(\[\d+\])?(\/{1,2}[a-z]+(\[\d+\])?)*$/
+          when /^\/{1,2}[a-z]+\d?(\[\d+\])?(\/{1,2}[a-z]+\d?(\[\d+\])?)*$/
             (example.include? '/' || example.include?('[')) ? EXAMPLE_TYPE_XPATH : EXAMPLE_TYPE_STRING
           else
             EXAMPLE_TYPE_STRING

data/lib/scrubyt/pattern.rb CHANGED

@@ -43,7 +43,7 @@ module Scrubyt
     attr_accessor :name, :output_type, :generalize, :children, :filters, :parent,
                   :last_result, :result, :root_pattern, :example,  :block_count,
                   :next_page, :limit, :extractor, :extracted_docs,
-                  :examples, :parent_of_leaf
+                  :examples, :parent_of_leaf, :document_index
     attr_reader :type, :generalize_set, :next_page_url
     def initialize (name, *args)
@@ -56,6 +56,7 @@ module Scrubyt
       @@instance_count = Hash.new(0)
       @evaluated_examples = []
       @next_page = nil
+      @document_index = 0
       if @examples == nil
         filters << Scrubyt::Filter.new(self) #create a default filter
       else
@@ -74,6 +75,7 @@ module Scrubyt
       #Grab any examples that are defined!
       look_for_examples(args)
       args.each do |arg|
+        next if !arg.is_a? Hash
         arg.each do |k,v|
           #Set only the setable fields
           if SETTABLE_FIELDS.include? k.to_s
@@ -92,7 +94,6 @@ module Scrubyt
       #default settings - the user can override them, but if she did not do so,
       #we will setup some meaningful defaults
       @type ||= PATTERN_TYPE_TREE
-      @type = PATTERN_TYPE_REGEXP if @example.instance_of? Regexp
       @output_type ||= OUTPUT_TYPE_MODEL
       #don't generalize by default
       @generalize ||= false
@@ -127,11 +128,20 @@ module Scrubyt
     #    camera_data.item[1].item_name[0]
     #
     #possible. The method Pattern::method missing handles the 'item', 'item_name' etc.
-    #parts, while the indexing ([1], [0]) is handled by this function
+    #parts, while the indexing ([1], [0]) is handled by this function.
+    #If you would like to select a different document than the first one (which is
+    #the default), you should use the form:
+    #
+    #    camera_data[1].item[1].item_name[0]
     def [](index)
-      return nil if (@result.lookup(@parent.last_result)) == nil
-      @last_result = @result.lookup(@parent.last_result)[index]
-      self
+      if @name == 'root'
+        @root_pattern.document_index = index
+      else
+        @parent.last_result = @parent.last_result[@root_pattern.document_index] if @parent.last_result.is_a? Array
+        return nil if (@result.lookup(@parent.last_result)) == nil
+        @last_result = @result.lookup(@parent.last_result)[index]
+      end
+      self
     end
     ##
@@ -217,9 +227,6 @@ module Scrubyt
             sorted_result = r.reject {|e| !result.keys.include? e}
             add_result(filter, source, sorted_result)
           else
-            if ( (xe = @result.lookup(source)) != nil )
-              #puts "ha"; p xe
-            end
             add_result(filter, source, r)
           end#end of constraint check
         end#end of source iteration
@@ -246,6 +253,7 @@ private
           end
         end
       elsif (args[0].is_a? Regexp)
+        @examples = args.select {|e| e.is_a? Regexp}
         #Check if all the String parameters are really the first
         #parameters
         args[0..@examples.size].each do |example|
@@ -253,6 +261,7 @@ private
             puts 'FATAL: Problem with example specification'
           end
         end
+        @type = PATTERN_TYPE_REGEXP
       end
     end
@@ -299,7 +308,7 @@ private
     end
     def generate_next_page_link(example)
-      node = XPathUtils.find_node_from_text(@root_pattern.filters[0].source[0], example)
+      node = XPathUtils.find_node_from_text(@root_pattern.filters[0].source[0], example, true)
       return nil if node == nil
       node.attributes['href'].gsub('&amp;') {'&'}
     end # end of method generate_next_page_link

data/lib/scrubyt/post_processor.rb CHANGED

@@ -18,6 +18,21 @@ module Scrubyt
       remove_multiple_filter_duplicates_intern(pattern) if pattern.parent_of_leaf
       pattern.children.each {|child| remove_multiple_filter_duplicates(child)}
     end
+    ##
+    #Issue an error report if the document did not extract anything.
+    #Probably this is because the structure of the page changed or
+    #because of some rather nasty bug - in any case, something wrong
+    #is going on, and we need to inform the user about this!
+    def self.report_if_no_results(root_pattern)
+      results_found = false
+      root_pattern.children.each {|child| return if (child.result.childmap.size > 0)}
+      puts
+      puts "!!!!!! WARNING: The extractor did not find any result instances"
+      puts "Most probably this is wrong. Check your extractor and if you are"
+      puts "sure it should work, report a bug!"
+      puts
+    end
 private
     def self.remove_multiple_filter_duplicates_intern(pattern)

data/lib/scrubyt/result_dumper.rb CHANGED

@@ -1,4 +1,5 @@
 require 'rexml/document'
+require 'rexml/xpath'
 module Scrubyt
   ##
@@ -16,7 +17,7 @@ module Scrubyt
         to_xml_recursive(pattern, root)
       end
       remove_empty_leaves(doc)
-      doc
+      @@last_doc = doc
     end
     def self.remove_empty_leaves(node)
@@ -80,11 +81,22 @@ private
         end
     end
-    def self.print_statistics_recursive(pattern, depth)
+    def self.print_old_sta(pattern, depth)
       puts((' ' * "#{depth}".to_i) +  "#{pattern.name} extracted #{pattern.get_instance_count[pattern.name]} instances.") if pattern.name != 'root'
       pattern.children.each do |child|
         print_statistics_recursive(child, depth + 4)
+      end
+    end
+    def self.print_statistics_recursive(pattern, depth)
+      if pattern.name != 'root'
+        count = REXML::XPath.match(@@last_doc, "//#{pattern.name}").size
+        puts((' ' * "#{depth}".to_i) +  "#{pattern.name} extracted #{count} instances.")
       end
+      pattern.children.each do |child|
+        print_statistics_recursive(child, depth + 4)
+      end
     end#end of method print_statistics_recursive
   end #end of class ResultDumper
 end #end of module Scrubyt

data/lib/scrubyt/xpathutils.rb CHANGED

@@ -21,21 +21,23 @@ module Scrubyt
     # <a>Bon <b>nuit</b>, monsieur!</a>
     #
     #In this case, <a>'s text is considered to be "Bon nuit, monsieur"
-    def self.find_node_from_text(doc, text)
+    def self.find_node_from_text(doc, text, next_link)
       @node = nil
       @found = false
       self.traverse_for_full_text(doc,text)
       self.lowest_possible_node_with_text(@node, text) if @node != nil
-      #$Logger.warn("Node for example #{text} Not found!") if (@found == false)
       if (@found == false)
         #Fallback to per node text lookup
         self.traverse_for_node_text(doc,text)
-        if (@found == false)
-          puts "FATAL: Node for example #{text} Not found!"
-          puts "Please make sure your specified the example properly"
+        if (@found == false)
+          return nil if next_link
+          puts "!" * 65
+          puts "!!!!!! FATAL: Node for example #{text} Not found! !!!!!!"
+          puts "!!!!!! Please make sure you specified the example properly !!!!!!"
+          puts "!" * 65
+          exit
         end
       end
-      p @node
       @node
     end
@@ -135,7 +137,7 @@ module Scrubyt
     #_index_ - there might be more images with the same src on the page -
     #most typically the user will need the 0th - but if this is not the
     #case, there is the possibility to override this
-    def self.find_image(doc, example, index=1)
+    def self.find_image(doc, example, index=0)
       (doc/"img[@src='#{example}']")[index]
     end

data/test/unittests/filter_test.rb CHANGED

@@ -22,7 +22,15 @@ class FilterTest < Test::Unit::TestCase
                      Scrubyt::Filter::EXAMPLE_TYPE_IMAGE)
     #Test XPaths
     assert_equal(Scrubyt::Filter.determine_example_type('/p/img'),
+                 Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
+    assert_equal(Scrubyt::Filter.determine_example_type('/p/h3'),
+                 Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
+    assert_equal(Scrubyt::Filter.determine_example_type('/p/h3/a/h2'),
+                 Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
+    assert_equal(Scrubyt::Filter.determine_example_type('/h2'),
                  Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
+    assert_equal(Scrubyt::Filter.determine_example_type('/h1/h3'),
+                 Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
     assert_equal(Scrubyt::Filter.determine_example_type('/p'),
                  Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
     assert_equal(Scrubyt::Filter.determine_example_type('//p'),

data/test/unittests/xpathutils_test.rb CHANGED

@@ -55,14 +55,14 @@ class XPathUtilsTest < Test::Unit::TestCase
   end
   def test_find_node_from_text
-    elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"fff")
+    elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"fff", false)
     assert_instance_of(Hpricot::Elem, elem)
     assert_equal(elem, @f)
-    elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"dddd")
+    elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"dddd", false)
     assert_equal(elem, @d)
-    elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"rrr")
+    elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"rrr", false)
     assert_equal(elem, @r)
   end

metadata CHANGED

@@ -3,8 +3,8 @@ rubygems_version: 0.9.0
 specification_version: 1
 name: scrubyt
 version: !ruby/object:Gem::Version
-  version: 0.1.9
-date: 2007-01-28 00:00:00 +01:00
+  version: 0.2.0
+date: 2007-02-04 00:00:00 +01:00
 summary: A powerful Web-scraping framework
 require_paths:
 - lib
@@ -29,29 +29,29 @@ post_install_message:
 authors:
 - Peter Szinek
 files:
-- README
 - COPYING
+- README
 - CHANGELOG
 - Rakefile
 - lib/scrubyt.rb
-- lib/scrubyt/constraint_adder.rb
 - lib/scrubyt/constraint.rb
-- lib/scrubyt/result_dumper.rb
-- lib/scrubyt/export.rb
-- lib/scrubyt/extractor.rb
-- lib/scrubyt/filter.rb
 - lib/scrubyt/pattern.rb
 - lib/scrubyt/result.rb
+- lib/scrubyt/export.rb
+- lib/scrubyt/constraint_adder.rb
 - lib/scrubyt/post_processor.rb
+- lib/scrubyt/filter.rb
 - lib/scrubyt/xpathutils.rb
+- lib/scrubyt/result_dumper.rb
+- lib/scrubyt/extractor.rb
 test_files:
 - test/unittests/input
+- test/unittests/constraint_test.rb
 - test/unittests/filter_test.rb
-- test/unittests/extractor_test.rb
 - test/unittests/xpathutils_test.rb
-- test/unittests/constraint_test.rb
-- test/unittests/input/constraint_test.html
+- test/unittests/extractor_test.rb
 - test/unittests/input/test.html
+- test/unittests/input/constraint_test.html
 rdoc_options: []
 extra_rdoc_files: []