RubyGems - scrubyt - Versions diffs - 0.2.3 → 0.2.6 - Mend

scrubyt 0.2.3 → 0.2.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

data/CHANGELOG +30 -0
data/Rakefile +2 -2
data/lib/scrubyt.rb +5 -0
data/lib/scrubyt/core/navigation/fetch_action.rb +13 -2
data/lib/scrubyt/core/navigation/navigation_actions.rb +4 -0
data/lib/scrubyt/core/scraping/compound_example.rb +30 -0
data/lib/scrubyt/core/scraping/filter.rb +35 -11
data/lib/scrubyt/core/scraping/pattern.rb +29 -22
data/lib/scrubyt/core/scraping/result_indexer.rb +2 -0
data/lib/scrubyt/core/shared/evaluation_context.rb +44 -22
data/lib/scrubyt/core/shared/extractor.rb +111 -15
data/lib/scrubyt/core/shared/u_r_i_builder.rb +67 -0
data/lib/scrubyt/output/export.rb +69 -22
data/lib/scrubyt/output/result.rb +1 -0
data/lib/scrubyt/output/result_dumper.rb +26 -7
data/lib/scrubyt/utils/compound_example_lookup.rb +50 -0
data/lib/scrubyt/utils/shared_utils.rb +45 -0
data/lib/scrubyt/utils/simple_example_lookup.rb +23 -0
data/lib/scrubyt/utils/xpathutils.rb +43 -92
data/test/unittests/simple_example_lookup_test.rb +68 -0
data/test/unittests/xpathutils_test.rb +0 -13
metadata +9 -3

data/CHANGELOG CHANGED

@@ -1,5 +1,35 @@
 = scRUBYt! Changelog
+== 0.2.5
+=== 22th March, 2007
+The mission of this release was to add even more powerful features, like crawling to detail pages or compound example specification, as well as fixing the most frequently popping-up bugs. Scraping of concrete sites is more and more frequently the cause for new features and bugfixes, which in my opinion means that the framework is beginning to make sense: from a shiny toy which looks cool and everybody wants to play with, it is moving towards a tool which you reach after if you seriously want to scrape a site.
+The new stuff in this release is 99% scraping related - if you are looking for new features in the navigation part, probably the next version will be for you, where I will concentrate more on adding new widgets and possibilities to the navigation process. Firewatir integration is very close, too - perhaps already the next release will contain FireWatir, or in the worst the next-next one.
+=<tt>changes:</tt>
+* [NEW] Automatically crawling to and extracting from detail pages
+* [NEW] Compound example specification: So far the example of a pattern had to be a string.
+        Now it can be a hash as well, like {:contains => /\d\d-\d/, :begins_with => 'Telephone'}
+* [NEW] More sophisticated example specification: Possible to use regexp as well, and need not
+        (but still possible of course) to specify the whole content of the node - nodes that
+        contain the string/match the regexp will be returned, too
+* [NEW] Possibility to force writing text in case of non-leaf nodes
+* [NEW] Crawling to the next page now possible via image links as well
+* [NEW] Possibility to define examples for any pattern (before it did not make sense for ancestors)
+* [NEW] Implementation of crawling to the next page with different methods
+* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
+        some_url 'href', :type => :attribute
+* [FIX] Crawling to the next page (the broken google example): if the next
+        link text is not an <a>, traverse down until the <a> is found; if it is
+        still not found, traverse up until it is found
+* [FIX] Crawling to next pages does not break if the next link is greyed out
+        (or otherwise present but has no href attribute (Credit: sorry, I could not find in the comments :(
+* [FIX] DRY-ed next link lookup - it should be much more robust now as it is uses the 'standard' example lookup
+* [NEW] Correct exporting of detail page extractors
+* [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
+* [NEW] New examples for the new featutres
+* [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and stabilization
 == 0.2.3
 === 20th February, 2007

data/Rakefile CHANGED

@@ -18,7 +18,7 @@ task "cleanup_readme" => ["rdoc"]
 gem_spec = Gem::Specification.new do |s|
   s.name = 'scrubyt'
-  s.version = '0.2.3'
+  s.version = '0.2.6'
   s.summary = 'A powerful Web-scraping framework'
   s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!}
   # Files containing Test::Unit test cases.
@@ -82,7 +82,7 @@ Rake::GemPackageTask.new(gem_spec) do |pkg|
   pkg.need_tar = false
 end
-Rake::PackageTask.new('scrubyt-examples', '0.2.3') do |pkg|
+Rake::PackageTask.new('scrubyt-examples', '0.2.6') do |pkg|
   pkg.need_zip = true
   pkg.need_tar = true
   pkg.package_files.include("examples/**/*")

data/lib/scrubyt.rb CHANGED

@@ -2,6 +2,7 @@ require 'scrubyt/core/scraping/constraint_adder.rb'
 require 'scrubyt/core/scraping/constraint.rb'
 require 'scrubyt/core/scraping/result_indexer.rb'
 require 'scrubyt/core/scraping/pre_filter_document.rb'
+require 'scrubyt/core/scraping/compound_example.rb'
 require 'scrubyt/output/export.rb'
 require 'scrubyt/core/shared/extractor.rb'
 require 'scrubyt/core/scraping/filter.rb'
@@ -13,3 +14,7 @@ require 'scrubyt/output/post_processor.rb'
 require 'scrubyt/core/navigation/navigation_actions.rb'
 require 'scrubyt/core/navigation/fetch_action.rb'
 require 'scrubyt/core/shared/evaluation_context.rb'
+require 'scrubyt/core/shared/u_r_i_builder.rb'
+require 'scrubyt/utils/shared_utils.rb'
+require 'scrubyt/utils/simple_example_lookup.rb'
+require 'scrubyt/utils/compound_example_lookup.rb'

data/lib/scrubyt/core/navigation/fetch_action.rb CHANGED

@@ -85,7 +85,15 @@ module Scrubyt
     def self.get_hpricot_doc
       @@hpricot_doc
-    end
+    end
+    def self.get_host_name
+      @@host_name
+    end
+    def self.restore_host_name
+      @@host_name = @@original_host_name
+    end
 private
     def self.determine_protocol
       old_protocol = @@current_doc_protocol
@@ -134,6 +142,8 @@ private
     def self.handle_relative_url(doc_url)
       return if doc_url =~ /^http/
       if @@host_name != nil
+        #p doc_url
+        #p @@host_name
         if doc_url !~ /#{@@host_name}/
           @@current_doc_url = (@@host_name + doc_url)
           #remove duplicate parts, like /blogs/en/blogs/en
@@ -146,7 +156,8 @@ private
     def self.store_host_name(doc_url)
       @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0] if @@current_doc_protocol == 'http'
       @@host_name = 'https://' + @@mechanize_doc.uri.to_s.scan(/https:\/\/(.+\/)+/).flatten[0] if @@current_doc_protocol == 'https'
-      @@host_name = doc_url if @@host_name == nil
+      @@host_name = doc_url if @@host_name == nil
+      @@original_host_name ||= @@host_name
     end #end of function store_host_name
   end #end of class FetchAction
 end #end of module Scrubyt

data/lib/scrubyt/core/navigation/navigation_actions.rb CHANGED

@@ -75,6 +75,10 @@ module Scrubyt
     def self.get_hpricot_doc
       FetchAction.get_hpricot_doc
     end
+    def self.get_current_doc_url
+      FetchAction.get_current_doc_url
+    end
 private
     def self.lookup_form_for_tag(tag,widget_name,name_attribute,query_string)

data/lib/scrubyt/core/scraping/compound_example.rb ADDED

@@ -0,0 +1,30 @@
+module Scrubyt
+  ##
+  #=<tt>Represents a compund example</tt>
+  #
+  #There are two types of string examples in scRUBYt! right now:
+  #the simple example and the compound example. The simple example
+  #is specified by a string, and a compound example is specified with
+  #:contains, :begins_with and :ends_with descriptors - which can be
+  #both regexps or strings
+  class CompoundExample
+    DESCRIPTORS = [:contains, :begins_with, :ends_with]
+    attr_accessor :descriptor_hash
+    def initialize(descriptor_hash)
+      @descriptor_hash = descriptor_hash
+    end
+    ##
+    #Is the hash passed to this function a compound example descriptor hash?
+    #Need to decide this when parsing pattern parameters
+    def self.compound_example?(hash)
+      hash.each do |k,v|
+        return false if !DESCRIPTORS.include? k
+      end
+      true
+    end# end of method
+  end# #end of class CompoundExample
+end# end of module Scrubyt

data/lib/scrubyt/core/scraping/filter.rb CHANGED

@@ -45,6 +45,8 @@ module Scrubyt
     EXAMPLE_TYPE_CHILDREN = 3
     #Regexp example, like /\d+@*\d+[a-z]/
     EXAMPLE_TYPE_REGEXP = 4
+    #Compound example, like :contains => 'goodies'
+    EXAMPLE_TYPE_COMPOUND = 5
     attr_accessor :example_type, :parent_pattern, :temp_sink,
                   :constraints, :xpath, :regexp, :example, :source, :sink
@@ -62,7 +64,7 @@ module Scrubyt
       @example = example
       @xpath = nil #The xpath to evaluate this filter
       #temp sinks are used for the initial run when determining the XPaths for examples;
-      @temp_sink = nil
+      #@temp_sink = nil
       @constraints = [] #list of constraints
     end
@@ -75,38 +77,44 @@ module Scrubyt
           #puts "Evaluating #{@parent_pattern.name} with #{@xpath}"
           result.class == Hpricot::Elements ? result.map : [result]
         when Scrubyt::Pattern::PATTERN_TYPE_ATTRIBUTE
+          puts "Evaluating: #{@parent_pattern.name}"
           attribute_value = [source.attributes[@example]]
           return attribute_value if attribute_value[0]
           @@attribute_in_parent = nil
           Filter.traverse_up_until_attribute_found(source.parent, @example)
-          @@attribute_in_parent
+          @@attribute_in_parent
         when Scrubyt::Pattern::PATTERN_TYPE_REGEXP
           source.inner_text.scan(@example).flatten
+        when Scrubyt::Pattern::PATTERN_TYPE_DETAIL
+          #p @parent_pattern.name
+          result = @parent_pattern.evaluation_context.extractor.evaluate_subextractor(
+            XPathUtils.find_nearest_node_with_attribute(source, 'href').attributes['href'],
+            @parent_pattern)
       end
     end
     #For all the tree patterns, generate an XPath based on the given example
     #Also this method should not be called directly; It is automatically called for every tree
     #pattern directly after wrapper definition
-    def generate_XPath_for_example
+    def generate_XPath_for_example(next_page_example=false)
+      #puts "generating example for: #{@parent_pattern.name}"
+      #puts @example_type
       case @example_type
         when EXAMPLE_TYPE_XPATH
           @xpath = @example
         when EXAMPLE_TYPE_STRING
-          @temp_sink = XPathUtils.find_node_from_text( @parent_pattern.evaluation_context.root_pattern.filters[0].source[0],
-	                                                   @example,
-						                               false )
+          @temp_sink = SimpleExampleLookup.find_node_from_text( @parent_pattern.evaluation_context.root_pattern.filters[0].source[0],
+	                                                            @example,
+						                                        next_page_example )
           @xpath = @parent_pattern.generalize ? XPathUtils.generate_XPath(@temp_sink, nil, false) :
                                                  XPathUtils.generate_XPath(@temp_sink, nil, true)
-        when EXAMPLE_TYPE_CHILDREN
+        when EXAMPLE_TYPE_CHILDREN
           current_example_index = 0
           loop do
             all_child_temp_sinks = []
             @parent_pattern.children.each do |child_pattern|
               all_child_temp_sinks << child_pattern.filters[current_example_index].temp_sink
             end
             result = all_child_temp_sinks.pop
             if all_child_temp_sinks.empty?
               result = result.parent
@@ -122,7 +130,8 @@ module Scrubyt
             end
             @parent_pattern.filters[current_example_index].xpath = xpath
             @parent_pattern.filters[current_example_index].temp_sink = result
-            @parent_pattern.children.each do |child_pattern|
+            @parent_pattern.children.each do |child_pattern|
+                  next if child_pattern.type == Scrubyt::Pattern::PATTERN_TYPE_DETAIL
                   child_pattern.filters[current_example_index].xpath =
                     child_pattern.generalize ? XPathUtils.generate_generalized_relative_XPath(child_pattern.filters[current_example_index].temp_sink, result) :
                                                XPathUtils.generate_relative_XPath(child_pattern.filters[current_example_index].temp_sink, result)
@@ -137,8 +146,20 @@ module Scrubyt
         when EXAMPLE_TYPE_IMAGE
           @temp_sink = XPathUtils.find_image(@parent_pattern.evaluation_context.root_pattern.filters[0].source[0], @example)
           @xpath = XPathUtils.generate_XPath(@temp_sink, nil, false)
+        when EXAMPLE_TYPE_COMPOUND
+          @temp_sink = CompoundExampleLookup.find_node_from_compund_example( @parent_pattern.evaluation_context.root_pattern.filters[0].source[0],
+	                                                                         @example,
+						                                                     next_page_example )
+          @xpath = @parent_pattern.generalize ? XPathUtils.generate_XPath(@temp_sink, nil, false) :
+                                                 XPathUtils.generate_XPath(@temp_sink, nil, true)
       end
     end
+    def setup_relative_XPaths
+      return if !@parent_pattern.parent.parent
+      parent_filter = @parent_pattern.parent.filters[@parent_pattern.filters.index(self)]
+      @xpath = XPathUtils.generate_relative_XPath_from_XPaths(parent_filter.xpath, @xpath) if (@xpath =~ /^\/html/)
+    end
     #Dispatcher method to add constraints; of course, as with any method_missing, this method
     #should not be called directly
@@ -160,13 +181,16 @@ private
     def self.determine_example_type(example)
       if example.instance_of? Regexp
         EXAMPLE_TYPE_REGEXP
+      elsif example.instance_of? Hash
+        EXAMPLE_TYPE_COMPOUND
       else
         case example
           when nil
             EXAMPLE_TYPE_CHILDREN
           when /\.(jpg|png|gif|jpeg)$/
             EXAMPLE_TYPE_IMAGE
-          when /^\/{1,2}[a-z]+\d?(\[\d+\])?(\/{1,2}[a-z]+\d?(\[\d+\])?)*$/
+          when
+/^\/{1,2}[a-z]+[0-9]?(\[[0-9]+\])?(\/{1,2}[a-z()]+[0-9]?(\[[0-9]+\])?)*$/
             (example.include? '/' || example.include?('[')) ? EXAMPLE_TYPE_XPATH : EXAMPLE_TYPE_STRING
           else
             EXAMPLE_TYPE_STRING

data/lib/scrubyt/core/scraping/pattern.rb CHANGED

@@ -17,13 +17,15 @@ module Scrubyt
     #Type of the pattern;
     # a root pattern represents a (surprise!) root pattern
-    PATTERN_TYPE_ROOT = 0
+    PATTERN_TYPE_ROOT = 0x00
     # a tree pattern represents a HTML region
-    PATTERN_TYPE_TREE = 1
+    PATTERN_TYPE_TREE = 0x01
     # represents an attribute of the node extracted by the parent pattern
-    PATTERN_TYPE_ATTRIBUTE = 2
+    PATTERN_TYPE_ATTRIBUTE = 0x02
     # represents a pattern which filters its output with a regexp
-    PATTERN_TYPE_REGEXP = 3
+    PATTERN_TYPE_REGEXP = 0x03
+    # represents a pattern which crawls to the detail page and extracts information from there
+    PATTERN_TYPE_DETAIL = 0x04
     #The pattern can be either a model pattern (in this case it is
     #written to the output) or a temp pattern (in this case it is skipped)
@@ -31,20 +33,21 @@ module Scrubyt
     #is considered to be a model pattern
     #Model pattern are shown in the output
-    OUTPUT_TYPE_MODEL = 0
+    OUTPUT_TYPE_MODEL = 0x10
     #Temp patterns are skipped in the output (their ancestors are appended to the parent
     #of the pattrern which was skipped
-    OUTPUT_TYPE_TEMP = 1
+    OUTPUT_TYPE_TEMP = 0x11
     #These fields can be set upon wrapper creation - i.e. a field which is public but not contained here can be accessed
     #from outside, but not set as a result of wrapper construction
-    SETTABLE_FIELDS = ['generalize', 'type', 'output_type', 'example']
+    SETTABLE_FIELDS = ['generalize', 'type', 'output_type', 'write_text']
     attr_accessor :name, :output_type, :generalize, :children, :filters, :parent,
-                  :last_result, :result, :example, :limit,
-                  :examples, :parent_of_leaf, :evaluation_context,
-                  :indices_to_extract, :evaluation_context
-    attr_reader :type, :generalize_set, :next_page_url, :result_indexer
+                  :last_result, :result, :limit,
+                  :examples, :parent_of_leaf, :evaluation_context, :type,
+                  :indices_to_extract, :evaluation_context, :referenced_extractor,
+                  :referenced_pattern, :write_text
+    attr_reader   :generalize_set, :next_page_url, :result_indexer
     def initialize (name, *args)
       @name = name                #name of the pattern
@@ -70,7 +73,7 @@ module Scrubyt
       #Grab any examples that are defined!
       look_for_examples(args)
       args.each do |arg|
-        next if !arg.is_a? Hash
+        next if !arg.is_a? Hash
         arg.each do |k,v|
           #Set only the setable fields
           if SETTABLE_FIELDS.include? k.to_s
@@ -107,16 +110,16 @@ module Scrubyt
     # camera_data.item[1].item_name[0]
     def method_missing(method_name, *args, &block)
       case method_name.to_s
-      when 'select_indices'
-        @result_indexer = Scrubyt::ResultIndexer.new(*args)
-        self
-      when /^to_/
-        Scrubyt::ResultDumper.send(method_name.to_s, self)
-      when /^ensure_/
-        Scrubyt::ConstraintAdder.send(method_name, self, *args)
-      else
-        @children.each { |child| return child if child.name == method_name.to_s }
-        nil
+        when 'select_indices'
+          @result_indexer = Scrubyt::ResultIndexer.new(*args)
+          self
+        when /^to_/
+          Scrubyt::ResultDumper.send(method_name.to_s, self)
+        when /^ensure_/
+          Scrubyt::ConstraintAdder.send(method_name, self, *args)
+        else
+          @children.each { |child| return child if child.name == method_name.to_s }
+          nil
       end
     end
@@ -226,7 +229,11 @@ private
           end
         end
         @type = PATTERN_TYPE_REGEXP
+      elsif (args[0].is_a? Hash)
+        @examples = (args.select {|e| e.is_a? Hash}).select {|e| CompoundExample.compound_example?(e)}
+        @examples = nil if @examples == []
       end
     end
     def add_result(filter, source, results)

data/lib/scrubyt/core/scraping/result_indexer.rb CHANGED

@@ -37,6 +37,8 @@ module Scrubyt
               (0..ary.size).each {|i| to_keep << i if (i % 2 == 0)}
             when :every_third
               (0..ary.size).each {|i| to_keep << i if (i % 3 == 0)}
+            when :every_fourth
+              (0..ary.size).each {|i| to_keep << i if (i % 4 == 0)}
           end
         end
       }

data/lib/scrubyt/core/shared/evaluation_context.rb CHANGED

@@ -13,8 +13,8 @@ module Scrubyt
   #two classes need to communicate frequently as well as share different information
   #and this is accomplished through EvaluationContext.
   class EvaluationContext
-    attr_accessor :root_pattern, :next_page, :document_index, :block_count,
-                  :extractor, :limit
+    attr_accessor :root_pattern, :document_index, :block_count,
+                  :extractor, :uri_builder
     def initialize
       @root_pattern = nil
@@ -26,9 +26,11 @@ module Scrubyt
     ##
     #Crawl to a new page. This function should not be called from the outside - it is automatically called
-    #if the next_page is defined
-    def crawl_to_new_page
-      temp_document = generate_next_page_link(@next_page)
+    #if the next_page pattern is defined
+    def crawl_to_new_page(root_pattern, uri_builder)
+      temp_document = uri_builder.next_page_example ?
+                        generate_next_page_link(uri_builder) :
+                        uri_builder.generate_next_uri
       return nil if temp_document == nil
       clear_sources_and_sinks(@root_pattern)
       @extractor.fetch(temp_document)
@@ -41,9 +43,9 @@ module Scrubyt
     def attach_current_document
       doc = @extractor.get_hpricot_doc
       @root_pattern.filters[0].source << doc
-      @root_pattern.filters[0].sink << doc
+      @root_pattern.filters[0].sink << doc
       @root_pattern.last_result ||= []
-      @root_pattern.last_result << doc
+      @root_pattern.last_result << doc
       @root_pattern.result.add_result(@root_pattern.filters[0].source,
                                       @root_pattern.filters[0].sink)
     end
@@ -54,6 +56,7 @@ module Scrubyt
       get_root_pattern(nil)
       mark_leaf_parents(@root_pattern)
       generate_examples(@root_pattern)
+      check_for_multipe_examples(@root_pattern)
     end
     ##
@@ -67,24 +70,22 @@ module Scrubyt
       pattern.children.each {|child| clear_sources_and_sinks child}
     end
-    def generate_next_page_link(example)
-      node = XPathUtils.find_node_from_text(@root_pattern.filters[0].source[0], example, true)
-      return nil if node == nil
+    def generate_next_page_link(uri_builder)
+      uri_builder.next_page_pattern.filters[0].generate_XPath_for_example(true)
+      xpath = uri_builder.next_page_pattern.filters[0].xpath
+      node = (@extractor.get_hpricot_doc/xpath).map.last
+      node = XPathUtils.find_nearest_node_with_attribute(node, 'href')
+      return nil if node == nil || node.attributes['href'] == nil
       node.attributes['href'].gsub('&amp;') {'&'}
     end
-    def mark_leaf_parents(pattern)
-      pattern.children.each { |child|
-        pattern.parent_of_leaf = true if child.children.size == 0
-      }
-      pattern.children.each { |child| mark_leaf_parents(child) }
+    def setup_uri_builder(pattern,args)
+      if args[0] =~ /^http.+/
+        args.insert(0, @extractor.get_current_doc_url) if args[1] !~ /^http.+/
+      end
+      @uri_builder = URIBuilder.new(pattern,args)
     end
-    def generate_examples(pattern)
-      pattern.children.each {|child_pattern| generate_examples(child_pattern) }
-      pattern.filters.each { |filter| filter.generate_XPath_for_example } if pattern.type == Pattern::PATTERN_TYPE_TREE
-    end
     def get_root_pattern(pattern)
       if @root_pattern == nil
         while (pattern.parent != nil)
@@ -92,6 +93,27 @@ module Scrubyt
         end
         @root_pattern = pattern
       end
-    end #end of function
+    end
+private
+    def mark_leaf_parents(pattern)
+      pattern.children.each { |child|
+        pattern.parent_of_leaf = true if child.children.size == 0
+      }
+      pattern.children.each { |child| mark_leaf_parents(child) }
+    end
+    ##
+    #Check the tree and turn all the XPaths for the examples (but the topmost one)
+    #into relative ones
+    def check_for_multipe_examples(pattern)
+      pattern.children.each {|child_pattern| check_for_multipe_examples(child_pattern) }
+      pattern.filters.each { |filter| filter.setup_relative_XPaths } if pattern.type == Pattern::PATTERN_TYPE_TREE
+    end
+    def generate_examples(pattern)
+      pattern.children.each {|child_pattern| generate_examples(child_pattern) }
+      pattern.filters.each { |filter| filter.generate_XPath_for_example(false) } if pattern.type == Pattern::PATTERN_TYPE_TREE
+    end #end of function generate_examples
   end #end of class EvaluationContext
 end #end of module Scrubyt