RubyGems - scrapi - Versions diffs - 1.1.2 - Mend

scrapi 1.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

data/CHANGELOG ADDED

@@ -0,0 +1,22 @@
+Version 1.1.2 (August 13, 2006)
+* Changed: Allows multiple :not pseudo classes to be used with the same
+  element (meaning, select where none of the negators match).
+* Fixed: first-of-type, last-of-type.
+Version 1.1.1 (August 8, 2006)
+* Added: select() method to each element, that selects from that element.
+* Fixed: Inheritence bug resulting in infinite loop. Credit: Andrew Turner
+Version 1.1.0 (July 26, 2006)
+* Added: CSS 3 pseudo classes. nth-child, first-child, not, empty, etc.
+* Added: Quoted attribute values.
+* Added: Gem.
+* Fixed: Group selectors not parsing correctly.
+* Fixed: Case sensitive (shouldn't be).
+Version 1.0.0 (July 11, 2006)
+* First release.

data/MIT-LICENSE ADDED

@@ -0,0 +1,20 @@
+Copyright (c) 2006 Assaf Arkin
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README ADDED

@@ -0,0 +1,88 @@
+== ScrAPI toolkit for Ruby
+A framework for writing scrapers using CSS selectors and simple
+select => extract => store processing rules.
+Here’s an example that scrapes auctions from eBay:
+  ebay_auction = Scraper.define do
+    process "h3.ens>a", :description=>:text,
+                        :url=>"@href"
+    process "td.ebcPr>span", :price=>:text
+    process "div.ebPicture >a>img", :image=>"@src"
+    result :description, :url, :price, :image
+  end
+  ebay = Scraper.define do
+    array :auctions
+    process "table.ebItemlist tr.single",
+            :auctions => ebay_auction
+    result :auctions
+  end
+And using the scraper:
+  auctions = ebay.scrape(html)
+  # No. of auctions found
+  puts auctions.size
+  # First auction:
+  auction = auctions[0]
+  puts auction.description
+  puts auction.url
+To get the latest source code with regular updates:
+svn co http://labnotes.org/svn/public/ruby/scrapi
+== Using TIDY
+By default scrAPI uses Tidy to cleanup the HTML.
+You need to install the Tidy Gem for Ruby:
+  gem install tidy
+And the Tidy binary libraries, available here:
+  http://tidy.sourceforge.net/
+By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That's one place to place the Tidy library.
+Alternatively, just point Tidy to the library with:
+  Tidy.path = "...."
+On Linux this would probably be:
+  Tidy.path = "/usr/local/lib/libtidy.so"
+On OS/X this would probably be:
+  Tidy.path = “/usr/lib/libtidy.dylib”
+For testing purposes, you can also use the built in HTML parser. It's useful for testing and getting up to grabs with scrAPI, but it doesn't deal well with broken HTML. So for testing only:
+  Scraper::Base.parser :html_parser
+== License
+Copyright (c) 2006 Assaf Arkin, under Creative Commons Attribution and/or MIT License
+Developed for http://co.mments.com
+Code and documention: http://labnotes.org
+HTML cleanup and good hygene by Tidy, Copyright (c) 1998-2003 World Wide Web Consortium.
+License at http://tidy.sourceforge.net/license.html
+HTML DOM extracted from Rails, Copyright (c) 2004 David Heinemeier Hansson. Under MIT license.
+HTML parser by Takahiro Maebashi and Katsuyuki Komatsu, Ruby license.
+http://www.jin.gr.jp/~nahi/Ruby/html-parser/README.html

data/Rakefile ADDED

@@ -0,0 +1,67 @@
+require "benchmark"
+require "rubygems"
+Gem::manage_gems
+require "rake"
+require "rake/testtask"
+require "rake/rdoctask"
+require "rake/gempackagetask"
+desc "Generate documentation"
+Rake::RDocTask.new(:rdoc) do |rdoc|
+  rdoc.rdoc_dir = "rdoc"
+  rdoc.title    = "Scraper"
+  rdoc.options << "--line-numbers"
+  rdoc.options << "--inline-source"
+  rdoc.rdoc_files.include("README")
+  rdoc.rdoc_files.include("lib/**/*.rb")
+end
+desc "Run all tests"
+Rake::TestTask.new(:test) do |test|
+  test.libs << "lib"
+  test.pattern = "test/**/*_test.rb"
+  test.verbose = true
+end
+desc "Package as a Gem"
+gem_spec = Gem::Specification.new do |spec|
+  version = nil
+  File.readlines("CHANGELOG").each do |line|
+    if line =~ /Version (\d+\.\d+\.\d+)/
+      version = $1
+      break
+    end
+  end
+  raise RuntimeError, "Can't find version number in changelog" unless version
+  spec.name = "scrapi"
+  spec.version = version
+  spec.summary = "scrAPI toolkit for Ruby"
+  spec.description = <<-EOF
+A framework for writing scrapers using CSS selectors and simple
+select => extract => store processing rules.
+EOF
+  spec.author = "Assaf Arkin"
+  spec.email = "assaf.arkin@gmail.com"
+  spec.homepage = "http://labnotes.org/"
+  spec.files = FileList["{test,lib}/**/*", "README", "CHANGELOG", "Rakefile", "MIT-LICENSE"].to_a
+  spec.require_path = "lib"
+  spec.autorequire = "scrapi.rb"
+  spec.requirements << "Tidy"
+  spec.add_dependency "tidy",  ">=1.1.0"
+  spec.has_rdoc = true
+  spec.rdoc_options << "--main" << "README" << "--title" <<  "scrAPI toolkit for Ruby" << "--line-numbers"
+  spec.extra_rdoc_files = ["README"]
+  spec.rubyforge_project = "scrapi"
+end
+gem = Rake::GemPackageTask.new(gem_spec) do |pkg|
+  pkg.need_tar = true
+  pkg.need_zip = true
+end

data/lib/html/document.rb ADDED

@@ -0,0 +1,64 @@
+require File.dirname(__FILE__) + '/tokenizer'
+require File.dirname(__FILE__) + '/node'
+module HTML #:nodoc:
+  # A top-level HTMl document. You give it a body of text, and it will parse that
+  # text into a tree of nodes.
+  class Document #:nodoc:
+    # The root of the parsed document.
+    attr_reader :root
+    # Create a new Document from the given text.
+    def initialize(text, strict=false, xml=false)
+      tokenizer = Tokenizer.new(text)
+      @root = Node.new(nil)
+      node_stack = [ @root ]
+      while token = tokenizer.next
+        node = Node.parse(node_stack.last, tokenizer.line, tokenizer.position, token, strict)
+        node_stack.last.children << node unless node.tag? && node.closing == :close
+        if node.tag?
+          if node_stack.length > 1 && node.closing == :close
+            if node_stack.last.name == node.name
+              node_stack.pop
+            else
+              open_start = node_stack.last.position - 20
+              open_start = 0 if open_start < 0
+              close_start = node.position - 20
+              close_start = 0 if close_start < 0
+              msg = <<EOF.strip
+ignoring attempt to close #{node_stack.last.name} with #{node.name}
+  opened at byte #{node_stack.last.position}, line #{node_stack.last.line}
+  closed at byte #{node.position}, line #{node.line}
+  attributes at open: #{node_stack.last.attributes.inspect}
+  text around open: #{text[open_start,40].inspect}
+  text around close: #{text[close_start,40].inspect}
+EOF
+              strict ? raise(msg) : warn(msg)
+            end
+          elsif !node.childless?(xml) && node.closing != :close
+            node_stack.push node
+          end
+        end
+      end
+    end
+    # Search the tree for (and return) the first node that matches the given
+    # conditions. The conditions are interpreted differently for different node
+    # types, see HTML::Text#find and HTML::Tag#find.
+    def find(conditions)
+      @root.find(conditions)
+    end
+    # Search the tree for (and return) all nodes that match the given
+    # conditions. The conditions are interpreted differently for different node
+    # types, see HTML::Text#find and HTML::Tag#find.
+    def find_all(conditions)
+      @root.find_all(conditions)
+    end
+  end
+end

data/lib/html/htmlparser.rb ADDED

@@ -0,0 +1,407 @@
+module HTML #:nodoc:
+    # A parser for SGML, using the derived class as static DTD.
+    class SGMLParser
+    # Regular expressions used for parsing:
+    Interesting = /[&<]/
+    Incomplete = Regexp.compile('&([a-zA-Z][a-zA-Z0-9]*|#[0-9]*)?|' +
+                                '<([a-zA-Z][^<>]*|/([a-zA-Z][^<>]*)?|' +
+                                '![^<>]*)?')
+    Entityref = /&([a-zA-Z][-.a-zA-Z0-9]*)[^-.a-zA-Z0-9]/
+    Charref = /&#([0-9]+)[^0-9]/
+    Starttagopen = /<[>a-zA-Z]/
+    Endtagopen = /<\/[<>a-zA-Z]/
+    # Assaf: fixed to allow tag to close itself (XHTML)
+    Endbracket = /<|>|\/>/
+    Special = /<![^<>]*>/
+    Commentopen = /<!--/
+    Commentclose = /--[ \t\n]*>/
+    Tagfind = /[a-zA-Z][a-zA-Z0-9.-]*/
+    # Assaf: / is no longer part of allowed attribute value
+    Attrfind = Regexp.compile('[\s,]*([a-zA-Z_][a-zA-Z_0-9.-]*)' +
+                                '(\s*=\s*' +
+                                "('[^']*'" +
+                                '|"[^"]*"' +
+                                '|[-~a-zA-Z0-9,.:+*%?!()_#=]*))?')
+    Entitydefs =
+        {'lt'=>'<', 'gt'=>'>', 'amp'=>'&', 'quot'=>'"', 'apos'=>'\''}
+    def initialize(verbose=false)
+        @verbose = verbose
+        reset
+    end
+    def reset
+        @rawdata = ''
+        @stack = []
+        @lasttag = '???'
+        @nomoretags = false
+        @literal = false
+    end
+    def has_context(gi)
+        @stack.include? gi
+    end
+    def setnomoretags
+        @nomoretags = true
+        @literal = true
+    end
+    def setliteral(*args)
+        @literal = true
+    end
+    def feed(data)
+        @rawdata << data
+        goahead(false)
+    end
+    def close
+        goahead(true)
+    end
+    def goahead(_end)
+        rawdata = @rawdata
+        i = 0
+        n = rawdata.length
+        while i < n
+        if @nomoretags
+            handle_data(rawdata[i..(n-1)])
+            i = n
+            break
+        end
+        j = rawdata.index(Interesting, i)
+        j = n unless j
+        if i < j
+            handle_data(rawdata[i..(j-1)])
+        end
+        i = j
+        break if (i == n)
+        if rawdata[i] == ?< #
+            if rawdata.index(Starttagopen, i) == i
+            if @literal
+                handle_data(rawdata[i, 1])
+                i += 1
+                next
+            end
+            k = parse_starttag(i)
+            break unless k
+            i = k
+            next
+            end
+            if rawdata.index(Endtagopen, i) == i
+            k = parse_endtag(i)
+            break unless k
+            i = k
+            @literal = false
+            next
+            end
+            if rawdata.index(Commentopen, i) == i
+            if @literal
+                handle_data(rawdata[i,1])
+                i += 1
+                next
+            end
+            k = parse_comment(i)
+            break unless k
+            i += k
+            next
+            end
+            if rawdata.index(Special, i) == i
+            if @literal
+                handle_data(rawdata[i, 1])
+                i += 1
+                next
+            end
+            k = parse_special(i)
+            break unless k
+            i += k
+            next
+            end
+        elsif rawdata[i] == ?& #
+            if rawdata.index(Charref, i) == i
+            i += $&.length
+            handle_charref($1)
+            i -= 1 unless rawdata[i-1] == ?;
+            next
+            end
+            if rawdata.index(Entityref, i) == i
+            i += $&.length
+            handle_entityref($1)
+            i -= 1 unless rawdata[i-1] == ?;
+            next
+            end
+        else
+            raise RuntimeError, 'neither < nor & ??'
+        end
+        # We get here only if incomplete matches but
+        # nothing else
+        match = rawdata.index(Incomplete, i)
+        unless match == i
+            handle_data(rawdata[i, 1])
+            i += 1
+            next
+        end
+        j = match + $&.length
+        break if j == n # Really incomplete
+        handle_data(rawdata[i..(j-1)])
+        i = j
+        end
+        # end while
+        if _end and i < n
+        handle_data(@rawdata[i..(n-1)])
+        i = n
+        end
+        @rawdata = rawdata[i..-1]
+    end
+    def parse_comment(i)
+        rawdata = @rawdata
+        if rawdata[i, 4] != '<!--'
+        raise RuntimeError, 'unexpected call to handle_comment'
+        end
+        match = rawdata.index(Commentclose, i)
+        return nil unless match
+        matched_length = $&.length
+        j = match
+        handle_comment(rawdata[i+4..(j-1)])
+        j = match + matched_length
+        return j-i
+    end
+    def parse_starttag(i)
+        rawdata = @rawdata
+        j = rawdata.index(Endbracket, i + 1)
+        return nil unless j
+        attrs = []
+        if rawdata[i+1] == ?> #
+        # SGML shorthand: <> == <last open tag seen>
+        k = j
+        tag = @lasttag
+        else
+        match = rawdata.index(Tagfind, i + 1)
+        unless match
+            raise RuntimeError, 'unexpected call to parse_starttag'
+        end
+        k = i + 1 + ($&.length)
+        tag = $&.downcase
+        @lasttag = tag
+        end
+        while k < j
+        # Assaf: fixed to allow tag to close itself (XHTML)
+        break unless idx = rawdata.index(Attrfind, k) and idx < j
+        matched_length = $&.length
+        attrname, rest, attrvalue = $1, $2, $3
+        if not rest
+            attrvalue = '' # was: = attrname
+        # Assaf: fixed to handle double quoted attribute values properly
+        elsif (attrvalue[0] == ?' && attrvalue[-1] == ?') or
+            (attrvalue[0] == ?" && attrvalue[-1] == ?")
+            attrvalue = attrvalue[1..-2]
+        end
+        attrs << [attrname.downcase, attrvalue]
+        k += matched_length
+        end
+        # Assaf: fixed to allow tag to close itself (XHTML)
+        if rawdata[j,2] == '/>'
+        j += 2
+        finish_starttag(tag, attrs)
+        finish_endtag(tag)
+        else
+        if rawdata[j] == ?> #
+            j += 1
+        end
+        finish_starttag(tag, attrs)
+        end
+        return j
+    end
+    def parse_endtag(i)
+        rawdata = @rawdata
+        j = rawdata.index(Endbracket, i + 1)
+        return nil unless j
+        tag = (rawdata[i+2..j-1].strip).downcase
+        if rawdata[j] == ?> #
+        j += 1
+        end
+        finish_endtag(tag)
+        return j
+    end
+    def finish_starttag(tag, attrs)
+        method = 'start_' + tag
+        if self.respond_to?(method)
+        @stack << tag
+        handle_starttag(tag, method, attrs)
+        return 1
+        else
+        method = 'do_' + tag
+        if self.respond_to?(method)
+            handle_starttag(tag, method, attrs)
+            return 0
+        else
+            unknown_starttag(tag, attrs)
+            return -1
+        end
+        end
+    end
+    def finish_endtag(tag)
+        if tag == ''
+        found = @stack.length - 1
+        if found < 0
+            unknown_endtag(tag)
+            return
+        end
+        else
+        unless @stack.include? tag
+            method = 'end_' + tag
+            unless self.respond_to?(method)
+            unknown_endtag(tag)
+            end
+            return
+        end
+        found = @stack.index(tag) #or @stack.length
+        end
+        while @stack.length > found
+        tag = @stack[-1]
+        method = 'end_' + tag
+        if respond_to?(method)
+            handle_endtag(tag, method)
+        else
+            unknown_endtag(tag)
+        end
+        @stack.pop
+        end
+    end
+    def parse_special(i)
+        rawdata = @rawdata
+        match = rawdata.index(Endbracket, i+1)
+        return nil unless match
+        matched_length = $&.length
+        handle_special(rawdata[i+1..(match-1)])
+        return match - i + matched_length
+    end
+    def handle_starttag(tag, method, attrs)
+        self.send(method, attrs)
+    end
+    def handle_endtag(tag, method)
+        self.send(method)
+    end
+    def report_unbalanced(tag)
+        if @verbose
+        print '*** Unbalanced </' + tag + '>', "\n"
+        print '*** Stack:', self.stack, "\n"
+        end
+    end
+    def handle_charref(name)
+        n = Integer(name) rescue -1
+        if !(0 <= n && n <= 255)
+        unknown_charref(name)
+        return
+        end
+        handle_data(n.chr)
+    end
+    def handle_entityref(name)
+        table = Entitydefs
+        if table.include?(name)
+        handle_data(table[name])
+        else
+        unknown_entityref(name)
+        return
+        end
+    end
+    def handle_data(data)
+    end
+    def handle_comment(data)
+    end
+    def handle_special(data)
+    end
+    def unknown_starttag(tag, attrs)
+    end
+    def unknown_endtag(tag)
+    end
+    def unknown_charref(ref)
+    end
+    def unknown_entityref(ref)
+    end
+    end
+    # (X)HTML parser.
+    #
+    # Parses a String and returns an REXML::Document with the (X)HTML content.
+    #
+    # For example:
+    #   html = "<p>paragraph</p>"
+    #   parser = HTMLParser.new(html)
+    #   puts parser.document
+    #
+    # Requires a patched version of SGMLParser.
+    class HTMLParser < SGMLParser
+        attr :document
+        def self.parse(html)
+            parser = HTMLParser.new
+            parser.feed(html)
+            parser.document
+        end
+        def initialize()
+            super
+            @document = HTML::Document.new("")
+            @current = @document.root
+        end
+        def handle_data(data)
+            @current.children << HTML::Text.new(@current, 0, 0, data)
+        end
+        def handle_comment(data)
+        end
+        def handle_special(data)
+        end
+        def unknown_starttag(tag, attrs)
+            attrs = attrs.inject({}) do |hash, attr|
+                hash[attr[0].downcase] = attr[1]
+                hash
+            end
+            element = HTML::Tag.new(@current || @document, 0, 0, tag.downcase, attrs, true)
+            @current.children << element
+            @current = element
+        end
+        def unknown_endtag(tag)
+            @current = @current.parent if @current.parent
+        end
+        def unknown_charref(ref)
+        end
+        def unknown_entityref(ref)
+            @current.children << HTML::Text.new(@current, 0, 0, "&amp;#{ref}&lt;")
+        end
+    end
+end