scrubyt 0.1.9 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +52 -18
- data/README +69 -40
- data/Rakefile +42 -11
- data/lib/scrubyt/export.rb +4 -4
- data/lib/scrubyt/extractor.rb +50 -9
- data/lib/scrubyt/filter.rb +7 -7
- data/lib/scrubyt/pattern.rb +19 -10
- data/lib/scrubyt/post_processor.rb +15 -0
- data/lib/scrubyt/result_dumper.rb +14 -2
- data/lib/scrubyt/xpathutils.rb +9 -7
- data/test/unittests/filter_test.rb +8 -0
- data/test/unittests/xpathutils_test.rb +3 -3
- metadata +11 -11
    
        data/CHANGELOG
    CHANGED
    
    | @@ -1,16 +1,47 @@ | |
| 1 | 
            +
            = scRUBYt! changelog
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            == 0.2.0
         | 
| 4 | 
            +
            === 30th January, 2007
         | 
| 5 | 
            +
             | 
| 6 | 
            +
            The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
         | 
| 7 | 
            +
             | 
| 8 | 
            +
            =<tt>changes:</tt>
         | 
| 9 | 
            +
             | 
| 10 | 
            +
            * better form detection heuristics
         | 
| 11 | 
            +
            * report message if there are absolutely no results
         | 
| 12 | 
            +
            * lots of bugfixes
         | 
| 13 | 
            +
              * fixed amazon_data.books[0].item[0].title[0] style output access
         | 
| 14 | 
            +
                and implemented it correctly in case of crawling as well
         | 
| 15 | 
            +
              * /body/div/h3 not detected as XPath
         | 
| 16 | 
            +
              * crawling problem (improved heuristics of url joining)
         | 
| 17 | 
            +
              * fixed blackbox test runner - no more platform dependent code
         | 
| 18 | 
            +
              * fixed exporting bug: swapped exported XPaths in the case of no example     present
         | 
| 19 | 
            +
              * fixed exporting bug: capturing \W (non-word character) after the\          pattern name; this way we can distinguish pattern names where one
         | 
| 20 | 
            +
                name is substring of the other
         | 
| 21 | 
            +
              * Evaluation stops if the example was not found - but not in the case
         | 
| 22 | 
            +
                of next page link lookup
         | 
| 23 | 
            +
              * google_data[0].link[0].url[0] style result lookup now works in the
         | 
| 24 | 
            +
                case of more documents, too
         | 
| 25 | 
            +
              * tons of others bugfixes
         | 
| 26 | 
            +
              * overall stability fixes
         | 
| 27 | 
            +
            * more blackbox tests
         | 
| 28 | 
            +
            * more examples
         | 
| 29 | 
            +
            * overall stability fixes
         | 
| 30 | 
            +
             | 
| 31 | 
            +
             | 
| 1 32 | 
             
            = 0.1.9
         | 
| 2 33 | 
             
            === 28th January, 2007
         | 
| 3 34 |  | 
| 4 35 | 
             
            This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
         | 
| 5 36 |  | 
| 6 | 
            -
             | 
| 37 | 
            +
            =<tt>Changes</tt>:
         | 
| 7 38 |  | 
| 8 | 
            -
             | 
| 9 | 
            -
             | 
| 10 | 
            -
             | 
| 11 | 
            -
             | 
| 12 | 
            -
             | 
| 13 | 
            -
             | 
| 39 | 
            +
            * Possibility to specify multiple examples (hence a pattern can have more filters)
         | 
| 40 | 
            +
            * Enhanced heuristics for example text detection
         | 
| 41 | 
            +
            * First version of algorithm to remove dupes resulting from multiple examples
         | 
| 42 | 
            +
            * empty XML leaf nodes are not written
         | 
| 43 | 
            +
            * new examples 
         | 
| 44 | 
            +
            * TONS of bugfixes
         | 
| 14 45 |  | 
| 15 46 | 
             
            = 0.1
         | 
| 16 47 | 
             
            === 15th January, 2007
         | 
| @@ -20,15 +51,18 @@ This release was made more for myself (to try and test rubyforge, gems, etc) rat | |
| 20 51 |  | 
| 21 52 | 
             
            Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
         | 
| 22 53 |  | 
| 23 | 
            -
            Navigation:
         | 
| 24 | 
            -
              fetching pages
         | 
| 25 | 
            -
              clicking links
         | 
| 26 | 
            -
              filling input fields
         | 
| 27 | 
            -
              submitting forms
         | 
| 28 | 
            -
             | 
| 29 | 
            -
             | 
| 30 | 
            -
               | 
| 31 | 
            -
             | 
| 32 | 
            -
             | 
| 33 | 
            -
               | 
| 54 | 
            +
            * Navigation:
         | 
| 55 | 
            +
              * fetching pages
         | 
| 56 | 
            +
              * clicking links
         | 
| 57 | 
            +
              * filling input fields
         | 
| 58 | 
            +
              * submitting forms
         | 
| 59 | 
            +
              * automatically passing the document to the scraping
         | 
| 60 | 
            +
              * both files and http:// support
         | 
| 61 | 
            +
              * automatic crawling
         | 
| 62 | 
            +
             | 
| 63 | 
            +
            * Scraping:
         | 
| 64 | 
            +
              * Fairly powerful DSL to describe the full scraping process
         | 
| 65 | 
            +
              * Automatic navigation with WWW::Mechanize
         | 
| 66 | 
            +
              * Automatic scraping through examples with Hpricot
         | 
| 67 | 
            +
              * automatic recursive scraping through the next button
         | 
| 34 68 |  | 
    
        data/README
    CHANGED
    
    | @@ -1,70 +1,99 @@ | |
| 1 | 
            -
             | 
| 2 | 
            -
            scRUBYt! - Hpricot and Mechanize on steroids
         | 
| 3 | 
            -
            ============================================
         | 
| 1 | 
            +
            = scRUBYt! - Hpricot and Mechanize on steroids
         | 
| 4 2 |  | 
| 5 | 
            -
            A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web, Extract, query, transform and save relevant data from the Web page of interest by the concise and easy to use DSL | 
| 3 | 
            +
            A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web, Extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL.
         | 
| 6 4 |  | 
| 7 | 
            -
             | 
| 8 | 
            -
            Why do we need one more web-scraping toolkit?
         | 
| 9 | 
            -
            =============================================
         | 
| 5 | 
            +
            Do you think that Mechanize and Hpricot are powerful libraries? You're right, they are, indeed - hats off to their authors: without these libs scRUBYt! could not exist now! I have been wondering whether their functionality could be still enhanced further - so I took these two powerful ingredients, threw in a handful of smart heuristics, wrapped them around with a chunky DSL coating and sprinkled the whole stuff with a lots of convention over configuration(tm) goodies - and ... enter scRUBYt! and decide it yourself.
         | 
| 10 6 |  | 
| 11 | 
            -
             | 
| 12 | 
            -
             | 
| 7 | 
            +
            = Wait... why do we need one more web-scraping toolkit?
         | 
| 8 | 
            +
             | 
| 9 | 
            +
            After all, we have HPricot, and Rubyful-soup, and Mechanize, and scrAPI, and ARIEL and scrapes and ...
         | 
| 10 | 
            +
            Well, because scRUBYt! is different. It has an entirely different philosophy, underlying techniques, theoretical background, use cases, todo list, real-life scenarios etc.  - shortly it should be used in different situations with different requirements than the previosly mentioned ones.
         | 
| 13 11 |  | 
| 14 12 | 
             
            If you need something quick and/or would like to have maximal control over the scraping process, I recommend HPricot. Mechanize shines when it comes to interaction with Web pages. Since scRUBYt! is operating based on XPaths, sometimes you will chose scrAPI because CSS selectors will better suit your needs. The list goes on and on, boiling down to the good old mantra: use the right tool for the right job!
         | 
| 15 13 |  | 
| 16 | 
            -
            I hope there will be times when you will want to experiment with Pandora's box and reach after the power of scRUBYt! :-)
         | 
| 14 | 
            +
            I hope there will be also times when you will want to experiment with Pandora's box and reach after the power of scRUBYt! :-)
         | 
| 15 | 
            +
             | 
| 16 | 
            +
            = Sounds fine - show me an example!
         | 
| 17 | 
            +
             | 
| 18 | 
            +
            Let's apply the "show don't tell" principle. Okay, here we go:
         | 
| 17 19 |  | 
| 18 | 
            -
             | 
| 19 | 
            -
            OK, OK, I believe you, what should I do?
         | 
| 20 | 
            -
            ========================================
         | 
| 20 | 
            +
            <tt>ebay_data = Scrubyt::Extractor.define do</tt>
         | 
| 21 21 |  | 
| 22 | 
            -
             | 
| 22 | 
            +
              fetch 'http://www.ebay.com/'
         | 
| 23 | 
            +
              fill_textfield 'satitle', 'ipod'
         | 
| 24 | 
            +
              submit
         | 
| 25 | 
            +
              click_link 'Apple iPod'
         | 
| 26 | 
            +
              
         | 
| 27 | 
            +
              record do
         | 
| 28 | 
            +
                item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
         | 
| 29 | 
            +
                price '$71.99'
         | 
| 30 | 
            +
              end
         | 
| 31 | 
            +
              next_page 'Next >', :limit => 5
         | 
| 23 32 |  | 
| 24 | 
            -
             | 
| 25 | 
            -
            rubyrailways.com (some theory)
         | 
| 26 | 
            -
            future: public extractor repository
         | 
| 33 | 
            +
            <tt>end</tt>
         | 
| 27 34 |  | 
| 28 | 
            -
             | 
| 29 | 
            -
            How to install
         | 
| 30 | 
            -
            ==============
         | 
| 35 | 
            +
            output:
         | 
| 31 36 |  | 
| 32 | 
            -
             | 
| 37 | 
            +
            <tt><root></tt>
         | 
| 38 | 
            +
                <record>
         | 
| 39 | 
            +
                  <item_name>APPLE IPOD NANO 4GB - PINK - MP3 PLAYER</item_name>
         | 
| 40 | 
            +
                  <price>$149.95</price>
         | 
| 41 | 
            +
                </record>
         | 
| 42 | 
            +
                <record>
         | 
| 43 | 
            +
                  <item_name>APPLE IPOD 30GB BLACK VIDEO/PHOTO/MP3 PLAYER</item_name>
         | 
| 44 | 
            +
                  <price>$172.50</price>
         | 
| 45 | 
            +
                </record>
         | 
| 46 | 
            +
                <record>
         | 
| 47 | 
            +
                  <item_name>NEW APPLE IPOD NANO 4GB PINK MP3 PLAYER</item_name>
         | 
| 48 | 
            +
                  <price>$171.06</price>
         | 
| 49 | 
            +
                </record>
         | 
| 50 | 
            +
                <!-- another 200+ results -->
         | 
| 51 | 
            +
            <tt></root></tt>
         | 
| 33 52 |  | 
| 34 | 
            -
             | 
| 35 | 
            -
             | 
| 36 | 
            -
             | 
| 53 | 
            +
            This was a relatively beginner-level example (scRUBYt knows a lot more than this and there are much complicated extractors than the above one) - yet it did a lot of things automagically. First of all,
         | 
| 54 | 
            +
            it automatically loaded the page of interest (by going to ebay.com, automatically searching for ipods
         | 
| 55 | 
            +
            and narrowing down the results by clicking on 'Apple iPod'), then it extracted *all* the items that
         | 
| 56 | 
            +
            looked like the specified example (which btw described also how the output structure should look like) - on the first 5 result pages. Not so bad for about 10 lines of code, eh?
         | 
| 37 57 |  | 
| 38 | 
            -
             | 
| 58 | 
            +
            = OK, OK, I believe you, what should I do?
         | 
| 39 59 |  | 
| 40 | 
            -
             | 
| 60 | 
            +
            You can find everything you will need at these addresses (or if not, I doubt you will find it elsewhere...). See the next section about installation, and after installing be sure to check out these URLs:
         | 
| 41 61 |  | 
| 42 | 
            -
             | 
| 62 | 
            +
            * <a href='http://www.rubyrailways.com'>rubyrailways.com</a> - for some theory; if you would like to take a sneak peek at web scraping in general and/or you would like to understand what's going on under the hood, check out <a href='http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails'>this article about web-scraping</a>!
         | 
| 63 | 
            +
            * <a href='http://scrubyt.org'>http://scrubyt.org</a> - your source of tutorials, howtos, news etc.
         | 
| 64 | 
            +
            * <a href='http://scrubyt.rubyforge.org'>scrubyt.rubyforge.org</a> - for an up-to-date, online Rdoc
         | 
| 65 | 
            +
            * <a href='http://projects.rubyforge.org/scrubyt'>projects.rubyforge.org/scrubyt</a> - for developer info, including open and closed bugs, files etc.
         | 
| 66 | 
            +
            * projects.rubyforge.org/scrubyt/files... - fair amount (and still growing with every release) of examples, showcasing the features of scRUBYt!
         | 
| 67 | 
            +
            * planned: public extractor repository - hopefully (after people realize how great this package is :-)) scRUBYt! will have a community, and people will upload their extractors for whatever reason
         | 
| 43 68 |  | 
| 44 | 
            -
             | 
| 69 | 
            +
            If you still can't find something here, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
         | 
| 45 70 |  | 
| 46 | 
            -
             | 
| 71 | 
            +
            = How to install
         | 
| 47 72 |  | 
| 48 | 
            -
             | 
| 73 | 
            +
            scRUBYt! requires these packages to be installed:
         | 
| 49 74 |  | 
| 50 | 
            -
             | 
| 75 | 
            +
            * Ruby 1.8.4
         | 
| 76 | 
            +
            * Hpricot 0.5
         | 
| 77 | 
            +
            * Mechanize 0.6.3
         | 
| 51 78 |  | 
| 52 | 
            -
             | 
| 79 | 
            +
            I assume you have ruby any rubygems installed. To install WWW::Mechanize 0.6.3 or higher, just run
         | 
| 53 80 |  | 
| 54 | 
            -
             | 
| 81 | 
            +
            <tt>sudo gem install mechanize</tt>
         | 
| 55 82 |  | 
| 56 | 
            -
             | 
| 57 | 
            -
            Additional installation notes
         | 
| 58 | 
            -
            =============================
         | 
| 83 | 
            +
            Hpricot 0.5 is just hot off the frying pan - perfect timing, _why! - install it with 
         | 
| 59 84 |  | 
| 60 | 
            -
             | 
| 61 | 
            -
            you will have to install ragel (dependency of HPricot) with something like
         | 
| 85 | 
            +
            <tt>sudo gem install hpricot</tt>
         | 
| 62 86 |  | 
| 63 | 
            -
             | 
| 87 | 
            +
            Once all the dependencies (Mechanize and Hpricot) are up and running, you can install scrubyt with
         | 
| 64 88 |  | 
| 65 | 
            -
             | 
| 89 | 
            +
            <tt>sudo gem install scrubyt</tt>
         | 
| 66 90 |  | 
| 91 | 
            +
            If you encounter any problems, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
         | 
| 67 92 |  | 
| 93 | 
            +
            = Author
         | 
| 68 94 |  | 
| 95 | 
            +
            Copyright (c) 2006 by Peter Szinek (peter@/NO-SPAM/rubyrailways.com)
         | 
| 69 96 |  | 
| 97 | 
            +
            = Copyright
         | 
| 70 98 |  | 
| 99 | 
            +
            This library is distributed under the GPL.  Please see the LICENSE file.
         | 
    
        data/Rakefile
    CHANGED
    
    | @@ -1,6 +1,7 @@ | |
| 1 1 | 
             
            require 'rake/rdoctask'
         | 
| 2 2 | 
             
            require 'rake/testtask'
         | 
| 3 3 | 
             
            require 'rake/gempackagetask'
         | 
| 4 | 
            +
            require 'rake/packagetask'
         | 
| 4 5 |  | 
| 5 6 | 
             
            ###################################################
         | 
| 6 7 | 
             
            # Dependencies
         | 
| @@ -8,6 +9,8 @@ require 'rake/gempackagetask' | |
| 8 9 |  | 
| 9 10 | 
             
            task "default" => ["test"]
         | 
| 10 11 | 
             
            task "fulltest" => ["test", "blackbox"] 
         | 
| 12 | 
            +
            task "generate_rdoc" => ["cleanup_readme"]
         | 
| 13 | 
            +
            task "cleanup_readme" => ["rdoc"]
         | 
| 11 14 |  | 
| 12 15 | 
             
            ###################################################
         | 
| 13 16 | 
             
            # Gem specification
         | 
| @@ -15,13 +18,13 @@ task "fulltest" => ["test", "blackbox"] | |
| 15 18 |  | 
| 16 19 | 
             
            gem_spec = Gem::Specification.new do |s|           
         | 
| 17 20 | 
             
              s.name = 'scrubyt' 
         | 
| 18 | 
            -
              s.version = '0. | 
| 21 | 
            +
              s.version = '0.2.0' 
         | 
| 19 22 | 
             
              s.summary = 'A powerful Web-scraping framework'
         | 
| 20 23 | 
             
              s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!} 
         | 
| 21 24 | 
             
              # Files containing Test::Unit test cases. 
         | 
| 22 25 | 
             
              s.test_files = FileList['test/unittests/**/*'] 
         | 
| 23 26 | 
             
              # List of other files to be included. 
         | 
| 24 | 
            -
              s.files = FileList[' | 
| 27 | 
            +
              s.files = FileList['COPYING', 'README', 'CHANGELOG', 'Rakefile', 'lib/**/*.rb']
         | 
| 25 28 | 
             
              s.author = 'Peter Szinek'
         | 
| 26 29 | 
             
              s.email = 'peter@rubyrailways.com' 
         | 
| 27 30 | 
             
              s.homepage = 'http://www.scrubyt.org'
         | 
| @@ -32,14 +35,14 @@ end | |
| 32 35 | 
             
            # Tasks
         | 
| 33 36 | 
             
            ###################################################
         | 
| 34 37 |  | 
| 35 | 
            -
            Rake::RDocTask.new do | | 
| 36 | 
            -
                 files = ['lib/**/*.rb', 'README']
         | 
| 37 | 
            -
                  | 
| 38 | 
            -
                  | 
| 39 | 
            -
                  | 
| 40 | 
            -
                  | 
| 41 | 
            -
                  | 
| 42 | 
            -
                  | 
| 38 | 
            +
            Rake::RDocTask.new do |generate_rdoc|
         | 
| 39 | 
            +
                 files = ['lib/**/*.rb', 'README', 'CHANGELOG']
         | 
| 40 | 
            +
                 generate_rdoc.rdoc_files.add(files)
         | 
| 41 | 
            +
                 generate_rdoc.main = "README" # page to start on
         | 
| 42 | 
            +
                 generate_rdoc.title = "Scrubyt Documentation"
         | 
| 43 | 
            +
                 generate_rdoc.template = "resources/allison/allison.rb"
         | 
| 44 | 
            +
                 generate_rdoc.rdoc_dir = 'doc' # rdoc output folder
         | 
| 45 | 
            +
                 generate_rdoc.options << '--line-numbers' << '--inline-source'
         | 
| 43 46 | 
             
            end
         | 
| 44 47 |  | 
| 45 48 | 
             
            Rake::TestTask.new do |test|           
         | 
| @@ -50,7 +53,35 @@ task "blackbox" do | |
| 50 53 | 
             
              ruby "test/blackbox/run_blackbox_tests.rb"
         | 
| 51 54 | 
             
            end
         | 
| 52 55 |  | 
| 56 | 
            +
            task "cleanup_readme" do
         | 
| 57 | 
            +
              puts "Cleaning up README..."
         | 
| 58 | 
            +
              readme_in = open('./doc/files/README.html')
         | 
| 59 | 
            +
              content = readme_in.read
         | 
| 60 | 
            +
              content.sub!('<h1 id="item_name">File: README</h1>','')
         | 
| 61 | 
            +
              content.sub!('<h1>Description</h1>','')
         | 
| 62 | 
            +
              readme_in.close
         | 
| 63 | 
            +
              open('./doc/files/README.html', 'w') {|f| f.write(content)}
         | 
| 64 | 
            +
              #OK, this is uggly as hell and as non-DRY as possible, but
         | 
| 65 | 
            +
              #I don't have time to deal with it right now
         | 
| 66 | 
            +
              puts "Cleaning up CHANGELOG..."
         | 
| 67 | 
            +
              readme_in = open('./doc/files/CHANGELOG.html')
         | 
| 68 | 
            +
              content = readme_in.read
         | 
| 69 | 
            +
              content.sub!('<h1 id="item_name">File: CHANGELOG</h1>','')
         | 
| 70 | 
            +
              content.sub!('<h1>Description</h1>','')
         | 
| 71 | 
            +
              readme_in.close
         | 
| 72 | 
            +
              open('./doc/files/CHANGELOG.html', 'w') {|f| f.write(content)}
         | 
| 73 | 
            +
            end
         | 
| 74 | 
            +
             | 
| 75 | 
            +
            task "generate_rdoc" do
         | 
| 76 | 
            +
            end
         | 
| 77 | 
            +
             | 
| 53 78 | 
             
            Rake::GemPackageTask.new(gem_spec) do |pkg|           
         | 
| 54 79 | 
             
              pkg.need_zip = false
         | 
| 55 80 | 
             
              pkg.need_tar = false 
         | 
| 56 | 
            -
            end | 
| 81 | 
            +
            end
         | 
| 82 | 
            +
             | 
| 83 | 
            +
            Rake::PackageTask.new('scrubyt-examples', '0.2.0') do |pkg|
         | 
| 84 | 
            +
              pkg.need_zip = true
         | 
| 85 | 
            +
              pkg.need_tar = true
         | 
| 86 | 
            +
              pkg.package_files.include("examples/**/*")
         | 
| 87 | 
            +
            end
         | 
    
        data/lib/scrubyt/export.rb
    CHANGED
    
    | @@ -1,5 +1,5 @@ | |
| 1 1 | 
             
            #require File.join(File.dirname(__FILE__), 'pattern.rb')
         | 
| 2 | 
            -
             | 
| 2 | 
            +
             
         | 
| 3 3 | 
             
            module Scrubyt
         | 
| 4 4 | 
             
              # =<tt>exporting previously defined extractors</tt>
         | 
| 5 5 | 
             
              class Export
         | 
| @@ -142,14 +142,14 @@ private | |
| 142 142 | 
             
                  @name_to_xpath_map = {}
         | 
| 143 143 | 
             
                  create_name_to_xpath_map(pattern)
         | 
| 144 144 | 
             
                  #Replace the examples which are quoted with " and '
         | 
| 145 | 
            -
                  @name_to_xpath_map.each do |name, xpaths| | 
| 145 | 
            +
                  @name_to_xpath_map.each do |name, xpaths|
         | 
| 146 146 | 
             
                    replace_example_with_xpath(name, xpaths, %q{"})
         | 
| 147 147 | 
             
                    replace_example_with_xpath(name, xpaths, %q{'})
         | 
| 148 148 | 
             
                  end
         | 
| 149 149 | 
             
                  #Finally, add XPaths to pattern which had no example at the beginning (the XPath was
         | 
| 150 150 | 
             
                  #generated from the child patterns
         | 
| 151 151 | 
             
                  @name_to_xpath_map.each do |name, xpaths| 
         | 
| 152 | 
            -
                    xpaths.each do |xpath|
         | 
| 152 | 
            +
                    xpaths.reverse.each do |xpath|
         | 
| 153 153 | 
             
                      comma = @full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0].sub('do'){}.strip == '' ? '' : ','
         | 
| 154 154 | 
             
                      if (@full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0]).include?('{')
         | 
| 155 155 | 
             
                        @full_definition.sub!("P.#{name}") {"P.#{name}('#{xpath}')"}
         | 
| @@ -180,7 +180,7 @@ private | |
| 180 180 |  | 
| 181 181 | 
             
                def self.replace_example_with_xpath(name, xpaths, left_delimiter, right_delimiter=left_delimiter)
         | 
| 182 182 | 
             
                  return if name=='root'
         | 
| 183 | 
            -
                  full_line = @full_definition.scan( | 
| 183 | 
            +
                  full_line = @full_definition.scan(/P.#{name}\W(.+)$/)[0][0]
         | 
| 184 184 | 
             
                  examples = full_line.split(",")
         | 
| 185 185 | 
             
                  examples.reject! {|exa| exa.strip!;  exa[0..0] != %q{"} && exa[0..0] != %q{'} }
         | 
| 186 186 | 
             
                  all_xpaths = ""
         | 
    
        data/lib/scrubyt/extractor.rb
    CHANGED
    
    | @@ -46,6 +46,7 @@ module Scrubyt | |
| 46 46 | 
             
                  end
         | 
| 47 47 | 
             
                  ensure_all_postconditions(root_pattern)
         | 
| 48 48 | 
             
                  PostProcessor.remove_multiple_filter_duplicates(root_pattern)
         | 
| 49 | 
            +
                  PostProcessor.report_if_no_results(root_pattern)
         | 
| 49 50 | 
             
                  #Return the root pattern
         | 
| 50 51 | 
             
                  root_pattern
         | 
| 51 52 | 
             
                end
         | 
| @@ -121,21 +122,28 @@ module Scrubyt | |
| 121 122 | 
             
                    @@current_doc_url = ((@@base_dir + doc_url) if doc_url !~ /#{@@base_dir}/)
         | 
| 122 123 | 
             
                  end
         | 
| 123 124 |  | 
| 124 | 
            -
                  if @@host_name != nil
         | 
| 125 | 
            +
                  if @@host_name != nil        
         | 
| 125 126 | 
             
                    if doc_url !~ /#{@@host_name}/
         | 
| 126 | 
            -
                      @@current_doc_url = (@@host_name + doc_url) | 
| 127 | 
            -
                       | 
| 127 | 
            +
                      @@current_doc_url = (@@host_name + doc_url)
         | 
| 128 | 
            +
                      #remove duplicate parts, like /blogs/en/blogs/en
         | 
| 129 | 
            +
                      @@current_doc_url = @@current_doc_url.split('/').uniq.reject{|x| x == ""}.join('/')
         | 
| 130 | 
            +
                      @@current_doc_url.sub!('http:/', 'http://')
         | 
| 128 131 | 
             
                    end  
         | 
| 129 132 | 
             
                  end
         | 
| 130 133 | 
             
                  puts "[ACTION] fetching document: #{@@current_doc_url}"
         | 
| 131 | 
            -
                   | 
| 134 | 
            +
                  if @@current_doc_protocol == :http
         | 
| 135 | 
            +
                    
         | 
| 136 | 
            +
                    @@mechanize_doc = @@agent.get(@@current_doc_url) 
         | 
| 137 | 
            +
                    @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0]
         | 
| 138 | 
            +
                    @@host_name = doc_url if @@host_name == nil
         | 
| 139 | 
            +
                  end
         | 
| 132 140 | 
             
                else
         | 
| 133 141 | 
             
                  @@current_doc_url = doc_url
         | 
| 134 142 | 
             
                  @@mechanize_doc = mechanize_doc
         | 
| 135 143 | 
             
                  @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0]
         | 
| 136 144 | 
             
                  @@host_name = doc_url if @@host_name == nil    
         | 
| 137 145 | 
             
                end
         | 
| 138 | 
            -
                @@hpricot_doc = Hpricot(open(@@current_doc_url)) | 
| 146 | 
            +
                @@hpricot_doc = Hpricot(open(@@current_doc_url))
         | 
| 139 147 | 
             
              end
         | 
| 140 148 |  | 
| 141 149 | 
             
              ##
         | 
| @@ -150,23 +158,56 @@ module Scrubyt | |
| 150 158 | 
             
              def self.fill_textfield(textfield_name, query_string)
         | 
| 151 159 | 
             
                puts "[ACTION] typing #{query_string} into the textfield named '#{textfield_name}'"
         | 
| 152 160 | 
             
                textfield = (@@hpricot_doc/"input[@name=#{textfield_name}]").map()[0]
         | 
| 153 | 
            -
                 | 
| 154 | 
            -
                 | 
| 161 | 
            +
                form_tag = Scrubyt::XPathUtils.traverse_up_until_name(textfield, 'form')
         | 
| 162 | 
            +
                #Refactor this code, it's a total mess
         | 
| 163 | 
            +
                formname = form_tag.attributes['name']
         | 
| 164 | 
            +
                if formname == nil
         | 
| 165 | 
            +
                  id_string = form_tag.attributes['id']
         | 
| 166 | 
            +
                  if id_string == nil
         | 
| 167 | 
            +
                    action_string = form_tag.attributes['action']
         | 
| 168 | 
            +
                    if action_string == nil
         | 
| 169 | 
            +
                      #If even this fails, do it with a button
         | 
| 170 | 
            +
                    else
         | 
| 171 | 
            +
                      puts "Finding from action"
         | 
| 172 | 
            +
                      puts action_string
         | 
| 173 | 
            +
                      find_form_with_attribute('action', action_string)
         | 
| 174 | 
            +
                    end
         | 
| 175 | 
            +
                  else
         | 
| 176 | 
            +
                    puts "Finding from id"
         | 
| 177 | 
            +
                    find_form_with_attribute('id', id_string)
         | 
| 178 | 
            +
                  end
         | 
| 179 | 
            +
                else
         | 
| 180 | 
            +
                  puts "Finding from name"
         | 
| 181 | 
            +
                  @@current_form = @@mechanize_doc.forms.with.name(formname).first
         | 
| 182 | 
            +
                end
         | 
| 183 | 
            +
                
         | 
| 155 184 | 
             
                eval("@@current_form['#{textfield_name}'] = '#{query_string}'")
         | 
| 156 185 | 
             
              end
         | 
| 157 186 |  | 
| 187 | 
            +
              def self.find_form_with_attribute(attr, expected_value)
         | 
| 188 | 
            +
                puts "attr: #{attr}"
         | 
| 189 | 
            +
                i = 0
         | 
| 190 | 
            +
                loop do
         | 
| 191 | 
            +
                  @@current_form = @@mechanize_doc.forms[i]
         | 
| 192 | 
            +
                  print "current a: " 
         | 
| 193 | 
            +
                  puts @@current_form.form_node.attributes[attr]
         | 
| 194 | 
            +
                  return nil if @@current_form == nil
         | 
| 195 | 
            +
                  break if @@current_form.form_node.attributes[attr] == expected_value
         | 
| 196 | 
            +
                  i+= 1
         | 
| 197 | 
            +
                end  
         | 
| 198 | 
            +
              end
         | 
| 199 | 
            +
              
         | 
| 158 200 | 
             
              #Submit the last form; 
         | 
| 159 201 | 
             
              def self.submit    
         | 
| 160 202 | 
             
                puts '[ACTION] submitting form...'
         | 
| 161 203 | 
             
                result_page = @@agent.submit(@@current_form)#, @@current_form.buttons.first)
         | 
| 162 204 | 
             
                @@current_doc_url = result_page.uri.to_s
         | 
| 205 | 
            +
                puts "[ACTION] fetched #{@@current_doc_url}"
         | 
| 163 206 | 
             
                fetch(@@current_doc_url, result_page)
         | 
| 164 207 | 
             
              end
         | 
| 165 208 |  | 
| 166 209 | 
             
              def self.click_link(link_text)
         | 
| 167 210 | 
             
                puts "[ACTION] clicking link: #{link_text}"
         | 
| 168 | 
            -
                #puts /^#{Regexp.escape(link_text)}$/
         | 
| 169 | 
            -
                #p /^#{Regexp.escape(link_text)}$/
         | 
| 170 211 | 
             
                link = @@mechanize_doc.links.text(/^#{Regexp.escape(link_text)}$/)
         | 
| 171 212 | 
             
                result_page = @@agent.click(link)
         | 
| 172 213 | 
             
                @@current_doc_url = result_page.uri.to_s
         | 
    
        data/lib/scrubyt/filter.rb
    CHANGED
    
    | @@ -53,8 +53,10 @@ module Scrubyt | |
| 53 53 | 
             
                  @parent_pattern = parent_pattern
         | 
| 54 54 | 
             
                  #If the example type is not explicitly defined in the pattern definition,
         | 
| 55 55 | 
             
                  #try to determine it automatically from the example
         | 
| 56 | 
            -
                   | 
| 57 | 
            -
             | 
| 56 | 
            +
                  #@example_type = (args[0] == nil ? Filter.determine_example_type(example) :
         | 
| 57 | 
            +
                  #                                  args[0][:example_type])
         | 
| 58 | 
            +
                  #TODOOOOO correct this!
         | 
| 59 | 
            +
                  @example_type = Filter.determine_example_type(example)
         | 
| 58 60 | 
             
                  @sink = []                  #output of a filter
         | 
| 59 61 | 
             
                  @source = []                #input of a filter                                        
         | 
| 60 62 | 
             
                  @example = example
         | 
| @@ -67,14 +69,13 @@ module Scrubyt | |
| 67 69 | 
             
                #Evaluate this filter. This method shoulf not be called directly - as the pattern hierarchy 
         | 
| 68 70 | 
             
                #is evaluated, every pattern evaluates its filters and then they are calling this method
         | 
| 69 71 | 
             
                def evaluate(source)
         | 
| 70 | 
            -
                  @parent_pattern.root_pattern.already_evaluated_sources ||= {}
         | 
| 71 72 | 
             
                  case @parent_pattern.type
         | 
| 72 73 | 
             
                    when Scrubyt::Pattern::PATTERN_TYPE_TREE      
         | 
| 73 74 | 
             
                      result = source/@xpath
         | 
| 74 75 | 
             
                      result.class == Hpricot::Elements ? result.map : [result]
         | 
| 75 76 | 
             
                    when Scrubyt::Pattern::PATTERN_TYPE_ATTRIBUTE  
         | 
| 76 77 | 
             
                      [source.attributes[@example]]
         | 
| 77 | 
            -
                    when Scrubyt::Pattern::PATTERN_TYPE_REGEXP | 
| 78 | 
            +
                    when Scrubyt::Pattern::PATTERN_TYPE_REGEXP
         | 
| 78 79 | 
             
                      source.inner_text.scan(@example).flatten
         | 
| 79 80 | 
             
                  end      
         | 
| 80 81 | 
             
                end
         | 
| @@ -87,10 +88,9 @@ module Scrubyt | |
| 87 88 | 
             
                    when EXAMPLE_TYPE_XPATH
         | 
| 88 89 | 
             
                      @xpath = @example
         | 
| 89 90 | 
             
                    when EXAMPLE_TYPE_STRING
         | 
| 90 | 
            -
                      @temp_sink = XPathUtils.find_node_from_text( @parent_pattern.root_pattern.filters[0].source[0], @example )
         | 
| 91 | 
            +
                      @temp_sink = XPathUtils.find_node_from_text( @parent_pattern.root_pattern.filters[0].source[0], @example, false )
         | 
| 91 92 | 
             
                      @xpath = @parent_pattern.generalize ? XPathUtils.generate_XPath(@temp_sink, nil, false) :
         | 
| 92 93 | 
             
                                                             XPathUtils.generate_XPath(@temp_sink, nil, true)
         | 
| 93 | 
            -
                      puts @xpath                                                 
         | 
| 94 94 | 
             
                    when EXAMPLE_TYPE_CHILDREN          
         | 
| 95 95 | 
             
                      current_example_index = 0
         | 
| 96 96 | 
             
                      loop do
         | 
| @@ -148,7 +148,7 @@ private | |
| 148 148 | 
             
                        EXAMPLE_TYPE_CHILDREN
         | 
| 149 149 | 
             
                      when /\.(jpg|png|gif|jpeg)$/
         | 
| 150 150 | 
             
                        EXAMPLE_TYPE_IMAGE
         | 
| 151 | 
            -
                      when /^\/{1,2}[a-z] | 
| 151 | 
            +
                      when /^\/{1,2}[a-z]+\d?(\[\d+\])?(\/{1,2}[a-z]+\d?(\[\d+\])?)*$/
         | 
| 152 152 | 
             
                        (example.include? '/' || example.include?('[')) ? EXAMPLE_TYPE_XPATH : EXAMPLE_TYPE_STRING 
         | 
| 153 153 | 
             
                      else
         | 
| 154 154 | 
             
                        EXAMPLE_TYPE_STRING
         | 
    
        data/lib/scrubyt/pattern.rb
    CHANGED
    
    | @@ -43,7 +43,7 @@ module Scrubyt | |
| 43 43 | 
             
                attr_accessor :name, :output_type, :generalize, :children, :filters, :parent, 
         | 
| 44 44 | 
             
                              :last_result, :result, :root_pattern, :example,  :block_count, 
         | 
| 45 45 | 
             
                              :next_page, :limit, :extractor, :extracted_docs,
         | 
| 46 | 
            -
                              :examples, :parent_of_leaf
         | 
| 46 | 
            +
                              :examples, :parent_of_leaf, :document_index
         | 
| 47 47 | 
             
                attr_reader :type, :generalize_set, :next_page_url
         | 
| 48 48 |  | 
| 49 49 | 
             
                def initialize (name, *args)
         | 
| @@ -56,6 +56,7 @@ module Scrubyt | |
| 56 56 | 
             
                  @@instance_count = Hash.new(0)
         | 
| 57 57 | 
             
                  @evaluated_examples = []
         | 
| 58 58 | 
             
                  @next_page = nil
         | 
| 59 | 
            +
                  @document_index = 0
         | 
| 59 60 | 
             
                  if @examples == nil       
         | 
| 60 61 | 
             
                    filters << Scrubyt::Filter.new(self) #create a default filter
         | 
| 61 62 | 
             
                  else
         | 
| @@ -74,6 +75,7 @@ module Scrubyt | |
| 74 75 | 
             
                  #Grab any examples that are defined!
         | 
| 75 76 | 
             
                  look_for_examples(args)
         | 
| 76 77 | 
             
                  args.each do |arg|
         | 
| 78 | 
            +
                    next if !arg.is_a? Hash
         | 
| 77 79 | 
             
                    arg.each do |k,v|
         | 
| 78 80 | 
             
                      #Set only the setable fields
         | 
| 79 81 | 
             
                      if SETTABLE_FIELDS.include? k.to_s 
         | 
| @@ -92,7 +94,6 @@ module Scrubyt | |
| 92 94 | 
             
                  #default settings - the user can override them, but if she did not do so,
         | 
| 93 95 | 
             
                  #we will setup some meaningful defaults
         | 
| 94 96 | 
             
                  @type ||= PATTERN_TYPE_TREE
         | 
| 95 | 
            -
                  @type = PATTERN_TYPE_REGEXP if @example.instance_of? Regexp
         | 
| 96 97 | 
             
                  @output_type ||= OUTPUT_TYPE_MODEL
         | 
| 97 98 | 
             
                  #don't generalize by default
         | 
| 98 99 | 
             
                  @generalize ||= false
         | 
| @@ -127,11 +128,20 @@ module Scrubyt | |
| 127 128 | 
             
                #    camera_data.item[1].item_name[0]
         | 
| 128 129 | 
             
                #    
         | 
| 129 130 | 
             
                #possible. The method Pattern::method missing handles the 'item', 'item_name' etc.
         | 
| 130 | 
            -
                #parts, while the indexing ([1], [0]) is handled by this function
         | 
| 131 | 
            +
                #parts, while the indexing ([1], [0]) is handled by this function.
         | 
| 132 | 
            +
                #If you would like to select a different document than the first one (which is
         | 
| 133 | 
            +
                #the default), you should use the form:
         | 
| 134 | 
            +
                #
         | 
| 135 | 
            +
                #    camera_data[1].item[1].item_name[0]
         | 
| 131 136 | 
             
                def [](index)
         | 
| 132 | 
            -
                   | 
| 133 | 
            -
             | 
| 134 | 
            -
                   | 
| 137 | 
            +
                  if @name == 'root'
         | 
| 138 | 
            +
                    @root_pattern.document_index = index
         | 
| 139 | 
            +
                  else
         | 
| 140 | 
            +
                    @parent.last_result = @parent.last_result[@root_pattern.document_index] if @parent.last_result.is_a? Array
         | 
| 141 | 
            +
                    return nil if (@result.lookup(@parent.last_result)) == nil
         | 
| 142 | 
            +
                    @last_result = @result.lookup(@parent.last_result)[index]
         | 
| 143 | 
            +
                  end
         | 
| 144 | 
            +
                  self    
         | 
| 135 145 | 
             
                end
         | 
| 136 146 |  | 
| 137 147 | 
             
                ##
         | 
| @@ -217,9 +227,6 @@ module Scrubyt | |
| 217 227 | 
             
                        sorted_result = r.reject {|e| !result.keys.include? e}
         | 
| 218 228 | 
             
                        add_result(filter, source, sorted_result)
         | 
| 219 229 | 
             
                      else
         | 
| 220 | 
            -
                        if ( (xe = @result.lookup(source)) != nil )
         | 
| 221 | 
            -
                          #puts "ha"; p xe
         | 
| 222 | 
            -
                        end          
         | 
| 223 230 | 
             
                        add_result(filter, source, r)
         | 
| 224 231 | 
             
                      end#end of constraint check
         | 
| 225 232 | 
             
                    end#end of source iteration
         | 
| @@ -246,6 +253,7 @@ private | |
| 246 253 | 
             
                      end
         | 
| 247 254 | 
             
                    end
         | 
| 248 255 | 
             
                  elsif (args[0].is_a? Regexp)
         | 
| 256 | 
            +
                    @examples = args.select {|e| e.is_a? Regexp}
         | 
| 249 257 | 
             
                    #Check if all the String parameters are really the first
         | 
| 250 258 | 
             
                    #parameters 
         | 
| 251 259 | 
             
                    args[0..@examples.size].each do |example|
         | 
| @@ -253,6 +261,7 @@ private | |
| 253 261 | 
             
                        puts 'FATAL: Problem with example specification'
         | 
| 254 262 | 
             
                      end
         | 
| 255 263 | 
             
                    end
         | 
| 264 | 
            +
                    @type = PATTERN_TYPE_REGEXP
         | 
| 256 265 | 
             
                  end
         | 
| 257 266 | 
             
                end
         | 
| 258 267 |  | 
| @@ -299,7 +308,7 @@ private | |
| 299 308 | 
             
                end
         | 
| 300 309 |  | 
| 301 310 | 
             
                def generate_next_page_link(example)
         | 
| 302 | 
            -
                  node = XPathUtils.find_node_from_text(@root_pattern.filters[0].source[0], example)
         | 
| 311 | 
            +
                  node = XPathUtils.find_node_from_text(@root_pattern.filters[0].source[0], example, true)
         | 
| 303 312 | 
             
                  return nil if node == nil
         | 
| 304 313 | 
             
                  node.attributes['href'].gsub('&') {'&'}
         | 
| 305 314 | 
             
                end # end of method generate_next_page_link    
         | 
| @@ -18,6 +18,21 @@ module Scrubyt | |
| 18 18 | 
             
                  remove_multiple_filter_duplicates_intern(pattern) if pattern.parent_of_leaf
         | 
| 19 19 | 
             
                  pattern.children.each {|child| remove_multiple_filter_duplicates(child)}
         | 
| 20 20 | 
             
                end
         | 
| 21 | 
            +
                
         | 
| 22 | 
            +
                ##
         | 
| 23 | 
            +
                #Issue an error report if the document did not extract anything.
         | 
| 24 | 
            +
                #Probably this is because the structure of the page changed or 
         | 
| 25 | 
            +
                #because of some rather nasty bug - in any case, something wrong 
         | 
| 26 | 
            +
                #is going on, and we need to inform the user about this!
         | 
| 27 | 
            +
                def self.report_if_no_results(root_pattern)
         | 
| 28 | 
            +
                  results_found = false
         | 
| 29 | 
            +
                  root_pattern.children.each {|child| return if (child.result.childmap.size > 0)}
         | 
| 30 | 
            +
                  puts
         | 
| 31 | 
            +
                  puts "!!!!!! WARNING: The extractor did not find any result instances"
         | 
| 32 | 
            +
                  puts "Most probably this is wrong. Check your extractor and if you are"
         | 
| 33 | 
            +
                  puts "sure it should work, report a bug!"
         | 
| 34 | 
            +
                  puts
         | 
| 35 | 
            +
                end
         | 
| 21 36 |  | 
| 22 37 | 
             
            private  
         | 
| 23 38 | 
             
                def self.remove_multiple_filter_duplicates_intern(pattern)
         | 
| @@ -1,4 +1,5 @@ | |
| 1 1 | 
             
            require 'rexml/document'
         | 
| 2 | 
            +
            require 'rexml/xpath'
         | 
| 2 3 |  | 
| 3 4 | 
             
            module Scrubyt
         | 
| 4 5 | 
             
              ##
         | 
| @@ -16,7 +17,7 @@ module Scrubyt | |
| 16 17 | 
             
                    to_xml_recursive(pattern, root) 
         | 
| 17 18 | 
             
                  end
         | 
| 18 19 | 
             
                  remove_empty_leaves(doc)
         | 
| 19 | 
            -
                  doc
         | 
| 20 | 
            +
                  @@last_doc = doc
         | 
| 20 21 | 
             
                end
         | 
| 21 22 |  | 
| 22 23 | 
             
                def self.remove_empty_leaves(node)
         | 
| @@ -80,11 +81,22 @@ private | |
| 80 81 | 
             
                    end    
         | 
| 81 82 | 
             
                end
         | 
| 82 83 |  | 
| 83 | 
            -
                def self. | 
| 84 | 
            +
                def self.print_old_sta(pattern, depth)
         | 
| 84 85 | 
             
                  puts((' ' * "#{depth}".to_i) +  "#{pattern.name} extracted #{pattern.get_instance_count[pattern.name]} instances.") if pattern.name != 'root'
         | 
| 85 86 | 
             
                  pattern.children.each do |child|
         | 
| 86 87 | 
             
                    print_statistics_recursive(child, depth + 4)
         | 
| 88 | 
            +
                  end    
         | 
| 89 | 
            +
                end
         | 
| 90 | 
            +
                
         | 
| 91 | 
            +
                def self.print_statistics_recursive(pattern, depth)
         | 
| 92 | 
            +
                  if pattern.name != 'root'
         | 
| 93 | 
            +
                    count = REXML::XPath.match(@@last_doc, "//#{pattern.name}").size 
         | 
| 94 | 
            +
                    puts((' ' * "#{depth}".to_i) +  "#{pattern.name} extracted #{count} instances.")
         | 
| 87 95 | 
             
                  end
         | 
| 96 | 
            +
                  
         | 
| 97 | 
            +
                  pattern.children.each do |child|
         | 
| 98 | 
            +
                    print_statistics_recursive(child, depth + 4)
         | 
| 99 | 
            +
                  end      
         | 
| 88 100 | 
             
                end#end of method print_statistics_recursive
         | 
| 89 101 | 
             
              end #end of class ResultDumper      
         | 
| 90 102 | 
             
            end #end of module Scrubyt
         | 
    
        data/lib/scrubyt/xpathutils.rb
    CHANGED
    
    | @@ -21,21 +21,23 @@ module Scrubyt | |
| 21 21 | 
             
                # <a>Bon <b>nuit</b>, monsieur!</a>
         | 
| 22 22 | 
             
                #
         | 
| 23 23 | 
             
                #In this case, <a>'s text is considered to be "Bon nuit, monsieur"
         | 
| 24 | 
            -
                def self.find_node_from_text(doc, text)
         | 
| 24 | 
            +
                def self.find_node_from_text(doc, text, next_link)
         | 
| 25 25 | 
             
                  @node = nil
         | 
| 26 26 | 
             
                  @found = false
         | 
| 27 27 | 
             
                  self.traverse_for_full_text(doc,text)
         | 
| 28 28 | 
             
                  self.lowest_possible_node_with_text(@node, text) if @node != nil
         | 
| 29 | 
            -
                  #$Logger.warn("Node for example #{text} Not found!") if (@found == false)
         | 
| 30 29 | 
             
                  if (@found == false)        
         | 
| 31 30 | 
             
                    #Fallback to per node text lookup
         | 
| 32 31 | 
             
                    self.traverse_for_node_text(doc,text)
         | 
| 33 | 
            -
                    if (@found == false) | 
| 34 | 
            -
                       | 
| 35 | 
            -
                      puts " | 
| 32 | 
            +
                    if (@found == false)
         | 
| 33 | 
            +
                      return nil if next_link
         | 
| 34 | 
            +
                      puts "!" * 65
         | 
| 35 | 
            +
                      puts "!!!!!! FATAL: Node for example #{text} Not found! !!!!!!" 
         | 
| 36 | 
            +
                      puts "!!!!!! Please make sure you specified the example properly !!!!!!"
         | 
| 37 | 
            +
                      puts "!" * 65
         | 
| 38 | 
            +
                      exit
         | 
| 36 39 | 
             
                    end
         | 
| 37 40 | 
             
                  end
         | 
| 38 | 
            -
                  p @node
         | 
| 39 41 | 
             
                  @node
         | 
| 40 42 | 
             
                end
         | 
| 41 43 |  | 
| @@ -135,7 +137,7 @@ module Scrubyt | |
| 135 137 | 
             
                #_index_ - there might be more images with the same src on the page -
         | 
| 136 138 | 
             
                #most typically the user will need the 0th - but if this is not the 
         | 
| 137 139 | 
             
                #case, there is the possibility to override this
         | 
| 138 | 
            -
                def self.find_image(doc, example, index= | 
| 140 | 
            +
                def self.find_image(doc, example, index=0)
         | 
| 139 141 | 
             
                  (doc/"img[@src='#{example}']")[index]
         | 
| 140 142 | 
             
                end
         | 
| 141 143 |  | 
| @@ -22,7 +22,15 @@ class FilterTest < Test::Unit::TestCase | |
| 22 22 | 
             
                                 Scrubyt::Filter::EXAMPLE_TYPE_IMAGE)
         | 
| 23 23 | 
             
                #Test XPaths
         | 
| 24 24 | 
             
                assert_equal(Scrubyt::Filter.determine_example_type('/p/img'), 
         | 
| 25 | 
            +
                             Scrubyt::Filter::EXAMPLE_TYPE_XPATH)                 
         | 
| 26 | 
            +
                assert_equal(Scrubyt::Filter.determine_example_type('/p/h3'), 
         | 
| 27 | 
            +
                             Scrubyt::Filter::EXAMPLE_TYPE_XPATH)                 
         | 
| 28 | 
            +
                assert_equal(Scrubyt::Filter.determine_example_type('/p/h3/a/h2'), 
         | 
| 29 | 
            +
                             Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
         | 
| 30 | 
            +
                assert_equal(Scrubyt::Filter.determine_example_type('/h2'), 
         | 
| 25 31 | 
             
                             Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
         | 
| 32 | 
            +
                assert_equal(Scrubyt::Filter.determine_example_type('/h1/h3'), 
         | 
| 33 | 
            +
                             Scrubyt::Filter::EXAMPLE_TYPE_XPATH)                                 
         | 
| 26 34 | 
             
                assert_equal(Scrubyt::Filter.determine_example_type('/p'), 
         | 
| 27 35 | 
             
                             Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
         | 
| 28 36 | 
             
                assert_equal(Scrubyt::Filter.determine_example_type('//p'), 
         | 
| @@ -55,14 +55,14 @@ class XPathUtilsTest < Test::Unit::TestCase | |
| 55 55 | 
             
              end
         | 
| 56 56 |  | 
| 57 57 | 
             
              def test_find_node_from_text
         | 
| 58 | 
            -
                elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"fff")
         | 
| 58 | 
            +
                elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"fff", false)
         | 
| 59 59 | 
             
                assert_instance_of(Hpricot::Elem, elem)
         | 
| 60 60 | 
             
                assert_equal(elem, @f)
         | 
| 61 61 |  | 
| 62 | 
            -
                elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"dddd")
         | 
| 62 | 
            +
                elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"dddd", false)
         | 
| 63 63 | 
             
                assert_equal(elem, @d)
         | 
| 64 64 |  | 
| 65 | 
            -
                elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"rrr")
         | 
| 65 | 
            +
                elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"rrr", false)
         | 
| 66 66 | 
             
                assert_equal(elem, @r)
         | 
| 67 67 |  | 
| 68 68 | 
             
              end
         | 
    
        metadata
    CHANGED
    
    | @@ -3,8 +3,8 @@ rubygems_version: 0.9.0 | |
| 3 3 | 
             
            specification_version: 1
         | 
| 4 4 | 
             
            name: scrubyt
         | 
| 5 5 | 
             
            version: !ruby/object:Gem::Version 
         | 
| 6 | 
            -
              version: 0. | 
| 7 | 
            -
            date: 2007- | 
| 6 | 
            +
              version: 0.2.0
         | 
| 7 | 
            +
            date: 2007-02-04 00:00:00 +01:00
         | 
| 8 8 | 
             
            summary: A powerful Web-scraping framework
         | 
| 9 9 | 
             
            require_paths: 
         | 
| 10 10 | 
             
            - lib
         | 
| @@ -29,29 +29,29 @@ post_install_message: | |
| 29 29 | 
             
            authors: 
         | 
| 30 30 | 
             
            - Peter Szinek
         | 
| 31 31 | 
             
            files: 
         | 
| 32 | 
            -
            - README
         | 
| 33 32 | 
             
            - COPYING
         | 
| 33 | 
            +
            - README
         | 
| 34 34 | 
             
            - CHANGELOG
         | 
| 35 35 | 
             
            - Rakefile
         | 
| 36 36 | 
             
            - lib/scrubyt.rb
         | 
| 37 | 
            -
            - lib/scrubyt/constraint_adder.rb
         | 
| 38 37 | 
             
            - lib/scrubyt/constraint.rb
         | 
| 39 | 
            -
            - lib/scrubyt/result_dumper.rb
         | 
| 40 | 
            -
            - lib/scrubyt/export.rb
         | 
| 41 | 
            -
            - lib/scrubyt/extractor.rb
         | 
| 42 | 
            -
            - lib/scrubyt/filter.rb
         | 
| 43 38 | 
             
            - lib/scrubyt/pattern.rb
         | 
| 44 39 | 
             
            - lib/scrubyt/result.rb
         | 
| 40 | 
            +
            - lib/scrubyt/export.rb
         | 
| 41 | 
            +
            - lib/scrubyt/constraint_adder.rb
         | 
| 45 42 | 
             
            - lib/scrubyt/post_processor.rb
         | 
| 43 | 
            +
            - lib/scrubyt/filter.rb
         | 
| 46 44 | 
             
            - lib/scrubyt/xpathutils.rb
         | 
| 45 | 
            +
            - lib/scrubyt/result_dumper.rb
         | 
| 46 | 
            +
            - lib/scrubyt/extractor.rb
         | 
| 47 47 | 
             
            test_files: 
         | 
| 48 48 | 
             
            - test/unittests/input
         | 
| 49 | 
            +
            - test/unittests/constraint_test.rb
         | 
| 49 50 | 
             
            - test/unittests/filter_test.rb
         | 
| 50 | 
            -
            - test/unittests/extractor_test.rb
         | 
| 51 51 | 
             
            - test/unittests/xpathutils_test.rb
         | 
| 52 | 
            -
            - test/unittests/ | 
| 53 | 
            -
            - test/unittests/input/constraint_test.html
         | 
| 52 | 
            +
            - test/unittests/extractor_test.rb
         | 
| 54 53 | 
             
            - test/unittests/input/test.html
         | 
| 54 | 
            +
            - test/unittests/input/constraint_test.html
         | 
| 55 55 | 
             
            rdoc_options: []
         | 
| 56 56 |  | 
| 57 57 | 
             
            extra_rdoc_files: []
         |