yasuri 2.0.12 → 3.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -5
 - data/.github/workflows/ruby.yml +35 -0
 - data/.gitignore +1 -2
 - data/.ruby-version +1 -0
 - data/.travis.yml +1 -3
 - data/README.md +87 -21
 - data/USAGE.ja.md +368 -120
 - data/USAGE.md +375 -125
 - data/examples/example.rb +79 -0
 - data/examples/github.yml +15 -0
 - data/examples/sample.json +4 -0
 - data/examples/sample.yml +11 -0
 - data/exe/yasuri +5 -0
 - data/lib/yasuri.rb +1 -0
 - data/lib/yasuri/version.rb +1 -1
 - data/lib/yasuri/yasuri.rb +86 -41
 - data/lib/yasuri/yasuri_cli.rb +64 -0
 - data/lib/yasuri/yasuri_links_node.rb +11 -5
 - data/lib/yasuri/yasuri_map_node.rb +40 -0
 - data/lib/yasuri/yasuri_node.rb +37 -2
 - data/lib/yasuri/yasuri_node_generator.rb +16 -11
 - data/lib/yasuri/yasuri_paginate_node.rb +10 -4
 - data/lib/yasuri/yasuri_struct_node.rb +5 -1
 - data/lib/yasuri/yasuri_text_node.rb +9 -2
 - data/spec/cli_resources/tree.json +8 -0
 - data/spec/cli_resources/tree.yml +5 -0
 - data/spec/cli_resources/tree_wrong.json +9 -0
 - data/spec/cli_resources/tree_wrong.yml +6 -0
 - data/spec/spec_helper.rb +4 -9
 - data/spec/yasuri_cli_spec.rb +96 -0
 - data/spec/yasuri_links_node_spec.rb +34 -12
 - data/spec/yasuri_map_spec.rb +75 -0
 - data/spec/yasuri_paginate_node_spec.rb +22 -10
 - data/spec/yasuri_spec.rb +244 -94
 - data/spec/yasuri_struct_node_spec.rb +13 -17
 - data/spec/yasuri_text_node_spec.rb +11 -12
 - data/yasuri.gemspec +5 -3
 - metadata +52 -18
 - data/app.rb +0 -52
 
    
        data/USAGE.md
    CHANGED
    
    | 
         @@ -1,27 +1,32 @@ 
     | 
|
| 
       1 
     | 
    
         
            -
            # Yasuri 
     | 
| 
      
 1 
     | 
    
         
            +
            # Yasuri
         
     | 
| 
       2 
2 
     | 
    
         | 
| 
       3 
3 
     | 
    
         
             
            ## What is Yasuri
         
     | 
| 
       4 
     | 
    
         
            -
            `Yasuri` is  
     | 
| 
      
 4 
     | 
    
         
            +
            `Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it.
         
     | 
| 
       5 
5 
     | 
    
         | 
| 
       6 
     | 
    
         
            -
             
     | 
| 
      
 6 
     | 
    
         
            +
            It performs scraping by simply describing the expected result in a simple declarative notation.
         
     | 
| 
       7 
7 
     | 
    
         | 
| 
       8 
     | 
    
         
            -
            Yasuri  
     | 
| 
      
 8 
     | 
    
         
            +
            Yasuri makes it easy to write common scraping operations.
         
     | 
| 
      
 9 
     | 
    
         
            +
            For example, the following processes can be easily implemented.
         
     | 
| 
       9 
10 
     | 
    
         | 
| 
       10 
     | 
    
         
            -
             
     | 
| 
       11 
     | 
    
         
            -
             
     | 
| 
       12 
     | 
    
         
            -
            +  
     | 
| 
       13 
     | 
    
         
            -
            +  
     | 
| 
       14 
     | 
    
         
            -
            + A table that repeatedly appears in a page each, scraping, get as an array.
         
     | 
| 
       15 
     | 
    
         
            -
            + Of each page provided by the pagination, scraping the only top 3.
         
     | 
| 
       16 
     | 
    
         
            -
             
     | 
| 
       17 
     | 
    
         
            -
            You can implement easy by Yasuri.
         
     | 
| 
      
 11 
     | 
    
         
            +
            + Scrape multiple texts in a page and name them into a Hash
         
     | 
| 
      
 12 
     | 
    
         
            +
            + Open multiple links in a page and get the result of scraping each page as a Hash
         
     | 
| 
      
 13 
     | 
    
         
            +
            + Scrape each table that appears repeatedly in the page and get the result as an array
         
     | 
| 
      
 14 
     | 
    
         
            +
            + Scrape only the first three pages of each page provided by pagination
         
     | 
| 
       18 
15 
     | 
    
         | 
| 
       19 
16 
     | 
    
         
             
            ## Quick Start
         
     | 
| 
       20 
17 
     | 
    
         | 
| 
      
 18 
     | 
    
         
            +
             
     | 
| 
      
 19 
     | 
    
         
            +
            #### Install
         
     | 
| 
      
 20 
     | 
    
         
            +
            ```sh
         
     | 
| 
      
 21 
     | 
    
         
            +
            # for Ruby 2.3.2
         
     | 
| 
      
 22 
     | 
    
         
            +
            $ gem 'yasuri', '~> 2.0', '>= 2.0.13'
         
     | 
| 
       21 
23 
     | 
    
         
             
            ```
         
     | 
| 
      
 24 
     | 
    
         
            +
            または
         
     | 
| 
      
 25 
     | 
    
         
            +
            ```sh
         
     | 
| 
      
 26 
     | 
    
         
            +
            # for Ruby 3.0.0 or upper
         
     | 
| 
       22 
27 
     | 
    
         
             
            $ gem install yasuri
         
     | 
| 
       23 
28 
     | 
    
         
             
            ```
         
     | 
| 
       24 
     | 
    
         
            -
             
     | 
| 
      
 29 
     | 
    
         
            +
            #### Use as library
         
     | 
| 
       25 
30 
     | 
    
         
             
            ```ruby
         
     | 
| 
       26 
31 
     | 
    
         
             
            require 'yasuri'
         
     | 
| 
       27 
32 
     | 
    
         
             
            require 'machinize'
         
     | 
| 
         @@ -32,81 +37,190 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do 
     | 
|
| 
       32 
37 
     | 
    
         
             
                     text_content '//*[@id="contents"]/p[1]'
         
     | 
| 
       33 
38 
     | 
    
         
             
                   end
         
     | 
| 
       34 
39 
     | 
    
         | 
| 
       35 
     | 
    
         
            -
             
     | 
| 
       36 
     | 
    
         
            -
             
     | 
| 
      
 40 
     | 
    
         
            +
            result = root.scrape("http://some.scraping.page.tac42.net/")
         
     | 
| 
      
 41 
     | 
    
         
            +
            # => [
         
     | 
| 
      
 42 
     | 
    
         
            +
            #      {"title" => "PageTitle 01", "content" => "Page Contents  01" },
         
     | 
| 
      
 43 
     | 
    
         
            +
            #      {"title" => "PageTitle 02", "content" => "Page Contents  02" },
         
     | 
| 
      
 44 
     | 
    
         
            +
            #      ...
         
     | 
| 
      
 45 
     | 
    
         
            +
            #      {"title" => "PageTitle N",  "content" => "Page Contents  N" }
         
     | 
| 
      
 46 
     | 
    
         
            +
            #    ]
         
     | 
| 
      
 47 
     | 
    
         
            +
            ```
         
     | 
| 
      
 48 
     | 
    
         
            +
             
     | 
| 
      
 49 
     | 
    
         
            +
            This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
         
     | 
| 
      
 50 
     | 
    
         
            +
             
     | 
| 
      
 51 
     | 
    
         
            +
            (in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
         
     | 
| 
      
 52 
     | 
    
         
            +
             
     | 
| 
       37 
53 
     | 
    
         | 
| 
       38 
     | 
    
         
            -
             
     | 
| 
       39 
     | 
    
         
            -
             
     | 
| 
       40 
     | 
    
         
            -
            #      {"title" => "PageTitle2", "content" => "Page Contents2" }, ...  ]
         
     | 
| 
      
 54 
     | 
    
         
            +
            #### Use as CLI tool
         
     | 
| 
      
 55 
     | 
    
         
            +
            The same thing as above can be executed as a CLI command.
         
     | 
| 
       41 
56 
     | 
    
         | 
| 
      
 57 
     | 
    
         
            +
            ```sh
         
     | 
| 
      
 58 
     | 
    
         
            +
            $ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
         
     | 
| 
      
 59 
     | 
    
         
            +
            {
         
     | 
| 
      
 60 
     | 
    
         
            +
              "links_root": {
         
     | 
| 
      
 61 
     | 
    
         
            +
                "path": "//*[@id=\"menu\"]/ul/li/a",
         
     | 
| 
      
 62 
     | 
    
         
            +
                "text_title": "//*[@id=\"contents\"]/h2",
         
     | 
| 
      
 63 
     | 
    
         
            +
                "text_content": "//*[@id=\"contents\"]/p[1]"
         
     | 
| 
      
 64 
     | 
    
         
            +
                }
         
     | 
| 
      
 65 
     | 
    
         
            +
            }'
         
     | 
| 
      
 66 
     | 
    
         
            +
             
     | 
| 
      
 67 
     | 
    
         
            +
            [
         
     | 
| 
      
 68 
     | 
    
         
            +
              {"title":"PageTitle 01","content":"Page Contents  01"},
         
     | 
| 
      
 69 
     | 
    
         
            +
              {"title":"PageTitle 02","content":"Page Contents  02"},
         
     | 
| 
      
 70 
     | 
    
         
            +
              ...,
         
     | 
| 
      
 71 
     | 
    
         
            +
              {"title":"PageTitle N","content":"Page Contents  N"}
         
     | 
| 
      
 72 
     | 
    
         
            +
            ]
         
     | 
| 
       42 
73 
     | 
    
         
             
            ```
         
     | 
| 
       43 
     | 
    
         
            -
            This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
         
     | 
| 
       44 
74 
     | 
    
         | 
| 
       45 
     | 
    
         
            -
             
     | 
| 
      
 75 
     | 
    
         
            +
            The result can be obtained as a string in json format.
         
     | 
| 
       46 
76 
     | 
    
         | 
| 
       47 
     | 
    
         
            -
             
     | 
| 
      
 77 
     | 
    
         
            +
            ----------------------------
         
     | 
| 
      
 78 
     | 
    
         
            +
            ## Parse Tree
         
     | 
| 
       48 
79 
     | 
    
         | 
| 
       49 
     | 
    
         
            -
             
     | 
| 
       50 
     | 
    
         
            -
            2. Start parse with Mechanize agent and first page.
         
     | 
| 
      
 80 
     | 
    
         
            +
            A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
         
     | 
| 
       51 
81 
     | 
    
         | 
| 
       52 
     | 
    
         
            -
             
     | 
| 
      
 82 
     | 
    
         
            +
            A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
         
     | 
| 
       53 
83 
     | 
    
         | 
| 
       54 
     | 
    
         
            -
             
     | 
| 
       55 
     | 
    
         
            -
            require 'mechanize'
         
     | 
| 
       56 
     | 
    
         
            -
            require 'yasuri'
         
     | 
| 
      
 84 
     | 
    
         
            +
            The parse tree is defined in the following format:
         
     | 
| 
       57 
85 
     | 
    
         | 
| 
      
 86 
     | 
    
         
            +
            ```ruby
         
     | 
| 
      
 87 
     | 
    
         
            +
            # A simple tree consisting of one node
         
     | 
| 
      
 88 
     | 
    
         
            +
            Yasuri.<Type>_<Name> <Path> [,<Options>]
         
     | 
| 
       58 
89 
     | 
    
         | 
| 
       59 
     | 
    
         
            -
            #  
     | 
| 
       60 
     | 
    
         
            -
             
     | 
| 
       61 
     | 
    
         
            -
             
     | 
| 
       62 
     | 
    
         
            -
             
     | 
| 
      
 90 
     | 
    
         
            +
            # Nested tree
         
     | 
| 
      
 91 
     | 
    
         
            +
            Yasuri.<Type>_<Name> <Path> [,<Options>] do
         
     | 
| 
      
 92 
     | 
    
         
            +
              <Type>_<Name> <Path> [,<Options>] do
         
     | 
| 
      
 93 
     | 
    
         
            +
                <Type>_<Name> <Path> [,<Options>]
         
     | 
| 
      
 94 
     | 
    
         
            +
                ...
         
     | 
| 
      
 95 
     | 
    
         
            +
              end
         
     | 
| 
      
 96 
     | 
    
         
            +
            end
         
     | 
| 
      
 97 
     | 
    
         
            +
            ```
         
     | 
| 
       63 
98 
     | 
    
         | 
| 
       64 
     | 
    
         
            -
             
     | 
| 
       65 
     | 
    
         
            -
            agent = Mechanize.new
         
     | 
| 
       66 
     | 
    
         
            -
            page = agent.get(uri)
         
     | 
| 
      
 99 
     | 
    
         
            +
            **Example**
         
     | 
| 
       67 
100 
     | 
    
         | 
| 
      
 101 
     | 
    
         
            +
            ```ruby
         
     | 
| 
      
 102 
     | 
    
         
            +
            # A simple tree consisting of one node
         
     | 
| 
      
 103 
     | 
    
         
            +
            Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
         
     | 
| 
       68 
104 
     | 
    
         | 
| 
       69 
     | 
    
         
            -
            tree 
     | 
| 
      
 105 
     | 
    
         
            +
            # Nested tree
         
     | 
| 
      
 106 
     | 
    
         
            +
            Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
         
     | 
| 
      
 107 
     | 
    
         
            +
              struct_table './tr' do
         
     | 
| 
      
 108 
     | 
    
         
            +
                text_title    './td[1]'
         
     | 
| 
      
 109 
     | 
    
         
            +
                text_pub_date './td[2]'
         
     | 
| 
      
 110 
     | 
    
         
            +
              end
         
     | 
| 
      
 111 
     | 
    
         
            +
            end
         
     | 
| 
       70 
112 
     | 
    
         
             
            ```
         
     | 
| 
       71 
113 
     | 
    
         | 
| 
       72 
     | 
    
         
            -
             
     | 
| 
      
 114 
     | 
    
         
            +
            Parsing trees can be defined in Ruby DSL, JSON, or YAML.
         
     | 
| 
      
 115 
     | 
    
         
            +
            The following is an example of the same parse tree as above, defined in each notation.
         
     | 
| 
      
 116 
     | 
    
         
            +
             
     | 
| 
       73 
117 
     | 
    
         | 
| 
      
 118 
     | 
    
         
            +
            **Case of defining as Ruby DSL**
         
     | 
| 
       74 
119 
     | 
    
         
             
            ```ruby
         
     | 
| 
       75 
     | 
    
         
            -
             
     | 
| 
       76 
     | 
    
         
            -
             
     | 
| 
       77 
     | 
    
         
            -
             
     | 
| 
       78 
     | 
    
         
            -
                 "name"     : "title",
         
     | 
| 
       79 
     | 
    
         
            -
                 "path"     : "/html/body/a",
         
     | 
| 
       80 
     | 
    
         
            -
                 "children" : [
         
     | 
| 
       81 
     | 
    
         
            -
                                { "node" : "text",
         
     | 
| 
       82 
     | 
    
         
            -
                                  "name" : "name",
         
     | 
| 
       83 
     | 
    
         
            -
                                  "path" : "/html/body/p"
         
     | 
| 
       84 
     | 
    
         
            -
                                }
         
     | 
| 
       85 
     | 
    
         
            -
                              ]
         
     | 
| 
       86 
     | 
    
         
            -
               }
         
     | 
| 
       87 
     | 
    
         
            -
            EOJSON
         
     | 
| 
       88 
     | 
    
         
            -
            tree = Yasuri.json2tree(src)
         
     | 
| 
      
 120 
     | 
    
         
            +
            Yasuri.links_title '/html/body/a' do
         
     | 
| 
      
 121 
     | 
    
         
            +
              text_name '/html/body/p'
         
     | 
| 
      
 122 
     | 
    
         
            +
            end
         
     | 
| 
       89 
123 
     | 
    
         
             
            ```
         
     | 
| 
       90 
124 
     | 
    
         | 
| 
       91 
     | 
    
         
            -
             
     | 
| 
       92 
     | 
    
         
            -
             
     | 
| 
       93 
     | 
    
         
            -
             
     | 
| 
      
 125 
     | 
    
         
            +
            **Case of defining as JSON**
         
     | 
| 
      
 126 
     | 
    
         
            +
            ```json
         
     | 
| 
      
 127 
     | 
    
         
            +
            {
         
     | 
| 
      
 128 
     | 
    
         
            +
              links_title": {
         
     | 
| 
      
 129 
     | 
    
         
            +
                "path": "/html/body/a",
         
     | 
| 
      
 130 
     | 
    
         
            +
                "text_name": "/html/body/p"
         
     | 
| 
      
 131 
     | 
    
         
            +
              }
         
     | 
| 
      
 132 
     | 
    
         
            +
            }
         
     | 
| 
      
 133 
     | 
    
         
            +
            ```
         
     | 
| 
       94 
134 
     | 
    
         | 
| 
       95 
     | 
    
         
            -
             
     | 
| 
      
 135 
     | 
    
         
            +
            **Case of defining as YAML**
         
     | 
| 
      
 136 
     | 
    
         
            +
            ```yaml
         
     | 
| 
      
 137 
     | 
    
         
            +
            links_title:
         
     | 
| 
      
 138 
     | 
    
         
            +
              path: "/html/body/a"
         
     | 
| 
      
 139 
     | 
    
         
            +
              text_name: "/html/body/p"
         
     | 
| 
      
 140 
     | 
    
         
            +
            ```
         
     | 
| 
       96 
141 
     | 
    
         | 
| 
      
 142 
     | 
    
         
            +
            **Special case of purse tree**
         
     | 
| 
       97 
143 
     | 
    
         | 
| 
      
 144 
     | 
    
         
            +
            If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
         
     | 
| 
      
 145 
     | 
    
         
            +
            ```json
         
     | 
| 
      
 146 
     | 
    
         
            +
            {
         
     | 
| 
      
 147 
     | 
    
         
            +
              "text_title": "/html/head/title",
         
     | 
| 
      
 148 
     | 
    
         
            +
              "text_body": "/html/body",
         
     | 
| 
      
 149 
     | 
    
         
            +
            }
         
     | 
| 
      
 150 
     | 
    
         
            +
            # => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
         
     | 
| 
      
 151 
     | 
    
         
            +
             
     | 
| 
      
 152 
     | 
    
         
            +
            {
         
     | 
| 
      
 153 
     | 
    
         
            +
              "text_title": "/html/head/title"}
         
     | 
| 
      
 154 
     | 
    
         
            +
            }
         
     | 
| 
      
 155 
     | 
    
         
            +
            # => Welcome to yasuri!
         
     | 
| 
      
 156 
     | 
    
         
            +
            ```
         
     | 
| 
      
 157 
     | 
    
         
            +
             
     | 
| 
      
 158 
     | 
    
         
            +
             
     | 
| 
      
 159 
     | 
    
         
            +
            In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
         
     | 
| 
      
 160 
     | 
    
         
            +
             
     | 
| 
      
 161 
     | 
    
         
            +
            ```json
         
     | 
| 
      
 162 
     | 
    
         
            +
            {
         
     | 
| 
      
 163 
     | 
    
         
            +
              "text_name": "/html/body/p"
         
     | 
| 
      
 164 
     | 
    
         
            +
            }
         
     | 
| 
      
 165 
     | 
    
         
            +
             
     | 
| 
      
 166 
     | 
    
         
            +
            {
         
     | 
| 
      
 167 
     | 
    
         
            +
              "text_name": {
         
     | 
| 
      
 168 
     | 
    
         
            +
                "path": "/html/body/p"
         
     | 
| 
      
 169 
     | 
    
         
            +
              }
         
     | 
| 
      
 170 
     | 
    
         
            +
            }
         
     | 
| 
      
 171 
     | 
    
         
            +
            ```
         
     | 
| 
      
 172 
     | 
    
         
            +
            ### Run ParseTree
         
     | 
| 
      
 173 
     | 
    
         
            +
            Call the `Node#scrape(uri, opt={})` method on the root node of the parse tree.
         
     | 
| 
      
 174 
     | 
    
         
            +
             
     | 
| 
      
 175 
     | 
    
         
            +
            **Example**
         
     | 
| 
       98 
176 
     | 
    
         
             
            ```ruby
         
     | 
| 
       99 
     | 
    
         
            -
             
     | 
| 
       100 
     | 
    
         
            -
             
     | 
| 
      
 177 
     | 
    
         
            +
            root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
         
     | 
| 
      
 178 
     | 
    
         
            +
                     text_title '//*[@id="contents"]/h2'
         
     | 
| 
      
 179 
     | 
    
         
            +
                     text_content '//*[@id="contents"]/p[1]'
         
     | 
| 
      
 180 
     | 
    
         
            +
                   end
         
     | 
| 
       101 
181 
     | 
    
         | 
| 
       102 
     | 
    
         
            -
             
     | 
| 
       103 
     | 
    
         
            -
            Yasuri.<Type>_<Name> <Path> [,<Options>] do
         
     | 
| 
       104 
     | 
    
         
            -
              <Type>_<Name> <Path> [,<Options>] do
         
     | 
| 
       105 
     | 
    
         
            -
                <Children>
         
     | 
| 
       106 
     | 
    
         
            -
              end
         
     | 
| 
       107 
     | 
    
         
            -
            end
         
     | 
| 
      
 182 
     | 
    
         
            +
            result = root.scrape("http://some.scraping.page.tac42.net/", interval_ms: 1000)
         
     | 
| 
       108 
183 
     | 
    
         
             
            ```
         
     | 
| 
       109 
184 
     | 
    
         | 
| 
      
 185 
     | 
    
         
            +
            + `uri` is the URI of the page to be scraped.
         
     | 
| 
      
 186 
     | 
    
         
            +
            + `opt` is options as Hash. The following options are available.
         
     | 
| 
      
 187 
     | 
    
         
            +
             
     | 
| 
      
 188 
     | 
    
         
            +
            Yasuri uses `Mechanize` internally as an agent to do scraping.
         
     | 
| 
      
 189 
     | 
    
         
            +
            If you want to specify this instance, call `Node#scrape_with_agent(uri, agent, opt={})`.
         
     | 
| 
      
 190 
     | 
    
         
            +
             
     | 
| 
      
 191 
     | 
    
         
            +
            ```ruby
         
     | 
| 
      
 192 
     | 
    
         
            +
            require 'logger'
         
     | 
| 
      
 193 
     | 
    
         
            +
             
     | 
| 
      
 194 
     | 
    
         
            +
            agent = Mechanize.new
         
     | 
| 
      
 195 
     | 
    
         
            +
            agent.log = Logger.new $stderr
         
     | 
| 
      
 196 
     | 
    
         
            +
            agent.request_headers = {
         
     | 
| 
      
 197 
     | 
    
         
            +
              # ...
         
     | 
| 
      
 198 
     | 
    
         
            +
            }
         
     | 
| 
      
 199 
     | 
    
         
            +
             
     | 
| 
      
 200 
     | 
    
         
            +
            result = root.scrape_with_agent(
         
     | 
| 
      
 201 
     | 
    
         
            +
              "http://some.scraping.page.tac42.net/",
         
     | 
| 
      
 202 
     | 
    
         
            +
              agent,
         
     | 
| 
      
 203 
     | 
    
         
            +
              interval_ms: 1000)
         
     | 
| 
      
 204 
     | 
    
         
            +
            ```
         
     | 
| 
      
 205 
     | 
    
         
            +
             
     | 
| 
      
 206 
     | 
    
         
            +
            ### `opt`
         
     | 
| 
      
 207 
     | 
    
         
            +
            #### `interval_ms`
         
     | 
| 
      
 208 
     | 
    
         
            +
            Interval [milliseconds] for requesting multiple pages.
         
     | 
| 
      
 209 
     | 
    
         
            +
             
     | 
| 
      
 210 
     | 
    
         
            +
            If omitted, requests will be made continuously without an interval, but if requests to many pages are expected, it is strongly recommended to specify an interval time to avoid high load on the target host.
         
     | 
| 
      
 211 
     | 
    
         
            +
             
     | 
| 
      
 212 
     | 
    
         
            +
            #### `retry_count`
         
     | 
| 
      
 213 
     | 
    
         
            +
            Number of retries when page acquisition fails. If omitted, it will retry 5 times.
         
     | 
| 
      
 214 
     | 
    
         
            +
             
     | 
| 
      
 215 
     | 
    
         
            +
            #### `symbolize_names`
         
     | 
| 
      
 216 
     | 
    
         
            +
            If true, returns the keys of the result set as symbols.
         
     | 
| 
      
 217 
     | 
    
         
            +
             
     | 
| 
      
 218 
     | 
    
         
            +
            --------------------------
         
     | 
| 
      
 219 
     | 
    
         
            +
            ## Node
         
     | 
| 
      
 220 
     | 
    
         
            +
             
     | 
| 
      
 221 
     | 
    
         
            +
            Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
         
     | 
| 
      
 222 
     | 
    
         
            +
             
     | 
| 
      
 223 
     | 
    
         
            +
             
     | 
| 
       110 
224 
     | 
    
         
             
            #### Type
         
     | 
| 
       111 
225 
     | 
    
         
             
            Type meen behavior of Node.
         
     | 
| 
       112 
226 
     | 
    
         | 
| 
         @@ -114,17 +228,20 @@ Type meen behavior of Node. 
     | 
|
| 
       114 
228 
     | 
    
         
             
            - *Struct*
         
     | 
| 
       115 
229 
     | 
    
         
             
            - *Links*
         
     | 
| 
       116 
230 
     | 
    
         
             
            - *Paginate*
         
     | 
| 
      
 231 
     | 
    
         
            +
            - *Map*
         
     | 
| 
       117 
232 
     | 
    
         | 
| 
       118 
     | 
    
         
            -
             
     | 
| 
      
 233 
     | 
    
         
            +
            See the description of each node for details.
         
     | 
| 
      
 234 
     | 
    
         
            +
             
     | 
| 
      
 235 
     | 
    
         
            +
            #### Name
         
     | 
| 
       119 
236 
     | 
    
         
             
            Name is used keys in returned hash.
         
     | 
| 
       120 
237 
     | 
    
         | 
| 
       121 
     | 
    
         
            -
             
     | 
| 
      
 238 
     | 
    
         
            +
            #### Path
         
     | 
| 
       122 
239 
     | 
    
         
             
            Path determine target node by xpath or css selector. It given by Machinize `search`.
         
     | 
| 
       123 
240 
     | 
    
         | 
| 
       124 
     | 
    
         
            -
             
     | 
| 
      
 241 
     | 
    
         
            +
            #### Childlen
         
     | 
| 
       125 
242 
     | 
    
         
             
            Child nodes. TextNode has always empty set, because TextNode is leaf.
         
     | 
| 
       126 
243 
     | 
    
         | 
| 
       127 
     | 
    
         
            -
             
     | 
| 
      
 244 
     | 
    
         
            +
            #### Options
         
     | 
| 
       128 
245 
     | 
    
         
             
            Parse options. It different in each types. You can get options and values by `opt` method.
         
     | 
| 
       129 
246 
     | 
    
         | 
| 
       130 
247 
     | 
    
         
             
            ```ruby
         
     | 
| 
         @@ -136,10 +253,12 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil} 
     | 
|
| 
       136 
253 
     | 
    
         
             
            ## Text Node
         
     | 
| 
       137 
254 
     | 
    
         
             
            TextNode return scraped text. This node have to be leaf.
         
     | 
| 
       138 
255 
     | 
    
         | 
| 
      
 256 
     | 
    
         
            +
             
     | 
| 
      
 257 
     | 
    
         
            +
             
     | 
| 
       139 
258 
     | 
    
         
             
            ### Example
         
     | 
| 
       140 
259 
     | 
    
         | 
| 
       141 
260 
     | 
    
         
             
            ```html
         
     | 
| 
       142 
     | 
    
         
            -
            <!-- http://yasuri.example.net -->
         
     | 
| 
      
 261 
     | 
    
         
            +
            <!-- http://yasuri.example.tac42.net -->
         
     | 
| 
       143 
262 
     | 
    
         
             
            <html>
         
     | 
| 
       144 
263 
     | 
    
         
             
              <head></head>
         
     | 
| 
       145 
264 
     | 
    
         
             
              <body>
         
     | 
| 
         @@ -150,25 +269,24 @@ TextNode return scraped text. This node have to be leaf. 
     | 
|
| 
       150 
269 
     | 
    
         
             
            ```
         
     | 
| 
       151 
270 
     | 
    
         | 
| 
       152 
271 
     | 
    
         
             
            ```ruby
         
     | 
| 
       153 
     | 
    
         
            -
            agent = Mechanize.new
         
     | 
| 
       154 
     | 
    
         
            -
            page = agent.get("http://yasuri.example.net")
         
     | 
| 
       155 
     | 
    
         
            -
             
     | 
| 
       156 
272 
     | 
    
         
             
            p1  = Yasuri.text_title '/html/body/p[1]'
         
     | 
| 
       157 
273 
     | 
    
         
             
            p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
         
     | 
| 
       158 
     | 
    
         
            -
            p2u = Yasuri.text_title '/html/body/p[ 
     | 
| 
      
 274 
     | 
    
         
            +
            p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
         
     | 
| 
       159 
275 
     | 
    
         | 
| 
       160 
     | 
    
         
            -
            p1. 
     | 
| 
       161 
     | 
    
         
            -
            p1t. 
     | 
| 
       162 
     | 
    
         
            -
             
     | 
| 
      
 276 
     | 
    
         
            +
            p1.scrape("http://yasuri.example.tac42.net")   #=> "Hello,World"
         
     | 
| 
      
 277 
     | 
    
         
            +
            p1t.scrape("http://yasuri.example.tac42.net")  #=> "Hello"
         
     | 
| 
      
 278 
     | 
    
         
            +
            p2u.scrape("http://yasuri.example.tac42.net")  #=> "HELLO,WORLD"
         
     | 
| 
       163 
279 
     | 
    
         
             
            ```
         
     | 
| 
       164 
280 
     | 
    
         | 
| 
      
 281 
     | 
    
         
            +
            Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
         
     | 
| 
      
 282 
     | 
    
         
            +
             
     | 
| 
       165 
283 
     | 
    
         
             
            ### Options
         
     | 
| 
       166 
284 
     | 
    
         
             
            ##### `truncate`
         
     | 
| 
       167 
285 
     | 
    
         
             
            Match to regexp, and truncate text. When you use group, it will return first matched group only.
         
     | 
| 
       168 
286 
     | 
    
         | 
| 
       169 
287 
     | 
    
         
             
            ```ruby
         
     | 
| 
       170 
288 
     | 
    
         
             
            node  = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
         
     | 
| 
       171 
     | 
    
         
            -
            node. 
     | 
| 
      
 289 
     | 
    
         
            +
            node.scrape(uri)
         
     | 
| 
       172 
290 
     | 
    
         
             
            #=> { "example" => "ello,Yasur" }
         
     | 
| 
       173 
291 
     | 
    
         
             
            ```
         
     | 
| 
       174 
292 
     | 
    
         | 
| 
         @@ -179,21 +297,22 @@ If it is given `truncate` option, apply method after truncated. 
     | 
|
| 
       179 
297 
     | 
    
         | 
| 
       180 
298 
     | 
    
         
             
            ```ruby
         
     | 
| 
       181 
299 
     | 
    
         
             
            node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
         
     | 
| 
       182 
     | 
    
         
            -
            node. 
     | 
| 
      
 300 
     | 
    
         
            +
            node.scrape(uri)
         
     | 
| 
       183 
301 
     | 
    
         
             
            #=> { "example" => "ELLO,YASUR" }
         
     | 
| 
       184 
302 
     | 
    
         
             
            ```
         
     | 
| 
       185 
303 
     | 
    
         | 
| 
       186 
304 
     | 
    
         
             
            ## Struct Node
         
     | 
| 
       187 
305 
     | 
    
         
             
            Struct Node return structured text.
         
     | 
| 
       188 
306 
     | 
    
         | 
| 
       189 
     | 
    
         
            -
            At first, Struct Node narrow down sub-tags by `Path`. 
     | 
| 
      
 307 
     | 
    
         
            +
            At first, Struct Node narrow down sub-tags by `Path`.
         
     | 
| 
      
 308 
     | 
    
         
            +
            Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
         
     | 
| 
       190 
309 
     | 
    
         | 
| 
       191 
310 
     | 
    
         
             
            If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
         
     | 
| 
       192 
311 
     | 
    
         | 
| 
       193 
312 
     | 
    
         
             
            ### Example
         
     | 
| 
       194 
313 
     | 
    
         | 
| 
       195 
314 
     | 
    
         
             
            ```html
         
     | 
| 
       196 
     | 
    
         
            -
            <!-- http://yasuri.example.net -->
         
     | 
| 
      
 315 
     | 
    
         
            +
            <!-- http://yasuri.example.tac42.net -->
         
     | 
| 
       197 
316 
     | 
    
         
             
            <html>
         
     | 
| 
       198 
317 
     | 
    
         
             
              <head>
         
     | 
| 
       199 
318 
     | 
    
         
             
                <title>Books</title>
         
     | 
| 
         @@ -234,15 +353,12 @@ If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags an 
     | 
|
| 
       234 
353 
     | 
    
         
             
            ```
         
     | 
| 
       235 
354 
     | 
    
         | 
| 
       236 
355 
     | 
    
         
             
            ```ruby
         
     | 
| 
       237 
     | 
    
         
            -
            agent = Mechanize.new
         
     | 
| 
       238 
     | 
    
         
            -
            page = agent.get("http://yasuri.example.net")
         
     | 
| 
       239 
     | 
    
         
            -
             
     | 
| 
       240 
356 
     | 
    
         
             
            node = Yasuri.struct_table '/html/body/table[1]/tr' do
         
     | 
| 
       241 
357 
     | 
    
         
             
              text_title    './td[1]'
         
     | 
| 
       242 
358 
     | 
    
         
             
              text_pub_date './td[2]'
         
     | 
| 
       243 
     | 
    
         
            -
             
     | 
| 
      
 359 
     | 
    
         
            +
            end
         
     | 
| 
       244 
360 
     | 
    
         | 
| 
       245 
     | 
    
         
            -
            node. 
     | 
| 
      
 361 
     | 
    
         
            +
            node.scrape("http://yasuri.example.tac42.net")
         
     | 
| 
       246 
362 
     | 
    
         
             
            #=> [ { "title"    => "The Perfect Insider",
         
     | 
| 
       247 
363 
     | 
    
         
             
            #       "pub_date" => "1996/4/5" },
         
     | 
| 
       248 
364 
     | 
    
         
             
            #     { "title"    => "Doctors in Isolated Room",
         
     | 
| 
         @@ -261,17 +377,14 @@ Struct node can contain not only Text node. 
     | 
|
| 
       261 
377 
     | 
    
         
             
            ### Example
         
     | 
| 
       262 
378 
     | 
    
         | 
| 
       263 
379 
     | 
    
         
             
            ```ruby
         
     | 
| 
       264 
     | 
    
         
            -
            agent = Mechanize.new
         
     | 
| 
       265 
     | 
    
         
            -
            page = agent.get("http://yasuri.example.net")
         
     | 
| 
       266 
     | 
    
         
            -
             
     | 
| 
       267 
380 
     | 
    
         
             
            node = Yasuri.strucre_tables '/html/body/table' do
         
     | 
| 
       268 
381 
     | 
    
         
             
              struct_table './tr' do
         
     | 
| 
       269 
382 
     | 
    
         
             
                text_title    './td[1]'
         
     | 
| 
       270 
383 
     | 
    
         
             
                text_pub_date './td[2]'
         
     | 
| 
       271 
384 
     | 
    
         
             
              end
         
     | 
| 
       272 
     | 
    
         
            -
             
     | 
| 
      
 385 
     | 
    
         
            +
            end
         
     | 
| 
       273 
386 
     | 
    
         | 
| 
       274 
     | 
    
         
            -
            node. 
     | 
| 
      
 387 
     | 
    
         
            +
            node.scrape("http://yasuri.example.tac42.net")
         
     | 
| 
       275 
388 
     | 
    
         | 
| 
       276 
389 
     | 
    
         
             
            #=>      [ { "table" => [ { "title"    => "The Perfect Insider",
         
     | 
| 
       277 
390 
     | 
    
         
             
            #                           "pub_date" => "1996/4/5" },
         
     | 
| 
         @@ -304,7 +417,7 @@ Links Node returns parsed text in each linked pages. 
     | 
|
| 
       304 
417 
     | 
    
         | 
| 
       305 
418 
     | 
    
         
             
            ### Example
         
     | 
| 
       306 
419 
     | 
    
         
             
            ```html
         
     | 
| 
       307 
     | 
    
         
            -
            <!-- http://yasuri.example.net -->
         
     | 
| 
      
 420 
     | 
    
         
            +
            <!-- http://yasuri.example.tac42.net -->
         
     | 
| 
       308 
421 
     | 
    
         
             
            <html>
         
     | 
| 
       309 
422 
     | 
    
         
             
              <head><title>Yasuri Test</title></head>
         
     | 
| 
       310 
423 
     | 
    
         
             
              <body>
         
     | 
| 
         @@ -317,7 +430,7 @@ Links Node returns parsed text in each linked pages. 
     | 
|
| 
       317 
430 
     | 
    
         
             
            ```
         
     | 
| 
       318 
431 
     | 
    
         | 
| 
       319 
432 
     | 
    
         
             
            ```html
         
     | 
| 
       320 
     | 
    
         
            -
            <!-- http://yasuri.example.net/child01.html -->
         
     | 
| 
      
 433 
     | 
    
         
            +
            <!-- http://yasuri.example.tac42.net/child01.html -->
         
     | 
| 
       321 
434 
     | 
    
         
             
            <html>
         
     | 
| 
       322 
435 
     | 
    
         
             
              <head><title>Child 01 Test</title></head>
         
     | 
| 
       323 
436 
     | 
    
         
             
              <body>
         
     | 
| 
         @@ -331,7 +444,7 @@ Links Node returns parsed text in each linked pages. 
     | 
|
| 
       331 
444 
     | 
    
         
             
            ```
         
     | 
| 
       332 
445 
     | 
    
         | 
| 
       333 
446 
     | 
    
         
             
            ```html
         
     | 
| 
       334 
     | 
    
         
            -
            <!-- http://yasuri.example.net/child02.html -->
         
     | 
| 
      
 447 
     | 
    
         
            +
            <!-- http://yasuri.example.tac42.net/child02.html -->
         
     | 
| 
       335 
448 
     | 
    
         
             
            <html>
         
     | 
| 
       336 
449 
     | 
    
         
             
              <head><title>Child 02 Test</title></head>
         
     | 
| 
       337 
450 
     | 
    
         
             
              <body>
         
     | 
| 
         @@ -341,7 +454,7 @@ Links Node returns parsed text in each linked pages. 
     | 
|
| 
       341 
454 
     | 
    
         
             
            ```
         
     | 
| 
       342 
455 
     | 
    
         | 
| 
       343 
456 
     | 
    
         
             
            ```html
         
     | 
| 
       344 
     | 
    
         
            -
            <!-- http://yasuri.example.net/child03.html -->
         
     | 
| 
      
 457 
     | 
    
         
            +
            <!-- http://yasuri.example.tac42.net/child03.html -->
         
     | 
| 
       345 
458 
     | 
    
         
             
            <html>
         
     | 
| 
       346 
459 
     | 
    
         
             
              <head><title>Child 03 Test</title></head>
         
     | 
| 
       347 
460 
     | 
    
         
             
              <body>
         
     | 
| 
         @@ -354,20 +467,17 @@ Links Node returns parsed text in each linked pages. 
     | 
|
| 
       354 
467 
     | 
    
         
             
            ```
         
     | 
| 
       355 
468 
     | 
    
         | 
| 
       356 
469 
     | 
    
         
             
            ```ruby
         
     | 
| 
       357 
     | 
    
         
            -
            agent = Mechanize.new
         
     | 
| 
       358 
     | 
    
         
            -
            page = agent.get("http://yasuri.example.net")
         
     | 
| 
       359 
     | 
    
         
            -
             
     | 
| 
       360 
470 
     | 
    
         
             
            node = Yasuri.links_title '/html/body/a' do
         
     | 
| 
       361 
471 
     | 
    
         
             
              text_content '/html/body/p'
         
     | 
| 
       362 
472 
     | 
    
         
             
            end
         
     | 
| 
       363 
473 
     | 
    
         | 
| 
       364 
     | 
    
         
            -
            node. 
     | 
| 
      
 474 
     | 
    
         
            +
            node.scrape("http://yasuri.example.tac42.net")
         
     | 
| 
       365 
475 
     | 
    
         
             
            #=> [ {"content" => "Child 01 page."},
         
     | 
| 
       366 
476 
     | 
    
         
             
                  {"content" => "Child 02 page."},
         
     | 
| 
       367 
477 
     | 
    
         
             
                  {"content" => "Child 03 page."}]
         
     | 
| 
       368 
478 
     | 
    
         
             
            ```
         
     | 
| 
       369 
479 
     | 
    
         | 
| 
       370 
     | 
    
         
            -
            At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
         
     | 
| 
      
 480 
     | 
    
         
            +
            At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.tac42.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
         
     | 
| 
       371 
481 
     | 
    
         | 
| 
       372 
482 
     | 
    
         
             
            Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
         
     | 
| 
       373 
483 
     | 
    
         | 
| 
         @@ -381,7 +491,7 @@ Paginate Node parses and returns each pages that provid by paginate. 
     | 
|
| 
       381 
491 
     | 
    
         
             
            Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
         
     | 
| 
       382 
492 
     | 
    
         | 
| 
       383 
493 
     | 
    
         
             
            ```html
         
     | 
| 
       384 
     | 
    
         
            -
            <!-- http://yasuri.example.net/page01.html -->
         
     | 
| 
      
 494 
     | 
    
         
            +
            <!-- http://yasuri.example.tac42.net/page01.html -->
         
     | 
| 
       385 
495 
     | 
    
         
             
            <html>
         
     | 
| 
       386 
496 
     | 
    
         
             
              <head><title>Page01</title></head>
         
     | 
| 
       387 
497 
     | 
    
         
             
              <body>
         
     | 
| 
         @@ -401,21 +511,17 @@ Target page `page01.html` is like this. `page02.html` to `page04.html` are simil 
     | 
|
| 
       401 
511 
     | 
    
         
             
            ```
         
     | 
| 
       402 
512 
     | 
    
         | 
| 
       403 
513 
     | 
    
         
             
            ```ruby
         
     | 
| 
       404 
     | 
    
         
            -
             
     | 
| 
       405 
     | 
    
         
            -
            page = agent.get("http://yasuri.example.net/page01.html")
         
     | 
| 
       406 
     | 
    
         
            -
             
     | 
| 
       407 
     | 
    
         
            -
            node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
         
     | 
| 
      
 514 
     | 
    
         
            +
            node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
         
     | 
| 
       408 
515 
     | 
    
         
             
                     text_content '/html/body/p'
         
     | 
| 
       409 
516 
     | 
    
         
             
                   end
         
     | 
| 
       410 
517 
     | 
    
         | 
| 
       411 
     | 
    
         
            -
            node. 
     | 
| 
       412 
     | 
    
         
            -
            #=> [ {"content" => " 
     | 
| 
       413 
     | 
    
         
            -
             
     | 
| 
       414 
     | 
    
         
            -
             
     | 
| 
       415 
     | 
    
         
            -
                  {"content" => "Pagination04"}]
         
     | 
| 
      
 518 
     | 
    
         
            +
            node.scrape("http://yasuri.example.tac42.net/page01.html")
         
     | 
| 
      
 519 
     | 
    
         
            +
            #=> [ {"content" => "Patination01"},
         
     | 
| 
      
 520 
     | 
    
         
            +
            #     {"content" => "Patination02"},
         
     | 
| 
      
 521 
     | 
    
         
            +
            #     {"content" => "Patination03"}]
         
     | 
| 
       416 
522 
     | 
    
         
             
            ```
         
     | 
| 
       417 
     | 
    
         
            -
             
     | 
| 
       418 
     | 
    
         
            -
             
     | 
| 
      
 523 
     | 
    
         
            +
            Paginate Node require link for next page.
         
     | 
| 
      
 524 
     | 
    
         
            +
            In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
         
     | 
| 
       419 
525 
     | 
    
         | 
| 
       420 
526 
     | 
    
         
             
            ### Options
         
     | 
| 
       421 
527 
     | 
    
         
             
            ##### `limit`
         
     | 
| 
         @@ -425,7 +531,7 @@ Upper limit of open pages in pagination. 
     | 
|
| 
       425 
531 
     | 
    
         
             
            node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
         
     | 
| 
       426 
532 
     | 
    
         
             
                     text_content '/html/body/p'
         
     | 
| 
       427 
533 
     | 
    
         
             
                   end
         
     | 
| 
       428 
     | 
    
         
            -
            node. 
     | 
| 
      
 534 
     | 
    
         
            +
            node.scrape(uri)
         
     | 
| 
       429 
535 
     | 
    
         
             
            #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
         
     | 
| 
       430 
536 
     | 
    
         
             
            ```
         
     | 
| 
       431 
537 
     | 
    
         
             
            Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
         
     | 
| 
         @@ -434,33 +540,177 @@ Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 
     | 
|
| 
       434 
540 
     | 
    
         
             
            `flatten` option expands each page results.
         
     | 
| 
       435 
541 
     | 
    
         | 
| 
       436 
542 
     | 
    
         
             
            ```ruby
         
     | 
| 
       437 
     | 
    
         
            -
            agent = Mechanize.new
         
     | 
| 
       438 
     | 
    
         
            -
            page = agent.get("http://yasuri.example.net/page01.html")
         
     | 
| 
       439 
     | 
    
         
            -
             
     | 
| 
       440 
543 
     | 
    
         
             
            node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
         
     | 
| 
       441 
544 
     | 
    
         
             
                     text_title   '/html/head/title'
         
     | 
| 
       442 
545 
     | 
    
         
             
                     text_content '/html/body/p'
         
     | 
| 
       443 
546 
     | 
    
         
             
                   end
         
     | 
| 
       444 
     | 
    
         
            -
            node. 
     | 
| 
      
 547 
     | 
    
         
            +
            node.scrape("http://yasuri.example.tac42.net/page01.html")
         
     | 
| 
       445 
548 
     | 
    
         | 
| 
       446 
549 
     | 
    
         
             
            #=> [ {"title" => "Page01",
         
     | 
| 
       447 
     | 
    
         
            -
             
     | 
| 
       448 
     | 
    
         
            -
             
     | 
| 
       449 
     | 
    
         
            -
             
     | 
| 
       450 
     | 
    
         
            -
             
     | 
| 
       451 
     | 
    
         
            -
             
     | 
| 
      
 550 
     | 
    
         
            +
            #      "content" => "Patination01"},
         
     | 
| 
      
 551 
     | 
    
         
            +
            #     {"title"   => "Page01",
         
     | 
| 
      
 552 
     | 
    
         
            +
            #      "content" => "Patination02"},
         
     | 
| 
      
 553 
     | 
    
         
            +
            #     {"title"   => "Page01",
         
     | 
| 
      
 554 
     | 
    
         
            +
            #      "content" => "Patination03"}]
         
     | 
| 
       452 
555 
     | 
    
         | 
| 
       453 
556 
     | 
    
         | 
| 
       454 
557 
     | 
    
         
             
            node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
         
     | 
| 
       455 
558 
     | 
    
         
             
                    text_title   '/html/head/title'
         
     | 
| 
       456 
559 
     | 
    
         
             
                    text_content '/html/body/p'
         
     | 
| 
       457 
560 
     | 
    
         
             
                  end
         
     | 
| 
       458 
     | 
    
         
            -
            node. 
     | 
| 
      
 561 
     | 
    
         
            +
            node.scrape("http://yasuri.example.tac42.net/page01.html")
         
     | 
| 
       459 
562 
     | 
    
         | 
| 
       460 
563 
     | 
    
         
             
            #=> [ "Page01",
         
     | 
| 
       461 
     | 
    
         
            -
             
     | 
| 
       462 
     | 
    
         
            -
             
     | 
| 
       463 
     | 
    
         
            -
             
     | 
| 
       464 
     | 
    
         
            -
             
     | 
| 
       465 
     | 
    
         
            -
             
     | 
| 
      
 564 
     | 
    
         
            +
            #     "Patination01",
         
     | 
| 
      
 565 
     | 
    
         
            +
            #     "Page02",
         
     | 
| 
      
 566 
     | 
    
         
            +
            #     "Patination02",
         
     | 
| 
      
 567 
     | 
    
         
            +
            #     "Page03",
         
     | 
| 
      
 568 
     | 
    
         
            +
            #     "Patination03"]
         
     | 
| 
      
 569 
     | 
    
         
            +
            ```
         
     | 
| 
      
 570 
     | 
    
         
            +
             
     | 
| 
      
 571 
     | 
    
         
            +
            ## Map Node
         
     | 
| 
      
 572 
     | 
    
         
            +
            *MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
         
     | 
| 
      
 573 
     | 
    
         
            +
             
     | 
| 
      
 574 
     | 
    
         
            +
            ### Example
         
     | 
| 
      
 575 
     | 
    
         
            +
             
     | 
| 
      
 576 
     | 
    
         
            +
            ```html
         
     | 
| 
      
 577 
     | 
    
         
            +
            <!-- http://yasuri.example.tac42.net -->
         
     | 
| 
      
 578 
     | 
    
         
            +
            <html>
         
     | 
| 
      
 579 
     | 
    
         
            +
              <head><title>Yasuri Example</title></head>
         
     | 
| 
      
 580 
     | 
    
         
            +
              <body>
         
     | 
| 
      
 581 
     | 
    
         
            +
                <p>Hello,World</p>
         
     | 
| 
      
 582 
     | 
    
         
            +
                <p>Hello,Yasuri</p>
         
     | 
| 
      
 583 
     | 
    
         
            +
              </body>
         
     | 
| 
      
 584 
     | 
    
         
            +
            </html>
         
     | 
| 
      
 585 
     | 
    
         
            +
            ```
         
     | 
| 
      
 586 
     | 
    
         
            +
             
     | 
| 
      
 587 
     | 
    
         
            +
            ```ruby
         
     | 
| 
      
 588 
     | 
    
         
            +
            tree = Yasuri.map_root do
         
     | 
| 
      
 589 
     | 
    
         
            +
              text_title  '/html/head/title'
         
     | 
| 
      
 590 
     | 
    
         
            +
              text_body_p '/html/body/p[1]'
         
     | 
| 
      
 591 
     | 
    
         
            +
            end
         
     | 
| 
      
 592 
     | 
    
         
            +
             
     | 
| 
      
 593 
     | 
    
         
            +
            tree.scrape("http://yasuri.example.tac42.net") #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
         
     | 
| 
      
 594 
     | 
    
         
            +
             
     | 
| 
      
 595 
     | 
    
         
            +
             
     | 
| 
      
 596 
     | 
    
         
            +
            tree = Yasuri.map_root do
         
     | 
| 
      
 597 
     | 
    
         
            +
              map_group1 { text_child01  '/html/body/a[1]' }
         
     | 
| 
      
 598 
     | 
    
         
            +
              map_group2 do
         
     | 
| 
      
 599 
     | 
    
         
            +
                text_child01 '/html/body/a[1]'
         
     | 
| 
      
 600 
     | 
    
         
            +
                text_child03 '/html/body/a[3]'
         
     | 
| 
      
 601 
     | 
    
         
            +
              end
         
     | 
| 
      
 602 
     | 
    
         
            +
            end
         
     | 
| 
      
 603 
     | 
    
         
            +
             
     | 
| 
      
 604 
     | 
    
         
            +
            tree.scrape("http://yasuri.example.tac42.net") #=> {
         
     | 
| 
      
 605 
     | 
    
         
            +
            #   "group1" => {
         
     | 
| 
      
 606 
     | 
    
         
            +
            #           "child01" => "child01"
         
     | 
| 
      
 607 
     | 
    
         
            +
            #         },
         
     | 
| 
      
 608 
     | 
    
         
            +
            #         "group2" => {
         
     | 
| 
      
 609 
     | 
    
         
            +
            #           "child01" => "child01",
         
     | 
| 
      
 610 
     | 
    
         
            +
            #           "child03" => "child03"
         
     | 
| 
      
 611 
     | 
    
         
            +
            #         }
         
     | 
| 
      
 612 
     | 
    
         
            +
            # }
         
     | 
| 
      
 613 
     | 
    
         
            +
            ```
         
     | 
| 
      
 614 
     | 
    
         
            +
             
     | 
| 
      
 615 
     | 
    
         
            +
            ### Options
         
     | 
| 
      
 616 
     | 
    
         
            +
            None.
         
     | 
| 
      
 617 
     | 
    
         
            +
             
     | 
| 
      
 618 
     | 
    
         
            +
             
     | 
| 
      
 619 
     | 
    
         
            +
            -------------------------
         
     | 
| 
      
 620 
     | 
    
         
            +
            ## Usage
         
     | 
| 
      
 621 
     | 
    
         
            +
             
     | 
| 
      
 622 
     | 
    
         
            +
            ### Use as library
         
     | 
| 
      
 623 
     | 
    
         
            +
            When used as a library, the tree can be defined in DSL, json, or yaml format.
         
     | 
| 
      
 624 
     | 
    
         
            +
             
     | 
| 
      
 625 
     | 
    
         
            +
            ```ruby
         
     | 
| 
      
 626 
     | 
    
         
            +
            require 'yasuri'
         
     | 
| 
      
 627 
     | 
    
         
            +
             
     | 
| 
      
 628 
     | 
    
         
            +
            # 1. Create a parse tree.
         
     | 
| 
      
 629 
     | 
    
         
            +
            # Define by Ruby's DSL
         
     | 
| 
      
 630 
     | 
    
         
            +
            tree = Yasuri.links_title '/html/body/a' do
         
     | 
| 
      
 631 
     | 
    
         
            +
                     text_name '/html/body/p'
         
     | 
| 
      
 632 
     | 
    
         
            +
                   end
         
     | 
| 
      
 633 
     | 
    
         
            +
             
     | 
| 
      
 634 
     | 
    
         
            +
            # Define by JSON
         
     | 
| 
      
 635 
     | 
    
         
            +
            src = <<-EOJSON
         
     | 
| 
      
 636 
     | 
    
         
            +
            {
         
     | 
| 
      
 637 
     | 
    
         
            +
              links_title": {
         
     | 
| 
      
 638 
     | 
    
         
            +
                "path": "/html/body/a",
         
     | 
| 
      
 639 
     | 
    
         
            +
                "text_name": "/html/body/p"
         
     | 
| 
      
 640 
     | 
    
         
            +
              }
         
     | 
| 
      
 641 
     | 
    
         
            +
            }
         
     | 
| 
      
 642 
     | 
    
         
            +
            EOJSON
         
     | 
| 
      
 643 
     | 
    
         
            +
            tree = Yasuri.json2tree(src)
         
     | 
| 
      
 644 
     | 
    
         
            +
             
     | 
| 
      
 645 
     | 
    
         
            +
             
     | 
| 
      
 646 
     | 
    
         
            +
            # Define by YAML
         
     | 
| 
      
 647 
     | 
    
         
            +
            src = <<-EOYAML
         
     | 
| 
      
 648 
     | 
    
         
            +
            links_title:
         
     | 
| 
      
 649 
     | 
    
         
            +
              path: "/html/body/a"
         
     | 
| 
      
 650 
     | 
    
         
            +
              text_name: "/html/body/p"
         
     | 
| 
      
 651 
     | 
    
         
            +
            EOYAML
         
     | 
| 
      
 652 
     | 
    
         
            +
            tree = Yasuri.yaml2tree(src)
         
     | 
| 
      
 653 
     | 
    
         
            +
             
     | 
| 
      
 654 
     | 
    
         
            +
            # 2. Give the URL to start parsing
         
     | 
| 
      
 655 
     | 
    
         
            +
            tree.inject(uri)
         
     | 
| 
      
 656 
     | 
    
         
            +
            ```
         
     | 
| 
      
 657 
     | 
    
         
            +
             
     | 
| 
      
 658 
     | 
    
         
            +
            ### Use as CLI tool
         
     | 
| 
      
 659 
     | 
    
         
            +
             
     | 
| 
      
 660 
     | 
    
         
            +
            **Help**
         
     | 
| 
      
 661 
     | 
    
         
            +
            ```sh
         
     | 
| 
      
 662 
     | 
    
         
            +
            $ yasuri help scrape
         
     | 
| 
      
 663 
     | 
    
         
            +
            Usage:
         
     | 
| 
      
 664 
     | 
    
         
            +
              yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
         
     | 
| 
      
 665 
     | 
    
         
            +
             
     | 
| 
      
 666 
     | 
    
         
            +
            Options:
         
     | 
| 
      
 667 
     | 
    
         
            +
              f, [--file=FILE]  # path to file that written yasuri tree as json or yaml
         
     | 
| 
      
 668 
     | 
    
         
            +
              j, [--json=JSON]  # yasuri tree format json string
         
     | 
| 
      
 669 
     | 
    
         
            +
              i, [--interval=N]  # interval each request [ms]
         
     | 
| 
      
 670 
     | 
    
         
            +
             
     | 
| 
      
 671 
     | 
    
         
            +
            Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
         
     | 
| 
      
 672 
     | 
    
         
            +
            ```
         
     | 
| 
      
 673 
     | 
    
         
            +
             
     | 
| 
      
 674 
     | 
    
         
            +
            In the CLI tool, you can specify the parse tree in either of the following ways.
         
     | 
| 
      
 675 
     | 
    
         
            +
            + `--file`, `-f` : option to read the parse tree in json or yaml format output to a file.
         
     | 
| 
      
 676 
     | 
    
         
            +
            + `--json`, `-j` : option to specify the parse tree directly as a string.
         
     | 
| 
      
 677 
     | 
    
         
            +
             
     | 
| 
      
 678 
     | 
    
         
            +
             
     | 
| 
      
 679 
     | 
    
         
            +
            **Example of specifying a parse tree as a file**
         
     | 
| 
      
 680 
     | 
    
         
            +
            ```sh
         
     | 
| 
      
 681 
     | 
    
         
            +
            % cat sample.yml
         
     | 
| 
      
 682 
     | 
    
         
            +
            text_title: "/html/head/title"
         
     | 
| 
      
 683 
     | 
    
         
            +
            text_desc: "//*[@id=\"intro\"]/p"
         
     | 
| 
      
 684 
     | 
    
         
            +
             
     | 
| 
      
 685 
     | 
    
         
            +
            % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
         
     | 
| 
      
 686 
     | 
    
         
            +
            {"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
         
     | 
| 
      
 687 
     | 
    
         
            +
             
     | 
| 
      
 688 
     | 
    
         
            +
            % cat sample.json
         
     | 
| 
      
 689 
     | 
    
         
            +
            {
         
     | 
| 
      
 690 
     | 
    
         
            +
              "text_title": "/html/head/title",
         
     | 
| 
      
 691 
     | 
    
         
            +
              "text_desc": "//*[@id=\"intro\"]/p"
         
     | 
| 
      
 692 
     | 
    
         
            +
            }
         
     | 
| 
      
 693 
     | 
    
         
            +
             
     | 
| 
      
 694 
     | 
    
         
            +
            % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
         
     | 
| 
      
 695 
     | 
    
         
            +
            {"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
         
     | 
| 
       466 
696 
     | 
    
         
             
            ```
         
     | 
| 
      
 697 
     | 
    
         
            +
             
     | 
| 
      
 698 
     | 
    
         
            +
            Whether the file is written in json or yaml will be determined automatically.
         
     | 
| 
      
 699 
     | 
    
         
            +
             
     | 
| 
      
 700 
     | 
    
         
            +
            **Example of specifying a parse tree directly in json**
         
     | 
| 
      
 701 
     | 
    
         
            +
            ```sh
         
     | 
| 
      
 702 
     | 
    
         
            +
            $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
         
     | 
| 
      
 703 
     | 
    
         
            +
            {
         
     | 
| 
      
 704 
     | 
    
         
            +
              "text_title": "/html/head/title",
         
     | 
| 
      
 705 
     | 
    
         
            +
              "text_desc": "//*[@id=\"intro\"]/p"
         
     | 
| 
      
 706 
     | 
    
         
            +
            }'
         
     | 
| 
      
 707 
     | 
    
         
            +
             
     | 
| 
      
 708 
     | 
    
         
            +
            {"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
         
     | 
| 
      
 709 
     | 
    
         
            +
            ```
         
     | 
| 
      
 710 
     | 
    
         
            +
             
     | 
| 
      
 711 
     | 
    
         
            +
            #### Other options
         
     | 
| 
      
 712 
     | 
    
         
            +
            + `--interval`, `-i` : The interval [milliseconds] for requesting multiple pages.
         
     | 
| 
      
 713 
     | 
    
         
            +
               **Example: Request at 1 second intervals**
         
     | 
| 
      
 714 
     | 
    
         
            +
               ```sh
         
     | 
| 
      
 715 
     | 
    
         
            +
               $ yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml --interval 1000
         
     | 
| 
      
 716 
     | 
    
         
            +
               ```
         
     |