RubyGems - yasuri - Versions diffs - 0.0.11 → 1.9.11 - Mend

yasuri 0.0.11 → 1.9.11

Files changed (14) hide show

checksums.yaml +4 -4
data/README.md +18 -0
data/USAGE.ja.md +433 -0
data/USAGE.md +431 -0
data/lib/yasuri/version.rb +1 -1
data/lib/yasuri/yasuri.rb +3 -2
data/lib/yasuri/yasuri_node.rb +1 -1
data/lib/yasuri/yasuri_paginate_node.rb +2 -2
data/lib/yasuri/yasuri_text_node.rb +4 -1
data/spec/spec_helper.rb +3 -2
data/spec/yasuri_spec.rb +3 -3
data/spec/yasuri_struct_node_spec.rb +3 -3
data/spec/yasuri_text_node_spec.rb +1 -1
metadata +4 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 2d97799a28c2b4d1991ce32abf50b81bce1cae5f
-  data.tar.gz: d94f15f6056e52f87feb1538964e8a049afb7fcb
+  metadata.gz: 28e6a3903cec3d8036b718a7c5de4fb8df8dfbfa
+  data.tar.gz: fbfb4c2b3a042410a7d05b1604e004bdffd82779
 SHA512:
-  metadata.gz: 193bb19ac39e9ea1ca74b2c44dc8c66840f630dcdcd32797acf28a6add47b9a5d878a9ffe3a482a544eccc75dc0f34aa03a4d199dffd311b6dc134f6471b5d35
-  data.tar.gz: 3a360133ce54adb4bcc16637a53d78d1cbd5402d64c47b4afc8e852918f4ea613b4eea9ab20e8bf8c325cb67bf57c31577fa45ba9c04b98d186aa1a72b913f64
+  metadata.gz: e4312623046ecf451ef261b1d99164b7f79b861ad8d7fe49d7ba8fd2319cd840978ae88235e6fb10b6991cd5131ebec966ec56e3bddece8fbbefd0b53d4a9dfe
+  data.tar.gz: 54639066aa4511309a1f712b8920c81c09a7489e0cab6a91c453d2aeca07d4f1c3bed167ed25746411d5f87fa9cc9fff8c6a32567c4df8d1990f12cb053286e2

data/README.md CHANGED Viewed

@@ -2,6 +2,16 @@
 Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
+Yasuri can reduce frequently processes in Scraping.
+For example,
++ Open links in the page, scraping each page, and getting result as Hash.
++ Scraping texts in the page, and named result in Hash.
++ A table that repeatedly appears in a page each, scraping, get as an array.
++ Of each page provided by the pagination, scraping the only top 3.
+You can implement easy by Yasuri.
 ## Sample
@@ -17,6 +27,14 @@ Add this line to your application's Gemfile:
 gem 'yasuri'
 ```
+or
+```ruby
+# for Ruby 1.9.3 or lower
+gem 'yasuri', '~> 1.9'
+```
 And then execute:
     $ bundle

data/USAGE.ja.md ADDED Viewed

@@ -0,0 +1,433 @@
+# Yasuri の使い方
+## Yasuri とは
+Yasuri (鑢) は簡単にWebスクレイピングを行うための、"[Mechanize](https://github.com/sparklemotion/mechanize)" をサポートするライブラリです．
+Yasuriは、スクレイピングにおける、よくある処理を簡単に記述することができます．
+例えば、
++ ページ内の複数のリンクを開いて、各ページをスクレイピングした結果をHashで取得する
++ ページ内の複数のテキストをスクレイピングし、名前をつけてHashにする
++ ページ内に繰り返し出現するテーブルをそれぞれスクレイピングして、配列として取得する
++ ページネーションで提供される各ページのうち、上位3つだけを順にスクレイピングする
+これらを簡単に実装することができます．
+## クイックスタート
+```
+$ gem install yasuri
+```
+```ruby
+require 'yasuri'
+require 'machinize'
+# Node tree constructing by DSL
+root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
+         text_title '//*[@id="contents"]/h2'
+         text_content '//*[@id="contents"]/p[1]'
+       end
+agent = Mechanize.new
+root_page = agent.get("http://some.scraping.page.net/")
+result = root.inject(agent, root_page)
+# => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
+#      {"title" => "PageTitle2", "content" => "Page Contents2" }, ...  ]
+```
+この例では、 LinkNode(`links_root`)の xpath で指定された各リンク先のページから、TextNode(`text_title`,`text_content`) の xpath で指定された2つのテキストをスクレイピングする例です．
+(言い換えると、`//*[@id="menu"]/ul/li/a` で示される各リンクを開いて、`//*[@id="contents"]/h2` と `//*[@id="contents"]/p[1]` で指定されたテキストをスクレイピングします)
+## 基本
+1. パースツリーを作る
+2. Mechanize の agent と対象のページを与えてパースを開始する
+### パースツリーを作る
+```ruby
+require 'mechanize'
+require 'yasuri'
+# 1. パースツリーを作る
+tree = Yasuri.links_title '/html/body/a' do
+         text_name '/html/body/p'
+       end
+# 2. Mechanize の agent と対象のページを与えてパースを開始する
+agent = Mechanize.new
+page = agent.get(uri)
+tree.inject(agent, page)
+```
+ツリーは、DSLまたはjsonで定義することができます．上の例ではDSLで定義しています．
+以下は、jsonで上記と等価な解析ツリーを定義した例です．
+```ruby
+# json で構成する場合
+src = <<-EOJSON
+   { "node"     : "links",
+     "name"     : "title",
+     "path"     : "/html/body/a",
+     "children" : [
+                    { "node" : "text",
+                      "name" : "name",
+                      "path" : "/html/body/p"
+                    }
+                  ]
+   }
+EOJSON
+tree = Yasuri.json2tree(src)
+```
+### Node
+ツリーは入れ子になった *Node* で構成されます．
+Node は `Type`, `Name`, `Path`, `Childlen`, `Options` を持っています．
+Nodeは以下のフォーマットで定義されます．
+```ruby
+# トップレベル
+Yasuri.<Type>_<Name> <Path> [,<Options>]
+# 入れ子になっている場合
+Yasuri.<Type>_<Name> <Path> [,<Options>] do
+  <Type>_<Name> <Path> [,<Options>] do
+    <Children>
+  end
+end
+```
+#### Type
+*Type* は Nodeの振る舞いを示します．Typeには以下のものがあります．
+- *Text*
+- *Struct*
+- *Links*
+- *Paginate*
+### Name
+*Name* は 解析結果のHashにおけるキーになります．
+### Path
+*Path* は xpath あるいは css セレクタによって、HTML上の特定のノードを指定します．
+これは Machinize の `search` で使用されます．
+### Childlen
+入れ子になっているノードの子ノードです．TextNodeはツリーの葉に当たるため、子ノードを持ちません．
+### Options
+パースのオプションです．オプションはTypeごとに異なります．
+各ノードに対して、`opt`メソッドをコールすることで、利用可能なオプションを取得できます．
+```
+# TextNode の例
+node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
+node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
+```
+## Text Node
+*TextNode* はスクレイピングしたテキストを返します．このノードはパースツリーにおいて常に葉です．
+### 例
+```html
+<!-- http://yasuri.example.net -->
+<html>
+  <head></head>
+  <body>
+    <p>Hello,World</p>
+    <p>Hello,Yasuri</p>
+  </body>
+</html>
+```
+```ruby
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net")
+p1  = Yasuri.text_title '/html/body/p[1]'
+p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
+p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
+p1.inject(agent, page)   #=> { "title" => "Hello,World" }
+p1t.inject(agent, page)  #=> { "title" => "Hello" }
+node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
+```
+### オプション
+##### `truncate`
+正規表現にマッチした文字列を取り出します．グループを指定した場合、最初にマッチしたグループだけを返します．
+```ruby
+node  = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
+node.inject(agent, index_page)
+#=> { "example" => "ello,Yasur" }
+```
+##### `proc`
+取り出した文字列(String)をレシーバーとして、シンボルで指定したメソッドを呼び出します．
+`truncate`オプションを併せて指定している場合、`truncate`した後の文字列に対し、メソッドを呼び出します．
+```ruby
+node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
+node.inject(agent, index_page)
+#=> { "example" => "ELLO,YASUR" }
+```
+## Struct Node
+*Struct Node*  は構造化されたHashとしてテキストを返します．
+まず、Struct Node は `Path` によって、HTMLのタグを絞込みます．
+Struct Node の子ノードは、この絞りこまれたタグに対してパースを行い、Struct Node は子ノードの結果を含むHashを返します．
+Struct Node の `Path` が複数のタグにマッチする場合、配列として結果を返します．
+### 例
+```html
+<!-- http://yasuri.example.net -->
+<html>
+  <head>
+    <title>Books</title>
+  </head>
+  <body>
+    <h1>1996</h1>
+    <table>
+      <thead>
+        <tr><th>Title</th> <th>Publication Date</th></tr>
+      </thead>
+      <tr><td>The Perfect Insider</td>      <td>1996/4/5</td></tr>
+      <tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
+      <tr><td>Mathematical Goodbye</td>     <td>1996/9/5</td></tr>
+    </table>
+    <h1>1997</h1>
+    <table>
+      <thead>
+        <tr><th>Title</th> <th>Publication Date</th></tr>
+      </thead>
+      <tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
+      <tr><td>Who Inside</td>                <td>1997/4/5</td></tr>
+      <tr><td>Illusion Acts Like Magic</td>  <td>1997/10/5</td></tr>
+    </table>
+    <h1>1998</h1>
+    <table>
+      <thead>
+        <tr><th>Title</th> <th>Publication Date</th></tr>
+      </thead>
+      <tr><td>Replaceable Summer</td>   <td>1998/1/7</td></tr>
+      <tr><td>Switch Back</td>          <td>1998/4/5</td></tr>
+      <tr><td>Numerical Models</td>     <td>1998/7/5</td></tr>
+      <tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
+    </table>
+  </body>
+</html>
+```
+```ruby
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net")
+node = Yasuri.struct_table '/html/body/table[1]/tr' do
+  text_title    './td[1]'
+  text_pub_date './td[2]'
+])
+node.inject(agent, page)
+#=> [ { "title"    => "The Perfect Insider",
+#       "pub_date" => "1996/4/5" },
+#     { "title"    => "Doctors in Isolated Room",
+#       "pub_date" => "1996/7/5" },
+#     { "title"    => "Mathematical Goodbye",
+#       "pub_date" => "1996/9/5" }, ]
+```
+Struct Node は xpath `'/html/body/table[1]/tr'` によって、最初の `<table>` から すべての`<tr>` タグを絞り込みます．
+その後、子ノードである2つの TextNode によって、 `<tr>` タグがパースされます．
+この場合は、最初の `<table>` は 3つの `<tr>`タグを持っているため、3つのHashを返します．(`<thead><tr>` は `Path` にマッチしないため4つではないことに注意)
+各HashはTextNodeによってパースされたテキストを含んでいます．
+また以下の例のように、Struct Node は TextNode以外のノードを子ノードとすることができます．
+### 例
+```ruby
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net")
+node = Yasuri.strucre_tables '/html/body/table' do
+  struct_table './tr' do
+    text_title    './td[1]'
+    text_pub_date './td[2]'
+  end
+])
+node.inject(agent, page)
+#=>      [ { "table" => [ { "title"    => "The Perfect Insider",
+#                           "pub_date" => "1996/4/5" },
+#                         { "title"    => "Doctors in Isolated Room",
+#                           "pub_date" => "1996/7/5" },
+#                         { "title"    => "Mathematical Goodbye",
+#                           "pub_date" => "1996/9/5" }]},
+#          { "table" => [ { "title"    => "Jack the Poetical Private",
+#                           "pub_date" => "1997/1/5" },
+#                         { "title"    => "Who Inside",
+#                           "pub_date" => "1997/4/5" },
+#                         { "title"    => "Illusion Acts Like Magic",
+#                           "pub_date" => "1997/10/5" }]},
+#          { "table" => [ { "title"    => "Replaceable Summer",
+#                           "pub_date" => "1998/1/7" },
+#                         { "title"    => "Switch Back",
+#                           "pub_date" => "1998/4/5" },
+#                         { "title"    => "Numerical Models",
+#                           "pub_date" => "1998/7/5" },
+#                         { "title"    => "The Perfect Outsider",
+#                           "pub_date" => "1998/10/5" }]}
+#       ]
+```
+### オプション
+なし
+## Links Node
+Links Node は リンクされた各ページをパースして結果を返します．
+### 例
+```
+<!-- http://yasuri.example.net -->
+<html>
+  <head><title>Yasuri Test</title></head>
+  <body>
+    <p>Hello,Yasuri</p>
+    <a href="./child01.html">child01</a>
+    <a href="./child02.html">child02</a>
+    <a href="./child03.html">child03</a>
+  </body>
+<title>
+```
+```
+<!-- http://yasuri.example.net/child01.html -->
+<html>
+  <head><title>Child 01 Test</title></head>
+  <body>
+    <p>Child 01 page.</p>
+    <ul>
+      <li><a href="./child01_sub.html">Child01_Sub</a></li>
+      <li><a href="./child02_sub.html">Child02_Sub</a></li>
+    </ul>
+  </body>
+<title>
+```
+```
+<!-- http://yasuri.example.net/child02.html -->
+<html>
+  <head><title>Child 02 Test</title></head>
+  <body>
+    <p>Child 02 page.</p>
+  </body>
+<title>
+```
+```
+<!-- http://yasuri.example.net/child03.html -->
+<html>
+  <head><title>Child 03 Test</title></head>
+  <body>
+    <p>Child 03 page.</p>
+    <ul>
+      <li><a href="./child03_sub.html">Child03_Sub</a></li>
+    </ul>
+  </body>
+<title>
+```
+```
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net")
+node = Yasuri.links_title '/html/body/a' do
+  text_content '/html/body/p'
+end
+node.inject(agent, page)
+#=> [ {"content" => "Child 01 page."},
+      {"content" => "Child 02 page."},
+      {"content" => "Child 03 page."}]
+```
+まず、 LinksNode は `Path` にマッチするすべてのリンクを最初のページから探します．
+この例では、LinksNodeは `/html/body/a` にマッチするすべてのタグを `http://yasuri.example.net` から探します．
+次に、見つかったタグのhref属性で指定されたページを開きます．(`./child01.html`, `./child02.html`, `./child03.html`)
+開いた各ページに対して、子ノードによる解析を行います．LinksNodeは 各ページに対するパース結果をHashの配列として返します．
+## Paginate Node
+PaginateNodeは ページネーション(パジネーション, Pagination) でたどることのできる各ページを順にパースします．
+### 例
+この例では、対象のページ `page01.html` はこのようになっているとします．
+`page02.html` から `page04.html` も同様です．
+```html
+<!-- http://yasuri.example.net/page01.html -->
+<html>
+  <head><title>Page01</title></head>
+  <body>
+    <p>Patination01</p>
+    <nav class='pagination'>
+      <span class='prev'> &laquo; PreviousPage </span>
+      <span class='page'> 1 </span>
+      <span class='page'> <a href="./page02.html">2</a> </span>
+      <span class='page'> <a href="./page03.html">3</a> </span>
+      <span class='page'> <a href="./page04.html">4</a> </span>
+      <span class='next'> <a href="./page02.html" class="next" rel="next">NextPage &raquo;</a> </span>
+    </nav>
+  </body>
+<title>
+```
+```ruby
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net/page01.html")
+node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
+         text_content '/html/body/p'
+       end
+node.inject(agent, page)
+#=> [ {"content" => "Patination01"},
+      {"content" => "Patination02"},
+      {"content" => "Patination03"}]
+```
+PaginateNodeは 次のページ を指すリンクを`Path`として指定する必要があります．
+この例では、`NextPage` (`/html/body/nav/span/a[@class='next']`)が、次のページを指すリンクに該当します．
+### オプション
+##### `limit`
+たどるページ数の上限を指定します．
+```ruby
+node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
+         text_content '/html/body/p'
+       end
+node.inject(agent, page)
+#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
+```
+この場合、PaginateNode は最大2つまでのページを開いてパースします．ページネーションは4つのページを持っているようですが、`limit:2`が指定されているため、結果の配列には2つの結果のみが含まれています．

data/USAGE.md ADDED Viewed

@@ -0,0 +1,431 @@
+# Yasuri Usage
+## What is Yasuri
+`Yasuri` is an easy web-scraping library for supporting "Mechanize".
+Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
+Yasuri can reduce frequently processes in Scraping.
+For example,
++ Open links in the page, scraping each page, and getting result as Hash.
++ Scraping texts in the page, and named result in Hash.
++ A table that repeatedly appears in a page each, scraping, get as an array.
++ Of each page provided by the pagination, scraping the only top 3.
+You can implement easy by Yasuri.
+## Quick Start
+```
+$ gem install yasuri
+```
+```ruby
+require 'yasuri'
+require 'machinize'
+# Node tree constructing by DSL
+root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
+         text_title '//*[@id="contents"]/h2'
+         text_content '//*[@id="contents"]/p[1]'
+       end
+agent = Mechanize.new
+root_page = agent.get("http://some.scraping.page.net/")
+result = root.inject(agent, root_page)
+# => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
+#      {"title" => "PageTitle2", "content" => "Page Contents2" }, ...  ]
+```
+This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
+(i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
+## Basics
+1. Construct parse tree.
+2. Start parse with Mechanize agent and first page.
+### Construct parse tree
+```ruby
+require 'mechanize'
+require 'yasuri'
+# 1. Construct parse tree.
+tree = Yasuri.links_title '/html/body/a' do
+         text_name '/html/body/p'
+       end
+# 2. Start parse with Mechanize agent and first page.
+agent = Mechanize.new
+page = agent.get(uri)
+tree.inject(agent, page)
+```
+Tree is definable by 2(+1) ways, DSL and json (and basic ruby code). In above example, DSL.
+```ruby
+# Construct by json.
+src = <<-EOJSON
+   { "node"     : "links",
+     "name"     : "title",
+     "path"     : "/html/body/a",
+     "children" : [
+                    { "node" : "text",
+                      "name" : "name",
+                      "path" : "/html/body/p"
+                    }
+                  ]
+   }
+EOJSON
+tree = Yasuri.json2tree(src)
+```
+### Node
+Tree is constructed by nested Nodes.
+Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
+Node is defined by this format.
+```ruby
+# Top Level
+Yasuri.<Type>_<Name> <Path> [,<Options>]
+# Nested
+Yasuri.<Type>_<Name> <Path> [,<Options>] do
+  <Type>_<Name> <Path> [,<Options>] do
+    <Children>
+  end
+end
+```
+#### Type
+Type meen behavior of Node.
+- *Text*
+- *Struct*
+- *Links*
+- *Paginate*
+### Name
+Name is used keys in returned hash.
+### Path
+Path determine target node by xpath or css selector. It given by Machinize `search`.
+### Childlen
+Child nodes. TextNode has always empty set, because TextNode is leaf.
+### Options
+Parse options. It different in each types. You can get options and values by `opt` method.
+```ruby
+# TextNode Exaample
+node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
+node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
+```
+## Text Node
+TextNode return scraped text. This node have to be leaf.
+### Example
+```html
+<!-- http://yasuri.example.net -->
+<html>
+  <head></head>
+  <body>
+    <p>Hello,World</p>
+    <p>Hello,Yasuri</p>
+  </body>
+</html>
+```
+```ruby
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net")
+p1  = Yasuri.text_title '/html/body/p[1]'
+p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
+p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
+p1.inject(agent, page)   #=> { "title" => "Hello,World" }
+p1t.inject(agent, page)  #=> { "title" => "Hello" }
+node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
+```
+### Options
+##### `truncate`
+Match to regexp, and truncate text. When you use group, it will return first matched group only.
+```ruby
+node  = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
+node.inject(agent, index_page)
+#=> { "example" => "ello,Yasur" }
+```
+##### `proc`
+Apply method to text. Method is given as Symbol.
+If it is given `truncate` option, apply method after truncated.
+```ruby
+node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
+node.inject(agent, index_page)
+#=> { "example" => "ELLO,YASUR" }
+```
+## Struct Node
+Struct Node return structured text.
+At first, Struct Node narrow down sub-tags by `Path`. Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
+If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
+### Example
+```html
+<!-- http://yasuri.example.net -->
+<html>
+  <head>
+    <title>Books</title>
+  </head>
+  <body>
+    <h1>1996</h1>
+    <table>
+      <thead>
+        <tr><th>Title</th> <th>Publication Date</th></tr>
+      </thead>
+      <tr><td>The Perfect Insider</td>      <td>1996/4/5</td></tr>
+      <tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
+      <tr><td>Mathematical Goodbye</td>     <td>1996/9/5</td></tr>
+    </table>
+    <h1>1997</h1>
+    <table>
+      <thead>
+        <tr><th>Title</th> <th>Publication Date</th></tr>
+      </thead>
+      <tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
+      <tr><td>Who Inside</td>                <td>1997/4/5</td></tr>
+      <tr><td>Illusion Acts Like Magic</td>  <td>1997/10/5</td></tr>
+    </table>
+    <h1>1998</h1>
+    <table>
+      <thead>
+        <tr><th>Title</th> <th>Publication Date</th></tr>
+      </thead>
+      <tr><td>Replaceable Summer</td>   <td>1998/1/7</td></tr>
+      <tr><td>Switch Back</td>          <td>1998/4/5</td></tr>
+      <tr><td>Numerical Models</td>     <td>1998/7/5</td></tr>
+      <tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
+    </table>
+  </body>
+</html>
+```
+```ruby
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net")
+node = Yasuri.struct_table '/html/body/table[1]/tr' do
+  text_title    './td[1]'
+  text_pub_date './td[2]'
+])
+node.inject(agent, page)
+#=> [ { "title"    => "The Perfect Insider",
+#       "pub_date" => "1996/4/5" },
+#     { "title"    => "Doctors in Isolated Room",
+#       "pub_date" => "1996/7/5" },
+#     { "title"    => "Mathematical Goodbye",
+#       "pub_date" => "1996/9/5" }, ]
+```
+StructNode narrow down `<tr>` tags in first `<table>` by `'/html/body/table[1]/tr'`. Then,
+`<tr>` tags parsed Struct node has two child node.
+In this case, first `<table>` contains three `<tr>` tags (Not four.`<thead><tr>` is not match to `Path` ), so struct node returns three hashes. Each hash contains parsed text by Text Node.
+Struct node can contain not only Text node.
+### Example
+```ruby
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net")
+node = Yasuri.strucre_tables '/html/body/table' do
+  struct_table './tr' do
+    text_title    './td[1]'
+    text_pub_date './td[2]'
+  end
+])
+node.inject(agent, page)
+#=>      [ { "table" => [ { "title"    => "The Perfect Insider",
+#                           "pub_date" => "1996/4/5" },
+#                         { "title"    => "Doctors in Isolated Room",
+#                           "pub_date" => "1996/7/5" },
+#                         { "title"    => "Mathematical Goodbye",
+#                           "pub_date" => "1996/9/5" }]},
+#          { "table" => [ { "title"    => "Jack the Poetical Private",
+#                           "pub_date" => "1997/1/5" },
+#                         { "title"    => "Who Inside",
+#                           "pub_date" => "1997/4/5" },
+#                         { "title"    => "Illusion Acts Like Magic",
+#                           "pub_date" => "1997/10/5" }]},
+#          { "table" => [ { "title"    => "Replaceable Summer",
+#                           "pub_date" => "1998/1/7" },
+#                         { "title"    => "Switch Back",
+#                           "pub_date" => "1998/4/5" },
+#                         { "title"    => "Numerical Models",
+#                           "pub_date" => "1998/7/5" },
+#                         { "title"    => "The Perfect Outsider",
+#                           "pub_date" => "1998/10/5" }]}
+#       ]
+```
+### Options
+None.
+## Links Node
+Links Node returns parsed text in each linked pages.
+### Example
+```html
+<!-- http://yasuri.example.net -->
+<html>
+  <head><title>Yasuri Test</title></head>
+  <body>
+    <p>Hello,Yasuri</p>
+    <a href="./child01.html">child01</a>
+    <a href="./child02.html">child02</a>
+    <a href="./child03.html">child03</a>
+  </body>
+<title>
+```
+```html
+<!-- http://yasuri.example.net/child01.html -->
+<html>
+  <head><title>Child 01 Test</title></head>
+  <body>
+    <p>Child 01 page.</p>
+    <ul>
+      <li><a href="./child01_sub.html">Child01_Sub</a></li>
+      <li><a href="./child02_sub.html">Child02_Sub</a></li>
+    </ul>
+  </body>
+<title>
+```
+```html
+<!-- http://yasuri.example.net/child02.html -->
+<html>
+  <head><title>Child 02 Test</title></head>
+  <body>
+    <p>Child 02 page.</p>
+  </body>
+<title>
+```
+```html
+<!-- http://yasuri.example.net/child03.html -->
+<html>
+  <head><title>Child 03 Test</title></head>
+  <body>
+    <p>Child 03 page.</p>
+    <ul>
+      <li><a href="./child03_sub.html">Child03_Sub</a></li>
+    </ul>
+  </body>
+<title>
+```
+```ruby
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net")
+node = Yasuri.links_title '/html/body/a' do
+  text_content '/html/body/p'
+end
+node.inject(agent, page)
+#=> [ {"content" => "Child 01 page."},
+      {"content" => "Child 02 page."},
+      {"content" => "Child 03 page."}]
+```
+At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
+Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
+### Options
+None.
+## Paginate Node
+Paginate Node parses and returns each pages that provid by paginate.
+### Example
+Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
+```html
+<!-- http://yasuri.example.net/page01.html -->
+<html>
+  <head><title>Page01</title></head>
+  <body>
+    <p>Pagination01</p>
+    <nav class='pagination'>
+      <span class='prev'> PreviousPage </span>
+      <span class='page'> 1 </span>
+      <span class='page'> <a href="./page02.html">2</a> </span>
+      <span class='page'> <a href="./page03.html">3</a> </span>
+      <span class='page'> <a href="./page04.html">4</a> </span>
+      <span class='next'> <a href="./page02.html" class="next" rel="next"> NextPage </a> </span>
+    </nav>
+  </body>
+<title>
+```
+```ruby
+agent = Mechanize.new
+page = agent.get("http://yasuri.example.net/page01.html")
+node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
+         text_content '/html/body/p'
+       end
+node.inject(agent, page)
+#=> [ {"content" => "Pagination01"},
+      {"content" => "Pagination02"},
+      {"content" => "Pagination03"},
+      {"content" => "Pagination04"}]
+```
+Paginate Node require link for next page. In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
+### Options
+##### `limit`
+Upper limit of open pages in pagination.
+```ruby
+node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
+         text_content '/html/body/p'
+       end
+node.inject(agent, page)
+#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
+```
+Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.

data/lib/yasuri/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Yasuri
-  VERSION = "0.0.11"
+  VERSION = "1.9.11"
 end

data/lib/yasuri/yasuri.rb CHANGED Viewed

@@ -37,7 +37,7 @@ module Yasuri
   }
   Node2Text = Text2Node.invert
-  ReservedKeys = %i|node name path children|
+  ReservedKeys = [:node, :name, :path, :children]
   def self.hash2node(node_h)
     node, name, path, children = ReservedKeys.map do |key|
       node_h[key]
@@ -78,7 +78,8 @@ module Yasuri
     json
   end
-  def self.NodeName(name, symbolize_names:false)
+  def self.NodeName(name, hash = {})
+    symbolize_names = hash[:symbolize_names] || false
     symbolize_names ? name.to_sym : name
   end

data/lib/yasuri/yasuri_node.rb CHANGED Viewed

@@ -7,7 +7,7 @@ module Yasuri
   module Node
     attr_reader :url, :xpath, :name, :children
-    def initialize(xpath, name, children = [], opt: {})
+    def initialize(xpath, name, children = [], opt = {})
       @xpath, @name, @children = xpath, name, children
     end

data/lib/yasuri/yasuri_paginate_node.rb CHANGED Viewed

@@ -7,9 +7,9 @@ module Yasuri
   class PaginateNode
     include Node
-    def initialize(xpath, name, children = [], limit: nil)
+    def initialize(xpath, name, children = [], hash = {})
       super(xpath, name, children)
-      @limit = limit
+      @limit = hash[:limit]
     end
     def inject(agent, page, opt = {})

data/lib/yasuri/yasuri_text_node.rb CHANGED Viewed

@@ -7,9 +7,12 @@ module Yasuri
   class TextNode
     include Node
-    def initialize(xpath, name, children = [], truncate: nil, proc:nil)
+    def initialize(xpath, name, children = [], hash = {})
       super(xpath, name, children)
+      truncate = hash[:truncate]
+      proc     = hash[:proc]
       truncate = Regexp.new(truncate) if not truncate.nil? # regexp or nil
       @truncate = truncate
       @truncate = Regexp.new(@truncate.to_s) if not @truncate.nil?

data/spec/spec_helper.rb CHANGED Viewed

@@ -14,8 +14,8 @@ end
 # ENV['CODECLIMATE_REPO_TOKEN'] = "0dc78d33107a7f11f257c0218ac1a37e0073005bb9734f2fd61d0f7e803fc151"
-require "codeclimate-test-reporter"
-CodeClimate::TestReporter.start
+# require "codeclimate-test-reporter"
+# CodeClimate::TestReporter.start
 require 'simplecov'
 require 'coveralls'
@@ -27,6 +27,7 @@ SimpleCov.formatter = SimpleCov::Formatter::MultiFormatter[
 ]
 SimpleCov.start
 $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
 require 'yasuri'

data/spec/yasuri_spec.rb CHANGED Viewed

@@ -18,7 +18,7 @@ describe 'Yasuri' do
   #############
   describe '.json2tree' do
     it "fail if empty json" do
-      expect { Yasuri.json2tree("{}") }.to raise_error
+      expect { Yasuri.json2tree("{}") }.to raise_error(RuntimeError)
     end
     it "return TextNode" do
@@ -39,7 +39,7 @@ describe 'Yasuri' do
                   "truncate"  : "^[^,]+"
                 }|
       generated = Yasuri.json2tree(src)
-      original  = Yasuri::TextNode.new('/html/body/p[1]', "content", truncate:/^[^,]+/)
+      original  = Yasuri::TextNode.new('/html/body/p[1]', "content", {}, truncate:/^[^,]+/)
       compare_generated_vs_original(generated, original, @index_page)
     end
@@ -153,7 +153,7 @@ describe 'Yasuri' do
     end
     it "return text node with truncate_regexp" do
-      node = Yasuri::TextNode.new("/html/head/title", "title", truncate:/^[^,]+/)
+      node = Yasuri::TextNode.new("/html/head/title", "title", {}, truncate:/^[^,]+/)
       json = Yasuri.tree2json(node)
       expected_str = %q| { "node": "text",
                            "name": "title",

data/spec/yasuri_struct_node_spec.rb CHANGED Viewed

@@ -81,7 +81,7 @@ describe 'Yasuri' do
       node = Yasuri::StructNode.new(invalid_xpath, "table", [
         Yasuri::TextNode.new('./td[1]', "title")
       ])
-      expect { node.inject(@agent, @page) }.to raise_error
+      expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
     end
     it 'fail with invalid xpath in children' do
@@ -90,7 +90,7 @@ describe 'Yasuri' do
         Yasuri::TextNode.new(invalid_xpath, "title"),
         Yasuri::TextNode.new('./td[2]', "pub_date"),
       ])
-      expect { node.inject(@agent, @page) }.to raise_error
+      expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
     end
     it 'scrape all tables' do
@@ -126,7 +126,7 @@ describe 'Yasuri' do
         Yasuri::TextNode.new('./td[1]', "title"),
         Yasuri::TextNode.new('./td[2]', "pub_date"),
       ])
-      expected = @table_1996.map{|h| h.map{|k,v| [k.to_sym, v] }.to_h }
+      expected = @table_1996.map{|h| Hash[h.map{|k,v| [k.to_sym, v] }] }
       actual = node.inject(@agent, @page, symbolize_names:true)
       expect(actual).to match expected
     end

data/spec/yasuri_text_node_spec.rb CHANGED Viewed

@@ -31,7 +31,7 @@ describe 'Yasuri' do
     it 'fail with invalid xpath' do
       invalid_xpath = '/html/body/no_match_node['
       node = Yasuri::TextNode.new(invalid_xpath, "title")
-      expect { node.inject(@agent, @index_page) }.to raise_error
+      expect { node.inject(@agent, @index_page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
     end
     it "can be defined by DSL, return single TextNode title" do

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: yasuri
 version: !ruby/object:Gem::Version
-  version: 0.0.11
+  version: 1.9.11
 platform: ruby
 authors:
 - TAC
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-03-03 00:00:00.000000000 Z
+date: 2016-11-14 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -151,6 +151,8 @@ files:
 - LICENSE
 - README.md
 - Rakefile
+- USAGE.ja.md
+- USAGE.md
 - app.rb
 - lib/yasuri.rb
 - lib/yasuri/version.rb