RubyGems - yasuri - Versions diffs - 3.2.0 → 3.3.0 - Mend

yasuri 3.2.0 → 3.3.0

Files changed (25) hide show

checksums.yaml +4 -4
data/README.md +4 -7
data/USAGE.ja.md +107 -86
data/USAGE.md +106 -87
data/examples/example.rb +79 -0
data/examples/github.yml +15 -0
data/examples/sample.json +4 -0
data/examples/sample.yml +11 -0
data/lib/yasuri/version.rb +1 -1
data/lib/yasuri/yasuri.rb +6 -2
data/lib/yasuri/yasuri_cli.rb +6 -6
data/lib/yasuri/yasuri_links_node.rb +3 -1
data/lib/yasuri/yasuri_map_node.rb +1 -0
data/lib/yasuri/yasuri_node.rb +14 -0
data/lib/yasuri/yasuri_paginate_node.rb +2 -1
data/spec/spec_helper.rb +3 -3
data/spec/yasuri_cli_spec.rb +17 -4
data/spec/yasuri_links_node_spec.rb +24 -10
data/spec/yasuri_map_spec.rb +4 -5
data/spec/yasuri_paginate_node_spec.rb +22 -10
data/spec/yasuri_spec.rb +55 -19
data/spec/yasuri_struct_node_spec.rb +13 -17
data/spec/yasuri_text_node_spec.rb +11 -12
metadata +6 -3
data/app.rb +0 -52

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: cd5fc7327c6d09b37771ac1c3ec40db2c052bf49ec9a1627e9ae49e047102856
-  data.tar.gz: a645f1e09ce72b73c54e2055af6fbf81bb145c8823e1d8428bb19c042bbb661d
+  metadata.gz: a7bf438a08fc83fec7e78cb5543577c98f6cc98b4f5fae7b0dd969f2049c0531
+  data.tar.gz: e399c6b57589b7d8ba2e8eff7a1d204fa7f8e676f82f631057e19a9377333060
 SHA512:
-  metadata.gz: 654bd6cfe8012811283b1aa03e0dcc1200ce957ef4641eed2b5fa65956fb974070157b832e42f340d7299031756848c5118a7f43019ff94f088c49974e2304e8
-  data.tar.gz: 5ad07b82672ea2ceebfb8154bb91631c095e9ad8d69f3d62c0bf8d528c4c539fab2597f4112b4212bffe7ad641b30d913686e8e2bfea7dfdbdd9a4468311b6c0
+  metadata.gz: 56f39994972657712cb7d95e5ceaadefca8de41e06c2cd4759363b496d7c8531fad7517f9df99bf2446c144f01c5cd82cbc94146c432d6b5b552f092b975ecd7
+  data.tar.gz: cf74a25615187ecbe5f8ca5f2072679fa9cc1902dfa3bf2190b87e11104f332688cdaee16f4be6cb00f9ed63fa18f2ec8f27cf32b0b27389f81d98229fa212e6

data/README.md CHANGED Viewed

@@ -81,12 +81,8 @@ src = <<-EOJSON
 EOJSON
 root = Yasuri.json2tree(src)
 # Execution and getting scraped result
-agent = Mechanize.new
-root_page = agent.get("http://some.scraping.page.tac42.net/")
-result = root.inject(agent, root_page)
+result = root.scrape("http://some.scraping.page.tac42.net/")
 # => [
 #      {"title" => "PageTitle 01", "content" => "Page Contents  01" },
 #      {"title" => "PageTitle 02", "content" => "Page Contents  02" },
@@ -104,8 +100,9 @@ Usage:
   yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
 Options:
-  f, [--file=FILE]  # path to file that written yasuri tree as json or yaml
-  j, [--json=JSON]  # yasuri tree format json string
+  f, [--file=FILE]   # path to file that written yasuri tree as json or yaml
+  j, [--json=JSON]   # yasuri tree format json string
+  i, [--interval=N]  # interval each request [ms]
 Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
 ```

data/USAGE.ja.md CHANGED Viewed

@@ -2,7 +2,8 @@
 ## Yasuri とは
 Yasuri (鑢) はWebスクレイピングを宣言的に行うためのライブラリと、それを用いたスクレイピングのコマンドラインツールです。
-簡単な宣言的記法で期待結果を記述するだけで、"[Mechanize](https://github.com/sparklemotion/mechanize)" によるスクレイピングを実行します。
+簡単な宣言的記法で期待結果を記述するだけでスクレイピングした結果を得られます。
 Yasuriは、スクレイピングにおける、よくある処理を簡単に記述することができます．
 例えば、以下のような処理を簡単に実現することができます．
@@ -36,10 +37,7 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
          text_content '//*[@id="contents"]/p[1]'
        end
-agent = Mechanize.new
-root_page = agent.get("http://some.scraping.page.tac42.net/")
-result = root.inject(agent, root_page)
+result = root.scrape("http://some.scraping.page.tac42.net/")
 # => [
 #      {"title" => "PageTitle 01", "content" => "Page Contents  01" },
 #      {"title" => "PageTitle 02", "content" => "Page Contents  02" },
@@ -171,7 +169,51 @@ jsonまたはyaml形式では、子Nodeを持たない場合、`path` を直接
   }
 }
 ```
+### ツリーを実行する
+パースツリーのルートノードで`Node#scrape(uri, opt={})`メソッドをコールします。
+**例**
+```ruby
+root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
+         text_title '//*[@id="contents"]/h2'
+         text_content '//*[@id="contents"]/p[1]'
+       end
+result = root.scrape("http://some.scraping.page.tac42.net/", interval_ms: 1000)
+```
++ `uri` はスクレイピングする対象ページのURIです。
++ `opt` はオプションをHashで指定します。以下のオプションを利用できます。
+Yasuriはスクレイピングを行うエージェントとして、内部で`Mechanize`を使用しています。
+このインスタンスを指定したい場合は、`Node#scrape_with_agent(uri, agent, opt={})`をコールします。
+```ruby
+require 'logger'
+agent = Mechanize.new
+agent.log = Logger.new $stderr
+agent.request_headers = {
+  # ...
+}
+result = root.scrape_with_agent(
+  "http://some.scraping.page.tac42.net/",
+  agent,
+  interval_ms: 1000)
+```
+### `opt`
+#### `interval_ms`
+複数ページにリクエストする際の間隔[ミリ秒]です。
+省略した場合はインターバルなしで続けてリクエストしますが、多数のページへのリクエストが予想される場合、対象ホストが高負荷とならないよう、インターバル時間を指定することを強くお勧めします。
+#### `retry_count`
+ページ取得失敗時のリトライ回数です。省略した場合は5回リトライします。
+#### `symbolize_names`
+`true`のとき、結果セットのキーをシンボルとして返します。
 --------------------------
 ## Node
@@ -216,7 +258,7 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
 ### 例
 ```html
-<!-- http://yasuri.example.net -->
+<!-- http://yasuri.example.tac42.net -->
 <html>
   <head></head>
   <body>
@@ -227,28 +269,24 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
 ```
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
 p1  = Yasuri.text_title '/html/body/p[1]'
 p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
 p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
-p1.inject(agent, page)   #=> "Hello,World"
-p1t.inject(agent, page)  #=> "Hello"
-p2u.inject(agent, page)  #=> "HELLO,WORLD"
+p1.scrape("http://yasuri.example.tac42.net")   #=> "Hello,World"
+p1t.scrape("http://yasuri.example.tac42.net")  #=> "Hello"
+p2u.scrape("http://yasuri.example.tac42.net")  #=> "HELLO,WORLD"
 ```
 なお、同じページ内の複数の要素を一度にスクレイピングする場合は、`MapNode`を使用します。詳細は、`MapNode`の例を参照してください。
 ### オプション
 ##### `truncate`
 正規表現にマッチした文字列を取り出します．グループを指定した場合、最初にマッチしたグループだけを返します．
 ```ruby
 node  = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
-node.inject(agent, index_page)
+node.scrape(uri)
 #=> { "example" => "ello,Yasur" }
 ```
@@ -259,7 +297,7 @@ node.inject(agent, index_page)
 ```ruby
 node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
-node.inject(agent, index_page)
+node.scrape(uri)
 #=> { "example" => "ELLO,YASUR" }
 ```
@@ -274,7 +312,7 @@ Struct Node の `Path` が複数のタグにマッチする場合、配列とし
 ### 例
 ```html
-<!-- http://yasuri.example.net -->
+<!-- http://yasuri.example.tac42.net -->
 <html>
   <head>
     <title>Books</title>
@@ -315,15 +353,12 @@ Struct Node の `Path` が複数のタグにマッチする場合、配列とし
 ```
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
 node = Yasuri.struct_table '/html/body/table[1]/tr' do
   text_title    './td[1]'
   text_pub_date './td[2]'
-])
+end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net")
 #=> [ { "title"    => "The Perfect Insider",
 #       "pub_date" => "1996/4/5" },
 #     { "title"    => "Doctors in Isolated Room",
@@ -337,23 +372,19 @@ Struct Node は xpath `'/html/body/table[1]/tr'` によって、最初の `<tabl
 この場合は、最初の `<table>` は 3つの `<tr>`タグを持っているため、3つのHashを返します．(`<thead><tr>` は `Path` にマッチしないため4つではないことに注意)
 各HashはTextNodeによってパースされたテキストを含んでいます．
 また以下の例のように、Struct Node は TextNode以外のノードを子ノードとすることができます．
 ### 例
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
 node = Yasuri.strucre_tables '/html/body/table' do
   struct_table './tr' do
     text_title    './td[1]'
     text_pub_date './td[2]'
   end
-])
+end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net")
 #=>      [ { "table" => [ { "title"    => "The Perfect Insider",
 #                           "pub_date" => "1996/4/5" },
@@ -385,8 +416,8 @@ node.inject(agent, page)
 Links Node は リンクされた各ページをパースして結果を返します．
 ### 例
-```
-<!-- http://yasuri.example.net -->
+```html
+<!-- http://yasuri.example.tac42.net -->
 <html>
   <head><title>Yasuri Test</title></head>
   <body>
@@ -398,8 +429,8 @@ Links Node は リンクされた各ページをパースして結果を返し
 <title>
 ```
-```
-<!-- http://yasuri.example.net/child01.html -->
+```html
+<!-- http://yasuri.example.tac42.net/child01.html -->
 <html>
   <head><title>Child 01 Test</title></head>
   <body>
@@ -412,8 +443,8 @@ Links Node は リンクされた各ページをパースして結果を返し
 <title>
 ```
-```
-<!-- http://yasuri.example.net/child02.html -->
+```html
+<!-- http://yasuri.example.tac42.net/child02.html -->
 <html>
   <head><title>Child 02 Test</title></head>
   <body>
@@ -422,8 +453,8 @@ Links Node は リンクされた各ページをパースして結果を返し
 <title>
 ```
-```
-<!-- http://yasuri.example.net/child03.html -->
+```html
+<!-- http://yasuri.example.tac42.net/child03.html -->
 <html>
   <head><title>Child 03 Test</title></head>
   <body>
@@ -435,22 +466,19 @@ Links Node は リンクされた各ページをパースして結果を返し
 <title>
 ```
-```
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
+```ruby
 node = Yasuri.links_title '/html/body/a' do
   text_content '/html/body/p'
 end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net")
 #=> [ {"content" => "Child 01 page."},
       {"content" => "Child 02 page."},
       {"content" => "Child 03 page."}]
 ```
 まず、 LinksNode は `Path` にマッチするすべてのリンクを最初のページから探します．
-この例では、LinksNodeは `/html/body/a` にマッチするすべてのタグを `http://yasuri.example.net` から探します．
+この例では、LinksNodeは `/html/body/a` にマッチするすべてのタグを `http://yasuri.example.tac42.net` から探します．
 次に、見つかったタグのhref属性で指定されたページを開きます．(`./child01.html`, `./child02.html`, `./child03.html`)
 開いた各ページに対して、子ノードによる解析を行います．LinksNodeは 各ページに対するパース結果をHashの配列として返します．
@@ -463,7 +491,7 @@ PaginateNodeは ページネーション(パジネーション, Pagination) で
 `page02.html` から `page04.html` も同様です．
 ```html
-<!-- http://yasuri.example.net/page01.html -->
+<!-- http://yasuri.example.tac42.net/page01.html -->
 <html>
   <head><title>Page01</title></head>
   <body>
@@ -483,17 +511,14 @@ PaginateNodeは ページネーション(パジネーション, Pagination) で
 ```
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net/page01.html")
 node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
          text_content '/html/body/p'
        end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net/page01.html")
 #=> [ {"content" => "Patination01"},
-      {"content" => "Patination02"},
-      {"content" => "Patination03"}]
+#     {"content" => "Patination02"},
+#     {"content" => "Patination03"}]
 ```
 PaginateNodeは 次のページ を指すリンクを`Path`として指定する必要があります．
 この例では、`NextPage` (`/html/body/nav/span/a[@class='next']`)が、次のページを指すリンクに該当します．
@@ -506,7 +531,7 @@ PaginateNodeは 次のページ を指すリンクを`Path`として指定する
 node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
          text_content '/html/body/p'
        end
-node.inject(agent, page)
+node.scrape(uri)
 #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
 ```
 この場合、PaginateNode は最大2つまでのページを開いてパースします．ページネーションは4つのページを持っているようですが、`limit:2`が指定されているため、結果の配列には2つの結果のみが含まれています．
@@ -515,35 +540,32 @@ node.inject(agent, page)
 取得した各ページの結果を展開します．
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net/page01.html")
 node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
          text_title   '/html/head/title'
          text_content '/html/body/p'
        end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net/page01.html")
 #=> [ {"title" => "Page01",
-       "content" => "Patination01"},
-      {"title"   => "Page01",
-       "content" => "Patination02"},
-      {"title"   => "Page01",
-       "content" => "Patination03"}]
+#      "content" => "Patination01"},
+#     {"title"   => "Page01",
+#      "content" => "Patination02"},
+#     {"title"   => "Page01",
+#      "content" => "Patination03"}]
 node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
         text_title   '/html/head/title'
         text_content '/html/body/p'
       end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net/page01.html")
 #=> [ "Page01",
-      "Patination01",
-      "Page02",
-      "Patination02",
-      "Page03",
-      "Patination03"]
+#     "Patination01",
+#     "Page02",
+#     "Patination02",
+#     "Page03",
+#     "Patination03"]
 ```
 ## Map Node
@@ -552,7 +574,7 @@ node.inject(agent, page)
 ### 例
 ```html
-<!-- http://yasuri.example.net -->
+<!-- http://yasuri.example.tac42.net -->
 <html>
   <head><title>Yasuri Example</title></head>
   <body>
@@ -563,16 +585,12 @@ node.inject(agent, page)
 ```
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
 tree = Yasuri.map_root do
   text_title  '/html/head/title'
   text_body_p '/html/body/p[1]'
 end
-tree.inject(agent, page) #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
+tree.scrape("http://yasuri.example.tac42.net") #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
 tree = Yasuri.map_root do
@@ -583,7 +601,7 @@ tree = Yasuri.map_root do
   end
 end
-tree.inject(agent, page) #=> {
+tree.scrape("http://yasuri.example.tac42.net") #=> {
 #   "group1" => {
 #           "child01" => "child01"
 #         },
@@ -601,15 +619,14 @@ tree.inject(agent, page) #=> {
 -------------------------
 ## 使い方
-#### ライブラリとして使用する場合
+### ライブラリとして使う
 ライブラリとして使用する場合は、DSL, json, yaml の形式でツリーを定義できます。
 ```ruby
-require 'mechanize'
 require 'yasuri'
 # 1. パースツリーを作る
-# DSLで定義する倍
+# DSLで定義する
 tree = Yasuri.links_title '/html/body/a' do
          text_name '/html/body/p'
        end
@@ -634,17 +651,11 @@ links_title:
 EOYAML
 tree = Yasuri.yaml2tree(src)
-# 2. Mechanize の agent と対象のページを与えてパースを開始する
-agent = Mechanize.new
-page = agent.get(uri)
-tree.inject(agent, page)
+# 2. URLを与えてパースを開始する
+tree.inject(uri)
 ```
-#### CLIツールとして使用する場合
+### CLIツールとして使う
 **ヘルプ表示**
 ```sh
@@ -655,13 +666,14 @@ Usage:
 Options:
   f, [--file=FILE]  # path to file that written yasuri tree as json or yaml
   j, [--json=JSON]  # yasuri tree format json string
+  i, [--interval=N]  # interval each request [ms]
 Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
 ```
 CLIツールでは以下のどちらかの方法でパースツリーを指定します。
-+ `--file`, `-f` オプションで、ファイルに出力されたjson形式またはyaml形式のパースツリーを読み込む
-+ `--json`, `-j` オプションで、パースツリーを文字列として直接指定する
++ `--file`, `-f` : ファイルに出力されたjson形式またはyaml形式のパースツリーを読み込む
++ `--json`, `-j` : パースツリーを文字列として直接指定する
 **パースツリーをファイルで指定する例**
@@ -683,6 +695,8 @@ text_desc: "//*[@id=\"intro\"]/p"
 {"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
 ```
+ファイルがjsonまたはyamlのどちらで記載されているかについては自動判別されます。
 **パースツリーをjsonで直接指定する例**
 ```sh
 $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
@@ -693,3 +707,10 @@ $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
 {"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
 ```
+#### その他のオプション
++ `--interval`, `-i` : 複数ページにリクエストする際の間隔[ミリ秒]です。
+   **例: 1秒間隔でリクエストする**
+   ```sh
+   $ yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml --interval 1000
+   ```