RubyGems - yasuri - Versions diffs - 3.1.0 → 3.2.0 - Mend

yasuri 3.1.0 → 3.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

checksums.yaml +4 -4
data/README.md +54 -24
data/USAGE.ja.md +216 -72
data/USAGE.md +225 -78
data/exe/yasuri +5 -0
data/lib/yasuri.rb +1 -0
data/lib/yasuri/version.rb +1 -1
data/lib/yasuri/yasuri.rb +71 -36
data/lib/yasuri/yasuri_cli.rb +64 -0
data/lib/yasuri/yasuri_links_node.rb +3 -3
data/lib/yasuri/yasuri_map_node.rb +12 -27
data/lib/yasuri/yasuri_node.rb +15 -37
data/lib/yasuri/yasuri_paginate_node.rb +5 -4
data/lib/yasuri/yasuri_struct_node.rb +5 -1
data/lib/yasuri/yasuri_text_node.rb +5 -5
data/spec/cli_resources/tree.json +8 -0
data/spec/cli_resources/tree.yml +5 -0
data/spec/cli_resources/tree_wrong.json +9 -0
data/spec/cli_resources/tree_wrong.yml +6 -0
data/spec/spec_helper.rb +1 -1
data/spec/yasuri_cli_spec.rb +83 -0
data/spec/yasuri_spec.rb +125 -140
data/yasuri.gemspec +3 -1
metadata +31 -4

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: f3542a2cc0959a4534520f6104fc2922bdf0dbd368fcd4c149c3d251c2fc2198
-  data.tar.gz: 6fdb960db697e9a4ec1d87f2b83bf0e9914e3c9efe90764536bbee6d68774353
+  metadata.gz: cd5fc7327c6d09b37771ac1c3ec40db2c052bf49ec9a1627e9ae49e047102856
+  data.tar.gz: a645f1e09ce72b73c54e2055af6fbf81bb145c8823e1d8428bb19c042bbb661d
 SHA512:
-  metadata.gz: 9df576243bea289f4c285c46f1bd2137b7b69b79b24e0c657e4ac952114dd7bcf82a5f95cd2dae88c6eac4e3e468273b7dbd6ead9d05ffdc8d25861921702333
-  data.tar.gz: 13f2ae72b3e8fa6d3ef58932daa2acad49f5d4f57c80f34e5215394940fc2305bc016d949760efe9f43ae2b8c3796064a1b0bd9bccf236cfe3789c2c291dfd8b
+  metadata.gz: 654bd6cfe8012811283b1aa03e0dcc1200ce957ef4641eed2b5fa65956fb974070157b832e42f340d7299031756848c5118a7f43019ff94f088c49974e2304e8
+  data.tar.gz: 5ad07b82672ea2ceebfb8154bb91631c095e9ad8d69f3d62c0bf8d528c4c539fab2597f4112b4212bffe7ad641b30d913686e8e2bfea7dfdbdd9a4468311b6c0

data/README.md CHANGED Viewed

@@ -2,7 +2,7 @@
 [![Build Status](https://github.com/tac0x2a/yasuri/actions/workflows/ruby.yml/badge.svg)](https://github.com/tac0x2a/yasuri/actions/workflows/ruby.yml)
 [![Coverage Status](https://coveralls.io/repos/tac0x2a/yasuri/badge.svg?branch=master)](https://coveralls.io/r/tac0x2a/yasuri?branch=master) [![Maintainability](https://api.codeclimate.com/v1/badges/c29480fea1305afe999f/maintainability)](https://codeclimate.com/github/tac0x2a/yasuri/maintainability)
-Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
+Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)", and CLI tool using it.
 Yasuri can reduce frequently processes in Scraping.
@@ -33,10 +33,10 @@ or
 ```ruby
 # for Ruby 1.9.3 or lower
-gem 'yasuri', '~> 1.9'
+gem 'yasuri', '~> 2.0', '>= 2.0.13'
 # for Ruby 3.0.0 or lower
-gem 'yasuri', '~> 3.0.1'
+gem 'yasuri', '~> 3.1'
 ```
@@ -49,6 +49,7 @@ Or install it yourself as:
     $ gem install yasuri
 ## Usage
+### Use as library
 ```ruby
 # Node tree constructing by DSL
@@ -60,40 +61,64 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
 # Node tree constructing by YAML
 src = <<-EOYAML
-root:
-  node: links
+links_root:
   path: "//*[@id='menu']/ul/li/a"
-  children:
-    - title:   { node: text, path: "//*[@id='contents']/h2" }
-    - content: { node: text, path: "//*[@id='contents']/p[1]" }
+  text_title: "//*[@id='contents']/h2"
+  text_content: "//*[@id='contents']/p[1]"
 EOYAML
 root = Yasuri.yaml2tree(src)
 # Node tree constructing by JSON
 src = <<-EOJSON
-   { "node"     : "links",
-     "name"     : "root",
-     "path"     : "//*[@id='menu']/ul/li/a",
-     "children" : [
-                    { "node" : "text",
-                      "name" : "title",
-                      "path" : "//*[@id='contents']/h2"
-                    },
-                    { "node" : "text",
-                      "name" : "content",
-                      "path" : "//*[@id='contents']/p[1]"
-                    }
-                  ]
-   }
+{
+  "links_root": {
+    "path": "//*[@id='menu']/ul/li/a",
+    "text_title": "//*[@id='contents']/h2",
+    "text_content": "//*[@id='contents']/p[1]"
+  }
+}
 EOJSON
 root = Yasuri.json2tree(src)
+# Execution and getting scraped result
 agent = Mechanize.new
-root_page = agent.get("http://some.scraping.page.net/")
+root_page = agent.get("http://some.scraping.page.tac42.net/")
 result = root.inject(agent, root_page)
-# => [ {"title" => "PageTitle", "content" => "Page Contents" }, ...  ]
+# => [
+#      {"title" => "PageTitle 01", "content" => "Page Contents  01" },
+#      {"title" => "PageTitle 02", "content" => "Page Contents  02" },
+#      ...
+#      {"title" => "PageTitle N",  "content" => "Page Contents  N" }
+#    ]
+```
+### Use as CLI
+```sh
+# After gem installation..
+$ yasuri help scrape
+Usage:
+  yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
+Options:
+  f, [--file=FILE]  # path to file that written yasuri tree as json or yaml
+  j, [--json=JSON]  # yasuri tree format json string
+Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
+```
+Example
+```sh
+$ yasuri scrape "https://www.ruby-lang.org/en/" -j '
+{
+  "text_title": "/html/head/title",
+  "text_desc": "//*[@id=\"intro\"]/p"
+}'
+{"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
 ```
 ## Dev
@@ -108,6 +133,11 @@ $ rake
 $ rspec spec/*spec.rb
 ```
+### Test gem in local
+```sh
+$ gem build yasuri.gemspec
+$ gem install yasuri-*.gem
+```
 ### Release RubyGems
 ```sh
 # Only first time

data/USAGE.ja.md CHANGED Viewed

@@ -1,24 +1,31 @@
-# Yasuri の使い方
+# Yasuri
 ## Yasuri とは
-Yasuri (鑢) は簡単にWebスクレイピングを行うための、"[Mechanize](https://github.com/sparklemotion/mechanize)" をサポートするライブラリです．
+Yasuri (鑢) はWebスクレイピングを宣言的に行うためのライブラリと、それを用いたスクレイピングのコマンドラインツールです。
+簡単な宣言的記法で期待結果を記述するだけで、"[Mechanize](https://github.com/sparklemotion/mechanize)" によるスクレイピングを実行します。
 Yasuriは、スクレイピングにおける、よくある処理を簡単に記述することができます．
-例えば、
+例えば、以下のような処理を簡単に実現することができます．
-+ ページ内の複数のリンクを開いて、各ページをスクレイピングした結果をHashで取得する
 + ページ内の複数のテキストをスクレイピングし、名前をつけてHashにする
++ ページ内の複数のリンクを開いて、各ページをスクレイピングした結果をHashで取得する
 + ページ内に繰り返し出現するテーブルをそれぞれスクレイピングして、配列として取得する
-+ ページネーションで提供される各ページのうち、上位3つだけを順にスクレイピングする
-これらを簡単に実装することができます．
++ ページネーションで提供される各ページのうち、最初の3ページだけをスクレイピングする
 ## クイックスタート
+#### インストール
+```sh
+# for Ruby 2.3.2
+$ gem 'yasuri', '~> 2.0', '>= 2.0.13'
 ```
+または
+```sh
+# for Ruby 3.0.0 or upper
 $ gem install yasuri
 ```
+#### ライブラリとして使う
 ```ruby
 require 'yasuri'
 require 'machinize'
@@ -30,88 +37,59 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
        end
 agent = Mechanize.new
-root_page = agent.get("http://some.scraping.page.net/")
+root_page = agent.get("http://some.scraping.page.tac42.net/")
 result = root.inject(agent, root_page)
-# => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
-#      {"title" => "PageTitle2", "content" => "Page Contents2" }, ...  ]
+# => [
+#      {"title" => "PageTitle 01", "content" => "Page Contents  01" },
+#      {"title" => "PageTitle 02", "content" => "Page Contents  02" },
+#      ...
+#      {"title" => "PageTitle N",  "content" => "Page Contents  N" }
+#    ]
 ```
 この例では、 LinkNode(`links_root`)の xpath で指定された各リンク先のページから、TextNode(`text_title`,`text_content`) の xpath で指定された2つのテキストをスクレイピングする例です．
 (言い換えると、`//*[@id="menu"]/ul/li/a` で示される各リンクを開いて、`//*[@id="contents"]/h2` と `//*[@id="contents"]/p[1]` で指定されたテキストをスクレイピングします)
-## 基本
-1. パースツリーを作る
-2. Mechanize の agent と対象のページを与えてパースを開始する
-### パースツリーを作る
-```ruby
-require 'mechanize'
-require 'yasuri'
+#### CLIツールとして使う
+上記と同じことを、CLIのコマンドとして実行できます。
+```sh
+$ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
+{
+  "links_root": {
+    "path": "//*[@id=\"menu\"]/ul/li/a",
+    "text_title": "//*[@id=\"contents\"]/h2",
+    "text_content": "//*[@id=\"contents\"]/p[1]"
+    }
+}'
-# 1. パースツリーを作る
-tree = Yasuri.links_title '/html/body/a' do
-         text_name '/html/body/p'
-       end
-# 2. Mechanize の agent と対象のページを与えてパースを開始する
-agent = Mechanize.new
-page = agent.get(uri)
-tree.inject(agent, page)
+[
+  {"title":"PageTitle 01","content":"Page Contents  01"},
+  {"title":"PageTitle 02","content":"Page Contents  02"},
+  ...,
+  {"title":"PageTitle N","content":"Page Contents  N"}
+]
 ```
-ツリーは、json，yaml，またはDSLで定義することができます．上の例ではDSLで定義しています．
-以下は、jsonで上記と等価な解析ツリーを定義した例です．
+結果はjson形式の文字列として取得できます。
-```ruby
-# json で構成する場合
-src = <<-EOJSON
-   { "node"     : "links",
-     "name"     : "title",
-     "path"     : "/html/body/a",
-     "children" : [
-                    { "node" : "text",
-                      "name" : "name",
-                      "path" : "/html/body/p"
-                    }
-                  ]
-   }
-EOJSON
-tree = Yasuri.json2tree(src)
-```
+----------------------------
+## パースツリー
-```ruby
-# yaml で構成する場合
-src = <<-EOYAML
-title:
-  node: links
-  path: "/html/body/a"
-  children:
-    - name:
-        node: text
-        path: "/html/body/p"
-EOYAML
-tree = Yasuri.yaml2tree(src)
-```
+パースツリーとは、スクレイピングする要素と出力構造を宣言的に定義するための木構造データです。
+パースツリーは入れ子になった Node で構成されます．Node は `Type`, `Name`, `Path`, `Childlen`, `Options` 属性を持っており、その `Type` に応じたスクレイピング処理を行います．(ただし、`MapNode` のみ `Path` を持ちません)
-### Node
-ツリーは入れ子になった *Node* で構成されます．
-Node は `Type`, `Name`, `Path`, `Childlen`, `Options` を持っています．
-(ただし、`MapNode` のみ `Path` を持ちません)
-Nodeは以下のフォーマットで定義されます．
+パースツリーは以下のフォーマットで定義されます．
 ```ruby
+# 1ノードからなる単純なツリー
 Yasuri.<Type>_<Name> <Path> [,<Options>]
-# 入れ子になっている場合
+# 入れ子になっているツリー
 Yasuri.<Type>_<Name> <Path> [,<Options>] do
   <Type>_<Name> <Path> [,<Options>] do
     <Type>_<Name> <Path> [,<Options>]
@@ -120,12 +98,13 @@ Yasuri.<Type>_<Name> <Path> [,<Options>] do
 end
 ```
-例
+**例**
 ```ruby
+# 1ノードからなる単純なツリー
 Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
-# 入れ子になっている場合
+# 入れ子になっているツリー
 Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
   struct_table './tr' do
     text_title    './td[1]'
@@ -135,6 +114,71 @@ end
 ```
+パースツリーはRubyのDSL、JSON、YAMLのいずれかで定義することができます。
+以下は、上記と同じパースツリーをそれぞれの記法で定義した例です。
+**Ruby DSLで定義する場合**
+```ruby
+Yasuri.links_title '/html/body/a' do
+  text_name '/html/body/p'
+end
+```
+**JSONで定義する場合**
+```json
+{
+  links_title": {
+    "path": "/html/body/a",
+    "text_name": "/html/body/p"
+  }
+}
+```
+**YAMLで定義する場合**
+```yaml
+links_title:
+  path: "/html/body/a"
+  text_name: "/html/body/p"
+```
+**パースツリーの特殊なケース**
+rootの直下の要素が1つだけの場合、Hash(Object)ではなく、その要素を直接返します。
+```json
+{
+  "text_title": "/html/head/title",
+  "text_body": "/html/body",
+}
+# => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
+{
+  "text_title": "/html/head/title"}
+}
+# => Welcome to yasuri!
+```
+jsonまたはyaml形式では、子Nodeを持たない場合、`path` を直接値に指定することができます。以下の2つのjsonは同じパースツリーになります。
+```json
+{
+  "text_name": "/html/body/p"
+}
+{
+  "text_name": {
+    "path": "/html/body/p"
+  }
+}
+```
+--------------------------
+## Node
+Nodeはパースツリーの節または葉となる要素で、`Type`, `Name`, `Path`, `Childlen`, `Options` を持っており、その `Type` に応じてスクレイピングを行います．(ただし、`MapNode` のみ `Path` を持ちません)
 #### Type
 *Type* は Nodeの振る舞いを示します．Typeには以下のものがあります．
@@ -144,6 +188,8 @@ end
 - *Paginate*
 - *Map*
+詳細は各ノードの説明を参照してください。
 #### Name
 *Name* は 解析結果のHashにおけるキーになります．
@@ -193,7 +239,8 @@ p1t.inject(agent, page)  #=> "Hello"
 p2u.inject(agent, page)  #=> "HELLO,WORLD"
 ```
-なお、同じページ内の複数の要素を一度にスクレイピングする場合は、`MapNode`を使用します。
+なお、同じページ内の複数の要素を一度にスクレイピングする場合は、`MapNode`を使用します。詳細は、`MapNode`の例を参照してください。
 ### オプション
 ##### `truncate`
@@ -549,3 +596,100 @@ tree.inject(agent, page) #=> {
 ### オプション
 なし
+-------------------------
+## 使い方
+#### ライブラリとして使用する場合
+ライブラリとして使用する場合は、DSL, json, yaml の形式でツリーを定義できます。
+```ruby
+require 'mechanize'
+require 'yasuri'
+# 1. パースツリーを作る
+# DSLで定義する倍
+tree = Yasuri.links_title '/html/body/a' do
+         text_name '/html/body/p'
+       end
+# jsonで定義する場合
+src = <<-EOJSON
+{
+  links_title": {
+    "path": "/html/body/a",
+    "text_name": "/html/body/p"
+  }
+}
+EOJSON
+tree = Yasuri.json2tree(src)
+# yamlで定義する場合
+src = <<-EOYAML
+links_title:
+  path: "/html/body/a"
+  text_name: "/html/body/p"
+EOYAML
+tree = Yasuri.yaml2tree(src)
+# 2. Mechanize の agent と対象のページを与えてパースを開始する
+agent = Mechanize.new
+page = agent.get(uri)
+tree.inject(agent, page)
+```
+#### CLIツールとして使用する場合
+**ヘルプ表示**
+```sh
+$ yasuri help scrape
+Usage:
+  yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
+Options:
+  f, [--file=FILE]  # path to file that written yasuri tree as json or yaml
+  j, [--json=JSON]  # yasuri tree format json string
+Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
+```
+CLIツールでは以下のどちらかの方法でパースツリーを指定します。
++ `--file`, `-f` オプションで、ファイルに出力されたjson形式またはyaml形式のパースツリーを読み込む
++ `--json`, `-j` オプションで、パースツリーを文字列として直接指定する
+**パースツリーをファイルで指定する例**
+```sh
+% cat sample.yml
+text_title: "/html/head/title"
+text_desc: "//*[@id=\"intro\"]/p"
+% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
+{"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
+% cat sample.json
+{
+  "text_title": "/html/head/title",
+  "text_desc": "//*[@id=\"intro\"]/p"
+}
+% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
+{"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
+```
+**パースツリーをjsonで直接指定する例**
+```sh
+$ yasuri scrape "https://www.ruby-lang.org/en/" -j '
+{
+  "text_title": "/html/head/title",
+  "text_desc": "//*[@id=\"intro\"]/p"
+}'
+{"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
+```