RubyGems - yasuri - Versions diffs - 3.2.0 → 3.3.0 - Mend

yasuri 3.2.0 → 3.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

checksums.yaml +4 -4
data/README.md +4 -7
data/USAGE.ja.md +107 -86
data/USAGE.md +106 -87
data/examples/example.rb +79 -0
data/examples/github.yml +15 -0
data/examples/sample.json +4 -0
data/examples/sample.yml +11 -0
data/lib/yasuri/version.rb +1 -1
data/lib/yasuri/yasuri.rb +6 -2
data/lib/yasuri/yasuri_cli.rb +6 -6
data/lib/yasuri/yasuri_links_node.rb +3 -1
data/lib/yasuri/yasuri_map_node.rb +1 -0
data/lib/yasuri/yasuri_node.rb +14 -0
data/lib/yasuri/yasuri_paginate_node.rb +2 -1
data/spec/spec_helper.rb +3 -3
data/spec/yasuri_cli_spec.rb +17 -4
data/spec/yasuri_links_node_spec.rb +24 -10
data/spec/yasuri_map_spec.rb +4 -5
data/spec/yasuri_paginate_node_spec.rb +22 -10
data/spec/yasuri_spec.rb +55 -19
data/spec/yasuri_struct_node_spec.rb +13 -17
data/spec/yasuri_text_node_spec.rb +11 -12
metadata +6 -3
data/app.rb +0 -52

data/USAGE.md CHANGED Viewed

@@ -1,7 +1,9 @@
 # Yasuri
 ## What is Yasuri
-`Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping using "[Mechanize](https://github.com/sparklemotion/mechanize)" by simply describing the expected result in a simple declarative notation.
+`Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it.
+It performs scraping by simply describing the expected result in a simple declarative notation.
 Yasuri makes it easy to write common scraping operations.
 For example, the following processes can be easily implemented.
@@ -11,7 +13,6 @@ For example, the following processes can be easily implemented.
 + Scrape each table that appears repeatedly in the page and get the result as an array
 + Scrape only the first three pages of each page provided by pagination
 ## Quick Start
@@ -36,10 +37,7 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
          text_content '//*[@id="contents"]/p[1]'
        end
-agent = Mechanize.new
-root_page = agent.get("http://some.scraping.page.tac42.net/")
-result = root.inject(agent, root_page)
+result = root.scrape("http://some.scraping.page.tac42.net/")
 # => [
 #      {"title" => "PageTitle 01", "content" => "Page Contents  01" },
 #      {"title" => "PageTitle 02", "content" => "Page Contents  02" },
@@ -171,7 +169,51 @@ In json or yaml format, a attribute can directly specify `path` as a value if it
   }
 }
 ```
+### Run ParseTree
+Call the `Node#scrape(uri, opt={})` method on the root node of the parse tree.
+**Example**
+```ruby
+root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
+         text_title '//*[@id="contents"]/h2'
+         text_content '//*[@id="contents"]/p[1]'
+       end
+result = root.scrape("http://some.scraping.page.tac42.net/", interval_ms: 1000)
+```
++ `uri` is the URI of the page to be scraped.
++ `opt` is options as Hash. The following options are available.
+Yasuri uses `Mechanize` internally as an agent to do scraping.
+If you want to specify this instance, call `Node#scrape_with_agent(uri, agent, opt={})`.
+```ruby
+require 'logger'
+agent = Mechanize.new
+agent.log = Logger.new $stderr
+agent.request_headers = {
+  # ...
+}
+result = root.scrape_with_agent(
+  "http://some.scraping.page.tac42.net/",
+  agent,
+  interval_ms: 1000)
+```
+### `opt`
+#### `interval_ms`
+Interval [milliseconds] for requesting multiple pages.
+If omitted, requests will be made continuously without an interval, but if requests to many pages are expected, it is strongly recommended to specify an interval time to avoid high load on the target host.
+#### `retry_count`
+Number of retries when page acquisition fails. If omitted, it will retry 5 times.
+#### `symbolize_names`
+If true, returns the keys of the result set as symbols.
 --------------------------
 ## Node
@@ -216,7 +258,7 @@ TextNode return scraped text. This node have to be leaf.
 ### Example
 ```html
-<!-- http://yasuri.example.net -->
+<!-- http://yasuri.example.tac42.net -->
 <html>
   <head></head>
   <body>
@@ -227,16 +269,13 @@ TextNode return scraped text. This node have to be leaf.
 ```
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
 p1  = Yasuri.text_title '/html/body/p[1]'
 p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
 p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
-p1.inject(agent, page)   #=> "Hello,World"
-p1t.inject(agent, page)  #=> "Hello"
-p2u.inject(agent, page)  #=> "HELLO,WORLD"
+p1.scrape("http://yasuri.example.tac42.net")   #=> "Hello,World"
+p1t.scrape("http://yasuri.example.tac42.net")  #=> "Hello"
+p2u.scrape("http://yasuri.example.tac42.net")  #=> "HELLO,WORLD"
 ```
 Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
@@ -247,7 +286,7 @@ Match to regexp, and truncate text. When you use group, it will return first mat
 ```ruby
 node  = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
-node.inject(agent, index_page)
+node.scrape(uri)
 #=> { "example" => "ello,Yasur" }
 ```
@@ -258,21 +297,22 @@ If it is given `truncate` option, apply method after truncated.
 ```ruby
 node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
-node.inject(agent, index_page)
+node.scrape(uri)
 #=> { "example" => "ELLO,YASUR" }
 ```
 ## Struct Node
 Struct Node return structured text.
-At first, Struct Node narrow down sub-tags by `Path`. Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
+At first, Struct Node narrow down sub-tags by `Path`.
+Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
 If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
 ### Example
 ```html
-<!-- http://yasuri.example.net -->
+<!-- http://yasuri.example.tac42.net -->
 <html>
   <head>
     <title>Books</title>
@@ -313,15 +353,12 @@ If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags an
 ```
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
 node = Yasuri.struct_table '/html/body/table[1]/tr' do
   text_title    './td[1]'
   text_pub_date './td[2]'
-])
+end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net")
 #=> [ { "title"    => "The Perfect Insider",
 #       "pub_date" => "1996/4/5" },
 #     { "title"    => "Doctors in Isolated Room",
@@ -340,17 +377,14 @@ Struct node can contain not only Text node.
 ### Example
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
 node = Yasuri.strucre_tables '/html/body/table' do
   struct_table './tr' do
     text_title    './td[1]'
     text_pub_date './td[2]'
   end
-])
+end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net")
 #=>      [ { "table" => [ { "title"    => "The Perfect Insider",
 #                           "pub_date" => "1996/4/5" },
@@ -383,7 +417,7 @@ Links Node returns parsed text in each linked pages.
 ### Example
 ```html
-<!-- http://yasuri.example.net -->
+<!-- http://yasuri.example.tac42.net -->
 <html>
   <head><title>Yasuri Test</title></head>
   <body>
@@ -396,7 +430,7 @@ Links Node returns parsed text in each linked pages.
 ```
 ```html
-<!-- http://yasuri.example.net/child01.html -->
+<!-- http://yasuri.example.tac42.net/child01.html -->
 <html>
   <head><title>Child 01 Test</title></head>
   <body>
@@ -410,7 +444,7 @@ Links Node returns parsed text in each linked pages.
 ```
 ```html
-<!-- http://yasuri.example.net/child02.html -->
+<!-- http://yasuri.example.tac42.net/child02.html -->
 <html>
   <head><title>Child 02 Test</title></head>
   <body>
@@ -420,7 +454,7 @@ Links Node returns parsed text in each linked pages.
 ```
 ```html
-<!-- http://yasuri.example.net/child03.html -->
+<!-- http://yasuri.example.tac42.net/child03.html -->
 <html>
   <head><title>Child 03 Test</title></head>
   <body>
@@ -433,20 +467,17 @@ Links Node returns parsed text in each linked pages.
 ```
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
 node = Yasuri.links_title '/html/body/a' do
   text_content '/html/body/p'
 end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net")
 #=> [ {"content" => "Child 01 page."},
       {"content" => "Child 02 page."},
       {"content" => "Child 03 page."}]
 ```
-At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
+At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.tac42.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
 Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
@@ -460,7 +491,7 @@ Paginate Node parses and returns each pages that provid by paginate.
 Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
 ```html
-<!-- http://yasuri.example.net/page01.html -->
+<!-- http://yasuri.example.tac42.net/page01.html -->
 <html>
   <head><title>Page01</title></head>
   <body>
@@ -480,21 +511,17 @@ Target page `page01.html` is like this. `page02.html` to `page04.html` are simil
 ```
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net/page01.html")
-node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
+node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
          text_content '/html/body/p'
        end
-node.inject(agent, page)
-#=> [ {"content" => "Pagination01"},
-      {"content" => "Pagination02"},
-      {"content" => "Pagination03"},
-      {"content" => "Pagination04"}]
+node.scrape("http://yasuri.example.tac42.net/page01.html")
+#=> [ {"content" => "Patination01"},
+#     {"content" => "Patination02"},
+#     {"content" => "Patination03"}]
 ```
-Paginate Node require link for next page. In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
+Paginate Node require link for next page.
+In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
 ### Options
 ##### `limit`
@@ -504,7 +531,7 @@ Upper limit of open pages in pagination.
 node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
          text_content '/html/body/p'
        end
-node.inject(agent, page)
+node.scrape(uri)
 #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
 ```
 Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
@@ -513,35 +540,32 @@ Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4
 `flatten` option expands each page results.
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net/page01.html")
 node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
          text_title   '/html/head/title'
          text_content '/html/body/p'
        end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net/page01.html")
 #=> [ {"title" => "Page01",
-       "content" => "Patination01"},
-      {"title"   => "Page01",
-       "content" => "Patination02"},
-      {"title"   => "Page01",
-       "content" => "Patination03"}]
+#      "content" => "Patination01"},
+#     {"title"   => "Page01",
+#      "content" => "Patination02"},
+#     {"title"   => "Page01",
+#      "content" => "Patination03"}]
 node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
         text_title   '/html/head/title'
         text_content '/html/body/p'
       end
-node.inject(agent, page)
+node.scrape("http://yasuri.example.tac42.net/page01.html")
 #=> [ "Page01",
-      "Patination01",
-      "Page02",
-      "Patination02",
-      "Page03",
-      "Patination03"]
+#     "Patination01",
+#     "Page02",
+#     "Patination02",
+#     "Page03",
+#     "Patination03"]
 ```
 ## Map Node
@@ -550,7 +574,7 @@ node.inject(agent, page)
 ### Example
 ```html
-<!-- http://yasuri.example.net -->
+<!-- http://yasuri.example.tac42.net -->
 <html>
   <head><title>Yasuri Example</title></head>
   <body>
@@ -561,16 +585,12 @@ node.inject(agent, page)
 ```
 ```ruby
-agent = Mechanize.new
-page = agent.get("http://yasuri.example.net")
 tree = Yasuri.map_root do
   text_title  '/html/head/title'
   text_body_p '/html/body/p[1]'
 end
-tree.inject(agent, page) #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
+tree.scrape("http://yasuri.example.tac42.net") #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
 tree = Yasuri.map_root do
@@ -581,7 +601,7 @@ tree = Yasuri.map_root do
   end
 end
-tree.inject(agent, page) #=> {
+tree.scrape("http://yasuri.example.tac42.net") #=> {
 #   "group1" => {
 #           "child01" => "child01"
 #         },
@@ -596,18 +616,15 @@ tree.inject(agent, page) #=> {
 None.
 -------------------------
 ## Usage
-#### Use as library
+### Use as library
 When used as a library, the tree can be defined in DSL, json, or yaml format.
 ```ruby
-require 'mechanize'
 require 'yasuri'
 # 1. Create a parse tree.
 # Define by Ruby's DSL
 tree = Yasuri.links_title '/html/body/a' do
@@ -634,17 +651,11 @@ links_title:
 EOYAML
 tree = Yasuri.yaml2tree(src)
-# 2. Give the Mechanize agent and the target page to start parsing
-agent = Mechanize.new
-page = agent.get(uri)
-tree.inject(agent, page)
+# 2. Give the URL to start parsing
+tree.inject(uri)
 ```
-#### Use as CLI tool
+### Use as CLI tool
 **Help**
 ```sh
@@ -655,13 +666,14 @@ Usage:
 Options:
   f, [--file=FILE]  # path to file that written yasuri tree as json or yaml
   j, [--json=JSON]  # yasuri tree format json string
+  i, [--interval=N]  # interval each request [ms]
 Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
 ```
 In the CLI tool, you can specify the parse tree in either of the following ways.
-+ `--file`, `-f`  option to read the parse tree in json or yaml format output to a file.
-+ `--json`, `-j`  option to specify the parse tree directly as a string.
++ `--file`, `-f` : option to read the parse tree in json or yaml format output to a file.
++ `--json`, `-j` : option to specify the parse tree directly as a string.
 **Example of specifying a parse tree as a file**
@@ -695,3 +707,10 @@ $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
 {"title":"Ruby Programming Language","desc":"\n    A dynamic, open source programming language with a focus on\n    simplicity and productivity. It has an elegant syntax that is\n    natural to read and easy to write.\n    "}
 ```
+#### Other options
++ `--interval`, `-i` : The interval [milliseconds] for requesting multiple pages.
+   **Example: Request at 1 second intervals**
+   ```sh
+   $ yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml --interval 1000
+   ```

data/examples/example.rb ADDED Viewed

@@ -0,0 +1,79 @@
+#!/usr/bin/env ruby
+# Author::    TAC (tac@tac42.net)
+require 'yasuri'
+uri = "https://github.com/tac0x2a?tab=repositories"
+# Node tree constructing by DSL
+root = Yasuri.map_root do
+  text_title '/html/head/title'
+  links_repo '//*[@id="user-repositories-list"]/ul/li/div[1]/div[1]/h3/a' do
+    text_name '//*[@id="js-repo-pjax-container"]/div[1]/div[1]/div/h1/strong/a'
+    text_desc '//*[@id="repo-content-pjax-container"]/div/div[2]/div[2]/div/div[1]/div/p', proc: :strip
+    text_stars '//*[@id="js-repo-pjax-container"]/div[1]/div[2]/div[2]/a[1]', proc: :to_i
+    text_forks '//*[@id="js-repo-pjax-container"]/div[1]/div[2]/div[2]/a[2]/span', proc: :to_i
+  end
+end
+# Node tree constructing by YAML
+# src = <<-EOYML
+# text_title: /html/head/title
+# links_repo:
+#   path: //*[@id="user-repositories-list"]/ul/li/div[1]/div[1]/h3/a
+#   text_name: //*[@id="js-repo-pjax-container"]/div[1]/div[1]/div/h1/strong/a
+#   text_desc:
+#     path: //*[@id="repo-content-pjax-container"]/div/div[2]/div[2]/div/div[1]/div/p
+#     proc: :strip
+#   text_stars:
+#     path: //*[@id="js-repo-pjax-container"]/div[1]/div[2]/div[2]/a[1]
+#     proc: :to_i
+#   text_forks:
+#     path: //*[@id="js-repo-pjax-container"]/div[1]/div[2]/div[2]/a[2]/span
+#     proc: :to_i
+# EOYML
+# root = Yasuri.yaml2tree(src)
+contents = root.scrape(uri, interval_ms: 100)
+# jj contents
+# {
+#   "title": "tac0x2a (TAC) / Repositories · GitHub",
+#   "repo": [
+#     {
+#       "name": "o-namazu",
+#       "desc": "Oh Namazu (Catfish) in datalake",
+#       "stars": 1,
+#       "forks": 0
+#     },
+#     {
+#       "name": "grebe",
+#       "desc": "grebe in datalake",
+#       "stars": 2,
+#       "forks": 0
+#     },
+#     {
+#       "name": "yasuri",
+#       "desc": "Yasuri (鑢) is easy web scraping library.",
+#       "stars": 43,
+#       "forks": 1
+#     },
+#     {
+#       "name": "dotfiles",
+#       "desc": "dotfiles",
+#       "stars": 0,
+#       "forks": 0
+#     }
+#     ...
+#   ]
+# }
+# Output as markdown
+puts "# #{contents['title']}"
+contents['repo'].each do |h|
+  puts "-----"
+  puts "## #{h['name']}"
+  puts h['desc']
+  puts ""
+  puts "* Stars: #{h['stars']}"
+  puts "* Forks: #{h['forks']}"
+  puts ""
+end