yasuri 3.1.0 → 3.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/USAGE.md CHANGED
@@ -1,27 +1,31 @@
1
- # Yasuri Usage
1
+ # Yasuri
2
2
 
3
3
  ## What is Yasuri
4
- `Yasuri` is an easy web-scraping library for supporting "Mechanize".
4
+ `Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping using "[Mechanize](https://github.com/sparklemotion/mechanize)" by simply describing the expected result in a simple declarative notation.
5
5
 
6
- Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
6
+ Yasuri makes it easy to write common scraping operations.
7
+ For example, the following processes can be easily implemented.
7
8
 
8
- Yasuri can reduce frequently processes in Scraping.
9
+ + Scrape multiple texts in a page and name them into a Hash
10
+ + Open multiple links in a page and get the result of scraping each page as a Hash
11
+ + Scrape each table that appears repeatedly in the page and get the result as an array
12
+ + Scrape only the first three pages of each page provided by pagination
9
13
 
10
- For example,
11
-
12
- + Open links in the page, scraping each page, and getting result as Hash.
13
- + Scraping texts in the page, and named result in Hash.
14
- + A table that repeatedly appears in a page each, scraping, get as an array.
15
- + Of each page provided by the pagination, scraping the only top 3.
16
-
17
- You can implement easy by Yasuri.
18
14
 
19
15
  ## Quick Start
20
16
 
17
+
18
+ #### Install
19
+ ```sh
20
+ # for Ruby 2.3.2
21
+ $ gem 'yasuri', '~> 2.0', '>= 2.0.13'
21
22
  ```
23
+ または
24
+ ```sh
25
+ # for Ruby 3.0.0 or upper
22
26
  $ gem install yasuri
23
27
  ```
24
-
28
+ #### Use as library
25
29
  ```ruby
26
30
  require 'yasuri'
27
31
  require 'machinize'
@@ -33,88 +37,59 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
33
37
  end
34
38
 
35
39
  agent = Mechanize.new
36
- root_page = agent.get("http://some.scraping.page.net/")
40
+ root_page = agent.get("http://some.scraping.page.tac42.net/")
37
41
 
38
42
  result = root.inject(agent, root_page)
39
- # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
40
- # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
41
-
43
+ # => [
44
+ # {"title" => "PageTitle 01", "content" => "Page Contents 01" },
45
+ # {"title" => "PageTitle 02", "content" => "Page Contents 02" },
46
+ # ...
47
+ # {"title" => "PageTitle N", "content" => "Page Contents N" }
48
+ # ]
42
49
  ```
43
- This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
44
-
45
- (i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
46
-
47
- ## Basics
48
-
49
- 1. Construct parse tree.
50
- 2. Start parse with Mechanize agent and first page.
51
-
52
- ### Construct parse tree
53
-
54
- ```ruby
55
- require 'mechanize'
56
- require 'yasuri'
57
50
 
51
+ This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
58
52
 
59
- # 1. Construct parse tree.
60
- tree = Yasuri.links_title '/html/body/a' do
61
- text_name '/html/body/p'
62
- end
53
+ (in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
63
54
 
64
- # 2. Start parse with Mechanize agent and first page.
65
- agent = Mechanize.new
66
- page = agent.get(uri)
67
55
 
56
+ #### Use as CLI tool
57
+ The same thing as above can be executed as a CLI command.
68
58
 
69
- tree.inject(agent, page)
70
- ```
71
-
72
- Tree is definable by 3(+1) ways, json, yaml, and DSL (or basic ruby code). In above example, DSL.
59
+ ```sh
60
+ $ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
61
+ {
62
+ "links_root": {
63
+ "path": "//*[@id=\"menu\"]/ul/li/a",
64
+ "text_title": "//*[@id=\"contents\"]/h2",
65
+ "text_content": "//*[@id=\"contents\"]/p[1]"
66
+ }
67
+ }'
73
68
 
74
- ```ruby
75
- # Construct by json.
76
- src = <<-EOJSON
77
- { "node" : "links",
78
- "name" : "title",
79
- "path" : "/html/body/a",
80
- "children" : [
81
- { "node" : "text",
82
- "name" : "name",
83
- "path" : "/html/body/p"
84
- }
85
- ]
86
- }
87
- EOJSON
88
- tree = Yasuri.json2tree(src)
69
+ [
70
+ {"title":"PageTitle 01","content":"Page Contents 01"},
71
+ {"title":"PageTitle 02","content":"Page Contents 02"},
72
+ ...,
73
+ {"title":"PageTitle N","content":"Page Contents N"}
74
+ ]
89
75
  ```
90
76
 
91
- ```ruby
92
- # Construct by yaml.
93
- src = <<-EOYAML
94
- title:
95
- node: links
96
- path: "/html/body/a"
97
- children:
98
- - name:
99
- node: text
100
- path: "/html/body/p"
101
- EOYAML
102
- tree = Yasuri.yaml2tree(src)
103
- ```
77
+ The result can be obtained as a string in json format.
104
78
 
79
+ ----------------------------
80
+ ## Parse Tree
105
81
 
106
- ### Node
107
- Tree is constructed by nested Nodes.
108
- Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
109
- (But only `MapNode` does not have `Path`.)
82
+ A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
110
83
 
111
- Node is defined by this format.
84
+ A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
112
85
 
86
+ The parse tree is defined in the following format:
113
87
 
114
88
  ```ruby
89
+ # A simple tree consisting of one node
115
90
  Yasuri.<Type>_<Name> <Path> [,<Options>]
116
91
 
117
- # Nested case
92
+ # Nested tree
118
93
  Yasuri.<Type>_<Name> <Path> [,<Options>] do
119
94
  <Type>_<Name> <Path> [,<Options>] do
120
95
  <Type>_<Name> <Path> [,<Options>]
@@ -123,12 +98,13 @@ Yasuri.<Type>_<Name> <Path> [,<Options>] do
123
98
  end
124
99
  ```
125
100
 
126
- Example
101
+ **Example**
127
102
 
128
103
  ```ruby
104
+ # A simple tree consisting of one node
129
105
  Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
130
106
 
131
- # Nested case
107
+ # Nested tree
132
108
  Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
133
109
  struct_table './tr' do
134
110
  text_title './td[1]'
@@ -137,6 +113,72 @@ Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
137
113
  end
138
114
  ```
139
115
 
116
+ Parsing trees can be defined in Ruby DSL, JSON, or YAML.
117
+ The following is an example of the same parse tree as above, defined in each notation.
118
+
119
+
120
+ **Case of defining as Ruby DSL**
121
+ ```ruby
122
+ Yasuri.links_title '/html/body/a' do
123
+ text_name '/html/body/p'
124
+ end
125
+ ```
126
+
127
+ **Case of defining as JSON**
128
+ ```json
129
+ {
130
+ links_title": {
131
+ "path": "/html/body/a",
132
+ "text_name": "/html/body/p"
133
+ }
134
+ }
135
+ ```
136
+
137
+ **Case of defining as YAML**
138
+ ```yaml
139
+ links_title:
140
+ path: "/html/body/a"
141
+ text_name: "/html/body/p"
142
+ ```
143
+
144
+ **Special case of purse tree**
145
+
146
+ If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
147
+ ```json
148
+ {
149
+ "text_title": "/html/head/title",
150
+ "text_body": "/html/body",
151
+ }
152
+ # => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
153
+
154
+ {
155
+ "text_title": "/html/head/title"}
156
+ }
157
+ # => Welcome to yasuri!
158
+ ```
159
+
160
+
161
+ In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
162
+
163
+ ```json
164
+ {
165
+ "text_name": "/html/body/p"
166
+ }
167
+
168
+ {
169
+ "text_name": {
170
+ "path": "/html/body/p"
171
+ }
172
+ }
173
+ ```
174
+
175
+
176
+ --------------------------
177
+ ## Node
178
+
179
+ Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
180
+
181
+
140
182
  #### Type
141
183
  Type meen behavior of Node.
142
184
 
@@ -146,6 +188,8 @@ Type meen behavior of Node.
146
188
  - *Paginate*
147
189
  - *Map*
148
190
 
191
+ See the description of each node for details.
192
+
149
193
  #### Name
150
194
  Name is used keys in returned hash.
151
195
 
@@ -167,6 +211,8 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
167
211
  ## Text Node
168
212
  TextNode return scraped text. This node have to be leaf.
169
213
 
214
+
215
+
170
216
  ### Example
171
217
 
172
218
  ```html
@@ -548,3 +594,104 @@ tree.inject(agent, page) #=> {
548
594
 
549
595
  ### Options
550
596
  None.
597
+
598
+
599
+
600
+
601
+ -------------------------
602
+ ## Usage
603
+
604
+ #### Use as library
605
+ When used as a library, the tree can be defined in DSL, json, or yaml format.
606
+ ```ruby
607
+ require 'mechanize'
608
+ require 'yasuri'
609
+
610
+
611
+ # 1. Create a parse tree.
612
+ # Define by Ruby's DSL
613
+ tree = Yasuri.links_title '/html/body/a' do
614
+ text_name '/html/body/p'
615
+ end
616
+
617
+ # Define by JSON
618
+ src = <<-EOJSON
619
+ {
620
+ links_title": {
621
+ "path": "/html/body/a",
622
+ "text_name": "/html/body/p"
623
+ }
624
+ }
625
+ EOJSON
626
+ tree = Yasuri.json2tree(src)
627
+
628
+
629
+ # Define by YAML
630
+ src = <<-EOYAML
631
+ links_title:
632
+ path: "/html/body/a"
633
+ text_name: "/html/body/p"
634
+ EOYAML
635
+ tree = Yasuri.yaml2tree(src)
636
+
637
+
638
+
639
+ # 2. Give the Mechanize agent and the target page to start parsing
640
+ agent = Mechanize.new
641
+ page = agent.get(uri)
642
+
643
+
644
+ tree.inject(agent, page)
645
+ ```
646
+
647
+ #### Use as CLI tool
648
+
649
+ **Help**
650
+ ```sh
651
+ $ yasuri help scrape
652
+ Usage:
653
+ yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
654
+
655
+ Options:
656
+ f, [--file=FILE] # path to file that written yasuri tree as json or yaml
657
+ j, [--json=JSON] # yasuri tree format json string
658
+
659
+ Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
660
+ ```
661
+
662
+ In the CLI tool, you can specify the parse tree in either of the following ways.
663
+ + `--file`, `-f` option to read the parse tree in json or yaml format output to a file.
664
+ + `--json`, `-j` option to specify the parse tree directly as a string.
665
+
666
+
667
+ **Example of specifying a parse tree as a file**
668
+ ```sh
669
+ % cat sample.yml
670
+ text_title: "/html/head/title"
671
+ text_desc: "//*[@id=\"intro\"]/p"
672
+
673
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
674
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
675
+
676
+ % cat sample.json
677
+ {
678
+ "text_title": "/html/head/title",
679
+ "text_desc": "//*[@id=\"intro\"]/p"
680
+ }
681
+
682
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
683
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
684
+ ```
685
+
686
+ Whether the file is written in json or yaml will be determined automatically.
687
+
688
+ **Example of specifying a parse tree directly in json**
689
+ ```sh
690
+ $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
691
+ {
692
+ "text_title": "/html/head/title",
693
+ "text_desc": "//*[@id=\"intro\"]/p"
694
+ }'
695
+
696
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
697
+ ```
data/exe/yasuri ADDED
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "yasuri"
4
+
5
+ Yasuri::CLI.start
data/lib/yasuri.rb CHANGED
@@ -1,5 +1,6 @@
1
1
  require "yasuri/version"
2
2
  require "yasuri/yasuri"
3
+ require "yasuri/yasuri_cli"
3
4
 
4
5
  module Yasuri
5
6
  # Your code goes here...
@@ -1,3 +1,3 @@
1
1
  module Yasuri
2
- VERSION = "3.1.0"
2
+ VERSION = "3.2.0"
3
3
  end
data/lib/yasuri/yasuri.rb CHANGED
@@ -16,45 +16,29 @@ require_relative 'yasuri_node_generator'
16
16
 
17
17
  module Yasuri
18
18
 
19
+ DefaultRetryCount = 5
20
+
19
21
  def self.json2tree(json_string)
20
- json = JSON.parse(json_string, {symbolize_names: true})
21
- Yasuri.hash2node(json)
22
+ raise RuntimeError if json_string.nil? or json_string.empty?
23
+
24
+ node_hash = JSON.parse(json_string, {symbolize_names: true})
25
+ Yasuri.hash2node(node_hash)
22
26
  end
23
27
 
24
28
  def self.tree2json(node)
29
+ raise RuntimeError if node.nil?
30
+
25
31
  Yasuri.node2hash(node).to_json
26
32
  end
27
33
 
28
34
  def self.yaml2tree(yaml_string)
29
35
  raise RuntimeError if yaml_string.nil? or yaml_string.empty?
30
36
 
31
- yaml = YAML.load(yaml_string)
32
- raise RuntimeError if yaml.keys.size < 1
33
-
34
- root_key, root = yaml.keys.first, yaml.values.first
35
- hash = Yasuri.yaml2tree_sub(root_key, root)
36
-
37
- Yasuri.hash2node(hash)
37
+ node_hash = YAML.load(yaml_string)
38
+ Yasuri.hash2node(node_hash.deep_symbolize_keys)
38
39
  end
39
40
 
40
41
  private
41
- def self.yaml2tree_sub(name, body)
42
- return nil if name.nil? or body.nil?
43
-
44
- new_body = Hash[:name, name]
45
- body.each{|k,v| new_body[k.to_sym] = v}
46
- body = new_body
47
-
48
- return body if body[:children].nil?
49
-
50
- body[:children] = body[:children].map do |c|
51
- k, b = c.keys.first, c.values.first
52
- Yasuri.yaml2tree_sub(k, b)
53
- end
54
-
55
- body
56
- end
57
-
58
42
  def self.method_missing(method_name, pattern=nil, **opt, &block)
59
43
  generated = Yasuri::NodeGenerator.gen(method_name, pattern, **opt, &block)
60
44
  generated || super(method_name, **opt)
@@ -66,24 +50,64 @@ module Yasuri
66
50
  struct: Yasuri::StructNode,
67
51
  links: Yasuri::LinksNode,
68
52
  pages: Yasuri::PaginateNode,
69
- map: Yasuri::MapNode
53
+ map: Yasuri::MapNode
70
54
  }
71
- Node2Text = Text2Node.invert
72
55
 
73
- ReservedKeys = %i|node name path children|
74
- def self.hash2node(node_h)
75
- node = node_h[:node]
56
+ def self.hash2node(node_hash, node_name = nil, node_type_class = nil)
57
+ raise RuntimeError.new("") if node_name.nil? and node_hash.empty?
58
+
59
+ node_prefixes = Text2Node.keys.freeze
60
+ child_nodes = []
61
+ opt = {}
62
+ path = nil
63
+
64
+ if node_hash.is_a?(String)
65
+ path = node_hash
66
+ else
67
+ node_hash.each do |key, value|
68
+ # is node?
69
+ node_regexps = Text2Node.keys.map do |node_type_sym|
70
+ /^(#{node_type_sym.to_s})_(.+)$/
71
+ end
72
+ node_regexp = node_regexps.find do |node_regexp|
73
+ key =~ node_regexp
74
+ end
75
+
76
+ case key
77
+ when node_regexp
78
+ node_type_sym = $1.to_sym
79
+ child_node_name = $2
80
+ child_node_type = Text2Node[node_type_sym]
81
+ child_nodes << self.hash2node(value, child_node_name, child_node_type)
82
+ when :path
83
+ path = value
84
+ else
85
+ opt[key] = value
86
+ end
87
+ end
88
+ end
89
+
90
+ # If only single node under root, return only the node.
91
+ return child_nodes.first if node_name.nil? and child_nodes.size == 1
92
+
93
+ node = if node_type_class.nil?
94
+ Yasuri::MapNode.new(node_name, child_nodes, **opt)
95
+ else
96
+ node_type_class::new(path, node_name, child_nodes, **opt)
97
+ end
76
98
 
77
- fail "Not found 'node' value in map" if node.nil?
78
- klass = Text2Node[node.to_sym]
79
- klass::hash2node(node_h)
99
+ node
80
100
  end
81
101
 
82
102
  def self.node2hash(node)
83
- node.to_h
103
+ return node.to_h if node.instance_of?(Yasuri::MapNode)
104
+
105
+ {
106
+ "#{node.node_type_str}_#{node.name}" => node.to_h
107
+ }
84
108
  end
85
109
 
86
- def self.NodeName(name, opt)
110
+ def self.node_name(name, opt)
87
111
  symbolize_names = opt[:symbolize_names]
88
112
  symbolize_names ? name.to_sym : name
89
113
  end
@@ -101,3 +125,14 @@ module Yasuri
101
125
  end
102
126
  end
103
127
  end
128
+
129
+ class Hash
130
+ def deep_symbolize_keys
131
+ Hash[
132
+ self.map do |k, v|
133
+ v = v.deep_symbolize_keys if v.kind_of?(Hash)
134
+ [k.to_sym, v]
135
+ end
136
+ ]
137
+ end
138
+ end