yasuri 3.1.0 → 3.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/USAGE.md CHANGED
@@ -1,27 +1,31 @@
1
- # Yasuri Usage
1
+ # Yasuri
2
2
 
3
3
  ## What is Yasuri
4
- `Yasuri` is an easy web-scraping library for supporting "Mechanize".
4
+ `Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping using "[Mechanize](https://github.com/sparklemotion/mechanize)" by simply describing the expected result in a simple declarative notation.
5
5
 
6
- Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
6
+ Yasuri makes it easy to write common scraping operations.
7
+ For example, the following processes can be easily implemented.
7
8
 
8
- Yasuri can reduce frequently processes in Scraping.
9
+ + Scrape multiple texts in a page and name them into a Hash
10
+ + Open multiple links in a page and get the result of scraping each page as a Hash
11
+ + Scrape each table that appears repeatedly in the page and get the result as an array
12
+ + Scrape only the first three pages of each page provided by pagination
9
13
 
10
- For example,
11
-
12
- + Open links in the page, scraping each page, and getting result as Hash.
13
- + Scraping texts in the page, and named result in Hash.
14
- + A table that repeatedly appears in a page each, scraping, get as an array.
15
- + Of each page provided by the pagination, scraping the only top 3.
16
-
17
- You can implement easy by Yasuri.
18
14
 
19
15
  ## Quick Start
20
16
 
17
+
18
+ #### Install
19
+ ```sh
20
+ # for Ruby 2.3.2
21
+ $ gem 'yasuri', '~> 2.0', '>= 2.0.13'
21
22
  ```
23
+ または
24
+ ```sh
25
+ # for Ruby 3.0.0 or upper
22
26
  $ gem install yasuri
23
27
  ```
24
-
28
+ #### Use as library
25
29
  ```ruby
26
30
  require 'yasuri'
27
31
  require 'machinize'
@@ -33,88 +37,59 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
33
37
  end
34
38
 
35
39
  agent = Mechanize.new
36
- root_page = agent.get("http://some.scraping.page.net/")
40
+ root_page = agent.get("http://some.scraping.page.tac42.net/")
37
41
 
38
42
  result = root.inject(agent, root_page)
39
- # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
40
- # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
41
-
43
+ # => [
44
+ # {"title" => "PageTitle 01", "content" => "Page Contents 01" },
45
+ # {"title" => "PageTitle 02", "content" => "Page Contents 02" },
46
+ # ...
47
+ # {"title" => "PageTitle N", "content" => "Page Contents N" }
48
+ # ]
42
49
  ```
43
- This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
44
-
45
- (i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
46
-
47
- ## Basics
48
-
49
- 1. Construct parse tree.
50
- 2. Start parse with Mechanize agent and first page.
51
-
52
- ### Construct parse tree
53
-
54
- ```ruby
55
- require 'mechanize'
56
- require 'yasuri'
57
50
 
51
+ This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
58
52
 
59
- # 1. Construct parse tree.
60
- tree = Yasuri.links_title '/html/body/a' do
61
- text_name '/html/body/p'
62
- end
53
+ (in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
63
54
 
64
- # 2. Start parse with Mechanize agent and first page.
65
- agent = Mechanize.new
66
- page = agent.get(uri)
67
55
 
56
+ #### Use as CLI tool
57
+ The same thing as above can be executed as a CLI command.
68
58
 
69
- tree.inject(agent, page)
70
- ```
71
-
72
- Tree is definable by 3(+1) ways, json, yaml, and DSL (or basic ruby code). In above example, DSL.
59
+ ```sh
60
+ $ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
61
+ {
62
+ "links_root": {
63
+ "path": "//*[@id=\"menu\"]/ul/li/a",
64
+ "text_title": "//*[@id=\"contents\"]/h2",
65
+ "text_content": "//*[@id=\"contents\"]/p[1]"
66
+ }
67
+ }'
73
68
 
74
- ```ruby
75
- # Construct by json.
76
- src = <<-EOJSON
77
- { "node" : "links",
78
- "name" : "title",
79
- "path" : "/html/body/a",
80
- "children" : [
81
- { "node" : "text",
82
- "name" : "name",
83
- "path" : "/html/body/p"
84
- }
85
- ]
86
- }
87
- EOJSON
88
- tree = Yasuri.json2tree(src)
69
+ [
70
+ {"title":"PageTitle 01","content":"Page Contents 01"},
71
+ {"title":"PageTitle 02","content":"Page Contents 02"},
72
+ ...,
73
+ {"title":"PageTitle N","content":"Page Contents N"}
74
+ ]
89
75
  ```
90
76
 
91
- ```ruby
92
- # Construct by yaml.
93
- src = <<-EOYAML
94
- title:
95
- node: links
96
- path: "/html/body/a"
97
- children:
98
- - name:
99
- node: text
100
- path: "/html/body/p"
101
- EOYAML
102
- tree = Yasuri.yaml2tree(src)
103
- ```
77
+ The result can be obtained as a string in json format.
104
78
 
79
+ ----------------------------
80
+ ## Parse Tree
105
81
 
106
- ### Node
107
- Tree is constructed by nested Nodes.
108
- Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
109
- (But only `MapNode` does not have `Path`.)
82
+ A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
110
83
 
111
- Node is defined by this format.
84
+ A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
112
85
 
86
+ The parse tree is defined in the following format:
113
87
 
114
88
  ```ruby
89
+ # A simple tree consisting of one node
115
90
  Yasuri.<Type>_<Name> <Path> [,<Options>]
116
91
 
117
- # Nested case
92
+ # Nested tree
118
93
  Yasuri.<Type>_<Name> <Path> [,<Options>] do
119
94
  <Type>_<Name> <Path> [,<Options>] do
120
95
  <Type>_<Name> <Path> [,<Options>]
@@ -123,12 +98,13 @@ Yasuri.<Type>_<Name> <Path> [,<Options>] do
123
98
  end
124
99
  ```
125
100
 
126
- Example
101
+ **Example**
127
102
 
128
103
  ```ruby
104
+ # A simple tree consisting of one node
129
105
  Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
130
106
 
131
- # Nested case
107
+ # Nested tree
132
108
  Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
133
109
  struct_table './tr' do
134
110
  text_title './td[1]'
@@ -137,6 +113,72 @@ Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
137
113
  end
138
114
  ```
139
115
 
116
+ Parsing trees can be defined in Ruby DSL, JSON, or YAML.
117
+ The following is an example of the same parse tree as above, defined in each notation.
118
+
119
+
120
+ **Case of defining as Ruby DSL**
121
+ ```ruby
122
+ Yasuri.links_title '/html/body/a' do
123
+ text_name '/html/body/p'
124
+ end
125
+ ```
126
+
127
+ **Case of defining as JSON**
128
+ ```json
129
+ {
130
+ links_title": {
131
+ "path": "/html/body/a",
132
+ "text_name": "/html/body/p"
133
+ }
134
+ }
135
+ ```
136
+
137
+ **Case of defining as YAML**
138
+ ```yaml
139
+ links_title:
140
+ path: "/html/body/a"
141
+ text_name: "/html/body/p"
142
+ ```
143
+
144
+ **Special case of purse tree**
145
+
146
+ If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
147
+ ```json
148
+ {
149
+ "text_title": "/html/head/title",
150
+ "text_body": "/html/body",
151
+ }
152
+ # => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
153
+
154
+ {
155
+ "text_title": "/html/head/title"}
156
+ }
157
+ # => Welcome to yasuri!
158
+ ```
159
+
160
+
161
+ In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
162
+
163
+ ```json
164
+ {
165
+ "text_name": "/html/body/p"
166
+ }
167
+
168
+ {
169
+ "text_name": {
170
+ "path": "/html/body/p"
171
+ }
172
+ }
173
+ ```
174
+
175
+
176
+ --------------------------
177
+ ## Node
178
+
179
+ Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
180
+
181
+
140
182
  #### Type
141
183
  Type meen behavior of Node.
142
184
 
@@ -146,6 +188,8 @@ Type meen behavior of Node.
146
188
  - *Paginate*
147
189
  - *Map*
148
190
 
191
+ See the description of each node for details.
192
+
149
193
  #### Name
150
194
  Name is used keys in returned hash.
151
195
 
@@ -167,6 +211,8 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
167
211
  ## Text Node
168
212
  TextNode return scraped text. This node have to be leaf.
169
213
 
214
+
215
+
170
216
  ### Example
171
217
 
172
218
  ```html
@@ -548,3 +594,104 @@ tree.inject(agent, page) #=> {
548
594
 
549
595
  ### Options
550
596
  None.
597
+
598
+
599
+
600
+
601
+ -------------------------
602
+ ## Usage
603
+
604
+ #### Use as library
605
+ When used as a library, the tree can be defined in DSL, json, or yaml format.
606
+ ```ruby
607
+ require 'mechanize'
608
+ require 'yasuri'
609
+
610
+
611
+ # 1. Create a parse tree.
612
+ # Define by Ruby's DSL
613
+ tree = Yasuri.links_title '/html/body/a' do
614
+ text_name '/html/body/p'
615
+ end
616
+
617
+ # Define by JSON
618
+ src = <<-EOJSON
619
+ {
620
+ links_title": {
621
+ "path": "/html/body/a",
622
+ "text_name": "/html/body/p"
623
+ }
624
+ }
625
+ EOJSON
626
+ tree = Yasuri.json2tree(src)
627
+
628
+
629
+ # Define by YAML
630
+ src = <<-EOYAML
631
+ links_title:
632
+ path: "/html/body/a"
633
+ text_name: "/html/body/p"
634
+ EOYAML
635
+ tree = Yasuri.yaml2tree(src)
636
+
637
+
638
+
639
+ # 2. Give the Mechanize agent and the target page to start parsing
640
+ agent = Mechanize.new
641
+ page = agent.get(uri)
642
+
643
+
644
+ tree.inject(agent, page)
645
+ ```
646
+
647
+ #### Use as CLI tool
648
+
649
+ **Help**
650
+ ```sh
651
+ $ yasuri help scrape
652
+ Usage:
653
+ yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
654
+
655
+ Options:
656
+ f, [--file=FILE] # path to file that written yasuri tree as json or yaml
657
+ j, [--json=JSON] # yasuri tree format json string
658
+
659
+ Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
660
+ ```
661
+
662
+ In the CLI tool, you can specify the parse tree in either of the following ways.
663
+ + `--file`, `-f` option to read the parse tree in json or yaml format output to a file.
664
+ + `--json`, `-j` option to specify the parse tree directly as a string.
665
+
666
+
667
+ **Example of specifying a parse tree as a file**
668
+ ```sh
669
+ % cat sample.yml
670
+ text_title: "/html/head/title"
671
+ text_desc: "//*[@id=\"intro\"]/p"
672
+
673
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
674
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
675
+
676
+ % cat sample.json
677
+ {
678
+ "text_title": "/html/head/title",
679
+ "text_desc": "//*[@id=\"intro\"]/p"
680
+ }
681
+
682
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
683
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
684
+ ```
685
+
686
+ Whether the file is written in json or yaml will be determined automatically.
687
+
688
+ **Example of specifying a parse tree directly in json**
689
+ ```sh
690
+ $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
691
+ {
692
+ "text_title": "/html/head/title",
693
+ "text_desc": "//*[@id=\"intro\"]/p"
694
+ }'
695
+
696
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
697
+ ```
data/exe/yasuri ADDED
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "yasuri"
4
+
5
+ Yasuri::CLI.start
data/lib/yasuri.rb CHANGED
@@ -1,5 +1,6 @@
1
1
  require "yasuri/version"
2
2
  require "yasuri/yasuri"
3
+ require "yasuri/yasuri_cli"
3
4
 
4
5
  module Yasuri
5
6
  # Your code goes here...
@@ -1,3 +1,3 @@
1
1
  module Yasuri
2
- VERSION = "3.1.0"
2
+ VERSION = "3.2.0"
3
3
  end
data/lib/yasuri/yasuri.rb CHANGED
@@ -16,45 +16,29 @@ require_relative 'yasuri_node_generator'
16
16
 
17
17
  module Yasuri
18
18
 
19
+ DefaultRetryCount = 5
20
+
19
21
  def self.json2tree(json_string)
20
- json = JSON.parse(json_string, {symbolize_names: true})
21
- Yasuri.hash2node(json)
22
+ raise RuntimeError if json_string.nil? or json_string.empty?
23
+
24
+ node_hash = JSON.parse(json_string, {symbolize_names: true})
25
+ Yasuri.hash2node(node_hash)
22
26
  end
23
27
 
24
28
  def self.tree2json(node)
29
+ raise RuntimeError if node.nil?
30
+
25
31
  Yasuri.node2hash(node).to_json
26
32
  end
27
33
 
28
34
  def self.yaml2tree(yaml_string)
29
35
  raise RuntimeError if yaml_string.nil? or yaml_string.empty?
30
36
 
31
- yaml = YAML.load(yaml_string)
32
- raise RuntimeError if yaml.keys.size < 1
33
-
34
- root_key, root = yaml.keys.first, yaml.values.first
35
- hash = Yasuri.yaml2tree_sub(root_key, root)
36
-
37
- Yasuri.hash2node(hash)
37
+ node_hash = YAML.load(yaml_string)
38
+ Yasuri.hash2node(node_hash.deep_symbolize_keys)
38
39
  end
39
40
 
40
41
  private
41
- def self.yaml2tree_sub(name, body)
42
- return nil if name.nil? or body.nil?
43
-
44
- new_body = Hash[:name, name]
45
- body.each{|k,v| new_body[k.to_sym] = v}
46
- body = new_body
47
-
48
- return body if body[:children].nil?
49
-
50
- body[:children] = body[:children].map do |c|
51
- k, b = c.keys.first, c.values.first
52
- Yasuri.yaml2tree_sub(k, b)
53
- end
54
-
55
- body
56
- end
57
-
58
42
  def self.method_missing(method_name, pattern=nil, **opt, &block)
59
43
  generated = Yasuri::NodeGenerator.gen(method_name, pattern, **opt, &block)
60
44
  generated || super(method_name, **opt)
@@ -66,24 +50,64 @@ module Yasuri
66
50
  struct: Yasuri::StructNode,
67
51
  links: Yasuri::LinksNode,
68
52
  pages: Yasuri::PaginateNode,
69
- map: Yasuri::MapNode
53
+ map: Yasuri::MapNode
70
54
  }
71
- Node2Text = Text2Node.invert
72
55
 
73
- ReservedKeys = %i|node name path children|
74
- def self.hash2node(node_h)
75
- node = node_h[:node]
56
+ def self.hash2node(node_hash, node_name = nil, node_type_class = nil)
57
+ raise RuntimeError.new("") if node_name.nil? and node_hash.empty?
58
+
59
+ node_prefixes = Text2Node.keys.freeze
60
+ child_nodes = []
61
+ opt = {}
62
+ path = nil
63
+
64
+ if node_hash.is_a?(String)
65
+ path = node_hash
66
+ else
67
+ node_hash.each do |key, value|
68
+ # is node?
69
+ node_regexps = Text2Node.keys.map do |node_type_sym|
70
+ /^(#{node_type_sym.to_s})_(.+)$/
71
+ end
72
+ node_regexp = node_regexps.find do |node_regexp|
73
+ key =~ node_regexp
74
+ end
75
+
76
+ case key
77
+ when node_regexp
78
+ node_type_sym = $1.to_sym
79
+ child_node_name = $2
80
+ child_node_type = Text2Node[node_type_sym]
81
+ child_nodes << self.hash2node(value, child_node_name, child_node_type)
82
+ when :path
83
+ path = value
84
+ else
85
+ opt[key] = value
86
+ end
87
+ end
88
+ end
89
+
90
+ # If only single node under root, return only the node.
91
+ return child_nodes.first if node_name.nil? and child_nodes.size == 1
92
+
93
+ node = if node_type_class.nil?
94
+ Yasuri::MapNode.new(node_name, child_nodes, **opt)
95
+ else
96
+ node_type_class::new(path, node_name, child_nodes, **opt)
97
+ end
76
98
 
77
- fail "Not found 'node' value in map" if node.nil?
78
- klass = Text2Node[node.to_sym]
79
- klass::hash2node(node_h)
99
+ node
80
100
  end
81
101
 
82
102
  def self.node2hash(node)
83
- node.to_h
103
+ return node.to_h if node.instance_of?(Yasuri::MapNode)
104
+
105
+ {
106
+ "#{node.node_type_str}_#{node.name}" => node.to_h
107
+ }
84
108
  end
85
109
 
86
- def self.NodeName(name, opt)
110
+ def self.node_name(name, opt)
87
111
  symbolize_names = opt[:symbolize_names]
88
112
  symbolize_names ? name.to_sym : name
89
113
  end
@@ -101,3 +125,14 @@ module Yasuri
101
125
  end
102
126
  end
103
127
  end
128
+
129
+ class Hash
130
+ def deep_symbolize_keys
131
+ Hash[
132
+ self.map do |k, v|
133
+ v = v.deep_symbolize_keys if v.kind_of?(Hash)
134
+ [k.to_sym, v]
135
+ end
136
+ ]
137
+ end
138
+ end