yasuri 3.0.0 → 3.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +15 -1
- data/USAGE.ja.md +80 -10
- data/USAGE.md +80 -11
- data/lib/yasuri/version.rb +1 -1
- data/lib/yasuri/yasuri.rb +9 -35
- data/lib/yasuri/yasuri_links_node.rb +6 -2
- data/lib/yasuri/yasuri_map_node.rb +54 -0
- data/lib/yasuri/yasuri_node.rb +44 -1
- data/lib/yasuri/yasuri_node_generator.rb +13 -6
- data/lib/yasuri/yasuri_paginate_node.rb +4 -0
- data/lib/yasuri/yasuri_text_node.rb +4 -0
- data/spec/yasuri_map_spec.rb +76 -0
- data/spec/yasuri_spec.rb +46 -0
- metadata +5 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f3542a2cc0959a4534520f6104fc2922bdf0dbd368fcd4c149c3d251c2fc2198
|
4
|
+
data.tar.gz: 6fdb960db697e9a4ec1d87f2b83bf0e9914e3c9efe90764536bbee6d68774353
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9df576243bea289f4c285c46f1bd2137b7b69b79b24e0c657e4ac952114dd7bcf82a5f95cd2dae88c6eac4e3e468273b7dbd6ead9d05ffdc8d25861921702333
|
7
|
+
data.tar.gz: 13f2ae72b3e8fa6d3ef58932daa2acad49f5d4f57c80f34e5215394940fc2305bc016d949760efe9f43ae2b8c3796064a1b0bd9bccf236cfe3789c2c291dfd8b
|
data/README.md
CHANGED
@@ -1,5 +1,6 @@
|
|
1
1
|
# Yasuri
|
2
|
-
[![Build Status](https://
|
2
|
+
[![Build Status](https://github.com/tac0x2a/yasuri/actions/workflows/ruby.yml/badge.svg)](https://github.com/tac0x2a/yasuri/actions/workflows/ruby.yml)
|
3
|
+
[![Coverage Status](https://coveralls.io/repos/tac0x2a/yasuri/badge.svg?branch=master)](https://coveralls.io/r/tac0x2a/yasuri?branch=master) [![Maintainability](https://api.codeclimate.com/v1/badges/c29480fea1305afe999f/maintainability)](https://codeclimate.com/github/tac0x2a/yasuri/maintainability)
|
3
4
|
|
4
5
|
Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
|
5
6
|
|
@@ -33,6 +34,9 @@ or
|
|
33
34
|
```ruby
|
34
35
|
# for Ruby 1.9.3 or lower
|
35
36
|
gem 'yasuri', '~> 1.9'
|
37
|
+
|
38
|
+
# for Ruby 3.0.0 or lower
|
39
|
+
gem 'yasuri', '~> 3.0.1'
|
36
40
|
```
|
37
41
|
|
38
42
|
|
@@ -104,6 +108,16 @@ $ rake
|
|
104
108
|
$ rspec spec/*spec.rb
|
105
109
|
```
|
106
110
|
|
111
|
+
### Release RubyGems
|
112
|
+
```sh
|
113
|
+
# Only first time
|
114
|
+
$ curl -u <user_name> https://rubygems.org/api/v1/api_key.yaml > ~/.gem/credentials
|
115
|
+
$ chmod 0600 ~/.gem/credentials
|
116
|
+
|
117
|
+
$ nano lib/yasuri/version.rb # edit gem version
|
118
|
+
$ rake release
|
119
|
+
```
|
120
|
+
|
107
121
|
## Contributing
|
108
122
|
|
109
123
|
1. Fork it ( https://github.com/tac0x2a/yasuri/fork )
|
data/USAGE.ja.md
CHANGED
@@ -104,21 +104,37 @@ tree = Yasuri.yaml2tree(src)
|
|
104
104
|
### Node
|
105
105
|
ツリーは入れ子になった *Node* で構成されます.
|
106
106
|
Node は `Type`, `Name`, `Path`, `Childlen`, `Options` を持っています.
|
107
|
+
(ただし、`MapNode` のみ `Path` を持ちません)
|
107
108
|
|
108
109
|
Nodeは以下のフォーマットで定義されます.
|
109
110
|
|
110
111
|
```ruby
|
111
|
-
# トップレベル
|
112
112
|
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
113
113
|
|
114
114
|
# 入れ子になっている場合
|
115
115
|
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
116
116
|
<Type>_<Name> <Path> [,<Options>] do
|
117
|
-
<
|
117
|
+
<Type>_<Name> <Path> [,<Options>]
|
118
|
+
...
|
118
119
|
end
|
119
120
|
end
|
120
121
|
```
|
121
122
|
|
123
|
+
例
|
124
|
+
|
125
|
+
```ruby
|
126
|
+
Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
|
127
|
+
|
128
|
+
# 入れ子になっている場合
|
129
|
+
Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
130
|
+
struct_table './tr' do
|
131
|
+
text_title './td[1]'
|
132
|
+
text_pub_date './td[2]'
|
133
|
+
end
|
134
|
+
end
|
135
|
+
```
|
136
|
+
|
137
|
+
|
122
138
|
#### Type
|
123
139
|
*Type* は Nodeの振る舞いを示します.Typeには以下のものがあります.
|
124
140
|
|
@@ -126,18 +142,19 @@ end
|
|
126
142
|
- *Struct*
|
127
143
|
- *Links*
|
128
144
|
- *Paginate*
|
145
|
+
- *Map*
|
129
146
|
|
130
|
-
|
147
|
+
#### Name
|
131
148
|
*Name* は 解析結果のHashにおけるキーになります.
|
132
149
|
|
133
|
-
|
150
|
+
#### Path
|
134
151
|
*Path* は xpath あるいは css セレクタによって、HTML上の特定のノードを指定します.
|
135
152
|
これは Machinize の `search` で使用されます.
|
136
153
|
|
137
|
-
|
154
|
+
#### Childlen
|
138
155
|
入れ子になっているノードの子ノードです.TextNodeはツリーの葉に当たるため、子ノードを持ちません.
|
139
156
|
|
140
|
-
|
157
|
+
#### Options
|
141
158
|
パースのオプションです.オプションはTypeごとに異なります.
|
142
159
|
各ノードに対して、`opt`メソッドをコールすることで、利用可能なオプションを取得できます.
|
143
160
|
|
@@ -169,13 +186,15 @@ page = agent.get("http://yasuri.example.net")
|
|
169
186
|
|
170
187
|
p1 = Yasuri.text_title '/html/body/p[1]'
|
171
188
|
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
172
|
-
p2u = Yasuri.text_title '/html/body/p[
|
189
|
+
p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
|
173
190
|
|
174
|
-
p1.inject(agent, page) #=>
|
175
|
-
p1t.inject(agent, page) #=>
|
176
|
-
|
191
|
+
p1.inject(agent, page) #=> "Hello,World"
|
192
|
+
p1t.inject(agent, page) #=> "Hello"
|
193
|
+
p2u.inject(agent, page) #=> "HELLO,WORLD"
|
177
194
|
```
|
178
195
|
|
196
|
+
なお、同じページ内の複数の要素を一度にスクレイピングする場合は、`MapNode`を使用します。
|
197
|
+
|
179
198
|
### オプション
|
180
199
|
##### `truncate`
|
181
200
|
正規表現にマッチした文字列を取り出します.グループを指定した場合、最初にマッチしたグループだけを返します.
|
@@ -479,3 +498,54 @@ node.inject(agent, page)
|
|
479
498
|
"Page03",
|
480
499
|
"Patination03"]
|
481
500
|
```
|
501
|
+
|
502
|
+
## Map Node
|
503
|
+
*MapNode* はスクレイピングした結果をまとめるノードです.このノードはパースツリーにおいて常に節です.
|
504
|
+
|
505
|
+
### 例
|
506
|
+
|
507
|
+
```html
|
508
|
+
<!-- http://yasuri.example.net -->
|
509
|
+
<html>
|
510
|
+
<head><title>Yasuri Example</title></head>
|
511
|
+
<body>
|
512
|
+
<p>Hello,World</p>
|
513
|
+
<p>Hello,Yasuri</p>
|
514
|
+
</body>
|
515
|
+
</html>
|
516
|
+
```
|
517
|
+
|
518
|
+
```ruby
|
519
|
+
agent = Mechanize.new
|
520
|
+
page = agent.get("http://yasuri.example.net")
|
521
|
+
|
522
|
+
|
523
|
+
tree = Yasuri.map_root do
|
524
|
+
text_title '/html/head/title'
|
525
|
+
text_body_p '/html/body/p[1]'
|
526
|
+
end
|
527
|
+
|
528
|
+
tree.inject(agent, page) #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
|
529
|
+
|
530
|
+
|
531
|
+
tree = Yasuri.map_root do
|
532
|
+
map_group1 { text_child01 '/html/body/a[1]' }
|
533
|
+
map_group2 do
|
534
|
+
text_child01 '/html/body/a[1]'
|
535
|
+
text_child03 '/html/body/a[3]'
|
536
|
+
end
|
537
|
+
end
|
538
|
+
|
539
|
+
tree.inject(agent, page) #=> {
|
540
|
+
# "group1" => {
|
541
|
+
# "child01" => "child01"
|
542
|
+
# },
|
543
|
+
# "group2" => {
|
544
|
+
# "child01" => "child01",
|
545
|
+
# "child03" => "child03"
|
546
|
+
# }
|
547
|
+
# }
|
548
|
+
```
|
549
|
+
|
550
|
+
### オプション
|
551
|
+
なし
|
data/USAGE.md
CHANGED
@@ -106,18 +106,33 @@ tree = Yasuri.yaml2tree(src)
|
|
106
106
|
### Node
|
107
107
|
Tree is constructed by nested Nodes.
|
108
108
|
Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
|
109
|
+
(But only `MapNode` does not have `Path`.)
|
109
110
|
|
110
111
|
Node is defined by this format.
|
111
112
|
|
112
113
|
|
113
114
|
```ruby
|
114
|
-
# Top Level
|
115
115
|
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
116
116
|
|
117
|
-
# Nested
|
117
|
+
# Nested case
|
118
118
|
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
119
119
|
<Type>_<Name> <Path> [,<Options>] do
|
120
|
-
<
|
120
|
+
<Type>_<Name> <Path> [,<Options>]
|
121
|
+
...
|
122
|
+
end
|
123
|
+
end
|
124
|
+
```
|
125
|
+
|
126
|
+
Example
|
127
|
+
|
128
|
+
```ruby
|
129
|
+
Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
|
130
|
+
|
131
|
+
# Nested case
|
132
|
+
Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
133
|
+
struct_table './tr' do
|
134
|
+
text_title './td[1]'
|
135
|
+
text_pub_date './td[2]'
|
121
136
|
end
|
122
137
|
end
|
123
138
|
```
|
@@ -129,17 +144,18 @@ Type meen behavior of Node.
|
|
129
144
|
- *Struct*
|
130
145
|
- *Links*
|
131
146
|
- *Paginate*
|
147
|
+
- *Map*
|
132
148
|
|
133
|
-
|
149
|
+
#### Name
|
134
150
|
Name is used keys in returned hash.
|
135
151
|
|
136
|
-
|
152
|
+
#### Path
|
137
153
|
Path determine target node by xpath or css selector. It given by Machinize `search`.
|
138
154
|
|
139
|
-
|
155
|
+
#### Childlen
|
140
156
|
Child nodes. TextNode has always empty set, because TextNode is leaf.
|
141
157
|
|
142
|
-
|
158
|
+
#### Options
|
143
159
|
Parse options. It different in each types. You can get options and values by `opt` method.
|
144
160
|
|
145
161
|
```ruby
|
@@ -170,13 +186,15 @@ page = agent.get("http://yasuri.example.net")
|
|
170
186
|
|
171
187
|
p1 = Yasuri.text_title '/html/body/p[1]'
|
172
188
|
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
173
|
-
p2u = Yasuri.text_title '/html/body/p[
|
189
|
+
p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
|
174
190
|
|
175
|
-
p1.inject(agent, page) #=>
|
176
|
-
p1t.inject(agent, page) #=>
|
177
|
-
|
191
|
+
p1.inject(agent, page) #=> "Hello,World"
|
192
|
+
p1t.inject(agent, page) #=> "Hello"
|
193
|
+
p2u.inject(agent, page) #=> "HELLO,WORLD"
|
178
194
|
```
|
179
195
|
|
196
|
+
Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
|
197
|
+
|
180
198
|
### Options
|
181
199
|
##### `truncate`
|
182
200
|
Match to regexp, and truncate text. When you use group, it will return first matched group only.
|
@@ -479,3 +497,54 @@ node.inject(agent, page)
|
|
479
497
|
"Page03",
|
480
498
|
"Patination03"]
|
481
499
|
```
|
500
|
+
|
501
|
+
## Map Node
|
502
|
+
*MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
|
503
|
+
|
504
|
+
### Example
|
505
|
+
|
506
|
+
```html
|
507
|
+
<!-- http://yasuri.example.net -->
|
508
|
+
<html>
|
509
|
+
<head><title>Yasuri Example</title></head>
|
510
|
+
<body>
|
511
|
+
<p>Hello,World</p>
|
512
|
+
<p>Hello,Yasuri</p>
|
513
|
+
</body>
|
514
|
+
</html>
|
515
|
+
```
|
516
|
+
|
517
|
+
```ruby
|
518
|
+
agent = Mechanize.new
|
519
|
+
page = agent.get("http://yasuri.example.net")
|
520
|
+
|
521
|
+
|
522
|
+
tree = Yasuri.map_root do
|
523
|
+
text_title '/html/head/title'
|
524
|
+
text_body_p '/html/body/p[1]'
|
525
|
+
end
|
526
|
+
|
527
|
+
tree.inject(agent, page) #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
|
528
|
+
|
529
|
+
|
530
|
+
tree = Yasuri.map_root do
|
531
|
+
map_group1 { text_child01 '/html/body/a[1]' }
|
532
|
+
map_group2 do
|
533
|
+
text_child01 '/html/body/a[1]'
|
534
|
+
text_child03 '/html/body/a[3]'
|
535
|
+
end
|
536
|
+
end
|
537
|
+
|
538
|
+
tree.inject(agent, page) #=> {
|
539
|
+
# "group1" => {
|
540
|
+
# "child01" => "child01"
|
541
|
+
# },
|
542
|
+
# "group2" => {
|
543
|
+
# "child01" => "child01",
|
544
|
+
# "child03" => "child03"
|
545
|
+
# }
|
546
|
+
# }
|
547
|
+
```
|
548
|
+
|
549
|
+
### Options
|
550
|
+
None.
|
data/lib/yasuri/version.rb
CHANGED
data/lib/yasuri/yasuri.rb
CHANGED
@@ -11,6 +11,7 @@ require_relative 'yasuri_text_node'
|
|
11
11
|
require_relative 'yasuri_struct_node'
|
12
12
|
require_relative 'yasuri_paginate_node'
|
13
13
|
require_relative 'yasuri_links_node'
|
14
|
+
require_relative 'yasuri_map_node'
|
14
15
|
require_relative 'yasuri_node_generator'
|
15
16
|
|
16
17
|
module Yasuri
|
@@ -54,9 +55,9 @@ module Yasuri
|
|
54
55
|
body
|
55
56
|
end
|
56
57
|
|
57
|
-
def self.method_missing(
|
58
|
-
generated = Yasuri::NodeGenerator.gen(
|
59
|
-
generated || super(
|
58
|
+
def self.method_missing(method_name, pattern=nil, **opt, &block)
|
59
|
+
generated = Yasuri::NodeGenerator.gen(method_name, pattern, **opt, &block)
|
60
|
+
generated || super(method_name, **opt)
|
60
61
|
end
|
61
62
|
|
62
63
|
private
|
@@ -64,49 +65,22 @@ module Yasuri
|
|
64
65
|
text: Yasuri::TextNode,
|
65
66
|
struct: Yasuri::StructNode,
|
66
67
|
links: Yasuri::LinksNode,
|
67
|
-
pages: Yasuri::PaginateNode
|
68
|
+
pages: Yasuri::PaginateNode,
|
69
|
+
map: Yasuri::MapNode
|
68
70
|
}
|
69
71
|
Node2Text = Text2Node.invert
|
70
72
|
|
71
73
|
ReservedKeys = %i|node name path children|
|
72
74
|
def self.hash2node(node_h)
|
73
|
-
node
|
74
|
-
node_h[key]
|
75
|
-
end
|
76
|
-
children ||= []
|
75
|
+
node = node_h[:node]
|
77
76
|
|
78
77
|
fail "Not found 'node' value in map" if node.nil?
|
79
|
-
fail "Not found 'name' value in map" if name.nil?
|
80
|
-
fail "Not found 'path' value in map" if path.nil?
|
81
|
-
|
82
|
-
childnodes = children.map{|c| Yasuri.hash2node(c) }
|
83
|
-
ReservedKeys.each{|key| node_h.delete(key)}
|
84
|
-
opt = node_h
|
85
|
-
|
86
78
|
klass = Text2Node[node.to_sym]
|
87
|
-
|
88
|
-
klass.new(path, name, childnodes, **opt)
|
79
|
+
klass::hash2node(node_h)
|
89
80
|
end
|
90
81
|
|
91
82
|
def self.node2hash(node)
|
92
|
-
|
93
|
-
return json if node.nil?
|
94
|
-
|
95
|
-
klass = node.class
|
96
|
-
klass_str = Node2Text[klass]
|
97
|
-
|
98
|
-
json["node"] = klass_str
|
99
|
-
json["name"] = node.name
|
100
|
-
json["path"] = node.xpath
|
101
|
-
|
102
|
-
children = node.children.map{|c| Yasuri.node2hash(c)}
|
103
|
-
json["children"] = children if not children.empty?
|
104
|
-
|
105
|
-
node.opts.each do |key,value|
|
106
|
-
json[key] = value if not value.nil?
|
107
|
-
end
|
108
|
-
|
109
|
-
json
|
83
|
+
node.to_h
|
110
84
|
end
|
111
85
|
|
112
86
|
def self.NodeName(name, opt)
|
@@ -0,0 +1,54 @@
|
|
1
|
+
|
2
|
+
module Yasuri
|
3
|
+
class MapNode
|
4
|
+
attr_reader :name, :children
|
5
|
+
|
6
|
+
def initialize(name, children, opt: {})
|
7
|
+
@name = name
|
8
|
+
@children = children
|
9
|
+
@opt = opt
|
10
|
+
end
|
11
|
+
|
12
|
+
def inject(agent, page, opt = {}, element = page)
|
13
|
+
child_results_kv = @children.map do |node|
|
14
|
+
[node.name, node.inject(agent, page, opt)]
|
15
|
+
end
|
16
|
+
Hash[child_results_kv]
|
17
|
+
end
|
18
|
+
|
19
|
+
def opts
|
20
|
+
{}
|
21
|
+
end
|
22
|
+
|
23
|
+
def to_h
|
24
|
+
h = {}
|
25
|
+
h["node"] = "map"
|
26
|
+
h["name"] = self.name
|
27
|
+
h["children"] = self.children.map{|c| c.to_h} if not children.empty?
|
28
|
+
|
29
|
+
self.opts.each do |key,value|
|
30
|
+
h[key] = value if not value.nil?
|
31
|
+
end
|
32
|
+
|
33
|
+
h
|
34
|
+
end
|
35
|
+
|
36
|
+
def self.hash2node(node_h)
|
37
|
+
reservedKeys = %i|node name children|
|
38
|
+
|
39
|
+
node, name, children = reservedKeys.map do |key|
|
40
|
+
node_h[key]
|
41
|
+
end
|
42
|
+
|
43
|
+
fail "Not found 'name' value in map" if name.nil?
|
44
|
+
fail "Not found 'children' value in map" if children.nil?
|
45
|
+
children ||= []
|
46
|
+
|
47
|
+
childnodes = children.map{|c| Yasuri.hash2node(c) }
|
48
|
+
reservedKeys.each{|key| node_h.delete(key)}
|
49
|
+
opt = node_h
|
50
|
+
|
51
|
+
self.new(name, childnodes, **opt)
|
52
|
+
end
|
53
|
+
end
|
54
|
+
end
|
data/lib/yasuri/yasuri_node.rb
CHANGED
@@ -12,10 +12,53 @@ module Yasuri
|
|
12
12
|
end
|
13
13
|
|
14
14
|
def inject(agent, page, opt = {}, element = page)
|
15
|
-
fail "#{Kernel.__method__} is not implemented."
|
15
|
+
fail "#{Kernel.__method__} is not implemented in included class."
|
16
16
|
end
|
17
|
+
|
17
18
|
def opts
|
18
19
|
{}
|
19
20
|
end
|
21
|
+
|
22
|
+
def to_h
|
23
|
+
h = {}
|
24
|
+
h["node"] = self.node_type_str
|
25
|
+
h["name"] = self.name
|
26
|
+
h["path"] = self.xpath
|
27
|
+
h["children"] = self.children.map{|c| c.to_h} if not children.empty?
|
28
|
+
|
29
|
+
self.opts.each do |key,value|
|
30
|
+
h[key] = value if not value.nil?
|
31
|
+
end
|
32
|
+
|
33
|
+
h
|
34
|
+
end
|
35
|
+
|
36
|
+
module ClassMethods
|
37
|
+
def hash2node(node_h)
|
38
|
+
reservedKeys = %i|node name path children|
|
39
|
+
|
40
|
+
node, name, path, children = ReservedKeys.map do |key|
|
41
|
+
node_h[key]
|
42
|
+
end
|
43
|
+
|
44
|
+
fail "Not found 'name' value in map" if name.nil?
|
45
|
+
fail "Not found 'path' value in map" if path.nil?
|
46
|
+
children ||= []
|
47
|
+
|
48
|
+
childnodes = children.map{|c| Yasuri.hash2node(c) }
|
49
|
+
reservedKeys.each{|key| node_h.delete(key)}
|
50
|
+
opt = node_h
|
51
|
+
|
52
|
+
self.new(path, name, childnodes, **opt)
|
53
|
+
end
|
54
|
+
|
55
|
+
def node_type_str
|
56
|
+
fail "#{Kernel.__method__} is not implemented in included class."
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
def self.included(base)
|
61
|
+
base.extend(ClassMethods)
|
62
|
+
end
|
20
63
|
end
|
21
64
|
end
|
@@ -6,6 +6,7 @@ require_relative 'yasuri_text_node'
|
|
6
6
|
require_relative 'yasuri_struct_node'
|
7
7
|
require_relative 'yasuri_links_node'
|
8
8
|
require_relative 'yasuri_paginate_node'
|
9
|
+
require_relative 'yasuri_map_node'
|
9
10
|
|
10
11
|
module Yasuri
|
11
12
|
class NodeGenerator
|
@@ -15,27 +16,33 @@ module Yasuri
|
|
15
16
|
@nodes
|
16
17
|
end
|
17
18
|
|
18
|
-
def method_missing(name, pattern, **args, &block)
|
19
|
+
def method_missing(name, pattern=nil, **args, &block)
|
19
20
|
node = NodeGenerator.gen(name, pattern, **args, &block)
|
20
21
|
raise "Undefined Node Name '#{name}'" if node == nil
|
21
22
|
@nodes << node
|
22
23
|
end
|
23
24
|
|
24
|
-
def self.gen(
|
25
|
+
def self.gen(method_name, xpath, **opt, &block)
|
25
26
|
children = Yasuri::NodeGenerator.new.gen_recursive(&block) if block_given?
|
26
27
|
|
27
|
-
case
|
28
|
+
case method_name
|
28
29
|
when /^text_(.+)$/
|
29
|
-
|
30
|
+
# Todo raise error xpath is not valid
|
31
|
+
Yasuri::TextNode.new(xpath, $1, children || [], **opt)
|
30
32
|
when /^struct_(.+)$/
|
33
|
+
# Todo raise error xpath is not valid
|
31
34
|
Yasuri::StructNode.new(xpath, $1, children || [], **opt)
|
32
35
|
when /^links_(.+)$/
|
33
|
-
|
36
|
+
# Todo raise error xpath is not valid
|
37
|
+
Yasuri::LinksNode.new(xpath, $1, children || [], **opt)
|
34
38
|
when /^pages_(.+)$/
|
39
|
+
# Todo raise error xpath is not valid
|
35
40
|
Yasuri::PaginateNode.new(xpath, $1, children || [], **opt)
|
41
|
+
when /^map_(.+)$/
|
42
|
+
Yasuri::MapNode.new($1, children, **opt)
|
36
43
|
else
|
37
44
|
nil
|
38
45
|
end
|
39
|
-
end # of self.gen(
|
46
|
+
end # of self.gen(method_name, xpath, **opt, &block)
|
40
47
|
end # of class NodeGenerator
|
41
48
|
end
|
@@ -0,0 +1,76 @@
|
|
1
|
+
require_relative 'spec_helper'
|
2
|
+
|
3
|
+
describe 'Yasuri' do
|
4
|
+
include_context 'httpserver'
|
5
|
+
|
6
|
+
before do
|
7
|
+
@agent = Mechanize.new
|
8
|
+
@index_page = @agent.get(uri)
|
9
|
+
end
|
10
|
+
|
11
|
+
describe '::MapNode' do
|
12
|
+
it "multi scrape in singe page" do
|
13
|
+
map = Yasuri.map_sample do
|
14
|
+
text_title '/html/head/title'
|
15
|
+
text_body_p '/html/body/p[1]'
|
16
|
+
end
|
17
|
+
actual = map.inject(@agent, @index_page)
|
18
|
+
|
19
|
+
expected = {
|
20
|
+
"title" => "Yasuri Test",
|
21
|
+
"body_p" => "Hello,Yasuri"
|
22
|
+
}
|
23
|
+
expect(actual).to include expected
|
24
|
+
end
|
25
|
+
|
26
|
+
it "nested multi scrape in singe page" do
|
27
|
+
map = Yasuri.map_sample do
|
28
|
+
map_group1 { text_child01 '/html/body/a[1]' }
|
29
|
+
map_group2 do
|
30
|
+
text_child01 '/html/body/a[1]'
|
31
|
+
text_child03 '/html/body/a[3]'
|
32
|
+
end
|
33
|
+
end
|
34
|
+
actual = map.inject(@agent, @index_page)
|
35
|
+
|
36
|
+
expected = {
|
37
|
+
"group1" => {
|
38
|
+
"child01" => "child01"
|
39
|
+
},
|
40
|
+
"group2" => {
|
41
|
+
"child01" => "child01",
|
42
|
+
"child03" => "child03"
|
43
|
+
}
|
44
|
+
}
|
45
|
+
expect(actual).to include expected
|
46
|
+
end
|
47
|
+
|
48
|
+
it "scrape with links node" do
|
49
|
+
map = Yasuri.map_sample do
|
50
|
+
map_group1 do
|
51
|
+
links_a '/html/body/a' do
|
52
|
+
text_content '/html/body/p'
|
53
|
+
end
|
54
|
+
text_child01 '/html/body/a[1]'
|
55
|
+
end
|
56
|
+
map_group2 do
|
57
|
+
text_child03 '/html/body/a[3]'
|
58
|
+
end
|
59
|
+
end
|
60
|
+
actual = map.inject(@agent, @index_page)
|
61
|
+
|
62
|
+
expected = {
|
63
|
+
"group1" => {
|
64
|
+
"a" => [
|
65
|
+
{"content" => "Child 01 page."},
|
66
|
+
{"content" => "Child 02 page."},
|
67
|
+
{"content" => "Child 03 page."},
|
68
|
+
],
|
69
|
+
"child01" => "child01"
|
70
|
+
},
|
71
|
+
"group2" => { "child03" => "child03" }
|
72
|
+
}
|
73
|
+
expect(actual).to include expected
|
74
|
+
end
|
75
|
+
end
|
76
|
+
end
|
data/spec/yasuri_spec.rb
CHANGED
@@ -126,6 +126,27 @@ EOB
|
|
126
126
|
compare_generated_vs_original(generated, original, @index_page)
|
127
127
|
end
|
128
128
|
|
129
|
+
it "return MapNode with TextNodes" do
|
130
|
+
src = %q| { "node" : "map",
|
131
|
+
"name" : "parent",
|
132
|
+
"children" : [
|
133
|
+
{ "node" : "text",
|
134
|
+
"name" : "content01",
|
135
|
+
"path" : "/html/body/p[1]"
|
136
|
+
},
|
137
|
+
{ "node" : "text",
|
138
|
+
"name" : "content02",
|
139
|
+
"path" : "/html/body/p[2]"
|
140
|
+
}
|
141
|
+
]
|
142
|
+
}|
|
143
|
+
generated = Yasuri.json2tree(src)
|
144
|
+
original = Yasuri::MapNode.new('parent', [
|
145
|
+
Yasuri::TextNode.new('/html/body/p[1]', "content01"),
|
146
|
+
Yasuri::TextNode.new('/html/body/p[2]', "content02"),
|
147
|
+
])
|
148
|
+
compare_generated_vs_original(generated, original, @index_page)
|
149
|
+
end
|
129
150
|
|
130
151
|
it "return LinksNode/TextNode" do
|
131
152
|
src = %q| { "node" : "links",
|
@@ -248,6 +269,31 @@ EOB
|
|
248
269
|
expect(actual).to match expected
|
249
270
|
end
|
250
271
|
|
272
|
+
it "return map node with text nodes" do
|
273
|
+
tree = Yasuri::MapNode.new('parent', [
|
274
|
+
Yasuri::TextNode.new('/html/body/p[1]', "content01"),
|
275
|
+
Yasuri::TextNode.new('/html/body/p[2]', "content02"),
|
276
|
+
])
|
277
|
+
actual_json = Yasuri.tree2json(tree)
|
278
|
+
|
279
|
+
expected_json = %q| { "node" : "map",
|
280
|
+
"name" : "parent",
|
281
|
+
"children" : [
|
282
|
+
{ "node" : "text",
|
283
|
+
"name" : "content01",
|
284
|
+
"path" : "/html/body/p[1]"
|
285
|
+
},
|
286
|
+
{ "node" : "text",
|
287
|
+
"name" : "content02",
|
288
|
+
"path" : "/html/body/p[2]"
|
289
|
+
}
|
290
|
+
]
|
291
|
+
}|
|
292
|
+
expected = Yasuri.tree2json(Yasuri.json2tree(expected_json))
|
293
|
+
actual = Yasuri.tree2json(Yasuri.json2tree(actual_json))
|
294
|
+
expect(actual).to match expected
|
295
|
+
end
|
296
|
+
|
251
297
|
it "return LinksNode/TextNode" do
|
252
298
|
tree = Yasuri::LinksNode.new('/html/body/a', "root", [
|
253
299
|
Yasuri::TextNode.new('/html/body/p', "content"),
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: yasuri
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 3.
|
4
|
+
version: 3.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- TAC
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2021-03-
|
11
|
+
date: 2021-03-21 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -160,6 +160,7 @@ files:
|
|
160
160
|
- lib/yasuri/version.rb
|
161
161
|
- lib/yasuri/yasuri.rb
|
162
162
|
- lib/yasuri/yasuri_links_node.rb
|
163
|
+
- lib/yasuri/yasuri_map_node.rb
|
163
164
|
- lib/yasuri/yasuri_node.rb
|
164
165
|
- lib/yasuri/yasuri_node_generator.rb
|
165
166
|
- lib/yasuri/yasuri_paginate_node.rb
|
@@ -181,6 +182,7 @@ files:
|
|
181
182
|
- spec/servers/httpserver.rb
|
182
183
|
- spec/spec_helper.rb
|
183
184
|
- spec/yasuri_links_node_spec.rb
|
185
|
+
- spec/yasuri_map_spec.rb
|
184
186
|
- spec/yasuri_node_spec.rb
|
185
187
|
- spec/yasuri_paginate_node_spec.rb
|
186
188
|
- spec/yasuri_spec.rb
|
@@ -227,6 +229,7 @@ test_files:
|
|
227
229
|
- spec/servers/httpserver.rb
|
228
230
|
- spec/spec_helper.rb
|
229
231
|
- spec/yasuri_links_node_spec.rb
|
232
|
+
- spec/yasuri_map_spec.rb
|
230
233
|
- spec/yasuri_node_spec.rb
|
231
234
|
- spec/yasuri_paginate_node_spec.rb
|
232
235
|
- spec/yasuri_spec.rb
|