yasuri 0.0.11 → 1.9.11
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +18 -0
- data/USAGE.ja.md +433 -0
- data/USAGE.md +431 -0
- data/lib/yasuri/version.rb +1 -1
- data/lib/yasuri/yasuri.rb +3 -2
- data/lib/yasuri/yasuri_node.rb +1 -1
- data/lib/yasuri/yasuri_paginate_node.rb +2 -2
- data/lib/yasuri/yasuri_text_node.rb +4 -1
- data/spec/spec_helper.rb +3 -2
- data/spec/yasuri_spec.rb +3 -3
- data/spec/yasuri_struct_node_spec.rb +3 -3
- data/spec/yasuri_text_node_spec.rb +1 -1
- metadata +4 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 28e6a3903cec3d8036b718a7c5de4fb8df8dfbfa
|
4
|
+
data.tar.gz: fbfb4c2b3a042410a7d05b1604e004bdffd82779
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e4312623046ecf451ef261b1d99164b7f79b861ad8d7fe49d7ba8fd2319cd840978ae88235e6fb10b6991cd5131ebec966ec56e3bddece8fbbefd0b53d4a9dfe
|
7
|
+
data.tar.gz: 54639066aa4511309a1f712b8920c81c09a7489e0cab6a91c453d2aeca07d4f1c3bed167ed25746411d5f87fa9cc9fff8c6a32567c4df8d1990f12cb053286e2
|
data/README.md
CHANGED
@@ -2,6 +2,16 @@
|
|
2
2
|
|
3
3
|
Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
|
4
4
|
|
5
|
+
Yasuri can reduce frequently processes in Scraping.
|
6
|
+
|
7
|
+
For example,
|
8
|
+
|
9
|
+
+ Open links in the page, scraping each page, and getting result as Hash.
|
10
|
+
+ Scraping texts in the page, and named result in Hash.
|
11
|
+
+ A table that repeatedly appears in a page each, scraping, get as an array.
|
12
|
+
+ Of each page provided by the pagination, scraping the only top 3.
|
13
|
+
|
14
|
+
You can implement easy by Yasuri.
|
5
15
|
|
6
16
|
## Sample
|
7
17
|
|
@@ -17,6 +27,14 @@ Add this line to your application's Gemfile:
|
|
17
27
|
gem 'yasuri'
|
18
28
|
```
|
19
29
|
|
30
|
+
or
|
31
|
+
|
32
|
+
```ruby
|
33
|
+
# for Ruby 1.9.3 or lower
|
34
|
+
gem 'yasuri', '~> 1.9'
|
35
|
+
```
|
36
|
+
|
37
|
+
|
20
38
|
And then execute:
|
21
39
|
|
22
40
|
$ bundle
|
data/USAGE.ja.md
ADDED
@@ -0,0 +1,433 @@
|
|
1
|
+
# Yasuri の使い方
|
2
|
+
|
3
|
+
## Yasuri とは
|
4
|
+
Yasuri (鑢) は簡単にWebスクレイピングを行うための、"[Mechanize](https://github.com/sparklemotion/mechanize)" をサポートするライブラリです.
|
5
|
+
|
6
|
+
Yasuriは、スクレイピングにおける、よくある処理を簡単に記述することができます.
|
7
|
+
例えば、
|
8
|
+
|
9
|
+
+ ページ内の複数のリンクを開いて、各ページをスクレイピングした結果をHashで取得する
|
10
|
+
+ ページ内の複数のテキストをスクレイピングし、名前をつけてHashにする
|
11
|
+
+ ページ内に繰り返し出現するテーブルをそれぞれスクレイピングして、配列として取得する
|
12
|
+
+ ページネーションで提供される各ページのうち、上位3つだけを順にスクレイピングする
|
13
|
+
|
14
|
+
これらを簡単に実装することができます.
|
15
|
+
|
16
|
+
## クイックスタート
|
17
|
+
|
18
|
+
```
|
19
|
+
$ gem install yasuri
|
20
|
+
```
|
21
|
+
|
22
|
+
```ruby
|
23
|
+
require 'yasuri'
|
24
|
+
require 'machinize'
|
25
|
+
|
26
|
+
# Node tree constructing by DSL
|
27
|
+
root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
28
|
+
text_title '//*[@id="contents"]/h2'
|
29
|
+
text_content '//*[@id="contents"]/p[1]'
|
30
|
+
end
|
31
|
+
|
32
|
+
agent = Mechanize.new
|
33
|
+
root_page = agent.get("http://some.scraping.page.net/")
|
34
|
+
|
35
|
+
result = root.inject(agent, root_page)
|
36
|
+
# => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
|
37
|
+
# {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
|
38
|
+
|
39
|
+
```
|
40
|
+
この例では、 LinkNode(`links_root`)の xpath で指定された各リンク先のページから、TextNode(`text_title`,`text_content`) の xpath で指定された2つのテキストをスクレイピングする例です.
|
41
|
+
|
42
|
+
(言い換えると、`//*[@id="menu"]/ul/li/a` で示される各リンクを開いて、`//*[@id="contents"]/h2` と `//*[@id="contents"]/p[1]` で指定されたテキストをスクレイピングします)
|
43
|
+
|
44
|
+
## 基本
|
45
|
+
|
46
|
+
1. パースツリーを作る
|
47
|
+
2. Mechanize の agent と対象のページを与えてパースを開始する
|
48
|
+
|
49
|
+
|
50
|
+
### パースツリーを作る
|
51
|
+
|
52
|
+
```ruby
|
53
|
+
require 'mechanize'
|
54
|
+
require 'yasuri'
|
55
|
+
|
56
|
+
|
57
|
+
# 1. パースツリーを作る
|
58
|
+
tree = Yasuri.links_title '/html/body/a' do
|
59
|
+
text_name '/html/body/p'
|
60
|
+
end
|
61
|
+
|
62
|
+
# 2. Mechanize の agent と対象のページを与えてパースを開始する
|
63
|
+
agent = Mechanize.new
|
64
|
+
page = agent.get(uri)
|
65
|
+
|
66
|
+
|
67
|
+
tree.inject(agent, page)
|
68
|
+
```
|
69
|
+
|
70
|
+
ツリーは、DSLまたはjsonで定義することができます.上の例ではDSLで定義しています.
|
71
|
+
以下は、jsonで上記と等価な解析ツリーを定義した例です.
|
72
|
+
|
73
|
+
```ruby
|
74
|
+
# json で構成する場合
|
75
|
+
src = <<-EOJSON
|
76
|
+
{ "node" : "links",
|
77
|
+
"name" : "title",
|
78
|
+
"path" : "/html/body/a",
|
79
|
+
"children" : [
|
80
|
+
{ "node" : "text",
|
81
|
+
"name" : "name",
|
82
|
+
"path" : "/html/body/p"
|
83
|
+
}
|
84
|
+
]
|
85
|
+
}
|
86
|
+
EOJSON
|
87
|
+
tree = Yasuri.json2tree(src)
|
88
|
+
```
|
89
|
+
|
90
|
+
|
91
|
+
### Node
|
92
|
+
ツリーは入れ子になった *Node* で構成されます.
|
93
|
+
Node は `Type`, `Name`, `Path`, `Childlen`, `Options` を持っています.
|
94
|
+
|
95
|
+
Nodeは以下のフォーマットで定義されます.
|
96
|
+
|
97
|
+
```ruby
|
98
|
+
# トップレベル
|
99
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
100
|
+
|
101
|
+
# 入れ子になっている場合
|
102
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
103
|
+
<Type>_<Name> <Path> [,<Options>] do
|
104
|
+
<Children>
|
105
|
+
end
|
106
|
+
end
|
107
|
+
```
|
108
|
+
|
109
|
+
#### Type
|
110
|
+
*Type* は Nodeの振る舞いを示します.Typeには以下のものがあります.
|
111
|
+
|
112
|
+
- *Text*
|
113
|
+
- *Struct*
|
114
|
+
- *Links*
|
115
|
+
- *Paginate*
|
116
|
+
|
117
|
+
### Name
|
118
|
+
*Name* は 解析結果のHashにおけるキーになります.
|
119
|
+
|
120
|
+
### Path
|
121
|
+
*Path* は xpath あるいは css セレクタによって、HTML上の特定のノードを指定します.
|
122
|
+
これは Machinize の `search` で使用されます.
|
123
|
+
|
124
|
+
### Childlen
|
125
|
+
入れ子になっているノードの子ノードです.TextNodeはツリーの葉に当たるため、子ノードを持ちません.
|
126
|
+
|
127
|
+
### Options
|
128
|
+
パースのオプションです.オプションはTypeごとに異なります.
|
129
|
+
各ノードに対して、`opt`メソッドをコールすることで、利用可能なオプションを取得できます.
|
130
|
+
|
131
|
+
```
|
132
|
+
# TextNode の例
|
133
|
+
node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
134
|
+
node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
|
135
|
+
```
|
136
|
+
|
137
|
+
## Text Node
|
138
|
+
*TextNode* はスクレイピングしたテキストを返します.このノードはパースツリーにおいて常に葉です.
|
139
|
+
|
140
|
+
### 例
|
141
|
+
|
142
|
+
```html
|
143
|
+
<!-- http://yasuri.example.net -->
|
144
|
+
<html>
|
145
|
+
<head></head>
|
146
|
+
<body>
|
147
|
+
<p>Hello,World</p>
|
148
|
+
<p>Hello,Yasuri</p>
|
149
|
+
</body>
|
150
|
+
</html>
|
151
|
+
```
|
152
|
+
|
153
|
+
```ruby
|
154
|
+
agent = Mechanize.new
|
155
|
+
page = agent.get("http://yasuri.example.net")
|
156
|
+
|
157
|
+
p1 = Yasuri.text_title '/html/body/p[1]'
|
158
|
+
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
159
|
+
p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
|
160
|
+
|
161
|
+
p1.inject(agent, page) #=> { "title" => "Hello,World" }
|
162
|
+
p1t.inject(agent, page) #=> { "title" => "Hello" }
|
163
|
+
node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
|
164
|
+
```
|
165
|
+
|
166
|
+
### オプション
|
167
|
+
##### `truncate`
|
168
|
+
正規表現にマッチした文字列を取り出します.グループを指定した場合、最初にマッチしたグループだけを返します.
|
169
|
+
|
170
|
+
```ruby
|
171
|
+
node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
|
172
|
+
node.inject(agent, index_page)
|
173
|
+
#=> { "example" => "ello,Yasur" }
|
174
|
+
```
|
175
|
+
|
176
|
+
|
177
|
+
##### `proc`
|
178
|
+
取り出した文字列(String)をレシーバーとして、シンボルで指定したメソッドを呼び出します.
|
179
|
+
`truncate`オプションを併せて指定している場合、`truncate`した後の文字列に対し、メソッドを呼び出します.
|
180
|
+
|
181
|
+
```ruby
|
182
|
+
node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
|
183
|
+
node.inject(agent, index_page)
|
184
|
+
#=> { "example" => "ELLO,YASUR" }
|
185
|
+
```
|
186
|
+
|
187
|
+
## Struct Node
|
188
|
+
*Struct Node* は構造化されたHashとしてテキストを返します.
|
189
|
+
|
190
|
+
まず、Struct Node は `Path` によって、HTMLのタグを絞込みます.
|
191
|
+
Struct Node の子ノードは、この絞りこまれたタグに対してパースを行い、Struct Node は子ノードの結果を含むHashを返します.
|
192
|
+
|
193
|
+
Struct Node の `Path` が複数のタグにマッチする場合、配列として結果を返します.
|
194
|
+
|
195
|
+
### 例
|
196
|
+
|
197
|
+
```html
|
198
|
+
<!-- http://yasuri.example.net -->
|
199
|
+
<html>
|
200
|
+
<head>
|
201
|
+
<title>Books</title>
|
202
|
+
</head>
|
203
|
+
<body>
|
204
|
+
<h1>1996</h1>
|
205
|
+
<table>
|
206
|
+
<thead>
|
207
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
208
|
+
</thead>
|
209
|
+
<tr><td>The Perfect Insider</td> <td>1996/4/5</td></tr>
|
210
|
+
<tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
|
211
|
+
<tr><td>Mathematical Goodbye</td> <td>1996/9/5</td></tr>
|
212
|
+
</table>
|
213
|
+
|
214
|
+
<h1>1997</h1>
|
215
|
+
<table>
|
216
|
+
<thead>
|
217
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
218
|
+
</thead>
|
219
|
+
<tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
|
220
|
+
<tr><td>Who Inside</td> <td>1997/4/5</td></tr>
|
221
|
+
<tr><td>Illusion Acts Like Magic</td> <td>1997/10/5</td></tr>
|
222
|
+
</table>
|
223
|
+
|
224
|
+
<h1>1998</h1>
|
225
|
+
<table>
|
226
|
+
<thead>
|
227
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
228
|
+
</thead>
|
229
|
+
<tr><td>Replaceable Summer</td> <td>1998/1/7</td></tr>
|
230
|
+
<tr><td>Switch Back</td> <td>1998/4/5</td></tr>
|
231
|
+
<tr><td>Numerical Models</td> <td>1998/7/5</td></tr>
|
232
|
+
<tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
|
233
|
+
</table>
|
234
|
+
</body>
|
235
|
+
</html>
|
236
|
+
```
|
237
|
+
|
238
|
+
```ruby
|
239
|
+
agent = Mechanize.new
|
240
|
+
page = agent.get("http://yasuri.example.net")
|
241
|
+
|
242
|
+
node = Yasuri.struct_table '/html/body/table[1]/tr' do
|
243
|
+
text_title './td[1]'
|
244
|
+
text_pub_date './td[2]'
|
245
|
+
])
|
246
|
+
|
247
|
+
node.inject(agent, page)
|
248
|
+
#=> [ { "title" => "The Perfect Insider",
|
249
|
+
# "pub_date" => "1996/4/5" },
|
250
|
+
# { "title" => "Doctors in Isolated Room",
|
251
|
+
# "pub_date" => "1996/7/5" },
|
252
|
+
# { "title" => "Mathematical Goodbye",
|
253
|
+
# "pub_date" => "1996/9/5" }, ]
|
254
|
+
```
|
255
|
+
|
256
|
+
Struct Node は xpath `'/html/body/table[1]/tr'` によって、最初の `<table>` から すべての`<tr>` タグを絞り込みます.
|
257
|
+
その後、子ノードである2つの TextNode によって、 `<tr>` タグがパースされます.
|
258
|
+
この場合は、最初の `<table>` は 3つの `<tr>`タグを持っているため、3つのHashを返します.(`<thead><tr>` は `Path` にマッチしないため4つではないことに注意)
|
259
|
+
各HashはTextNodeによってパースされたテキストを含んでいます.
|
260
|
+
|
261
|
+
|
262
|
+
また以下の例のように、Struct Node は TextNode以外のノードを子ノードとすることができます.
|
263
|
+
|
264
|
+
### 例
|
265
|
+
|
266
|
+
```ruby
|
267
|
+
agent = Mechanize.new
|
268
|
+
page = agent.get("http://yasuri.example.net")
|
269
|
+
|
270
|
+
node = Yasuri.strucre_tables '/html/body/table' do
|
271
|
+
struct_table './tr' do
|
272
|
+
text_title './td[1]'
|
273
|
+
text_pub_date './td[2]'
|
274
|
+
end
|
275
|
+
])
|
276
|
+
|
277
|
+
node.inject(agent, page)
|
278
|
+
|
279
|
+
#=> [ { "table" => [ { "title" => "The Perfect Insider",
|
280
|
+
# "pub_date" => "1996/4/5" },
|
281
|
+
# { "title" => "Doctors in Isolated Room",
|
282
|
+
# "pub_date" => "1996/7/5" },
|
283
|
+
# { "title" => "Mathematical Goodbye",
|
284
|
+
# "pub_date" => "1996/9/5" }]},
|
285
|
+
# { "table" => [ { "title" => "Jack the Poetical Private",
|
286
|
+
# "pub_date" => "1997/1/5" },
|
287
|
+
# { "title" => "Who Inside",
|
288
|
+
# "pub_date" => "1997/4/5" },
|
289
|
+
# { "title" => "Illusion Acts Like Magic",
|
290
|
+
# "pub_date" => "1997/10/5" }]},
|
291
|
+
# { "table" => [ { "title" => "Replaceable Summer",
|
292
|
+
# "pub_date" => "1998/1/7" },
|
293
|
+
# { "title" => "Switch Back",
|
294
|
+
# "pub_date" => "1998/4/5" },
|
295
|
+
# { "title" => "Numerical Models",
|
296
|
+
# "pub_date" => "1998/7/5" },
|
297
|
+
# { "title" => "The Perfect Outsider",
|
298
|
+
# "pub_date" => "1998/10/5" }]}
|
299
|
+
# ]
|
300
|
+
```
|
301
|
+
|
302
|
+
### オプション
|
303
|
+
なし
|
304
|
+
|
305
|
+
## Links Node
|
306
|
+
Links Node は リンクされた各ページをパースして結果を返します.
|
307
|
+
|
308
|
+
### 例
|
309
|
+
```
|
310
|
+
<!-- http://yasuri.example.net -->
|
311
|
+
<html>
|
312
|
+
<head><title>Yasuri Test</title></head>
|
313
|
+
<body>
|
314
|
+
<p>Hello,Yasuri</p>
|
315
|
+
<a href="./child01.html">child01</a>
|
316
|
+
<a href="./child02.html">child02</a>
|
317
|
+
<a href="./child03.html">child03</a>
|
318
|
+
</body>
|
319
|
+
<title>
|
320
|
+
```
|
321
|
+
|
322
|
+
```
|
323
|
+
<!-- http://yasuri.example.net/child01.html -->
|
324
|
+
<html>
|
325
|
+
<head><title>Child 01 Test</title></head>
|
326
|
+
<body>
|
327
|
+
<p>Child 01 page.</p>
|
328
|
+
<ul>
|
329
|
+
<li><a href="./child01_sub.html">Child01_Sub</a></li>
|
330
|
+
<li><a href="./child02_sub.html">Child02_Sub</a></li>
|
331
|
+
</ul>
|
332
|
+
</body>
|
333
|
+
<title>
|
334
|
+
```
|
335
|
+
|
336
|
+
```
|
337
|
+
<!-- http://yasuri.example.net/child02.html -->
|
338
|
+
<html>
|
339
|
+
<head><title>Child 02 Test</title></head>
|
340
|
+
<body>
|
341
|
+
<p>Child 02 page.</p>
|
342
|
+
</body>
|
343
|
+
<title>
|
344
|
+
```
|
345
|
+
|
346
|
+
```
|
347
|
+
<!-- http://yasuri.example.net/child03.html -->
|
348
|
+
<html>
|
349
|
+
<head><title>Child 03 Test</title></head>
|
350
|
+
<body>
|
351
|
+
<p>Child 03 page.</p>
|
352
|
+
<ul>
|
353
|
+
<li><a href="./child03_sub.html">Child03_Sub</a></li>
|
354
|
+
</ul>
|
355
|
+
</body>
|
356
|
+
<title>
|
357
|
+
```
|
358
|
+
|
359
|
+
```
|
360
|
+
agent = Mechanize.new
|
361
|
+
page = agent.get("http://yasuri.example.net")
|
362
|
+
|
363
|
+
node = Yasuri.links_title '/html/body/a' do
|
364
|
+
text_content '/html/body/p'
|
365
|
+
end
|
366
|
+
|
367
|
+
node.inject(agent, page)
|
368
|
+
#=> [ {"content" => "Child 01 page."},
|
369
|
+
{"content" => "Child 02 page."},
|
370
|
+
{"content" => "Child 03 page."}]
|
371
|
+
```
|
372
|
+
|
373
|
+
まず、 LinksNode は `Path` にマッチするすべてのリンクを最初のページから探します.
|
374
|
+
この例では、LinksNodeは `/html/body/a` にマッチするすべてのタグを `http://yasuri.example.net` から探します.
|
375
|
+
次に、見つかったタグのhref属性で指定されたページを開きます.(`./child01.html`, `./child02.html`, `./child03.html`)
|
376
|
+
|
377
|
+
開いた各ページに対して、子ノードによる解析を行います.LinksNodeは 各ページに対するパース結果をHashの配列として返します.
|
378
|
+
|
379
|
+
## Paginate Node
|
380
|
+
PaginateNodeは ページネーション(パジネーション, Pagination) でたどることのできる各ページを順にパースします.
|
381
|
+
|
382
|
+
### 例
|
383
|
+
この例では、対象のページ `page01.html` はこのようになっているとします.
|
384
|
+
`page02.html` から `page04.html` も同様です.
|
385
|
+
|
386
|
+
```html
|
387
|
+
<!-- http://yasuri.example.net/page01.html -->
|
388
|
+
<html>
|
389
|
+
<head><title>Page01</title></head>
|
390
|
+
<body>
|
391
|
+
<p>Patination01</p>
|
392
|
+
|
393
|
+
<nav class='pagination'>
|
394
|
+
<span class='prev'> « PreviousPage </span>
|
395
|
+
<span class='page'> 1 </span>
|
396
|
+
<span class='page'> <a href="./page02.html">2</a> </span>
|
397
|
+
<span class='page'> <a href="./page03.html">3</a> </span>
|
398
|
+
<span class='page'> <a href="./page04.html">4</a> </span>
|
399
|
+
<span class='next'> <a href="./page02.html" class="next" rel="next">NextPage »</a> </span>
|
400
|
+
</nav>
|
401
|
+
|
402
|
+
</body>
|
403
|
+
<title>
|
404
|
+
```
|
405
|
+
|
406
|
+
```ruby
|
407
|
+
agent = Mechanize.new
|
408
|
+
page = agent.get("http://yasuri.example.net/page01.html")
|
409
|
+
|
410
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
|
411
|
+
text_content '/html/body/p'
|
412
|
+
end
|
413
|
+
|
414
|
+
node.inject(agent, page)
|
415
|
+
#=> [ {"content" => "Patination01"},
|
416
|
+
{"content" => "Patination02"},
|
417
|
+
{"content" => "Patination03"}]
|
418
|
+
```
|
419
|
+
PaginateNodeは 次のページ を指すリンクを`Path`として指定する必要があります.
|
420
|
+
この例では、`NextPage` (`/html/body/nav/span/a[@class='next']`)が、次のページを指すリンクに該当します.
|
421
|
+
|
422
|
+
### オプション
|
423
|
+
##### `limit`
|
424
|
+
たどるページ数の上限を指定します.
|
425
|
+
|
426
|
+
```ruby
|
427
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
|
428
|
+
text_content '/html/body/p'
|
429
|
+
end
|
430
|
+
node.inject(agent, page)
|
431
|
+
#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
|
432
|
+
```
|
433
|
+
この場合、PaginateNode は最大2つまでのページを開いてパースします.ページネーションは4つのページを持っているようですが、`limit:2`が指定されているため、結果の配列には2つの結果のみが含まれています.
|
data/USAGE.md
ADDED
@@ -0,0 +1,431 @@
|
|
1
|
+
# Yasuri Usage
|
2
|
+
|
3
|
+
## What is Yasuri
|
4
|
+
`Yasuri` is an easy web-scraping library for supporting "Mechanize".
|
5
|
+
|
6
|
+
Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
|
7
|
+
|
8
|
+
Yasuri can reduce frequently processes in Scraping.
|
9
|
+
|
10
|
+
For example,
|
11
|
+
|
12
|
+
+ Open links in the page, scraping each page, and getting result as Hash.
|
13
|
+
+ Scraping texts in the page, and named result in Hash.
|
14
|
+
+ A table that repeatedly appears in a page each, scraping, get as an array.
|
15
|
+
+ Of each page provided by the pagination, scraping the only top 3.
|
16
|
+
|
17
|
+
You can implement easy by Yasuri.
|
18
|
+
|
19
|
+
## Quick Start
|
20
|
+
|
21
|
+
```
|
22
|
+
$ gem install yasuri
|
23
|
+
```
|
24
|
+
|
25
|
+
```ruby
|
26
|
+
require 'yasuri'
|
27
|
+
require 'machinize'
|
28
|
+
|
29
|
+
# Node tree constructing by DSL
|
30
|
+
root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
31
|
+
text_title '//*[@id="contents"]/h2'
|
32
|
+
text_content '//*[@id="contents"]/p[1]'
|
33
|
+
end
|
34
|
+
|
35
|
+
agent = Mechanize.new
|
36
|
+
root_page = agent.get("http://some.scraping.page.net/")
|
37
|
+
|
38
|
+
result = root.inject(agent, root_page)
|
39
|
+
# => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
|
40
|
+
# {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
|
41
|
+
|
42
|
+
```
|
43
|
+
This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
|
44
|
+
|
45
|
+
(i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
|
46
|
+
|
47
|
+
## Basics
|
48
|
+
|
49
|
+
1. Construct parse tree.
|
50
|
+
2. Start parse with Mechanize agent and first page.
|
51
|
+
|
52
|
+
### Construct parse tree
|
53
|
+
|
54
|
+
```ruby
|
55
|
+
require 'mechanize'
|
56
|
+
require 'yasuri'
|
57
|
+
|
58
|
+
|
59
|
+
# 1. Construct parse tree.
|
60
|
+
tree = Yasuri.links_title '/html/body/a' do
|
61
|
+
text_name '/html/body/p'
|
62
|
+
end
|
63
|
+
|
64
|
+
# 2. Start parse with Mechanize agent and first page.
|
65
|
+
agent = Mechanize.new
|
66
|
+
page = agent.get(uri)
|
67
|
+
|
68
|
+
|
69
|
+
tree.inject(agent, page)
|
70
|
+
```
|
71
|
+
|
72
|
+
Tree is definable by 2(+1) ways, DSL and json (and basic ruby code). In above example, DSL.
|
73
|
+
|
74
|
+
```ruby
|
75
|
+
# Construct by json.
|
76
|
+
src = <<-EOJSON
|
77
|
+
{ "node" : "links",
|
78
|
+
"name" : "title",
|
79
|
+
"path" : "/html/body/a",
|
80
|
+
"children" : [
|
81
|
+
{ "node" : "text",
|
82
|
+
"name" : "name",
|
83
|
+
"path" : "/html/body/p"
|
84
|
+
}
|
85
|
+
]
|
86
|
+
}
|
87
|
+
EOJSON
|
88
|
+
tree = Yasuri.json2tree(src)
|
89
|
+
```
|
90
|
+
|
91
|
+
### Node
|
92
|
+
Tree is constructed by nested Nodes.
|
93
|
+
Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
|
94
|
+
|
95
|
+
Node is defined by this format.
|
96
|
+
|
97
|
+
|
98
|
+
```ruby
|
99
|
+
# Top Level
|
100
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
101
|
+
|
102
|
+
# Nested
|
103
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
104
|
+
<Type>_<Name> <Path> [,<Options>] do
|
105
|
+
<Children>
|
106
|
+
end
|
107
|
+
end
|
108
|
+
```
|
109
|
+
|
110
|
+
#### Type
|
111
|
+
Type meen behavior of Node.
|
112
|
+
|
113
|
+
- *Text*
|
114
|
+
- *Struct*
|
115
|
+
- *Links*
|
116
|
+
- *Paginate*
|
117
|
+
|
118
|
+
### Name
|
119
|
+
Name is used keys in returned hash.
|
120
|
+
|
121
|
+
### Path
|
122
|
+
Path determine target node by xpath or css selector. It given by Machinize `search`.
|
123
|
+
|
124
|
+
### Childlen
|
125
|
+
Child nodes. TextNode has always empty set, because TextNode is leaf.
|
126
|
+
|
127
|
+
### Options
|
128
|
+
Parse options. It different in each types. You can get options and values by `opt` method.
|
129
|
+
|
130
|
+
```ruby
|
131
|
+
# TextNode Exaample
|
132
|
+
node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
133
|
+
node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
|
134
|
+
```
|
135
|
+
|
136
|
+
## Text Node
|
137
|
+
TextNode return scraped text. This node have to be leaf.
|
138
|
+
|
139
|
+
### Example
|
140
|
+
|
141
|
+
```html
|
142
|
+
<!-- http://yasuri.example.net -->
|
143
|
+
<html>
|
144
|
+
<head></head>
|
145
|
+
<body>
|
146
|
+
<p>Hello,World</p>
|
147
|
+
<p>Hello,Yasuri</p>
|
148
|
+
</body>
|
149
|
+
</html>
|
150
|
+
```
|
151
|
+
|
152
|
+
```ruby
|
153
|
+
agent = Mechanize.new
|
154
|
+
page = agent.get("http://yasuri.example.net")
|
155
|
+
|
156
|
+
p1 = Yasuri.text_title '/html/body/p[1]'
|
157
|
+
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
158
|
+
p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
|
159
|
+
|
160
|
+
p1.inject(agent, page) #=> { "title" => "Hello,World" }
|
161
|
+
p1t.inject(agent, page) #=> { "title" => "Hello" }
|
162
|
+
node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
|
163
|
+
```
|
164
|
+
|
165
|
+
### Options
|
166
|
+
##### `truncate`
|
167
|
+
Match to regexp, and truncate text. When you use group, it will return first matched group only.
|
168
|
+
|
169
|
+
```ruby
|
170
|
+
node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
|
171
|
+
node.inject(agent, index_page)
|
172
|
+
#=> { "example" => "ello,Yasur" }
|
173
|
+
```
|
174
|
+
|
175
|
+
|
176
|
+
##### `proc`
|
177
|
+
Apply method to text. Method is given as Symbol.
|
178
|
+
If it is given `truncate` option, apply method after truncated.
|
179
|
+
|
180
|
+
```ruby
|
181
|
+
node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
|
182
|
+
node.inject(agent, index_page)
|
183
|
+
#=> { "example" => "ELLO,YASUR" }
|
184
|
+
```
|
185
|
+
|
186
|
+
## Struct Node
|
187
|
+
Struct Node return structured text.
|
188
|
+
|
189
|
+
At first, Struct Node narrow down sub-tags by `Path`. Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
|
190
|
+
|
191
|
+
If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
|
192
|
+
|
193
|
+
### Example
|
194
|
+
|
195
|
+
```html
|
196
|
+
<!-- http://yasuri.example.net -->
|
197
|
+
<html>
|
198
|
+
<head>
|
199
|
+
<title>Books</title>
|
200
|
+
</head>
|
201
|
+
<body>
|
202
|
+
<h1>1996</h1>
|
203
|
+
<table>
|
204
|
+
<thead>
|
205
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
206
|
+
</thead>
|
207
|
+
<tr><td>The Perfect Insider</td> <td>1996/4/5</td></tr>
|
208
|
+
<tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
|
209
|
+
<tr><td>Mathematical Goodbye</td> <td>1996/9/5</td></tr>
|
210
|
+
</table>
|
211
|
+
|
212
|
+
<h1>1997</h1>
|
213
|
+
<table>
|
214
|
+
<thead>
|
215
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
216
|
+
</thead>
|
217
|
+
<tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
|
218
|
+
<tr><td>Who Inside</td> <td>1997/4/5</td></tr>
|
219
|
+
<tr><td>Illusion Acts Like Magic</td> <td>1997/10/5</td></tr>
|
220
|
+
</table>
|
221
|
+
|
222
|
+
<h1>1998</h1>
|
223
|
+
<table>
|
224
|
+
<thead>
|
225
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
226
|
+
</thead>
|
227
|
+
<tr><td>Replaceable Summer</td> <td>1998/1/7</td></tr>
|
228
|
+
<tr><td>Switch Back</td> <td>1998/4/5</td></tr>
|
229
|
+
<tr><td>Numerical Models</td> <td>1998/7/5</td></tr>
|
230
|
+
<tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
|
231
|
+
</table>
|
232
|
+
</body>
|
233
|
+
</html>
|
234
|
+
```
|
235
|
+
|
236
|
+
```ruby
|
237
|
+
agent = Mechanize.new
|
238
|
+
page = agent.get("http://yasuri.example.net")
|
239
|
+
|
240
|
+
node = Yasuri.struct_table '/html/body/table[1]/tr' do
|
241
|
+
text_title './td[1]'
|
242
|
+
text_pub_date './td[2]'
|
243
|
+
])
|
244
|
+
|
245
|
+
node.inject(agent, page)
|
246
|
+
#=> [ { "title" => "The Perfect Insider",
|
247
|
+
# "pub_date" => "1996/4/5" },
|
248
|
+
# { "title" => "Doctors in Isolated Room",
|
249
|
+
# "pub_date" => "1996/7/5" },
|
250
|
+
# { "title" => "Mathematical Goodbye",
|
251
|
+
# "pub_date" => "1996/9/5" }, ]
|
252
|
+
```
|
253
|
+
|
254
|
+
StructNode narrow down `<tr>` tags in first `<table>` by `'/html/body/table[1]/tr'`. Then,
|
255
|
+
`<tr>` tags parsed Struct node has two child node.
|
256
|
+
|
257
|
+
In this case, first `<table>` contains three `<tr>` tags (Not four.`<thead><tr>` is not match to `Path` ), so struct node returns three hashes. Each hash contains parsed text by Text Node.
|
258
|
+
|
259
|
+
Struct node can contain not only Text node.
|
260
|
+
|
261
|
+
### Example
|
262
|
+
|
263
|
+
```ruby
|
264
|
+
agent = Mechanize.new
|
265
|
+
page = agent.get("http://yasuri.example.net")
|
266
|
+
|
267
|
+
node = Yasuri.strucre_tables '/html/body/table' do
|
268
|
+
struct_table './tr' do
|
269
|
+
text_title './td[1]'
|
270
|
+
text_pub_date './td[2]'
|
271
|
+
end
|
272
|
+
])
|
273
|
+
|
274
|
+
node.inject(agent, page)
|
275
|
+
|
276
|
+
#=> [ { "table" => [ { "title" => "The Perfect Insider",
|
277
|
+
# "pub_date" => "1996/4/5" },
|
278
|
+
# { "title" => "Doctors in Isolated Room",
|
279
|
+
# "pub_date" => "1996/7/5" },
|
280
|
+
# { "title" => "Mathematical Goodbye",
|
281
|
+
# "pub_date" => "1996/9/5" }]},
|
282
|
+
# { "table" => [ { "title" => "Jack the Poetical Private",
|
283
|
+
# "pub_date" => "1997/1/5" },
|
284
|
+
# { "title" => "Who Inside",
|
285
|
+
# "pub_date" => "1997/4/5" },
|
286
|
+
# { "title" => "Illusion Acts Like Magic",
|
287
|
+
# "pub_date" => "1997/10/5" }]},
|
288
|
+
# { "table" => [ { "title" => "Replaceable Summer",
|
289
|
+
# "pub_date" => "1998/1/7" },
|
290
|
+
# { "title" => "Switch Back",
|
291
|
+
# "pub_date" => "1998/4/5" },
|
292
|
+
# { "title" => "Numerical Models",
|
293
|
+
# "pub_date" => "1998/7/5" },
|
294
|
+
# { "title" => "The Perfect Outsider",
|
295
|
+
# "pub_date" => "1998/10/5" }]}
|
296
|
+
# ]
|
297
|
+
```
|
298
|
+
|
299
|
+
### Options
|
300
|
+
None.
|
301
|
+
|
302
|
+
## Links Node
|
303
|
+
Links Node returns parsed text in each linked pages.
|
304
|
+
|
305
|
+
### Example
|
306
|
+
```html
|
307
|
+
<!-- http://yasuri.example.net -->
|
308
|
+
<html>
|
309
|
+
<head><title>Yasuri Test</title></head>
|
310
|
+
<body>
|
311
|
+
<p>Hello,Yasuri</p>
|
312
|
+
<a href="./child01.html">child01</a>
|
313
|
+
<a href="./child02.html">child02</a>
|
314
|
+
<a href="./child03.html">child03</a>
|
315
|
+
</body>
|
316
|
+
<title>
|
317
|
+
```
|
318
|
+
|
319
|
+
```html
|
320
|
+
<!-- http://yasuri.example.net/child01.html -->
|
321
|
+
<html>
|
322
|
+
<head><title>Child 01 Test</title></head>
|
323
|
+
<body>
|
324
|
+
<p>Child 01 page.</p>
|
325
|
+
<ul>
|
326
|
+
<li><a href="./child01_sub.html">Child01_Sub</a></li>
|
327
|
+
<li><a href="./child02_sub.html">Child02_Sub</a></li>
|
328
|
+
</ul>
|
329
|
+
</body>
|
330
|
+
<title>
|
331
|
+
```
|
332
|
+
|
333
|
+
```html
|
334
|
+
<!-- http://yasuri.example.net/child02.html -->
|
335
|
+
<html>
|
336
|
+
<head><title>Child 02 Test</title></head>
|
337
|
+
<body>
|
338
|
+
<p>Child 02 page.</p>
|
339
|
+
</body>
|
340
|
+
<title>
|
341
|
+
```
|
342
|
+
|
343
|
+
```html
|
344
|
+
<!-- http://yasuri.example.net/child03.html -->
|
345
|
+
<html>
|
346
|
+
<head><title>Child 03 Test</title></head>
|
347
|
+
<body>
|
348
|
+
<p>Child 03 page.</p>
|
349
|
+
<ul>
|
350
|
+
<li><a href="./child03_sub.html">Child03_Sub</a></li>
|
351
|
+
</ul>
|
352
|
+
</body>
|
353
|
+
<title>
|
354
|
+
```
|
355
|
+
|
356
|
+
```ruby
|
357
|
+
agent = Mechanize.new
|
358
|
+
page = agent.get("http://yasuri.example.net")
|
359
|
+
|
360
|
+
node = Yasuri.links_title '/html/body/a' do
|
361
|
+
text_content '/html/body/p'
|
362
|
+
end
|
363
|
+
|
364
|
+
node.inject(agent, page)
|
365
|
+
#=> [ {"content" => "Child 01 page."},
|
366
|
+
{"content" => "Child 02 page."},
|
367
|
+
{"content" => "Child 03 page."}]
|
368
|
+
```
|
369
|
+
|
370
|
+
At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
|
371
|
+
|
372
|
+
Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
|
373
|
+
|
374
|
+
### Options
|
375
|
+
None.
|
376
|
+
|
377
|
+
## Paginate Node
|
378
|
+
Paginate Node parses and returns each pages that provid by paginate.
|
379
|
+
|
380
|
+
### Example
|
381
|
+
Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
|
382
|
+
|
383
|
+
```html
|
384
|
+
<!-- http://yasuri.example.net/page01.html -->
|
385
|
+
<html>
|
386
|
+
<head><title>Page01</title></head>
|
387
|
+
<body>
|
388
|
+
<p>Pagination01</p>
|
389
|
+
|
390
|
+
<nav class='pagination'>
|
391
|
+
<span class='prev'> PreviousPage </span>
|
392
|
+
<span class='page'> 1 </span>
|
393
|
+
<span class='page'> <a href="./page02.html">2</a> </span>
|
394
|
+
<span class='page'> <a href="./page03.html">3</a> </span>
|
395
|
+
<span class='page'> <a href="./page04.html">4</a> </span>
|
396
|
+
<span class='next'> <a href="./page02.html" class="next" rel="next"> NextPage </a> </span>
|
397
|
+
</nav>
|
398
|
+
|
399
|
+
</body>
|
400
|
+
<title>
|
401
|
+
```
|
402
|
+
|
403
|
+
```ruby
|
404
|
+
agent = Mechanize.new
|
405
|
+
page = agent.get("http://yasuri.example.net/page01.html")
|
406
|
+
|
407
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
|
408
|
+
text_content '/html/body/p'
|
409
|
+
end
|
410
|
+
|
411
|
+
node.inject(agent, page)
|
412
|
+
#=> [ {"content" => "Pagination01"},
|
413
|
+
{"content" => "Pagination02"},
|
414
|
+
{"content" => "Pagination03"},
|
415
|
+
{"content" => "Pagination04"}]
|
416
|
+
```
|
417
|
+
|
418
|
+
Paginate Node require link for next page. In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
|
419
|
+
|
420
|
+
### Options
|
421
|
+
##### `limit`
|
422
|
+
Upper limit of open pages in pagination.
|
423
|
+
|
424
|
+
```ruby
|
425
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
|
426
|
+
text_content '/html/body/p'
|
427
|
+
end
|
428
|
+
node.inject(agent, page)
|
429
|
+
#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
|
430
|
+
```
|
431
|
+
Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
|
data/lib/yasuri/version.rb
CHANGED
data/lib/yasuri/yasuri.rb
CHANGED
@@ -37,7 +37,7 @@ module Yasuri
|
|
37
37
|
}
|
38
38
|
Node2Text = Text2Node.invert
|
39
39
|
|
40
|
-
ReservedKeys =
|
40
|
+
ReservedKeys = [:node, :name, :path, :children]
|
41
41
|
def self.hash2node(node_h)
|
42
42
|
node, name, path, children = ReservedKeys.map do |key|
|
43
43
|
node_h[key]
|
@@ -78,7 +78,8 @@ module Yasuri
|
|
78
78
|
json
|
79
79
|
end
|
80
80
|
|
81
|
-
def self.NodeName(name,
|
81
|
+
def self.NodeName(name, hash = {})
|
82
|
+
symbolize_names = hash[:symbolize_names] || false
|
82
83
|
symbolize_names ? name.to_sym : name
|
83
84
|
end
|
84
85
|
|
data/lib/yasuri/yasuri_node.rb
CHANGED
@@ -7,9 +7,9 @@ module Yasuri
|
|
7
7
|
class PaginateNode
|
8
8
|
include Node
|
9
9
|
|
10
|
-
def initialize(xpath, name, children = [],
|
10
|
+
def initialize(xpath, name, children = [], hash = {})
|
11
11
|
super(xpath, name, children)
|
12
|
-
@limit = limit
|
12
|
+
@limit = hash[:limit]
|
13
13
|
end
|
14
14
|
|
15
15
|
def inject(agent, page, opt = {})
|
@@ -7,9 +7,12 @@ module Yasuri
|
|
7
7
|
class TextNode
|
8
8
|
include Node
|
9
9
|
|
10
|
-
def initialize(xpath, name, children = [],
|
10
|
+
def initialize(xpath, name, children = [], hash = {})
|
11
11
|
super(xpath, name, children)
|
12
12
|
|
13
|
+
truncate = hash[:truncate]
|
14
|
+
proc = hash[:proc]
|
15
|
+
|
13
16
|
truncate = Regexp.new(truncate) if not truncate.nil? # regexp or nil
|
14
17
|
@truncate = truncate
|
15
18
|
@truncate = Regexp.new(@truncate.to_s) if not @truncate.nil?
|
data/spec/spec_helper.rb
CHANGED
@@ -14,8 +14,8 @@ end
|
|
14
14
|
|
15
15
|
|
16
16
|
# ENV['CODECLIMATE_REPO_TOKEN'] = "0dc78d33107a7f11f257c0218ac1a37e0073005bb9734f2fd61d0f7e803fc151"
|
17
|
-
require "codeclimate-test-reporter"
|
18
|
-
CodeClimate::TestReporter.start
|
17
|
+
# require "codeclimate-test-reporter"
|
18
|
+
# CodeClimate::TestReporter.start
|
19
19
|
|
20
20
|
require 'simplecov'
|
21
21
|
require 'coveralls'
|
@@ -27,6 +27,7 @@ SimpleCov.formatter = SimpleCov::Formatter::MultiFormatter[
|
|
27
27
|
]
|
28
28
|
SimpleCov.start
|
29
29
|
|
30
|
+
|
30
31
|
$LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
|
31
32
|
require 'yasuri'
|
32
33
|
|
data/spec/yasuri_spec.rb
CHANGED
@@ -18,7 +18,7 @@ describe 'Yasuri' do
|
|
18
18
|
#############
|
19
19
|
describe '.json2tree' do
|
20
20
|
it "fail if empty json" do
|
21
|
-
expect { Yasuri.json2tree("{}") }.to raise_error
|
21
|
+
expect { Yasuri.json2tree("{}") }.to raise_error(RuntimeError)
|
22
22
|
end
|
23
23
|
|
24
24
|
it "return TextNode" do
|
@@ -39,7 +39,7 @@ describe 'Yasuri' do
|
|
39
39
|
"truncate" : "^[^,]+"
|
40
40
|
}|
|
41
41
|
generated = Yasuri.json2tree(src)
|
42
|
-
original = Yasuri::TextNode.new('/html/body/p[1]', "content", truncate:/^[^,]+/)
|
42
|
+
original = Yasuri::TextNode.new('/html/body/p[1]', "content", {}, truncate:/^[^,]+/)
|
43
43
|
compare_generated_vs_original(generated, original, @index_page)
|
44
44
|
end
|
45
45
|
|
@@ -153,7 +153,7 @@ describe 'Yasuri' do
|
|
153
153
|
end
|
154
154
|
|
155
155
|
it "return text node with truncate_regexp" do
|
156
|
-
node = Yasuri::TextNode.new("/html/head/title", "title", truncate:/^[^,]+/)
|
156
|
+
node = Yasuri::TextNode.new("/html/head/title", "title", {}, truncate:/^[^,]+/)
|
157
157
|
json = Yasuri.tree2json(node)
|
158
158
|
expected_str = %q| { "node": "text",
|
159
159
|
"name": "title",
|
@@ -81,7 +81,7 @@ describe 'Yasuri' do
|
|
81
81
|
node = Yasuri::StructNode.new(invalid_xpath, "table", [
|
82
82
|
Yasuri::TextNode.new('./td[1]', "title")
|
83
83
|
])
|
84
|
-
expect { node.inject(@agent, @page) }.to raise_error
|
84
|
+
expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
|
85
85
|
end
|
86
86
|
|
87
87
|
it 'fail with invalid xpath in children' do
|
@@ -90,7 +90,7 @@ describe 'Yasuri' do
|
|
90
90
|
Yasuri::TextNode.new(invalid_xpath, "title"),
|
91
91
|
Yasuri::TextNode.new('./td[2]', "pub_date"),
|
92
92
|
])
|
93
|
-
expect { node.inject(@agent, @page) }.to raise_error
|
93
|
+
expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
|
94
94
|
end
|
95
95
|
|
96
96
|
it 'scrape all tables' do
|
@@ -126,7 +126,7 @@ describe 'Yasuri' do
|
|
126
126
|
Yasuri::TextNode.new('./td[1]', "title"),
|
127
127
|
Yasuri::TextNode.new('./td[2]', "pub_date"),
|
128
128
|
])
|
129
|
-
expected = @table_1996.map{|h| h.map{|k,v| [k.to_sym, v] }
|
129
|
+
expected = @table_1996.map{|h| Hash[h.map{|k,v| [k.to_sym, v] }] }
|
130
130
|
actual = node.inject(@agent, @page, symbolize_names:true)
|
131
131
|
expect(actual).to match expected
|
132
132
|
end
|
@@ -31,7 +31,7 @@ describe 'Yasuri' do
|
|
31
31
|
it 'fail with invalid xpath' do
|
32
32
|
invalid_xpath = '/html/body/no_match_node['
|
33
33
|
node = Yasuri::TextNode.new(invalid_xpath, "title")
|
34
|
-
expect { node.inject(@agent, @index_page) }.to raise_error
|
34
|
+
expect { node.inject(@agent, @index_page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
|
35
35
|
end
|
36
36
|
|
37
37
|
it "can be defined by DSL, return single TextNode title" do
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: yasuri
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 1.9.11
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- TAC
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2016-11-14 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -151,6 +151,8 @@ files:
|
|
151
151
|
- LICENSE
|
152
152
|
- README.md
|
153
153
|
- Rakefile
|
154
|
+
- USAGE.ja.md
|
155
|
+
- USAGE.md
|
154
156
|
- app.rb
|
155
157
|
- lib/yasuri.rb
|
156
158
|
- lib/yasuri/version.rb
|