yasuri 0.0.11 → 1.9.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +18 -0
- data/USAGE.ja.md +433 -0
- data/USAGE.md +431 -0
- data/lib/yasuri/version.rb +1 -1
- data/lib/yasuri/yasuri.rb +3 -2
- data/lib/yasuri/yasuri_node.rb +1 -1
- data/lib/yasuri/yasuri_paginate_node.rb +2 -2
- data/lib/yasuri/yasuri_text_node.rb +4 -1
- data/spec/spec_helper.rb +3 -2
- data/spec/yasuri_spec.rb +3 -3
- data/spec/yasuri_struct_node_spec.rb +3 -3
- data/spec/yasuri_text_node_spec.rb +1 -1
- metadata +4 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 28e6a3903cec3d8036b718a7c5de4fb8df8dfbfa
|
4
|
+
data.tar.gz: fbfb4c2b3a042410a7d05b1604e004bdffd82779
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e4312623046ecf451ef261b1d99164b7f79b861ad8d7fe49d7ba8fd2319cd840978ae88235e6fb10b6991cd5131ebec966ec56e3bddece8fbbefd0b53d4a9dfe
|
7
|
+
data.tar.gz: 54639066aa4511309a1f712b8920c81c09a7489e0cab6a91c453d2aeca07d4f1c3bed167ed25746411d5f87fa9cc9fff8c6a32567c4df8d1990f12cb053286e2
|
data/README.md
CHANGED
@@ -2,6 +2,16 @@
|
|
2
2
|
|
3
3
|
Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
|
4
4
|
|
5
|
+
Yasuri can reduce frequently processes in Scraping.
|
6
|
+
|
7
|
+
For example,
|
8
|
+
|
9
|
+
+ Open links in the page, scraping each page, and getting result as Hash.
|
10
|
+
+ Scraping texts in the page, and named result in Hash.
|
11
|
+
+ A table that repeatedly appears in a page each, scraping, get as an array.
|
12
|
+
+ Of each page provided by the pagination, scraping the only top 3.
|
13
|
+
|
14
|
+
You can implement easy by Yasuri.
|
5
15
|
|
6
16
|
## Sample
|
7
17
|
|
@@ -17,6 +27,14 @@ Add this line to your application's Gemfile:
|
|
17
27
|
gem 'yasuri'
|
18
28
|
```
|
19
29
|
|
30
|
+
or
|
31
|
+
|
32
|
+
```ruby
|
33
|
+
# for Ruby 1.9.3 or lower
|
34
|
+
gem 'yasuri', '~> 1.9'
|
35
|
+
```
|
36
|
+
|
37
|
+
|
20
38
|
And then execute:
|
21
39
|
|
22
40
|
$ bundle
|
data/USAGE.ja.md
ADDED
@@ -0,0 +1,433 @@
|
|
1
|
+
# Yasuri の使い方
|
2
|
+
|
3
|
+
## Yasuri とは
|
4
|
+
Yasuri (鑢) は簡単にWebスクレイピングを行うための、"[Mechanize](https://github.com/sparklemotion/mechanize)" をサポートするライブラリです.
|
5
|
+
|
6
|
+
Yasuriは、スクレイピングにおける、よくある処理を簡単に記述することができます.
|
7
|
+
例えば、
|
8
|
+
|
9
|
+
+ ページ内の複数のリンクを開いて、各ページをスクレイピングした結果をHashで取得する
|
10
|
+
+ ページ内の複数のテキストをスクレイピングし、名前をつけてHashにする
|
11
|
+
+ ページ内に繰り返し出現するテーブルをそれぞれスクレイピングして、配列として取得する
|
12
|
+
+ ページネーションで提供される各ページのうち、上位3つだけを順にスクレイピングする
|
13
|
+
|
14
|
+
これらを簡単に実装することができます.
|
15
|
+
|
16
|
+
## クイックスタート
|
17
|
+
|
18
|
+
```
|
19
|
+
$ gem install yasuri
|
20
|
+
```
|
21
|
+
|
22
|
+
```ruby
|
23
|
+
require 'yasuri'
|
24
|
+
require 'machinize'
|
25
|
+
|
26
|
+
# Node tree constructing by DSL
|
27
|
+
root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
28
|
+
text_title '//*[@id="contents"]/h2'
|
29
|
+
text_content '//*[@id="contents"]/p[1]'
|
30
|
+
end
|
31
|
+
|
32
|
+
agent = Mechanize.new
|
33
|
+
root_page = agent.get("http://some.scraping.page.net/")
|
34
|
+
|
35
|
+
result = root.inject(agent, root_page)
|
36
|
+
# => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
|
37
|
+
# {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
|
38
|
+
|
39
|
+
```
|
40
|
+
この例では、 LinkNode(`links_root`)の xpath で指定された各リンク先のページから、TextNode(`text_title`,`text_content`) の xpath で指定された2つのテキストをスクレイピングする例です.
|
41
|
+
|
42
|
+
(言い換えると、`//*[@id="menu"]/ul/li/a` で示される各リンクを開いて、`//*[@id="contents"]/h2` と `//*[@id="contents"]/p[1]` で指定されたテキストをスクレイピングします)
|
43
|
+
|
44
|
+
## 基本
|
45
|
+
|
46
|
+
1. パースツリーを作る
|
47
|
+
2. Mechanize の agent と対象のページを与えてパースを開始する
|
48
|
+
|
49
|
+
|
50
|
+
### パースツリーを作る
|
51
|
+
|
52
|
+
```ruby
|
53
|
+
require 'mechanize'
|
54
|
+
require 'yasuri'
|
55
|
+
|
56
|
+
|
57
|
+
# 1. パースツリーを作る
|
58
|
+
tree = Yasuri.links_title '/html/body/a' do
|
59
|
+
text_name '/html/body/p'
|
60
|
+
end
|
61
|
+
|
62
|
+
# 2. Mechanize の agent と対象のページを与えてパースを開始する
|
63
|
+
agent = Mechanize.new
|
64
|
+
page = agent.get(uri)
|
65
|
+
|
66
|
+
|
67
|
+
tree.inject(agent, page)
|
68
|
+
```
|
69
|
+
|
70
|
+
ツリーは、DSLまたはjsonで定義することができます.上の例ではDSLで定義しています.
|
71
|
+
以下は、jsonで上記と等価な解析ツリーを定義した例です.
|
72
|
+
|
73
|
+
```ruby
|
74
|
+
# json で構成する場合
|
75
|
+
src = <<-EOJSON
|
76
|
+
{ "node" : "links",
|
77
|
+
"name" : "title",
|
78
|
+
"path" : "/html/body/a",
|
79
|
+
"children" : [
|
80
|
+
{ "node" : "text",
|
81
|
+
"name" : "name",
|
82
|
+
"path" : "/html/body/p"
|
83
|
+
}
|
84
|
+
]
|
85
|
+
}
|
86
|
+
EOJSON
|
87
|
+
tree = Yasuri.json2tree(src)
|
88
|
+
```
|
89
|
+
|
90
|
+
|
91
|
+
### Node
|
92
|
+
ツリーは入れ子になった *Node* で構成されます.
|
93
|
+
Node は `Type`, `Name`, `Path`, `Childlen`, `Options` を持っています.
|
94
|
+
|
95
|
+
Nodeは以下のフォーマットで定義されます.
|
96
|
+
|
97
|
+
```ruby
|
98
|
+
# トップレベル
|
99
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
100
|
+
|
101
|
+
# 入れ子になっている場合
|
102
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
103
|
+
<Type>_<Name> <Path> [,<Options>] do
|
104
|
+
<Children>
|
105
|
+
end
|
106
|
+
end
|
107
|
+
```
|
108
|
+
|
109
|
+
#### Type
|
110
|
+
*Type* は Nodeの振る舞いを示します.Typeには以下のものがあります.
|
111
|
+
|
112
|
+
- *Text*
|
113
|
+
- *Struct*
|
114
|
+
- *Links*
|
115
|
+
- *Paginate*
|
116
|
+
|
117
|
+
### Name
|
118
|
+
*Name* は 解析結果のHashにおけるキーになります.
|
119
|
+
|
120
|
+
### Path
|
121
|
+
*Path* は xpath あるいは css セレクタによって、HTML上の特定のノードを指定します.
|
122
|
+
これは Machinize の `search` で使用されます.
|
123
|
+
|
124
|
+
### Childlen
|
125
|
+
入れ子になっているノードの子ノードです.TextNodeはツリーの葉に当たるため、子ノードを持ちません.
|
126
|
+
|
127
|
+
### Options
|
128
|
+
パースのオプションです.オプションはTypeごとに異なります.
|
129
|
+
各ノードに対して、`opt`メソッドをコールすることで、利用可能なオプションを取得できます.
|
130
|
+
|
131
|
+
```
|
132
|
+
# TextNode の例
|
133
|
+
node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
134
|
+
node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
|
135
|
+
```
|
136
|
+
|
137
|
+
## Text Node
|
138
|
+
*TextNode* はスクレイピングしたテキストを返します.このノードはパースツリーにおいて常に葉です.
|
139
|
+
|
140
|
+
### 例
|
141
|
+
|
142
|
+
```html
|
143
|
+
<!-- http://yasuri.example.net -->
|
144
|
+
<html>
|
145
|
+
<head></head>
|
146
|
+
<body>
|
147
|
+
<p>Hello,World</p>
|
148
|
+
<p>Hello,Yasuri</p>
|
149
|
+
</body>
|
150
|
+
</html>
|
151
|
+
```
|
152
|
+
|
153
|
+
```ruby
|
154
|
+
agent = Mechanize.new
|
155
|
+
page = agent.get("http://yasuri.example.net")
|
156
|
+
|
157
|
+
p1 = Yasuri.text_title '/html/body/p[1]'
|
158
|
+
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
159
|
+
p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
|
160
|
+
|
161
|
+
p1.inject(agent, page) #=> { "title" => "Hello,World" }
|
162
|
+
p1t.inject(agent, page) #=> { "title" => "Hello" }
|
163
|
+
node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
|
164
|
+
```
|
165
|
+
|
166
|
+
### オプション
|
167
|
+
##### `truncate`
|
168
|
+
正規表現にマッチした文字列を取り出します.グループを指定した場合、最初にマッチしたグループだけを返します.
|
169
|
+
|
170
|
+
```ruby
|
171
|
+
node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
|
172
|
+
node.inject(agent, index_page)
|
173
|
+
#=> { "example" => "ello,Yasur" }
|
174
|
+
```
|
175
|
+
|
176
|
+
|
177
|
+
##### `proc`
|
178
|
+
取り出した文字列(String)をレシーバーとして、シンボルで指定したメソッドを呼び出します.
|
179
|
+
`truncate`オプションを併せて指定している場合、`truncate`した後の文字列に対し、メソッドを呼び出します.
|
180
|
+
|
181
|
+
```ruby
|
182
|
+
node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
|
183
|
+
node.inject(agent, index_page)
|
184
|
+
#=> { "example" => "ELLO,YASUR" }
|
185
|
+
```
|
186
|
+
|
187
|
+
## Struct Node
|
188
|
+
*Struct Node* は構造化されたHashとしてテキストを返します.
|
189
|
+
|
190
|
+
まず、Struct Node は `Path` によって、HTMLのタグを絞込みます.
|
191
|
+
Struct Node の子ノードは、この絞りこまれたタグに対してパースを行い、Struct Node は子ノードの結果を含むHashを返します.
|
192
|
+
|
193
|
+
Struct Node の `Path` が複数のタグにマッチする場合、配列として結果を返します.
|
194
|
+
|
195
|
+
### 例
|
196
|
+
|
197
|
+
```html
|
198
|
+
<!-- http://yasuri.example.net -->
|
199
|
+
<html>
|
200
|
+
<head>
|
201
|
+
<title>Books</title>
|
202
|
+
</head>
|
203
|
+
<body>
|
204
|
+
<h1>1996</h1>
|
205
|
+
<table>
|
206
|
+
<thead>
|
207
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
208
|
+
</thead>
|
209
|
+
<tr><td>The Perfect Insider</td> <td>1996/4/5</td></tr>
|
210
|
+
<tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
|
211
|
+
<tr><td>Mathematical Goodbye</td> <td>1996/9/5</td></tr>
|
212
|
+
</table>
|
213
|
+
|
214
|
+
<h1>1997</h1>
|
215
|
+
<table>
|
216
|
+
<thead>
|
217
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
218
|
+
</thead>
|
219
|
+
<tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
|
220
|
+
<tr><td>Who Inside</td> <td>1997/4/5</td></tr>
|
221
|
+
<tr><td>Illusion Acts Like Magic</td> <td>1997/10/5</td></tr>
|
222
|
+
</table>
|
223
|
+
|
224
|
+
<h1>1998</h1>
|
225
|
+
<table>
|
226
|
+
<thead>
|
227
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
228
|
+
</thead>
|
229
|
+
<tr><td>Replaceable Summer</td> <td>1998/1/7</td></tr>
|
230
|
+
<tr><td>Switch Back</td> <td>1998/4/5</td></tr>
|
231
|
+
<tr><td>Numerical Models</td> <td>1998/7/5</td></tr>
|
232
|
+
<tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
|
233
|
+
</table>
|
234
|
+
</body>
|
235
|
+
</html>
|
236
|
+
```
|
237
|
+
|
238
|
+
```ruby
|
239
|
+
agent = Mechanize.new
|
240
|
+
page = agent.get("http://yasuri.example.net")
|
241
|
+
|
242
|
+
node = Yasuri.struct_table '/html/body/table[1]/tr' do
|
243
|
+
text_title './td[1]'
|
244
|
+
text_pub_date './td[2]'
|
245
|
+
])
|
246
|
+
|
247
|
+
node.inject(agent, page)
|
248
|
+
#=> [ { "title" => "The Perfect Insider",
|
249
|
+
# "pub_date" => "1996/4/5" },
|
250
|
+
# { "title" => "Doctors in Isolated Room",
|
251
|
+
# "pub_date" => "1996/7/5" },
|
252
|
+
# { "title" => "Mathematical Goodbye",
|
253
|
+
# "pub_date" => "1996/9/5" }, ]
|
254
|
+
```
|
255
|
+
|
256
|
+
Struct Node は xpath `'/html/body/table[1]/tr'` によって、最初の `<table>` から すべての`<tr>` タグを絞り込みます.
|
257
|
+
その後、子ノードである2つの TextNode によって、 `<tr>` タグがパースされます.
|
258
|
+
この場合は、最初の `<table>` は 3つの `<tr>`タグを持っているため、3つのHashを返します.(`<thead><tr>` は `Path` にマッチしないため4つではないことに注意)
|
259
|
+
各HashはTextNodeによってパースされたテキストを含んでいます.
|
260
|
+
|
261
|
+
|
262
|
+
また以下の例のように、Struct Node は TextNode以外のノードを子ノードとすることができます.
|
263
|
+
|
264
|
+
### 例
|
265
|
+
|
266
|
+
```ruby
|
267
|
+
agent = Mechanize.new
|
268
|
+
page = agent.get("http://yasuri.example.net")
|
269
|
+
|
270
|
+
node = Yasuri.strucre_tables '/html/body/table' do
|
271
|
+
struct_table './tr' do
|
272
|
+
text_title './td[1]'
|
273
|
+
text_pub_date './td[2]'
|
274
|
+
end
|
275
|
+
])
|
276
|
+
|
277
|
+
node.inject(agent, page)
|
278
|
+
|
279
|
+
#=> [ { "table" => [ { "title" => "The Perfect Insider",
|
280
|
+
# "pub_date" => "1996/4/5" },
|
281
|
+
# { "title" => "Doctors in Isolated Room",
|
282
|
+
# "pub_date" => "1996/7/5" },
|
283
|
+
# { "title" => "Mathematical Goodbye",
|
284
|
+
# "pub_date" => "1996/9/5" }]},
|
285
|
+
# { "table" => [ { "title" => "Jack the Poetical Private",
|
286
|
+
# "pub_date" => "1997/1/5" },
|
287
|
+
# { "title" => "Who Inside",
|
288
|
+
# "pub_date" => "1997/4/5" },
|
289
|
+
# { "title" => "Illusion Acts Like Magic",
|
290
|
+
# "pub_date" => "1997/10/5" }]},
|
291
|
+
# { "table" => [ { "title" => "Replaceable Summer",
|
292
|
+
# "pub_date" => "1998/1/7" },
|
293
|
+
# { "title" => "Switch Back",
|
294
|
+
# "pub_date" => "1998/4/5" },
|
295
|
+
# { "title" => "Numerical Models",
|
296
|
+
# "pub_date" => "1998/7/5" },
|
297
|
+
# { "title" => "The Perfect Outsider",
|
298
|
+
# "pub_date" => "1998/10/5" }]}
|
299
|
+
# ]
|
300
|
+
```
|
301
|
+
|
302
|
+
### オプション
|
303
|
+
なし
|
304
|
+
|
305
|
+
## Links Node
|
306
|
+
Links Node は リンクされた各ページをパースして結果を返します.
|
307
|
+
|
308
|
+
### 例
|
309
|
+
```
|
310
|
+
<!-- http://yasuri.example.net -->
|
311
|
+
<html>
|
312
|
+
<head><title>Yasuri Test</title></head>
|
313
|
+
<body>
|
314
|
+
<p>Hello,Yasuri</p>
|
315
|
+
<a href="./child01.html">child01</a>
|
316
|
+
<a href="./child02.html">child02</a>
|
317
|
+
<a href="./child03.html">child03</a>
|
318
|
+
</body>
|
319
|
+
<title>
|
320
|
+
```
|
321
|
+
|
322
|
+
```
|
323
|
+
<!-- http://yasuri.example.net/child01.html -->
|
324
|
+
<html>
|
325
|
+
<head><title>Child 01 Test</title></head>
|
326
|
+
<body>
|
327
|
+
<p>Child 01 page.</p>
|
328
|
+
<ul>
|
329
|
+
<li><a href="./child01_sub.html">Child01_Sub</a></li>
|
330
|
+
<li><a href="./child02_sub.html">Child02_Sub</a></li>
|
331
|
+
</ul>
|
332
|
+
</body>
|
333
|
+
<title>
|
334
|
+
```
|
335
|
+
|
336
|
+
```
|
337
|
+
<!-- http://yasuri.example.net/child02.html -->
|
338
|
+
<html>
|
339
|
+
<head><title>Child 02 Test</title></head>
|
340
|
+
<body>
|
341
|
+
<p>Child 02 page.</p>
|
342
|
+
</body>
|
343
|
+
<title>
|
344
|
+
```
|
345
|
+
|
346
|
+
```
|
347
|
+
<!-- http://yasuri.example.net/child03.html -->
|
348
|
+
<html>
|
349
|
+
<head><title>Child 03 Test</title></head>
|
350
|
+
<body>
|
351
|
+
<p>Child 03 page.</p>
|
352
|
+
<ul>
|
353
|
+
<li><a href="./child03_sub.html">Child03_Sub</a></li>
|
354
|
+
</ul>
|
355
|
+
</body>
|
356
|
+
<title>
|
357
|
+
```
|
358
|
+
|
359
|
+
```
|
360
|
+
agent = Mechanize.new
|
361
|
+
page = agent.get("http://yasuri.example.net")
|
362
|
+
|
363
|
+
node = Yasuri.links_title '/html/body/a' do
|
364
|
+
text_content '/html/body/p'
|
365
|
+
end
|
366
|
+
|
367
|
+
node.inject(agent, page)
|
368
|
+
#=> [ {"content" => "Child 01 page."},
|
369
|
+
{"content" => "Child 02 page."},
|
370
|
+
{"content" => "Child 03 page."}]
|
371
|
+
```
|
372
|
+
|
373
|
+
まず、 LinksNode は `Path` にマッチするすべてのリンクを最初のページから探します.
|
374
|
+
この例では、LinksNodeは `/html/body/a` にマッチするすべてのタグを `http://yasuri.example.net` から探します.
|
375
|
+
次に、見つかったタグのhref属性で指定されたページを開きます.(`./child01.html`, `./child02.html`, `./child03.html`)
|
376
|
+
|
377
|
+
開いた各ページに対して、子ノードによる解析を行います.LinksNodeは 各ページに対するパース結果をHashの配列として返します.
|
378
|
+
|
379
|
+
## Paginate Node
|
380
|
+
PaginateNodeは ページネーション(パジネーション, Pagination) でたどることのできる各ページを順にパースします.
|
381
|
+
|
382
|
+
### 例
|
383
|
+
この例では、対象のページ `page01.html` はこのようになっているとします.
|
384
|
+
`page02.html` から `page04.html` も同様です.
|
385
|
+
|
386
|
+
```html
|
387
|
+
<!-- http://yasuri.example.net/page01.html -->
|
388
|
+
<html>
|
389
|
+
<head><title>Page01</title></head>
|
390
|
+
<body>
|
391
|
+
<p>Patination01</p>
|
392
|
+
|
393
|
+
<nav class='pagination'>
|
394
|
+
<span class='prev'> « PreviousPage </span>
|
395
|
+
<span class='page'> 1 </span>
|
396
|
+
<span class='page'> <a href="./page02.html">2</a> </span>
|
397
|
+
<span class='page'> <a href="./page03.html">3</a> </span>
|
398
|
+
<span class='page'> <a href="./page04.html">4</a> </span>
|
399
|
+
<span class='next'> <a href="./page02.html" class="next" rel="next">NextPage »</a> </span>
|
400
|
+
</nav>
|
401
|
+
|
402
|
+
</body>
|
403
|
+
<title>
|
404
|
+
```
|
405
|
+
|
406
|
+
```ruby
|
407
|
+
agent = Mechanize.new
|
408
|
+
page = agent.get("http://yasuri.example.net/page01.html")
|
409
|
+
|
410
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
|
411
|
+
text_content '/html/body/p'
|
412
|
+
end
|
413
|
+
|
414
|
+
node.inject(agent, page)
|
415
|
+
#=> [ {"content" => "Patination01"},
|
416
|
+
{"content" => "Patination02"},
|
417
|
+
{"content" => "Patination03"}]
|
418
|
+
```
|
419
|
+
PaginateNodeは 次のページ を指すリンクを`Path`として指定する必要があります.
|
420
|
+
この例では、`NextPage` (`/html/body/nav/span/a[@class='next']`)が、次のページを指すリンクに該当します.
|
421
|
+
|
422
|
+
### オプション
|
423
|
+
##### `limit`
|
424
|
+
たどるページ数の上限を指定します.
|
425
|
+
|
426
|
+
```ruby
|
427
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
|
428
|
+
text_content '/html/body/p'
|
429
|
+
end
|
430
|
+
node.inject(agent, page)
|
431
|
+
#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
|
432
|
+
```
|
433
|
+
この場合、PaginateNode は最大2つまでのページを開いてパースします.ページネーションは4つのページを持っているようですが、`limit:2`が指定されているため、結果の配列には2つの結果のみが含まれています.
|
data/USAGE.md
ADDED
@@ -0,0 +1,431 @@
|
|
1
|
+
# Yasuri Usage
|
2
|
+
|
3
|
+
## What is Yasuri
|
4
|
+
`Yasuri` is an easy web-scraping library for supporting "Mechanize".
|
5
|
+
|
6
|
+
Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
|
7
|
+
|
8
|
+
Yasuri can reduce frequently processes in Scraping.
|
9
|
+
|
10
|
+
For example,
|
11
|
+
|
12
|
+
+ Open links in the page, scraping each page, and getting result as Hash.
|
13
|
+
+ Scraping texts in the page, and named result in Hash.
|
14
|
+
+ A table that repeatedly appears in a page each, scraping, get as an array.
|
15
|
+
+ Of each page provided by the pagination, scraping the only top 3.
|
16
|
+
|
17
|
+
You can implement easy by Yasuri.
|
18
|
+
|
19
|
+
## Quick Start
|
20
|
+
|
21
|
+
```
|
22
|
+
$ gem install yasuri
|
23
|
+
```
|
24
|
+
|
25
|
+
```ruby
|
26
|
+
require 'yasuri'
|
27
|
+
require 'machinize'
|
28
|
+
|
29
|
+
# Node tree constructing by DSL
|
30
|
+
root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
31
|
+
text_title '//*[@id="contents"]/h2'
|
32
|
+
text_content '//*[@id="contents"]/p[1]'
|
33
|
+
end
|
34
|
+
|
35
|
+
agent = Mechanize.new
|
36
|
+
root_page = agent.get("http://some.scraping.page.net/")
|
37
|
+
|
38
|
+
result = root.inject(agent, root_page)
|
39
|
+
# => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
|
40
|
+
# {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
|
41
|
+
|
42
|
+
```
|
43
|
+
This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
|
44
|
+
|
45
|
+
(i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
|
46
|
+
|
47
|
+
## Basics
|
48
|
+
|
49
|
+
1. Construct parse tree.
|
50
|
+
2. Start parse with Mechanize agent and first page.
|
51
|
+
|
52
|
+
### Construct parse tree
|
53
|
+
|
54
|
+
```ruby
|
55
|
+
require 'mechanize'
|
56
|
+
require 'yasuri'
|
57
|
+
|
58
|
+
|
59
|
+
# 1. Construct parse tree.
|
60
|
+
tree = Yasuri.links_title '/html/body/a' do
|
61
|
+
text_name '/html/body/p'
|
62
|
+
end
|
63
|
+
|
64
|
+
# 2. Start parse with Mechanize agent and first page.
|
65
|
+
agent = Mechanize.new
|
66
|
+
page = agent.get(uri)
|
67
|
+
|
68
|
+
|
69
|
+
tree.inject(agent, page)
|
70
|
+
```
|
71
|
+
|
72
|
+
Tree is definable by 2(+1) ways, DSL and json (and basic ruby code). In above example, DSL.
|
73
|
+
|
74
|
+
```ruby
|
75
|
+
# Construct by json.
|
76
|
+
src = <<-EOJSON
|
77
|
+
{ "node" : "links",
|
78
|
+
"name" : "title",
|
79
|
+
"path" : "/html/body/a",
|
80
|
+
"children" : [
|
81
|
+
{ "node" : "text",
|
82
|
+
"name" : "name",
|
83
|
+
"path" : "/html/body/p"
|
84
|
+
}
|
85
|
+
]
|
86
|
+
}
|
87
|
+
EOJSON
|
88
|
+
tree = Yasuri.json2tree(src)
|
89
|
+
```
|
90
|
+
|
91
|
+
### Node
|
92
|
+
Tree is constructed by nested Nodes.
|
93
|
+
Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
|
94
|
+
|
95
|
+
Node is defined by this format.
|
96
|
+
|
97
|
+
|
98
|
+
```ruby
|
99
|
+
# Top Level
|
100
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
101
|
+
|
102
|
+
# Nested
|
103
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
104
|
+
<Type>_<Name> <Path> [,<Options>] do
|
105
|
+
<Children>
|
106
|
+
end
|
107
|
+
end
|
108
|
+
```
|
109
|
+
|
110
|
+
#### Type
|
111
|
+
Type meen behavior of Node.
|
112
|
+
|
113
|
+
- *Text*
|
114
|
+
- *Struct*
|
115
|
+
- *Links*
|
116
|
+
- *Paginate*
|
117
|
+
|
118
|
+
### Name
|
119
|
+
Name is used keys in returned hash.
|
120
|
+
|
121
|
+
### Path
|
122
|
+
Path determine target node by xpath or css selector. It given by Machinize `search`.
|
123
|
+
|
124
|
+
### Childlen
|
125
|
+
Child nodes. TextNode has always empty set, because TextNode is leaf.
|
126
|
+
|
127
|
+
### Options
|
128
|
+
Parse options. It different in each types. You can get options and values by `opt` method.
|
129
|
+
|
130
|
+
```ruby
|
131
|
+
# TextNode Exaample
|
132
|
+
node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
133
|
+
node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
|
134
|
+
```
|
135
|
+
|
136
|
+
## Text Node
|
137
|
+
TextNode return scraped text. This node have to be leaf.
|
138
|
+
|
139
|
+
### Example
|
140
|
+
|
141
|
+
```html
|
142
|
+
<!-- http://yasuri.example.net -->
|
143
|
+
<html>
|
144
|
+
<head></head>
|
145
|
+
<body>
|
146
|
+
<p>Hello,World</p>
|
147
|
+
<p>Hello,Yasuri</p>
|
148
|
+
</body>
|
149
|
+
</html>
|
150
|
+
```
|
151
|
+
|
152
|
+
```ruby
|
153
|
+
agent = Mechanize.new
|
154
|
+
page = agent.get("http://yasuri.example.net")
|
155
|
+
|
156
|
+
p1 = Yasuri.text_title '/html/body/p[1]'
|
157
|
+
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
158
|
+
p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
|
159
|
+
|
160
|
+
p1.inject(agent, page) #=> { "title" => "Hello,World" }
|
161
|
+
p1t.inject(agent, page) #=> { "title" => "Hello" }
|
162
|
+
node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
|
163
|
+
```
|
164
|
+
|
165
|
+
### Options
|
166
|
+
##### `truncate`
|
167
|
+
Match to regexp, and truncate text. When you use group, it will return first matched group only.
|
168
|
+
|
169
|
+
```ruby
|
170
|
+
node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
|
171
|
+
node.inject(agent, index_page)
|
172
|
+
#=> { "example" => "ello,Yasur" }
|
173
|
+
```
|
174
|
+
|
175
|
+
|
176
|
+
##### `proc`
|
177
|
+
Apply method to text. Method is given as Symbol.
|
178
|
+
If it is given `truncate` option, apply method after truncated.
|
179
|
+
|
180
|
+
```ruby
|
181
|
+
node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
|
182
|
+
node.inject(agent, index_page)
|
183
|
+
#=> { "example" => "ELLO,YASUR" }
|
184
|
+
```
|
185
|
+
|
186
|
+
## Struct Node
|
187
|
+
Struct Node return structured text.
|
188
|
+
|
189
|
+
At first, Struct Node narrow down sub-tags by `Path`. Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
|
190
|
+
|
191
|
+
If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
|
192
|
+
|
193
|
+
### Example
|
194
|
+
|
195
|
+
```html
|
196
|
+
<!-- http://yasuri.example.net -->
|
197
|
+
<html>
|
198
|
+
<head>
|
199
|
+
<title>Books</title>
|
200
|
+
</head>
|
201
|
+
<body>
|
202
|
+
<h1>1996</h1>
|
203
|
+
<table>
|
204
|
+
<thead>
|
205
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
206
|
+
</thead>
|
207
|
+
<tr><td>The Perfect Insider</td> <td>1996/4/5</td></tr>
|
208
|
+
<tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
|
209
|
+
<tr><td>Mathematical Goodbye</td> <td>1996/9/5</td></tr>
|
210
|
+
</table>
|
211
|
+
|
212
|
+
<h1>1997</h1>
|
213
|
+
<table>
|
214
|
+
<thead>
|
215
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
216
|
+
</thead>
|
217
|
+
<tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
|
218
|
+
<tr><td>Who Inside</td> <td>1997/4/5</td></tr>
|
219
|
+
<tr><td>Illusion Acts Like Magic</td> <td>1997/10/5</td></tr>
|
220
|
+
</table>
|
221
|
+
|
222
|
+
<h1>1998</h1>
|
223
|
+
<table>
|
224
|
+
<thead>
|
225
|
+
<tr><th>Title</th> <th>Publication Date</th></tr>
|
226
|
+
</thead>
|
227
|
+
<tr><td>Replaceable Summer</td> <td>1998/1/7</td></tr>
|
228
|
+
<tr><td>Switch Back</td> <td>1998/4/5</td></tr>
|
229
|
+
<tr><td>Numerical Models</td> <td>1998/7/5</td></tr>
|
230
|
+
<tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
|
231
|
+
</table>
|
232
|
+
</body>
|
233
|
+
</html>
|
234
|
+
```
|
235
|
+
|
236
|
+
```ruby
|
237
|
+
agent = Mechanize.new
|
238
|
+
page = agent.get("http://yasuri.example.net")
|
239
|
+
|
240
|
+
node = Yasuri.struct_table '/html/body/table[1]/tr' do
|
241
|
+
text_title './td[1]'
|
242
|
+
text_pub_date './td[2]'
|
243
|
+
])
|
244
|
+
|
245
|
+
node.inject(agent, page)
|
246
|
+
#=> [ { "title" => "The Perfect Insider",
|
247
|
+
# "pub_date" => "1996/4/5" },
|
248
|
+
# { "title" => "Doctors in Isolated Room",
|
249
|
+
# "pub_date" => "1996/7/5" },
|
250
|
+
# { "title" => "Mathematical Goodbye",
|
251
|
+
# "pub_date" => "1996/9/5" }, ]
|
252
|
+
```
|
253
|
+
|
254
|
+
StructNode narrow down `<tr>` tags in first `<table>` by `'/html/body/table[1]/tr'`. Then,
|
255
|
+
`<tr>` tags parsed Struct node has two child node.
|
256
|
+
|
257
|
+
In this case, first `<table>` contains three `<tr>` tags (Not four.`<thead><tr>` is not match to `Path` ), so struct node returns three hashes. Each hash contains parsed text by Text Node.
|
258
|
+
|
259
|
+
Struct node can contain not only Text node.
|
260
|
+
|
261
|
+
### Example
|
262
|
+
|
263
|
+
```ruby
|
264
|
+
agent = Mechanize.new
|
265
|
+
page = agent.get("http://yasuri.example.net")
|
266
|
+
|
267
|
+
node = Yasuri.strucre_tables '/html/body/table' do
|
268
|
+
struct_table './tr' do
|
269
|
+
text_title './td[1]'
|
270
|
+
text_pub_date './td[2]'
|
271
|
+
end
|
272
|
+
])
|
273
|
+
|
274
|
+
node.inject(agent, page)
|
275
|
+
|
276
|
+
#=> [ { "table" => [ { "title" => "The Perfect Insider",
|
277
|
+
# "pub_date" => "1996/4/5" },
|
278
|
+
# { "title" => "Doctors in Isolated Room",
|
279
|
+
# "pub_date" => "1996/7/5" },
|
280
|
+
# { "title" => "Mathematical Goodbye",
|
281
|
+
# "pub_date" => "1996/9/5" }]},
|
282
|
+
# { "table" => [ { "title" => "Jack the Poetical Private",
|
283
|
+
# "pub_date" => "1997/1/5" },
|
284
|
+
# { "title" => "Who Inside",
|
285
|
+
# "pub_date" => "1997/4/5" },
|
286
|
+
# { "title" => "Illusion Acts Like Magic",
|
287
|
+
# "pub_date" => "1997/10/5" }]},
|
288
|
+
# { "table" => [ { "title" => "Replaceable Summer",
|
289
|
+
# "pub_date" => "1998/1/7" },
|
290
|
+
# { "title" => "Switch Back",
|
291
|
+
# "pub_date" => "1998/4/5" },
|
292
|
+
# { "title" => "Numerical Models",
|
293
|
+
# "pub_date" => "1998/7/5" },
|
294
|
+
# { "title" => "The Perfect Outsider",
|
295
|
+
# "pub_date" => "1998/10/5" }]}
|
296
|
+
# ]
|
297
|
+
```
|
298
|
+
|
299
|
+
### Options
|
300
|
+
None.
|
301
|
+
|
302
|
+
## Links Node
|
303
|
+
Links Node returns parsed text in each linked pages.
|
304
|
+
|
305
|
+
### Example
|
306
|
+
```html
|
307
|
+
<!-- http://yasuri.example.net -->
|
308
|
+
<html>
|
309
|
+
<head><title>Yasuri Test</title></head>
|
310
|
+
<body>
|
311
|
+
<p>Hello,Yasuri</p>
|
312
|
+
<a href="./child01.html">child01</a>
|
313
|
+
<a href="./child02.html">child02</a>
|
314
|
+
<a href="./child03.html">child03</a>
|
315
|
+
</body>
|
316
|
+
<title>
|
317
|
+
```
|
318
|
+
|
319
|
+
```html
|
320
|
+
<!-- http://yasuri.example.net/child01.html -->
|
321
|
+
<html>
|
322
|
+
<head><title>Child 01 Test</title></head>
|
323
|
+
<body>
|
324
|
+
<p>Child 01 page.</p>
|
325
|
+
<ul>
|
326
|
+
<li><a href="./child01_sub.html">Child01_Sub</a></li>
|
327
|
+
<li><a href="./child02_sub.html">Child02_Sub</a></li>
|
328
|
+
</ul>
|
329
|
+
</body>
|
330
|
+
<title>
|
331
|
+
```
|
332
|
+
|
333
|
+
```html
|
334
|
+
<!-- http://yasuri.example.net/child02.html -->
|
335
|
+
<html>
|
336
|
+
<head><title>Child 02 Test</title></head>
|
337
|
+
<body>
|
338
|
+
<p>Child 02 page.</p>
|
339
|
+
</body>
|
340
|
+
<title>
|
341
|
+
```
|
342
|
+
|
343
|
+
```html
|
344
|
+
<!-- http://yasuri.example.net/child03.html -->
|
345
|
+
<html>
|
346
|
+
<head><title>Child 03 Test</title></head>
|
347
|
+
<body>
|
348
|
+
<p>Child 03 page.</p>
|
349
|
+
<ul>
|
350
|
+
<li><a href="./child03_sub.html">Child03_Sub</a></li>
|
351
|
+
</ul>
|
352
|
+
</body>
|
353
|
+
<title>
|
354
|
+
```
|
355
|
+
|
356
|
+
```ruby
|
357
|
+
agent = Mechanize.new
|
358
|
+
page = agent.get("http://yasuri.example.net")
|
359
|
+
|
360
|
+
node = Yasuri.links_title '/html/body/a' do
|
361
|
+
text_content '/html/body/p'
|
362
|
+
end
|
363
|
+
|
364
|
+
node.inject(agent, page)
|
365
|
+
#=> [ {"content" => "Child 01 page."},
|
366
|
+
{"content" => "Child 02 page."},
|
367
|
+
{"content" => "Child 03 page."}]
|
368
|
+
```
|
369
|
+
|
370
|
+
At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
|
371
|
+
|
372
|
+
Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
|
373
|
+
|
374
|
+
### Options
|
375
|
+
None.
|
376
|
+
|
377
|
+
## Paginate Node
|
378
|
+
Paginate Node parses and returns each pages that provid by paginate.
|
379
|
+
|
380
|
+
### Example
|
381
|
+
Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
|
382
|
+
|
383
|
+
```html
|
384
|
+
<!-- http://yasuri.example.net/page01.html -->
|
385
|
+
<html>
|
386
|
+
<head><title>Page01</title></head>
|
387
|
+
<body>
|
388
|
+
<p>Pagination01</p>
|
389
|
+
|
390
|
+
<nav class='pagination'>
|
391
|
+
<span class='prev'> PreviousPage </span>
|
392
|
+
<span class='page'> 1 </span>
|
393
|
+
<span class='page'> <a href="./page02.html">2</a> </span>
|
394
|
+
<span class='page'> <a href="./page03.html">3</a> </span>
|
395
|
+
<span class='page'> <a href="./page04.html">4</a> </span>
|
396
|
+
<span class='next'> <a href="./page02.html" class="next" rel="next"> NextPage </a> </span>
|
397
|
+
</nav>
|
398
|
+
|
399
|
+
</body>
|
400
|
+
<title>
|
401
|
+
```
|
402
|
+
|
403
|
+
```ruby
|
404
|
+
agent = Mechanize.new
|
405
|
+
page = agent.get("http://yasuri.example.net/page01.html")
|
406
|
+
|
407
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
|
408
|
+
text_content '/html/body/p'
|
409
|
+
end
|
410
|
+
|
411
|
+
node.inject(agent, page)
|
412
|
+
#=> [ {"content" => "Pagination01"},
|
413
|
+
{"content" => "Pagination02"},
|
414
|
+
{"content" => "Pagination03"},
|
415
|
+
{"content" => "Pagination04"}]
|
416
|
+
```
|
417
|
+
|
418
|
+
Paginate Node require link for next page. In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
|
419
|
+
|
420
|
+
### Options
|
421
|
+
##### `limit`
|
422
|
+
Upper limit of open pages in pagination.
|
423
|
+
|
424
|
+
```ruby
|
425
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
|
426
|
+
text_content '/html/body/p'
|
427
|
+
end
|
428
|
+
node.inject(agent, page)
|
429
|
+
#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
|
430
|
+
```
|
431
|
+
Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
|
data/lib/yasuri/version.rb
CHANGED
data/lib/yasuri/yasuri.rb
CHANGED
@@ -37,7 +37,7 @@ module Yasuri
|
|
37
37
|
}
|
38
38
|
Node2Text = Text2Node.invert
|
39
39
|
|
40
|
-
ReservedKeys =
|
40
|
+
ReservedKeys = [:node, :name, :path, :children]
|
41
41
|
def self.hash2node(node_h)
|
42
42
|
node, name, path, children = ReservedKeys.map do |key|
|
43
43
|
node_h[key]
|
@@ -78,7 +78,8 @@ module Yasuri
|
|
78
78
|
json
|
79
79
|
end
|
80
80
|
|
81
|
-
def self.NodeName(name,
|
81
|
+
def self.NodeName(name, hash = {})
|
82
|
+
symbolize_names = hash[:symbolize_names] || false
|
82
83
|
symbolize_names ? name.to_sym : name
|
83
84
|
end
|
84
85
|
|
data/lib/yasuri/yasuri_node.rb
CHANGED
@@ -7,9 +7,9 @@ module Yasuri
|
|
7
7
|
class PaginateNode
|
8
8
|
include Node
|
9
9
|
|
10
|
-
def initialize(xpath, name, children = [],
|
10
|
+
def initialize(xpath, name, children = [], hash = {})
|
11
11
|
super(xpath, name, children)
|
12
|
-
@limit = limit
|
12
|
+
@limit = hash[:limit]
|
13
13
|
end
|
14
14
|
|
15
15
|
def inject(agent, page, opt = {})
|
@@ -7,9 +7,12 @@ module Yasuri
|
|
7
7
|
class TextNode
|
8
8
|
include Node
|
9
9
|
|
10
|
-
def initialize(xpath, name, children = [],
|
10
|
+
def initialize(xpath, name, children = [], hash = {})
|
11
11
|
super(xpath, name, children)
|
12
12
|
|
13
|
+
truncate = hash[:truncate]
|
14
|
+
proc = hash[:proc]
|
15
|
+
|
13
16
|
truncate = Regexp.new(truncate) if not truncate.nil? # regexp or nil
|
14
17
|
@truncate = truncate
|
15
18
|
@truncate = Regexp.new(@truncate.to_s) if not @truncate.nil?
|
data/spec/spec_helper.rb
CHANGED
@@ -14,8 +14,8 @@ end
|
|
14
14
|
|
15
15
|
|
16
16
|
# ENV['CODECLIMATE_REPO_TOKEN'] = "0dc78d33107a7f11f257c0218ac1a37e0073005bb9734f2fd61d0f7e803fc151"
|
17
|
-
require "codeclimate-test-reporter"
|
18
|
-
CodeClimate::TestReporter.start
|
17
|
+
# require "codeclimate-test-reporter"
|
18
|
+
# CodeClimate::TestReporter.start
|
19
19
|
|
20
20
|
require 'simplecov'
|
21
21
|
require 'coveralls'
|
@@ -27,6 +27,7 @@ SimpleCov.formatter = SimpleCov::Formatter::MultiFormatter[
|
|
27
27
|
]
|
28
28
|
SimpleCov.start
|
29
29
|
|
30
|
+
|
30
31
|
$LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
|
31
32
|
require 'yasuri'
|
32
33
|
|
data/spec/yasuri_spec.rb
CHANGED
@@ -18,7 +18,7 @@ describe 'Yasuri' do
|
|
18
18
|
#############
|
19
19
|
describe '.json2tree' do
|
20
20
|
it "fail if empty json" do
|
21
|
-
expect { Yasuri.json2tree("{}") }.to raise_error
|
21
|
+
expect { Yasuri.json2tree("{}") }.to raise_error(RuntimeError)
|
22
22
|
end
|
23
23
|
|
24
24
|
it "return TextNode" do
|
@@ -39,7 +39,7 @@ describe 'Yasuri' do
|
|
39
39
|
"truncate" : "^[^,]+"
|
40
40
|
}|
|
41
41
|
generated = Yasuri.json2tree(src)
|
42
|
-
original = Yasuri::TextNode.new('/html/body/p[1]', "content", truncate:/^[^,]+/)
|
42
|
+
original = Yasuri::TextNode.new('/html/body/p[1]', "content", {}, truncate:/^[^,]+/)
|
43
43
|
compare_generated_vs_original(generated, original, @index_page)
|
44
44
|
end
|
45
45
|
|
@@ -153,7 +153,7 @@ describe 'Yasuri' do
|
|
153
153
|
end
|
154
154
|
|
155
155
|
it "return text node with truncate_regexp" do
|
156
|
-
node = Yasuri::TextNode.new("/html/head/title", "title", truncate:/^[^,]+/)
|
156
|
+
node = Yasuri::TextNode.new("/html/head/title", "title", {}, truncate:/^[^,]+/)
|
157
157
|
json = Yasuri.tree2json(node)
|
158
158
|
expected_str = %q| { "node": "text",
|
159
159
|
"name": "title",
|
@@ -81,7 +81,7 @@ describe 'Yasuri' do
|
|
81
81
|
node = Yasuri::StructNode.new(invalid_xpath, "table", [
|
82
82
|
Yasuri::TextNode.new('./td[1]', "title")
|
83
83
|
])
|
84
|
-
expect { node.inject(@agent, @page) }.to raise_error
|
84
|
+
expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
|
85
85
|
end
|
86
86
|
|
87
87
|
it 'fail with invalid xpath in children' do
|
@@ -90,7 +90,7 @@ describe 'Yasuri' do
|
|
90
90
|
Yasuri::TextNode.new(invalid_xpath, "title"),
|
91
91
|
Yasuri::TextNode.new('./td[2]', "pub_date"),
|
92
92
|
])
|
93
|
-
expect { node.inject(@agent, @page) }.to raise_error
|
93
|
+
expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
|
94
94
|
end
|
95
95
|
|
96
96
|
it 'scrape all tables' do
|
@@ -126,7 +126,7 @@ describe 'Yasuri' do
|
|
126
126
|
Yasuri::TextNode.new('./td[1]', "title"),
|
127
127
|
Yasuri::TextNode.new('./td[2]', "pub_date"),
|
128
128
|
])
|
129
|
-
expected = @table_1996.map{|h| h.map{|k,v| [k.to_sym, v] }
|
129
|
+
expected = @table_1996.map{|h| Hash[h.map{|k,v| [k.to_sym, v] }] }
|
130
130
|
actual = node.inject(@agent, @page, symbolize_names:true)
|
131
131
|
expect(actual).to match expected
|
132
132
|
end
|
@@ -31,7 +31,7 @@ describe 'Yasuri' do
|
|
31
31
|
it 'fail with invalid xpath' do
|
32
32
|
invalid_xpath = '/html/body/no_match_node['
|
33
33
|
node = Yasuri::TextNode.new(invalid_xpath, "title")
|
34
|
-
expect { node.inject(@agent, @index_page) }.to raise_error
|
34
|
+
expect { node.inject(@agent, @index_page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
|
35
35
|
end
|
36
36
|
|
37
37
|
it "can be defined by DSL, return single TextNode title" do
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: yasuri
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 1.9.11
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- TAC
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2016-11-14 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -151,6 +151,8 @@ files:
|
|
151
151
|
- LICENSE
|
152
152
|
- README.md
|
153
153
|
- Rakefile
|
154
|
+
- USAGE.ja.md
|
155
|
+
- USAGE.md
|
154
156
|
- app.rb
|
155
157
|
- lib/yasuri.rb
|
156
158
|
- lib/yasuri/version.rb
|