yasuri 0.0.11 → 1.9.11

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 2d97799a28c2b4d1991ce32abf50b81bce1cae5f
4
- data.tar.gz: d94f15f6056e52f87feb1538964e8a049afb7fcb
3
+ metadata.gz: 28e6a3903cec3d8036b718a7c5de4fb8df8dfbfa
4
+ data.tar.gz: fbfb4c2b3a042410a7d05b1604e004bdffd82779
5
5
  SHA512:
6
- metadata.gz: 193bb19ac39e9ea1ca74b2c44dc8c66840f630dcdcd32797acf28a6add47b9a5d878a9ffe3a482a544eccc75dc0f34aa03a4d199dffd311b6dc134f6471b5d35
7
- data.tar.gz: 3a360133ce54adb4bcc16637a53d78d1cbd5402d64c47b4afc8e852918f4ea613b4eea9ab20e8bf8c325cb67bf57c31577fa45ba9c04b98d186aa1a72b913f64
6
+ metadata.gz: e4312623046ecf451ef261b1d99164b7f79b861ad8d7fe49d7ba8fd2319cd840978ae88235e6fb10b6991cd5131ebec966ec56e3bddece8fbbefd0b53d4a9dfe
7
+ data.tar.gz: 54639066aa4511309a1f712b8920c81c09a7489e0cab6a91c453d2aeca07d4f1c3bed167ed25746411d5f87fa9cc9fff8c6a32567c4df8d1990f12cb053286e2
data/README.md CHANGED
@@ -2,6 +2,16 @@
2
2
 
3
3
  Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
4
4
 
5
+ Yasuri can reduce frequently processes in Scraping.
6
+
7
+ For example,
8
+
9
+ + Open links in the page, scraping each page, and getting result as Hash.
10
+ + Scraping texts in the page, and named result in Hash.
11
+ + A table that repeatedly appears in a page each, scraping, get as an array.
12
+ + Of each page provided by the pagination, scraping the only top 3.
13
+
14
+ You can implement easy by Yasuri.
5
15
 
6
16
  ## Sample
7
17
 
@@ -17,6 +27,14 @@ Add this line to your application's Gemfile:
17
27
  gem 'yasuri'
18
28
  ```
19
29
 
30
+ or
31
+
32
+ ```ruby
33
+ # for Ruby 1.9.3 or lower
34
+ gem 'yasuri', '~> 1.9'
35
+ ```
36
+
37
+
20
38
  And then execute:
21
39
 
22
40
  $ bundle
data/USAGE.ja.md ADDED
@@ -0,0 +1,433 @@
1
+ # Yasuri の使い方
2
+
3
+ ## Yasuri とは
4
+ Yasuri (鑢) は簡単にWebスクレイピングを行うための、"[Mechanize](https://github.com/sparklemotion/mechanize)" をサポートするライブラリです.
5
+
6
+ Yasuriは、スクレイピングにおける、よくある処理を簡単に記述することができます.
7
+ 例えば、
8
+
9
+ + ページ内の複数のリンクを開いて、各ページをスクレイピングした結果をHashで取得する
10
+ + ページ内の複数のテキストをスクレイピングし、名前をつけてHashにする
11
+ + ページ内に繰り返し出現するテーブルをそれぞれスクレイピングして、配列として取得する
12
+ + ページネーションで提供される各ページのうち、上位3つだけを順にスクレイピングする
13
+
14
+ これらを簡単に実装することができます.
15
+
16
+ ## クイックスタート
17
+
18
+ ```
19
+ $ gem install yasuri
20
+ ```
21
+
22
+ ```ruby
23
+ require 'yasuri'
24
+ require 'machinize'
25
+
26
+ # Node tree constructing by DSL
27
+ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
28
+ text_title '//*[@id="contents"]/h2'
29
+ text_content '//*[@id="contents"]/p[1]'
30
+ end
31
+
32
+ agent = Mechanize.new
33
+ root_page = agent.get("http://some.scraping.page.net/")
34
+
35
+ result = root.inject(agent, root_page)
36
+ # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
37
+ # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
38
+
39
+ ```
40
+ この例では、 LinkNode(`links_root`)の xpath で指定された各リンク先のページから、TextNode(`text_title`,`text_content`) の xpath で指定された2つのテキストをスクレイピングする例です.
41
+
42
+ (言い換えると、`//*[@id="menu"]/ul/li/a` で示される各リンクを開いて、`//*[@id="contents"]/h2` と `//*[@id="contents"]/p[1]` で指定されたテキストをスクレイピングします)
43
+
44
+ ## 基本
45
+
46
+ 1. パースツリーを作る
47
+ 2. Mechanize の agent と対象のページを与えてパースを開始する
48
+
49
+
50
+ ### パースツリーを作る
51
+
52
+ ```ruby
53
+ require 'mechanize'
54
+ require 'yasuri'
55
+
56
+
57
+ # 1. パースツリーを作る
58
+ tree = Yasuri.links_title '/html/body/a' do
59
+ text_name '/html/body/p'
60
+ end
61
+
62
+ # 2. Mechanize の agent と対象のページを与えてパースを開始する
63
+ agent = Mechanize.new
64
+ page = agent.get(uri)
65
+
66
+
67
+ tree.inject(agent, page)
68
+ ```
69
+
70
+ ツリーは、DSLまたはjsonで定義することができます.上の例ではDSLで定義しています.
71
+ 以下は、jsonで上記と等価な解析ツリーを定義した例です.
72
+
73
+ ```ruby
74
+ # json で構成する場合
75
+ src = <<-EOJSON
76
+ { "node" : "links",
77
+ "name" : "title",
78
+ "path" : "/html/body/a",
79
+ "children" : [
80
+ { "node" : "text",
81
+ "name" : "name",
82
+ "path" : "/html/body/p"
83
+ }
84
+ ]
85
+ }
86
+ EOJSON
87
+ tree = Yasuri.json2tree(src)
88
+ ```
89
+
90
+
91
+ ### Node
92
+ ツリーは入れ子になった *Node* で構成されます.
93
+ Node は `Type`, `Name`, `Path`, `Childlen`, `Options` を持っています.
94
+
95
+ Nodeは以下のフォーマットで定義されます.
96
+
97
+ ```ruby
98
+ # トップレベル
99
+ Yasuri.<Type>_<Name> <Path> [,<Options>]
100
+
101
+ # 入れ子になっている場合
102
+ Yasuri.<Type>_<Name> <Path> [,<Options>] do
103
+ <Type>_<Name> <Path> [,<Options>] do
104
+ <Children>
105
+ end
106
+ end
107
+ ```
108
+
109
+ #### Type
110
+ *Type* は Nodeの振る舞いを示します.Typeには以下のものがあります.
111
+
112
+ - *Text*
113
+ - *Struct*
114
+ - *Links*
115
+ - *Paginate*
116
+
117
+ ### Name
118
+ *Name* は 解析結果のHashにおけるキーになります.
119
+
120
+ ### Path
121
+ *Path* は xpath あるいは css セレクタによって、HTML上の特定のノードを指定します.
122
+ これは Machinize の `search` で使用されます.
123
+
124
+ ### Childlen
125
+ 入れ子になっているノードの子ノードです.TextNodeはツリーの葉に当たるため、子ノードを持ちません.
126
+
127
+ ### Options
128
+ パースのオプションです.オプションはTypeごとに異なります.
129
+ 各ノードに対して、`opt`メソッドをコールすることで、利用可能なオプションを取得できます.
130
+
131
+ ```
132
+ # TextNode の例
133
+ node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
134
+ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
135
+ ```
136
+
137
+ ## Text Node
138
+ *TextNode* はスクレイピングしたテキストを返します.このノードはパースツリーにおいて常に葉です.
139
+
140
+ ### 例
141
+
142
+ ```html
143
+ <!-- http://yasuri.example.net -->
144
+ <html>
145
+ <head></head>
146
+ <body>
147
+ <p>Hello,World</p>
148
+ <p>Hello,Yasuri</p>
149
+ </body>
150
+ </html>
151
+ ```
152
+
153
+ ```ruby
154
+ agent = Mechanize.new
155
+ page = agent.get("http://yasuri.example.net")
156
+
157
+ p1 = Yasuri.text_title '/html/body/p[1]'
158
+ p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
159
+ p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
160
+
161
+ p1.inject(agent, page) #=> { "title" => "Hello,World" }
162
+ p1t.inject(agent, page) #=> { "title" => "Hello" }
163
+ node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
164
+ ```
165
+
166
+ ### オプション
167
+ ##### `truncate`
168
+ 正規表現にマッチした文字列を取り出します.グループを指定した場合、最初にマッチしたグループだけを返します.
169
+
170
+ ```ruby
171
+ node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
172
+ node.inject(agent, index_page)
173
+ #=> { "example" => "ello,Yasur" }
174
+ ```
175
+
176
+
177
+ ##### `proc`
178
+ 取り出した文字列(String)をレシーバーとして、シンボルで指定したメソッドを呼び出します.
179
+ `truncate`オプションを併せて指定している場合、`truncate`した後の文字列に対し、メソッドを呼び出します.
180
+
181
+ ```ruby
182
+ node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
183
+ node.inject(agent, index_page)
184
+ #=> { "example" => "ELLO,YASUR" }
185
+ ```
186
+
187
+ ## Struct Node
188
+ *Struct Node* は構造化されたHashとしてテキストを返します.
189
+
190
+ まず、Struct Node は `Path` によって、HTMLのタグを絞込みます.
191
+ Struct Node の子ノードは、この絞りこまれたタグに対してパースを行い、Struct Node は子ノードの結果を含むHashを返します.
192
+
193
+ Struct Node の `Path` が複数のタグにマッチする場合、配列として結果を返します.
194
+
195
+ ### 例
196
+
197
+ ```html
198
+ <!-- http://yasuri.example.net -->
199
+ <html>
200
+ <head>
201
+ <title>Books</title>
202
+ </head>
203
+ <body>
204
+ <h1>1996</h1>
205
+ <table>
206
+ <thead>
207
+ <tr><th>Title</th> <th>Publication Date</th></tr>
208
+ </thead>
209
+ <tr><td>The Perfect Insider</td> <td>1996/4/5</td></tr>
210
+ <tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
211
+ <tr><td>Mathematical Goodbye</td> <td>1996/9/5</td></tr>
212
+ </table>
213
+
214
+ <h1>1997</h1>
215
+ <table>
216
+ <thead>
217
+ <tr><th>Title</th> <th>Publication Date</th></tr>
218
+ </thead>
219
+ <tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
220
+ <tr><td>Who Inside</td> <td>1997/4/5</td></tr>
221
+ <tr><td>Illusion Acts Like Magic</td> <td>1997/10/5</td></tr>
222
+ </table>
223
+
224
+ <h1>1998</h1>
225
+ <table>
226
+ <thead>
227
+ <tr><th>Title</th> <th>Publication Date</th></tr>
228
+ </thead>
229
+ <tr><td>Replaceable Summer</td> <td>1998/1/7</td></tr>
230
+ <tr><td>Switch Back</td> <td>1998/4/5</td></tr>
231
+ <tr><td>Numerical Models</td> <td>1998/7/5</td></tr>
232
+ <tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
233
+ </table>
234
+ </body>
235
+ </html>
236
+ ```
237
+
238
+ ```ruby
239
+ agent = Mechanize.new
240
+ page = agent.get("http://yasuri.example.net")
241
+
242
+ node = Yasuri.struct_table '/html/body/table[1]/tr' do
243
+ text_title './td[1]'
244
+ text_pub_date './td[2]'
245
+ ])
246
+
247
+ node.inject(agent, page)
248
+ #=> [ { "title" => "The Perfect Insider",
249
+ # "pub_date" => "1996/4/5" },
250
+ # { "title" => "Doctors in Isolated Room",
251
+ # "pub_date" => "1996/7/5" },
252
+ # { "title" => "Mathematical Goodbye",
253
+ # "pub_date" => "1996/9/5" }, ]
254
+ ```
255
+
256
+ Struct Node は xpath `'/html/body/table[1]/tr'` によって、最初の `<table>` から すべての`<tr>` タグを絞り込みます.
257
+ その後、子ノードである2つの TextNode によって、 `<tr>` タグがパースされます.
258
+ この場合は、最初の `<table>` は 3つの `<tr>`タグを持っているため、3つのHashを返します.(`<thead><tr>` は `Path` にマッチしないため4つではないことに注意)
259
+ 各HashはTextNodeによってパースされたテキストを含んでいます.
260
+
261
+
262
+ また以下の例のように、Struct Node は TextNode以外のノードを子ノードとすることができます.
263
+
264
+ ### 例
265
+
266
+ ```ruby
267
+ agent = Mechanize.new
268
+ page = agent.get("http://yasuri.example.net")
269
+
270
+ node = Yasuri.strucre_tables '/html/body/table' do
271
+ struct_table './tr' do
272
+ text_title './td[1]'
273
+ text_pub_date './td[2]'
274
+ end
275
+ ])
276
+
277
+ node.inject(agent, page)
278
+
279
+ #=> [ { "table" => [ { "title" => "The Perfect Insider",
280
+ # "pub_date" => "1996/4/5" },
281
+ # { "title" => "Doctors in Isolated Room",
282
+ # "pub_date" => "1996/7/5" },
283
+ # { "title" => "Mathematical Goodbye",
284
+ # "pub_date" => "1996/9/5" }]},
285
+ # { "table" => [ { "title" => "Jack the Poetical Private",
286
+ # "pub_date" => "1997/1/5" },
287
+ # { "title" => "Who Inside",
288
+ # "pub_date" => "1997/4/5" },
289
+ # { "title" => "Illusion Acts Like Magic",
290
+ # "pub_date" => "1997/10/5" }]},
291
+ # { "table" => [ { "title" => "Replaceable Summer",
292
+ # "pub_date" => "1998/1/7" },
293
+ # { "title" => "Switch Back",
294
+ # "pub_date" => "1998/4/5" },
295
+ # { "title" => "Numerical Models",
296
+ # "pub_date" => "1998/7/5" },
297
+ # { "title" => "The Perfect Outsider",
298
+ # "pub_date" => "1998/10/5" }]}
299
+ # ]
300
+ ```
301
+
302
+ ### オプション
303
+ なし
304
+
305
+ ## Links Node
306
+ Links Node は リンクされた各ページをパースして結果を返します.
307
+
308
+ ### 例
309
+ ```
310
+ <!-- http://yasuri.example.net -->
311
+ <html>
312
+ <head><title>Yasuri Test</title></head>
313
+ <body>
314
+ <p>Hello,Yasuri</p>
315
+ <a href="./child01.html">child01</a>
316
+ <a href="./child02.html">child02</a>
317
+ <a href="./child03.html">child03</a>
318
+ </body>
319
+ <title>
320
+ ```
321
+
322
+ ```
323
+ <!-- http://yasuri.example.net/child01.html -->
324
+ <html>
325
+ <head><title>Child 01 Test</title></head>
326
+ <body>
327
+ <p>Child 01 page.</p>
328
+ <ul>
329
+ <li><a href="./child01_sub.html">Child01_Sub</a></li>
330
+ <li><a href="./child02_sub.html">Child02_Sub</a></li>
331
+ </ul>
332
+ </body>
333
+ <title>
334
+ ```
335
+
336
+ ```
337
+ <!-- http://yasuri.example.net/child02.html -->
338
+ <html>
339
+ <head><title>Child 02 Test</title></head>
340
+ <body>
341
+ <p>Child 02 page.</p>
342
+ </body>
343
+ <title>
344
+ ```
345
+
346
+ ```
347
+ <!-- http://yasuri.example.net/child03.html -->
348
+ <html>
349
+ <head><title>Child 03 Test</title></head>
350
+ <body>
351
+ <p>Child 03 page.</p>
352
+ <ul>
353
+ <li><a href="./child03_sub.html">Child03_Sub</a></li>
354
+ </ul>
355
+ </body>
356
+ <title>
357
+ ```
358
+
359
+ ```
360
+ agent = Mechanize.new
361
+ page = agent.get("http://yasuri.example.net")
362
+
363
+ node = Yasuri.links_title '/html/body/a' do
364
+ text_content '/html/body/p'
365
+ end
366
+
367
+ node.inject(agent, page)
368
+ #=> [ {"content" => "Child 01 page."},
369
+ {"content" => "Child 02 page."},
370
+ {"content" => "Child 03 page."}]
371
+ ```
372
+
373
+ まず、 LinksNode は `Path` にマッチするすべてのリンクを最初のページから探します.
374
+ この例では、LinksNodeは `/html/body/a` にマッチするすべてのタグを `http://yasuri.example.net` から探します.
375
+ 次に、見つかったタグのhref属性で指定されたページを開きます.(`./child01.html`, `./child02.html`, `./child03.html`)
376
+
377
+ 開いた各ページに対して、子ノードによる解析を行います.LinksNodeは 各ページに対するパース結果をHashの配列として返します.
378
+
379
+ ## Paginate Node
380
+ PaginateNodeは ページネーション(パジネーション, Pagination) でたどることのできる各ページを順にパースします.
381
+
382
+ ### 例
383
+ この例では、対象のページ `page01.html` はこのようになっているとします.
384
+ `page02.html` から `page04.html` も同様です.
385
+
386
+ ```html
387
+ <!-- http://yasuri.example.net/page01.html -->
388
+ <html>
389
+ <head><title>Page01</title></head>
390
+ <body>
391
+ <p>Patination01</p>
392
+
393
+ <nav class='pagination'>
394
+ <span class='prev'> &laquo; PreviousPage </span>
395
+ <span class='page'> 1 </span>
396
+ <span class='page'> <a href="./page02.html">2</a> </span>
397
+ <span class='page'> <a href="./page03.html">3</a> </span>
398
+ <span class='page'> <a href="./page04.html">4</a> </span>
399
+ <span class='next'> <a href="./page02.html" class="next" rel="next">NextPage &raquo;</a> </span>
400
+ </nav>
401
+
402
+ </body>
403
+ <title>
404
+ ```
405
+
406
+ ```ruby
407
+ agent = Mechanize.new
408
+ page = agent.get("http://yasuri.example.net/page01.html")
409
+
410
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
411
+ text_content '/html/body/p'
412
+ end
413
+
414
+ node.inject(agent, page)
415
+ #=> [ {"content" => "Patination01"},
416
+ {"content" => "Patination02"},
417
+ {"content" => "Patination03"}]
418
+ ```
419
+ PaginateNodeは 次のページ を指すリンクを`Path`として指定する必要があります.
420
+ この例では、`NextPage` (`/html/body/nav/span/a[@class='next']`)が、次のページを指すリンクに該当します.
421
+
422
+ ### オプション
423
+ ##### `limit`
424
+ たどるページ数の上限を指定します.
425
+
426
+ ```ruby
427
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
428
+ text_content '/html/body/p'
429
+ end
430
+ node.inject(agent, page)
431
+ #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
432
+ ```
433
+ この場合、PaginateNode は最大2つまでのページを開いてパースします.ページネーションは4つのページを持っているようですが、`limit:2`が指定されているため、結果の配列には2つの結果のみが含まれています.
data/USAGE.md ADDED
@@ -0,0 +1,431 @@
1
+ # Yasuri Usage
2
+
3
+ ## What is Yasuri
4
+ `Yasuri` is an easy web-scraping library for supporting "Mechanize".
5
+
6
+ Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
7
+
8
+ Yasuri can reduce frequently processes in Scraping.
9
+
10
+ For example,
11
+
12
+ + Open links in the page, scraping each page, and getting result as Hash.
13
+ + Scraping texts in the page, and named result in Hash.
14
+ + A table that repeatedly appears in a page each, scraping, get as an array.
15
+ + Of each page provided by the pagination, scraping the only top 3.
16
+
17
+ You can implement easy by Yasuri.
18
+
19
+ ## Quick Start
20
+
21
+ ```
22
+ $ gem install yasuri
23
+ ```
24
+
25
+ ```ruby
26
+ require 'yasuri'
27
+ require 'machinize'
28
+
29
+ # Node tree constructing by DSL
30
+ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
31
+ text_title '//*[@id="contents"]/h2'
32
+ text_content '//*[@id="contents"]/p[1]'
33
+ end
34
+
35
+ agent = Mechanize.new
36
+ root_page = agent.get("http://some.scraping.page.net/")
37
+
38
+ result = root.inject(agent, root_page)
39
+ # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
40
+ # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
41
+
42
+ ```
43
+ This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
44
+
45
+ (i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
46
+
47
+ ## Basics
48
+
49
+ 1. Construct parse tree.
50
+ 2. Start parse with Mechanize agent and first page.
51
+
52
+ ### Construct parse tree
53
+
54
+ ```ruby
55
+ require 'mechanize'
56
+ require 'yasuri'
57
+
58
+
59
+ # 1. Construct parse tree.
60
+ tree = Yasuri.links_title '/html/body/a' do
61
+ text_name '/html/body/p'
62
+ end
63
+
64
+ # 2. Start parse with Mechanize agent and first page.
65
+ agent = Mechanize.new
66
+ page = agent.get(uri)
67
+
68
+
69
+ tree.inject(agent, page)
70
+ ```
71
+
72
+ Tree is definable by 2(+1) ways, DSL and json (and basic ruby code). In above example, DSL.
73
+
74
+ ```ruby
75
+ # Construct by json.
76
+ src = <<-EOJSON
77
+ { "node" : "links",
78
+ "name" : "title",
79
+ "path" : "/html/body/a",
80
+ "children" : [
81
+ { "node" : "text",
82
+ "name" : "name",
83
+ "path" : "/html/body/p"
84
+ }
85
+ ]
86
+ }
87
+ EOJSON
88
+ tree = Yasuri.json2tree(src)
89
+ ```
90
+
91
+ ### Node
92
+ Tree is constructed by nested Nodes.
93
+ Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
94
+
95
+ Node is defined by this format.
96
+
97
+
98
+ ```ruby
99
+ # Top Level
100
+ Yasuri.<Type>_<Name> <Path> [,<Options>]
101
+
102
+ # Nested
103
+ Yasuri.<Type>_<Name> <Path> [,<Options>] do
104
+ <Type>_<Name> <Path> [,<Options>] do
105
+ <Children>
106
+ end
107
+ end
108
+ ```
109
+
110
+ #### Type
111
+ Type meen behavior of Node.
112
+
113
+ - *Text*
114
+ - *Struct*
115
+ - *Links*
116
+ - *Paginate*
117
+
118
+ ### Name
119
+ Name is used keys in returned hash.
120
+
121
+ ### Path
122
+ Path determine target node by xpath or css selector. It given by Machinize `search`.
123
+
124
+ ### Childlen
125
+ Child nodes. TextNode has always empty set, because TextNode is leaf.
126
+
127
+ ### Options
128
+ Parse options. It different in each types. You can get options and values by `opt` method.
129
+
130
+ ```ruby
131
+ # TextNode Exaample
132
+ node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
133
+ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
134
+ ```
135
+
136
+ ## Text Node
137
+ TextNode return scraped text. This node have to be leaf.
138
+
139
+ ### Example
140
+
141
+ ```html
142
+ <!-- http://yasuri.example.net -->
143
+ <html>
144
+ <head></head>
145
+ <body>
146
+ <p>Hello,World</p>
147
+ <p>Hello,Yasuri</p>
148
+ </body>
149
+ </html>
150
+ ```
151
+
152
+ ```ruby
153
+ agent = Mechanize.new
154
+ page = agent.get("http://yasuri.example.net")
155
+
156
+ p1 = Yasuri.text_title '/html/body/p[1]'
157
+ p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
158
+ p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
159
+
160
+ p1.inject(agent, page) #=> { "title" => "Hello,World" }
161
+ p1t.inject(agent, page) #=> { "title" => "Hello" }
162
+ node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
163
+ ```
164
+
165
+ ### Options
166
+ ##### `truncate`
167
+ Match to regexp, and truncate text. When you use group, it will return first matched group only.
168
+
169
+ ```ruby
170
+ node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
171
+ node.inject(agent, index_page)
172
+ #=> { "example" => "ello,Yasur" }
173
+ ```
174
+
175
+
176
+ ##### `proc`
177
+ Apply method to text. Method is given as Symbol.
178
+ If it is given `truncate` option, apply method after truncated.
179
+
180
+ ```ruby
181
+ node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
182
+ node.inject(agent, index_page)
183
+ #=> { "example" => "ELLO,YASUR" }
184
+ ```
185
+
186
+ ## Struct Node
187
+ Struct Node return structured text.
188
+
189
+ At first, Struct Node narrow down sub-tags by `Path`. Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
190
+
191
+ If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
192
+
193
+ ### Example
194
+
195
+ ```html
196
+ <!-- http://yasuri.example.net -->
197
+ <html>
198
+ <head>
199
+ <title>Books</title>
200
+ </head>
201
+ <body>
202
+ <h1>1996</h1>
203
+ <table>
204
+ <thead>
205
+ <tr><th>Title</th> <th>Publication Date</th></tr>
206
+ </thead>
207
+ <tr><td>The Perfect Insider</td> <td>1996/4/5</td></tr>
208
+ <tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
209
+ <tr><td>Mathematical Goodbye</td> <td>1996/9/5</td></tr>
210
+ </table>
211
+
212
+ <h1>1997</h1>
213
+ <table>
214
+ <thead>
215
+ <tr><th>Title</th> <th>Publication Date</th></tr>
216
+ </thead>
217
+ <tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
218
+ <tr><td>Who Inside</td> <td>1997/4/5</td></tr>
219
+ <tr><td>Illusion Acts Like Magic</td> <td>1997/10/5</td></tr>
220
+ </table>
221
+
222
+ <h1>1998</h1>
223
+ <table>
224
+ <thead>
225
+ <tr><th>Title</th> <th>Publication Date</th></tr>
226
+ </thead>
227
+ <tr><td>Replaceable Summer</td> <td>1998/1/7</td></tr>
228
+ <tr><td>Switch Back</td> <td>1998/4/5</td></tr>
229
+ <tr><td>Numerical Models</td> <td>1998/7/5</td></tr>
230
+ <tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
231
+ </table>
232
+ </body>
233
+ </html>
234
+ ```
235
+
236
+ ```ruby
237
+ agent = Mechanize.new
238
+ page = agent.get("http://yasuri.example.net")
239
+
240
+ node = Yasuri.struct_table '/html/body/table[1]/tr' do
241
+ text_title './td[1]'
242
+ text_pub_date './td[2]'
243
+ ])
244
+
245
+ node.inject(agent, page)
246
+ #=> [ { "title" => "The Perfect Insider",
247
+ # "pub_date" => "1996/4/5" },
248
+ # { "title" => "Doctors in Isolated Room",
249
+ # "pub_date" => "1996/7/5" },
250
+ # { "title" => "Mathematical Goodbye",
251
+ # "pub_date" => "1996/9/5" }, ]
252
+ ```
253
+
254
+ StructNode narrow down `<tr>` tags in first `<table>` by `'/html/body/table[1]/tr'`. Then,
255
+ `<tr>` tags parsed Struct node has two child node.
256
+
257
+ In this case, first `<table>` contains three `<tr>` tags (Not four.`<thead><tr>` is not match to `Path` ), so struct node returns three hashes. Each hash contains parsed text by Text Node.
258
+
259
+ Struct node can contain not only Text node.
260
+
261
+ ### Example
262
+
263
+ ```ruby
264
+ agent = Mechanize.new
265
+ page = agent.get("http://yasuri.example.net")
266
+
267
+ node = Yasuri.strucre_tables '/html/body/table' do
268
+ struct_table './tr' do
269
+ text_title './td[1]'
270
+ text_pub_date './td[2]'
271
+ end
272
+ ])
273
+
274
+ node.inject(agent, page)
275
+
276
+ #=> [ { "table" => [ { "title" => "The Perfect Insider",
277
+ # "pub_date" => "1996/4/5" },
278
+ # { "title" => "Doctors in Isolated Room",
279
+ # "pub_date" => "1996/7/5" },
280
+ # { "title" => "Mathematical Goodbye",
281
+ # "pub_date" => "1996/9/5" }]},
282
+ # { "table" => [ { "title" => "Jack the Poetical Private",
283
+ # "pub_date" => "1997/1/5" },
284
+ # { "title" => "Who Inside",
285
+ # "pub_date" => "1997/4/5" },
286
+ # { "title" => "Illusion Acts Like Magic",
287
+ # "pub_date" => "1997/10/5" }]},
288
+ # { "table" => [ { "title" => "Replaceable Summer",
289
+ # "pub_date" => "1998/1/7" },
290
+ # { "title" => "Switch Back",
291
+ # "pub_date" => "1998/4/5" },
292
+ # { "title" => "Numerical Models",
293
+ # "pub_date" => "1998/7/5" },
294
+ # { "title" => "The Perfect Outsider",
295
+ # "pub_date" => "1998/10/5" }]}
296
+ # ]
297
+ ```
298
+
299
+ ### Options
300
+ None.
301
+
302
+ ## Links Node
303
+ Links Node returns parsed text in each linked pages.
304
+
305
+ ### Example
306
+ ```html
307
+ <!-- http://yasuri.example.net -->
308
+ <html>
309
+ <head><title>Yasuri Test</title></head>
310
+ <body>
311
+ <p>Hello,Yasuri</p>
312
+ <a href="./child01.html">child01</a>
313
+ <a href="./child02.html">child02</a>
314
+ <a href="./child03.html">child03</a>
315
+ </body>
316
+ <title>
317
+ ```
318
+
319
+ ```html
320
+ <!-- http://yasuri.example.net/child01.html -->
321
+ <html>
322
+ <head><title>Child 01 Test</title></head>
323
+ <body>
324
+ <p>Child 01 page.</p>
325
+ <ul>
326
+ <li><a href="./child01_sub.html">Child01_Sub</a></li>
327
+ <li><a href="./child02_sub.html">Child02_Sub</a></li>
328
+ </ul>
329
+ </body>
330
+ <title>
331
+ ```
332
+
333
+ ```html
334
+ <!-- http://yasuri.example.net/child02.html -->
335
+ <html>
336
+ <head><title>Child 02 Test</title></head>
337
+ <body>
338
+ <p>Child 02 page.</p>
339
+ </body>
340
+ <title>
341
+ ```
342
+
343
+ ```html
344
+ <!-- http://yasuri.example.net/child03.html -->
345
+ <html>
346
+ <head><title>Child 03 Test</title></head>
347
+ <body>
348
+ <p>Child 03 page.</p>
349
+ <ul>
350
+ <li><a href="./child03_sub.html">Child03_Sub</a></li>
351
+ </ul>
352
+ </body>
353
+ <title>
354
+ ```
355
+
356
+ ```ruby
357
+ agent = Mechanize.new
358
+ page = agent.get("http://yasuri.example.net")
359
+
360
+ node = Yasuri.links_title '/html/body/a' do
361
+ text_content '/html/body/p'
362
+ end
363
+
364
+ node.inject(agent, page)
365
+ #=> [ {"content" => "Child 01 page."},
366
+ {"content" => "Child 02 page."},
367
+ {"content" => "Child 03 page."}]
368
+ ```
369
+
370
+ At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
371
+
372
+ Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
373
+
374
+ ### Options
375
+ None.
376
+
377
+ ## Paginate Node
378
+ Paginate Node parses and returns each pages that provid by paginate.
379
+
380
+ ### Example
381
+ Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
382
+
383
+ ```html
384
+ <!-- http://yasuri.example.net/page01.html -->
385
+ <html>
386
+ <head><title>Page01</title></head>
387
+ <body>
388
+ <p>Pagination01</p>
389
+
390
+ <nav class='pagination'>
391
+ <span class='prev'> PreviousPage </span>
392
+ <span class='page'> 1 </span>
393
+ <span class='page'> <a href="./page02.html">2</a> </span>
394
+ <span class='page'> <a href="./page03.html">3</a> </span>
395
+ <span class='page'> <a href="./page04.html">4</a> </span>
396
+ <span class='next'> <a href="./page02.html" class="next" rel="next"> NextPage </a> </span>
397
+ </nav>
398
+
399
+ </body>
400
+ <title>
401
+ ```
402
+
403
+ ```ruby
404
+ agent = Mechanize.new
405
+ page = agent.get("http://yasuri.example.net/page01.html")
406
+
407
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
408
+ text_content '/html/body/p'
409
+ end
410
+
411
+ node.inject(agent, page)
412
+ #=> [ {"content" => "Pagination01"},
413
+ {"content" => "Pagination02"},
414
+ {"content" => "Pagination03"},
415
+ {"content" => "Pagination04"}]
416
+ ```
417
+
418
+ Paginate Node require link for next page. In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
419
+
420
+ ### Options
421
+ ##### `limit`
422
+ Upper limit of open pages in pagination.
423
+
424
+ ```ruby
425
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
426
+ text_content '/html/body/p'
427
+ end
428
+ node.inject(agent, page)
429
+ #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
430
+ ```
431
+ Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
@@ -1,3 +1,3 @@
1
1
  module Yasuri
2
- VERSION = "0.0.11"
2
+ VERSION = "1.9.11"
3
3
  end
data/lib/yasuri/yasuri.rb CHANGED
@@ -37,7 +37,7 @@ module Yasuri
37
37
  }
38
38
  Node2Text = Text2Node.invert
39
39
 
40
- ReservedKeys = %i|node name path children|
40
+ ReservedKeys = [:node, :name, :path, :children]
41
41
  def self.hash2node(node_h)
42
42
  node, name, path, children = ReservedKeys.map do |key|
43
43
  node_h[key]
@@ -78,7 +78,8 @@ module Yasuri
78
78
  json
79
79
  end
80
80
 
81
- def self.NodeName(name, symbolize_names:false)
81
+ def self.NodeName(name, hash = {})
82
+ symbolize_names = hash[:symbolize_names] || false
82
83
  symbolize_names ? name.to_sym : name
83
84
  end
84
85
 
@@ -7,7 +7,7 @@ module Yasuri
7
7
  module Node
8
8
  attr_reader :url, :xpath, :name, :children
9
9
 
10
- def initialize(xpath, name, children = [], opt: {})
10
+ def initialize(xpath, name, children = [], opt = {})
11
11
  @xpath, @name, @children = xpath, name, children
12
12
  end
13
13
 
@@ -7,9 +7,9 @@ module Yasuri
7
7
  class PaginateNode
8
8
  include Node
9
9
 
10
- def initialize(xpath, name, children = [], limit: nil)
10
+ def initialize(xpath, name, children = [], hash = {})
11
11
  super(xpath, name, children)
12
- @limit = limit
12
+ @limit = hash[:limit]
13
13
  end
14
14
 
15
15
  def inject(agent, page, opt = {})
@@ -7,9 +7,12 @@ module Yasuri
7
7
  class TextNode
8
8
  include Node
9
9
 
10
- def initialize(xpath, name, children = [], truncate: nil, proc:nil)
10
+ def initialize(xpath, name, children = [], hash = {})
11
11
  super(xpath, name, children)
12
12
 
13
+ truncate = hash[:truncate]
14
+ proc = hash[:proc]
15
+
13
16
  truncate = Regexp.new(truncate) if not truncate.nil? # regexp or nil
14
17
  @truncate = truncate
15
18
  @truncate = Regexp.new(@truncate.to_s) if not @truncate.nil?
data/spec/spec_helper.rb CHANGED
@@ -14,8 +14,8 @@ end
14
14
 
15
15
 
16
16
  # ENV['CODECLIMATE_REPO_TOKEN'] = "0dc78d33107a7f11f257c0218ac1a37e0073005bb9734f2fd61d0f7e803fc151"
17
- require "codeclimate-test-reporter"
18
- CodeClimate::TestReporter.start
17
+ # require "codeclimate-test-reporter"
18
+ # CodeClimate::TestReporter.start
19
19
 
20
20
  require 'simplecov'
21
21
  require 'coveralls'
@@ -27,6 +27,7 @@ SimpleCov.formatter = SimpleCov::Formatter::MultiFormatter[
27
27
  ]
28
28
  SimpleCov.start
29
29
 
30
+
30
31
  $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
31
32
  require 'yasuri'
32
33
 
data/spec/yasuri_spec.rb CHANGED
@@ -18,7 +18,7 @@ describe 'Yasuri' do
18
18
  #############
19
19
  describe '.json2tree' do
20
20
  it "fail if empty json" do
21
- expect { Yasuri.json2tree("{}") }.to raise_error
21
+ expect { Yasuri.json2tree("{}") }.to raise_error(RuntimeError)
22
22
  end
23
23
 
24
24
  it "return TextNode" do
@@ -39,7 +39,7 @@ describe 'Yasuri' do
39
39
  "truncate" : "^[^,]+"
40
40
  }|
41
41
  generated = Yasuri.json2tree(src)
42
- original = Yasuri::TextNode.new('/html/body/p[1]', "content", truncate:/^[^,]+/)
42
+ original = Yasuri::TextNode.new('/html/body/p[1]', "content", {}, truncate:/^[^,]+/)
43
43
  compare_generated_vs_original(generated, original, @index_page)
44
44
  end
45
45
 
@@ -153,7 +153,7 @@ describe 'Yasuri' do
153
153
  end
154
154
 
155
155
  it "return text node with truncate_regexp" do
156
- node = Yasuri::TextNode.new("/html/head/title", "title", truncate:/^[^,]+/)
156
+ node = Yasuri::TextNode.new("/html/head/title", "title", {}, truncate:/^[^,]+/)
157
157
  json = Yasuri.tree2json(node)
158
158
  expected_str = %q| { "node": "text",
159
159
  "name": "title",
@@ -81,7 +81,7 @@ describe 'Yasuri' do
81
81
  node = Yasuri::StructNode.new(invalid_xpath, "table", [
82
82
  Yasuri::TextNode.new('./td[1]', "title")
83
83
  ])
84
- expect { node.inject(@agent, @page) }.to raise_error
84
+ expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
85
85
  end
86
86
 
87
87
  it 'fail with invalid xpath in children' do
@@ -90,7 +90,7 @@ describe 'Yasuri' do
90
90
  Yasuri::TextNode.new(invalid_xpath, "title"),
91
91
  Yasuri::TextNode.new('./td[2]', "pub_date"),
92
92
  ])
93
- expect { node.inject(@agent, @page) }.to raise_error
93
+ expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
94
94
  end
95
95
 
96
96
  it 'scrape all tables' do
@@ -126,7 +126,7 @@ describe 'Yasuri' do
126
126
  Yasuri::TextNode.new('./td[1]', "title"),
127
127
  Yasuri::TextNode.new('./td[2]', "pub_date"),
128
128
  ])
129
- expected = @table_1996.map{|h| h.map{|k,v| [k.to_sym, v] }.to_h }
129
+ expected = @table_1996.map{|h| Hash[h.map{|k,v| [k.to_sym, v] }] }
130
130
  actual = node.inject(@agent, @page, symbolize_names:true)
131
131
  expect(actual).to match expected
132
132
  end
@@ -31,7 +31,7 @@ describe 'Yasuri' do
31
31
  it 'fail with invalid xpath' do
32
32
  invalid_xpath = '/html/body/no_match_node['
33
33
  node = Yasuri::TextNode.new(invalid_xpath, "title")
34
- expect { node.inject(@agent, @index_page) }.to raise_error
34
+ expect { node.inject(@agent, @index_page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
35
35
  end
36
36
 
37
37
  it "can be defined by DSL, return single TextNode title" do
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: yasuri
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.11
4
+ version: 1.9.11
5
5
  platform: ruby
6
6
  authors:
7
7
  - TAC
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-03-03 00:00:00.000000000 Z
11
+ date: 2016-11-14 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -151,6 +151,8 @@ files:
151
151
  - LICENSE
152
152
  - README.md
153
153
  - Rakefile
154
+ - USAGE.ja.md
155
+ - USAGE.md
154
156
  - app.rb
155
157
  - lib/yasuri.rb
156
158
  - lib/yasuri/version.rb