yasuri 0.0.11 → 1.9.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 2d97799a28c2b4d1991ce32abf50b81bce1cae5f
4
- data.tar.gz: d94f15f6056e52f87feb1538964e8a049afb7fcb
3
+ metadata.gz: 28e6a3903cec3d8036b718a7c5de4fb8df8dfbfa
4
+ data.tar.gz: fbfb4c2b3a042410a7d05b1604e004bdffd82779
5
5
  SHA512:
6
- metadata.gz: 193bb19ac39e9ea1ca74b2c44dc8c66840f630dcdcd32797acf28a6add47b9a5d878a9ffe3a482a544eccc75dc0f34aa03a4d199dffd311b6dc134f6471b5d35
7
- data.tar.gz: 3a360133ce54adb4bcc16637a53d78d1cbd5402d64c47b4afc8e852918f4ea613b4eea9ab20e8bf8c325cb67bf57c31577fa45ba9c04b98d186aa1a72b913f64
6
+ metadata.gz: e4312623046ecf451ef261b1d99164b7f79b861ad8d7fe49d7ba8fd2319cd840978ae88235e6fb10b6991cd5131ebec966ec56e3bddece8fbbefd0b53d4a9dfe
7
+ data.tar.gz: 54639066aa4511309a1f712b8920c81c09a7489e0cab6a91c453d2aeca07d4f1c3bed167ed25746411d5f87fa9cc9fff8c6a32567c4df8d1990f12cb053286e2
data/README.md CHANGED
@@ -2,6 +2,16 @@
2
2
 
3
3
  Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
4
4
 
5
+ Yasuri can reduce frequently processes in Scraping.
6
+
7
+ For example,
8
+
9
+ + Open links in the page, scraping each page, and getting result as Hash.
10
+ + Scraping texts in the page, and named result in Hash.
11
+ + A table that repeatedly appears in a page each, scraping, get as an array.
12
+ + Of each page provided by the pagination, scraping the only top 3.
13
+
14
+ You can implement easy by Yasuri.
5
15
 
6
16
  ## Sample
7
17
 
@@ -17,6 +27,14 @@ Add this line to your application's Gemfile:
17
27
  gem 'yasuri'
18
28
  ```
19
29
 
30
+ or
31
+
32
+ ```ruby
33
+ # for Ruby 1.9.3 or lower
34
+ gem 'yasuri', '~> 1.9'
35
+ ```
36
+
37
+
20
38
  And then execute:
21
39
 
22
40
  $ bundle
data/USAGE.ja.md ADDED
@@ -0,0 +1,433 @@
1
+ # Yasuri の使い方
2
+
3
+ ## Yasuri とは
4
+ Yasuri (鑢) は簡単にWebスクレイピングを行うための、"[Mechanize](https://github.com/sparklemotion/mechanize)" をサポートするライブラリです.
5
+
6
+ Yasuriは、スクレイピングにおける、よくある処理を簡単に記述することができます.
7
+ 例えば、
8
+
9
+ + ページ内の複数のリンクを開いて、各ページをスクレイピングした結果をHashで取得する
10
+ + ページ内の複数のテキストをスクレイピングし、名前をつけてHashにする
11
+ + ページ内に繰り返し出現するテーブルをそれぞれスクレイピングして、配列として取得する
12
+ + ページネーションで提供される各ページのうち、上位3つだけを順にスクレイピングする
13
+
14
+ これらを簡単に実装することができます.
15
+
16
+ ## クイックスタート
17
+
18
+ ```
19
+ $ gem install yasuri
20
+ ```
21
+
22
+ ```ruby
23
+ require 'yasuri'
24
+ require 'machinize'
25
+
26
+ # Node tree constructing by DSL
27
+ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
28
+ text_title '//*[@id="contents"]/h2'
29
+ text_content '//*[@id="contents"]/p[1]'
30
+ end
31
+
32
+ agent = Mechanize.new
33
+ root_page = agent.get("http://some.scraping.page.net/")
34
+
35
+ result = root.inject(agent, root_page)
36
+ # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
37
+ # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
38
+
39
+ ```
40
+ この例では、 LinkNode(`links_root`)の xpath で指定された各リンク先のページから、TextNode(`text_title`,`text_content`) の xpath で指定された2つのテキストをスクレイピングする例です.
41
+
42
+ (言い換えると、`//*[@id="menu"]/ul/li/a` で示される各リンクを開いて、`//*[@id="contents"]/h2` と `//*[@id="contents"]/p[1]` で指定されたテキストをスクレイピングします)
43
+
44
+ ## 基本
45
+
46
+ 1. パースツリーを作る
47
+ 2. Mechanize の agent と対象のページを与えてパースを開始する
48
+
49
+
50
+ ### パースツリーを作る
51
+
52
+ ```ruby
53
+ require 'mechanize'
54
+ require 'yasuri'
55
+
56
+
57
+ # 1. パースツリーを作る
58
+ tree = Yasuri.links_title '/html/body/a' do
59
+ text_name '/html/body/p'
60
+ end
61
+
62
+ # 2. Mechanize の agent と対象のページを与えてパースを開始する
63
+ agent = Mechanize.new
64
+ page = agent.get(uri)
65
+
66
+
67
+ tree.inject(agent, page)
68
+ ```
69
+
70
+ ツリーは、DSLまたはjsonで定義することができます.上の例ではDSLで定義しています.
71
+ 以下は、jsonで上記と等価な解析ツリーを定義した例です.
72
+
73
+ ```ruby
74
+ # json で構成する場合
75
+ src = <<-EOJSON
76
+ { "node" : "links",
77
+ "name" : "title",
78
+ "path" : "/html/body/a",
79
+ "children" : [
80
+ { "node" : "text",
81
+ "name" : "name",
82
+ "path" : "/html/body/p"
83
+ }
84
+ ]
85
+ }
86
+ EOJSON
87
+ tree = Yasuri.json2tree(src)
88
+ ```
89
+
90
+
91
+ ### Node
92
+ ツリーは入れ子になった *Node* で構成されます.
93
+ Node は `Type`, `Name`, `Path`, `Childlen`, `Options` を持っています.
94
+
95
+ Nodeは以下のフォーマットで定義されます.
96
+
97
+ ```ruby
98
+ # トップレベル
99
+ Yasuri.<Type>_<Name> <Path> [,<Options>]
100
+
101
+ # 入れ子になっている場合
102
+ Yasuri.<Type>_<Name> <Path> [,<Options>] do
103
+ <Type>_<Name> <Path> [,<Options>] do
104
+ <Children>
105
+ end
106
+ end
107
+ ```
108
+
109
+ #### Type
110
+ *Type* は Nodeの振る舞いを示します.Typeには以下のものがあります.
111
+
112
+ - *Text*
113
+ - *Struct*
114
+ - *Links*
115
+ - *Paginate*
116
+
117
+ ### Name
118
+ *Name* は 解析結果のHashにおけるキーになります.
119
+
120
+ ### Path
121
+ *Path* は xpath あるいは css セレクタによって、HTML上の特定のノードを指定します.
122
+ これは Machinize の `search` で使用されます.
123
+
124
+ ### Childlen
125
+ 入れ子になっているノードの子ノードです.TextNodeはツリーの葉に当たるため、子ノードを持ちません.
126
+
127
+ ### Options
128
+ パースのオプションです.オプションはTypeごとに異なります.
129
+ 各ノードに対して、`opt`メソッドをコールすることで、利用可能なオプションを取得できます.
130
+
131
+ ```
132
+ # TextNode の例
133
+ node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
134
+ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
135
+ ```
136
+
137
+ ## Text Node
138
+ *TextNode* はスクレイピングしたテキストを返します.このノードはパースツリーにおいて常に葉です.
139
+
140
+ ### 例
141
+
142
+ ```html
143
+ <!-- http://yasuri.example.net -->
144
+ <html>
145
+ <head></head>
146
+ <body>
147
+ <p>Hello,World</p>
148
+ <p>Hello,Yasuri</p>
149
+ </body>
150
+ </html>
151
+ ```
152
+
153
+ ```ruby
154
+ agent = Mechanize.new
155
+ page = agent.get("http://yasuri.example.net")
156
+
157
+ p1 = Yasuri.text_title '/html/body/p[1]'
158
+ p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
159
+ p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
160
+
161
+ p1.inject(agent, page) #=> { "title" => "Hello,World" }
162
+ p1t.inject(agent, page) #=> { "title" => "Hello" }
163
+ node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
164
+ ```
165
+
166
+ ### オプション
167
+ ##### `truncate`
168
+ 正規表現にマッチした文字列を取り出します.グループを指定した場合、最初にマッチしたグループだけを返します.
169
+
170
+ ```ruby
171
+ node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
172
+ node.inject(agent, index_page)
173
+ #=> { "example" => "ello,Yasur" }
174
+ ```
175
+
176
+
177
+ ##### `proc`
178
+ 取り出した文字列(String)をレシーバーとして、シンボルで指定したメソッドを呼び出します.
179
+ `truncate`オプションを併せて指定している場合、`truncate`した後の文字列に対し、メソッドを呼び出します.
180
+
181
+ ```ruby
182
+ node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
183
+ node.inject(agent, index_page)
184
+ #=> { "example" => "ELLO,YASUR" }
185
+ ```
186
+
187
+ ## Struct Node
188
+ *Struct Node* は構造化されたHashとしてテキストを返します.
189
+
190
+ まず、Struct Node は `Path` によって、HTMLのタグを絞込みます.
191
+ Struct Node の子ノードは、この絞りこまれたタグに対してパースを行い、Struct Node は子ノードの結果を含むHashを返します.
192
+
193
+ Struct Node の `Path` が複数のタグにマッチする場合、配列として結果を返します.
194
+
195
+ ### 例
196
+
197
+ ```html
198
+ <!-- http://yasuri.example.net -->
199
+ <html>
200
+ <head>
201
+ <title>Books</title>
202
+ </head>
203
+ <body>
204
+ <h1>1996</h1>
205
+ <table>
206
+ <thead>
207
+ <tr><th>Title</th> <th>Publication Date</th></tr>
208
+ </thead>
209
+ <tr><td>The Perfect Insider</td> <td>1996/4/5</td></tr>
210
+ <tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
211
+ <tr><td>Mathematical Goodbye</td> <td>1996/9/5</td></tr>
212
+ </table>
213
+
214
+ <h1>1997</h1>
215
+ <table>
216
+ <thead>
217
+ <tr><th>Title</th> <th>Publication Date</th></tr>
218
+ </thead>
219
+ <tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
220
+ <tr><td>Who Inside</td> <td>1997/4/5</td></tr>
221
+ <tr><td>Illusion Acts Like Magic</td> <td>1997/10/5</td></tr>
222
+ </table>
223
+
224
+ <h1>1998</h1>
225
+ <table>
226
+ <thead>
227
+ <tr><th>Title</th> <th>Publication Date</th></tr>
228
+ </thead>
229
+ <tr><td>Replaceable Summer</td> <td>1998/1/7</td></tr>
230
+ <tr><td>Switch Back</td> <td>1998/4/5</td></tr>
231
+ <tr><td>Numerical Models</td> <td>1998/7/5</td></tr>
232
+ <tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
233
+ </table>
234
+ </body>
235
+ </html>
236
+ ```
237
+
238
+ ```ruby
239
+ agent = Mechanize.new
240
+ page = agent.get("http://yasuri.example.net")
241
+
242
+ node = Yasuri.struct_table '/html/body/table[1]/tr' do
243
+ text_title './td[1]'
244
+ text_pub_date './td[2]'
245
+ ])
246
+
247
+ node.inject(agent, page)
248
+ #=> [ { "title" => "The Perfect Insider",
249
+ # "pub_date" => "1996/4/5" },
250
+ # { "title" => "Doctors in Isolated Room",
251
+ # "pub_date" => "1996/7/5" },
252
+ # { "title" => "Mathematical Goodbye",
253
+ # "pub_date" => "1996/9/5" }, ]
254
+ ```
255
+
256
+ Struct Node は xpath `'/html/body/table[1]/tr'` によって、最初の `<table>` から すべての`<tr>` タグを絞り込みます.
257
+ その後、子ノードである2つの TextNode によって、 `<tr>` タグがパースされます.
258
+ この場合は、最初の `<table>` は 3つの `<tr>`タグを持っているため、3つのHashを返します.(`<thead><tr>` は `Path` にマッチしないため4つではないことに注意)
259
+ 各HashはTextNodeによってパースされたテキストを含んでいます.
260
+
261
+
262
+ また以下の例のように、Struct Node は TextNode以外のノードを子ノードとすることができます.
263
+
264
+ ### 例
265
+
266
+ ```ruby
267
+ agent = Mechanize.new
268
+ page = agent.get("http://yasuri.example.net")
269
+
270
+ node = Yasuri.strucre_tables '/html/body/table' do
271
+ struct_table './tr' do
272
+ text_title './td[1]'
273
+ text_pub_date './td[2]'
274
+ end
275
+ ])
276
+
277
+ node.inject(agent, page)
278
+
279
+ #=> [ { "table" => [ { "title" => "The Perfect Insider",
280
+ # "pub_date" => "1996/4/5" },
281
+ # { "title" => "Doctors in Isolated Room",
282
+ # "pub_date" => "1996/7/5" },
283
+ # { "title" => "Mathematical Goodbye",
284
+ # "pub_date" => "1996/9/5" }]},
285
+ # { "table" => [ { "title" => "Jack the Poetical Private",
286
+ # "pub_date" => "1997/1/5" },
287
+ # { "title" => "Who Inside",
288
+ # "pub_date" => "1997/4/5" },
289
+ # { "title" => "Illusion Acts Like Magic",
290
+ # "pub_date" => "1997/10/5" }]},
291
+ # { "table" => [ { "title" => "Replaceable Summer",
292
+ # "pub_date" => "1998/1/7" },
293
+ # { "title" => "Switch Back",
294
+ # "pub_date" => "1998/4/5" },
295
+ # { "title" => "Numerical Models",
296
+ # "pub_date" => "1998/7/5" },
297
+ # { "title" => "The Perfect Outsider",
298
+ # "pub_date" => "1998/10/5" }]}
299
+ # ]
300
+ ```
301
+
302
+ ### オプション
303
+ なし
304
+
305
+ ## Links Node
306
+ Links Node は リンクされた各ページをパースして結果を返します.
307
+
308
+ ### 例
309
+ ```
310
+ <!-- http://yasuri.example.net -->
311
+ <html>
312
+ <head><title>Yasuri Test</title></head>
313
+ <body>
314
+ <p>Hello,Yasuri</p>
315
+ <a href="./child01.html">child01</a>
316
+ <a href="./child02.html">child02</a>
317
+ <a href="./child03.html">child03</a>
318
+ </body>
319
+ <title>
320
+ ```
321
+
322
+ ```
323
+ <!-- http://yasuri.example.net/child01.html -->
324
+ <html>
325
+ <head><title>Child 01 Test</title></head>
326
+ <body>
327
+ <p>Child 01 page.</p>
328
+ <ul>
329
+ <li><a href="./child01_sub.html">Child01_Sub</a></li>
330
+ <li><a href="./child02_sub.html">Child02_Sub</a></li>
331
+ </ul>
332
+ </body>
333
+ <title>
334
+ ```
335
+
336
+ ```
337
+ <!-- http://yasuri.example.net/child02.html -->
338
+ <html>
339
+ <head><title>Child 02 Test</title></head>
340
+ <body>
341
+ <p>Child 02 page.</p>
342
+ </body>
343
+ <title>
344
+ ```
345
+
346
+ ```
347
+ <!-- http://yasuri.example.net/child03.html -->
348
+ <html>
349
+ <head><title>Child 03 Test</title></head>
350
+ <body>
351
+ <p>Child 03 page.</p>
352
+ <ul>
353
+ <li><a href="./child03_sub.html">Child03_Sub</a></li>
354
+ </ul>
355
+ </body>
356
+ <title>
357
+ ```
358
+
359
+ ```
360
+ agent = Mechanize.new
361
+ page = agent.get("http://yasuri.example.net")
362
+
363
+ node = Yasuri.links_title '/html/body/a' do
364
+ text_content '/html/body/p'
365
+ end
366
+
367
+ node.inject(agent, page)
368
+ #=> [ {"content" => "Child 01 page."},
369
+ {"content" => "Child 02 page."},
370
+ {"content" => "Child 03 page."}]
371
+ ```
372
+
373
+ まず、 LinksNode は `Path` にマッチするすべてのリンクを最初のページから探します.
374
+ この例では、LinksNodeは `/html/body/a` にマッチするすべてのタグを `http://yasuri.example.net` から探します.
375
+ 次に、見つかったタグのhref属性で指定されたページを開きます.(`./child01.html`, `./child02.html`, `./child03.html`)
376
+
377
+ 開いた各ページに対して、子ノードによる解析を行います.LinksNodeは 各ページに対するパース結果をHashの配列として返します.
378
+
379
+ ## Paginate Node
380
+ PaginateNodeは ページネーション(パジネーション, Pagination) でたどることのできる各ページを順にパースします.
381
+
382
+ ### 例
383
+ この例では、対象のページ `page01.html` はこのようになっているとします.
384
+ `page02.html` から `page04.html` も同様です.
385
+
386
+ ```html
387
+ <!-- http://yasuri.example.net/page01.html -->
388
+ <html>
389
+ <head><title>Page01</title></head>
390
+ <body>
391
+ <p>Patination01</p>
392
+
393
+ <nav class='pagination'>
394
+ <span class='prev'> &laquo; PreviousPage </span>
395
+ <span class='page'> 1 </span>
396
+ <span class='page'> <a href="./page02.html">2</a> </span>
397
+ <span class='page'> <a href="./page03.html">3</a> </span>
398
+ <span class='page'> <a href="./page04.html">4</a> </span>
399
+ <span class='next'> <a href="./page02.html" class="next" rel="next">NextPage &raquo;</a> </span>
400
+ </nav>
401
+
402
+ </body>
403
+ <title>
404
+ ```
405
+
406
+ ```ruby
407
+ agent = Mechanize.new
408
+ page = agent.get("http://yasuri.example.net/page01.html")
409
+
410
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
411
+ text_content '/html/body/p'
412
+ end
413
+
414
+ node.inject(agent, page)
415
+ #=> [ {"content" => "Patination01"},
416
+ {"content" => "Patination02"},
417
+ {"content" => "Patination03"}]
418
+ ```
419
+ PaginateNodeは 次のページ を指すリンクを`Path`として指定する必要があります.
420
+ この例では、`NextPage` (`/html/body/nav/span/a[@class='next']`)が、次のページを指すリンクに該当します.
421
+
422
+ ### オプション
423
+ ##### `limit`
424
+ たどるページ数の上限を指定します.
425
+
426
+ ```ruby
427
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
428
+ text_content '/html/body/p'
429
+ end
430
+ node.inject(agent, page)
431
+ #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
432
+ ```
433
+ この場合、PaginateNode は最大2つまでのページを開いてパースします.ページネーションは4つのページを持っているようですが、`limit:2`が指定されているため、結果の配列には2つの結果のみが含まれています.
data/USAGE.md ADDED
@@ -0,0 +1,431 @@
1
+ # Yasuri Usage
2
+
3
+ ## What is Yasuri
4
+ `Yasuri` is an easy web-scraping library for supporting "Mechanize".
5
+
6
+ Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
7
+
8
+ Yasuri can reduce frequently processes in Scraping.
9
+
10
+ For example,
11
+
12
+ + Open links in the page, scraping each page, and getting result as Hash.
13
+ + Scraping texts in the page, and named result in Hash.
14
+ + A table that repeatedly appears in a page each, scraping, get as an array.
15
+ + Of each page provided by the pagination, scraping the only top 3.
16
+
17
+ You can implement easy by Yasuri.
18
+
19
+ ## Quick Start
20
+
21
+ ```
22
+ $ gem install yasuri
23
+ ```
24
+
25
+ ```ruby
26
+ require 'yasuri'
27
+ require 'machinize'
28
+
29
+ # Node tree constructing by DSL
30
+ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
31
+ text_title '//*[@id="contents"]/h2'
32
+ text_content '//*[@id="contents"]/p[1]'
33
+ end
34
+
35
+ agent = Mechanize.new
36
+ root_page = agent.get("http://some.scraping.page.net/")
37
+
38
+ result = root.inject(agent, root_page)
39
+ # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
40
+ # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
41
+
42
+ ```
43
+ This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
44
+
45
+ (i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
46
+
47
+ ## Basics
48
+
49
+ 1. Construct parse tree.
50
+ 2. Start parse with Mechanize agent and first page.
51
+
52
+ ### Construct parse tree
53
+
54
+ ```ruby
55
+ require 'mechanize'
56
+ require 'yasuri'
57
+
58
+
59
+ # 1. Construct parse tree.
60
+ tree = Yasuri.links_title '/html/body/a' do
61
+ text_name '/html/body/p'
62
+ end
63
+
64
+ # 2. Start parse with Mechanize agent and first page.
65
+ agent = Mechanize.new
66
+ page = agent.get(uri)
67
+
68
+
69
+ tree.inject(agent, page)
70
+ ```
71
+
72
+ Tree is definable by 2(+1) ways, DSL and json (and basic ruby code). In above example, DSL.
73
+
74
+ ```ruby
75
+ # Construct by json.
76
+ src = <<-EOJSON
77
+ { "node" : "links",
78
+ "name" : "title",
79
+ "path" : "/html/body/a",
80
+ "children" : [
81
+ { "node" : "text",
82
+ "name" : "name",
83
+ "path" : "/html/body/p"
84
+ }
85
+ ]
86
+ }
87
+ EOJSON
88
+ tree = Yasuri.json2tree(src)
89
+ ```
90
+
91
+ ### Node
92
+ Tree is constructed by nested Nodes.
93
+ Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
94
+
95
+ Node is defined by this format.
96
+
97
+
98
+ ```ruby
99
+ # Top Level
100
+ Yasuri.<Type>_<Name> <Path> [,<Options>]
101
+
102
+ # Nested
103
+ Yasuri.<Type>_<Name> <Path> [,<Options>] do
104
+ <Type>_<Name> <Path> [,<Options>] do
105
+ <Children>
106
+ end
107
+ end
108
+ ```
109
+
110
+ #### Type
111
+ Type meen behavior of Node.
112
+
113
+ - *Text*
114
+ - *Struct*
115
+ - *Links*
116
+ - *Paginate*
117
+
118
+ ### Name
119
+ Name is used keys in returned hash.
120
+
121
+ ### Path
122
+ Path determine target node by xpath or css selector. It given by Machinize `search`.
123
+
124
+ ### Childlen
125
+ Child nodes. TextNode has always empty set, because TextNode is leaf.
126
+
127
+ ### Options
128
+ Parse options. It different in each types. You can get options and values by `opt` method.
129
+
130
+ ```ruby
131
+ # TextNode Exaample
132
+ node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
133
+ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
134
+ ```
135
+
136
+ ## Text Node
137
+ TextNode return scraped text. This node have to be leaf.
138
+
139
+ ### Example
140
+
141
+ ```html
142
+ <!-- http://yasuri.example.net -->
143
+ <html>
144
+ <head></head>
145
+ <body>
146
+ <p>Hello,World</p>
147
+ <p>Hello,Yasuri</p>
148
+ </body>
149
+ </html>
150
+ ```
151
+
152
+ ```ruby
153
+ agent = Mechanize.new
154
+ page = agent.get("http://yasuri.example.net")
155
+
156
+ p1 = Yasuri.text_title '/html/body/p[1]'
157
+ p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
158
+ p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
159
+
160
+ p1.inject(agent, page) #=> { "title" => "Hello,World" }
161
+ p1t.inject(agent, page) #=> { "title" => "Hello" }
162
+ node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
163
+ ```
164
+
165
+ ### Options
166
+ ##### `truncate`
167
+ Match to regexp, and truncate text. When you use group, it will return first matched group only.
168
+
169
+ ```ruby
170
+ node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
171
+ node.inject(agent, index_page)
172
+ #=> { "example" => "ello,Yasur" }
173
+ ```
174
+
175
+
176
+ ##### `proc`
177
+ Apply method to text. Method is given as Symbol.
178
+ If it is given `truncate` option, apply method after truncated.
179
+
180
+ ```ruby
181
+ node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
182
+ node.inject(agent, index_page)
183
+ #=> { "example" => "ELLO,YASUR" }
184
+ ```
185
+
186
+ ## Struct Node
187
+ Struct Node return structured text.
188
+
189
+ At first, Struct Node narrow down sub-tags by `Path`. Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
190
+
191
+ If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
192
+
193
+ ### Example
194
+
195
+ ```html
196
+ <!-- http://yasuri.example.net -->
197
+ <html>
198
+ <head>
199
+ <title>Books</title>
200
+ </head>
201
+ <body>
202
+ <h1>1996</h1>
203
+ <table>
204
+ <thead>
205
+ <tr><th>Title</th> <th>Publication Date</th></tr>
206
+ </thead>
207
+ <tr><td>The Perfect Insider</td> <td>1996/4/5</td></tr>
208
+ <tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
209
+ <tr><td>Mathematical Goodbye</td> <td>1996/9/5</td></tr>
210
+ </table>
211
+
212
+ <h1>1997</h1>
213
+ <table>
214
+ <thead>
215
+ <tr><th>Title</th> <th>Publication Date</th></tr>
216
+ </thead>
217
+ <tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
218
+ <tr><td>Who Inside</td> <td>1997/4/5</td></tr>
219
+ <tr><td>Illusion Acts Like Magic</td> <td>1997/10/5</td></tr>
220
+ </table>
221
+
222
+ <h1>1998</h1>
223
+ <table>
224
+ <thead>
225
+ <tr><th>Title</th> <th>Publication Date</th></tr>
226
+ </thead>
227
+ <tr><td>Replaceable Summer</td> <td>1998/1/7</td></tr>
228
+ <tr><td>Switch Back</td> <td>1998/4/5</td></tr>
229
+ <tr><td>Numerical Models</td> <td>1998/7/5</td></tr>
230
+ <tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
231
+ </table>
232
+ </body>
233
+ </html>
234
+ ```
235
+
236
+ ```ruby
237
+ agent = Mechanize.new
238
+ page = agent.get("http://yasuri.example.net")
239
+
240
+ node = Yasuri.struct_table '/html/body/table[1]/tr' do
241
+ text_title './td[1]'
242
+ text_pub_date './td[2]'
243
+ ])
244
+
245
+ node.inject(agent, page)
246
+ #=> [ { "title" => "The Perfect Insider",
247
+ # "pub_date" => "1996/4/5" },
248
+ # { "title" => "Doctors in Isolated Room",
249
+ # "pub_date" => "1996/7/5" },
250
+ # { "title" => "Mathematical Goodbye",
251
+ # "pub_date" => "1996/9/5" }, ]
252
+ ```
253
+
254
+ StructNode narrow down `<tr>` tags in first `<table>` by `'/html/body/table[1]/tr'`. Then,
255
+ `<tr>` tags parsed Struct node has two child node.
256
+
257
+ In this case, first `<table>` contains three `<tr>` tags (Not four.`<thead><tr>` is not match to `Path` ), so struct node returns three hashes. Each hash contains parsed text by Text Node.
258
+
259
+ Struct node can contain not only Text node.
260
+
261
+ ### Example
262
+
263
+ ```ruby
264
+ agent = Mechanize.new
265
+ page = agent.get("http://yasuri.example.net")
266
+
267
+ node = Yasuri.strucre_tables '/html/body/table' do
268
+ struct_table './tr' do
269
+ text_title './td[1]'
270
+ text_pub_date './td[2]'
271
+ end
272
+ ])
273
+
274
+ node.inject(agent, page)
275
+
276
+ #=> [ { "table" => [ { "title" => "The Perfect Insider",
277
+ # "pub_date" => "1996/4/5" },
278
+ # { "title" => "Doctors in Isolated Room",
279
+ # "pub_date" => "1996/7/5" },
280
+ # { "title" => "Mathematical Goodbye",
281
+ # "pub_date" => "1996/9/5" }]},
282
+ # { "table" => [ { "title" => "Jack the Poetical Private",
283
+ # "pub_date" => "1997/1/5" },
284
+ # { "title" => "Who Inside",
285
+ # "pub_date" => "1997/4/5" },
286
+ # { "title" => "Illusion Acts Like Magic",
287
+ # "pub_date" => "1997/10/5" }]},
288
+ # { "table" => [ { "title" => "Replaceable Summer",
289
+ # "pub_date" => "1998/1/7" },
290
+ # { "title" => "Switch Back",
291
+ # "pub_date" => "1998/4/5" },
292
+ # { "title" => "Numerical Models",
293
+ # "pub_date" => "1998/7/5" },
294
+ # { "title" => "The Perfect Outsider",
295
+ # "pub_date" => "1998/10/5" }]}
296
+ # ]
297
+ ```
298
+
299
+ ### Options
300
+ None.
301
+
302
+ ## Links Node
303
+ Links Node returns parsed text in each linked pages.
304
+
305
+ ### Example
306
+ ```html
307
+ <!-- http://yasuri.example.net -->
308
+ <html>
309
+ <head><title>Yasuri Test</title></head>
310
+ <body>
311
+ <p>Hello,Yasuri</p>
312
+ <a href="./child01.html">child01</a>
313
+ <a href="./child02.html">child02</a>
314
+ <a href="./child03.html">child03</a>
315
+ </body>
316
+ <title>
317
+ ```
318
+
319
+ ```html
320
+ <!-- http://yasuri.example.net/child01.html -->
321
+ <html>
322
+ <head><title>Child 01 Test</title></head>
323
+ <body>
324
+ <p>Child 01 page.</p>
325
+ <ul>
326
+ <li><a href="./child01_sub.html">Child01_Sub</a></li>
327
+ <li><a href="./child02_sub.html">Child02_Sub</a></li>
328
+ </ul>
329
+ </body>
330
+ <title>
331
+ ```
332
+
333
+ ```html
334
+ <!-- http://yasuri.example.net/child02.html -->
335
+ <html>
336
+ <head><title>Child 02 Test</title></head>
337
+ <body>
338
+ <p>Child 02 page.</p>
339
+ </body>
340
+ <title>
341
+ ```
342
+
343
+ ```html
344
+ <!-- http://yasuri.example.net/child03.html -->
345
+ <html>
346
+ <head><title>Child 03 Test</title></head>
347
+ <body>
348
+ <p>Child 03 page.</p>
349
+ <ul>
350
+ <li><a href="./child03_sub.html">Child03_Sub</a></li>
351
+ </ul>
352
+ </body>
353
+ <title>
354
+ ```
355
+
356
+ ```ruby
357
+ agent = Mechanize.new
358
+ page = agent.get("http://yasuri.example.net")
359
+
360
+ node = Yasuri.links_title '/html/body/a' do
361
+ text_content '/html/body/p'
362
+ end
363
+
364
+ node.inject(agent, page)
365
+ #=> [ {"content" => "Child 01 page."},
366
+ {"content" => "Child 02 page."},
367
+ {"content" => "Child 03 page."}]
368
+ ```
369
+
370
+ At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
371
+
372
+ Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
373
+
374
+ ### Options
375
+ None.
376
+
377
+ ## Paginate Node
378
+ Paginate Node parses and returns each pages that provid by paginate.
379
+
380
+ ### Example
381
+ Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
382
+
383
+ ```html
384
+ <!-- http://yasuri.example.net/page01.html -->
385
+ <html>
386
+ <head><title>Page01</title></head>
387
+ <body>
388
+ <p>Pagination01</p>
389
+
390
+ <nav class='pagination'>
391
+ <span class='prev'> PreviousPage </span>
392
+ <span class='page'> 1 </span>
393
+ <span class='page'> <a href="./page02.html">2</a> </span>
394
+ <span class='page'> <a href="./page03.html">3</a> </span>
395
+ <span class='page'> <a href="./page04.html">4</a> </span>
396
+ <span class='next'> <a href="./page02.html" class="next" rel="next"> NextPage </a> </span>
397
+ </nav>
398
+
399
+ </body>
400
+ <title>
401
+ ```
402
+
403
+ ```ruby
404
+ agent = Mechanize.new
405
+ page = agent.get("http://yasuri.example.net/page01.html")
406
+
407
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
408
+ text_content '/html/body/p'
409
+ end
410
+
411
+ node.inject(agent, page)
412
+ #=> [ {"content" => "Pagination01"},
413
+ {"content" => "Pagination02"},
414
+ {"content" => "Pagination03"},
415
+ {"content" => "Pagination04"}]
416
+ ```
417
+
418
+ Paginate Node require link for next page. In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
419
+
420
+ ### Options
421
+ ##### `limit`
422
+ Upper limit of open pages in pagination.
423
+
424
+ ```ruby
425
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
426
+ text_content '/html/body/p'
427
+ end
428
+ node.inject(agent, page)
429
+ #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
430
+ ```
431
+ Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
@@ -1,3 +1,3 @@
1
1
  module Yasuri
2
- VERSION = "0.0.11"
2
+ VERSION = "1.9.11"
3
3
  end
data/lib/yasuri/yasuri.rb CHANGED
@@ -37,7 +37,7 @@ module Yasuri
37
37
  }
38
38
  Node2Text = Text2Node.invert
39
39
 
40
- ReservedKeys = %i|node name path children|
40
+ ReservedKeys = [:node, :name, :path, :children]
41
41
  def self.hash2node(node_h)
42
42
  node, name, path, children = ReservedKeys.map do |key|
43
43
  node_h[key]
@@ -78,7 +78,8 @@ module Yasuri
78
78
  json
79
79
  end
80
80
 
81
- def self.NodeName(name, symbolize_names:false)
81
+ def self.NodeName(name, hash = {})
82
+ symbolize_names = hash[:symbolize_names] || false
82
83
  symbolize_names ? name.to_sym : name
83
84
  end
84
85
 
@@ -7,7 +7,7 @@ module Yasuri
7
7
  module Node
8
8
  attr_reader :url, :xpath, :name, :children
9
9
 
10
- def initialize(xpath, name, children = [], opt: {})
10
+ def initialize(xpath, name, children = [], opt = {})
11
11
  @xpath, @name, @children = xpath, name, children
12
12
  end
13
13
 
@@ -7,9 +7,9 @@ module Yasuri
7
7
  class PaginateNode
8
8
  include Node
9
9
 
10
- def initialize(xpath, name, children = [], limit: nil)
10
+ def initialize(xpath, name, children = [], hash = {})
11
11
  super(xpath, name, children)
12
- @limit = limit
12
+ @limit = hash[:limit]
13
13
  end
14
14
 
15
15
  def inject(agent, page, opt = {})
@@ -7,9 +7,12 @@ module Yasuri
7
7
  class TextNode
8
8
  include Node
9
9
 
10
- def initialize(xpath, name, children = [], truncate: nil, proc:nil)
10
+ def initialize(xpath, name, children = [], hash = {})
11
11
  super(xpath, name, children)
12
12
 
13
+ truncate = hash[:truncate]
14
+ proc = hash[:proc]
15
+
13
16
  truncate = Regexp.new(truncate) if not truncate.nil? # regexp or nil
14
17
  @truncate = truncate
15
18
  @truncate = Regexp.new(@truncate.to_s) if not @truncate.nil?
data/spec/spec_helper.rb CHANGED
@@ -14,8 +14,8 @@ end
14
14
 
15
15
 
16
16
  # ENV['CODECLIMATE_REPO_TOKEN'] = "0dc78d33107a7f11f257c0218ac1a37e0073005bb9734f2fd61d0f7e803fc151"
17
- require "codeclimate-test-reporter"
18
- CodeClimate::TestReporter.start
17
+ # require "codeclimate-test-reporter"
18
+ # CodeClimate::TestReporter.start
19
19
 
20
20
  require 'simplecov'
21
21
  require 'coveralls'
@@ -27,6 +27,7 @@ SimpleCov.formatter = SimpleCov::Formatter::MultiFormatter[
27
27
  ]
28
28
  SimpleCov.start
29
29
 
30
+
30
31
  $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
31
32
  require 'yasuri'
32
33
 
data/spec/yasuri_spec.rb CHANGED
@@ -18,7 +18,7 @@ describe 'Yasuri' do
18
18
  #############
19
19
  describe '.json2tree' do
20
20
  it "fail if empty json" do
21
- expect { Yasuri.json2tree("{}") }.to raise_error
21
+ expect { Yasuri.json2tree("{}") }.to raise_error(RuntimeError)
22
22
  end
23
23
 
24
24
  it "return TextNode" do
@@ -39,7 +39,7 @@ describe 'Yasuri' do
39
39
  "truncate" : "^[^,]+"
40
40
  }|
41
41
  generated = Yasuri.json2tree(src)
42
- original = Yasuri::TextNode.new('/html/body/p[1]', "content", truncate:/^[^,]+/)
42
+ original = Yasuri::TextNode.new('/html/body/p[1]', "content", {}, truncate:/^[^,]+/)
43
43
  compare_generated_vs_original(generated, original, @index_page)
44
44
  end
45
45
 
@@ -153,7 +153,7 @@ describe 'Yasuri' do
153
153
  end
154
154
 
155
155
  it "return text node with truncate_regexp" do
156
- node = Yasuri::TextNode.new("/html/head/title", "title", truncate:/^[^,]+/)
156
+ node = Yasuri::TextNode.new("/html/head/title", "title", {}, truncate:/^[^,]+/)
157
157
  json = Yasuri.tree2json(node)
158
158
  expected_str = %q| { "node": "text",
159
159
  "name": "title",
@@ -81,7 +81,7 @@ describe 'Yasuri' do
81
81
  node = Yasuri::StructNode.new(invalid_xpath, "table", [
82
82
  Yasuri::TextNode.new('./td[1]', "title")
83
83
  ])
84
- expect { node.inject(@agent, @page) }.to raise_error
84
+ expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
85
85
  end
86
86
 
87
87
  it 'fail with invalid xpath in children' do
@@ -90,7 +90,7 @@ describe 'Yasuri' do
90
90
  Yasuri::TextNode.new(invalid_xpath, "title"),
91
91
  Yasuri::TextNode.new('./td[2]', "pub_date"),
92
92
  ])
93
- expect { node.inject(@agent, @page) }.to raise_error
93
+ expect { node.inject(@agent, @page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
94
94
  end
95
95
 
96
96
  it 'scrape all tables' do
@@ -126,7 +126,7 @@ describe 'Yasuri' do
126
126
  Yasuri::TextNode.new('./td[1]', "title"),
127
127
  Yasuri::TextNode.new('./td[2]', "pub_date"),
128
128
  ])
129
- expected = @table_1996.map{|h| h.map{|k,v| [k.to_sym, v] }.to_h }
129
+ expected = @table_1996.map{|h| Hash[h.map{|k,v| [k.to_sym, v] }] }
130
130
  actual = node.inject(@agent, @page, symbolize_names:true)
131
131
  expect(actual).to match expected
132
132
  end
@@ -31,7 +31,7 @@ describe 'Yasuri' do
31
31
  it 'fail with invalid xpath' do
32
32
  invalid_xpath = '/html/body/no_match_node['
33
33
  node = Yasuri::TextNode.new(invalid_xpath, "title")
34
- expect { node.inject(@agent, @index_page) }.to raise_error
34
+ expect { node.inject(@agent, @index_page) }.to raise_error(Nokogiri::XML::XPath::SyntaxError)
35
35
  end
36
36
 
37
37
  it "can be defined by DSL, return single TextNode title" do
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: yasuri
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.11
4
+ version: 1.9.11
5
5
  platform: ruby
6
6
  authors:
7
7
  - TAC
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-03-03 00:00:00.000000000 Z
11
+ date: 2016-11-14 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -151,6 +151,8 @@ files:
151
151
  - LICENSE
152
152
  - README.md
153
153
  - Rakefile
154
+ - USAGE.ja.md
155
+ - USAGE.md
154
156
  - app.rb
155
157
  - lib/yasuri.rb
156
158
  - lib/yasuri/version.rb