yasuri 2.0.13 → 3.3.1

Sign up to get free protection for your applications and to get access to all the features.
data/USAGE.md CHANGED
@@ -1,27 +1,32 @@
1
- # Yasuri Usage
1
+ # Yasuri
2
2
 
3
3
  ## What is Yasuri
4
- `Yasuri` is an easy web-scraping library for supporting "Mechanize".
4
+ `Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it.
5
5
 
6
- Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
6
+ It performs scraping by simply describing the expected result in a simple declarative notation.
7
7
 
8
- Yasuri can reduce frequently processes in Scraping.
8
+ Yasuri makes it easy to write common scraping operations.
9
+ For example, the following processes can be easily implemented.
9
10
 
10
- For example,
11
-
12
- + Open links in the page, scraping each page, and getting result as Hash.
13
- + Scraping texts in the page, and named result in Hash.
14
- + A table that repeatedly appears in a page each, scraping, get as an array.
15
- + Of each page provided by the pagination, scraping the only top 3.
16
-
17
- You can implement easy by Yasuri.
11
+ + Scrape multiple texts in a page and name them into a Hash
12
+ + Open multiple links in a page and get the result of scraping each page as a Hash
13
+ + Scrape each table that appears repeatedly in the page and get the result as an array
14
+ + Scrape only the first three pages of each page provided by pagination
18
15
 
19
16
  ## Quick Start
20
17
 
18
+
19
+ #### Install
20
+ ```sh
21
+ # for Ruby 2.3.2
22
+ $ gem 'yasuri', '~> 2.0', '>= 2.0.13'
21
23
  ```
24
+ または
25
+ ```sh
26
+ # for Ruby 3.0.0 or upper
22
27
  $ gem install yasuri
23
28
  ```
24
-
29
+ #### Use as library
25
30
  ```ruby
26
31
  require 'yasuri'
27
32
  require 'machinize'
@@ -32,96 +37,190 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
32
37
  text_content '//*[@id="contents"]/p[1]'
33
38
  end
34
39
 
35
- agent = Mechanize.new
36
- root_page = agent.get("http://some.scraping.page.net/")
37
-
38
- result = root.inject(agent, root_page)
39
- # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
40
- # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
41
-
40
+ result = root.scrape("http://some.scraping.page.tac42.net/")
41
+ # => [
42
+ # {"title" => "PageTitle 01", "content" => "Page Contents 01" },
43
+ # {"title" => "PageTitle 02", "content" => "Page Contents 02" },
44
+ # ...
45
+ # {"title" => "PageTitle N", "content" => "Page Contents N" }
46
+ # ]
42
47
  ```
48
+
43
49
  This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
44
50
 
45
- (i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
51
+ (in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
46
52
 
47
- ## Basics
48
53
 
49
- 1. Construct parse tree.
50
- 2. Start parse with Mechanize agent and first page.
54
+ #### Use as CLI tool
55
+ The same thing as above can be executed as a CLI command.
51
56
 
52
- ### Construct parse tree
57
+ ```sh
58
+ $ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
59
+ {
60
+ "links_root": {
61
+ "path": "//*[@id=\"menu\"]/ul/li/a",
62
+ "text_title": "//*[@id=\"contents\"]/h2",
63
+ "text_content": "//*[@id=\"contents\"]/p[1]"
64
+ }
65
+ }'
53
66
 
54
- ```ruby
55
- require 'mechanize'
56
- require 'yasuri'
67
+ [
68
+ {"title":"PageTitle 01","content":"Page Contents 01"},
69
+ {"title":"PageTitle 02","content":"Page Contents 02"},
70
+ ...,
71
+ {"title":"PageTitle N","content":"Page Contents N"}
72
+ ]
73
+ ```
57
74
 
75
+ The result can be obtained as a string in json format.
58
76
 
59
- # 1. Construct parse tree.
60
- tree = Yasuri.links_title '/html/body/a' do
61
- text_name '/html/body/p'
62
- end
77
+ ----------------------------
78
+ ## Parse Tree
63
79
 
64
- # 2. Start parse with Mechanize agent and first page.
65
- agent = Mechanize.new
66
- page = agent.get(uri)
80
+ A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
67
81
 
82
+ A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
68
83
 
69
- tree.inject(agent, page)
84
+ The parse tree is defined in the following format:
85
+
86
+ ```ruby
87
+ # A simple tree consisting of one node
88
+ Yasuri.<Type>_<Name> <Path> [,<Options>]
89
+
90
+ # Nested tree
91
+ Yasuri.<Type>_<Name> <Path> [,<Options>] do
92
+ <Type>_<Name> <Path> [,<Options>] do
93
+ <Type>_<Name> <Path> [,<Options>]
94
+ ...
95
+ end
96
+ end
70
97
  ```
71
98
 
72
- Tree is definable by 3(+1) ways, json, yaml, and DSL (or basic ruby code). In above example, DSL.
99
+ **Example**
73
100
 
74
101
  ```ruby
75
- # Construct by json.
76
- src = <<-EOJSON
77
- { "node" : "links",
78
- "name" : "title",
79
- "path" : "/html/body/a",
80
- "children" : [
81
- { "node" : "text",
82
- "name" : "name",
83
- "path" : "/html/body/p"
84
- }
85
- ]
86
- }
87
- EOJSON
88
- tree = Yasuri.json2tree(src)
102
+ # A simple tree consisting of one node
103
+ Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
104
+
105
+ # Nested tree
106
+ Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
107
+ struct_table './tr' do
108
+ text_title './td[1]'
109
+ text_pub_date './td[2]'
110
+ end
111
+ end
89
112
  ```
90
113
 
114
+ Parsing trees can be defined in Ruby DSL, JSON, or YAML.
115
+ The following is an example of the same parse tree as above, defined in each notation.
116
+
117
+
118
+ **Case of defining as Ruby DSL**
91
119
  ```ruby
92
- # Construct by yaml.
93
- src = <<-EOYAML
94
- title:
95
- node: links
120
+ Yasuri.links_title '/html/body/a' do
121
+ text_name '/html/body/p'
122
+ end
123
+ ```
124
+
125
+ **Case of defining as JSON**
126
+ ```json
127
+ {
128
+ links_title": {
129
+ "path": "/html/body/a",
130
+ "text_name": "/html/body/p"
131
+ }
132
+ }
133
+ ```
134
+
135
+ **Case of defining as YAML**
136
+ ```yaml
137
+ links_title:
96
138
  path: "/html/body/a"
97
- children:
98
- - name:
99
- node: text
100
- path: "/html/body/p"
101
- EOYAML
102
- tree = Yasuri.yaml2tree(src)
139
+ text_name: "/html/body/p"
140
+ ```
141
+
142
+ **Special case of purse tree**
143
+
144
+ If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
145
+ ```json
146
+ {
147
+ "text_title": "/html/head/title",
148
+ "text_body": "/html/body",
149
+ }
150
+ # => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
151
+
152
+ {
153
+ "text_title": "/html/head/title"}
154
+ }
155
+ # => Welcome to yasuri!
103
156
  ```
104
157
 
105
158
 
106
- ### Node
107
- Tree is constructed by nested Nodes.
108
- Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
159
+ In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
109
160
 
110
- Node is defined by this format.
161
+ ```json
162
+ {
163
+ "text_name": "/html/body/p"
164
+ }
111
165
 
166
+ {
167
+ "text_name": {
168
+ "path": "/html/body/p"
169
+ }
170
+ }
171
+ ```
172
+ ### Run ParseTree
173
+ Call the `Node#scrape(uri, opt={})` method on the root node of the parse tree.
112
174
 
175
+ **Example**
113
176
  ```ruby
114
- # Top Level
115
- Yasuri.<Type>_<Name> <Path> [,<Options>]
177
+ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
178
+ text_title '//*[@id="contents"]/h2'
179
+ text_content '//*[@id="contents"]/p[1]'
180
+ end
116
181
 
117
- # Nested
118
- Yasuri.<Type>_<Name> <Path> [,<Options>] do
119
- <Type>_<Name> <Path> [,<Options>] do
120
- <Children>
121
- end
122
- end
182
+ result = root.scrape("http://some.scraping.page.tac42.net/", interval_ms: 1000)
183
+ ```
184
+
185
+ + `uri` is the URI of the page to be scraped.
186
+ + `opt` is options as Hash. The following options are available.
187
+
188
+ Yasuri uses `Mechanize` internally as an agent to do scraping.
189
+ If you want to specify this instance, call `Node#scrape_with_agent(uri, agent, opt={})`.
190
+
191
+ ```ruby
192
+ require 'logger'
193
+
194
+ agent = Mechanize.new
195
+ agent.log = Logger.new $stderr
196
+ agent.request_headers = {
197
+ # ...
198
+ }
199
+
200
+ result = root.scrape_with_agent(
201
+ "http://some.scraping.page.tac42.net/",
202
+ agent,
203
+ interval_ms: 1000)
123
204
  ```
124
205
 
206
+ ### `opt`
207
+ #### `interval_ms`
208
+ Interval [milliseconds] for requesting multiple pages.
209
+
210
+ If omitted, requests will be made continuously without an interval, but if requests to many pages are expected, it is strongly recommended to specify an interval time to avoid high load on the target host.
211
+
212
+ #### `retry_count`
213
+ Number of retries when page acquisition fails. If omitted, it will retry 5 times.
214
+
215
+ #### `symbolize_names`
216
+ If true, returns the keys of the result set as symbols.
217
+
218
+ --------------------------
219
+ ## Node
220
+
221
+ Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
222
+
223
+
125
224
  #### Type
126
225
  Type meen behavior of Node.
127
226
 
@@ -129,17 +228,20 @@ Type meen behavior of Node.
129
228
  - *Struct*
130
229
  - *Links*
131
230
  - *Paginate*
231
+ - *Map*
232
+
233
+ See the description of each node for details.
132
234
 
133
- ### Name
235
+ #### Name
134
236
  Name is used keys in returned hash.
135
237
 
136
- ### Path
238
+ #### Path
137
239
  Path determine target node by xpath or css selector. It given by Machinize `search`.
138
240
 
139
- ### Childlen
241
+ #### Childlen
140
242
  Child nodes. TextNode has always empty set, because TextNode is leaf.
141
243
 
142
- ### Options
244
+ #### Options
143
245
  Parse options. It different in each types. You can get options and values by `opt` method.
144
246
 
145
247
  ```ruby
@@ -151,10 +253,12 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
151
253
  ## Text Node
152
254
  TextNode return scraped text. This node have to be leaf.
153
255
 
256
+
257
+
154
258
  ### Example
155
259
 
156
260
  ```html
157
- <!-- http://yasuri.example.net -->
261
+ <!-- http://yasuri.example.tac42.net -->
158
262
  <html>
159
263
  <head></head>
160
264
  <body>
@@ -165,25 +269,24 @@ TextNode return scraped text. This node have to be leaf.
165
269
  ```
166
270
 
167
271
  ```ruby
168
- agent = Mechanize.new
169
- page = agent.get("http://yasuri.example.net")
170
-
171
272
  p1 = Yasuri.text_title '/html/body/p[1]'
172
273
  p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
173
- p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
274
+ p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
174
275
 
175
- p1.inject(agent, page) #=> { "title" => "Hello,World" }
176
- p1t.inject(agent, page) #=> { "title" => "Hello" }
177
- node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
276
+ p1.scrape("http://yasuri.example.tac42.net") #=> "Hello,World"
277
+ p1t.scrape("http://yasuri.example.tac42.net") #=> "Hello"
278
+ p2u.scrape("http://yasuri.example.tac42.net") #=> "HELLO,WORLD"
178
279
  ```
179
280
 
281
+ Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
282
+
180
283
  ### Options
181
284
  ##### `truncate`
182
285
  Match to regexp, and truncate text. When you use group, it will return first matched group only.
183
286
 
184
287
  ```ruby
185
288
  node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
186
- node.inject(agent, index_page)
289
+ node.scrape(uri)
187
290
  #=> { "example" => "ello,Yasur" }
188
291
  ```
189
292
 
@@ -194,21 +297,22 @@ If it is given `truncate` option, apply method after truncated.
194
297
 
195
298
  ```ruby
196
299
  node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
197
- node.inject(agent, index_page)
300
+ node.scrape(uri)
198
301
  #=> { "example" => "ELLO,YASUR" }
199
302
  ```
200
303
 
201
304
  ## Struct Node
202
305
  Struct Node return structured text.
203
306
 
204
- At first, Struct Node narrow down sub-tags by `Path`. Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
307
+ At first, Struct Node narrow down sub-tags by `Path`.
308
+ Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
205
309
 
206
310
  If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
207
311
 
208
312
  ### Example
209
313
 
210
314
  ```html
211
- <!-- http://yasuri.example.net -->
315
+ <!-- http://yasuri.example.tac42.net -->
212
316
  <html>
213
317
  <head>
214
318
  <title>Books</title>
@@ -249,15 +353,12 @@ If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags an
249
353
  ```
250
354
 
251
355
  ```ruby
252
- agent = Mechanize.new
253
- page = agent.get("http://yasuri.example.net")
254
-
255
356
  node = Yasuri.struct_table '/html/body/table[1]/tr' do
256
357
  text_title './td[1]'
257
358
  text_pub_date './td[2]'
258
- ])
359
+ end
259
360
 
260
- node.inject(agent, page)
361
+ node.scrape("http://yasuri.example.tac42.net")
261
362
  #=> [ { "title" => "The Perfect Insider",
262
363
  # "pub_date" => "1996/4/5" },
263
364
  # { "title" => "Doctors in Isolated Room",
@@ -276,17 +377,14 @@ Struct node can contain not only Text node.
276
377
  ### Example
277
378
 
278
379
  ```ruby
279
- agent = Mechanize.new
280
- page = agent.get("http://yasuri.example.net")
281
-
282
380
  node = Yasuri.strucre_tables '/html/body/table' do
283
381
  struct_table './tr' do
284
382
  text_title './td[1]'
285
383
  text_pub_date './td[2]'
286
384
  end
287
- ])
385
+ end
288
386
 
289
- node.inject(agent, page)
387
+ node.scrape("http://yasuri.example.tac42.net")
290
388
 
291
389
  #=> [ { "table" => [ { "title" => "The Perfect Insider",
292
390
  # "pub_date" => "1996/4/5" },
@@ -319,7 +417,7 @@ Links Node returns parsed text in each linked pages.
319
417
 
320
418
  ### Example
321
419
  ```html
322
- <!-- http://yasuri.example.net -->
420
+ <!-- http://yasuri.example.tac42.net -->
323
421
  <html>
324
422
  <head><title>Yasuri Test</title></head>
325
423
  <body>
@@ -332,7 +430,7 @@ Links Node returns parsed text in each linked pages.
332
430
  ```
333
431
 
334
432
  ```html
335
- <!-- http://yasuri.example.net/child01.html -->
433
+ <!-- http://yasuri.example.tac42.net/child01.html -->
336
434
  <html>
337
435
  <head><title>Child 01 Test</title></head>
338
436
  <body>
@@ -346,7 +444,7 @@ Links Node returns parsed text in each linked pages.
346
444
  ```
347
445
 
348
446
  ```html
349
- <!-- http://yasuri.example.net/child02.html -->
447
+ <!-- http://yasuri.example.tac42.net/child02.html -->
350
448
  <html>
351
449
  <head><title>Child 02 Test</title></head>
352
450
  <body>
@@ -356,7 +454,7 @@ Links Node returns parsed text in each linked pages.
356
454
  ```
357
455
 
358
456
  ```html
359
- <!-- http://yasuri.example.net/child03.html -->
457
+ <!-- http://yasuri.example.tac42.net/child03.html -->
360
458
  <html>
361
459
  <head><title>Child 03 Test</title></head>
362
460
  <body>
@@ -369,20 +467,17 @@ Links Node returns parsed text in each linked pages.
369
467
  ```
370
468
 
371
469
  ```ruby
372
- agent = Mechanize.new
373
- page = agent.get("http://yasuri.example.net")
374
-
375
470
  node = Yasuri.links_title '/html/body/a' do
376
471
  text_content '/html/body/p'
377
472
  end
378
473
 
379
- node.inject(agent, page)
474
+ node.scrape("http://yasuri.example.tac42.net")
380
475
  #=> [ {"content" => "Child 01 page."},
381
476
  {"content" => "Child 02 page."},
382
477
  {"content" => "Child 03 page."}]
383
478
  ```
384
479
 
385
- At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
480
+ At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.tac42.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
386
481
 
387
482
  Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
388
483
 
@@ -396,7 +491,7 @@ Paginate Node parses and returns each pages that provid by paginate.
396
491
  Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
397
492
 
398
493
  ```html
399
- <!-- http://yasuri.example.net/page01.html -->
494
+ <!-- http://yasuri.example.tac42.net/page01.html -->
400
495
  <html>
401
496
  <head><title>Page01</title></head>
402
497
  <body>
@@ -416,21 +511,17 @@ Target page `page01.html` is like this. `page02.html` to `page04.html` are simil
416
511
  ```
417
512
 
418
513
  ```ruby
419
- agent = Mechanize.new
420
- page = agent.get("http://yasuri.example.net/page01.html")
421
-
422
- node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
514
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
423
515
  text_content '/html/body/p'
424
516
  end
425
517
 
426
- node.inject(agent, page)
427
- #=> [ {"content" => "Pagination01"},
428
- {"content" => "Pagination02"},
429
- {"content" => "Pagination03"},
430
- {"content" => "Pagination04"}]
518
+ node.scrape("http://yasuri.example.tac42.net/page01.html")
519
+ #=> [ {"content" => "Patination01"},
520
+ # {"content" => "Patination02"},
521
+ # {"content" => "Patination03"}]
431
522
  ```
432
-
433
- Paginate Node require link for next page. In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
523
+ Paginate Node require link for next page.
524
+ In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
434
525
 
435
526
  ### Options
436
527
  ##### `limit`
@@ -440,7 +531,7 @@ Upper limit of open pages in pagination.
440
531
  node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
441
532
  text_content '/html/body/p'
442
533
  end
443
- node.inject(agent, page)
534
+ node.scrape(uri)
444
535
  #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
445
536
  ```
446
537
  Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
@@ -449,33 +540,177 @@ Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4
449
540
  `flatten` option expands each page results.
450
541
 
451
542
  ```ruby
452
- agent = Mechanize.new
453
- page = agent.get("http://yasuri.example.net/page01.html")
454
-
455
543
  node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
456
544
  text_title '/html/head/title'
457
545
  text_content '/html/body/p'
458
546
  end
459
- node.inject(agent, page)
547
+ node.scrape("http://yasuri.example.tac42.net/page01.html")
460
548
 
461
549
  #=> [ {"title" => "Page01",
462
- "content" => "Patination01"},
463
- {"title" => "Page01",
464
- "content" => "Patination02"},
465
- {"title" => "Page01",
466
- "content" => "Patination03"}]
550
+ # "content" => "Patination01"},
551
+ # {"title" => "Page01",
552
+ # "content" => "Patination02"},
553
+ # {"title" => "Page01",
554
+ # "content" => "Patination03"}]
467
555
 
468
556
 
469
557
  node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
470
558
  text_title '/html/head/title'
471
559
  text_content '/html/body/p'
472
560
  end
473
- node.inject(agent, page)
561
+ node.scrape("http://yasuri.example.tac42.net/page01.html")
474
562
 
475
563
  #=> [ "Page01",
476
- "Patination01",
477
- "Page02",
478
- "Patination02",
479
- "Page03",
480
- "Patination03"]
564
+ # "Patination01",
565
+ # "Page02",
566
+ # "Patination02",
567
+ # "Page03",
568
+ # "Patination03"]
481
569
  ```
570
+
571
+ ## Map Node
572
+ *MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
573
+
574
+ ### Example
575
+
576
+ ```html
577
+ <!-- http://yasuri.example.tac42.net -->
578
+ <html>
579
+ <head><title>Yasuri Example</title></head>
580
+ <body>
581
+ <p>Hello,World</p>
582
+ <p>Hello,Yasuri</p>
583
+ </body>
584
+ </html>
585
+ ```
586
+
587
+ ```ruby
588
+ tree = Yasuri.map_root do
589
+ text_title '/html/head/title'
590
+ text_body_p '/html/body/p[1]'
591
+ end
592
+
593
+ tree.scrape("http://yasuri.example.tac42.net") #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
594
+
595
+
596
+ tree = Yasuri.map_root do
597
+ map_group1 { text_child01 '/html/body/a[1]' }
598
+ map_group2 do
599
+ text_child01 '/html/body/a[1]'
600
+ text_child03 '/html/body/a[3]'
601
+ end
602
+ end
603
+
604
+ tree.scrape("http://yasuri.example.tac42.net") #=> {
605
+ # "group1" => {
606
+ # "child01" => "child01"
607
+ # },
608
+ # "group2" => {
609
+ # "child01" => "child01",
610
+ # "child03" => "child03"
611
+ # }
612
+ # }
613
+ ```
614
+
615
+ ### Options
616
+ None.
617
+
618
+
619
+ -------------------------
620
+ ## Usage
621
+
622
+ ### Use as library
623
+ When used as a library, the tree can be defined in DSL, json, or yaml format.
624
+
625
+ ```ruby
626
+ require 'yasuri'
627
+
628
+ # 1. Create a parse tree.
629
+ # Define by Ruby's DSL
630
+ tree = Yasuri.links_title '/html/body/a' do
631
+ text_name '/html/body/p'
632
+ end
633
+
634
+ # Define by JSON
635
+ src = <<-EOJSON
636
+ {
637
+ links_title": {
638
+ "path": "/html/body/a",
639
+ "text_name": "/html/body/p"
640
+ }
641
+ }
642
+ EOJSON
643
+ tree = Yasuri.json2tree(src)
644
+
645
+
646
+ # Define by YAML
647
+ src = <<-EOYAML
648
+ links_title:
649
+ path: "/html/body/a"
650
+ text_name: "/html/body/p"
651
+ EOYAML
652
+ tree = Yasuri.yaml2tree(src)
653
+
654
+ # 2. Give the URL to start parsing
655
+ tree.inject(uri)
656
+ ```
657
+
658
+ ### Use as CLI tool
659
+
660
+ **Help**
661
+ ```sh
662
+ $ yasuri help scrape
663
+ Usage:
664
+ yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
665
+
666
+ Options:
667
+ f, [--file=FILE] # path to file that written yasuri tree as json or yaml
668
+ j, [--json=JSON] # yasuri tree format json string
669
+ i, [--interval=N] # interval each request [ms]
670
+
671
+ Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
672
+ ```
673
+
674
+ In the CLI tool, you can specify the parse tree in either of the following ways.
675
+ + `--file`, `-f` : option to read the parse tree in json or yaml format output to a file.
676
+ + `--json`, `-j` : option to specify the parse tree directly as a string.
677
+
678
+
679
+ **Example of specifying a parse tree as a file**
680
+ ```sh
681
+ % cat sample.yml
682
+ text_title: "/html/head/title"
683
+ text_desc: "//*[@id=\"intro\"]/p"
684
+
685
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
686
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
687
+
688
+ % cat sample.json
689
+ {
690
+ "text_title": "/html/head/title",
691
+ "text_desc": "//*[@id=\"intro\"]/p"
692
+ }
693
+
694
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
695
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
696
+ ```
697
+
698
+ Whether the file is written in json or yaml will be determined automatically.
699
+
700
+ **Example of specifying a parse tree directly in json**
701
+ ```sh
702
+ $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
703
+ {
704
+ "text_title": "/html/head/title",
705
+ "text_desc": "//*[@id=\"intro\"]/p"
706
+ }'
707
+
708
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
709
+ ```
710
+
711
+ #### Other options
712
+ + `--interval`, `-i` : The interval [milliseconds] for requesting multiple pages.
713
+ **Example: Request at 1 second intervals**
714
+ ```sh
715
+ $ yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml --interval 1000
716
+ ```