yasuri 2.0.12 → 3.3.0

Sign up to get free protection for your applications and to get access to all the features.
data/USAGE.md CHANGED
@@ -1,27 +1,32 @@
1
- # Yasuri Usage
1
+ # Yasuri
2
2
 
3
3
  ## What is Yasuri
4
- `Yasuri` is an easy web-scraping library for supporting "Mechanize".
4
+ `Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it.
5
5
 
6
- Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
6
+ It performs scraping by simply describing the expected result in a simple declarative notation.
7
7
 
8
- Yasuri can reduce frequently processes in Scraping.
8
+ Yasuri makes it easy to write common scraping operations.
9
+ For example, the following processes can be easily implemented.
9
10
 
10
- For example,
11
-
12
- + Open links in the page, scraping each page, and getting result as Hash.
13
- + Scraping texts in the page, and named result in Hash.
14
- + A table that repeatedly appears in a page each, scraping, get as an array.
15
- + Of each page provided by the pagination, scraping the only top 3.
16
-
17
- You can implement easy by Yasuri.
11
+ + Scrape multiple texts in a page and name them into a Hash
12
+ + Open multiple links in a page and get the result of scraping each page as a Hash
13
+ + Scrape each table that appears repeatedly in the page and get the result as an array
14
+ + Scrape only the first three pages of each page provided by pagination
18
15
 
19
16
  ## Quick Start
20
17
 
18
+
19
+ #### Install
20
+ ```sh
21
+ # for Ruby 2.3.2
22
+ $ gem 'yasuri', '~> 2.0', '>= 2.0.13'
21
23
  ```
24
+ または
25
+ ```sh
26
+ # for Ruby 3.0.0 or upper
22
27
  $ gem install yasuri
23
28
  ```
24
-
29
+ #### Use as library
25
30
  ```ruby
26
31
  require 'yasuri'
27
32
  require 'machinize'
@@ -32,81 +37,190 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
32
37
  text_content '//*[@id="contents"]/p[1]'
33
38
  end
34
39
 
35
- agent = Mechanize.new
36
- root_page = agent.get("http://some.scraping.page.net/")
40
+ result = root.scrape("http://some.scraping.page.tac42.net/")
41
+ # => [
42
+ # {"title" => "PageTitle 01", "content" => "Page Contents 01" },
43
+ # {"title" => "PageTitle 02", "content" => "Page Contents 02" },
44
+ # ...
45
+ # {"title" => "PageTitle N", "content" => "Page Contents N" }
46
+ # ]
47
+ ```
48
+
49
+ This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
50
+
51
+ (in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
52
+
37
53
 
38
- result = root.inject(agent, root_page)
39
- # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
40
- # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
54
+ #### Use as CLI tool
55
+ The same thing as above can be executed as a CLI command.
41
56
 
57
+ ```sh
58
+ $ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
59
+ {
60
+ "links_root": {
61
+ "path": "//*[@id=\"menu\"]/ul/li/a",
62
+ "text_title": "//*[@id=\"contents\"]/h2",
63
+ "text_content": "//*[@id=\"contents\"]/p[1]"
64
+ }
65
+ }'
66
+
67
+ [
68
+ {"title":"PageTitle 01","content":"Page Contents 01"},
69
+ {"title":"PageTitle 02","content":"Page Contents 02"},
70
+ ...,
71
+ {"title":"PageTitle N","content":"Page Contents N"}
72
+ ]
42
73
  ```
43
- This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
44
74
 
45
- (i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
75
+ The result can be obtained as a string in json format.
46
76
 
47
- ## Basics
77
+ ----------------------------
78
+ ## Parse Tree
48
79
 
49
- 1. Construct parse tree.
50
- 2. Start parse with Mechanize agent and first page.
80
+ A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
51
81
 
52
- ### Construct parse tree
82
+ A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
53
83
 
54
- ```ruby
55
- require 'mechanize'
56
- require 'yasuri'
84
+ The parse tree is defined in the following format:
57
85
 
86
+ ```ruby
87
+ # A simple tree consisting of one node
88
+ Yasuri.<Type>_<Name> <Path> [,<Options>]
58
89
 
59
- # 1. Construct parse tree.
60
- tree = Yasuri.links_title '/html/body/a' do
61
- text_name '/html/body/p'
62
- end
90
+ # Nested tree
91
+ Yasuri.<Type>_<Name> <Path> [,<Options>] do
92
+ <Type>_<Name> <Path> [,<Options>] do
93
+ <Type>_<Name> <Path> [,<Options>]
94
+ ...
95
+ end
96
+ end
97
+ ```
63
98
 
64
- # 2. Start parse with Mechanize agent and first page.
65
- agent = Mechanize.new
66
- page = agent.get(uri)
99
+ **Example**
67
100
 
101
+ ```ruby
102
+ # A simple tree consisting of one node
103
+ Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
68
104
 
69
- tree.inject(agent, page)
105
+ # Nested tree
106
+ Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
107
+ struct_table './tr' do
108
+ text_title './td[1]'
109
+ text_pub_date './td[2]'
110
+ end
111
+ end
70
112
  ```
71
113
 
72
- Tree is definable by 2(+1) ways, DSL and json (and basic ruby code). In above example, DSL.
114
+ Parsing trees can be defined in Ruby DSL, JSON, or YAML.
115
+ The following is an example of the same parse tree as above, defined in each notation.
116
+
73
117
 
118
+ **Case of defining as Ruby DSL**
74
119
  ```ruby
75
- # Construct by json.
76
- src = <<-EOJSON
77
- { "node" : "links",
78
- "name" : "title",
79
- "path" : "/html/body/a",
80
- "children" : [
81
- { "node" : "text",
82
- "name" : "name",
83
- "path" : "/html/body/p"
84
- }
85
- ]
86
- }
87
- EOJSON
88
- tree = Yasuri.json2tree(src)
120
+ Yasuri.links_title '/html/body/a' do
121
+ text_name '/html/body/p'
122
+ end
89
123
  ```
90
124
 
91
- ### Node
92
- Tree is constructed by nested Nodes.
93
- Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
125
+ **Case of defining as JSON**
126
+ ```json
127
+ {
128
+ links_title": {
129
+ "path": "/html/body/a",
130
+ "text_name": "/html/body/p"
131
+ }
132
+ }
133
+ ```
94
134
 
95
- Node is defined by this format.
135
+ **Case of defining as YAML**
136
+ ```yaml
137
+ links_title:
138
+ path: "/html/body/a"
139
+ text_name: "/html/body/p"
140
+ ```
96
141
 
142
+ **Special case of purse tree**
97
143
 
144
+ If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
145
+ ```json
146
+ {
147
+ "text_title": "/html/head/title",
148
+ "text_body": "/html/body",
149
+ }
150
+ # => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
151
+
152
+ {
153
+ "text_title": "/html/head/title"}
154
+ }
155
+ # => Welcome to yasuri!
156
+ ```
157
+
158
+
159
+ In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
160
+
161
+ ```json
162
+ {
163
+ "text_name": "/html/body/p"
164
+ }
165
+
166
+ {
167
+ "text_name": {
168
+ "path": "/html/body/p"
169
+ }
170
+ }
171
+ ```
172
+ ### Run ParseTree
173
+ Call the `Node#scrape(uri, opt={})` method on the root node of the parse tree.
174
+
175
+ **Example**
98
176
  ```ruby
99
- # Top Level
100
- Yasuri.<Type>_<Name> <Path> [,<Options>]
177
+ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
178
+ text_title '//*[@id="contents"]/h2'
179
+ text_content '//*[@id="contents"]/p[1]'
180
+ end
101
181
 
102
- # Nested
103
- Yasuri.<Type>_<Name> <Path> [,<Options>] do
104
- <Type>_<Name> <Path> [,<Options>] do
105
- <Children>
106
- end
107
- end
182
+ result = root.scrape("http://some.scraping.page.tac42.net/", interval_ms: 1000)
108
183
  ```
109
184
 
185
+ + `uri` is the URI of the page to be scraped.
186
+ + `opt` is options as Hash. The following options are available.
187
+
188
+ Yasuri uses `Mechanize` internally as an agent to do scraping.
189
+ If you want to specify this instance, call `Node#scrape_with_agent(uri, agent, opt={})`.
190
+
191
+ ```ruby
192
+ require 'logger'
193
+
194
+ agent = Mechanize.new
195
+ agent.log = Logger.new $stderr
196
+ agent.request_headers = {
197
+ # ...
198
+ }
199
+
200
+ result = root.scrape_with_agent(
201
+ "http://some.scraping.page.tac42.net/",
202
+ agent,
203
+ interval_ms: 1000)
204
+ ```
205
+
206
+ ### `opt`
207
+ #### `interval_ms`
208
+ Interval [milliseconds] for requesting multiple pages.
209
+
210
+ If omitted, requests will be made continuously without an interval, but if requests to many pages are expected, it is strongly recommended to specify an interval time to avoid high load on the target host.
211
+
212
+ #### `retry_count`
213
+ Number of retries when page acquisition fails. If omitted, it will retry 5 times.
214
+
215
+ #### `symbolize_names`
216
+ If true, returns the keys of the result set as symbols.
217
+
218
+ --------------------------
219
+ ## Node
220
+
221
+ Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
222
+
223
+
110
224
  #### Type
111
225
  Type meen behavior of Node.
112
226
 
@@ -114,17 +228,20 @@ Type meen behavior of Node.
114
228
  - *Struct*
115
229
  - *Links*
116
230
  - *Paginate*
231
+ - *Map*
117
232
 
118
- ### Name
233
+ See the description of each node for details.
234
+
235
+ #### Name
119
236
  Name is used keys in returned hash.
120
237
 
121
- ### Path
238
+ #### Path
122
239
  Path determine target node by xpath or css selector. It given by Machinize `search`.
123
240
 
124
- ### Childlen
241
+ #### Childlen
125
242
  Child nodes. TextNode has always empty set, because TextNode is leaf.
126
243
 
127
- ### Options
244
+ #### Options
128
245
  Parse options. It different in each types. You can get options and values by `opt` method.
129
246
 
130
247
  ```ruby
@@ -136,10 +253,12 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
136
253
  ## Text Node
137
254
  TextNode return scraped text. This node have to be leaf.
138
255
 
256
+
257
+
139
258
  ### Example
140
259
 
141
260
  ```html
142
- <!-- http://yasuri.example.net -->
261
+ <!-- http://yasuri.example.tac42.net -->
143
262
  <html>
144
263
  <head></head>
145
264
  <body>
@@ -150,25 +269,24 @@ TextNode return scraped text. This node have to be leaf.
150
269
  ```
151
270
 
152
271
  ```ruby
153
- agent = Mechanize.new
154
- page = agent.get("http://yasuri.example.net")
155
-
156
272
  p1 = Yasuri.text_title '/html/body/p[1]'
157
273
  p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
158
- p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
274
+ p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
159
275
 
160
- p1.inject(agent, page) #=> { "title" => "Hello,World" }
161
- p1t.inject(agent, page) #=> { "title" => "Hello" }
162
- node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
276
+ p1.scrape("http://yasuri.example.tac42.net") #=> "Hello,World"
277
+ p1t.scrape("http://yasuri.example.tac42.net") #=> "Hello"
278
+ p2u.scrape("http://yasuri.example.tac42.net") #=> "HELLO,WORLD"
163
279
  ```
164
280
 
281
+ Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
282
+
165
283
  ### Options
166
284
  ##### `truncate`
167
285
  Match to regexp, and truncate text. When you use group, it will return first matched group only.
168
286
 
169
287
  ```ruby
170
288
  node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
171
- node.inject(agent, index_page)
289
+ node.scrape(uri)
172
290
  #=> { "example" => "ello,Yasur" }
173
291
  ```
174
292
 
@@ -179,21 +297,22 @@ If it is given `truncate` option, apply method after truncated.
179
297
 
180
298
  ```ruby
181
299
  node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
182
- node.inject(agent, index_page)
300
+ node.scrape(uri)
183
301
  #=> { "example" => "ELLO,YASUR" }
184
302
  ```
185
303
 
186
304
  ## Struct Node
187
305
  Struct Node return structured text.
188
306
 
189
- At first, Struct Node narrow down sub-tags by `Path`. Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
307
+ At first, Struct Node narrow down sub-tags by `Path`.
308
+ Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
190
309
 
191
310
  If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
192
311
 
193
312
  ### Example
194
313
 
195
314
  ```html
196
- <!-- http://yasuri.example.net -->
315
+ <!-- http://yasuri.example.tac42.net -->
197
316
  <html>
198
317
  <head>
199
318
  <title>Books</title>
@@ -234,15 +353,12 @@ If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags an
234
353
  ```
235
354
 
236
355
  ```ruby
237
- agent = Mechanize.new
238
- page = agent.get("http://yasuri.example.net")
239
-
240
356
  node = Yasuri.struct_table '/html/body/table[1]/tr' do
241
357
  text_title './td[1]'
242
358
  text_pub_date './td[2]'
243
- ])
359
+ end
244
360
 
245
- node.inject(agent, page)
361
+ node.scrape("http://yasuri.example.tac42.net")
246
362
  #=> [ { "title" => "The Perfect Insider",
247
363
  # "pub_date" => "1996/4/5" },
248
364
  # { "title" => "Doctors in Isolated Room",
@@ -261,17 +377,14 @@ Struct node can contain not only Text node.
261
377
  ### Example
262
378
 
263
379
  ```ruby
264
- agent = Mechanize.new
265
- page = agent.get("http://yasuri.example.net")
266
-
267
380
  node = Yasuri.strucre_tables '/html/body/table' do
268
381
  struct_table './tr' do
269
382
  text_title './td[1]'
270
383
  text_pub_date './td[2]'
271
384
  end
272
- ])
385
+ end
273
386
 
274
- node.inject(agent, page)
387
+ node.scrape("http://yasuri.example.tac42.net")
275
388
 
276
389
  #=> [ { "table" => [ { "title" => "The Perfect Insider",
277
390
  # "pub_date" => "1996/4/5" },
@@ -304,7 +417,7 @@ Links Node returns parsed text in each linked pages.
304
417
 
305
418
  ### Example
306
419
  ```html
307
- <!-- http://yasuri.example.net -->
420
+ <!-- http://yasuri.example.tac42.net -->
308
421
  <html>
309
422
  <head><title>Yasuri Test</title></head>
310
423
  <body>
@@ -317,7 +430,7 @@ Links Node returns parsed text in each linked pages.
317
430
  ```
318
431
 
319
432
  ```html
320
- <!-- http://yasuri.example.net/child01.html -->
433
+ <!-- http://yasuri.example.tac42.net/child01.html -->
321
434
  <html>
322
435
  <head><title>Child 01 Test</title></head>
323
436
  <body>
@@ -331,7 +444,7 @@ Links Node returns parsed text in each linked pages.
331
444
  ```
332
445
 
333
446
  ```html
334
- <!-- http://yasuri.example.net/child02.html -->
447
+ <!-- http://yasuri.example.tac42.net/child02.html -->
335
448
  <html>
336
449
  <head><title>Child 02 Test</title></head>
337
450
  <body>
@@ -341,7 +454,7 @@ Links Node returns parsed text in each linked pages.
341
454
  ```
342
455
 
343
456
  ```html
344
- <!-- http://yasuri.example.net/child03.html -->
457
+ <!-- http://yasuri.example.tac42.net/child03.html -->
345
458
  <html>
346
459
  <head><title>Child 03 Test</title></head>
347
460
  <body>
@@ -354,20 +467,17 @@ Links Node returns parsed text in each linked pages.
354
467
  ```
355
468
 
356
469
  ```ruby
357
- agent = Mechanize.new
358
- page = agent.get("http://yasuri.example.net")
359
-
360
470
  node = Yasuri.links_title '/html/body/a' do
361
471
  text_content '/html/body/p'
362
472
  end
363
473
 
364
- node.inject(agent, page)
474
+ node.scrape("http://yasuri.example.tac42.net")
365
475
  #=> [ {"content" => "Child 01 page."},
366
476
  {"content" => "Child 02 page."},
367
477
  {"content" => "Child 03 page."}]
368
478
  ```
369
479
 
370
- At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
480
+ At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.tac42.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
371
481
 
372
482
  Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
373
483
 
@@ -381,7 +491,7 @@ Paginate Node parses and returns each pages that provid by paginate.
381
491
  Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
382
492
 
383
493
  ```html
384
- <!-- http://yasuri.example.net/page01.html -->
494
+ <!-- http://yasuri.example.tac42.net/page01.html -->
385
495
  <html>
386
496
  <head><title>Page01</title></head>
387
497
  <body>
@@ -401,21 +511,17 @@ Target page `page01.html` is like this. `page02.html` to `page04.html` are simil
401
511
  ```
402
512
 
403
513
  ```ruby
404
- agent = Mechanize.new
405
- page = agent.get("http://yasuri.example.net/page01.html")
406
-
407
- node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
514
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
408
515
  text_content '/html/body/p'
409
516
  end
410
517
 
411
- node.inject(agent, page)
412
- #=> [ {"content" => "Pagination01"},
413
- {"content" => "Pagination02"},
414
- {"content" => "Pagination03"},
415
- {"content" => "Pagination04"}]
518
+ node.scrape("http://yasuri.example.tac42.net/page01.html")
519
+ #=> [ {"content" => "Patination01"},
520
+ # {"content" => "Patination02"},
521
+ # {"content" => "Patination03"}]
416
522
  ```
417
-
418
- Paginate Node require link for next page. In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
523
+ Paginate Node require link for next page.
524
+ In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
419
525
 
420
526
  ### Options
421
527
  ##### `limit`
@@ -425,7 +531,7 @@ Upper limit of open pages in pagination.
425
531
  node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
426
532
  text_content '/html/body/p'
427
533
  end
428
- node.inject(agent, page)
534
+ node.scrape(uri)
429
535
  #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
430
536
  ```
431
537
  Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
@@ -434,33 +540,177 @@ Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4
434
540
  `flatten` option expands each page results.
435
541
 
436
542
  ```ruby
437
- agent = Mechanize.new
438
- page = agent.get("http://yasuri.example.net/page01.html")
439
-
440
543
  node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
441
544
  text_title '/html/head/title'
442
545
  text_content '/html/body/p'
443
546
  end
444
- node.inject(agent, page)
547
+ node.scrape("http://yasuri.example.tac42.net/page01.html")
445
548
 
446
549
  #=> [ {"title" => "Page01",
447
- "content" => "Patination01"},
448
- {"title" => "Page01",
449
- "content" => "Patination02"},
450
- {"title" => "Page01",
451
- "content" => "Patination03"}]
550
+ # "content" => "Patination01"},
551
+ # {"title" => "Page01",
552
+ # "content" => "Patination02"},
553
+ # {"title" => "Page01",
554
+ # "content" => "Patination03"}]
452
555
 
453
556
 
454
557
  node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
455
558
  text_title '/html/head/title'
456
559
  text_content '/html/body/p'
457
560
  end
458
- node.inject(agent, page)
561
+ node.scrape("http://yasuri.example.tac42.net/page01.html")
459
562
 
460
563
  #=> [ "Page01",
461
- "Patination01",
462
- "Page02",
463
- "Patination02",
464
- "Page03",
465
- "Patination03"]
564
+ # "Patination01",
565
+ # "Page02",
566
+ # "Patination02",
567
+ # "Page03",
568
+ # "Patination03"]
569
+ ```
570
+
571
+ ## Map Node
572
+ *MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
573
+
574
+ ### Example
575
+
576
+ ```html
577
+ <!-- http://yasuri.example.tac42.net -->
578
+ <html>
579
+ <head><title>Yasuri Example</title></head>
580
+ <body>
581
+ <p>Hello,World</p>
582
+ <p>Hello,Yasuri</p>
583
+ </body>
584
+ </html>
585
+ ```
586
+
587
+ ```ruby
588
+ tree = Yasuri.map_root do
589
+ text_title '/html/head/title'
590
+ text_body_p '/html/body/p[1]'
591
+ end
592
+
593
+ tree.scrape("http://yasuri.example.tac42.net") #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
594
+
595
+
596
+ tree = Yasuri.map_root do
597
+ map_group1 { text_child01 '/html/body/a[1]' }
598
+ map_group2 do
599
+ text_child01 '/html/body/a[1]'
600
+ text_child03 '/html/body/a[3]'
601
+ end
602
+ end
603
+
604
+ tree.scrape("http://yasuri.example.tac42.net") #=> {
605
+ # "group1" => {
606
+ # "child01" => "child01"
607
+ # },
608
+ # "group2" => {
609
+ # "child01" => "child01",
610
+ # "child03" => "child03"
611
+ # }
612
+ # }
613
+ ```
614
+
615
+ ### Options
616
+ None.
617
+
618
+
619
+ -------------------------
620
+ ## Usage
621
+
622
+ ### Use as library
623
+ When used as a library, the tree can be defined in DSL, json, or yaml format.
624
+
625
+ ```ruby
626
+ require 'yasuri'
627
+
628
+ # 1. Create a parse tree.
629
+ # Define by Ruby's DSL
630
+ tree = Yasuri.links_title '/html/body/a' do
631
+ text_name '/html/body/p'
632
+ end
633
+
634
+ # Define by JSON
635
+ src = <<-EOJSON
636
+ {
637
+ links_title": {
638
+ "path": "/html/body/a",
639
+ "text_name": "/html/body/p"
640
+ }
641
+ }
642
+ EOJSON
643
+ tree = Yasuri.json2tree(src)
644
+
645
+
646
+ # Define by YAML
647
+ src = <<-EOYAML
648
+ links_title:
649
+ path: "/html/body/a"
650
+ text_name: "/html/body/p"
651
+ EOYAML
652
+ tree = Yasuri.yaml2tree(src)
653
+
654
+ # 2. Give the URL to start parsing
655
+ tree.inject(uri)
656
+ ```
657
+
658
+ ### Use as CLI tool
659
+
660
+ **Help**
661
+ ```sh
662
+ $ yasuri help scrape
663
+ Usage:
664
+ yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
665
+
666
+ Options:
667
+ f, [--file=FILE] # path to file that written yasuri tree as json or yaml
668
+ j, [--json=JSON] # yasuri tree format json string
669
+ i, [--interval=N] # interval each request [ms]
670
+
671
+ Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
672
+ ```
673
+
674
+ In the CLI tool, you can specify the parse tree in either of the following ways.
675
+ + `--file`, `-f` : option to read the parse tree in json or yaml format output to a file.
676
+ + `--json`, `-j` : option to specify the parse tree directly as a string.
677
+
678
+
679
+ **Example of specifying a parse tree as a file**
680
+ ```sh
681
+ % cat sample.yml
682
+ text_title: "/html/head/title"
683
+ text_desc: "//*[@id=\"intro\"]/p"
684
+
685
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
686
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
687
+
688
+ % cat sample.json
689
+ {
690
+ "text_title": "/html/head/title",
691
+ "text_desc": "//*[@id=\"intro\"]/p"
692
+ }
693
+
694
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
695
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
466
696
  ```
697
+
698
+ Whether the file is written in json or yaml will be determined automatically.
699
+
700
+ **Example of specifying a parse tree directly in json**
701
+ ```sh
702
+ $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
703
+ {
704
+ "text_title": "/html/head/title",
705
+ "text_desc": "//*[@id=\"intro\"]/p"
706
+ }'
707
+
708
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
709
+ ```
710
+
711
+ #### Other options
712
+ + `--interval`, `-i` : The interval [milliseconds] for requesting multiple pages.
713
+ **Example: Request at 1 second intervals**
714
+ ```sh
715
+ $ yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml --interval 1000
716
+ ```