yasuri 3.0.0 → 3.3.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/workflows/ruby.yml +1 -1
- data/.rubocop.yml +49 -0
- data/.rubocop_todo.yml +0 -0
- data/README.md +70 -27
- data/Rakefile +1 -1
- data/USAGE.ja.md +366 -131
- data/USAGE.md +371 -136
- data/examples/example.rb +78 -0
- data/examples/github.yml +15 -0
- data/examples/sample.json +4 -0
- data/examples/sample.yml +11 -0
- data/exe/yasuri +5 -0
- data/lib/yasuri.rb +1 -0
- data/lib/yasuri/version.rb +1 -1
- data/lib/yasuri/yasuri.rb +96 -76
- data/lib/yasuri/yasuri_cli.rb +78 -0
- data/lib/yasuri/yasuri_links_node.rb +10 -6
- data/lib/yasuri/yasuri_map_node.rb +40 -0
- data/lib/yasuri/yasuri_node.rb +36 -4
- data/lib/yasuri/yasuri_node_generator.rb +14 -9
- data/lib/yasuri/yasuri_paginate_node.rb +26 -16
- data/lib/yasuri/yasuri_struct_node.rb +6 -4
- data/lib/yasuri/yasuri_text_node.rb +9 -7
- data/spec/cli_resources/tree.json +8 -0
- data/spec/cli_resources/tree.yml +5 -0
- data/spec/cli_resources/tree_wrong.json +9 -0
- data/spec/cli_resources/tree_wrong.yml +6 -0
- data/spec/servers/httpserver.rb +0 -2
- data/spec/spec_helper.rb +4 -6
- data/spec/yasuri_cli_spec.rb +114 -0
- data/spec/yasuri_links_node_spec.rb +82 -58
- data/spec/yasuri_map_spec.rb +71 -0
- data/spec/yasuri_paginate_node_spec.rb +99 -88
- data/spec/yasuri_spec.rb +196 -138
- data/spec/yasuri_struct_node_spec.rb +120 -100
- data/spec/yasuri_text_node_spec.rb +22 -32
- data/yasuri.gemspec +29 -22
- metadata +105 -15
- data/app.rb +0 -52
- data/spec/yasuri_node_spec.rb +0 -11
data/USAGE.md
CHANGED
@@ -1,27 +1,32 @@
|
|
1
|
-
# Yasuri
|
1
|
+
# Yasuri
|
2
2
|
|
3
3
|
## What is Yasuri
|
4
|
-
`Yasuri` is
|
4
|
+
`Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it.
|
5
5
|
|
6
|
-
|
6
|
+
It performs scraping by simply describing the expected result in a simple declarative notation.
|
7
7
|
|
8
|
-
Yasuri
|
8
|
+
Yasuri makes it easy to write common scraping operations.
|
9
|
+
For example, the following processes can be easily implemented.
|
9
10
|
|
10
|
-
|
11
|
-
|
12
|
-
+
|
13
|
-
+
|
14
|
-
+ A table that repeatedly appears in a page each, scraping, get as an array.
|
15
|
-
+ Of each page provided by the pagination, scraping the only top 3.
|
16
|
-
|
17
|
-
You can implement easy by Yasuri.
|
11
|
+
+ Scrape multiple texts in a page and name them into a Hash
|
12
|
+
+ Open multiple links in a page and get the result of scraping each page as a Hash
|
13
|
+
+ Scrape each table that appears repeatedly in the page and get the result as an array
|
14
|
+
+ Scrape only the first three pages of each page provided by pagination
|
18
15
|
|
19
16
|
## Quick Start
|
20
17
|
|
18
|
+
|
19
|
+
#### Install
|
20
|
+
```sh
|
21
|
+
# for Ruby 2.3.2
|
22
|
+
$ gem 'yasuri', '~> 2.0', '>= 2.0.13'
|
21
23
|
```
|
24
|
+
または
|
25
|
+
```sh
|
26
|
+
# for Ruby 3.0.0 or upper
|
22
27
|
$ gem install yasuri
|
23
28
|
```
|
24
|
-
|
29
|
+
#### Use as library
|
25
30
|
```ruby
|
26
31
|
require 'yasuri'
|
27
32
|
require 'machinize'
|
@@ -32,96 +37,190 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
|
32
37
|
text_content '//*[@id="contents"]/p[1]'
|
33
38
|
end
|
34
39
|
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
#
|
40
|
-
# {"title" => "
|
41
|
-
|
40
|
+
result = root.scrape("http://some.scraping.page.tac42.net/")
|
41
|
+
# => [
|
42
|
+
# {"title" => "PageTitle 01", "content" => "Page Contents 01" },
|
43
|
+
# {"title" => "PageTitle 02", "content" => "Page Contents 02" },
|
44
|
+
# ...
|
45
|
+
# {"title" => "PageTitle N", "content" => "Page Contents N" }
|
46
|
+
# ]
|
42
47
|
```
|
48
|
+
|
43
49
|
This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
|
44
50
|
|
45
|
-
(
|
51
|
+
(in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
|
46
52
|
|
47
|
-
## Basics
|
48
53
|
|
49
|
-
|
50
|
-
|
54
|
+
#### Use as CLI tool
|
55
|
+
The same thing as above can be executed as a CLI command.
|
51
56
|
|
52
|
-
|
57
|
+
```sh
|
58
|
+
$ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
|
59
|
+
{
|
60
|
+
"links_root": {
|
61
|
+
"path": "//*[@id=\"menu\"]/ul/li/a",
|
62
|
+
"text_title": "//*[@id=\"contents\"]/h2",
|
63
|
+
"text_content": "//*[@id=\"contents\"]/p[1]"
|
64
|
+
}
|
65
|
+
}'
|
53
66
|
|
54
|
-
|
55
|
-
|
56
|
-
|
67
|
+
[
|
68
|
+
{"title":"PageTitle 01","content":"Page Contents 01"},
|
69
|
+
{"title":"PageTitle 02","content":"Page Contents 02"},
|
70
|
+
...,
|
71
|
+
{"title":"PageTitle N","content":"Page Contents N"}
|
72
|
+
]
|
73
|
+
```
|
57
74
|
|
75
|
+
The result can be obtained as a string in json format.
|
58
76
|
|
59
|
-
|
60
|
-
|
61
|
-
text_name '/html/body/p'
|
62
|
-
end
|
77
|
+
----------------------------
|
78
|
+
## Parse Tree
|
63
79
|
|
64
|
-
|
65
|
-
agent = Mechanize.new
|
66
|
-
page = agent.get(uri)
|
80
|
+
A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
|
67
81
|
|
82
|
+
A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
|
68
83
|
|
69
|
-
tree
|
84
|
+
The parse tree is defined in the following format:
|
85
|
+
|
86
|
+
```ruby
|
87
|
+
# A simple tree consisting of one node
|
88
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
89
|
+
|
90
|
+
# Nested tree
|
91
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
92
|
+
<Type>_<Name> <Path> [,<Options>] do
|
93
|
+
<Type>_<Name> <Path> [,<Options>]
|
94
|
+
...
|
95
|
+
end
|
96
|
+
end
|
70
97
|
```
|
71
98
|
|
72
|
-
|
99
|
+
**Example**
|
73
100
|
|
74
101
|
```ruby
|
75
|
-
#
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
]
|
86
|
-
}
|
87
|
-
EOJSON
|
88
|
-
tree = Yasuri.json2tree(src)
|
102
|
+
# A simple tree consisting of one node
|
103
|
+
Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
|
104
|
+
|
105
|
+
# Nested tree
|
106
|
+
Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
107
|
+
struct_table './tr' do
|
108
|
+
text_title './td[1]'
|
109
|
+
text_pub_date './td[2]'
|
110
|
+
end
|
111
|
+
end
|
89
112
|
```
|
90
113
|
|
114
|
+
Parsing trees can be defined in Ruby DSL, JSON, or YAML.
|
115
|
+
The following is an example of the same parse tree as above, defined in each notation.
|
116
|
+
|
117
|
+
|
118
|
+
**Case of defining as Ruby DSL**
|
91
119
|
```ruby
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
120
|
+
Yasuri.links_title '/html/body/a' do
|
121
|
+
text_name '/html/body/p'
|
122
|
+
end
|
123
|
+
```
|
124
|
+
|
125
|
+
**Case of defining as JSON**
|
126
|
+
```json
|
127
|
+
{
|
128
|
+
links_title": {
|
129
|
+
"path": "/html/body/a",
|
130
|
+
"text_name": "/html/body/p"
|
131
|
+
}
|
132
|
+
}
|
133
|
+
```
|
134
|
+
|
135
|
+
**Case of defining as YAML**
|
136
|
+
```yaml
|
137
|
+
links_title:
|
96
138
|
path: "/html/body/a"
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
139
|
+
text_name: "/html/body/p"
|
140
|
+
```
|
141
|
+
|
142
|
+
**Special case of purse tree**
|
143
|
+
|
144
|
+
If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
|
145
|
+
```json
|
146
|
+
{
|
147
|
+
"text_title": "/html/head/title",
|
148
|
+
"text_body": "/html/body",
|
149
|
+
}
|
150
|
+
# => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
|
151
|
+
|
152
|
+
{
|
153
|
+
"text_title": "/html/head/title"}
|
154
|
+
}
|
155
|
+
# => Welcome to yasuri!
|
103
156
|
```
|
104
157
|
|
105
158
|
|
106
|
-
|
107
|
-
Tree is constructed by nested Nodes.
|
108
|
-
Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
|
159
|
+
In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
|
109
160
|
|
110
|
-
|
161
|
+
```json
|
162
|
+
{
|
163
|
+
"text_name": "/html/body/p"
|
164
|
+
}
|
111
165
|
|
166
|
+
{
|
167
|
+
"text_name": {
|
168
|
+
"path": "/html/body/p"
|
169
|
+
}
|
170
|
+
}
|
171
|
+
```
|
172
|
+
### Run ParseTree
|
173
|
+
Call the `Node#scrape(uri, opt={})` method on the root node of the parse tree.
|
112
174
|
|
175
|
+
**Example**
|
113
176
|
```ruby
|
114
|
-
|
115
|
-
|
177
|
+
root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
178
|
+
text_title '//*[@id="contents"]/h2'
|
179
|
+
text_content '//*[@id="contents"]/p[1]'
|
180
|
+
end
|
116
181
|
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
182
|
+
result = root.scrape("http://some.scraping.page.tac42.net/", interval_ms: 1000)
|
183
|
+
```
|
184
|
+
|
185
|
+
+ `uri` is the URI of the page to be scraped.
|
186
|
+
+ `opt` is options as Hash. The following options are available.
|
187
|
+
|
188
|
+
Yasuri uses `Mechanize` internally as an agent to do scraping.
|
189
|
+
If you want to specify this instance, call `Node#scrape_with_agent(uri, agent, opt={})`.
|
190
|
+
|
191
|
+
```ruby
|
192
|
+
require 'logger'
|
193
|
+
|
194
|
+
agent = Mechanize.new
|
195
|
+
agent.log = Logger.new $stderr
|
196
|
+
agent.request_headers = {
|
197
|
+
# ...
|
198
|
+
}
|
199
|
+
|
200
|
+
result = root.scrape_with_agent(
|
201
|
+
"http://some.scraping.page.tac42.net/",
|
202
|
+
agent,
|
203
|
+
interval_ms: 1000)
|
123
204
|
```
|
124
205
|
|
206
|
+
### `opt`
|
207
|
+
#### `interval_ms`
|
208
|
+
Interval [milliseconds] for requesting multiple pages.
|
209
|
+
|
210
|
+
If omitted, requests will be made continuously without an interval, but if requests to many pages are expected, it is strongly recommended to specify an interval time to avoid high load on the target host.
|
211
|
+
|
212
|
+
#### `retry_count`
|
213
|
+
Number of retries when page acquisition fails. If omitted, it will retry 5 times.
|
214
|
+
|
215
|
+
#### `symbolize_names`
|
216
|
+
If true, returns the keys of the result set as symbols.
|
217
|
+
|
218
|
+
--------------------------
|
219
|
+
## Node
|
220
|
+
|
221
|
+
Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
|
222
|
+
|
223
|
+
|
125
224
|
#### Type
|
126
225
|
Type meen behavior of Node.
|
127
226
|
|
@@ -129,17 +228,20 @@ Type meen behavior of Node.
|
|
129
228
|
- *Struct*
|
130
229
|
- *Links*
|
131
230
|
- *Paginate*
|
231
|
+
- *Map*
|
232
|
+
|
233
|
+
See the description of each node for details.
|
132
234
|
|
133
|
-
|
235
|
+
#### Name
|
134
236
|
Name is used keys in returned hash.
|
135
237
|
|
136
|
-
|
238
|
+
#### Path
|
137
239
|
Path determine target node by xpath or css selector. It given by Machinize `search`.
|
138
240
|
|
139
|
-
|
241
|
+
#### Childlen
|
140
242
|
Child nodes. TextNode has always empty set, because TextNode is leaf.
|
141
243
|
|
142
|
-
|
244
|
+
#### Options
|
143
245
|
Parse options. It different in each types. You can get options and values by `opt` method.
|
144
246
|
|
145
247
|
```ruby
|
@@ -151,10 +253,12 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
|
|
151
253
|
## Text Node
|
152
254
|
TextNode return scraped text. This node have to be leaf.
|
153
255
|
|
256
|
+
|
257
|
+
|
154
258
|
### Example
|
155
259
|
|
156
260
|
```html
|
157
|
-
<!-- http://yasuri.example.net -->
|
261
|
+
<!-- http://yasuri.example.tac42.net -->
|
158
262
|
<html>
|
159
263
|
<head></head>
|
160
264
|
<body>
|
@@ -165,25 +269,24 @@ TextNode return scraped text. This node have to be leaf.
|
|
165
269
|
```
|
166
270
|
|
167
271
|
```ruby
|
168
|
-
agent = Mechanize.new
|
169
|
-
page = agent.get("http://yasuri.example.net")
|
170
|
-
|
171
272
|
p1 = Yasuri.text_title '/html/body/p[1]'
|
172
273
|
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
173
|
-
p2u = Yasuri.text_title '/html/body/p[
|
274
|
+
p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
|
174
275
|
|
175
|
-
p1.
|
176
|
-
p1t.
|
177
|
-
|
276
|
+
p1.scrape("http://yasuri.example.tac42.net") #=> "Hello,World"
|
277
|
+
p1t.scrape("http://yasuri.example.tac42.net") #=> "Hello"
|
278
|
+
p2u.scrape("http://yasuri.example.tac42.net") #=> "HELLO,WORLD"
|
178
279
|
```
|
179
280
|
|
281
|
+
Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
|
282
|
+
|
180
283
|
### Options
|
181
284
|
##### `truncate`
|
182
285
|
Match to regexp, and truncate text. When you use group, it will return first matched group only.
|
183
286
|
|
184
287
|
```ruby
|
185
288
|
node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
|
186
|
-
node.
|
289
|
+
node.scrape(uri)
|
187
290
|
#=> { "example" => "ello,Yasur" }
|
188
291
|
```
|
189
292
|
|
@@ -194,21 +297,22 @@ If it is given `truncate` option, apply method after truncated.
|
|
194
297
|
|
195
298
|
```ruby
|
196
299
|
node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
|
197
|
-
node.
|
300
|
+
node.scrape(uri)
|
198
301
|
#=> { "example" => "ELLO,YASUR" }
|
199
302
|
```
|
200
303
|
|
201
304
|
## Struct Node
|
202
305
|
Struct Node return structured text.
|
203
306
|
|
204
|
-
At first, Struct Node narrow down sub-tags by `Path`.
|
307
|
+
At first, Struct Node narrow down sub-tags by `Path`.
|
308
|
+
Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
|
205
309
|
|
206
310
|
If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
|
207
311
|
|
208
312
|
### Example
|
209
313
|
|
210
314
|
```html
|
211
|
-
<!-- http://yasuri.example.net -->
|
315
|
+
<!-- http://yasuri.example.tac42.net -->
|
212
316
|
<html>
|
213
317
|
<head>
|
214
318
|
<title>Books</title>
|
@@ -249,15 +353,12 @@ If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags an
|
|
249
353
|
```
|
250
354
|
|
251
355
|
```ruby
|
252
|
-
agent = Mechanize.new
|
253
|
-
page = agent.get("http://yasuri.example.net")
|
254
|
-
|
255
356
|
node = Yasuri.struct_table '/html/body/table[1]/tr' do
|
256
357
|
text_title './td[1]'
|
257
358
|
text_pub_date './td[2]'
|
258
|
-
|
359
|
+
end
|
259
360
|
|
260
|
-
node.
|
361
|
+
node.scrape("http://yasuri.example.tac42.net")
|
261
362
|
#=> [ { "title" => "The Perfect Insider",
|
262
363
|
# "pub_date" => "1996/4/5" },
|
263
364
|
# { "title" => "Doctors in Isolated Room",
|
@@ -276,17 +377,14 @@ Struct node can contain not only Text node.
|
|
276
377
|
### Example
|
277
378
|
|
278
379
|
```ruby
|
279
|
-
agent = Mechanize.new
|
280
|
-
page = agent.get("http://yasuri.example.net")
|
281
|
-
|
282
380
|
node = Yasuri.strucre_tables '/html/body/table' do
|
283
381
|
struct_table './tr' do
|
284
382
|
text_title './td[1]'
|
285
383
|
text_pub_date './td[2]'
|
286
384
|
end
|
287
|
-
|
385
|
+
end
|
288
386
|
|
289
|
-
node.
|
387
|
+
node.scrape("http://yasuri.example.tac42.net")
|
290
388
|
|
291
389
|
#=> [ { "table" => [ { "title" => "The Perfect Insider",
|
292
390
|
# "pub_date" => "1996/4/5" },
|
@@ -319,7 +417,7 @@ Links Node returns parsed text in each linked pages.
|
|
319
417
|
|
320
418
|
### Example
|
321
419
|
```html
|
322
|
-
<!-- http://yasuri.example.net -->
|
420
|
+
<!-- http://yasuri.example.tac42.net -->
|
323
421
|
<html>
|
324
422
|
<head><title>Yasuri Test</title></head>
|
325
423
|
<body>
|
@@ -332,7 +430,7 @@ Links Node returns parsed text in each linked pages.
|
|
332
430
|
```
|
333
431
|
|
334
432
|
```html
|
335
|
-
<!-- http://yasuri.example.net/child01.html -->
|
433
|
+
<!-- http://yasuri.example.tac42.net/child01.html -->
|
336
434
|
<html>
|
337
435
|
<head><title>Child 01 Test</title></head>
|
338
436
|
<body>
|
@@ -346,7 +444,7 @@ Links Node returns parsed text in each linked pages.
|
|
346
444
|
```
|
347
445
|
|
348
446
|
```html
|
349
|
-
<!-- http://yasuri.example.net/child02.html -->
|
447
|
+
<!-- http://yasuri.example.tac42.net/child02.html -->
|
350
448
|
<html>
|
351
449
|
<head><title>Child 02 Test</title></head>
|
352
450
|
<body>
|
@@ -356,7 +454,7 @@ Links Node returns parsed text in each linked pages.
|
|
356
454
|
```
|
357
455
|
|
358
456
|
```html
|
359
|
-
<!-- http://yasuri.example.net/child03.html -->
|
457
|
+
<!-- http://yasuri.example.tac42.net/child03.html -->
|
360
458
|
<html>
|
361
459
|
<head><title>Child 03 Test</title></head>
|
362
460
|
<body>
|
@@ -369,20 +467,17 @@ Links Node returns parsed text in each linked pages.
|
|
369
467
|
```
|
370
468
|
|
371
469
|
```ruby
|
372
|
-
agent = Mechanize.new
|
373
|
-
page = agent.get("http://yasuri.example.net")
|
374
|
-
|
375
470
|
node = Yasuri.links_title '/html/body/a' do
|
376
471
|
text_content '/html/body/p'
|
377
472
|
end
|
378
473
|
|
379
|
-
node.
|
474
|
+
node.scrape("http://yasuri.example.tac42.net")
|
380
475
|
#=> [ {"content" => "Child 01 page."},
|
381
476
|
{"content" => "Child 02 page."},
|
382
477
|
{"content" => "Child 03 page."}]
|
383
478
|
```
|
384
479
|
|
385
|
-
At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
|
480
|
+
At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.tac42.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
|
386
481
|
|
387
482
|
Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
|
388
483
|
|
@@ -396,7 +491,7 @@ Paginate Node parses and returns each pages that provid by paginate.
|
|
396
491
|
Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
|
397
492
|
|
398
493
|
```html
|
399
|
-
<!-- http://yasuri.example.net/page01.html -->
|
494
|
+
<!-- http://yasuri.example.tac42.net/page01.html -->
|
400
495
|
<html>
|
401
496
|
<head><title>Page01</title></head>
|
402
497
|
<body>
|
@@ -416,21 +511,17 @@ Target page `page01.html` is like this. `page02.html` to `page04.html` are simil
|
|
416
511
|
```
|
417
512
|
|
418
513
|
```ruby
|
419
|
-
|
420
|
-
page = agent.get("http://yasuri.example.net/page01.html")
|
421
|
-
|
422
|
-
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
|
514
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
|
423
515
|
text_content '/html/body/p'
|
424
516
|
end
|
425
517
|
|
426
|
-
node.
|
427
|
-
#=> [ {"content" => "
|
428
|
-
|
429
|
-
|
430
|
-
{"content" => "Pagination04"}]
|
518
|
+
node.scrape("http://yasuri.example.tac42.net/page01.html")
|
519
|
+
#=> [ {"content" => "Patination01"},
|
520
|
+
# {"content" => "Patination02"},
|
521
|
+
# {"content" => "Patination03"}]
|
431
522
|
```
|
432
|
-
|
433
|
-
|
523
|
+
Paginate Node require link for next page.
|
524
|
+
In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
|
434
525
|
|
435
526
|
### Options
|
436
527
|
##### `limit`
|
@@ -440,7 +531,7 @@ Upper limit of open pages in pagination.
|
|
440
531
|
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
|
441
532
|
text_content '/html/body/p'
|
442
533
|
end
|
443
|
-
node.
|
534
|
+
node.scrape(uri)
|
444
535
|
#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
|
445
536
|
```
|
446
537
|
Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
|
@@ -449,33 +540,177 @@ Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4
|
|
449
540
|
`flatten` option expands each page results.
|
450
541
|
|
451
542
|
```ruby
|
452
|
-
agent = Mechanize.new
|
453
|
-
page = agent.get("http://yasuri.example.net/page01.html")
|
454
|
-
|
455
543
|
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
|
456
544
|
text_title '/html/head/title'
|
457
545
|
text_content '/html/body/p'
|
458
546
|
end
|
459
|
-
node.
|
547
|
+
node.scrape("http://yasuri.example.tac42.net/page01.html")
|
460
548
|
|
461
549
|
#=> [ {"title" => "Page01",
|
462
|
-
|
463
|
-
|
464
|
-
|
465
|
-
|
466
|
-
|
550
|
+
# "content" => "Patination01"},
|
551
|
+
# {"title" => "Page01",
|
552
|
+
# "content" => "Patination02"},
|
553
|
+
# {"title" => "Page01",
|
554
|
+
# "content" => "Patination03"}]
|
467
555
|
|
468
556
|
|
469
557
|
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
|
470
558
|
text_title '/html/head/title'
|
471
559
|
text_content '/html/body/p'
|
472
560
|
end
|
473
|
-
node.
|
561
|
+
node.scrape("http://yasuri.example.tac42.net/page01.html")
|
474
562
|
|
475
563
|
#=> [ "Page01",
|
476
|
-
|
477
|
-
|
478
|
-
|
479
|
-
|
480
|
-
|
564
|
+
# "Patination01",
|
565
|
+
# "Page02",
|
566
|
+
# "Patination02",
|
567
|
+
# "Page03",
|
568
|
+
# "Patination03"]
|
481
569
|
```
|
570
|
+
|
571
|
+
## Map Node
|
572
|
+
*MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
|
573
|
+
|
574
|
+
### Example
|
575
|
+
|
576
|
+
```html
|
577
|
+
<!-- http://yasuri.example.tac42.net -->
|
578
|
+
<html>
|
579
|
+
<head><title>Yasuri Example</title></head>
|
580
|
+
<body>
|
581
|
+
<p>Hello,World</p>
|
582
|
+
<p>Hello,Yasuri</p>
|
583
|
+
</body>
|
584
|
+
</html>
|
585
|
+
```
|
586
|
+
|
587
|
+
```ruby
|
588
|
+
tree = Yasuri.map_root do
|
589
|
+
text_title '/html/head/title'
|
590
|
+
text_body_p '/html/body/p[1]'
|
591
|
+
end
|
592
|
+
|
593
|
+
tree.scrape("http://yasuri.example.tac42.net") #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
|
594
|
+
|
595
|
+
|
596
|
+
tree = Yasuri.map_root do
|
597
|
+
map_group1 { text_child01 '/html/body/a[1]' }
|
598
|
+
map_group2 do
|
599
|
+
text_child01 '/html/body/a[1]'
|
600
|
+
text_child03 '/html/body/a[3]'
|
601
|
+
end
|
602
|
+
end
|
603
|
+
|
604
|
+
tree.scrape("http://yasuri.example.tac42.net") #=> {
|
605
|
+
# "group1" => {
|
606
|
+
# "child01" => "child01"
|
607
|
+
# },
|
608
|
+
# "group2" => {
|
609
|
+
# "child01" => "child01",
|
610
|
+
# "child03" => "child03"
|
611
|
+
# }
|
612
|
+
# }
|
613
|
+
```
|
614
|
+
|
615
|
+
### Options
|
616
|
+
None.
|
617
|
+
|
618
|
+
|
619
|
+
-------------------------
|
620
|
+
## Usage
|
621
|
+
|
622
|
+
### Use as library
|
623
|
+
When used as a library, the tree can be defined in DSL, json, or yaml format.
|
624
|
+
|
625
|
+
```ruby
|
626
|
+
require 'yasuri'
|
627
|
+
|
628
|
+
# 1. Create a parse tree.
|
629
|
+
# Define by Ruby's DSL
|
630
|
+
tree = Yasuri.links_title '/html/body/a' do
|
631
|
+
text_name '/html/body/p'
|
632
|
+
end
|
633
|
+
|
634
|
+
# Define by JSON
|
635
|
+
src = <<-EOJSON
|
636
|
+
{
|
637
|
+
links_title": {
|
638
|
+
"path": "/html/body/a",
|
639
|
+
"text_name": "/html/body/p"
|
640
|
+
}
|
641
|
+
}
|
642
|
+
EOJSON
|
643
|
+
tree = Yasuri.json2tree(src)
|
644
|
+
|
645
|
+
|
646
|
+
# Define by YAML
|
647
|
+
src = <<-EOYAML
|
648
|
+
links_title:
|
649
|
+
path: "/html/body/a"
|
650
|
+
text_name: "/html/body/p"
|
651
|
+
EOYAML
|
652
|
+
tree = Yasuri.yaml2tree(src)
|
653
|
+
|
654
|
+
# 2. Give the URL to start parsing
|
655
|
+
tree.inject(uri)
|
656
|
+
```
|
657
|
+
|
658
|
+
### Use as CLI tool
|
659
|
+
|
660
|
+
**Help**
|
661
|
+
```sh
|
662
|
+
$ yasuri help scrape
|
663
|
+
Usage:
|
664
|
+
yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
|
665
|
+
|
666
|
+
Options:
|
667
|
+
f, [--file=FILE] # path to file that written yasuri tree as json or yaml
|
668
|
+
j, [--json=JSON] # yasuri tree format json string
|
669
|
+
i, [--interval=N] # interval each request [ms]
|
670
|
+
|
671
|
+
Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
|
672
|
+
```
|
673
|
+
|
674
|
+
In the CLI tool, you can specify the parse tree in either of the following ways.
|
675
|
+
+ `--file`, `-f` : option to read the parse tree in json or yaml format output to a file.
|
676
|
+
+ `--json`, `-j` : option to specify the parse tree directly as a string.
|
677
|
+
|
678
|
+
|
679
|
+
**Example of specifying a parse tree as a file**
|
680
|
+
```sh
|
681
|
+
% cat sample.yml
|
682
|
+
text_title: "/html/head/title"
|
683
|
+
text_desc: "//*[@id=\"intro\"]/p"
|
684
|
+
|
685
|
+
% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
|
686
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
687
|
+
|
688
|
+
% cat sample.json
|
689
|
+
{
|
690
|
+
"text_title": "/html/head/title",
|
691
|
+
"text_desc": "//*[@id=\"intro\"]/p"
|
692
|
+
}
|
693
|
+
|
694
|
+
% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
|
695
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
696
|
+
```
|
697
|
+
|
698
|
+
Whether the file is written in json or yaml will be determined automatically.
|
699
|
+
|
700
|
+
**Example of specifying a parse tree directly in json**
|
701
|
+
```sh
|
702
|
+
$ yasuri scrape "https://www.ruby-lang.org/en/" -j '
|
703
|
+
{
|
704
|
+
"text_title": "/html/head/title",
|
705
|
+
"text_desc": "//*[@id=\"intro\"]/p"
|
706
|
+
}'
|
707
|
+
|
708
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
709
|
+
```
|
710
|
+
|
711
|
+
#### Other options
|
712
|
+
+ `--interval`, `-i` : The interval [milliseconds] for requesting multiple pages.
|
713
|
+
**Example: Request at 1 second intervals**
|
714
|
+
```sh
|
715
|
+
$ yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml --interval 1000
|
716
|
+
```
|