yasuri 2.0.12 → 3.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/.github/workflows/ruby.yml +35 -0
- data/.gitignore +1 -2
- data/.ruby-version +1 -0
- data/.travis.yml +1 -3
- data/README.md +87 -21
- data/USAGE.ja.md +368 -120
- data/USAGE.md +375 -125
- data/examples/example.rb +79 -0
- data/examples/github.yml +15 -0
- data/examples/sample.json +4 -0
- data/examples/sample.yml +11 -0
- data/exe/yasuri +5 -0
- data/lib/yasuri.rb +1 -0
- data/lib/yasuri/version.rb +1 -1
- data/lib/yasuri/yasuri.rb +86 -41
- data/lib/yasuri/yasuri_cli.rb +64 -0
- data/lib/yasuri/yasuri_links_node.rb +11 -5
- data/lib/yasuri/yasuri_map_node.rb +40 -0
- data/lib/yasuri/yasuri_node.rb +37 -2
- data/lib/yasuri/yasuri_node_generator.rb +16 -11
- data/lib/yasuri/yasuri_paginate_node.rb +10 -4
- data/lib/yasuri/yasuri_struct_node.rb +5 -1
- data/lib/yasuri/yasuri_text_node.rb +9 -2
- data/spec/cli_resources/tree.json +8 -0
- data/spec/cli_resources/tree.yml +5 -0
- data/spec/cli_resources/tree_wrong.json +9 -0
- data/spec/cli_resources/tree_wrong.yml +6 -0
- data/spec/spec_helper.rb +4 -9
- data/spec/yasuri_cli_spec.rb +96 -0
- data/spec/yasuri_links_node_spec.rb +34 -12
- data/spec/yasuri_map_spec.rb +75 -0
- data/spec/yasuri_paginate_node_spec.rb +22 -10
- data/spec/yasuri_spec.rb +244 -94
- data/spec/yasuri_struct_node_spec.rb +13 -17
- data/spec/yasuri_text_node_spec.rb +11 -12
- data/yasuri.gemspec +5 -3
- metadata +52 -18
- data/app.rb +0 -52
data/USAGE.md
CHANGED
@@ -1,27 +1,32 @@
|
|
1
|
-
# Yasuri
|
1
|
+
# Yasuri
|
2
2
|
|
3
3
|
## What is Yasuri
|
4
|
-
`Yasuri` is
|
4
|
+
`Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it.
|
5
5
|
|
6
|
-
|
6
|
+
It performs scraping by simply describing the expected result in a simple declarative notation.
|
7
7
|
|
8
|
-
Yasuri
|
8
|
+
Yasuri makes it easy to write common scraping operations.
|
9
|
+
For example, the following processes can be easily implemented.
|
9
10
|
|
10
|
-
|
11
|
-
|
12
|
-
+
|
13
|
-
+
|
14
|
-
+ A table that repeatedly appears in a page each, scraping, get as an array.
|
15
|
-
+ Of each page provided by the pagination, scraping the only top 3.
|
16
|
-
|
17
|
-
You can implement easy by Yasuri.
|
11
|
+
+ Scrape multiple texts in a page and name them into a Hash
|
12
|
+
+ Open multiple links in a page and get the result of scraping each page as a Hash
|
13
|
+
+ Scrape each table that appears repeatedly in the page and get the result as an array
|
14
|
+
+ Scrape only the first three pages of each page provided by pagination
|
18
15
|
|
19
16
|
## Quick Start
|
20
17
|
|
18
|
+
|
19
|
+
#### Install
|
20
|
+
```sh
|
21
|
+
# for Ruby 2.3.2
|
22
|
+
$ gem 'yasuri', '~> 2.0', '>= 2.0.13'
|
21
23
|
```
|
24
|
+
または
|
25
|
+
```sh
|
26
|
+
# for Ruby 3.0.0 or upper
|
22
27
|
$ gem install yasuri
|
23
28
|
```
|
24
|
-
|
29
|
+
#### Use as library
|
25
30
|
```ruby
|
26
31
|
require 'yasuri'
|
27
32
|
require 'machinize'
|
@@ -32,81 +37,190 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
|
32
37
|
text_content '//*[@id="contents"]/p[1]'
|
33
38
|
end
|
34
39
|
|
35
|
-
|
36
|
-
|
40
|
+
result = root.scrape("http://some.scraping.page.tac42.net/")
|
41
|
+
# => [
|
42
|
+
# {"title" => "PageTitle 01", "content" => "Page Contents 01" },
|
43
|
+
# {"title" => "PageTitle 02", "content" => "Page Contents 02" },
|
44
|
+
# ...
|
45
|
+
# {"title" => "PageTitle N", "content" => "Page Contents N" }
|
46
|
+
# ]
|
47
|
+
```
|
48
|
+
|
49
|
+
This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
|
50
|
+
|
51
|
+
(in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
|
52
|
+
|
37
53
|
|
38
|
-
|
39
|
-
|
40
|
-
# {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
|
54
|
+
#### Use as CLI tool
|
55
|
+
The same thing as above can be executed as a CLI command.
|
41
56
|
|
57
|
+
```sh
|
58
|
+
$ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
|
59
|
+
{
|
60
|
+
"links_root": {
|
61
|
+
"path": "//*[@id=\"menu\"]/ul/li/a",
|
62
|
+
"text_title": "//*[@id=\"contents\"]/h2",
|
63
|
+
"text_content": "//*[@id=\"contents\"]/p[1]"
|
64
|
+
}
|
65
|
+
}'
|
66
|
+
|
67
|
+
[
|
68
|
+
{"title":"PageTitle 01","content":"Page Contents 01"},
|
69
|
+
{"title":"PageTitle 02","content":"Page Contents 02"},
|
70
|
+
...,
|
71
|
+
{"title":"PageTitle N","content":"Page Contents N"}
|
72
|
+
]
|
42
73
|
```
|
43
|
-
This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
|
44
74
|
|
45
|
-
|
75
|
+
The result can be obtained as a string in json format.
|
46
76
|
|
47
|
-
|
77
|
+
----------------------------
|
78
|
+
## Parse Tree
|
48
79
|
|
49
|
-
|
50
|
-
2. Start parse with Mechanize agent and first page.
|
80
|
+
A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
|
51
81
|
|
52
|
-
|
82
|
+
A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
|
53
83
|
|
54
|
-
|
55
|
-
require 'mechanize'
|
56
|
-
require 'yasuri'
|
84
|
+
The parse tree is defined in the following format:
|
57
85
|
|
86
|
+
```ruby
|
87
|
+
# A simple tree consisting of one node
|
88
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
58
89
|
|
59
|
-
#
|
60
|
-
|
61
|
-
|
62
|
-
|
90
|
+
# Nested tree
|
91
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
92
|
+
<Type>_<Name> <Path> [,<Options>] do
|
93
|
+
<Type>_<Name> <Path> [,<Options>]
|
94
|
+
...
|
95
|
+
end
|
96
|
+
end
|
97
|
+
```
|
63
98
|
|
64
|
-
|
65
|
-
agent = Mechanize.new
|
66
|
-
page = agent.get(uri)
|
99
|
+
**Example**
|
67
100
|
|
101
|
+
```ruby
|
102
|
+
# A simple tree consisting of one node
|
103
|
+
Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
|
68
104
|
|
69
|
-
tree
|
105
|
+
# Nested tree
|
106
|
+
Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
107
|
+
struct_table './tr' do
|
108
|
+
text_title './td[1]'
|
109
|
+
text_pub_date './td[2]'
|
110
|
+
end
|
111
|
+
end
|
70
112
|
```
|
71
113
|
|
72
|
-
|
114
|
+
Parsing trees can be defined in Ruby DSL, JSON, or YAML.
|
115
|
+
The following is an example of the same parse tree as above, defined in each notation.
|
116
|
+
|
73
117
|
|
118
|
+
**Case of defining as Ruby DSL**
|
74
119
|
```ruby
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
"name" : "title",
|
79
|
-
"path" : "/html/body/a",
|
80
|
-
"children" : [
|
81
|
-
{ "node" : "text",
|
82
|
-
"name" : "name",
|
83
|
-
"path" : "/html/body/p"
|
84
|
-
}
|
85
|
-
]
|
86
|
-
}
|
87
|
-
EOJSON
|
88
|
-
tree = Yasuri.json2tree(src)
|
120
|
+
Yasuri.links_title '/html/body/a' do
|
121
|
+
text_name '/html/body/p'
|
122
|
+
end
|
89
123
|
```
|
90
124
|
|
91
|
-
|
92
|
-
|
93
|
-
|
125
|
+
**Case of defining as JSON**
|
126
|
+
```json
|
127
|
+
{
|
128
|
+
links_title": {
|
129
|
+
"path": "/html/body/a",
|
130
|
+
"text_name": "/html/body/p"
|
131
|
+
}
|
132
|
+
}
|
133
|
+
```
|
94
134
|
|
95
|
-
|
135
|
+
**Case of defining as YAML**
|
136
|
+
```yaml
|
137
|
+
links_title:
|
138
|
+
path: "/html/body/a"
|
139
|
+
text_name: "/html/body/p"
|
140
|
+
```
|
96
141
|
|
142
|
+
**Special case of purse tree**
|
97
143
|
|
144
|
+
If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
|
145
|
+
```json
|
146
|
+
{
|
147
|
+
"text_title": "/html/head/title",
|
148
|
+
"text_body": "/html/body",
|
149
|
+
}
|
150
|
+
# => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
|
151
|
+
|
152
|
+
{
|
153
|
+
"text_title": "/html/head/title"}
|
154
|
+
}
|
155
|
+
# => Welcome to yasuri!
|
156
|
+
```
|
157
|
+
|
158
|
+
|
159
|
+
In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
|
160
|
+
|
161
|
+
```json
|
162
|
+
{
|
163
|
+
"text_name": "/html/body/p"
|
164
|
+
}
|
165
|
+
|
166
|
+
{
|
167
|
+
"text_name": {
|
168
|
+
"path": "/html/body/p"
|
169
|
+
}
|
170
|
+
}
|
171
|
+
```
|
172
|
+
### Run ParseTree
|
173
|
+
Call the `Node#scrape(uri, opt={})` method on the root node of the parse tree.
|
174
|
+
|
175
|
+
**Example**
|
98
176
|
```ruby
|
99
|
-
|
100
|
-
|
177
|
+
root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
178
|
+
text_title '//*[@id="contents"]/h2'
|
179
|
+
text_content '//*[@id="contents"]/p[1]'
|
180
|
+
end
|
101
181
|
|
102
|
-
|
103
|
-
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
104
|
-
<Type>_<Name> <Path> [,<Options>] do
|
105
|
-
<Children>
|
106
|
-
end
|
107
|
-
end
|
182
|
+
result = root.scrape("http://some.scraping.page.tac42.net/", interval_ms: 1000)
|
108
183
|
```
|
109
184
|
|
185
|
+
+ `uri` is the URI of the page to be scraped.
|
186
|
+
+ `opt` is options as Hash. The following options are available.
|
187
|
+
|
188
|
+
Yasuri uses `Mechanize` internally as an agent to do scraping.
|
189
|
+
If you want to specify this instance, call `Node#scrape_with_agent(uri, agent, opt={})`.
|
190
|
+
|
191
|
+
```ruby
|
192
|
+
require 'logger'
|
193
|
+
|
194
|
+
agent = Mechanize.new
|
195
|
+
agent.log = Logger.new $stderr
|
196
|
+
agent.request_headers = {
|
197
|
+
# ...
|
198
|
+
}
|
199
|
+
|
200
|
+
result = root.scrape_with_agent(
|
201
|
+
"http://some.scraping.page.tac42.net/",
|
202
|
+
agent,
|
203
|
+
interval_ms: 1000)
|
204
|
+
```
|
205
|
+
|
206
|
+
### `opt`
|
207
|
+
#### `interval_ms`
|
208
|
+
Interval [milliseconds] for requesting multiple pages.
|
209
|
+
|
210
|
+
If omitted, requests will be made continuously without an interval, but if requests to many pages are expected, it is strongly recommended to specify an interval time to avoid high load on the target host.
|
211
|
+
|
212
|
+
#### `retry_count`
|
213
|
+
Number of retries when page acquisition fails. If omitted, it will retry 5 times.
|
214
|
+
|
215
|
+
#### `symbolize_names`
|
216
|
+
If true, returns the keys of the result set as symbols.
|
217
|
+
|
218
|
+
--------------------------
|
219
|
+
## Node
|
220
|
+
|
221
|
+
Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
|
222
|
+
|
223
|
+
|
110
224
|
#### Type
|
111
225
|
Type meen behavior of Node.
|
112
226
|
|
@@ -114,17 +228,20 @@ Type meen behavior of Node.
|
|
114
228
|
- *Struct*
|
115
229
|
- *Links*
|
116
230
|
- *Paginate*
|
231
|
+
- *Map*
|
117
232
|
|
118
|
-
|
233
|
+
See the description of each node for details.
|
234
|
+
|
235
|
+
#### Name
|
119
236
|
Name is used keys in returned hash.
|
120
237
|
|
121
|
-
|
238
|
+
#### Path
|
122
239
|
Path determine target node by xpath or css selector. It given by Machinize `search`.
|
123
240
|
|
124
|
-
|
241
|
+
#### Childlen
|
125
242
|
Child nodes. TextNode has always empty set, because TextNode is leaf.
|
126
243
|
|
127
|
-
|
244
|
+
#### Options
|
128
245
|
Parse options. It different in each types. You can get options and values by `opt` method.
|
129
246
|
|
130
247
|
```ruby
|
@@ -136,10 +253,12 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
|
|
136
253
|
## Text Node
|
137
254
|
TextNode return scraped text. This node have to be leaf.
|
138
255
|
|
256
|
+
|
257
|
+
|
139
258
|
### Example
|
140
259
|
|
141
260
|
```html
|
142
|
-
<!-- http://yasuri.example.net -->
|
261
|
+
<!-- http://yasuri.example.tac42.net -->
|
143
262
|
<html>
|
144
263
|
<head></head>
|
145
264
|
<body>
|
@@ -150,25 +269,24 @@ TextNode return scraped text. This node have to be leaf.
|
|
150
269
|
```
|
151
270
|
|
152
271
|
```ruby
|
153
|
-
agent = Mechanize.new
|
154
|
-
page = agent.get("http://yasuri.example.net")
|
155
|
-
|
156
272
|
p1 = Yasuri.text_title '/html/body/p[1]'
|
157
273
|
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
158
|
-
p2u = Yasuri.text_title '/html/body/p[
|
274
|
+
p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
|
159
275
|
|
160
|
-
p1.
|
161
|
-
p1t.
|
162
|
-
|
276
|
+
p1.scrape("http://yasuri.example.tac42.net") #=> "Hello,World"
|
277
|
+
p1t.scrape("http://yasuri.example.tac42.net") #=> "Hello"
|
278
|
+
p2u.scrape("http://yasuri.example.tac42.net") #=> "HELLO,WORLD"
|
163
279
|
```
|
164
280
|
|
281
|
+
Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
|
282
|
+
|
165
283
|
### Options
|
166
284
|
##### `truncate`
|
167
285
|
Match to regexp, and truncate text. When you use group, it will return first matched group only.
|
168
286
|
|
169
287
|
```ruby
|
170
288
|
node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/
|
171
|
-
node.
|
289
|
+
node.scrape(uri)
|
172
290
|
#=> { "example" => "ello,Yasur" }
|
173
291
|
```
|
174
292
|
|
@@ -179,21 +297,22 @@ If it is given `truncate` option, apply method after truncated.
|
|
179
297
|
|
180
298
|
```ruby
|
181
299
|
node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/
|
182
|
-
node.
|
300
|
+
node.scrape(uri)
|
183
301
|
#=> { "example" => "ELLO,YASUR" }
|
184
302
|
```
|
185
303
|
|
186
304
|
## Struct Node
|
187
305
|
Struct Node return structured text.
|
188
306
|
|
189
|
-
At first, Struct Node narrow down sub-tags by `Path`.
|
307
|
+
At first, Struct Node narrow down sub-tags by `Path`.
|
308
|
+
Child nodes parse narrowed tags, and struct node returns hash contains parsed result.
|
190
309
|
|
191
310
|
If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array.
|
192
311
|
|
193
312
|
### Example
|
194
313
|
|
195
314
|
```html
|
196
|
-
<!-- http://yasuri.example.net -->
|
315
|
+
<!-- http://yasuri.example.tac42.net -->
|
197
316
|
<html>
|
198
317
|
<head>
|
199
318
|
<title>Books</title>
|
@@ -234,15 +353,12 @@ If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags an
|
|
234
353
|
```
|
235
354
|
|
236
355
|
```ruby
|
237
|
-
agent = Mechanize.new
|
238
|
-
page = agent.get("http://yasuri.example.net")
|
239
|
-
|
240
356
|
node = Yasuri.struct_table '/html/body/table[1]/tr' do
|
241
357
|
text_title './td[1]'
|
242
358
|
text_pub_date './td[2]'
|
243
|
-
|
359
|
+
end
|
244
360
|
|
245
|
-
node.
|
361
|
+
node.scrape("http://yasuri.example.tac42.net")
|
246
362
|
#=> [ { "title" => "The Perfect Insider",
|
247
363
|
# "pub_date" => "1996/4/5" },
|
248
364
|
# { "title" => "Doctors in Isolated Room",
|
@@ -261,17 +377,14 @@ Struct node can contain not only Text node.
|
|
261
377
|
### Example
|
262
378
|
|
263
379
|
```ruby
|
264
|
-
agent = Mechanize.new
|
265
|
-
page = agent.get("http://yasuri.example.net")
|
266
|
-
|
267
380
|
node = Yasuri.strucre_tables '/html/body/table' do
|
268
381
|
struct_table './tr' do
|
269
382
|
text_title './td[1]'
|
270
383
|
text_pub_date './td[2]'
|
271
384
|
end
|
272
|
-
|
385
|
+
end
|
273
386
|
|
274
|
-
node.
|
387
|
+
node.scrape("http://yasuri.example.tac42.net")
|
275
388
|
|
276
389
|
#=> [ { "table" => [ { "title" => "The Perfect Insider",
|
277
390
|
# "pub_date" => "1996/4/5" },
|
@@ -304,7 +417,7 @@ Links Node returns parsed text in each linked pages.
|
|
304
417
|
|
305
418
|
### Example
|
306
419
|
```html
|
307
|
-
<!-- http://yasuri.example.net -->
|
420
|
+
<!-- http://yasuri.example.tac42.net -->
|
308
421
|
<html>
|
309
422
|
<head><title>Yasuri Test</title></head>
|
310
423
|
<body>
|
@@ -317,7 +430,7 @@ Links Node returns parsed text in each linked pages.
|
|
317
430
|
```
|
318
431
|
|
319
432
|
```html
|
320
|
-
<!-- http://yasuri.example.net/child01.html -->
|
433
|
+
<!-- http://yasuri.example.tac42.net/child01.html -->
|
321
434
|
<html>
|
322
435
|
<head><title>Child 01 Test</title></head>
|
323
436
|
<body>
|
@@ -331,7 +444,7 @@ Links Node returns parsed text in each linked pages.
|
|
331
444
|
```
|
332
445
|
|
333
446
|
```html
|
334
|
-
<!-- http://yasuri.example.net/child02.html -->
|
447
|
+
<!-- http://yasuri.example.tac42.net/child02.html -->
|
335
448
|
<html>
|
336
449
|
<head><title>Child 02 Test</title></head>
|
337
450
|
<body>
|
@@ -341,7 +454,7 @@ Links Node returns parsed text in each linked pages.
|
|
341
454
|
```
|
342
455
|
|
343
456
|
```html
|
344
|
-
<!-- http://yasuri.example.net/child03.html -->
|
457
|
+
<!-- http://yasuri.example.tac42.net/child03.html -->
|
345
458
|
<html>
|
346
459
|
<head><title>Child 03 Test</title></head>
|
347
460
|
<body>
|
@@ -354,20 +467,17 @@ Links Node returns parsed text in each linked pages.
|
|
354
467
|
```
|
355
468
|
|
356
469
|
```ruby
|
357
|
-
agent = Mechanize.new
|
358
|
-
page = agent.get("http://yasuri.example.net")
|
359
|
-
|
360
470
|
node = Yasuri.links_title '/html/body/a' do
|
361
471
|
text_content '/html/body/p'
|
362
472
|
end
|
363
473
|
|
364
|
-
node.
|
474
|
+
node.scrape("http://yasuri.example.tac42.net")
|
365
475
|
#=> [ {"content" => "Child 01 page."},
|
366
476
|
{"content" => "Child 02 page."},
|
367
477
|
{"content" => "Child 03 page."}]
|
368
478
|
```
|
369
479
|
|
370
|
-
At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
|
480
|
+
At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.tac42.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`).
|
371
481
|
|
372
482
|
Then, Links Node and apply child nodes. Links Node will return applied result of each page as array.
|
373
483
|
|
@@ -381,7 +491,7 @@ Paginate Node parses and returns each pages that provid by paginate.
|
|
381
491
|
Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly.
|
382
492
|
|
383
493
|
```html
|
384
|
-
<!-- http://yasuri.example.net/page01.html -->
|
494
|
+
<!-- http://yasuri.example.tac42.net/page01.html -->
|
385
495
|
<html>
|
386
496
|
<head><title>Page01</title></head>
|
387
497
|
<body>
|
@@ -401,21 +511,17 @@ Target page `page01.html` is like this. `page02.html` to `page04.html` are simil
|
|
401
511
|
```
|
402
512
|
|
403
513
|
```ruby
|
404
|
-
|
405
|
-
page = agent.get("http://yasuri.example.net/page01.html")
|
406
|
-
|
407
|
-
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" do
|
514
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
|
408
515
|
text_content '/html/body/p'
|
409
516
|
end
|
410
517
|
|
411
|
-
node.
|
412
|
-
#=> [ {"content" => "
|
413
|
-
|
414
|
-
|
415
|
-
{"content" => "Pagination04"}]
|
518
|
+
node.scrape("http://yasuri.example.tac42.net/page01.html")
|
519
|
+
#=> [ {"content" => "Patination01"},
|
520
|
+
# {"content" => "Patination02"},
|
521
|
+
# {"content" => "Patination03"}]
|
416
522
|
```
|
417
|
-
|
418
|
-
|
523
|
+
Paginate Node require link for next page.
|
524
|
+
In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`.
|
419
525
|
|
420
526
|
### Options
|
421
527
|
##### `limit`
|
@@ -425,7 +531,7 @@ Upper limit of open pages in pagination.
|
|
425
531
|
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do
|
426
532
|
text_content '/html/body/p'
|
427
533
|
end
|
428
|
-
node.
|
534
|
+
node.scrape(uri)
|
429
535
|
#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
|
430
536
|
```
|
431
537
|
Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
|
@@ -434,33 +540,177 @@ Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4
|
|
434
540
|
`flatten` option expands each page results.
|
435
541
|
|
436
542
|
```ruby
|
437
|
-
agent = Mechanize.new
|
438
|
-
page = agent.get("http://yasuri.example.net/page01.html")
|
439
|
-
|
440
543
|
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
|
441
544
|
text_title '/html/head/title'
|
442
545
|
text_content '/html/body/p'
|
443
546
|
end
|
444
|
-
node.
|
547
|
+
node.scrape("http://yasuri.example.tac42.net/page01.html")
|
445
548
|
|
446
549
|
#=> [ {"title" => "Page01",
|
447
|
-
|
448
|
-
|
449
|
-
|
450
|
-
|
451
|
-
|
550
|
+
# "content" => "Patination01"},
|
551
|
+
# {"title" => "Page01",
|
552
|
+
# "content" => "Patination02"},
|
553
|
+
# {"title" => "Page01",
|
554
|
+
# "content" => "Patination03"}]
|
452
555
|
|
453
556
|
|
454
557
|
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
|
455
558
|
text_title '/html/head/title'
|
456
559
|
text_content '/html/body/p'
|
457
560
|
end
|
458
|
-
node.
|
561
|
+
node.scrape("http://yasuri.example.tac42.net/page01.html")
|
459
562
|
|
460
563
|
#=> [ "Page01",
|
461
|
-
|
462
|
-
|
463
|
-
|
464
|
-
|
465
|
-
|
564
|
+
# "Patination01",
|
565
|
+
# "Page02",
|
566
|
+
# "Patination02",
|
567
|
+
# "Page03",
|
568
|
+
# "Patination03"]
|
569
|
+
```
|
570
|
+
|
571
|
+
## Map Node
|
572
|
+
*MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
|
573
|
+
|
574
|
+
### Example
|
575
|
+
|
576
|
+
```html
|
577
|
+
<!-- http://yasuri.example.tac42.net -->
|
578
|
+
<html>
|
579
|
+
<head><title>Yasuri Example</title></head>
|
580
|
+
<body>
|
581
|
+
<p>Hello,World</p>
|
582
|
+
<p>Hello,Yasuri</p>
|
583
|
+
</body>
|
584
|
+
</html>
|
585
|
+
```
|
586
|
+
|
587
|
+
```ruby
|
588
|
+
tree = Yasuri.map_root do
|
589
|
+
text_title '/html/head/title'
|
590
|
+
text_body_p '/html/body/p[1]'
|
591
|
+
end
|
592
|
+
|
593
|
+
tree.scrape("http://yasuri.example.tac42.net") #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
|
594
|
+
|
595
|
+
|
596
|
+
tree = Yasuri.map_root do
|
597
|
+
map_group1 { text_child01 '/html/body/a[1]' }
|
598
|
+
map_group2 do
|
599
|
+
text_child01 '/html/body/a[1]'
|
600
|
+
text_child03 '/html/body/a[3]'
|
601
|
+
end
|
602
|
+
end
|
603
|
+
|
604
|
+
tree.scrape("http://yasuri.example.tac42.net") #=> {
|
605
|
+
# "group1" => {
|
606
|
+
# "child01" => "child01"
|
607
|
+
# },
|
608
|
+
# "group2" => {
|
609
|
+
# "child01" => "child01",
|
610
|
+
# "child03" => "child03"
|
611
|
+
# }
|
612
|
+
# }
|
613
|
+
```
|
614
|
+
|
615
|
+
### Options
|
616
|
+
None.
|
617
|
+
|
618
|
+
|
619
|
+
-------------------------
|
620
|
+
## Usage
|
621
|
+
|
622
|
+
### Use as library
|
623
|
+
When used as a library, the tree can be defined in DSL, json, or yaml format.
|
624
|
+
|
625
|
+
```ruby
|
626
|
+
require 'yasuri'
|
627
|
+
|
628
|
+
# 1. Create a parse tree.
|
629
|
+
# Define by Ruby's DSL
|
630
|
+
tree = Yasuri.links_title '/html/body/a' do
|
631
|
+
text_name '/html/body/p'
|
632
|
+
end
|
633
|
+
|
634
|
+
# Define by JSON
|
635
|
+
src = <<-EOJSON
|
636
|
+
{
|
637
|
+
links_title": {
|
638
|
+
"path": "/html/body/a",
|
639
|
+
"text_name": "/html/body/p"
|
640
|
+
}
|
641
|
+
}
|
642
|
+
EOJSON
|
643
|
+
tree = Yasuri.json2tree(src)
|
644
|
+
|
645
|
+
|
646
|
+
# Define by YAML
|
647
|
+
src = <<-EOYAML
|
648
|
+
links_title:
|
649
|
+
path: "/html/body/a"
|
650
|
+
text_name: "/html/body/p"
|
651
|
+
EOYAML
|
652
|
+
tree = Yasuri.yaml2tree(src)
|
653
|
+
|
654
|
+
# 2. Give the URL to start parsing
|
655
|
+
tree.inject(uri)
|
656
|
+
```
|
657
|
+
|
658
|
+
### Use as CLI tool
|
659
|
+
|
660
|
+
**Help**
|
661
|
+
```sh
|
662
|
+
$ yasuri help scrape
|
663
|
+
Usage:
|
664
|
+
yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
|
665
|
+
|
666
|
+
Options:
|
667
|
+
f, [--file=FILE] # path to file that written yasuri tree as json or yaml
|
668
|
+
j, [--json=JSON] # yasuri tree format json string
|
669
|
+
i, [--interval=N] # interval each request [ms]
|
670
|
+
|
671
|
+
Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
|
672
|
+
```
|
673
|
+
|
674
|
+
In the CLI tool, you can specify the parse tree in either of the following ways.
|
675
|
+
+ `--file`, `-f` : option to read the parse tree in json or yaml format output to a file.
|
676
|
+
+ `--json`, `-j` : option to specify the parse tree directly as a string.
|
677
|
+
|
678
|
+
|
679
|
+
**Example of specifying a parse tree as a file**
|
680
|
+
```sh
|
681
|
+
% cat sample.yml
|
682
|
+
text_title: "/html/head/title"
|
683
|
+
text_desc: "//*[@id=\"intro\"]/p"
|
684
|
+
|
685
|
+
% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
|
686
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
687
|
+
|
688
|
+
% cat sample.json
|
689
|
+
{
|
690
|
+
"text_title": "/html/head/title",
|
691
|
+
"text_desc": "//*[@id=\"intro\"]/p"
|
692
|
+
}
|
693
|
+
|
694
|
+
% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
|
695
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
466
696
|
```
|
697
|
+
|
698
|
+
Whether the file is written in json or yaml will be determined automatically.
|
699
|
+
|
700
|
+
**Example of specifying a parse tree directly in json**
|
701
|
+
```sh
|
702
|
+
$ yasuri scrape "https://www.ruby-lang.org/en/" -j '
|
703
|
+
{
|
704
|
+
"text_title": "/html/head/title",
|
705
|
+
"text_desc": "//*[@id=\"intro\"]/p"
|
706
|
+
}'
|
707
|
+
|
708
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
709
|
+
```
|
710
|
+
|
711
|
+
#### Other options
|
712
|
+
+ `--interval`, `-i` : The interval [milliseconds] for requesting multiple pages.
|
713
|
+
**Example: Request at 1 second intervals**
|
714
|
+
```sh
|
715
|
+
$ yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml --interval 1000
|
716
|
+
```
|