yasuri 2.0.11 → 3.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/.github/workflows/ruby.yml +35 -0
- data/.gitignore +1 -2
- data/.ruby-version +1 -0
- data/.travis.yml +1 -3
- data/README.md +88 -19
- data/USAGE.ja.md +325 -63
- data/USAGE.md +335 -69
- data/exe/yasuri +5 -0
- data/lib/yasuri.rb +1 -0
- data/lib/yasuri/version.rb +1 -1
- data/lib/yasuri/yasuri.rb +80 -39
- data/lib/yasuri/yasuri_cli.rb +64 -0
- data/lib/yasuri/yasuri_links_node.rb +10 -6
- data/lib/yasuri/yasuri_map_node.rb +39 -0
- data/lib/yasuri/yasuri_node.rb +24 -3
- data/lib/yasuri/yasuri_node_generator.rb +16 -11
- data/lib/yasuri/yasuri_paginate_node.rb +18 -6
- data/lib/yasuri/yasuri_struct_node.rb +8 -4
- data/lib/yasuri/yasuri_text_node.rb +11 -4
- data/spec/cli_resources/tree.json +8 -0
- data/spec/cli_resources/tree.yml +5 -0
- data/spec/cli_resources/tree_wrong.json +9 -0
- data/spec/cli_resources/tree_wrong.yml +6 -0
- data/spec/htdocs/struct/structual_links.html +30 -0
- data/spec/htdocs/{structual_text.html → struct/structual_text.html} +0 -0
- data/spec/spec_helper.rb +1 -6
- data/spec/yasuri_cli_spec.rb +83 -0
- data/spec/yasuri_links_node_spec.rb +12 -4
- data/spec/yasuri_map_spec.rb +76 -0
- data/spec/yasuri_paginate_node_spec.rb +43 -0
- data/spec/yasuri_spec.rb +199 -84
- data/spec/yasuri_struct_node_spec.rb +42 -1
- data/yasuri.gemspec +5 -3
- metadata +52 -19
data/USAGE.md
CHANGED
@@ -1,27 +1,31 @@
|
|
1
|
-
# Yasuri
|
1
|
+
# Yasuri
|
2
2
|
|
3
3
|
## What is Yasuri
|
4
|
-
`Yasuri` is
|
4
|
+
`Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping using "[Mechanize](https://github.com/sparklemotion/mechanize)" by simply describing the expected result in a simple declarative notation.
|
5
5
|
|
6
|
-
Yasuri
|
6
|
+
Yasuri makes it easy to write common scraping operations.
|
7
|
+
For example, the following processes can be easily implemented.
|
7
8
|
|
8
|
-
|
9
|
+
+ Scrape multiple texts in a page and name them into a Hash
|
10
|
+
+ Open multiple links in a page and get the result of scraping each page as a Hash
|
11
|
+
+ Scrape each table that appears repeatedly in the page and get the result as an array
|
12
|
+
+ Scrape only the first three pages of each page provided by pagination
|
9
13
|
|
10
|
-
For example,
|
11
|
-
|
12
|
-
+ Open links in the page, scraping each page, and getting result as Hash.
|
13
|
-
+ Scraping texts in the page, and named result in Hash.
|
14
|
-
+ A table that repeatedly appears in a page each, scraping, get as an array.
|
15
|
-
+ Of each page provided by the pagination, scraping the only top 3.
|
16
|
-
|
17
|
-
You can implement easy by Yasuri.
|
18
14
|
|
19
15
|
## Quick Start
|
20
16
|
|
17
|
+
|
18
|
+
#### Install
|
19
|
+
```sh
|
20
|
+
# for Ruby 2.3.2
|
21
|
+
$ gem 'yasuri', '~> 2.0', '>= 2.0.13'
|
21
22
|
```
|
23
|
+
または
|
24
|
+
```sh
|
25
|
+
# for Ruby 3.0.0 or upper
|
22
26
|
$ gem install yasuri
|
23
27
|
```
|
24
|
-
|
28
|
+
#### Use as library
|
25
29
|
```ruby
|
26
30
|
require 'yasuri'
|
27
31
|
require 'machinize'
|
@@ -33,80 +37,148 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
|
33
37
|
end
|
34
38
|
|
35
39
|
agent = Mechanize.new
|
36
|
-
root_page = agent.get("http://some.scraping.page.net/")
|
40
|
+
root_page = agent.get("http://some.scraping.page.tac42.net/")
|
37
41
|
|
38
42
|
result = root.inject(agent, root_page)
|
39
|
-
# => [
|
40
|
-
# {"title" => "
|
41
|
-
|
43
|
+
# => [
|
44
|
+
# {"title" => "PageTitle 01", "content" => "Page Contents 01" },
|
45
|
+
# {"title" => "PageTitle 02", "content" => "Page Contents 02" },
|
46
|
+
# ...
|
47
|
+
# {"title" => "PageTitle N", "content" => "Page Contents N" }
|
48
|
+
# ]
|
42
49
|
```
|
50
|
+
|
43
51
|
This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
|
44
52
|
|
45
|
-
(
|
53
|
+
(in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
|
46
54
|
|
47
|
-
## Basics
|
48
55
|
|
49
|
-
|
50
|
-
|
56
|
+
#### Use as CLI tool
|
57
|
+
The same thing as above can be executed as a CLI command.
|
51
58
|
|
52
|
-
|
59
|
+
```sh
|
60
|
+
$ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
|
61
|
+
{
|
62
|
+
"links_root": {
|
63
|
+
"path": "//*[@id=\"menu\"]/ul/li/a",
|
64
|
+
"text_title": "//*[@id=\"contents\"]/h2",
|
65
|
+
"text_content": "//*[@id=\"contents\"]/p[1]"
|
66
|
+
}
|
67
|
+
}'
|
53
68
|
|
54
|
-
|
55
|
-
|
56
|
-
|
69
|
+
[
|
70
|
+
{"title":"PageTitle 01","content":"Page Contents 01"},
|
71
|
+
{"title":"PageTitle 02","content":"Page Contents 02"},
|
72
|
+
...,
|
73
|
+
{"title":"PageTitle N","content":"Page Contents N"}
|
74
|
+
]
|
75
|
+
```
|
57
76
|
|
77
|
+
The result can be obtained as a string in json format.
|
58
78
|
|
59
|
-
|
60
|
-
|
61
|
-
text_name '/html/body/p'
|
62
|
-
end
|
79
|
+
----------------------------
|
80
|
+
## Parse Tree
|
63
81
|
|
64
|
-
|
65
|
-
agent = Mechanize.new
|
66
|
-
page = agent.get(uri)
|
82
|
+
A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
|
67
83
|
|
84
|
+
A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
|
68
85
|
|
69
|
-
tree
|
86
|
+
The parse tree is defined in the following format:
|
87
|
+
|
88
|
+
```ruby
|
89
|
+
# A simple tree consisting of one node
|
90
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
91
|
+
|
92
|
+
# Nested tree
|
93
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
94
|
+
<Type>_<Name> <Path> [,<Options>] do
|
95
|
+
<Type>_<Name> <Path> [,<Options>]
|
96
|
+
...
|
97
|
+
end
|
98
|
+
end
|
70
99
|
```
|
71
100
|
|
72
|
-
|
101
|
+
**Example**
|
73
102
|
|
74
103
|
```ruby
|
75
|
-
#
|
76
|
-
|
77
|
-
{ "node" : "links",
|
78
|
-
"name" : "title",
|
79
|
-
"path" : "/html/body/a",
|
80
|
-
"children" : [
|
81
|
-
{ "node" : "text",
|
82
|
-
"name" : "name",
|
83
|
-
"path" : "/html/body/p"
|
84
|
-
}
|
85
|
-
]
|
86
|
-
}
|
87
|
-
EOJSON
|
88
|
-
tree = Yasuri.json2tree(src)
|
89
|
-
```
|
104
|
+
# A simple tree consisting of one node
|
105
|
+
Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
|
90
106
|
|
91
|
-
|
92
|
-
|
93
|
-
|
107
|
+
# Nested tree
|
108
|
+
Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
109
|
+
struct_table './tr' do
|
110
|
+
text_title './td[1]'
|
111
|
+
text_pub_date './td[2]'
|
112
|
+
end
|
113
|
+
end
|
114
|
+
```
|
94
115
|
|
95
|
-
|
116
|
+
Parsing trees can be defined in Ruby DSL, JSON, or YAML.
|
117
|
+
The following is an example of the same parse tree as above, defined in each notation.
|
96
118
|
|
97
119
|
|
120
|
+
**Case of defining as Ruby DSL**
|
98
121
|
```ruby
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
# Nested
|
103
|
-
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
104
|
-
<Type>_<Name> <Path> [,<Options>] do
|
105
|
-
<Children>
|
106
|
-
end
|
122
|
+
Yasuri.links_title '/html/body/a' do
|
123
|
+
text_name '/html/body/p'
|
107
124
|
end
|
108
125
|
```
|
109
126
|
|
127
|
+
**Case of defining as JSON**
|
128
|
+
```json
|
129
|
+
{
|
130
|
+
links_title": {
|
131
|
+
"path": "/html/body/a",
|
132
|
+
"text_name": "/html/body/p"
|
133
|
+
}
|
134
|
+
}
|
135
|
+
```
|
136
|
+
|
137
|
+
**Case of defining as YAML**
|
138
|
+
```yaml
|
139
|
+
links_title:
|
140
|
+
path: "/html/body/a"
|
141
|
+
text_name: "/html/body/p"
|
142
|
+
```
|
143
|
+
|
144
|
+
**Special case of purse tree**
|
145
|
+
|
146
|
+
If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
|
147
|
+
```json
|
148
|
+
{
|
149
|
+
"text_title": "/html/head/title",
|
150
|
+
"text_body": "/html/body",
|
151
|
+
}
|
152
|
+
# => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
|
153
|
+
|
154
|
+
{
|
155
|
+
"text_title": "/html/head/title"}
|
156
|
+
}
|
157
|
+
# => Welcome to yasuri!
|
158
|
+
```
|
159
|
+
|
160
|
+
|
161
|
+
In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
|
162
|
+
|
163
|
+
```json
|
164
|
+
{
|
165
|
+
"text_name": "/html/body/p"
|
166
|
+
}
|
167
|
+
|
168
|
+
{
|
169
|
+
"text_name": {
|
170
|
+
"path": "/html/body/p"
|
171
|
+
}
|
172
|
+
}
|
173
|
+
```
|
174
|
+
|
175
|
+
|
176
|
+
--------------------------
|
177
|
+
## Node
|
178
|
+
|
179
|
+
Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
|
180
|
+
|
181
|
+
|
110
182
|
#### Type
|
111
183
|
Type meen behavior of Node.
|
112
184
|
|
@@ -114,17 +186,20 @@ Type meen behavior of Node.
|
|
114
186
|
- *Struct*
|
115
187
|
- *Links*
|
116
188
|
- *Paginate*
|
189
|
+
- *Map*
|
190
|
+
|
191
|
+
See the description of each node for details.
|
117
192
|
|
118
|
-
|
193
|
+
#### Name
|
119
194
|
Name is used keys in returned hash.
|
120
195
|
|
121
|
-
|
196
|
+
#### Path
|
122
197
|
Path determine target node by xpath or css selector. It given by Machinize `search`.
|
123
198
|
|
124
|
-
|
199
|
+
#### Childlen
|
125
200
|
Child nodes. TextNode has always empty set, because TextNode is leaf.
|
126
201
|
|
127
|
-
|
202
|
+
#### Options
|
128
203
|
Parse options. It different in each types. You can get options and values by `opt` method.
|
129
204
|
|
130
205
|
```ruby
|
@@ -136,6 +211,8 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
|
|
136
211
|
## Text Node
|
137
212
|
TextNode return scraped text. This node have to be leaf.
|
138
213
|
|
214
|
+
|
215
|
+
|
139
216
|
### Example
|
140
217
|
|
141
218
|
```html
|
@@ -155,13 +232,15 @@ page = agent.get("http://yasuri.example.net")
|
|
155
232
|
|
156
233
|
p1 = Yasuri.text_title '/html/body/p[1]'
|
157
234
|
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
158
|
-
p2u = Yasuri.text_title '/html/body/p[
|
235
|
+
p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
|
159
236
|
|
160
|
-
p1.inject(agent, page) #=>
|
161
|
-
p1t.inject(agent, page) #=>
|
162
|
-
|
237
|
+
p1.inject(agent, page) #=> "Hello,World"
|
238
|
+
p1t.inject(agent, page) #=> "Hello"
|
239
|
+
p2u.inject(agent, page) #=> "HELLO,WORLD"
|
163
240
|
```
|
164
241
|
|
242
|
+
Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
|
243
|
+
|
165
244
|
### Options
|
166
245
|
##### `truncate`
|
167
246
|
Match to regexp, and truncate text. When you use group, it will return first matched group only.
|
@@ -429,3 +508,190 @@ node.inject(agent, page)
|
|
429
508
|
#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
|
430
509
|
```
|
431
510
|
Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
|
511
|
+
|
512
|
+
##### `flatten`
|
513
|
+
`flatten` option expands each page results.
|
514
|
+
|
515
|
+
```ruby
|
516
|
+
agent = Mechanize.new
|
517
|
+
page = agent.get("http://yasuri.example.net/page01.html")
|
518
|
+
|
519
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
|
520
|
+
text_title '/html/head/title'
|
521
|
+
text_content '/html/body/p'
|
522
|
+
end
|
523
|
+
node.inject(agent, page)
|
524
|
+
|
525
|
+
#=> [ {"title" => "Page01",
|
526
|
+
"content" => "Patination01"},
|
527
|
+
{"title" => "Page01",
|
528
|
+
"content" => "Patination02"},
|
529
|
+
{"title" => "Page01",
|
530
|
+
"content" => "Patination03"}]
|
531
|
+
|
532
|
+
|
533
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
|
534
|
+
text_title '/html/head/title'
|
535
|
+
text_content '/html/body/p'
|
536
|
+
end
|
537
|
+
node.inject(agent, page)
|
538
|
+
|
539
|
+
#=> [ "Page01",
|
540
|
+
"Patination01",
|
541
|
+
"Page02",
|
542
|
+
"Patination02",
|
543
|
+
"Page03",
|
544
|
+
"Patination03"]
|
545
|
+
```
|
546
|
+
|
547
|
+
## Map Node
|
548
|
+
*MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
|
549
|
+
|
550
|
+
### Example
|
551
|
+
|
552
|
+
```html
|
553
|
+
<!-- http://yasuri.example.net -->
|
554
|
+
<html>
|
555
|
+
<head><title>Yasuri Example</title></head>
|
556
|
+
<body>
|
557
|
+
<p>Hello,World</p>
|
558
|
+
<p>Hello,Yasuri</p>
|
559
|
+
</body>
|
560
|
+
</html>
|
561
|
+
```
|
562
|
+
|
563
|
+
```ruby
|
564
|
+
agent = Mechanize.new
|
565
|
+
page = agent.get("http://yasuri.example.net")
|
566
|
+
|
567
|
+
|
568
|
+
tree = Yasuri.map_root do
|
569
|
+
text_title '/html/head/title'
|
570
|
+
text_body_p '/html/body/p[1]'
|
571
|
+
end
|
572
|
+
|
573
|
+
tree.inject(agent, page) #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
|
574
|
+
|
575
|
+
|
576
|
+
tree = Yasuri.map_root do
|
577
|
+
map_group1 { text_child01 '/html/body/a[1]' }
|
578
|
+
map_group2 do
|
579
|
+
text_child01 '/html/body/a[1]'
|
580
|
+
text_child03 '/html/body/a[3]'
|
581
|
+
end
|
582
|
+
end
|
583
|
+
|
584
|
+
tree.inject(agent, page) #=> {
|
585
|
+
# "group1" => {
|
586
|
+
# "child01" => "child01"
|
587
|
+
# },
|
588
|
+
# "group2" => {
|
589
|
+
# "child01" => "child01",
|
590
|
+
# "child03" => "child03"
|
591
|
+
# }
|
592
|
+
# }
|
593
|
+
```
|
594
|
+
|
595
|
+
### Options
|
596
|
+
None.
|
597
|
+
|
598
|
+
|
599
|
+
|
600
|
+
|
601
|
+
-------------------------
|
602
|
+
## Usage
|
603
|
+
|
604
|
+
#### Use as library
|
605
|
+
When used as a library, the tree can be defined in DSL, json, or yaml format.
|
606
|
+
```ruby
|
607
|
+
require 'mechanize'
|
608
|
+
require 'yasuri'
|
609
|
+
|
610
|
+
|
611
|
+
# 1. Create a parse tree.
|
612
|
+
# Define by Ruby's DSL
|
613
|
+
tree = Yasuri.links_title '/html/body/a' do
|
614
|
+
text_name '/html/body/p'
|
615
|
+
end
|
616
|
+
|
617
|
+
# Define by JSON
|
618
|
+
src = <<-EOJSON
|
619
|
+
{
|
620
|
+
links_title": {
|
621
|
+
"path": "/html/body/a",
|
622
|
+
"text_name": "/html/body/p"
|
623
|
+
}
|
624
|
+
}
|
625
|
+
EOJSON
|
626
|
+
tree = Yasuri.json2tree(src)
|
627
|
+
|
628
|
+
|
629
|
+
# Define by YAML
|
630
|
+
src = <<-EOYAML
|
631
|
+
links_title:
|
632
|
+
path: "/html/body/a"
|
633
|
+
text_name: "/html/body/p"
|
634
|
+
EOYAML
|
635
|
+
tree = Yasuri.yaml2tree(src)
|
636
|
+
|
637
|
+
|
638
|
+
|
639
|
+
# 2. Give the Mechanize agent and the target page to start parsing
|
640
|
+
agent = Mechanize.new
|
641
|
+
page = agent.get(uri)
|
642
|
+
|
643
|
+
|
644
|
+
tree.inject(agent, page)
|
645
|
+
```
|
646
|
+
|
647
|
+
#### Use as CLI tool
|
648
|
+
|
649
|
+
**Help**
|
650
|
+
```sh
|
651
|
+
$ yasuri help scrape
|
652
|
+
Usage:
|
653
|
+
yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
|
654
|
+
|
655
|
+
Options:
|
656
|
+
f, [--file=FILE] # path to file that written yasuri tree as json or yaml
|
657
|
+
j, [--json=JSON] # yasuri tree format json string
|
658
|
+
|
659
|
+
Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
|
660
|
+
```
|
661
|
+
|
662
|
+
In the CLI tool, you can specify the parse tree in either of the following ways.
|
663
|
+
+ `--file`, `-f` option to read the parse tree in json or yaml format output to a file.
|
664
|
+
+ `--json`, `-j` option to specify the parse tree directly as a string.
|
665
|
+
|
666
|
+
|
667
|
+
**Example of specifying a parse tree as a file**
|
668
|
+
```sh
|
669
|
+
% cat sample.yml
|
670
|
+
text_title: "/html/head/title"
|
671
|
+
text_desc: "//*[@id=\"intro\"]/p"
|
672
|
+
|
673
|
+
% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
|
674
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
675
|
+
|
676
|
+
% cat sample.json
|
677
|
+
{
|
678
|
+
"text_title": "/html/head/title",
|
679
|
+
"text_desc": "//*[@id=\"intro\"]/p"
|
680
|
+
}
|
681
|
+
|
682
|
+
% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
|
683
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
684
|
+
```
|
685
|
+
|
686
|
+
Whether the file is written in json or yaml will be determined automatically.
|
687
|
+
|
688
|
+
**Example of specifying a parse tree directly in json**
|
689
|
+
```sh
|
690
|
+
$ yasuri scrape "https://www.ruby-lang.org/en/" -j '
|
691
|
+
{
|
692
|
+
"text_title": "/html/head/title",
|
693
|
+
"text_desc": "//*[@id=\"intro\"]/p"
|
694
|
+
}'
|
695
|
+
|
696
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
697
|
+
```
|