yasuri 2.0.11 → 3.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -5
- data/.github/workflows/ruby.yml +35 -0
- data/.gitignore +1 -2
- data/.ruby-version +1 -0
- data/.travis.yml +1 -3
- data/README.md +88 -19
- data/USAGE.ja.md +325 -63
- data/USAGE.md +335 -69
- data/exe/yasuri +5 -0
- data/lib/yasuri.rb +1 -0
- data/lib/yasuri/version.rb +1 -1
- data/lib/yasuri/yasuri.rb +80 -39
- data/lib/yasuri/yasuri_cli.rb +64 -0
- data/lib/yasuri/yasuri_links_node.rb +10 -6
- data/lib/yasuri/yasuri_map_node.rb +39 -0
- data/lib/yasuri/yasuri_node.rb +24 -3
- data/lib/yasuri/yasuri_node_generator.rb +16 -11
- data/lib/yasuri/yasuri_paginate_node.rb +18 -6
- data/lib/yasuri/yasuri_struct_node.rb +8 -4
- data/lib/yasuri/yasuri_text_node.rb +11 -4
- data/spec/cli_resources/tree.json +8 -0
- data/spec/cli_resources/tree.yml +5 -0
- data/spec/cli_resources/tree_wrong.json +9 -0
- data/spec/cli_resources/tree_wrong.yml +6 -0
- data/spec/htdocs/struct/structual_links.html +30 -0
- data/spec/htdocs/{structual_text.html → struct/structual_text.html} +0 -0
- data/spec/spec_helper.rb +1 -6
- data/spec/yasuri_cli_spec.rb +83 -0
- data/spec/yasuri_links_node_spec.rb +12 -4
- data/spec/yasuri_map_spec.rb +76 -0
- data/spec/yasuri_paginate_node_spec.rb +43 -0
- data/spec/yasuri_spec.rb +199 -84
- data/spec/yasuri_struct_node_spec.rb +42 -1
- data/yasuri.gemspec +5 -3
- metadata +52 -19
data/USAGE.md
CHANGED
@@ -1,27 +1,31 @@
|
|
1
|
-
# Yasuri
|
1
|
+
# Yasuri
|
2
2
|
|
3
3
|
## What is Yasuri
|
4
|
-
`Yasuri` is
|
4
|
+
`Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping using "[Mechanize](https://github.com/sparklemotion/mechanize)" by simply describing the expected result in a simple declarative notation.
|
5
5
|
|
6
|
-
Yasuri
|
6
|
+
Yasuri makes it easy to write common scraping operations.
|
7
|
+
For example, the following processes can be easily implemented.
|
7
8
|
|
8
|
-
|
9
|
+
+ Scrape multiple texts in a page and name them into a Hash
|
10
|
+
+ Open multiple links in a page and get the result of scraping each page as a Hash
|
11
|
+
+ Scrape each table that appears repeatedly in the page and get the result as an array
|
12
|
+
+ Scrape only the first three pages of each page provided by pagination
|
9
13
|
|
10
|
-
For example,
|
11
|
-
|
12
|
-
+ Open links in the page, scraping each page, and getting result as Hash.
|
13
|
-
+ Scraping texts in the page, and named result in Hash.
|
14
|
-
+ A table that repeatedly appears in a page each, scraping, get as an array.
|
15
|
-
+ Of each page provided by the pagination, scraping the only top 3.
|
16
|
-
|
17
|
-
You can implement easy by Yasuri.
|
18
14
|
|
19
15
|
## Quick Start
|
20
16
|
|
17
|
+
|
18
|
+
#### Install
|
19
|
+
```sh
|
20
|
+
# for Ruby 2.3.2
|
21
|
+
$ gem 'yasuri', '~> 2.0', '>= 2.0.13'
|
21
22
|
```
|
23
|
+
または
|
24
|
+
```sh
|
25
|
+
# for Ruby 3.0.0 or upper
|
22
26
|
$ gem install yasuri
|
23
27
|
```
|
24
|
-
|
28
|
+
#### Use as library
|
25
29
|
```ruby
|
26
30
|
require 'yasuri'
|
27
31
|
require 'machinize'
|
@@ -33,80 +37,148 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
|
33
37
|
end
|
34
38
|
|
35
39
|
agent = Mechanize.new
|
36
|
-
root_page = agent.get("http://some.scraping.page.net/")
|
40
|
+
root_page = agent.get("http://some.scraping.page.tac42.net/")
|
37
41
|
|
38
42
|
result = root.inject(agent, root_page)
|
39
|
-
# => [
|
40
|
-
# {"title" => "
|
41
|
-
|
43
|
+
# => [
|
44
|
+
# {"title" => "PageTitle 01", "content" => "Page Contents 01" },
|
45
|
+
# {"title" => "PageTitle 02", "content" => "Page Contents 02" },
|
46
|
+
# ...
|
47
|
+
# {"title" => "PageTitle N", "content" => "Page Contents N" }
|
48
|
+
# ]
|
42
49
|
```
|
50
|
+
|
43
51
|
This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
|
44
52
|
|
45
|
-
(
|
53
|
+
(in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
|
46
54
|
|
47
|
-
## Basics
|
48
55
|
|
49
|
-
|
50
|
-
|
56
|
+
#### Use as CLI tool
|
57
|
+
The same thing as above can be executed as a CLI command.
|
51
58
|
|
52
|
-
|
59
|
+
```sh
|
60
|
+
$ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
|
61
|
+
{
|
62
|
+
"links_root": {
|
63
|
+
"path": "//*[@id=\"menu\"]/ul/li/a",
|
64
|
+
"text_title": "//*[@id=\"contents\"]/h2",
|
65
|
+
"text_content": "//*[@id=\"contents\"]/p[1]"
|
66
|
+
}
|
67
|
+
}'
|
53
68
|
|
54
|
-
|
55
|
-
|
56
|
-
|
69
|
+
[
|
70
|
+
{"title":"PageTitle 01","content":"Page Contents 01"},
|
71
|
+
{"title":"PageTitle 02","content":"Page Contents 02"},
|
72
|
+
...,
|
73
|
+
{"title":"PageTitle N","content":"Page Contents N"}
|
74
|
+
]
|
75
|
+
```
|
57
76
|
|
77
|
+
The result can be obtained as a string in json format.
|
58
78
|
|
59
|
-
|
60
|
-
|
61
|
-
text_name '/html/body/p'
|
62
|
-
end
|
79
|
+
----------------------------
|
80
|
+
## Parse Tree
|
63
81
|
|
64
|
-
|
65
|
-
agent = Mechanize.new
|
66
|
-
page = agent.get(uri)
|
82
|
+
A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
|
67
83
|
|
84
|
+
A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
|
68
85
|
|
69
|
-
tree
|
86
|
+
The parse tree is defined in the following format:
|
87
|
+
|
88
|
+
```ruby
|
89
|
+
# A simple tree consisting of one node
|
90
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>]
|
91
|
+
|
92
|
+
# Nested tree
|
93
|
+
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
94
|
+
<Type>_<Name> <Path> [,<Options>] do
|
95
|
+
<Type>_<Name> <Path> [,<Options>]
|
96
|
+
...
|
97
|
+
end
|
98
|
+
end
|
70
99
|
```
|
71
100
|
|
72
|
-
|
101
|
+
**Example**
|
73
102
|
|
74
103
|
```ruby
|
75
|
-
#
|
76
|
-
|
77
|
-
{ "node" : "links",
|
78
|
-
"name" : "title",
|
79
|
-
"path" : "/html/body/a",
|
80
|
-
"children" : [
|
81
|
-
{ "node" : "text",
|
82
|
-
"name" : "name",
|
83
|
-
"path" : "/html/body/p"
|
84
|
-
}
|
85
|
-
]
|
86
|
-
}
|
87
|
-
EOJSON
|
88
|
-
tree = Yasuri.json2tree(src)
|
89
|
-
```
|
104
|
+
# A simple tree consisting of one node
|
105
|
+
Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
|
90
106
|
|
91
|
-
|
92
|
-
|
93
|
-
|
107
|
+
# Nested tree
|
108
|
+
Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
|
109
|
+
struct_table './tr' do
|
110
|
+
text_title './td[1]'
|
111
|
+
text_pub_date './td[2]'
|
112
|
+
end
|
113
|
+
end
|
114
|
+
```
|
94
115
|
|
95
|
-
|
116
|
+
Parsing trees can be defined in Ruby DSL, JSON, or YAML.
|
117
|
+
The following is an example of the same parse tree as above, defined in each notation.
|
96
118
|
|
97
119
|
|
120
|
+
**Case of defining as Ruby DSL**
|
98
121
|
```ruby
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
# Nested
|
103
|
-
Yasuri.<Type>_<Name> <Path> [,<Options>] do
|
104
|
-
<Type>_<Name> <Path> [,<Options>] do
|
105
|
-
<Children>
|
106
|
-
end
|
122
|
+
Yasuri.links_title '/html/body/a' do
|
123
|
+
text_name '/html/body/p'
|
107
124
|
end
|
108
125
|
```
|
109
126
|
|
127
|
+
**Case of defining as JSON**
|
128
|
+
```json
|
129
|
+
{
|
130
|
+
links_title": {
|
131
|
+
"path": "/html/body/a",
|
132
|
+
"text_name": "/html/body/p"
|
133
|
+
}
|
134
|
+
}
|
135
|
+
```
|
136
|
+
|
137
|
+
**Case of defining as YAML**
|
138
|
+
```yaml
|
139
|
+
links_title:
|
140
|
+
path: "/html/body/a"
|
141
|
+
text_name: "/html/body/p"
|
142
|
+
```
|
143
|
+
|
144
|
+
**Special case of purse tree**
|
145
|
+
|
146
|
+
If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
|
147
|
+
```json
|
148
|
+
{
|
149
|
+
"text_title": "/html/head/title",
|
150
|
+
"text_body": "/html/body",
|
151
|
+
}
|
152
|
+
# => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
|
153
|
+
|
154
|
+
{
|
155
|
+
"text_title": "/html/head/title"}
|
156
|
+
}
|
157
|
+
# => Welcome to yasuri!
|
158
|
+
```
|
159
|
+
|
160
|
+
|
161
|
+
In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
|
162
|
+
|
163
|
+
```json
|
164
|
+
{
|
165
|
+
"text_name": "/html/body/p"
|
166
|
+
}
|
167
|
+
|
168
|
+
{
|
169
|
+
"text_name": {
|
170
|
+
"path": "/html/body/p"
|
171
|
+
}
|
172
|
+
}
|
173
|
+
```
|
174
|
+
|
175
|
+
|
176
|
+
--------------------------
|
177
|
+
## Node
|
178
|
+
|
179
|
+
Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
|
180
|
+
|
181
|
+
|
110
182
|
#### Type
|
111
183
|
Type meen behavior of Node.
|
112
184
|
|
@@ -114,17 +186,20 @@ Type meen behavior of Node.
|
|
114
186
|
- *Struct*
|
115
187
|
- *Links*
|
116
188
|
- *Paginate*
|
189
|
+
- *Map*
|
190
|
+
|
191
|
+
See the description of each node for details.
|
117
192
|
|
118
|
-
|
193
|
+
#### Name
|
119
194
|
Name is used keys in returned hash.
|
120
195
|
|
121
|
-
|
196
|
+
#### Path
|
122
197
|
Path determine target node by xpath or css selector. It given by Machinize `search`.
|
123
198
|
|
124
|
-
|
199
|
+
#### Childlen
|
125
200
|
Child nodes. TextNode has always empty set, because TextNode is leaf.
|
126
201
|
|
127
|
-
|
202
|
+
#### Options
|
128
203
|
Parse options. It different in each types. You can get options and values by `opt` method.
|
129
204
|
|
130
205
|
```ruby
|
@@ -136,6 +211,8 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
|
|
136
211
|
## Text Node
|
137
212
|
TextNode return scraped text. This node have to be leaf.
|
138
213
|
|
214
|
+
|
215
|
+
|
139
216
|
### Example
|
140
217
|
|
141
218
|
```html
|
@@ -155,13 +232,15 @@ page = agent.get("http://yasuri.example.net")
|
|
155
232
|
|
156
233
|
p1 = Yasuri.text_title '/html/body/p[1]'
|
157
234
|
p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
|
158
|
-
p2u = Yasuri.text_title '/html/body/p[
|
235
|
+
p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
|
159
236
|
|
160
|
-
p1.inject(agent, page) #=>
|
161
|
-
p1t.inject(agent, page) #=>
|
162
|
-
|
237
|
+
p1.inject(agent, page) #=> "Hello,World"
|
238
|
+
p1t.inject(agent, page) #=> "Hello"
|
239
|
+
p2u.inject(agent, page) #=> "HELLO,WORLD"
|
163
240
|
```
|
164
241
|
|
242
|
+
Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
|
243
|
+
|
165
244
|
### Options
|
166
245
|
##### `truncate`
|
167
246
|
Match to regexp, and truncate text. When you use group, it will return first matched group only.
|
@@ -429,3 +508,190 @@ node.inject(agent, page)
|
|
429
508
|
#=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
|
430
509
|
```
|
431
510
|
Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
|
511
|
+
|
512
|
+
##### `flatten`
|
513
|
+
`flatten` option expands each page results.
|
514
|
+
|
515
|
+
```ruby
|
516
|
+
agent = Mechanize.new
|
517
|
+
page = agent.get("http://yasuri.example.net/page01.html")
|
518
|
+
|
519
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
|
520
|
+
text_title '/html/head/title'
|
521
|
+
text_content '/html/body/p'
|
522
|
+
end
|
523
|
+
node.inject(agent, page)
|
524
|
+
|
525
|
+
#=> [ {"title" => "Page01",
|
526
|
+
"content" => "Patination01"},
|
527
|
+
{"title" => "Page01",
|
528
|
+
"content" => "Patination02"},
|
529
|
+
{"title" => "Page01",
|
530
|
+
"content" => "Patination03"}]
|
531
|
+
|
532
|
+
|
533
|
+
node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
|
534
|
+
text_title '/html/head/title'
|
535
|
+
text_content '/html/body/p'
|
536
|
+
end
|
537
|
+
node.inject(agent, page)
|
538
|
+
|
539
|
+
#=> [ "Page01",
|
540
|
+
"Patination01",
|
541
|
+
"Page02",
|
542
|
+
"Patination02",
|
543
|
+
"Page03",
|
544
|
+
"Patination03"]
|
545
|
+
```
|
546
|
+
|
547
|
+
## Map Node
|
548
|
+
*MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
|
549
|
+
|
550
|
+
### Example
|
551
|
+
|
552
|
+
```html
|
553
|
+
<!-- http://yasuri.example.net -->
|
554
|
+
<html>
|
555
|
+
<head><title>Yasuri Example</title></head>
|
556
|
+
<body>
|
557
|
+
<p>Hello,World</p>
|
558
|
+
<p>Hello,Yasuri</p>
|
559
|
+
</body>
|
560
|
+
</html>
|
561
|
+
```
|
562
|
+
|
563
|
+
```ruby
|
564
|
+
agent = Mechanize.new
|
565
|
+
page = agent.get("http://yasuri.example.net")
|
566
|
+
|
567
|
+
|
568
|
+
tree = Yasuri.map_root do
|
569
|
+
text_title '/html/head/title'
|
570
|
+
text_body_p '/html/body/p[1]'
|
571
|
+
end
|
572
|
+
|
573
|
+
tree.inject(agent, page) #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
|
574
|
+
|
575
|
+
|
576
|
+
tree = Yasuri.map_root do
|
577
|
+
map_group1 { text_child01 '/html/body/a[1]' }
|
578
|
+
map_group2 do
|
579
|
+
text_child01 '/html/body/a[1]'
|
580
|
+
text_child03 '/html/body/a[3]'
|
581
|
+
end
|
582
|
+
end
|
583
|
+
|
584
|
+
tree.inject(agent, page) #=> {
|
585
|
+
# "group1" => {
|
586
|
+
# "child01" => "child01"
|
587
|
+
# },
|
588
|
+
# "group2" => {
|
589
|
+
# "child01" => "child01",
|
590
|
+
# "child03" => "child03"
|
591
|
+
# }
|
592
|
+
# }
|
593
|
+
```
|
594
|
+
|
595
|
+
### Options
|
596
|
+
None.
|
597
|
+
|
598
|
+
|
599
|
+
|
600
|
+
|
601
|
+
-------------------------
|
602
|
+
## Usage
|
603
|
+
|
604
|
+
#### Use as library
|
605
|
+
When used as a library, the tree can be defined in DSL, json, or yaml format.
|
606
|
+
```ruby
|
607
|
+
require 'mechanize'
|
608
|
+
require 'yasuri'
|
609
|
+
|
610
|
+
|
611
|
+
# 1. Create a parse tree.
|
612
|
+
# Define by Ruby's DSL
|
613
|
+
tree = Yasuri.links_title '/html/body/a' do
|
614
|
+
text_name '/html/body/p'
|
615
|
+
end
|
616
|
+
|
617
|
+
# Define by JSON
|
618
|
+
src = <<-EOJSON
|
619
|
+
{
|
620
|
+
links_title": {
|
621
|
+
"path": "/html/body/a",
|
622
|
+
"text_name": "/html/body/p"
|
623
|
+
}
|
624
|
+
}
|
625
|
+
EOJSON
|
626
|
+
tree = Yasuri.json2tree(src)
|
627
|
+
|
628
|
+
|
629
|
+
# Define by YAML
|
630
|
+
src = <<-EOYAML
|
631
|
+
links_title:
|
632
|
+
path: "/html/body/a"
|
633
|
+
text_name: "/html/body/p"
|
634
|
+
EOYAML
|
635
|
+
tree = Yasuri.yaml2tree(src)
|
636
|
+
|
637
|
+
|
638
|
+
|
639
|
+
# 2. Give the Mechanize agent and the target page to start parsing
|
640
|
+
agent = Mechanize.new
|
641
|
+
page = agent.get(uri)
|
642
|
+
|
643
|
+
|
644
|
+
tree.inject(agent, page)
|
645
|
+
```
|
646
|
+
|
647
|
+
#### Use as CLI tool
|
648
|
+
|
649
|
+
**Help**
|
650
|
+
```sh
|
651
|
+
$ yasuri help scrape
|
652
|
+
Usage:
|
653
|
+
yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
|
654
|
+
|
655
|
+
Options:
|
656
|
+
f, [--file=FILE] # path to file that written yasuri tree as json or yaml
|
657
|
+
j, [--json=JSON] # yasuri tree format json string
|
658
|
+
|
659
|
+
Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
|
660
|
+
```
|
661
|
+
|
662
|
+
In the CLI tool, you can specify the parse tree in either of the following ways.
|
663
|
+
+ `--file`, `-f` option to read the parse tree in json or yaml format output to a file.
|
664
|
+
+ `--json`, `-j` option to specify the parse tree directly as a string.
|
665
|
+
|
666
|
+
|
667
|
+
**Example of specifying a parse tree as a file**
|
668
|
+
```sh
|
669
|
+
% cat sample.yml
|
670
|
+
text_title: "/html/head/title"
|
671
|
+
text_desc: "//*[@id=\"intro\"]/p"
|
672
|
+
|
673
|
+
% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
|
674
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
675
|
+
|
676
|
+
% cat sample.json
|
677
|
+
{
|
678
|
+
"text_title": "/html/head/title",
|
679
|
+
"text_desc": "//*[@id=\"intro\"]/p"
|
680
|
+
}
|
681
|
+
|
682
|
+
% yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
|
683
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
684
|
+
```
|
685
|
+
|
686
|
+
Whether the file is written in json or yaml will be determined automatically.
|
687
|
+
|
688
|
+
**Example of specifying a parse tree directly in json**
|
689
|
+
```sh
|
690
|
+
$ yasuri scrape "https://www.ruby-lang.org/en/" -j '
|
691
|
+
{
|
692
|
+
"text_title": "/html/head/title",
|
693
|
+
"text_desc": "//*[@id=\"intro\"]/p"
|
694
|
+
}'
|
695
|
+
|
696
|
+
{"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
|
697
|
+
```
|