yasuri 2.0.11 → 3.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/USAGE.md CHANGED
@@ -1,27 +1,31 @@
1
- # Yasuri Usage
1
+ # Yasuri
2
2
 
3
3
  ## What is Yasuri
4
- `Yasuri` is an easy web-scraping library for supporting "Mechanize".
4
+ `Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping using "[Mechanize](https://github.com/sparklemotion/mechanize)" by simply describing the expected result in a simple declarative notation.
5
5
 
6
- Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
6
+ Yasuri makes it easy to write common scraping operations.
7
+ For example, the following processes can be easily implemented.
7
8
 
8
- Yasuri can reduce frequently processes in Scraping.
9
+ + Scrape multiple texts in a page and name them into a Hash
10
+ + Open multiple links in a page and get the result of scraping each page as a Hash
11
+ + Scrape each table that appears repeatedly in the page and get the result as an array
12
+ + Scrape only the first three pages of each page provided by pagination
9
13
 
10
- For example,
11
-
12
- + Open links in the page, scraping each page, and getting result as Hash.
13
- + Scraping texts in the page, and named result in Hash.
14
- + A table that repeatedly appears in a page each, scraping, get as an array.
15
- + Of each page provided by the pagination, scraping the only top 3.
16
-
17
- You can implement easy by Yasuri.
18
14
 
19
15
  ## Quick Start
20
16
 
17
+
18
+ #### Install
19
+ ```sh
20
+ # for Ruby 2.3.2
21
+ $ gem 'yasuri', '~> 2.0', '>= 2.0.13'
21
22
  ```
23
+ または
24
+ ```sh
25
+ # for Ruby 3.0.0 or upper
22
26
  $ gem install yasuri
23
27
  ```
24
-
28
+ #### Use as library
25
29
  ```ruby
26
30
  require 'yasuri'
27
31
  require 'machinize'
@@ -33,80 +37,148 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
33
37
  end
34
38
 
35
39
  agent = Mechanize.new
36
- root_page = agent.get("http://some.scraping.page.net/")
40
+ root_page = agent.get("http://some.scraping.page.tac42.net/")
37
41
 
38
42
  result = root.inject(agent, root_page)
39
- # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
40
- # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
41
-
43
+ # => [
44
+ # {"title" => "PageTitle 01", "content" => "Page Contents 01" },
45
+ # {"title" => "PageTitle 02", "content" => "Page Contents 02" },
46
+ # ...
47
+ # {"title" => "PageTitle N", "content" => "Page Contents N" }
48
+ # ]
42
49
  ```
50
+
43
51
  This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
44
52
 
45
- (i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
53
+ (in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
46
54
 
47
- ## Basics
48
55
 
49
- 1. Construct parse tree.
50
- 2. Start parse with Mechanize agent and first page.
56
+ #### Use as CLI tool
57
+ The same thing as above can be executed as a CLI command.
51
58
 
52
- ### Construct parse tree
59
+ ```sh
60
+ $ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
61
+ {
62
+ "links_root": {
63
+ "path": "//*[@id=\"menu\"]/ul/li/a",
64
+ "text_title": "//*[@id=\"contents\"]/h2",
65
+ "text_content": "//*[@id=\"contents\"]/p[1]"
66
+ }
67
+ }'
53
68
 
54
- ```ruby
55
- require 'mechanize'
56
- require 'yasuri'
69
+ [
70
+ {"title":"PageTitle 01","content":"Page Contents 01"},
71
+ {"title":"PageTitle 02","content":"Page Contents 02"},
72
+ ...,
73
+ {"title":"PageTitle N","content":"Page Contents N"}
74
+ ]
75
+ ```
57
76
 
77
+ The result can be obtained as a string in json format.
58
78
 
59
- # 1. Construct parse tree.
60
- tree = Yasuri.links_title '/html/body/a' do
61
- text_name '/html/body/p'
62
- end
79
+ ----------------------------
80
+ ## Parse Tree
63
81
 
64
- # 2. Start parse with Mechanize agent and first page.
65
- agent = Mechanize.new
66
- page = agent.get(uri)
82
+ A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
67
83
 
84
+ A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
68
85
 
69
- tree.inject(agent, page)
86
+ The parse tree is defined in the following format:
87
+
88
+ ```ruby
89
+ # A simple tree consisting of one node
90
+ Yasuri.<Type>_<Name> <Path> [,<Options>]
91
+
92
+ # Nested tree
93
+ Yasuri.<Type>_<Name> <Path> [,<Options>] do
94
+ <Type>_<Name> <Path> [,<Options>] do
95
+ <Type>_<Name> <Path> [,<Options>]
96
+ ...
97
+ end
98
+ end
70
99
  ```
71
100
 
72
- Tree is definable by 2(+1) ways, DSL and json (and basic ruby code). In above example, DSL.
101
+ **Example**
73
102
 
74
103
  ```ruby
75
- # Construct by json.
76
- src = <<-EOJSON
77
- { "node" : "links",
78
- "name" : "title",
79
- "path" : "/html/body/a",
80
- "children" : [
81
- { "node" : "text",
82
- "name" : "name",
83
- "path" : "/html/body/p"
84
- }
85
- ]
86
- }
87
- EOJSON
88
- tree = Yasuri.json2tree(src)
89
- ```
104
+ # A simple tree consisting of one node
105
+ Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
90
106
 
91
- ### Node
92
- Tree is constructed by nested Nodes.
93
- Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
107
+ # Nested tree
108
+ Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
109
+ struct_table './tr' do
110
+ text_title './td[1]'
111
+ text_pub_date './td[2]'
112
+ end
113
+ end
114
+ ```
94
115
 
95
- Node is defined by this format.
116
+ Parsing trees can be defined in Ruby DSL, JSON, or YAML.
117
+ The following is an example of the same parse tree as above, defined in each notation.
96
118
 
97
119
 
120
+ **Case of defining as Ruby DSL**
98
121
  ```ruby
99
- # Top Level
100
- Yasuri.<Type>_<Name> <Path> [,<Options>]
101
-
102
- # Nested
103
- Yasuri.<Type>_<Name> <Path> [,<Options>] do
104
- <Type>_<Name> <Path> [,<Options>] do
105
- <Children>
106
- end
122
+ Yasuri.links_title '/html/body/a' do
123
+ text_name '/html/body/p'
107
124
  end
108
125
  ```
109
126
 
127
+ **Case of defining as JSON**
128
+ ```json
129
+ {
130
+ links_title": {
131
+ "path": "/html/body/a",
132
+ "text_name": "/html/body/p"
133
+ }
134
+ }
135
+ ```
136
+
137
+ **Case of defining as YAML**
138
+ ```yaml
139
+ links_title:
140
+ path: "/html/body/a"
141
+ text_name: "/html/body/p"
142
+ ```
143
+
144
+ **Special case of purse tree**
145
+
146
+ If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
147
+ ```json
148
+ {
149
+ "text_title": "/html/head/title",
150
+ "text_body": "/html/body",
151
+ }
152
+ # => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
153
+
154
+ {
155
+ "text_title": "/html/head/title"}
156
+ }
157
+ # => Welcome to yasuri!
158
+ ```
159
+
160
+
161
+ In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
162
+
163
+ ```json
164
+ {
165
+ "text_name": "/html/body/p"
166
+ }
167
+
168
+ {
169
+ "text_name": {
170
+ "path": "/html/body/p"
171
+ }
172
+ }
173
+ ```
174
+
175
+
176
+ --------------------------
177
+ ## Node
178
+
179
+ Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
180
+
181
+
110
182
  #### Type
111
183
  Type meen behavior of Node.
112
184
 
@@ -114,17 +186,20 @@ Type meen behavior of Node.
114
186
  - *Struct*
115
187
  - *Links*
116
188
  - *Paginate*
189
+ - *Map*
190
+
191
+ See the description of each node for details.
117
192
 
118
- ### Name
193
+ #### Name
119
194
  Name is used keys in returned hash.
120
195
 
121
- ### Path
196
+ #### Path
122
197
  Path determine target node by xpath or css selector. It given by Machinize `search`.
123
198
 
124
- ### Childlen
199
+ #### Childlen
125
200
  Child nodes. TextNode has always empty set, because TextNode is leaf.
126
201
 
127
- ### Options
202
+ #### Options
128
203
  Parse options. It different in each types. You can get options and values by `opt` method.
129
204
 
130
205
  ```ruby
@@ -136,6 +211,8 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
136
211
  ## Text Node
137
212
  TextNode return scraped text. This node have to be leaf.
138
213
 
214
+
215
+
139
216
  ### Example
140
217
 
141
218
  ```html
@@ -155,13 +232,15 @@ page = agent.get("http://yasuri.example.net")
155
232
 
156
233
  p1 = Yasuri.text_title '/html/body/p[1]'
157
234
  p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
158
- p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
235
+ p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
159
236
 
160
- p1.inject(agent, page) #=> { "title" => "Hello,World" }
161
- p1t.inject(agent, page) #=> { "title" => "Hello" }
162
- node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
237
+ p1.inject(agent, page) #=> "Hello,World"
238
+ p1t.inject(agent, page) #=> "Hello"
239
+ p2u.inject(agent, page) #=> "HELLO,WORLD"
163
240
  ```
164
241
 
242
+ Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
243
+
165
244
  ### Options
166
245
  ##### `truncate`
167
246
  Match to regexp, and truncate text. When you use group, it will return first matched group only.
@@ -429,3 +508,190 @@ node.inject(agent, page)
429
508
  #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
430
509
  ```
431
510
  Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
511
+
512
+ ##### `flatten`
513
+ `flatten` option expands each page results.
514
+
515
+ ```ruby
516
+ agent = Mechanize.new
517
+ page = agent.get("http://yasuri.example.net/page01.html")
518
+
519
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
520
+ text_title '/html/head/title'
521
+ text_content '/html/body/p'
522
+ end
523
+ node.inject(agent, page)
524
+
525
+ #=> [ {"title" => "Page01",
526
+ "content" => "Patination01"},
527
+ {"title" => "Page01",
528
+ "content" => "Patination02"},
529
+ {"title" => "Page01",
530
+ "content" => "Patination03"}]
531
+
532
+
533
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
534
+ text_title '/html/head/title'
535
+ text_content '/html/body/p'
536
+ end
537
+ node.inject(agent, page)
538
+
539
+ #=> [ "Page01",
540
+ "Patination01",
541
+ "Page02",
542
+ "Patination02",
543
+ "Page03",
544
+ "Patination03"]
545
+ ```
546
+
547
+ ## Map Node
548
+ *MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
549
+
550
+ ### Example
551
+
552
+ ```html
553
+ <!-- http://yasuri.example.net -->
554
+ <html>
555
+ <head><title>Yasuri Example</title></head>
556
+ <body>
557
+ <p>Hello,World</p>
558
+ <p>Hello,Yasuri</p>
559
+ </body>
560
+ </html>
561
+ ```
562
+
563
+ ```ruby
564
+ agent = Mechanize.new
565
+ page = agent.get("http://yasuri.example.net")
566
+
567
+
568
+ tree = Yasuri.map_root do
569
+ text_title '/html/head/title'
570
+ text_body_p '/html/body/p[1]'
571
+ end
572
+
573
+ tree.inject(agent, page) #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
574
+
575
+
576
+ tree = Yasuri.map_root do
577
+ map_group1 { text_child01 '/html/body/a[1]' }
578
+ map_group2 do
579
+ text_child01 '/html/body/a[1]'
580
+ text_child03 '/html/body/a[3]'
581
+ end
582
+ end
583
+
584
+ tree.inject(agent, page) #=> {
585
+ # "group1" => {
586
+ # "child01" => "child01"
587
+ # },
588
+ # "group2" => {
589
+ # "child01" => "child01",
590
+ # "child03" => "child03"
591
+ # }
592
+ # }
593
+ ```
594
+
595
+ ### Options
596
+ None.
597
+
598
+
599
+
600
+
601
+ -------------------------
602
+ ## Usage
603
+
604
+ #### Use as library
605
+ When used as a library, the tree can be defined in DSL, json, or yaml format.
606
+ ```ruby
607
+ require 'mechanize'
608
+ require 'yasuri'
609
+
610
+
611
+ # 1. Create a parse tree.
612
+ # Define by Ruby's DSL
613
+ tree = Yasuri.links_title '/html/body/a' do
614
+ text_name '/html/body/p'
615
+ end
616
+
617
+ # Define by JSON
618
+ src = <<-EOJSON
619
+ {
620
+ links_title": {
621
+ "path": "/html/body/a",
622
+ "text_name": "/html/body/p"
623
+ }
624
+ }
625
+ EOJSON
626
+ tree = Yasuri.json2tree(src)
627
+
628
+
629
+ # Define by YAML
630
+ src = <<-EOYAML
631
+ links_title:
632
+ path: "/html/body/a"
633
+ text_name: "/html/body/p"
634
+ EOYAML
635
+ tree = Yasuri.yaml2tree(src)
636
+
637
+
638
+
639
+ # 2. Give the Mechanize agent and the target page to start parsing
640
+ agent = Mechanize.new
641
+ page = agent.get(uri)
642
+
643
+
644
+ tree.inject(agent, page)
645
+ ```
646
+
647
+ #### Use as CLI tool
648
+
649
+ **Help**
650
+ ```sh
651
+ $ yasuri help scrape
652
+ Usage:
653
+ yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
654
+
655
+ Options:
656
+ f, [--file=FILE] # path to file that written yasuri tree as json or yaml
657
+ j, [--json=JSON] # yasuri tree format json string
658
+
659
+ Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
660
+ ```
661
+
662
+ In the CLI tool, you can specify the parse tree in either of the following ways.
663
+ + `--file`, `-f` option to read the parse tree in json or yaml format output to a file.
664
+ + `--json`, `-j` option to specify the parse tree directly as a string.
665
+
666
+
667
+ **Example of specifying a parse tree as a file**
668
+ ```sh
669
+ % cat sample.yml
670
+ text_title: "/html/head/title"
671
+ text_desc: "//*[@id=\"intro\"]/p"
672
+
673
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
674
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
675
+
676
+ % cat sample.json
677
+ {
678
+ "text_title": "/html/head/title",
679
+ "text_desc": "//*[@id=\"intro\"]/p"
680
+ }
681
+
682
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
683
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
684
+ ```
685
+
686
+ Whether the file is written in json or yaml will be determined automatically.
687
+
688
+ **Example of specifying a parse tree directly in json**
689
+ ```sh
690
+ $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
691
+ {
692
+ "text_title": "/html/head/title",
693
+ "text_desc": "//*[@id=\"intro\"]/p"
694
+ }'
695
+
696
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
697
+ ```