yasuri 2.0.11 → 3.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/USAGE.md CHANGED
@@ -1,27 +1,31 @@
1
- # Yasuri Usage
1
+ # Yasuri
2
2
 
3
3
  ## What is Yasuri
4
- `Yasuri` is an easy web-scraping library for supporting "Mechanize".
4
+ `Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping using "[Mechanize](https://github.com/sparklemotion/mechanize)" by simply describing the expected result in a simple declarative notation.
5
5
 
6
- Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
6
+ Yasuri makes it easy to write common scraping operations.
7
+ For example, the following processes can be easily implemented.
7
8
 
8
- Yasuri can reduce frequently processes in Scraping.
9
+ + Scrape multiple texts in a page and name them into a Hash
10
+ + Open multiple links in a page and get the result of scraping each page as a Hash
11
+ + Scrape each table that appears repeatedly in the page and get the result as an array
12
+ + Scrape only the first three pages of each page provided by pagination
9
13
 
10
- For example,
11
-
12
- + Open links in the page, scraping each page, and getting result as Hash.
13
- + Scraping texts in the page, and named result in Hash.
14
- + A table that repeatedly appears in a page each, scraping, get as an array.
15
- + Of each page provided by the pagination, scraping the only top 3.
16
-
17
- You can implement easy by Yasuri.
18
14
 
19
15
  ## Quick Start
20
16
 
17
+
18
+ #### Install
19
+ ```sh
20
+ # for Ruby 2.3.2
21
+ $ gem 'yasuri', '~> 2.0', '>= 2.0.13'
21
22
  ```
23
+ または
24
+ ```sh
25
+ # for Ruby 3.0.0 or upper
22
26
  $ gem install yasuri
23
27
  ```
24
-
28
+ #### Use as library
25
29
  ```ruby
26
30
  require 'yasuri'
27
31
  require 'machinize'
@@ -33,80 +37,148 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
33
37
  end
34
38
 
35
39
  agent = Mechanize.new
36
- root_page = agent.get("http://some.scraping.page.net/")
40
+ root_page = agent.get("http://some.scraping.page.tac42.net/")
37
41
 
38
42
  result = root.inject(agent, root_page)
39
- # => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
40
- # {"title" => "PageTitle2", "content" => "Page Contents2" }, ... ]
41
-
43
+ # => [
44
+ # {"title" => "PageTitle 01", "content" => "Page Contents 01" },
45
+ # {"title" => "PageTitle 02", "content" => "Page Contents 02" },
46
+ # ...
47
+ # {"title" => "PageTitle N", "content" => "Page Contents N" }
48
+ # ]
42
49
  ```
50
+
43
51
  This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`).
44
52
 
45
- (i.e. open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
53
+ (in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.)
46
54
 
47
- ## Basics
48
55
 
49
- 1. Construct parse tree.
50
- 2. Start parse with Mechanize agent and first page.
56
+ #### Use as CLI tool
57
+ The same thing as above can be executed as a CLI command.
51
58
 
52
- ### Construct parse tree
59
+ ```sh
60
+ $ yasuri scrape "http://some.scraping.page.tac42.net/" -j '
61
+ {
62
+ "links_root": {
63
+ "path": "//*[@id=\"menu\"]/ul/li/a",
64
+ "text_title": "//*[@id=\"contents\"]/h2",
65
+ "text_content": "//*[@id=\"contents\"]/p[1]"
66
+ }
67
+ }'
53
68
 
54
- ```ruby
55
- require 'mechanize'
56
- require 'yasuri'
69
+ [
70
+ {"title":"PageTitle 01","content":"Page Contents 01"},
71
+ {"title":"PageTitle 02","content":"Page Contents 02"},
72
+ ...,
73
+ {"title":"PageTitle N","content":"Page Contents N"}
74
+ ]
75
+ ```
57
76
 
77
+ The result can be obtained as a string in json format.
58
78
 
59
- # 1. Construct parse tree.
60
- tree = Yasuri.links_title '/html/body/a' do
61
- text_name '/html/body/p'
62
- end
79
+ ----------------------------
80
+ ## Parse Tree
63
81
 
64
- # 2. Start parse with Mechanize agent and first page.
65
- agent = Mechanize.new
66
- page = agent.get(uri)
82
+ A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure.
67
83
 
84
+ A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
68
85
 
69
- tree.inject(agent, page)
86
+ The parse tree is defined in the following format:
87
+
88
+ ```ruby
89
+ # A simple tree consisting of one node
90
+ Yasuri.<Type>_<Name> <Path> [,<Options>]
91
+
92
+ # Nested tree
93
+ Yasuri.<Type>_<Name> <Path> [,<Options>] do
94
+ <Type>_<Name> <Path> [,<Options>] do
95
+ <Type>_<Name> <Path> [,<Options>]
96
+ ...
97
+ end
98
+ end
70
99
  ```
71
100
 
72
- Tree is definable by 2(+1) ways, DSL and json (and basic ruby code). In above example, DSL.
101
+ **Example**
73
102
 
74
103
  ```ruby
75
- # Construct by json.
76
- src = <<-EOJSON
77
- { "node" : "links",
78
- "name" : "title",
79
- "path" : "/html/body/a",
80
- "children" : [
81
- { "node" : "text",
82
- "name" : "name",
83
- "path" : "/html/body/p"
84
- }
85
- ]
86
- }
87
- EOJSON
88
- tree = Yasuri.json2tree(src)
89
- ```
104
+ # A simple tree consisting of one node
105
+ Yasuri.text_title '/html/head/title', truncate:/^[^,]+/
90
106
 
91
- ### Node
92
- Tree is constructed by nested Nodes.
93
- Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
107
+ # Nested tree
108
+ Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
109
+ struct_table './tr' do
110
+ text_title './td[1]'
111
+ text_pub_date './td[2]'
112
+ end
113
+ end
114
+ ```
94
115
 
95
- Node is defined by this format.
116
+ Parsing trees can be defined in Ruby DSL, JSON, or YAML.
117
+ The following is an example of the same parse tree as above, defined in each notation.
96
118
 
97
119
 
120
+ **Case of defining as Ruby DSL**
98
121
  ```ruby
99
- # Top Level
100
- Yasuri.<Type>_<Name> <Path> [,<Options>]
101
-
102
- # Nested
103
- Yasuri.<Type>_<Name> <Path> [,<Options>] do
104
- <Type>_<Name> <Path> [,<Options>] do
105
- <Children>
106
- end
122
+ Yasuri.links_title '/html/body/a' do
123
+ text_name '/html/body/p'
107
124
  end
108
125
  ```
109
126
 
127
+ **Case of defining as JSON**
128
+ ```json
129
+ {
130
+ links_title": {
131
+ "path": "/html/body/a",
132
+ "text_name": "/html/body/p"
133
+ }
134
+ }
135
+ ```
136
+
137
+ **Case of defining as YAML**
138
+ ```yaml
139
+ links_title:
140
+ path: "/html/body/a"
141
+ text_name: "/html/body/p"
142
+ ```
143
+
144
+ **Special case of purse tree**
145
+
146
+ If there is only one element directly under the root, it will return that element directly instead of Hash(Object).
147
+ ```json
148
+ {
149
+ "text_title": "/html/head/title",
150
+ "text_body": "/html/body",
151
+ }
152
+ # => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."}
153
+
154
+ {
155
+ "text_title": "/html/head/title"}
156
+ }
157
+ # => Welcome to yasuri!
158
+ ```
159
+
160
+
161
+ In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree.
162
+
163
+ ```json
164
+ {
165
+ "text_name": "/html/body/p"
166
+ }
167
+
168
+ {
169
+ "text_name": {
170
+ "path": "/html/body/p"
171
+ }
172
+ }
173
+ ```
174
+
175
+
176
+ --------------------------
177
+ ## Node
178
+
179
+ Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`).
180
+
181
+
110
182
  #### Type
111
183
  Type meen behavior of Node.
112
184
 
@@ -114,17 +186,20 @@ Type meen behavior of Node.
114
186
  - *Struct*
115
187
  - *Links*
116
188
  - *Paginate*
189
+ - *Map*
190
+
191
+ See the description of each node for details.
117
192
 
118
- ### Name
193
+ #### Name
119
194
  Name is used keys in returned hash.
120
195
 
121
- ### Path
196
+ #### Path
122
197
  Path determine target node by xpath or css selector. It given by Machinize `search`.
123
198
 
124
- ### Childlen
199
+ #### Childlen
125
200
  Child nodes. TextNode has always empty set, because TextNode is leaf.
126
201
 
127
- ### Options
202
+ #### Options
128
203
  Parse options. It different in each types. You can get options and values by `opt` method.
129
204
 
130
205
  ```ruby
@@ -136,6 +211,8 @@ node.opt #=> {:truncate => /^[^,]+/, :proc => nil}
136
211
  ## Text Node
137
212
  TextNode return scraped text. This node have to be leaf.
138
213
 
214
+
215
+
139
216
  ### Example
140
217
 
141
218
  ```html
@@ -155,13 +232,15 @@ page = agent.get("http://yasuri.example.net")
155
232
 
156
233
  p1 = Yasuri.text_title '/html/body/p[1]'
157
234
  p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/
158
- p2u = Yasuri.text_title '/html/body/p[2]', proc: :upcase
235
+ p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase
159
236
 
160
- p1.inject(agent, page) #=> { "title" => "Hello,World" }
161
- p1t.inject(agent, page) #=> { "title" => "Hello" }
162
- node.inject(agent, page) #=> { "title" => "HELLO,YASURI" }
237
+ p1.inject(agent, page) #=> "Hello,World"
238
+ p1t.inject(agent, page) #=> "Hello"
239
+ p2u.inject(agent, page) #=> "HELLO,WORLD"
163
240
  ```
164
241
 
242
+ Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details.
243
+
165
244
  ### Options
166
245
  ##### `truncate`
167
246
  Match to regexp, and truncate text. When you use group, it will return first matched group only.
@@ -429,3 +508,190 @@ node.inject(agent, page)
429
508
  #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
430
509
  ```
431
510
  Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
511
+
512
+ ##### `flatten`
513
+ `flatten` option expands each page results.
514
+
515
+ ```ruby
516
+ agent = Mechanize.new
517
+ page = agent.get("http://yasuri.example.net/page01.html")
518
+
519
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
520
+ text_title '/html/head/title'
521
+ text_content '/html/body/p'
522
+ end
523
+ node.inject(agent, page)
524
+
525
+ #=> [ {"title" => "Page01",
526
+ "content" => "Patination01"},
527
+ {"title" => "Page01",
528
+ "content" => "Patination02"},
529
+ {"title" => "Page01",
530
+ "content" => "Patination03"}]
531
+
532
+
533
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
534
+ text_title '/html/head/title'
535
+ text_content '/html/body/p'
536
+ end
537
+ node.inject(agent, page)
538
+
539
+ #=> [ "Page01",
540
+ "Patination01",
541
+ "Page02",
542
+ "Patination02",
543
+ "Page03",
544
+ "Patination03"]
545
+ ```
546
+
547
+ ## Map Node
548
+ *MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree.
549
+
550
+ ### Example
551
+
552
+ ```html
553
+ <!-- http://yasuri.example.net -->
554
+ <html>
555
+ <head><title>Yasuri Example</title></head>
556
+ <body>
557
+ <p>Hello,World</p>
558
+ <p>Hello,Yasuri</p>
559
+ </body>
560
+ </html>
561
+ ```
562
+
563
+ ```ruby
564
+ agent = Mechanize.new
565
+ page = agent.get("http://yasuri.example.net")
566
+
567
+
568
+ tree = Yasuri.map_root do
569
+ text_title '/html/head/title'
570
+ text_body_p '/html/body/p[1]'
571
+ end
572
+
573
+ tree.inject(agent, page) #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" }
574
+
575
+
576
+ tree = Yasuri.map_root do
577
+ map_group1 { text_child01 '/html/body/a[1]' }
578
+ map_group2 do
579
+ text_child01 '/html/body/a[1]'
580
+ text_child03 '/html/body/a[3]'
581
+ end
582
+ end
583
+
584
+ tree.inject(agent, page) #=> {
585
+ # "group1" => {
586
+ # "child01" => "child01"
587
+ # },
588
+ # "group2" => {
589
+ # "child01" => "child01",
590
+ # "child03" => "child03"
591
+ # }
592
+ # }
593
+ ```
594
+
595
+ ### Options
596
+ None.
597
+
598
+
599
+
600
+
601
+ -------------------------
602
+ ## Usage
603
+
604
+ #### Use as library
605
+ When used as a library, the tree can be defined in DSL, json, or yaml format.
606
+ ```ruby
607
+ require 'mechanize'
608
+ require 'yasuri'
609
+
610
+
611
+ # 1. Create a parse tree.
612
+ # Define by Ruby's DSL
613
+ tree = Yasuri.links_title '/html/body/a' do
614
+ text_name '/html/body/p'
615
+ end
616
+
617
+ # Define by JSON
618
+ src = <<-EOJSON
619
+ {
620
+ links_title": {
621
+ "path": "/html/body/a",
622
+ "text_name": "/html/body/p"
623
+ }
624
+ }
625
+ EOJSON
626
+ tree = Yasuri.json2tree(src)
627
+
628
+
629
+ # Define by YAML
630
+ src = <<-EOYAML
631
+ links_title:
632
+ path: "/html/body/a"
633
+ text_name: "/html/body/p"
634
+ EOYAML
635
+ tree = Yasuri.yaml2tree(src)
636
+
637
+
638
+
639
+ # 2. Give the Mechanize agent and the target page to start parsing
640
+ agent = Mechanize.new
641
+ page = agent.get(uri)
642
+
643
+
644
+ tree.inject(agent, page)
645
+ ```
646
+
647
+ #### Use as CLI tool
648
+
649
+ **Help**
650
+ ```sh
651
+ $ yasuri help scrape
652
+ Usage:
653
+ yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]]
654
+
655
+ Options:
656
+ f, [--file=FILE] # path to file that written yasuri tree as json or yaml
657
+ j, [--json=JSON] # yasuri tree format json string
658
+
659
+ Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string.
660
+ ```
661
+
662
+ In the CLI tool, you can specify the parse tree in either of the following ways.
663
+ + `--file`, `-f` option to read the parse tree in json or yaml format output to a file.
664
+ + `--json`, `-j` option to specify the parse tree directly as a string.
665
+
666
+
667
+ **Example of specifying a parse tree as a file**
668
+ ```sh
669
+ % cat sample.yml
670
+ text_title: "/html/head/title"
671
+ text_desc: "//*[@id=\"intro\"]/p"
672
+
673
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml
674
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
675
+
676
+ % cat sample.json
677
+ {
678
+ "text_title": "/html/head/title",
679
+ "text_desc": "//*[@id=\"intro\"]/p"
680
+ }
681
+
682
+ % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json
683
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
684
+ ```
685
+
686
+ Whether the file is written in json or yaml will be determined automatically.
687
+
688
+ **Example of specifying a parse tree directly in json**
689
+ ```sh
690
+ $ yasuri scrape "https://www.ruby-lang.org/en/" -j '
691
+ {
692
+ "text_title": "/html/head/title",
693
+ "text_desc": "//*[@id=\"intro\"]/p"
694
+ }'
695
+
696
+ {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "}
697
+ ```