yasuri 1.9.11 → 3.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 28e6a3903cec3d8036b718a7c5de4fb8df8dfbfa
4
- data.tar.gz: fbfb4c2b3a042410a7d05b1604e004bdffd82779
2
+ SHA256:
3
+ metadata.gz: 7f360d6efb02954a5a54e2fc308d0cd0c2e5c129c52eba727fb0dfe4a40ce502
4
+ data.tar.gz: 8d8805a55c7ce16c76eb50945b954ad19327a3a63183eca098dac6ac93d2203b
5
5
  SHA512:
6
- metadata.gz: e4312623046ecf451ef261b1d99164b7f79b861ad8d7fe49d7ba8fd2319cd840978ae88235e6fb10b6991cd5131ebec966ec56e3bddece8fbbefd0b53d4a9dfe
7
- data.tar.gz: 54639066aa4511309a1f712b8920c81c09a7489e0cab6a91c453d2aeca07d4f1c3bed167ed25746411d5f87fa9cc9fff8c6a32567c4df8d1990f12cb053286e2
6
+ metadata.gz: ffe02aee78de5f30f1e583b2aca8c0617324bdbf62d7c64e371e90d139bac8b1d26df23e9725df0b81b946c6a465283f88a7d51945872c56e7be892eac1b5e4e
7
+ data.tar.gz: c8983dc2cd283c7de0d97357d2a8164426ee3e1017e73c498c0676716a1c9ab4c42cc02a836bf7e559877d50ca23df6fa656c0197b5018a4881997e2fb4c57d0
@@ -0,0 +1,35 @@
1
+ # This workflow uses actions that are not certified by GitHub.
2
+ # They are provided by a third-party and are governed by
3
+ # separate terms of service, privacy policy, and support
4
+ # documentation.
5
+ # This workflow will download a prebuilt Ruby version, install dependencies and run tests with Rake
6
+ # For more information see: https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby
7
+
8
+ name: Ruby
9
+
10
+ on:
11
+ push:
12
+ branches: [ master ]
13
+ pull_request:
14
+ branches: [ master ]
15
+
16
+ jobs:
17
+ test:
18
+
19
+ runs-on: ubuntu-latest
20
+ strategy:
21
+ matrix:
22
+ ruby-version: ['2.6', '2.7', '3.0']
23
+
24
+ steps:
25
+ - uses: actions/checkout@v2
26
+ - name: Set up Ruby
27
+ # To automatically get bug fixes and new Ruby versions for ruby/setup-ruby,
28
+ # change this to (see https://github.com/ruby/setup-ruby#versioning):
29
+ # uses: ruby/setup-ruby@v1
30
+ uses: ruby/setup-ruby@473e4d8fe5dd94ee328fdfca9f8c9c7afc9dae5e
31
+ with:
32
+ ruby-version: ${{ matrix.ruby-version }}
33
+ bundler-cache: true # runs 'bundle install' and caches installed gems automatically
34
+ - name: Run tests
35
+ run: bundle exec rake
data/.gitignore CHANGED
@@ -66,5 +66,4 @@ tramp
66
66
  # cask packages
67
67
  .cask/
68
68
 
69
- .ruby-version
70
- Gemfile.lock
69
+ Gemfile.lock
data/.ruby-version ADDED
@@ -0,0 +1 @@
1
+ 3.0.0
data/.travis.yml CHANGED
@@ -1,9 +1,7 @@
1
1
  language: ruby
2
- rvm:
3
- - 2.2.0
4
2
  script:
5
3
  - ruby --version
6
4
  - rspec spec
7
5
  addons:
8
6
  code_climate:
9
- repo_token: 0dc78d33107a7f11f257c0218ac1a37e0073005bb9734f2fd61d0f7e803fc151
7
+ repo_token: 0dc78d33107a7f11f257c0218ac1a37e0073005bb9734f2fd61d0f7e803fc151
data/README.md CHANGED
@@ -1,4 +1,5 @@
1
- # Yasuri [![Build Status](https://travis-ci.org/tac0x2a/yasuri.svg?branch=master)](https://travis-ci.org/tac0x2a/yasuri) [![Coverage Status](https://coveralls.io/repos/tac0x2a/yasuri/badge.svg?branch=master)](https://coveralls.io/r/tac0x2a/yasuri?branch=master) [![Code Climate](https://codeclimate.com/github/tac0x2a/yasuri/badges/gpa.svg)](https://codeclimate.com/github/tac0x2a/yasuri)
1
+ # Yasuri
2
+ [![Build Status](https://travis-ci.org/tac0x2a/yasuri.svg?branch=master)](https://travis-ci.org/tac0x2a/yasuri) [![Coverage Status](https://coveralls.io/repos/tac0x2a/yasuri/badge.svg?branch=master)](https://coveralls.io/r/tac0x2a/yasuri?branch=master) [![Maintainability](https://api.codeclimate.com/v1/badges/c29480fea1305afe999f/maintainability)](https://codeclimate.com/github/tac0x2a/yasuri/maintainability)
2
3
 
3
4
  Yasuri (鑢) is an easy web-scraping library for supporting "[Mechanize](https://github.com/sparklemotion/mechanize)".
4
5
 
@@ -52,6 +53,19 @@ root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
52
53
  text_content '//*[@id="contents"]/p[1]'
53
54
  end
54
55
 
56
+
57
+ # Node tree constructing by YAML
58
+ src = <<-EOYAML
59
+ root:
60
+ node: links
61
+ path: "//*[@id='menu']/ul/li/a"
62
+ children:
63
+ - title: { node: text, path: "//*[@id='contents']/h2" }
64
+ - content: { node: text, path: "//*[@id='contents']/p[1]" }
65
+ EOYAML
66
+ root = Yasuri.yaml2tree(src)
67
+
68
+
55
69
  # Node tree constructing by JSON
56
70
  src = <<-EOJSON
57
71
  { "node" : "links",
@@ -78,6 +92,17 @@ result = root.inject(agent, root_page)
78
92
  # => [ {"title" => "PageTitle", "content" => "Page Contents" }, ... ]
79
93
  ```
80
94
 
95
+ ## Dev
96
+ ```sh
97
+ $ gem install bundler
98
+ $ bundle install
99
+ ```
100
+ ### Test
101
+ ```sh
102
+ $ rake
103
+ # or
104
+ $ rspec spec/*spec.rb
105
+ ```
81
106
 
82
107
  ## Contributing
83
108
 
data/USAGE.ja.md CHANGED
@@ -67,7 +67,7 @@ page = agent.get(uri)
67
67
  tree.inject(agent, page)
68
68
  ```
69
69
 
70
- ツリーは、DSLまたはjsonで定義することができます.上の例ではDSLで定義しています.
70
+ ツリーは、json,yaml,またはDSLで定義することができます.上の例ではDSLで定義しています.
71
71
  以下は、jsonで上記と等価な解析ツリーを定義した例です.
72
72
 
73
73
  ```ruby
@@ -87,6 +87,19 @@ EOJSON
87
87
  tree = Yasuri.json2tree(src)
88
88
  ```
89
89
 
90
+ ```ruby
91
+ # yaml で構成する場合
92
+ src = <<-EOYAML
93
+ title:
94
+ node: links
95
+ path: "/html/body/a"
96
+ children:
97
+ - name:
98
+ node: text
99
+ path: "/html/body/p"
100
+ EOYAML
101
+ tree = Yasuri.yaml2tree(src)
102
+ ```
90
103
 
91
104
  ### Node
92
105
  ツリーは入れ子になった *Node* で構成されます.
@@ -431,3 +444,38 @@ node.inject(agent, page)
431
444
  #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
432
445
  ```
433
446
  この場合、PaginateNode は最大2つまでのページを開いてパースします.ページネーションは4つのページを持っているようですが、`limit:2`が指定されているため、結果の配列には2つの結果のみが含まれています.
447
+
448
+ ##### `flatten`
449
+ 取得した各ページの結果を展開します.
450
+
451
+ ```ruby
452
+ agent = Mechanize.new
453
+ page = agent.get("http://yasuri.example.net/page01.html")
454
+
455
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
456
+ text_title '/html/head/title'
457
+ text_content '/html/body/p'
458
+ end
459
+ node.inject(agent, page)
460
+
461
+ #=> [ {"title" => "Page01",
462
+ "content" => "Patination01"},
463
+ {"title" => "Page01",
464
+ "content" => "Patination02"},
465
+ {"title" => "Page01",
466
+ "content" => "Patination03"}]
467
+
468
+
469
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
470
+ text_title '/html/head/title'
471
+ text_content '/html/body/p'
472
+ end
473
+ node.inject(agent, page)
474
+
475
+ #=> [ "Page01",
476
+ "Patination01",
477
+ "Page02",
478
+ "Patination02",
479
+ "Page03",
480
+ "Patination03"]
481
+ ```
data/USAGE.md CHANGED
@@ -69,7 +69,7 @@ page = agent.get(uri)
69
69
  tree.inject(agent, page)
70
70
  ```
71
71
 
72
- Tree is definable by 2(+1) ways, DSL and json (and basic ruby code). In above example, DSL.
72
+ Tree is definable by 3(+1) ways, json, yaml, and DSL (or basic ruby code). In above example, DSL.
73
73
 
74
74
  ```ruby
75
75
  # Construct by json.
@@ -88,6 +88,21 @@ EOJSON
88
88
  tree = Yasuri.json2tree(src)
89
89
  ```
90
90
 
91
+ ```ruby
92
+ # Construct by yaml.
93
+ src = <<-EOYAML
94
+ title:
95
+ node: links
96
+ path: "/html/body/a"
97
+ children:
98
+ - name:
99
+ node: text
100
+ path: "/html/body/p"
101
+ EOYAML
102
+ tree = Yasuri.yaml2tree(src)
103
+ ```
104
+
105
+
91
106
  ### Node
92
107
  Tree is constructed by nested Nodes.
93
108
  Node has `Type`, `Name`, `Path`, `Childlen`, and `Options`.
@@ -429,3 +444,38 @@ node.inject(agent, page)
429
444
  #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}]
430
445
  ```
431
446
  Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`.
447
+
448
+ ##### `flatten`
449
+ `flatten` option expands each page results.
450
+
451
+ ```ruby
452
+ agent = Mechanize.new
453
+ page = agent.get("http://yasuri.example.net/page01.html")
454
+
455
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
456
+ text_title '/html/head/title'
457
+ text_content '/html/body/p'
458
+ end
459
+ node.inject(agent, page)
460
+
461
+ #=> [ {"title" => "Page01",
462
+ "content" => "Patination01"},
463
+ {"title" => "Page01",
464
+ "content" => "Patination02"},
465
+ {"title" => "Page01",
466
+ "content" => "Patination03"}]
467
+
468
+
469
+ node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do
470
+ text_title '/html/head/title'
471
+ text_content '/html/body/p'
472
+ end
473
+ node.inject(agent, page)
474
+
475
+ #=> [ "Page01",
476
+ "Patination01",
477
+ "Page02",
478
+ "Patination02",
479
+ "Page03",
480
+ "Patination03"]
481
+ ```
@@ -1,3 +1,3 @@
1
1
  module Yasuri
2
- VERSION = "1.9.11"
2
+ VERSION = "3.0.0"
3
3
  end
data/lib/yasuri/yasuri.rb CHANGED
@@ -4,6 +4,7 @@
4
4
 
5
5
  require 'mechanize'
6
6
  require 'json'
7
+ require 'yaml'
7
8
 
8
9
  require_relative 'yasuri_node'
9
10
  require_relative 'yasuri_text_node'
@@ -23,9 +24,39 @@ module Yasuri
23
24
  Yasuri.node2hash(node).to_json
24
25
  end
25
26
 
26
- def self.method_missing(name, *args, &block)
27
- generated = Yasuri::NodeGenerator.gen(name, *args, &block)
28
- generated || super(name, args)
27
+ def self.yaml2tree(yaml_string)
28
+ raise RuntimeError if yaml_string.nil? or yaml_string.empty?
29
+
30
+ yaml = YAML.load(yaml_string)
31
+ raise RuntimeError if yaml.keys.size < 1
32
+
33
+ root_key, root = yaml.keys.first, yaml.values.first
34
+ hash = Yasuri.yaml2tree_sub(root_key, root)
35
+
36
+ Yasuri.hash2node(hash)
37
+ end
38
+
39
+ private
40
+ def self.yaml2tree_sub(name, body)
41
+ return nil if name.nil? or body.nil?
42
+
43
+ new_body = Hash[:name, name]
44
+ body.each{|k,v| new_body[k.to_sym] = v}
45
+ body = new_body
46
+
47
+ return body if body[:children].nil?
48
+
49
+ body[:children] = body[:children].map do |c|
50
+ k, b = c.keys.first, c.values.first
51
+ Yasuri.yaml2tree_sub(k, b)
52
+ end
53
+
54
+ body
55
+ end
56
+
57
+ def self.method_missing(node_name, pattern, **opt, &block)
58
+ generated = Yasuri::NodeGenerator.gen(node_name, pattern, **opt, &block)
59
+ generated || super(node_name, **opt)
29
60
  end
30
61
 
31
62
  private
@@ -37,16 +68,16 @@ module Yasuri
37
68
  }
38
69
  Node2Text = Text2Node.invert
39
70
 
40
- ReservedKeys = [:node, :name, :path, :children]
71
+ ReservedKeys = %i|node name path children|
41
72
  def self.hash2node(node_h)
42
73
  node, name, path, children = ReservedKeys.map do |key|
43
74
  node_h[key]
44
75
  end
45
76
  children ||= []
46
77
 
47
- fail "Not found 'node' value in json" if node.nil?
48
- fail "Not found 'name' value in json" if name.nil?
49
- fail "Not found 'path' value in json" if path.nil?
78
+ fail "Not found 'node' value in map" if node.nil?
79
+ fail "Not found 'name' value in map" if name.nil?
80
+ fail "Not found 'path' value in map" if path.nil?
50
81
 
51
82
  childnodes = children.map{|c| Yasuri.hash2node(c) }
52
83
  ReservedKeys.each{|key| node_h.delete(key)}
@@ -54,7 +85,7 @@ module Yasuri
54
85
 
55
86
  klass = Text2Node[node.to_sym]
56
87
  fail "Undefined node type #{node}" if klass.nil?
57
- klass.new(path, name, childnodes, opt)
88
+ klass.new(path, name, childnodes, **opt)
58
89
  end
59
90
 
60
91
  def self.node2hash(node)
@@ -78,8 +109,8 @@ module Yasuri
78
109
  json
79
110
  end
80
111
 
81
- def self.NodeName(name, hash = {})
82
- symbolize_names = hash[:symbolize_names] || false
112
+ def self.NodeName(name, opt)
113
+ symbolize_names = opt[:symbolize_names]
83
114
  symbolize_names ? name.to_sym : name
84
115
  end
85
116
 
@@ -6,10 +6,10 @@ require_relative 'yasuri_node'
6
6
  module Yasuri
7
7
  class LinksNode
8
8
  include Node
9
- def inject(agent, page, opt = {})
9
+ def inject(agent, page, opt = {}, element = page)
10
10
  retry_count = opt[:retry_count] || 5
11
11
 
12
- links = page.search(@xpath) || [] # links expected
12
+ links = element.search(@xpath) || [] # links expected
13
13
  links.map do |link|
14
14
  link_button = Mechanize::Page::Link.new(link, agent, page)
15
15
  child_page = Yasuri.with_retry(retry_count) { link_button.click }
@@ -7,11 +7,11 @@ module Yasuri
7
7
  module Node
8
8
  attr_reader :url, :xpath, :name, :children
9
9
 
10
- def initialize(xpath, name, children = [], opt = {})
10
+ def initialize(xpath, name, children = [], opt: {})
11
11
  @xpath, @name, @children = xpath, name, children
12
12
  end
13
13
 
14
- def inject(agent, page, opt = {})
14
+ def inject(agent, page, opt = {}, element = page)
15
15
  fail "#{Kernel.__method__} is not implemented."
16
16
  end
17
17
  def opts
@@ -15,26 +15,24 @@ module Yasuri
15
15
  @nodes
16
16
  end
17
17
 
18
- def method_missing(name, *args, &block)
19
- node = NodeGenerator.gen(name, *args, &block)
18
+ def method_missing(name, pattern, **args, &block)
19
+ node = NodeGenerator.gen(name, pattern, **args, &block)
20
20
  raise "Undefined Node Name '#{name}'" if node == nil
21
21
  @nodes << node
22
22
  end
23
23
 
24
- def self.gen(name, *args, &block)
25
- xpath, opt = *args
26
- opt = [opt].flatten.compact
24
+ def self.gen(name, xpath, **opt, &block)
27
25
  children = Yasuri::NodeGenerator.new.gen_recursive(&block) if block_given?
28
26
 
29
27
  case name
30
28
  when /^text_(.+)$/
31
- Yasuri::TextNode.new(xpath, $1, children || [], *opt)
29
+ Yasuri::TextNode.new(xpath, $1, children || [], **opt)
32
30
  when /^struct_(.+)$/
33
- Yasuri::StructNode.new(xpath, $1, children || [], *opt)
31
+ Yasuri::StructNode.new(xpath, $1, children || [], **opt)
34
32
  when /^links_(.+)$/
35
- Yasuri::LinksNode.new(xpath, $1, children || [], *opt)
33
+ Yasuri::LinksNode.new(xpath, $1, children || [], **opt)
36
34
  when /^pages_(.+)$/
37
- Yasuri::PaginateNode.new(xpath, $1, children || [], *opt)
35
+ Yasuri::PaginateNode.new(xpath, $1, children || [], **opt)
38
36
  else
39
37
  nil
40
38
  end
@@ -7,14 +7,17 @@ module Yasuri
7
7
  class PaginateNode
8
8
  include Node
9
9
 
10
- def initialize(xpath, name, children = [], hash = {})
10
+ def initialize(xpath, name, children = [], limit: nil, flatten: false)
11
11
  super(xpath, name, children)
12
- @limit = hash[:limit]
12
+ @flatten = flatten
13
+ @limit = limit
13
14
  end
14
15
 
15
- def inject(agent, page, opt = {})
16
+ def inject(agent, page, opt = {}, element = page)
16
17
  retry_count = opt[:retry_count] || 5
17
18
 
19
+ raise NotImplementedError.new("PagenateNode inside StructNode, Not Supported") if page != element
20
+
18
21
  child_results = []
19
22
  limit = @limit.nil? ? Float::MAX : @limit
20
23
  while page
@@ -32,10 +35,14 @@ module Yasuri
32
35
  break if (limit -= 1) <= 0
33
36
  end
34
37
 
38
+ if @flatten == true
39
+ return child_results.map{|h| h.values}.flatten
40
+ end
41
+
35
42
  child_results
36
43
  end
37
44
  def opts
38
- {limit:@limit}
45
+ {limit:@limit, flatten:@flatten}
39
46
  end
40
47
  end
41
48
  end
@@ -6,12 +6,12 @@ require_relative 'yasuri_node'
6
6
  module Yasuri
7
7
  class StructNode
8
8
  include Node
9
- def inject(agent, page, opt = {})
10
- sub_tags = page.search(@xpath)
9
+ def inject(agent, page, opt = {}, element = page)
10
+ sub_tags = element.search(@xpath)
11
11
  tree = sub_tags.map do |sub_tag|
12
12
  child_results_kv = @children.map do |child_node|
13
13
  child_name = Yasuri.NodeName(child_node.name, opt)
14
- [child_name, child_node.inject(agent, sub_tag, opt)]
14
+ [child_name, child_node.inject(agent, page, opt, sub_tag)]
15
15
  end
16
16
  Hash[child_results_kv]
17
17
  end
@@ -7,11 +7,11 @@ module Yasuri
7
7
  class TextNode
8
8
  include Node
9
9
 
10
- def initialize(xpath, name, children = [], hash = {})
10
+ def initialize(xpath, name, children = [], **opt)
11
11
  super(xpath, name, children)
12
12
 
13
- truncate = hash[:truncate]
14
- proc = hash[:proc]
13
+ truncate = opt[:truncate]
14
+ proc = opt[:proc]
15
15
 
16
16
  truncate = Regexp.new(truncate) if not truncate.nil? # regexp or nil
17
17
  @truncate = truncate
@@ -21,8 +21,8 @@ module Yasuri
21
21
 
22
22
  end
23
23
 
24
- def inject(agent, page, opt = {})
25
- node = page.search(@xpath)
24
+ def inject(agent, page, opt = {}, element = page)
25
+ node = element.search(@xpath)
26
26
  text = node.text.to_s
27
27
 
28
28
  if @truncate
@@ -0,0 +1,30 @@
1
+ <html>
2
+ <head>
3
+ <title>StructualLinksTest</title>
4
+ </head>
5
+ <body>
6
+
7
+ <table>
8
+ <thead>
9
+ <tr>
10
+ <th>Title</th>
11
+ <th>Links</th>
12
+ </tr>
13
+ </thead>
14
+ <tr>
15
+ <td>Child01,02</td>
16
+ <td><a href="../child01.html">Child01</a></td>
17
+ <td><a href="../child02.html">Child02</a></td>
18
+ <td>../child02.html</td>
19
+ </tr>
20
+
21
+ <tr>
22
+ <td>Child01,02,03</td>
23
+ <td><a href="../child01.html">Child01</a></td>
24
+ <td><a href="../child02.html">Child02</a></td>
25
+ <td><a href="../child03.html">Child03</a></td>
26
+ </tr>
27
+ </table>
28
+
29
+ </body>
30
+ </html>
data/spec/spec_helper.rb CHANGED
@@ -12,11 +12,6 @@ shared_context 'httpserver' do
12
12
  }
13
13
  end
14
14
 
15
-
16
- # ENV['CODECLIMATE_REPO_TOKEN'] = "0dc78d33107a7f11f257c0218ac1a37e0073005bb9734f2fd61d0f7e803fc151"
17
- # require "codeclimate-test-reporter"
18
- # CodeClimate::TestReporter.start
19
-
20
15
  require 'simplecov'
21
16
  require 'coveralls'
22
17
  Coveralls.wear!
@@ -59,10 +59,18 @@ describe 'Yasuri' do
59
59
  ]
60
60
  expect(actual).to match expected
61
61
  end
62
- it 'can be defined by DSL, return single LinkNode title' do
63
- generated = Yasuri.links_title '/html/body/a'
64
- original = Yasuri::LinksNode.new('/html/body/a', "title")
65
- compare_generated_vs_original(generated, original, @index_page)
62
+ it 'can be defined by DSL, return no contains if no child node' do
63
+ root_node = Yasuri.links_title '/html/body/a'
64
+ actual = root_node.inject(@agent, @index_page)
65
+ expected = [{}, {}, {}] # Empty if no child node under links node.
66
+ expect(actual).to match expected
67
+ end
68
+
69
+ it 'can be defined return no contains if no child node' do
70
+ root_node = Yasuri::LinksNode.new('/html/body/a', "title")
71
+ actual = root_node.inject(@agent, @index_page)
72
+ expected = [{}, {}, {}] # Empty if no child node under links node.
73
+ expect(actual).to match expected
66
74
  end
67
75
  it 'can be defined by DSL, return nested contents under link' do
68
76
  generated = Yasuri.links_title '/html/body/a' do
@@ -30,6 +30,49 @@ describe 'Yasuri' do
30
30
  expect(actual).to match expected
31
31
  end
32
32
 
33
+ it "scrape each paginated pages with flatten" do
34
+ root_node = Yasuri::PaginateNode.new("/html/body/nav/span/a[@class='next']", "root", [
35
+ Yasuri::TextNode.new('/html/body/p', "content"),
36
+ Yasuri::StructNode.new('/html/body/nav/span', "span", [
37
+ Yasuri::TextNode.new('./a', "text"),
38
+ ]),
39
+ ], flatten: true)
40
+ actual = root_node.inject(@agent, @page)
41
+ expected = [
42
+ "PaginationTest01",
43
+ {"text"=>""},
44
+ {"text"=>""},
45
+ {"text" => "2"},
46
+ {"text" => "3"},
47
+ {"text" => "4"},
48
+ {"text"=>"NextPage »"},
49
+ "PaginationTest02",
50
+ {"text"=>"« PreviousPage"},
51
+ {"text" => "1"},
52
+ {"text"=>""},
53
+ {"text" => "3"},
54
+ {"text" => "4"},
55
+ {"text"=>"NextPage »"},
56
+ "PaginationTest03",
57
+ {"text"=>"« PreviousPage"},
58
+ {"text" => "1"},
59
+ {"text" => "2"},
60
+ {"text"=>""},
61
+ {"text" => "4"},
62
+ {"text"=>"NextPage »"},
63
+ "PaginationTest04",
64
+ {"text"=>"« PreviousPage"},
65
+ {"text" => "1"},
66
+ {"text" => "2"},
67
+ {"text" => "3"},
68
+ {"text"=>""},
69
+ {"text"=>""},
70
+ ]
71
+
72
+ expect(actual).to match expected
73
+ end
74
+
75
+
33
76
  it "scrape each paginated pages limited" do
34
77
  root_node = Yasuri::PaginateNode.new("/html/body/nav/span/a[@class='next']", "root", [
35
78
  Yasuri::TextNode.new('/html/body/p', "content"),
data/spec/yasuri_spec.rb CHANGED
@@ -13,6 +13,89 @@ describe 'Yasuri' do
13
13
  @index_page = @agent.get(@uri)
14
14
  end
15
15
 
16
+ ############
17
+ # yam2tree #
18
+ ############
19
+ describe '.yaml2tree' do
20
+ it "fail if empty yaml" do
21
+ expect { Yasuri.yaml2tree(nil) }.to raise_error(RuntimeError)
22
+ end
23
+
24
+ it "return text node" do
25
+ src = <<-EOB
26
+ content:
27
+ node: text
28
+ path: "/html/body/p[1]"
29
+ EOB
30
+ generated = Yasuri.yaml2tree(src)
31
+ original = Yasuri::TextNode.new('/html/body/p[1]', "content")
32
+
33
+ compare_generated_vs_original(generated, original, @index_page)
34
+ end
35
+
36
+ it "return text node as symbol" do
37
+ src = <<-EOB
38
+ :content:
39
+ :node: text
40
+ :path: "/html/body/p[1]"
41
+ EOB
42
+ generated = Yasuri.yaml2tree(src)
43
+ original = Yasuri::TextNode.new('/html/body/p[1]', "content")
44
+
45
+ compare_generated_vs_original(generated, original, @index_page)
46
+ end
47
+
48
+ it "return LinksNode/TextNode" do
49
+
50
+ src = <<-EOB
51
+ root:
52
+ node: links
53
+ path: "/html/body/a"
54
+ children:
55
+ - content:
56
+ node: text
57
+ path: "/html/body/p"
58
+ EOB
59
+ generated = Yasuri.yaml2tree(src)
60
+ original = Yasuri::LinksNode.new('/html/body/a', "root", [
61
+ Yasuri::TextNode.new('/html/body/p', "content"),
62
+ ])
63
+
64
+ compare_generated_vs_original(generated, original, @index_page)
65
+ end
66
+
67
+ it "return StructNode/StructNode/[TextNode,TextNode]" do
68
+ src = <<-EOB
69
+ tables:
70
+ node: struct
71
+ path: "/html/body/table"
72
+ children:
73
+ - table:
74
+ node: struct
75
+ path: "./tr"
76
+ children:
77
+ - title:
78
+ node: text
79
+ path: "./td[1]"
80
+ - pub_date:
81
+ node: text
82
+ path: "./td[2]"
83
+ EOB
84
+
85
+ generated = Yasuri.yaml2tree(src)
86
+ original = Yasuri::StructNode.new('/html/body/table', "tables", [
87
+ Yasuri::StructNode.new('./tr', "table", [
88
+ Yasuri::TextNode.new('./td[1]', "title"),
89
+ Yasuri::TextNode.new('./td[2]', "pub_date"),
90
+ ])
91
+ ])
92
+ page = @agent.get(@uri + "/struct/structual_text.html")
93
+ compare_generated_vs_original(generated, original, page)
94
+ end
95
+
96
+ end # end of describe '.yaml2tree'
97
+
98
+
16
99
  #############
17
100
  # json2tree #
18
101
  #############
@@ -39,7 +122,7 @@ describe 'Yasuri' do
39
122
  "truncate" : "^[^,]+"
40
123
  }|
41
124
  generated = Yasuri.json2tree(src)
42
- original = Yasuri::TextNode.new('/html/body/p[1]', "content", {}, truncate:/^[^,]+/)
125
+ original = Yasuri::TextNode.new('/html/body/p[1]', "content", truncate:/^[^,]+/)
43
126
  compare_generated_vs_original(generated, original, @index_page)
44
127
  end
45
128
 
@@ -126,7 +209,7 @@ describe 'Yasuri' do
126
209
  Yasuri::TextNode.new('./td[2]', "pub_date"),
127
210
  ])
128
211
  ])
129
- page = @agent.get(@uri + "/structual_text.html")
212
+ page = @agent.get(@uri + "/struct/structual_text.html")
130
213
  compare_generated_vs_original(generated, original, page)
131
214
  end
132
215
  end
@@ -153,7 +236,7 @@ describe 'Yasuri' do
153
236
  end
154
237
 
155
238
  it "return text node with truncate_regexp" do
156
- node = Yasuri::TextNode.new("/html/head/title", "title", {}, truncate:/^[^,]+/)
239
+ node = Yasuri::TextNode.new("/html/head/title", "title", truncate:/^[^,]+/)
157
240
  json = Yasuri.tree2json(node)
158
241
  expected_str = %q| { "node": "text",
159
242
  "name": "title",
@@ -193,6 +276,7 @@ describe 'Yasuri' do
193
276
  "name" : "root",
194
277
  "path" : "/html/body/nav/span/a[@class='next']",
195
278
  "limit" : 10,
279
+ "flatten" : false,
196
280
  "children" : [ { "node" : "text",
197
281
  "name" : "content",
198
282
  "path" : "/html/body/p"
@@ -12,7 +12,7 @@ describe 'Yasuri' do
12
12
  describe '::StructNode' do
13
13
  before do
14
14
  @agent = Mechanize.new
15
- @page = @agent.get(uri + "/structual_text.html")
15
+ @page = @agent.get(uri + "/struct/structual_text.html")
16
16
 
17
17
  @table_1996 = [
18
18
  { "title" => "The Perfect Insider",
@@ -126,10 +126,51 @@ describe 'Yasuri' do
126
126
  Yasuri::TextNode.new('./td[1]', "title"),
127
127
  Yasuri::TextNode.new('./td[2]', "pub_date"),
128
128
  ])
129
- expected = @table_1996.map{|h| Hash[h.map{|k,v| [k.to_sym, v] }] }
129
+ expected = @table_1996.map{|h| h.map{|k,v| [k.to_sym, v] }.to_h }
130
130
  actual = node.inject(@agent, @page, symbolize_names:true)
131
131
  expect(actual).to match expected
132
132
  end
133
133
 
134
134
  end
135
+
136
+ describe '::StructNode::Links' do
137
+ before do
138
+ @agent = Mechanize.new
139
+ @page = @agent.get(uri + "/struct/structual_links.html")
140
+
141
+ @table = [
142
+ { "title" => "Child01,02",
143
+ "child" => [{"p" => "Child 01 page."}, {"p" => "Child 02 page."}] },
144
+
145
+ { "title" => "Child01,02,03",
146
+ "child" => [{"p" => "Child 01 page."}, {"p" => "Child 02 page."}, {"p" => "Child 03 page."}]}
147
+ ]
148
+ end
149
+
150
+ it 'return child node in links inside struct' do
151
+ node = Yasuri::StructNode.new('/html/body/table/tr', "table", [
152
+ Yasuri::TextNode.new('./td[1]', "title"),
153
+ Yasuri::LinksNode.new('./td/a', "child", [
154
+ Yasuri::TextNode.new('/html/body/p', "p"),
155
+ ])
156
+ ])
157
+ expected = @table
158
+ actual = node.inject(@agent, @page)
159
+ expect(actual).to match expected
160
+ end
161
+ end # descrive
162
+
163
+ describe '::StructNode::Pages' do
164
+ before do
165
+ @agent = Mechanize.new
166
+ @page = @agent.get(uri + "/struct/structual_text.html") #dummy
167
+ end
168
+
169
+ it 'not supported' do
170
+ node = Yasuri::StructNode.new('/html/body/table[1]/tr', "table", [
171
+ Yasuri::PaginateNode.new("/html/body/nav/span/a[@class='next']", "pages", [])
172
+ ])
173
+ expect{ node.inject(@agent, @page) }.to raise_error(NotImplementedError, "PagenateNode inside StructNode, Not Supported")
174
+ end
175
+ end
135
176
  end
data/yasuri.gemspec CHANGED
@@ -18,8 +18,8 @@ Gem::Specification.new do |spec|
18
18
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
19
  spec.require_paths = ["lib"]
20
20
 
21
- spec.add_development_dependency "bundler", "~> 1.7"
22
- spec.add_development_dependency "rake", "~> 10.0"
21
+ spec.add_development_dependency "bundler"
22
+ spec.add_development_dependency "rake"
23
23
  spec.add_development_dependency "rspec"
24
24
  spec.add_development_dependency "fuubar"
25
25
  spec.add_development_dependency "glint"
metadata CHANGED
@@ -1,43 +1,43 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: yasuri
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.9.11
4
+ version: 3.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - TAC
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-11-14 00:00:00.000000000 Z
11
+ date: 2021-03-18 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - "~>"
17
+ - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: '1.7'
19
+ version: '0'
20
20
  type: :development
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - "~>"
24
+ - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: '1.7'
26
+ version: '0'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: rake
29
29
  requirement: !ruby/object:Gem::Requirement
30
30
  requirements:
31
- - - "~>"
31
+ - - ">="
32
32
  - !ruby/object:Gem::Version
33
- version: '10.0'
33
+ version: '0'
34
34
  type: :development
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
- - - "~>"
38
+ - - ">="
39
39
  - !ruby/object:Gem::Version
40
- version: '10.0'
40
+ version: '0'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rspec
43
43
  requirement: !ruby/object:Gem::Requirement
@@ -144,8 +144,10 @@ extensions: []
144
144
  extra_rdoc_files: []
145
145
  files:
146
146
  - ".coveralls.yml"
147
+ - ".github/workflows/ruby.yml"
147
148
  - ".gitignore"
148
149
  - ".rspec"
150
+ - ".ruby-version"
149
151
  - ".travis.yml"
150
152
  - Gemfile
151
153
  - LICENSE
@@ -174,7 +176,8 @@ files:
174
176
  - spec/htdocs/pagination/page02.html
175
177
  - spec/htdocs/pagination/page03.html
176
178
  - spec/htdocs/pagination/page04.html
177
- - spec/htdocs/structual_text.html
179
+ - spec/htdocs/struct/structual_links.html
180
+ - spec/htdocs/struct/structual_text.html
178
181
  - spec/servers/httpserver.rb
179
182
  - spec/spec_helper.rb
180
183
  - spec/yasuri_links_node_spec.rb
@@ -188,7 +191,7 @@ homepage: https://github.com/tac0x2a/yasuri
188
191
  licenses:
189
192
  - MIT
190
193
  metadata: {}
191
- post_install_message:
194
+ post_install_message:
192
195
  rdoc_options: []
193
196
  require_paths:
194
197
  - lib
@@ -203,9 +206,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
203
206
  - !ruby/object:Gem::Version
204
207
  version: '0'
205
208
  requirements: []
206
- rubyforge_project:
207
- rubygems_version: 2.4.5
208
- signing_key:
209
+ rubygems_version: 3.2.3
210
+ signing_key:
209
211
  specification_version: 4
210
212
  summary: Yasuri is easy scraping library.
211
213
  test_files:
@@ -220,7 +222,8 @@ test_files:
220
222
  - spec/htdocs/pagination/page02.html
221
223
  - spec/htdocs/pagination/page03.html
222
224
  - spec/htdocs/pagination/page04.html
223
- - spec/htdocs/structual_text.html
225
+ - spec/htdocs/struct/structual_links.html
226
+ - spec/htdocs/struct/structual_text.html
224
227
  - spec/servers/httpserver.rb
225
228
  - spec/spec_helper.rb
226
229
  - spec/yasuri_links_node_spec.rb