nikkou 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,75 +1,184 @@
1
1
  Nikkou
2
2
  ======
3
- description
3
+ Extract useful data from HTML and XML with ease!
4
4
 
5
5
  Description
6
6
  -----------
7
7
 
8
- Nikkou...
8
+ Nikkou adds additional methods to Nokogiri to make extracting commonly-used data from HTML and XML easier. It lets you transform HTML into structured data very quickly, and it integrates nicely with [Mechanize](https://github.com/sparklemotion/mechanize).
9
9
 
10
- ### time
10
+ Method Overview
11
+ ---------------
12
+
13
+ Here's a summary of the methods Nikkou provides (see "Methods" for details):
14
+
15
+ ### Formatting
16
+
17
+ **parse_text** - Parses the node's text as XML and returns it as a Nokogiri::XML::NodeSet
18
+
19
+ **time(options={})** - Intelligently parses the time (relative or absolute) of either the text or a specified attribute; accepts a `time_zone` option
20
+
21
+ **url(attribute='href')** - Converts the href (or other specified attribute) into an absolute URL using the document's URI; `<a href="/p/1">Link</a>` yields `http://mysite.com/p/1`
22
+
23
+ ### Searching
24
+
25
+ **attr_equals(attribute, string)** - Finds nodes where the attribute equals the string
26
+
27
+ **attr_includes(attribute, string)** - Finds nodes where the attribute includes the string
28
+
29
+ **attr_matches(attribute, pattern)** - Finds nodes where the attribute matches the pattern
30
+
31
+ **drill(*methods)** - Nil-safe method chaining
32
+
33
+ **find(path)** - Same as `search` but returns the first matched node
34
+
35
+ **text_equals(string)** - Finds nodes where the text equals the string
36
+
37
+ **text_includes(string)** - Finds nodes where the text includes the string
38
+
39
+ **text_matches(pattern)** - Finds nodes where the text matches the pattern
40
+
41
+ ## Methods
42
+
43
+ ### Formatting
44
+
45
+ #### time(options={})
11
46
 
12
47
  Returns a Time object (in UTC) by automatically parsing the text or specified attribute of the node.
13
48
 
14
49
  ```ruby
15
50
  # <a href="/p/1">3 hours ago</a>
16
- doc.search('a').first.time # 2013-04-16 02:42:34 UTC
51
+ doc.search('a').first.time
17
52
  ```
18
53
 
19
54
  ###### Options
20
55
 
21
- `attribute` - The attribute to parse:
56
+ `attribute`
57
+
58
+ The attribute to parse:
22
59
 
23
60
  ```ruby
24
- # <a href="/p/1" data-published-at="2013-04-16 02:42:34">My link</a>
25
- doc.search('a').first.time(attribute: 'data-published-at') # 2013-04-16 02:42:34 UTC
61
+ # <a href="/p/1" data-published-at="2013-05-22 02:42:34">My link</a>
62
+ doc.search('a').first.time(attribute: 'data-published-at')
26
63
  ```
27
64
 
28
- `time_zone` - The document's time zone (the time will be converted from that to UTC):
65
+ `time_zone`
66
+
67
+ The document's time zone (the time will be converted from that to UTC):
29
68
 
30
69
  ```ruby
31
70
  # <a href="/p/1">3 hours ago</a>
32
- doc.search('a').first.time(time_zone: 'America/New_York') # 2013-04-16 06:42:34 UTC
71
+ doc.search('a').first.time(time_zone: 'America/New_York')
33
72
  ```
34
73
 
35
- #### url
74
+ #### url(attribute='href')
36
75
 
37
- Returns an absolute URL; useful for parsing relative hrefs. The document's `uri` needs to be set for Nikkou to know what domain to add to relative hrefs.
76
+ Returns an absolute URL; useful for parsing relative hrefs. The document's `uri` needs to be set for Nikkou to know what domain to add to relative paths.
38
77
 
39
78
  ```ruby
40
79
  # <a href="/p/1">My link</a>
41
80
  doc.uri = 'http://mysite.com/mypage'
42
- doc.search('a').first.url # http://mysite.com/p/1
81
+ doc.search('a').first.url # "http://mysite.com/p/1"
43
82
  ```
44
83
 
84
+ If Mechanize is being used, the `uri` doesn't need to be manually set.
85
+
45
86
  ###### Options
46
87
 
47
- `attribute` - The attribute to parse:
88
+ `attribute`
89
+
90
+ The attribute to parse:
48
91
 
49
92
  ```ruby
50
93
  # <a href="/p/1" data-comments-url="/p/1#comments">My Link</a>
51
94
  doc.uri = 'http://mysite.com/mypage'
52
- doc.search('a').first.url('data-comments-url') # http://mysite.com/p/1#comments
95
+ doc.search('a').first.url('data-comments-url') # "http://mysite.com/p/1#comments"
96
+ ```
97
+
98
+ ### Searching
99
+
100
+ #### attr_equals(attribute, string)
101
+
102
+ Selects nodes where the specified attribute equals the string.
103
+
104
+ ```ruby
105
+ # <div data-type="news">My Text</div>
106
+ doc.attr_equals('data-type', 'news').first.text # "My Text"
107
+ ```
108
+
109
+ #### attr_includes(attribute, string)
110
+
111
+ Selects nodes where the specified attribute includes the string.
112
+
113
+ ```ruby
114
+ # <div data-type="major-news">My Text</div>
115
+ doc.attr_equals('data-type', 'news').first.text # "My Text"
53
116
  ```
54
117
 
55
- ### attr_matches(attribute, pattern)
118
+ #### attr_matches(attribute, pattern)
56
119
 
57
- Selects nodes with an attribute matching a pattern. The pattern's matches are stored in `Node#matches`.
120
+ Selects nodes with an attribute matching a pattern. The pattern's matches are available in `Node#matches`.
58
121
 
59
122
  ```ruby
60
123
  # <span data-tooltip="3 Comments">My Text</span>
61
- doc.search('span').attr_matches('data-tooltip', /(\d+) comments/i).first.text # My Text
62
- doc.search('span').attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]
124
+ doc.attr_matches('data-tooltip', /(\d+) comments/i).first.text # "My Text"
125
+ doc.attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]
126
+ ```
127
+
128
+ #### drill(*methods)
129
+
130
+ Nil-safe method chaining. Replaces this:
131
+
132
+ ```ruby
133
+ node = doc.find('.count')
134
+ if node
135
+ attribute = node.attr('data-count')
136
+ if attribute
137
+ return attribute.to_i
138
+ end
139
+ end
140
+ ```
141
+
142
+ With this:
143
+
144
+ ```ruby
145
+ return doc.drill([:find, '.count'], [:attr, 'data-count'], :to_i)
146
+ ```
147
+
148
+ #### find(path)
149
+
150
+ Same as `search`, but returns the first matched node. Replaces this:
151
+
152
+ ```ruby
153
+ nodes = node.search('h4')
154
+ if nodes
155
+ return nodes.first
156
+ end
157
+ ```
158
+
159
+ With this:
160
+
161
+ ```ruby
162
+ return node.find('h4')
163
+ ```
164
+
165
+ #### text_includes(string)
166
+
167
+ Selects nodes where the text includes the string.
168
+
169
+ ```ruby
170
+ # <div data-type="news">My Text</div>
171
+ doc.text_includes('Text').first.text # "My Text"
63
172
  ```
64
173
 
65
- ### text_matches(attribute, pattern)
174
+ #### text_matches(pattern)
66
175
 
67
- Selects nodes with text matching a pattern. The pattern's matches are stored in `Node#matches`.
176
+ Selects nodes with text matching a pattern. The pattern's matches are available in `Node#matches`.
68
177
 
69
178
  ```ruby
70
179
  # <a href="/p/1">3 Comments</a>
71
- doc.search('span').text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
72
- doc.search('span').text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]
180
+ doc.text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
181
+ doc.text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]
73
182
  ```
74
183
 
75
184
  License
data/Rakefile CHANGED
@@ -16,7 +16,7 @@ RDoc::Task.new(:rdoc) do |rdoc|
16
16
  rdoc.rdoc_dir = 'rdoc'
17
17
  rdoc.title = 'Nikkou'
18
18
  rdoc.options << '--line-numbers'
19
- rdoc.rdoc_files.include('README.rdoc')
19
+ rdoc.rdoc_files.include('README.md')
20
20
  rdoc.rdoc_files.include('lib/**/*.rb')
21
21
  end
22
22
 
@@ -6,14 +6,68 @@ module Nikkou
6
6
  include Nikkou::Findable
7
7
 
8
8
  attr_accessor :matches
9
+
10
+ def attr_equals(attribute, string)
11
+ list = []
12
+ traverse do |node|
13
+ list << node if node.attr(attribute) == string
14
+ end
15
+ ::Nokogiri::XML::NodeSet.new(document, list)
16
+ end
17
+
18
+ def attr_includes(attribute, string)
19
+ list = []
20
+ traverse do |node|
21
+ next if node.attr(attribute).nil?
22
+ list << node if node.attr(attribute).include?(string)
23
+ end
24
+ ::Nokogiri::XML::NodeSet.new(document, list)
25
+ end
26
+
27
+ def attr_matches(attribute, pattern)
28
+ list = []
29
+ traverse do |node|
30
+ next if node.attr(attribute).nil?
31
+ if node.attr(attribute).match(pattern)
32
+ node.matches = $~.to_a
33
+ list << node
34
+ end
35
+ end
36
+ ::Nokogiri::XML::NodeSet.new(document, list)
37
+ end
9
38
 
10
- def url(attribute='href')
11
- return nil if attr(attribute).nil? || document.nil? || document.uri.nil?
12
- href = attr(attribute)
13
- return href if href =~ /^https?:\/\//
14
- return "http:#{href}" if href.start_with?('//')
15
- root_url = "#{document.uri.scheme}://#{document.uri.host}"
16
- URI.join(root_url, href).to_s
39
+ def parse_text
40
+ parse(text)
41
+ end
42
+
43
+ def text_equals(string)
44
+ list = []
45
+ traverse do |node|
46
+ next if node.is_a?(::Nokogiri::XML::Text)
47
+ list << node if node.text == string
48
+ end
49
+ ::Nokogiri::XML::NodeSet.new(document, list)
50
+ end
51
+
52
+ def text_includes(string)
53
+ list = []
54
+ traverse do |node|
55
+ next if node.is_a?(::Nokogiri::XML::Text)
56
+ list << node if node.text.include?(string)
57
+ end
58
+ ::Nokogiri::XML::NodeSet.new(document, list)
59
+ end
60
+
61
+ def text_matches(pattern)
62
+ list = []
63
+ traverse do |node|
64
+ next if node.is_a?(::Nokogiri::XML::Text)
65
+ if node.text.match(pattern)
66
+ node.matches = $~.to_a
67
+ list << node
68
+ end
69
+ end
70
+ ::Nokogiri::XML::NodeSet.new(document, list)
17
71
  end
18
72
 
19
73
  def time(options={})
@@ -33,6 +87,16 @@ module Nikkou
33
87
  end
34
88
  time_zone.local_to_utc(time)
35
89
  end
90
+
91
+ def url(attribute='href')
92
+ return nil if attr(attribute).nil?
93
+ href = attr(attribute)
94
+ return href if href =~ /^https?:\/\//
95
+ return "http:#{href}" if href.start_with?('//')
96
+ return nil if document.nil? || document.uri.nil?
97
+ root_url = "#{document.uri.scheme}://#{document.uri.host}"
98
+ URI.join(root_url, href).to_s
99
+ end
36
100
  end
37
101
  end
38
102
  end
@@ -5,6 +5,14 @@ module Nikkou
5
5
  include Nikkou::Drillable
6
6
  include Nikkou::Findable
7
7
 
8
+ def attr_equals(attribute, string)
9
+ list = select do |node|
10
+ return false if node.attr(attribute).nil?
11
+ node.attr(attribute) == string
12
+ end
13
+ self.class.new(document, list)
14
+ end
15
+
8
16
  def attr_includes(attribute, string)
9
17
  list = select do |node|
10
18
  return false if node.attr(attribute).nil?
@@ -25,8 +33,17 @@ module Nikkou
25
33
  self.class.new(document, list)
26
34
  end
27
35
 
36
+ def text_equals(string)
37
+ list = select do |node|
38
+ next if node.is_a?(::Nokogiri::XML::Text)
39
+ node.text == string
40
+ end
41
+ self.class.new(document, list)
42
+ end
43
+
28
44
  def text_includes(string)
29
45
  list = select do |node|
46
+ next if node.is_a?(::Nokogiri::XML::Text)
30
47
  node.text.include?(string)
31
48
  end
32
49
  self.class.new(document, list)
@@ -35,6 +52,7 @@ module Nikkou
35
52
  def text_matches(pattern)
36
53
  list = []
37
54
  each do |node|
55
+ next if node.is_a?(::Nokogiri::XML::Text)
38
56
  if node.text.match(pattern)
39
57
  node.matches = $~.to_a
40
58
  list << node
@@ -1,3 +1,3 @@
1
1
  module Nikkou
2
- VERSION = "0.0.2"
2
+ VERSION = '0.0.3'
3
3
  end
@@ -1,7 +1,7 @@
1
1
  require 'spec_helper'
2
2
 
3
3
  describe Nikkou::Drillable do
4
- before do
4
+ before(:all) do
5
5
  assets_directory = File.expand_path(File.join(File.dirname(__FILE__), 'files'))
6
6
  html_file = File.join(assets_directory, 'test.html')
7
7
  @html = Nokogiri::HTML.parse(File.read(html_file))
@@ -18,5 +18,10 @@
18
18
  </ul>
19
19
  </div>
20
20
  </div>
21
+ <div class="xml-node">
22
+ &lt;div class=&quot;xml-encoded-node&quot;&gt;
23
+ xml encoded node value
24
+ &lt;/div&gt;
25
+ </div>
21
26
  </body>
22
27
  </html>
@@ -1,7 +1,7 @@
1
1
  require 'spec_helper'
2
2
 
3
3
  describe Nikkou::Findable do
4
- before do
4
+ before(:all) do
5
5
  assets_directory = File.expand_path(File.join(File.dirname(__FILE__), 'files'))
6
6
  html_file = File.join(assets_directory, 'test.html')
7
7
  @html = Nokogiri::HTML.parse(File.read(html_file))
@@ -1,12 +1,19 @@
1
1
  require 'spec_helper'
2
2
 
3
3
  describe Nokogiri::XML::NodeSet do
4
- before do
4
+ before(:all) do
5
5
  assets_directory = File.expand_path(File.join(File.dirname(__FILE__), 'files'))
6
6
  html_file = File.join(assets_directory, 'test.html')
7
7
  @html = Nokogiri::HTML.parse(File.read(html_file))
8
8
  end
9
9
 
10
+ describe '.attr_equals' do
11
+ it 'finds nodes' do
12
+ nodes = @html.search('a').attr_equals('href', 'http://www.ipsum.com/')
13
+ nodes.first.text.should == 'ipsum'
14
+ end
15
+ end
16
+
10
17
  describe '.attr_includes' do
11
18
  it 'finds nodes' do
12
19
  nodes = @html.search('a').attr_includes('href', 'ipsum.com')
@@ -26,6 +33,20 @@ describe Nokogiri::XML::NodeSet do
26
33
  end
27
34
  end
28
35
 
36
+ describe '.text_equals' do
37
+ it 'finds nodes' do
38
+ nodes = @html.search('a').text_equals('ipsum')
39
+ nodes.first.text.should == 'ipsum'
40
+ end
41
+ end
42
+
43
+ describe '.text_includes' do
44
+ it 'finds nodes' do
45
+ nodes = @html.search('a').text_includes('ipsum')
46
+ nodes.first.text.should == 'ipsum'
47
+ end
48
+ end
49
+
29
50
  describe '.text_matches' do
30
51
  it 'finds nodes' do
31
52
  nodes = @html.search('a').text_matches(/(\d+) comments/)
@@ -37,11 +58,4 @@ describe Nokogiri::XML::NodeSet do
37
58
  nodes.first.matches.should == ['12 comments', '12']
38
59
  end
39
60
  end
40
-
41
- describe '.text_includes' do
42
- it 'finds nodes' do
43
- nodes = @html.search('a').text_includes('ipsum')
44
- nodes.first.text.should == 'ipsum'
45
- end
46
- end
47
61
  end
@@ -1,20 +1,77 @@
1
1
  require 'spec_helper'
2
2
 
3
3
  describe Nokogiri::XML::Node do
4
- before do
4
+ before(:all) do
5
5
  assets_directory = File.expand_path(File.join(File.dirname(__FILE__), 'files'))
6
6
  html_file = File.join(assets_directory, 'test.html')
7
7
  @html = Nokogiri::HTML.parse(File.read(html_file))
8
8
  @html.uri = 'http://www.loremipsum.com/page/2'
9
+
10
+ # Set the time zone for .time
11
+ Time.zone = 'Pacific Time (US & Canada)'
9
12
  end
10
13
 
11
- describe '.url' do
12
- it 'reads absolute URLs' do
13
- @html.search('a.absolute-url').first.url.should == 'http://www.absoluteurl.com/'
14
+ describe '.attr_equals' do
15
+ it 'finds nodes' do
16
+ nodes = @html.search('body').first.attr_equals('href', 'http://www.ipsum.com/')
17
+ nodes.first.text.should == 'ipsum'
14
18
  end
19
+ end
15
20
 
16
- it 'reads relative URLs' do
17
- @html.search('a.relative-url').first.url.should == 'http://www.loremipsum.com/p/1'
21
+ describe '.attr_includes' do
22
+ it 'finds nodes' do
23
+ nodes = @html.search('body').first.attr_includes('href', 'ipsum.com')
24
+ nodes.first.text.should == 'ipsum'
25
+ end
26
+ end
27
+
28
+ describe '.attr_matches' do
29
+ it 'finds nodes' do
30
+ nodes = @html.search('body').first.attr_matches('href', /(lorem|ipsum)\.com/)
31
+ nodes.first.text.should == 'ipsum'
32
+ end
33
+
34
+ it 'sets matches' do
35
+ nodes = @html.search('body').first.attr_matches('href', /(lorem|ipsum)\.com/)
36
+ nodes.first.matches.should == ['ipsum.com', 'ipsum']
37
+ end
38
+ end
39
+
40
+ describe '.parse_text' do
41
+ it 'converts the node\'s text to a node set' do
42
+ nodes = @html.search('.xml-node').first.parse_text
43
+ nodes.should be_an_instance_of(Nokogiri::XML::NodeSet)
44
+ end
45
+
46
+ it 'returns a node set that contains the correct content' do
47
+ nodes = @html.search('.xml-node').first.parse_text
48
+ nodes.search('.xml-encoded-node').length.should == 1
49
+ end
50
+ end
51
+
52
+ describe '.text_equals' do
53
+ it 'finds nodes' do
54
+ nodes = @html.search('body').first.text_equals('ipsum')
55
+ nodes.first.text.should == 'ipsum'
56
+ end
57
+ end
58
+
59
+ describe '.text_includes' do
60
+ it 'finds nodes' do
61
+ nodes = @html.search('body').first.text_includes('ipsum')
62
+ nodes.first.text.should == 'ipsum'
63
+ end
64
+ end
65
+
66
+ describe '.text_matches' do
67
+ it 'finds nodes' do
68
+ nodes = @html.search('body').first.text_matches(/(\d+) comments/)
69
+ nodes.first.text.should == '12 comments'
70
+ end
71
+
72
+ it 'sets matches' do
73
+ nodes = @html.search('body').first.text_matches(/(\d+) comments/)
74
+ nodes.first.matches.should == ['12 comments', '12']
18
75
  end
19
76
  end
20
77
 
@@ -31,4 +88,14 @@ describe Nokogiri::XML::Node do
31
88
  @html.search('.post-published-at').first.time(attribute: 'data-published-at', time_zone: 'America/New_York').to_s.should == '2013-04-01 04:00:00 UTC'
32
89
  end
33
90
  end
91
+
92
+ describe '.url' do
93
+ it 'reads absolute URLs' do
94
+ @html.search('a.absolute-url').first.url.should == 'http://www.absoluteurl.com/'
95
+ end
96
+
97
+ it 'reads relative URLs' do
98
+ @html.search('a.relative-url').first.url.should == 'http://www.loremipsum.com/p/1'
99
+ end
100
+ end
34
101
  end
@@ -2,7 +2,6 @@ ENV["RAILS_ENV"] ||= 'test'
2
2
 
3
3
  require 'rspec'
4
4
  require 'nikkou'
5
- require 'pry'
6
5
 
7
6
  RSpec.configure do |config|
8
7
  config.color_enabled = true
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nikkou
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-04-23 00:00:00.000000000 Z
12
+ date: 2013-06-02 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: nokogiri
@@ -91,7 +91,7 @@ dependencies:
91
91
  - - ! '>='
92
92
  - !ruby/object:Gem::Version
93
93
  version: '0'
94
- description: Utilities for Nokogiri
94
+ description: Extract useful data from HTML and XML with ease!
95
95
  email:
96
96
  - tombenner@gmail.com
97
97
  executables: []
@@ -141,7 +141,7 @@ rubyforge_project:
141
141
  rubygems_version: 1.8.24
142
142
  signing_key:
143
143
  specification_version: 3
144
- summary: Utilities for Nokogiri
144
+ summary: Extract useful data from HTML and XML with ease!
145
145
  test_files:
146
146
  - spec/drillable_spec.rb
147
147
  - spec/files/test.html