nikkou 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,75 +1,184 @@
1
1
  Nikkou
2
2
  ======
3
- description
3
+ Extract useful data from HTML and XML with ease!
4
4
 
5
5
  Description
6
6
  -----------
7
7
 
8
- Nikkou...
8
+ Nikkou adds additional methods to Nokogiri to make extracting commonly-used data from HTML and XML easier. It lets you transform HTML into structured data very quickly, and it integrates nicely with [Mechanize](https://github.com/sparklemotion/mechanize).
9
9
 
10
- ### time
10
+ Method Overview
11
+ ---------------
12
+
13
+ Here's a summary of the methods Nikkou provides (see "Methods" for details):
14
+
15
+ ### Formatting
16
+
17
+ **parse_text** - Parses the node's text as XML and returns it as a Nokogiri::XML::NodeSet
18
+
19
+ **time(options={})** - Intelligently parses the time (relative or absolute) of either the text or a specified attribute; accepts a `time_zone` option
20
+
21
+ **url(attribute='href')** - Converts the href (or other specified attribute) into an absolute URL using the document's URI; `<a href="/p/1">Link</a>` yields `http://mysite.com/p/1`
22
+
23
+ ### Searching
24
+
25
+ **attr_equals(attribute, string)** - Finds nodes where the attribute equals the string
26
+
27
+ **attr_includes(attribute, string)** - Finds nodes where the attribute includes the string
28
+
29
+ **attr_matches(attribute, pattern)** - Finds nodes where the attribute matches the pattern
30
+
31
+ **drill(*methods)** - Nil-safe method chaining
32
+
33
+ **find(path)** - Same as `search` but returns the first matched node
34
+
35
+ **text_equals(string)** - Finds nodes where the text equals the string
36
+
37
+ **text_includes(string)** - Finds nodes where the text includes the string
38
+
39
+ **text_matches(pattern)** - Finds nodes where the text matches the pattern
40
+
41
+ ## Methods
42
+
43
+ ### Formatting
44
+
45
+ #### time(options={})
11
46
 
12
47
  Returns a Time object (in UTC) by automatically parsing the text or specified attribute of the node.
13
48
 
14
49
  ```ruby
15
50
  # <a href="/p/1">3 hours ago</a>
16
- doc.search('a').first.time # 2013-04-16 02:42:34 UTC
51
+ doc.search('a').first.time
17
52
  ```
18
53
 
19
54
  ###### Options
20
55
 
21
- `attribute` - The attribute to parse:
56
+ `attribute`
57
+
58
+ The attribute to parse:
22
59
 
23
60
  ```ruby
24
- # <a href="/p/1" data-published-at="2013-04-16 02:42:34">My link</a>
25
- doc.search('a').first.time(attribute: 'data-published-at') # 2013-04-16 02:42:34 UTC
61
+ # <a href="/p/1" data-published-at="2013-05-22 02:42:34">My link</a>
62
+ doc.search('a').first.time(attribute: 'data-published-at')
26
63
  ```
27
64
 
28
- `time_zone` - The document's time zone (the time will be converted from that to UTC):
65
+ `time_zone`
66
+
67
+ The document's time zone (the time will be converted from that to UTC):
29
68
 
30
69
  ```ruby
31
70
  # <a href="/p/1">3 hours ago</a>
32
- doc.search('a').first.time(time_zone: 'America/New_York') # 2013-04-16 06:42:34 UTC
71
+ doc.search('a').first.time(time_zone: 'America/New_York')
33
72
  ```
34
73
 
35
- #### url
74
+ #### url(attribute='href')
36
75
 
37
- Returns an absolute URL; useful for parsing relative hrefs. The document's `uri` needs to be set for Nikkou to know what domain to add to relative hrefs.
76
+ Returns an absolute URL; useful for parsing relative hrefs. The document's `uri` needs to be set for Nikkou to know what domain to add to relative paths.
38
77
 
39
78
  ```ruby
40
79
  # <a href="/p/1">My link</a>
41
80
  doc.uri = 'http://mysite.com/mypage'
42
- doc.search('a').first.url # http://mysite.com/p/1
81
+ doc.search('a').first.url # "http://mysite.com/p/1"
43
82
  ```
44
83
 
84
+ If Mechanize is being used, the `uri` doesn't need to be manually set.
85
+
45
86
  ###### Options
46
87
 
47
- `attribute` - The attribute to parse:
88
+ `attribute`
89
+
90
+ The attribute to parse:
48
91
 
49
92
  ```ruby
50
93
  # <a href="/p/1" data-comments-url="/p/1#comments">My Link</a>
51
94
  doc.uri = 'http://mysite.com/mypage'
52
- doc.search('a').first.url('data-comments-url') # http://mysite.com/p/1#comments
95
+ doc.search('a').first.url('data-comments-url') # "http://mysite.com/p/1#comments"
96
+ ```
97
+
98
+ ### Searching
99
+
100
+ #### attr_equals(attribute, string)
101
+
102
+ Selects nodes where the specified attribute equals the string.
103
+
104
+ ```ruby
105
+ # <div data-type="news">My Text</div>
106
+ doc.attr_equals('data-type', 'news').first.text # "My Text"
107
+ ```
108
+
109
+ #### attr_includes(attribute, string)
110
+
111
+ Selects nodes where the specified attribute includes the string.
112
+
113
+ ```ruby
114
+ # <div data-type="major-news">My Text</div>
115
+ doc.attr_equals('data-type', 'news').first.text # "My Text"
53
116
  ```
54
117
 
55
- ### attr_matches(attribute, pattern)
118
+ #### attr_matches(attribute, pattern)
56
119
 
57
- Selects nodes with an attribute matching a pattern. The pattern's matches are stored in `Node#matches`.
120
+ Selects nodes with an attribute matching a pattern. The pattern's matches are available in `Node#matches`.
58
121
 
59
122
  ```ruby
60
123
  # <span data-tooltip="3 Comments">My Text</span>
61
- doc.search('span').attr_matches('data-tooltip', /(\d+) comments/i).first.text # My Text
62
- doc.search('span').attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]
124
+ doc.attr_matches('data-tooltip', /(\d+) comments/i).first.text # "My Text"
125
+ doc.attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]
126
+ ```
127
+
128
+ #### drill(*methods)
129
+
130
+ Nil-safe method chaining. Replaces this:
131
+
132
+ ```ruby
133
+ node = doc.find('.count')
134
+ if node
135
+ attribute = node.attr('data-count')
136
+ if attribute
137
+ return attribute.to_i
138
+ end
139
+ end
140
+ ```
141
+
142
+ With this:
143
+
144
+ ```ruby
145
+ return doc.drill([:find, '.count'], [:attr, 'data-count'], :to_i)
146
+ ```
147
+
148
+ #### find(path)
149
+
150
+ Same as `search`, but returns the first matched node. Replaces this:
151
+
152
+ ```ruby
153
+ nodes = node.search('h4')
154
+ if nodes
155
+ return nodes.first
156
+ end
157
+ ```
158
+
159
+ With this:
160
+
161
+ ```ruby
162
+ return node.find('h4')
163
+ ```
164
+
165
+ #### text_includes(string)
166
+
167
+ Selects nodes where the text includes the string.
168
+
169
+ ```ruby
170
+ # <div data-type="news">My Text</div>
171
+ doc.text_includes('Text').first.text # "My Text"
63
172
  ```
64
173
 
65
- ### text_matches(attribute, pattern)
174
+ #### text_matches(pattern)
66
175
 
67
- Selects nodes with text matching a pattern. The pattern's matches are stored in `Node#matches`.
176
+ Selects nodes with text matching a pattern. The pattern's matches are available in `Node#matches`.
68
177
 
69
178
  ```ruby
70
179
  # <a href="/p/1">3 Comments</a>
71
- doc.search('span').text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
72
- doc.search('span').text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]
180
+ doc.text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
181
+ doc.text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]
73
182
  ```
74
183
 
75
184
  License
data/Rakefile CHANGED
@@ -16,7 +16,7 @@ RDoc::Task.new(:rdoc) do |rdoc|
16
16
  rdoc.rdoc_dir = 'rdoc'
17
17
  rdoc.title = 'Nikkou'
18
18
  rdoc.options << '--line-numbers'
19
- rdoc.rdoc_files.include('README.rdoc')
19
+ rdoc.rdoc_files.include('README.md')
20
20
  rdoc.rdoc_files.include('lib/**/*.rb')
21
21
  end
22
22
 
@@ -6,14 +6,68 @@ module Nikkou
6
6
  include Nikkou::Findable
7
7
 
8
8
  attr_accessor :matches
9
+
10
+ def attr_equals(attribute, string)
11
+ list = []
12
+ traverse do |node|
13
+ list << node if node.attr(attribute) == string
14
+ end
15
+ ::Nokogiri::XML::NodeSet.new(document, list)
16
+ end
17
+
18
+ def attr_includes(attribute, string)
19
+ list = []
20
+ traverse do |node|
21
+ next if node.attr(attribute).nil?
22
+ list << node if node.attr(attribute).include?(string)
23
+ end
24
+ ::Nokogiri::XML::NodeSet.new(document, list)
25
+ end
26
+
27
+ def attr_matches(attribute, pattern)
28
+ list = []
29
+ traverse do |node|
30
+ next if node.attr(attribute).nil?
31
+ if node.attr(attribute).match(pattern)
32
+ node.matches = $~.to_a
33
+ list << node
34
+ end
35
+ end
36
+ ::Nokogiri::XML::NodeSet.new(document, list)
37
+ end
9
38
 
10
- def url(attribute='href')
11
- return nil if attr(attribute).nil? || document.nil? || document.uri.nil?
12
- href = attr(attribute)
13
- return href if href =~ /^https?:\/\//
14
- return "http:#{href}" if href.start_with?('//')
15
- root_url = "#{document.uri.scheme}://#{document.uri.host}"
16
- URI.join(root_url, href).to_s
39
+ def parse_text
40
+ parse(text)
41
+ end
42
+
43
+ def text_equals(string)
44
+ list = []
45
+ traverse do |node|
46
+ next if node.is_a?(::Nokogiri::XML::Text)
47
+ list << node if node.text == string
48
+ end
49
+ ::Nokogiri::XML::NodeSet.new(document, list)
50
+ end
51
+
52
+ def text_includes(string)
53
+ list = []
54
+ traverse do |node|
55
+ next if node.is_a?(::Nokogiri::XML::Text)
56
+ list << node if node.text.include?(string)
57
+ end
58
+ ::Nokogiri::XML::NodeSet.new(document, list)
59
+ end
60
+
61
+ def text_matches(pattern)
62
+ list = []
63
+ traverse do |node|
64
+ next if node.is_a?(::Nokogiri::XML::Text)
65
+ if node.text.match(pattern)
66
+ node.matches = $~.to_a
67
+ list << node
68
+ end
69
+ end
70
+ ::Nokogiri::XML::NodeSet.new(document, list)
17
71
  end
18
72
 
19
73
  def time(options={})
@@ -33,6 +87,16 @@ module Nikkou
33
87
  end
34
88
  time_zone.local_to_utc(time)
35
89
  end
90
+
91
+ def url(attribute='href')
92
+ return nil if attr(attribute).nil?
93
+ href = attr(attribute)
94
+ return href if href =~ /^https?:\/\//
95
+ return "http:#{href}" if href.start_with?('//')
96
+ return nil if document.nil? || document.uri.nil?
97
+ root_url = "#{document.uri.scheme}://#{document.uri.host}"
98
+ URI.join(root_url, href).to_s
99
+ end
36
100
  end
37
101
  end
38
102
  end
@@ -5,6 +5,14 @@ module Nikkou
5
5
  include Nikkou::Drillable
6
6
  include Nikkou::Findable
7
7
 
8
+ def attr_equals(attribute, string)
9
+ list = select do |node|
10
+ return false if node.attr(attribute).nil?
11
+ node.attr(attribute) == string
12
+ end
13
+ self.class.new(document, list)
14
+ end
15
+
8
16
  def attr_includes(attribute, string)
9
17
  list = select do |node|
10
18
  return false if node.attr(attribute).nil?
@@ -25,8 +33,17 @@ module Nikkou
25
33
  self.class.new(document, list)
26
34
  end
27
35
 
36
+ def text_equals(string)
37
+ list = select do |node|
38
+ next if node.is_a?(::Nokogiri::XML::Text)
39
+ node.text == string
40
+ end
41
+ self.class.new(document, list)
42
+ end
43
+
28
44
  def text_includes(string)
29
45
  list = select do |node|
46
+ next if node.is_a?(::Nokogiri::XML::Text)
30
47
  node.text.include?(string)
31
48
  end
32
49
  self.class.new(document, list)
@@ -35,6 +52,7 @@ module Nikkou
35
52
  def text_matches(pattern)
36
53
  list = []
37
54
  each do |node|
55
+ next if node.is_a?(::Nokogiri::XML::Text)
38
56
  if node.text.match(pattern)
39
57
  node.matches = $~.to_a
40
58
  list << node
@@ -1,3 +1,3 @@
1
1
  module Nikkou
2
- VERSION = "0.0.2"
2
+ VERSION = '0.0.3'
3
3
  end
@@ -1,7 +1,7 @@
1
1
  require 'spec_helper'
2
2
 
3
3
  describe Nikkou::Drillable do
4
- before do
4
+ before(:all) do
5
5
  assets_directory = File.expand_path(File.join(File.dirname(__FILE__), 'files'))
6
6
  html_file = File.join(assets_directory, 'test.html')
7
7
  @html = Nokogiri::HTML.parse(File.read(html_file))
@@ -18,5 +18,10 @@
18
18
  </ul>
19
19
  </div>
20
20
  </div>
21
+ <div class="xml-node">
22
+ &lt;div class=&quot;xml-encoded-node&quot;&gt;
23
+ xml encoded node value
24
+ &lt;/div&gt;
25
+ </div>
21
26
  </body>
22
27
  </html>
@@ -1,7 +1,7 @@
1
1
  require 'spec_helper'
2
2
 
3
3
  describe Nikkou::Findable do
4
- before do
4
+ before(:all) do
5
5
  assets_directory = File.expand_path(File.join(File.dirname(__FILE__), 'files'))
6
6
  html_file = File.join(assets_directory, 'test.html')
7
7
  @html = Nokogiri::HTML.parse(File.read(html_file))
@@ -1,12 +1,19 @@
1
1
  require 'spec_helper'
2
2
 
3
3
  describe Nokogiri::XML::NodeSet do
4
- before do
4
+ before(:all) do
5
5
  assets_directory = File.expand_path(File.join(File.dirname(__FILE__), 'files'))
6
6
  html_file = File.join(assets_directory, 'test.html')
7
7
  @html = Nokogiri::HTML.parse(File.read(html_file))
8
8
  end
9
9
 
10
+ describe '.attr_equals' do
11
+ it 'finds nodes' do
12
+ nodes = @html.search('a').attr_equals('href', 'http://www.ipsum.com/')
13
+ nodes.first.text.should == 'ipsum'
14
+ end
15
+ end
16
+
10
17
  describe '.attr_includes' do
11
18
  it 'finds nodes' do
12
19
  nodes = @html.search('a').attr_includes('href', 'ipsum.com')
@@ -26,6 +33,20 @@ describe Nokogiri::XML::NodeSet do
26
33
  end
27
34
  end
28
35
 
36
+ describe '.text_equals' do
37
+ it 'finds nodes' do
38
+ nodes = @html.search('a').text_equals('ipsum')
39
+ nodes.first.text.should == 'ipsum'
40
+ end
41
+ end
42
+
43
+ describe '.text_includes' do
44
+ it 'finds nodes' do
45
+ nodes = @html.search('a').text_includes('ipsum')
46
+ nodes.first.text.should == 'ipsum'
47
+ end
48
+ end
49
+
29
50
  describe '.text_matches' do
30
51
  it 'finds nodes' do
31
52
  nodes = @html.search('a').text_matches(/(\d+) comments/)
@@ -37,11 +58,4 @@ describe Nokogiri::XML::NodeSet do
37
58
  nodes.first.matches.should == ['12 comments', '12']
38
59
  end
39
60
  end
40
-
41
- describe '.text_includes' do
42
- it 'finds nodes' do
43
- nodes = @html.search('a').text_includes('ipsum')
44
- nodes.first.text.should == 'ipsum'
45
- end
46
- end
47
61
  end
@@ -1,20 +1,77 @@
1
1
  require 'spec_helper'
2
2
 
3
3
  describe Nokogiri::XML::Node do
4
- before do
4
+ before(:all) do
5
5
  assets_directory = File.expand_path(File.join(File.dirname(__FILE__), 'files'))
6
6
  html_file = File.join(assets_directory, 'test.html')
7
7
  @html = Nokogiri::HTML.parse(File.read(html_file))
8
8
  @html.uri = 'http://www.loremipsum.com/page/2'
9
+
10
+ # Set the time zone for .time
11
+ Time.zone = 'Pacific Time (US & Canada)'
9
12
  end
10
13
 
11
- describe '.url' do
12
- it 'reads absolute URLs' do
13
- @html.search('a.absolute-url').first.url.should == 'http://www.absoluteurl.com/'
14
+ describe '.attr_equals' do
15
+ it 'finds nodes' do
16
+ nodes = @html.search('body').first.attr_equals('href', 'http://www.ipsum.com/')
17
+ nodes.first.text.should == 'ipsum'
14
18
  end
19
+ end
15
20
 
16
- it 'reads relative URLs' do
17
- @html.search('a.relative-url').first.url.should == 'http://www.loremipsum.com/p/1'
21
+ describe '.attr_includes' do
22
+ it 'finds nodes' do
23
+ nodes = @html.search('body').first.attr_includes('href', 'ipsum.com')
24
+ nodes.first.text.should == 'ipsum'
25
+ end
26
+ end
27
+
28
+ describe '.attr_matches' do
29
+ it 'finds nodes' do
30
+ nodes = @html.search('body').first.attr_matches('href', /(lorem|ipsum)\.com/)
31
+ nodes.first.text.should == 'ipsum'
32
+ end
33
+
34
+ it 'sets matches' do
35
+ nodes = @html.search('body').first.attr_matches('href', /(lorem|ipsum)\.com/)
36
+ nodes.first.matches.should == ['ipsum.com', 'ipsum']
37
+ end
38
+ end
39
+
40
+ describe '.parse_text' do
41
+ it 'converts the node\'s text to a node set' do
42
+ nodes = @html.search('.xml-node').first.parse_text
43
+ nodes.should be_an_instance_of(Nokogiri::XML::NodeSet)
44
+ end
45
+
46
+ it 'returns a node set that contains the correct content' do
47
+ nodes = @html.search('.xml-node').first.parse_text
48
+ nodes.search('.xml-encoded-node').length.should == 1
49
+ end
50
+ end
51
+
52
+ describe '.text_equals' do
53
+ it 'finds nodes' do
54
+ nodes = @html.search('body').first.text_equals('ipsum')
55
+ nodes.first.text.should == 'ipsum'
56
+ end
57
+ end
58
+
59
+ describe '.text_includes' do
60
+ it 'finds nodes' do
61
+ nodes = @html.search('body').first.text_includes('ipsum')
62
+ nodes.first.text.should == 'ipsum'
63
+ end
64
+ end
65
+
66
+ describe '.text_matches' do
67
+ it 'finds nodes' do
68
+ nodes = @html.search('body').first.text_matches(/(\d+) comments/)
69
+ nodes.first.text.should == '12 comments'
70
+ end
71
+
72
+ it 'sets matches' do
73
+ nodes = @html.search('body').first.text_matches(/(\d+) comments/)
74
+ nodes.first.matches.should == ['12 comments', '12']
18
75
  end
19
76
  end
20
77
 
@@ -31,4 +88,14 @@ describe Nokogiri::XML::Node do
31
88
  @html.search('.post-published-at').first.time(attribute: 'data-published-at', time_zone: 'America/New_York').to_s.should == '2013-04-01 04:00:00 UTC'
32
89
  end
33
90
  end
91
+
92
+ describe '.url' do
93
+ it 'reads absolute URLs' do
94
+ @html.search('a.absolute-url').first.url.should == 'http://www.absoluteurl.com/'
95
+ end
96
+
97
+ it 'reads relative URLs' do
98
+ @html.search('a.relative-url').first.url.should == 'http://www.loremipsum.com/p/1'
99
+ end
100
+ end
34
101
  end
@@ -2,7 +2,6 @@ ENV["RAILS_ENV"] ||= 'test'
2
2
 
3
3
  require 'rspec'
4
4
  require 'nikkou'
5
- require 'pry'
6
5
 
7
6
  RSpec.configure do |config|
8
7
  config.color_enabled = true
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nikkou
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-04-23 00:00:00.000000000 Z
12
+ date: 2013-06-02 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: nokogiri
@@ -91,7 +91,7 @@ dependencies:
91
91
  - - ! '>='
92
92
  - !ruby/object:Gem::Version
93
93
  version: '0'
94
- description: Utilities for Nokogiri
94
+ description: Extract useful data from HTML and XML with ease!
95
95
  email:
96
96
  - tombenner@gmail.com
97
97
  executables: []
@@ -141,7 +141,7 @@ rubyforge_project:
141
141
  rubygems_version: 1.8.24
142
142
  signing_key:
143
143
  specification_version: 3
144
- summary: Utilities for Nokogiri
144
+ summary: Extract useful data from HTML and XML with ease!
145
145
  test_files:
146
146
  - spec/drillable_spec.rb
147
147
  - spec/files/test.html