feed-normalizer 1.5.1 → 1.5.2
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +48 -48
- data/License.txt +27 -27
- data/Manifest.txt +18 -19
- data/README.txt +63 -63
- data/Rakefile +29 -25
- data/lib/feed-normalizer.rb +149 -149
- data/lib/html-cleaner.rb +181 -190
- data/lib/parsers/rss.rb +110 -95
- data/lib/parsers/simple-rss.rb +138 -137
- data/lib/structures.rb +245 -244
- data/test/data/atom03.xml +128 -127
- data/test/data/atom10.xml +114 -112
- data/test/data/rdf10.xml +1498 -1498
- data/test/data/rss20.xml +64 -63
- data/test/data/rss20diff.xml +59 -59
- data/test/data/rss20diff_short.xml +51 -51
- data/test/test_feednormalizer.rb +265 -267
- data/test/test_htmlcleaner.rb +156 -155
- metadata +99 -63
- data/test/test_all.rb +0 -6
data/History.txt
CHANGED
@@ -2,51 +2,51 @@
|
|
2
2
|
|
3
3
|
* Fix a bug that was breaking the parsing process for certain feeds. [reported by: Patrick Minton]
|
4
4
|
|
5
|
-
1.5.0
|
6
|
-
|
7
|
-
* Add support for new fields:
|
8
|
-
* Atom 0.3: issued is now available through entry.date_published.
|
9
|
-
* RSS: feed.skip_hours, feed.skip_days, feed.ttl [joshpeek]
|
10
|
-
* All: entry.last_updated, this is an alias to entry.date_published for RSS.
|
11
|
-
* Rewrite relative links in content [joshpeek]
|
12
|
-
* Handle CDATA sections consistently across all formats. [sam.lown]
|
13
|
-
* Prevent SimpleRSS from doing its own escaping. [reported by: paul.stadig, lionel.bouton]
|
14
|
-
* Reparse Time classes [reported by: sam.lown]
|
15
|
-
|
16
|
-
1.4.0
|
17
|
-
|
18
|
-
* Support content:encoded. Accessible via Entry#content.
|
19
|
-
* Support categories. Accessible via Entry#categories.
|
20
|
-
* Introduces a new parsing feature 'loose parsing'. Use :loose => true
|
21
|
-
when parsing if the required output should retain extra data, rather
|
22
|
-
than drop it in the interests of 'lowest common denomiator' normalization.
|
23
|
-
Currently affects how categories works. See the documentation in
|
24
|
-
FeedNormalizer#parse for more details.
|
25
|
-
|
26
|
-
1.3.2
|
27
|
-
|
28
|
-
* Add support for applicable dublin core elements. (dc:date and dc:creator)
|
29
|
-
* Feeds can now be dumped to YAML.
|
30
|
-
|
31
|
-
1.3.1
|
32
|
-
|
33
|
-
* Small changes to work with hpricot 0.6. This release depends on hpricot 0.6.
|
34
|
-
* Reduced the greediness of a regexp that was removing html comments.
|
35
|
-
|
36
|
-
1.3.0
|
37
|
-
|
38
|
-
* Small changes to work with hpricot 0.5.
|
39
|
-
|
40
|
-
1.2.0
|
41
|
-
|
42
|
-
* Added HtmlCleaner - sanitizes HTML and removes 'bad' URIs to a level suitable
|
43
|
-
for 'safe' display inside a web browser. Can be used as a standalone library,
|
44
|
-
or as part of the Feed object. See Feed.clean! for details about cleaning a
|
45
|
-
Feed instance. Also see HtmlCleaner and its unit tests. Uses Hpricot.
|
46
|
-
* Added Feed-diffing. Differences between two feeds can be displayed using
|
47
|
-
Feed.diff. Works nicely with YAML for a readable diff.
|
48
|
-
* FeedNormalizer.parse now takes a hash for its arguments.
|
49
|
-
* Removed FN::Content.
|
50
|
-
* Now uses Hoe!
|
51
|
-
|
52
|
-
|
5
|
+
1.5.0
|
6
|
+
|
7
|
+
* Add support for new fields:
|
8
|
+
* Atom 0.3: issued is now available through entry.date_published.
|
9
|
+
* RSS: feed.skip_hours, feed.skip_days, feed.ttl [joshpeek]
|
10
|
+
* All: entry.last_updated, this is an alias to entry.date_published for RSS.
|
11
|
+
* Rewrite relative links in content [joshpeek]
|
12
|
+
* Handle CDATA sections consistently across all formats. [sam.lown]
|
13
|
+
* Prevent SimpleRSS from doing its own escaping. [reported by: paul.stadig, lionel.bouton]
|
14
|
+
* Reparse Time classes [reported by: sam.lown]
|
15
|
+
|
16
|
+
1.4.0
|
17
|
+
|
18
|
+
* Support content:encoded. Accessible via Entry#content.
|
19
|
+
* Support categories. Accessible via Entry#categories.
|
20
|
+
* Introduces a new parsing feature 'loose parsing'. Use :loose => true
|
21
|
+
when parsing if the required output should retain extra data, rather
|
22
|
+
than drop it in the interests of 'lowest common denomiator' normalization.
|
23
|
+
Currently affects how categories works. See the documentation in
|
24
|
+
FeedNormalizer#parse for more details.
|
25
|
+
|
26
|
+
1.3.2
|
27
|
+
|
28
|
+
* Add support for applicable dublin core elements. (dc:date and dc:creator)
|
29
|
+
* Feeds can now be dumped to YAML.
|
30
|
+
|
31
|
+
1.3.1
|
32
|
+
|
33
|
+
* Small changes to work with hpricot 0.6. This release depends on hpricot 0.6.
|
34
|
+
* Reduced the greediness of a regexp that was removing html comments.
|
35
|
+
|
36
|
+
1.3.0
|
37
|
+
|
38
|
+
* Small changes to work with hpricot 0.5.
|
39
|
+
|
40
|
+
1.2.0
|
41
|
+
|
42
|
+
* Added HtmlCleaner - sanitizes HTML and removes 'bad' URIs to a level suitable
|
43
|
+
for 'safe' display inside a web browser. Can be used as a standalone library,
|
44
|
+
or as part of the Feed object. See Feed.clean! for details about cleaning a
|
45
|
+
Feed instance. Also see HtmlCleaner and its unit tests. Uses Hpricot.
|
46
|
+
* Added Feed-diffing. Differences between two feeds can be displayed using
|
47
|
+
Feed.diff. Works nicely with YAML for a readable diff.
|
48
|
+
* FeedNormalizer.parse now takes a hash for its arguments.
|
49
|
+
* Removed FN::Content.
|
50
|
+
* Now uses Hoe!
|
51
|
+
|
52
|
+
|
data/License.txt
CHANGED
@@ -1,27 +1,27 @@
|
|
1
|
-
Copyright (c) 2006-2007, Andrew A. Smith
|
2
|
-
All rights reserved.
|
3
|
-
|
4
|
-
Redistribution and use in source and binary forms, with or without modification,
|
5
|
-
are permitted provided that the following conditions are met:
|
6
|
-
|
7
|
-
* Redistributions of source code must retain the above copyright notice,
|
8
|
-
this list of conditions and the following disclaimer.
|
9
|
-
|
10
|
-
* Redistributions in binary form must reproduce the above copyright notice,
|
11
|
-
this list of conditions and the following disclaimer in the documentation
|
12
|
-
and/or other materials provided with the distribution.
|
13
|
-
|
14
|
-
* Neither the name of the copyright owner nor the names of its contributors
|
15
|
-
may be used to endorse or promote products derived from this software
|
16
|
-
without specific prior written permission.
|
17
|
-
|
18
|
-
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
19
|
-
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
20
|
-
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
21
|
-
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
22
|
-
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
23
|
-
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
24
|
-
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
|
25
|
-
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
26
|
-
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
27
|
-
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
1
|
+
Copyright (c) 2006-2007, Andrew A. Smith
|
2
|
+
All rights reserved.
|
3
|
+
|
4
|
+
Redistribution and use in source and binary forms, with or without modification,
|
5
|
+
are permitted provided that the following conditions are met:
|
6
|
+
|
7
|
+
* Redistributions of source code must retain the above copyright notice,
|
8
|
+
this list of conditions and the following disclaimer.
|
9
|
+
|
10
|
+
* Redistributions in binary form must reproduce the above copyright notice,
|
11
|
+
this list of conditions and the following disclaimer in the documentation
|
12
|
+
and/or other materials provided with the distribution.
|
13
|
+
|
14
|
+
* Neither the name of the copyright owner nor the names of its contributors
|
15
|
+
may be used to endorse or promote products derived from this software
|
16
|
+
without specific prior written permission.
|
17
|
+
|
18
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
19
|
+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
20
|
+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
21
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
22
|
+
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
23
|
+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
24
|
+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
|
25
|
+
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
26
|
+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
27
|
+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/Manifest.txt
CHANGED
@@ -1,19 +1,18 @@
|
|
1
|
-
History.txt
|
2
|
-
License.txt
|
3
|
-
Manifest.txt
|
4
|
-
Rakefile
|
5
|
-
README.txt
|
6
|
-
lib/feed-normalizer.rb
|
7
|
-
lib/html-cleaner.rb
|
8
|
-
lib/parsers/rss.rb
|
9
|
-
lib/parsers/simple-rss.rb
|
10
|
-
lib/structures.rb
|
11
|
-
test/data/atom03.xml
|
12
|
-
test/data/atom10.xml
|
13
|
-
test/data/rdf10.xml
|
14
|
-
test/data/rss20.xml
|
15
|
-
test/data/rss20diff.xml
|
16
|
-
test/data/rss20diff_short.xml
|
17
|
-
test/
|
18
|
-
test/
|
19
|
-
test/test_htmlcleaner.rb
|
1
|
+
History.txt
|
2
|
+
License.txt
|
3
|
+
Manifest.txt
|
4
|
+
Rakefile
|
5
|
+
README.txt
|
6
|
+
lib/feed-normalizer.rb
|
7
|
+
lib/html-cleaner.rb
|
8
|
+
lib/parsers/rss.rb
|
9
|
+
lib/parsers/simple-rss.rb
|
10
|
+
lib/structures.rb
|
11
|
+
test/data/atom03.xml
|
12
|
+
test/data/atom10.xml
|
13
|
+
test/data/rdf10.xml
|
14
|
+
test/data/rss20.xml
|
15
|
+
test/data/rss20diff.xml
|
16
|
+
test/data/rss20diff_short.xml
|
17
|
+
test/test_feednormalizer.rb
|
18
|
+
test/test_htmlcleaner.rb
|
data/README.txt
CHANGED
@@ -1,63 +1,63 @@
|
|
1
|
-
== Feed Normalizer
|
2
|
-
|
3
|
-
An extensible Ruby wrapper for Atom and RSS parsers.
|
4
|
-
|
5
|
-
Feed normalizer wraps various RSS and Atom parsers, and returns a single unified
|
6
|
-
object graph, regardless of the underlying feed format.
|
7
|
-
|
8
|
-
== Download
|
9
|
-
|
10
|
-
* gem install feed-normalizer
|
11
|
-
* http://rubyforge.org/projects/feed-normalizer
|
12
|
-
* svn co http://feed-normalizer.googlecode.com/svn/trunk
|
13
|
-
|
14
|
-
== Usage
|
15
|
-
|
16
|
-
require 'feed-normalizer'
|
17
|
-
require 'open-uri'
|
18
|
-
|
19
|
-
feed = FeedNormalizer::FeedNormalizer.parse open('http://www.iht.com/rss/frontpage.xml')
|
20
|
-
|
21
|
-
feed.title # => "International Herald Tribune"
|
22
|
-
feed.url # => "http://www.iht.com/pages/index.php"
|
23
|
-
feed.entries.first.url # => "http://www.iht.com/articles/2006/10/03/frontpage/web.1003UN.php"
|
24
|
-
|
25
|
-
feed.class # => FeedNormalizer::Feed
|
26
|
-
feed.parser # => "RSS::Parser"
|
27
|
-
|
28
|
-
Now read an Atom feed, and the same class is returned, and the same terminology applies:
|
29
|
-
|
30
|
-
feed = FeedNormalizer::FeedNormalizer.parse open('http://www.atomenabled.org/atom.xml')
|
31
|
-
|
32
|
-
feed.title # => "AtomEnabled.org"
|
33
|
-
feed.url # => "http://www.atomenabled.org/atom.xml"
|
34
|
-
feed.entries.first.url # => "http://www.atomenabled.org/2006/09/moving-toward-atom.php"
|
35
|
-
|
36
|
-
The feed representation stays the same, even though a different parser was used.
|
37
|
-
|
38
|
-
feed.class # => FeedNormalizer::Feed
|
39
|
-
feed.parser # => "SimpleRSS"
|
40
|
-
|
41
|
-
== Cleaning / Sanitizing
|
42
|
-
|
43
|
-
feed.title # => "My Feed > Your Feed"
|
44
|
-
feed.entries.first.content # => "<p x='y'>Hello</p><object></object></html>"
|
45
|
-
feed.clean!
|
46
|
-
|
47
|
-
All elements should now be either clean HTML, or HTML escaped strings.
|
48
|
-
|
49
|
-
feed.title # => "My Feed > Your Feed"
|
50
|
-
feed.entries.first.content # => "<p>Hello</p>"
|
51
|
-
|
52
|
-
== Extending
|
53
|
-
|
54
|
-
Implement a parser wrapper by extending the FeedNormalizer::Parser class and overriding
|
55
|
-
the public methods. Also note the helper methods in the root Parser object to make
|
56
|
-
mapping of output from the particular parser to the Feed object easier.
|
57
|
-
|
58
|
-
See FeedNormalizer::RubyRssParser and FeedNormalizer::SimpleRssParser for examples.
|
59
|
-
|
60
|
-
== Authors
|
61
|
-
* Andrew A. Smith (andy@tinnedfruit.org)
|
62
|
-
|
63
|
-
This library is released under the terms of the BSD License (see the License.txt file for details).
|
1
|
+
== Feed Normalizer
|
2
|
+
|
3
|
+
An extensible Ruby wrapper for Atom and RSS parsers.
|
4
|
+
|
5
|
+
Feed normalizer wraps various RSS and Atom parsers, and returns a single unified
|
6
|
+
object graph, regardless of the underlying feed format.
|
7
|
+
|
8
|
+
== Download
|
9
|
+
|
10
|
+
* gem install feed-normalizer
|
11
|
+
* http://rubyforge.org/projects/feed-normalizer
|
12
|
+
* svn co http://feed-normalizer.googlecode.com/svn/trunk
|
13
|
+
|
14
|
+
== Usage
|
15
|
+
|
16
|
+
require 'feed-normalizer'
|
17
|
+
require 'open-uri'
|
18
|
+
|
19
|
+
feed = FeedNormalizer::FeedNormalizer.parse open('http://www.iht.com/rss/frontpage.xml')
|
20
|
+
|
21
|
+
feed.title # => "International Herald Tribune"
|
22
|
+
feed.url # => "http://www.iht.com/pages/index.php"
|
23
|
+
feed.entries.first.url # => "http://www.iht.com/articles/2006/10/03/frontpage/web.1003UN.php"
|
24
|
+
|
25
|
+
feed.class # => FeedNormalizer::Feed
|
26
|
+
feed.parser # => "RSS::Parser"
|
27
|
+
|
28
|
+
Now read an Atom feed, and the same class is returned, and the same terminology applies:
|
29
|
+
|
30
|
+
feed = FeedNormalizer::FeedNormalizer.parse open('http://www.atomenabled.org/atom.xml')
|
31
|
+
|
32
|
+
feed.title # => "AtomEnabled.org"
|
33
|
+
feed.url # => "http://www.atomenabled.org/atom.xml"
|
34
|
+
feed.entries.first.url # => "http://www.atomenabled.org/2006/09/moving-toward-atom.php"
|
35
|
+
|
36
|
+
The feed representation stays the same, even though a different parser was used.
|
37
|
+
|
38
|
+
feed.class # => FeedNormalizer::Feed
|
39
|
+
feed.parser # => "SimpleRSS"
|
40
|
+
|
41
|
+
== Cleaning / Sanitizing
|
42
|
+
|
43
|
+
feed.title # => "My Feed > Your Feed"
|
44
|
+
feed.entries.first.content # => "<p x='y'>Hello</p><object></object></html>"
|
45
|
+
feed.clean!
|
46
|
+
|
47
|
+
All elements should now be either clean HTML, or HTML escaped strings.
|
48
|
+
|
49
|
+
feed.title # => "My Feed > Your Feed"
|
50
|
+
feed.entries.first.content # => "<p>Hello</p>"
|
51
|
+
|
52
|
+
== Extending
|
53
|
+
|
54
|
+
Implement a parser wrapper by extending the FeedNormalizer::Parser class and overriding
|
55
|
+
the public methods. Also note the helper methods in the root Parser object to make
|
56
|
+
mapping of output from the particular parser to the Feed object easier.
|
57
|
+
|
58
|
+
See FeedNormalizer::RubyRssParser and FeedNormalizer::SimpleRssParser for examples.
|
59
|
+
|
60
|
+
== Authors
|
61
|
+
* Andrew A. Smith (andy@tinnedfruit.org)
|
62
|
+
|
63
|
+
This library is released under the terms of the BSD License (see the License.txt file for details).
|
data/Rakefile
CHANGED
@@ -1,25 +1,29 @@
|
|
1
|
-
require 'hoe'
|
2
|
-
|
3
|
-
|
4
|
-
|
5
|
-
|
6
|
-
|
7
|
-
s.
|
8
|
-
s.
|
9
|
-
s.
|
10
|
-
s.
|
11
|
-
s.
|
12
|
-
s.
|
13
|
-
s.
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
1
|
+
require 'hoe'
|
2
|
+
|
3
|
+
$: << "lib"
|
4
|
+
require 'feed-normalizer'
|
5
|
+
|
6
|
+
Hoe.spec("feed-normalizer") do |s|
|
7
|
+
s.version = "1.5.2"
|
8
|
+
s.author = "Andrew A. Smith"
|
9
|
+
s.email = "andy@tinnedfruit.org"
|
10
|
+
s.url = "http://github.com/aasmith/feed-normalizer"
|
11
|
+
s.summary = "Extensible Ruby wrapper for Atom and RSS parsers"
|
12
|
+
s.description = s.paragraphs_of('README.txt', 1..2).join("\n\n")
|
13
|
+
s.changes = s.paragraphs_of('History.txt', 0..1).join("\n\n")
|
14
|
+
s.extra_deps << ["simple-rss", ">= 1.1"]
|
15
|
+
s.extra_deps << ["hpricot", ">= 0.6"]
|
16
|
+
s.need_zip = true
|
17
|
+
s.need_tar = false
|
18
|
+
end
|
19
|
+
|
20
|
+
|
21
|
+
begin
|
22
|
+
require 'rcov/rcovtask'
|
23
|
+
Rcov::RcovTask.new("rcov") do |t|
|
24
|
+
t.test_files = Dir['test/test_all.rb']
|
25
|
+
end
|
26
|
+
rescue LoadError
|
27
|
+
nil
|
28
|
+
end
|
29
|
+
|
data/lib/feed-normalizer.rb
CHANGED
@@ -1,149 +1,149 @@
|
|
1
|
-
require 'structures'
|
2
|
-
require 'html-cleaner'
|
3
|
-
|
4
|
-
module FeedNormalizer
|
5
|
-
|
6
|
-
# The root parser object. Every parser must extend this object.
|
7
|
-
class Parser
|
8
|
-
|
9
|
-
# Parser being used.
|
10
|
-
def self.parser
|
11
|
-
nil
|
12
|
-
end
|
13
|
-
|
14
|
-
# Parses the given feed, and returns a normalized representation.
|
15
|
-
# Returns nil if the feed could not be parsed.
|
16
|
-
def self.parse(feed, loose)
|
17
|
-
nil
|
18
|
-
end
|
19
|
-
|
20
|
-
# Returns a number to indicate parser priority.
|
21
|
-
# The lower the number, the more likely the parser will be used first,
|
22
|
-
# and vice-versa.
|
23
|
-
def self.priority
|
24
|
-
0
|
25
|
-
end
|
26
|
-
|
27
|
-
protected
|
28
|
-
|
29
|
-
# Some utility methods that can be used by subclasses.
|
30
|
-
|
31
|
-
# sets value, or appends to an existing value
|
32
|
-
def self.map_functions!(mapping, src, dest)
|
33
|
-
|
34
|
-
mapping.each do |dest_function, src_functions|
|
35
|
-
src_functions = [src_functions].flatten # pack into array
|
36
|
-
|
37
|
-
src_functions.each do |src_function|
|
38
|
-
value = if src.respond_to?(src_function)
|
39
|
-
src.send(src_function)
|
40
|
-
elsif src.respond_to?(:has_key?)
|
41
|
-
src[src_function]
|
42
|
-
end
|
43
|
-
|
44
|
-
unless value.to_s.empty?
|
45
|
-
append_or_set!(value, dest, dest_function)
|
46
|
-
break
|
47
|
-
end
|
48
|
-
end
|
49
|
-
|
50
|
-
end
|
51
|
-
end
|
52
|
-
|
53
|
-
def self.append_or_set!(value, object, object_function)
|
54
|
-
if object.send(object_function).respond_to? :push
|
55
|
-
object.send(object_function).push(value)
|
56
|
-
else
|
57
|
-
object.send(:"#{object_function}=", value)
|
58
|
-
end
|
59
|
-
end
|
60
|
-
|
61
|
-
private
|
62
|
-
|
63
|
-
# Callback that ensures that every parser gets registered.
|
64
|
-
def self.inherited(subclass)
|
65
|
-
ParserRegistry.register(subclass)
|
66
|
-
end
|
67
|
-
|
68
|
-
end
|
69
|
-
|
70
|
-
|
71
|
-
# The parser registry keeps a list of current parsers that are available.
|
72
|
-
class ParserRegistry
|
73
|
-
|
74
|
-
@@parsers = []
|
75
|
-
|
76
|
-
def self.register(parser)
|
77
|
-
@@parsers << parser
|
78
|
-
end
|
79
|
-
|
80
|
-
# Returns a list of currently registered parsers, in order of priority.
|
81
|
-
def self.parsers
|
82
|
-
@@parsers.sort_by { |parser| parser.priority }
|
83
|
-
end
|
84
|
-
|
85
|
-
end
|
86
|
-
|
87
|
-
|
88
|
-
class FeedNormalizer
|
89
|
-
|
90
|
-
# Parses the given xml and attempts to return a normalized Feed object.
|
91
|
-
# Setting +force_parser+ to a suitable parser will mean that parser is
|
92
|
-
# used first, and if +try_others+ is false, it is the only parser used,
|
93
|
-
# otherwise all parsers in the ParserRegistry are attempted, in
|
94
|
-
# order of priority.
|
95
|
-
#
|
96
|
-
# ===Available options
|
97
|
-
#
|
98
|
-
# * <tt>:force_parser</tt> - instruct feed-normalizer to try the specified
|
99
|
-
# parser first. Takes a class, such as RubyRssParser, or SimpleRssParser.
|
100
|
-
#
|
101
|
-
# * <tt>:try_others</tt> - +true+ or +false+, defaults to +true+.
|
102
|
-
# If +true+, other parsers will be used as described above. The option
|
103
|
-
# is useful if combined with +force_parser+ to only use a single parser.
|
104
|
-
#
|
105
|
-
# * <tt>:loose</tt> - +true+ or +false+, defaults to +false+.
|
106
|
-
#
|
107
|
-
# Specifies parsing should be done loosely. This means that when
|
108
|
-
# feed-normalizer would usually throw away data in order to meet
|
109
|
-
# the requirement of keeping resulting feed outputs the same regardless
|
110
|
-
# of the underlying parser, the data will instead be kept. This currently
|
111
|
-
# affects the following items:
|
112
|
-
# * <em>Categories:</em> RSS allows for multiple categories per feed item.
|
113
|
-
# * <em>Limitation:</em> SimpleRSS can only return the first category
|
114
|
-
# for an item.
|
115
|
-
# * <em>Result:</em> When loose is true, the extra categories are kept,
|
116
|
-
# of course, only if the parser is not SimpleRSS.
|
117
|
-
def self.parse(xml, opts = {})
|
118
|
-
|
119
|
-
# Get a string ASAP, as multiple read()'s will start returning nil..
|
120
|
-
xml = xml.respond_to?(:read) ? xml.read : xml.to_s
|
121
|
-
|
122
|
-
if opts[:force_parser]
|
123
|
-
result = opts[:force_parser].parse(xml, opts[:loose])
|
124
|
-
|
125
|
-
return result if result
|
126
|
-
return nil if opts[:try_others] == false
|
127
|
-
end
|
128
|
-
|
129
|
-
ParserRegistry.parsers.each do |parser|
|
130
|
-
result = parser.parse(xml, opts[:loose])
|
131
|
-
return result if result
|
132
|
-
end
|
133
|
-
|
134
|
-
# if we got here, no parsers worked.
|
135
|
-
return nil
|
136
|
-
end
|
137
|
-
end
|
138
|
-
|
139
|
-
|
140
|
-
parser_dir = File.dirname(__FILE__) + '/parsers'
|
141
|
-
|
142
|
-
# Load up the parsers
|
143
|
-
Dir.open(parser_dir).each do |fn|
|
144
|
-
next unless fn =~ /[.]rb$/
|
145
|
-
require "parsers/#{fn}"
|
146
|
-
end
|
147
|
-
|
148
|
-
end
|
149
|
-
|
1
|
+
require 'structures'
|
2
|
+
require 'html-cleaner'
|
3
|
+
|
4
|
+
module FeedNormalizer
|
5
|
+
|
6
|
+
# The root parser object. Every parser must extend this object.
|
7
|
+
class Parser
|
8
|
+
|
9
|
+
# Parser being used.
|
10
|
+
def self.parser
|
11
|
+
nil
|
12
|
+
end
|
13
|
+
|
14
|
+
# Parses the given feed, and returns a normalized representation.
|
15
|
+
# Returns nil if the feed could not be parsed.
|
16
|
+
def self.parse(feed, loose)
|
17
|
+
nil
|
18
|
+
end
|
19
|
+
|
20
|
+
# Returns a number to indicate parser priority.
|
21
|
+
# The lower the number, the more likely the parser will be used first,
|
22
|
+
# and vice-versa.
|
23
|
+
def self.priority
|
24
|
+
0
|
25
|
+
end
|
26
|
+
|
27
|
+
protected
|
28
|
+
|
29
|
+
# Some utility methods that can be used by subclasses.
|
30
|
+
|
31
|
+
# sets value, or appends to an existing value
|
32
|
+
def self.map_functions!(mapping, src, dest)
|
33
|
+
|
34
|
+
mapping.each do |dest_function, src_functions|
|
35
|
+
src_functions = [src_functions].flatten # pack into array
|
36
|
+
|
37
|
+
src_functions.each do |src_function|
|
38
|
+
value = if src.respond_to?(src_function)
|
39
|
+
src.send(src_function)
|
40
|
+
elsif src.respond_to?(:has_key?)
|
41
|
+
src[src_function]
|
42
|
+
end
|
43
|
+
|
44
|
+
unless value.to_s.empty?
|
45
|
+
append_or_set!(value, dest, dest_function)
|
46
|
+
break
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
end
|
51
|
+
end
|
52
|
+
|
53
|
+
def self.append_or_set!(value, object, object_function)
|
54
|
+
if object.send(object_function).respond_to? :push
|
55
|
+
object.send(object_function).push(value)
|
56
|
+
else
|
57
|
+
object.send(:"#{object_function}=", value)
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
private
|
62
|
+
|
63
|
+
# Callback that ensures that every parser gets registered.
|
64
|
+
def self.inherited(subclass)
|
65
|
+
ParserRegistry.register(subclass)
|
66
|
+
end
|
67
|
+
|
68
|
+
end
|
69
|
+
|
70
|
+
|
71
|
+
# The parser registry keeps a list of current parsers that are available.
|
72
|
+
class ParserRegistry
|
73
|
+
|
74
|
+
@@parsers = []
|
75
|
+
|
76
|
+
def self.register(parser)
|
77
|
+
@@parsers << parser
|
78
|
+
end
|
79
|
+
|
80
|
+
# Returns a list of currently registered parsers, in order of priority.
|
81
|
+
def self.parsers
|
82
|
+
@@parsers.sort_by { |parser| parser.priority }
|
83
|
+
end
|
84
|
+
|
85
|
+
end
|
86
|
+
|
87
|
+
|
88
|
+
class FeedNormalizer
|
89
|
+
|
90
|
+
# Parses the given xml and attempts to return a normalized Feed object.
|
91
|
+
# Setting +force_parser+ to a suitable parser will mean that parser is
|
92
|
+
# used first, and if +try_others+ is false, it is the only parser used,
|
93
|
+
# otherwise all parsers in the ParserRegistry are attempted, in
|
94
|
+
# order of priority.
|
95
|
+
#
|
96
|
+
# ===Available options
|
97
|
+
#
|
98
|
+
# * <tt>:force_parser</tt> - instruct feed-normalizer to try the specified
|
99
|
+
# parser first. Takes a class, such as RubyRssParser, or SimpleRssParser.
|
100
|
+
#
|
101
|
+
# * <tt>:try_others</tt> - +true+ or +false+, defaults to +true+.
|
102
|
+
# If +true+, other parsers will be used as described above. The option
|
103
|
+
# is useful if combined with +force_parser+ to only use a single parser.
|
104
|
+
#
|
105
|
+
# * <tt>:loose</tt> - +true+ or +false+, defaults to +false+.
|
106
|
+
#
|
107
|
+
# Specifies parsing should be done loosely. This means that when
|
108
|
+
# feed-normalizer would usually throw away data in order to meet
|
109
|
+
# the requirement of keeping resulting feed outputs the same regardless
|
110
|
+
# of the underlying parser, the data will instead be kept. This currently
|
111
|
+
# affects the following items:
|
112
|
+
# * <em>Categories:</em> RSS allows for multiple categories per feed item.
|
113
|
+
# * <em>Limitation:</em> SimpleRSS can only return the first category
|
114
|
+
# for an item.
|
115
|
+
# * <em>Result:</em> When loose is true, the extra categories are kept,
|
116
|
+
# of course, only if the parser is not SimpleRSS.
|
117
|
+
def self.parse(xml, opts = {})
|
118
|
+
|
119
|
+
# Get a string ASAP, as multiple read()'s will start returning nil..
|
120
|
+
xml = xml.respond_to?(:read) ? xml.read : xml.to_s
|
121
|
+
|
122
|
+
if opts[:force_parser]
|
123
|
+
result = opts[:force_parser].parse(xml, opts[:loose])
|
124
|
+
|
125
|
+
return result if result
|
126
|
+
return nil if opts[:try_others] == false
|
127
|
+
end
|
128
|
+
|
129
|
+
ParserRegistry.parsers.each do |parser|
|
130
|
+
result = parser.parse(xml, opts[:loose])
|
131
|
+
return result if result
|
132
|
+
end
|
133
|
+
|
134
|
+
# if we got here, no parsers worked.
|
135
|
+
return nil
|
136
|
+
end
|
137
|
+
end
|
138
|
+
|
139
|
+
|
140
|
+
parser_dir = File.dirname(__FILE__) + '/parsers'
|
141
|
+
|
142
|
+
# Load up the parsers
|
143
|
+
Dir.open(parser_dir).each do |fn|
|
144
|
+
next unless fn =~ /[.]rb$/
|
145
|
+
require "parsers/#{fn}"
|
146
|
+
end
|
147
|
+
|
148
|
+
end
|
149
|
+
|