feed-normalizer 1.5.1 → 1.5.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/History.txt +48 -48
- data/License.txt +27 -27
- data/Manifest.txt +18 -19
- data/README.txt +63 -63
- data/Rakefile +29 -25
- data/lib/feed-normalizer.rb +149 -149
- data/lib/html-cleaner.rb +181 -190
- data/lib/parsers/rss.rb +110 -95
- data/lib/parsers/simple-rss.rb +138 -137
- data/lib/structures.rb +245 -244
- data/test/data/atom03.xml +128 -127
- data/test/data/atom10.xml +114 -112
- data/test/data/rdf10.xml +1498 -1498
- data/test/data/rss20.xml +64 -63
- data/test/data/rss20diff.xml +59 -59
- data/test/data/rss20diff_short.xml +51 -51
- data/test/test_feednormalizer.rb +265 -267
- data/test/test_htmlcleaner.rb +156 -155
- metadata +99 -63
- data/test/test_all.rb +0 -6
data/History.txt
CHANGED
@@ -2,51 +2,51 @@
|
|
2
2
|
|
3
3
|
* Fix a bug that was breaking the parsing process for certain feeds. [reported by: Patrick Minton]
|
4
4
|
|
5
|
-
1.5.0
|
6
|
-
|
7
|
-
* Add support for new fields:
|
8
|
-
* Atom 0.3: issued is now available through entry.date_published.
|
9
|
-
* RSS: feed.skip_hours, feed.skip_days, feed.ttl [joshpeek]
|
10
|
-
* All: entry.last_updated, this is an alias to entry.date_published for RSS.
|
11
|
-
* Rewrite relative links in content [joshpeek]
|
12
|
-
* Handle CDATA sections consistently across all formats. [sam.lown]
|
13
|
-
* Prevent SimpleRSS from doing its own escaping. [reported by: paul.stadig, lionel.bouton]
|
14
|
-
* Reparse Time classes [reported by: sam.lown]
|
15
|
-
|
16
|
-
1.4.0
|
17
|
-
|
18
|
-
* Support content:encoded. Accessible via Entry#content.
|
19
|
-
* Support categories. Accessible via Entry#categories.
|
20
|
-
* Introduces a new parsing feature 'loose parsing'. Use :loose => true
|
21
|
-
when parsing if the required output should retain extra data, rather
|
22
|
-
than drop it in the interests of 'lowest common denomiator' normalization.
|
23
|
-
Currently affects how categories works. See the documentation in
|
24
|
-
FeedNormalizer#parse for more details.
|
25
|
-
|
26
|
-
1.3.2
|
27
|
-
|
28
|
-
* Add support for applicable dublin core elements. (dc:date and dc:creator)
|
29
|
-
* Feeds can now be dumped to YAML.
|
30
|
-
|
31
|
-
1.3.1
|
32
|
-
|
33
|
-
* Small changes to work with hpricot 0.6. This release depends on hpricot 0.6.
|
34
|
-
* Reduced the greediness of a regexp that was removing html comments.
|
35
|
-
|
36
|
-
1.3.0
|
37
|
-
|
38
|
-
* Small changes to work with hpricot 0.5.
|
39
|
-
|
40
|
-
1.2.0
|
41
|
-
|
42
|
-
* Added HtmlCleaner - sanitizes HTML and removes 'bad' URIs to a level suitable
|
43
|
-
for 'safe' display inside a web browser. Can be used as a standalone library,
|
44
|
-
or as part of the Feed object. See Feed.clean! for details about cleaning a
|
45
|
-
Feed instance. Also see HtmlCleaner and its unit tests. Uses Hpricot.
|
46
|
-
* Added Feed-diffing. Differences between two feeds can be displayed using
|
47
|
-
Feed.diff. Works nicely with YAML for a readable diff.
|
48
|
-
* FeedNormalizer.parse now takes a hash for its arguments.
|
49
|
-
* Removed FN::Content.
|
50
|
-
* Now uses Hoe!
|
51
|
-
|
52
|
-
|
5
|
+
1.5.0
|
6
|
+
|
7
|
+
* Add support for new fields:
|
8
|
+
* Atom 0.3: issued is now available through entry.date_published.
|
9
|
+
* RSS: feed.skip_hours, feed.skip_days, feed.ttl [joshpeek]
|
10
|
+
* All: entry.last_updated, this is an alias to entry.date_published for RSS.
|
11
|
+
* Rewrite relative links in content [joshpeek]
|
12
|
+
* Handle CDATA sections consistently across all formats. [sam.lown]
|
13
|
+
* Prevent SimpleRSS from doing its own escaping. [reported by: paul.stadig, lionel.bouton]
|
14
|
+
* Reparse Time classes [reported by: sam.lown]
|
15
|
+
|
16
|
+
1.4.0
|
17
|
+
|
18
|
+
* Support content:encoded. Accessible via Entry#content.
|
19
|
+
* Support categories. Accessible via Entry#categories.
|
20
|
+
* Introduces a new parsing feature 'loose parsing'. Use :loose => true
|
21
|
+
when parsing if the required output should retain extra data, rather
|
22
|
+
than drop it in the interests of 'lowest common denomiator' normalization.
|
23
|
+
Currently affects how categories works. See the documentation in
|
24
|
+
FeedNormalizer#parse for more details.
|
25
|
+
|
26
|
+
1.3.2
|
27
|
+
|
28
|
+
* Add support for applicable dublin core elements. (dc:date and dc:creator)
|
29
|
+
* Feeds can now be dumped to YAML.
|
30
|
+
|
31
|
+
1.3.1
|
32
|
+
|
33
|
+
* Small changes to work with hpricot 0.6. This release depends on hpricot 0.6.
|
34
|
+
* Reduced the greediness of a regexp that was removing html comments.
|
35
|
+
|
36
|
+
1.3.0
|
37
|
+
|
38
|
+
* Small changes to work with hpricot 0.5.
|
39
|
+
|
40
|
+
1.2.0
|
41
|
+
|
42
|
+
* Added HtmlCleaner - sanitizes HTML and removes 'bad' URIs to a level suitable
|
43
|
+
for 'safe' display inside a web browser. Can be used as a standalone library,
|
44
|
+
or as part of the Feed object. See Feed.clean! for details about cleaning a
|
45
|
+
Feed instance. Also see HtmlCleaner and its unit tests. Uses Hpricot.
|
46
|
+
* Added Feed-diffing. Differences between two feeds can be displayed using
|
47
|
+
Feed.diff. Works nicely with YAML for a readable diff.
|
48
|
+
* FeedNormalizer.parse now takes a hash for its arguments.
|
49
|
+
* Removed FN::Content.
|
50
|
+
* Now uses Hoe!
|
51
|
+
|
52
|
+
|
data/License.txt
CHANGED
@@ -1,27 +1,27 @@
|
|
1
|
-
Copyright (c) 2006-2007, Andrew A. Smith
|
2
|
-
All rights reserved.
|
3
|
-
|
4
|
-
Redistribution and use in source and binary forms, with or without modification,
|
5
|
-
are permitted provided that the following conditions are met:
|
6
|
-
|
7
|
-
* Redistributions of source code must retain the above copyright notice,
|
8
|
-
this list of conditions and the following disclaimer.
|
9
|
-
|
10
|
-
* Redistributions in binary form must reproduce the above copyright notice,
|
11
|
-
this list of conditions and the following disclaimer in the documentation
|
12
|
-
and/or other materials provided with the distribution.
|
13
|
-
|
14
|
-
* Neither the name of the copyright owner nor the names of its contributors
|
15
|
-
may be used to endorse or promote products derived from this software
|
16
|
-
without specific prior written permission.
|
17
|
-
|
18
|
-
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
19
|
-
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
20
|
-
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
21
|
-
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
22
|
-
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
23
|
-
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
24
|
-
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
|
25
|
-
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
26
|
-
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
27
|
-
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
1
|
+
Copyright (c) 2006-2007, Andrew A. Smith
|
2
|
+
All rights reserved.
|
3
|
+
|
4
|
+
Redistribution and use in source and binary forms, with or without modification,
|
5
|
+
are permitted provided that the following conditions are met:
|
6
|
+
|
7
|
+
* Redistributions of source code must retain the above copyright notice,
|
8
|
+
this list of conditions and the following disclaimer.
|
9
|
+
|
10
|
+
* Redistributions in binary form must reproduce the above copyright notice,
|
11
|
+
this list of conditions and the following disclaimer in the documentation
|
12
|
+
and/or other materials provided with the distribution.
|
13
|
+
|
14
|
+
* Neither the name of the copyright owner nor the names of its contributors
|
15
|
+
may be used to endorse or promote products derived from this software
|
16
|
+
without specific prior written permission.
|
17
|
+
|
18
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
19
|
+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
20
|
+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
21
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
22
|
+
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
23
|
+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
24
|
+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
|
25
|
+
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
26
|
+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
27
|
+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/Manifest.txt
CHANGED
@@ -1,19 +1,18 @@
|
|
1
|
-
History.txt
|
2
|
-
License.txt
|
3
|
-
Manifest.txt
|
4
|
-
Rakefile
|
5
|
-
README.txt
|
6
|
-
lib/feed-normalizer.rb
|
7
|
-
lib/html-cleaner.rb
|
8
|
-
lib/parsers/rss.rb
|
9
|
-
lib/parsers/simple-rss.rb
|
10
|
-
lib/structures.rb
|
11
|
-
test/data/atom03.xml
|
12
|
-
test/data/atom10.xml
|
13
|
-
test/data/rdf10.xml
|
14
|
-
test/data/rss20.xml
|
15
|
-
test/data/rss20diff.xml
|
16
|
-
test/data/rss20diff_short.xml
|
17
|
-
test/
|
18
|
-
test/
|
19
|
-
test/test_htmlcleaner.rb
|
1
|
+
History.txt
|
2
|
+
License.txt
|
3
|
+
Manifest.txt
|
4
|
+
Rakefile
|
5
|
+
README.txt
|
6
|
+
lib/feed-normalizer.rb
|
7
|
+
lib/html-cleaner.rb
|
8
|
+
lib/parsers/rss.rb
|
9
|
+
lib/parsers/simple-rss.rb
|
10
|
+
lib/structures.rb
|
11
|
+
test/data/atom03.xml
|
12
|
+
test/data/atom10.xml
|
13
|
+
test/data/rdf10.xml
|
14
|
+
test/data/rss20.xml
|
15
|
+
test/data/rss20diff.xml
|
16
|
+
test/data/rss20diff_short.xml
|
17
|
+
test/test_feednormalizer.rb
|
18
|
+
test/test_htmlcleaner.rb
|
data/README.txt
CHANGED
@@ -1,63 +1,63 @@
|
|
1
|
-
== Feed Normalizer
|
2
|
-
|
3
|
-
An extensible Ruby wrapper for Atom and RSS parsers.
|
4
|
-
|
5
|
-
Feed normalizer wraps various RSS and Atom parsers, and returns a single unified
|
6
|
-
object graph, regardless of the underlying feed format.
|
7
|
-
|
8
|
-
== Download
|
9
|
-
|
10
|
-
* gem install feed-normalizer
|
11
|
-
* http://rubyforge.org/projects/feed-normalizer
|
12
|
-
* svn co http://feed-normalizer.googlecode.com/svn/trunk
|
13
|
-
|
14
|
-
== Usage
|
15
|
-
|
16
|
-
require 'feed-normalizer'
|
17
|
-
require 'open-uri'
|
18
|
-
|
19
|
-
feed = FeedNormalizer::FeedNormalizer.parse open('http://www.iht.com/rss/frontpage.xml')
|
20
|
-
|
21
|
-
feed.title # => "International Herald Tribune"
|
22
|
-
feed.url # => "http://www.iht.com/pages/index.php"
|
23
|
-
feed.entries.first.url # => "http://www.iht.com/articles/2006/10/03/frontpage/web.1003UN.php"
|
24
|
-
|
25
|
-
feed.class # => FeedNormalizer::Feed
|
26
|
-
feed.parser # => "RSS::Parser"
|
27
|
-
|
28
|
-
Now read an Atom feed, and the same class is returned, and the same terminology applies:
|
29
|
-
|
30
|
-
feed = FeedNormalizer::FeedNormalizer.parse open('http://www.atomenabled.org/atom.xml')
|
31
|
-
|
32
|
-
feed.title # => "AtomEnabled.org"
|
33
|
-
feed.url # => "http://www.atomenabled.org/atom.xml"
|
34
|
-
feed.entries.first.url # => "http://www.atomenabled.org/2006/09/moving-toward-atom.php"
|
35
|
-
|
36
|
-
The feed representation stays the same, even though a different parser was used.
|
37
|
-
|
38
|
-
feed.class # => FeedNormalizer::Feed
|
39
|
-
feed.parser # => "SimpleRSS"
|
40
|
-
|
41
|
-
== Cleaning / Sanitizing
|
42
|
-
|
43
|
-
feed.title # => "My Feed > Your Feed"
|
44
|
-
feed.entries.first.content # => "<p x='y'>Hello</p><object></object></html>"
|
45
|
-
feed.clean!
|
46
|
-
|
47
|
-
All elements should now be either clean HTML, or HTML escaped strings.
|
48
|
-
|
49
|
-
feed.title # => "My Feed > Your Feed"
|
50
|
-
feed.entries.first.content # => "<p>Hello</p>"
|
51
|
-
|
52
|
-
== Extending
|
53
|
-
|
54
|
-
Implement a parser wrapper by extending the FeedNormalizer::Parser class and overriding
|
55
|
-
the public methods. Also note the helper methods in the root Parser object to make
|
56
|
-
mapping of output from the particular parser to the Feed object easier.
|
57
|
-
|
58
|
-
See FeedNormalizer::RubyRssParser and FeedNormalizer::SimpleRssParser for examples.
|
59
|
-
|
60
|
-
== Authors
|
61
|
-
* Andrew A. Smith (andy@tinnedfruit.org)
|
62
|
-
|
63
|
-
This library is released under the terms of the BSD License (see the License.txt file for details).
|
1
|
+
== Feed Normalizer
|
2
|
+
|
3
|
+
An extensible Ruby wrapper for Atom and RSS parsers.
|
4
|
+
|
5
|
+
Feed normalizer wraps various RSS and Atom parsers, and returns a single unified
|
6
|
+
object graph, regardless of the underlying feed format.
|
7
|
+
|
8
|
+
== Download
|
9
|
+
|
10
|
+
* gem install feed-normalizer
|
11
|
+
* http://rubyforge.org/projects/feed-normalizer
|
12
|
+
* svn co http://feed-normalizer.googlecode.com/svn/trunk
|
13
|
+
|
14
|
+
== Usage
|
15
|
+
|
16
|
+
require 'feed-normalizer'
|
17
|
+
require 'open-uri'
|
18
|
+
|
19
|
+
feed = FeedNormalizer::FeedNormalizer.parse open('http://www.iht.com/rss/frontpage.xml')
|
20
|
+
|
21
|
+
feed.title # => "International Herald Tribune"
|
22
|
+
feed.url # => "http://www.iht.com/pages/index.php"
|
23
|
+
feed.entries.first.url # => "http://www.iht.com/articles/2006/10/03/frontpage/web.1003UN.php"
|
24
|
+
|
25
|
+
feed.class # => FeedNormalizer::Feed
|
26
|
+
feed.parser # => "RSS::Parser"
|
27
|
+
|
28
|
+
Now read an Atom feed, and the same class is returned, and the same terminology applies:
|
29
|
+
|
30
|
+
feed = FeedNormalizer::FeedNormalizer.parse open('http://www.atomenabled.org/atom.xml')
|
31
|
+
|
32
|
+
feed.title # => "AtomEnabled.org"
|
33
|
+
feed.url # => "http://www.atomenabled.org/atom.xml"
|
34
|
+
feed.entries.first.url # => "http://www.atomenabled.org/2006/09/moving-toward-atom.php"
|
35
|
+
|
36
|
+
The feed representation stays the same, even though a different parser was used.
|
37
|
+
|
38
|
+
feed.class # => FeedNormalizer::Feed
|
39
|
+
feed.parser # => "SimpleRSS"
|
40
|
+
|
41
|
+
== Cleaning / Sanitizing
|
42
|
+
|
43
|
+
feed.title # => "My Feed > Your Feed"
|
44
|
+
feed.entries.first.content # => "<p x='y'>Hello</p><object></object></html>"
|
45
|
+
feed.clean!
|
46
|
+
|
47
|
+
All elements should now be either clean HTML, or HTML escaped strings.
|
48
|
+
|
49
|
+
feed.title # => "My Feed > Your Feed"
|
50
|
+
feed.entries.first.content # => "<p>Hello</p>"
|
51
|
+
|
52
|
+
== Extending
|
53
|
+
|
54
|
+
Implement a parser wrapper by extending the FeedNormalizer::Parser class and overriding
|
55
|
+
the public methods. Also note the helper methods in the root Parser object to make
|
56
|
+
mapping of output from the particular parser to the Feed object easier.
|
57
|
+
|
58
|
+
See FeedNormalizer::RubyRssParser and FeedNormalizer::SimpleRssParser for examples.
|
59
|
+
|
60
|
+
== Authors
|
61
|
+
* Andrew A. Smith (andy@tinnedfruit.org)
|
62
|
+
|
63
|
+
This library is released under the terms of the BSD License (see the License.txt file for details).
|
data/Rakefile
CHANGED
@@ -1,25 +1,29 @@
|
|
1
|
-
require 'hoe'
|
2
|
-
|
3
|
-
|
4
|
-
|
5
|
-
|
6
|
-
|
7
|
-
s.
|
8
|
-
s.
|
9
|
-
s.
|
10
|
-
s.
|
11
|
-
s.
|
12
|
-
s.
|
13
|
-
s.
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
1
|
+
require 'hoe'
|
2
|
+
|
3
|
+
$: << "lib"
|
4
|
+
require 'feed-normalizer'
|
5
|
+
|
6
|
+
Hoe.spec("feed-normalizer") do |s|
|
7
|
+
s.version = "1.5.2"
|
8
|
+
s.author = "Andrew A. Smith"
|
9
|
+
s.email = "andy@tinnedfruit.org"
|
10
|
+
s.url = "http://github.com/aasmith/feed-normalizer"
|
11
|
+
s.summary = "Extensible Ruby wrapper for Atom and RSS parsers"
|
12
|
+
s.description = s.paragraphs_of('README.txt', 1..2).join("\n\n")
|
13
|
+
s.changes = s.paragraphs_of('History.txt', 0..1).join("\n\n")
|
14
|
+
s.extra_deps << ["simple-rss", ">= 1.1"]
|
15
|
+
s.extra_deps << ["hpricot", ">= 0.6"]
|
16
|
+
s.need_zip = true
|
17
|
+
s.need_tar = false
|
18
|
+
end
|
19
|
+
|
20
|
+
|
21
|
+
begin
|
22
|
+
require 'rcov/rcovtask'
|
23
|
+
Rcov::RcovTask.new("rcov") do |t|
|
24
|
+
t.test_files = Dir['test/test_all.rb']
|
25
|
+
end
|
26
|
+
rescue LoadError
|
27
|
+
nil
|
28
|
+
end
|
29
|
+
|
data/lib/feed-normalizer.rb
CHANGED
@@ -1,149 +1,149 @@
|
|
1
|
-
require 'structures'
|
2
|
-
require 'html-cleaner'
|
3
|
-
|
4
|
-
module FeedNormalizer
|
5
|
-
|
6
|
-
# The root parser object. Every parser must extend this object.
|
7
|
-
class Parser
|
8
|
-
|
9
|
-
# Parser being used.
|
10
|
-
def self.parser
|
11
|
-
nil
|
12
|
-
end
|
13
|
-
|
14
|
-
# Parses the given feed, and returns a normalized representation.
|
15
|
-
# Returns nil if the feed could not be parsed.
|
16
|
-
def self.parse(feed, loose)
|
17
|
-
nil
|
18
|
-
end
|
19
|
-
|
20
|
-
# Returns a number to indicate parser priority.
|
21
|
-
# The lower the number, the more likely the parser will be used first,
|
22
|
-
# and vice-versa.
|
23
|
-
def self.priority
|
24
|
-
0
|
25
|
-
end
|
26
|
-
|
27
|
-
protected
|
28
|
-
|
29
|
-
# Some utility methods that can be used by subclasses.
|
30
|
-
|
31
|
-
# sets value, or appends to an existing value
|
32
|
-
def self.map_functions!(mapping, src, dest)
|
33
|
-
|
34
|
-
mapping.each do |dest_function, src_functions|
|
35
|
-
src_functions = [src_functions].flatten # pack into array
|
36
|
-
|
37
|
-
src_functions.each do |src_function|
|
38
|
-
value = if src.respond_to?(src_function)
|
39
|
-
src.send(src_function)
|
40
|
-
elsif src.respond_to?(:has_key?)
|
41
|
-
src[src_function]
|
42
|
-
end
|
43
|
-
|
44
|
-
unless value.to_s.empty?
|
45
|
-
append_or_set!(value, dest, dest_function)
|
46
|
-
break
|
47
|
-
end
|
48
|
-
end
|
49
|
-
|
50
|
-
end
|
51
|
-
end
|
52
|
-
|
53
|
-
def self.append_or_set!(value, object, object_function)
|
54
|
-
if object.send(object_function).respond_to? :push
|
55
|
-
object.send(object_function).push(value)
|
56
|
-
else
|
57
|
-
object.send(:"#{object_function}=", value)
|
58
|
-
end
|
59
|
-
end
|
60
|
-
|
61
|
-
private
|
62
|
-
|
63
|
-
# Callback that ensures that every parser gets registered.
|
64
|
-
def self.inherited(subclass)
|
65
|
-
ParserRegistry.register(subclass)
|
66
|
-
end
|
67
|
-
|
68
|
-
end
|
69
|
-
|
70
|
-
|
71
|
-
# The parser registry keeps a list of current parsers that are available.
|
72
|
-
class ParserRegistry
|
73
|
-
|
74
|
-
@@parsers = []
|
75
|
-
|
76
|
-
def self.register(parser)
|
77
|
-
@@parsers << parser
|
78
|
-
end
|
79
|
-
|
80
|
-
# Returns a list of currently registered parsers, in order of priority.
|
81
|
-
def self.parsers
|
82
|
-
@@parsers.sort_by { |parser| parser.priority }
|
83
|
-
end
|
84
|
-
|
85
|
-
end
|
86
|
-
|
87
|
-
|
88
|
-
class FeedNormalizer
|
89
|
-
|
90
|
-
# Parses the given xml and attempts to return a normalized Feed object.
|
91
|
-
# Setting +force_parser+ to a suitable parser will mean that parser is
|
92
|
-
# used first, and if +try_others+ is false, it is the only parser used,
|
93
|
-
# otherwise all parsers in the ParserRegistry are attempted, in
|
94
|
-
# order of priority.
|
95
|
-
#
|
96
|
-
# ===Available options
|
97
|
-
#
|
98
|
-
# * <tt>:force_parser</tt> - instruct feed-normalizer to try the specified
|
99
|
-
# parser first. Takes a class, such as RubyRssParser, or SimpleRssParser.
|
100
|
-
#
|
101
|
-
# * <tt>:try_others</tt> - +true+ or +false+, defaults to +true+.
|
102
|
-
# If +true+, other parsers will be used as described above. The option
|
103
|
-
# is useful if combined with +force_parser+ to only use a single parser.
|
104
|
-
#
|
105
|
-
# * <tt>:loose</tt> - +true+ or +false+, defaults to +false+.
|
106
|
-
#
|
107
|
-
# Specifies parsing should be done loosely. This means that when
|
108
|
-
# feed-normalizer would usually throw away data in order to meet
|
109
|
-
# the requirement of keeping resulting feed outputs the same regardless
|
110
|
-
# of the underlying parser, the data will instead be kept. This currently
|
111
|
-
# affects the following items:
|
112
|
-
# * <em>Categories:</em> RSS allows for multiple categories per feed item.
|
113
|
-
# * <em>Limitation:</em> SimpleRSS can only return the first category
|
114
|
-
# for an item.
|
115
|
-
# * <em>Result:</em> When loose is true, the extra categories are kept,
|
116
|
-
# of course, only if the parser is not SimpleRSS.
|
117
|
-
def self.parse(xml, opts = {})
|
118
|
-
|
119
|
-
# Get a string ASAP, as multiple read()'s will start returning nil..
|
120
|
-
xml = xml.respond_to?(:read) ? xml.read : xml.to_s
|
121
|
-
|
122
|
-
if opts[:force_parser]
|
123
|
-
result = opts[:force_parser].parse(xml, opts[:loose])
|
124
|
-
|
125
|
-
return result if result
|
126
|
-
return nil if opts[:try_others] == false
|
127
|
-
end
|
128
|
-
|
129
|
-
ParserRegistry.parsers.each do |parser|
|
130
|
-
result = parser.parse(xml, opts[:loose])
|
131
|
-
return result if result
|
132
|
-
end
|
133
|
-
|
134
|
-
# if we got here, no parsers worked.
|
135
|
-
return nil
|
136
|
-
end
|
137
|
-
end
|
138
|
-
|
139
|
-
|
140
|
-
parser_dir = File.dirname(__FILE__) + '/parsers'
|
141
|
-
|
142
|
-
# Load up the parsers
|
143
|
-
Dir.open(parser_dir).each do |fn|
|
144
|
-
next unless fn =~ /[.]rb$/
|
145
|
-
require "parsers/#{fn}"
|
146
|
-
end
|
147
|
-
|
148
|
-
end
|
149
|
-
|
1
|
+
require 'structures'
|
2
|
+
require 'html-cleaner'
|
3
|
+
|
4
|
+
module FeedNormalizer
|
5
|
+
|
6
|
+
# The root parser object. Every parser must extend this object.
|
7
|
+
class Parser
|
8
|
+
|
9
|
+
# Parser being used.
|
10
|
+
def self.parser
|
11
|
+
nil
|
12
|
+
end
|
13
|
+
|
14
|
+
# Parses the given feed, and returns a normalized representation.
|
15
|
+
# Returns nil if the feed could not be parsed.
|
16
|
+
def self.parse(feed, loose)
|
17
|
+
nil
|
18
|
+
end
|
19
|
+
|
20
|
+
# Returns a number to indicate parser priority.
|
21
|
+
# The lower the number, the more likely the parser will be used first,
|
22
|
+
# and vice-versa.
|
23
|
+
def self.priority
|
24
|
+
0
|
25
|
+
end
|
26
|
+
|
27
|
+
protected
|
28
|
+
|
29
|
+
# Some utility methods that can be used by subclasses.
|
30
|
+
|
31
|
+
# sets value, or appends to an existing value
|
32
|
+
def self.map_functions!(mapping, src, dest)
|
33
|
+
|
34
|
+
mapping.each do |dest_function, src_functions|
|
35
|
+
src_functions = [src_functions].flatten # pack into array
|
36
|
+
|
37
|
+
src_functions.each do |src_function|
|
38
|
+
value = if src.respond_to?(src_function)
|
39
|
+
src.send(src_function)
|
40
|
+
elsif src.respond_to?(:has_key?)
|
41
|
+
src[src_function]
|
42
|
+
end
|
43
|
+
|
44
|
+
unless value.to_s.empty?
|
45
|
+
append_or_set!(value, dest, dest_function)
|
46
|
+
break
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
end
|
51
|
+
end
|
52
|
+
|
53
|
+
def self.append_or_set!(value, object, object_function)
|
54
|
+
if object.send(object_function).respond_to? :push
|
55
|
+
object.send(object_function).push(value)
|
56
|
+
else
|
57
|
+
object.send(:"#{object_function}=", value)
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
private
|
62
|
+
|
63
|
+
# Callback that ensures that every parser gets registered.
|
64
|
+
def self.inherited(subclass)
|
65
|
+
ParserRegistry.register(subclass)
|
66
|
+
end
|
67
|
+
|
68
|
+
end
|
69
|
+
|
70
|
+
|
71
|
+
# The parser registry keeps a list of current parsers that are available.
|
72
|
+
class ParserRegistry
|
73
|
+
|
74
|
+
@@parsers = []
|
75
|
+
|
76
|
+
def self.register(parser)
|
77
|
+
@@parsers << parser
|
78
|
+
end
|
79
|
+
|
80
|
+
# Returns a list of currently registered parsers, in order of priority.
|
81
|
+
def self.parsers
|
82
|
+
@@parsers.sort_by { |parser| parser.priority }
|
83
|
+
end
|
84
|
+
|
85
|
+
end
|
86
|
+
|
87
|
+
|
88
|
+
class FeedNormalizer
|
89
|
+
|
90
|
+
# Parses the given xml and attempts to return a normalized Feed object.
|
91
|
+
# Setting +force_parser+ to a suitable parser will mean that parser is
|
92
|
+
# used first, and if +try_others+ is false, it is the only parser used,
|
93
|
+
# otherwise all parsers in the ParserRegistry are attempted, in
|
94
|
+
# order of priority.
|
95
|
+
#
|
96
|
+
# ===Available options
|
97
|
+
#
|
98
|
+
# * <tt>:force_parser</tt> - instruct feed-normalizer to try the specified
|
99
|
+
# parser first. Takes a class, such as RubyRssParser, or SimpleRssParser.
|
100
|
+
#
|
101
|
+
# * <tt>:try_others</tt> - +true+ or +false+, defaults to +true+.
|
102
|
+
# If +true+, other parsers will be used as described above. The option
|
103
|
+
# is useful if combined with +force_parser+ to only use a single parser.
|
104
|
+
#
|
105
|
+
# * <tt>:loose</tt> - +true+ or +false+, defaults to +false+.
|
106
|
+
#
|
107
|
+
# Specifies parsing should be done loosely. This means that when
|
108
|
+
# feed-normalizer would usually throw away data in order to meet
|
109
|
+
# the requirement of keeping resulting feed outputs the same regardless
|
110
|
+
# of the underlying parser, the data will instead be kept. This currently
|
111
|
+
# affects the following items:
|
112
|
+
# * <em>Categories:</em> RSS allows for multiple categories per feed item.
|
113
|
+
# * <em>Limitation:</em> SimpleRSS can only return the first category
|
114
|
+
# for an item.
|
115
|
+
# * <em>Result:</em> When loose is true, the extra categories are kept,
|
116
|
+
# of course, only if the parser is not SimpleRSS.
|
117
|
+
def self.parse(xml, opts = {})
|
118
|
+
|
119
|
+
# Get a string ASAP, as multiple read()'s will start returning nil..
|
120
|
+
xml = xml.respond_to?(:read) ? xml.read : xml.to_s
|
121
|
+
|
122
|
+
if opts[:force_parser]
|
123
|
+
result = opts[:force_parser].parse(xml, opts[:loose])
|
124
|
+
|
125
|
+
return result if result
|
126
|
+
return nil if opts[:try_others] == false
|
127
|
+
end
|
128
|
+
|
129
|
+
ParserRegistry.parsers.each do |parser|
|
130
|
+
result = parser.parse(xml, opts[:loose])
|
131
|
+
return result if result
|
132
|
+
end
|
133
|
+
|
134
|
+
# if we got here, no parsers worked.
|
135
|
+
return nil
|
136
|
+
end
|
137
|
+
end
|
138
|
+
|
139
|
+
|
140
|
+
parser_dir = File.dirname(__FILE__) + '/parsers'
|
141
|
+
|
142
|
+
# Load up the parsers
|
143
|
+
Dir.open(parser_dir).each do |fn|
|
144
|
+
next unless fn =~ /[.]rb$/
|
145
|
+
require "parsers/#{fn}"
|
146
|
+
end
|
147
|
+
|
148
|
+
end
|
149
|
+
|