openlogic-feed-normalizer 1.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.gemtest ADDED
File without changes
data/History.txt ADDED
@@ -0,0 +1,62 @@
1
+ 1.5.3
2
+
3
+ * Fix a stack overflow error that occurred when calling unimplemented
4
+ methods on Feeds or Entrys on Ruby 1.9.x. For example, calling flatten
5
+ on an array of Entrys causes to_ary to be sent to each of the Entrys,
6
+ which would overflow the stack.
7
+
8
+ 1.5.2
9
+ [unknown]
10
+
11
+ 1.5.1
12
+
13
+ * Fix a bug that was breaking the parsing process for certain feeds. [reported by: Patrick Minton]
14
+
15
+ 1.5.0
16
+
17
+ * Add support for new fields:
18
+ * Atom 0.3: issued is now available through entry.date_published.
19
+ * RSS: feed.skip_hours, feed.skip_days, feed.ttl [joshpeek]
20
+ * All: entry.last_updated, this is an alias to entry.date_published for RSS.
21
+ * Rewrite relative links in content [joshpeek]
22
+ * Handle CDATA sections consistently across all formats. [sam.lown]
23
+ * Prevent SimpleRSS from doing its own escaping. [reported by: paul.stadig, lionel.bouton]
24
+ * Reparse Time classes [reported by: sam.lown]
25
+
26
+ 1.4.0
27
+
28
+ * Support content:encoded. Accessible via Entry#content.
29
+ * Support categories. Accessible via Entry#categories.
30
+ * Introduces a new parsing feature 'loose parsing'. Use :loose => true
31
+ when parsing if the required output should retain extra data, rather
32
+ than drop it in the interests of 'lowest common denomiator' normalization.
33
+ Currently affects how categories works. See the documentation in
34
+ FeedNormalizer#parse for more details.
35
+
36
+ 1.3.2
37
+
38
+ * Add support for applicable dublin core elements. (dc:date and dc:creator)
39
+ * Feeds can now be dumped to YAML.
40
+
41
+ 1.3.1
42
+
43
+ * Small changes to work with hpricot 0.6. This release depends on hpricot 0.6.
44
+ * Reduced the greediness of a regexp that was removing html comments.
45
+
46
+ 1.3.0
47
+
48
+ * Small changes to work with hpricot 0.5.
49
+
50
+ 1.2.0
51
+
52
+ * Added HtmlCleaner - sanitizes HTML and removes 'bad' URIs to a level suitable
53
+ for 'safe' display inside a web browser. Can be used as a standalone library,
54
+ or as part of the Feed object. See Feed.clean! for details about cleaning a
55
+ Feed instance. Also see HtmlCleaner and its unit tests. Uses Hpricot.
56
+ * Added Feed-diffing. Differences between two feeds can be displayed using
57
+ Feed.diff. Works nicely with YAML for a readable diff.
58
+ * FeedNormalizer.parse now takes a hash for its arguments.
59
+ * Removed FN::Content.
60
+ * Now uses Hoe!
61
+
62
+
data/License.txt ADDED
@@ -0,0 +1,27 @@
1
+ Copyright (c) 2006-2007, Andrew A. Smith
2
+ All rights reserved.
3
+
4
+ Redistribution and use in source and binary forms, with or without modification,
5
+ are permitted provided that the following conditions are met:
6
+
7
+ * Redistributions of source code must retain the above copyright notice,
8
+ this list of conditions and the following disclaimer.
9
+
10
+ * Redistributions in binary form must reproduce the above copyright notice,
11
+ this list of conditions and the following disclaimer in the documentation
12
+ and/or other materials provided with the distribution.
13
+
14
+ * Neither the name of the copyright owner nor the names of its contributors
15
+ may be used to endorse or promote products derived from this software
16
+ without specific prior written permission.
17
+
18
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
19
+ ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
20
+ WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
22
+ ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
23
+ (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
24
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
25
+ ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
26
+ (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
27
+ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
data/Manifest.txt ADDED
@@ -0,0 +1,18 @@
1
+ History.txt
2
+ License.txt
3
+ Manifest.txt
4
+ Rakefile
5
+ README.txt
6
+ lib/feed-normalizer.rb
7
+ lib/html-cleaner.rb
8
+ lib/parsers/rss.rb
9
+ lib/parsers/simple-rss.rb
10
+ lib/structures.rb
11
+ test/data/atom03.xml
12
+ test/data/atom10.xml
13
+ test/data/rdf10.xml
14
+ test/data/rss20.xml
15
+ test/data/rss20diff.xml
16
+ test/data/rss20diff_short.xml
17
+ test/test_feednormalizer.rb
18
+ test/test_htmlcleaner.rb
data/README.txt ADDED
@@ -0,0 +1,63 @@
1
+ == Feed Normalizer
2
+
3
+ An extensible Ruby wrapper for Atom and RSS parsers.
4
+
5
+ Feed normalizer wraps various RSS and Atom parsers, and returns a single unified
6
+ object graph, regardless of the underlying feed format.
7
+
8
+ == Download
9
+
10
+ * gem install openlogic-feed-normalizer
11
+ * http://rubyforge.org/projects/openlogic-feed-normalizer
12
+ * http://github.com/toddthomas/feed-normalizer
13
+
14
+ == Usage
15
+
16
+ require 'feed-normalizer'
17
+ require 'open-uri'
18
+
19
+ feed = FeedNormalizer::FeedNormalizer.parse open('http://www.iht.com/rss/frontpage.xml')
20
+
21
+ feed.title # => "International Herald Tribune"
22
+ feed.url # => "http://www.iht.com/pages/index.php"
23
+ feed.entries.first.url # => "http://www.iht.com/articles/2006/10/03/frontpage/web.1003UN.php"
24
+
25
+ feed.class # => FeedNormalizer::Feed
26
+ feed.parser # => "RSS::Parser"
27
+
28
+ Now read an Atom feed, and the same class is returned, and the same terminology applies:
29
+
30
+ feed = FeedNormalizer::FeedNormalizer.parse open('http://www.atomenabled.org/atom.xml')
31
+
32
+ feed.title # => "AtomEnabled.org"
33
+ feed.url # => "http://www.atomenabled.org/atom.xml"
34
+ feed.entries.first.url # => "http://www.atomenabled.org/2006/09/moving-toward-atom.php"
35
+
36
+ The feed representation stays the same, even though a different parser was used.
37
+
38
+ feed.class # => FeedNormalizer::Feed
39
+ feed.parser # => "SimpleRSS"
40
+
41
+ == Cleaning / Sanitizing
42
+
43
+ feed.title # => "My Feed > Your Feed"
44
+ feed.entries.first.content # => "<p x='y'>Hello</p><object></object></html>"
45
+ feed.clean!
46
+
47
+ All elements should now be either clean HTML, or HTML escaped strings.
48
+
49
+ feed.title # => "My Feed &gt; Your Feed"
50
+ feed.entries.first.content # => "<p>Hello</p>"
51
+
52
+ == Extending
53
+
54
+ Implement a parser wrapper by extending the FeedNormalizer::Parser class and overriding
55
+ the public methods. Also note the helper methods in the root Parser object to make
56
+ mapping of output from the particular parser to the Feed object easier.
57
+
58
+ See FeedNormalizer::RubyRssParser and FeedNormalizer::SimpleRssParser for examples.
59
+
60
+ == Authors
61
+ * Andrew A. Smith (andy@tinnedfruit.org)
62
+
63
+ This library is released under the terms of the BSD License (see the License.txt file for details).
data/Rakefile ADDED
@@ -0,0 +1,30 @@
1
+ require 'rubygems'
2
+ require 'hoe'
3
+
4
+ $: << "lib"
5
+ require 'feed-normalizer'
6
+
7
+ Hoe.spec("openlogic-feed-normalizer") do |s|
8
+ s.version = "1.5.3"
9
+ s.developer "Andrew A. Smith", "andy@tinnedfruit.org"
10
+ s.developer "Todd Thomas", "todd.thomas@openlogic.com"
11
+ s.url = "http://github.com/toddthomas/feed-normalizer"
12
+ s.summary = "Extensible Ruby wrapper for Atom and RSS parsers"
13
+ s.description = s.paragraphs_of('README.txt', 1..2).join("\n\n")
14
+ s.changes = s.paragraphs_of('History.txt', 0..1).join("\n\n")
15
+ s.extra_deps << ["simple-rss", ">= 1.1"]
16
+ s.extra_deps << ["hpricot", ">= 0.6"]
17
+ s.need_zip = true
18
+ s.need_tar = false
19
+ end
20
+
21
+
22
+ begin
23
+ require 'rcov/rcovtask'
24
+ Rcov::RcovTask.new("rcov") do |t|
25
+ t.test_files = Dir['test/test_all.rb']
26
+ end
27
+ rescue LoadError
28
+ nil
29
+ end
30
+
@@ -0,0 +1,149 @@
1
+ require 'structures'
2
+ require 'html-cleaner'
3
+
4
+ module FeedNormalizer
5
+
6
+ # The root parser object. Every parser must extend this object.
7
+ class Parser
8
+
9
+ # Parser being used.
10
+ def self.parser
11
+ nil
12
+ end
13
+
14
+ # Parses the given feed, and returns a normalized representation.
15
+ # Returns nil if the feed could not be parsed.
16
+ def self.parse(feed, loose)
17
+ nil
18
+ end
19
+
20
+ # Returns a number to indicate parser priority.
21
+ # The lower the number, the more likely the parser will be used first,
22
+ # and vice-versa.
23
+ def self.priority
24
+ 0
25
+ end
26
+
27
+ protected
28
+
29
+ # Some utility methods that can be used by subclasses.
30
+
31
+ # sets value, or appends to an existing value
32
+ def self.map_functions!(mapping, src, dest)
33
+
34
+ mapping.each do |dest_function, src_functions|
35
+ src_functions = [src_functions].flatten # pack into array
36
+
37
+ src_functions.each do |src_function|
38
+ value = if src.respond_to?(src_function)
39
+ src.send(src_function)
40
+ elsif src.respond_to?(:has_key?)
41
+ src[src_function]
42
+ end
43
+
44
+ unless value.to_s.empty?
45
+ append_or_set!(value, dest, dest_function)
46
+ break
47
+ end
48
+ end
49
+
50
+ end
51
+ end
52
+
53
+ def self.append_or_set!(value, object, object_function)
54
+ if object.send(object_function).respond_to? :push
55
+ object.send(object_function).push(value)
56
+ else
57
+ object.send(:"#{object_function}=", value)
58
+ end
59
+ end
60
+
61
+ private
62
+
63
+ # Callback that ensures that every parser gets registered.
64
+ def self.inherited(subclass)
65
+ ParserRegistry.register(subclass)
66
+ end
67
+
68
+ end
69
+
70
+
71
+ # The parser registry keeps a list of current parsers that are available.
72
+ class ParserRegistry
73
+
74
+ @@parsers = []
75
+
76
+ def self.register(parser)
77
+ @@parsers << parser
78
+ end
79
+
80
+ # Returns a list of currently registered parsers, in order of priority.
81
+ def self.parsers
82
+ @@parsers.sort_by { |parser| parser.priority }
83
+ end
84
+
85
+ end
86
+
87
+
88
+ class FeedNormalizer
89
+
90
+ # Parses the given xml and attempts to return a normalized Feed object.
91
+ # Setting +force_parser+ to a suitable parser will mean that parser is
92
+ # used first, and if +try_others+ is false, it is the only parser used,
93
+ # otherwise all parsers in the ParserRegistry are attempted, in
94
+ # order of priority.
95
+ #
96
+ # ===Available options
97
+ #
98
+ # * <tt>:force_parser</tt> - instruct feed-normalizer to try the specified
99
+ # parser first. Takes a class, such as RubyRssParser, or SimpleRssParser.
100
+ #
101
+ # * <tt>:try_others</tt> - +true+ or +false+, defaults to +true+.
102
+ # If +true+, other parsers will be used as described above. The option
103
+ # is useful if combined with +force_parser+ to only use a single parser.
104
+ #
105
+ # * <tt>:loose</tt> - +true+ or +false+, defaults to +false+.
106
+ #
107
+ # Specifies parsing should be done loosely. This means that when
108
+ # feed-normalizer would usually throw away data in order to meet
109
+ # the requirement of keeping resulting feed outputs the same regardless
110
+ # of the underlying parser, the data will instead be kept. This currently
111
+ # affects the following items:
112
+ # * <em>Categories:</em> RSS allows for multiple categories per feed item.
113
+ # * <em>Limitation:</em> SimpleRSS can only return the first category
114
+ # for an item.
115
+ # * <em>Result:</em> When loose is true, the extra categories are kept,
116
+ # of course, only if the parser is not SimpleRSS.
117
+ def self.parse(xml, opts = {})
118
+
119
+ # Get a string ASAP, as multiple read()'s will start returning nil..
120
+ xml = xml.respond_to?(:read) ? xml.read : xml.to_s
121
+
122
+ if opts[:force_parser]
123
+ result = opts[:force_parser].parse(xml, opts[:loose])
124
+
125
+ return result if result
126
+ return nil if opts[:try_others] == false
127
+ end
128
+
129
+ ParserRegistry.parsers.each do |parser|
130
+ result = parser.parse(xml, opts[:loose])
131
+ return result if result
132
+ end
133
+
134
+ # if we got here, no parsers worked.
135
+ return nil
136
+ end
137
+ end
138
+
139
+
140
+ parser_dir = File.dirname(__FILE__) + '/parsers'
141
+
142
+ # Load up the parsers
143
+ Dir.open(parser_dir).each do |fn|
144
+ next unless fn =~ /[.]rb$/
145
+ require "parsers/#{fn}"
146
+ end
147
+
148
+ end
149
+
@@ -0,0 +1,181 @@
1
+ require 'rubygems'
2
+ require 'hpricot'
3
+ require 'cgi'
4
+
5
+ module FeedNormalizer
6
+
7
+ # Various methods for cleaning up HTML and preparing it for safe public
8
+ # consumption.
9
+ #
10
+ # Documents used for refrence:
11
+ # - http://www.w3.org/TR/html4/index/attributes.html
12
+ # - http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
13
+ # - http://feedparser.org/docs/html-sanitization.html
14
+ # - http://code.whytheluckystiff.net/hpricot/wiki
15
+ class HtmlCleaner
16
+
17
+ # allowed html elements.
18
+ HTML_ELEMENTS = %w(
19
+ a abbr acronym address area b bdo big blockquote br button caption center
20
+ cite code col colgroup dd del dfn dir div dl dt em fieldset font h1 h2 h3
21
+ h4 h5 h6 hr i img ins kbd label legend li map menu ol optgroup p pre q s
22
+ samp small span strike strong sub sup table tbody td tfoot th thead tr tt
23
+ u ul var
24
+ )
25
+
26
+ # allowed attributes.
27
+ HTML_ATTRS = %w(
28
+ abbr accept accept-charset accesskey align alt axis border cellpadding
29
+ cellspacing char charoff charset checked cite class clear cols colspan
30
+ color compact coords datetime dir disabled for frame headers height href
31
+ hreflang hspace id ismap label lang longdesc maxlength media method
32
+ multiple name nohref noshade nowrap readonly rel rev rows rowspan rules
33
+ scope selected shape size span src start summary tabindex target title
34
+ type usemap valign value vspace width
35
+ )
36
+
37
+ # allowed attributes, but they can contain URIs, extra caution required.
38
+ # NOTE: That means this doesnt list *all* URI attrs, just the ones that are allowed.
39
+ HTML_URI_ATTRS = %w(
40
+ href src cite usemap longdesc
41
+ )
42
+
43
+ DODGY_URI_SCHEMES = %w(
44
+ javascript vbscript mocha livescript data
45
+ )
46
+
47
+ class << self
48
+
49
+ # Does this:
50
+ # - Unescape HTML
51
+ # - Parse HTML into tree
52
+ # - Find 'body' if present, and extract tree inside that tag, otherwise parse whole tree
53
+ # - Each tag:
54
+ # - remove tag if not whitelisted
55
+ # - escape HTML tag contents
56
+ # - remove all attributes not on whitelist
57
+ # - extra-scrub URI attrs; see dodgy_uri?
58
+ #
59
+ # Extra (i.e. unmatched) ending tags and comments are removed.
60
+ def clean(str)
61
+ str = unescapeHTML(str)
62
+
63
+ doc = Hpricot(str, :fixup_tags => true)
64
+ doc = subtree(doc, :body)
65
+
66
+ # get all the tags in the document
67
+ # Somewhere near hpricot 0.4.92 "*" starting to return all elements,
68
+ # including text nodes instead of just tagged elements.
69
+ tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq
70
+
71
+ # Remove tags that aren't whitelisted.
72
+ remove_tags!(doc, tags - HTML_ELEMENTS)
73
+ remaining_tags = tags & HTML_ELEMENTS
74
+
75
+ # Remove attributes that aren't on the whitelist, or are suspicious URLs.
76
+ (doc/remaining_tags.join(",")).each do |element|
77
+ next if element.raw_attributes.nil? || element.raw_attributes.empty?
78
+ element.raw_attributes.reject! do |attr,val|
79
+ !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val))
80
+ end
81
+
82
+ element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]}
83
+ end unless remaining_tags.empty?
84
+
85
+ doc.traverse_text do |t|
86
+ t.swap(add_entities(t.to_html))
87
+ end
88
+
89
+ # Return the tree, without comments. Ugly way of removing comments,
90
+ # but can't see a way to do this in Hpricot yet.
91
+ doc.to_s.gsub(/<\!--.*?-->/mi, '')
92
+ end
93
+
94
+ # For all other feed elements:
95
+ # - Unescape HTML.
96
+ # - Parse HTML into tree (taking 'body' as root, if present)
97
+ # - Takes text out of each tag, and escapes HTML.
98
+ # - Returns all text concatenated.
99
+ def flatten(str)
100
+ str.gsub!("\n", " ")
101
+ str = unescapeHTML(str)
102
+
103
+ doc = Hpricot(str, :xhtml_strict => true)
104
+ doc = subtree(doc, :body)
105
+
106
+ out = []
107
+ doc.traverse_text {|t| out << add_entities(t.to_html)}
108
+
109
+ return out.join
110
+ end
111
+
112
+ # Returns true if the given string contains a suspicious URL,
113
+ # i.e. a javascript link.
114
+ #
115
+ # This method rejects javascript, vbscript, livescript, mocha and data URLs.
116
+ # It *could* be refined to only deny dangerous data URLs, however.
117
+ def dodgy_uri?(uri)
118
+ uri = uri.to_s
119
+
120
+ # special case for poorly-formed entities (missing ';')
121
+ # if these occur *anywhere* within the string, then throw it out.
122
+ return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/mi)
123
+
124
+ # Try escaping as both HTML or URI encodings, and then trying
125
+ # each scheme regexp on each
126
+ [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri|
127
+ DODGY_URI_SCHEMES.each do |scheme|
128
+
129
+ regexp = "#{scheme}:".gsub(/./) do |char|
130
+ "([\000-\037\177\s]*)#{char}"
131
+ end
132
+
133
+ # regexp looks something like
134
+ # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi
135
+ return true if (unesc_uri =~ %r{\A#{regexp}}mi)
136
+ end
137
+ end
138
+
139
+ nil
140
+ end
141
+
142
+ # unescapes HTML. If xml is true, also converts XML-only named entities to HTML.
143
+ def unescapeHTML(str, xml = true)
144
+ CGI.unescapeHTML(xml ? str.gsub("&apos;", "&#39;") : str)
145
+ end
146
+
147
+ # Adds entities where possible.
148
+ # Works like CGI.escapeHTML, but will not escape existing entities;
149
+ # i.e. &#123; will NOT become &amp;#123;
150
+ #
151
+ # This method could be improved by adding a whitelist of html entities.
152
+ def add_entities(str)
153
+ str.to_s.gsub(/\"/n, '&quot;').gsub(/>/n, '&gt;').gsub(/</n, '&lt;').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/nmi, '&amp;')
154
+ end
155
+
156
+ private
157
+
158
+ # Everything below elment, or the just return the doc if element not present.
159
+ def subtree(doc, element)
160
+ doc.at("//#{element}/*") || doc
161
+ end
162
+
163
+ def remove_tags!(doc, tags)
164
+ (doc/tags.join(",")).remove unless tags.empty?
165
+ end
166
+
167
+ end
168
+ end
169
+ end
170
+
171
+
172
+ module Enumerable #:nodoc:
173
+ def build_hash
174
+ result = {}
175
+ self.each do |elt|
176
+ key, value = yield elt
177
+ result[key] = value
178
+ end
179
+ result
180
+ end
181
+ end