jmcnevin-dryopteris 0.1.2

Sign up to get free protection for your applications and to get access to all the features.
data/README.markdown ADDED
@@ -0,0 +1,97 @@
1
+ Dryopteris
2
+ ==========
3
+
4
+ Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
5
+
6
+ * [Dryopteris erythrosora](http://en.wikipedia.org/wiki/Dryopteris_erythrosora)
7
+ * [XSS Attacks](http://en.wikipedia.org/wiki/Cross-site_scripting)
8
+
9
+ Usage
10
+ -----
11
+
12
+ Let's say you run a web site, and you allow people to post HTML snippets.
13
+
14
+ Let's also say some script-kiddie from Norland posts this to your site, in an effort to swipe some credit cards:
15
+
16
+ <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
17
+
18
+ Oooh, that could be bad. Here's how to fix it:
19
+
20
+ safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
21
+
22
+ Yeah, it's that easy.
23
+
24
+ In this example, <tt>safe\_html\_snippet</tt> will have all of its __broken markup fixed__ by libxml2, and it will also be completely __sanitized of harmful tags and attributes__. That's twice as clean!
25
+
26
+
27
+ Sanitization Usage
28
+ -----
29
+
30
+ You're still here? Ok, let me tell you a little something about the two different methods of sanitizing the Dryopteris offers.
31
+
32
+ ### Fragments
33
+
34
+ The first method is for _html fragments_, which are small snippets of markup such as those used in forum posts, emails and homework assignments.
35
+
36
+ Usage is the same as above:
37
+
38
+ safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
39
+
40
+ Generally speaking, unless you expect to have &lt;html&gt; and &lt;body&gt; tags in your HTML, this is the sanitizing method to use.
41
+
42
+ The only real limitation on this method is that the snippet must be a string object. (Support for IO objects was sacrificed at the altar of fixer-uppery-ness. If you need to sanitize data that's coming from an IO object, either socket or file, check out the next section on __Documents__).
43
+
44
+ ### Documents
45
+
46
+ Sometimes you need to sanitize an entire HTML document. (Well, maybe not _you_, but other people, certainly.)
47
+
48
+ safe_html_document = Dryopteris.sanitize_document(dangerous_html_document)
49
+
50
+ The returned string will contain exactly one (1) well-formed HTML document, with all broken HTML fixed and all harmful tags and attributes removed.
51
+
52
+ Coolness: <tt>dangerous\_html\_document</tt> can be a string OR an IO object (a file, or a socket, or ...). Which makes it particularly easy to sanitize large numbers of docs.
53
+
54
+ Whitewashing Usage
55
+ -----
56
+
57
+ ### Whitewashing Fragments
58
+
59
+ Other times, you may want to remove all styling, attributes and invalid HTML tags. I like to call this "whitewashing", since it's putting a new layer of paint on top of the HTML input to make it look nice.
60
+
61
+ One use case for this feature is to clean up HTML that was cut-and-pasted from Microsoft(tm) Word into a WYSIWYG editor/textarea. Microsoft's editor is famous for injecting all kinds of cruft into its HTML output. Who needs that? Certainly not me.
62
+
63
+ whitewashed_html = Dryopteris.whitewash(ugly_microsoft_html_snippet)
64
+
65
+ Please note that whitewashing implicitly also sanitizes your HTML, as it uses the same HTML tag whitelist as <tt>sanitize()</tt>. It's implementation is:
66
+
67
+ 1. unless the tag is on the whitelist, remove it from the document
68
+ 2. if the tag has an XML namespace on it, remove it from the document
69
+ 2. remove all attributes from the node
70
+
71
+ ### Whitewashing Documents
72
+
73
+ Also note the existence of <tt>whitewash\_document</tt>, which is analogous to <tt>sanitize\_document</tt>.
74
+
75
+ Standing on the Shoulders of Giants
76
+ -----
77
+
78
+ Dryopteris uses [Nokogiri](http://nokogiri.rubyforge.org/) and [libxml2](http://xmlsoft.org/), so it's fast.
79
+
80
+ Dryopteris also takes its tag and tag attribute whitelists and its CSS sanitizer directly from [HTML5](http://code.google.com/p/html5lib/).
81
+
82
+
83
+ Authors
84
+ -----
85
+ * [Bryan Helmkamp](http://www.brynary.com/)
86
+ * [Mike Dalessio](http://mike.daless.io/) ([twitter](http://twitter.com/flavorjones))
87
+
88
+
89
+ Quotes About Dryopteris
90
+ -----
91
+
92
+ > "dryopteris shields you from xss attacks using nokogiri and NY attitude"
93
+ > - [hasmanyjosh](http://blog.hasmanythrough.com/)
94
+
95
+ > "I just wanted to say thank you for your dryopteris plugin. It is by far the best sanitization I've found."
96
+ > - [catalystmediastudios](http://github.com/catalystmediastudios)
97
+
data/VERSION.yml ADDED
@@ -0,0 +1,4 @@
1
+ ---
2
+ :major: 0
3
+ :minor: 0
4
+ :patch: 0
data/lib/dryopteris.rb ADDED
@@ -0,0 +1,12 @@
1
+ $LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__))) unless $LOAD_PATH.include?(File.expand_path(File.dirname(__FILE__)))
2
+
3
+ require 'rubygems'
4
+ gem 'nokogiri', '>=1.2.4'
5
+ require 'nokogiri'
6
+
7
+ require "dryopteris/whitelist"
8
+ require "dryopteris/sanitize"
9
+
10
+ module Dryopteris
11
+ VERSION = '0.1'
12
+ end
@@ -0,0 +1,46 @@
1
+ require "dryopteris"
2
+
3
+ module Dryopteris
4
+ module RailsExtension
5
+ def self.included(base)
6
+ base.extend(ClassMethods)
7
+
8
+ # sets up default of stripping tags for all fields
9
+ base.class_eval do
10
+ before_save :sanitize_fields
11
+ class_inheritable_reader :dryopteris_options
12
+ end
13
+ end
14
+
15
+ module ClassMethods
16
+ def sanitize_fields(options = {})
17
+ write_inheritable_attribute(:dryopteris_options, {
18
+ :except => (options[:except] || []),
19
+ :allow_tags => (options[:allow_tags] || [])
20
+ })
21
+ end
22
+
23
+ alias_method :sanitize_field, :sanitize_fields
24
+ end
25
+
26
+
27
+ def sanitize_fields
28
+ self.class.columns.each do |column|
29
+ next unless (column.type == :string || column.type == :text)
30
+
31
+ field = column.name.to_sym
32
+ value = self[field]
33
+
34
+ if dryopteris_options && dryopteris_options[:except].include?(field)
35
+ next
36
+ elsif dryopteris_options && dryopteris_options[:allow_tags].include?(field)
37
+ self[field] = Dryopteris.sanitize(value)
38
+ else
39
+ self[field] = Dryopteris.strip_tags(value)
40
+ end
41
+ end
42
+
43
+ end
44
+
45
+ end
46
+ end
@@ -0,0 +1,175 @@
1
+ require 'cgi'
2
+
3
+ module Dryopteris
4
+
5
+ class << self
6
+ def strip_tags(string_or_io, encoding=nil)
7
+ return nil if string_or_io.nil?
8
+ return "" if string_or_io.strip.size == 0
9
+
10
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
11
+ body_element = doc.at("/html/body")
12
+ return "" if body_element.nil?
13
+ body_element.inner_text
14
+ end
15
+
16
+
17
+ def whitewash(string, encoding=nil)
18
+ return nil if string.nil?
19
+ return "" if string.strip.size == 0
20
+
21
+ string = "<html><body>" + string + "</body></html>"
22
+ doc = Nokogiri::HTML.parse(string, nil, encoding)
23
+ body = doc.xpath("/html/body").first
24
+ return "" if body.nil?
25
+ body.children.each do |node|
26
+ traverse_conditionally_top_down(node, :whitewash_node)
27
+ end
28
+ body.children.map { |x| x.to_xml }.join
29
+ end
30
+
31
+ def whitewash_document(string_or_io, encoding=nil)
32
+ return nil if string_or_io.nil?
33
+ return "" if string_or_io.strip.size == 0
34
+
35
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
36
+ body = doc.xpath("/html/body").first
37
+ return "" if body.nil?
38
+ body.children.each do |node|
39
+ traverse_conditionally_top_down(node, :whitewash_node)
40
+ end
41
+ body.children.map { |x| x.to_xml }.join
42
+ end
43
+
44
+
45
+ def sanitize(string, encoding=nil)
46
+ return nil if string.nil?
47
+ return "" if string.strip.size == 0
48
+
49
+ string = "<html><body>" + string + "</body></html>"
50
+ doc = Nokogiri::HTML.parse(string, nil, encoding)
51
+ body = doc.xpath("/html/body").first
52
+ return "" if body.nil?
53
+ body.children.each do |node|
54
+ traverse_conditionally_top_down(node, :sanitize_node)
55
+ end
56
+ body.children.map { |x| x.to_xml }.join
57
+ end
58
+
59
+ def sanitize_document(string_or_io, encoding=nil)
60
+ return nil if string_or_io.nil?
61
+ return "" if string_or_io.strip.size == 0
62
+
63
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
64
+ elements = doc.xpath("/html/head/*","/html/body/*")
65
+ return "" if (elements.nil? || elements.empty?)
66
+ elements.each do |node|
67
+ traverse_conditionally_top_down(node, :sanitize_node)
68
+ end
69
+ doc.root.to_xml
70
+ end
71
+
72
+ private
73
+
74
+ def traverse_conditionally_top_down(node, method_name)
75
+ return if send(method_name, node)
76
+ node.children.each {|j| traverse_conditionally_top_down(j, method_name)}
77
+ end
78
+
79
+ def remove_tags_from_node(node)
80
+ replacement_killer = Nokogiri::XML::Text.new(node.text, node.document)
81
+ node.add_next_sibling(replacement_killer)
82
+ node.remove
83
+ return true
84
+ end
85
+
86
+ def sanitize_node(node)
87
+ case node.type
88
+ when 1 # Nokogiri::XML::Node::ELEMENT_NODE
89
+ if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
90
+ node.attributes.each do |attr|
91
+ node.remove_attribute(attr.first) unless HashedWhiteList::ALLOWED_ATTRIBUTES[attr.first]
92
+ end
93
+ node.attributes.each do |attr|
94
+ if HashedWhiteList::ATTR_VAL_IS_URI[attr.first]
95
+ # this block lifted nearly verbatim from HTML5 sanitization
96
+ val_unescaped = CGI.unescapeHTML(attr.last.to_s).gsub(/`|[\000-\040\177\s]+|\302[\200-\240]/,'').downcase
97
+ if val_unescaped =~ /^[a-z0-9][-+.a-z0-9]*:/ and HashedWhiteList::ALLOWED_PROTOCOLS[val_unescaped.split(':')[0]].nil?
98
+ node.remove_attribute(attr.first)
99
+ end
100
+ end
101
+ end
102
+ if node.attributes['style']
103
+ node['style'] = sanitize_css(node.attributes['style'])
104
+ end
105
+ return false
106
+ end
107
+ when 3 # Nokogiri::XML::Node::TEXT_NODE
108
+ return false
109
+ when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
110
+ return false
111
+ end
112
+ replacement_killer = Nokogiri::XML::Text.new(node.to_s, node.document)
113
+ node.add_next_sibling(replacement_killer)
114
+ node.remove
115
+ return true
116
+ end
117
+
118
+
119
+ def whitewash_node(node)
120
+ case node.type
121
+ when 1 # Nokogiri::XML::Node::ELEMENT_NODE
122
+ if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
123
+ node.attributes.each { |attr| node.remove_attribute(attr.first) }
124
+ has_no_namespaces = true
125
+ begin
126
+ has_no_namespaces = node.namespaces.empty?
127
+ rescue
128
+ # older versions of nokogiri raise an exception when there
129
+ # is a namespace on the node that is not declared with an href.
130
+ # see http://github.com/tenderlove/nokogiri/commit/395d7971304e1489e92c494b9c50609f4b4c4ab0
131
+ has_no_namespaces = false
132
+ end
133
+ return false if has_no_namespaces
134
+ end
135
+ when 3 # Nokogiri::XML::Node::TEXT_NODE
136
+ return false
137
+ when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
138
+ return false
139
+ end
140
+ node.remove
141
+ return true
142
+ end
143
+
144
+
145
+ # this liftend nearly verbatim from html5
146
+ def sanitize_css(style)
147
+ # disallow urls
148
+ style = style.to_s.gsub(/url\s*\(\s*[^\s)]+?\s*\)\s*/, ' ')
149
+
150
+ # gauntlet
151
+ return '' unless style =~ /^([:,;#%.\sa-zA-Z0-9!]|\w-\w|\'[\s\w]+\'|\"[\s\w]+\"|\([\d,\s]+\))*$/
152
+ return '' unless style =~ /^\s*([-\w]+\s*:[^:;]*(;\s*|$))*$/
153
+
154
+ clean = []
155
+ style.scan(/([-\w]+)\s*:\s*([^:;]*)/) do |prop, val|
156
+ next if val.empty?
157
+ prop.downcase!
158
+ if HashedWhiteList::ALLOWED_CSS_PROPERTIES[prop]
159
+ clean << "#{prop}: #{val};"
160
+ elsif %w[background border margin padding].include?(prop.split('-')[0])
161
+ clean << "#{prop}: #{val};" unless val.split().any? do |keyword|
162
+ HashedWhiteList::ALLOWED_CSS_KEYWORDS[keyword].nil? and
163
+ keyword !~ /^(#[0-9a-f]+|rgb\(\d+%?,\d*%?,?\d*%?\)?|\d{0,2}\.?\d{0,2}(cm|em|ex|in|mm|pc|pt|px|%|,|\))?)$/
164
+ end
165
+ elsif HashedWhiteList::ALLOWED_SVG_PROPERTIES[prop]
166
+ clean << "#{prop}: #{val};"
167
+ end
168
+ end
169
+
170
+ style = clean.join(' ')
171
+ end
172
+
173
+ end # self
174
+
175
+ end
@@ -0,0 +1,159 @@
1
+ #
2
+ # HTML whitelist lifted from HTML5 sanitizer code
3
+ # http://code.google.com/p/html5lib/
4
+ #
5
+
6
+ module Dryopteris
7
+ module WhiteList
8
+ # <html5_license>
9
+ #
10
+ # Copyright (c) 2006-2008 The Authors
11
+ #
12
+ # Contributors:
13
+ # James Graham - jg307@cam.ac.uk
14
+ # Anne van Kesteren - annevankesteren@gmail.com
15
+ # Lachlan Hunt - lachlan.hunt@lachy.id.au
16
+ # Matt McDonald - kanashii@kanashii.ca
17
+ # Sam Ruby - rubys@intertwingly.net
18
+ # Ian Hickson (Google) - ian@hixie.ch
19
+ # Thomas Broyer - t.broyer@ltgt.net
20
+ # Jacques Distler - distler@golem.ph.utexas.edu
21
+ # Henri Sivonen - hsivonen@iki.fi
22
+ # The Mozilla Foundation (contributions from Henri Sivonen since 2008)
23
+ #
24
+ # Permission is hereby granted, free of charge, to any person
25
+ # obtaining a copy of this software and associated documentation
26
+ # files (the "Software"), to deal in the Software without
27
+ # restriction, including without limitation the rights to use, copy,
28
+ # modify, merge, publish, distribute, sublicense, and/or sell copies
29
+ # of the Software, and to permit persons to whom the Software is
30
+ # furnished to do so, subject to the following conditions:
31
+ #
32
+ # The above copyright notice and this permission notice shall be
33
+ # included in all copies or substantial portions of the Software.
34
+ #
35
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
36
+ # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
37
+ # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
38
+ # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
39
+ # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
40
+ # WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
41
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
42
+ # DEALINGS IN THE SOFTWARE.
43
+ #
44
+ # </html5_license>
45
+
46
+ ACCEPTABLE_ELEMENTS = %w[a abbr acronym address area b big blockquote br
47
+ button caption center cite code col colgroup dd del dfn dir div dl dt
48
+ em fieldset font form h1 h2 h3 h4 h5 h6 hr i img input ins kbd label
49
+ legend li map menu ol optgroup option p pre q s samp select small span
50
+ strike strong sub sup table tbody td textarea tfoot th thead tr tt u
51
+ ul var]
52
+
53
+ MATHML_ELEMENTS = %w[maction math merror mfrac mi mmultiscripts mn mo
54
+ mover mpadded mphantom mprescripts mroot mrow mspace msqrt mstyle msub
55
+ msubsup msup mtable mtd mtext mtr munder munderover none]
56
+
57
+ SVG_ELEMENTS = %w[a animate animateColor animateMotion animateTransform
58
+ circle defs desc ellipse font-face font-face-name font-face-src g
59
+ glyph hkern image linearGradient line marker metadata missing-glyph
60
+ mpath path polygon polyline radialGradient rect set stop svg switch
61
+ text title tspan use]
62
+
63
+ ACCEPTABLE_ATTRIBUTES = %w[abbr accept accept-charset accesskey action
64
+ align alt axis border cellpadding cellspacing char charoff charset
65
+ checked cite class clear cols colspan color compact coords datetime
66
+ dir disabled enctype for frame headers height href hreflang hspace id
67
+ ismap label lang longdesc maxlength media method multiple name nohref
68
+ noshade nowrap prompt readonly rel rev rows rowspan rules scope
69
+ selected shape size span src start style summary tabindex target title
70
+ type usemap valign value vspace width xml:lang]
71
+
72
+ MATHML_ATTRIBUTES = %w[actiontype align columnalign columnalign
73
+ columnalign columnlines columnspacing columnspan depth display
74
+ displaystyle equalcolumns equalrows fence fontstyle fontweight frame
75
+ height linethickness lspace mathbackground mathcolor mathvariant
76
+ mathvariant maxsize minsize other rowalign rowalign rowalign rowlines
77
+ rowspacing rowspan rspace scriptlevel selection separator stretchy
78
+ width width xlink:href xlink:show xlink:type xmlns xmlns:xlink]
79
+
80
+ SVG_ATTRIBUTES = %w[accent-height accumulate additive alphabetic
81
+ arabic-form ascent attributeName attributeType baseProfile bbox begin
82
+ by calcMode cap-height class color color-rendering content cx cy d dx
83
+ dy descent display dur end fill fill-rule font-family font-size
84
+ font-stretch font-style font-variant font-weight from fx fy g1 g2
85
+ glyph-name gradientUnits hanging height horiz-adv-x horiz-origin-x id
86
+ ideographic k keyPoints keySplines keyTimes lang marker-end
87
+ marker-mid marker-start markerHeight markerUnits markerWidth
88
+ mathematical max min name offset opacity orient origin
89
+ overline-position overline-thickness panose-1 path pathLength points
90
+ preserveAspectRatio r refX refY repeatCount repeatDur
91
+ requiredExtensions requiredFeatures restart rotate rx ry slope stemh
92
+ stemv stop-color stop-opacity strikethrough-position
93
+ strikethrough-thickness stroke stroke-dasharray stroke-dashoffset
94
+ stroke-linecap stroke-linejoin stroke-miterlimit stroke-opacity
95
+ stroke-width systemLanguage target text-anchor to transform type u1
96
+ u2 underline-position underline-thickness unicode unicode-range
97
+ units-per-em values version viewBox visibility width widths x
98
+ x-height x1 x2 xlink:actuate xlink:arcrole xlink:href xlink:role
99
+ xlink:show xlink:title xlink:type xml:base xml:lang xml:space xmlns
100
+ xmlns:xlink y y1 y2 zoomAndPan]
101
+
102
+ ATTR_VAL_IS_URI = %w[href src cite action longdesc xlink:href xml:base]
103
+
104
+ ACCEPTABLE_CSS_PROPERTIES = %w[azimuth background-color
105
+ border-bottom-color border-collapse border-color border-left-color
106
+ border-right-color border-top-color clear color cursor direction
107
+ display elevation float font font-family font-size font-style
108
+ font-variant font-weight height letter-spacing line-height overflow
109
+ pause pause-after pause-before pitch pitch-range richness speak
110
+ speak-header speak-numeral speak-punctuation speech-rate stress
111
+ text-align text-decoration text-indent unicode-bidi vertical-align
112
+ voice-family volume white-space width]
113
+
114
+ ACCEPTABLE_CSS_KEYWORDS = %w[auto aqua black block blue bold both bottom
115
+ brown center collapse dashed dotted fuchsia gray green !important
116
+ italic left lime maroon medium none navy normal nowrap olive pointer
117
+ purple red right solid silver teal top transparent underline white
118
+ yellow]
119
+
120
+ ACCEPTABLE_SVG_PROPERTIES = %w[fill fill-opacity fill-rule stroke
121
+ stroke-width stroke-linecap stroke-linejoin stroke-opacity]
122
+
123
+ ACCEPTABLE_PROTOCOLS = %w[ed2k ftp http https irc mailto news gopher nntp
124
+ telnet webcal xmpp callto feed urn aim rsync tag ssh sftp rtsp afs]
125
+
126
+ # subclasses may define their own versions of these constants
127
+ ALLOWED_ELEMENTS = ACCEPTABLE_ELEMENTS + MATHML_ELEMENTS + SVG_ELEMENTS
128
+ ALLOWED_ATTRIBUTES = ACCEPTABLE_ATTRIBUTES + MATHML_ATTRIBUTES + SVG_ATTRIBUTES
129
+ ALLOWED_CSS_PROPERTIES = ACCEPTABLE_CSS_PROPERTIES
130
+ ALLOWED_CSS_KEYWORDS = ACCEPTABLE_CSS_KEYWORDS
131
+ ALLOWED_SVG_PROPERTIES = ACCEPTABLE_SVG_PROPERTIES
132
+ ALLOWED_PROTOCOLS = ACCEPTABLE_PROTOCOLS
133
+
134
+ VOID_ELEMENTS = %w[
135
+ base
136
+ link
137
+ meta
138
+ hr
139
+ br
140
+ img
141
+ embed
142
+ param
143
+ area
144
+ col
145
+ input
146
+ ]
147
+ end
148
+
149
+ module HashedWhiteList
150
+ # turn each of the whitelist arrays into a hash for faster lookup
151
+ WhiteList.constants.each do |constant|
152
+ next unless WhiteList.module_eval("#{constant}").is_a?(Array)
153
+ module_eval <<-CODE
154
+ #{constant} = {}
155
+ WhiteList::#{constant}.each { |c| #{constant}[c] = true ; #{constant}[c.downcase] = true }
156
+ CODE
157
+ end
158
+ end
159
+ end
data/test/helper.rb ADDED
@@ -0,0 +1,8 @@
1
+ require 'test/unit'
2
+ require File.expand_path(File.join(File.dirname(__FILE__), "..", "lib", "dryopteris"))
3
+
4
+ if defined? Nokogiri::VERSION_INFO
5
+ puts "=> running with Nokogiri #{Nokogiri::VERSION_INFO.inspect}"
6
+ else
7
+ puts "=> running with Nokogiri #{Nokogiri::VERSION} / libxml #{Nokogiri::LIBXML_PARSER_VERSION}"
8
+ end
@@ -0,0 +1,185 @@
1
+ #
2
+ # these tests taken from the HTML5 sanitization project and modified for use with Dryopteris
3
+ # see the original here: http://code.google.com/p/html5lib/source/browse/ruby/test/test_sanitizer.rb
4
+ #
5
+ # license text at the bottom of this file
6
+ #
7
+ require File.expand_path(File.join(File.dirname(__FILE__), '..', 'helper'))
8
+
9
+ class SanitizeTest < Test::Unit::TestCase
10
+ include Dryopteris
11
+
12
+ def sanitize_html stream
13
+ Dryopteris.sanitize(stream)
14
+ end
15
+
16
+ def sanitize_doc stream
17
+ Dryopteris.sanitize_document(stream)
18
+ end
19
+
20
+ def check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
21
+ # libxml uses double-quotes, so let's swappo-boppo our quotes before comparing.
22
+ assert_equal htmloutput, sanitize_html(input).gsub(/"/,"'"), input
23
+
24
+ doc = sanitize_doc(input).gsub(/"/,"'")
25
+ assert doc.include?(htmloutput), "#{input}:\n#{doc}\nshould include:\n#{htmloutput}"
26
+ end
27
+
28
+ WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
29
+ define_method "test_should_allow_#{tag_name}_tag" do
30
+ input = "<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>"
31
+ htmloutput = "<#{tag_name.downcase} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name.downcase}>"
32
+ xhtmloutput = "<#{tag_name} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name}>"
33
+ rexmloutput = xhtmloutput
34
+
35
+ ##
36
+ ## these special cases are HTML5-tokenizer-dependent.
37
+ ## libxml2 cleans up HTML differently, and I trust that.
38
+ ##
39
+ # if %w[caption colgroup optgroup option tbody td tfoot th thead tr].include?(tag_name)
40
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
41
+ # xhtmloutput = htmloutput
42
+ # elsif tag_name == 'col'
43
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
44
+ # xhtmloutput = htmloutput
45
+ # rexmloutput = "<col title='1' />"
46
+ # elsif tag_name == 'table'
47
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt;baz<table title='1'> </table>"
48
+ # xhtmloutput = htmloutput
49
+ # elsif tag_name == 'image'
50
+ # htmloutput = "<image title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
51
+ # xhtmloutput = htmloutput
52
+ # rexmloutput = "<image title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</image>"
53
+ if WhiteList::VOID_ELEMENTS.include?(tag_name)
54
+ if Nokogiri::LIBXML_VERSION <= "2.6.16"
55
+ htmloutput = "<#{tag_name} title='1'/><p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
56
+ else
57
+ htmloutput = "<#{tag_name} title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
58
+ end
59
+ xhtmloutput = htmloutput
60
+ # htmloutput += '<br/>' if tag_name == 'br'
61
+ rexmloutput = "<#{tag_name} title='1' />"
62
+ end
63
+ check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
64
+ end
65
+ end
66
+
67
+ ##
68
+ ## libxml2 downcases tag names as it parses, so this is unnecessary.
69
+ ##
70
+ # WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
71
+ # define_method "test_should_forbid_#{tag_name.upcase}_tag" do
72
+ # input = "<#{tag_name.upcase} title='1'>foo <bad>bar</bad> baz</#{tag_name.upcase}>"
73
+ # output = "&lt;#{tag_name.upcase} title=\"1\"&gt;foo &lt;bad&gt;bar&lt;/bad&gt; baz&lt;/#{tag_name.upcase}&gt;"
74
+ # check_sanitization(input, output, output, output)
75
+ # end
76
+ # end
77
+
78
+ WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
79
+ next if attribute_name == 'style'
80
+ next if attribute_name =~ /:/ && Nokogiri::LIBXML_VERSION <= '2.6.16'
81
+ define_method "test_should_allow_#{attribute_name}_attribute" do
82
+ input = "<p #{attribute_name}='foo'>foo <bad>bar</bad> baz</p>"
83
+ output = "<p #{attribute_name}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
84
+ htmloutput = "<p #{attribute_name.downcase}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
85
+ check_sanitization(input, htmloutput, output, output)
86
+ end
87
+ end
88
+
89
+ ##
90
+ ## libxml2 downcases attributes as it parses, so this is unnecessary.
91
+ ##
92
+ # WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
93
+ # define_method "test_should_forbid_#{attribute_name.upcase}_attribute" do
94
+ # input = "<p #{attribute_name.upcase}='display: none;'>foo <bad>bar</bad> baz</p>"
95
+ # output = "<p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
96
+ # check_sanitization(input, output, output, output)
97
+ # end
98
+ # end
99
+
100
+ WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
101
+ define_method "test_should_allow_#{protocol}_uris" do
102
+ input = %(<a href="#{protocol}">foo</a>)
103
+ output = "<a href='#{protocol}'>foo</a>"
104
+ check_sanitization(input, output, output, output)
105
+ end
106
+ end
107
+
108
+ WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
109
+ define_method "test_should_allow_uppercase_#{protocol}_uris" do
110
+ input = %(<a href="#{protocol.upcase}">foo</a>)
111
+ output = "<a href='#{protocol.upcase}'>foo</a>"
112
+ check_sanitization(input, output, output, output)
113
+ end
114
+ end
115
+
116
+ if Nokogiri::LIBXML_VERSION > '2.6.16'
117
+ def test_should_handle_astral_plane_characters
118
+ input = "<p>&#x1d4b5; &#x1d538;</p>"
119
+ output = "<p>\360\235\222\265 \360\235\224\270</p>"
120
+ check_sanitization(input, output, output, output)
121
+
122
+ input = "<p><tspan>\360\235\224\270</tspan> a</p>"
123
+ output = "<p><tspan>\360\235\224\270</tspan> a</p>"
124
+ check_sanitization(input, output, output, output)
125
+ end
126
+ end
127
+
128
+ # This affects only NS4. Is it worth fixing?
129
+ # def test_javascript_includes
130
+ # input = %(<div size="&{alert('XSS')}">foo</div>)
131
+ # output = "<div>foo</div>"
132
+ # check_sanitization(input, output, output, output)
133
+ # end
134
+
135
+ #html5_test_files('sanitizer').each do |filename|
136
+ # JSON::parse(open(filename).read).each do |test|
137
+ # define_method "test_#{test['name']}" do
138
+ # check_sanitization(
139
+ # test['input'],
140
+ # test['output'],
141
+ # test['xhtml'] || test['output'],
142
+ # test['rexml'] || test['output']
143
+ # )
144
+ # end
145
+ # end
146
+ #end
147
+ end
148
+
149
+ # <html5_license>
150
+ #
151
+ # Copyright (c) 2006-2008 The Authors
152
+ #
153
+ # Contributors:
154
+ # James Graham - jg307@cam.ac.uk
155
+ # Anne van Kesteren - annevankesteren@gmail.com
156
+ # Lachlan Hunt - lachlan.hunt@lachy.id.au
157
+ # Matt McDonald - kanashii@kanashii.ca
158
+ # Sam Ruby - rubys@intertwingly.net
159
+ # Ian Hickson (Google) - ian@hixie.ch
160
+ # Thomas Broyer - t.broyer@ltgt.net
161
+ # Jacques Distler - distler@golem.ph.utexas.edu
162
+ # Henri Sivonen - hsivonen@iki.fi
163
+ # The Mozilla Foundation (contributions from Henri Sivonen since 2008)
164
+ #
165
+ # Permission is hereby granted, free of charge, to any person
166
+ # obtaining a copy of this software and associated documentation files
167
+ # (the "Software"), to deal in the Software without restriction,
168
+ # including without limitation the rights to use, copy, modify, merge,
169
+ # publish, distribute, sublicense, and/or sell copies of the Software,
170
+ # and to permit persons to whom the Software is furnished to do so,
171
+ # subject to the following conditions:
172
+ #
173
+ # The above copyright notice and this permission notice shall be
174
+ # included in all copies or substantial portions of the Software.
175
+ #
176
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
177
+ # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
178
+ # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
179
+ # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
180
+ # BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
181
+ # ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
182
+ # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
183
+ # SOFTWARE.
184
+ #
185
+ # </html5_license>
@@ -0,0 +1,76 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class TestBasic < Test::Unit::TestCase
4
+
5
+ def test_nil
6
+ assert_nil Dryopteris.sanitize(nil)
7
+ end
8
+
9
+ def test_empty_string
10
+ assert_equal "", Dryopteris.sanitize("")
11
+ end
12
+
13
+ def test_removal_of_illegal_tag
14
+ html = <<-HTML
15
+ following this there should be no jim tag
16
+ <jim>jim</jim>
17
+ was there?
18
+ HTML
19
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
20
+ assert sane.xpath("//jim").empty?
21
+ end
22
+
23
+ def test_removal_of_illegal_attribute
24
+ html = "<p class=bar foo=bar abbr=bar />"
25
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
26
+ node = sane.xpath("//p").first
27
+ assert node.attributes['class']
28
+ assert node.attributes['abbr']
29
+ assert_nil node.attributes['foo']
30
+ end
31
+
32
+ def test_removal_of_illegal_url_in_href
33
+ html = <<-HTML
34
+ <a href='jimbo://jim.jim/'>this link should have its href removed because of illegal url</a>
35
+ <a href='http://jim.jim/'>this link should be fine</a>
36
+ HTML
37
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
38
+ nodes = sane.xpath("//a")
39
+ assert_nil nodes.first.attributes['href']
40
+ assert nodes.last.attributes['href']
41
+ end
42
+
43
+ def test_css_sanitization
44
+ html = "<p style='background-color: url(\"http://foo.com/\") ; background-color: #000 ;' />"
45
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
46
+ assert_match(/#000/, sane.inner_html)
47
+ assert_no_match(/foo\.com/, sane.inner_html)
48
+ end
49
+
50
+ def test_fragment_with_no_tags
51
+ assert_equal "This fragment has no tags.", Dryopteris.sanitize("This fragment has no tags.")
52
+ end
53
+
54
+ def test_fragment_in_p_tag
55
+ assert_equal "<p>This fragment is in a p.</p>", Dryopteris.sanitize("<p>This fragment is in a p.</p>")
56
+ end
57
+
58
+ def test_fragment_in_a_nontrivial_p_tag
59
+ assert_equal " \n<p>This fragment is in a p.</p>", Dryopteris.sanitize(" \n<p foo='bar'>This fragment is in a p.</p>")
60
+ end
61
+
62
+ def test_fragment_in_p_tag_plus_stuff
63
+ assert_equal "<p>This fragment is in a p.</p>foo<strong>bar</strong>", Dryopteris.sanitize("<p>This fragment is in a p.</p>foo<strong>bar</strong>")
64
+ end
65
+
66
+ def test_fragment_with_text_nodes_leading_and_trailing
67
+ assert_equal "text<p>fragment</p>text", Dryopteris.sanitize("text<p>fragment</p>text")
68
+ end
69
+
70
+ def test_whitewash_on_fragment
71
+ html = "safe<frameset rows=\"*\"><frame src=\"http://example.com\"></frameset> <b>description</b>"
72
+ whitewashed = Dryopteris.whitewash_document(html)
73
+ assert_equal "<p>safe</p><b>description</b>", whitewashed
74
+ end
75
+
76
+ end
@@ -0,0 +1,40 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class TestStripTags < Test::Unit::TestCase
4
+
5
+ def test_nil
6
+ assert_nil Dryopteris.strip_tags(nil)
7
+ end
8
+
9
+ def test_empty_string
10
+ assert_equal Dryopteris.strip_tags(""), ""
11
+ end
12
+
13
+ def test_return_empty_string_when_nothing_left
14
+ assert_equal "", Dryopteris.strip_tags('<script>test</script>')
15
+ end
16
+
17
+ def test_removal_of_all_tags
18
+ html = <<-HTML
19
+ What's up <strong>doc</strong>?
20
+ HTML
21
+ stripped = Dryopteris.strip_tags(html)
22
+ assert_equal "What's up doc?".strip, stripped.strip
23
+ end
24
+
25
+ def test_dont_remove_whitespace
26
+ html = "Foo\nBar"
27
+ assert_equal html, Dryopteris.strip_tags(html)
28
+ end
29
+
30
+ def test_dont_remove_whitespace_between_tags
31
+ html = "<p>Foo</p>\n<p>Bar</p>"
32
+ assert_equal "Foo\nBar", Dryopteris.strip_tags(html)
33
+ end
34
+
35
+ def test_removal_of_entities
36
+ html = "<p>this is &lt; that &quot;&amp;&quot; the other &gt; boo&apos;ya</p>"
37
+ assert_equal 'this is < that "&" the other > boo\'ya', Dryopteris.strip_tags(html)
38
+ end
39
+
40
+ end
metadata ADDED
@@ -0,0 +1,77 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: jmcnevin-dryopteris
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.2
5
+ platform: ruby
6
+ authors:
7
+ - Bryan Helmkamp
8
+ - Mike Dalessio
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+
13
+ date: 2009-02-10 00:00:00 -08:00
14
+ default_executable:
15
+ dependencies:
16
+ - !ruby/object:Gem::Dependency
17
+ name: nokogiri
18
+ type: :runtime
19
+ version_requirement:
20
+ version_requirements: !ruby/object:Gem::Requirement
21
+ requirements:
22
+ - - ">"
23
+ - !ruby/object:Gem::Version
24
+ version: 0.0.0
25
+ version:
26
+ description: Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
27
+ email:
28
+ - bryan@brynary.com
29
+ - mike.dalessio@gmail.com
30
+ executables: []
31
+
32
+ extensions: []
33
+
34
+ extra_rdoc_files: []
35
+
36
+ files:
37
+ - README.markdown
38
+ - VERSION.yml
39
+ - lib/dryopteris
40
+ - lib/dryopteris/rails_extension.rb
41
+ - lib/dryopteris/sanitize.rb
42
+ - lib/dryopteris/whitelist.rb
43
+ - lib/dryopteris.rb
44
+ - test/test_basic.rb
45
+ - test/test_strip_tags.rb
46
+ - test/helper.rb
47
+ - test/html5/test_sanitizer.rb
48
+ has_rdoc: true
49
+ homepage: http://github.com/brynary/dryopteris/tree/master
50
+ licenses:
51
+ post_install_message:
52
+ rdoc_options:
53
+ - --inline-source
54
+ - --charset=UTF-8
55
+ require_paths:
56
+ - lib
57
+ required_ruby_version: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: "0"
62
+ version:
63
+ required_rubygems_version: !ruby/object:Gem::Requirement
64
+ requirements:
65
+ - - ">="
66
+ - !ruby/object:Gem::Version
67
+ version: "0"
68
+ version:
69
+ requirements: []
70
+
71
+ rubyforge_project:
72
+ rubygems_version: 1.3.5
73
+ signing_key:
74
+ specification_version: 2
75
+ summary: HTML sanitization using Nokogiri
76
+ test_files: []
77
+