RubyGems - jmcnevin-dryopteris - Versions diffs - 0.1.2 - Mend

jmcnevin-dryopteris 0.1.2

Files changed (11) hide show

data/README.markdown +97 -0
data/VERSION.yml +4 -0
data/lib/dryopteris.rb +12 -0
data/lib/dryopteris/rails_extension.rb +46 -0
data/lib/dryopteris/sanitize.rb +175 -0
data/lib/dryopteris/whitelist.rb +159 -0
data/test/helper.rb +8 -0
data/test/html5/test_sanitizer.rb +185 -0
data/test/test_basic.rb +76 -0
data/test/test_strip_tags.rb +40 -0
metadata +77 -0

data/README.markdown ADDED Viewed

@@ -0,0 +1,97 @@
+Dryopteris
+==========
+Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
+* [Dryopteris erythrosora](http://en.wikipedia.org/wiki/Dryopteris_erythrosora)
+* [XSS Attacks](http://en.wikipedia.org/wiki/Cross-site_scripting)
+Usage
+-----
+Let's say you run a web site, and you allow people to post HTML snippets.
+Let's also say some script-kiddie from Norland posts this to your site, in an effort to swipe some credit cards:
+    <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
+Oooh, that could be bad. Here's how to fix it:
+    safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
+Yeah, it's that easy.
+In this example, <tt>safe\_html\_snippet</tt> will have all of its __broken markup fixed__ by libxml2, and it will also be completely __sanitized of harmful tags and attributes__. That's twice as clean!
+Sanitization Usage
+-----
+You're still here? Ok, let me tell you a little something about the two different methods of sanitizing the Dryopteris offers.
+### Fragments
+The first method is for _html fragments_, which are small snippets of markup such as those used in forum posts, emails and homework assignments.
+Usage is the same as above:
+    safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
+Generally speaking, unless you expect to have &lt;html&gt; and &lt;body&gt; tags in your HTML, this is the sanitizing method to use.
+The only real limitation on this method is that the snippet must be a string object. (Support for IO objects was sacrificed at the altar of fixer-uppery-ness. If you need to sanitize data that's coming from an IO object, either socket or file, check out the next section on __Documents__).
+### Documents
+Sometimes you need to sanitize an entire HTML document. (Well, maybe not _you_, but other people, certainly.)
+    safe_html_document = Dryopteris.sanitize_document(dangerous_html_document)
+The returned string will contain exactly one (1) well-formed HTML document, with all broken HTML fixed and all harmful tags and attributes removed.
+Coolness: <tt>dangerous\_html\_document</tt> can be a string OR an IO object (a file, or a socket, or ...). Which makes it particularly easy to sanitize large numbers of docs.
+Whitewashing Usage
+-----
+### Whitewashing Fragments
+Other times, you may want to remove all styling, attributes and invalid HTML tags. I like to call this "whitewashing", since it's putting a new layer of paint on top of the HTML input to make it look nice.
+One use case for this feature is to clean up HTML that was cut-and-pasted from Microsoft(tm) Word into a WYSIWYG editor/textarea. Microsoft's editor is famous for injecting all kinds of cruft into its HTML output. Who needs that? Certainly not me.
+    whitewashed_html = Dryopteris.whitewash(ugly_microsoft_html_snippet)
+Please note that whitewashing implicitly also sanitizes your HTML, as it uses the same HTML tag whitelist as <tt>sanitize()</tt>. It's implementation is:
+ 1. unless the tag is on the whitelist, remove it from the document
+ 2. if the tag has an XML namespace on it, remove it from the document
+ 2. remove all attributes from the node
+### Whitewashing Documents
+Also note the existence of <tt>whitewash\_document</tt>, which is analogous to <tt>sanitize\_document</tt>.
+Standing on the Shoulders of Giants
+-----
+Dryopteris uses [Nokogiri](http://nokogiri.rubyforge.org/) and [libxml2](http://xmlsoft.org/), so it's fast.
+Dryopteris also takes its tag and tag attribute whitelists and its CSS sanitizer directly from [HTML5](http://code.google.com/p/html5lib/).
+Authors
+-----
+* [Bryan Helmkamp](http://www.brynary.com/)
+* [Mike Dalessio](http://mike.daless.io/) ([twitter](http://twitter.com/flavorjones))
+Quotes About Dryopteris
+-----
+> "dryopteris shields you from xss attacks using nokogiri and NY attitude"
+>  - [hasmanyjosh](http://blog.hasmanythrough.com/)
+> "I just wanted to say thank you for your dryopteris plugin. It is by far the best sanitization I've found."
+>  - [catalystmediastudios](http://github.com/catalystmediastudios)

data/VERSION.yml ADDED Viewed

@@ -0,0 +1,4 @@
+---
+:major: 0
+:minor: 0
+:patch: 0

data/lib/dryopteris.rb ADDED Viewed

@@ -0,0 +1,12 @@
+$LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__))) unless $LOAD_PATH.include?(File.expand_path(File.dirname(__FILE__)))
+require 'rubygems'
+gem     'nokogiri', '>=1.2.4'
+require 'nokogiri'
+require "dryopteris/whitelist"
+require "dryopteris/sanitize"
+module Dryopteris
+  VERSION = '0.1'
+end

data/lib/dryopteris/rails_extension.rb ADDED Viewed

@@ -0,0 +1,46 @@
+require "dryopteris"
+module Dryopteris
+  module RailsExtension
+    def self.included(base)
+      base.extend(ClassMethods)
+      # sets up default of stripping tags for all fields
+      base.class_eval do
+        before_save :sanitize_fields
+        class_inheritable_reader :dryopteris_options
+      end
+    end
+    module ClassMethods
+      def sanitize_fields(options = {})
+        write_inheritable_attribute(:dryopteris_options, {
+          :except     => (options[:except] || []),
+          :allow_tags => (options[:allow_tags] || [])
+        })
+      end
+      alias_method :sanitize_field, :sanitize_fields
+    end
+    def sanitize_fields
+      self.class.columns.each do |column|
+        next unless (column.type == :string || column.type == :text)
+        field = column.name.to_sym
+        value = self[field]
+        if dryopteris_options && dryopteris_options[:except].include?(field)
+          next
+        elsif dryopteris_options && dryopteris_options[:allow_tags].include?(field)
+          self[field] = Dryopteris.sanitize(value)
+        else
+          self[field] = Dryopteris.strip_tags(value)
+        end
+      end
+    end
+  end
+end

data/lib/dryopteris/sanitize.rb ADDED Viewed

@@ -0,0 +1,175 @@
+require 'cgi'
+module Dryopteris
+  class << self
+    def strip_tags(string_or_io, encoding=nil)
+      return nil if string_or_io.nil?
+      return "" if string_or_io.strip.size == 0
+      doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
+      body_element = doc.at("/html/body")
+      return "" if body_element.nil?
+      body_element.inner_text
+    end
+    def whitewash(string, encoding=nil)
+      return nil if string.nil?
+      return "" if string.strip.size == 0
+      string = "<html><body>" + string + "</body></html>"
+      doc = Nokogiri::HTML.parse(string, nil, encoding)
+      body = doc.xpath("/html/body").first
+      return "" if body.nil?
+      body.children.each do |node|
+        traverse_conditionally_top_down(node, :whitewash_node)
+      end
+      body.children.map { |x| x.to_xml }.join
+    end
+    def whitewash_document(string_or_io, encoding=nil)
+      return nil if string_or_io.nil?
+      return "" if string_or_io.strip.size == 0
+      doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
+      body = doc.xpath("/html/body").first
+      return "" if body.nil?
+      body.children.each do |node|
+        traverse_conditionally_top_down(node, :whitewash_node)
+      end
+      body.children.map { |x| x.to_xml }.join
+    end
+    def sanitize(string, encoding=nil)
+      return nil if string.nil?
+      return "" if string.strip.size == 0
+      string = "<html><body>" + string + "</body></html>"
+      doc = Nokogiri::HTML.parse(string, nil, encoding)
+      body = doc.xpath("/html/body").first
+      return "" if body.nil?
+      body.children.each do |node|
+        traverse_conditionally_top_down(node, :sanitize_node)
+      end
+      body.children.map { |x| x.to_xml }.join
+    end
+    def sanitize_document(string_or_io, encoding=nil)
+      return nil if string_or_io.nil?
+      return "" if string_or_io.strip.size == 0
+      doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
+      elements = doc.xpath("/html/head/*","/html/body/*")
+      return "" if (elements.nil? || elements.empty?)
+      elements.each do |node|
+        traverse_conditionally_top_down(node, :sanitize_node)
+      end
+      doc.root.to_xml
+    end
+    private
+    def traverse_conditionally_top_down(node, method_name)
+      return if send(method_name, node)
+      node.children.each {|j| traverse_conditionally_top_down(j, method_name)}
+    end
+    def remove_tags_from_node(node)
+      replacement_killer = Nokogiri::XML::Text.new(node.text, node.document)
+      node.add_next_sibling(replacement_killer)
+      node.remove
+      return true
+    end
+    def sanitize_node(node)
+      case node.type
+      when 1 # Nokogiri::XML::Node::ELEMENT_NODE
+        if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
+          node.attributes.each do |attr|
+            node.remove_attribute(attr.first) unless HashedWhiteList::ALLOWED_ATTRIBUTES[attr.first]
+          end
+          node.attributes.each do |attr|
+            if HashedWhiteList::ATTR_VAL_IS_URI[attr.first]
+              # this block lifted nearly verbatim from HTML5 sanitization
+              val_unescaped = CGI.unescapeHTML(attr.last.to_s).gsub(/`|[\000-\040\177\s]+|\302[\200-\240]/,'').downcase
+              if val_unescaped =~ /^[a-z0-9][-+.a-z0-9]*:/ and HashedWhiteList::ALLOWED_PROTOCOLS[val_unescaped.split(':')[0]].nil?
+                node.remove_attribute(attr.first)
+              end
+            end
+          end
+          if node.attributes['style']
+            node['style'] = sanitize_css(node.attributes['style'])
+          end
+          return false
+        end
+      when 3 # Nokogiri::XML::Node::TEXT_NODE
+        return false
+      when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
+        return false
+      end
+      replacement_killer = Nokogiri::XML::Text.new(node.to_s, node.document)
+      node.add_next_sibling(replacement_killer)
+      node.remove
+      return true
+    end
+    def whitewash_node(node)
+      case node.type
+      when 1 # Nokogiri::XML::Node::ELEMENT_NODE
+        if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
+          node.attributes.each { |attr| node.remove_attribute(attr.first) }
+          has_no_namespaces = true
+          begin
+            has_no_namespaces = node.namespaces.empty?
+          rescue
+            # older versions of nokogiri raise an exception when there
+            # is a namespace on the node that is not declared with an href.
+            # see http://github.com/tenderlove/nokogiri/commit/395d7971304e1489e92c494b9c50609f4b4c4ab0
+            has_no_namespaces = false
+          end
+          return false if has_no_namespaces
+        end
+      when 3 # Nokogiri::XML::Node::TEXT_NODE
+        return false
+      when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
+        return false
+      end
+      node.remove
+      return true
+    end
+    #  this liftend nearly verbatim from html5
+    def sanitize_css(style)
+      # disallow urls
+      style = style.to_s.gsub(/url\s*\(\s*[^\s)]+?\s*\)\s*/, ' ')
+      # gauntlet
+      return '' unless style =~ /^([:,;#%.\sa-zA-Z0-9!]|\w-\w|\'[\s\w]+\'|\"[\s\w]+\"|\([\d,\s]+\))*$/
+      return '' unless style =~ /^\s*([-\w]+\s*:[^:;]*(;\s*|$))*$/
+      clean = []
+      style.scan(/([-\w]+)\s*:\s*([^:;]*)/) do |prop, val|
+        next if val.empty?
+        prop.downcase!
+        if HashedWhiteList::ALLOWED_CSS_PROPERTIES[prop]
+          clean << "#{prop}: #{val};"
+        elsif %w[background border margin padding].include?(prop.split('-')[0])
+          clean << "#{prop}: #{val};" unless val.split().any? do |keyword|
+            HashedWhiteList::ALLOWED_CSS_KEYWORDS[keyword].nil? and
+              keyword !~ /^(#[0-9a-f]+|rgb\(\d+%?,\d*%?,?\d*%?\)?|\d{0,2}\.?\d{0,2}(cm|em|ex|in|mm|pc|pt|px|%|,|\))?)$/
+          end
+        elsif HashedWhiteList::ALLOWED_SVG_PROPERTIES[prop]
+          clean << "#{prop}: #{val};"
+        end
+      end
+      style = clean.join(' ')
+    end
+  end # self
+end

data/lib/dryopteris/whitelist.rb ADDED Viewed

@@ -0,0 +1,159 @@
+#
+#  HTML whitelist lifted from HTML5 sanitizer code
+#    http://code.google.com/p/html5lib/
+#
+module Dryopteris
+  module WhiteList
+    # <html5_license>
+    #
+    #   Copyright (c) 2006-2008 The Authors
+    #
+    #   Contributors:
+    #   James Graham - jg307@cam.ac.uk
+    #   Anne van Kesteren - annevankesteren@gmail.com
+    #   Lachlan Hunt - lachlan.hunt@lachy.id.au
+    #   Matt McDonald - kanashii@kanashii.ca
+    #   Sam Ruby - rubys@intertwingly.net
+    #   Ian Hickson (Google) - ian@hixie.ch
+    #   Thomas Broyer - t.broyer@ltgt.net
+    #   Jacques Distler - distler@golem.ph.utexas.edu
+    #   Henri Sivonen - hsivonen@iki.fi
+    #   The Mozilla Foundation (contributions from Henri Sivonen since 2008)
+    #
+    #   Permission is hereby granted, free of charge, to any person
+    #   obtaining a copy of this software and associated documentation
+    #   files (the "Software"), to deal in the Software without
+    #   restriction, including without limitation the rights to use, copy,
+    #   modify, merge, publish, distribute, sublicense, and/or sell copies
+    #   of the Software, and to permit persons to whom the Software is
+    #   furnished to do so, subject to the following conditions:
+    #
+    #   The above copyright notice and this permission notice shall be
+    #   included in all copies or substantial portions of the Software.
+    #
+    #   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+    #   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+    #   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+    #   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+    #   HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+    #   WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    #   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+    #   DEALINGS IN THE SOFTWARE.
+    #
+    # </html5_license>
+    ACCEPTABLE_ELEMENTS = %w[a abbr acronym address area b big blockquote br
+      button caption center cite code col colgroup dd del dfn dir div dl dt
+      em fieldset font form h1 h2 h3 h4 h5 h6 hr i img input ins kbd label
+      legend li map menu ol optgroup option p pre q s samp select small span
+      strike strong sub sup table tbody td textarea tfoot th thead tr tt u
+      ul var]
+    MATHML_ELEMENTS = %w[maction math merror mfrac mi mmultiscripts mn mo
+      mover mpadded mphantom mprescripts mroot mrow mspace msqrt mstyle msub
+      msubsup msup mtable mtd mtext mtr munder munderover none]
+    SVG_ELEMENTS = %w[a animate animateColor animateMotion animateTransform
+      circle defs desc ellipse font-face font-face-name font-face-src g
+      glyph hkern image linearGradient line marker metadata missing-glyph
+      mpath path polygon polyline radialGradient rect set stop svg switch
+      text title tspan use]
+    ACCEPTABLE_ATTRIBUTES = %w[abbr accept accept-charset accesskey action
+      align alt axis border cellpadding cellspacing char charoff charset
+      checked cite class clear cols colspan color compact coords datetime
+      dir disabled enctype for frame headers height href hreflang hspace id
+      ismap label lang longdesc maxlength media method multiple name nohref
+      noshade nowrap prompt readonly rel rev rows rowspan rules scope
+      selected shape size span src start style summary tabindex target title
+      type usemap valign value vspace width xml:lang]
+    MATHML_ATTRIBUTES = %w[actiontype align columnalign columnalign
+      columnalign columnlines columnspacing columnspan depth display
+      displaystyle equalcolumns equalrows fence fontstyle fontweight frame
+      height linethickness lspace mathbackground mathcolor mathvariant
+      mathvariant maxsize minsize other rowalign rowalign rowalign rowlines
+      rowspacing rowspan rspace scriptlevel selection separator stretchy
+      width width xlink:href xlink:show xlink:type xmlns xmlns:xlink]
+    SVG_ATTRIBUTES = %w[accent-height accumulate additive alphabetic
+       arabic-form ascent attributeName attributeType baseProfile bbox begin
+       by calcMode cap-height class color color-rendering content cx cy d dx
+       dy descent display dur end fill fill-rule font-family font-size
+       font-stretch font-style font-variant font-weight from fx fy g1 g2
+       glyph-name gradientUnits hanging height horiz-adv-x horiz-origin-x id
+       ideographic k keyPoints keySplines keyTimes lang marker-end
+       marker-mid marker-start markerHeight markerUnits markerWidth
+       mathematical max min name offset opacity orient origin
+       overline-position overline-thickness panose-1 path pathLength points
+       preserveAspectRatio r refX refY repeatCount repeatDur
+       requiredExtensions requiredFeatures restart rotate rx ry slope stemh
+       stemv stop-color stop-opacity strikethrough-position
+       strikethrough-thickness stroke stroke-dasharray stroke-dashoffset
+       stroke-linecap stroke-linejoin stroke-miterlimit stroke-opacity
+       stroke-width systemLanguage target text-anchor to transform type u1
+       u2 underline-position underline-thickness unicode unicode-range
+       units-per-em values version viewBox visibility width widths x
+       x-height x1 x2 xlink:actuate xlink:arcrole xlink:href xlink:role
+       xlink:show xlink:title xlink:type xml:base xml:lang xml:space xmlns
+       xmlns:xlink y y1 y2 zoomAndPan]
+    ATTR_VAL_IS_URI = %w[href src cite action longdesc xlink:href xml:base]
+    ACCEPTABLE_CSS_PROPERTIES = %w[azimuth background-color
+      border-bottom-color border-collapse border-color border-left-color
+      border-right-color border-top-color clear color cursor direction
+      display elevation float font font-family font-size font-style
+      font-variant font-weight height letter-spacing line-height overflow
+      pause pause-after pause-before pitch pitch-range richness speak
+      speak-header speak-numeral speak-punctuation speech-rate stress
+      text-align text-decoration text-indent unicode-bidi vertical-align
+      voice-family volume white-space width]
+    ACCEPTABLE_CSS_KEYWORDS = %w[auto aqua black block blue bold both bottom
+      brown center collapse dashed dotted fuchsia gray green !important
+      italic left lime maroon medium none navy normal nowrap olive pointer
+      purple red right solid silver teal top transparent underline white
+      yellow]
+    ACCEPTABLE_SVG_PROPERTIES = %w[fill fill-opacity fill-rule stroke
+      stroke-width stroke-linecap stroke-linejoin stroke-opacity]
+    ACCEPTABLE_PROTOCOLS = %w[ed2k ftp http https irc mailto news gopher nntp
+      telnet webcal xmpp callto feed urn aim rsync tag ssh sftp rtsp afs]
+    # subclasses may define their own versions of these constants
+    ALLOWED_ELEMENTS = ACCEPTABLE_ELEMENTS + MATHML_ELEMENTS + SVG_ELEMENTS
+    ALLOWED_ATTRIBUTES = ACCEPTABLE_ATTRIBUTES + MATHML_ATTRIBUTES + SVG_ATTRIBUTES
+    ALLOWED_CSS_PROPERTIES = ACCEPTABLE_CSS_PROPERTIES
+    ALLOWED_CSS_KEYWORDS = ACCEPTABLE_CSS_KEYWORDS
+    ALLOWED_SVG_PROPERTIES = ACCEPTABLE_SVG_PROPERTIES
+    ALLOWED_PROTOCOLS = ACCEPTABLE_PROTOCOLS
+    VOID_ELEMENTS = %w[
+      base
+      link
+      meta
+      hr
+      br
+      img
+      embed
+      param
+      area
+      col
+      input
+    ]
+  end
+  module HashedWhiteList
+    #  turn each of the whitelist arrays into a hash for faster lookup
+    WhiteList.constants.each do |constant|
+      next unless WhiteList.module_eval("#{constant}").is_a?(Array)
+      module_eval <<-CODE
+        #{constant} = {}
+        WhiteList::#{constant}.each { |c| #{constant}[c] = true ; #{constant}[c.downcase] = true }
+      CODE
+    end
+  end
+end

data/test/helper.rb ADDED Viewed

@@ -0,0 +1,8 @@
+require 'test/unit'
+require File.expand_path(File.join(File.dirname(__FILE__), "..", "lib", "dryopteris"))
+if defined? Nokogiri::VERSION_INFO
+  puts "=> running with Nokogiri #{Nokogiri::VERSION_INFO.inspect}"
+else
+  puts "=> running with Nokogiri #{Nokogiri::VERSION} / libxml #{Nokogiri::LIBXML_PARSER_VERSION}"
+end

data/test/html5/test_sanitizer.rb ADDED Viewed

@@ -0,0 +1,185 @@
+#
+#  these tests taken from the HTML5 sanitization project and modified for use with Dryopteris
+#  see the original here: http://code.google.com/p/html5lib/source/browse/ruby/test/test_sanitizer.rb
+#
+#  license text at the bottom of this file
+#
+require File.expand_path(File.join(File.dirname(__FILE__), '..', 'helper'))
+class SanitizeTest < Test::Unit::TestCase
+  include Dryopteris
+  def sanitize_html stream
+    Dryopteris.sanitize(stream)
+  end
+  def sanitize_doc stream
+    Dryopteris.sanitize_document(stream)
+  end
+  def check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
+    #  libxml uses double-quotes, so let's swappo-boppo our quotes before comparing.
+    assert_equal htmloutput, sanitize_html(input).gsub(/"/,"'"), input
+    doc = sanitize_doc(input).gsub(/"/,"'")
+    assert doc.include?(htmloutput), "#{input}:\n#{doc}\nshould include:\n#{htmloutput}"
+  end
+  WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
+    define_method "test_should_allow_#{tag_name}_tag" do
+      input       = "<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>"
+      htmloutput  = "<#{tag_name.downcase} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name.downcase}>"
+      xhtmloutput = "<#{tag_name} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name}>"
+      rexmloutput = xhtmloutput
+##
+##  these special cases are HTML5-tokenizer-dependent.
+##  libxml2 cleans up HTML differently, and I trust that.
+##
+#       if %w[caption colgroup optgroup option tbody td tfoot th thead tr].include?(tag_name)
+#         htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
+#         xhtmloutput = htmloutput
+#       elsif tag_name == 'col'
+#         htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
+#         xhtmloutput = htmloutput
+#         rexmloutput = "<col title='1' />"
+#       elsif tag_name == 'table'
+#         htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt;baz<table title='1'> </table>"
+#         xhtmloutput = htmloutput
+#       elsif tag_name == 'image'
+#         htmloutput = "<image title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
+#         xhtmloutput = htmloutput
+#         rexmloutput = "<image title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</image>"
+      if WhiteList::VOID_ELEMENTS.include?(tag_name)
+        if Nokogiri::LIBXML_VERSION <= "2.6.16"
+          htmloutput = "<#{tag_name} title='1'/><p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
+        else
+          htmloutput = "<#{tag_name} title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
+        end
+        xhtmloutput = htmloutput
+#        htmloutput += '<br/>' if tag_name == 'br'
+        rexmloutput =  "<#{tag_name} title='1' />"
+      end
+      check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
+    end
+  end
+##
+##  libxml2 downcases tag names as it parses, so this is unnecessary.
+##
+#   WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
+#     define_method "test_should_forbid_#{tag_name.upcase}_tag" do
+#       input = "<#{tag_name.upcase} title='1'>foo <bad>bar</bad> baz</#{tag_name.upcase}>"
+#       output = "&lt;#{tag_name.upcase} title=\"1\"&gt;foo &lt;bad&gt;bar&lt;/bad&gt; baz&lt;/#{tag_name.upcase}&gt;"
+#       check_sanitization(input, output, output, output)
+#     end
+#   end
+  WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
+    next if attribute_name == 'style'
+    next if attribute_name =~ /:/ && Nokogiri::LIBXML_VERSION <= '2.6.16'
+    define_method "test_should_allow_#{attribute_name}_attribute" do
+      input = "<p #{attribute_name}='foo'>foo <bad>bar</bad> baz</p>"
+      output = "<p #{attribute_name}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
+      htmloutput = "<p #{attribute_name.downcase}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
+      check_sanitization(input, htmloutput, output, output)
+    end
+  end
+##
+##  libxml2 downcases attributes as it parses, so this is unnecessary.
+##
+#   WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
+#     define_method "test_should_forbid_#{attribute_name.upcase}_attribute" do
+#       input = "<p #{attribute_name.upcase}='display: none;'>foo <bad>bar</bad> baz</p>"
+#       output =  "<p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
+#       check_sanitization(input, output, output, output)
+#     end
+#   end
+  WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
+    define_method "test_should_allow_#{protocol}_uris" do
+      input = %(<a href="#{protocol}">foo</a>)
+      output = "<a href='#{protocol}'>foo</a>"
+      check_sanitization(input, output, output, output)
+    end
+  end
+  WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
+    define_method "test_should_allow_uppercase_#{protocol}_uris" do
+      input = %(<a href="#{protocol.upcase}">foo</a>)
+      output = "<a href='#{protocol.upcase}'>foo</a>"
+      check_sanitization(input, output, output, output)
+    end
+  end
+  if Nokogiri::LIBXML_VERSION > '2.6.16'
+    def test_should_handle_astral_plane_characters
+      input = "<p>&#x1d4b5; &#x1d538;</p>"
+      output = "<p>\360\235\222\265 \360\235\224\270</p>"
+      check_sanitization(input, output, output, output)
+      input = "<p><tspan>\360\235\224\270</tspan> a</p>"
+      output = "<p><tspan>\360\235\224\270</tspan> a</p>"
+      check_sanitization(input, output, output, output)
+    end
+  end
+# This affects only NS4. Is it worth fixing?
+#  def test_javascript_includes
+#    input = %(<div size="&{alert('XSS')}">foo</div>)
+#    output = "<div>foo</div>"
+#    check_sanitization(input, output, output, output)
+#  end
+  #html5_test_files('sanitizer').each do |filename|
+  #  JSON::parse(open(filename).read).each do |test|
+  #    define_method "test_#{test['name']}" do
+  #      check_sanitization(
+  #        test['input'],
+  #        test['output'],
+  #        test['xhtml'] || test['output'],
+  #        test['rexml'] || test['output']
+  #      )
+  #    end
+  #  end
+  #end
+end
+# <html5_license>
+#
+# Copyright (c) 2006-2008 The Authors
+#
+# Contributors:
+# James Graham - jg307@cam.ac.uk
+# Anne van Kesteren - annevankesteren@gmail.com
+# Lachlan Hunt - lachlan.hunt@lachy.id.au
+# Matt McDonald - kanashii@kanashii.ca
+# Sam Ruby - rubys@intertwingly.net
+# Ian Hickson (Google) - ian@hixie.ch
+# Thomas Broyer - t.broyer@ltgt.net
+# Jacques Distler - distler@golem.ph.utexas.edu
+# Henri Sivonen - hsivonen@iki.fi
+# The Mozilla Foundation (contributions from Henri Sivonen since 2008)
+#
+# Permission is hereby granted, free of charge, to any person
+# obtaining a copy of this software and associated documentation files
+# (the "Software"), to deal in the Software without restriction,
+# including without limitation the rights to use, copy, modify, merge,
+# publish, distribute, sublicense, and/or sell copies of the Software,
+# and to permit persons to whom the Software is furnished to do so,
+# subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be
+# included in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+#
+# </html5_license>

data/test/test_basic.rb ADDED Viewed

@@ -0,0 +1,76 @@
+require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
+class TestBasic < Test::Unit::TestCase
+  def test_nil
+    assert_nil Dryopteris.sanitize(nil)
+  end
+  def test_empty_string
+    assert_equal "", Dryopteris.sanitize("")
+  end
+  def test_removal_of_illegal_tag
+    html = <<-HTML
+      following this there should be no jim tag
+      <jim>jim</jim>
+      was there?
+    HTML
+    sane = Nokogiri::HTML(Dryopteris.sanitize(html))
+    assert sane.xpath("//jim").empty?
+  end
+  def test_removal_of_illegal_attribute
+    html = "<p class=bar foo=bar abbr=bar />"
+    sane = Nokogiri::HTML(Dryopteris.sanitize(html))
+    node = sane.xpath("//p").first
+    assert node.attributes['class']
+    assert node.attributes['abbr']
+    assert_nil node.attributes['foo']
+  end
+  def test_removal_of_illegal_url_in_href
+    html = <<-HTML
+      <a href='jimbo://jim.jim/'>this link should have its href removed because of illegal url</a>
+      <a href='http://jim.jim/'>this link should be fine</a>
+    HTML
+    sane = Nokogiri::HTML(Dryopteris.sanitize(html))
+    nodes = sane.xpath("//a")
+    assert_nil nodes.first.attributes['href']
+    assert nodes.last.attributes['href']
+  end
+  def test_css_sanitization
+    html = "<p style='background-color: url(\"http://foo.com/\") ; background-color: #000 ;' />"
+    sane = Nokogiri::HTML(Dryopteris.sanitize(html))
+    assert_match(/#000/, sane.inner_html)
+    assert_no_match(/foo\.com/, sane.inner_html)
+  end
+  def test_fragment_with_no_tags
+    assert_equal "This fragment has no tags.", Dryopteris.sanitize("This fragment has no tags.")
+  end
+  def test_fragment_in_p_tag
+    assert_equal "<p>This fragment is in a p.</p>", Dryopteris.sanitize("<p>This fragment is in a p.</p>")
+  end
+  def test_fragment_in_a_nontrivial_p_tag
+    assert_equal "  \n<p>This fragment is in a p.</p>", Dryopteris.sanitize("  \n<p foo='bar'>This fragment is in a p.</p>")
+  end
+  def test_fragment_in_p_tag_plus_stuff
+    assert_equal "<p>This fragment is in a p.</p>foo<strong>bar</strong>", Dryopteris.sanitize("<p>This fragment is in a p.</p>foo<strong>bar</strong>")
+  end
+  def test_fragment_with_text_nodes_leading_and_trailing
+    assert_equal "text<p>fragment</p>text", Dryopteris.sanitize("text<p>fragment</p>text")
+  end
+  def test_whitewash_on_fragment
+    html = "safe<frameset rows=\"*\"><frame src=\"http://example.com\"></frameset> <b>description</b>"
+    whitewashed = Dryopteris.whitewash_document(html)
+    assert_equal "<p>safe</p><b>description</b>", whitewashed
+  end
+end

data/test/test_strip_tags.rb ADDED Viewed

@@ -0,0 +1,40 @@
+require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
+class TestStripTags < Test::Unit::TestCase
+  def test_nil
+    assert_nil Dryopteris.strip_tags(nil)
+  end
+  def test_empty_string
+    assert_equal Dryopteris.strip_tags(""), ""
+  end
+  def test_return_empty_string_when_nothing_left
+    assert_equal "", Dryopteris.strip_tags('<script>test</script>')
+  end
+  def test_removal_of_all_tags
+    html = <<-HTML
+      What's up <strong>doc</strong>?
+    HTML
+    stripped = Dryopteris.strip_tags(html)
+    assert_equal "What's up doc?".strip, stripped.strip
+  end
+  def test_dont_remove_whitespace
+    html = "Foo\nBar"
+    assert_equal html, Dryopteris.strip_tags(html)
+  end
+  def test_dont_remove_whitespace_between_tags
+    html = "<p>Foo</p>\n<p>Bar</p>"
+    assert_equal "Foo\nBar", Dryopteris.strip_tags(html)
+  end
+  def test_removal_of_entities
+    html = "<p>this is &lt; that &quot;&amp;&quot; the other &gt; boo&apos;ya</p>"
+    assert_equal 'this is < that "&" the other > boo\'ya', Dryopteris.strip_tags(html)
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,77 @@
+--- !ruby/object:Gem::Specification
+name: jmcnevin-dryopteris
+version: !ruby/object:Gem::Version
+  version: 0.1.2
+platform: ruby
+authors:
+- Bryan Helmkamp
+- Mike Dalessio
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2009-02-10 00:00:00 -08:00
+default_executable:
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  type: :runtime
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">"
+      - !ruby/object:Gem::Version
+        version: 0.0.0
+    version:
+description: Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
+email:
+- bryan@brynary.com
+- mike.dalessio@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- README.markdown
+- VERSION.yml
+- lib/dryopteris
+- lib/dryopteris/rails_extension.rb
+- lib/dryopteris/sanitize.rb
+- lib/dryopteris/whitelist.rb
+- lib/dryopteris.rb
+- test/test_basic.rb
+- test/test_strip_tags.rb
+- test/helper.rb
+- test/html5/test_sanitizer.rb
+has_rdoc: true
+homepage: http://github.com/brynary/dryopteris/tree/master
+licenses:
+post_install_message:
+rdoc_options:
+- --inline-source
+- --charset=UTF-8
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+requirements: []
+rubyforge_project:
+rubygems_version: 1.3.5
+signing_key:
+specification_version: 2
+summary: HTML sanitization using Nokogiri
+test_files: []