RubyGems - sanitize - Versions diffs - 1.0.8 → 1.1.0 - Mend

sanitize 1.0.8 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of sanitize might be problematic. Click here for more details.

Files changed (6) hide show

data/HISTORY CHANGED Viewed

@@ -1,6 +1,13 @@
 Sanitize History
 ================================================================================
+Version 1.1.0 (2009-10-11)
+  * Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
+  * Added an :output config setting to allow the output format to be specified.
+    Supported formats are :xhtml (the default) and :html (which outputs HTML4).
+  * Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
+    path segments. [Peter Cooper]
 Version 1.0.8 (2009-04-23)
   * Added a workaround for an Hpricot bug that prevents attribute names from
     being downcased in recent versions of Hpricot. This was exploitable to

data/README.rdoc CHANGED Viewed

@@ -9,21 +9,31 @@ elements, certain attributes within those elements, and even certain URL
 protocols within attributes that contain URLs. Any HTML elements or attributes
 that you don't explicitly allow will be removed.
-Because it's based on Hpricot, a full-fledged HTML parser, rather than a bunch
+Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
 of fragile regular expressions, Sanitize has no trouble dealing with malformed
 or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
 caution.
 *Author*::    Ryan Grove (mailto:ryan@wonko.com)
-*Version*::   1.0.8 (2009-04-23)
+*Version*::   1.1.0 (2009-10-11)
 *Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
 *License*::   MIT License (http://opensource.org/licenses/mit-license.php)
 *Website*::   http://github.com/rgrove/sanitize
 == Requires
-* RubyGems
-* Hpricot 0.6+
+* Nokogiri
+* libxml2 >= 2.7.2
+== Installation
+Latest stable release:
+  gem install sanitize
+Latest development version:
+  gem install sanitize -s http://gemcutter.org --prerelease
 == Usage
@@ -141,6 +151,7 @@ include the symbol <code>:relative</code> in the protocol array:
 The following lovely people have contributed to Sanitize in the form of patches
 or ideas that later became code:
+* Peter Cooper <git@peterc.org>
 * Ryan Grove <ryan@wonko.com>
 * Adam Hooper <adam@adamhooper.com>
 * Mutwin Kraus <mutle@blogage.de>

data/lib/sanitize.rb CHANGED Viewed

@@ -1,3 +1,4 @@
+# encoding: utf-8
 #--
 # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
 #
@@ -20,15 +21,8 @@
 # SOFTWARE.
 #++
-# Append this file's directory to the include path if it's not there already.
-$:.unshift(File.dirname(File.expand_path(__FILE__)))
-$:.uniq!
-require 'rubygems'
-gem 'hpricot', '~> 0.8.1'
-require 'hpricot'
+require 'nokogiri'
+require 'sanitize/version'
 require 'sanitize/config'
 require 'sanitize/config/restricted'
 require 'sanitize/config/basic'
@@ -36,24 +30,12 @@ require 'sanitize/config/relaxed'
 class Sanitize
-  # Characters that should be replaced with entities in text nodes.
-  ENTITY_MAP = {
-    '<' => '&lt;',
-    '>' => '&gt;',
-    '"' => '&quot;',
-    "'" => '&#39;'
-  }
-  # Matches an unencoded ampersand that is not part of a valid character entity
-  # reference.
-  REGEX_AMPERSAND = /&(?!(?:[a-z]+[0-9]{0,2}|#[0-9]+|#x[0-9a-f]+);)/i
   # Matches an attribute value that could be treated by a browser as a URL
   # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
   # or more characters followed by a colon is considered a match, even if the
   # colon is encoded as an entity and even if it's an incomplete entity (which
   # IE6 and Opera will still parse).
-  REGEX_PROTOCOL = /^([^:]*)(?:\:|&#0*58|&#x0*3a)/i
+  REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|&#0*58|&#x0*3a)/i
   #--
   # Instance Methods
@@ -73,78 +55,82 @@ class Sanitize
   # Performs clean in place, returning _html_, or +nil+ if no changes were
   # made.
   def clean!(html)
-    fragment = Hpricot(html)
-    fragment.search('*') do |node|
-      if node.bogusetag? || node.doctype? || node.procins? || node.xmldecl?
-        node.parent.replace_child(node, '')
-        next
-      end
+    fragment = Nokogiri::HTML::DocumentFragment.parse(html)
+    fragment.traverse do |node|
       if node.comment?
-        node.parent.replace_child(node, '') unless @config[:allow_comments]
-      elsif node.elem?
+        node.unlink unless @config[:allow_comments]
+      elsif node.element?
         name = node.name.to_s.downcase
         # Delete any element that isn't in the whitelist.
         unless @config[:elements].include?(name)
-          node.parent.replace_child(node, node.children || '')
+          node.children.each { |n| node.add_previous_sibling(n) }
+          node.unlink
           next
         end
-        node.raw_attributes ||= {}
         attr_whitelist = ((@config[:attributes][name] || []) +
             (@config[:attributes][:all] || [])).uniq
         if attr_whitelist.empty?
           # Delete all attributes from elements with no whitelisted
           # attributes.
-          node.raw_attributes = {}
+          node.attribute_nodes.each { |attr| attr.remove }
         else
           # Delete any attribute that isn't in the whitelist for this element.
-          node.raw_attributes.delete_if do |key, value|
-            !attr_whitelist.include?(key.to_s.downcase)
+          node.attribute_nodes.each do |attr|
+            attr.unlink unless attr_whitelist.include?(attr.name.downcase)
           end
           # Delete remaining attributes that use unacceptable protocols.
           if @config[:protocols].has_key?(name)
             protocol = @config[:protocols][name]
-            node.raw_attributes.delete_if do |key, value|
-              key = key.to_s.downcase
-              next false unless protocol.has_key?(key)
-              next true if value.nil?
+            node.attribute_nodes.each do |attr|
+              attr_name = attr.name.downcase
+              next false unless protocol.has_key?(attr_name)
-              if value.to_s.downcase =~ REGEX_PROTOCOL
-                !protocol[key].include?($1.downcase)
+              del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
+                !protocol[attr_name].include?($1.downcase)
               else
-                !protocol[key].include?(:relative)
+                !protocol[attr_name].include?(:relative)
               end
+              attr.unlink if del
             end
           end
         end
         # Add required attributes.
         if @config[:add_attributes].has_key?(name)
-          node.raw_attributes.merge!(@config[:add_attributes][name])
-        end
-        # Escape special chars in attribute values.
-        node.raw_attributes.each do |key, value|
-          node.raw_attributes[key] = Sanitize.encode_html(value)
+          @config[:add_attributes][name].each do |key, val|
+            node[key] = val
+          end
         end
+      elsif node.cdata?
+        node.replace(Nokogiri::XML::Text.new(node.text, node.document))
       end
     end
-    # Make one last pass through the fragment and encode all special HTML chars
-    # as entities. This eliminates certain types of maliciously-malformed nested
-    # tags.
-    fragment.search('*') do |node|
-      node.swap(Sanitize.encode_html(node.to_original_html)) if node.text?
+    if @config[:output] == :xhtml
+      output_method = fragment.method(:to_xhtml)
+    elsif @config[:output] == :html
+      output_method = fragment.method(:to_html)
+    else
+      raise Error, "unsupported output format: #{@config[:output]}"
+    end
+    if RUBY_VERSION >= '1.9'
+      # Nokogiri 1.3.3 (and possibly earlier versions) always returns a US-ASCII
+      # string no matter what we ask for. This will be fixed in 1.4.0, but for
+      # now we have to hack around it to prevent errors.
+      result = output_method.call(:encoding => 'utf-8', :indent => 0).force_encoding('utf-8')
+      result.gsub!(">\n", '>')
+    else
+      result = output_method.call(:encoding => 'utf-8', :indent => 0).gsub(">\n", '>')
     end
-    result = fragment.to_s
     return result == html ? nil : html[0, html.length] = result
   end
@@ -166,18 +152,6 @@ class Sanitize
       sanitize = Sanitize.new(config)
       sanitize.clean!(html)
     end
-    # Encodes special HTML characters (<, >, ", ', and &) in _html_ as entity
-    # references and returns the encoded string.
-    def encode_html(html)
-      str = html.dup
-      # Encode special chars.
-      ENTITY_MAP.each {|char, entity| str.gsub!(char, entity) }
-      # Convert unencoded ampersands to entity references.
-      str.gsub(REGEX_AMPERSAND, '&amp;')
-    end
   end
 end

data/lib/sanitize/config.rb CHANGED Viewed

@@ -28,17 +28,21 @@ class Sanitize
       # comments.
       :allow_comments => false,
-      # HTML elements to allow. By default, no elements are allowed (which means
-      # that all HTML will be stripped).
-      :elements => [],
+      # HTML attributes to add to specific elements. By default, no attributes
+      # are added.
+      :add_attributes => {},
       # HTML attributes to allow in specific elements. By default, no attributes
       # are allowed.
       :attributes => {},
-      # HTML attributes to add to specific elements. By default, no attributes
-      # are added.
-      :add_attributes => {},
+      # HTML elements to allow. By default, no elements are allowed (which means
+      # that all HTML will be stripped).
+      :elements => [],
+      # Output format. Supported formats are :html and :xhtml (which is the
+      # default).
+      :output => :xhtml,
       # URL handling protocols to allow in specific attributes. By default, no
       # protocols are allowed. Use :relative in place of a protocol if you want

data/lib/sanitize/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+class Sanitize
+  VERSION = '1.1.0'
+end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: sanitize
 version: !ruby/object:Gem::Version
-  version: 1.0.8
+  version: 1.1.0
 platform: ruby
 authors:
 - Ryan Grove
@@ -9,18 +9,38 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2009-04-23 00:00:00 -07:00
+date: 2009-10-11 00:00:00 -07:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
-  name: hpricot
+  name: nokogiri
   type: :runtime
   version_requirement:
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - ~>
       - !ruby/object:Gem::Version
-        version: 0.8.1
+        version: 1.3.3
+    version:
+- !ruby/object:Gem::Dependency
+  name: bacon
+  type: :development
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: 1.1.0
+    version:
+- !ruby/object:Gem::Dependency
+  name: rake
+  type: :development
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: 0.8.0
     version:
 description:
 email: ryan@wonko.com
@@ -34,13 +54,16 @@ files:
 - HISTORY
 - LICENSE
 - README.rdoc
-- lib/sanitize.rb
-- lib/sanitize/config.rb
 - lib/sanitize/config/basic.rb
 - lib/sanitize/config/relaxed.rb
 - lib/sanitize/config/restricted.rb
-has_rdoc: false
+- lib/sanitize/config.rb
+- lib/sanitize/version.rb
+- lib/sanitize.rb
+has_rdoc: true
 homepage: http://github.com/rgrove/sanitize/
+licenses: []
 post_install_message:
 rdoc_options: []
@@ -60,10 +83,10 @@ required_rubygems_version: !ruby/object:Gem::Requirement
   version:
 requirements: []
-rubyforge_project:
-rubygems_version: 1.2.0
+rubyforge_project: riposte
+rubygems_version: 1.3.5
 signing_key:
-specification_version: 2
+specification_version: 3
 summary: Whitelist-based HTML sanitizer.
 test_files: []