adamh-sanitize 1.0.4.4 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/HISTORY +36 -0
- data/README.rdoc +25 -5
- data/lib/sanitize.rb +77 -64
- metadata +3 -13
    
        data/HISTORY
    CHANGED
    
    | @@ -1,6 +1,42 @@ | |
| 1 1 | 
             
            Sanitize History
         | 
| 2 2 | 
             
            ================================================================================
         | 
| 3 3 |  | 
| 4 | 
            +
            Version 1.1.0
         | 
| 5 | 
            +
              * Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
         | 
| 6 | 
            +
             | 
| 7 | 
            +
            Version 1.0.8.1 (git)
         | 
| 8 | 
            +
              * Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in 
         | 
| 9 | 
            +
                path segments. [Peter Cooper]
         | 
| 10 | 
            +
             | 
| 11 | 
            +
            Version 1.0.8 (2009-04-23)
         | 
| 12 | 
            +
              * Added a workaround for an Hpricot bug that prevents attribute names from
         | 
| 13 | 
            +
                being downcased in recent versions of Hpricot. This was exploitable to
         | 
| 14 | 
            +
                prevent non-whitelisted protocols from being cleaned. [Reported by Ben
         | 
| 15 | 
            +
                Wanicur]
         | 
| 16 | 
            +
             | 
| 17 | 
            +
            Version 1.0.7 (2009-04-11)
         | 
| 18 | 
            +
              * Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
         | 
| 19 | 
            +
              * Fixed a bug that caused named character entities containing digits (like
         | 
| 20 | 
            +
                ²) to be escaped when they shouldn't have been. [Reported by Sebastian
         | 
| 21 | 
            +
                Steinmetz]
         | 
| 22 | 
            +
             | 
| 23 | 
            +
            Version 1.0.6 (2009-02-23)
         | 
| 24 | 
            +
              * Removed htmlentities gem dependency.
         | 
| 25 | 
            +
              * Existing well-formed character entity references in the input string are now
         | 
| 26 | 
            +
                preserved rather than being decoded and re-encoded.
         | 
| 27 | 
            +
              * The ' character is now encoded as ' instead of ' to prevent
         | 
| 28 | 
            +
                problems in IE6.
         | 
| 29 | 
            +
              * You can now specify the symbol :all in place of an element name in the
         | 
| 30 | 
            +
                attributes config hash to allow certain attributes on all elements. [Thanks
         | 
| 31 | 
            +
                to Mutwin Kraus]
         | 
| 32 | 
            +
             | 
| 33 | 
            +
            Version 1.0.5 (2009-02-05)
         | 
| 34 | 
            +
              * Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
         | 
| 35 | 
            +
                protocols from being cleaned when relative URLs were allowed. [Reported by
         | 
| 36 | 
            +
                Dev Purkayastha]
         | 
| 37 | 
            +
              * Fixed "undefined method `parent='" exceptions caused by parser changes in
         | 
| 38 | 
            +
                edge Hpricot.
         | 
| 39 | 
            +
             | 
| 4 40 | 
             
            Version 1.0.4 (2009-01-16)
         | 
| 5 41 | 
             
              * Fixed a bug that made it possible to sneak a non-whitelisted element through
         | 
| 6 42 | 
             
                by repeating it several times in a row. All versions of Sanitize prior to
         | 
    
        data/README.rdoc
    CHANGED
    
    | @@ -9,13 +9,13 @@ elements, certain attributes within those elements, and even certain URL | |
| 9 9 | 
             
            protocols within attributes that contain URLs. Any HTML elements or attributes
         | 
| 10 10 | 
             
            that you don't explicitly allow will be removed.
         | 
| 11 11 |  | 
| 12 | 
            -
            Because it's based on  | 
| 12 | 
            +
            Because it's based on nokogiri, a full-fledged HTML parser, rather than a bunch
         | 
| 13 13 | 
             
            of fragile regular expressions, Sanitize has no trouble dealing with malformed
         | 
| 14 14 | 
             
            or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
         | 
| 15 15 | 
             
            caution.
         | 
| 16 16 |  | 
| 17 17 | 
             
            *Author*::    Ryan Grove (mailto:ryan@wonko.com)
         | 
| 18 | 
            -
            *Version*::   1.0. | 
| 18 | 
            +
            *Version*::   1.0.8 (2009-04-23)
         | 
| 19 19 | 
             
            *Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
         | 
| 20 20 | 
             
            *License*::   MIT License (http://opensource.org/licenses/mit-license.php)
         | 
| 21 21 | 
             
            *Website*::   http://github.com/rgrove/sanitize
         | 
| @@ -23,8 +23,7 @@ caution. | |
| 23 23 | 
             
            == Requires
         | 
| 24 24 |  | 
| 25 25 | 
             
            * RubyGems
         | 
| 26 | 
            -
            *  | 
| 27 | 
            -
            * HTMLEntities 4.0.0+
         | 
| 26 | 
            +
            * nokogiri
         | 
| 28 27 |  | 
| 29 28 | 
             
            == Usage
         | 
| 30 29 |  | 
| @@ -100,6 +99,14 @@ attributes in lowercase. | |
| 100 99 | 
             
                'img'        => ['alt', 'src', 'title']
         | 
| 101 100 | 
             
              }
         | 
| 102 101 |  | 
| 102 | 
            +
            If you'd like to allow certain attributes on all elements, use the symbol
         | 
| 103 | 
            +
            <code>:all</code> instead of an element name.
         | 
| 104 | 
            +
             | 
| 105 | 
            +
              :attributes => {
         | 
| 106 | 
            +
                :all => ['class'],
         | 
| 107 | 
            +
                'a'  => ['href', 'title']
         | 
| 108 | 
            +
              }
         | 
| 109 | 
            +
             | 
| 103 110 | 
             
            ==== :add_attributes
         | 
| 104 111 |  | 
| 105 112 | 
             
            Attributes to add to specific elements. If the attribute already exists, it will
         | 
| @@ -122,12 +129,25 @@ protocol at all), it will be removed. | |
| 122 129 | 
             
              }
         | 
| 123 130 |  | 
| 124 131 | 
             
            If you'd like to allow the use of relative URLs which don't have a protocol,
         | 
| 125 | 
            -
            include the  | 
| 132 | 
            +
            include the symbol <code>:relative</code> in the protocol array:
         | 
| 126 133 |  | 
| 127 134 | 
             
              :protocols => {
         | 
| 128 135 | 
             
                'a' => {'href' => ['http', 'https', :relative]}
         | 
| 129 136 | 
             
              }
         | 
| 130 137 |  | 
| 138 | 
            +
             | 
| 139 | 
            +
            == Contributors
         | 
| 140 | 
            +
             | 
| 141 | 
            +
            The following lovely people have contributed to Sanitize in the form of patches
         | 
| 142 | 
            +
            or ideas that later became code:
         | 
| 143 | 
            +
             | 
| 144 | 
            +
            * Peter Cooper <git@peterc.org>
         | 
| 145 | 
            +
            * Ryan Grove <ryan@wonko.com>
         | 
| 146 | 
            +
            * Adam Hooper <adam@adamhooper.com>
         | 
| 147 | 
            +
            * Mutwin Kraus <mutle@blogage.de>
         | 
| 148 | 
            +
            * Dev Purkayastha <dev.purkayastha@gmail.com>
         | 
| 149 | 
            +
            * Ben Wanicur <bwanicur@verticalresponse.com>
         | 
| 150 | 
            +
             | 
| 131 151 | 
             
            == License
         | 
| 132 152 |  | 
| 133 153 | 
             
            Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
         | 
    
        data/lib/sanitize.rb
    CHANGED
    
    | @@ -26,11 +26,9 @@ $:.uniq! | |
| 26 26 |  | 
| 27 27 | 
             
            require 'rubygems'
         | 
| 28 28 |  | 
| 29 | 
            -
            gem ' | 
| 30 | 
            -
            gem 'htmlentities',  '~> 4.0.0'
         | 
| 29 | 
            +
            gem 'nokogiri', '~> 1.3.3'
         | 
| 31 30 |  | 
| 32 | 
            -
            require ' | 
| 33 | 
            -
            require 'htmlentities'
         | 
| 31 | 
            +
            require 'nokogiri'
         | 
| 34 32 | 
             
            require 'sanitize/config'
         | 
| 35 33 | 
             
            require 'sanitize/config/restricted'
         | 
| 36 34 | 
             
            require 'sanitize/config/basic'
         | 
| @@ -38,30 +36,24 @@ require 'sanitize/config/relaxed' | |
| 38 36 |  | 
| 39 37 | 
             
            class Sanitize
         | 
| 40 38 |  | 
| 39 | 
            +
              # Characters that should be replaced with entities in text nodes.
         | 
| 40 | 
            +
              ENTITY_MAP = {
         | 
| 41 | 
            +
                '<' => '<',
         | 
| 42 | 
            +
                '>' => '>',
         | 
| 43 | 
            +
                '"' => '"',
         | 
| 44 | 
            +
                "'" => '''
         | 
| 45 | 
            +
              }
         | 
| 46 | 
            +
             | 
| 47 | 
            +
              # Matches an unencoded ampersand that is not part of a valid character entity
         | 
| 48 | 
            +
              # reference.
         | 
| 49 | 
            +
              REGEX_AMPERSAND = /&(?!(?:[a-z]+[0-9]{0,2}|#[0-9]+|#x[0-9a-f]+);)/i
         | 
| 50 | 
            +
             | 
| 41 51 | 
             
              # Matches an attribute value that could be treated by a browser as a URL
         | 
| 42 | 
            -
              # with a protocol prefix, such as "http:" or "javascript:". Any string of  | 
| 52 | 
            +
              # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
         | 
| 43 53 | 
             
              # or more characters followed by a colon is considered a match, even if the
         | 
| 44 54 | 
             
              # colon is encoded as an entity and even if it's an incomplete entity (which
         | 
| 45 55 | 
             
              # IE6 and Opera will still parse).
         | 
| 46 | 
            -
              REGEX_PROTOCOL = /^([ | 
| 47 | 
            -
             | 
| 48 | 
            -
              #--
         | 
| 49 | 
            -
              # Class Methods
         | 
| 50 | 
            -
              #++
         | 
| 51 | 
            -
             | 
| 52 | 
            -
              # Returns a sanitized copy of _html_, using the settings in _config_ if
         | 
| 53 | 
            -
              # specified.
         | 
| 54 | 
            -
              def self.clean(html, config = {})
         | 
| 55 | 
            -
                sanitize = Sanitize.new(config)
         | 
| 56 | 
            -
                sanitize.clean(html)
         | 
| 57 | 
            -
              end
         | 
| 58 | 
            -
             | 
| 59 | 
            -
              # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
         | 
| 60 | 
            -
              # were made.
         | 
| 61 | 
            -
              def self.clean!(html, config = {})
         | 
| 62 | 
            -
                sanitize = Sanitize.new(config)
         | 
| 63 | 
            -
                sanitize.clean!(html)
         | 
| 64 | 
            -
              end
         | 
| 56 | 
            +
              REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|�*58|�*3a)/i
         | 
| 65 57 |  | 
| 66 58 | 
             
              #--
         | 
| 67 59 | 
             
              # Instance Methods
         | 
| @@ -81,77 +73,98 @@ class Sanitize | |
| 81 73 | 
             
              # Performs clean in place, returning _html_, or +nil+ if no changes were
         | 
| 82 74 | 
             
              # made.
         | 
| 83 75 | 
             
              def clean!(html)
         | 
| 84 | 
            -
                fragment =  | 
| 85 | 
            -
             | 
| 86 | 
            -
                fragment.search('*') do |node|
         | 
| 87 | 
            -
                  if node.bogusetag? || node.doctype? || node.procins? || node.xmldecl?
         | 
| 88 | 
            -
                    node.parent.altered!
         | 
| 89 | 
            -
                    node.parent.children[node.parent.children.index(node), 1] = []
         | 
| 90 | 
            -
                    next
         | 
| 91 | 
            -
                  end
         | 
| 76 | 
            +
                fragment = Nokogiri::HTML::DocumentFragment.parse(html)
         | 
| 92 77 |  | 
| 78 | 
            +
                fragment.traverse do |node|
         | 
| 93 79 | 
             
                  if node.comment?
         | 
| 94 | 
            -
                    unless @config[:allow_comments]
         | 
| 95 | 
            -
             | 
| 96 | 
            -
                      node.parent.children[node.parent.children.index(node), 1] = []
         | 
| 97 | 
            -
                    end
         | 
| 98 | 
            -
                  elsif node.elem?
         | 
| 80 | 
            +
                    node.unlink unless @config[:allow_comments]
         | 
| 81 | 
            +
                  elsif node.element?
         | 
| 99 82 | 
             
                    name = node.name.to_s.downcase
         | 
| 100 83 |  | 
| 101 84 | 
             
                    # Delete any element that isn't in the whitelist.
         | 
| 102 85 | 
             
                    unless @config[:elements].include?(name)
         | 
| 103 | 
            -
                      node. | 
| 86 | 
            +
                      node.children.each { |n| node.add_previous_sibling(n) }
         | 
| 87 | 
            +
                      node.unlink
         | 
| 104 88 | 
             
                      next
         | 
| 105 89 | 
             
                    end
         | 
| 106 90 |  | 
| 107 | 
            -
                     | 
| 108 | 
            -
             | 
| 91 | 
            +
                    attr_whitelist = ((@config[:attributes][name] || []) +
         | 
| 92 | 
            +
                        (@config[:attributes][:all] || [])).uniq
         | 
| 93 | 
            +
             | 
| 94 | 
            +
                    if attr_whitelist.empty?
         | 
| 95 | 
            +
                      # Delete all attributes from elements with no whitelisted
         | 
| 96 | 
            +
                      # attributes.
         | 
| 97 | 
            +
                      node.attribute_nodes.each { |attr| attr.remove }
         | 
| 98 | 
            +
                    else
         | 
| 109 99 | 
             
                      # Delete any attribute that isn't in the whitelist for this element.
         | 
| 110 | 
            -
                      node. | 
| 111 | 
            -
                         | 
| 100 | 
            +
                      node.attribute_nodes.each do |attr|
         | 
| 101 | 
            +
                        attr.unlink unless attr_whitelist.include?(attr.name.downcase)
         | 
| 112 102 | 
             
                      end
         | 
| 113 103 |  | 
| 114 104 | 
             
                      # Delete remaining attributes that use unacceptable protocols.
         | 
| 115 105 | 
             
                      if @config[:protocols].has_key?(name)
         | 
| 116 106 | 
             
                        protocol = @config[:protocols][name]
         | 
| 117 107 |  | 
| 118 | 
            -
                        node. | 
| 119 | 
            -
                           | 
| 120 | 
            -
                          next  | 
| 108 | 
            +
                        node.attribute_nodes.each do |attr|
         | 
| 109 | 
            +
                          attr_name = attr.name.downcase
         | 
| 110 | 
            +
                          next false unless protocol.has_key?(attr_name)
         | 
| 121 111 |  | 
| 122 | 
            -
                          if value.to_s.downcase =~ REGEX_PROTOCOL
         | 
| 123 | 
            -
                            !protocol[ | 
| 112 | 
            +
                          del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
         | 
| 113 | 
            +
                            !protocol[attr_name].include?($1.downcase)
         | 
| 124 114 | 
             
                          else
         | 
| 125 | 
            -
                            !protocol[ | 
| 115 | 
            +
                            !protocol[attr_name].include?(:relative)
         | 
| 126 116 | 
             
                          end
         | 
| 117 | 
            +
             | 
| 118 | 
            +
                          attr.unlink if del
         | 
| 127 119 | 
             
                        end
         | 
| 128 120 | 
             
                      end
         | 
| 129 | 
            -
                    else
         | 
| 130 | 
            -
                      # Delete all attributes from elements with no whitelisted
         | 
| 131 | 
            -
                      # attributes.
         | 
| 132 | 
            -
                      node.raw_attributes = {}
         | 
| 133 121 | 
             
                    end
         | 
| 134 122 |  | 
| 135 123 | 
             
                    # Add required attributes.
         | 
| 136 124 | 
             
                    if @config[:add_attributes].has_key?(name)
         | 
| 137 | 
            -
                       | 
| 125 | 
            +
                      @config[:add_attributes][name].each do |key, val|
         | 
| 126 | 
            +
                        node[key] = val
         | 
| 127 | 
            +
                      end
         | 
| 138 128 | 
             
                    end
         | 
| 129 | 
            +
                  elsif node.cdata?
         | 
| 130 | 
            +
                    node.replace(Nokogiri::XML::Text.new(node.text, node.document))
         | 
| 139 131 | 
             
                  end
         | 
| 140 132 | 
             
                end
         | 
| 141 133 |  | 
| 142 | 
            -
                 | 
| 143 | 
            -
                 | 
| 144 | 
            -
             | 
| 145 | 
            -
                # burning desire to decode all entities.
         | 
| 146 | 
            -
                coder = HTMLEntities.new
         | 
| 134 | 
            +
                result = fragment.to_xhtml(:encoding => 'UTF-8', :indent => 0).gsub(/>\n/, '>')
         | 
| 135 | 
            +
                return result == html ? nil : html[0, html.length] = result
         | 
| 136 | 
            +
              end
         | 
| 147 137 |  | 
| 148 | 
            -
             | 
| 149 | 
            -
             | 
| 150 | 
            -
             | 
| 151 | 
            -
             | 
| 138 | 
            +
              #--
         | 
| 139 | 
            +
              # Class Methods
         | 
| 140 | 
            +
              #++
         | 
| 141 | 
            +
             | 
| 142 | 
            +
              class << self
         | 
| 143 | 
            +
                # Returns a sanitized copy of _html_, using the settings in _config_ if
         | 
| 144 | 
            +
                # specified.
         | 
| 145 | 
            +
                def clean(html, config = {})
         | 
| 146 | 
            +
                  sanitize = Sanitize.new(config)
         | 
| 147 | 
            +
                  sanitize.clean(html)
         | 
| 152 148 | 
             
                end
         | 
| 153 149 |  | 
| 154 | 
            -
                 | 
| 155 | 
            -
                 | 
| 150 | 
            +
                # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
         | 
| 151 | 
            +
                # were made.
         | 
| 152 | 
            +
                def clean!(html, config = {})
         | 
| 153 | 
            +
                  sanitize = Sanitize.new(config)
         | 
| 154 | 
            +
                  sanitize.clean!(html)
         | 
| 155 | 
            +
                end
         | 
| 156 | 
            +
             | 
| 157 | 
            +
                # Encodes special HTML characters (<, >, ", ', and &) in _html_ as entity
         | 
| 158 | 
            +
                # references and returns the encoded string.
         | 
| 159 | 
            +
                def encode_html(html)
         | 
| 160 | 
            +
                  str = html.dup
         | 
| 161 | 
            +
             | 
| 162 | 
            +
                  # Encode special chars.
         | 
| 163 | 
            +
                  ENTITY_MAP.each {|char, entity| str.gsub!(char, entity) }
         | 
| 164 | 
            +
             | 
| 165 | 
            +
                  # Convert unencoded ampersands to entity references.
         | 
| 166 | 
            +
                  str.gsub(REGEX_AMPERSAND, '&')
         | 
| 167 | 
            +
                end
         | 
| 156 168 | 
             
              end
         | 
| 169 | 
            +
             | 
| 157 170 | 
             
            end
         | 
    
        metadata
    CHANGED
    
    | @@ -1,7 +1,7 @@ | |
| 1 1 | 
             
            --- !ruby/object:Gem::Specification 
         | 
| 2 2 | 
             
            name: adamh-sanitize
         | 
| 3 3 | 
             
            version: !ruby/object:Gem::Version 
         | 
| 4 | 
            -
              version: 1.0 | 
| 4 | 
            +
              version: 1.1.0
         | 
| 5 5 | 
             
            platform: ruby
         | 
| 6 6 | 
             
            authors: 
         | 
| 7 7 | 
             
            - Ryan Grove
         | 
| @@ -13,24 +13,14 @@ date: 2009-05-16 00:00:00 -07:00 | |
| 13 13 | 
             
            default_executable: 
         | 
| 14 14 | 
             
            dependencies: 
         | 
| 15 15 | 
             
            - !ruby/object:Gem::Dependency 
         | 
| 16 | 
            -
              name:  | 
| 16 | 
            +
              name: nokogiri
         | 
| 17 17 | 
             
              type: :runtime
         | 
| 18 18 | 
             
              version_requirement: 
         | 
| 19 19 | 
             
              version_requirements: !ruby/object:Gem::Requirement 
         | 
| 20 20 | 
             
                requirements: 
         | 
| 21 21 | 
             
                - - ~>
         | 
| 22 22 | 
             
                  - !ruby/object:Gem::Version 
         | 
| 23 | 
            -
                    version:  | 
| 24 | 
            -
                version: 
         | 
| 25 | 
            -
            - !ruby/object:Gem::Dependency 
         | 
| 26 | 
            -
              name: htmlentities
         | 
| 27 | 
            -
              type: :runtime
         | 
| 28 | 
            -
              version_requirement: 
         | 
| 29 | 
            -
              version_requirements: !ruby/object:Gem::Requirement 
         | 
| 30 | 
            -
                requirements: 
         | 
| 31 | 
            -
                - - ~>
         | 
| 32 | 
            -
                  - !ruby/object:Gem::Version 
         | 
| 33 | 
            -
                    version: 4.0.0
         | 
| 23 | 
            +
                    version: 1.3.3
         | 
| 34 24 | 
             
                version: 
         | 
| 35 25 | 
             
            description: 
         | 
| 36 26 | 
             
            email: ryan@wonko.com
         |