RubyGems - htmlentities - Versions diffs - 3.0.0 - Mend

htmlentities 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

data/CHANGES ADDED Viewed

@@ -0,0 +1,29 @@
+== 3.0.0 (2005-04-08)
+* Changed licence to MIT due to confusion with previous 'Fair' licence (my
+  intention was to be liberal, not obscure).
+* Moved basic functionality out of String class; for previous behaviour,
+  require 'htmlentities/string'
+* Changed version numbering scheme
+* Now available as a Gem
+== 2.2 (2005-11-07)
+* Important bug fixes -- thanks to Moonwolf
+* Decoding hexadecimal entities now accepts 'f' as a hex digit. (D'oh!)
+* Decimal decoding edge cases addressed
+* Test cases added
+== 2.1 (2005-10-31)
+* Removed some unnecessary code in basic entity encoding
+* Improved handling of encoding: commands are now automatically sorted, so the
+  user doesn't have to worry about their order
+* Now using setup.rb
+* Tests moved to separate file
+== 2.0 (2005-08-23)
+* Added encoding to entities
+* Decoding interface unchanged
+* Fixed a bug with handling high codepoints
+== 1.0 (2005-08-03)
+* Initial release
+* Decoding only

data/COPYING ADDED Viewed

@@ -0,0 +1,21 @@
+== Licence (MIT)
+Copyright (c) 2005-2006 Paul Battley
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README ADDED Viewed

@@ -0,0 +1,23 @@
+== HTMLEntities
+HTML entity encoding and decoding for Ruby
+The HTMLEntities module facilitates encoding and decoding of
+HTML/XML entities from/to their corresponding UTF-8 codepoints.
+To install (requires root/admin privileges):
+ ruby setup.rb
+Alternatively, you can just use the gem.
+== Licence
+This code is free to use under the terms of the MIT licence. If you'd like to
+negotiate a different licence for a specific use, just contact me -- I'll
+almost certainly permit it.
+== Contact
+Comments are welcome. Send an email to pbattley@gmail.com.

data/lib/htmlentities.rb ADDED Viewed

@@ -0,0 +1,167 @@
+#
+# HTML entity encoding and decoding for Ruby
+#
+module HTMLEntities # :nodoc:
+  class InstructionError < RuntimeError
+  end
+  #
+  # MAP is a hash of all the HTML entities I could discover, as taken
+  # from the w3schools page on the subject:
+  # http://www.w3schools.com/html/html_entitiesref.asp
+  # The format is 'entity name' => codepoint where entity name is given
+  # without the surrounding ampersand and semicolon.
+  #
+  MAP = {
+    'quot'      => 34,        'apos'      => 39,        'amp'       => 38,
+    'lt'        => 60,        'gt'        => 62,        'nbsp'      => 160,
+    'iexcl'     => 161,       'curren'    => 164,       'cent'      => 162,
+    'pound'     => 163,       'yen'       => 165,       'brvbar'    => 166,
+    'sect'      => 167,       'uml'       => 168,       'copy'      => 169,
+    'ordf'      => 170,       'laquo'     => 171,       'not'       => 172,
+    'shy'       => 173,       'reg'       => 174,       'trade'     => 8482,
+    'macr'      => 175,       'deg'       => 176,       'plusmn'    => 177,
+    'sup2'      => 178,       'sup3'      => 179,       'acute'     => 180,
+    'micro'     => 181,       'para'      => 182,       'middot'    => 183,
+    'cedil'     => 184,       'sup1'      => 185,       'ordm'      => 186,
+    'raquo'     => 187,       'frac14'    => 188,       'frac12'    => 189,
+    'frac34'    => 190,       'iquest'    => 191,       'times'     => 215,
+    'divide'    => 247,       'Agrave'    => 192,       'Aacute'    => 193,
+    'Acirc'     => 194,       'Atilde'    => 195,       'Auml'      => 196,
+    'Aring'     => 197,       'AElig'     => 198,       'Ccedil'    => 199,
+    'Egrave'    => 200,       'Eacute'    => 201,       'Ecirc'     => 202,
+    'Euml'      => 203,       'Igrave'    => 204,       'Iacute'    => 205,
+    'Icirc'     => 206,       'Iuml'      => 207,       'ETH'       => 208,
+    'Ntilde'    => 209,       'Ograve'    => 210,       'Oacute'    => 211,
+    'Ocirc'     => 212,       'Otilde'    => 213,       'Ouml'      => 214,
+    'Oslash'    => 216,       'Ugrave'    => 217,       'Uacute'    => 218,
+    'Ucirc'     => 219,       'Uuml'      => 220,       'Yacute'    => 221,
+    'THORN'     => 222,       'szlig'     => 223,       'agrave'    => 224,
+    'aacute'    => 225,       'acirc'     => 226,       'atilde'    => 227,
+    'auml'      => 228,       'aring'     => 229,       'aelig'     => 230,
+    'ccedil'    => 231,       'egrave'    => 232,       'eacute'    => 233,
+    'ecirc'     => 234,       'euml'      => 235,       'igrave'    => 236,
+    'iacute'    => 237,       'icirc'     => 238,       'iuml'      => 239,
+    'eth'       => 240,       'ntilde'    => 241,       'ograve'    => 242,
+    'oacute'    => 243,       'ocirc'     => 244,       'otilde'    => 245,
+    'ouml'      => 246,       'oslash'    => 248,       'ugrave'    => 249,
+    'uacute'    => 250,       'ucirc'     => 251,       'uuml'      => 252,
+    'yacute'    => 253,       'thorn'     => 254,       'yuml'      => 255,
+    'OElig'     => 338,       'oelig'     => 339,       'Scaron'    => 352,
+    'scaron'    => 353,       'Yuml'      => 376,       'circ'      => 710,
+    'tilde'     => 732,       'ensp'      => 8194,      'emsp'      => 8195,
+    'thinsp'    => 8201,      'zwnj'      => 8204,      'zwj'       => 8205,
+    'lrm'       => 8206,      'rlm'       => 8207,      'ndash'     => 8211,
+    'mdash'     => 8212,      'lsquo'     => 8216,      'rsquo'     => 8217,
+    'sbquo'     => 8218,      'ldquo'     => 8220,      'rdquo'     => 8221,
+    'bdquo'     => 8222,      'dagger'    => 8224,      'Dagger'    => 8225,
+    'hellip'    => 8230,      'permil'    => 8240,      'lsaquo'    => 8249,
+    'rsaquo'    => 8250,      'euro'      => 8364
+  }
+  MIN_LENGTH = MAP.keys.map{ |a| a.length }.min
+  MAX_LENGTH = MAP.keys.map{ |a| a.length }.max
+  # Precompile the regexp
+  NAMED_ENTITY_REGEXP =
+    /&([a-z]{#{HTMLEntities::MIN_LENGTH},#{HTMLEntities::MAX_LENGTH}});/i
+  # Reverse map for converting characters to named entities
+  REVERSE_MAP = MAP.invert
+  BASIC_ENTITY_REGEXP = /[<>'"&]/
+  UTF8_NON_ASCII_REGEXP = /[\x00-\x1f]|[\xc0-\xfd][\x80-\xbf]+/
+  ENCODE_ENTITIES_COMMAND_ORDER = {
+    :basic => 0,
+    :named => 1,
+    :decimal => 2,
+    :hexadecimal => 3
+  }
+  #
+  # Decode XML and HTML 4.01 entities in a string into their UTF-8
+  # equivalents.  Obviously, if your string is not already in UTF-8, you'd
+  # better convert it before using this method, or the output will be mixed
+  # up.
+  #
+  # Unknown named entities are not converted
+  #
+  def decode_entities(string)
+    return string.gsub(NAMED_ENTITY_REGEXP) {
+      (cp = MAP[$1]) ? [cp].pack('U') : $&
+    }.gsub(/&#([0-9]{1,7});|&#x([0-9a-f]{1,6});/i) {
+      $1 ? [$1.to_i].pack('U') : [$2.to_i(16)].pack('U')
+    }
+  end
+  #
+  # Encode codepoints into their corresponding entities.  Various operations
+  # are possible, and may be specified in order:
+  #
+  # :basic :: Convert the five XML entities ('"<>&)
+  # :named :: Convert non-ASCII characters to their named HTML 4.01 equivalent
+  # :decimal :: Convert non-ASCII characters to decimal entities (e.g. &#1234;)
+  # :hexadecimal :: Convert non-ASCII characters to hexadecimal entities (e.g. # &#x12ab;)
+  #
+  # You can specify the commands in any order, but they will be executed in
+  # the order listed above to ensure that entity ampersands are not
+  # clobbered and that named entities are replaced before numeric ones.
+  #
+  # If no instructions are specified, :basic will be used.
+  #
+  # Examples:
+  #   encode_entities(str) - XML-safe
+  #   encode_entities(str, :basic, :decimal) - XML-safe and 7-bit clean
+  #   encode_entities(str, :basic, :named, :decimal) - 7-bit clean, with all
+  #   non-ASCII characters replaced with their named entity where possible, and
+  #   decimal equivalents otherwise.
+  #
+  # Note: It is the program's responsibility to ensure that the string
+  # contains valid UTF-8 before calling this method.
+  #
+  def encode_entities(string, *instructions)
+    output = nil
+    if (instructions.empty?)
+      instructions = [:basic]
+    else
+      instructions = instructions.sort_by { |instruction|
+        ENCODE_ENTITIES_COMMAND_ORDER[instruction] ||
+        (raise InstructionError, "unknown encode_entities command `#{instruction.inspect}'")
+      }
+    end
+    instructions.each do |instruction|
+      case instruction
+      when :basic
+        # Handled as basic ASCII
+        output = (output || string).gsub(BASIC_ENTITY_REGEXP) {
+          # It's safe to use the simpler [0] here because we know
+          # that the basic entities are ASCII.
+          '&' << REVERSE_MAP[$&[0]] << ';'
+        }
+      when :named
+        # Test everything except printable ASCII
+        output = (output || string).gsub(UTF8_NON_ASCII_REGEXP) {
+          cp = $&.unpack('U')[0]
+          (e = REVERSE_MAP[cp]) ?  "&#{e};" : $&
+        }
+      when :decimal
+        output = (output || string).gsub(UTF8_NON_ASCII_REGEXP) {
+          "&##{$&.unpack('U')[0]};"
+        }
+      when :hexadecimal
+        output = (output || string).gsub(UTF8_NON_ASCII_REGEXP) {
+          "&#x#{$&.unpack('U')[0].to_s(16)};"
+        }
+      end
+    end
+    return output
+  end
+  extend self
+end

data/lib/htmlentities/string.rb ADDED Viewed

@@ -0,0 +1,17 @@
+require 'htmlentities'
+#
+# This library extends the String class with methods to allow encoding and decoding of
+# HTML/XML entities from/to their corresponding UTF-8 codepoints.
+#
+class String
+  def decode_entities
+    return HTMLEntities.decode_entities(self)
+  end
+  def encode_entities(*instructions)
+    return HTMLEntities.encode_entities(self, *instructions)
+  end
+end

data/test/all.rb ADDED Viewed

@@ -0,0 +1,3 @@
+Dir[File.dirname(__FILE__)+'/*_test.rb'].each do |test|
+  require test
+end

data/test/entities_test.rb ADDED Viewed

@@ -0,0 +1,129 @@
+$: << File.dirname(__FILE__) + '/../lib/'
+require 'htmlentities'
+require 'test/unit'
+$KCODE = 'u'
+class TestHTMLEntities < Test::Unit::TestCase
+  def test_basic_decoding
+    assert_decode('&', '&amp;')
+    assert_decode('<', '&lt;')
+    assert_decode('"', '&quot;')
+  end
+  def test_basic_encoding
+    assert_encode('&amp;', '&', :basic)
+    assert_encode('&quot;', '"')
+    assert_encode('&lt;', '<', :basic)
+    assert_encode('&lt;', '<')
+  end
+  def test_extended_decoding
+    assert_decode('±', '&plusmn;')
+    assert_decode('ð', '&eth;')
+    assert_decode('Œ', '&OElig;')
+    assert_decode('œ', '&oelig;')
+  end
+  def test_extended_encoding
+    assert_encode('&plusmn;', '±', :named)
+    assert_encode('&eth;', 'ð', :named)
+    assert_encode('&OElig;', 'Œ', :named)
+    assert_encode('&oelig;', 'œ', :named)
+  end
+  def test_decimal_decoding
+    assert_decode('“', '&#8220;')
+    assert_decode('…', '&#8230;')
+    assert_decode(' ', '&#32;')
+  end
+  def test_decimal_encoding
+    assert_encode('&#8220;', '“', :decimal)
+    assert_encode('&#8230;', '…', :decimal)
+  end
+  def test_hexadecimal_decoding
+    assert_decode('−', '&#x2212;')
+    assert_decode('—', '&#x2014;')
+    assert_decode('`', '&#x0060;')
+    assert_decode('`', '&#x60;')
+  end
+  def test_hexadecimal_encoding
+    assert_encode('&#x2212;', '−', :hexadecimal)
+    assert_encode('&#x2014;', '—', :hexadecimal)
+  end
+  def test_mixed_decoding
+    # Just a random headline - I needed something with accented letters.
+    assert_decode(
+      'Le tabac pourrait bientôt être banni dans tous les lieux publics en France',
+      'Le tabac pourrait bient&ocirc;t &#234;tre banni dans tous les lieux publics en France'
+    )
+    assert_decode(
+      '"bientôt" & 文字',
+      '&quot;bient&ocirc;t&quot; &amp; &#25991;&#x5b57;'
+    )
+  end
+  def test_mixed_encoding
+    assert_encode(
+      '&quot;bient&ocirc;t&quot; &amp; &#x6587;&#x5b57;',
+      '"bientôt" & 文字', :basic, :named, :hexadecimal
+    )
+    assert_encode(
+      '&quot;bient&ocirc;t&quot; &amp; &#25991;&#23383;',
+      '"bientôt" & 文字', :basic, :named, :decimal
+    )
+  end
+  def test_mixed_encoding_with_sort
+    assert_encode(
+      '&quot;bient&ocirc;t&quot; &amp; &#x6587;&#x5b57;',
+      '"bientôt" & 文字', :named, :hexadecimal, :basic
+    )
+    assert_encode(
+      '&quot;bient&ocirc;t&quot; &amp; &#25991;&#23383;',
+      '"bientôt" & 文字', :decimal, :named, :basic
+    )
+  end
+  def test_detect_illegal_encoding_command
+    assert_raise(HTMLEntities::InstructionError) {
+      HTMLEntities.encode_entities('foo', :bar, :baz)
+    }
+  end
+  def test_edge_case_decoding
+    assert_decode('', '')
+    assert_decode('&bogus;', '&bogus;')
+    assert_decode('&amp;', '&amp;amp;')
+  end
+  def test_edge_case_encoding
+    assert_encode('`', '`')
+    assert_encode(' ', ' ')
+    assert_encode('&amp;amp;', '&amp;')
+    assert_encode('&amp;amp;', '&amp;')
+  end
+  # Faults found and patched by Moonwolf
+  def test_moonwolf_decoding
+    assert_decode("\x2", '&#2;')
+    assert_decode("\xf", '&#xf;')
+  end
+  private
+  def assert_decode(expected, input)
+    assert_equal(expected, HTMLEntities.decode_entities(input))
+  end
+  def assert_encode(expected, input, *args)
+    assert_equal(expected, HTMLEntities.encode_entities(input, *args))
+  end
+end

data/test/string_test.rb ADDED Viewed

@@ -0,0 +1,24 @@
+$: << File.dirname(__FILE__) + '/../lib/'
+require 'htmlentities/string'
+require 'test/unit'
+$KCODE = 'u'
+class TestHTMLEntities < Test::Unit::TestCase
+  def test_string_responds_correctly_to_decode_entities
+    assert_equal('±', '&plusmn;'.decode_entities)
+  end
+  def test_string_responds_correctly_to_encode_entities_with_no_parameters
+    assert_equal('&quot;', '"'.encode_entities)
+  end
+  def test_string_responds_correctly_to_encode_entities_with_multiple_parameters
+    assert_equal(
+      '&quot;bient&ocirc;t&quot; &amp; &#x6587;&#x5b57;',
+      '"bientôt" & 文字'.encode_entities(:basic, :named, :hexadecimal)
+    )
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,54 @@
+--- !ruby/object:Gem::Specification
+rubygems_version: 0.8.11
+specification_version: 1
+name: htmlentities
+version: !ruby/object:Gem::Version
+  version: 3.0.0
+date: 2006-04-08 00:00:00 +01:00
+summary: A module for encoding and decoding of HTML/XML entities from/to their corresponding UTF-8 codepoints. Optional String class extension for same.
+require_paths:
+- lib
+email: pbattley@gmail.com
+homepage:
+rubyforge_project:
+description:
+autorequire:
+default_executable:
+bindir: bin
+has_rdoc: true
+required_ruby_version: !ruby/object:Gem::Version::Requirement
+  requirements:
+  - - ">"
+    - !ruby/object:Gem::Version
+      version: 0.0.0
+  version:
+platform: ruby
+signing_key:
+cert_chain:
+authors:
+- Paul Battley
+files:
+- lib/htmlentities.rb
+- lib/htmlentities/string.rb
+- test/all.rb
+- test/entities_test.rb
+- test/string_test.rb
+- README
+- CHANGES
+- COPYING
+test_files:
+- test/all.rb
+rdoc_options: []
+extra_rdoc_files:
+- README
+- CHANGES
+- COPYING
+executables: []
+extensions: []
+requirements: []
+dependencies: []