RubyGems - charesc - Versions diffs - 0.1.0 - Mend

charesc 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

data/README ADDED Viewed

@@ -0,0 +1,57 @@
+== charesc version 0.1.0
+=== Overview
+Many programming languages and data formats provide character
+escapes based on Unicode. This gem, 'charesc', does so for Ruby.
+=== Syntax
+Character Escapes are defined as constants starting with the
+letter 'U', followed by at least four hexadecimal digits.
+Four hexadecimal digits represent characters in the basic
+multilingual plane (BMP) of Unicode/ISO 10646.
+Five hexadecimal digits represent characters in planes 1 to 15.
+Six hexadecimal digits, starting with '10', represent characters
+in plane 16.
+Up and including four digits, leading zeros are mandatory,
+but otherwise, they are forbidden. In this respect, the
+syntax is the same as for the U+ notation from the Unicode book.
+=== Usage
+Character escapes can be used inside strings, with the interpolation
+syntax, e.g., "abcd#{U6789u789A}". They can also be used on their
+own, as free-standing constants, e.g., "abcd" + U6789u789A.
+=== Returned Values
+All codepoints including non-characters (e.g. U+FFFF) are available,
+but surrogates (U+D800-U+DFFF) are not available, guaranteeing
+that no ill-formed UTF-8 sequences are produced.
+Character escapes can either be used as individual characters
+(e.g., U6789) or in strings (e.g., U6789U789A). Starting from the
+second 'U', it is possible to use 'u' instead for easier visual
+parsing (e.g., U6789u789A). The hexadecimal characters A-F can
+always also be written lower-case. The value of a character
+escape is never a character (e.g., ?a), always a string.
+=== Character Escapes and Character Encodings
+The charesc gem takes the value of $KCODE into account automatically.
+If $KCODE is set to Shift_JIS or EUC-JP, the character escapes are
+coverted to the respective encoding (as far as allowed by these
+encodings). If $KCODE indicates UTF-8 or 'none', character escapes
+return their values in UTF-8.
+By redefining the method charesc_non_utf8_conversion_hook,
+it is possible to change this behavior if necessary.
+=== Future Work
+- Adapt syntax if there is community consensus for something
+  different (warning: discussing syntactic details can become
+  a rathole).
+- Make this part of the standard Ruby distribution, or even
+  better, integrate it into Ruby itself. In the later case,
+  the syntax can be reconsidered, because we can then e.g.
+  use \u.... or so.
+=== Copyright
+Copyright (c) 2007 Martin J. Du"rst (duerst@it.aoyama.ac.jp)
+Licensed under the same terms as Ruby. Absolutely no warranty.
+(see http://www.ruby-lang.org/en/LICENSE.txt)

data/lib/charesc.rb ADDED Viewed

@@ -0,0 +1,39 @@
+# :include: ../README
+class Module
+  alias charesc_old_const_missing const_missing
+  # pretend that constants of the form Uhhhh, with h a hexadecimal digit,
+  # are defined and their value corresponds to the value of the Unicode
+  # character U+hhhh. For details, see the README.
+  def const_missing (const)
+    # Everything but the first 'U' is case-insensitive, and
+    # the first 'U' is guaranteed to be upper-case anyway,
+    # otherwise, we never get here anyway.
+    if const.to_s =~ /^((U(          [0-9ABCEF][0-9A-F]{3} # general BMP
+                           |         D[0-7][0-9A-F]{2}     # excluding surrogates
+                           | [1-9A-F][0-9A-F]{4}           # planes 1-15
+                           | 10      [0-9A-F]{4}           # plane 16
+                          )
+                        )*
+                       )
+                      $/ix
+      unescaped = $1.split(/[Uu]/)[1..-1].collect do |hex| hex.to_i(16) end.pack('U*')
+      # make it work with other built-in encodings
+      return charesc_non_utf8_conversion_hook(unescaped)
+    else
+      charesc_old_const_missing(const)
+    end
+  end
+  # redefine this hook method if you need to handle non-UTF-8
+  # encodings differently
+  def charesc_non_utf8_conversion_hook (unescaped)
+    if nkf_options = {'SJIS'=>'-WsIm0', 'EUC'=>'-WeIm0'}[$KCODE]
+      require 'nkf' # avoid that for a pure UTF-8 application
+      unescaped = NKF.nkf(nkf_options, unescaped)
+    end
+    return unescaped
+  end
+end

data/test/test_euc.rb ADDED Viewed

@@ -0,0 +1,32 @@
+# testing charesc gem with EUC-JP
+# Copyright 2007 Martin J. Du"rst (duerst@it.aoyama.ac.jp);
+# available under the same licence as Ruby itself
+# (see http://www.ruby-lang.org/en/LICENSE.txt)
+$:.unshift File.join(File.dirname(__FILE__), "..", "lib")
+require 'charesc'
+$KCODE = 'EUC-JP'
+require 'test/unit'
+class TestEUC < Test::Unit::TestCase
+  def test_euc
+    assert_equal('Yukihiro Matsumoto - ���ܹԹ�',
+      "Yukihiro Matsumoto - #{U677Eu672Cu884Cu5F18}")
+    assert_equal('Matz - �ޤĤ�� �椭�Ҥ�',
+      "Matz - #{U307Eu3064u3082u3068} #{U3086u304Du3072u308D}")
+    assert_equal("Aoyama Gakuin University - \xC0\xC4\xBB\xB3\xB3\xD8\xB1\xA1\xC2\xE7\xB3\xD8",
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+    assert_equal('Aoyama Gakuin University - �Ļ��ر����',
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+  end
+  def test_mime
+    # make sure MIME is not decoded
+    # MIME header encoding of �Ļ�: =?ISO-2022-JP?B?GyRCQEQ7MxsoQg==?=
+    assert_equal('=?ISO-2022-JP?B?GyRCQEQ7MxsoQg==?=',
+      U003Du003Fu0049u0053u004Fu002Du0032u0030u0032u0032u002Du004Au0050u003Fu0042u003Fu0047u0079u0052u0043u0051u0045u0051u0037u004Du0078u0073u006Fu0051u0067u003Du003Du003Fu003D)
+  end
+end

data/test/test_fail.rb ADDED Viewed

@@ -0,0 +1,42 @@
+# testing failure cases for charesc gem
+# Copyright 2007 Martin J. Du"rst (duerst@it.aoyama.ac.jp);
+# available under the same licence as Ruby itself
+# (see http://www.ruby-lang.org/en/LICENSE.txt)
+$:.unshift File.join(File.dirname(__FILE__), "..", "lib")
+require 'charesc'
+require 'test/unit'
+# Tests to make sure that arbitrary nonexisting constants
+# and disallowed cases are handled correctly
+class TestFail < Test::Unit::TestCase
+  def test_foo
+    assert_raise(NameError) { FOO }      # arbitrary constant
+    assert_raise(NameError) { string = "#{BAR}" } # interpolation
+    assert_raise(NameError) { uABCD }    # lower-case 'u'
+    assert_raise(NameError) { UD800 }    # surrogate block
+    assert_raise(NameError) { UDCBA }    # surrogate block
+    assert_raise(NameError) { UDFFF }    # surrogate block
+    assert_raise(NameError) { UD847uDD9A } # surrogate pair
+    assert_raise(NameError) { U0ABCD }   # leading zero
+    assert_raise(NameError) { U00ABCD }  # leading zeros
+    assert_raise(NameError) { Uabc }     # too short
+    assert_raise(NameError) { Uabcdef }  # too long
+    assert_raise(NameError) { U110000 }  # too high
+    assert_not_equal(?a, U0061)          # we return strings, not chararcters
+    # with leading correct escape
+    assert_raise(NameError) { UABCDFOO }      # arbitrary constant
+    assert_raise(NameError) { string = "#{UABCDBAR}" } # interpolation
+    assert_raise(NameError) { UABCDuD800 }    # surrogate block
+    assert_raise(NameError) { UABCDuDCBA }    # surrogate block
+    assert_raise(NameError) { UABCDuDFFF }    # surrogate block
+    assert_raise(NameError) { UABCDUD847uDD9A } # surrogate pair
+    assert_raise(NameError) { UABCDu0ABCD }   # leading zero
+    assert_raise(NameError) { UABCDu00ABCD }  # leading zeros
+    assert_raise(NameError) { UABCDuabc }     # too short
+    assert_raise(NameError) { UABCDuabcdef }  # too long
+    assert_raise(NameError) { UABCDu110000 }  # too high
+  end
+end

data/test/test_mixed.rb ADDED Viewed

@@ -0,0 +1,27 @@
+# testing charesc gem with mixed encodings
+# Copyright 2007 Martin J. Du"rst (duerst@it.aoyama.ac.jp);
+# available under the same licence as Ruby itself
+# (see http://www.ruby-lang.org/en/LICENSE.txt)
+$:.unshift File.join(File.dirname(__FILE__), "..", "lib")
+require 'charesc'
+require 'test/unit'
+class TestMixed < Test::Unit::TestCase
+  def test_mixed
+    assert_equal("Aoyama Gakuin University - \xE9\x9D\x92\xE5\xB1\xB1\xE5\xAD\xA6\xE9\x99\xA2\xE5\xA4\xA7\xE5\xAD\xA6",
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+    assert_equal("Martin D\xC3\xBCrst", "Martin D#{U00FC}rst")
+    $KCODE = 'Shift_JIS'
+    assert_equal("Aoyama Gakuin University - \x90\xc2\x8e\x52\x8a\x77\x89\x40\x91\xe5\x8a\x77",
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+    $KCODE = 'EUC-JP'
+    assert_equal("Aoyama Gakuin University - \xC0\xC4\xBB\xB3\xB3\xD8\xB1\xA1\xC2\xE7\xB3\xD8",
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+    $KCODE = 'UTF-8'
+    assert_equal("Aoyama Gakuin University - \xE9\x9D\x92\xE5\xB1\xB1\xE5\xAD\xA6\xE9\x99\xA2\xE5\xA4\xA7\xE5\xAD\xA6",
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+    assert_equal("Martin D\xC3\xBCrst", "Martin D#{U00FC}rst")
+  end
+end

data/test/test_sjis.rb ADDED Viewed

@@ -0,0 +1,32 @@
+# testing charesc gem with Shift_JIS
+# Copyright 2007 Martin J. Du"rst (duerst@it.aoyama.ac.jp);
+# available under the same licence as Ruby itself
+# (see http://www.ruby-lang.org/en/LICENSE.txt)
+$:.unshift File.join(File.dirname(__FILE__), "..", "lib")
+require 'charesc'
+$KCODE = 'Shift_JIS'
+require 'test/unit'
+class TestSJIS < Test::Unit::TestCase
+  def test_sjis
+    assert_equal('Yukihiro Matsumoto - ���{�s�O',
+      "Yukihiro Matsumoto - #{U677Eu672Cu884Cu5F18}")
+    assert_equal('Matz - �܂��� �䂫�Ђ�',
+      "Matz - #{U307Eu3064u3082u3068} #{U3086u304Du3072u308D}")
+    assert_equal("Aoyama Gakuin University - \x90\xc2\x8e\x52\x8a\x77\x89\x40\x91\xe5\x8a\x77",
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+    assert_equal('Aoyama Gakuin University - �R�w�@��w',
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+  end
+  def test_mime
+    # make sure MIME is not decoded
+    # MIME header encoding of �R: =?ISO-2022-JP?B?GyRCQEQ7MxsoQg==?=
+    assert_equal('=?ISO-2022-JP?B?GyRCQEQ7MxsoQg==?=',
+      U003Du003Fu0049u0053u004Fu002Du0032u0030u0032u0032u002Du004Au0050u003Fu0042u003Fu0047u0079u0052u0043u0051u0045u0051u0037u004Du0078u0073u006Fu0051u0067u003Du003Du003Fu003D)
+  end
+end

data/test/test_utf8.rb ADDED Viewed

@@ -0,0 +1,126 @@
+# testing charesc gem with UTF-8
+# Copyright 2007 Martin J. Du"rst (duerst@it.aoyama.ac.jp);
+# available under the same licence as Ruby itself
+# (see http://www.ruby-lang.org/en/LICENSE.txt)
+$:.unshift File.join(File.dirname(__FILE__), "..", "lib")
+require 'charesc'
+require 'test/unit'
+class TestUTF8 < Test::Unit::TestCase
+  def test_utf8
+    assert_equal('Yukihiro Matsumoto - 松本行弘',
+      "Yukihiro Matsumoto - #{U677Eu672Cu884Cu5F18}")
+    assert_equal('Matz - まつもと ゆきひろ',
+      "Matz - #{U307Eu3064u3082u3068} #{U3086u304Du3072u308D}")
+    assert_equal("Aoyama Gakuin University - \xE9\x9D\x92\xE5\xB1\xB1\xE5\xAD\xA6\xE9\x99\xA2\xE5\xA4\xA7\xE5\xAD\xA6",
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+    assert_equal('Aoyama Gakuin University - 青山学院大学',
+      "Aoyama Gakuin University - #{U9752u5C71u5B66u9662u5927u5B66}")
+    assert_equal('青山学院大学', U9752u5C71u5B66u9662u5927u5B66)
+    assert_equal("Martin D\xC3\xBCrst", "Martin D#{U00FC}rst")
+    assert_equal('Martin Dürst', "Martin D#{U00FC}rst")
+    assert_equal('ü', U00FC)
+  end
+  def test_syntax_variants
+    # upper/lower case variants
+    assert_equal('松本行弘', U677Eu672Cu884Cu5F18)
+    assert_equal('松本行弘', U677EU672CU884CU5F18)
+    assert_equal('松本行弘', U677eu672cu884cu5f18)
+    assert_equal('松本行弘', U677eU672cU884cU5f18)
+    # all hex digits
+    assert_equal("\xC4\xA3\xE4\x95\xA7\xE8\xA6\xAB\xEC\xB7\xAF", U0123u4567u89ABuCDEF)
+    assert_equal("\xC4\xA3\xE4\x95\xA7\xE8\xA6\xAB\xEC\xB7\xAF", U0123U4567U89ABUCDEF)
+    assert_equal("\xC4\xA3\xE4\x95\xA7\xE8\xA6\xAB\xEC\xB7\xAF", U0123u4567u89abucdef)
+    assert_equal("\xC4\xA3\xE4\x95\xA7\xE8\xA6\xAB\xEC\xB7\xAF", U0123U4567U89abUcdef)
+    assert_equal("\xC4\xA3\xE4\x95\xA7\xE8\xA6\xAB\xEC\xB7\xAF", U0123u4567u89aBuCdEf)
+    assert_equal("\xC4\xA3\xE4\x95\xA7\xE8\xA6\xAB\xEC\xB7\xAF", U0123u4567u89aBUcDEF)
+  end
+  def test_fulton
+    # examples from Hal Fulton's book (second edition), chapter 4
+    # precomposed e'pe'e
+    assert_equal('épée', U00E9u0070u00E9u0065)
+    assert_equal('épée', "#{U00E9u0070u00E9u0065}")
+    assert_equal('épée', "#{U00E9}p#{U00E9}e")
+    assert_equal("\xC3\xA9\x70\xC3\xA9\x65", U00E9u0070u00E9u0065)
+    assert_equal("\xC3\xA9\x70\xC3\xA9\x65", "#{U00E9u0070u00E9u0065}")
+    assert_equal("\xC3\xA9\x70\xC3\xA9\x65", "#{U00E9}p#{U00E9}e")
+    # decomposed e'pe'e
+    assert_equal('épée', U0065u0301u0070u0065u0301u0065)
+    assert_equal('épée', "#{U0065u0301u0070u0065u0301u0065}")
+    assert_equal('épée', "e#{U0301}pe#{U0301}e")
+    assert_equal("\x65\xCC\x81\x70\x65\xCC\x81\x65", U0065u0301u0070u0065u0301u0065)
+    assert_equal("\x65\xCC\x81\x70\x65\xCC\x81\x65", "#{U0065u0301u0070u0065u0301u0065}")
+    assert_equal("\x65\xCC\x81\x70\x65\xCC\x81\x65", "e#{U0301}pe#{U0301}e")
+    # combinations of NFC/D, NFKC/D
+    assert_equal('öffnen', U00F6u0066u0066u006Eu0065u006E)
+    assert_equal("\xC3\xB6ffnen", U00F6u0066u0066u006Eu0065u006E)
+    assert_equal('öffnen', "#{U00F6}ffnen")
+    assert_equal("\xC3\xB6ffnen", "#{U00F6}ffnen")
+    assert_equal('öffnen', U006Fu0308u0066u0066u006Eu0065u006E)
+    assert_equal("\x6F\xCC\x88ffnen", U006Fu0308u0066u0066u006Eu0065u006E)
+    assert_equal('öffnen', "o#{U0308}ffnen")
+    assert_equal("\x6F\xCC\x88ffnen", "o#{U0308}ffnen")
+    assert_equal('öﬀnen', U00F6uFB00u006Eu0065u006E)
+    assert_equal("\xC3\xB6\xEF\xAC\x80nen", U00F6uFB00u006Eu0065u006E)
+    assert_equal('öﬀnen', "#{U00F6uFB00}nen")
+    assert_equal("\xC3\xB6\xEF\xAC\x80nen", "#{U00F6uFB00}nen")
+    assert_equal('öﬀnen', U006Fu0308uFB00u006Eu0065u006E)
+    assert_equal("\x6F\xCC\x88\xEF\xAC\x80nen", U006Fu0308uFB00u006Eu0065u006E)
+    assert_equal('öﬀnen', "o#{U0308uFB00}nen")
+    assert_equal("\x6F\xCC\x88\xEF\xAC\x80nen", "o#{U0308uFB00}nen")
+    # German sharp s (sz)
+    assert_equal('Straße', U0053u0074u0072u0061u00DFu0065)
+    assert_equal("\x53\x74\x72\x61\xC3\x9F\x65", U0053u0074u0072u0061u00DFu0065)
+    assert_equal('Straße', "Stra#{U00DF}e")
+    assert_equal("\x53\x74\x72\x61\xC3\x9F\x65", "Stra#{U00DF}e")
+  end
+  def test_edge_cases
+    # start and end of each outer plane
+    assert_equal("\xF4\x8F\xBF\xBF", U10FFFF)
+    assert_equal("\xF4\x80\x80\x80", U100000)
+    assert_equal("\xF3\xBF\xBF\xBF", UFFFFF)
+    assert_equal("\xF3\xB0\x80\x80", UF0000)
+    assert_equal("\xF3\xAF\xBF\xBF", UEFFFF)
+    assert_equal("\xF3\xA0\x80\x80", UE0000)
+    assert_equal("\xF3\x9F\xBF\xBF", UDFFFF)
+    assert_equal("\xF3\x90\x80\x80", UD0000)
+    assert_equal("\xF3\x8F\xBF\xBF", UCFFFF)
+    assert_equal("\xF3\x80\x80\x80", UC0000)
+    assert_equal("\xF2\xBF\xBF\xBF", UBFFFF)
+    assert_equal("\xF2\xB0\x80\x80", UB0000)
+    assert_equal("\xF2\xAF\xBF\xBF", UAFFFF)
+    assert_equal("\xF2\xA0\x80\x80", UA0000)
+    assert_equal("\xF2\x9F\xBF\xBF", U9FFFF)
+    assert_equal("\xF2\x90\x80\x80", U90000)
+    assert_equal("\xF2\x8F\xBF\xBF", U8FFFF)
+    assert_equal("\xF2\x80\x80\x80", U80000)
+    assert_equal("\xF1\xBF\xBF\xBF", U7FFFF)
+    assert_equal("\xF1\xB0\x80\x80", U70000)
+    assert_equal("\xF1\xAF\xBF\xBF", U6FFFF)
+    assert_equal("\xF1\xA0\x80\x80", U60000)
+    assert_equal("\xF1\x9F\xBF\xBF", U5FFFF)
+    assert_equal("\xF1\x90\x80\x80", U50000)
+    assert_equal("\xF1\x8F\xBF\xBF", U4FFFF)
+    assert_equal("\xF1\x80\x80\x80", U40000)
+    assert_equal("\xF0\xBF\xBF\xBF", U3FFFF)
+    assert_equal("\xF0\xB0\x80\x80", U30000)
+    assert_equal("\xF0\xAF\xBF\xBF", U2FFFF)
+    assert_equal("\xF0\xA0\x80\x80", U20000)
+    assert_equal("\xF0\x9F\xBF\xBF", U1FFFF)
+    assert_equal("\xF0\x90\x80\x80", U10000)
+    # BMP
+    assert_equal("\xEF\xBF\xBF", UFFFF)
+    assert_equal("\xEE\x80\x80", UE000)
+    assert_equal("\xED\x9F\xBF", UD7FF)
+    assert_equal("\xE0\xA0\x80", U0800)
+    assert_equal("\xDF\xBF", U07FF)
+    assert_equal("\xC2\x80", U0080)
+    assert_equal("\x7F", U007F)
+    assert_equal("\x00", U0000)
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,56 @@
+--- !ruby/object:Gem::Specification
+rubygems_version: 0.9.2
+specification_version: 1
+name: charesc
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+date: 2007-06-05 00:00:00 +09:00
+summary: Unicode based character escapes for Ruby; works with UTF-8 as well as Shift_JIS and EUC-JP (within the limits of these encodings)
+require_paths:
+- lib
+email: duerst@it.aoyama.ac.jp
+homepage:
+rubyforge_project:
+description:
+autorequire: charesc
+default_executable:
+bindir: bin
+has_rdoc: true
+required_ruby_version: !ruby/object:Gem::Version::Requirement
+  requirements:
+  - - ">"
+    - !ruby/object:Gem::Version
+      version: 0.0.0
+  version:
+platform: ruby
+signing_key:
+cert_chain:
+post_install_message:
+authors:
+- Martin J. Du"rst
+files:
+- lib/charesc.rb
+- test/test_euc.rb
+- test/test_fail.rb
+- test/test_mixed.rb
+- test/test_sjis.rb
+- test/test_utf8.rb
+- README
+test_files:
+- test/test_utf8.rb
+- test/test_sjis.rb
+- test/test_euc.rb
+- test/test_fail.rb
+- test/test_mixed.rb
+rdoc_options: []
+extra_rdoc_files:
+- README
+executables: []
+extensions: []
+requirements: []
+dependencies: []