RubyGems - utf8_utils - Versions diffs - 0.0.1 → 1.0.0 - Mend

utf8_utils 0.0.1 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

data/README.md CHANGED Viewed

@@ -11,7 +11,7 @@ issues with it, I'll probably try patching it into ActiveSupport.
 Here's what happens when you try to access a string with invalid UTF-8 characters in Ruby 1.9:
-    ruby-1.9.1-p378 > "my messed up \x92 string".split(//)
+    ruby-1.9.1-p378 > "my messed up \x92 string".split(//u)
     ArgumentError: invalid byte sequence in UTF-8
             from (irb):3:in `split'
             from (irb):3
@@ -19,7 +19,7 @@ Here's what happens when you try to access a string with invalid UTF-8 character
 ## The Solution
-    ruby-1.9.1-p378 > "my messed up \x92 string".to_utf8_codepoints.tidy_bytes.to_s.split(//u)
+    ruby-1.9.1-p378 > "my messed up \x92 string".to_utf8_chars.tidy_bytes.to_s.split(//u)
      => ["m", "y", " ", "m", "e", "s", "s", "e", "d", " ", "u", "p", " ", "’", " ", "s", "t", "r", "i", "n", "g"]
 Amazing in its brevity and elegance, huh? Ok, maybe not really but if you have
@@ -30,6 +30,25 @@ Note that like ActiveSupport, it naively assumes if you have invalid UTF8
 characters, they are either Windows CP1251 or ISO8859-1. In practice this isn't
 a bad assumption, but may not always work.
+Unlike ActiveSupport, however, the performance of this library is **very** poor
+right now.  Since my intention is for this to be used mostly for very short
+strings, it should, however, be good enough for many kinds of applications.
+How poor is "very poor?" Have a look:
+                               | ACTIVE_SUPPORT | UTF8_UTILS |
+    ----------------------------------------------------------
+    tidy bytes           x2000 |          0.087 |      1.225 |
+    ==========================================================
+    Total                      |          0.087 |      1.225 |
+This will improve quite a bit soon, as I'm pretty well aware of where the
+slowness is coming from. If performance is important for you now though, by all
+means use another library (if you can find one) until I've made a few more
+releases.
 ## Getting it
     gem install utf8_utils
@@ -37,15 +56,16 @@ a bad assumption, but may not always work.
 ## Using it
+    # encoding: utf-8
     require "utf8_utils"
-    # Traverse codepoints
-    "hello-world".to_utf8_codepoints.each_codepoint do |codepoint|
-        puts codepoint.valid?
+    # Iterate over multibyte characters
+    "hello ーチエンジンの日本".to_utf8_chars.each_char do |char|
+        puts char.valid?
      end
      # tidy bytes
-     good_string = bad_string.to_utf8_codepoints.tidy_bytes.to_s
+     good_string = bad_string.to_utf8_chars.tidy_bytes.to_s
 ## API Docs
@@ -53,6 +73,8 @@ a bad assumption, but may not always work.
 ## Credits
-Created by Norman Clarke, with some code <strike>stolen</strike> borrowed from ActiveRecord.
+Created by Norman Clarke. Some code was taken from
+[ActiveRecord](http://github.com/rails/rails/tree/master/activesupport/), as
+indicated in the source code.
 Copyright (c) 2010, released under the MIT license.

data/Rakefile CHANGED Viewed

@@ -9,6 +9,15 @@ CLEAN << "pkg" << "doc" << "coverage" << ".yardoc"
 Rake::GemPackageTask.new(eval(File.read("utf8_utils.gemspec"))) { |pkg| }
 Rake::TestTask.new(:test) { |t| t.pattern = "test/**/*_test.rb" }
+begin
+  require "yard"
+  YARD::Rake::YardocTask.new do |t|
+    t.options = ["--output-dir=doc"]
+    t.options << "--files" << "README.md"
+  end
+rescue LoadError
+end
 Rake::RDocTask.new do |r|
   r.rdoc_dir = "doc"
   r.rdoc_files.include "lib/**/*.rb"

data/lib/utf8_utils.rb CHANGED Viewed

@@ -1,156 +1,53 @@
-# Wraps a string as an array of bytes and allows some naive cleanup operations as a workaround
-# for Ruby 1.9's crappy encoding support that throws exceptions when attempting to access
-# UTF8 strings with invalid characters.
-module UTF8Utils
-  class Codepoints
-    attr_accessor :chars
-    attr :position
-    include Enumerable
-    CP1251 = {
-      128 => [226, 130, 172],
-      129 => nil,
-      130 => [226, 128, 154],
-      131 => [198, 146],
-      132 => [226, 128, 158],
-      133 => [226, 128, 166],
-      134 => [226, 128, 160],
-      135 => [226, 128, 161],
-      136 => [203, 134],
-      137 => [226, 128, 176],
-      138 => [197, 160],
-      139 => [226, 128, 185],
-      140 => [197, 146],
-      141 => nil,
-      142 => [197, 189],
-      143 => nil,
-      144 => nil,
-      145 => [226, 128, 152],
-      146 => [226, 128, 153],
-      147 => [226, 128, 156],
-      148 => [226, 128, 157],
-      149 => [226, 128, 162],
-      150 => [226, 128, 147],
-      151 => [226, 128, 148],
-      152 => [203, 156],
-      153 => [226, 132, 162],
-      154 => [197, 161],
-      155 => [226, 128, 186],
-      156 => [197, 147],
-      157 => nil,
-      158 => [197, 190],
-      159 => [197, 184]
-    }
-    def initialize(string)
-      @position = 0
-      # 1.8.6's `each_byte` does not return an Enumerable
-      if RUBY_VERSION < "1.8.7"
-        @chars = []
-        string.each_byte { |b| @chars << b }
-      else
-        # Create an array of bytes without raising an ArgumentError in 1.9.x
-        # when the string contains invalid UTF-8 characters
-        @chars = string.each_byte.entries
-      end
-    end
-    # Attempt to clean up malformed characters.
-    def tidy_bytes
-      Codepoints.new(entries.map {|c| c.tidy.to_char}.compact.join)
-    end
-    # Cast to string.
-    def to_s
-      entries.map {|e| e.to_char}.join
-    end
-    private
-    def each(&block)
-      while codepoint = next_codepoint
-        yield codepoint
-      end
-      @position = 0
-    end
-    alias :each_codepoint :each
-    public :each_codepoint
+require File.expand_path("../utf8_utils/byte",  __FILE__)
+require File.expand_path("../utf8_utils/char",  __FILE__)
+require File.expand_path("../utf8_utils/chars", __FILE__)
-    def bytes_to_pull
-      case chars[position]
-      when 0..127 then 1
-      when 128..223 then 2
-      when 224..239 then 3
-      else 4
-      end
-    end
-    def next_codepoint
-      codepoint = Codepoint.new(chars.slice(position, bytes_to_pull))
-      if codepoint.invalid?
-        codepoint = Codepoint.new(chars.slice(position, 1))
-      end
-      @position = position + codepoint.size
-      codepoint unless codepoint.empty?
-    end
-  end
-  class Codepoint < Array
-    # Borrowed from the regexp in ActiveSupport, which in turn had been borrowed from
-    # the Kconv library by Shinji KONO - (also as seen on the W3C site).
-    # See also http://en.wikipedia.org/wiki/UTF-8
-    def valid?
-     if length == 1
-       (0..127) === self[0]
-     elsif length == 2
-       (192..223) === self[0] &&  (128..191) === self[1]
-     elsif length == 3
-       (self[0] == 224 && ((160..191) === self[1] && (128..191) === self[2])) ||
-       ((225..239) === self[0] && (128..191) === self[1] && (128..191) === self[2])
-     elsif length == 4
-       (self[0] == 240 && (144..191) === self[1] && (128..191) === self[2] && (128..191) === self[3]) ||
-       ((241..243) === self[0] && (128..191) === self[1] && (128..191) === self[2] && (128..191) === self[3]) ||
-       (self[0] == 244 && (128..143) === self[1] && (128..191) === self[2] && (128..191) === self[3])
-     end
-    end
-    # Attempt to rescue a valid UTF-8 character from a malformed codepoint. It will first
-    # attempt to convert from CP1251, and if this isn't possible, it prepends a valid leading
-    # byte, treating the character as the last byte in a two-byte codepoint.
-    # Note that much of the logic here is taken from ActiveSupport; the difference is that this
-    # works for Ruby 1.8.6 - 1.9.1.
-    def tidy
-      return self if valid?
-      if Codepoints::CP1251.key? self[0]
-        self.class.new [Codepoints::CP1251[self[0]]]
-      elsif self[0] < 192
-        self.class.new [194, self[0]]
-      else
-        self.class.new [195, self[0] - 64]
-      end
-    end
-    def invalid?
-      !valid?
-    end
+# Wraps a string as an array of bytes and allows some naive cleanup operations
+# as a workaround for Ruby 1.9's crappy encoding support that throws exceptions
+# when attempting to access UTF8 strings with invalid characters.
+module UTF8Utils
-    # Get a character from the bytes.
-    def to_char
-      flatten.pack("C*").unpack("U*").pack("U*")
-    end
+  # CP1251 decimal byte => UTF-8 approximation as an array of bytes
+  CP1251 = {
+    128 => [226, 130, 172],
+    129 => nil,
+    130 => [226, 128, 154],
+    131 => [198, 146],
+    132 => [226, 128, 158],
+    133 => [226, 128, 166],
+    134 => [226, 128, 160],
+    135 => [226, 128, 161],
+    136 => [203, 134],
+    137 => [226, 128, 176],
+    138 => [197, 160],
+    139 => [226, 128, 185],
+    140 => [197, 146],
+    141 => nil,
+    142 => [197, 189],
+    143 => nil,
+    144 => nil,
+    145 => [226, 128, 152],
+    146 => [226, 128, 153],
+    147 => [226, 128, 156],
+    148 => [226, 128, 157],
+    149 => [226, 128, 162],
+    150 => [226, 128, 147],
+    151 => [226, 128, 148],
+    152 => [203, 156],
+    153 => [226, 132, 162],
+    154 => [197, 161],
+    155 => [226, 128, 186],
+    156 => [197, 147],
+    157 => nil,
+    158 => [197, 190],
+    159 => [197, 184]
+  }
-  end
 end
-# Get an array of UTF8 codepoints from a string.
+# Get an array of UTF8 charsfrom a string.
 class String
-  def to_utf8_codepoints
-    UTF8Utils::Codepoints.new self
+  def to_utf8_chars
+    UTF8Utils::Chars.new self
   end
-end
+end

data/lib/utf8_utils/byte.rb ADDED Viewed

@@ -0,0 +1,86 @@
+module UTF8Utils
+  # A single UTF-8 byte.
+  class Byte
+    attr_reader :byte
+    def initialize(byte)
+      @byte = byte
+    end
+    def codepoint_mask
+      case leading_1_bits
+      when 0 then 0
+      when 1 then 0b1000_0000
+      when 2 then 0b1100_0000
+      when 3 then 0b1110_0000
+      when 4 then 0b1111_0000
+      end
+    end
+    # Is this a continuation byte?
+    def continuation?
+      leading_1_bits == 1
+    end
+    # How many continuation bytes should follow this byte?
+    def continuations
+      bits = leading_1_bits
+      bits < 2 ? 0 : bits - 1
+    end
+    def invalid?
+      !valid?
+    end
+    # From Wikipedia's entry on UTF-8:
+    #
+    # The UTF-8 encoding is variable-width, with each character represented by 1
+    # to 4 bytes. Each byte has 0–4 leading consecutive 1 bits followed by a zero bit
+    # to indicate its type. N 1 bits indicates the first byte in a N-byte sequence,
+    # with the exception that zero 1 bits indicates a one-byte sequence while one 1
+    # bit indicates a continuation byte in a multi-byte sequence (this was done for
+    # ASCII compatibility).
+    # @see http://en.wikipedia.org/wiki/Utf-8
+    def leading_1_bits
+      nibble = byte >> 4
+      if    nibble < 0b1000 then 0 # single-byte chars
+      elsif nibble < 0b1100 then 1 # continuation byte
+      elsif nibble < 0b1110 then 2 # start of 2-byte char
+      elsif nibble < 0b1111 then 3 # 3-byte char
+      else                       4 # 4-byte char
+      end
+    end
+    # Start of a 2-byte sequence, but code point ≤ 127
+    # @see http://tools.ietf.org/html/rfc3629
+    def overlong?
+      (192..193) === byte
+    end
+    # RFC 3629 reserves 245-253 for the leading bytes of 4-6 byte sequences.
+    # @see http://tools.ietf.org/html/rfc3629
+    def restricted?
+      (245..253) === byte
+    end
+    def to_i
+      byte
+    end
+    # Bytes 254 and  255 are not defined by the original UTF-8 spec.
+    def undefined?
+      (254..255) === byte
+    end
+    def valid?
+      !(overlong? or restricted? or undefined?)
+    end
+    def codepoint_bits
+      byte ^ codepoint_mask
+    end
+  end
+end

data/lib/utf8_utils/char.rb ADDED Viewed

@@ -0,0 +1,52 @@
+module UTF8Utils
+  class Char < Array
+    # Given the first byte, how many bytes long should this character be?
+    def expected_length
+      (first.continuations rescue 0) + 1
+    end
+    # Is the character invalid?
+    def invalid?
+      !valid?
+    end
+    # Attempt to rescue a valid UTF-8 character from a malformed character. It
+    # will first attempt to convert from CP1251, and if this isn't possible, it
+    # prepends a valid leading byte, treating the character as the last byte in
+    # a two-byte character.  Note that much of the logic here is taken from
+    # ActiveSupport; the difference is that this works for Ruby 1.8.6 - 1.9.1.
+    def tidy
+      return self if valid?
+      byte = first.to_i
+      if UTF8Utils::CP1251.key? byte
+        self.class.new [UTF8Utils::CP1251[byte]]
+      elsif byte < 192
+        self.class.new [194, byte]
+      else
+        self.class.new [195, byte - 64]
+      end
+    end
+    # Get a multibyte character from the bytes.
+    def to_s
+      flatten.map {|b| b.to_i }.pack("C*").unpack("U*").pack("U*")
+    end
+    def to_codepoint
+      flatten.map {|b| b.to_i }.pack("C*").unpack("U*")[0]
+    end
+    def valid?
+      return false if length != expected_length
+      each_with_index do |byte, index|
+        return false if byte.invalid?
+        return false if index == 0 and byte.continuation?
+        return false if index > 0 and !byte.continuation?
+      end
+      true
+    end
+  end
+end

data/lib/utf8_utils/chars.rb ADDED Viewed

@@ -0,0 +1,59 @@
+module UTF8Utils
+  class Chars
+    attr :bytes
+    attr :position
+    include Enumerable
+    def initialize(string)
+      @position = 0
+      begin
+        # Create an array of bytes without raising an ArgumentError in 1.9.x
+        # when the string contains invalid UTF-8 characters
+        @bytes = string.each_byte.map {|b| Byte.new(b)}
+      rescue LocalJumpError
+        # 1.8.6's `each_byte` does not return an Enumerable
+        @bytes = []
+        string.each_byte { |b| @bytes << Byte.new(b) }
+      end
+    end
+    # Attempt to clean up malformed characters.
+    def tidy_bytes
+      Chars.new(entries.map {|c| c.tidy.to_s}.compact.join)
+    end
+    # Cast to string.
+    def to_s
+      entries.flatten.map {|b| b.to_i }.pack("C*").unpack("U*").pack("U*")
+    end
+    def first
+      entries.first
+    end
+    private
+    def each(&block)
+      while char = next_char
+        yield char
+      end
+      @position = 0
+    end
+    alias :each_char :each
+    public :each_char
+    def next_char
+      return if !bytes[position]
+      char = Char.new(bytes.slice(position, bytes[position].continuations + 1))
+      if char.invalid?
+        char = Char.new(bytes.slice(position, 1))
+      end
+      @position = position + char.size
+      char unless char.empty?
+    end
+  end
+end

data/lib/utf8_utils/version.rb CHANGED Viewed

@@ -1,8 +1,8 @@
 module UTF8Utils
   module Version
-    MAJOR = 0
+    MAJOR = 1
     MINOR = 0
-    TINY  = 1
+    TINY  = 0
     STRING = [MAJOR, MINOR, TINY].join('.')
   end
-end
+end

data/test/utf8_utils_test.rb CHANGED Viewed

@@ -1,49 +1,70 @@
 # encoding: utf-8
+require "rubygems"
 require "test/unit"
-require File.join(File.dirname(__FILE__), "..", "lib", "utf8_utils")
+require "mocha"
+require File.expand_path("../../lib/utf8_utils", __FILE__)
-class UTF8CodepointsTest < Test::Unit::TestCase
+module UTF8ByteTest
-  def test_should_pull_one_byte_for_ascii_char
-    assert_equal 1, "a".to_utf8_codepoints.entries[0].length
+  def test_leading_1_bits
+    [0, 128, 194, 224, 240].each_with_index do |n, i|
+      byte = UTF8Utils::Byte.new(n)
+      assert_equal i, byte.leading_1_bits
+    end
   end
-  def test_should_pull_two_bytes_for_latin_char_with_diacritics
-    assert_equal 2, "¡".to_utf8_codepoints.entries[0].length
+  def test_invalid_bytes
+    [192, 193, 245, 255].each do |n|
+      assert !UTF8Utils::Byte.new(n).valid?
+    end
   end
-  def test_should_pull_three_bytes_for_basic_multilingual_char
-    assert_equal 3, "आ".to_utf8_codepoints.entries[0].length
+  def test_continuation
+    assert UTF8Utils::Byte.new(130).continuation?
   end
-  def test_should_pull_four_bytes_for_other_chars
-    u = UTF8Utils::Codepoints.new("")
-    # Editors tend to freak out with chars in this plane, so just stub the
-    # chars field instead. This char is U+10405, DESERET CAPITAL LETTER LONG OO.
-    u.chars = [240, 144, 144, 132]
-    assert_equal 4, u.entries[0].length
+end
+class UTF8UtilsTest < Test::Unit::TestCase
+  include UTF8ByteTest
+  def test_entries_should_be_one_byte_for_ascii_char
+    assert_equal 1, "a".to_utf8_chars.first.length
+  end
+  def test_entries_should_be_two_bytes_for_latin_char_with_diacritics
+    assert_equal 2, "¡".to_utf8_chars.first.length
   end
-  def test_should_detect_valid_codepoints
-    "cañón आ".to_utf8_codepoints.each_codepoint {|c| assert c.valid? }
+  def test_entries_should_be_three_bytes_for_basic_multilingual_char
+    assert_equal 3, "आ".to_utf8_chars.first.length
   end
-  def test_should_detect_invalid_codepoints
-    "\x92".to_utf8_codepoints.each_codepoint {|c| assert c.invalid? }
+  def test_entries_should_be_four_bytes_for_other_chars
+    u = UTF8Utils::Chars.new("")
+    # Editors tend to freak out with chars in this plane, so just stub the
+    # chars field instead. This char is U+10404, DESERET CAPITAL LETTER LONG O.
+    u.stubs(:bytes).returns([240, 144, 144, 132].map { |b| UTF8Utils::Byte.new(b)})
+    assert_equal 4, u.first.length
   end
-  def test_should_split_correctly_with_invalid_codepoints
-    assert_equal 3, "a\x92a".to_utf8_codepoints.entries.length
+  def test_should_detect_valid_chars
+    "cañón आ".to_utf8_chars.each_char {|c| assert c.valid? }
   end
-  def test_should_tidy_bytes
-    assert_equal "a’a", "a\x92a".to_utf8_codepoints.tidy_bytes.to_s
+  def test_should_detect_invalid_chars
+    "\x92".to_utf8_chars.each_char {|c| assert c.invalid? }
   end
-  def test_should_not_screw_up_valid_strings
-    s = File.read(__FILE__)
-    assert_equal s.to_s, s.to_utf8_codepoints.tidy_bytes.to_s
+  def test_should_split_correctly_with_invalid_chars
+    assert_equal 3, "a\x92a".to_utf8_chars.entries.length
+  end
+  def test_should_tidy_bytes
+    assert_equal "a’a", "a\x92a".to_utf8_chars.tidy_bytes.to_s
+    assert_equal "Simón Bolívar", "Sim\xF3n Bol\xEDvar".to_utf8_chars.tidy_bytes.to_s
   end
 end

metadata CHANGED Viewed

@@ -3,10 +3,10 @@ name: utf8_utils
 version: !ruby/object:Gem::Version
   prerelease: false
   segments:
+  - 1
   - 0
   - 0
-  - 1
-  version: 0.0.1
+  version: 1.0.0
 platform: ruby
 authors:
 - Norman Clarke
@@ -14,10 +14,21 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2010-03-25 00:00:00 -03:00
+date: 2010-04-07 00:00:00 -03:00
 default_executable:
-dependencies: []
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: mocha
+  prerelease: false
+  requirement: &id001 !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        segments:
+        - 0
+        version: "0"
+  type: :development
+  version_requirements: *id001
 description: Utilities for cleaning up UTF8 strings. Compatible with Ruby 1.8.6 - 1.9.x
 email: norman@njclarke.com
 executables: []
@@ -27,6 +38,9 @@ extensions: []
 extra_rdoc_files: []
 files:
+- lib/utf8_utils/byte.rb
+- lib/utf8_utils/char.rb
+- lib/utf8_utils/chars.rb
 - lib/utf8_utils/version.rb
 - lib/utf8_utils.rb
 - README.md