RubyGems - utf8_utils - Versions diffs - 2.0.0 → 2.0.1 - Mend

utf8_utils 2.0.0 → 2.0.1

Files changed (5) hide show

data/README.md CHANGED

@@ -11,7 +11,13 @@ access at [its home on Github](github.com/norman/utf8_utils).
 ## The Problem
-Here's what happens when you try to access a string with invalid UTF-8 characters in Ruby 1.9:
+Your application may have to deal with invalid UTF-8 strings that come from
+user input that is copied and pasted from Microsoft Word, and includes
+Windows-encoded "smart quotes," or other characters. This is only one scenario;
+there are many ways your application could receive such input.
+Here's what happens when you try to access a string with invalid UTF-8
+characters in Ruby 1.9:
     ruby-1.9.1-p378 > "my messed up \x92 string".split(//u)
     ArgumentError: invalid byte sequence in UTF-8
@@ -19,24 +25,30 @@ Here's what happens when you try to access a string with invalid UTF-8 character
             from (irb):3
             from /Users/norman/.rvm/rubies/ruby-1.9.1-p378/bin/irb:17:in `<main>'
+Ruby is quite particular about this - accessing the data in the string is
+difficult as almost all string access methods will die with this error.
 ## The Solution
+This library breaks the string down into an array of raw bytes, and cleans up
+the ones that are impossible UTF-8 sequences.
     ruby-1.9.1-p378 > "my messed up \x92 string".tidy_bytes.split(//u)
      => ["m", "y", " ", "m", "e", "s", "s", "e", "d", " ", "u", "p", " ", "’", " ", "s", "t", "r", "i", "n", "g"]
-Note that like ActiveSupport, it naively assumes if you have invalid UTF8
-characters, they are either Windows CP1251 or ISO8859-1. In practice this isn't
-a bad assumption, but may not always work.
+Note that, like ActiveSupport, it naively assumes if you have invalid UTF8
+characters, their encoding is either Windows CP1252 or ISO-8859-1. In practice
+this isn't a bad assumption, but may not always work.
 This library's `tidy_bytes` method is a little less than twice as fast as the
 one provided by ActiveSupport:
                                | ACTIVE_SUPPORT | UTF8_UTILS |
     ----------------------------------------------------------
-    tidy bytes          x20000 |          1.008 |      0.650 |
+    tidy bytes          x20000 |          1.004 |      0.607 |
     ==========================================================
-    Total                      |          1.008 |      0.650 |
+    Total                      |          1.004 |      0.607 |
 ## Getting it
@@ -62,4 +74,4 @@ one provided by ActiveSupport:
 Created by Norman Clarke.
-Copyright (c) 2010, released under the MIT license.
+Copyright (c) 2010, released under the MIT license.

data/lib/utf8_utils.rb CHANGED

@@ -45,49 +45,55 @@ module UTF8Utils
     # naively assumes if you have invalid UTF8 bytes, they are either Windows
     # CP1251 or ISO8859-1. In practice this isn't a bad assumption, but may not
     # always work.
-    def tidy_bytes
+    #
+    # Passing +true+ will forcibly tidy all bytes, assuming that the string's
+    # encoding is CP1252 or ISO-8859-1.
+    def tidy_bytes(force = false)
-      bytes = unpack("C*")
-      continuation_bytes_expected = 0
+      if force
+        return unpack("C*").map do |b|
+          tidy_byte(b)
+        end.flatten.compact.pack("C*").unpack("U*").pack("U*")
+      end
-      bytes.each_index do |index|
+      bytes = unpack("C*")
+      conts_expected = 0
+      last_lead = 0
-        byte = bytes[index]
+      bytes.each_index do |i|
-        is_continuation_byte = byte[7] == 1 && byte[6] == 0
-        ascii_byte = byte[7] == 0
-        leading_byte = byte[7] == 1 && byte[6] == 1
+        byte          = bytes[i]
+        is_ascii      = byte < 128
+        is_cont       = byte > 127 && byte < 192
+        is_lead       = byte > 191 && byte < 245
+        is_unused     = byte > 240
+        is_restricted = byte > 244
-        if is_continuation_byte
-          if continuation_bytes_expected > 0
-            continuation_bytes_expected = continuation_bytes_expected - 1
-          else
-            # Not expecting a continuation, so clean it
-            bytes[index] = tidy_byte(byte)
-          end
-        # ASCII byte
-        elsif ascii_byte
-          if continuation_bytes_expected > 0
-            # Expected continuation, got ASCII, so clean previous
-            bytes[index - 1] = tidy_byte(bytes[index - 1])
-            continuation_bytes_expected = 0
-          end
-        elsif leading_byte
-          if continuation_bytes_expected > 0
-            # Expected continuation, got leading, so clean previous
-            bytes[index - 1] = tidy_byte(bytes[index - 1])
-            continuation_bytes_expected = 0
+        # Impossible or highly unlikely byte? Clean it.
+        if is_unused || is_restricted
+          bytes[i] = tidy_byte(byte)
+        elsif is_cont
+          # Not expecting contination byte? Clean up. Otherwise, now expect one less.
+          conts_expected == 0 ? bytes[i] = tidy_byte(byte) : conts_expected -= 1
+        else
+          if conts_expected > 0
+            # Expected continuation, but got ASCII or leading? Clean backwards up to
+            # the leading byte.
+            (1..(i - last_lead)).each {|j| bytes[i - j] = tidy_byte(bytes[i - j])}
+            conts_expected = 0
           end
-          continuation_bytes_expected =
-            if    byte[5] == 0 then 1
-            elsif byte[4] == 0 then 2
-            elsif byte[3] == 0 then 3
+          if is_lead
+            # Final byte is leading? Clean it.
+            if i == bytes.length - 1
+              bytes[i] = tidy_byte(bytes.last)
+            else
+              # Valid leading byte? Expect continuations determined by position of
+              # first zero bit, with max of 3.
+              conts_expected = byte < 224 ? 1 : byte < 240 ? 2 : 3
+              last_lead = i
+            end
           end
         end
-        # Don't allow the string to terminate with a leading byte
-        if leading_byte && index == bytes.length - 1
-          bytes[index] = tidy_byte(bytes.last)
-        end
       end
       bytes.empty? ? "" : bytes.flatten.compact.pack("C*").unpack("U*").pack("U*")
     end
@@ -100,17 +106,12 @@ module UTF8Utils
     private
     def tidy_byte(byte)
-      if UTF8Utils::CP1252.key? byte
-        UTF8Utils::CP1252[byte]
-      elsif byte < 192
-        [194, byte]
-      else
-        [195, byte - 64]
-      end
+      byte < 160 ? UTF8Utils::CP1252[byte] : byte < 192 ? [194, byte] : [195, byte - 64]
     end
   end
 end
 class String
   include UTF8Utils::StringExt
-end
+end

data/lib/utf8_utils/version.rb CHANGED

@@ -2,7 +2,7 @@ module UTF8Utils
   module Version
     MAJOR = 2
     MINOR = 0
-    TINY  = 0
+    TINY  = 1
     STRING = [MAJOR, MINOR, TINY].join('.')
   end
 end

data/test/utf8_utils_test.rb CHANGED

@@ -1,19 +1,66 @@
 # encoding: utf-8
+require "rubygems"
+require "active_support"
 require "test/unit"
 require File.expand_path("../../lib/utf8_utils", __FILE__)
 class UTF8UtilsTest < Test::Unit::TestCase
-  CASES = {
-    "Sim\xF3n Bol\xEDvar" => "Simón Bolívar", # utf-8 leading bytes followed by an ascii char (fix as CP1252)
-    "\xBFhola?" => "¿hola?", # iso-8859-1 inverted question mark
-    "\xFF" => "something"
+  SINGLE_BYTE_CASES = {
+    "\x21" => "!", # Valid ASCII byte, low
+    "\x41" => "A", # Valid ASCII byte, mid
+    "\x7E" => "~", # Valid ASCII byte, high
+    "\x80" => "€",  # Continuation byte, low (cp125)
+    "\x94" => "”",  # Continuation byte, mid (cp125)
+    "\x9F" => "Ÿ",  # Continuation byte, high (cp125)
+    "\xC0" => "À", # Overlong encoding, start of 2-byte sequence, but codepoint < 128
+    "\xC1" => "Á", # Overlong encoding, start of 2-byte sequence, but codepoint < 128
+    "\xC2" => "Â", # Start of 2-byte sequence, low
+    "\xC8" => "È", # Start of 2-byte sequence, mid
+    "\xDF" => "ß", # Start of 2-byte sequence, high
+    "\xE0" => "à", # Start of 3-byte sequence, low
+    "\xE8" => "è", # Start of 3-byte sequence, mid
+    "\xEF" => "ï", # Start of 3-byte sequence, high
+    "\xF0" => "ð", # Start of 4-byte sequence
+    "\xF1" => "ñ",  # Unused byte
+    "\xFF" => "ÿ", # Restricted byte
   }
+  def setup
+    # SINGLE_BYTE_CASES.each do |k, v|
+    #   SINGLE_BYTE_CASES[k] = ActiveSupport::Multibyte::Chars.new(k)
+    # end
+  end
-  def test_tidy_bytes
-    CASES.each do |bad, good|
-      assert_equal good, bad.tidy_bytes
+  def test_should_handle_single_byte_cases
+    SINGLE_BYTE_CASES.each do |bad, good|
+      assert_equal good, bad.tidy_bytes.to_s
+      assert_equal "#{good}#{good}", "#{bad}#{bad}".tidy_bytes
+      assert_equal "#{good}#{good}#{good}", "#{bad}#{bad}#{bad}".tidy_bytes
+      assert_equal "#{good}a", "#{bad}a".tidy_bytes
+      assert_equal "a#{good}a", "a#{bad}a".tidy_bytes
+      assert_equal "a#{good}", "a#{bad}".tidy_bytes
     end
   end
+  def test_should_tidy_leading_byte_followed_by_too_few_continuation_bytes
+    string = "\xF0\xA5\xA4\x21"
+    assert_equal "ð¥¤!", string.tidy_bytes
+  end
+  def test_should_not_modifiy_valid_utf8_unless_forced
+    # Nothing can be done to tidy the bytes here, because it's valid UTF-8.
+    assert_not_equal "ð¥¤¤", "\xF0\xA5\xA4\xA4".tidy_bytes
+    assert_not_equal "Â»", "\xC2\xBB".tidy_bytes
+    assert_equal "ð¥¤¤", "\xF0\xA5\xA4\xA4".tidy_bytes(true)
+    assert_equal "Â»", "\xC2\xBB".tidy_bytes(true)
+  end
+  def test_should_not_tidy_leading_byte_followed_by_too_many_continuation_bytes_unless_forced
+    string = "\xF0\xA5\xA4\xA4\xA4"
+    assert_not_equal "ð¥¤¤¤", string.tidy_bytes
+    assert_equal "ð¥¤¤¤", string.tidy_bytes(true)
+  end
 end

metadata CHANGED

@@ -5,8 +5,8 @@ version: !ruby/object:Gem::Version
   segments:
   - 2
   - 0
-  - 0
-  version: 2.0.0
+  - 1
+  version: 2.0.1
 platform: ruby
 authors:
 - Norman Clarke
@@ -16,19 +16,8 @@ cert_chain: []
 date: 2010-04-08 00:00:00 -03:00
 default_executable:
-dependencies:
-- !ruby/object:Gem::Dependency
-  name: mocha
-  prerelease: false
-  requirement: &id001 !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        segments:
-        - 0
-        version: "0"
-  type: :development
-  version_requirements: *id001
+dependencies: []
 description: Utilities for cleaning up UTF8 strings. Compatible with Ruby 1.8.6 - 1.9.x
 email: norman@njclarke.com
 executables: []