RubyGems - utf8_utils - Versions diffs - 0.0.1 - Mend

utf8_utils 0.0.1

Files changed (7) hide show

data/LICENSE ADDED Viewed

@@ -0,0 +1,19 @@
+Copyright (c) 2010 Norman Clarke
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,58 @@
+# UTF8 Utils
+This library provides a means of cleaning UTF8 strings with invalid characters.
+It provides functionality very similar to [ActiveSupport's `tidy_bytes`
+method](http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html#M000977),
+but works for Ruby 1.8.6 - 1.9.x. Once I sort out any potentially embarrassing
+issues with it, I'll probably try patching it into ActiveSupport.
+## The Problem
+Here's what happens when you try to access a string with invalid UTF-8 characters in Ruby 1.9:
+    ruby-1.9.1-p378 > "my messed up \x92 string".split(//)
+    ArgumentError: invalid byte sequence in UTF-8
+            from (irb):3:in `split'
+            from (irb):3
+            from /Users/norman/.rvm/rubies/ruby-1.9.1-p378/bin/irb:17:in `<main>'
+## The Solution
+    ruby-1.9.1-p378 > "my messed up \x92 string".to_utf8_codepoints.tidy_bytes.to_s.split(//u)
+     => ["m", "y", " ", "m", "e", "s", "s", "e", "d", " ", "u", "p", " ", "’", " ", "s", "t", "r", "i", "n", "g"]
+Amazing in its brevity and elegance, huh? Ok, maybe not really but if you have
+some badly encoded data you need to clean up, it can save you from ripping out
+your hair.
+Note that like ActiveSupport, it naively assumes if you have invalid UTF8
+characters, they are either Windows CP1251 or ISO8859-1. In practice this isn't
+a bad assumption, but may not always work.
+## Getting it
+    gem install utf8_utils
+## Using it
+    require "utf8_utils"
+    # Traverse codepoints
+    "hello-world".to_utf8_codepoints.each_codepoint do |codepoint|
+        puts codepoint.valid?
+     end
+     # tidy bytes
+     good_string = bad_string.to_utf8_codepoints.tidy_bytes.to_s
+## API Docs
+[http://norman.github.com/utf8_utils](http://norman.github.com/utf8_utils)
+## Credits
+Created by Norman Clarke, with some code <strike>stolen</strike> borrowed from ActiveRecord.
+Copyright (c) 2010, released under the MIT license.

data/Rakefile ADDED Viewed

@@ -0,0 +1,25 @@
+require "rake"
+require "rake/testtask"
+require "rake/gempackagetask"
+require "rake/rdoctask"
+require "rake/clean"
+CLEAN << "pkg" << "doc" << "coverage" << ".yardoc"
+Rake::GemPackageTask.new(eval(File.read("utf8_utils.gemspec"))) { |pkg| }
+Rake::TestTask.new(:test) { |t| t.pattern = "test/**/*_test.rb" }
+Rake::RDocTask.new do |r|
+  r.rdoc_dir = "doc"
+  r.rdoc_files.include "lib/**/*.rb"
+end
+begin
+  require "rcov/rcovtask"
+  Rcov::RcovTask.new do |r|
+    r.test_files = FileList["test/**/*_test.rb"]
+    r.verbose = true
+    r.rcov_opts << "--exclude gems/*"
+  end
+rescue LoadError
+end

data/lib/utf8_utils/version.rb ADDED Viewed

@@ -0,0 +1,8 @@
+module UTF8Utils
+  module Version
+    MAJOR = 0
+    MINOR = 0
+    TINY  = 1
+    STRING = [MAJOR, MINOR, TINY].join('.')
+  end
+end

data/lib/utf8_utils.rb ADDED Viewed

@@ -0,0 +1,156 @@
+# Wraps a string as an array of bytes and allows some naive cleanup operations as a workaround
+# for Ruby 1.9's crappy encoding support that throws exceptions when attempting to access
+# UTF8 strings with invalid characters.
+module UTF8Utils
+  class Codepoints
+    attr_accessor :chars
+    attr :position
+    include Enumerable
+    CP1251 = {
+      128 => [226, 130, 172],
+      129 => nil,
+      130 => [226, 128, 154],
+      131 => [198, 146],
+      132 => [226, 128, 158],
+      133 => [226, 128, 166],
+      134 => [226, 128, 160],
+      135 => [226, 128, 161],
+      136 => [203, 134],
+      137 => [226, 128, 176],
+      138 => [197, 160],
+      139 => [226, 128, 185],
+      140 => [197, 146],
+      141 => nil,
+      142 => [197, 189],
+      143 => nil,
+      144 => nil,
+      145 => [226, 128, 152],
+      146 => [226, 128, 153],
+      147 => [226, 128, 156],
+      148 => [226, 128, 157],
+      149 => [226, 128, 162],
+      150 => [226, 128, 147],
+      151 => [226, 128, 148],
+      152 => [203, 156],
+      153 => [226, 132, 162],
+      154 => [197, 161],
+      155 => [226, 128, 186],
+      156 => [197, 147],
+      157 => nil,
+      158 => [197, 190],
+      159 => [197, 184]
+    }
+    def initialize(string)
+      @position = 0
+      # 1.8.6's `each_byte` does not return an Enumerable
+      if RUBY_VERSION < "1.8.7"
+        @chars = []
+        string.each_byte { |b| @chars << b }
+      else
+        # Create an array of bytes without raising an ArgumentError in 1.9.x
+        # when the string contains invalid UTF-8 characters
+        @chars = string.each_byte.entries
+      end
+    end
+    # Attempt to clean up malformed characters.
+    def tidy_bytes
+      Codepoints.new(entries.map {|c| c.tidy.to_char}.compact.join)
+    end
+    # Cast to string.
+    def to_s
+      entries.map {|e| e.to_char}.join
+    end
+    private
+    def each(&block)
+      while codepoint = next_codepoint
+        yield codepoint
+      end
+      @position = 0
+    end
+    alias :each_codepoint :each
+    public :each_codepoint
+    def bytes_to_pull
+      case chars[position]
+      when 0..127 then 1
+      when 128..223 then 2
+      when 224..239 then 3
+      else 4
+      end
+    end
+    def next_codepoint
+      codepoint = Codepoint.new(chars.slice(position, bytes_to_pull))
+      if codepoint.invalid?
+        codepoint = Codepoint.new(chars.slice(position, 1))
+      end
+      @position = position + codepoint.size
+      codepoint unless codepoint.empty?
+    end
+  end
+  class Codepoint < Array
+    # Borrowed from the regexp in ActiveSupport, which in turn had been borrowed from
+    # the Kconv library by Shinji KONO - (also as seen on the W3C site).
+    # See also http://en.wikipedia.org/wiki/UTF-8
+    def valid?
+     if length == 1
+       (0..127) === self[0]
+     elsif length == 2
+       (192..223) === self[0] &&  (128..191) === self[1]
+     elsif length == 3
+       (self[0] == 224 && ((160..191) === self[1] && (128..191) === self[2])) ||
+       ((225..239) === self[0] && (128..191) === self[1] && (128..191) === self[2])
+     elsif length == 4
+       (self[0] == 240 && (144..191) === self[1] && (128..191) === self[2] && (128..191) === self[3]) ||
+       ((241..243) === self[0] && (128..191) === self[1] && (128..191) === self[2] && (128..191) === self[3]) ||
+       (self[0] == 244 && (128..143) === self[1] && (128..191) === self[2] && (128..191) === self[3])
+     end
+    end
+    # Attempt to rescue a valid UTF-8 character from a malformed codepoint. It will first
+    # attempt to convert from CP1251, and if this isn't possible, it prepends a valid leading
+    # byte, treating the character as the last byte in a two-byte codepoint.
+    # Note that much of the logic here is taken from ActiveSupport; the difference is that this
+    # works for Ruby 1.8.6 - 1.9.1.
+    def tidy
+      return self if valid?
+      if Codepoints::CP1251.key? self[0]
+        self.class.new [Codepoints::CP1251[self[0]]]
+      elsif self[0] < 192
+        self.class.new [194, self[0]]
+      else
+        self.class.new [195, self[0] - 64]
+      end
+    end
+    def invalid?
+      !valid?
+    end
+    # Get a character from the bytes.
+    def to_char
+      flatten.pack("C*").unpack("U*").pack("U*")
+    end
+  end
+end
+# Get an array of UTF8 codepoints from a string.
+class String
+  def to_utf8_codepoints
+    UTF8Utils::Codepoints.new self
+  end
+end

data/test/utf8_utils_test.rb ADDED Viewed

@@ -0,0 +1,49 @@
+# encoding: utf-8
+require "test/unit"
+require File.join(File.dirname(__FILE__), "..", "lib", "utf8_utils")
+class UTF8CodepointsTest < Test::Unit::TestCase
+  def test_should_pull_one_byte_for_ascii_char
+    assert_equal 1, "a".to_utf8_codepoints.entries[0].length
+  end
+  def test_should_pull_two_bytes_for_latin_char_with_diacritics
+    assert_equal 2, "¡".to_utf8_codepoints.entries[0].length
+  end
+  def test_should_pull_three_bytes_for_basic_multilingual_char
+    assert_equal 3, "आ".to_utf8_codepoints.entries[0].length
+  end
+  def test_should_pull_four_bytes_for_other_chars
+    u = UTF8Utils::Codepoints.new("")
+    # Editors tend to freak out with chars in this plane, so just stub the
+    # chars field instead. This char is U+10405, DESERET CAPITAL LETTER LONG OO.
+    u.chars = [240, 144, 144, 132]
+    assert_equal 4, u.entries[0].length
+  end
+  def test_should_detect_valid_codepoints
+    "cañón आ".to_utf8_codepoints.each_codepoint {|c| assert c.valid? }
+  end
+  def test_should_detect_invalid_codepoints
+    "\x92".to_utf8_codepoints.each_codepoint {|c| assert c.invalid? }
+  end
+  def test_should_split_correctly_with_invalid_codepoints
+    assert_equal 3, "a\x92a".to_utf8_codepoints.entries.length
+  end
+  def test_should_tidy_bytes
+    assert_equal "a’a", "a\x92a".to_utf8_codepoints.tidy_bytes.to_s
+  end
+  def test_should_not_screw_up_valid_strings
+    s = File.read(__FILE__)
+    assert_equal s.to_s, s.to_utf8_codepoints.tidy_bytes.to_s
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,67 @@
+--- !ruby/object:Gem::Specification
+name: utf8_utils
+version: !ruby/object:Gem::Version
+  prerelease: false
+  segments:
+  - 0
+  - 0
+  - 1
+  version: 0.0.1
+platform: ruby
+authors:
+- Norman Clarke
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2010-03-25 00:00:00 -03:00
+default_executable:
+dependencies: []
+description: Utilities for cleaning up UTF8 strings. Compatible with Ruby 1.8.6 - 1.9.x
+email: norman@njclarke.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- lib/utf8_utils/version.rb
+- lib/utf8_utils.rb
+- README.md
+- LICENSE
+- Rakefile
+- test/utf8_utils_test.rb
+has_rdoc: true
+homepage: http://norman.github.com/utf8_utils
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      segments:
+      - 0
+      version: "0"
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      segments:
+      - 0
+      version: "0"
+requirements: []
+rubyforge_project: utf8_utils
+rubygems_version: 1.3.6
+signing_key:
+specification_version: 3
+summary: Utilities for cleaning up UTF8 strings.
+test_files:
+- test/utf8_utils_test.rb