RubyGems - utf8_validator - Versions diffs - 0.0.2 → 1.0.0 - Mend

utf8_validator 0.0.2 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

data/README.rdoc CHANGED

@@ -18,13 +18,28 @@ The Unicode Consortium:: At http://unicode.org/ for all the information publishe
 Frank Yung-Fong Tang:: For the state machine algorithm.  See: http://unicode.org/mail-arch/unicode-ml/y2003-m02/att-0467/01-The_Algorithm_to_Valide_an_UTF-8_String
 Markus Kuhn:: For invalid test data.  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
+== Useful Information
+Other interesting and/or useful information can be found:
+* http://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder
+* http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
 == A Word On Ruby Versions
-It is expected that this validator will be used in Ruby environments prior to 1.9.x. However, nothing prohibits it's use with Ruby 1.9.
+It is expected that this validator will be used in Ruby environments prior to 1.9.x. However, nothing prohibits use with Ruby 1.9.  Tests recognize a 1.9 environment, and insure that 1.9 native behavior matches the validator behavior.
 == Reporting Issues
-Please report issues on the tracker at github.
+Please report issues on the tracker at github:
+* https://github.com/gmallard/utf8_validator/issues
+== Web Based Documentation
+Human readable documentation can be found at:
+* http://gmallard.github.com/utf8_validator
 == Contributing to the utf8_validator gem

data/VERSION CHANGED

@@ -1 +1,2 @@
-0.0.2
+1.0.0

data/lib/validation/validator.rb CHANGED

@@ -77,12 +77,12 @@ class Validator
   #
   #    validator = UTF8::Validator.new                           -> validator
   #    validator.valid_encoding?(string)                         -> true or false
-  #    validator.valid_encoding?(string, true)                   -> true or exception
+  #    validator.valid_encoding?(string, raise_on_error)         -> true or exception
   #
   # Parameters:
   #
   # string::         the string to validate
-  # raise_on_error:: a flag to indicate failure behavior
+  # raise_on_error:: a boolean flag to indicate requested failure behavior
   #
   # When raise_on_error is _true_ and a string fails validation, an
   # error of type #{UTF8::ValidationError} is raised.  The byte in error

data/test/test_utf8_validator.rb CHANGED

@@ -102,9 +102,7 @@ class TestUtf8Validator < Test::Unit::TestCase
       "\xed\x9f\xbf", # = "\ud7ff"
       "\xee\x80\x80", # = "\ue000"
       "\xef\xbf\xbd", # = "\ufffd"
-#      "\xf4\x8f\xbf\xbf", # = "\U0010ffff" / maybe _should_ fail ??
-#      "\xf4\x90\x80\x80", # = "\ufffd" / maybe  _should_ fail ?? / research
+      "\xf4\x8f\xbf\xbf", # = "\U10ffff" / _should_ this fail ??
     ]
     test_data.each do |string|
       assert @validator.valid_encoding?(string), "boundary conditions: #{string}"
@@ -144,7 +142,7 @@ class TestUtf8Validator < Test::Unit::TestCase
   #--
   # I do not see a need to test UTF-16 surrogate pairs.  They are guaranteed
-  # to alyays fail if the preceding test succeeds.  This is because the
+  # to always fail if the preceding test succeeds.  This is because the
   # preceeding test data values are always the first surrogate of the pair.
   #
   # UTF-16 surrogates are clearly something I do not understand.
@@ -168,10 +166,11 @@ class TestUtf8Validator < Test::Unit::TestCase
   def test0540_not_shortest
     test_data = [
       "\xc0\x80",
+      "\xc1\x80",
+      "\xc0\x30",
+      "\xc1\x30",
       "\xe0\x80\x80",
       "\xf0\x80\x80\x80",
-      "\xf8\x80\x80\x80\x80",
-      "\xfc\x80\x80\x80\x80\x80",
     ]
     test_data.each do |string|
       assert !@validator.valid_encoding?(string), "Not shortest: #{string}"
@@ -271,5 +270,81 @@ class TestUtf8Validator < Test::Unit::TestCase
     end
   end
+  # Boundary conditions
+  def test_0590_boundary_conditions
+    test_data = [
+      "\xf4\x90\x80\x80", # See: http://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder
+    ]
+    test_data.each do |string|
+      assert !@validator.valid_encoding?(string), "boundary conditions: #{string}"
+      assert !string.force_encoding("UTF-8").valid_encoding?, "boundary conditions 19: #{string}"  if RUBY_VERSION =~ /1\.9/
+    end
+  end
+=begin
+The next test is based on examples provided in the Unicode 6.0 specification.
+See pages 91-92.
+From that specification:
+• The UTF-8 code unit sequence <41 C3 B1 42> is well-formed, because it can be
+partitioned into subsequences, all of which match the specification for UTF-8
+in Table 3-7. It consists of the following minimal well-formed code unit subse-
+quences: <41>, <C3 B1>, and <42>.
+• The UTF-8 code unit sequence <41 C2 C3 B1 42> is ill-formed, because it con-
+tains one ill-formed subsequence. There is no subsequence for the C2 byte
+which matches the specification for UTF-8 in Table 3-7. The code unit sequence
+is partitioned into one minimal well-formed code unit subsequence, <41>, fol-
+lowed by one ill-formed code unit subsequence, <C2>, followed by two mini-
+mal well-formed code unit subsequences, <C3 B1> and <42>.
+• In isolation, the UTF-8 code unit sequence <C2 C3> would be ill-formed, but
+in the context of the UTF-8 code unit sequence <41 C2 C3 B1 42>, <C2 C3>
+does not constitute an ill-formed code unit subsequence, because the C3 byte is
+actually the first byte of the minimal well-formed UTF-8 code unit subse-
+quence <C3 B1>. Ill-formed code unit subsequences do not overlap with mini-
+mal well-formed code unit subsequences.
+// Above:
+straight from the Unicode 6.0 spec.  See page 91.
+• As another example, the code unit sequence <C0 80 61 F3> is a Unicode 8-bit
+string, but does not consist of a well-formed UTF-8 code unit sequence. That
+code unit sequence could not result from the specification of the UTF-8 encod-
+ing form and is thus ill-formed. (The same code unit sequence could, of course,
+be well-formed in the context of some other character encoding standard using
+8-bit code units, such as ISO/IEC 8859-1, or vendor code pages.)
+// Above:
+straight from the Unicode 6.0 spec.  See page 92.
+=end
+  # Tests from examples in the Unicode speciication
+  def test_0600_unicode_specs
+    good_data = [
+      "\x41\xc3\xb1\x42",
+    ]
+    good_data.each do |string|
+      assert @validator.valid_encoding?(string), "good unicode specs 01: #{string}"
+      assert string.force_encoding("UTF-8").valid_encoding?,
+        "good unicode specs 01 19: #{string}"  if RUBY_VERSION =~ /1\.9/
+    end
+    bad_data = [
+      "\x41\xc2\xc3\xb1\x42",
+    ]
+    bad_data.each do |string|
+      assert !@validator.valid_encoding?(string), "bad unicode specs 01: #{string}"
+      assert !string.force_encoding("UTF-8").valid_encoding?,
+        "bad unicode specs 01 19: #{string}"  if RUBY_VERSION =~ /1\.9/
+    end
+  end
 end

data/utf8_validator.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = %q{utf8_validator}
-  s.version = "0.0.2"
+  s.version = "1.0.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Guy Allard"]
-  s.date = %q{2011-01-26}
+  s.date = %q{2011-03-21}
   s.description = %q{A State Machine implementation of a UTF-8 Encoding
 Validation algorithm.}
   s.email = %q{allard.guy.m@gmail.com}

metadata CHANGED

@@ -3,10 +3,10 @@ name: utf8_validator
 version: !ruby/object:Gem::Version
   prerelease: false
   segments:
+  - 1
   - 0
   - 0
-  - 2
-  version: 0.0.2
+  version: 1.0.0
 platform: ruby
 authors:
 - Guy Allard
@@ -14,7 +14,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-01-26 00:00:00 -05:00
+date: 2011-03-21 00:00:00 -04:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -101,7 +101,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      hash: 2884485592009813991
+      hash: -2830837862218191460
       segments:
       - 0
       version: "0"