RubyGems - icu_name - Versions diffs - 0.0.7 → 0.1.0 - Mend

icu_name 0.0.7 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README.rdoc CHANGED Viewed

@@ -8,7 +8,7 @@ For ruby 1.9.2 and above.
   gem install icu_name
-It depends on active_support and i18n.
+It depends on _active_support_ and _i18n_.
 == Names
@@ -23,84 +23,88 @@ To create a name object, supply both the first and second names separately to th
 Capitalisation, white space and punctuation will all be automatically corrected:
-  robert.name                                    # => 'Robert J. Fischer'
-  robert.rname                                   # => 'Fischer, Robert J.'  (reversed name)
+  robert.name                                             # => 'Robert J. Fischer'
+  robert.rname                                            # => 'Fischer, Robert J.'  (reversed name)
 The input text, without any changes apart from white-space cleanup, is returned by the _original_ method:
-  robert.original                                # => 'robert j FISHER'
+  robert.original                                         # => 'robert j FISHER'
 To avoid ambiguity when either the first or second names consist of multiple words, it is better to
 supply the two separately, if known. However, the full name can be supplied alone to the constructor
 and a guess will be made as to the first and last names.
   bobby = ICU::Name.new(' bobby fischer ')
-  bobby.first                                    # => 'Bobby'
-  bobby.last                                     # => 'Fischer'
+  bobby.first                                             # => 'Bobby'
+  bobby.last                                              # => 'Fischer'
 Names will match even if one is missing middle initials or if a nickname is used for one of the first names.
-  bobby.match('Robert J.', 'Fischer')            # => true
+  bobby.match('Robert J.', 'Fischer')                     # => true
 Note that the class is aware of only common nicknames (e.g. _Bobby_ and _Robert_, _Bill_ and _William_, etc), not all possibilities.
 Supplying the _match_ method with strings is equivalent to instantiating a Name instance with the same
 strings and then matching it. So, for example the following are equivalent:
-  robert.match('R.', 'Fischer')                  # => true
-  robert.match(ICU::Name.new('R.', 'Fischer'))   # => true
+  robert.match('R.', 'Fischer')                           # => true
+  robert.match(ICU::Name.new('R.', 'Fischer'))            # => true
 The inital _R_, for example, matches the first letter of _Robert_. However, nickname matches will not
 always work with initials. In the next example, the initial _R_ does not match the first letter _B_ of the
 nickname _Bobby_.
-  bobby.match('R. J.', 'Fischer')                # => false
+  bobby.match('R. J.', 'Fischer')                         # => false
 Some of the ways last names are canonicalised are illustrated below:
-  ICU::Name.new('John', 'O Reilly').last         # => "O'Reilly"
-  ICU::Name.new('dave', 'mcmanus').last          # => "McManus"
+  ICU::Name.new('John', 'O Reilly').last                  # => "O'Reilly"
+  ICU::Name.new('dave', 'mcmanus').last                   # => "McManus"
 == Characters and Encoding
-The class can only cope with Western European letter characters, including the accented ones in Latin-1.
-It's various accessors (_first_, _last_, _name_, _rname_, _to_s_, _original_) always return strings
-encoded in UTF-8, no matter what the input encoding.
+The class can only cope with Latin characters, including those with diacritics (accents).
+Along with hyphens and single quotes (which represent apostophes) letters in ISO-8859-1
+(e.g. "a", "è", "Ö") and letters outside ISO-8859-1 which are decomposable into a US-ASCII
+character plus one or more diacritics (e.g. "ł" or "Ś") are preserved, while everything
+else is removed.
+  ICU::Name.new('éric', 'PRIÉ').name                      # => "Éric Prié"
+  ICU::Name.new('BARTŁOMIEJ', 'śliwa').name               # => "Bartłomiej Śliwa"
+  ICU::Name.new(' 渡井美代子').name                            # => ""
-  eric = ICU::Name.new('éric', 'PRIÉ')
-  eric.rname                                     # => "Prié, Éric"
-  eric.rname.encoding.name                       # => "UTF-8"
+The various accessors (_first_, _last_, _name_, _rname_, _to_s_, _original_) always return
+strings encoded in UTF-8, no matter what the input encoding.
   eric = ICU::Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
-  eric.rname                                     # => "Prié, Éric"
-  eric.rname.encoding.name                       # => "UTF-8"
-  eric.original                                  # => "éric PRIÉ"
-  eric.original.encoding.name                    # => "UTF-8"
+  eric.rname                                              # => "Prié, Éric"
+  eric.rname.encoding.name                                # => "UTF-8"
+  eric.original                                           # => "éric PRIÉ"
+  eric.original.encoding.name                             # => "UTF-8"
-Currently, all characters outside the Latin-1 range are removed as if they wern't there.
+Accented letters can be transliterated into their US-ASCII counterparts by setting the
+_chars_ option, which is available in all accessors. For example:
-  ICU::Name.new('Józef Żabiński').name           # => "Józef Abiski"
-  ICU::Name.new('Bǔ Xiángzhì').name              # => "B. Xiángzhì"
+  eric.rname(:chars => "US-ASCII")                        # => "Prie, Eric"
+  eric.original(:chars => "US-ASCII")                     # => "eric PRIE"
-Accented Latin-1 characters can be transliterated into their ascii counterparts by setting the
-_ascii_ option to a true value.
+Also possible is the preservation of ISO-8859-1 characters, but the transliteration of
+all other accented characters:
-  eric.name(:ascii => true)                      # => "Eric Prie"
+  joe = Name.new('Józef', 'Żabiński')
+  joe.rname                                               # => "Żabiński, Józef"
+  joe.rname(:chars => "ISO-8859-1")                       # => "Zabinski, Józef"
+  joe.rname(:chars => "US-ASCII")                         # => "Zabinski, Jozef"
-This works with all the other accessors and also with the constructor:
+Note that the character encoding of the strings returned is still UTF-8 in all cases.
+The same option also relaxes the need for accented characters to match exactly:
-  eric_ascii = ICU::Name.new('éric', 'PRIÉ', :ascii => true)
-  eric_ascii.name                                # => "Eric Prie"
-  jozef_ascii = ICU::Name.new('Józef', 'Żabiński', :ascii => true).name
-  jozef_ascii.name                               # => "Jozef Zabinski"
-The option also relaxes the need for accented characters to match exactly:
+  eric.match('Eric', 'Prie')                              # => false
+  eric.match('Eric', 'Prie', :chars => "US-ASCII")        # => true
+  joe.match('Józef', 'Zabinski')                          # => false
+  joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1")  # => true
-  eric.match('Éric', 'Prié')                     # => true
-  eric.match('Eric', 'Prie')                     # => false
-  eric.match('Eric', 'Prie', :ascii => true)     # => true
 == Author
 Mark Orr, rating officer for the Irish Chess Union (ICU[http://icu.ie]).

data/lib/icu_name/name.rb CHANGED Viewed

@@ -1,3 +1,4 @@
+# encoding: UTF-8
 require 'active_support'
 require 'active_support/inflector/transliterate'
 require 'active_support/core_ext/string/multibyte'
@@ -6,32 +7,28 @@ module ICU
   class Name
     # Construct from one or two strings or any objects that have a to_s method.
-    def initialize(name1='', name2='', opt={})
+    def initialize(name1='', name2='')
       @name1 = Util.to_utf8(name1.to_s)
       @name2 = Util.to_utf8(name2.to_s)
       originalize
-      if opt[:ascii]
-        @name1 = ActiveSupport::Inflector.transliterate(@name1)
-        @name2 = ActiveSupport::Inflector.transliterate(@name2)
-      end
       canonicalize
     end
     # Original text getter.
     def original(opts={})
-      return ActiveSupport::Inflector.transliterate(@original) if opts[:ascii]
+      return transliterate(@original, opts[:chars]) if opts[:chars]
       @original
     end
     # First name getter.
     def first(opts={})
-      return ActiveSupport::Inflector.transliterate(@first) if opts[:ascii]
+      return transliterate(@first, opts[:chars]) if opts[:chars]
       @first
     end
     # Last name getter.
     def last(opts={})
-      return ActiveSupport::Inflector.transliterate(@last) if opts[:ascii]
+      return transliterate(@last, opts[:chars]) if opts[:chars]
       @last
     end
@@ -60,8 +57,8 @@ module ICU
     # Match another name to this object, returning true or false.
     def match(name1='', name2='', opts={})
-      other = Name.new(name1, name2, opts)
-      match_first(first(opts), other.first) && match_last(last(opts), other.last)
+      other = Name.new(name1, name2)
+      match_first(first(opts), other.first(opts)) && match_last(last(opts), other.last(opts))
     end
     # :stopdoc:
@@ -73,6 +70,18 @@ module ICU
       @original.strip!
       @original.gsub!(/\s+/, ' ')
     end
+    # Transliterate characters to ASCII or Latin1.
+    def transliterate(str, chars='US-ASCII')
+      case chars
+      when /^(US-?)?ASCII/i
+        ActiveSupport::Inflector.transliterate(str)
+      when /^(Windows|CP)-?1252|ISO-?8859-?1|Latin(-?1)?$/i
+        str.gsub(/./) { |m| m.ord < 256 ? m : ActiveSupport::Inflector.transliterate(m) }
+      else
+        str.dup
+      end
+    end
     # Canonicalise the first and last names.
     def canonicalize
@@ -106,7 +115,15 @@ module ICU
     # Clean up characters in any name keeping only letters (including accented), hyphens, and single quotes.
     def clean(name)
       name.gsub!(/`/, "'")
-      name.gsub!(/[^-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]/, '')
+      name.gsub!(/./) do |m|
+        if m.ord < 256
+          # Keep Latin1 accented letters.
+          m.match(/^[-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]$/) ? m : ''
+        else
+          # Keep ASCII characters with diacritics (e.g. Polish ł and Ś).
+          transliterate(m) == '?' ? '' : m
+        end
+      end
       name.gsub!(/\./, ' ')
       name.gsub!(/\s*-\s*/, '-')
       name.gsub!(/'+/, "'")

data/lib/icu_name/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module ICU
   class Name
-    VERSION = "0.0.7"
+    VERSION = "0.1.0"
   end
 end

data/spec/name_spec.rb CHANGED Viewed

@@ -66,29 +66,25 @@ module ICU
       end
       it "characters and encoding" do
-        josef = Name.new('Józef', 'Żabiński')
-        josef.name.should == "Józef Abiski"
-        josef.original.should == "Józef Żabiński"
-        josef.original(:ascii => true).should == "Jozef Zabinski"
-        josef = Name.new('Józef', 'Żabiński', :ascii => true)
-        josef.name.should == "Jozef Zabinski"
-        bu = Name.new('Bǔ Xiángzhì')
-        bu.name.should == "B. Xiángzhì"
-        eric = Name.new('éric', 'PRIÉ')
-        eric.rname.should == "Prié, Éric"
-        eric.rname.encoding.name.should == "UTF-8"
+        ICU::Name.new('éric', 'PRIÉ').name.should == "Éric Prié"
+        ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartłomiej Śliwa"
+        ICU::Name.new(' 渡井美代子').name.should == ""
         eric = Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
         eric.rname.should == "Prié, Éric"
         eric.rname.encoding.name.should == "UTF-8"
         eric.original.should == "éric PRIÉ"
-        eric.original(:ascii => true).should == "eric PRIE"
         eric.original.encoding.name.should == "UTF-8"
-        eric.name(:ascii => true).should == "Eric Prie"
-        eric_ascii = Name.new('éric', 'PRIÉ', :ascii => true)
-        eric_ascii.name.should == "Eric Prie"
+        eric.rname(:chars => "US-ASCII").should == "Prie, Eric"
+        eric.original(:chars => "US-ASCII").should == "eric PRIE"
+        joe = Name.new('Józef', 'Żabiński')
+        joe.rname.should == "Żabiński, Józef"
+        joe.rname(:chars => "ISO-8859-1").should == "Zabinski, Józef"
+        joe.rname(:chars => "US-ASCII").should == "Zabinski, Jozef"
         eric.match('Éric', 'Prié').should be_true
         eric.match('Eric', 'Prie').should be_false
-        eric.match('Eric', 'Prie', :ascii => true).should be_true
+        eric.match('Eric', 'Prie', :chars => "US-ASCII").should be_true
+        joe.match('Józef', 'Zabinski').should be_false
+        joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1").should be_true
       end
     end
@@ -244,7 +240,7 @@ module ICU
     context "transliteration" do
       before(:all) do
-        @opt = { :ascii => true }
+        @opt = { :chars => "US-ASCII" }
       end
       it "should be a no-op for names that already ASCII" do
@@ -267,12 +263,6 @@ module ICU
         name.first(@opt).should == 'Eric'
         name.last(@opt).should == 'Prie'
       end
-      it "should work for the constructor as well as accessors" do
-        name = Name.new('Gearóidín', 'Uí Laighléis', @opt)
-        name.first.should == 'Gearoidin'
-        name.last.should == 'Ui Laighleis'
-      end
     end
     context "constuction corner cases" do
@@ -280,7 +270,6 @@ module ICU
         Name.new('Orr').name.should == 'Orr'
         Name.new('Orr').rname.should == 'Orr'
         Name.new('Uí Laighléis').rname.should == 'Laighléis, Uí'
-        Name.new('', 'Uí Laighléis', :ascii => true).last.should == 'Ui Laighleis'
         Name.new('').name.should == ''
         Name.new('').rname.should == ''
         Name.new.name.should == ''
@@ -367,8 +356,8 @@ module ICU
       end
       it "the matching of accented characters can be relaxed" do
-        Name.new('Gearóidín', 'Uí Laighléis').match('Gearoidin', 'Ui Laíghleis', :ascii => true).should be_true
-        Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :ascii => true).should be_true
+        Name.new('Gearóidín', 'Uí Laighléis').match('Gearoidin', 'Ui Laíghleis', :chars => "US-ASCII").should be_true
+        Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :chars => "US-ASCII").should be_true
       end
     end
   end

metadata CHANGED Viewed

@@ -4,9 +4,9 @@ version: !ruby/object:Gem::Version
   prerelease: false
   segments:
   - 0
+  - 1
   - 0
-  - 7
-  version: 0.0.7
+  version: 0.1.0
 platform: ruby
 authors:
 - Mark Orr