RubyGems - icu_name - Versions diffs - 1.0.16 → 1.1.0 - Mend

icu_name 1.0.16 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

data/README.rdoc CHANGED Viewed

@@ -1,6 +1,8 @@
 = ICU Tournament
-Canonicalises and matches person names with Western European characters and first and last names.
+Canonicalises and matches person names with Western European characters.
+Note: version 1.1.0 dropped support for characters beyond codepoint 255 and became independent of activesupport and i18n.
 == Installation
@@ -8,14 +10,12 @@ For ruby 1.9.2, 1.9.3, 2.0.0.
   gem install icu_name
-It depends on _active_support_ and _i18n_.
 == Names
 This class exists for two main purposes:
-* to normalise to a common format the different ways names are typed in practice
-* to be able to match two names even if they are not exactly the same
+* to normalise to a common format the different ways Irish person names are typed in practice
+* to be able to match two names even if they are not exactly the same in their original form
 To create a name object, supply both the first and second names separately to the constructor.
@@ -36,7 +36,6 @@ supply the two separately. If the full name is supplied alone to the constructor
 of where the first names end, then the last distinct name is assumed to be the last name.
   bobby = ICU::Name.new(' bobby  fischer ')
   bobby.first                                                 # => 'Bobby'
   bobby.last                                                  # => 'Fischer'
@@ -77,13 +76,11 @@ Some other ways last names are canonicalised are illustrated below:
 == Characters and Encoding
 The class can only cope with Latin characters, including those with diacritics (accents).
-Along with hyphens and single quotes (which represent apostophes) letters in ISO-8859-1
-(e.g. "a", "è", "Ö") and letters outside ISO-8859-1 which are decomposable into a US-ASCII
-character plus one or more diacritics (e.g. "ł" or "Ś") are preserved, while everything
-else is removed.
+Hyphens, single quotes (which represent apostophes) and letters in the ISO-8859-1 range
+(e.g. "a", "è", "Ö") are preserved, while everything else is removed (unsupported).
   ICU::Name.new('éric', 'PRIÉ').name                          # => "Éric Prié"
-  ICU::Name.new('BARTŁOMIEJ', 'śliwa').name                   # => "Bartłomiej Śliwa"
+  ICU::Name.new('BARTŁOMIEJ', 'śliwa').name                   # => "Bartomiej Liwa"
   ICU::Name.new('Սմբատ', 'Լպուտյան').name                     # => ""
 The various accessors (<tt>first</tt>, <tt>last</tt>, <tt>name</tt>, <tt>rname</tt>, <tt>to_s</tt>, <tt>original</tt>) always return
@@ -101,21 +98,11 @@ Accented letters can be transliterated into their US-ASCII counterparts by setti
   eric.rname(:chars => "US-ASCII")                            # => "Prie, Eric"
   eric.original(:chars => "US-ASCII")                         # => "PRIE, eric"
-Also possible is the preservation of ISO-8859-1 characters, but the transliteration of
-all other accented characters:
-  joe = Name.new('Józef', 'Żabiński')
-  joe.rname                                                   # => "Żabiński, Józef"
-  joe.rname(:chars => "ISO-8859-1")                           # => "Zabinski, Józef"
-  joe.rname(:chars => "US-ASCII")                             # => "Zabinski, Jozef"
 Note that the character encoding of the strings returned is still UTF-8 in all cases.
 The same option also relaxes the need for accented characters to match exactly:
   eric.match('Eric', 'Prie')                                  # => false
   eric.match('Eric', 'Prie', :chars => "US-ASCII")            # => true
-  joe.match('Józef', 'Zabinski')                              # => false
-  joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1")      # => true
 == Customization of Alternative Names
@@ -153,7 +140,7 @@ To change alternative name behaviour, you can replace the default alternatives
 with a customized set perhaps stored in a database or a YAML file, as illustrated below:
   data = YAML.load(File open "my_last_name_alternatives.yaml")
-  Name.load_alternatives(:first, data)
+  Name.load_alternatives(:last, data)
   data = YAML.load(File open "my_first_name_alternatives.yaml")
   Name.load_alternatives(:first, data)
@@ -173,8 +160,8 @@ so that now:
   Name.new("Stephen", "Hanly").match("Steven", "Hanly")       # => true
 This kind of rule risks producing false positives - you must judge
-carefully whether that risk is outweighed by the benefits of being
-able to overcome spelling mistakes in the context of your application.
+whether that risk is outweighed by the benefits of being able to overcome
+spelling mistakes in the context of your application.
 Another use is to cater for English and Irish versions of the same name.
 For example, for last names:

data/lib/icu_name/name.rb CHANGED Viewed

@@ -1,7 +1,4 @@
 # encoding: UTF-8
-require 'active_support'
-require 'active_support/inflector/transliterate'
-require 'active_support/core_ext/string/multibyte'
 module ICU
   class Name
@@ -20,7 +17,6 @@ module ICU
       @name2 = Util.to_utf8(name2.to_s)
       originalize
       canonicalize
-      repair
       @first.freeze
       @last.freeze
       @original.freeze
@@ -94,13 +90,10 @@ module ICU
       @original.gsub!(/\s+/, ' ')
     end
-    # Transliterate characters to ASCII or Latin1.
+    # Transliterate characters to ASCII.
     def transliterate(str, chars='US-ASCII')
-      case chars
-      when /^(US-?)?ASCII/i
-        ActiveSupport::Inflector.transliterate(str)
-      when /^(Windows|CP)-?1252|ISO-?8859-?1|Latin(-?1)?$/i
-        str.gsub(/./) { |m| m.ord < 256 ? m : ActiveSupport::Inflector.transliterate(m) }
+      if chars.match(/ASCII/i)
+        Util.transliterate(str)
       else
         str.dup
       end
@@ -139,49 +132,38 @@ module ICU
     def clean(name)
       name.gsub!(/[`‘’′‛]/, "'")
       name.gsub!(/./) do |m|
-        if m.ord < 256
-          # Keep Latin1 accented letters.
-          m.match(/^[-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]$/) ? m : ''
-        else
-          # Keep ASCII characters with diacritics (e.g. Polish ł and Ś).
-          transliterate(m) == '?' ? '' : m
-        end
+        # Keep only hyphens, normal characters, accented Latin1, full stops, single quotes and spaces.
+        m.ord < 256 && m.match(/\A[-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]\z/) ? m : ''
       end
       name.gsub!(/\./, ' ')
       name.gsub!(/\s*-\s*/, '-')
       name.gsub!(/'+/, "'")
-      name.strip.mb_chars.downcase.split(/\s+/).map do |n|
+      name.strip!
+      name = Util.downcase(name)
+      name.split(/\s+/).map do |n|
         n.sub!(/^-+/, '')
         n.sub!(/-+$/, '')
         n.split(/-/).map do |p|
-          p.capitalize!
+          Util.capitalize(p)
         end.join('-')
-      end.join(' ').to_s
-    end
-    # Try to ensure the encoding is UTF-8. This wasn't necessary before but some upgrade caused a change
-    # in behaviour. Since UTF-8 and ASCII are compatible encodings, it's probably not necessary to do
-    # this but I like to keep everything in the same encoding.
-    def repair
-      @first.force_encoding('UTF-8') if @first.encoding.name == "US-ASCII"
-      @last.force_encoding('UTF-8')  if @last.encoding.name == "US-ASCII"
+      end.join(' ')
     end
-    # Apply final touches to finish canonicalising a first name mb_chars object, returning a normal string.
+    # Apply final touches to finish canonicalising a first name.
     def finish_first(names)
       names.gsub(/([A-Z\u{c0}-\u{de}])\b/, '\1.')
     end
-    # Apply final touches to finish canonicalising a last name mb_chars object, returning a normal string.
+    # Apply final touches to finish canonicalising a last name.
     def finish_last(names)
-      names.gsub!(/\b([A-Z\u{c0}-\u{de}]')([a-z\u{e0}-\u{ff}])/) { |m| $1 << $2.mb_chars.upcase.to_s }
-      names.gsub!(/\b(Mc)([a-z\u{e0}-\u{ff}])/) { |m| $1 << $2.mb_chars.upcase.to_s }
+      names.gsub!(/\b([A-Z\u{c0}-\u{de}]')([a-z\u{e0}-\u{ff}])/) { |m| $1 + Util.upcase($2) }
+      names.gsub!(/\b(Mc)([a-z\u{e0}-\u{ff}])/) { |m| $1 + Util.upcase($2) }
       names.gsub!(/\bMac([a-z\u{e0}-\u{ff}])/) do |m|
         letter = $1  # capitalize after "Mac" only if the original clearly indicates it
-        upper = letter.mb_chars.upcase.to_s
+        upper = Util.upcase(letter)
         'Mac'.concat(@original.match(/\bMac#{upper}/) ? upper : letter)
       end
-      names.gsub!(/\bO ([A-Z\u{c0}-\u{de}])/) { |m| "O'" << $1 }
+      names.gsub!(/\bO ([A-Z\u{c0}-\u{de}])/) { |m| "O'" + $1 } # O Kelly => "O'Kelly"
       names
     end

data/lib/icu_name/util.rb CHANGED Viewed

@@ -1,5 +1,12 @@
+# encoding: UTF-8
 module ICU
   module Util
+    LOWER_CHARS      = "àáâãäåæçèéêëìíîïñòóôõöøùúûüýþ"
+    UPPER_CHARS      = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÞ"
+    ACCENTED_CHARS   = "ÀÁÂÃÄÅÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåèéêëìíîïñòóôõöùúûüý"
+    UNACCENTED_CHARS = "AAAAAAEEEEIIIINOOOOOUUUUYaaaaaaeeeeiiiinooooouuuuy"
     # Decide if a string is valid UTF-8 or not, returning true or false.
     def self.is_utf8(str)
       dup = str.dup
@@ -15,5 +22,31 @@ module ICU
       dup.force_encoding("Windows-1252") if dup.encoding.name.match(/^(ASCII-8BIT|UTF-8)$/)
       dup.encode("UTF-8")
     end
+    # Upcase a UTF-8 string that might contain accented characters.
+    def self.upcase(str)
+      str = str.upcase
+      return str if str.ascii_only?
+      str.tr(LOWER_CHARS, UPPER_CHARS)
+    end
+    # Downcase a UTF-8 string that might contain accented characters.
+    def self.downcase(str)
+      str = str.downcase
+      return str if str.ascii_only?
+      str.tr(UPPER_CHARS, LOWER_CHARS)
+    end
+    # Capilalize a UTF-8 string that might contain accented characters.
+    def self.capitalize(str)
+      return str.capitalize if str.ascii_only? || !str.match(/\A(.)(.*)\z/)
+      upcase($1) + downcase($2)
+    end
+    # Transliterate Latin-1 accented characters to ASCII.
+    def self.transliterate(str)
+      return str.dup if str.ascii_only?
+      str.tr(ACCENTED_CHARS, UNACCENTED_CHARS)
+    end
   end
 end

data/lib/icu_name/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module ICU
   class Name
-    VERSION = "1.0.16"
+    VERSION = "1.1.0"
   end
 end

data/spec/name_spec.rb CHANGED Viewed

@@ -79,7 +79,7 @@ module ICU
       it "characters and encoding" do
         ICU::Name.new('éric', 'PRIÉ').name.should == "Éric Prié"
-        ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartłomiej Śliwa"
+        ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartomiej Liwa"
         ICU::Name.new('Սմբատ', 'Լպուտյան').name.should == ""
         eric = Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
         eric.rname.should == "Prié, Éric"
@@ -88,15 +88,9 @@ module ICU
         eric.original.encoding.name.should == "UTF-8"
         eric.rname(:chars => "US-ASCII").should == "Prie, Eric"
         eric.original(:chars => "US-ASCII").should == "PRIE, eric"
-        joe = Name.new('Józef', 'Żabiński')
-        joe.rname.should == "Żabiński, Józef"
-        joe.rname(:chars => "ISO-8859-1").should == "Zabinski, Józef"
-        joe.rname(:chars => "US-ASCII").should == "Zabinski, Jozef"
         eric.match('Éric', 'Prié').should be_true
         eric.match('Eric', 'Prie').should be_false
         eric.match('Eric', 'Prie', :chars => "US-ASCII").should be_true
-        joe.match('Józef', 'Zabinski').should be_false
-        joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1").should be_true
       end
     end

data/spec/util_spec.rb CHANGED Viewed

@@ -4,34 +4,62 @@ require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
 module ICU
   describe Util do
     context "#is_utf8" do
-      it "should recognise US-ASCII as a special case of UTF-8" do
-        Util.is_utf8("Resume".encode("US-ASCII")).should be_true
+      it "recognises some encodings as a special case of UTF-8" do
+        expect(Util.is_utf8("Resume".encode("US-ASCII"))).to be_true
+        expect(Util.is_utf8("Resume".encode("ASCII-8BIT"))).to be_true
+        expect(Util.is_utf8("Resume".encode("BINARY"))).to be_true
       end
-      it "should recognise UTF-8" do
-        Util.is_utf8("Résumé").should be_true
-        Util.is_utf8("δog").should be_true
+      it "recognises UTF-8" do
+        expect(Util.is_utf8("Résumé")).to be_true
+        expect(Util.is_utf8("δog")).to be_true
       end
       it "should recognize other encodings as not being UTF-8" do
-        Util.is_utf8("Résumé".encode("ISO-8859-1")).should be_false
-        Util.is_utf8("€50".encode("Windows-1252")).should be_false
-        Util.is_utf8("ひらがな".encode("Shift_JIS")).should be_false
-        Util.is_utf8("\xa3").should be_false
+        expect(Util.is_utf8("Résumé".encode("ISO-8859-1"))).to be_false
+        expect(Util.is_utf8("€50".encode("Windows-1252"))).to be_false
+        expect(Util.is_utf8("ひらがな".encode("Shift_JIS"))).to be_false
+        expect(Util.is_utf8("\xa3")).to be_false
       end
     end
     context "#to_utf8" do
-      it "should convert to UTF-8" do
-        Util.to_utf8("Resume").should == "Resume"
-        Util.to_utf8("Resume".force_encoding("US-ASCII")).encoding.name.should == "UTF-8"
-        Util.to_utf8("Résumé".encode("ISO-8859-1")).should == "Résumé"
-        Util.to_utf8("Résumé".encode("Windows-1252")).should == "Résumé"
-        Util.to_utf8("€50".encode("Windows-1252")).should == "€50"
-        Util.to_utf8("\xa350".force_encoding("ASCII-8BIT")).should == "£50"
-        Util.to_utf8("\xa350").should == "£50"
-        Util.to_utf8("ひらがな".encode("Shift_JIS")).should == "ひらがな"
+      it "converts to UTF-8" do
+        expect(Util.to_utf8("Resume")).to eq "Resume"
+        expect(Util.to_utf8("Resume".force_encoding("US-ASCII")).encoding.name).to eq "UTF-8"
+        expect(Util.to_utf8("Résumé".encode("ISO-8859-1"))).to eq "Résumé"
+        expect(Util.to_utf8("Résumé".encode("Windows-1252"))).to eq "Résumé"
+        expect(Util.to_utf8("€50".encode("Windows-1252"))).to eq "€50"
+        expect(Util.to_utf8("\xa350".force_encoding("ASCII-8BIT"))).to eq "£50"
+        expect(Util.to_utf8("\xa350")).to eq "£50"
+        expect(Util.to_utf8("ひらがな".encode("Shift_JIS"))).to eq "ひらがな"
+      end
+    end
+    context "#downcase" do
+      it "downcases characters in the Latin-1 range" do
+        expect(Util.downcase("Eric")).to eq "eric"
+        expect(Util.downcase("Éric")).to eq "éric"
+        expect(Util.downcase("ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÞ")).to eq "àáâãäåæçèéêëìíîïñòóôõöøùúûüýþ"
+      end
+    end
+    context "#upcase" do
+      it "upcases characters in the Latin-1 range" do
+        expect(Util.upcase("Gearoidin")).to eq "GEAROIDIN"
+        expect(Util.upcase("Gearóidín")).to eq "GEARÓIDÍN"
+        expect(Util.upcase("àáâãäåæçèéêëìíîïñòóôõöøùúûüýþ")).to eq "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÞ"
+      end
+    end
+    context "#capitalize" do
+      it "capitalizes strings that might contain accented characters" do
+        expect(Util.capitalize("gearoidin")).to eq "Gearoidin"
+        expect(Util.capitalize("GEAROIDIN")).to eq "Gearoidin"
+        expect(Util.capitalize("gEAróiDÍn")).to eq "Gearóidín"
+        expect(Util.capitalize("ériC")).to eq "Éric"
+        expect(Util.capitalize("ÉRIc")).to eq "Éric"
       end
     end
   end
-end
+end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: icu_name
 version: !ruby/object:Gem::Version
-  version: 1.0.16
+  version: 1.1.0
   prerelease:
 platform: ruby
 authors:
@@ -9,40 +9,8 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-06-05 00:00:00.000000000 Z
+date: 2013-07-22 00:00:00.000000000 Z
 dependencies:
-- !ruby/object:Gem::Dependency
-  name: activesupport
-  requirement: !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - ! '>='
-      - !ruby/object:Gem::Version
-        version: '0'
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - ! '>='
-      - !ruby/object:Gem::Version
-        version: '0'
-- !ruby/object:Gem::Dependency
-  name: i18n
-  requirement: !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - ! '>='
-      - !ruby/object:Gem::Version
-        version: '0'
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - ! '>='
-      - !ruby/object:Gem::Version
-        version: '0'
 - !ruby/object:Gem::Dependency
   name: bundler
   requirement: !ruby/object:Gem::Requirement
@@ -143,7 +111,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: -1925954322170323442
+      hash: -2995260603720583581
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements: