RubyGems - icu_name - Versions diffs - 0.0.6 → 0.0.7 - Mend

icu_name 0.0.6 → 0.0.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README.rdoc CHANGED Viewed

@@ -26,6 +26,10 @@ Capitalisation, white space and punctuation will all be automatically corrected:
   robert.name                                    # => 'Robert J. Fischer'
   robert.rname                                   # => 'Fischer, Robert J.'  (reversed name)
+The input text, without any changes apart from white-space cleanup, is returned by the _original_ method:
+  robert.original                                # => 'robert j FISHER'
 To avoid ambiguity when either the first or second names consist of multiple words, it is better to
 supply the two separately, if known. However, the full name can be supplied alone to the constructor
 and a guess will be made as to the first and last names.
@@ -61,8 +65,8 @@ Some of the ways last names are canonicalised are illustrated below:
 == Characters and Encoding
 The class can only cope with Western European letter characters, including the accented ones in Latin-1.
-It's various accessors (_first_, _last_, _name_, _rname_, _to_s_) always return strings encoded in UTF-8,
-no matter what the input encoding.
+It's various accessors (_first_, _last_, _name_, _rname_, _to_s_, _original_) always return strings
+encoded in UTF-8, no matter what the input encoding.
   eric = ICU::Name.new('éric', 'PRIÉ')
   eric.rname                                     # => "Prié, Éric"
@@ -71,11 +75,13 @@ no matter what the input encoding.
   eric = ICU::Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
   eric.rname                                     # => "Prié, Éric"
   eric.rname.encoding.name                       # => "UTF-8"
+  eric.original                                  # => "éric PRIÉ"
+  eric.original.encoding.name                    # => "UTF-8"
 Currently, all characters outside the Latin-1 range are removed as if they wern't there.
-  ICU::Name.new('Józef Żabiński').name           # "Józef Abiski"
-  ICU::Name.new('Bǔ Xiángzhì').name              # "B. Xiángzhì"
+  ICU::Name.new('Józef Żabiński').name           # => "Józef Abiski"
+  ICU::Name.new('Bǔ Xiángzhì').name              # => "B. Xiángzhì"
 Accented Latin-1 characters can be transliterated into their ascii counterparts by setting the
 _ascii_ option to a true value.
@@ -86,6 +92,8 @@ This works with all the other accessors and also with the constructor:
   eric_ascii = ICU::Name.new('éric', 'PRIÉ', :ascii => true)
   eric_ascii.name                                # => "Eric Prie"
+  jozef_ascii = ICU::Name.new('Józef', 'Żabiński', :ascii => true).name
+  jozef_ascii.name                               # => "Jozef Zabinski"
 The option also relaxes the need for accented characters to match exactly:

data/lib/icu_name/name.rb CHANGED Viewed

@@ -9,11 +9,18 @@ module ICU
     def initialize(name1='', name2='', opt={})
       @name1 = Util.to_utf8(name1.to_s)
       @name2 = Util.to_utf8(name2.to_s)
-      canonicalize
+      originalize
       if opt[:ascii]
-        @first = ActiveSupport::Inflector.transliterate(@first)
-        @last  = ActiveSupport::Inflector.transliterate(@last)
+        @name1 = ActiveSupport::Inflector.transliterate(@name1)
+        @name2 = ActiveSupport::Inflector.transliterate(@name2)
       end
+      canonicalize
+    end
+    # Original text getter.
+    def original(opts={})
+      return ActiveSupport::Inflector.transliterate(@original) if opts[:ascii]
+      @original
     end
     # First name getter.
@@ -60,6 +67,13 @@ module ICU
     # :stopdoc:
     private
+    # Save the original inputs without any cleanup other than whitespace.
+    def originalize
+      @original = "#{@name1} #{@name2}"
+      @original.strip!
+      @original.gsub!(/\s+/, ' ')
+    end
     # Canonicalise the first and last names.
     def canonicalize
       first, last = partition
@@ -70,7 +84,7 @@ module ICU
     # Split one complete name into first and last parts.
     def partition
       if @name2.length == 0
-        # Only one imput so we must split first and last.
+        # Only one input so we must split it into first and last.
         parts = @name1.split(/,/)
         if parts.size > 1
           last  = clean(parts.shift || '')
@@ -78,7 +92,7 @@ module ICU
         else
           parts = clean(@name1).split(/ /)
           last  = parts.pop || ''
-          last  = "#{parts.pop}'#{last}" if parts.size > 1 && parts.last == "O" && !last.match(/^O'/)
+          last  = "#{parts.pop}'#{last}" if parts.size > 1 && parts.last.match(/^O$/i) && !last.match(/^O'/i)  # "O", "Reilly" => "O'Reilly"
           first = parts.join(' ')
         end
       else
@@ -114,6 +128,11 @@ module ICU
     def finish_last(names)
       names.gsub!(/\b([A-Z\u{c0}-\u{de}]')([a-z\u{e0}-\u{ff}])/) { |m| $1 << $2.mb_chars.upcase.to_s }
       names.gsub!(/\b(Mc)([a-z\u{e0}-\u{ff}])/) { |m| $1 << $2.mb_chars.upcase.to_s }
+      names.gsub!(/\bMac([a-z\u{e0}-\u{ff}])/) do |m|
+        letter = $1  # capitalize after "Mac" only if the original clearly indicates it
+        upper = letter.mb_chars.upcase.to_s
+        'Mac'.concat(@original.match(/\bMac#{upper}/) ? upper : letter)
+      end
       names.gsub!(/\bO ([A-Z\u{c0}-\u{de}])/) { |m| "O'" << $1 }
       names
     end

data/lib/icu_name/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module ICU
   class Name
-    VERSION = "0.0.6"
+    VERSION = "0.0.7"
   end
 end

data/spec/name_spec.rb CHANGED Viewed

@@ -5,7 +5,7 @@ module ICU
   describe Name do
     context "public methods" do
       before(:each) do
-        @simple = Name.new('mark j l', 'orr')
+        @simple = Name.new('mark j l', 'ORR')
       end
       it "#first returns the first name(s)" do
@@ -28,6 +28,10 @@ module ICU
         @simple.to_s.should == 'Orr, Mark J. L.'
       end
+      it "#original returns the original data" do
+        @simple.original.should == 'mark j l ORR'
+      end
       it "#match returns true if and only if two names match" do
         @simple.match('mark j l orr').should be_true
         @simple.match('malcolm g l orr').should be_false
@@ -62,18 +66,25 @@ module ICU
       end
       it "characters and encoding" do
-        josef = ICU::Name.new('Józef', 'Żabiński')
+        josef = Name.new('Józef', 'Żabiński')
         josef.name.should == "Józef Abiski"
-        bu = ICU::Name.new('Bǔ Xiángzhì')
+        josef.original.should == "Józef Żabiński"
+        josef.original(:ascii => true).should == "Jozef Zabinski"
+        josef = Name.new('Józef', 'Żabiński', :ascii => true)
+        josef.name.should == "Jozef Zabinski"
+        bu = Name.new('Bǔ Xiángzhì')
         bu.name.should == "B. Xiángzhì"
-        eric = ICU::Name.new('éric', 'PRIÉ')
+        eric = Name.new('éric', 'PRIÉ')
         eric.rname.should == "Prié, Éric"
         eric.rname.encoding.name.should == "UTF-8"
-        eric = ICU::Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
+        eric = Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
         eric.rname.should == "Prié, Éric"
         eric.rname.encoding.name.should == "UTF-8"
+        eric.original.should == "éric PRIÉ"
+        eric.original(:ascii => true).should == "eric PRIE"
+        eric.original.encoding.name.should == "UTF-8"
         eric.name(:ascii => true).should == "Eric Prie"
-        eric_ascii = ICU::Name.new('éric', 'PRIÉ', :ascii => true)
+        eric_ascii = Name.new('éric', 'PRIÉ', :ascii => true)
         eric_ascii.name.should == "Eric Prie"
         eric.match('Éric', 'Prié').should be_true
         eric.match('Eric', 'Prie').should be_false
@@ -104,9 +115,12 @@ module ICU
       it "should be handled correctly" do
         Name.new('shane', "mccabe").name.should == "Shane McCabe"
         Name.new('shawn', "macdonagh").name.should == "Shawn Macdonagh"
+        Name.new('Colin', "MacNab").name.should == "Colin MacNab"
+        Name.new('colin', "macnab").name.should == "Colin Macnab"
         Name.new('bartlomiej', "macieja").name.should == "Bartlomiej Macieja"
         Name.new('türko', "mcözgür").name.should == "Türko McÖzgür"
         Name.new('TÜRKO', "MACÖZGÜR").name.should == "Türko Macözgür"
+        Name.new('Türko', "MacÖzgür").name.should == "Türko MacÖzgür"
       end
     end
@@ -171,6 +185,14 @@ module ICU
       end
     end
+    context "the original input" do
+      it "should be the original text unaltered except for white space" do
+        Name.new(' Mark   j l   ', ' ORR  ').original.should == 'Mark j l ORR'
+        Name.new('Józef', 'Żabiński').original.should == 'Józef Żabiński'
+        Name.new('Ui  Laigleis,Gearoidin').original.should == 'Ui Laigleis,Gearoidin'
+      end
+    end
     context "encoding" do
       before(:each) do
         @first = 'Gearóidín'

metadata CHANGED Viewed

@@ -5,8 +5,8 @@ version: !ruby/object:Gem::Version
   segments:
   - 0
   - 0
-  - 6
-  version: 0.0.6
+  - 7
+  version: 0.0.7
 platform: ruby
 authors:
 - Mark Orr
@@ -14,7 +14,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-01-23 00:00:00 +00:00
+date: 2011-01-24 00:00:00 +00:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency