icu_name 1.0.16 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.rdoc CHANGED
@@ -1,6 +1,8 @@
1
1
  = ICU Tournament
2
2
 
3
- Canonicalises and matches person names with Western European characters and first and last names.
3
+ Canonicalises and matches person names with Western European characters.
4
+
5
+ Note: version 1.1.0 dropped support for characters beyond codepoint 255 and became independent of activesupport and i18n.
4
6
 
5
7
  == Installation
6
8
 
@@ -8,14 +10,12 @@ For ruby 1.9.2, 1.9.3, 2.0.0.
8
10
 
9
11
  gem install icu_name
10
12
 
11
- It depends on _active_support_ and _i18n_.
12
-
13
13
  == Names
14
14
 
15
15
  This class exists for two main purposes:
16
16
 
17
- * to normalise to a common format the different ways names are typed in practice
18
- * to be able to match two names even if they are not exactly the same
17
+ * to normalise to a common format the different ways Irish person names are typed in practice
18
+ * to be able to match two names even if they are not exactly the same in their original form
19
19
 
20
20
  To create a name object, supply both the first and second names separately to the constructor.
21
21
 
@@ -36,7 +36,6 @@ supply the two separately. If the full name is supplied alone to the constructor
36
36
  of where the first names end, then the last distinct name is assumed to be the last name.
37
37
 
38
38
  bobby = ICU::Name.new(' bobby fischer ')
39
-
40
39
  bobby.first # => 'Bobby'
41
40
  bobby.last # => 'Fischer'
42
41
 
@@ -77,13 +76,11 @@ Some other ways last names are canonicalised are illustrated below:
77
76
  == Characters and Encoding
78
77
 
79
78
  The class can only cope with Latin characters, including those with diacritics (accents).
80
- Along with hyphens and single quotes (which represent apostophes) letters in ISO-8859-1
81
- (e.g. "a", "è", "Ö") and letters outside ISO-8859-1 which are decomposable into a US-ASCII
82
- character plus one or more diacritics (e.g. "ł" or "Ś") are preserved, while everything
83
- else is removed.
79
+ Hyphens, single quotes (which represent apostophes) and letters in the ISO-8859-1 range
80
+ (e.g. "a", "è", "Ö") are preserved, while everything else is removed (unsupported).
84
81
 
85
82
  ICU::Name.new('éric', 'PRIÉ').name # => "Éric Prié"
86
- ICU::Name.new('BARTŁOMIEJ', 'śliwa').name # => "Bartłomiej Śliwa"
83
+ ICU::Name.new('BARTŁOMIEJ', 'śliwa').name # => "Bartomiej Liwa"
87
84
  ICU::Name.new('Սմբատ', 'Լպուտյան').name # => ""
88
85
 
89
86
  The various accessors (<tt>first</tt>, <tt>last</tt>, <tt>name</tt>, <tt>rname</tt>, <tt>to_s</tt>, <tt>original</tt>) always return
@@ -101,21 +98,11 @@ Accented letters can be transliterated into their US-ASCII counterparts by setti
101
98
  eric.rname(:chars => "US-ASCII") # => "Prie, Eric"
102
99
  eric.original(:chars => "US-ASCII") # => "PRIE, eric"
103
100
 
104
- Also possible is the preservation of ISO-8859-1 characters, but the transliteration of
105
- all other accented characters:
106
-
107
- joe = Name.new('Józef', 'Żabiński')
108
- joe.rname # => "Żabiński, Józef"
109
- joe.rname(:chars => "ISO-8859-1") # => "Zabinski, Józef"
110
- joe.rname(:chars => "US-ASCII") # => "Zabinski, Jozef"
111
-
112
101
  Note that the character encoding of the strings returned is still UTF-8 in all cases.
113
102
  The same option also relaxes the need for accented characters to match exactly:
114
103
 
115
104
  eric.match('Eric', 'Prie') # => false
116
105
  eric.match('Eric', 'Prie', :chars => "US-ASCII") # => true
117
- joe.match('Józef', 'Zabinski') # => false
118
- joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1") # => true
119
106
 
120
107
  == Customization of Alternative Names
121
108
 
@@ -153,7 +140,7 @@ To change alternative name behaviour, you can replace the default alternatives
153
140
  with a customized set perhaps stored in a database or a YAML file, as illustrated below:
154
141
 
155
142
  data = YAML.load(File open "my_last_name_alternatives.yaml")
156
- Name.load_alternatives(:first, data)
143
+ Name.load_alternatives(:last, data)
157
144
  data = YAML.load(File open "my_first_name_alternatives.yaml")
158
145
  Name.load_alternatives(:first, data)
159
146
 
@@ -173,8 +160,8 @@ so that now:
173
160
  Name.new("Stephen", "Hanly").match("Steven", "Hanly") # => true
174
161
 
175
162
  This kind of rule risks producing false positives - you must judge
176
- carefully whether that risk is outweighed by the benefits of being
177
- able to overcome spelling mistakes in the context of your application.
163
+ whether that risk is outweighed by the benefits of being able to overcome
164
+ spelling mistakes in the context of your application.
178
165
 
179
166
  Another use is to cater for English and Irish versions of the same name.
180
167
  For example, for last names:
data/lib/icu_name/name.rb CHANGED
@@ -1,7 +1,4 @@
1
1
  # encoding: UTF-8
2
- require 'active_support'
3
- require 'active_support/inflector/transliterate'
4
- require 'active_support/core_ext/string/multibyte'
5
2
 
6
3
  module ICU
7
4
  class Name
@@ -20,7 +17,6 @@ module ICU
20
17
  @name2 = Util.to_utf8(name2.to_s)
21
18
  originalize
22
19
  canonicalize
23
- repair
24
20
  @first.freeze
25
21
  @last.freeze
26
22
  @original.freeze
@@ -94,13 +90,10 @@ module ICU
94
90
  @original.gsub!(/\s+/, ' ')
95
91
  end
96
92
 
97
- # Transliterate characters to ASCII or Latin1.
93
+ # Transliterate characters to ASCII.
98
94
  def transliterate(str, chars='US-ASCII')
99
- case chars
100
- when /^(US-?)?ASCII/i
101
- ActiveSupport::Inflector.transliterate(str)
102
- when /^(Windows|CP)-?1252|ISO-?8859-?1|Latin(-?1)?$/i
103
- str.gsub(/./) { |m| m.ord < 256 ? m : ActiveSupport::Inflector.transliterate(m) }
95
+ if chars.match(/ASCII/i)
96
+ Util.transliterate(str)
104
97
  else
105
98
  str.dup
106
99
  end
@@ -139,49 +132,38 @@ module ICU
139
132
  def clean(name)
140
133
  name.gsub!(/[`‘’′‛]/, "'")
141
134
  name.gsub!(/./) do |m|
142
- if m.ord < 256
143
- # Keep Latin1 accented letters.
144
- m.match(/^[-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]$/) ? m : ''
145
- else
146
- # Keep ASCII characters with diacritics (e.g. Polish ł and Ś).
147
- transliterate(m) == '?' ? '' : m
148
- end
135
+ # Keep only hyphens, normal characters, accented Latin1, full stops, single quotes and spaces.
136
+ m.ord < 256 && m.match(/\A[-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]\z/) ? m : ''
149
137
  end
150
138
  name.gsub!(/\./, ' ')
151
139
  name.gsub!(/\s*-\s*/, '-')
152
140
  name.gsub!(/'+/, "'")
153
- name.strip.mb_chars.downcase.split(/\s+/).map do |n|
141
+ name.strip!
142
+ name = Util.downcase(name)
143
+ name.split(/\s+/).map do |n|
154
144
  n.sub!(/^-+/, '')
155
145
  n.sub!(/-+$/, '')
156
146
  n.split(/-/).map do |p|
157
- p.capitalize!
147
+ Util.capitalize(p)
158
148
  end.join('-')
159
- end.join(' ').to_s
160
- end
161
-
162
- # Try to ensure the encoding is UTF-8. This wasn't necessary before but some upgrade caused a change
163
- # in behaviour. Since UTF-8 and ASCII are compatible encodings, it's probably not necessary to do
164
- # this but I like to keep everything in the same encoding.
165
- def repair
166
- @first.force_encoding('UTF-8') if @first.encoding.name == "US-ASCII"
167
- @last.force_encoding('UTF-8') if @last.encoding.name == "US-ASCII"
149
+ end.join(' ')
168
150
  end
169
151
 
170
- # Apply final touches to finish canonicalising a first name mb_chars object, returning a normal string.
152
+ # Apply final touches to finish canonicalising a first name.
171
153
  def finish_first(names)
172
154
  names.gsub(/([A-Z\u{c0}-\u{de}])\b/, '\1.')
173
155
  end
174
156
 
175
- # Apply final touches to finish canonicalising a last name mb_chars object, returning a normal string.
157
+ # Apply final touches to finish canonicalising a last name.
176
158
  def finish_last(names)
177
- names.gsub!(/\b([A-Z\u{c0}-\u{de}]')([a-z\u{e0}-\u{ff}])/) { |m| $1 << $2.mb_chars.upcase.to_s }
178
- names.gsub!(/\b(Mc)([a-z\u{e0}-\u{ff}])/) { |m| $1 << $2.mb_chars.upcase.to_s }
159
+ names.gsub!(/\b([A-Z\u{c0}-\u{de}]')([a-z\u{e0}-\u{ff}])/) { |m| $1 + Util.upcase($2) }
160
+ names.gsub!(/\b(Mc)([a-z\u{e0}-\u{ff}])/) { |m| $1 + Util.upcase($2) }
179
161
  names.gsub!(/\bMac([a-z\u{e0}-\u{ff}])/) do |m|
180
162
  letter = $1 # capitalize after "Mac" only if the original clearly indicates it
181
- upper = letter.mb_chars.upcase.to_s
163
+ upper = Util.upcase(letter)
182
164
  'Mac'.concat(@original.match(/\bMac#{upper}/) ? upper : letter)
183
165
  end
184
- names.gsub!(/\bO ([A-Z\u{c0}-\u{de}])/) { |m| "O'" << $1 }
166
+ names.gsub!(/\bO ([A-Z\u{c0}-\u{de}])/) { |m| "O'" + $1 } # O Kelly => "O'Kelly"
185
167
  names
186
168
  end
187
169
 
data/lib/icu_name/util.rb CHANGED
@@ -1,5 +1,12 @@
1
+ # encoding: UTF-8
2
+
1
3
  module ICU
2
4
  module Util
5
+ LOWER_CHARS = "àáâãäåæçèéêëìíîïñòóôõöøùúûüýþ"
6
+ UPPER_CHARS = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÞ"
7
+ ACCENTED_CHARS = "ÀÁÂÃÄÅÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåèéêëìíîïñòóôõöùúûüý"
8
+ UNACCENTED_CHARS = "AAAAAAEEEEIIIINOOOOOUUUUYaaaaaaeeeeiiiinooooouuuuy"
9
+
3
10
  # Decide if a string is valid UTF-8 or not, returning true or false.
4
11
  def self.is_utf8(str)
5
12
  dup = str.dup
@@ -15,5 +22,31 @@ module ICU
15
22
  dup.force_encoding("Windows-1252") if dup.encoding.name.match(/^(ASCII-8BIT|UTF-8)$/)
16
23
  dup.encode("UTF-8")
17
24
  end
25
+
26
+ # Upcase a UTF-8 string that might contain accented characters.
27
+ def self.upcase(str)
28
+ str = str.upcase
29
+ return str if str.ascii_only?
30
+ str.tr(LOWER_CHARS, UPPER_CHARS)
31
+ end
32
+
33
+ # Downcase a UTF-8 string that might contain accented characters.
34
+ def self.downcase(str)
35
+ str = str.downcase
36
+ return str if str.ascii_only?
37
+ str.tr(UPPER_CHARS, LOWER_CHARS)
38
+ end
39
+
40
+ # Capilalize a UTF-8 string that might contain accented characters.
41
+ def self.capitalize(str)
42
+ return str.capitalize if str.ascii_only? || !str.match(/\A(.)(.*)\z/)
43
+ upcase($1) + downcase($2)
44
+ end
45
+
46
+ # Transliterate Latin-1 accented characters to ASCII.
47
+ def self.transliterate(str)
48
+ return str.dup if str.ascii_only?
49
+ str.tr(ACCENTED_CHARS, UNACCENTED_CHARS)
50
+ end
18
51
  end
19
52
  end
@@ -2,6 +2,6 @@
2
2
 
3
3
  module ICU
4
4
  class Name
5
- VERSION = "1.0.16"
5
+ VERSION = "1.1.0"
6
6
  end
7
7
  end
data/spec/name_spec.rb CHANGED
@@ -79,7 +79,7 @@ module ICU
79
79
 
80
80
  it "characters and encoding" do
81
81
  ICU::Name.new('éric', 'PRIÉ').name.should == "Éric Prié"
82
- ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartłomiej Śliwa"
82
+ ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartomiej Liwa"
83
83
  ICU::Name.new('Սմբատ', 'Լպուտյան').name.should == ""
84
84
  eric = Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
85
85
  eric.rname.should == "Prié, Éric"
@@ -88,15 +88,9 @@ module ICU
88
88
  eric.original.encoding.name.should == "UTF-8"
89
89
  eric.rname(:chars => "US-ASCII").should == "Prie, Eric"
90
90
  eric.original(:chars => "US-ASCII").should == "PRIE, eric"
91
- joe = Name.new('Józef', 'Żabiński')
92
- joe.rname.should == "Żabiński, Józef"
93
- joe.rname(:chars => "ISO-8859-1").should == "Zabinski, Józef"
94
- joe.rname(:chars => "US-ASCII").should == "Zabinski, Jozef"
95
91
  eric.match('Éric', 'Prié').should be_true
96
92
  eric.match('Eric', 'Prie').should be_false
97
93
  eric.match('Eric', 'Prie', :chars => "US-ASCII").should be_true
98
- joe.match('Józef', 'Zabinski').should be_false
99
- joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1").should be_true
100
94
  end
101
95
  end
102
96
 
data/spec/util_spec.rb CHANGED
@@ -4,34 +4,62 @@ require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
4
4
  module ICU
5
5
  describe Util do
6
6
  context "#is_utf8" do
7
- it "should recognise US-ASCII as a special case of UTF-8" do
8
- Util.is_utf8("Resume".encode("US-ASCII")).should be_true
7
+ it "recognises some encodings as a special case of UTF-8" do
8
+ expect(Util.is_utf8("Resume".encode("US-ASCII"))).to be_true
9
+ expect(Util.is_utf8("Resume".encode("ASCII-8BIT"))).to be_true
10
+ expect(Util.is_utf8("Resume".encode("BINARY"))).to be_true
9
11
  end
10
12
 
11
- it "should recognise UTF-8" do
12
- Util.is_utf8("Résumé").should be_true
13
- Util.is_utf8("δog").should be_true
13
+ it "recognises UTF-8" do
14
+ expect(Util.is_utf8("Résumé")).to be_true
15
+ expect(Util.is_utf8("δog")).to be_true
14
16
  end
15
17
 
16
18
  it "should recognize other encodings as not being UTF-8" do
17
- Util.is_utf8("Résumé".encode("ISO-8859-1")).should be_false
18
- Util.is_utf8("€50".encode("Windows-1252")).should be_false
19
- Util.is_utf8("ひらがな".encode("Shift_JIS")).should be_false
20
- Util.is_utf8("\xa3").should be_false
19
+ expect(Util.is_utf8("Résumé".encode("ISO-8859-1"))).to be_false
20
+ expect(Util.is_utf8("€50".encode("Windows-1252"))).to be_false
21
+ expect(Util.is_utf8("ひらがな".encode("Shift_JIS"))).to be_false
22
+ expect(Util.is_utf8("\xa3")).to be_false
21
23
  end
22
24
  end
23
25
 
24
26
  context "#to_utf8" do
25
- it "should convert to UTF-8" do
26
- Util.to_utf8("Resume").should == "Resume"
27
- Util.to_utf8("Resume".force_encoding("US-ASCII")).encoding.name.should == "UTF-8"
28
- Util.to_utf8("Résumé".encode("ISO-8859-1")).should == "Résumé"
29
- Util.to_utf8("Résumé".encode("Windows-1252")).should == "Résumé"
30
- Util.to_utf8("€50".encode("Windows-1252")).should == "€50"
31
- Util.to_utf8("\xa350".force_encoding("ASCII-8BIT")).should == "£50"
32
- Util.to_utf8("\xa350").should == "£50"
33
- Util.to_utf8("ひらがな".encode("Shift_JIS")).should == "ひらがな"
27
+ it "converts to UTF-8" do
28
+ expect(Util.to_utf8("Resume")).to eq "Resume"
29
+ expect(Util.to_utf8("Resume".force_encoding("US-ASCII")).encoding.name).to eq "UTF-8"
30
+ expect(Util.to_utf8("Résumé".encode("ISO-8859-1"))).to eq "Résumé"
31
+ expect(Util.to_utf8("Résumé".encode("Windows-1252"))).to eq "Résumé"
32
+ expect(Util.to_utf8("€50".encode("Windows-1252"))).to eq "€50"
33
+ expect(Util.to_utf8("\xa350".force_encoding("ASCII-8BIT"))).to eq "£50"
34
+ expect(Util.to_utf8("\xa350")).to eq "£50"
35
+ expect(Util.to_utf8("ひらがな".encode("Shift_JIS"))).to eq "ひらがな"
36
+ end
37
+ end
38
+
39
+ context "#downcase" do
40
+ it "downcases characters in the Latin-1 range" do
41
+ expect(Util.downcase("Eric")).to eq "eric"
42
+ expect(Util.downcase("Éric")).to eq "éric"
43
+ expect(Util.downcase("ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÞ")).to eq "àáâãäåæçèéêëìíîïñòóôõöøùúûüýþ"
44
+ end
45
+ end
46
+
47
+ context "#upcase" do
48
+ it "upcases characters in the Latin-1 range" do
49
+ expect(Util.upcase("Gearoidin")).to eq "GEAROIDIN"
50
+ expect(Util.upcase("Gearóidín")).to eq "GEARÓIDÍN"
51
+ expect(Util.upcase("àáâãäåæçèéêëìíîïñòóôõöøùúûüýþ")).to eq "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÞ"
52
+ end
53
+ end
54
+
55
+ context "#capitalize" do
56
+ it "capitalizes strings that might contain accented characters" do
57
+ expect(Util.capitalize("gearoidin")).to eq "Gearoidin"
58
+ expect(Util.capitalize("GEAROIDIN")).to eq "Gearoidin"
59
+ expect(Util.capitalize("gEAróiDÍn")).to eq "Gearóidín"
60
+ expect(Util.capitalize("ériC")).to eq "Éric"
61
+ expect(Util.capitalize("ÉRIc")).to eq "Éric"
34
62
  end
35
63
  end
36
64
  end
37
- end
65
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: icu_name
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.16
4
+ version: 1.1.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,40 +9,8 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-06-05 00:00:00.000000000 Z
12
+ date: 2013-07-22 00:00:00.000000000 Z
13
13
  dependencies:
14
- - !ruby/object:Gem::Dependency
15
- name: activesupport
16
- requirement: !ruby/object:Gem::Requirement
17
- none: false
18
- requirements:
19
- - - ! '>='
20
- - !ruby/object:Gem::Version
21
- version: '0'
22
- type: :runtime
23
- prerelease: false
24
- version_requirements: !ruby/object:Gem::Requirement
25
- none: false
26
- requirements:
27
- - - ! '>='
28
- - !ruby/object:Gem::Version
29
- version: '0'
30
- - !ruby/object:Gem::Dependency
31
- name: i18n
32
- requirement: !ruby/object:Gem::Requirement
33
- none: false
34
- requirements:
35
- - - ! '>='
36
- - !ruby/object:Gem::Version
37
- version: '0'
38
- type: :runtime
39
- prerelease: false
40
- version_requirements: !ruby/object:Gem::Requirement
41
- none: false
42
- requirements:
43
- - - ! '>='
44
- - !ruby/object:Gem::Version
45
- version: '0'
46
14
  - !ruby/object:Gem::Dependency
47
15
  name: bundler
48
16
  requirement: !ruby/object:Gem::Requirement
@@ -143,7 +111,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
143
111
  version: '0'
144
112
  segments:
145
113
  - 0
146
- hash: -1925954322170323442
114
+ hash: -2995260603720583581
147
115
  required_rubygems_version: !ruby/object:Gem::Requirement
148
116
  none: false
149
117
  requirements: