icu_name 0.0.7 → 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +44 -40
- data/lib/icu_name/name.rb +28 -11
- data/lib/icu_name/version.rb +1 -1
- data/spec/name_spec.rb +15 -26
- metadata +2 -2
data/README.rdoc
CHANGED
@@ -8,7 +8,7 @@ For ruby 1.9.2 and above.
|
|
8
8
|
|
9
9
|
gem install icu_name
|
10
10
|
|
11
|
-
It depends on
|
11
|
+
It depends on _active_support_ and _i18n_.
|
12
12
|
|
13
13
|
== Names
|
14
14
|
|
@@ -23,84 +23,88 @@ To create a name object, supply both the first and second names separately to th
|
|
23
23
|
|
24
24
|
Capitalisation, white space and punctuation will all be automatically corrected:
|
25
25
|
|
26
|
-
robert.name
|
27
|
-
robert.rname
|
26
|
+
robert.name # => 'Robert J. Fischer'
|
27
|
+
robert.rname # => 'Fischer, Robert J.' (reversed name)
|
28
28
|
|
29
29
|
The input text, without any changes apart from white-space cleanup, is returned by the _original_ method:
|
30
30
|
|
31
|
-
robert.original
|
31
|
+
robert.original # => 'robert j FISHER'
|
32
32
|
|
33
33
|
To avoid ambiguity when either the first or second names consist of multiple words, it is better to
|
34
34
|
supply the two separately, if known. However, the full name can be supplied alone to the constructor
|
35
35
|
and a guess will be made as to the first and last names.
|
36
36
|
|
37
37
|
bobby = ICU::Name.new(' bobby fischer ')
|
38
|
-
|
39
|
-
bobby.first
|
40
|
-
bobby.last
|
38
|
+
|
39
|
+
bobby.first # => 'Bobby'
|
40
|
+
bobby.last # => 'Fischer'
|
41
41
|
|
42
42
|
Names will match even if one is missing middle initials or if a nickname is used for one of the first names.
|
43
43
|
|
44
|
-
bobby.match('Robert J.', 'Fischer')
|
44
|
+
bobby.match('Robert J.', 'Fischer') # => true
|
45
45
|
|
46
46
|
Note that the class is aware of only common nicknames (e.g. _Bobby_ and _Robert_, _Bill_ and _William_, etc), not all possibilities.
|
47
47
|
|
48
48
|
Supplying the _match_ method with strings is equivalent to instantiating a Name instance with the same
|
49
49
|
strings and then matching it. So, for example the following are equivalent:
|
50
50
|
|
51
|
-
robert.match('R.', 'Fischer')
|
52
|
-
robert.match(ICU::Name.new('R.', 'Fischer'))
|
51
|
+
robert.match('R.', 'Fischer') # => true
|
52
|
+
robert.match(ICU::Name.new('R.', 'Fischer')) # => true
|
53
53
|
|
54
54
|
The inital _R_, for example, matches the first letter of _Robert_. However, nickname matches will not
|
55
55
|
always work with initials. In the next example, the initial _R_ does not match the first letter _B_ of the
|
56
56
|
nickname _Bobby_.
|
57
57
|
|
58
|
-
bobby.match('R. J.', 'Fischer')
|
58
|
+
bobby.match('R. J.', 'Fischer') # => false
|
59
59
|
|
60
60
|
Some of the ways last names are canonicalised are illustrated below:
|
61
61
|
|
62
|
-
ICU::Name.new('John', 'O Reilly').last
|
63
|
-
ICU::Name.new('dave', 'mcmanus').last
|
62
|
+
ICU::Name.new('John', 'O Reilly').last # => "O'Reilly"
|
63
|
+
ICU::Name.new('dave', 'mcmanus').last # => "McManus"
|
64
64
|
|
65
65
|
== Characters and Encoding
|
66
66
|
|
67
|
-
The class can only cope with
|
68
|
-
|
69
|
-
|
67
|
+
The class can only cope with Latin characters, including those with diacritics (accents).
|
68
|
+
Along with hyphens and single quotes (which represent apostophes) letters in ISO-8859-1
|
69
|
+
(e.g. "a", "è", "Ö") and letters outside ISO-8859-1 which are decomposable into a US-ASCII
|
70
|
+
character plus one or more diacritics (e.g. "ł" or "Ś") are preserved, while everything
|
71
|
+
else is removed.
|
72
|
+
|
73
|
+
ICU::Name.new('éric', 'PRIÉ').name # => "Éric Prié"
|
74
|
+
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name # => "Bartłomiej Śliwa"
|
75
|
+
ICU::Name.new(' 渡井美代子').name # => ""
|
70
76
|
|
71
|
-
|
72
|
-
|
73
|
-
eric.rname.encoding.name # => "UTF-8"
|
77
|
+
The various accessors (_first_, _last_, _name_, _rname_, _to_s_, _original_) always return
|
78
|
+
strings encoded in UTF-8, no matter what the input encoding.
|
74
79
|
|
75
80
|
eric = ICU::Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
|
76
|
-
eric.rname
|
77
|
-
eric.rname.encoding.name
|
78
|
-
eric.original
|
79
|
-
eric.original.encoding.name
|
81
|
+
eric.rname # => "Prié, Éric"
|
82
|
+
eric.rname.encoding.name # => "UTF-8"
|
83
|
+
eric.original # => "éric PRIÉ"
|
84
|
+
eric.original.encoding.name # => "UTF-8"
|
80
85
|
|
81
|
-
|
86
|
+
Accented letters can be transliterated into their US-ASCII counterparts by setting the
|
87
|
+
_chars_ option, which is available in all accessors. For example:
|
82
88
|
|
83
|
-
|
84
|
-
|
89
|
+
eric.rname(:chars => "US-ASCII") # => "Prie, Eric"
|
90
|
+
eric.original(:chars => "US-ASCII") # => "eric PRIE"
|
85
91
|
|
86
|
-
|
87
|
-
|
92
|
+
Also possible is the preservation of ISO-8859-1 characters, but the transliteration of
|
93
|
+
all other accented characters:
|
88
94
|
|
89
|
-
|
95
|
+
joe = Name.new('Józef', 'Żabiński')
|
96
|
+
joe.rname # => "Żabiński, Józef"
|
97
|
+
joe.rname(:chars => "ISO-8859-1") # => "Zabinski, Józef"
|
98
|
+
joe.rname(:chars => "US-ASCII") # => "Zabinski, Jozef"
|
90
99
|
|
91
|
-
|
100
|
+
Note that the character encoding of the strings returned is still UTF-8 in all cases.
|
101
|
+
The same option also relaxes the need for accented characters to match exactly:
|
92
102
|
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
The option also relaxes the need for accented characters to match exactly:
|
103
|
+
eric.match('Eric', 'Prie') # => false
|
104
|
+
eric.match('Eric', 'Prie', :chars => "US-ASCII") # => true
|
105
|
+
joe.match('Józef', 'Zabinski') # => false
|
106
|
+
joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1") # => true
|
99
107
|
|
100
|
-
eric.match('Éric', 'Prié') # => true
|
101
|
-
eric.match('Eric', 'Prie') # => false
|
102
|
-
eric.match('Eric', 'Prie', :ascii => true) # => true
|
103
|
-
|
104
108
|
== Author
|
105
109
|
|
106
110
|
Mark Orr, rating officer for the Irish Chess Union (ICU[http://icu.ie]).
|
data/lib/icu_name/name.rb
CHANGED
@@ -1,3 +1,4 @@
|
|
1
|
+
# encoding: UTF-8
|
1
2
|
require 'active_support'
|
2
3
|
require 'active_support/inflector/transliterate'
|
3
4
|
require 'active_support/core_ext/string/multibyte'
|
@@ -6,32 +7,28 @@ module ICU
|
|
6
7
|
class Name
|
7
8
|
|
8
9
|
# Construct from one or two strings or any objects that have a to_s method.
|
9
|
-
def initialize(name1='', name2=''
|
10
|
+
def initialize(name1='', name2='')
|
10
11
|
@name1 = Util.to_utf8(name1.to_s)
|
11
12
|
@name2 = Util.to_utf8(name2.to_s)
|
12
13
|
originalize
|
13
|
-
if opt[:ascii]
|
14
|
-
@name1 = ActiveSupport::Inflector.transliterate(@name1)
|
15
|
-
@name2 = ActiveSupport::Inflector.transliterate(@name2)
|
16
|
-
end
|
17
14
|
canonicalize
|
18
15
|
end
|
19
16
|
|
20
17
|
# Original text getter.
|
21
18
|
def original(opts={})
|
22
|
-
return
|
19
|
+
return transliterate(@original, opts[:chars]) if opts[:chars]
|
23
20
|
@original
|
24
21
|
end
|
25
22
|
|
26
23
|
# First name getter.
|
27
24
|
def first(opts={})
|
28
|
-
return
|
25
|
+
return transliterate(@first, opts[:chars]) if opts[:chars]
|
29
26
|
@first
|
30
27
|
end
|
31
28
|
|
32
29
|
# Last name getter.
|
33
30
|
def last(opts={})
|
34
|
-
return
|
31
|
+
return transliterate(@last, opts[:chars]) if opts[:chars]
|
35
32
|
@last
|
36
33
|
end
|
37
34
|
|
@@ -60,8 +57,8 @@ module ICU
|
|
60
57
|
|
61
58
|
# Match another name to this object, returning true or false.
|
62
59
|
def match(name1='', name2='', opts={})
|
63
|
-
other = Name.new(name1, name2
|
64
|
-
match_first(first(opts), other.first) && match_last(last(opts), other.last)
|
60
|
+
other = Name.new(name1, name2)
|
61
|
+
match_first(first(opts), other.first(opts)) && match_last(last(opts), other.last(opts))
|
65
62
|
end
|
66
63
|
|
67
64
|
# :stopdoc:
|
@@ -73,6 +70,18 @@ module ICU
|
|
73
70
|
@original.strip!
|
74
71
|
@original.gsub!(/\s+/, ' ')
|
75
72
|
end
|
73
|
+
|
74
|
+
# Transliterate characters to ASCII or Latin1.
|
75
|
+
def transliterate(str, chars='US-ASCII')
|
76
|
+
case chars
|
77
|
+
when /^(US-?)?ASCII/i
|
78
|
+
ActiveSupport::Inflector.transliterate(str)
|
79
|
+
when /^(Windows|CP)-?1252|ISO-?8859-?1|Latin(-?1)?$/i
|
80
|
+
str.gsub(/./) { |m| m.ord < 256 ? m : ActiveSupport::Inflector.transliterate(m) }
|
81
|
+
else
|
82
|
+
str.dup
|
83
|
+
end
|
84
|
+
end
|
76
85
|
|
77
86
|
# Canonicalise the first and last names.
|
78
87
|
def canonicalize
|
@@ -106,7 +115,15 @@ module ICU
|
|
106
115
|
# Clean up characters in any name keeping only letters (including accented), hyphens, and single quotes.
|
107
116
|
def clean(name)
|
108
117
|
name.gsub!(/`/, "'")
|
109
|
-
name.gsub!(
|
118
|
+
name.gsub!(/./) do |m|
|
119
|
+
if m.ord < 256
|
120
|
+
# Keep Latin1 accented letters.
|
121
|
+
m.match(/^[-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]$/) ? m : ''
|
122
|
+
else
|
123
|
+
# Keep ASCII characters with diacritics (e.g. Polish ł and Ś).
|
124
|
+
transliterate(m) == '?' ? '' : m
|
125
|
+
end
|
126
|
+
end
|
110
127
|
name.gsub!(/\./, ' ')
|
111
128
|
name.gsub!(/\s*-\s*/, '-')
|
112
129
|
name.gsub!(/'+/, "'")
|
data/lib/icu_name/version.rb
CHANGED
data/spec/name_spec.rb
CHANGED
@@ -66,29 +66,25 @@ module ICU
|
|
66
66
|
end
|
67
67
|
|
68
68
|
it "characters and encoding" do
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
josef.original(:ascii => true).should == "Jozef Zabinski"
|
73
|
-
josef = Name.new('Józef', 'Żabiński', :ascii => true)
|
74
|
-
josef.name.should == "Jozef Zabinski"
|
75
|
-
bu = Name.new('Bǔ Xiángzhì')
|
76
|
-
bu.name.should == "B. Xiángzhì"
|
77
|
-
eric = Name.new('éric', 'PRIÉ')
|
78
|
-
eric.rname.should == "Prié, Éric"
|
79
|
-
eric.rname.encoding.name.should == "UTF-8"
|
69
|
+
ICU::Name.new('éric', 'PRIÉ').name.should == "Éric Prié"
|
70
|
+
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartłomiej Śliwa"
|
71
|
+
ICU::Name.new(' 渡井美代子').name.should == ""
|
80
72
|
eric = Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
|
81
73
|
eric.rname.should == "Prié, Éric"
|
82
74
|
eric.rname.encoding.name.should == "UTF-8"
|
83
75
|
eric.original.should == "éric PRIÉ"
|
84
|
-
eric.original(:ascii => true).should == "eric PRIE"
|
85
76
|
eric.original.encoding.name.should == "UTF-8"
|
86
|
-
eric.
|
87
|
-
|
88
|
-
|
77
|
+
eric.rname(:chars => "US-ASCII").should == "Prie, Eric"
|
78
|
+
eric.original(:chars => "US-ASCII").should == "eric PRIE"
|
79
|
+
joe = Name.new('Józef', 'Żabiński')
|
80
|
+
joe.rname.should == "Żabiński, Józef"
|
81
|
+
joe.rname(:chars => "ISO-8859-1").should == "Zabinski, Józef"
|
82
|
+
joe.rname(:chars => "US-ASCII").should == "Zabinski, Jozef"
|
89
83
|
eric.match('Éric', 'Prié').should be_true
|
90
84
|
eric.match('Eric', 'Prie').should be_false
|
91
|
-
eric.match('Eric', 'Prie', :
|
85
|
+
eric.match('Eric', 'Prie', :chars => "US-ASCII").should be_true
|
86
|
+
joe.match('Józef', 'Zabinski').should be_false
|
87
|
+
joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1").should be_true
|
92
88
|
end
|
93
89
|
end
|
94
90
|
|
@@ -244,7 +240,7 @@ module ICU
|
|
244
240
|
|
245
241
|
context "transliteration" do
|
246
242
|
before(:all) do
|
247
|
-
@opt = { :
|
243
|
+
@opt = { :chars => "US-ASCII" }
|
248
244
|
end
|
249
245
|
|
250
246
|
it "should be a no-op for names that already ASCII" do
|
@@ -267,12 +263,6 @@ module ICU
|
|
267
263
|
name.first(@opt).should == 'Eric'
|
268
264
|
name.last(@opt).should == 'Prie'
|
269
265
|
end
|
270
|
-
|
271
|
-
it "should work for the constructor as well as accessors" do
|
272
|
-
name = Name.new('Gearóidín', 'Uí Laighléis', @opt)
|
273
|
-
name.first.should == 'Gearoidin'
|
274
|
-
name.last.should == 'Ui Laighleis'
|
275
|
-
end
|
276
266
|
end
|
277
267
|
|
278
268
|
context "constuction corner cases" do
|
@@ -280,7 +270,6 @@ module ICU
|
|
280
270
|
Name.new('Orr').name.should == 'Orr'
|
281
271
|
Name.new('Orr').rname.should == 'Orr'
|
282
272
|
Name.new('Uí Laighléis').rname.should == 'Laighléis, Uí'
|
283
|
-
Name.new('', 'Uí Laighléis', :ascii => true).last.should == 'Ui Laighleis'
|
284
273
|
Name.new('').name.should == ''
|
285
274
|
Name.new('').rname.should == ''
|
286
275
|
Name.new.name.should == ''
|
@@ -367,8 +356,8 @@ module ICU
|
|
367
356
|
end
|
368
357
|
|
369
358
|
it "the matching of accented characters can be relaxed" do
|
370
|
-
Name.new('Gearóidín', 'Uí Laighléis').match('Gearoidin', 'Ui Laíghleis', :
|
371
|
-
Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :
|
359
|
+
Name.new('Gearóidín', 'Uí Laighléis').match('Gearoidin', 'Ui Laíghleis', :chars => "US-ASCII").should be_true
|
360
|
+
Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :chars => "US-ASCII").should be_true
|
372
361
|
end
|
373
362
|
end
|
374
363
|
end
|