icu_name 0.0.7 → 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.rdoc +44 -40
- data/lib/icu_name/name.rb +28 -11
- data/lib/icu_name/version.rb +1 -1
- data/spec/name_spec.rb +15 -26
- metadata +2 -2
data/README.rdoc
CHANGED
@@ -8,7 +8,7 @@ For ruby 1.9.2 and above.
|
|
8
8
|
|
9
9
|
gem install icu_name
|
10
10
|
|
11
|
-
It depends on
|
11
|
+
It depends on _active_support_ and _i18n_.
|
12
12
|
|
13
13
|
== Names
|
14
14
|
|
@@ -23,84 +23,88 @@ To create a name object, supply both the first and second names separately to th
|
|
23
23
|
|
24
24
|
Capitalisation, white space and punctuation will all be automatically corrected:
|
25
25
|
|
26
|
-
robert.name
|
27
|
-
robert.rname
|
26
|
+
robert.name # => 'Robert J. Fischer'
|
27
|
+
robert.rname # => 'Fischer, Robert J.' (reversed name)
|
28
28
|
|
29
29
|
The input text, without any changes apart from white-space cleanup, is returned by the _original_ method:
|
30
30
|
|
31
|
-
robert.original
|
31
|
+
robert.original # => 'robert j FISHER'
|
32
32
|
|
33
33
|
To avoid ambiguity when either the first or second names consist of multiple words, it is better to
|
34
34
|
supply the two separately, if known. However, the full name can be supplied alone to the constructor
|
35
35
|
and a guess will be made as to the first and last names.
|
36
36
|
|
37
37
|
bobby = ICU::Name.new(' bobby fischer ')
|
38
|
-
|
39
|
-
bobby.first
|
40
|
-
bobby.last
|
38
|
+
|
39
|
+
bobby.first # => 'Bobby'
|
40
|
+
bobby.last # => 'Fischer'
|
41
41
|
|
42
42
|
Names will match even if one is missing middle initials or if a nickname is used for one of the first names.
|
43
43
|
|
44
|
-
bobby.match('Robert J.', 'Fischer')
|
44
|
+
bobby.match('Robert J.', 'Fischer') # => true
|
45
45
|
|
46
46
|
Note that the class is aware of only common nicknames (e.g. _Bobby_ and _Robert_, _Bill_ and _William_, etc), not all possibilities.
|
47
47
|
|
48
48
|
Supplying the _match_ method with strings is equivalent to instantiating a Name instance with the same
|
49
49
|
strings and then matching it. So, for example the following are equivalent:
|
50
50
|
|
51
|
-
robert.match('R.', 'Fischer')
|
52
|
-
robert.match(ICU::Name.new('R.', 'Fischer'))
|
51
|
+
robert.match('R.', 'Fischer') # => true
|
52
|
+
robert.match(ICU::Name.new('R.', 'Fischer')) # => true
|
53
53
|
|
54
54
|
The inital _R_, for example, matches the first letter of _Robert_. However, nickname matches will not
|
55
55
|
always work with initials. In the next example, the initial _R_ does not match the first letter _B_ of the
|
56
56
|
nickname _Bobby_.
|
57
57
|
|
58
|
-
bobby.match('R. J.', 'Fischer')
|
58
|
+
bobby.match('R. J.', 'Fischer') # => false
|
59
59
|
|
60
60
|
Some of the ways last names are canonicalised are illustrated below:
|
61
61
|
|
62
|
-
ICU::Name.new('John', 'O Reilly').last
|
63
|
-
ICU::Name.new('dave', 'mcmanus').last
|
62
|
+
ICU::Name.new('John', 'O Reilly').last # => "O'Reilly"
|
63
|
+
ICU::Name.new('dave', 'mcmanus').last # => "McManus"
|
64
64
|
|
65
65
|
== Characters and Encoding
|
66
66
|
|
67
|
-
The class can only cope with
|
68
|
-
|
69
|
-
|
67
|
+
The class can only cope with Latin characters, including those with diacritics (accents).
|
68
|
+
Along with hyphens and single quotes (which represent apostophes) letters in ISO-8859-1
|
69
|
+
(e.g. "a", "è", "Ö") and letters outside ISO-8859-1 which are decomposable into a US-ASCII
|
70
|
+
character plus one or more diacritics (e.g. "ł" or "Ś") are preserved, while everything
|
71
|
+
else is removed.
|
72
|
+
|
73
|
+
ICU::Name.new('éric', 'PRIÉ').name # => "Éric Prié"
|
74
|
+
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name # => "Bartłomiej Śliwa"
|
75
|
+
ICU::Name.new(' 渡井美代子').name # => ""
|
70
76
|
|
71
|
-
|
72
|
-
|
73
|
-
eric.rname.encoding.name # => "UTF-8"
|
77
|
+
The various accessors (_first_, _last_, _name_, _rname_, _to_s_, _original_) always return
|
78
|
+
strings encoded in UTF-8, no matter what the input encoding.
|
74
79
|
|
75
80
|
eric = ICU::Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
|
76
|
-
eric.rname
|
77
|
-
eric.rname.encoding.name
|
78
|
-
eric.original
|
79
|
-
eric.original.encoding.name
|
81
|
+
eric.rname # => "Prié, Éric"
|
82
|
+
eric.rname.encoding.name # => "UTF-8"
|
83
|
+
eric.original # => "éric PRIÉ"
|
84
|
+
eric.original.encoding.name # => "UTF-8"
|
80
85
|
|
81
|
-
|
86
|
+
Accented letters can be transliterated into their US-ASCII counterparts by setting the
|
87
|
+
_chars_ option, which is available in all accessors. For example:
|
82
88
|
|
83
|
-
|
84
|
-
|
89
|
+
eric.rname(:chars => "US-ASCII") # => "Prie, Eric"
|
90
|
+
eric.original(:chars => "US-ASCII") # => "eric PRIE"
|
85
91
|
|
86
|
-
|
87
|
-
|
92
|
+
Also possible is the preservation of ISO-8859-1 characters, but the transliteration of
|
93
|
+
all other accented characters:
|
88
94
|
|
89
|
-
|
95
|
+
joe = Name.new('Józef', 'Żabiński')
|
96
|
+
joe.rname # => "Żabiński, Józef"
|
97
|
+
joe.rname(:chars => "ISO-8859-1") # => "Zabinski, Józef"
|
98
|
+
joe.rname(:chars => "US-ASCII") # => "Zabinski, Jozef"
|
90
99
|
|
91
|
-
|
100
|
+
Note that the character encoding of the strings returned is still UTF-8 in all cases.
|
101
|
+
The same option also relaxes the need for accented characters to match exactly:
|
92
102
|
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
The option also relaxes the need for accented characters to match exactly:
|
103
|
+
eric.match('Eric', 'Prie') # => false
|
104
|
+
eric.match('Eric', 'Prie', :chars => "US-ASCII") # => true
|
105
|
+
joe.match('Józef', 'Zabinski') # => false
|
106
|
+
joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1") # => true
|
99
107
|
|
100
|
-
eric.match('Éric', 'Prié') # => true
|
101
|
-
eric.match('Eric', 'Prie') # => false
|
102
|
-
eric.match('Eric', 'Prie', :ascii => true) # => true
|
103
|
-
|
104
108
|
== Author
|
105
109
|
|
106
110
|
Mark Orr, rating officer for the Irish Chess Union (ICU[http://icu.ie]).
|
data/lib/icu_name/name.rb
CHANGED
@@ -1,3 +1,4 @@
|
|
1
|
+
# encoding: UTF-8
|
1
2
|
require 'active_support'
|
2
3
|
require 'active_support/inflector/transliterate'
|
3
4
|
require 'active_support/core_ext/string/multibyte'
|
@@ -6,32 +7,28 @@ module ICU
|
|
6
7
|
class Name
|
7
8
|
|
8
9
|
# Construct from one or two strings or any objects that have a to_s method.
|
9
|
-
def initialize(name1='', name2=''
|
10
|
+
def initialize(name1='', name2='')
|
10
11
|
@name1 = Util.to_utf8(name1.to_s)
|
11
12
|
@name2 = Util.to_utf8(name2.to_s)
|
12
13
|
originalize
|
13
|
-
if opt[:ascii]
|
14
|
-
@name1 = ActiveSupport::Inflector.transliterate(@name1)
|
15
|
-
@name2 = ActiveSupport::Inflector.transliterate(@name2)
|
16
|
-
end
|
17
14
|
canonicalize
|
18
15
|
end
|
19
16
|
|
20
17
|
# Original text getter.
|
21
18
|
def original(opts={})
|
22
|
-
return
|
19
|
+
return transliterate(@original, opts[:chars]) if opts[:chars]
|
23
20
|
@original
|
24
21
|
end
|
25
22
|
|
26
23
|
# First name getter.
|
27
24
|
def first(opts={})
|
28
|
-
return
|
25
|
+
return transliterate(@first, opts[:chars]) if opts[:chars]
|
29
26
|
@first
|
30
27
|
end
|
31
28
|
|
32
29
|
# Last name getter.
|
33
30
|
def last(opts={})
|
34
|
-
return
|
31
|
+
return transliterate(@last, opts[:chars]) if opts[:chars]
|
35
32
|
@last
|
36
33
|
end
|
37
34
|
|
@@ -60,8 +57,8 @@ module ICU
|
|
60
57
|
|
61
58
|
# Match another name to this object, returning true or false.
|
62
59
|
def match(name1='', name2='', opts={})
|
63
|
-
other = Name.new(name1, name2
|
64
|
-
match_first(first(opts), other.first) && match_last(last(opts), other.last)
|
60
|
+
other = Name.new(name1, name2)
|
61
|
+
match_first(first(opts), other.first(opts)) && match_last(last(opts), other.last(opts))
|
65
62
|
end
|
66
63
|
|
67
64
|
# :stopdoc:
|
@@ -73,6 +70,18 @@ module ICU
|
|
73
70
|
@original.strip!
|
74
71
|
@original.gsub!(/\s+/, ' ')
|
75
72
|
end
|
73
|
+
|
74
|
+
# Transliterate characters to ASCII or Latin1.
|
75
|
+
def transliterate(str, chars='US-ASCII')
|
76
|
+
case chars
|
77
|
+
when /^(US-?)?ASCII/i
|
78
|
+
ActiveSupport::Inflector.transliterate(str)
|
79
|
+
when /^(Windows|CP)-?1252|ISO-?8859-?1|Latin(-?1)?$/i
|
80
|
+
str.gsub(/./) { |m| m.ord < 256 ? m : ActiveSupport::Inflector.transliterate(m) }
|
81
|
+
else
|
82
|
+
str.dup
|
83
|
+
end
|
84
|
+
end
|
76
85
|
|
77
86
|
# Canonicalise the first and last names.
|
78
87
|
def canonicalize
|
@@ -106,7 +115,15 @@ module ICU
|
|
106
115
|
# Clean up characters in any name keeping only letters (including accented), hyphens, and single quotes.
|
107
116
|
def clean(name)
|
108
117
|
name.gsub!(/`/, "'")
|
109
|
-
name.gsub!(
|
118
|
+
name.gsub!(/./) do |m|
|
119
|
+
if m.ord < 256
|
120
|
+
# Keep Latin1 accented letters.
|
121
|
+
m.match(/^[-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]$/) ? m : ''
|
122
|
+
else
|
123
|
+
# Keep ASCII characters with diacritics (e.g. Polish ł and Ś).
|
124
|
+
transliterate(m) == '?' ? '' : m
|
125
|
+
end
|
126
|
+
end
|
110
127
|
name.gsub!(/\./, ' ')
|
111
128
|
name.gsub!(/\s*-\s*/, '-')
|
112
129
|
name.gsub!(/'+/, "'")
|
data/lib/icu_name/version.rb
CHANGED
data/spec/name_spec.rb
CHANGED
@@ -66,29 +66,25 @@ module ICU
|
|
66
66
|
end
|
67
67
|
|
68
68
|
it "characters and encoding" do
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
josef.original(:ascii => true).should == "Jozef Zabinski"
|
73
|
-
josef = Name.new('Józef', 'Żabiński', :ascii => true)
|
74
|
-
josef.name.should == "Jozef Zabinski"
|
75
|
-
bu = Name.new('Bǔ Xiángzhì')
|
76
|
-
bu.name.should == "B. Xiángzhì"
|
77
|
-
eric = Name.new('éric', 'PRIÉ')
|
78
|
-
eric.rname.should == "Prié, Éric"
|
79
|
-
eric.rname.encoding.name.should == "UTF-8"
|
69
|
+
ICU::Name.new('éric', 'PRIÉ').name.should == "Éric Prié"
|
70
|
+
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartłomiej Śliwa"
|
71
|
+
ICU::Name.new(' 渡井美代子').name.should == ""
|
80
72
|
eric = Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
|
81
73
|
eric.rname.should == "Prié, Éric"
|
82
74
|
eric.rname.encoding.name.should == "UTF-8"
|
83
75
|
eric.original.should == "éric PRIÉ"
|
84
|
-
eric.original(:ascii => true).should == "eric PRIE"
|
85
76
|
eric.original.encoding.name.should == "UTF-8"
|
86
|
-
eric.
|
87
|
-
|
88
|
-
|
77
|
+
eric.rname(:chars => "US-ASCII").should == "Prie, Eric"
|
78
|
+
eric.original(:chars => "US-ASCII").should == "eric PRIE"
|
79
|
+
joe = Name.new('Józef', 'Żabiński')
|
80
|
+
joe.rname.should == "Żabiński, Józef"
|
81
|
+
joe.rname(:chars => "ISO-8859-1").should == "Zabinski, Józef"
|
82
|
+
joe.rname(:chars => "US-ASCII").should == "Zabinski, Jozef"
|
89
83
|
eric.match('Éric', 'Prié').should be_true
|
90
84
|
eric.match('Eric', 'Prie').should be_false
|
91
|
-
eric.match('Eric', 'Prie', :
|
85
|
+
eric.match('Eric', 'Prie', :chars => "US-ASCII").should be_true
|
86
|
+
joe.match('Józef', 'Zabinski').should be_false
|
87
|
+
joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1").should be_true
|
92
88
|
end
|
93
89
|
end
|
94
90
|
|
@@ -244,7 +240,7 @@ module ICU
|
|
244
240
|
|
245
241
|
context "transliteration" do
|
246
242
|
before(:all) do
|
247
|
-
@opt = { :
|
243
|
+
@opt = { :chars => "US-ASCII" }
|
248
244
|
end
|
249
245
|
|
250
246
|
it "should be a no-op for names that already ASCII" do
|
@@ -267,12 +263,6 @@ module ICU
|
|
267
263
|
name.first(@opt).should == 'Eric'
|
268
264
|
name.last(@opt).should == 'Prie'
|
269
265
|
end
|
270
|
-
|
271
|
-
it "should work for the constructor as well as accessors" do
|
272
|
-
name = Name.new('Gearóidín', 'Uí Laighléis', @opt)
|
273
|
-
name.first.should == 'Gearoidin'
|
274
|
-
name.last.should == 'Ui Laighleis'
|
275
|
-
end
|
276
266
|
end
|
277
267
|
|
278
268
|
context "constuction corner cases" do
|
@@ -280,7 +270,6 @@ module ICU
|
|
280
270
|
Name.new('Orr').name.should == 'Orr'
|
281
271
|
Name.new('Orr').rname.should == 'Orr'
|
282
272
|
Name.new('Uí Laighléis').rname.should == 'Laighléis, Uí'
|
283
|
-
Name.new('', 'Uí Laighléis', :ascii => true).last.should == 'Ui Laighleis'
|
284
273
|
Name.new('').name.should == ''
|
285
274
|
Name.new('').rname.should == ''
|
286
275
|
Name.new.name.should == ''
|
@@ -367,8 +356,8 @@ module ICU
|
|
367
356
|
end
|
368
357
|
|
369
358
|
it "the matching of accented characters can be relaxed" do
|
370
|
-
Name.new('Gearóidín', 'Uí Laighléis').match('Gearoidin', 'Ui Laíghleis', :
|
371
|
-
Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :
|
359
|
+
Name.new('Gearóidín', 'Uí Laighléis').match('Gearoidin', 'Ui Laíghleis', :chars => "US-ASCII").should be_true
|
360
|
+
Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :chars => "US-ASCII").should be_true
|
372
361
|
end
|
373
362
|
end
|
374
363
|
end
|