icu_name 0.1.4 → 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +134 -41
- data/config/first_alternatives.yaml +36 -0
- data/config/last_alternatives.yaml +1 -0
- data/config/test_first_alts.yaml +41 -0
- data/config/test_last_alts.yaml +5 -0
- data/lib/icu_name/name.rb +94 -67
- data/lib/icu_name/version.rb +1 -1
- data/spec/name_spec.rb +201 -2
- metadata +8 -4
data/README.rdoc
CHANGED
@@ -23,50 +23,56 @@ To create a name object, supply both the first and second names separately to th
|
|
23
23
|
|
24
24
|
Capitalisation, white space and punctuation will all be automatically corrected:
|
25
25
|
|
26
|
-
robert.name
|
27
|
-
robert.rname
|
26
|
+
robert.name # => 'Robert J. Fischer'
|
27
|
+
robert.rname # => 'Fischer, Robert J.' (reversed name)
|
28
28
|
|
29
29
|
The input text, without any changes apart from white-space cleanup and the insertion of a comma
|
30
|
-
(to separate the two names), is returned by the
|
30
|
+
(to separate the two names), is returned by the <tt>original</tt> method:
|
31
31
|
|
32
|
-
robert.original
|
32
|
+
robert.original # => 'FISCHER, robert j'
|
33
33
|
|
34
34
|
To avoid ambiguity when either the first or second names consist of multiple words, it is better to
|
35
|
-
supply the two separately
|
36
|
-
|
35
|
+
supply the two separately. If the full name is supplied alone to the constructor, without any indication
|
36
|
+
of where the first names end, then the last distinct name is assumed to be the last name.
|
37
37
|
|
38
38
|
bobby = ICU::Name.new(' bobby fischer ')
|
39
39
|
|
40
|
-
bobby.first
|
41
|
-
bobby.last
|
40
|
+
bobby.first # => 'Bobby'
|
41
|
+
bobby.last # => 'Fischer'
|
42
42
|
|
43
|
-
|
43
|
+
In this case, since the names were not supplied separately, the <tt>original</tt> text will not contain a comma:
|
44
44
|
|
45
|
-
bobby.original
|
45
|
+
bobby.original # => 'bobby fischer'
|
46
46
|
|
47
47
|
Names will match even if one is missing middle initials or if a nickname is used for one of the first names.
|
48
48
|
|
49
|
-
bobby.match('Robert J.', 'Fischer')
|
49
|
+
bobby.match('Robert J.', 'Fischer') # => true
|
50
50
|
|
51
|
-
|
52
|
-
and not all possibilities.
|
51
|
+
The method <tt>alternatives</tt> can be used to list alternatives to a given first or last name:
|
53
52
|
|
54
|
-
|
53
|
+
Name.new('Stephen', 'Orr').alternatives(:first) # => ["Steve"]
|
54
|
+
Name.new('Michael Stephen', 'Orr').alternatives(:first) # => ["Steve", "Mike", "Mick", "Mikey"],
|
55
|
+
Name.new('Mark', 'Orr').alternatives(:first) # => []
|
56
|
+
|
57
|
+
By default the class is only aware of a few common alternatives for first names (e.g. _Bobby_ and _Robert_,
|
58
|
+
_Bill_ and _William_, etc). However, this can be customized (see below).
|
59
|
+
|
60
|
+
Supplying the <tt>match</tt> method with strings is equivalent to instantiating an instance with the same
|
55
61
|
strings and then matching it. So, for example the following are equivalent:
|
56
62
|
|
57
|
-
robert.match('R.', 'Fischer')
|
58
|
-
robert.match(ICU::Name.new('R.', 'Fischer'))
|
63
|
+
robert.match('R.', 'Fischer') # => true
|
64
|
+
robert.match(ICU::Name.new('R.', 'Fischer')) # => true
|
59
65
|
|
60
|
-
|
61
|
-
always work with initials. In the next example, the initial _R_ does not match the first letter
|
62
|
-
nickname _Bobby_.
|
66
|
+
Here the inital _R_ matches the first letter of _Robert_. However, nickname matches will not
|
67
|
+
always work with initials. In the next example, the initial _R_ does not match the first letter
|
68
|
+
_B_ of the nickname _Bobby_.
|
63
69
|
|
64
|
-
bobby.match('R. J.', 'Fischer')
|
70
|
+
bobby.match('R. J.', 'Fischer') # => false
|
65
71
|
|
66
|
-
Some
|
72
|
+
Some other ways last names are canonicalised are illustrated below:
|
67
73
|
|
68
|
-
ICU::Name.new('John', 'O Reilly').last
|
69
|
-
ICU::Name.new('dave', 'mcmanus').last
|
74
|
+
ICU::Name.new('John', 'O Reilly').last # => "O'Reilly, John"
|
75
|
+
ICU::Name.new('dave', 'mcmanus').last # => "McManus, Dave"
|
70
76
|
|
71
77
|
== Characters and Encoding
|
72
78
|
|
@@ -76,40 +82,127 @@ Along with hyphens and single quotes (which represent apostophes) letters in ISO
|
|
76
82
|
character plus one or more diacritics (e.g. "ł" or "Ś") are preserved, while everything
|
77
83
|
else is removed.
|
78
84
|
|
79
|
-
ICU::Name.new('éric', 'PRIÉ').name
|
80
|
-
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name
|
81
|
-
ICU::Name.new('
|
85
|
+
ICU::Name.new('éric', 'PRIÉ').name # => "Éric Prié"
|
86
|
+
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name # => "Bartłomiej Śliwa"
|
87
|
+
ICU::Name.new('Սմբատ', 'Լպուտյան').name # => ""
|
82
88
|
|
83
|
-
The various accessors (
|
89
|
+
The various accessors (<tt>first</tt>, <tt>last</tt>, <tt>name</tt>, <tt>rname</tt>, <tt>to_s</tt>, <tt>original</tt>) always return
|
84
90
|
strings encoded in UTF-8, no matter what the input encoding.
|
85
91
|
|
86
92
|
eric = ICU::Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
|
87
|
-
eric.rname
|
88
|
-
eric.rname.encoding.name
|
89
|
-
eric.original
|
90
|
-
eric.original.encoding.name
|
93
|
+
eric.rname # => "Prié, Éric"
|
94
|
+
eric.rname.encoding.name # => "UTF-8"
|
95
|
+
eric.original # => "PRIÉ, éric"
|
96
|
+
eric.original.encoding.name # => "UTF-8"
|
91
97
|
|
92
98
|
Accented letters can be transliterated into their US-ASCII counterparts by setting the
|
93
|
-
|
99
|
+
<tt>:chars</tt> option, which is available in all accessors. For example:
|
94
100
|
|
95
|
-
eric.rname(:chars => "US-ASCII")
|
96
|
-
eric.original(:chars => "US-ASCII")
|
101
|
+
eric.rname(:chars => "US-ASCII") # => "Prie, Eric"
|
102
|
+
eric.original(:chars => "US-ASCII") # => "PRIE, eric"
|
97
103
|
|
98
104
|
Also possible is the preservation of ISO-8859-1 characters, but the transliteration of
|
99
105
|
all other accented characters:
|
100
106
|
|
101
107
|
joe = Name.new('Józef', 'Żabiński')
|
102
|
-
joe.rname
|
103
|
-
joe.rname(:chars => "ISO-8859-1")
|
104
|
-
joe.rname(:chars => "US-ASCII")
|
108
|
+
joe.rname # => "Żabiński, Józef"
|
109
|
+
joe.rname(:chars => "ISO-8859-1") # => "Zabinski, Józef"
|
110
|
+
joe.rname(:chars => "US-ASCII") # => "Zabinski, Jozef"
|
105
111
|
|
106
112
|
Note that the character encoding of the strings returned is still UTF-8 in all cases.
|
107
113
|
The same option also relaxes the need for accented characters to match exactly:
|
108
114
|
|
109
|
-
eric.match('Eric', 'Prie')
|
110
|
-
eric.match('Eric', 'Prie', :chars => "US-ASCII")
|
111
|
-
joe.match('Józef', 'Zabinski')
|
112
|
-
joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1")
|
115
|
+
eric.match('Eric', 'Prie') # => false
|
116
|
+
eric.match('Eric', 'Prie', :chars => "US-ASCII") # => true
|
117
|
+
joe.match('Józef', 'Zabinski') # => false
|
118
|
+
joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1") # => true
|
119
|
+
|
120
|
+
== Customization of Alternative Names
|
121
|
+
|
122
|
+
We saw above how _Bobby_ and _Robert_ were able to match because, by default, the
|
123
|
+
matcher is aware of some common English nicknames. These name alternatives can be
|
124
|
+
customised to handle additional nick names and other types of alternative names
|
125
|
+
such as common spelling mistakes and name changes.
|
126
|
+
|
127
|
+
The alternative names are specified in two YAML files, one for first names and
|
128
|
+
one for last names. Each YAML file represents an array and each element in the
|
129
|
+
array is an array representing a set of alternative names. Here, for example,
|
130
|
+
are some of the default first name alternatives:
|
131
|
+
|
132
|
+
[Anthony, Tony]
|
133
|
+
[James, Jim, Jimmy]
|
134
|
+
[Michael, Mike, Mick, Mikey]
|
135
|
+
[Robert, Bob, Bobby]
|
136
|
+
[Stephen, Steve]
|
137
|
+
[Steven, Steve]
|
138
|
+
[Thomas, Tom, Tommy]
|
139
|
+
[William, Will, Willy, Willie, Bill]
|
140
|
+
|
141
|
+
The first of these means that _Anthony_ and _Tony_ are considered equivalent and can match.
|
142
|
+
|
143
|
+
Name.new("Tony", "Miles").match("Anthony", "Miles") # => true
|
144
|
+
|
145
|
+
Note that both _Steven_ and _Stephen_ match _Steve_ but, because they don't occur in the
|
146
|
+
same group, they don't match each other.
|
147
|
+
|
148
|
+
Name.new("Steven", "Hanly").match("Steve", "Hanly") # => true
|
149
|
+
Name.new("Stephen", "Hanly").match("Steve", "Hanly") # => true
|
150
|
+
Name.new("Stephen", "Hanly").match("Steven", "Hanly") # => false
|
151
|
+
|
152
|
+
To customize alternative name behaviour, prepare YAML files with your chosen alternatives
|
153
|
+
and then replace the default alternatives like this:
|
154
|
+
|
155
|
+
Name.load_alternatives(:first, "my_first_name_alternatives.yaml")
|
156
|
+
Name.load_alternatives(:last, "my_last_name_alternatives.yaml")
|
157
|
+
|
158
|
+
An example of one way in which you might want to customize the alternatives is to
|
159
|
+
cater for common spelling mistakes such as _Steven_ and _Stephen_. These two names
|
160
|
+
don't match by default, but you can make them so by replacing the two default rules:
|
161
|
+
|
162
|
+
[Stephen, Steve]
|
163
|
+
[Steven, Steve]
|
164
|
+
|
165
|
+
with the following single rule:
|
166
|
+
|
167
|
+
[Stephen, Steven, Steve]
|
168
|
+
|
169
|
+
so that now:
|
170
|
+
|
171
|
+
Name.new("Stephen", "Hanly").match("Steven", "Hanly") # => true
|
172
|
+
|
173
|
+
Another use is to cater for English and Irish versions of the same name. For example,
|
174
|
+
for last names:
|
175
|
+
|
176
|
+
[Murphy, Murchadha]
|
177
|
+
|
178
|
+
or for first names, including spelling variations:
|
179
|
+
|
180
|
+
[Patrick, Pat, Paddy, Padraig, Padraic, Padhraig, Padhraic]
|
181
|
+
|
182
|
+
== Conditional Alternatives
|
183
|
+
|
184
|
+
Normally, entries in the two YAML files are just lists of alternative names. There is one
|
185
|
+
exception to this however, when one of the entries (it doesn't matter which one but,
|
186
|
+
by convention, the last one) is a regular expression. Here is an example that might
|
187
|
+
be added to the last name alternatives:
|
188
|
+
|
189
|
+
[Quinn, Benjamin, !ruby/regexp /^(Debbie|Deborah)$/]
|
190
|
+
|
191
|
+
What this means is that the last names _Quinn_ and _Benjamin_ match but only when the
|
192
|
+
first name matches the regular expression.
|
193
|
+
|
194
|
+
Name.new("Debbie", "Quinn").match("Debbie", "Benjamin") # => true
|
195
|
+
Name.new("Mark", "Quinn").match("Mark", "Benjamin") # => false
|
196
|
+
|
197
|
+
Another example, this time for first names, is:
|
198
|
+
|
199
|
+
[Sean, John, !ruby/regexp /^Bradley$/]
|
200
|
+
|
201
|
+
This caters for an individual who is known by two normally unrelated first names.
|
202
|
+
We only want these two names to match for that individual and no others.
|
203
|
+
|
204
|
+
Name.new("John", "Bradley").match("Sean", "Bradley") # => true
|
205
|
+
Name.new("John", "Alfred").match("Sean", "Alfred") # => false
|
113
206
|
|
114
207
|
== Author
|
115
208
|
|
@@ -0,0 +1,36 @@
|
|
1
|
+
---
|
2
|
+
- [Alexander, Alex]
|
3
|
+
- [Andrew, Andy]
|
4
|
+
- [Anthony, Tony]
|
5
|
+
- [Benjamin, Ben]
|
6
|
+
- [Catherine, Cathy, Cath]
|
7
|
+
- [Daniel, Danny, Dan]
|
8
|
+
- [David, Dave]
|
9
|
+
- [Deborah, Debbie]
|
10
|
+
- [Des, Desmond]
|
11
|
+
- [Edward, Eddie, Eddy, Ed]
|
12
|
+
- [Frederick, Fred]
|
13
|
+
- [Frederic, Fred]
|
14
|
+
- [Gerald, Gerry]
|
15
|
+
- [Gerard, Gerry]
|
16
|
+
- [James, Jim, Jimmy]
|
17
|
+
- [John, Johnny]
|
18
|
+
- [Jonathan, Jon]
|
19
|
+
- [Kenneth, Ken, Kenny]
|
20
|
+
- [Michael, Mike, Mick, Mikey]
|
21
|
+
- [Nic, Nick, Nicolas]
|
22
|
+
- [Nicola, Nickie, Nicky]
|
23
|
+
- [Patrick, Pat]
|
24
|
+
- [Patricia, Patty, Pat]
|
25
|
+
- [Peter, Pete]
|
26
|
+
- [Philip, Phil]
|
27
|
+
- [Phillip, Phil]
|
28
|
+
- [Rick, Ricky]
|
29
|
+
- [Robert, Bob, Bobby]
|
30
|
+
- [Samual, Sam]
|
31
|
+
- [Samuel, Sam]
|
32
|
+
- [Stephen, Steve]
|
33
|
+
- [Steven, Steve]
|
34
|
+
- [Terence, Terry]
|
35
|
+
- [Thomas, Tom, Tommy]
|
36
|
+
- [William, Will, Willy, Willie, Bill]
|
@@ -0,0 +1 @@
|
|
1
|
+
--- []
|
@@ -0,0 +1,41 @@
|
|
1
|
+
---
|
2
|
+
- [Abdul, Abul]
|
3
|
+
- [Alexander, Alex]
|
4
|
+
- [Anandagopal, Ananda]
|
5
|
+
- [Andrew, Andy]
|
6
|
+
- [Anne, Ann]
|
7
|
+
- [Anthony, Tony]
|
8
|
+
- [Benjamin, Ben]
|
9
|
+
- [Catherine, Cathy, Cath]
|
10
|
+
- [Daniel, Danial, Danny, Dan]
|
11
|
+
- [David, Dave]
|
12
|
+
- [Deborah, Debbie]
|
13
|
+
- [Des, Desmond]
|
14
|
+
- [Eamonn, Eamon]
|
15
|
+
- [Edward, Eddie, Eddy, Ed]
|
16
|
+
- [Eric, Erick, Erik]
|
17
|
+
- [Frederick, Frederic, Fred]
|
18
|
+
- [Gerald, Gerry]
|
19
|
+
- [Gerhard, Gerard, Ger, Gerry]
|
20
|
+
- [James, Jim, Jimmy]
|
21
|
+
- [Joanna, Joan, Joanne]
|
22
|
+
- [John, Johnny]
|
23
|
+
- [Jonathan, Jon]
|
24
|
+
- [Kenneth, Ken, Kenny]
|
25
|
+
- [Michael, Mike, Mick, Micky, Mickie, Mikey]
|
26
|
+
- [Nicholas, Nick, Nicolas]
|
27
|
+
- [Nicola, Nickie, Nicky]
|
28
|
+
- [Patrick, Pat, Paddy, Padraig, Padraic, Padhraig, Padhraic]
|
29
|
+
- [Patricia, Paddy, Patty, Pat]
|
30
|
+
- [Peter, Pete]
|
31
|
+
- [Philippe, Philip, Phillippe, Phillip]
|
32
|
+
- [Rick, Ricky]
|
33
|
+
- [Robert, Bob, Bobby]
|
34
|
+
- [Samual, Sam, Samuel]
|
35
|
+
- [Stef, Stefan, Stephan, Stefen, Stephen]
|
36
|
+
- [Steffy, Stefanie, Stephanie, Stefenie, Stephenie]
|
37
|
+
- [Stephen, Steve, Steven]
|
38
|
+
- [Terence, Terry]
|
39
|
+
- [Thomas, Tom, Tommy]
|
40
|
+
- [William, Will, Willy, Willie, Bill]
|
41
|
+
- [Sean, John, !ruby/regexp /^Bradley$/]
|
data/lib/icu_name/name.rb
CHANGED
@@ -5,31 +5,42 @@ require 'active_support/core_ext/string/multibyte'
|
|
5
5
|
|
6
6
|
module ICU
|
7
7
|
class Name
|
8
|
+
# Revert to the default sets of alternative names.
|
9
|
+
def self.reset_alternatives
|
10
|
+
@@alts = Hash.new
|
11
|
+
@@cmps = Hash.new
|
12
|
+
end
|
13
|
+
|
14
|
+
# Perform a reset when the class is first loaded.
|
15
|
+
self.reset_alternatives
|
8
16
|
|
9
|
-
# Construct from one or two strings or any objects that have a to_s method.
|
17
|
+
# Construct a new name from one or two strings or any objects that have a to_s method.
|
10
18
|
def initialize(name1='', name2='')
|
11
19
|
@name1 = Util.to_utf8(name1.to_s)
|
12
20
|
@name2 = Util.to_utf8(name2.to_s)
|
13
21
|
originalize
|
14
22
|
canonicalize
|
23
|
+
@first.freeze
|
24
|
+
@last.freeze
|
25
|
+
@original.freeze
|
15
26
|
end
|
16
|
-
|
27
|
+
|
17
28
|
# Original text getter.
|
18
29
|
def original(opts={})
|
19
30
|
return transliterate(@original, opts[:chars]) if opts[:chars]
|
20
|
-
@original
|
31
|
+
@original.dup
|
21
32
|
end
|
22
33
|
|
23
34
|
# First name getter.
|
24
35
|
def first(opts={})
|
25
36
|
return transliterate(@first, opts[:chars]) if opts[:chars]
|
26
|
-
@first
|
37
|
+
@first.dup
|
27
38
|
end
|
28
39
|
|
29
40
|
# Last name getter.
|
30
41
|
def last(opts={})
|
31
42
|
return transliterate(@last, opts[:chars]) if opts[:chars]
|
32
|
-
@last
|
43
|
+
@last.dup
|
33
44
|
end
|
34
45
|
|
35
46
|
# Return a complete name, first name first, no comma.
|
@@ -50,7 +61,7 @@ module ICU
|
|
50
61
|
name
|
51
62
|
end
|
52
63
|
|
53
|
-
# Convert
|
64
|
+
# Convert to a string (same as rname).
|
54
65
|
def to_s(opts={})
|
55
66
|
rname(opts)
|
56
67
|
end
|
@@ -61,6 +72,17 @@ module ICU
|
|
61
72
|
match_first(first(opts), other.first(opts)) && match_last(last(opts), other.last(opts))
|
62
73
|
end
|
63
74
|
|
75
|
+
# Load a set of first or last name alternatives. If the YAML file name is absent,
|
76
|
+
# the default set is loaded. <tt>type</tt> should be <tt>:first</tt> or <tt>:last</tt>.
|
77
|
+
def self.load_alternatives(type, file=nil)
|
78
|
+
compile_alts(check_type(type), file, true)
|
79
|
+
end
|
80
|
+
|
81
|
+
# Show first name or last name alternatives.
|
82
|
+
def alternatives(type)
|
83
|
+
get_alts(check_type(type))
|
84
|
+
end
|
85
|
+
|
64
86
|
# :stopdoc:
|
65
87
|
private
|
66
88
|
|
@@ -70,7 +92,7 @@ module ICU
|
|
70
92
|
@original.strip!
|
71
93
|
@original.gsub!(/\s+/, ' ')
|
72
94
|
end
|
73
|
-
|
95
|
+
|
74
96
|
# Transliterate characters to ASCII or Latin1.
|
75
97
|
def transliterate(str, chars='US-ASCII')
|
76
98
|
case chars
|
@@ -154,6 +176,10 @@ module ICU
|
|
154
176
|
names
|
155
177
|
end
|
156
178
|
|
179
|
+
# Check the type argument to the public methods.
|
180
|
+
def check_type(type) self.class.instance_eval { check_type(type) }; end
|
181
|
+
def self.check_type(type) type = type.to_s == "last" ? :last : :first; end
|
182
|
+
|
157
183
|
# Match a complete first name.
|
158
184
|
def match_first(first1, first2)
|
159
185
|
# Is this one a walk in the park?
|
@@ -166,8 +192,9 @@ module ICU
|
|
166
192
|
# Get the long list and the short list.
|
167
193
|
long, short = first1.size >= first2.size ? [first1, first2] : [first2, first1]
|
168
194
|
|
169
|
-
# The short one must be a "subset" of the long one.
|
170
|
-
#
|
195
|
+
# The short one must be a "subset" of the long one. An extra condition must also be satisfied:
|
196
|
+
# either there has to be at least one match not involving initials or the first names must match.
|
197
|
+
# For example "M. J." matches "Mark" but not "John".
|
171
198
|
extra = false
|
172
199
|
(0..long.size-1).each do |i|
|
173
200
|
lword = long.shift
|
@@ -186,6 +213,7 @@ module ICU
|
|
186
213
|
# Match a complete last name.
|
187
214
|
def match_last(last1, last2)
|
188
215
|
return true if last1 == last2
|
216
|
+
return true if match_alt(:last, last1, last2)
|
189
217
|
[last1, last2].each do |last|
|
190
218
|
last.downcase! # case insensitive
|
191
219
|
last.gsub!(/\bmac/, 'mc') # MacDonaugh and McDonaugh
|
@@ -211,74 +239,73 @@ module ICU
|
|
211
239
|
initials = 0
|
212
240
|
initials+= 1 if first1.match(/^[A-Z\u{c0}-\u{de}]\.?$/)
|
213
241
|
initials+= 1 if first2.match(/^[A-Z\u{c0}-\u{de}]\.?$/)
|
214
|
-
return initials if first1 == first2
|
215
|
-
return 0 if initials == 0 &&
|
216
|
-
return -1 unless initials > 0
|
217
|
-
return initials if first1[0] == first2[0]
|
242
|
+
return initials if first1 == first2 # "W." and "W." or "William" and "William"
|
243
|
+
return 0 if initials == 0 && match_alt(:first, first1, first2) # "William"" and "Bill"
|
244
|
+
return -1 unless initials > 0 # "William" and "Patricia"
|
245
|
+
return initials if first1[0] == first2[0] # "W." and "William" or "W." and "W"
|
218
246
|
-1
|
219
247
|
end
|
220
248
|
|
221
|
-
# Match two
|
222
|
-
def
|
223
|
-
|
224
|
-
|
225
|
-
return false unless
|
226
|
-
|
249
|
+
# Match two names that might be equivalent due to nicknames, misspellings, changed married names etc.
|
250
|
+
def match_alt(type, nam1, nam2)
|
251
|
+
self.class.compile_alts(type)
|
252
|
+
return false unless nams = @@alts[type][nam1]
|
253
|
+
return false unless cond = nams[nam2]
|
254
|
+
return true if cond == true
|
255
|
+
cond.match(type == :first ? @last : @first)
|
227
256
|
end
|
228
257
|
|
229
|
-
#
|
230
|
-
|
231
|
-
|
258
|
+
# Return an array of alternative first or second names (not including the original name).
|
259
|
+
# Allow for double barrelled last names or multiple first names.
|
260
|
+
def get_alts(type)
|
261
|
+
self.class.compile_alts(type)
|
262
|
+
name = self.send(type)
|
263
|
+
names = name.split(/[- ]/)
|
264
|
+
names.push(name) if names.length > 1
|
265
|
+
target = type == :first ? @last : @first
|
266
|
+
alts = Array.new
|
267
|
+
names.each do |n|
|
268
|
+
next unless @@alts[type][n]
|
269
|
+
@@alts[type][n].each_pair do |k, v|
|
270
|
+
alts.push k if v == true || v.match(target)
|
271
|
+
end
|
272
|
+
end
|
273
|
+
alts
|
274
|
+
end
|
275
|
+
|
276
|
+
# Compile an alternative names hash (for either first names or last names) before matching is first attempted.
|
277
|
+
def self.compile_alts(type, file=nil, force=false)
|
278
|
+
return if @@alts[type] && !force
|
279
|
+
file ||= File.expand_path(File.dirname(__FILE__) + "/../../config/#{type}_alternatives.yaml")
|
280
|
+
data = YAML.load(File.open file)
|
281
|
+
@@cmps[type] ||= 0
|
282
|
+
@@alts[type] = Hash.new
|
232
283
|
code = 1
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
|
284
|
+
data.each do |alts|
|
285
|
+
cond = true
|
286
|
+
alts.reject! do |a|
|
287
|
+
if a.instance_of?(Regexp)
|
288
|
+
cond = a
|
289
|
+
else
|
290
|
+
false
|
291
|
+
end
|
292
|
+
end
|
293
|
+
alts.each do |name|
|
294
|
+
alts.each do |other|
|
295
|
+
unless other == name
|
296
|
+
@@alts[type][name] ||= Hash.new
|
297
|
+
@@alts[type][name][other] = cond
|
298
|
+
end
|
299
|
+
end
|
237
300
|
end
|
238
301
|
code+= 1
|
239
302
|
end
|
303
|
+
@@cmps[type] += 1
|
240
304
|
end
|
241
305
|
|
242
|
-
#
|
243
|
-
|
244
|
-
|
245
|
-
|
246
|
-
Alexander Alex
|
247
|
-
Anandagopal Ananda
|
248
|
-
Andrew Andy
|
249
|
-
Anne Ann
|
250
|
-
Anthony Tony
|
251
|
-
Benjamin Ben
|
252
|
-
Catherine Cathy Cath
|
253
|
-
Daniel Danial Danny Dan
|
254
|
-
David Dave
|
255
|
-
Deborah Debbie
|
256
|
-
Des Desmond
|
257
|
-
Eamonn Eamon
|
258
|
-
Edward Eddie Ed
|
259
|
-
Eric Erick Erik
|
260
|
-
Frederick Frederic Fred
|
261
|
-
Gerald Gerry
|
262
|
-
Gerhard Gerard Ger
|
263
|
-
James Jim
|
264
|
-
Joanna Joan Joanne
|
265
|
-
John Johnny
|
266
|
-
Jonathan Jon
|
267
|
-
Kenneth Ken Kenny
|
268
|
-
Michael Mike Mick Micky
|
269
|
-
Nicholas Nick Nicolas
|
270
|
-
Nicola Nickie Nicky
|
271
|
-
Patrick Pat Paddy
|
272
|
-
Peter Pete
|
273
|
-
Philippe Philip Phillippe Phillip
|
274
|
-
Rick Ricky
|
275
|
-
Robert Bob Bobby
|
276
|
-
Samual Sam Samuel
|
277
|
-
Stefanie Stef
|
278
|
-
Stephen Steven Steve
|
279
|
-
Terence Terry
|
280
|
-
Thomas Tom Tommy
|
281
|
-
William Will Willy Willie Bill
|
282
|
-
EOF
|
306
|
+
# Return the number of YAML file compilations (for testing).
|
307
|
+
def self.alt_compilations(type)
|
308
|
+
@@cmps[check_type(type)] || 0
|
309
|
+
end
|
283
310
|
end
|
284
311
|
end
|
data/lib/icu_name/version.rb
CHANGED
data/spec/name_spec.rb
CHANGED
@@ -3,6 +3,17 @@ require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
|
|
3
3
|
|
4
4
|
module ICU
|
5
5
|
describe Name do
|
6
|
+
def load_alt_test(*types)
|
7
|
+
types.each do |type|
|
8
|
+
file = File.expand_path(File.dirname(__FILE__) + "/../config/test_#{type}_alts.yaml")
|
9
|
+
Name.load_alternatives(type, file)
|
10
|
+
end
|
11
|
+
end
|
12
|
+
|
13
|
+
def alt_compilations(type)
|
14
|
+
Name.alt_compilations(type)
|
15
|
+
end
|
16
|
+
|
6
17
|
context "public methods" do
|
7
18
|
before(:each) do
|
8
19
|
@simple = Name.new('mark j l', 'ORR')
|
@@ -68,7 +79,7 @@ module ICU
|
|
68
79
|
it "characters and encoding" do
|
69
80
|
ICU::Name.new('éric', 'PRIÉ').name.should == "Éric Prié"
|
70
81
|
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartłomiej Śliwa"
|
71
|
-
ICU::Name.new('
|
82
|
+
ICU::Name.new('Սմբատ', 'Լպուտյան').name.should == ""
|
72
83
|
eric = Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
|
73
84
|
eric.rname.should == "Prié, Éric"
|
74
85
|
eric.rname.encoding.name.should == "UTF-8"
|
@@ -244,7 +255,7 @@ module ICU
|
|
244
255
|
@opt = { :chars => "US-ASCII" }
|
245
256
|
end
|
246
257
|
|
247
|
-
it "should be a no-op for names that already ASCII" do
|
258
|
+
it "should be a no-op for names that are already ASCII" do
|
248
259
|
name = Name.new('Mark J. L.', 'Orr')
|
249
260
|
name.first(@opt).should == 'Mark J. L.'
|
250
261
|
name.last(@opt).should == 'Orr'
|
@@ -325,6 +336,21 @@ module ICU
|
|
325
336
|
Name.new('Mick', 'Orr').match('Mike', 'Orr').should be_true
|
326
337
|
end
|
327
338
|
|
339
|
+
it "should handle ambiguous nicknames" do
|
340
|
+
Name.new('Gerry', 'Orr').match('Gerald', 'Orr').should be_true
|
341
|
+
Name.new('Gerry', 'Orr').match('Gerard', 'Orr').should be_true
|
342
|
+
Name.new('Gerard', 'Orr').match('Gerald', 'Orr').should be_false
|
343
|
+
end
|
344
|
+
|
345
|
+
it "should by default be cautious about misspellings" do
|
346
|
+
Name.new('Steven', 'Brady').match('Stephen', 'Brady').should be_false
|
347
|
+
Name.new('Philip', 'Short').match('Phillip', 'Short').should be_false
|
348
|
+
end
|
349
|
+
|
350
|
+
it "should by default have no conditional matches" do
|
351
|
+
Name.new('Sean', 'Bradley').match('John', 'Bradley').should be_false
|
352
|
+
end
|
353
|
+
|
328
354
|
it "should not mix up nick names" do
|
329
355
|
Name.new('David', 'Orr').match('Bill', 'Orr').should be_false
|
330
356
|
end
|
@@ -343,6 +369,11 @@ module ICU
|
|
343
369
|
Name.new('Alan', 'McDonagh').match('Alan', 'MacDonagh').should be_true
|
344
370
|
Name.new('Darko', 'Polimac').match('Darko', 'Polimc').should be_false
|
345
371
|
end
|
372
|
+
|
373
|
+
it "should by defaut have no conditional matches" do
|
374
|
+
Name.new('Debbie', 'Quinn').match('Debbie', 'Benjamin').should be_false
|
375
|
+
Name.new('Mairead', "O'Siochru").match('Mairead', 'King').should be_false
|
376
|
+
end
|
346
377
|
end
|
347
378
|
|
348
379
|
context "matches involving accented characters" do
|
@@ -361,5 +392,173 @@ module ICU
|
|
361
392
|
Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :chars => "US-ASCII").should be_true
|
362
393
|
end
|
363
394
|
end
|
395
|
+
|
396
|
+
context "configuring new first name alternatives" do
|
397
|
+
before(:all) do
|
398
|
+
load_alt_test(:first)
|
399
|
+
end
|
400
|
+
|
401
|
+
it "should match some spelling errors" do
|
402
|
+
Name.new('Steven', 'Brady').match('Stephen', 'Brady').should be_true
|
403
|
+
Name.new('Philip', 'Short').match('Phillip', 'Short').should be_true
|
404
|
+
end
|
405
|
+
|
406
|
+
it "should handle conditional matches" do
|
407
|
+
Name.new('Sean', 'Collins').match('John', 'Collins').should be_false
|
408
|
+
Name.new('Sean', 'Bradley').match('John', 'Bradley').should be_true
|
409
|
+
end
|
410
|
+
end
|
411
|
+
|
412
|
+
context "configuring new last name alternatives" do
|
413
|
+
before(:all) do
|
414
|
+
load_alt_test(:last)
|
415
|
+
end
|
416
|
+
|
417
|
+
it "should match some spelling errors" do
|
418
|
+
Name.new('William', 'Ffrench').match('William', 'French').should be_true
|
419
|
+
end
|
420
|
+
|
421
|
+
it "should handle conditional matches" do
|
422
|
+
Name.new('Mark', 'Quinn').match('Mark', 'Benjamin').should be_false
|
423
|
+
Name.new('Debbie', 'Quinn').match('Debbie', 'Benjamin').should be_true
|
424
|
+
Name.new('Oisin', "O'Siochru").match('Oisin', 'King').should be_false
|
425
|
+
Name.new('Mairead', "O'Siochru").match('Mairead', 'King').should be_true
|
426
|
+
end
|
427
|
+
|
428
|
+
it "should allow some awesome matches" do
|
429
|
+
Name.new('debbie quinn').match('Deborah', 'Benjamin').should be_true
|
430
|
+
Name.new('french, william').match('Bill', 'Ffrench').should be_true
|
431
|
+
Name.new('Oissine', 'Murphy').match('Oissine', 'Murchadha').should be_true
|
432
|
+
end
|
433
|
+
end
|
434
|
+
|
435
|
+
context "configuring new first and new last name alternatives" do
|
436
|
+
before(:all) do
|
437
|
+
load_alt_test(:first, :last)
|
438
|
+
end
|
439
|
+
|
440
|
+
it "should allow some awesome matches" do
|
441
|
+
Name.new('french, steven').match('Stephen', 'Ffrench').should be_true
|
442
|
+
Name.new('Patrick', 'Murphy').match('Padraic', 'Murchadha').should be_true
|
443
|
+
end
|
444
|
+
end
|
445
|
+
|
446
|
+
context "reverting to the default configuration" do
|
447
|
+
before(:all) do
|
448
|
+
load_alt_test(:first, :last)
|
449
|
+
end
|
450
|
+
|
451
|
+
it "should not match so boldly after reverting" do
|
452
|
+
Name.new('french, steven').match('Stephen', 'Ffrench').should be_true
|
453
|
+
Name.load_alternatives(:first)
|
454
|
+
Name.new('Patrick', 'Murphy').match('Padraic', 'Murchadha').should be_false
|
455
|
+
Name.new('Patrick', 'Murphy').match('Patrick', 'Murchadha').should be_true
|
456
|
+
Name.load_alternatives(:last)
|
457
|
+
Name.new('Patrick', 'Murphy').match('Patrick', 'Murchadha').should be_false
|
458
|
+
end
|
459
|
+
end
|
460
|
+
|
461
|
+
context "name alternatives with default configuration" do
|
462
|
+
it "should show common nicknames" do
|
463
|
+
Name.new('William', 'Ffrench').alternatives(:first).should =~ %w{Bill Willy Willie Will}
|
464
|
+
Name.new('Bill', 'Ffrench').alternatives(:first).should =~ %w{William Willy Will Willie}
|
465
|
+
Name.new('Steven', 'Ffrench').alternatives(:first).should =~ %w{Steve}
|
466
|
+
Name.new('Stephen', 'Ffrench').alternatives(:first).should =~ %w{Steve}
|
467
|
+
Name.new('Michael Stephen', 'Ffrench').alternatives(:first).should =~ %w{Steve Mike Mick Mikey}
|
468
|
+
Name.new('Stephen M.', 'Ffrench').alternatives(:first).should =~ %w{Steve}
|
469
|
+
Name.new('S.', 'Ffrench').alternatives(:first).should =~ []
|
470
|
+
Name.new('Sean', 'Bradley').alternatives(:first).should =~ []
|
471
|
+
end
|
472
|
+
|
473
|
+
it "should not have any last name alternatives" do
|
474
|
+
Name.new('William', 'Ffrench').alternatives(:last).should =~ []
|
475
|
+
Name.new('Mairead', "O'Siochru").alternatives(:last).should =~ []
|
476
|
+
Name.new('Oissine', 'Murphy').alternatives(:last).should =~ []
|
477
|
+
Name.new('Debbie', 'Quinn').alternatives(:last).should =~ []
|
478
|
+
end
|
479
|
+
end
|
480
|
+
|
481
|
+
context "name alternatives with more adventurous configuration" do
|
482
|
+
before(:all) do
|
483
|
+
load_alt_test(:first, :last)
|
484
|
+
end
|
485
|
+
|
486
|
+
it "should show additional nicknames" do
|
487
|
+
Name.new('Steven', 'Ffrench').alternatives(:first).should =~ %w{Stephen Steve}
|
488
|
+
Name.new('Stephen', 'Ffrench').alternatives(:first).should =~ %w{Stef Stefan Stefen Stephan Steve Steven}
|
489
|
+
Name.new('Stephen Mike', 'Ffrench').alternatives(:first).should =~ %w{Michael Mick Mickie Micky Mikey Stef Stefan Stefen Stephan Steve Steven}
|
490
|
+
Name.new('Sean', 'Bradley').alternatives(:first).should =~ %w{John}
|
491
|
+
Name.new('Sean', 'McDonagh').alternatives(:first).should =~ []
|
492
|
+
Name.new('John', 'Bradley').alternatives(:first).should =~ %w{Sean Johnny}
|
493
|
+
end
|
494
|
+
|
495
|
+
it "should have some last name alternatives" do
|
496
|
+
Name.new('William', 'Ffrench').alternatives(:last).should =~ %w{French}
|
497
|
+
Name.new('Mairead', "O'Siochru").alternatives(:last).should =~ %w{King}
|
498
|
+
Name.new('Oissine', 'Murphy').alternatives(:last).should =~ %w{Murchadha}
|
499
|
+
Name.new('Debbie', 'Quinn').alternatives(:last).should =~ %w{Benjamin}
|
500
|
+
Name.new('Mark', 'Quinn').alternatives(:last).should =~ []
|
501
|
+
Name.new('Debbie', 'Quinn-French').alternatives(:last).should =~ %w{Benjamin Ffrench}
|
502
|
+
end
|
503
|
+
end
|
504
|
+
|
505
|
+
context "number of alternative compilations" do
|
506
|
+
before(:all) do
|
507
|
+
Name.reset_alternatives
|
508
|
+
end
|
509
|
+
|
510
|
+
it "should be no more than necessary" do
|
511
|
+
alt_compilations(:first).should == 0
|
512
|
+
alt_compilations(:last).should == 0
|
513
|
+
Name.new('William', 'Ffrench').match('Bill', 'French')
|
514
|
+
alt_compilations(:first).should == 1
|
515
|
+
alt_compilations(:last).should == 1
|
516
|
+
Name.new('Debbie', 'Quinn').match('Deborah', 'Benjamin')
|
517
|
+
alt_compilations(:first).should == 1
|
518
|
+
alt_compilations(:last).should == 1
|
519
|
+
load_alt_test(:first)
|
520
|
+
alt_compilations(:first).should == 2
|
521
|
+
alt_compilations(:last).should == 1
|
522
|
+
load_alt_test(:last)
|
523
|
+
alt_compilations(:first).should == 2
|
524
|
+
alt_compilations(:last).should == 2
|
525
|
+
Name.new('William', 'Ffrench').match('Bill', 'French')
|
526
|
+
Name.new('Debbie', 'Quinn').match('Deborah', 'Benjamin')
|
527
|
+
Name.new('Mark', 'Orr').alternatives(:first)
|
528
|
+
Name.new('Mark', 'Orr').alternatives(:last)
|
529
|
+
alt_compilations(:first).should == 2
|
530
|
+
alt_compilations(:last).should == 2
|
531
|
+
end
|
532
|
+
end
|
533
|
+
|
534
|
+
context "immutability" do
|
535
|
+
before(:each) do
|
536
|
+
@mark = ICU::Name.new('Màrk', 'Orr')
|
537
|
+
end
|
538
|
+
|
539
|
+
it "there are no setters" do
|
540
|
+
lambda { @mark.first = "Malcolm" }.should raise_error(/undefined/)
|
541
|
+
lambda { @mark.last = "Dickie" }.should raise_error(/undefined/)
|
542
|
+
lambda { @mark.original = "mark orr" }.should raise_error(/undefined/)
|
543
|
+
end
|
544
|
+
|
545
|
+
it "should prevent accidentally access to the instance variables" do
|
546
|
+
@mark.first.downcase!
|
547
|
+
@mark.first.should == "Màrk"
|
548
|
+
@mark.last.downcase!
|
549
|
+
@mark.last.should == "Orr"
|
550
|
+
@mark.original.downcase!
|
551
|
+
@mark.original.should == "Orr, Màrk"
|
552
|
+
end
|
553
|
+
|
554
|
+
it "should prevent accidentally access to the instance variables when transliterating" do
|
555
|
+
@mark.first(:chars => "US-ASCII").downcase!
|
556
|
+
@mark.first.should == "Màrk"
|
557
|
+
@mark.last(:chars => "US-ASCII").downcase!
|
558
|
+
@mark.last.should == "Orr"
|
559
|
+
@mark.original(:chars => "US-ASCII").downcase!
|
560
|
+
@mark.original.should == "Orr, Màrk"
|
561
|
+
end
|
562
|
+
end
|
364
563
|
end
|
365
564
|
end
|
metadata
CHANGED
@@ -3,10 +3,10 @@ name: icu_name
|
|
3
3
|
version: !ruby/object:Gem::Version
|
4
4
|
prerelease: false
|
5
5
|
segments:
|
6
|
-
- 0
|
7
6
|
- 1
|
8
|
-
-
|
9
|
-
|
7
|
+
- 0
|
8
|
+
- 0
|
9
|
+
version: 1.0.0
|
10
10
|
platform: ruby
|
11
11
|
authors:
|
12
12
|
- Mark Orr
|
@@ -14,7 +14,7 @@ autorequire:
|
|
14
14
|
bindir: bin
|
15
15
|
cert_chain: []
|
16
16
|
|
17
|
-
date: 2011-
|
17
|
+
date: 2011-04-16 00:00:00 +01:00
|
18
18
|
default_executable:
|
19
19
|
dependencies:
|
20
20
|
- !ruby/object:Gem::Dependency
|
@@ -138,6 +138,10 @@ files:
|
|
138
138
|
- spec/name_spec.rb
|
139
139
|
- spec/spec_helper.rb
|
140
140
|
- spec/util_spec.rb
|
141
|
+
- config/first_alternatives.yaml
|
142
|
+
- config/last_alternatives.yaml
|
143
|
+
- config/test_first_alts.yaml
|
144
|
+
- config/test_last_alts.yaml
|
141
145
|
- LICENCE
|
142
146
|
- README.rdoc
|
143
147
|
has_rdoc: true
|