icu_name 0.1.4 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.rdoc +134 -41
- data/config/first_alternatives.yaml +36 -0
- data/config/last_alternatives.yaml +1 -0
- data/config/test_first_alts.yaml +41 -0
- data/config/test_last_alts.yaml +5 -0
- data/lib/icu_name/name.rb +94 -67
- data/lib/icu_name/version.rb +1 -1
- data/spec/name_spec.rb +201 -2
- metadata +8 -4
data/README.rdoc
CHANGED
@@ -23,50 +23,56 @@ To create a name object, supply both the first and second names separately to th
|
|
23
23
|
|
24
24
|
Capitalisation, white space and punctuation will all be automatically corrected:
|
25
25
|
|
26
|
-
robert.name
|
27
|
-
robert.rname
|
26
|
+
robert.name # => 'Robert J. Fischer'
|
27
|
+
robert.rname # => 'Fischer, Robert J.' (reversed name)
|
28
28
|
|
29
29
|
The input text, without any changes apart from white-space cleanup and the insertion of a comma
|
30
|
-
(to separate the two names), is returned by the
|
30
|
+
(to separate the two names), is returned by the <tt>original</tt> method:
|
31
31
|
|
32
|
-
robert.original
|
32
|
+
robert.original # => 'FISCHER, robert j'
|
33
33
|
|
34
34
|
To avoid ambiguity when either the first or second names consist of multiple words, it is better to
|
35
|
-
supply the two separately
|
36
|
-
|
35
|
+
supply the two separately. If the full name is supplied alone to the constructor, without any indication
|
36
|
+
of where the first names end, then the last distinct name is assumed to be the last name.
|
37
37
|
|
38
38
|
bobby = ICU::Name.new(' bobby fischer ')
|
39
39
|
|
40
|
-
bobby.first
|
41
|
-
bobby.last
|
40
|
+
bobby.first # => 'Bobby'
|
41
|
+
bobby.last # => 'Fischer'
|
42
42
|
|
43
|
-
|
43
|
+
In this case, since the names were not supplied separately, the <tt>original</tt> text will not contain a comma:
|
44
44
|
|
45
|
-
bobby.original
|
45
|
+
bobby.original # => 'bobby fischer'
|
46
46
|
|
47
47
|
Names will match even if one is missing middle initials or if a nickname is used for one of the first names.
|
48
48
|
|
49
|
-
bobby.match('Robert J.', 'Fischer')
|
49
|
+
bobby.match('Robert J.', 'Fischer') # => true
|
50
50
|
|
51
|
-
|
52
|
-
and not all possibilities.
|
51
|
+
The method <tt>alternatives</tt> can be used to list alternatives to a given first or last name:
|
53
52
|
|
54
|
-
|
53
|
+
Name.new('Stephen', 'Orr').alternatives(:first) # => ["Steve"]
|
54
|
+
Name.new('Michael Stephen', 'Orr').alternatives(:first) # => ["Steve", "Mike", "Mick", "Mikey"],
|
55
|
+
Name.new('Mark', 'Orr').alternatives(:first) # => []
|
56
|
+
|
57
|
+
By default the class is only aware of a few common alternatives for first names (e.g. _Bobby_ and _Robert_,
|
58
|
+
_Bill_ and _William_, etc). However, this can be customized (see below).
|
59
|
+
|
60
|
+
Supplying the <tt>match</tt> method with strings is equivalent to instantiating an instance with the same
|
55
61
|
strings and then matching it. So, for example the following are equivalent:
|
56
62
|
|
57
|
-
robert.match('R.', 'Fischer')
|
58
|
-
robert.match(ICU::Name.new('R.', 'Fischer'))
|
63
|
+
robert.match('R.', 'Fischer') # => true
|
64
|
+
robert.match(ICU::Name.new('R.', 'Fischer')) # => true
|
59
65
|
|
60
|
-
|
61
|
-
always work with initials. In the next example, the initial _R_ does not match the first letter
|
62
|
-
nickname _Bobby_.
|
66
|
+
Here the inital _R_ matches the first letter of _Robert_. However, nickname matches will not
|
67
|
+
always work with initials. In the next example, the initial _R_ does not match the first letter
|
68
|
+
_B_ of the nickname _Bobby_.
|
63
69
|
|
64
|
-
bobby.match('R. J.', 'Fischer')
|
70
|
+
bobby.match('R. J.', 'Fischer') # => false
|
65
71
|
|
66
|
-
Some
|
72
|
+
Some other ways last names are canonicalised are illustrated below:
|
67
73
|
|
68
|
-
ICU::Name.new('John', 'O Reilly').last
|
69
|
-
ICU::Name.new('dave', 'mcmanus').last
|
74
|
+
ICU::Name.new('John', 'O Reilly').last # => "O'Reilly, John"
|
75
|
+
ICU::Name.new('dave', 'mcmanus').last # => "McManus, Dave"
|
70
76
|
|
71
77
|
== Characters and Encoding
|
72
78
|
|
@@ -76,40 +82,127 @@ Along with hyphens and single quotes (which represent apostophes) letters in ISO
|
|
76
82
|
character plus one or more diacritics (e.g. "ł" or "Ś") are preserved, while everything
|
77
83
|
else is removed.
|
78
84
|
|
79
|
-
ICU::Name.new('éric', 'PRIÉ').name
|
80
|
-
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name
|
81
|
-
ICU::Name.new('
|
85
|
+
ICU::Name.new('éric', 'PRIÉ').name # => "Éric Prié"
|
86
|
+
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name # => "Bartłomiej Śliwa"
|
87
|
+
ICU::Name.new('Սմբատ', 'Լպուտյան').name # => ""
|
82
88
|
|
83
|
-
The various accessors (
|
89
|
+
The various accessors (<tt>first</tt>, <tt>last</tt>, <tt>name</tt>, <tt>rname</tt>, <tt>to_s</tt>, <tt>original</tt>) always return
|
84
90
|
strings encoded in UTF-8, no matter what the input encoding.
|
85
91
|
|
86
92
|
eric = ICU::Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
|
87
|
-
eric.rname
|
88
|
-
eric.rname.encoding.name
|
89
|
-
eric.original
|
90
|
-
eric.original.encoding.name
|
93
|
+
eric.rname # => "Prié, Éric"
|
94
|
+
eric.rname.encoding.name # => "UTF-8"
|
95
|
+
eric.original # => "PRIÉ, éric"
|
96
|
+
eric.original.encoding.name # => "UTF-8"
|
91
97
|
|
92
98
|
Accented letters can be transliterated into their US-ASCII counterparts by setting the
|
93
|
-
|
99
|
+
<tt>:chars</tt> option, which is available in all accessors. For example:
|
94
100
|
|
95
|
-
eric.rname(:chars => "US-ASCII")
|
96
|
-
eric.original(:chars => "US-ASCII")
|
101
|
+
eric.rname(:chars => "US-ASCII") # => "Prie, Eric"
|
102
|
+
eric.original(:chars => "US-ASCII") # => "PRIE, eric"
|
97
103
|
|
98
104
|
Also possible is the preservation of ISO-8859-1 characters, but the transliteration of
|
99
105
|
all other accented characters:
|
100
106
|
|
101
107
|
joe = Name.new('Józef', 'Żabiński')
|
102
|
-
joe.rname
|
103
|
-
joe.rname(:chars => "ISO-8859-1")
|
104
|
-
joe.rname(:chars => "US-ASCII")
|
108
|
+
joe.rname # => "Żabiński, Józef"
|
109
|
+
joe.rname(:chars => "ISO-8859-1") # => "Zabinski, Józef"
|
110
|
+
joe.rname(:chars => "US-ASCII") # => "Zabinski, Jozef"
|
105
111
|
|
106
112
|
Note that the character encoding of the strings returned is still UTF-8 in all cases.
|
107
113
|
The same option also relaxes the need for accented characters to match exactly:
|
108
114
|
|
109
|
-
eric.match('Eric', 'Prie')
|
110
|
-
eric.match('Eric', 'Prie', :chars => "US-ASCII")
|
111
|
-
joe.match('Józef', 'Zabinski')
|
112
|
-
joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1")
|
115
|
+
eric.match('Eric', 'Prie') # => false
|
116
|
+
eric.match('Eric', 'Prie', :chars => "US-ASCII") # => true
|
117
|
+
joe.match('Józef', 'Zabinski') # => false
|
118
|
+
joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1") # => true
|
119
|
+
|
120
|
+
== Customization of Alternative Names
|
121
|
+
|
122
|
+
We saw above how _Bobby_ and _Robert_ were able to match because, by default, the
|
123
|
+
matcher is aware of some common English nicknames. These name alternatives can be
|
124
|
+
customised to handle additional nick names and other types of alternative names
|
125
|
+
such as common spelling mistakes and name changes.
|
126
|
+
|
127
|
+
The alternative names are specified in two YAML files, one for first names and
|
128
|
+
one for last names. Each YAML file represents an array and each element in the
|
129
|
+
array is an array representing a set of alternative names. Here, for example,
|
130
|
+
are some of the default first name alternatives:
|
131
|
+
|
132
|
+
[Anthony, Tony]
|
133
|
+
[James, Jim, Jimmy]
|
134
|
+
[Michael, Mike, Mick, Mikey]
|
135
|
+
[Robert, Bob, Bobby]
|
136
|
+
[Stephen, Steve]
|
137
|
+
[Steven, Steve]
|
138
|
+
[Thomas, Tom, Tommy]
|
139
|
+
[William, Will, Willy, Willie, Bill]
|
140
|
+
|
141
|
+
The first of these means that _Anthony_ and _Tony_ are considered equivalent and can match.
|
142
|
+
|
143
|
+
Name.new("Tony", "Miles").match("Anthony", "Miles") # => true
|
144
|
+
|
145
|
+
Note that both _Steven_ and _Stephen_ match _Steve_ but, because they don't occur in the
|
146
|
+
same group, they don't match each other.
|
147
|
+
|
148
|
+
Name.new("Steven", "Hanly").match("Steve", "Hanly") # => true
|
149
|
+
Name.new("Stephen", "Hanly").match("Steve", "Hanly") # => true
|
150
|
+
Name.new("Stephen", "Hanly").match("Steven", "Hanly") # => false
|
151
|
+
|
152
|
+
To customize alternative name behaviour, prepare YAML files with your chosen alternatives
|
153
|
+
and then replace the default alternatives like this:
|
154
|
+
|
155
|
+
Name.load_alternatives(:first, "my_first_name_alternatives.yaml")
|
156
|
+
Name.load_alternatives(:last, "my_last_name_alternatives.yaml")
|
157
|
+
|
158
|
+
An example of one way in which you might want to customize the alternatives is to
|
159
|
+
cater for common spelling mistakes such as _Steven_ and _Stephen_. These two names
|
160
|
+
don't match by default, but you can make them so by replacing the two default rules:
|
161
|
+
|
162
|
+
[Stephen, Steve]
|
163
|
+
[Steven, Steve]
|
164
|
+
|
165
|
+
with the following single rule:
|
166
|
+
|
167
|
+
[Stephen, Steven, Steve]
|
168
|
+
|
169
|
+
so that now:
|
170
|
+
|
171
|
+
Name.new("Stephen", "Hanly").match("Steven", "Hanly") # => true
|
172
|
+
|
173
|
+
Another use is to cater for English and Irish versions of the same name. For example,
|
174
|
+
for last names:
|
175
|
+
|
176
|
+
[Murphy, Murchadha]
|
177
|
+
|
178
|
+
or for first names, including spelling variations:
|
179
|
+
|
180
|
+
[Patrick, Pat, Paddy, Padraig, Padraic, Padhraig, Padhraic]
|
181
|
+
|
182
|
+
== Conditional Alternatives
|
183
|
+
|
184
|
+
Normally, entries in the two YAML files are just lists of alternative names. There is one
|
185
|
+
exception to this however, when one of the entries (it doesn't matter which one but,
|
186
|
+
by convention, the last one) is a regular expression. Here is an example that might
|
187
|
+
be added to the last name alternatives:
|
188
|
+
|
189
|
+
[Quinn, Benjamin, !ruby/regexp /^(Debbie|Deborah)$/]
|
190
|
+
|
191
|
+
What this means is that the last names _Quinn_ and _Benjamin_ match but only when the
|
192
|
+
first name matches the regular expression.
|
193
|
+
|
194
|
+
Name.new("Debbie", "Quinn").match("Debbie", "Benjamin") # => true
|
195
|
+
Name.new("Mark", "Quinn").match("Mark", "Benjamin") # => false
|
196
|
+
|
197
|
+
Another example, this time for first names, is:
|
198
|
+
|
199
|
+
[Sean, John, !ruby/regexp /^Bradley$/]
|
200
|
+
|
201
|
+
This caters for an individual who is known by two normally unrelated first names.
|
202
|
+
We only want these two names to match for that individual and no others.
|
203
|
+
|
204
|
+
Name.new("John", "Bradley").match("Sean", "Bradley") # => true
|
205
|
+
Name.new("John", "Alfred").match("Sean", "Alfred") # => false
|
113
206
|
|
114
207
|
== Author
|
115
208
|
|
@@ -0,0 +1,36 @@
|
|
1
|
+
---
|
2
|
+
- [Alexander, Alex]
|
3
|
+
- [Andrew, Andy]
|
4
|
+
- [Anthony, Tony]
|
5
|
+
- [Benjamin, Ben]
|
6
|
+
- [Catherine, Cathy, Cath]
|
7
|
+
- [Daniel, Danny, Dan]
|
8
|
+
- [David, Dave]
|
9
|
+
- [Deborah, Debbie]
|
10
|
+
- [Des, Desmond]
|
11
|
+
- [Edward, Eddie, Eddy, Ed]
|
12
|
+
- [Frederick, Fred]
|
13
|
+
- [Frederic, Fred]
|
14
|
+
- [Gerald, Gerry]
|
15
|
+
- [Gerard, Gerry]
|
16
|
+
- [James, Jim, Jimmy]
|
17
|
+
- [John, Johnny]
|
18
|
+
- [Jonathan, Jon]
|
19
|
+
- [Kenneth, Ken, Kenny]
|
20
|
+
- [Michael, Mike, Mick, Mikey]
|
21
|
+
- [Nic, Nick, Nicolas]
|
22
|
+
- [Nicola, Nickie, Nicky]
|
23
|
+
- [Patrick, Pat]
|
24
|
+
- [Patricia, Patty, Pat]
|
25
|
+
- [Peter, Pete]
|
26
|
+
- [Philip, Phil]
|
27
|
+
- [Phillip, Phil]
|
28
|
+
- [Rick, Ricky]
|
29
|
+
- [Robert, Bob, Bobby]
|
30
|
+
- [Samual, Sam]
|
31
|
+
- [Samuel, Sam]
|
32
|
+
- [Stephen, Steve]
|
33
|
+
- [Steven, Steve]
|
34
|
+
- [Terence, Terry]
|
35
|
+
- [Thomas, Tom, Tommy]
|
36
|
+
- [William, Will, Willy, Willie, Bill]
|
@@ -0,0 +1 @@
|
|
1
|
+
--- []
|
@@ -0,0 +1,41 @@
|
|
1
|
+
---
|
2
|
+
- [Abdul, Abul]
|
3
|
+
- [Alexander, Alex]
|
4
|
+
- [Anandagopal, Ananda]
|
5
|
+
- [Andrew, Andy]
|
6
|
+
- [Anne, Ann]
|
7
|
+
- [Anthony, Tony]
|
8
|
+
- [Benjamin, Ben]
|
9
|
+
- [Catherine, Cathy, Cath]
|
10
|
+
- [Daniel, Danial, Danny, Dan]
|
11
|
+
- [David, Dave]
|
12
|
+
- [Deborah, Debbie]
|
13
|
+
- [Des, Desmond]
|
14
|
+
- [Eamonn, Eamon]
|
15
|
+
- [Edward, Eddie, Eddy, Ed]
|
16
|
+
- [Eric, Erick, Erik]
|
17
|
+
- [Frederick, Frederic, Fred]
|
18
|
+
- [Gerald, Gerry]
|
19
|
+
- [Gerhard, Gerard, Ger, Gerry]
|
20
|
+
- [James, Jim, Jimmy]
|
21
|
+
- [Joanna, Joan, Joanne]
|
22
|
+
- [John, Johnny]
|
23
|
+
- [Jonathan, Jon]
|
24
|
+
- [Kenneth, Ken, Kenny]
|
25
|
+
- [Michael, Mike, Mick, Micky, Mickie, Mikey]
|
26
|
+
- [Nicholas, Nick, Nicolas]
|
27
|
+
- [Nicola, Nickie, Nicky]
|
28
|
+
- [Patrick, Pat, Paddy, Padraig, Padraic, Padhraig, Padhraic]
|
29
|
+
- [Patricia, Paddy, Patty, Pat]
|
30
|
+
- [Peter, Pete]
|
31
|
+
- [Philippe, Philip, Phillippe, Phillip]
|
32
|
+
- [Rick, Ricky]
|
33
|
+
- [Robert, Bob, Bobby]
|
34
|
+
- [Samual, Sam, Samuel]
|
35
|
+
- [Stef, Stefan, Stephan, Stefen, Stephen]
|
36
|
+
- [Steffy, Stefanie, Stephanie, Stefenie, Stephenie]
|
37
|
+
- [Stephen, Steve, Steven]
|
38
|
+
- [Terence, Terry]
|
39
|
+
- [Thomas, Tom, Tommy]
|
40
|
+
- [William, Will, Willy, Willie, Bill]
|
41
|
+
- [Sean, John, !ruby/regexp /^Bradley$/]
|
data/lib/icu_name/name.rb
CHANGED
@@ -5,31 +5,42 @@ require 'active_support/core_ext/string/multibyte'
|
|
5
5
|
|
6
6
|
module ICU
|
7
7
|
class Name
|
8
|
+
# Revert to the default sets of alternative names.
|
9
|
+
def self.reset_alternatives
|
10
|
+
@@alts = Hash.new
|
11
|
+
@@cmps = Hash.new
|
12
|
+
end
|
13
|
+
|
14
|
+
# Perform a reset when the class is first loaded.
|
15
|
+
self.reset_alternatives
|
8
16
|
|
9
|
-
# Construct from one or two strings or any objects that have a to_s method.
|
17
|
+
# Construct a new name from one or two strings or any objects that have a to_s method.
|
10
18
|
def initialize(name1='', name2='')
|
11
19
|
@name1 = Util.to_utf8(name1.to_s)
|
12
20
|
@name2 = Util.to_utf8(name2.to_s)
|
13
21
|
originalize
|
14
22
|
canonicalize
|
23
|
+
@first.freeze
|
24
|
+
@last.freeze
|
25
|
+
@original.freeze
|
15
26
|
end
|
16
|
-
|
27
|
+
|
17
28
|
# Original text getter.
|
18
29
|
def original(opts={})
|
19
30
|
return transliterate(@original, opts[:chars]) if opts[:chars]
|
20
|
-
@original
|
31
|
+
@original.dup
|
21
32
|
end
|
22
33
|
|
23
34
|
# First name getter.
|
24
35
|
def first(opts={})
|
25
36
|
return transliterate(@first, opts[:chars]) if opts[:chars]
|
26
|
-
@first
|
37
|
+
@first.dup
|
27
38
|
end
|
28
39
|
|
29
40
|
# Last name getter.
|
30
41
|
def last(opts={})
|
31
42
|
return transliterate(@last, opts[:chars]) if opts[:chars]
|
32
|
-
@last
|
43
|
+
@last.dup
|
33
44
|
end
|
34
45
|
|
35
46
|
# Return a complete name, first name first, no comma.
|
@@ -50,7 +61,7 @@ module ICU
|
|
50
61
|
name
|
51
62
|
end
|
52
63
|
|
53
|
-
# Convert
|
64
|
+
# Convert to a string (same as rname).
|
54
65
|
def to_s(opts={})
|
55
66
|
rname(opts)
|
56
67
|
end
|
@@ -61,6 +72,17 @@ module ICU
|
|
61
72
|
match_first(first(opts), other.first(opts)) && match_last(last(opts), other.last(opts))
|
62
73
|
end
|
63
74
|
|
75
|
+
# Load a set of first or last name alternatives. If the YAML file name is absent,
|
76
|
+
# the default set is loaded. <tt>type</tt> should be <tt>:first</tt> or <tt>:last</tt>.
|
77
|
+
def self.load_alternatives(type, file=nil)
|
78
|
+
compile_alts(check_type(type), file, true)
|
79
|
+
end
|
80
|
+
|
81
|
+
# Show first name or last name alternatives.
|
82
|
+
def alternatives(type)
|
83
|
+
get_alts(check_type(type))
|
84
|
+
end
|
85
|
+
|
64
86
|
# :stopdoc:
|
65
87
|
private
|
66
88
|
|
@@ -70,7 +92,7 @@ module ICU
|
|
70
92
|
@original.strip!
|
71
93
|
@original.gsub!(/\s+/, ' ')
|
72
94
|
end
|
73
|
-
|
95
|
+
|
74
96
|
# Transliterate characters to ASCII or Latin1.
|
75
97
|
def transliterate(str, chars='US-ASCII')
|
76
98
|
case chars
|
@@ -154,6 +176,10 @@ module ICU
|
|
154
176
|
names
|
155
177
|
end
|
156
178
|
|
179
|
+
# Check the type argument to the public methods.
|
180
|
+
def check_type(type) self.class.instance_eval { check_type(type) }; end
|
181
|
+
def self.check_type(type) type = type.to_s == "last" ? :last : :first; end
|
182
|
+
|
157
183
|
# Match a complete first name.
|
158
184
|
def match_first(first1, first2)
|
159
185
|
# Is this one a walk in the park?
|
@@ -166,8 +192,9 @@ module ICU
|
|
166
192
|
# Get the long list and the short list.
|
167
193
|
long, short = first1.size >= first2.size ? [first1, first2] : [first2, first1]
|
168
194
|
|
169
|
-
# The short one must be a "subset" of the long one.
|
170
|
-
#
|
195
|
+
# The short one must be a "subset" of the long one. An extra condition must also be satisfied:
|
196
|
+
# either there has to be at least one match not involving initials or the first names must match.
|
197
|
+
# For example "M. J." matches "Mark" but not "John".
|
171
198
|
extra = false
|
172
199
|
(0..long.size-1).each do |i|
|
173
200
|
lword = long.shift
|
@@ -186,6 +213,7 @@ module ICU
|
|
186
213
|
# Match a complete last name.
|
187
214
|
def match_last(last1, last2)
|
188
215
|
return true if last1 == last2
|
216
|
+
return true if match_alt(:last, last1, last2)
|
189
217
|
[last1, last2].each do |last|
|
190
218
|
last.downcase! # case insensitive
|
191
219
|
last.gsub!(/\bmac/, 'mc') # MacDonaugh and McDonaugh
|
@@ -211,74 +239,73 @@ module ICU
|
|
211
239
|
initials = 0
|
212
240
|
initials+= 1 if first1.match(/^[A-Z\u{c0}-\u{de}]\.?$/)
|
213
241
|
initials+= 1 if first2.match(/^[A-Z\u{c0}-\u{de}]\.?$/)
|
214
|
-
return initials if first1 == first2
|
215
|
-
return 0 if initials == 0 &&
|
216
|
-
return -1 unless initials > 0
|
217
|
-
return initials if first1[0] == first2[0]
|
242
|
+
return initials if first1 == first2 # "W." and "W." or "William" and "William"
|
243
|
+
return 0 if initials == 0 && match_alt(:first, first1, first2) # "William"" and "Bill"
|
244
|
+
return -1 unless initials > 0 # "William" and "Patricia"
|
245
|
+
return initials if first1[0] == first2[0] # "W." and "William" or "W." and "W"
|
218
246
|
-1
|
219
247
|
end
|
220
248
|
|
221
|
-
# Match two
|
222
|
-
def
|
223
|
-
|
224
|
-
|
225
|
-
return false unless
|
226
|
-
|
249
|
+
# Match two names that might be equivalent due to nicknames, misspellings, changed married names etc.
|
250
|
+
def match_alt(type, nam1, nam2)
|
251
|
+
self.class.compile_alts(type)
|
252
|
+
return false unless nams = @@alts[type][nam1]
|
253
|
+
return false unless cond = nams[nam2]
|
254
|
+
return true if cond == true
|
255
|
+
cond.match(type == :first ? @last : @first)
|
227
256
|
end
|
228
257
|
|
229
|
-
#
|
230
|
-
|
231
|
-
|
258
|
+
# Return an array of alternative first or second names (not including the original name).
|
259
|
+
# Allow for double barrelled last names or multiple first names.
|
260
|
+
def get_alts(type)
|
261
|
+
self.class.compile_alts(type)
|
262
|
+
name = self.send(type)
|
263
|
+
names = name.split(/[- ]/)
|
264
|
+
names.push(name) if names.length > 1
|
265
|
+
target = type == :first ? @last : @first
|
266
|
+
alts = Array.new
|
267
|
+
names.each do |n|
|
268
|
+
next unless @@alts[type][n]
|
269
|
+
@@alts[type][n].each_pair do |k, v|
|
270
|
+
alts.push k if v == true || v.match(target)
|
271
|
+
end
|
272
|
+
end
|
273
|
+
alts
|
274
|
+
end
|
275
|
+
|
276
|
+
# Compile an alternative names hash (for either first names or last names) before matching is first attempted.
|
277
|
+
def self.compile_alts(type, file=nil, force=false)
|
278
|
+
return if @@alts[type] && !force
|
279
|
+
file ||= File.expand_path(File.dirname(__FILE__) + "/../../config/#{type}_alternatives.yaml")
|
280
|
+
data = YAML.load(File.open file)
|
281
|
+
@@cmps[type] ||= 0
|
282
|
+
@@alts[type] = Hash.new
|
232
283
|
code = 1
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
|
284
|
+
data.each do |alts|
|
285
|
+
cond = true
|
286
|
+
alts.reject! do |a|
|
287
|
+
if a.instance_of?(Regexp)
|
288
|
+
cond = a
|
289
|
+
else
|
290
|
+
false
|
291
|
+
end
|
292
|
+
end
|
293
|
+
alts.each do |name|
|
294
|
+
alts.each do |other|
|
295
|
+
unless other == name
|
296
|
+
@@alts[type][name] ||= Hash.new
|
297
|
+
@@alts[type][name][other] = cond
|
298
|
+
end
|
299
|
+
end
|
237
300
|
end
|
238
301
|
code+= 1
|
239
302
|
end
|
303
|
+
@@cmps[type] += 1
|
240
304
|
end
|
241
305
|
|
242
|
-
#
|
243
|
-
|
244
|
-
|
245
|
-
|
246
|
-
Alexander Alex
|
247
|
-
Anandagopal Ananda
|
248
|
-
Andrew Andy
|
249
|
-
Anne Ann
|
250
|
-
Anthony Tony
|
251
|
-
Benjamin Ben
|
252
|
-
Catherine Cathy Cath
|
253
|
-
Daniel Danial Danny Dan
|
254
|
-
David Dave
|
255
|
-
Deborah Debbie
|
256
|
-
Des Desmond
|
257
|
-
Eamonn Eamon
|
258
|
-
Edward Eddie Ed
|
259
|
-
Eric Erick Erik
|
260
|
-
Frederick Frederic Fred
|
261
|
-
Gerald Gerry
|
262
|
-
Gerhard Gerard Ger
|
263
|
-
James Jim
|
264
|
-
Joanna Joan Joanne
|
265
|
-
John Johnny
|
266
|
-
Jonathan Jon
|
267
|
-
Kenneth Ken Kenny
|
268
|
-
Michael Mike Mick Micky
|
269
|
-
Nicholas Nick Nicolas
|
270
|
-
Nicola Nickie Nicky
|
271
|
-
Patrick Pat Paddy
|
272
|
-
Peter Pete
|
273
|
-
Philippe Philip Phillippe Phillip
|
274
|
-
Rick Ricky
|
275
|
-
Robert Bob Bobby
|
276
|
-
Samual Sam Samuel
|
277
|
-
Stefanie Stef
|
278
|
-
Stephen Steven Steve
|
279
|
-
Terence Terry
|
280
|
-
Thomas Tom Tommy
|
281
|
-
William Will Willy Willie Bill
|
282
|
-
EOF
|
306
|
+
# Return the number of YAML file compilations (for testing).
|
307
|
+
def self.alt_compilations(type)
|
308
|
+
@@cmps[check_type(type)] || 0
|
309
|
+
end
|
283
310
|
end
|
284
311
|
end
|
data/lib/icu_name/version.rb
CHANGED
data/spec/name_spec.rb
CHANGED
@@ -3,6 +3,17 @@ require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
|
|
3
3
|
|
4
4
|
module ICU
|
5
5
|
describe Name do
|
6
|
+
def load_alt_test(*types)
|
7
|
+
types.each do |type|
|
8
|
+
file = File.expand_path(File.dirname(__FILE__) + "/../config/test_#{type}_alts.yaml")
|
9
|
+
Name.load_alternatives(type, file)
|
10
|
+
end
|
11
|
+
end
|
12
|
+
|
13
|
+
def alt_compilations(type)
|
14
|
+
Name.alt_compilations(type)
|
15
|
+
end
|
16
|
+
|
6
17
|
context "public methods" do
|
7
18
|
before(:each) do
|
8
19
|
@simple = Name.new('mark j l', 'ORR')
|
@@ -68,7 +79,7 @@ module ICU
|
|
68
79
|
it "characters and encoding" do
|
69
80
|
ICU::Name.new('éric', 'PRIÉ').name.should == "Éric Prié"
|
70
81
|
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartłomiej Śliwa"
|
71
|
-
ICU::Name.new('
|
82
|
+
ICU::Name.new('Սմբատ', 'Լպուտյան').name.should == ""
|
72
83
|
eric = Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
|
73
84
|
eric.rname.should == "Prié, Éric"
|
74
85
|
eric.rname.encoding.name.should == "UTF-8"
|
@@ -244,7 +255,7 @@ module ICU
|
|
244
255
|
@opt = { :chars => "US-ASCII" }
|
245
256
|
end
|
246
257
|
|
247
|
-
it "should be a no-op for names that already ASCII" do
|
258
|
+
it "should be a no-op for names that are already ASCII" do
|
248
259
|
name = Name.new('Mark J. L.', 'Orr')
|
249
260
|
name.first(@opt).should == 'Mark J. L.'
|
250
261
|
name.last(@opt).should == 'Orr'
|
@@ -325,6 +336,21 @@ module ICU
|
|
325
336
|
Name.new('Mick', 'Orr').match('Mike', 'Orr').should be_true
|
326
337
|
end
|
327
338
|
|
339
|
+
it "should handle ambiguous nicknames" do
|
340
|
+
Name.new('Gerry', 'Orr').match('Gerald', 'Orr').should be_true
|
341
|
+
Name.new('Gerry', 'Orr').match('Gerard', 'Orr').should be_true
|
342
|
+
Name.new('Gerard', 'Orr').match('Gerald', 'Orr').should be_false
|
343
|
+
end
|
344
|
+
|
345
|
+
it "should by default be cautious about misspellings" do
|
346
|
+
Name.new('Steven', 'Brady').match('Stephen', 'Brady').should be_false
|
347
|
+
Name.new('Philip', 'Short').match('Phillip', 'Short').should be_false
|
348
|
+
end
|
349
|
+
|
350
|
+
it "should by default have no conditional matches" do
|
351
|
+
Name.new('Sean', 'Bradley').match('John', 'Bradley').should be_false
|
352
|
+
end
|
353
|
+
|
328
354
|
it "should not mix up nick names" do
|
329
355
|
Name.new('David', 'Orr').match('Bill', 'Orr').should be_false
|
330
356
|
end
|
@@ -343,6 +369,11 @@ module ICU
|
|
343
369
|
Name.new('Alan', 'McDonagh').match('Alan', 'MacDonagh').should be_true
|
344
370
|
Name.new('Darko', 'Polimac').match('Darko', 'Polimc').should be_false
|
345
371
|
end
|
372
|
+
|
373
|
+
it "should by defaut have no conditional matches" do
|
374
|
+
Name.new('Debbie', 'Quinn').match('Debbie', 'Benjamin').should be_false
|
375
|
+
Name.new('Mairead', "O'Siochru").match('Mairead', 'King').should be_false
|
376
|
+
end
|
346
377
|
end
|
347
378
|
|
348
379
|
context "matches involving accented characters" do
|
@@ -361,5 +392,173 @@ module ICU
|
|
361
392
|
Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :chars => "US-ASCII").should be_true
|
362
393
|
end
|
363
394
|
end
|
395
|
+
|
396
|
+
context "configuring new first name alternatives" do
|
397
|
+
before(:all) do
|
398
|
+
load_alt_test(:first)
|
399
|
+
end
|
400
|
+
|
401
|
+
it "should match some spelling errors" do
|
402
|
+
Name.new('Steven', 'Brady').match('Stephen', 'Brady').should be_true
|
403
|
+
Name.new('Philip', 'Short').match('Phillip', 'Short').should be_true
|
404
|
+
end
|
405
|
+
|
406
|
+
it "should handle conditional matches" do
|
407
|
+
Name.new('Sean', 'Collins').match('John', 'Collins').should be_false
|
408
|
+
Name.new('Sean', 'Bradley').match('John', 'Bradley').should be_true
|
409
|
+
end
|
410
|
+
end
|
411
|
+
|
412
|
+
context "configuring new last name alternatives" do
|
413
|
+
before(:all) do
|
414
|
+
load_alt_test(:last)
|
415
|
+
end
|
416
|
+
|
417
|
+
it "should match some spelling errors" do
|
418
|
+
Name.new('William', 'Ffrench').match('William', 'French').should be_true
|
419
|
+
end
|
420
|
+
|
421
|
+
it "should handle conditional matches" do
|
422
|
+
Name.new('Mark', 'Quinn').match('Mark', 'Benjamin').should be_false
|
423
|
+
Name.new('Debbie', 'Quinn').match('Debbie', 'Benjamin').should be_true
|
424
|
+
Name.new('Oisin', "O'Siochru").match('Oisin', 'King').should be_false
|
425
|
+
Name.new('Mairead', "O'Siochru").match('Mairead', 'King').should be_true
|
426
|
+
end
|
427
|
+
|
428
|
+
it "should allow some awesome matches" do
|
429
|
+
Name.new('debbie quinn').match('Deborah', 'Benjamin').should be_true
|
430
|
+
Name.new('french, william').match('Bill', 'Ffrench').should be_true
|
431
|
+
Name.new('Oissine', 'Murphy').match('Oissine', 'Murchadha').should be_true
|
432
|
+
end
|
433
|
+
end
|
434
|
+
|
435
|
+
context "configuring new first and new last name alternatives" do
|
436
|
+
before(:all) do
|
437
|
+
load_alt_test(:first, :last)
|
438
|
+
end
|
439
|
+
|
440
|
+
it "should allow some awesome matches" do
|
441
|
+
Name.new('french, steven').match('Stephen', 'Ffrench').should be_true
|
442
|
+
Name.new('Patrick', 'Murphy').match('Padraic', 'Murchadha').should be_true
|
443
|
+
end
|
444
|
+
end
|
445
|
+
|
446
|
+
context "reverting to the default configuration" do
|
447
|
+
before(:all) do
|
448
|
+
load_alt_test(:first, :last)
|
449
|
+
end
|
450
|
+
|
451
|
+
it "should not match so boldly after reverting" do
|
452
|
+
Name.new('french, steven').match('Stephen', 'Ffrench').should be_true
|
453
|
+
Name.load_alternatives(:first)
|
454
|
+
Name.new('Patrick', 'Murphy').match('Padraic', 'Murchadha').should be_false
|
455
|
+
Name.new('Patrick', 'Murphy').match('Patrick', 'Murchadha').should be_true
|
456
|
+
Name.load_alternatives(:last)
|
457
|
+
Name.new('Patrick', 'Murphy').match('Patrick', 'Murchadha').should be_false
|
458
|
+
end
|
459
|
+
end
|
460
|
+
|
461
|
+
context "name alternatives with default configuration" do
|
462
|
+
it "should show common nicknames" do
|
463
|
+
Name.new('William', 'Ffrench').alternatives(:first).should =~ %w{Bill Willy Willie Will}
|
464
|
+
Name.new('Bill', 'Ffrench').alternatives(:first).should =~ %w{William Willy Will Willie}
|
465
|
+
Name.new('Steven', 'Ffrench').alternatives(:first).should =~ %w{Steve}
|
466
|
+
Name.new('Stephen', 'Ffrench').alternatives(:first).should =~ %w{Steve}
|
467
|
+
Name.new('Michael Stephen', 'Ffrench').alternatives(:first).should =~ %w{Steve Mike Mick Mikey}
|
468
|
+
Name.new('Stephen M.', 'Ffrench').alternatives(:first).should =~ %w{Steve}
|
469
|
+
Name.new('S.', 'Ffrench').alternatives(:first).should =~ []
|
470
|
+
Name.new('Sean', 'Bradley').alternatives(:first).should =~ []
|
471
|
+
end
|
472
|
+
|
473
|
+
it "should not have any last name alternatives" do
|
474
|
+
Name.new('William', 'Ffrench').alternatives(:last).should =~ []
|
475
|
+
Name.new('Mairead', "O'Siochru").alternatives(:last).should =~ []
|
476
|
+
Name.new('Oissine', 'Murphy').alternatives(:last).should =~ []
|
477
|
+
Name.new('Debbie', 'Quinn').alternatives(:last).should =~ []
|
478
|
+
end
|
479
|
+
end
|
480
|
+
|
481
|
+
context "name alternatives with more adventurous configuration" do
|
482
|
+
before(:all) do
|
483
|
+
load_alt_test(:first, :last)
|
484
|
+
end
|
485
|
+
|
486
|
+
it "should show additional nicknames" do
|
487
|
+
Name.new('Steven', 'Ffrench').alternatives(:first).should =~ %w{Stephen Steve}
|
488
|
+
Name.new('Stephen', 'Ffrench').alternatives(:first).should =~ %w{Stef Stefan Stefen Stephan Steve Steven}
|
489
|
+
Name.new('Stephen Mike', 'Ffrench').alternatives(:first).should =~ %w{Michael Mick Mickie Micky Mikey Stef Stefan Stefen Stephan Steve Steven}
|
490
|
+
Name.new('Sean', 'Bradley').alternatives(:first).should =~ %w{John}
|
491
|
+
Name.new('Sean', 'McDonagh').alternatives(:first).should =~ []
|
492
|
+
Name.new('John', 'Bradley').alternatives(:first).should =~ %w{Sean Johnny}
|
493
|
+
end
|
494
|
+
|
495
|
+
it "should have some last name alternatives" do
|
496
|
+
Name.new('William', 'Ffrench').alternatives(:last).should =~ %w{French}
|
497
|
+
Name.new('Mairead', "O'Siochru").alternatives(:last).should =~ %w{King}
|
498
|
+
Name.new('Oissine', 'Murphy').alternatives(:last).should =~ %w{Murchadha}
|
499
|
+
Name.new('Debbie', 'Quinn').alternatives(:last).should =~ %w{Benjamin}
|
500
|
+
Name.new('Mark', 'Quinn').alternatives(:last).should =~ []
|
501
|
+
Name.new('Debbie', 'Quinn-French').alternatives(:last).should =~ %w{Benjamin Ffrench}
|
502
|
+
end
|
503
|
+
end
|
504
|
+
|
505
|
+
context "number of alternative compilations" do
|
506
|
+
before(:all) do
|
507
|
+
Name.reset_alternatives
|
508
|
+
end
|
509
|
+
|
510
|
+
it "should be no more than necessary" do
|
511
|
+
alt_compilations(:first).should == 0
|
512
|
+
alt_compilations(:last).should == 0
|
513
|
+
Name.new('William', 'Ffrench').match('Bill', 'French')
|
514
|
+
alt_compilations(:first).should == 1
|
515
|
+
alt_compilations(:last).should == 1
|
516
|
+
Name.new('Debbie', 'Quinn').match('Deborah', 'Benjamin')
|
517
|
+
alt_compilations(:first).should == 1
|
518
|
+
alt_compilations(:last).should == 1
|
519
|
+
load_alt_test(:first)
|
520
|
+
alt_compilations(:first).should == 2
|
521
|
+
alt_compilations(:last).should == 1
|
522
|
+
load_alt_test(:last)
|
523
|
+
alt_compilations(:first).should == 2
|
524
|
+
alt_compilations(:last).should == 2
|
525
|
+
Name.new('William', 'Ffrench').match('Bill', 'French')
|
526
|
+
Name.new('Debbie', 'Quinn').match('Deborah', 'Benjamin')
|
527
|
+
Name.new('Mark', 'Orr').alternatives(:first)
|
528
|
+
Name.new('Mark', 'Orr').alternatives(:last)
|
529
|
+
alt_compilations(:first).should == 2
|
530
|
+
alt_compilations(:last).should == 2
|
531
|
+
end
|
532
|
+
end
|
533
|
+
|
534
|
+
context "immutability" do
|
535
|
+
before(:each) do
|
536
|
+
@mark = ICU::Name.new('Màrk', 'Orr')
|
537
|
+
end
|
538
|
+
|
539
|
+
it "there are no setters" do
|
540
|
+
lambda { @mark.first = "Malcolm" }.should raise_error(/undefined/)
|
541
|
+
lambda { @mark.last = "Dickie" }.should raise_error(/undefined/)
|
542
|
+
lambda { @mark.original = "mark orr" }.should raise_error(/undefined/)
|
543
|
+
end
|
544
|
+
|
545
|
+
it "should prevent accidentally access to the instance variables" do
|
546
|
+
@mark.first.downcase!
|
547
|
+
@mark.first.should == "Màrk"
|
548
|
+
@mark.last.downcase!
|
549
|
+
@mark.last.should == "Orr"
|
550
|
+
@mark.original.downcase!
|
551
|
+
@mark.original.should == "Orr, Màrk"
|
552
|
+
end
|
553
|
+
|
554
|
+
it "should prevent accidentally access to the instance variables when transliterating" do
|
555
|
+
@mark.first(:chars => "US-ASCII").downcase!
|
556
|
+
@mark.first.should == "Màrk"
|
557
|
+
@mark.last(:chars => "US-ASCII").downcase!
|
558
|
+
@mark.last.should == "Orr"
|
559
|
+
@mark.original(:chars => "US-ASCII").downcase!
|
560
|
+
@mark.original.should == "Orr, Màrk"
|
561
|
+
end
|
562
|
+
end
|
364
563
|
end
|
365
564
|
end
|
metadata
CHANGED
@@ -3,10 +3,10 @@ name: icu_name
|
|
3
3
|
version: !ruby/object:Gem::Version
|
4
4
|
prerelease: false
|
5
5
|
segments:
|
6
|
-
- 0
|
7
6
|
- 1
|
8
|
-
-
|
9
|
-
|
7
|
+
- 0
|
8
|
+
- 0
|
9
|
+
version: 1.0.0
|
10
10
|
platform: ruby
|
11
11
|
authors:
|
12
12
|
- Mark Orr
|
@@ -14,7 +14,7 @@ autorequire:
|
|
14
14
|
bindir: bin
|
15
15
|
cert_chain: []
|
16
16
|
|
17
|
-
date: 2011-
|
17
|
+
date: 2011-04-16 00:00:00 +01:00
|
18
18
|
default_executable:
|
19
19
|
dependencies:
|
20
20
|
- !ruby/object:Gem::Dependency
|
@@ -138,6 +138,10 @@ files:
|
|
138
138
|
- spec/name_spec.rb
|
139
139
|
- spec/spec_helper.rb
|
140
140
|
- spec/util_spec.rb
|
141
|
+
- config/first_alternatives.yaml
|
142
|
+
- config/last_alternatives.yaml
|
143
|
+
- config/test_first_alts.yaml
|
144
|
+
- config/test_last_alts.yaml
|
141
145
|
- LICENCE
|
142
146
|
- README.rdoc
|
143
147
|
has_rdoc: true
|