icu_name 0.0.7 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.rdoc CHANGED
@@ -8,7 +8,7 @@ For ruby 1.9.2 and above.
8
8
 
9
9
  gem install icu_name
10
10
 
11
- It depends on active_support and i18n.
11
+ It depends on _active_support_ and _i18n_.
12
12
 
13
13
  == Names
14
14
 
@@ -23,84 +23,88 @@ To create a name object, supply both the first and second names separately to th
23
23
 
24
24
  Capitalisation, white space and punctuation will all be automatically corrected:
25
25
 
26
- robert.name # => 'Robert J. Fischer'
27
- robert.rname # => 'Fischer, Robert J.' (reversed name)
26
+ robert.name # => 'Robert J. Fischer'
27
+ robert.rname # => 'Fischer, Robert J.' (reversed name)
28
28
 
29
29
  The input text, without any changes apart from white-space cleanup, is returned by the _original_ method:
30
30
 
31
- robert.original # => 'robert j FISHER'
31
+ robert.original # => 'robert j FISHER'
32
32
 
33
33
  To avoid ambiguity when either the first or second names consist of multiple words, it is better to
34
34
  supply the two separately, if known. However, the full name can be supplied alone to the constructor
35
35
  and a guess will be made as to the first and last names.
36
36
 
37
37
  bobby = ICU::Name.new(' bobby fischer ')
38
-
39
- bobby.first # => 'Bobby'
40
- bobby.last # => 'Fischer'
38
+
39
+ bobby.first # => 'Bobby'
40
+ bobby.last # => 'Fischer'
41
41
 
42
42
  Names will match even if one is missing middle initials or if a nickname is used for one of the first names.
43
43
 
44
- bobby.match('Robert J.', 'Fischer') # => true
44
+ bobby.match('Robert J.', 'Fischer') # => true
45
45
 
46
46
  Note that the class is aware of only common nicknames (e.g. _Bobby_ and _Robert_, _Bill_ and _William_, etc), not all possibilities.
47
47
 
48
48
  Supplying the _match_ method with strings is equivalent to instantiating a Name instance with the same
49
49
  strings and then matching it. So, for example the following are equivalent:
50
50
 
51
- robert.match('R.', 'Fischer') # => true
52
- robert.match(ICU::Name.new('R.', 'Fischer')) # => true
51
+ robert.match('R.', 'Fischer') # => true
52
+ robert.match(ICU::Name.new('R.', 'Fischer')) # => true
53
53
 
54
54
  The inital _R_, for example, matches the first letter of _Robert_. However, nickname matches will not
55
55
  always work with initials. In the next example, the initial _R_ does not match the first letter _B_ of the
56
56
  nickname _Bobby_.
57
57
 
58
- bobby.match('R. J.', 'Fischer') # => false
58
+ bobby.match('R. J.', 'Fischer') # => false
59
59
 
60
60
  Some of the ways last names are canonicalised are illustrated below:
61
61
 
62
- ICU::Name.new('John', 'O Reilly').last # => "O'Reilly"
63
- ICU::Name.new('dave', 'mcmanus').last # => "McManus"
62
+ ICU::Name.new('John', 'O Reilly').last # => "O'Reilly"
63
+ ICU::Name.new('dave', 'mcmanus').last # => "McManus"
64
64
 
65
65
  == Characters and Encoding
66
66
 
67
- The class can only cope with Western European letter characters, including the accented ones in Latin-1.
68
- It's various accessors (_first_, _last_, _name_, _rname_, _to_s_, _original_) always return strings
69
- encoded in UTF-8, no matter what the input encoding.
67
+ The class can only cope with Latin characters, including those with diacritics (accents).
68
+ Along with hyphens and single quotes (which represent apostophes) letters in ISO-8859-1
69
+ (e.g. "a", "è", "Ö") and letters outside ISO-8859-1 which are decomposable into a US-ASCII
70
+ character plus one or more diacritics (e.g. "ł" or "Ś") are preserved, while everything
71
+ else is removed.
72
+
73
+ ICU::Name.new('éric', 'PRIÉ').name # => "Éric Prié"
74
+ ICU::Name.new('BARTŁOMIEJ', 'śliwa').name # => "Bartłomiej Śliwa"
75
+ ICU::Name.new(' 渡井美代子').name # => ""
70
76
 
71
- eric = ICU::Name.new('éric', 'PRIÉ')
72
- eric.rname # => "Prié, Éric"
73
- eric.rname.encoding.name # => "UTF-8"
77
+ The various accessors (_first_, _last_, _name_, _rname_, _to_s_, _original_) always return
78
+ strings encoded in UTF-8, no matter what the input encoding.
74
79
 
75
80
  eric = ICU::Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
76
- eric.rname # => "Prié, Éric"
77
- eric.rname.encoding.name # => "UTF-8"
78
- eric.original # => "éric PRIÉ"
79
- eric.original.encoding.name # => "UTF-8"
81
+ eric.rname # => "Prié, Éric"
82
+ eric.rname.encoding.name # => "UTF-8"
83
+ eric.original # => "éric PRIÉ"
84
+ eric.original.encoding.name # => "UTF-8"
80
85
 
81
- Currently, all characters outside the Latin-1 range are removed as if they wern't there.
86
+ Accented letters can be transliterated into their US-ASCII counterparts by setting the
87
+ _chars_ option, which is available in all accessors. For example:
82
88
 
83
- ICU::Name.new('Józef Żabiński').name # => "Józef Abiski"
84
- ICU::Name.new('Bǔ Xiángzhì').name # => "B. Xiángzhì"
89
+ eric.rname(:chars => "US-ASCII") # => "Prie, Eric"
90
+ eric.original(:chars => "US-ASCII") # => "eric PRIE"
85
91
 
86
- Accented Latin-1 characters can be transliterated into their ascii counterparts by setting the
87
- _ascii_ option to a true value.
92
+ Also possible is the preservation of ISO-8859-1 characters, but the transliteration of
93
+ all other accented characters:
88
94
 
89
- eric.name(:ascii => true) # => "Eric Prie"
95
+ joe = Name.new('Józef', 'Żabiński')
96
+ joe.rname # => "Żabiński, Józef"
97
+ joe.rname(:chars => "ISO-8859-1") # => "Zabinski, Józef"
98
+ joe.rname(:chars => "US-ASCII") # => "Zabinski, Jozef"
90
99
 
91
- This works with all the other accessors and also with the constructor:
100
+ Note that the character encoding of the strings returned is still UTF-8 in all cases.
101
+ The same option also relaxes the need for accented characters to match exactly:
92
102
 
93
- eric_ascii = ICU::Name.new('éric', 'PRIÉ', :ascii => true)
94
- eric_ascii.name # => "Eric Prie"
95
- jozef_ascii = ICU::Name.new('Józef', 'Żabiński', :ascii => true).name
96
- jozef_ascii.name # => "Jozef Zabinski"
97
-
98
- The option also relaxes the need for accented characters to match exactly:
103
+ eric.match('Eric', 'Prie') # => false
104
+ eric.match('Eric', 'Prie', :chars => "US-ASCII") # => true
105
+ joe.match('Józef', 'Zabinski') # => false
106
+ joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1") # => true
99
107
 
100
- eric.match('Éric', 'Prié') # => true
101
- eric.match('Eric', 'Prie') # => false
102
- eric.match('Eric', 'Prie', :ascii => true) # => true
103
-
104
108
  == Author
105
109
 
106
110
  Mark Orr, rating officer for the Irish Chess Union (ICU[http://icu.ie]).
data/lib/icu_name/name.rb CHANGED
@@ -1,3 +1,4 @@
1
+ # encoding: UTF-8
1
2
  require 'active_support'
2
3
  require 'active_support/inflector/transliterate'
3
4
  require 'active_support/core_ext/string/multibyte'
@@ -6,32 +7,28 @@ module ICU
6
7
  class Name
7
8
 
8
9
  # Construct from one or two strings or any objects that have a to_s method.
9
- def initialize(name1='', name2='', opt={})
10
+ def initialize(name1='', name2='')
10
11
  @name1 = Util.to_utf8(name1.to_s)
11
12
  @name2 = Util.to_utf8(name2.to_s)
12
13
  originalize
13
- if opt[:ascii]
14
- @name1 = ActiveSupport::Inflector.transliterate(@name1)
15
- @name2 = ActiveSupport::Inflector.transliterate(@name2)
16
- end
17
14
  canonicalize
18
15
  end
19
16
 
20
17
  # Original text getter.
21
18
  def original(opts={})
22
- return ActiveSupport::Inflector.transliterate(@original) if opts[:ascii]
19
+ return transliterate(@original, opts[:chars]) if opts[:chars]
23
20
  @original
24
21
  end
25
22
 
26
23
  # First name getter.
27
24
  def first(opts={})
28
- return ActiveSupport::Inflector.transliterate(@first) if opts[:ascii]
25
+ return transliterate(@first, opts[:chars]) if opts[:chars]
29
26
  @first
30
27
  end
31
28
 
32
29
  # Last name getter.
33
30
  def last(opts={})
34
- return ActiveSupport::Inflector.transliterate(@last) if opts[:ascii]
31
+ return transliterate(@last, opts[:chars]) if opts[:chars]
35
32
  @last
36
33
  end
37
34
 
@@ -60,8 +57,8 @@ module ICU
60
57
 
61
58
  # Match another name to this object, returning true or false.
62
59
  def match(name1='', name2='', opts={})
63
- other = Name.new(name1, name2, opts)
64
- match_first(first(opts), other.first) && match_last(last(opts), other.last)
60
+ other = Name.new(name1, name2)
61
+ match_first(first(opts), other.first(opts)) && match_last(last(opts), other.last(opts))
65
62
  end
66
63
 
67
64
  # :stopdoc:
@@ -73,6 +70,18 @@ module ICU
73
70
  @original.strip!
74
71
  @original.gsub!(/\s+/, ' ')
75
72
  end
73
+
74
+ # Transliterate characters to ASCII or Latin1.
75
+ def transliterate(str, chars='US-ASCII')
76
+ case chars
77
+ when /^(US-?)?ASCII/i
78
+ ActiveSupport::Inflector.transliterate(str)
79
+ when /^(Windows|CP)-?1252|ISO-?8859-?1|Latin(-?1)?$/i
80
+ str.gsub(/./) { |m| m.ord < 256 ? m : ActiveSupport::Inflector.transliterate(m) }
81
+ else
82
+ str.dup
83
+ end
84
+ end
76
85
 
77
86
  # Canonicalise the first and last names.
78
87
  def canonicalize
@@ -106,7 +115,15 @@ module ICU
106
115
  # Clean up characters in any name keeping only letters (including accented), hyphens, and single quotes.
107
116
  def clean(name)
108
117
  name.gsub!(/`/, "'")
109
- name.gsub!(/[^-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]/, '')
118
+ name.gsub!(/./) do |m|
119
+ if m.ord < 256
120
+ # Keep Latin1 accented letters.
121
+ m.match(/^[-a-zA-Z\u{c0}-\u{d6}\u{d8}-\u{f6}\u{f8}-\u{ff}.'\s]$/) ? m : ''
122
+ else
123
+ # Keep ASCII characters with diacritics (e.g. Polish ł and Ś).
124
+ transliterate(m) == '?' ? '' : m
125
+ end
126
+ end
110
127
  name.gsub!(/\./, ' ')
111
128
  name.gsub!(/\s*-\s*/, '-')
112
129
  name.gsub!(/'+/, "'")
@@ -2,6 +2,6 @@
2
2
 
3
3
  module ICU
4
4
  class Name
5
- VERSION = "0.0.7"
5
+ VERSION = "0.1.0"
6
6
  end
7
7
  end
data/spec/name_spec.rb CHANGED
@@ -66,29 +66,25 @@ module ICU
66
66
  end
67
67
 
68
68
  it "characters and encoding" do
69
- josef = Name.new('Józef', 'Żabiński')
70
- josef.name.should == "Józef Abiski"
71
- josef.original.should == "Józef Żabiński"
72
- josef.original(:ascii => true).should == "Jozef Zabinski"
73
- josef = Name.new('Józef', 'Żabiński', :ascii => true)
74
- josef.name.should == "Jozef Zabinski"
75
- bu = Name.new('Bǔ Xiángzhì')
76
- bu.name.should == "B. Xiángzhì"
77
- eric = Name.new('éric', 'PRIÉ')
78
- eric.rname.should == "Prié, Éric"
79
- eric.rname.encoding.name.should == "UTF-8"
69
+ ICU::Name.new('éric', 'PRIÉ').name.should == "Éric Prié"
70
+ ICU::Name.new('BARTŁOMIEJ', 'śliwa').name.should == "Bartłomiej Śliwa"
71
+ ICU::Name.new(' 渡井美代子').name.should == ""
80
72
  eric = Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
81
73
  eric.rname.should == "Prié, Éric"
82
74
  eric.rname.encoding.name.should == "UTF-8"
83
75
  eric.original.should == "éric PRIÉ"
84
- eric.original(:ascii => true).should == "eric PRIE"
85
76
  eric.original.encoding.name.should == "UTF-8"
86
- eric.name(:ascii => true).should == "Eric Prie"
87
- eric_ascii = Name.new('éric', 'PRIÉ', :ascii => true)
88
- eric_ascii.name.should == "Eric Prie"
77
+ eric.rname(:chars => "US-ASCII").should == "Prie, Eric"
78
+ eric.original(:chars => "US-ASCII").should == "eric PRIE"
79
+ joe = Name.new('Józef', 'Żabiński')
80
+ joe.rname.should == "Żabiński, Józef"
81
+ joe.rname(:chars => "ISO-8859-1").should == "Zabinski, Józef"
82
+ joe.rname(:chars => "US-ASCII").should == "Zabinski, Jozef"
89
83
  eric.match('Éric', 'Prié').should be_true
90
84
  eric.match('Eric', 'Prie').should be_false
91
- eric.match('Eric', 'Prie', :ascii => true).should be_true
85
+ eric.match('Eric', 'Prie', :chars => "US-ASCII").should be_true
86
+ joe.match('Józef', 'Zabinski').should be_false
87
+ joe.match('Józef', 'Zabinski', :chars => "ISO-8859-1").should be_true
92
88
  end
93
89
  end
94
90
 
@@ -244,7 +240,7 @@ module ICU
244
240
 
245
241
  context "transliteration" do
246
242
  before(:all) do
247
- @opt = { :ascii => true }
243
+ @opt = { :chars => "US-ASCII" }
248
244
  end
249
245
 
250
246
  it "should be a no-op for names that already ASCII" do
@@ -267,12 +263,6 @@ module ICU
267
263
  name.first(@opt).should == 'Eric'
268
264
  name.last(@opt).should == 'Prie'
269
265
  end
270
-
271
- it "should work for the constructor as well as accessors" do
272
- name = Name.new('Gearóidín', 'Uí Laighléis', @opt)
273
- name.first.should == 'Gearoidin'
274
- name.last.should == 'Ui Laighleis'
275
- end
276
266
  end
277
267
 
278
268
  context "constuction corner cases" do
@@ -280,7 +270,6 @@ module ICU
280
270
  Name.new('Orr').name.should == 'Orr'
281
271
  Name.new('Orr').rname.should == 'Orr'
282
272
  Name.new('Uí Laighléis').rname.should == 'Laighléis, Uí'
283
- Name.new('', 'Uí Laighléis', :ascii => true).last.should == 'Ui Laighleis'
284
273
  Name.new('').name.should == ''
285
274
  Name.new('').rname.should == ''
286
275
  Name.new.name.should == ''
@@ -367,8 +356,8 @@ module ICU
367
356
  end
368
357
 
369
358
  it "the matching of accented characters can be relaxed" do
370
- Name.new('Gearóidín', 'Uí Laighléis').match('Gearoidin', 'Ui Laíghleis', :ascii => true).should be_true
371
- Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :ascii => true).should be_true
359
+ Name.new('Gearóidín', 'Uí Laighléis').match('Gearoidin', 'Ui Laíghleis', :chars => "US-ASCII").should be_true
360
+ Name.new('Èric-K.', 'Cantona').match('E. K.', 'Cantona', :chars => "US-ASCII").should be_true
372
361
  end
373
362
  end
374
363
  end
metadata CHANGED
@@ -4,9 +4,9 @@ version: !ruby/object:Gem::Version
4
4
  prerelease: false
5
5
  segments:
6
6
  - 0
7
+ - 1
7
8
  - 0
8
- - 7
9
- version: 0.0.7
9
+ version: 0.1.0
10
10
  platform: ruby
11
11
  authors:
12
12
  - Mark Orr