namae 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,133 @@
1
+ Namae
2
+ =====
3
+ Namae is a parser for human names. It recognizes personal names of various
4
+ cultural backgrounds and tries to split them into their component parts
5
+ (e.g., given and family names, honorifics etc.).
6
+
7
+ [![Build Status](https://secure.travis-ci.org/berkmancenter/namae.png)](http://travis-ci.org/berkmancenter/namae)
8
+
9
+ Quickstart
10
+ ----------
11
+ 1. Install the namae gem (or add it to your Gemfile):
12
+
13
+ $ gem install namae
14
+
15
+ 2. Start parsing names! Namae expects you to pass in a string and it returns
16
+ a list of parsed names:
17
+
18
+ require 'namae'
19
+
20
+ names = Namae.parse 'Yukihiro "Matz" Matsumoto'
21
+ #-> [#<Name family="Matsumoto" given="Yukihiro" nick="Matz">]
22
+
23
+ 3. Use the name objects to access the individual parts:
24
+
25
+ matz = names[0]
26
+
27
+ matz.nick
28
+ #-> "Matz"
29
+
30
+ matz.family
31
+ #-> "Matsumoto"
32
+
33
+ matz.initials
34
+ #-> "Y.M."
35
+
36
+ matz.initials :expand => true
37
+ #-> "Y. Matsumoto"
38
+
39
+ matz.initials :dots => false
40
+ #-> "YM"
41
+
42
+ Format and Examples
43
+ -------------------
44
+ Namae recognizes names in a wide variety of two basic formats, internally
45
+ referred to as display-order and sort-order. For example, the following
46
+ names are written in display-order:
47
+
48
+ Namae.parse 'Charles Babbage'
49
+ #-> [#<Name family="Babbage" given="Charles">]]
50
+
51
+ Namae.parse 'Mr. Alan M. Turing'
52
+ #-> [#<Name family="Turing" given="Alan M." appellation="Mr.">]
53
+
54
+ Namae.parse 'Yukihiro "Matz" Matsumoto'
55
+ #-> [#<Name family="Matsumoto" given="Yukihiro" nick="Matz">]
56
+
57
+ Namae.parse 'Augusta Ada King and Lord Byron'
58
+ #-> [#<Name family="King" given="Augusta Ada">, #<Name family="Byron" title="Lord">]
59
+
60
+ Namae.parse 'Sir Isaac Newton'
61
+ #-> [#<Name family="Newton" given="Isaac" title="Sir">]
62
+
63
+ Namae.parse 'Prof. Donald Ervin Knuth'
64
+ #-> [#<Name family="Knuth" given="Donald Ervin" title="Prof.">]
65
+
66
+ Or in sort-order:
67
+
68
+ Namae.parse 'Turing, Alan M.'
69
+ #-> [#<Name family="Turing" given="Alan M.">]
70
+
71
+ You can also mix sort- and display-order in the same expression:
72
+
73
+ Namae.parse 'Torvalds, Linus and Alan Cox'
74
+ #-> [#<Name family="Torvalds" given="Linus">, #<Name family="Cox" given="Alan">]
75
+
76
+ Typically, sort-order names are easier to parse, because the syntax is less
77
+ ambiguous. For example, multiple family names are always possible in sort-order:
78
+
79
+ Namae.parse 'Carreño Quiñones, María-Jose'
80
+ #-> [#<Name family="Carreño Quiñones" given="María-Jose">]
81
+
82
+ Whilst in display-order, multiple family names are only supported when the
83
+ name contains a particle or a nickname.
84
+
85
+
86
+ Rationale
87
+ ---------
88
+ Parsing human names is at once too easy and too hard. When working in the
89
+ confines of a single language or culture it is often a trivial task that
90
+ does not warrant a dedicated software package; when working across different
91
+ cultures, languages, or scripts, however, it may quickly become unrealistic
92
+ to devise a satisfying, one-size-fits-all solution. In languages like
93
+ Japanese or Chinese, for instance, the issue of word segmentation alone is
94
+ probably more difficult than name parsing itself.
95
+
96
+ Having said that, Namae is based on the rules used by BibTeX to format names
97
+ and can therefore be used to parse names of most languages using latin
98
+ script with the long-time goal to support as many languages and scripts as
99
+ possible without the need for sophisticated or large dictionary based
100
+ language-detection or word segmentation features.
101
+
102
+ For further reading, see the W3C's primer on
103
+ [Personal Names Around the World](http://www.w3.org/International/questions/qa-personal-names).
104
+
105
+ Development
106
+ -----------
107
+ The Namae source code is [hosted on GitHub](https://github.com/berkmancenter/namae).
108
+ You can check out a copy of the latest code using Git:
109
+
110
+ $ git clone https://github.com/berkmancenter/namae.git
111
+
112
+ To get started, generate the parser and run all tests:
113
+
114
+ $ cd namae
115
+ $ bundle install
116
+ $ rake racc
117
+ $ rake features
118
+ $ rake spec
119
+
120
+ If you've found a bug or have a question, please open an issue on the
121
+ [issue tracker](https://github.com/berkmancenter/namae/issues). Or, for extra
122
+ credit, clone the Namae repository, write a failing example, fix the bug
123
+ and submit a pull request.
124
+
125
+ Contributors
126
+ ------------
127
+ * [Sylvester Keil](http://sylvester.keil.or.at)
128
+ * Dan Collis-Puro
129
+
130
+ Copyright
131
+ ---------
132
+ Copyright (c) 2012 President and Fellows of Harvard College.
133
+ Please see LICENSE for further details.
@@ -0,0 +1,62 @@
1
+ # encoding: utf-8
2
+
3
+ require 'bundler'
4
+ begin
5
+ Bundler.setup(:default, :development, :debug, :test)
6
+ rescue Bundler::BundlerError => e
7
+ $stderr.puts e.message
8
+ $stderr.puts "Run `bundle install` to install missing gems"
9
+ exit e.status_code
10
+ end
11
+ require 'rake'
12
+
13
+ $:.unshift(File.join(File.dirname(__FILE__), './lib'))
14
+ require 'namae'
15
+
16
+ begin
17
+ require 'jeweler'
18
+ Jeweler::Tasks.new do |gem|
19
+ gem.name = 'namae'
20
+ gem.version = Namae::Version::STRING.dup
21
+ gem.homepage = 'https://github.com/berkmancenter/namae'
22
+
23
+ gem.email = ['sylvester@keil.or.at', 'dan@collispuro.com']
24
+ gem.authors = ['Sylvester Keil', 'Dan Collis-Puro']
25
+
26
+ gem.license = 'AGPL'
27
+
28
+ gem.summary =
29
+ 'Namae parses personal names and splits them into their component parts.'
30
+
31
+ gem.description = %q{
32
+ Namae is a parser for human names. It recognizes personal names of various
33
+ cultural backgrounds and tries to split them into their component parts
34
+ (e.g., given and family names, honorifics etc.).
35
+ }.gsub(/\s+/, ' ')
36
+
37
+ end
38
+ Jeweler::RubygemsDotOrgTasks.new
39
+ rescue LoadError
40
+ warn 'failed to load jeweler'
41
+ end
42
+
43
+ desc 'Generate the name parser'
44
+ task :racc => ['lib/namae/parser.rb']
45
+
46
+ file 'lib/namae/parser.rb' => ['lib/namae/parser.y'] do
47
+ sh 'bundle exec racc -o lib/namae/parser.rb lib/namae/parser.y'
48
+ end
49
+
50
+ require 'rspec/core'
51
+ require 'rspec/core/rake_task'
52
+ RSpec::Core::RakeTask.new(:spec) do |spec|
53
+ spec.pattern = FileList['spec/**/*_spec.rb']
54
+ end
55
+
56
+ require 'cucumber/rake/task'
57
+ Cucumber::Rake::Task.new(:features)
58
+
59
+ task :default => :spec
60
+
61
+ require 'yard'
62
+ YARD::Rake::YardocTask.new
@@ -0,0 +1 @@
1
+ default: --format progress --require features --color
@@ -0,0 +1,78 @@
1
+ Feature: Parse BibTeX-style names
2
+ As a hacker who works with bibliographies
3
+ I want to be able to parse BibTeX-style names
4
+
5
+ Scenario Outline: Name splitting
6
+ When I parse the name "<name>"
7
+ Then the BibTeX parts should be:
8
+ | first | von | last | jr |
9
+ | <first> | <von> | <last> | <jr> |
10
+
11
+ @names @display
12
+ Scenarios: Decoret test suite (display order)
13
+ | name | first | von | last | jr |
14
+ | AA BB | AA | | BB | |
15
+ | AA BB CC | AA BB | | CC | |
16
+ # | AA | | | AA | |
17
+ | AA bb | AA | | bb | |
18
+ # | aa | | | aa | |
19
+ | aa bb | | aa | bb | |
20
+ | aa BB | | aa | BB | |
21
+ | AA bb CC | AA | bb | CC | |
22
+ | AA bb CC dd EE | AA | bb CC dd | EE | |
23
+ # | AA 1B cc dd | AA 1B | cc | dd | |
24
+ # | AA 1b cc dd | AA | 1b cc | dd | |
25
+ | AA {b}B cc dd | AA {b}B | cc | dd | |
26
+ | AA {b}b cc dd | AA | {b}b cc | dd | |
27
+ | AA {B}b cc dd | AA | {B}b cc | dd | |
28
+ | AA {B}B cc dd | AA {B}B | cc | dd | |
29
+ | AA \BB{b} cc dd | AA \\BB{b} | cc | dd | |
30
+ | AA \bb{b} cc dd | AA \\bb{b} | cc | dd | |
31
+ | AA {bb} cc DD | AA {bb} | cc | DD | |
32
+ | AA bb {cc} DD | AA | bb | {cc} DD | |
33
+ | AA {bb} CC | AA {bb} | | CC | |
34
+
35
+ @names @sort
36
+ Scenarios: Decoret test suite (sort order)
37
+ | name | first | von | last | jr |
38
+ | bb CC, AA | AA | bb | CC | |
39
+ | bb CC, aa | aa | bb | CC | |
40
+ | bb CC dd EE, AA | AA | bb CC dd | EE | |
41
+ | bb, AA | AA | | bb | |
42
+ | BB, | | | BB | |
43
+ | bb CC,XX, AA | AA | bb | CC | XX |
44
+ | bb CC,xx, AA | AA | bb | CC | xx |
45
+ | BB,, AA | AA | | BB | |
46
+ | CC dd BB, AA | AA | CC dd | BB | |
47
+ | BB, AA | AA | | BB | |
48
+
49
+ @names @sort
50
+ Scenarios: Long von parts
51
+ | name | first | von | last | jr |
52
+ | bb cc dd CC, AA | AA | bb cc dd | CC | |
53
+ | bb CC dd CC, AA | AA | bb CC dd | CC | |
54
+ | BB cc dd CC, AA | AA | BB cc dd | CC | |
55
+ | BB CC dd CC, AA | AA | BB CC dd | CC | |
56
+
57
+ @names
58
+ Scenarios: Decoret further remarks
59
+ | name | first | von | last | jr |
60
+ | Dominique Galouzeau de Villepin | Dominique Galouzeau | de | Villepin | |
61
+ | Dominique {G}alouzeau de Villepin | Dominique | {G}alouzeau de | Villepin | |
62
+ | Galouzeau {de} Villepin, Dominique | Dominique | | Galouzeau {de} Villepin | |
63
+
64
+ @names
65
+ Scenarios: Some actual names
66
+ | name | first | von | last | jr |
67
+ | John Paul Jones | John Paul | | Jones | |
68
+ | Ludwig von Beethoven | Ludwig | von | Beethoven | |
69
+ | von Beethoven, Ludwig | Ludwig | von | Beethoven | |
70
+ | {von Beethoven}, Ludwig | Ludwig | | {von Beethoven} | |
71
+ | {{von} Beethoven}, Ludwig | Ludwig | | {{von} Beethoven} | |
72
+ | John {}Paul Jones | John {}Paul | | Jones | |
73
+ | Ford, Jr., Henry | Henry | | Ford | Jr. |
74
+ | Brinch Hansen, Per | Per | | Brinch Hansen | |
75
+ # | {Barnes and Noble, Inc.} | | | {Barnes and Noble, Inc.} | |
76
+ | {Barnes and} {Noble, Inc.} | {Barnes and} | | {Noble, Inc.} | |
77
+ | {Barnes} {and} {Noble,} {Inc.} | {Barnes} {and} {Noble,} | | {Inc.} | |
78
+ | Charles Louis Xavier Joseph de la Vallee Poussin | Charles Louis Xavier Joseph | de la | Vallee Poussin | |
@@ -0,0 +1,24 @@
1
+ Feature: Parse the names in the Readme file
2
+ As a hacker who works with Namae
3
+ I want to be able to parse all the examples in the Readme file
4
+
5
+ Scenario Outline: Names Parsing
6
+ When I parse the name "<name>"
7
+ Then the parts should be:
8
+ | given | particle | family | suffix | title | appellation | nick |
9
+ | <given> | <particle> | <family> | <suffix> | <title> | <appellation> | <nick> |
10
+
11
+ @readme @display
12
+ Scenarios: Readme examples (display-order)
13
+ | name | given | particle | family | suffix | title | appellation | nick |
14
+ | Charles Babbage | Charles | | Babbage | | | | |
15
+ | Mr. Alan M. Turing | Alan M. | | Turing | | | Mr. | |
16
+ | Yukihiro "Matz" Matsumoto | Yukihiro | | Matsumoto | | | | Matz |
17
+ | Sir Isaac Newton | Isaac | | Newton | | Sir | | |
18
+ | Prof. Donald Ervin Knuth | Donald Ervin | | Knuth | | Prof. | | |
19
+ | Lord Byron | | | Byron | | Lord | | |
20
+
21
+ @readme @sort
22
+ Scenarios: Readme examples (sort-order)
23
+ | name | given | particle | family | suffix | title | appellation | nick |
24
+ | Carreño Quiñones, María-Jose | María-Jose | | Carreño Quiñones | | | | |
@@ -0,0 +1,22 @@
1
+ When /^I parse the name "(.*)"$/ do |string|
2
+ @name = Namae.parse!(string)[0]
3
+ end
4
+
5
+ When /^I parse the names "(.*)"$/ do |string|
6
+ @names = Namae.parse!(string)
7
+ end
8
+
9
+
10
+ Then /^the BibTeX parts should be:$/ do |table|
11
+ table.hashes.each do |row|
12
+ @name.values_at(:given, :particle, :family, :suffix).map(&:to_s).should ==
13
+ row.values_at('first', 'von', 'last', 'jr')
14
+ end
15
+ end
16
+
17
+ Then /^the parts should be:$/ do |table|
18
+ table.hashes.each do |row|
19
+ @name.values_at(:given, :particle, :family, :suffix, :title, :appellation, :nick).map(&:to_s).should ==
20
+ row.values_at('given', 'particle', 'family', 'suffix', 'title', 'appellation', 'nick')
21
+ end
22
+ end
@@ -0,0 +1,19 @@
1
+ require 'bundler'
2
+ begin
3
+ Bundler.setup(:default, :development)
4
+ rescue Bundler::BundlerError => e
5
+ $stderr.puts e.message
6
+ $stderr.puts "Run `bundle install` to install missing gems"
7
+ exit e.status_code
8
+ end
9
+
10
+ begin
11
+ require 'simplecov'
12
+ rescue LoadError
13
+ # ignore
14
+ end
15
+
16
+ $LOAD_PATH.unshift(File.dirname(__FILE__) + '/../../lib')
17
+ require 'namae'
18
+
19
+ require 'rspec/expectations'
@@ -0,0 +1,5 @@
1
+ require 'namae/version'
2
+
3
+ require 'namae/name'
4
+ require 'namae/parser'
5
+ require 'namae/utility'
@@ -0,0 +1,119 @@
1
+ module Namae
2
+
3
+ # A Name represents a single personal name, exposing its constituent
4
+ # parts (e.g., family name, given name etc.). Name instances are typically
5
+ # created and returned from {Namae.parse Namae.parse}.
6
+ #
7
+ # name = Namae.parse('Yukihiro "Matz" Matsumoto')[0]
8
+ #
9
+ # name.family #=> Matsumoto
10
+ # name.nick #=> Matz
11
+ # name.given #=> Yukihiro
12
+ #
13
+ class Name < Struct.new :family, :given, :suffix, :particle,
14
+ :dropping_particle, :nick, :appellation, :title
15
+
16
+ # rbx compatibility
17
+ @parts = members.map(&:to_sym).freeze
18
+
19
+ @defaults = {
20
+ :initials => {
21
+ :expand => false,
22
+ :dots => true,
23
+ :spaces => false
24
+ }
25
+ }
26
+
27
+ class << self
28
+ attr_reader :parts, :defaults
29
+ end
30
+
31
+
32
+ # @param attributes [Hash] the individual parts of the name
33
+ # @example
34
+ # Name.new(:family => 'Matsumoto')
35
+ def initialize(attributes = {})
36
+ super(*attributes.values_at(*Name.parts))
37
+ end
38
+
39
+
40
+ # True if all the name components are nil.
41
+ def empty?
42
+ values.compact.empty?
43
+ end
44
+
45
+ # Merges the name with the passed-in name or hash.
46
+ #
47
+ # @param other [#each_pair] the other name or hash
48
+ # @return [self]
49
+ def merge(other)
50
+ raise ArgumentError, "failed to merge #{other.class} into Name" unless
51
+ other.respond_to?(:each_pair)
52
+
53
+ other.each_pair do |part, value|
54
+ writer = "#{part}="
55
+ send(writer, value) if !value.nil? && respond_to?(writer)
56
+ end
57
+
58
+ self
59
+ end
60
+
61
+ # @param options [Hash] the options to create the initials
62
+ #
63
+ # @option options [true,false] :expand (false) whether or not to expand the family name
64
+ # @option options [true,false] :dots (true) whether or not to print dots between the initials
65
+ # @option options [true,false] :spaces (false) whether or not to print spaces between the initals
66
+ #
67
+ # @return [String] the name's initials.
68
+ def initials(options = {})
69
+ options = Name.defaults[:initials].merge(options)
70
+
71
+ if options[:expand]
72
+ [initials_of(given_part, options), family].compact.join(' ')
73
+ else
74
+ initials_of([given_part, family_part].join(' '), options)
75
+ end
76
+ end
77
+
78
+ # @overload values_at(selector, ... )
79
+ # Returns an array containing the elements in self corresponding to
80
+ # the given selector(s). The selectors may be either integer indices,
81
+ # ranges (functionality inherited from Struct) or symbols
82
+ # idenifying valid keys.
83
+ #
84
+ # @example
85
+ # name.values_at(:family, :nick) #=> ['Matsumoto', 'Matz']
86
+ #
87
+ # @see Struct#values_at
88
+ # @return [Array] the list of values
89
+ def values_at(*arguments)
90
+ super(*arguments.flatten.map { |k| k.is_a?(Symbol) ? Name.parts.index(k) : k })
91
+ end
92
+
93
+
94
+ # Describe the contents of this name in a string.
95
+ def inspect
96
+ "#<Name #{each_pair.map { |k,v| [k,v.inspect].join('=') if v }.compact.join(' ')}>"
97
+ end
98
+
99
+
100
+ private
101
+
102
+ def family_part
103
+ [particle, family].compact.join(' ')
104
+ end
105
+
106
+ def given_part
107
+ [given, dropping_particle].compact.join(' ')
108
+ end
109
+
110
+ # @param name [String] a name or part of a name
111
+ # @return [String] the initials of the passed-in name
112
+ def initials_of(name, options = {})
113
+ i = name.gsub(/([[:upper:]])[[:lower:]]+/, options[:dots] ? '\1.' : '\1')
114
+ i.gsub!(/\s+/, '') unless options[:spaces]
115
+ i
116
+ end
117
+
118
+ end
119
+ end