namae 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,133 @@
1
+ Namae
2
+ =====
3
+ Namae is a parser for human names. It recognizes personal names of various
4
+ cultural backgrounds and tries to split them into their component parts
5
+ (e.g., given and family names, honorifics etc.).
6
+
7
+ [![Build Status](https://secure.travis-ci.org/berkmancenter/namae.png)](http://travis-ci.org/berkmancenter/namae)
8
+
9
+ Quickstart
10
+ ----------
11
+ 1. Install the namae gem (or add it to your Gemfile):
12
+
13
+ $ gem install namae
14
+
15
+ 2. Start parsing names! Namae expects you to pass in a string and it returns
16
+ a list of parsed names:
17
+
18
+ require 'namae'
19
+
20
+ names = Namae.parse 'Yukihiro "Matz" Matsumoto'
21
+ #-> [#<Name family="Matsumoto" given="Yukihiro" nick="Matz">]
22
+
23
+ 3. Use the name objects to access the individual parts:
24
+
25
+ matz = names[0]
26
+
27
+ matz.nick
28
+ #-> "Matz"
29
+
30
+ matz.family
31
+ #-> "Matsumoto"
32
+
33
+ matz.initials
34
+ #-> "Y.M."
35
+
36
+ matz.initials :expand => true
37
+ #-> "Y. Matsumoto"
38
+
39
+ matz.initials :dots => false
40
+ #-> "YM"
41
+
42
+ Format and Examples
43
+ -------------------
44
+ Namae recognizes names in a wide variety of two basic formats, internally
45
+ referred to as display-order and sort-order. For example, the following
46
+ names are written in display-order:
47
+
48
+ Namae.parse 'Charles Babbage'
49
+ #-> [#<Name family="Babbage" given="Charles">]]
50
+
51
+ Namae.parse 'Mr. Alan M. Turing'
52
+ #-> [#<Name family="Turing" given="Alan M." appellation="Mr.">]
53
+
54
+ Namae.parse 'Yukihiro "Matz" Matsumoto'
55
+ #-> [#<Name family="Matsumoto" given="Yukihiro" nick="Matz">]
56
+
57
+ Namae.parse 'Augusta Ada King and Lord Byron'
58
+ #-> [#<Name family="King" given="Augusta Ada">, #<Name family="Byron" title="Lord">]
59
+
60
+ Namae.parse 'Sir Isaac Newton'
61
+ #-> [#<Name family="Newton" given="Isaac" title="Sir">]
62
+
63
+ Namae.parse 'Prof. Donald Ervin Knuth'
64
+ #-> [#<Name family="Knuth" given="Donald Ervin" title="Prof.">]
65
+
66
+ Or in sort-order:
67
+
68
+ Namae.parse 'Turing, Alan M.'
69
+ #-> [#<Name family="Turing" given="Alan M.">]
70
+
71
+ You can also mix sort- and display-order in the same expression:
72
+
73
+ Namae.parse 'Torvalds, Linus and Alan Cox'
74
+ #-> [#<Name family="Torvalds" given="Linus">, #<Name family="Cox" given="Alan">]
75
+
76
+ Typically, sort-order names are easier to parse, because the syntax is less
77
+ ambiguous. For example, multiple family names are always possible in sort-order:
78
+
79
+ Namae.parse 'Carreño Quiñones, María-Jose'
80
+ #-> [#<Name family="Carreño Quiñones" given="María-Jose">]
81
+
82
+ Whilst in display-order, multiple family names are only supported when the
83
+ name contains a particle or a nickname.
84
+
85
+
86
+ Rationale
87
+ ---------
88
+ Parsing human names is at once too easy and too hard. When working in the
89
+ confines of a single language or culture it is often a trivial task that
90
+ does not warrant a dedicated software package; when working across different
91
+ cultures, languages, or scripts, however, it may quickly become unrealistic
92
+ to devise a satisfying, one-size-fits-all solution. In languages like
93
+ Japanese or Chinese, for instance, the issue of word segmentation alone is
94
+ probably more difficult than name parsing itself.
95
+
96
+ Having said that, Namae is based on the rules used by BibTeX to format names
97
+ and can therefore be used to parse names of most languages using latin
98
+ script with the long-time goal to support as many languages and scripts as
99
+ possible without the need for sophisticated or large dictionary based
100
+ language-detection or word segmentation features.
101
+
102
+ For further reading, see the W3C's primer on
103
+ [Personal Names Around the World](http://www.w3.org/International/questions/qa-personal-names).
104
+
105
+ Development
106
+ -----------
107
+ The Namae source code is [hosted on GitHub](https://github.com/berkmancenter/namae).
108
+ You can check out a copy of the latest code using Git:
109
+
110
+ $ git clone https://github.com/berkmancenter/namae.git
111
+
112
+ To get started, generate the parser and run all tests:
113
+
114
+ $ cd namae
115
+ $ bundle install
116
+ $ rake racc
117
+ $ rake features
118
+ $ rake spec
119
+
120
+ If you've found a bug or have a question, please open an issue on the
121
+ [issue tracker](https://github.com/berkmancenter/namae/issues). Or, for extra
122
+ credit, clone the Namae repository, write a failing example, fix the bug
123
+ and submit a pull request.
124
+
125
+ Contributors
126
+ ------------
127
+ * [Sylvester Keil](http://sylvester.keil.or.at)
128
+ * Dan Collis-Puro
129
+
130
+ Copyright
131
+ ---------
132
+ Copyright (c) 2012 President and Fellows of Harvard College.
133
+ Please see LICENSE for further details.
@@ -0,0 +1,62 @@
1
+ # encoding: utf-8
2
+
3
+ require 'bundler'
4
+ begin
5
+ Bundler.setup(:default, :development, :debug, :test)
6
+ rescue Bundler::BundlerError => e
7
+ $stderr.puts e.message
8
+ $stderr.puts "Run `bundle install` to install missing gems"
9
+ exit e.status_code
10
+ end
11
+ require 'rake'
12
+
13
+ $:.unshift(File.join(File.dirname(__FILE__), './lib'))
14
+ require 'namae'
15
+
16
+ begin
17
+ require 'jeweler'
18
+ Jeweler::Tasks.new do |gem|
19
+ gem.name = 'namae'
20
+ gem.version = Namae::Version::STRING.dup
21
+ gem.homepage = 'https://github.com/berkmancenter/namae'
22
+
23
+ gem.email = ['sylvester@keil.or.at', 'dan@collispuro.com']
24
+ gem.authors = ['Sylvester Keil', 'Dan Collis-Puro']
25
+
26
+ gem.license = 'AGPL'
27
+
28
+ gem.summary =
29
+ 'Namae parses personal names and splits them into their component parts.'
30
+
31
+ gem.description = %q{
32
+ Namae is a parser for human names. It recognizes personal names of various
33
+ cultural backgrounds and tries to split them into their component parts
34
+ (e.g., given and family names, honorifics etc.).
35
+ }.gsub(/\s+/, ' ')
36
+
37
+ end
38
+ Jeweler::RubygemsDotOrgTasks.new
39
+ rescue LoadError
40
+ warn 'failed to load jeweler'
41
+ end
42
+
43
+ desc 'Generate the name parser'
44
+ task :racc => ['lib/namae/parser.rb']
45
+
46
+ file 'lib/namae/parser.rb' => ['lib/namae/parser.y'] do
47
+ sh 'bundle exec racc -o lib/namae/parser.rb lib/namae/parser.y'
48
+ end
49
+
50
+ require 'rspec/core'
51
+ require 'rspec/core/rake_task'
52
+ RSpec::Core::RakeTask.new(:spec) do |spec|
53
+ spec.pattern = FileList['spec/**/*_spec.rb']
54
+ end
55
+
56
+ require 'cucumber/rake/task'
57
+ Cucumber::Rake::Task.new(:features)
58
+
59
+ task :default => :spec
60
+
61
+ require 'yard'
62
+ YARD::Rake::YardocTask.new
@@ -0,0 +1 @@
1
+ default: --format progress --require features --color
@@ -0,0 +1,78 @@
1
+ Feature: Parse BibTeX-style names
2
+ As a hacker who works with bibliographies
3
+ I want to be able to parse BibTeX-style names
4
+
5
+ Scenario Outline: Name splitting
6
+ When I parse the name "<name>"
7
+ Then the BibTeX parts should be:
8
+ | first | von | last | jr |
9
+ | <first> | <von> | <last> | <jr> |
10
+
11
+ @names @display
12
+ Scenarios: Decoret test suite (display order)
13
+ | name | first | von | last | jr |
14
+ | AA BB | AA | | BB | |
15
+ | AA BB CC | AA BB | | CC | |
16
+ # | AA | | | AA | |
17
+ | AA bb | AA | | bb | |
18
+ # | aa | | | aa | |
19
+ | aa bb | | aa | bb | |
20
+ | aa BB | | aa | BB | |
21
+ | AA bb CC | AA | bb | CC | |
22
+ | AA bb CC dd EE | AA | bb CC dd | EE | |
23
+ # | AA 1B cc dd | AA 1B | cc | dd | |
24
+ # | AA 1b cc dd | AA | 1b cc | dd | |
25
+ | AA {b}B cc dd | AA {b}B | cc | dd | |
26
+ | AA {b}b cc dd | AA | {b}b cc | dd | |
27
+ | AA {B}b cc dd | AA | {B}b cc | dd | |
28
+ | AA {B}B cc dd | AA {B}B | cc | dd | |
29
+ | AA \BB{b} cc dd | AA \\BB{b} | cc | dd | |
30
+ | AA \bb{b} cc dd | AA \\bb{b} | cc | dd | |
31
+ | AA {bb} cc DD | AA {bb} | cc | DD | |
32
+ | AA bb {cc} DD | AA | bb | {cc} DD | |
33
+ | AA {bb} CC | AA {bb} | | CC | |
34
+
35
+ @names @sort
36
+ Scenarios: Decoret test suite (sort order)
37
+ | name | first | von | last | jr |
38
+ | bb CC, AA | AA | bb | CC | |
39
+ | bb CC, aa | aa | bb | CC | |
40
+ | bb CC dd EE, AA | AA | bb CC dd | EE | |
41
+ | bb, AA | AA | | bb | |
42
+ | BB, | | | BB | |
43
+ | bb CC,XX, AA | AA | bb | CC | XX |
44
+ | bb CC,xx, AA | AA | bb | CC | xx |
45
+ | BB,, AA | AA | | BB | |
46
+ | CC dd BB, AA | AA | CC dd | BB | |
47
+ | BB, AA | AA | | BB | |
48
+
49
+ @names @sort
50
+ Scenarios: Long von parts
51
+ | name | first | von | last | jr |
52
+ | bb cc dd CC, AA | AA | bb cc dd | CC | |
53
+ | bb CC dd CC, AA | AA | bb CC dd | CC | |
54
+ | BB cc dd CC, AA | AA | BB cc dd | CC | |
55
+ | BB CC dd CC, AA | AA | BB CC dd | CC | |
56
+
57
+ @names
58
+ Scenarios: Decoret further remarks
59
+ | name | first | von | last | jr |
60
+ | Dominique Galouzeau de Villepin | Dominique Galouzeau | de | Villepin | |
61
+ | Dominique {G}alouzeau de Villepin | Dominique | {G}alouzeau de | Villepin | |
62
+ | Galouzeau {de} Villepin, Dominique | Dominique | | Galouzeau {de} Villepin | |
63
+
64
+ @names
65
+ Scenarios: Some actual names
66
+ | name | first | von | last | jr |
67
+ | John Paul Jones | John Paul | | Jones | |
68
+ | Ludwig von Beethoven | Ludwig | von | Beethoven | |
69
+ | von Beethoven, Ludwig | Ludwig | von | Beethoven | |
70
+ | {von Beethoven}, Ludwig | Ludwig | | {von Beethoven} | |
71
+ | {{von} Beethoven}, Ludwig | Ludwig | | {{von} Beethoven} | |
72
+ | John {}Paul Jones | John {}Paul | | Jones | |
73
+ | Ford, Jr., Henry | Henry | | Ford | Jr. |
74
+ | Brinch Hansen, Per | Per | | Brinch Hansen | |
75
+ # | {Barnes and Noble, Inc.} | | | {Barnes and Noble, Inc.} | |
76
+ | {Barnes and} {Noble, Inc.} | {Barnes and} | | {Noble, Inc.} | |
77
+ | {Barnes} {and} {Noble,} {Inc.} | {Barnes} {and} {Noble,} | | {Inc.} | |
78
+ | Charles Louis Xavier Joseph de la Vallee Poussin | Charles Louis Xavier Joseph | de la | Vallee Poussin | |
@@ -0,0 +1,24 @@
1
+ Feature: Parse the names in the Readme file
2
+ As a hacker who works with Namae
3
+ I want to be able to parse all the examples in the Readme file
4
+
5
+ Scenario Outline: Names Parsing
6
+ When I parse the name "<name>"
7
+ Then the parts should be:
8
+ | given | particle | family | suffix | title | appellation | nick |
9
+ | <given> | <particle> | <family> | <suffix> | <title> | <appellation> | <nick> |
10
+
11
+ @readme @display
12
+ Scenarios: Readme examples (display-order)
13
+ | name | given | particle | family | suffix | title | appellation | nick |
14
+ | Charles Babbage | Charles | | Babbage | | | | |
15
+ | Mr. Alan M. Turing | Alan M. | | Turing | | | Mr. | |
16
+ | Yukihiro "Matz" Matsumoto | Yukihiro | | Matsumoto | | | | Matz |
17
+ | Sir Isaac Newton | Isaac | | Newton | | Sir | | |
18
+ | Prof. Donald Ervin Knuth | Donald Ervin | | Knuth | | Prof. | | |
19
+ | Lord Byron | | | Byron | | Lord | | |
20
+
21
+ @readme @sort
22
+ Scenarios: Readme examples (sort-order)
23
+ | name | given | particle | family | suffix | title | appellation | nick |
24
+ | Carreño Quiñones, María-Jose | María-Jose | | Carreño Quiñones | | | | |
@@ -0,0 +1,22 @@
1
+ When /^I parse the name "(.*)"$/ do |string|
2
+ @name = Namae.parse!(string)[0]
3
+ end
4
+
5
+ When /^I parse the names "(.*)"$/ do |string|
6
+ @names = Namae.parse!(string)
7
+ end
8
+
9
+
10
+ Then /^the BibTeX parts should be:$/ do |table|
11
+ table.hashes.each do |row|
12
+ @name.values_at(:given, :particle, :family, :suffix).map(&:to_s).should ==
13
+ row.values_at('first', 'von', 'last', 'jr')
14
+ end
15
+ end
16
+
17
+ Then /^the parts should be:$/ do |table|
18
+ table.hashes.each do |row|
19
+ @name.values_at(:given, :particle, :family, :suffix, :title, :appellation, :nick).map(&:to_s).should ==
20
+ row.values_at('given', 'particle', 'family', 'suffix', 'title', 'appellation', 'nick')
21
+ end
22
+ end
@@ -0,0 +1,19 @@
1
+ require 'bundler'
2
+ begin
3
+ Bundler.setup(:default, :development)
4
+ rescue Bundler::BundlerError => e
5
+ $stderr.puts e.message
6
+ $stderr.puts "Run `bundle install` to install missing gems"
7
+ exit e.status_code
8
+ end
9
+
10
+ begin
11
+ require 'simplecov'
12
+ rescue LoadError
13
+ # ignore
14
+ end
15
+
16
+ $LOAD_PATH.unshift(File.dirname(__FILE__) + '/../../lib')
17
+ require 'namae'
18
+
19
+ require 'rspec/expectations'
@@ -0,0 +1,5 @@
1
+ require 'namae/version'
2
+
3
+ require 'namae/name'
4
+ require 'namae/parser'
5
+ require 'namae/utility'
@@ -0,0 +1,119 @@
1
+ module Namae
2
+
3
+ # A Name represents a single personal name, exposing its constituent
4
+ # parts (e.g., family name, given name etc.). Name instances are typically
5
+ # created and returned from {Namae.parse Namae.parse}.
6
+ #
7
+ # name = Namae.parse('Yukihiro "Matz" Matsumoto')[0]
8
+ #
9
+ # name.family #=> Matsumoto
10
+ # name.nick #=> Matz
11
+ # name.given #=> Yukihiro
12
+ #
13
+ class Name < Struct.new :family, :given, :suffix, :particle,
14
+ :dropping_particle, :nick, :appellation, :title
15
+
16
+ # rbx compatibility
17
+ @parts = members.map(&:to_sym).freeze
18
+
19
+ @defaults = {
20
+ :initials => {
21
+ :expand => false,
22
+ :dots => true,
23
+ :spaces => false
24
+ }
25
+ }
26
+
27
+ class << self
28
+ attr_reader :parts, :defaults
29
+ end
30
+
31
+
32
+ # @param attributes [Hash] the individual parts of the name
33
+ # @example
34
+ # Name.new(:family => 'Matsumoto')
35
+ def initialize(attributes = {})
36
+ super(*attributes.values_at(*Name.parts))
37
+ end
38
+
39
+
40
+ # True if all the name components are nil.
41
+ def empty?
42
+ values.compact.empty?
43
+ end
44
+
45
+ # Merges the name with the passed-in name or hash.
46
+ #
47
+ # @param other [#each_pair] the other name or hash
48
+ # @return [self]
49
+ def merge(other)
50
+ raise ArgumentError, "failed to merge #{other.class} into Name" unless
51
+ other.respond_to?(:each_pair)
52
+
53
+ other.each_pair do |part, value|
54
+ writer = "#{part}="
55
+ send(writer, value) if !value.nil? && respond_to?(writer)
56
+ end
57
+
58
+ self
59
+ end
60
+
61
+ # @param options [Hash] the options to create the initials
62
+ #
63
+ # @option options [true,false] :expand (false) whether or not to expand the family name
64
+ # @option options [true,false] :dots (true) whether or not to print dots between the initials
65
+ # @option options [true,false] :spaces (false) whether or not to print spaces between the initals
66
+ #
67
+ # @return [String] the name's initials.
68
+ def initials(options = {})
69
+ options = Name.defaults[:initials].merge(options)
70
+
71
+ if options[:expand]
72
+ [initials_of(given_part, options), family].compact.join(' ')
73
+ else
74
+ initials_of([given_part, family_part].join(' '), options)
75
+ end
76
+ end
77
+
78
+ # @overload values_at(selector, ... )
79
+ # Returns an array containing the elements in self corresponding to
80
+ # the given selector(s). The selectors may be either integer indices,
81
+ # ranges (functionality inherited from Struct) or symbols
82
+ # idenifying valid keys.
83
+ #
84
+ # @example
85
+ # name.values_at(:family, :nick) #=> ['Matsumoto', 'Matz']
86
+ #
87
+ # @see Struct#values_at
88
+ # @return [Array] the list of values
89
+ def values_at(*arguments)
90
+ super(*arguments.flatten.map { |k| k.is_a?(Symbol) ? Name.parts.index(k) : k })
91
+ end
92
+
93
+
94
+ # Describe the contents of this name in a string.
95
+ def inspect
96
+ "#<Name #{each_pair.map { |k,v| [k,v.inspect].join('=') if v }.compact.join(' ')}>"
97
+ end
98
+
99
+
100
+ private
101
+
102
+ def family_part
103
+ [particle, family].compact.join(' ')
104
+ end
105
+
106
+ def given_part
107
+ [given, dropping_particle].compact.join(' ')
108
+ end
109
+
110
+ # @param name [String] a name or part of a name
111
+ # @return [String] the initials of the passed-in name
112
+ def initials_of(name, options = {})
113
+ i = name.gsub(/([[:upper:]])[[:lower:]]+/, options[:dots] ? '\1.' : '\1')
114
+ i.gsub!(/\s+/, '') unless options[:spaces]
115
+ i
116
+ end
117
+
118
+ end
119
+ end