japanese_names 0.0.3 → 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -13
- data/README.md +18 -33
- data/lib/japanese_names/backend/memory/finder.rb +55 -0
- data/lib/japanese_names/backend/memory/store.rb +25 -0
- data/lib/japanese_names/enamdict.rb +10 -100
- data/lib/japanese_names/finder.rb +21 -0
- data/lib/japanese_names/splitter.rb +55 -0
- data/lib/japanese_names/util/ngram.rb +46 -0
- data/lib/japanese_names/version.rb +1 -4
- data/lib/japanese_names.rb +11 -1
- data/spec/spec_helper.rb +1 -3
- data/spec/unit/{enamdict_spec.rb → finder_spec.rb} +2 -22
- data/spec/unit/ngram_spec.rb +24 -0
- data/spec/unit/{parser_spec.rb → splitter_spec.rb} +11 -14
- metadata +24 -19
- data/lib/japanese_names/parser.rb +0 -80
checksums.yaml
CHANGED
@@ -1,15 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
|
5
|
-
data.tar.gz: !binary |-
|
6
|
-
MDdhYjQ3NTA3NTgxZjMyY2Q4ZmQxZTk2MzgwMmRjYzMwMzAxNmQ4Mg==
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 66e24980c69de00af005fe290cb738acadc6a42c
|
4
|
+
data.tar.gz: fe841d057b518bc964f19d79004892565e6d80e0
|
7
5
|
SHA512:
|
8
|
-
metadata.gz:
|
9
|
-
|
10
|
-
NjFiOTQ3MGJkYjg1Y2NiNTI5YWI1NTJmNWIwNWNkYTkyODU3ODU3Yjc4MDY3
|
11
|
-
N2Q2NmJlMGUzZWY2N2MzZjA1N2VjYTUwYTc2MDVjMTE3YWM0YWE=
|
12
|
-
data.tar.gz: !binary |-
|
13
|
-
YjRiNTMzNzQzMDc1OGQ2NWY3YmExM2VjOWMyOWE4ODBiZTFmNGQzMmI3MTJi
|
14
|
-
NzU0NmJiNzdhY2YwMjY5OGRiYWUzYTM4NmM2MDUwMjA3NTE0MTljY2Y2ODgx
|
15
|
-
ZGM2MTc1ZmVhN2ZjMTk0ZGFmZGVhOTljNTY1ZWQ1NTZiZDhlODE=
|
6
|
+
metadata.gz: e8392c49fe7e1d091889af90fd78e80b4ce41635c25a70247caf317f3697d047f85dfb17357c7b1ed53898974ad73de8e271cb2b56f76c9c4ba8a6fe75b7d304
|
7
|
+
data.tar.gz: 928c0867d2ccdcaaf502d1318d2e2126cb29e5f719c414fe6b40ca530bc2a39985be870b78f3a1ea7f697f8f5931575df25b23390bfb7107f53f22459dd63f8a
|
data/README.md
CHANGED
@@ -1,28 +1,28 @@
|
|
1
1
|
# JapaneseNames
|
2
2
|
|
3
|
-
JapaneseNames provides an interface to the [
|
3
|
+
JapaneseNames provides an interface to the [T file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
|
4
4
|
|
5
5
|
|
6
|
-
##
|
6
|
+
## ENAMDICT
|
7
7
|
|
8
|
-
This library comes packaged with a compacted version of the [
|
8
|
+
This library comes packaged with a compacted version of the [ENAMDICT file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
|
9
9
|
at `bin/enamdict.min`. Refer to *Rake Tasks* below for how this file is constructed.
|
10
10
|
|
11
|
-
`JapaneseNames::Enamdict` is a module; all methods are called on the module `self` class.
|
12
11
|
|
13
|
-
|
14
|
-
### Enamdict.find
|
12
|
+
### Finder.find
|
15
13
|
|
16
14
|
Provides a structured query interface to access ENAMDICT data.
|
17
15
|
|
18
16
|
```ruby
|
19
|
-
JapaneseNames::
|
17
|
+
finder = JapaneseNames::Finder.new
|
18
|
+
|
19
|
+
finder.find(kanji: '外世子') #=> [["外世子", "とよこ", "f"]]
|
20
20
|
|
21
|
-
|
22
|
-
|
23
|
-
|
21
|
+
finder.find(kana: 'ならしま', flags: 's') #=> [["奈良島", "ならしま", "s"],
|
22
|
+
# ["楢島", "ならしま", "s"],
|
23
|
+
# ["楢嶋", "ならしま", "s"]]
|
24
24
|
|
25
|
-
|
25
|
+
finder.find(kanji: '楢二郎', kana: 'ならじろう') #=> [["楢二郎", "ならじろう", "m"]]
|
26
26
|
```
|
27
27
|
|
28
28
|
where options are:
|
@@ -30,37 +30,22 @@ where options are:
|
|
30
30
|
* `kanji`: The kanji name string to match. Regex syntax suppported. Either `:kanji` or `:kana` must be specified.
|
31
31
|
* `kana`: The kana name string to match. Regex syntax suppported.
|
32
32
|
* `flags`: The flag char or array of flag chars to match. Refer to [ENAMDIC documentation](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
|
33
|
-
Additionally constants JapaneseNames::Enamdict::
|
33
|
+
Additionally constants `JapaneseNames::Enamdict::NAME_SURNAME` and `JapaneseNames::Enamdict::NAME_GIVEN` may be used.
|
34
34
|
|
35
35
|
Note that romaji data has been removed from our `enamdict.min` file in the compression step. We recommend to use a gem such as `mojinizer` to convert romaji to kana before doing a query.
|
36
36
|
|
37
37
|
|
38
|
-
|
39
|
-
|
40
|
-
Provides a raw interface to match ENAMDICT entries via a block, which would typically contain a `Regexp` expression:
|
38
|
+
## JapaneseNames::Splitter
|
41
39
|
|
42
|
-
|
43
|
-
JapaneseNames::Enamdict.match{|entry| entry =~ /^堺|/} #=> [["堺", "さかい", "p,s"], ["堺", "さかえ", "p"]]
|
44
|
-
```
|
45
|
-
|
46
|
-
where each dictionary entry is in the format below (different from raw ENAMDICT file):
|
47
|
-
|
48
|
-
```
|
49
|
-
kanji|kana|flag1(,flag2,...)
|
50
|
-
```
|
51
|
-
|
52
|
-
|
53
|
-
## JapaneseNames::Parser
|
54
|
-
|
55
|
-
### Parser#split
|
40
|
+
### Splitter#split
|
56
41
|
|
57
42
|
Currently the main method is `split` which, given a kanji and kana representation of a name splits
|
58
43
|
into to family/given names.
|
59
44
|
|
60
|
-
```ruby
|
61
|
-
|
62
|
-
|
63
|
-
```
|
45
|
+
```ruby
|
46
|
+
splitter = JapaneseNames::Splitter.new
|
47
|
+
splitter.split('堺雅美', 'さかいマサミ') #=> [['堺', '雅美'], ['さかい', 'マサミ']]
|
48
|
+
```
|
64
49
|
|
65
50
|
The logic is as follows:
|
66
51
|
|
@@ -0,0 +1,55 @@
|
|
1
|
+
module JapaneseNames
|
2
|
+
module Backend
|
3
|
+
module Memory
|
4
|
+
class Finder
|
5
|
+
|
6
|
+
class << self
|
7
|
+
|
8
|
+
# Public: Finds kanji and/or kana regex strings in the dictionary via
|
9
|
+
# a structured query interface.
|
10
|
+
#
|
11
|
+
# opts - The Hash options used to match the dictionary (default: {}):
|
12
|
+
# kanji: Regex to match kanji name (optional)
|
13
|
+
# kana: Regex to match kana name (optional)
|
14
|
+
# flags: Flag or Array of flags to filter the match (optional)
|
15
|
+
#
|
16
|
+
# Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
|
17
|
+
def find(opts={})
|
18
|
+
return [] unless opts[:kanji] || opts[:kana]
|
19
|
+
kanji = name_regex opts.delete(:kanji)
|
20
|
+
kana = name_regex opts.delete(:kana)
|
21
|
+
flags = flags_regex opts.delete(:flags)
|
22
|
+
store.select do |row|
|
23
|
+
(!kanji || row[0] =~ kanji) && (!kana || row[1] =~ kana) && (!flags || row[2] =~ flags)
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
private
|
28
|
+
|
29
|
+
def store
|
30
|
+
::JapaneseNames::Backend::Memory::Store.store
|
31
|
+
end
|
32
|
+
|
33
|
+
# Internal: Builds regex criteria for name.
|
34
|
+
def name_regex(name)
|
35
|
+
case name
|
36
|
+
when String, Symbol then /\A#{name}\z/
|
37
|
+
when Array then /\A(?:#{name.join('|')})\z/
|
38
|
+
else nil
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
# Internal: Builds regex criteria for flags.
|
43
|
+
def flags_regex(flags)
|
44
|
+
case flags
|
45
|
+
when ::JapaneseNames::Enamdict::NAME_ANY then nil
|
46
|
+
when String, Symbol then /[#{flags}]/
|
47
|
+
when Array then /[#{flags.join}]/
|
48
|
+
else nil
|
49
|
+
end
|
50
|
+
end
|
51
|
+
end
|
52
|
+
end
|
53
|
+
end
|
54
|
+
end
|
55
|
+
end
|
@@ -0,0 +1,25 @@
|
|
1
|
+
module JapaneseNames
|
2
|
+
module Backend
|
3
|
+
module Memory
|
4
|
+
class Store
|
5
|
+
|
6
|
+
class << self
|
7
|
+
|
8
|
+
# Public: The memoized dictionary instance.
|
9
|
+
def store
|
10
|
+
@store ||= File.open(filepath, 'r:utf-8').map do |line|
|
11
|
+
line.chop.split('|').map(&:freeze).freeze
|
12
|
+
end.freeze
|
13
|
+
end
|
14
|
+
|
15
|
+
private
|
16
|
+
|
17
|
+
# Internal: Returns the filepath to the enamdict.min file.
|
18
|
+
def filepath
|
19
|
+
File.join(JapaneseNames.root, 'bin/enamdict.min')
|
20
|
+
end
|
21
|
+
end
|
22
|
+
end
|
23
|
+
end
|
24
|
+
end
|
25
|
+
end
|
@@ -1,105 +1,15 @@
|
|
1
|
-
#!/bin/env ruby
|
2
|
-
# encoding: utf-8
|
3
|
-
|
4
1
|
module JapaneseNames
|
5
2
|
|
6
|
-
#
|
3
|
+
# Enumerated flags for the ENAMDICT file (http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
|
7
4
|
module Enamdict
|
8
|
-
|
9
|
-
#
|
10
|
-
|
11
|
-
|
12
|
-
#
|
13
|
-
#
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
NAME_ANY = NAME_FAM | NAME_GIV
|
18
|
-
|
19
|
-
class << self
|
20
|
-
|
21
|
-
# Public: Finds kanji and/or kana regex strings in the dictionary via
|
22
|
-
# a structured query interface.
|
23
|
-
#
|
24
|
-
# opts - The Hash options used to match the dictionary (default: {}):
|
25
|
-
# kanji: Regex to match kanji name (optional)
|
26
|
-
# kana: Regex to match kana name (optional)
|
27
|
-
# flags: Flag or Array of flags to filter the match (optional)
|
28
|
-
#
|
29
|
-
# Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
|
30
|
-
def find(opts={})
|
31
|
-
return [] unless opts[:kanji] || opts[:kana]
|
32
|
-
|
33
|
-
kanji = name_regex opts.delete(:kanji)
|
34
|
-
kana = name_regex opts.delete(:kana)
|
35
|
-
flags = flags_regex opts.delete(:flags)
|
36
|
-
regex = /^#{kanji}\|#{kana}\|#{flags}$/
|
37
|
-
|
38
|
-
match{|line| line[regex]}
|
39
|
-
end
|
40
|
-
|
41
|
-
# Public: Matches entries in the enamdict based on a block which should
|
42
|
-
# evaluate true or false (typically a regex).
|
43
|
-
#
|
44
|
-
# Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
|
45
|
-
def match(&block)
|
46
|
-
sel = []
|
47
|
-
each_line do |line|
|
48
|
-
if block.call(line)
|
49
|
-
sel << unpack_line(line)
|
50
|
-
end
|
51
|
-
end
|
52
|
-
sel
|
53
|
-
end
|
54
|
-
|
55
|
-
protected
|
56
|
-
|
57
|
-
# Internal: Returns the filepath to the enamdict.min file.
|
58
|
-
def filepath
|
59
|
-
File.join(File.dirname(__FILE__), '../../bin/enamdict.min')
|
60
|
-
end
|
61
|
-
|
62
|
-
# Internal: The memoized dictionary instance.
|
63
|
-
def dict
|
64
|
-
return @dict if @dict
|
65
|
-
@dict = []
|
66
|
-
File.open(self.filepath, 'r:utf-8') do |f|
|
67
|
-
while(line = f.gets) != nil
|
68
|
-
@dict << line[0..-2] # omit trailing newline char
|
69
|
-
end
|
70
|
-
end
|
71
|
-
@dict.freeze
|
72
|
-
end
|
73
|
-
|
74
|
-
# Internal: Calls the given block for each line in the dict.
|
75
|
-
def each_line(&block)
|
76
|
-
dict.each{|line| block.call(line) }
|
77
|
-
end
|
78
|
-
|
79
|
-
# Internal: Formats a line as a 3-tuple Array [kanji, kana, flags]
|
80
|
-
def unpack_line(line)
|
81
|
-
line.split('|')
|
82
|
-
end
|
83
|
-
|
84
|
-
# Internal: Builds regex criteria for name.
|
85
|
-
def name_regex(name)
|
86
|
-
case name
|
87
|
-
when String then name
|
88
|
-
when Array then "(?:#{name.join('|')})"
|
89
|
-
else '.+?'
|
90
|
-
end
|
91
|
-
end
|
92
|
-
|
93
|
-
# Internal: Builds regex criteria for flags.
|
94
|
-
def flags_regex(flags)
|
95
|
-
if !flags || flags == NAME_ANY
|
96
|
-
'.+?'
|
97
|
-
elsif flags.is_a?(Array)
|
98
|
-
".*?[#{flags.join}].*?"
|
99
|
-
else
|
100
|
-
flags
|
101
|
-
end
|
102
|
-
end
|
103
|
-
end
|
5
|
+
NAME_PLACE = %i(p).freeze # place-name (99,500)
|
6
|
+
NAME_PERSON = %i(u).freeze # person name, either given or surname, as-yet unclassified (139,000)
|
7
|
+
NAME_SURNAME = %i(s).freeze # surname (138,500)
|
8
|
+
NAME_GIVEN_MALE = %i(m).freeze # male given name (14,500)
|
9
|
+
NAME_GIVEN_FEMALE = %i(f).freeze # female given name (106,300)
|
10
|
+
NAME_GIVEN_OTHER = %i(g).freeze # given name, as-yet not classified by sex (64,600)
|
11
|
+
NAME_SURNAME_ANY = (NAME_PLACE | NAME_PERSON | NAME_SURNAME).freeze
|
12
|
+
NAME_GIVEN_ANY = (NAME_PERSON | NAME_GIVEN_MALE| NAME_GIVEN_FEMALE | NAME_GIVEN_OTHER).freeze
|
13
|
+
NAME_ANY = (NAME_SURNAME_ANY | NAME_GIVEN_ANY).freeze
|
104
14
|
end
|
105
15
|
end
|
@@ -0,0 +1,21 @@
|
|
1
|
+
module JapaneseNames
|
2
|
+
|
3
|
+
# Query interface for ENAMDICT
|
4
|
+
class Finder
|
5
|
+
|
6
|
+
# Hash opts
|
7
|
+
# - kanji: String kanji to match
|
8
|
+
# - kana: String kana to match
|
9
|
+
# - kanji: Array<Symbol> ENAMDICT flags to match
|
10
|
+
def find(opts={})
|
11
|
+
backend.find(opts)
|
12
|
+
end
|
13
|
+
|
14
|
+
private
|
15
|
+
|
16
|
+
# Internal: Builds regex criteria for name.
|
17
|
+
def backend
|
18
|
+
::JapaneseNames::Backend::Memory::Finder
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
@@ -0,0 +1,55 @@
|
|
1
|
+
module JapaneseNames
|
2
|
+
|
3
|
+
# Provides methods to split a full Japanese name strings into surname and given name.
|
4
|
+
class Splitter
|
5
|
+
|
6
|
+
# Given a kanji and kana representation of a name splits into to family/given names.
|
7
|
+
#
|
8
|
+
# The choice to prioritize family name is arbitrary. Further analysis is needed
|
9
|
+
# for whether given or family name should be prioritized.
|
10
|
+
#
|
11
|
+
# Returns Array [[kanji_fam, kanji_giv], [kana_fam, kana_giv]] if there was a match.
|
12
|
+
# Returns nil if there was no match.
|
13
|
+
def split(kanji, kana)
|
14
|
+
split_surname(kanji, kana) || split_given(kanji, kana)
|
15
|
+
end
|
16
|
+
|
17
|
+
def split_giv(kanji, kana)
|
18
|
+
return nil unless kanji && kana
|
19
|
+
kanji, kana = kanji.strip, kana.strip
|
20
|
+
dict = finder.find(kanji: Util::Ngram.ngram_right(kanji))
|
21
|
+
dict.sort!{|x,y| y[0].size <=> x[0].size}
|
22
|
+
kana_match = nil
|
23
|
+
if match = dict.detect{|m| kana_match = kana[/#{hk m[1]}\z/]}
|
24
|
+
return [[Util::Ngram.mask_right(kanji, match[0]), match[0]],[Util::Ngram.mask_right(kana, kana_match), kana_match]]
|
25
|
+
end
|
26
|
+
end
|
27
|
+
alias :split_given :split_giv
|
28
|
+
|
29
|
+
def split_sur(kanji, kana)
|
30
|
+
return nil unless kanji && kana
|
31
|
+
kanji, kana = kanji.strip, kana.strip
|
32
|
+
dict = finder.find(kanji: Util::Ngram.ngram_left(kanji))
|
33
|
+
dict.sort!{|x,y| y[0].size <=> x[0].size}
|
34
|
+
kana_match = nil
|
35
|
+
if match = dict.detect{|m| kana_match = kana[/\A#{hk m[1]}/]}
|
36
|
+
return [[match[0], Util::Ngram.mask_left(kanji, match[0])],[kana_match, Util::Ngram.mask_left(kana, kana_match)]]
|
37
|
+
end
|
38
|
+
end
|
39
|
+
alias :split_surname :split_sur
|
40
|
+
|
41
|
+
# TODO: add option to strip honorific '様'
|
42
|
+
# TODO: add option to infer sex (0 = unknown, 1 = male, 2 = female as per ISO/IEC 5218)
|
43
|
+
|
44
|
+
private
|
45
|
+
|
46
|
+
# Returns a regex string which matches both hiragana and katakana variations of a String.
|
47
|
+
def hk(str)
|
48
|
+
"(?:#{Moji.kata_to_hira(str)}|#{Moji.hira_to_kata(str)})"
|
49
|
+
end
|
50
|
+
|
51
|
+
def finder
|
52
|
+
@finder ||= Finder.new
|
53
|
+
end
|
54
|
+
end
|
55
|
+
end
|
@@ -0,0 +1,46 @@
|
|
1
|
+
module JapaneseNames
|
2
|
+
module Util
|
3
|
+
|
4
|
+
# Provides methods for parsing Japanese name strings.
|
5
|
+
class Ngram
|
6
|
+
|
7
|
+
class << self
|
8
|
+
|
9
|
+
# Given a String, returns an ordered array of all possible substrings.
|
10
|
+
#
|
11
|
+
# Example: ngram_right("abcd") #=> ["abcd", "abc", "bcd", "ab", "bc", "cd", "a", "b", "c", "d"]
|
12
|
+
def ngram(str)
|
13
|
+
(0...str.size).to_a.reverse.map{|i| (0...(str.size-i)).map{|j| str[j..(i+j)]}}.flatten.uniq
|
14
|
+
end
|
15
|
+
|
16
|
+
# Given a String, returns an array of progressively smaller substrings anchored on the left side.
|
17
|
+
#
|
18
|
+
# Example: ngram_left("abcd") #=> ["abcd", "abc", "ab", "a"]
|
19
|
+
def ngram_left(str)
|
20
|
+
(0...str.size).to_a.reverse.map{|i| str[0..i]}
|
21
|
+
end
|
22
|
+
|
23
|
+
# Given a String, returns an array of progressively smaller substrings anchored on the right side.
|
24
|
+
#
|
25
|
+
# Example: ngram_right("abcd") #=> ["abcd", "bcd", "cd", "d"]
|
26
|
+
def ngram_right(str)
|
27
|
+
(0...str.size).map{|i| str[i..-1]}
|
28
|
+
end
|
29
|
+
|
30
|
+
# Masks a String from the left side and returns the remaining (right) portion of the String.
|
31
|
+
#
|
32
|
+
# Example: mask_left("abcde", "ab") #=> "cde"
|
33
|
+
def mask_left(str, mask)
|
34
|
+
str.gsub(/^#{mask}/, '')
|
35
|
+
end
|
36
|
+
|
37
|
+
# Masks a String from the right side and returns the remaining (left) portion of the String.
|
38
|
+
#
|
39
|
+
# Example: mask_right("abcde", "de") #=> "abc"
|
40
|
+
def mask_right(str, mask)
|
41
|
+
str.gsub(/#{mask}$/, '')
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
45
|
+
end
|
46
|
+
end
|
data/lib/japanese_names.rb
CHANGED
@@ -4,4 +4,14 @@ require 'moji'
|
|
4
4
|
|
5
5
|
require 'japanese_names/version'
|
6
6
|
require 'japanese_names/enamdict'
|
7
|
-
require 'japanese_names/
|
7
|
+
require 'japanese_names/finder'
|
8
|
+
require 'japanese_names/splitter'
|
9
|
+
require 'japanese_names/util/ngram'
|
10
|
+
require 'japanese_names/backend/memory/store'
|
11
|
+
require 'japanese_names/backend/memory/finder'
|
12
|
+
|
13
|
+
module JapaneseNames
|
14
|
+
def self.root
|
15
|
+
File.join(File.dirname(__FILE__), '../')
|
16
|
+
end
|
17
|
+
end
|
data/spec/spec_helper.rb
CHANGED
@@ -1,28 +1,8 @@
|
|
1
|
-
#!/bin/env ruby
|
2
|
-
# encoding: utf-8
|
3
|
-
|
4
1
|
require 'spec_helper'
|
5
2
|
|
6
|
-
describe JapaneseNames::
|
7
|
-
|
8
|
-
subject { JapaneseNames::Enamdict }
|
3
|
+
describe JapaneseNames::Finder do
|
9
4
|
|
10
|
-
|
11
|
-
|
12
|
-
it 'should select only lines which match criteria' do
|
13
|
-
result = subject.match{|line| line =~ /^.+?\|あわのはら\|.+?$/}
|
14
|
-
result.should eq [["粟野原", "あわのはら", "s"]]
|
15
|
-
end
|
16
|
-
|
17
|
-
it 'should select multiple lines' do
|
18
|
-
result = subject.match{|line| line =~ /^.+?\|はしの\|.+?$/}
|
19
|
-
result.should eq [["橋之", "はしの", "p"],
|
20
|
-
["橋埜", "はしの", "s"],
|
21
|
-
["橋野", "はしの", "s"],
|
22
|
-
["端野", "はしの", "s"],
|
23
|
-
["箸野", "はしの", "s"]]
|
24
|
-
end
|
25
|
-
end
|
5
|
+
subject { described_class.new }
|
26
6
|
|
27
7
|
describe '#find' do
|
28
8
|
|
@@ -0,0 +1,24 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe JapaneseNames::Util::Ngram do
|
4
|
+
|
5
|
+
describe '#ngram' do
|
6
|
+
it { expect(described_class.ngram("abcd")).to eq ["abcd", "abc", "bcd", "ab", "bc", "cd", "a", "b", "c", "d"] }
|
7
|
+
end
|
8
|
+
|
9
|
+
describe '#ngram_left' do
|
10
|
+
it { expect(described_class.ngram_left("abcd")).to eq ["abcd", "abc", "ab", "a"] }
|
11
|
+
end
|
12
|
+
|
13
|
+
describe '#ngram_right' do
|
14
|
+
it { expect(described_class.ngram_right("abcd")).to eq ["abcd", "bcd", "cd", "d"] }
|
15
|
+
end
|
16
|
+
|
17
|
+
describe '#mask_left' do
|
18
|
+
it { expect(described_class.mask_left("abcde", "ab")).to eq "cde" }
|
19
|
+
end
|
20
|
+
|
21
|
+
describe '#mask_right' do
|
22
|
+
it { expect(described_class.mask_right("abcde", "de")).to eq "abc" }
|
23
|
+
end
|
24
|
+
end
|
@@ -1,21 +1,18 @@
|
|
1
|
-
#!/bin/env ruby
|
2
|
-
# encoding: utf-8
|
3
|
-
|
4
1
|
require 'spec_helper'
|
5
2
|
|
6
|
-
describe JapaneseNames::
|
3
|
+
describe JapaneseNames::Splitter do
|
7
4
|
|
8
|
-
subject {
|
5
|
+
subject { described_class.new }
|
9
6
|
|
10
7
|
describe '#split' do
|
11
8
|
|
12
|
-
[['上原','望','ウエハラ',
|
13
|
-
['樋口','知美','ヒグチ',
|
14
|
-
['堺','雅美','さかい',
|
15
|
-
['中村','幸子','ナカムラ',
|
16
|
-
['秋保','郁子','アキホ',
|
17
|
-
['光野','亜佐子','ミツノ',
|
18
|
-
['熊澤','貴子','クマザワ',
|
9
|
+
[['上原','望','ウエハラ','ノゾミ'],
|
10
|
+
['樋口','知美','ヒグチ','ともみ'],
|
11
|
+
['堺','雅美','さかい','マサミ'],
|
12
|
+
['中村','幸子','ナカムラ','サチコ'],
|
13
|
+
['秋保','郁子','アキホ','いくこ'],
|
14
|
+
['光野','亜佐子','ミツノ','アサコ'],
|
15
|
+
['熊澤','貴子','クマザワ','タカコ']].each do |kanji_fam, kanji_giv, kana_fam, kana_giv|
|
19
16
|
it "should parse #{kanji_fam+kanji_giv} #{kana_fam+kana_giv}" do
|
20
17
|
result = subject.split(kanji_fam+kanji_giv, kana_fam+kana_giv)
|
21
18
|
result.should eq [[kanji_fam, kanji_giv], [kana_fam, kana_giv]]
|
@@ -27,7 +24,7 @@ describe JapaneseNames::Parser do
|
|
27
24
|
end
|
28
25
|
|
29
26
|
it "should parse #{kanji_fam+kanji_giv} #{kana_fam+kana_giv} by family name" do
|
30
|
-
result = subject.
|
27
|
+
result = subject.split_sur(kanji_fam+kanji_giv, kana_fam+kana_giv)
|
31
28
|
result.should eq [[kanji_fam, kanji_giv], [kana_fam, kana_giv]]
|
32
29
|
end
|
33
30
|
end
|
@@ -42,7 +39,7 @@ describe JapaneseNames::Parser do
|
|
42
39
|
it 'should strip leading/trailing whitespace' do
|
43
40
|
subject.split(' 上原望 ', ' ウエハラノゾミ ').should eq [['上原','望'],['ウエハラ','ノゾミ']]
|
44
41
|
subject.split_giv(' 上原望 ', ' ウエハラノゾミ ').should eq [['上原','望'],['ウエハラ','ノゾミ']]
|
45
|
-
subject.
|
42
|
+
subject.split_sur(' 上原望 ', ' ウエハラノゾミ ').should eq [['上原','望'],['ウエハラ','ノゾミ']]
|
46
43
|
end
|
47
44
|
|
48
45
|
it 'should return nil for nil input' do
|
metadata
CHANGED
@@ -1,69 +1,69 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: japanese_names
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0
|
4
|
+
version: 0.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Johnny Shields
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2016-10-12 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: moji
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- -
|
17
|
+
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
19
|
version: '1.6'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
|
-
- -
|
24
|
+
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '1.6'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: rake
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
30
30
|
requirements:
|
31
|
-
- -
|
31
|
+
- - ">="
|
32
32
|
- !ruby/object:Gem::Version
|
33
33
|
version: '0'
|
34
34
|
type: :development
|
35
35
|
prerelease: false
|
36
36
|
version_requirements: !ruby/object:Gem::Requirement
|
37
37
|
requirements:
|
38
|
-
- -
|
38
|
+
- - ">="
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '0'
|
41
41
|
- !ruby/object:Gem::Dependency
|
42
42
|
name: rspec
|
43
43
|
requirement: !ruby/object:Gem::Requirement
|
44
44
|
requirements:
|
45
|
-
- -
|
45
|
+
- - ">="
|
46
46
|
- !ruby/object:Gem::Version
|
47
47
|
version: 3.0.0
|
48
48
|
type: :development
|
49
49
|
prerelease: false
|
50
50
|
version_requirements: !ruby/object:Gem::Requirement
|
51
51
|
requirements:
|
52
|
-
- -
|
52
|
+
- - ">="
|
53
53
|
- !ruby/object:Gem::Version
|
54
54
|
version: 3.0.0
|
55
55
|
- !ruby/object:Gem::Dependency
|
56
56
|
name: gem-release
|
57
57
|
requirement: !ruby/object:Gem::Requirement
|
58
58
|
requirements:
|
59
|
-
- -
|
59
|
+
- - ">="
|
60
60
|
- !ruby/object:Gem::Version
|
61
61
|
version: '0'
|
62
62
|
type: :development
|
63
63
|
prerelease: false
|
64
64
|
version_requirements: !ruby/object:Gem::Requirement
|
65
65
|
requirements:
|
66
|
-
- -
|
66
|
+
- - ">="
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '0'
|
69
69
|
description: Japanese name parser based on ENAMDICT
|
@@ -76,12 +76,17 @@ files:
|
|
76
76
|
- README.md
|
77
77
|
- bin/enamdict.min
|
78
78
|
- lib/japanese_names.rb
|
79
|
+
- lib/japanese_names/backend/memory/finder.rb
|
80
|
+
- lib/japanese_names/backend/memory/store.rb
|
79
81
|
- lib/japanese_names/enamdict.rb
|
80
|
-
- lib/japanese_names/
|
82
|
+
- lib/japanese_names/finder.rb
|
83
|
+
- lib/japanese_names/splitter.rb
|
84
|
+
- lib/japanese_names/util/ngram.rb
|
81
85
|
- lib/japanese_names/version.rb
|
82
86
|
- spec/spec_helper.rb
|
83
|
-
- spec/unit/
|
84
|
-
- spec/unit/
|
87
|
+
- spec/unit/finder_spec.rb
|
88
|
+
- spec/unit/ngram_spec.rb
|
89
|
+
- spec/unit/splitter_spec.rb
|
85
90
|
homepage: https://github.com/johnnyshields/japanese_names
|
86
91
|
licenses:
|
87
92
|
- MIT
|
@@ -92,22 +97,22 @@ require_paths:
|
|
92
97
|
- lib
|
93
98
|
required_ruby_version: !ruby/object:Gem::Requirement
|
94
99
|
requirements:
|
95
|
-
- -
|
100
|
+
- - ">="
|
96
101
|
- !ruby/object:Gem::Version
|
97
102
|
version: '0'
|
98
103
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
99
104
|
requirements:
|
100
|
-
- -
|
105
|
+
- - ">="
|
101
106
|
- !ruby/object:Gem::Version
|
102
107
|
version: '0'
|
103
108
|
requirements: []
|
104
109
|
rubyforge_project:
|
105
|
-
rubygems_version: 2.
|
110
|
+
rubygems_version: 2.4.7
|
106
111
|
signing_key:
|
107
112
|
specification_version: 4
|
108
113
|
summary: Tools for parsing japanese names
|
109
114
|
test_files:
|
110
115
|
- spec/spec_helper.rb
|
111
|
-
- spec/unit/
|
112
|
-
- spec/unit/
|
113
|
-
|
116
|
+
- spec/unit/finder_spec.rb
|
117
|
+
- spec/unit/ngram_spec.rb
|
118
|
+
- spec/unit/splitter_spec.rb
|
@@ -1,80 +0,0 @@
|
|
1
|
-
#!/bin/env ruby
|
2
|
-
# encoding: utf-8
|
3
|
-
|
4
|
-
module JapaneseNames
|
5
|
-
|
6
|
-
# Provides methods for parsing Japanese name strings.
|
7
|
-
class Parser
|
8
|
-
|
9
|
-
# Given a kanji and kana representation of a name splits into to family/given names.
|
10
|
-
#
|
11
|
-
# The choice to prioritize family name is arbitrary. Further analysis is needed
|
12
|
-
# for whether given or family name should be prioritized.
|
13
|
-
#
|
14
|
-
# Returns Array [[kanji_fam, kanji_giv], [kana_fam, kana_giv]] if there was a match.
|
15
|
-
# Returns nil if there was no match.
|
16
|
-
def split(kanji, kana)
|
17
|
-
split_fam(kanji, kana) || split_giv(kanji, kana)
|
18
|
-
end
|
19
|
-
|
20
|
-
def split_giv(kanji, kana)
|
21
|
-
return nil unless kanji && kana
|
22
|
-
kanji, kana = kanji.strip, kana.strip
|
23
|
-
dict = Enamdict.find(kanji: window_right(kanji))
|
24
|
-
dict.sort!{|x,y| y[0].size <=> x[0].size}
|
25
|
-
kana_match = nil
|
26
|
-
if match = dict.detect{|m| kana_match = kana[/#{hk m[1]}$/]}
|
27
|
-
return [[mask_right(kanji, match[0]), match[0]],[mask_right(kana, kana_match), kana_match]]
|
28
|
-
end
|
29
|
-
end
|
30
|
-
|
31
|
-
def split_fam(kanji, kana)
|
32
|
-
return nil unless kanji && kana
|
33
|
-
kanji, kana = kanji.strip, kana.strip
|
34
|
-
dict = Enamdict.find(kanji: window_left(kanji))
|
35
|
-
dict.sort!{|x,y| y[0].size <=> x[0].size}
|
36
|
-
kana_match = nil
|
37
|
-
if match = dict.detect{|m| kana_match = kana[/^#{hk m[1]}/]}
|
38
|
-
return [[match[0], mask_left(kanji, match[0])],[kana_match, mask_left(kana, kana_match)]]
|
39
|
-
end
|
40
|
-
end
|
41
|
-
|
42
|
-
# TODO: add option to strip honorific '様'
|
43
|
-
# TODO: add option to infer sex (0 = unknown, 1 = male, 2 = female as per ISO/IEC 5218)
|
44
|
-
|
45
|
-
protected
|
46
|
-
|
47
|
-
# Returns a regex string which matches both hiragana and katakana variations of a String.
|
48
|
-
def hk(str)
|
49
|
-
"(?:#{Moji.kata_to_hira(str)}|#{Moji.hira_to_kata(str)})"
|
50
|
-
end
|
51
|
-
|
52
|
-
# Masks a String from the left side and returns the remaining (right) portion of the String.
|
53
|
-
#
|
54
|
-
# Example: mask_left("abcde", "ab") #=> "cde"
|
55
|
-
def mask_left(str, mask)
|
56
|
-
str.gsub(/^#{mask}/, '')
|
57
|
-
end
|
58
|
-
|
59
|
-
# Masks a String from the right side and returns the remaining (left) portion of the String.
|
60
|
-
#
|
61
|
-
# Example: mask_right("abcde", "de") #=> "abc"
|
62
|
-
def mask_right(str, mask)
|
63
|
-
str.gsub(/#{mask}$/, '')
|
64
|
-
end
|
65
|
-
|
66
|
-
# Given a String, returns an array of progressively smaller substrings anchored on the left side.
|
67
|
-
#
|
68
|
-
# Example: window_left("abcd") #=> ["abcd", "abc", "ab", "a"]
|
69
|
-
def window_left(str)
|
70
|
-
(0...str.size).to_a.reverse.map{|i| str[0..i]}
|
71
|
-
end
|
72
|
-
|
73
|
-
# Given a String, returns an array of progressively smaller substrings anchored on the right side.
|
74
|
-
#
|
75
|
-
# Example: window_right("abcd") #=> ["abcd", "bcd", "cd", "d"]
|
76
|
-
def window_right(str)
|
77
|
-
(0...str.size).map{|i| str[i..-1]}
|
78
|
-
end
|
79
|
-
end
|
80
|
-
end
|