japanese_names 0.0.2 → 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +8 -8
- data/README.md +125 -7
- data/lib/japanese_names/enamdict.rb +6 -5
- data/lib/japanese_names/parser.rb +2 -2
- data/lib/japanese_names/version.rb +1 -1
- data/spec/unit/enamdict_spec.rb +9 -9
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,15 +1,15 @@
|
|
1
1
|
---
|
2
2
|
!binary "U0hBMQ==":
|
3
3
|
metadata.gz: !binary |-
|
4
|
-
|
4
|
+
YTZhOWUxMzRlNTE5ZmVmZTJkMWJmMzlhZTYzMzBjNzAxZjEzMzQ2MA==
|
5
5
|
data.tar.gz: !binary |-
|
6
|
-
|
6
|
+
MDdhYjQ3NTA3NTgxZjMyY2Q4ZmQxZTk2MzgwMmRjYzMwMzAxNmQ4Mg==
|
7
7
|
SHA512:
|
8
8
|
metadata.gz: !binary |-
|
9
|
-
|
10
|
-
|
11
|
-
|
9
|
+
ZGQxNjI5ODEyYjUwZTYzY2JlZDZmOWQwYzI4ZjRkNzc3ZTljMTc0YzdlNWNj
|
10
|
+
NjFiOTQ3MGJkYjg1Y2NiNTI5YWI1NTJmNWIwNWNkYTkyODU3ODU3Yjc4MDY3
|
11
|
+
N2Q2NmJlMGUzZWY2N2MzZjA1N2VjYTUwYTc2MDVjMTE3YWM0YWE=
|
12
12
|
data.tar.gz: !binary |-
|
13
|
-
|
14
|
-
|
15
|
-
|
13
|
+
YjRiNTMzNzQzMDc1OGQ2NWY3YmExM2VjOWMyOWE4ODBiZTFmNGQzMmI3MTJi
|
14
|
+
NzU0NmJiNzdhY2YwMjY5OGRiYWUzYTM4NmM2MDUwMjA3NTE0MTljY2Y2ODgx
|
15
|
+
ZGM2MTc1ZmVhN2ZjMTk0ZGFmZGVhOTljNTY1ZWQ1NTZiZDhlODE=
|
data/README.md
CHANGED
@@ -1,10 +1,58 @@
|
|
1
1
|
# JapaneseNames
|
2
2
|
|
3
|
-
|
3
|
+
JapaneseNames provides an interface to the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
|
4
4
|
|
5
|
-
## Overview
|
6
5
|
|
7
|
-
JapaneseNames
|
6
|
+
## JapaneseNames::Enamdict
|
7
|
+
|
8
|
+
This library comes packaged with a compacted version of the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
|
9
|
+
at `bin/enamdict.min`. Refer to *Rake Tasks* below for how this file is constructed.
|
10
|
+
|
11
|
+
`JapaneseNames::Enamdict` is a module; all methods are called on the module `self` class.
|
12
|
+
|
13
|
+
|
14
|
+
### Enamdict.find
|
15
|
+
|
16
|
+
Provides a structured query interface to access ENAMDICT data.
|
17
|
+
|
18
|
+
```ruby
|
19
|
+
JapaneseNames::Enamdict.find(kanji: '外世子') #=> [["外世子", "とよこ", "f"]]
|
20
|
+
|
21
|
+
JapaneseNames::Enamdict.find(kana: 'ならしま', flags: 's') #=> [["奈良島", "ならしま", "s"],
|
22
|
+
["楢島", "ならしま", "s"],
|
23
|
+
["楢嶋", "ならしま", "s"]]
|
24
|
+
|
25
|
+
JapaneseNames::Enamdict.find(kanji: '楢二郎', kana: 'ならじろう') #=> [["楢二郎", "ならじろう", "m"]]
|
26
|
+
```
|
27
|
+
|
28
|
+
where options are:
|
29
|
+
|
30
|
+
* `kanji`: The kanji name string to match. Regex syntax suppported. Either `:kanji` or `:kana` must be specified.
|
31
|
+
* `kana`: The kana name string to match. Regex syntax suppported.
|
32
|
+
* `flags`: The flag char or array of flag chars to match. Refer to [ENAMDIC documentation](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
|
33
|
+
Additionally constants JapaneseNames::Enamdict::NAME_FAM and JapaneseNames::Enamdict::NAME_GIV may be used.
|
34
|
+
|
35
|
+
Note that romaji data has been removed from our `enamdict.min` file in the compression step. We recommend to use a gem such as `mojinizer` to convert romaji to kana before doing a query.
|
36
|
+
|
37
|
+
|
38
|
+
### Enamdict.match
|
39
|
+
|
40
|
+
Provides a raw interface to match ENAMDICT entries via a block, which would typically contain a `Regexp` expression:
|
41
|
+
|
42
|
+
```ruby
|
43
|
+
JapaneseNames::Enamdict.match{|entry| entry =~ /^堺|/} #=> [["堺", "さかい", "p,s"], ["堺", "さかえ", "p"]]
|
44
|
+
```
|
45
|
+
|
46
|
+
where each dictionary entry is in the format below (different from raw ENAMDICT file):
|
47
|
+
|
48
|
+
```
|
49
|
+
kanji|kana|flag1(,flag2,...)
|
50
|
+
```
|
51
|
+
|
52
|
+
|
53
|
+
## JapaneseNames::Parser
|
54
|
+
|
55
|
+
### Parser#split
|
8
56
|
|
9
57
|
Currently the main method is `split` which, given a kanji and kana representation of a name splits
|
10
58
|
into to family/given names.
|
@@ -14,13 +62,77 @@ into to family/given names.
|
|
14
62
|
parser.split('堺雅美', 'さかいマサミ') #=> [['堺', '雅美'], ['さかい', 'マサミ']]
|
15
63
|
```
|
16
64
|
|
65
|
+
The logic is as follows:
|
17
66
|
|
18
|
-
|
67
|
+
* Step 1: Split kanji name into possible surname sub-strings
|
19
68
|
|
20
|
-
|
21
|
-
|
69
|
+
```
|
70
|
+
上原亜沙子 =>
|
71
|
+
|
72
|
+
上原亜沙子
|
73
|
+
上原亜沙
|
74
|
+
上原亜
|
75
|
+
上原
|
76
|
+
上
|
77
|
+
```
|
78
|
+
|
79
|
+
* Step 2: Lookup possible kana matches in dictionary (done in a single pass)
|
80
|
+
|
81
|
+
```
|
82
|
+
上原亜沙子 => X
|
83
|
+
上原亜沙 => X
|
84
|
+
上原亜 => X
|
85
|
+
上原 => かみはら かみばら うえはら うえばら...
|
86
|
+
上 => かみ うえ ...
|
87
|
+
```
|
88
|
+
|
89
|
+
* Step 3: Compare kana lookups versus kana name and detect first match (starting from longest candidate string)
|
90
|
+
|
91
|
+
```
|
92
|
+
うえはらあさこ contains かみはら ? => X
|
93
|
+
うえはらあさこ contains かみばら ? => X
|
94
|
+
うえはらあさこ contains うえはら ? => YES! [うえはら]あさこ
|
95
|
+
```
|
96
|
+
|
97
|
+
* Step 4: If match found, split names accordingly
|
98
|
+
|
99
|
+
```
|
100
|
+
[上原]亜沙子 => 上原 亜沙子
|
101
|
+
[うえはら]あさこ => うえはら あさこ
|
102
|
+
```
|
103
|
+
|
104
|
+
* Step 5: If match not found, repeat steps 1-4 in reverse for given name:
|
105
|
+
|
106
|
+
```
|
107
|
+
上原亜沙子 =>
|
22
108
|
|
23
|
-
|
109
|
+
上原亜沙子 => X
|
110
|
+
原亜沙子 => X
|
111
|
+
亜沙子 => あさこ
|
112
|
+
沙子 => さこ
|
113
|
+
子 => こ
|
114
|
+
|
115
|
+
上原[亜沙子] => 上原 亜沙子
|
116
|
+
うえはら[あさこ] => うえはら あさこ
|
117
|
+
```
|
118
|
+
|
119
|
+
* Step 6: If match still not found, return `nil`
|
120
|
+
|
121
|
+
|
122
|
+
## Rake Tasks
|
123
|
+
|
124
|
+
The following tasks are used for development purposes of this gem only. They will not be accessible
|
125
|
+
in projects which use this gem.
|
126
|
+
|
127
|
+
* `rake enamdict:refresh`: Runs `enamdict:download` and `enamdict:minify` (see below)
|
128
|
+
|
129
|
+
* `rake enamdict:download`: Downloads and extract the ENAMDICT file to `/tmp/enamdict`
|
130
|
+
|
131
|
+
* `rake enamdict:minify`: Compiles `/bin/enamdict.min` file from `/tmp/enamdict`. Performs several processing steps including:
|
132
|
+
* Converts to UTF-8
|
133
|
+
* Compacts format (pipe-delimited)
|
134
|
+
* Removes non-human name entries
|
135
|
+
* Removes romaji strings (redundant with kana)
|
24
136
|
|
25
137
|
|
26
138
|
## TODO
|
@@ -38,6 +150,12 @@ implementation of the dictionary would be nice.
|
|
38
150
|
Fork -> Commit -> Spec -> Push -> Pull Request
|
39
151
|
|
40
152
|
|
153
|
+
## Similar Projects
|
154
|
+
|
155
|
+
* Marco Bresciani's [wwwwjdic](https://rubygems.org/gems/wwwjdic) gem which is **NOT** used by this lib
|
156
|
+
* [@jeresig](https://github.com/jeresig)'s [node-enamdict](https://github.com/jeresig/node-enamdict) an ENAMDIC reader for Node.js
|
157
|
+
|
158
|
+
|
41
159
|
## Authors
|
42
160
|
|
43
161
|
* [@johnnyshields](https://github.com/johnnyshields)
|
@@ -18,7 +18,8 @@ module JapaneseNames
|
|
18
18
|
|
19
19
|
class << self
|
20
20
|
|
21
|
-
# Public:
|
21
|
+
# Public: Finds kanji and/or kana regex strings in the dictionary via
|
22
|
+
# a structured query interface.
|
22
23
|
#
|
23
24
|
# opts - The Hash options used to match the dictionary (default: {}):
|
24
25
|
# kanji: Regex to match kanji name (optional)
|
@@ -26,7 +27,7 @@ module JapaneseNames
|
|
26
27
|
# flags: Flag or Array of flags to filter the match (optional)
|
27
28
|
#
|
28
29
|
# Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
|
29
|
-
def
|
30
|
+
def find(opts={})
|
30
31
|
return [] unless opts[:kanji] || opts[:kana]
|
31
32
|
|
32
33
|
kanji = name_regex opts.delete(:kanji)
|
@@ -34,14 +35,14 @@ module JapaneseNames
|
|
34
35
|
flags = flags_regex opts.delete(:flags)
|
35
36
|
regex = /^#{kanji}\|#{kana}\|#{flags}$/
|
36
37
|
|
37
|
-
|
38
|
+
match{|line| line[regex]}
|
38
39
|
end
|
39
40
|
|
40
|
-
# Public:
|
41
|
+
# Public: Matches entries in the enamdict based on a block which should
|
41
42
|
# evaluate true or false (typically a regex).
|
42
43
|
#
|
43
44
|
# Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
|
44
|
-
def
|
45
|
+
def match(&block)
|
45
46
|
sel = []
|
46
47
|
each_line do |line|
|
47
48
|
if block.call(line)
|
@@ -20,7 +20,7 @@ module JapaneseNames
|
|
20
20
|
def split_giv(kanji, kana)
|
21
21
|
return nil unless kanji && kana
|
22
22
|
kanji, kana = kanji.strip, kana.strip
|
23
|
-
dict = Enamdict.
|
23
|
+
dict = Enamdict.find(kanji: window_right(kanji))
|
24
24
|
dict.sort!{|x,y| y[0].size <=> x[0].size}
|
25
25
|
kana_match = nil
|
26
26
|
if match = dict.detect{|m| kana_match = kana[/#{hk m[1]}$/]}
|
@@ -31,7 +31,7 @@ module JapaneseNames
|
|
31
31
|
def split_fam(kanji, kana)
|
32
32
|
return nil unless kanji && kana
|
33
33
|
kanji, kana = kanji.strip, kana.strip
|
34
|
-
dict = Enamdict.
|
34
|
+
dict = Enamdict.find(kanji: window_left(kanji))
|
35
35
|
dict.sort!{|x,y| y[0].size <=> x[0].size}
|
36
36
|
kana_match = nil
|
37
37
|
if match = dict.detect{|m| kana_match = kana[/^#{hk m[1]}/]}
|
data/spec/unit/enamdict_spec.rb
CHANGED
@@ -7,15 +7,15 @@ describe JapaneseNames::Enamdict do
|
|
7
7
|
|
8
8
|
subject { JapaneseNames::Enamdict }
|
9
9
|
|
10
|
-
describe '#
|
10
|
+
describe '#match' do
|
11
11
|
|
12
12
|
it 'should select only lines which match criteria' do
|
13
|
-
result = subject.
|
13
|
+
result = subject.match{|line| line =~ /^.+?\|あわのはら\|.+?$/}
|
14
14
|
result.should eq [["粟野原", "あわのはら", "s"]]
|
15
15
|
end
|
16
16
|
|
17
17
|
it 'should select multiple lines' do
|
18
|
-
result = subject.
|
18
|
+
result = subject.match{|line| line =~ /^.+?\|はしの\|.+?$/}
|
19
19
|
result.should eq [["橋之", "はしの", "p"],
|
20
20
|
["橋埜", "はしの", "s"],
|
21
21
|
["橋野", "はしの", "s"],
|
@@ -24,15 +24,15 @@ describe JapaneseNames::Enamdict do
|
|
24
24
|
end
|
25
25
|
end
|
26
26
|
|
27
|
-
describe '#
|
27
|
+
describe '#find' do
|
28
28
|
|
29
29
|
it 'should match kanji only' do
|
30
|
-
result = subject.
|
30
|
+
result = subject.find(kanji: '外世子')
|
31
31
|
result.should eq [["外世子", "とよこ", "f"]]
|
32
32
|
end
|
33
33
|
|
34
34
|
it 'should match kana only' do
|
35
|
-
result = subject.
|
35
|
+
result = subject.find(kana: 'ならしま')
|
36
36
|
result.should eq [["樽島", "ならしま", "u"],
|
37
37
|
["奈良島", "ならしま", "s"],
|
38
38
|
["楢島", "ならしま", "s"],
|
@@ -40,19 +40,19 @@ describe JapaneseNames::Enamdict do
|
|
40
40
|
end
|
41
41
|
|
42
42
|
it 'should match both kanji and kana only' do
|
43
|
-
result = subject.
|
43
|
+
result = subject.find(kanji: '楢二郎', kana: 'ならじろう')
|
44
44
|
result.should eq [["楢二郎", "ならじろう", "m"]]
|
45
45
|
end
|
46
46
|
|
47
47
|
it 'should match flags as String' do
|
48
|
-
result = subject.
|
48
|
+
result = subject.find(kana: 'ならしま', flags: 's')
|
49
49
|
result.should eq [["奈良島", "ならしま", "s"],
|
50
50
|
["楢島", "ならしま", "s"],
|
51
51
|
["楢嶋", "ならしま", "s"]]
|
52
52
|
end
|
53
53
|
|
54
54
|
it 'should match flags as Array' do
|
55
|
-
result = subject.
|
55
|
+
result = subject.find(kana: 'ならしま', flags: ['u','g'])
|
56
56
|
result.should eq [["樽島", "ならしま", "u"]]
|
57
57
|
end
|
58
58
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: japanese_names
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Johnny Shields
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-09-
|
11
|
+
date: 2014-09-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: moji
|