japanese_names 0.0.2 → 0.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +8 -8
- data/README.md +125 -7
- data/lib/japanese_names/enamdict.rb +6 -5
- data/lib/japanese_names/parser.rb +2 -2
- data/lib/japanese_names/version.rb +1 -1
- data/spec/unit/enamdict_spec.rb +9 -9
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,15 +1,15 @@
|
|
|
1
1
|
---
|
|
2
2
|
!binary "U0hBMQ==":
|
|
3
3
|
metadata.gz: !binary |-
|
|
4
|
-
|
|
4
|
+
YTZhOWUxMzRlNTE5ZmVmZTJkMWJmMzlhZTYzMzBjNzAxZjEzMzQ2MA==
|
|
5
5
|
data.tar.gz: !binary |-
|
|
6
|
-
|
|
6
|
+
MDdhYjQ3NTA3NTgxZjMyY2Q4ZmQxZTk2MzgwMmRjYzMwMzAxNmQ4Mg==
|
|
7
7
|
SHA512:
|
|
8
8
|
metadata.gz: !binary |-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
9
|
+
ZGQxNjI5ODEyYjUwZTYzY2JlZDZmOWQwYzI4ZjRkNzc3ZTljMTc0YzdlNWNj
|
|
10
|
+
NjFiOTQ3MGJkYjg1Y2NiNTI5YWI1NTJmNWIwNWNkYTkyODU3ODU3Yjc4MDY3
|
|
11
|
+
N2Q2NmJlMGUzZWY2N2MzZjA1N2VjYTUwYTc2MDVjMTE3YWM0YWE=
|
|
12
12
|
data.tar.gz: !binary |-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
13
|
+
YjRiNTMzNzQzMDc1OGQ2NWY3YmExM2VjOWMyOWE4ODBiZTFmNGQzMmI3MTJi
|
|
14
|
+
NzU0NmJiNzdhY2YwMjY5OGRiYWUzYTM4NmM2MDUwMjA3NTE0MTljY2Y2ODgx
|
|
15
|
+
ZGM2MTc1ZmVhN2ZjMTk0ZGFmZGVhOTljNTY1ZWQ1NTZiZDhlODE=
|
data/README.md
CHANGED
|
@@ -1,10 +1,58 @@
|
|
|
1
1
|
# JapaneseNames
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
JapaneseNames provides an interface to the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
|
|
4
4
|
|
|
5
|
-
## Overview
|
|
6
5
|
|
|
7
|
-
JapaneseNames
|
|
6
|
+
## JapaneseNames::Enamdict
|
|
7
|
+
|
|
8
|
+
This library comes packaged with a compacted version of the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
|
|
9
|
+
at `bin/enamdict.min`. Refer to *Rake Tasks* below for how this file is constructed.
|
|
10
|
+
|
|
11
|
+
`JapaneseNames::Enamdict` is a module; all methods are called on the module `self` class.
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
### Enamdict.find
|
|
15
|
+
|
|
16
|
+
Provides a structured query interface to access ENAMDICT data.
|
|
17
|
+
|
|
18
|
+
```ruby
|
|
19
|
+
JapaneseNames::Enamdict.find(kanji: '外世子') #=> [["外世子", "とよこ", "f"]]
|
|
20
|
+
|
|
21
|
+
JapaneseNames::Enamdict.find(kana: 'ならしま', flags: 's') #=> [["奈良島", "ならしま", "s"],
|
|
22
|
+
["楢島", "ならしま", "s"],
|
|
23
|
+
["楢嶋", "ならしま", "s"]]
|
|
24
|
+
|
|
25
|
+
JapaneseNames::Enamdict.find(kanji: '楢二郎', kana: 'ならじろう') #=> [["楢二郎", "ならじろう", "m"]]
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
where options are:
|
|
29
|
+
|
|
30
|
+
* `kanji`: The kanji name string to match. Regex syntax suppported. Either `:kanji` or `:kana` must be specified.
|
|
31
|
+
* `kana`: The kana name string to match. Regex syntax suppported.
|
|
32
|
+
* `flags`: The flag char or array of flag chars to match. Refer to [ENAMDIC documentation](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
|
|
33
|
+
Additionally constants JapaneseNames::Enamdict::NAME_FAM and JapaneseNames::Enamdict::NAME_GIV may be used.
|
|
34
|
+
|
|
35
|
+
Note that romaji data has been removed from our `enamdict.min` file in the compression step. We recommend to use a gem such as `mojinizer` to convert romaji to kana before doing a query.
|
|
36
|
+
|
|
37
|
+
|
|
38
|
+
### Enamdict.match
|
|
39
|
+
|
|
40
|
+
Provides a raw interface to match ENAMDICT entries via a block, which would typically contain a `Regexp` expression:
|
|
41
|
+
|
|
42
|
+
```ruby
|
|
43
|
+
JapaneseNames::Enamdict.match{|entry| entry =~ /^堺|/} #=> [["堺", "さかい", "p,s"], ["堺", "さかえ", "p"]]
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
where each dictionary entry is in the format below (different from raw ENAMDICT file):
|
|
47
|
+
|
|
48
|
+
```
|
|
49
|
+
kanji|kana|flag1(,flag2,...)
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
## JapaneseNames::Parser
|
|
54
|
+
|
|
55
|
+
### Parser#split
|
|
8
56
|
|
|
9
57
|
Currently the main method is `split` which, given a kanji and kana representation of a name splits
|
|
10
58
|
into to family/given names.
|
|
@@ -14,13 +62,77 @@ into to family/given names.
|
|
|
14
62
|
parser.split('堺雅美', 'さかいマサミ') #=> [['堺', '雅美'], ['さかい', 'マサミ']]
|
|
15
63
|
```
|
|
16
64
|
|
|
65
|
+
The logic is as follows:
|
|
17
66
|
|
|
18
|
-
|
|
67
|
+
* Step 1: Split kanji name into possible surname sub-strings
|
|
19
68
|
|
|
20
|
-
|
|
21
|
-
|
|
69
|
+
```
|
|
70
|
+
上原亜沙子 =>
|
|
71
|
+
|
|
72
|
+
上原亜沙子
|
|
73
|
+
上原亜沙
|
|
74
|
+
上原亜
|
|
75
|
+
上原
|
|
76
|
+
上
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
* Step 2: Lookup possible kana matches in dictionary (done in a single pass)
|
|
80
|
+
|
|
81
|
+
```
|
|
82
|
+
上原亜沙子 => X
|
|
83
|
+
上原亜沙 => X
|
|
84
|
+
上原亜 => X
|
|
85
|
+
上原 => かみはら かみばら うえはら うえばら...
|
|
86
|
+
上 => かみ うえ ...
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
* Step 3: Compare kana lookups versus kana name and detect first match (starting from longest candidate string)
|
|
90
|
+
|
|
91
|
+
```
|
|
92
|
+
うえはらあさこ contains かみはら ? => X
|
|
93
|
+
うえはらあさこ contains かみばら ? => X
|
|
94
|
+
うえはらあさこ contains うえはら ? => YES! [うえはら]あさこ
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
* Step 4: If match found, split names accordingly
|
|
98
|
+
|
|
99
|
+
```
|
|
100
|
+
[上原]亜沙子 => 上原 亜沙子
|
|
101
|
+
[うえはら]あさこ => うえはら あさこ
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
* Step 5: If match not found, repeat steps 1-4 in reverse for given name:
|
|
105
|
+
|
|
106
|
+
```
|
|
107
|
+
上原亜沙子 =>
|
|
22
108
|
|
|
23
|
-
|
|
109
|
+
上原亜沙子 => X
|
|
110
|
+
原亜沙子 => X
|
|
111
|
+
亜沙子 => あさこ
|
|
112
|
+
沙子 => さこ
|
|
113
|
+
子 => こ
|
|
114
|
+
|
|
115
|
+
上原[亜沙子] => 上原 亜沙子
|
|
116
|
+
うえはら[あさこ] => うえはら あさこ
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
* Step 6: If match still not found, return `nil`
|
|
120
|
+
|
|
121
|
+
|
|
122
|
+
## Rake Tasks
|
|
123
|
+
|
|
124
|
+
The following tasks are used for development purposes of this gem only. They will not be accessible
|
|
125
|
+
in projects which use this gem.
|
|
126
|
+
|
|
127
|
+
* `rake enamdict:refresh`: Runs `enamdict:download` and `enamdict:minify` (see below)
|
|
128
|
+
|
|
129
|
+
* `rake enamdict:download`: Downloads and extract the ENAMDICT file to `/tmp/enamdict`
|
|
130
|
+
|
|
131
|
+
* `rake enamdict:minify`: Compiles `/bin/enamdict.min` file from `/tmp/enamdict`. Performs several processing steps including:
|
|
132
|
+
* Converts to UTF-8
|
|
133
|
+
* Compacts format (pipe-delimited)
|
|
134
|
+
* Removes non-human name entries
|
|
135
|
+
* Removes romaji strings (redundant with kana)
|
|
24
136
|
|
|
25
137
|
|
|
26
138
|
## TODO
|
|
@@ -38,6 +150,12 @@ implementation of the dictionary would be nice.
|
|
|
38
150
|
Fork -> Commit -> Spec -> Push -> Pull Request
|
|
39
151
|
|
|
40
152
|
|
|
153
|
+
## Similar Projects
|
|
154
|
+
|
|
155
|
+
* Marco Bresciani's [wwwwjdic](https://rubygems.org/gems/wwwjdic) gem which is **NOT** used by this lib
|
|
156
|
+
* [@jeresig](https://github.com/jeresig)'s [node-enamdict](https://github.com/jeresig/node-enamdict) an ENAMDIC reader for Node.js
|
|
157
|
+
|
|
158
|
+
|
|
41
159
|
## Authors
|
|
42
160
|
|
|
43
161
|
* [@johnnyshields](https://github.com/johnnyshields)
|
|
@@ -18,7 +18,8 @@ module JapaneseNames
|
|
|
18
18
|
|
|
19
19
|
class << self
|
|
20
20
|
|
|
21
|
-
# Public:
|
|
21
|
+
# Public: Finds kanji and/or kana regex strings in the dictionary via
|
|
22
|
+
# a structured query interface.
|
|
22
23
|
#
|
|
23
24
|
# opts - The Hash options used to match the dictionary (default: {}):
|
|
24
25
|
# kanji: Regex to match kanji name (optional)
|
|
@@ -26,7 +27,7 @@ module JapaneseNames
|
|
|
26
27
|
# flags: Flag or Array of flags to filter the match (optional)
|
|
27
28
|
#
|
|
28
29
|
# Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
|
|
29
|
-
def
|
|
30
|
+
def find(opts={})
|
|
30
31
|
return [] unless opts[:kanji] || opts[:kana]
|
|
31
32
|
|
|
32
33
|
kanji = name_regex opts.delete(:kanji)
|
|
@@ -34,14 +35,14 @@ module JapaneseNames
|
|
|
34
35
|
flags = flags_regex opts.delete(:flags)
|
|
35
36
|
regex = /^#{kanji}\|#{kana}\|#{flags}$/
|
|
36
37
|
|
|
37
|
-
|
|
38
|
+
match{|line| line[regex]}
|
|
38
39
|
end
|
|
39
40
|
|
|
40
|
-
# Public:
|
|
41
|
+
# Public: Matches entries in the enamdict based on a block which should
|
|
41
42
|
# evaluate true or false (typically a regex).
|
|
42
43
|
#
|
|
43
44
|
# Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
|
|
44
|
-
def
|
|
45
|
+
def match(&block)
|
|
45
46
|
sel = []
|
|
46
47
|
each_line do |line|
|
|
47
48
|
if block.call(line)
|
|
@@ -20,7 +20,7 @@ module JapaneseNames
|
|
|
20
20
|
def split_giv(kanji, kana)
|
|
21
21
|
return nil unless kanji && kana
|
|
22
22
|
kanji, kana = kanji.strip, kana.strip
|
|
23
|
-
dict = Enamdict.
|
|
23
|
+
dict = Enamdict.find(kanji: window_right(kanji))
|
|
24
24
|
dict.sort!{|x,y| y[0].size <=> x[0].size}
|
|
25
25
|
kana_match = nil
|
|
26
26
|
if match = dict.detect{|m| kana_match = kana[/#{hk m[1]}$/]}
|
|
@@ -31,7 +31,7 @@ module JapaneseNames
|
|
|
31
31
|
def split_fam(kanji, kana)
|
|
32
32
|
return nil unless kanji && kana
|
|
33
33
|
kanji, kana = kanji.strip, kana.strip
|
|
34
|
-
dict = Enamdict.
|
|
34
|
+
dict = Enamdict.find(kanji: window_left(kanji))
|
|
35
35
|
dict.sort!{|x,y| y[0].size <=> x[0].size}
|
|
36
36
|
kana_match = nil
|
|
37
37
|
if match = dict.detect{|m| kana_match = kana[/^#{hk m[1]}/]}
|
data/spec/unit/enamdict_spec.rb
CHANGED
|
@@ -7,15 +7,15 @@ describe JapaneseNames::Enamdict do
|
|
|
7
7
|
|
|
8
8
|
subject { JapaneseNames::Enamdict }
|
|
9
9
|
|
|
10
|
-
describe '#
|
|
10
|
+
describe '#match' do
|
|
11
11
|
|
|
12
12
|
it 'should select only lines which match criteria' do
|
|
13
|
-
result = subject.
|
|
13
|
+
result = subject.match{|line| line =~ /^.+?\|あわのはら\|.+?$/}
|
|
14
14
|
result.should eq [["粟野原", "あわのはら", "s"]]
|
|
15
15
|
end
|
|
16
16
|
|
|
17
17
|
it 'should select multiple lines' do
|
|
18
|
-
result = subject.
|
|
18
|
+
result = subject.match{|line| line =~ /^.+?\|はしの\|.+?$/}
|
|
19
19
|
result.should eq [["橋之", "はしの", "p"],
|
|
20
20
|
["橋埜", "はしの", "s"],
|
|
21
21
|
["橋野", "はしの", "s"],
|
|
@@ -24,15 +24,15 @@ describe JapaneseNames::Enamdict do
|
|
|
24
24
|
end
|
|
25
25
|
end
|
|
26
26
|
|
|
27
|
-
describe '#
|
|
27
|
+
describe '#find' do
|
|
28
28
|
|
|
29
29
|
it 'should match kanji only' do
|
|
30
|
-
result = subject.
|
|
30
|
+
result = subject.find(kanji: '外世子')
|
|
31
31
|
result.should eq [["外世子", "とよこ", "f"]]
|
|
32
32
|
end
|
|
33
33
|
|
|
34
34
|
it 'should match kana only' do
|
|
35
|
-
result = subject.
|
|
35
|
+
result = subject.find(kana: 'ならしま')
|
|
36
36
|
result.should eq [["樽島", "ならしま", "u"],
|
|
37
37
|
["奈良島", "ならしま", "s"],
|
|
38
38
|
["楢島", "ならしま", "s"],
|
|
@@ -40,19 +40,19 @@ describe JapaneseNames::Enamdict do
|
|
|
40
40
|
end
|
|
41
41
|
|
|
42
42
|
it 'should match both kanji and kana only' do
|
|
43
|
-
result = subject.
|
|
43
|
+
result = subject.find(kanji: '楢二郎', kana: 'ならじろう')
|
|
44
44
|
result.should eq [["楢二郎", "ならじろう", "m"]]
|
|
45
45
|
end
|
|
46
46
|
|
|
47
47
|
it 'should match flags as String' do
|
|
48
|
-
result = subject.
|
|
48
|
+
result = subject.find(kana: 'ならしま', flags: 's')
|
|
49
49
|
result.should eq [["奈良島", "ならしま", "s"],
|
|
50
50
|
["楢島", "ならしま", "s"],
|
|
51
51
|
["楢嶋", "ならしま", "s"]]
|
|
52
52
|
end
|
|
53
53
|
|
|
54
54
|
it 'should match flags as Array' do
|
|
55
|
-
result = subject.
|
|
55
|
+
result = subject.find(kana: 'ならしま', flags: ['u','g'])
|
|
56
56
|
result.should eq [["樽島", "ならしま", "u"]]
|
|
57
57
|
end
|
|
58
58
|
end
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: japanese_names
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.0.
|
|
4
|
+
version: 0.0.3
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Johnny Shields
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2014-09-
|
|
11
|
+
date: 2014-09-08 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: moji
|