japanese_names 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,15 +1,15 @@
1
1
  ---
2
2
  !binary "U0hBMQ==":
3
3
  metadata.gz: !binary |-
4
- Y2RiZjRlZTEzMDQ4NjJhZmQyMTg4YmI5ZjE0M2IyNGU5MGRlNGU1YQ==
4
+ YTZhOWUxMzRlNTE5ZmVmZTJkMWJmMzlhZTYzMzBjNzAxZjEzMzQ2MA==
5
5
  data.tar.gz: !binary |-
6
- OWVlN2EzY2NmYWNiMzA1ZThjZGQxYjUxY2MwZTE2YzRhZGZiNDU5ZA==
6
+ MDdhYjQ3NTA3NTgxZjMyY2Q4ZmQxZTk2MzgwMmRjYzMwMzAxNmQ4Mg==
7
7
  SHA512:
8
8
  metadata.gz: !binary |-
9
- YmViZDBkNzFhNjU5OGE0NGNlZDAzMzUzMTAwZjBmOWI3OWE3ZjdhZjYwYTI4
10
- MDZlODJkNTA2ZjQ5ZTU2M2E2YjEwNzY1Mjk2MDVlOWMyNWU4NzA4ZTlkNmFl
11
- MTlkZjdhNjhhY2NiZDQ3MTU3MjQ3MTMzMmVjMTE2YTg5NmM4YzM=
9
+ ZGQxNjI5ODEyYjUwZTYzY2JlZDZmOWQwYzI4ZjRkNzc3ZTljMTc0YzdlNWNj
10
+ NjFiOTQ3MGJkYjg1Y2NiNTI5YWI1NTJmNWIwNWNkYTkyODU3ODU3Yjc4MDY3
11
+ N2Q2NmJlMGUzZWY2N2MzZjA1N2VjYTUwYTc2MDVjMTE3YWM0YWE=
12
12
  data.tar.gz: !binary |-
13
- ZTVmZWI4ZGUxNDQ5ZDA5ZTZhYjRkZjcxMzU3ZDQxNmQ2YTYwMmQyZTk3Njgz
14
- NzUxYzdlOThkNTgwM2I4NGQxODY5OWMxNGM3OGIwODViYmU1ZjZjYmIzZWNi
15
- YjhhOWIzMDM2ZjFhNDhlMjZlM2I3MjFjYjdlNGYxYmU0NzM0NGE=
13
+ YjRiNTMzNzQzMDc1OGQ2NWY3YmExM2VjOWMyOWE4ODBiZTFmNGQzMmI3MTJi
14
+ NzU0NmJiNzdhY2YwMjY5OGRiYWUzYTM4NmM2MDUwMjA3NTE0MTljY2Y2ODgx
15
+ ZGM2MTc1ZmVhN2ZjMTk0ZGFmZGVhOTljNTY1ZWQ1NTZiZDhlODE=
data/README.md CHANGED
@@ -1,10 +1,58 @@
1
1
  # JapaneseNames
2
2
 
3
- Japanese name parser based on ENAMDIC
3
+ JapaneseNames provides an interface to the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
4
4
 
5
- ## Overview
6
5
 
7
- JapaneseNames provides an interface to the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
6
+ ## JapaneseNames::Enamdict
7
+
8
+ This library comes packaged with a compacted version of the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
9
+ at `bin/enamdict.min`. Refer to *Rake Tasks* below for how this file is constructed.
10
+
11
+ `JapaneseNames::Enamdict` is a module; all methods are called on the module `self` class.
12
+
13
+
14
+ ### Enamdict.find
15
+
16
+ Provides a structured query interface to access ENAMDICT data.
17
+
18
+ ```ruby
19
+ JapaneseNames::Enamdict.find(kanji: '外世子') #=> [["外世子", "とよこ", "f"]]
20
+
21
+ JapaneseNames::Enamdict.find(kana: 'ならしま', flags: 's') #=> [["奈良島", "ならしま", "s"],
22
+ ["楢島", "ならしま", "s"],
23
+ ["楢嶋", "ならしま", "s"]]
24
+
25
+ JapaneseNames::Enamdict.find(kanji: '楢二郎', kana: 'ならじろう') #=> [["楢二郎", "ならじろう", "m"]]
26
+ ```
27
+
28
+ where options are:
29
+
30
+ * `kanji`: The kanji name string to match. Regex syntax suppported. Either `:kanji` or `:kana` must be specified.
31
+ * `kana`: The kana name string to match. Regex syntax suppported.
32
+ * `flags`: The flag char or array of flag chars to match. Refer to [ENAMDIC documentation](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
33
+ Additionally constants JapaneseNames::Enamdict::NAME_FAM and JapaneseNames::Enamdict::NAME_GIV may be used.
34
+
35
+ Note that romaji data has been removed from our `enamdict.min` file in the compression step. We recommend to use a gem such as `mojinizer` to convert romaji to kana before doing a query.
36
+
37
+
38
+ ### Enamdict.match
39
+
40
+ Provides a raw interface to match ENAMDICT entries via a block, which would typically contain a `Regexp` expression:
41
+
42
+ ```ruby
43
+ JapaneseNames::Enamdict.match{|entry| entry =~ /^堺|/} #=> [["堺", "さかい", "p,s"], ["堺", "さかえ", "p"]]
44
+ ```
45
+
46
+ where each dictionary entry is in the format below (different from raw ENAMDICT file):
47
+
48
+ ```
49
+ kanji|kana|flag1(,flag2,...)
50
+ ```
51
+
52
+
53
+ ## JapaneseNames::Parser
54
+
55
+ ### Parser#split
8
56
 
9
57
  Currently the main method is `split` which, given a kanji and kana representation of a name splits
10
58
  into to family/given names.
@@ -14,13 +62,77 @@ into to family/given names.
14
62
  parser.split('堺雅美', 'さかいマサミ') #=> [['堺', '雅美'], ['さかい', 'マサミ']]
15
63
  ```
16
64
 
65
+ The logic is as follows:
17
66
 
18
- ## ENAMDICT
67
+ * Step 1: Split kanji name into possible surname sub-strings
19
68
 
20
- This library comes packaged with a compacted version of the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
21
- at `bin/enamdict.min`.
69
+ ```
70
+ 上原亜沙子 =>
71
+
72
+ 上原亜沙子
73
+ 上原亜沙
74
+ 上原亜
75
+ 上原
76
+
77
+ ```
78
+
79
+ * Step 2: Lookup possible kana matches in dictionary (done in a single pass)
80
+
81
+ ```
82
+ 上原亜沙子 => X
83
+ 上原亜沙  => X
84
+ 上原亜   => X
85
+ 上原    => かみはら かみばら うえはら うえばら...
86
+ 上     => かみ うえ ...
87
+ ```
88
+
89
+ * Step 3: Compare kana lookups versus kana name and detect first match (starting from longest candidate string)
90
+
91
+ ```
92
+ うえはらあさこ contains かみはら ? => X
93
+ うえはらあさこ contains かみばら ? => X
94
+ うえはらあさこ contains うえはら ? => YES! [うえはら]あさこ
95
+ ```
96
+
97
+ * Step 4: If match found, split names accordingly
98
+
99
+ ```
100
+ [上原]亜沙子 => 上原 亜沙子
101
+ [うえはら]あさこ => うえはら あさこ
102
+ ```
103
+
104
+ * Step 5: If match not found, repeat steps 1-4 in reverse for given name:
105
+
106
+ ```
107
+ 上原亜沙子 =>
22
108
 
23
- This file can be regenerated by `rake enamdict:refresh`, which downloads, extracts, and compiles the ENAMDICT file.
109
+ 上原亜沙子 => X
110
+  原亜沙子 => X
111
+   亜沙子 => あさこ
112
+    沙子 => さこ
113
+     子 => こ
114
+
115
+ 上原[亜沙子] => 上原 亜沙子
116
+ うえはら[あさこ] => うえはら あさこ
117
+ ```
118
+
119
+ * Step 6: If match still not found, return `nil`
120
+
121
+
122
+ ## Rake Tasks
123
+
124
+ The following tasks are used for development purposes of this gem only. They will not be accessible
125
+ in projects which use this gem.
126
+
127
+ * `rake enamdict:refresh`: Runs `enamdict:download` and `enamdict:minify` (see below)
128
+
129
+ * `rake enamdict:download`: Downloads and extract the ENAMDICT file to `/tmp/enamdict`
130
+
131
+ * `rake enamdict:minify`: Compiles `/bin/enamdict.min` file from `/tmp/enamdict`. Performs several processing steps including:
132
+ * Converts to UTF-8
133
+ * Compacts format (pipe-delimited)
134
+ * Removes non-human name entries
135
+ * Removes romaji strings (redundant with kana)
24
136
 
25
137
 
26
138
  ## TODO
@@ -38,6 +150,12 @@ implementation of the dictionary would be nice.
38
150
  Fork -> Commit -> Spec -> Push -> Pull Request
39
151
 
40
152
 
153
+ ## Similar Projects
154
+
155
+ * Marco Bresciani's [wwwwjdic](https://rubygems.org/gems/wwwjdic) gem which is **NOT** used by this lib
156
+ * [@jeresig](https://github.com/jeresig)'s [node-enamdict](https://github.com/jeresig/node-enamdict) an ENAMDIC reader for Node.js
157
+
158
+
41
159
  ## Authors
42
160
 
43
161
  * [@johnnyshields](https://github.com/johnnyshields)
@@ -18,7 +18,8 @@ module JapaneseNames
18
18
 
19
19
  class << self
20
20
 
21
- # Public: Matches kanji and/or kana regex strings in the dictionary.
21
+ # Public: Finds kanji and/or kana regex strings in the dictionary via
22
+ # a structured query interface.
22
23
  #
23
24
  # opts - The Hash options used to match the dictionary (default: {}):
24
25
  # kanji: Regex to match kanji name (optional)
@@ -26,7 +27,7 @@ module JapaneseNames
26
27
  # flags: Flag or Array of flags to filter the match (optional)
27
28
  #
28
29
  # Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
29
- def match(opts={})
30
+ def find(opts={})
30
31
  return [] unless opts[:kanji] || opts[:kana]
31
32
 
32
33
  kanji = name_regex opts.delete(:kanji)
@@ -34,14 +35,14 @@ module JapaneseNames
34
35
  flags = flags_regex opts.delete(:flags)
35
36
  regex = /^#{kanji}\|#{kana}\|#{flags}$/
36
37
 
37
- search{|line| line[regex]}
38
+ match{|line| line[regex]}
38
39
  end
39
40
 
40
- # Public: Selects entries in the enamdict based on a block which should
41
+ # Public: Matches entries in the enamdict based on a block which should
41
42
  # evaluate true or false (typically a regex).
42
43
  #
43
44
  # Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
44
- def search(&block)
45
+ def match(&block)
45
46
  sel = []
46
47
  each_line do |line|
47
48
  if block.call(line)
@@ -20,7 +20,7 @@ module JapaneseNames
20
20
  def split_giv(kanji, kana)
21
21
  return nil unless kanji && kana
22
22
  kanji, kana = kanji.strip, kana.strip
23
- dict = Enamdict.match(kanji: window_right(kanji))
23
+ dict = Enamdict.find(kanji: window_right(kanji))
24
24
  dict.sort!{|x,y| y[0].size <=> x[0].size}
25
25
  kana_match = nil
26
26
  if match = dict.detect{|m| kana_match = kana[/#{hk m[1]}$/]}
@@ -31,7 +31,7 @@ module JapaneseNames
31
31
  def split_fam(kanji, kana)
32
32
  return nil unless kanji && kana
33
33
  kanji, kana = kanji.strip, kana.strip
34
- dict = Enamdict.match(kanji: window_left(kanji))
34
+ dict = Enamdict.find(kanji: window_left(kanji))
35
35
  dict.sort!{|x,y| y[0].size <=> x[0].size}
36
36
  kana_match = nil
37
37
  if match = dict.detect{|m| kana_match = kana[/^#{hk m[1]}/]}
@@ -2,5 +2,5 @@
2
2
  # encoding: utf-8
3
3
 
4
4
  module JapaneseNames
5
- VERSION = '0.0.2'
5
+ VERSION = '0.0.3'
6
6
  end
@@ -7,15 +7,15 @@ describe JapaneseNames::Enamdict do
7
7
 
8
8
  subject { JapaneseNames::Enamdict }
9
9
 
10
- describe '#search' do
10
+ describe '#match' do
11
11
 
12
12
  it 'should select only lines which match criteria' do
13
- result = subject.search{|line| line =~ /^.+?\|あわのはら\|.+?$/}
13
+ result = subject.match{|line| line =~ /^.+?\|あわのはら\|.+?$/}
14
14
  result.should eq [["粟野原", "あわのはら", "s"]]
15
15
  end
16
16
 
17
17
  it 'should select multiple lines' do
18
- result = subject.search{|line| line =~ /^.+?\|はしの\|.+?$/}
18
+ result = subject.match{|line| line =~ /^.+?\|はしの\|.+?$/}
19
19
  result.should eq [["橋之", "はしの", "p"],
20
20
  ["橋埜", "はしの", "s"],
21
21
  ["橋野", "はしの", "s"],
@@ -24,15 +24,15 @@ describe JapaneseNames::Enamdict do
24
24
  end
25
25
  end
26
26
 
27
- describe '#lookup' do
27
+ describe '#find' do
28
28
 
29
29
  it 'should match kanji only' do
30
- result = subject.match(kanji: '外世子')
30
+ result = subject.find(kanji: '外世子')
31
31
  result.should eq [["外世子", "とよこ", "f"]]
32
32
  end
33
33
 
34
34
  it 'should match kana only' do
35
- result = subject.match(kana: 'ならしま')
35
+ result = subject.find(kana: 'ならしま')
36
36
  result.should eq [["樽島", "ならしま", "u"],
37
37
  ["奈良島", "ならしま", "s"],
38
38
  ["楢島", "ならしま", "s"],
@@ -40,19 +40,19 @@ describe JapaneseNames::Enamdict do
40
40
  end
41
41
 
42
42
  it 'should match both kanji and kana only' do
43
- result = subject.match(kanji: '楢二郎', kana: 'ならじろう')
43
+ result = subject.find(kanji: '楢二郎', kana: 'ならじろう')
44
44
  result.should eq [["楢二郎", "ならじろう", "m"]]
45
45
  end
46
46
 
47
47
  it 'should match flags as String' do
48
- result = subject.match(kana: 'ならしま', flags: 's')
48
+ result = subject.find(kana: 'ならしま', flags: 's')
49
49
  result.should eq [["奈良島", "ならしま", "s"],
50
50
  ["楢島", "ならしま", "s"],
51
51
  ["楢嶋", "ならしま", "s"]]
52
52
  end
53
53
 
54
54
  it 'should match flags as Array' do
55
- result = subject.match(kana: 'ならしま', flags: ['u','g'])
55
+ result = subject.find(kana: 'ならしま', flags: ['u','g'])
56
56
  result.should eq [["樽島", "ならしま", "u"]]
57
57
  end
58
58
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: japanese_names
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Johnny Shields
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-09-07 00:00:00.000000000 Z
11
+ date: 2014-09-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: moji