japanese_names 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,15 +1,15 @@
1
1
  ---
2
2
  !binary "U0hBMQ==":
3
3
  metadata.gz: !binary |-
4
- Y2RiZjRlZTEzMDQ4NjJhZmQyMTg4YmI5ZjE0M2IyNGU5MGRlNGU1YQ==
4
+ YTZhOWUxMzRlNTE5ZmVmZTJkMWJmMzlhZTYzMzBjNzAxZjEzMzQ2MA==
5
5
  data.tar.gz: !binary |-
6
- OWVlN2EzY2NmYWNiMzA1ZThjZGQxYjUxY2MwZTE2YzRhZGZiNDU5ZA==
6
+ MDdhYjQ3NTA3NTgxZjMyY2Q4ZmQxZTk2MzgwMmRjYzMwMzAxNmQ4Mg==
7
7
  SHA512:
8
8
  metadata.gz: !binary |-
9
- YmViZDBkNzFhNjU5OGE0NGNlZDAzMzUzMTAwZjBmOWI3OWE3ZjdhZjYwYTI4
10
- MDZlODJkNTA2ZjQ5ZTU2M2E2YjEwNzY1Mjk2MDVlOWMyNWU4NzA4ZTlkNmFl
11
- MTlkZjdhNjhhY2NiZDQ3MTU3MjQ3MTMzMmVjMTE2YTg5NmM4YzM=
9
+ ZGQxNjI5ODEyYjUwZTYzY2JlZDZmOWQwYzI4ZjRkNzc3ZTljMTc0YzdlNWNj
10
+ NjFiOTQ3MGJkYjg1Y2NiNTI5YWI1NTJmNWIwNWNkYTkyODU3ODU3Yjc4MDY3
11
+ N2Q2NmJlMGUzZWY2N2MzZjA1N2VjYTUwYTc2MDVjMTE3YWM0YWE=
12
12
  data.tar.gz: !binary |-
13
- ZTVmZWI4ZGUxNDQ5ZDA5ZTZhYjRkZjcxMzU3ZDQxNmQ2YTYwMmQyZTk3Njgz
14
- NzUxYzdlOThkNTgwM2I4NGQxODY5OWMxNGM3OGIwODViYmU1ZjZjYmIzZWNi
15
- YjhhOWIzMDM2ZjFhNDhlMjZlM2I3MjFjYjdlNGYxYmU0NzM0NGE=
13
+ YjRiNTMzNzQzMDc1OGQ2NWY3YmExM2VjOWMyOWE4ODBiZTFmNGQzMmI3MTJi
14
+ NzU0NmJiNzdhY2YwMjY5OGRiYWUzYTM4NmM2MDUwMjA3NTE0MTljY2Y2ODgx
15
+ ZGM2MTc1ZmVhN2ZjMTk0ZGFmZGVhOTljNTY1ZWQ1NTZiZDhlODE=
data/README.md CHANGED
@@ -1,10 +1,58 @@
1
1
  # JapaneseNames
2
2
 
3
- Japanese name parser based on ENAMDIC
3
+ JapaneseNames provides an interface to the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
4
4
 
5
- ## Overview
6
5
 
7
- JapaneseNames provides an interface to the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
6
+ ## JapaneseNames::Enamdict
7
+
8
+ This library comes packaged with a compacted version of the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
9
+ at `bin/enamdict.min`. Refer to *Rake Tasks* below for how this file is constructed.
10
+
11
+ `JapaneseNames::Enamdict` is a module; all methods are called on the module `self` class.
12
+
13
+
14
+ ### Enamdict.find
15
+
16
+ Provides a structured query interface to access ENAMDICT data.
17
+
18
+ ```ruby
19
+ JapaneseNames::Enamdict.find(kanji: '外世子') #=> [["外世子", "とよこ", "f"]]
20
+
21
+ JapaneseNames::Enamdict.find(kana: 'ならしま', flags: 's') #=> [["奈良島", "ならしま", "s"],
22
+ ["楢島", "ならしま", "s"],
23
+ ["楢嶋", "ならしま", "s"]]
24
+
25
+ JapaneseNames::Enamdict.find(kanji: '楢二郎', kana: 'ならじろう') #=> [["楢二郎", "ならじろう", "m"]]
26
+ ```
27
+
28
+ where options are:
29
+
30
+ * `kanji`: The kanji name string to match. Regex syntax suppported. Either `:kanji` or `:kana` must be specified.
31
+ * `kana`: The kana name string to match. Regex syntax suppported.
32
+ * `flags`: The flag char or array of flag chars to match. Refer to [ENAMDIC documentation](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
33
+ Additionally constants JapaneseNames::Enamdict::NAME_FAM and JapaneseNames::Enamdict::NAME_GIV may be used.
34
+
35
+ Note that romaji data has been removed from our `enamdict.min` file in the compression step. We recommend to use a gem such as `mojinizer` to convert romaji to kana before doing a query.
36
+
37
+
38
+ ### Enamdict.match
39
+
40
+ Provides a raw interface to match ENAMDICT entries via a block, which would typically contain a `Regexp` expression:
41
+
42
+ ```ruby
43
+ JapaneseNames::Enamdict.match{|entry| entry =~ /^堺|/} #=> [["堺", "さかい", "p,s"], ["堺", "さかえ", "p"]]
44
+ ```
45
+
46
+ where each dictionary entry is in the format below (different from raw ENAMDICT file):
47
+
48
+ ```
49
+ kanji|kana|flag1(,flag2,...)
50
+ ```
51
+
52
+
53
+ ## JapaneseNames::Parser
54
+
55
+ ### Parser#split
8
56
 
9
57
  Currently the main method is `split` which, given a kanji and kana representation of a name splits
10
58
  into to family/given names.
@@ -14,13 +62,77 @@ into to family/given names.
14
62
  parser.split('堺雅美', 'さかいマサミ') #=> [['堺', '雅美'], ['さかい', 'マサミ']]
15
63
  ```
16
64
 
65
+ The logic is as follows:
17
66
 
18
- ## ENAMDICT
67
+ * Step 1: Split kanji name into possible surname sub-strings
19
68
 
20
- This library comes packaged with a compacted version of the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
21
- at `bin/enamdict.min`.
69
+ ```
70
+ 上原亜沙子 =>
71
+
72
+ 上原亜沙子
73
+ 上原亜沙
74
+ 上原亜
75
+ 上原
76
+
77
+ ```
78
+
79
+ * Step 2: Lookup possible kana matches in dictionary (done in a single pass)
80
+
81
+ ```
82
+ 上原亜沙子 => X
83
+ 上原亜沙  => X
84
+ 上原亜   => X
85
+ 上原    => かみはら かみばら うえはら うえばら...
86
+ 上     => かみ うえ ...
87
+ ```
88
+
89
+ * Step 3: Compare kana lookups versus kana name and detect first match (starting from longest candidate string)
90
+
91
+ ```
92
+ うえはらあさこ contains かみはら ? => X
93
+ うえはらあさこ contains かみばら ? => X
94
+ うえはらあさこ contains うえはら ? => YES! [うえはら]あさこ
95
+ ```
96
+
97
+ * Step 4: If match found, split names accordingly
98
+
99
+ ```
100
+ [上原]亜沙子 => 上原 亜沙子
101
+ [うえはら]あさこ => うえはら あさこ
102
+ ```
103
+
104
+ * Step 5: If match not found, repeat steps 1-4 in reverse for given name:
105
+
106
+ ```
107
+ 上原亜沙子 =>
22
108
 
23
- This file can be regenerated by `rake enamdict:refresh`, which downloads, extracts, and compiles the ENAMDICT file.
109
+ 上原亜沙子 => X
110
+  原亜沙子 => X
111
+   亜沙子 => あさこ
112
+    沙子 => さこ
113
+     子 => こ
114
+
115
+ 上原[亜沙子] => 上原 亜沙子
116
+ うえはら[あさこ] => うえはら あさこ
117
+ ```
118
+
119
+ * Step 6: If match still not found, return `nil`
120
+
121
+
122
+ ## Rake Tasks
123
+
124
+ The following tasks are used for development purposes of this gem only. They will not be accessible
125
+ in projects which use this gem.
126
+
127
+ * `rake enamdict:refresh`: Runs `enamdict:download` and `enamdict:minify` (see below)
128
+
129
+ * `rake enamdict:download`: Downloads and extract the ENAMDICT file to `/tmp/enamdict`
130
+
131
+ * `rake enamdict:minify`: Compiles `/bin/enamdict.min` file from `/tmp/enamdict`. Performs several processing steps including:
132
+ * Converts to UTF-8
133
+ * Compacts format (pipe-delimited)
134
+ * Removes non-human name entries
135
+ * Removes romaji strings (redundant with kana)
24
136
 
25
137
 
26
138
  ## TODO
@@ -38,6 +150,12 @@ implementation of the dictionary would be nice.
38
150
  Fork -> Commit -> Spec -> Push -> Pull Request
39
151
 
40
152
 
153
+ ## Similar Projects
154
+
155
+ * Marco Bresciani's [wwwwjdic](https://rubygems.org/gems/wwwjdic) gem which is **NOT** used by this lib
156
+ * [@jeresig](https://github.com/jeresig)'s [node-enamdict](https://github.com/jeresig/node-enamdict) an ENAMDIC reader for Node.js
157
+
158
+
41
159
  ## Authors
42
160
 
43
161
  * [@johnnyshields](https://github.com/johnnyshields)
@@ -18,7 +18,8 @@ module JapaneseNames
18
18
 
19
19
  class << self
20
20
 
21
- # Public: Matches kanji and/or kana regex strings in the dictionary.
21
+ # Public: Finds kanji and/or kana regex strings in the dictionary via
22
+ # a structured query interface.
22
23
  #
23
24
  # opts - The Hash options used to match the dictionary (default: {}):
24
25
  # kanji: Regex to match kanji name (optional)
@@ -26,7 +27,7 @@ module JapaneseNames
26
27
  # flags: Flag or Array of flags to filter the match (optional)
27
28
  #
28
29
  # Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
29
- def match(opts={})
30
+ def find(opts={})
30
31
  return [] unless opts[:kanji] || opts[:kana]
31
32
 
32
33
  kanji = name_regex opts.delete(:kanji)
@@ -34,14 +35,14 @@ module JapaneseNames
34
35
  flags = flags_regex opts.delete(:flags)
35
36
  regex = /^#{kanji}\|#{kana}\|#{flags}$/
36
37
 
37
- search{|line| line[regex]}
38
+ match{|line| line[regex]}
38
39
  end
39
40
 
40
- # Public: Selects entries in the enamdict based on a block which should
41
+ # Public: Matches entries in the enamdict based on a block which should
41
42
  # evaluate true or false (typically a regex).
42
43
  #
43
44
  # Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
44
- def search(&block)
45
+ def match(&block)
45
46
  sel = []
46
47
  each_line do |line|
47
48
  if block.call(line)
@@ -20,7 +20,7 @@ module JapaneseNames
20
20
  def split_giv(kanji, kana)
21
21
  return nil unless kanji && kana
22
22
  kanji, kana = kanji.strip, kana.strip
23
- dict = Enamdict.match(kanji: window_right(kanji))
23
+ dict = Enamdict.find(kanji: window_right(kanji))
24
24
  dict.sort!{|x,y| y[0].size <=> x[0].size}
25
25
  kana_match = nil
26
26
  if match = dict.detect{|m| kana_match = kana[/#{hk m[1]}$/]}
@@ -31,7 +31,7 @@ module JapaneseNames
31
31
  def split_fam(kanji, kana)
32
32
  return nil unless kanji && kana
33
33
  kanji, kana = kanji.strip, kana.strip
34
- dict = Enamdict.match(kanji: window_left(kanji))
34
+ dict = Enamdict.find(kanji: window_left(kanji))
35
35
  dict.sort!{|x,y| y[0].size <=> x[0].size}
36
36
  kana_match = nil
37
37
  if match = dict.detect{|m| kana_match = kana[/^#{hk m[1]}/]}
@@ -2,5 +2,5 @@
2
2
  # encoding: utf-8
3
3
 
4
4
  module JapaneseNames
5
- VERSION = '0.0.2'
5
+ VERSION = '0.0.3'
6
6
  end
@@ -7,15 +7,15 @@ describe JapaneseNames::Enamdict do
7
7
 
8
8
  subject { JapaneseNames::Enamdict }
9
9
 
10
- describe '#search' do
10
+ describe '#match' do
11
11
 
12
12
  it 'should select only lines which match criteria' do
13
- result = subject.search{|line| line =~ /^.+?\|あわのはら\|.+?$/}
13
+ result = subject.match{|line| line =~ /^.+?\|あわのはら\|.+?$/}
14
14
  result.should eq [["粟野原", "あわのはら", "s"]]
15
15
  end
16
16
 
17
17
  it 'should select multiple lines' do
18
- result = subject.search{|line| line =~ /^.+?\|はしの\|.+?$/}
18
+ result = subject.match{|line| line =~ /^.+?\|はしの\|.+?$/}
19
19
  result.should eq [["橋之", "はしの", "p"],
20
20
  ["橋埜", "はしの", "s"],
21
21
  ["橋野", "はしの", "s"],
@@ -24,15 +24,15 @@ describe JapaneseNames::Enamdict do
24
24
  end
25
25
  end
26
26
 
27
- describe '#lookup' do
27
+ describe '#find' do
28
28
 
29
29
  it 'should match kanji only' do
30
- result = subject.match(kanji: '外世子')
30
+ result = subject.find(kanji: '外世子')
31
31
  result.should eq [["外世子", "とよこ", "f"]]
32
32
  end
33
33
 
34
34
  it 'should match kana only' do
35
- result = subject.match(kana: 'ならしま')
35
+ result = subject.find(kana: 'ならしま')
36
36
  result.should eq [["樽島", "ならしま", "u"],
37
37
  ["奈良島", "ならしま", "s"],
38
38
  ["楢島", "ならしま", "s"],
@@ -40,19 +40,19 @@ describe JapaneseNames::Enamdict do
40
40
  end
41
41
 
42
42
  it 'should match both kanji and kana only' do
43
- result = subject.match(kanji: '楢二郎', kana: 'ならじろう')
43
+ result = subject.find(kanji: '楢二郎', kana: 'ならじろう')
44
44
  result.should eq [["楢二郎", "ならじろう", "m"]]
45
45
  end
46
46
 
47
47
  it 'should match flags as String' do
48
- result = subject.match(kana: 'ならしま', flags: 's')
48
+ result = subject.find(kana: 'ならしま', flags: 's')
49
49
  result.should eq [["奈良島", "ならしま", "s"],
50
50
  ["楢島", "ならしま", "s"],
51
51
  ["楢嶋", "ならしま", "s"]]
52
52
  end
53
53
 
54
54
  it 'should match flags as Array' do
55
- result = subject.match(kana: 'ならしま', flags: ['u','g'])
55
+ result = subject.find(kana: 'ならしま', flags: ['u','g'])
56
56
  result.should eq [["樽島", "ならしま", "u"]]
57
57
  end
58
58
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: japanese_names
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Johnny Shields
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-09-07 00:00:00.000000000 Z
11
+ date: 2014-09-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: moji