japanese_names 0.0.3 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,15 +1,7 @@
1
1
  ---
2
- !binary "U0hBMQ==":
3
- metadata.gz: !binary |-
4
- YTZhOWUxMzRlNTE5ZmVmZTJkMWJmMzlhZTYzMzBjNzAxZjEzMzQ2MA==
5
- data.tar.gz: !binary |-
6
- MDdhYjQ3NTA3NTgxZjMyY2Q4ZmQxZTk2MzgwMmRjYzMwMzAxNmQ4Mg==
2
+ SHA1:
3
+ metadata.gz: 66e24980c69de00af005fe290cb738acadc6a42c
4
+ data.tar.gz: fe841d057b518bc964f19d79004892565e6d80e0
7
5
  SHA512:
8
- metadata.gz: !binary |-
9
- ZGQxNjI5ODEyYjUwZTYzY2JlZDZmOWQwYzI4ZjRkNzc3ZTljMTc0YzdlNWNj
10
- NjFiOTQ3MGJkYjg1Y2NiNTI5YWI1NTJmNWIwNWNkYTkyODU3ODU3Yjc4MDY3
11
- N2Q2NmJlMGUzZWY2N2MzZjA1N2VjYTUwYTc2MDVjMTE3YWM0YWE=
12
- data.tar.gz: !binary |-
13
- YjRiNTMzNzQzMDc1OGQ2NWY3YmExM2VjOWMyOWE4ODBiZTFmNGQzMmI3MTJi
14
- NzU0NmJiNzdhY2YwMjY5OGRiYWUzYTM4NmM2MDUwMjA3NTE0MTljY2Y2ODgx
15
- ZGM2MTc1ZmVhN2ZjMTk0ZGFmZGVhOTljNTY1ZWQ1NTZiZDhlODE=
6
+ metadata.gz: e8392c49fe7e1d091889af90fd78e80b4ce41635c25a70247caf317f3697d047f85dfb17357c7b1ed53898974ad73de8e271cb2b56f76c9c4ba8a6fe75b7d304
7
+ data.tar.gz: 928c0867d2ccdcaaf502d1318d2e2126cb29e5f719c414fe6b40ca530bc2a39985be870b78f3a1ea7f697f8f5931575df25b23390bfb7107f53f22459dd63f8a
data/README.md CHANGED
@@ -1,28 +1,28 @@
1
1
  # JapaneseNames
2
2
 
3
- JapaneseNames provides an interface to the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
3
+ JapaneseNames provides an interface to the [T file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
4
4
 
5
5
 
6
- ## JapaneseNames::Enamdict
6
+ ## ENAMDICT
7
7
 
8
- This library comes packaged with a compacted version of the [ENAMDIC file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
8
+ This library comes packaged with a compacted version of the [ENAMDICT file](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
9
9
  at `bin/enamdict.min`. Refer to *Rake Tasks* below for how this file is constructed.
10
10
 
11
- `JapaneseNames::Enamdict` is a module; all methods are called on the module `self` class.
12
11
 
13
-
14
- ### Enamdict.find
12
+ ### Finder.find
15
13
 
16
14
  Provides a structured query interface to access ENAMDICT data.
17
15
 
18
16
  ```ruby
19
- JapaneseNames::Enamdict.find(kanji: '外世子') #=> [["外世子", "とよこ", "f"]]
17
+ finder = JapaneseNames::Finder.new
18
+
19
+ finder.find(kanji: '外世子') #=> [["外世子", "とよこ", "f"]]
20
20
 
21
- JapaneseNames::Enamdict.find(kana: 'ならしま', flags: 's') #=> [["奈良島", "ならしま", "s"],
22
- ["楢島", "ならしま", "s"],
23
- ["楢嶋", "ならしま", "s"]]
21
+ finder.find(kana: 'ならしま', flags: 's') #=> [["奈良島", "ならしま", "s"],
22
+ # ["楢島", "ならしま", "s"],
23
+ # ["楢嶋", "ならしま", "s"]]
24
24
 
25
- JapaneseNames::Enamdict.find(kanji: '楢二郎', kana: 'ならじろう') #=> [["楢二郎", "ならじろう", "m"]]
25
+ finder.find(kanji: '楢二郎', kana: 'ならじろう') #=> [["楢二郎", "ならじろう", "m"]]
26
26
  ```
27
27
 
28
28
  where options are:
@@ -30,37 +30,22 @@ where options are:
30
30
  * `kanji`: The kanji name string to match. Regex syntax suppported. Either `:kanji` or `:kana` must be specified.
31
31
  * `kana`: The kana name string to match. Regex syntax suppported.
32
32
  * `flags`: The flag char or array of flag chars to match. Refer to [ENAMDIC documentation](http://www.csse.monash.edu.au/~jwb/enamdict_doc.html).
33
- Additionally constants JapaneseNames::Enamdict::NAME_FAM and JapaneseNames::Enamdict::NAME_GIV may be used.
33
+ Additionally constants `JapaneseNames::Enamdict::NAME_SURNAME` and `JapaneseNames::Enamdict::NAME_GIVEN` may be used.
34
34
 
35
35
  Note that romaji data has been removed from our `enamdict.min` file in the compression step. We recommend to use a gem such as `mojinizer` to convert romaji to kana before doing a query.
36
36
 
37
37
 
38
- ### Enamdict.match
39
-
40
- Provides a raw interface to match ENAMDICT entries via a block, which would typically contain a `Regexp` expression:
38
+ ## JapaneseNames::Splitter
41
39
 
42
- ```ruby
43
- JapaneseNames::Enamdict.match{|entry| entry =~ /^堺|/} #=> [["堺", "さかい", "p,s"], ["堺", "さかえ", "p"]]
44
- ```
45
-
46
- where each dictionary entry is in the format below (different from raw ENAMDICT file):
47
-
48
- ```
49
- kanji|kana|flag1(,flag2,...)
50
- ```
51
-
52
-
53
- ## JapaneseNames::Parser
54
-
55
- ### Parser#split
40
+ ### Splitter#split
56
41
 
57
42
  Currently the main method is `split` which, given a kanji and kana representation of a name splits
58
43
  into to family/given names.
59
44
 
60
- ```ruby
61
- parser = JapaneseNames::Parser.new
62
- parser.split('堺雅美', 'さかいマサミ') #=> [['堺', '雅美'], ['さかい', 'マサミ']]
63
- ```
45
+ ```ruby
46
+ splitter = JapaneseNames::Splitter.new
47
+ splitter.split('堺雅美', 'さかいマサミ') #=> [['堺', '雅美'], ['さかい', 'マサミ']]
48
+ ```
64
49
 
65
50
  The logic is as follows:
66
51
 
@@ -0,0 +1,55 @@
1
+ module JapaneseNames
2
+ module Backend
3
+ module Memory
4
+ class Finder
5
+
6
+ class << self
7
+
8
+ # Public: Finds kanji and/or kana regex strings in the dictionary via
9
+ # a structured query interface.
10
+ #
11
+ # opts - The Hash options used to match the dictionary (default: {}):
12
+ # kanji: Regex to match kanji name (optional)
13
+ # kana: Regex to match kana name (optional)
14
+ # flags: Flag or Array of flags to filter the match (optional)
15
+ #
16
+ # Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
17
+ def find(opts={})
18
+ return [] unless opts[:kanji] || opts[:kana]
19
+ kanji = name_regex opts.delete(:kanji)
20
+ kana = name_regex opts.delete(:kana)
21
+ flags = flags_regex opts.delete(:flags)
22
+ store.select do |row|
23
+ (!kanji || row[0] =~ kanji) && (!kana || row[1] =~ kana) && (!flags || row[2] =~ flags)
24
+ end
25
+ end
26
+
27
+ private
28
+
29
+ def store
30
+ ::JapaneseNames::Backend::Memory::Store.store
31
+ end
32
+
33
+ # Internal: Builds regex criteria for name.
34
+ def name_regex(name)
35
+ case name
36
+ when String, Symbol then /\A#{name}\z/
37
+ when Array then /\A(?:#{name.join('|')})\z/
38
+ else nil
39
+ end
40
+ end
41
+
42
+ # Internal: Builds regex criteria for flags.
43
+ def flags_regex(flags)
44
+ case flags
45
+ when ::JapaneseNames::Enamdict::NAME_ANY then nil
46
+ when String, Symbol then /[#{flags}]/
47
+ when Array then /[#{flags.join}]/
48
+ else nil
49
+ end
50
+ end
51
+ end
52
+ end
53
+ end
54
+ end
55
+ end
@@ -0,0 +1,25 @@
1
+ module JapaneseNames
2
+ module Backend
3
+ module Memory
4
+ class Store
5
+
6
+ class << self
7
+
8
+ # Public: The memoized dictionary instance.
9
+ def store
10
+ @store ||= File.open(filepath, 'r:utf-8').map do |line|
11
+ line.chop.split('|').map(&:freeze).freeze
12
+ end.freeze
13
+ end
14
+
15
+ private
16
+
17
+ # Internal: Returns the filepath to the enamdict.min file.
18
+ def filepath
19
+ File.join(JapaneseNames.root, 'bin/enamdict.min')
20
+ end
21
+ end
22
+ end
23
+ end
24
+ end
25
+ end
@@ -1,105 +1,15 @@
1
- #!/bin/env ruby
2
- # encoding: utf-8
3
-
4
1
  module JapaneseNames
5
2
 
6
- # Query interface for the ENAMDICT file (http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
3
+ # Enumerated flags for the ENAMDICT file (http://www.csse.monash.edu.au/~jwb/enamdict_doc.html)
7
4
  module Enamdict
8
-
9
- # s - surname (138,500)
10
- # p - place-name (99,500)
11
- # u - person name, either given or surname, as-yet unclassified (139,000)
12
- # g - given name, as-yet not classified by sex (64,600)
13
- # f - female given name (106,300)
14
- # m - male given name (14,500)
15
- NAME_FAM = %w(s p u)
16
- NAME_GIV = %w(u g f m)
17
- NAME_ANY = NAME_FAM | NAME_GIV
18
-
19
- class << self
20
-
21
- # Public: Finds kanji and/or kana regex strings in the dictionary via
22
- # a structured query interface.
23
- #
24
- # opts - The Hash options used to match the dictionary (default: {}):
25
- # kanji: Regex to match kanji name (optional)
26
- # kana: Regex to match kana name (optional)
27
- # flags: Flag or Array of flags to filter the match (optional)
28
- #
29
- # Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
30
- def find(opts={})
31
- return [] unless opts[:kanji] || opts[:kana]
32
-
33
- kanji = name_regex opts.delete(:kanji)
34
- kana = name_regex opts.delete(:kana)
35
- flags = flags_regex opts.delete(:flags)
36
- regex = /^#{kanji}\|#{kana}\|#{flags}$/
37
-
38
- match{|line| line[regex]}
39
- end
40
-
41
- # Public: Matches entries in the enamdict based on a block which should
42
- # evaluate true or false (typically a regex).
43
- #
44
- # Returns the dict entries as an Array of Arrays [[kanji, kana, flags], ...]
45
- def match(&block)
46
- sel = []
47
- each_line do |line|
48
- if block.call(line)
49
- sel << unpack_line(line)
50
- end
51
- end
52
- sel
53
- end
54
-
55
- protected
56
-
57
- # Internal: Returns the filepath to the enamdict.min file.
58
- def filepath
59
- File.join(File.dirname(__FILE__), '../../bin/enamdict.min')
60
- end
61
-
62
- # Internal: The memoized dictionary instance.
63
- def dict
64
- return @dict if @dict
65
- @dict = []
66
- File.open(self.filepath, 'r:utf-8') do |f|
67
- while(line = f.gets) != nil
68
- @dict << line[0..-2] # omit trailing newline char
69
- end
70
- end
71
- @dict.freeze
72
- end
73
-
74
- # Internal: Calls the given block for each line in the dict.
75
- def each_line(&block)
76
- dict.each{|line| block.call(line) }
77
- end
78
-
79
- # Internal: Formats a line as a 3-tuple Array [kanji, kana, flags]
80
- def unpack_line(line)
81
- line.split('|')
82
- end
83
-
84
- # Internal: Builds regex criteria for name.
85
- def name_regex(name)
86
- case name
87
- when String then name
88
- when Array then "(?:#{name.join('|')})"
89
- else '.+?'
90
- end
91
- end
92
-
93
- # Internal: Builds regex criteria for flags.
94
- def flags_regex(flags)
95
- if !flags || flags == NAME_ANY
96
- '.+?'
97
- elsif flags.is_a?(Array)
98
- ".*?[#{flags.join}].*?"
99
- else
100
- flags
101
- end
102
- end
103
- end
5
+ NAME_PLACE = %i(p).freeze # place-name (99,500)
6
+ NAME_PERSON = %i(u).freeze # person name, either given or surname, as-yet unclassified (139,000)
7
+ NAME_SURNAME = %i(s).freeze # surname (138,500)
8
+ NAME_GIVEN_MALE = %i(m).freeze # male given name (14,500)
9
+ NAME_GIVEN_FEMALE = %i(f).freeze # female given name (106,300)
10
+ NAME_GIVEN_OTHER = %i(g).freeze # given name, as-yet not classified by sex (64,600)
11
+ NAME_SURNAME_ANY = (NAME_PLACE | NAME_PERSON | NAME_SURNAME).freeze
12
+ NAME_GIVEN_ANY = (NAME_PERSON | NAME_GIVEN_MALE| NAME_GIVEN_FEMALE | NAME_GIVEN_OTHER).freeze
13
+ NAME_ANY = (NAME_SURNAME_ANY | NAME_GIVEN_ANY).freeze
104
14
  end
105
15
  end
@@ -0,0 +1,21 @@
1
+ module JapaneseNames
2
+
3
+ # Query interface for ENAMDICT
4
+ class Finder
5
+
6
+ # Hash opts
7
+ # - kanji: String kanji to match
8
+ # - kana: String kana to match
9
+ # - kanji: Array<Symbol> ENAMDICT flags to match
10
+ def find(opts={})
11
+ backend.find(opts)
12
+ end
13
+
14
+ private
15
+
16
+ # Internal: Builds regex criteria for name.
17
+ def backend
18
+ ::JapaneseNames::Backend::Memory::Finder
19
+ end
20
+ end
21
+ end
@@ -0,0 +1,55 @@
1
+ module JapaneseNames
2
+
3
+ # Provides methods to split a full Japanese name strings into surname and given name.
4
+ class Splitter
5
+
6
+ # Given a kanji and kana representation of a name splits into to family/given names.
7
+ #
8
+ # The choice to prioritize family name is arbitrary. Further analysis is needed
9
+ # for whether given or family name should be prioritized.
10
+ #
11
+ # Returns Array [[kanji_fam, kanji_giv], [kana_fam, kana_giv]] if there was a match.
12
+ # Returns nil if there was no match.
13
+ def split(kanji, kana)
14
+ split_surname(kanji, kana) || split_given(kanji, kana)
15
+ end
16
+
17
+ def split_giv(kanji, kana)
18
+ return nil unless kanji && kana
19
+ kanji, kana = kanji.strip, kana.strip
20
+ dict = finder.find(kanji: Util::Ngram.ngram_right(kanji))
21
+ dict.sort!{|x,y| y[0].size <=> x[0].size}
22
+ kana_match = nil
23
+ if match = dict.detect{|m| kana_match = kana[/#{hk m[1]}\z/]}
24
+ return [[Util::Ngram.mask_right(kanji, match[0]), match[0]],[Util::Ngram.mask_right(kana, kana_match), kana_match]]
25
+ end
26
+ end
27
+ alias :split_given :split_giv
28
+
29
+ def split_sur(kanji, kana)
30
+ return nil unless kanji && kana
31
+ kanji, kana = kanji.strip, kana.strip
32
+ dict = finder.find(kanji: Util::Ngram.ngram_left(kanji))
33
+ dict.sort!{|x,y| y[0].size <=> x[0].size}
34
+ kana_match = nil
35
+ if match = dict.detect{|m| kana_match = kana[/\A#{hk m[1]}/]}
36
+ return [[match[0], Util::Ngram.mask_left(kanji, match[0])],[kana_match, Util::Ngram.mask_left(kana, kana_match)]]
37
+ end
38
+ end
39
+ alias :split_surname :split_sur
40
+
41
+ # TODO: add option to strip honorific '様'
42
+ # TODO: add option to infer sex (0 = unknown, 1 = male, 2 = female as per ISO/IEC 5218)
43
+
44
+ private
45
+
46
+ # Returns a regex string which matches both hiragana and katakana variations of a String.
47
+ def hk(str)
48
+ "(?:#{Moji.kata_to_hira(str)}|#{Moji.hira_to_kata(str)})"
49
+ end
50
+
51
+ def finder
52
+ @finder ||= Finder.new
53
+ end
54
+ end
55
+ end
@@ -0,0 +1,46 @@
1
+ module JapaneseNames
2
+ module Util
3
+
4
+ # Provides methods for parsing Japanese name strings.
5
+ class Ngram
6
+
7
+ class << self
8
+
9
+ # Given a String, returns an ordered array of all possible substrings.
10
+ #
11
+ # Example: ngram_right("abcd") #=> ["abcd", "abc", "bcd", "ab", "bc", "cd", "a", "b", "c", "d"]
12
+ def ngram(str)
13
+ (0...str.size).to_a.reverse.map{|i| (0...(str.size-i)).map{|j| str[j..(i+j)]}}.flatten.uniq
14
+ end
15
+
16
+ # Given a String, returns an array of progressively smaller substrings anchored on the left side.
17
+ #
18
+ # Example: ngram_left("abcd") #=> ["abcd", "abc", "ab", "a"]
19
+ def ngram_left(str)
20
+ (0...str.size).to_a.reverse.map{|i| str[0..i]}
21
+ end
22
+
23
+ # Given a String, returns an array of progressively smaller substrings anchored on the right side.
24
+ #
25
+ # Example: ngram_right("abcd") #=> ["abcd", "bcd", "cd", "d"]
26
+ def ngram_right(str)
27
+ (0...str.size).map{|i| str[i..-1]}
28
+ end
29
+
30
+ # Masks a String from the left side and returns the remaining (right) portion of the String.
31
+ #
32
+ # Example: mask_left("abcde", "ab") #=> "cde"
33
+ def mask_left(str, mask)
34
+ str.gsub(/^#{mask}/, '')
35
+ end
36
+
37
+ # Masks a String from the right side and returns the remaining (left) portion of the String.
38
+ #
39
+ # Example: mask_right("abcde", "de") #=> "abc"
40
+ def mask_right(str, mask)
41
+ str.gsub(/#{mask}$/, '')
42
+ end
43
+ end
44
+ end
45
+ end
46
+ end
@@ -1,6 +1,3 @@
1
- #!/bin/env ruby
2
- # encoding: utf-8
3
-
4
1
  module JapaneseNames
5
- VERSION = '0.0.3'
2
+ VERSION = '0.1.0'
6
3
  end
@@ -4,4 +4,14 @@ require 'moji'
4
4
 
5
5
  require 'japanese_names/version'
6
6
  require 'japanese_names/enamdict'
7
- require 'japanese_names/parser'
7
+ require 'japanese_names/finder'
8
+ require 'japanese_names/splitter'
9
+ require 'japanese_names/util/ngram'
10
+ require 'japanese_names/backend/memory/store'
11
+ require 'japanese_names/backend/memory/finder'
12
+
13
+ module JapaneseNames
14
+ def self.root
15
+ File.join(File.dirname(__FILE__), '../')
16
+ end
17
+ end
data/spec/spec_helper.rb CHANGED
@@ -1,9 +1,7 @@
1
- #!/bin/env ruby
2
- # encoding: utf-8
3
-
4
1
  $:.push File.expand_path('../../lib', __FILE__)
5
2
 
6
3
  require 'rubygems'
4
+ require 'rspec'
7
5
  require 'japanese_names'
8
6
 
9
7
  RSpec.configure do |config|
@@ -1,28 +1,8 @@
1
- #!/bin/env ruby
2
- # encoding: utf-8
3
-
4
1
  require 'spec_helper'
5
2
 
6
- describe JapaneseNames::Enamdict do
7
-
8
- subject { JapaneseNames::Enamdict }
3
+ describe JapaneseNames::Finder do
9
4
 
10
- describe '#match' do
11
-
12
- it 'should select only lines which match criteria' do
13
- result = subject.match{|line| line =~ /^.+?\|あわのはら\|.+?$/}
14
- result.should eq [["粟野原", "あわのはら", "s"]]
15
- end
16
-
17
- it 'should select multiple lines' do
18
- result = subject.match{|line| line =~ /^.+?\|はしの\|.+?$/}
19
- result.should eq [["橋之", "はしの", "p"],
20
- ["橋埜", "はしの", "s"],
21
- ["橋野", "はしの", "s"],
22
- ["端野", "はしの", "s"],
23
- ["箸野", "はしの", "s"]]
24
- end
25
- end
5
+ subject { described_class.new }
26
6
 
27
7
  describe '#find' do
28
8
 
@@ -0,0 +1,24 @@
1
+ require 'spec_helper'
2
+
3
+ describe JapaneseNames::Util::Ngram do
4
+
5
+ describe '#ngram' do
6
+ it { expect(described_class.ngram("abcd")).to eq ["abcd", "abc", "bcd", "ab", "bc", "cd", "a", "b", "c", "d"] }
7
+ end
8
+
9
+ describe '#ngram_left' do
10
+ it { expect(described_class.ngram_left("abcd")).to eq ["abcd", "abc", "ab", "a"] }
11
+ end
12
+
13
+ describe '#ngram_right' do
14
+ it { expect(described_class.ngram_right("abcd")).to eq ["abcd", "bcd", "cd", "d"] }
15
+ end
16
+
17
+ describe '#mask_left' do
18
+ it { expect(described_class.mask_left("abcde", "ab")).to eq "cde" }
19
+ end
20
+
21
+ describe '#mask_right' do
22
+ it { expect(described_class.mask_right("abcde", "de")).to eq "abc" }
23
+ end
24
+ end
@@ -1,21 +1,18 @@
1
- #!/bin/env ruby
2
- # encoding: utf-8
3
-
4
1
  require 'spec_helper'
5
2
 
6
- describe JapaneseNames::Parser do
3
+ describe JapaneseNames::Splitter do
7
4
 
8
- subject { JapaneseNames::Parser.new }
5
+ subject { described_class.new }
9
6
 
10
7
  describe '#split' do
11
8
 
12
- [['上原','望','ウエハラ', 'ノゾミ'],
13
- ['樋口','知美','ヒグチ', 'ともみ'],
14
- ['堺','雅美','さかい', 'マサミ'],
15
- ['中村','幸子','ナカムラ', 'サチコ'],
16
- ['秋保','郁子','アキホ', 'いくこ'],
17
- ['光野','亜佐子','ミツノ', 'アサコ'],
18
- ['熊澤','貴子','クマザワ', 'タカコ']].each do |kanji_fam, kanji_giv, kana_fam, kana_giv|
9
+ [['上原','望','ウエハラ','ノゾミ'],
10
+ ['樋口','知美','ヒグチ','ともみ'],
11
+ ['堺','雅美','さかい','マサミ'],
12
+ ['中村','幸子','ナカムラ','サチコ'],
13
+ ['秋保','郁子','アキホ','いくこ'],
14
+ ['光野','亜佐子','ミツノ','アサコ'],
15
+ ['熊澤','貴子','クマザワ','タカコ']].each do |kanji_fam, kanji_giv, kana_fam, kana_giv|
19
16
  it "should parse #{kanji_fam+kanji_giv} #{kana_fam+kana_giv}" do
20
17
  result = subject.split(kanji_fam+kanji_giv, kana_fam+kana_giv)
21
18
  result.should eq [[kanji_fam, kanji_giv], [kana_fam, kana_giv]]
@@ -27,7 +24,7 @@ describe JapaneseNames::Parser do
27
24
  end
28
25
 
29
26
  it "should parse #{kanji_fam+kanji_giv} #{kana_fam+kana_giv} by family name" do
30
- result = subject.split_fam(kanji_fam+kanji_giv, kana_fam+kana_giv)
27
+ result = subject.split_sur(kanji_fam+kanji_giv, kana_fam+kana_giv)
31
28
  result.should eq [[kanji_fam, kanji_giv], [kana_fam, kana_giv]]
32
29
  end
33
30
  end
@@ -42,7 +39,7 @@ describe JapaneseNames::Parser do
42
39
  it 'should strip leading/trailing whitespace' do
43
40
  subject.split(' 上原望 ', ' ウエハラノゾミ ').should eq [['上原','望'],['ウエハラ','ノゾミ']]
44
41
  subject.split_giv(' 上原望 ', ' ウエハラノゾミ ').should eq [['上原','望'],['ウエハラ','ノゾミ']]
45
- subject.split_fam(' 上原望 ', ' ウエハラノゾミ ').should eq [['上原','望'],['ウエハラ','ノゾミ']]
42
+ subject.split_sur(' 上原望 ', ' ウエハラノゾミ ').should eq [['上原','望'],['ウエハラ','ノゾミ']]
46
43
  end
47
44
 
48
45
  it 'should return nil for nil input' do
metadata CHANGED
@@ -1,69 +1,69 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: japanese_names
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.3
4
+ version: 0.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Johnny Shields
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-09-08 00:00:00.000000000 Z
11
+ date: 2016-10-12 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: moji
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - ! '>='
17
+ - - ">="
18
18
  - !ruby/object:Gem::Version
19
19
  version: '1.6'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - ! '>='
24
+ - - ">="
25
25
  - !ruby/object:Gem::Version
26
26
  version: '1.6'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: rake
29
29
  requirement: !ruby/object:Gem::Requirement
30
30
  requirements:
31
- - - ! '>='
31
+ - - ">="
32
32
  - !ruby/object:Gem::Version
33
33
  version: '0'
34
34
  type: :development
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
- - - ! '>='
38
+ - - ">="
39
39
  - !ruby/object:Gem::Version
40
40
  version: '0'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rspec
43
43
  requirement: !ruby/object:Gem::Requirement
44
44
  requirements:
45
- - - ! '>='
45
+ - - ">="
46
46
  - !ruby/object:Gem::Version
47
47
  version: 3.0.0
48
48
  type: :development
49
49
  prerelease: false
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
- - - ! '>='
52
+ - - ">="
53
53
  - !ruby/object:Gem::Version
54
54
  version: 3.0.0
55
55
  - !ruby/object:Gem::Dependency
56
56
  name: gem-release
57
57
  requirement: !ruby/object:Gem::Requirement
58
58
  requirements:
59
- - - ! '>='
59
+ - - ">="
60
60
  - !ruby/object:Gem::Version
61
61
  version: '0'
62
62
  type: :development
63
63
  prerelease: false
64
64
  version_requirements: !ruby/object:Gem::Requirement
65
65
  requirements:
66
- - - ! '>='
66
+ - - ">="
67
67
  - !ruby/object:Gem::Version
68
68
  version: '0'
69
69
  description: Japanese name parser based on ENAMDICT
@@ -76,12 +76,17 @@ files:
76
76
  - README.md
77
77
  - bin/enamdict.min
78
78
  - lib/japanese_names.rb
79
+ - lib/japanese_names/backend/memory/finder.rb
80
+ - lib/japanese_names/backend/memory/store.rb
79
81
  - lib/japanese_names/enamdict.rb
80
- - lib/japanese_names/parser.rb
82
+ - lib/japanese_names/finder.rb
83
+ - lib/japanese_names/splitter.rb
84
+ - lib/japanese_names/util/ngram.rb
81
85
  - lib/japanese_names/version.rb
82
86
  - spec/spec_helper.rb
83
- - spec/unit/enamdict_spec.rb
84
- - spec/unit/parser_spec.rb
87
+ - spec/unit/finder_spec.rb
88
+ - spec/unit/ngram_spec.rb
89
+ - spec/unit/splitter_spec.rb
85
90
  homepage: https://github.com/johnnyshields/japanese_names
86
91
  licenses:
87
92
  - MIT
@@ -92,22 +97,22 @@ require_paths:
92
97
  - lib
93
98
  required_ruby_version: !ruby/object:Gem::Requirement
94
99
  requirements:
95
- - - ! '>='
100
+ - - ">="
96
101
  - !ruby/object:Gem::Version
97
102
  version: '0'
98
103
  required_rubygems_version: !ruby/object:Gem::Requirement
99
104
  requirements:
100
- - - ! '>='
105
+ - - ">="
101
106
  - !ruby/object:Gem::Version
102
107
  version: '0'
103
108
  requirements: []
104
109
  rubyforge_project:
105
- rubygems_version: 2.2.1
110
+ rubygems_version: 2.4.7
106
111
  signing_key:
107
112
  specification_version: 4
108
113
  summary: Tools for parsing japanese names
109
114
  test_files:
110
115
  - spec/spec_helper.rb
111
- - spec/unit/enamdict_spec.rb
112
- - spec/unit/parser_spec.rb
113
- has_rdoc:
116
+ - spec/unit/finder_spec.rb
117
+ - spec/unit/ngram_spec.rb
118
+ - spec/unit/splitter_spec.rb
@@ -1,80 +0,0 @@
1
- #!/bin/env ruby
2
- # encoding: utf-8
3
-
4
- module JapaneseNames
5
-
6
- # Provides methods for parsing Japanese name strings.
7
- class Parser
8
-
9
- # Given a kanji and kana representation of a name splits into to family/given names.
10
- #
11
- # The choice to prioritize family name is arbitrary. Further analysis is needed
12
- # for whether given or family name should be prioritized.
13
- #
14
- # Returns Array [[kanji_fam, kanji_giv], [kana_fam, kana_giv]] if there was a match.
15
- # Returns nil if there was no match.
16
- def split(kanji, kana)
17
- split_fam(kanji, kana) || split_giv(kanji, kana)
18
- end
19
-
20
- def split_giv(kanji, kana)
21
- return nil unless kanji && kana
22
- kanji, kana = kanji.strip, kana.strip
23
- dict = Enamdict.find(kanji: window_right(kanji))
24
- dict.sort!{|x,y| y[0].size <=> x[0].size}
25
- kana_match = nil
26
- if match = dict.detect{|m| kana_match = kana[/#{hk m[1]}$/]}
27
- return [[mask_right(kanji, match[0]), match[0]],[mask_right(kana, kana_match), kana_match]]
28
- end
29
- end
30
-
31
- def split_fam(kanji, kana)
32
- return nil unless kanji && kana
33
- kanji, kana = kanji.strip, kana.strip
34
- dict = Enamdict.find(kanji: window_left(kanji))
35
- dict.sort!{|x,y| y[0].size <=> x[0].size}
36
- kana_match = nil
37
- if match = dict.detect{|m| kana_match = kana[/^#{hk m[1]}/]}
38
- return [[match[0], mask_left(kanji, match[0])],[kana_match, mask_left(kana, kana_match)]]
39
- end
40
- end
41
-
42
- # TODO: add option to strip honorific '様'
43
- # TODO: add option to infer sex (0 = unknown, 1 = male, 2 = female as per ISO/IEC 5218)
44
-
45
- protected
46
-
47
- # Returns a regex string which matches both hiragana and katakana variations of a String.
48
- def hk(str)
49
- "(?:#{Moji.kata_to_hira(str)}|#{Moji.hira_to_kata(str)})"
50
- end
51
-
52
- # Masks a String from the left side and returns the remaining (right) portion of the String.
53
- #
54
- # Example: mask_left("abcde", "ab") #=> "cde"
55
- def mask_left(str, mask)
56
- str.gsub(/^#{mask}/, '')
57
- end
58
-
59
- # Masks a String from the right side and returns the remaining (left) portion of the String.
60
- #
61
- # Example: mask_right("abcde", "de") #=> "abc"
62
- def mask_right(str, mask)
63
- str.gsub(/#{mask}$/, '')
64
- end
65
-
66
- # Given a String, returns an array of progressively smaller substrings anchored on the left side.
67
- #
68
- # Example: window_left("abcd") #=> ["abcd", "abc", "ab", "a"]
69
- def window_left(str)
70
- (0...str.size).to_a.reverse.map{|i| str[0..i]}
71
- end
72
-
73
- # Given a String, returns an array of progressively smaller substrings anchored on the right side.
74
- #
75
- # Example: window_right("abcd") #=> ["abcd", "bcd", "cd", "d"]
76
- def window_right(str)
77
- (0...str.size).map{|i| str[i..-1]}
78
- end
79
- end
80
- end