ruby-pinyin-ez 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 92a6c23a0492659ffbe8d356f6d845d060a4ac7c801669d9d59f742bac7c4bf6
4
+ data.tar.gz: 9cbf2d4733d3ffe853cf8968d916a647d9f934f10de77eeecf6176c375edb135
5
+ SHA512:
6
+ metadata.gz: 9e2af60065a7a96c17bef6d5f9e3ccb069b575710f2c18cd561785bea898d5edd0d5bd6ea7a6a70375246e82c8f6804c5fe1959c7e117304dfeb7a17984fce9e
7
+ data.tar.gz: 5845a4e2a627c32cc5dfe90490acdce961d3ebfa3f71b8ab58325110c47fc762744b8965fdbfbfac62c71c7d91ab335c5e0371bb55a2858dbb66b3a667295a91
data/LICENSE ADDED
@@ -0,0 +1,10 @@
1
+ Copyright (c) 2012, Jan Xie <jan.h.xie@gmail.com>
2
+ All rights reserved.
3
+
4
+ Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
5
+
6
+ * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
7
+ * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
8
+ * Neither the name of Jan Xie nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
9
+
10
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
data/README.markdown ADDED
@@ -0,0 +1,100 @@
1
+ # ruby-pinyin: 支持多音字的汉字转拼音工具
2
+ [![Build Status](https://travis-ci.org/janx/ruby-pinyin.svg?branch=master)](https://travis-ci.org/janx/ruby-pinyin)
3
+
4
+ ruby-pinyin: zhī chí duō yīn zì de hàn zì zhuǎn pīn yīn gōng jù
5
+
6
+ ruby-pinyin可以把汉字转化为对应的拼音,并能够较好的处理多音字的情况。比如:
7
+
8
+ PinYin.of_string('南京市长江大桥', :ascii)
9
+
10
+ 能够正确的将“长”转为"chang2", 而不是"zhang3".
11
+
12
+ ## Features
13
+
14
+ * 支持多音字
15
+ * 使用最新的UNICODE数据(6.3.0 published at 2013/02/26)
16
+ * 能够显示数字或者UNICODE音调(eg: 'cao1', 'cāo')
17
+ * 丰富的API
18
+ * 支持中英文标点混合字符串
19
+ * 中文标点转为英文标点
20
+ * 支持自定义读音
21
+
22
+ ## Installation
23
+
24
+ gem install ruby-pinyin
25
+
26
+ 或者把ruby-pinyin加入你的Gemfile:
27
+
28
+ gem 'ruby-pinyin'
29
+
30
+ ## Examples
31
+
32
+ # encoding: utf-8
33
+ require 'ruby-pinyin'
34
+
35
+ # return ['jie', 'cao']
36
+ PinYin.of_string('节操')
37
+
38
+ # return ['jie2', 'cao1']
39
+ PinYin.of_string('节操', true)
40
+ PinYin.of_string('节操', :ascii)
41
+
42
+ # return ["jié", "cāo"]
43
+ PinYin.of_string('节操', :unicode)
44
+
45
+ # 正确处理多音字: return ["nán", "jīng", "shì", "cháng", "jiāng", "dà", "qiáo"]
46
+ PinYin.of_string('南京市长江大桥', :unicode)
47
+
48
+ # return %w(gan xie party gan xie guo jia)
49
+ PinYin.of_string('感谢party感谢guo jia')
50
+
51
+ # return 'gan-xie-party-gan-xie-guo-jia'
52
+ PinYin.permlink('感谢party感谢guo jia')
53
+
54
+ # return 'gxpartygxguojia'
55
+ PinYin.abbr('感谢party感谢guo jia')
56
+
57
+ # return 'gan xie party, gan xie guo jia!'
58
+ # PinYin.sentence保留标点符号, 同时用对应英文标点代替中文标点
59
+ PinYin.sentence('感谢party, 感谢guo家!')
60
+
61
+ # override readings with your own data file
62
+ PinYin.override_files = [File.expand_path('../my.dat', __FILE__)]
63
+
64
+ 更多的例子和参数请参考[测试用例](https://github.com/janx/ruby-pinyin/blob/master/test/pinyin_test.rb)
65
+
66
+ ## 配置 ##
67
+
68
+ ruby-pinyin有两个PinYin::Backend: `PinYin::Backend::Simple` 以及`PinYin::Backend::MMSeg`. 默认是使用MMSeg backend, 支持多音字识别。如果你不需要多音字识别,或是对内存使用要求很高,或是有其它任何原因想要fallback到Simple backend, 可以如下配置:
69
+
70
+ ```ruby
71
+ PinYin.backend = PinYin::Backend::Simple.new
72
+ ```
73
+
74
+ ## 自定义发音 ##
75
+
76
+ 通过`PinYin.override_files`可以自定义某些字的发音。自定义的数据以普通文本文件存放,每行定义一个字的发音,以ASCII空格将汉字的unicode编码和拼音隔开。格式可参考[lib/ruby-pinyin/data/Mandarin.dat](https://github.com/janx/ruby-pinyin/blob/master/lib/ruby-pinyin/data/Mandarin.dat)文件。
77
+
78
+ ## 欢迎任何帮助 ##
79
+
80
+ 如果你喜欢这个项目,请通过(不限)以下方式帮助她!
81
+
82
+ * 各种使用
83
+ * 各种宣传
84
+ * 各种报告bug, 提供建议 (github issue tracker)
85
+ * 各种修bug, 实现feature (github pull request)
86
+
87
+ ## LICENSE ##
88
+
89
+ [BSD LICENSE](https://github.com/janx/ruby-pinyin/blob/master/LICENSE)
90
+
91
+ ruby-pinyin中的拼音数据由作者整理自互联网,你可以在ruby-pinyin之外的地方任意使用,但是请注明数据来自ruby-pinyin :-)
92
+
93
+ ## Contributors ##
94
+
95
+ * [Martin91](https://github.com/Martin91)
96
+ * [jaxi](https://github.com/jaxi)
97
+ * [jiangxin](https://github.com/jiangxin)
98
+ * [forresty](https://github.com/forresty)
99
+ * [pzpz](https://github.com/pzpz)
100
+ * [Eric Guo](https://github.com/Eric-Guo)
@@ -0,0 +1,101 @@
1
+ module PinYin
2
+ module Backend
3
+ class EZSeg
4
+
5
+ def initialize(override_files=[])
6
+ @override_files = override_files || []
7
+ end
8
+
9
+
10
+
11
+ def romanize(str, tone=nil, include_punctuations=false)
12
+ return [] unless str && str.length > 0
13
+ words = segment str
14
+
15
+ res = []
16
+ words.each do |word|
17
+ if str && !str.empty?
18
+ word.unpack('U*').each_with_index do |t,idx|
19
+ code = sprintf('%x',t).upcase
20
+ readings = codes[code]
21
+
22
+ if readings
23
+ multiple_arr = readings.collect{|one| Value.new(format([one], tone), false)}
24
+ res << (multiple_arr.length > 1 ? multiple_arr : multiple_arr[0])
25
+ else
26
+ val = [t].pack('U*')
27
+ if val =~ /^[0-9a-zA-Z\s]*$/ # 复原,去除特殊字符,如全角符号等。
28
+ if res.last && res.last.respond_to?(:english?) && res.last.english?
29
+ res.last << Value.new(val, true)
30
+ elsif val != ' '
31
+ res << Value.new(val, true)
32
+ end
33
+ elsif include_punctuations
34
+ val = [Punctuation[code]].pack('H*') if Punctuation.include?(code)
35
+ (res.last ? res.last : res) << Value.new(val, false)
36
+ end
37
+ end
38
+ end
39
+ end
40
+ end
41
+ res
42
+ end
43
+
44
+ private
45
+
46
+ def codes
47
+ return @codes if @codes
48
+
49
+ @codes = {}
50
+ src = File.expand_path('../../data/Mandarin.dat', __FILE__)
51
+ @override_files.unshift(src).each do |file|
52
+ load_codes_from(file)
53
+ end
54
+ @codes
55
+ end
56
+
57
+ def load_codes_from(file)
58
+ File.readlines(file).map do |line|
59
+ code, readings = line.split(' ')
60
+ @codes[code] = readings.split(',')
61
+ end
62
+ end
63
+
64
+ def format(readings, tone)
65
+ case tone
66
+ when :unicode
67
+ readings[0]
68
+ when :ascii, true
69
+ PinYin::Util.to_ascii(readings[0])
70
+ else
71
+ PinYin::Util.to_ascii(readings[0], false)
72
+ end
73
+ end
74
+
75
+ def segment(str)
76
+ words = []
77
+ str.split('').each do |s|
78
+ words.push(s) unless s =~ Punctuation.chinese_regexp
79
+ end
80
+
81
+ words
82
+ end
83
+
84
+ # def apply(base, patch)
85
+ # result = []
86
+ # base.each_with_index do |char, i|
87
+ # if patch[i].nil?
88
+ # result.push char
89
+ # elsif char =~ Punctuation.regexp
90
+ # result.push Value.new("#{patch[i]}#{$1}", char.english?)
91
+ # else
92
+ # result.push Value.new(patch[i], char.english?)
93
+ # end
94
+ # end
95
+ # result
96
+ # end
97
+
98
+
99
+ end
100
+ end
101
+ end
@@ -0,0 +1,110 @@
1
+ # -*- coding: utf-8 -*-
2
+
3
+ require 'rmmseg-cpp-new'
4
+
5
+ module PinYin
6
+ module Backend
7
+ class MMSeg
8
+
9
+ def initialize(override_files=[])
10
+ @simple = Simple.new override_files
11
+
12
+ RMMSeg::Dictionary.dictionaries.delete_if {|(type, path)| type == :words}
13
+ RMMSeg::Dictionary.dictionaries.push [:words, File.expand_path('../../data/words.dic', __FILE__)]
14
+ RMMSeg::Dictionary.load_dictionaries
15
+ end
16
+
17
+ def romanize(str, tone=nil, include_punctuations=false)
18
+ return [] unless str && str.length > 0
19
+
20
+ words = segment str
21
+
22
+ base = @simple.romanize(str, tone, include_punctuations)
23
+ patch = words.map {|w| format(w, tone) }.flatten
24
+
25
+ if base.size != patch.size
26
+ base.compact!
27
+ patch.compact!
28
+ end
29
+
30
+ apply base, patch
31
+ end
32
+
33
+ def segment(str)
34
+ algor = RMMSeg::Algorithm.new str
35
+
36
+ words = []
37
+ while token = algor.next_token
38
+ s = token.text.force_encoding("UTF-8")
39
+ words.push(s) unless s =~ Punctuation.chinese_regexp
40
+ end
41
+ words
42
+ end
43
+
44
+ private
45
+
46
+ def dictionary
47
+ return @dict if @dict
48
+
49
+ @dict = {}
50
+ src = File.expand_path('../../data/words.dat', __FILE__)
51
+ File.readlines(src).map do |line|
52
+ word, unicode = line.strip.split(',')
53
+ @dict[word] = unicode
54
+ end
55
+
56
+ @dict
57
+ end
58
+
59
+ def get_pinyin(word, tone)
60
+ return unless dictionary[word]
61
+
62
+ case tone
63
+ when :unicode
64
+ dictionary[word]
65
+ when :ascii, true
66
+ to_ascii dictionary[word], true
67
+ else
68
+ to_ascii dictionary[word], false
69
+ end
70
+ end
71
+
72
+ def to_ascii(word, with_tone)
73
+ word.split(' ').map do |reading|
74
+ PinYin::Util.to_ascii(reading, with_tone)
75
+ end.join(' ')
76
+ end
77
+
78
+ def format(word, tone)
79
+ pinyin = get_pinyin(word, tone)
80
+ return pinyin.split(' ') if pinyin
81
+
82
+ #如果是个英文单词,直接返回,否则返回与词等长的nil数组
83
+ if word =~ /^[_0-9a-zA-Z\s]*$/
84
+ word
85
+ elsif word.respond_to? :force_encoding
86
+ # word has been encoded in UTF-8 already
87
+ [nil] * word.size
88
+ else
89
+ # For ruby 1.8, there is no native utf-8 support
90
+ [nil] * word.unpack('U*').size
91
+ end
92
+ end
93
+
94
+ def apply(base, patch)
95
+ result = []
96
+ base.each_with_index do |char, i|
97
+ if patch[i].nil?
98
+ result.push char
99
+ elsif char =~ Punctuation.regexp
100
+ result.push Value.new("#{patch[i]}#{$1}", char.english?)
101
+ else
102
+ result.push Value.new(patch[i], char.english?)
103
+ end
104
+ end
105
+ result
106
+ end
107
+
108
+ end
109
+ end
110
+ end
@@ -0,0 +1,72 @@
1
+ # -*- coding: utf-8 -*-
2
+
3
+ module PinYin
4
+ module Backend
5
+ class Simple
6
+
7
+ def initialize(override_files=[])
8
+ @override_files = override_files || []
9
+ end
10
+
11
+ def romanize(str, tone=nil, include_punctuations=false)
12
+ res = []
13
+ return res unless str && !str.empty?
14
+
15
+ str.unpack('U*').each_with_index do |t,idx|
16
+ code = sprintf('%x',t).upcase
17
+ readings = codes[code]
18
+
19
+ if readings
20
+ res << Value.new(format(readings, tone), false)
21
+ else
22
+ val = [t].pack('U*')
23
+ if val =~ /^[0-9a-zA-Z\s]*$/ # 复原,去除特殊字符,如全角符号等。
24
+ if res.last && res.last.english?
25
+ res.last << Value.new(val, true)
26
+ elsif val != ' '
27
+ res << Value.new(val, true)
28
+ end
29
+ elsif include_punctuations
30
+ val = [Punctuation[code]].pack('H*') if Punctuation.include?(code)
31
+ (res.last ? res.last : res) << Value.new(val, false)
32
+ end
33
+ end
34
+ end
35
+
36
+ res.map {|phrase| phrase.split(/\s+/)}.flatten
37
+ end
38
+
39
+ private
40
+
41
+ def codes
42
+ return @codes if @codes
43
+
44
+ @codes = {}
45
+ src = File.expand_path('../../data/Mandarin.dat', __FILE__)
46
+ @override_files.unshift(src).each do |file|
47
+ load_codes_from(file)
48
+ end
49
+ @codes
50
+ end
51
+
52
+ def load_codes_from(file)
53
+ File.readlines(file).map do |line|
54
+ code, readings = line.split(' ')
55
+ @codes[code] = readings.split(',')
56
+ end
57
+ end
58
+
59
+ def format(readings, tone)
60
+ case tone
61
+ when :unicode
62
+ readings[0]
63
+ when :ascii, true
64
+ PinYin::Util.to_ascii(readings[0])
65
+ else
66
+ PinYin::Util.to_ascii(readings[0], false)
67
+ end
68
+ end
69
+
70
+ end
71
+ end
72
+ end
@@ -0,0 +1,7 @@
1
+ module PinYin
2
+ module Backend
3
+ autoload :Simple, 'ruby-pinyin/backend/simple'
4
+ autoload :MMSeg, 'ruby-pinyin/backend/mmseg'
5
+ autoload :EZSeg, 'ruby-pinyin/backend/ezseg'
6
+ end
7
+ end