ruby-pinyin-ez 0.5.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/LICENSE +10 -0
- data/README.markdown +100 -0
- data/lib/ruby-pinyin/backend/ezseg.rb +101 -0
- data/lib/ruby-pinyin/backend/mmseg.rb +110 -0
- data/lib/ruby-pinyin/backend/simple.rb +72 -0
- data/lib/ruby-pinyin/backend.rb +7 -0
- data/lib/ruby-pinyin/data/Mandarin.dat +41208 -0
- data/lib/ruby-pinyin/data/Punctuations.dat +14 -0
- data/lib/ruby-pinyin/data/words.dat +175180 -0
- data/lib/ruby-pinyin/data/words.dic +175180 -0
- data/lib/ruby-pinyin/punctuation.rb +46 -0
- data/lib/ruby-pinyin/util.rb +29 -0
- data/lib/ruby-pinyin/value.rb +16 -0
- data/lib/ruby-pinyin/version.rb +3 -0
- data/lib/ruby-pinyin.rb +41 -0
- metadata +87 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 92a6c23a0492659ffbe8d356f6d845d060a4ac7c801669d9d59f742bac7c4bf6
|
4
|
+
data.tar.gz: 9cbf2d4733d3ffe853cf8968d916a647d9f934f10de77eeecf6176c375edb135
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 9e2af60065a7a96c17bef6d5f9e3ccb069b575710f2c18cd561785bea898d5edd0d5bd6ea7a6a70375246e82c8f6804c5fe1959c7e117304dfeb7a17984fce9e
|
7
|
+
data.tar.gz: 5845a4e2a627c32cc5dfe90490acdce961d3ebfa3f71b8ab58325110c47fc762744b8965fdbfbfac62c71c7d91ab335c5e0371bb55a2858dbb66b3a667295a91
|
data/LICENSE
ADDED
@@ -0,0 +1,10 @@
|
|
1
|
+
Copyright (c) 2012, Jan Xie <jan.h.xie@gmail.com>
|
2
|
+
All rights reserved.
|
3
|
+
|
4
|
+
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
|
5
|
+
|
6
|
+
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
|
7
|
+
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
|
8
|
+
* Neither the name of Jan Xie nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
|
9
|
+
|
10
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/README.markdown
ADDED
@@ -0,0 +1,100 @@
|
|
1
|
+
# ruby-pinyin: 支持多音字的汉字转拼音工具
|
2
|
+
[![Build Status](https://travis-ci.org/janx/ruby-pinyin.svg?branch=master)](https://travis-ci.org/janx/ruby-pinyin)
|
3
|
+
|
4
|
+
ruby-pinyin: zhī chí duō yīn zì de hàn zì zhuǎn pīn yīn gōng jù
|
5
|
+
|
6
|
+
ruby-pinyin可以把汉字转化为对应的拼音,并能够较好的处理多音字的情况。比如:
|
7
|
+
|
8
|
+
PinYin.of_string('南京市长江大桥', :ascii)
|
9
|
+
|
10
|
+
能够正确的将“长”转为"chang2", 而不是"zhang3".
|
11
|
+
|
12
|
+
## Features
|
13
|
+
|
14
|
+
* 支持多音字
|
15
|
+
* 使用最新的UNICODE数据(6.3.0 published at 2013/02/26)
|
16
|
+
* 能够显示数字或者UNICODE音调(eg: 'cao1', 'cāo')
|
17
|
+
* 丰富的API
|
18
|
+
* 支持中英文标点混合字符串
|
19
|
+
* 中文标点转为英文标点
|
20
|
+
* 支持自定义读音
|
21
|
+
|
22
|
+
## Installation
|
23
|
+
|
24
|
+
gem install ruby-pinyin
|
25
|
+
|
26
|
+
或者把ruby-pinyin加入你的Gemfile:
|
27
|
+
|
28
|
+
gem 'ruby-pinyin'
|
29
|
+
|
30
|
+
## Examples
|
31
|
+
|
32
|
+
# encoding: utf-8
|
33
|
+
require 'ruby-pinyin'
|
34
|
+
|
35
|
+
# return ['jie', 'cao']
|
36
|
+
PinYin.of_string('节操')
|
37
|
+
|
38
|
+
# return ['jie2', 'cao1']
|
39
|
+
PinYin.of_string('节操', true)
|
40
|
+
PinYin.of_string('节操', :ascii)
|
41
|
+
|
42
|
+
# return ["jié", "cāo"]
|
43
|
+
PinYin.of_string('节操', :unicode)
|
44
|
+
|
45
|
+
# 正确处理多音字: return ["nán", "jīng", "shì", "cháng", "jiāng", "dà", "qiáo"]
|
46
|
+
PinYin.of_string('南京市长江大桥', :unicode)
|
47
|
+
|
48
|
+
# return %w(gan xie party gan xie guo jia)
|
49
|
+
PinYin.of_string('感谢party感谢guo jia')
|
50
|
+
|
51
|
+
# return 'gan-xie-party-gan-xie-guo-jia'
|
52
|
+
PinYin.permlink('感谢party感谢guo jia')
|
53
|
+
|
54
|
+
# return 'gxpartygxguojia'
|
55
|
+
PinYin.abbr('感谢party感谢guo jia')
|
56
|
+
|
57
|
+
# return 'gan xie party, gan xie guo jia!'
|
58
|
+
# PinYin.sentence保留标点符号, 同时用对应英文标点代替中文标点
|
59
|
+
PinYin.sentence('感谢party, 感谢guo家!')
|
60
|
+
|
61
|
+
# override readings with your own data file
|
62
|
+
PinYin.override_files = [File.expand_path('../my.dat', __FILE__)]
|
63
|
+
|
64
|
+
更多的例子和参数请参考[测试用例](https://github.com/janx/ruby-pinyin/blob/master/test/pinyin_test.rb)
|
65
|
+
|
66
|
+
## 配置 ##
|
67
|
+
|
68
|
+
ruby-pinyin有两个PinYin::Backend: `PinYin::Backend::Simple` 以及`PinYin::Backend::MMSeg`. 默认是使用MMSeg backend, 支持多音字识别。如果你不需要多音字识别,或是对内存使用要求很高,或是有其它任何原因想要fallback到Simple backend, 可以如下配置:
|
69
|
+
|
70
|
+
```ruby
|
71
|
+
PinYin.backend = PinYin::Backend::Simple.new
|
72
|
+
```
|
73
|
+
|
74
|
+
## 自定义发音 ##
|
75
|
+
|
76
|
+
通过`PinYin.override_files`可以自定义某些字的发音。自定义的数据以普通文本文件存放,每行定义一个字的发音,以ASCII空格将汉字的unicode编码和拼音隔开。格式可参考[lib/ruby-pinyin/data/Mandarin.dat](https://github.com/janx/ruby-pinyin/blob/master/lib/ruby-pinyin/data/Mandarin.dat)文件。
|
77
|
+
|
78
|
+
## 欢迎任何帮助 ##
|
79
|
+
|
80
|
+
如果你喜欢这个项目,请通过(不限)以下方式帮助她!
|
81
|
+
|
82
|
+
* 各种使用
|
83
|
+
* 各种宣传
|
84
|
+
* 各种报告bug, 提供建议 (github issue tracker)
|
85
|
+
* 各种修bug, 实现feature (github pull request)
|
86
|
+
|
87
|
+
## LICENSE ##
|
88
|
+
|
89
|
+
[BSD LICENSE](https://github.com/janx/ruby-pinyin/blob/master/LICENSE)
|
90
|
+
|
91
|
+
ruby-pinyin中的拼音数据由作者整理自互联网,你可以在ruby-pinyin之外的地方任意使用,但是请注明数据来自ruby-pinyin :-)
|
92
|
+
|
93
|
+
## Contributors ##
|
94
|
+
|
95
|
+
* [Martin91](https://github.com/Martin91)
|
96
|
+
* [jaxi](https://github.com/jaxi)
|
97
|
+
* [jiangxin](https://github.com/jiangxin)
|
98
|
+
* [forresty](https://github.com/forresty)
|
99
|
+
* [pzpz](https://github.com/pzpz)
|
100
|
+
* [Eric Guo](https://github.com/Eric-Guo)
|
@@ -0,0 +1,101 @@
|
|
1
|
+
module PinYin
|
2
|
+
module Backend
|
3
|
+
class EZSeg
|
4
|
+
|
5
|
+
def initialize(override_files=[])
|
6
|
+
@override_files = override_files || []
|
7
|
+
end
|
8
|
+
|
9
|
+
|
10
|
+
|
11
|
+
def romanize(str, tone=nil, include_punctuations=false)
|
12
|
+
return [] unless str && str.length > 0
|
13
|
+
words = segment str
|
14
|
+
|
15
|
+
res = []
|
16
|
+
words.each do |word|
|
17
|
+
if str && !str.empty?
|
18
|
+
word.unpack('U*').each_with_index do |t,idx|
|
19
|
+
code = sprintf('%x',t).upcase
|
20
|
+
readings = codes[code]
|
21
|
+
|
22
|
+
if readings
|
23
|
+
multiple_arr = readings.collect{|one| Value.new(format([one], tone), false)}
|
24
|
+
res << (multiple_arr.length > 1 ? multiple_arr : multiple_arr[0])
|
25
|
+
else
|
26
|
+
val = [t].pack('U*')
|
27
|
+
if val =~ /^[0-9a-zA-Z\s]*$/ # 复原,去除特殊字符,如全角符号等。
|
28
|
+
if res.last && res.last.respond_to?(:english?) && res.last.english?
|
29
|
+
res.last << Value.new(val, true)
|
30
|
+
elsif val != ' '
|
31
|
+
res << Value.new(val, true)
|
32
|
+
end
|
33
|
+
elsif include_punctuations
|
34
|
+
val = [Punctuation[code]].pack('H*') if Punctuation.include?(code)
|
35
|
+
(res.last ? res.last : res) << Value.new(val, false)
|
36
|
+
end
|
37
|
+
end
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
41
|
+
res
|
42
|
+
end
|
43
|
+
|
44
|
+
private
|
45
|
+
|
46
|
+
def codes
|
47
|
+
return @codes if @codes
|
48
|
+
|
49
|
+
@codes = {}
|
50
|
+
src = File.expand_path('../../data/Mandarin.dat', __FILE__)
|
51
|
+
@override_files.unshift(src).each do |file|
|
52
|
+
load_codes_from(file)
|
53
|
+
end
|
54
|
+
@codes
|
55
|
+
end
|
56
|
+
|
57
|
+
def load_codes_from(file)
|
58
|
+
File.readlines(file).map do |line|
|
59
|
+
code, readings = line.split(' ')
|
60
|
+
@codes[code] = readings.split(',')
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
def format(readings, tone)
|
65
|
+
case tone
|
66
|
+
when :unicode
|
67
|
+
readings[0]
|
68
|
+
when :ascii, true
|
69
|
+
PinYin::Util.to_ascii(readings[0])
|
70
|
+
else
|
71
|
+
PinYin::Util.to_ascii(readings[0], false)
|
72
|
+
end
|
73
|
+
end
|
74
|
+
|
75
|
+
def segment(str)
|
76
|
+
words = []
|
77
|
+
str.split('').each do |s|
|
78
|
+
words.push(s) unless s =~ Punctuation.chinese_regexp
|
79
|
+
end
|
80
|
+
|
81
|
+
words
|
82
|
+
end
|
83
|
+
|
84
|
+
# def apply(base, patch)
|
85
|
+
# result = []
|
86
|
+
# base.each_with_index do |char, i|
|
87
|
+
# if patch[i].nil?
|
88
|
+
# result.push char
|
89
|
+
# elsif char =~ Punctuation.regexp
|
90
|
+
# result.push Value.new("#{patch[i]}#{$1}", char.english?)
|
91
|
+
# else
|
92
|
+
# result.push Value.new(patch[i], char.english?)
|
93
|
+
# end
|
94
|
+
# end
|
95
|
+
# result
|
96
|
+
# end
|
97
|
+
|
98
|
+
|
99
|
+
end
|
100
|
+
end
|
101
|
+
end
|
@@ -0,0 +1,110 @@
|
|
1
|
+
# -*- coding: utf-8 -*-
|
2
|
+
|
3
|
+
require 'rmmseg-cpp-new'
|
4
|
+
|
5
|
+
module PinYin
|
6
|
+
module Backend
|
7
|
+
class MMSeg
|
8
|
+
|
9
|
+
def initialize(override_files=[])
|
10
|
+
@simple = Simple.new override_files
|
11
|
+
|
12
|
+
RMMSeg::Dictionary.dictionaries.delete_if {|(type, path)| type == :words}
|
13
|
+
RMMSeg::Dictionary.dictionaries.push [:words, File.expand_path('../../data/words.dic', __FILE__)]
|
14
|
+
RMMSeg::Dictionary.load_dictionaries
|
15
|
+
end
|
16
|
+
|
17
|
+
def romanize(str, tone=nil, include_punctuations=false)
|
18
|
+
return [] unless str && str.length > 0
|
19
|
+
|
20
|
+
words = segment str
|
21
|
+
|
22
|
+
base = @simple.romanize(str, tone, include_punctuations)
|
23
|
+
patch = words.map {|w| format(w, tone) }.flatten
|
24
|
+
|
25
|
+
if base.size != patch.size
|
26
|
+
base.compact!
|
27
|
+
patch.compact!
|
28
|
+
end
|
29
|
+
|
30
|
+
apply base, patch
|
31
|
+
end
|
32
|
+
|
33
|
+
def segment(str)
|
34
|
+
algor = RMMSeg::Algorithm.new str
|
35
|
+
|
36
|
+
words = []
|
37
|
+
while token = algor.next_token
|
38
|
+
s = token.text.force_encoding("UTF-8")
|
39
|
+
words.push(s) unless s =~ Punctuation.chinese_regexp
|
40
|
+
end
|
41
|
+
words
|
42
|
+
end
|
43
|
+
|
44
|
+
private
|
45
|
+
|
46
|
+
def dictionary
|
47
|
+
return @dict if @dict
|
48
|
+
|
49
|
+
@dict = {}
|
50
|
+
src = File.expand_path('../../data/words.dat', __FILE__)
|
51
|
+
File.readlines(src).map do |line|
|
52
|
+
word, unicode = line.strip.split(',')
|
53
|
+
@dict[word] = unicode
|
54
|
+
end
|
55
|
+
|
56
|
+
@dict
|
57
|
+
end
|
58
|
+
|
59
|
+
def get_pinyin(word, tone)
|
60
|
+
return unless dictionary[word]
|
61
|
+
|
62
|
+
case tone
|
63
|
+
when :unicode
|
64
|
+
dictionary[word]
|
65
|
+
when :ascii, true
|
66
|
+
to_ascii dictionary[word], true
|
67
|
+
else
|
68
|
+
to_ascii dictionary[word], false
|
69
|
+
end
|
70
|
+
end
|
71
|
+
|
72
|
+
def to_ascii(word, with_tone)
|
73
|
+
word.split(' ').map do |reading|
|
74
|
+
PinYin::Util.to_ascii(reading, with_tone)
|
75
|
+
end.join(' ')
|
76
|
+
end
|
77
|
+
|
78
|
+
def format(word, tone)
|
79
|
+
pinyin = get_pinyin(word, tone)
|
80
|
+
return pinyin.split(' ') if pinyin
|
81
|
+
|
82
|
+
#如果是个英文单词,直接返回,否则返回与词等长的nil数组
|
83
|
+
if word =~ /^[_0-9a-zA-Z\s]*$/
|
84
|
+
word
|
85
|
+
elsif word.respond_to? :force_encoding
|
86
|
+
# word has been encoded in UTF-8 already
|
87
|
+
[nil] * word.size
|
88
|
+
else
|
89
|
+
# For ruby 1.8, there is no native utf-8 support
|
90
|
+
[nil] * word.unpack('U*').size
|
91
|
+
end
|
92
|
+
end
|
93
|
+
|
94
|
+
def apply(base, patch)
|
95
|
+
result = []
|
96
|
+
base.each_with_index do |char, i|
|
97
|
+
if patch[i].nil?
|
98
|
+
result.push char
|
99
|
+
elsif char =~ Punctuation.regexp
|
100
|
+
result.push Value.new("#{patch[i]}#{$1}", char.english?)
|
101
|
+
else
|
102
|
+
result.push Value.new(patch[i], char.english?)
|
103
|
+
end
|
104
|
+
end
|
105
|
+
result
|
106
|
+
end
|
107
|
+
|
108
|
+
end
|
109
|
+
end
|
110
|
+
end
|
@@ -0,0 +1,72 @@
|
|
1
|
+
# -*- coding: utf-8 -*-
|
2
|
+
|
3
|
+
module PinYin
|
4
|
+
module Backend
|
5
|
+
class Simple
|
6
|
+
|
7
|
+
def initialize(override_files=[])
|
8
|
+
@override_files = override_files || []
|
9
|
+
end
|
10
|
+
|
11
|
+
def romanize(str, tone=nil, include_punctuations=false)
|
12
|
+
res = []
|
13
|
+
return res unless str && !str.empty?
|
14
|
+
|
15
|
+
str.unpack('U*').each_with_index do |t,idx|
|
16
|
+
code = sprintf('%x',t).upcase
|
17
|
+
readings = codes[code]
|
18
|
+
|
19
|
+
if readings
|
20
|
+
res << Value.new(format(readings, tone), false)
|
21
|
+
else
|
22
|
+
val = [t].pack('U*')
|
23
|
+
if val =~ /^[0-9a-zA-Z\s]*$/ # 复原,去除特殊字符,如全角符号等。
|
24
|
+
if res.last && res.last.english?
|
25
|
+
res.last << Value.new(val, true)
|
26
|
+
elsif val != ' '
|
27
|
+
res << Value.new(val, true)
|
28
|
+
end
|
29
|
+
elsif include_punctuations
|
30
|
+
val = [Punctuation[code]].pack('H*') if Punctuation.include?(code)
|
31
|
+
(res.last ? res.last : res) << Value.new(val, false)
|
32
|
+
end
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|
36
|
+
res.map {|phrase| phrase.split(/\s+/)}.flatten
|
37
|
+
end
|
38
|
+
|
39
|
+
private
|
40
|
+
|
41
|
+
def codes
|
42
|
+
return @codes if @codes
|
43
|
+
|
44
|
+
@codes = {}
|
45
|
+
src = File.expand_path('../../data/Mandarin.dat', __FILE__)
|
46
|
+
@override_files.unshift(src).each do |file|
|
47
|
+
load_codes_from(file)
|
48
|
+
end
|
49
|
+
@codes
|
50
|
+
end
|
51
|
+
|
52
|
+
def load_codes_from(file)
|
53
|
+
File.readlines(file).map do |line|
|
54
|
+
code, readings = line.split(' ')
|
55
|
+
@codes[code] = readings.split(',')
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
def format(readings, tone)
|
60
|
+
case tone
|
61
|
+
when :unicode
|
62
|
+
readings[0]
|
63
|
+
when :ascii, true
|
64
|
+
PinYin::Util.to_ascii(readings[0])
|
65
|
+
else
|
66
|
+
PinYin::Util.to_ascii(readings[0], false)
|
67
|
+
end
|
68
|
+
end
|
69
|
+
|
70
|
+
end
|
71
|
+
end
|
72
|
+
end
|