myanmar-tools 0.1.1 → 1.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/README.md +21 -1
- data/lib/myanmar-tools.rb +2 -2
- data/lib/myanmar-tools/resources/compatibility.tsv +42 -34
- data/lib/myanmar-tools/resources/zawgyiUnicodeModel.dat +0 -0
- data/lib/myanmar-tools/version.rb +1 -1
- data/lib/myanmar-tools/zawgyi_unicode_markov_model.rb +11 -6
- data/myanmar-tools.gemspec +2 -2
- data/test/zawgyidetector_test.rb +4 -4
- metadata +6 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 0dbc0284872a618fddefb0d2a887f6fc5b94d4c90a7848ce0f5c72f00a6066da
|
4
|
+
data.tar.gz: 38b2ae3dd0df9a6225401675acf2288ed7690c11d597366e5666cc8b884a61e9
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 652287b770fba31364a0e2bc82f2d88e36c08fadcf91a4c1137609dccb674d19492f201e9cb31dd813c27fc64a26b9ae36d976621fe6f37996c65c47bc0dda02
|
7
|
+
data.tar.gz: 2766900bf91ff26e26b8b3235895e15ecd5b426023c7d43f91bca932c53fa1f502de8abd1d20e0231b0101560461fb53a13ff8e92689fd481583eb767802d457
|
data/.gitignore
CHANGED
data/README.md
CHANGED
@@ -10,6 +10,8 @@ Add the dependency to your project:
|
|
10
10
|
|
11
11
|
```bash
|
12
12
|
$ gem install myanmar-tools
|
13
|
+
or
|
14
|
+
$ bundle add myanmar-tools
|
13
15
|
```
|
14
16
|
|
15
17
|
To detect Zawgyi, create an instance of ZawgyiDetector, and call `get_zawgyi_probability` with your string.
|
@@ -21,10 +23,28 @@ score = detector.get_zawgyi_probability('အျပည္ျပည္ဆို
|
|
21
23
|
# score is 1.0 (The input is definitely Zawgyi)
|
22
24
|
```
|
23
25
|
|
26
|
+
To convert from Zawgyi to Unicode, you can use [the Ruby wrapper over icu4c](https://github.com/fantasticfears/icu4r). Install it like this:
|
27
|
+
|
28
|
+
```bash
|
29
|
+
$ gem install icu
|
30
|
+
or
|
31
|
+
$ bundle add icu
|
32
|
+
```
|
33
|
+
|
34
|
+
Then convert from Z to U like this:
|
35
|
+
|
36
|
+
```ruby
|
37
|
+
require 'icu'
|
38
|
+
converter = ICU::Transliterator.new "Zawgyi-my"
|
39
|
+
output = converter.transliterate("မ္း")
|
40
|
+
# output is now "မ်း"
|
41
|
+
```
|
42
|
+
|
24
43
|
For a complete working example, see [samples/ruby/demo.rb](../../samples/ruby/demo.rb).
|
44
|
+
|
25
45
|
## Contributing
|
26
46
|
|
27
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/
|
47
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/google/myanmar-tools . This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
28
48
|
|
29
49
|
## License
|
30
50
|
|
data/lib/myanmar-tools.rb
CHANGED
@@ -18,7 +18,7 @@ require 'myanmar-tools/zawgyi_unicode_markov_model'
|
|
18
18
|
|
19
19
|
module MyanmarTools
|
20
20
|
# Uses a machine learning model to determine whether a string of text is Zawgyi or Unicode.
|
21
|
-
# For more details and documentation, see https://github.com/
|
21
|
+
# For more details and documentation, see https://github.com/google/myanmar-tools
|
22
22
|
class ZawgyiDetector
|
23
23
|
def initialize
|
24
24
|
@model = ZawgyiUnicodeMarkovModel.new(stream_markov_model)
|
@@ -54,4 +54,4 @@ class ZawgyiDetector
|
|
54
54
|
end
|
55
55
|
end
|
56
56
|
end
|
57
|
-
end
|
57
|
+
end
|
@@ -1,42 +1,50 @@
|
|
1
1
|
-Infinity all ascii
|
2
|
-
|
3
|
-
|
2
|
+
1.7578816048593336E-11 ascii mixed with စင်္ကာပူမှာ unicode
|
3
|
+
0.9999999999720051 ascii mixed with ဘဂၤလားေဒရွ္႕ zawgyi
|
4
4
|
-Infinity digits ၅၂၀
|
5
|
-
0.
|
6
|
-
|
7
|
-
0.
|
8
|
-
|
9
|
-
0.
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
1.
|
14
|
-
0.
|
15
|
-
0.
|
16
|
-
0.
|
17
|
-
|
18
|
-
0.
|
19
|
-
0.
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
0.
|
25
|
-
0.
|
26
|
-
0.
|
27
|
-
0.
|
28
|
-
0.
|
29
|
-
0.
|
5
|
+
0.26211359963003 လူခပ်သိမ်း၏
|
6
|
+
5.31323436438016E-17 မျိုးရိုးဂုဏ်သိက္ခာနှင့်တကွ
|
7
|
+
0.4889728131312674 လူတို
|
8
|
+
1.6642496914651517E-5 င်း
|
9
|
+
0.11237116076150185 အညီ
|
10
|
+
6.707728524305754E-4 အမျှခံ
|
11
|
+
7.574188856564602E-16 စားခွင့်ရှိသည့်
|
12
|
+
4.638074119608733E-16 အခွင့်အရေးများကို
|
13
|
+
1.5487280820343697E-21 အသိအမှတ်ပြုခြင်းသည်
|
14
|
+
0.016794132891985412 လူခပ်
|
15
|
+
0.14479549474886305 သိမ်း
|
16
|
+
0.9934952187669185 ၏လွ
|
17
|
+
1.0003119485747386E-12 တ်လပ်မှု၊
|
18
|
+
0.1341121039650929 တ
|
19
|
+
0.688567924875057 ရား
|
20
|
+
3.606036708928397E-11 မျှတမှု၊
|
21
|
+
3.2639839874100116E-13 ငြိမ်းချမ်းမှုတို့၏
|
22
|
+
1.7756935949309309E-43 အ ခြေခံ အုတ် မြစ်ဖြစ်သော ကြောင့် လည်းကော င်း၊
|
23
|
+
0.9999999999999065 လူ႔အခြင့္ေရး
|
24
|
+
0.999992967794266 မ်ားကို
|
25
|
+
0.9988885380499896 အေရးမ
|
26
|
+
0.9161983282020472 ထား မထီေ
|
27
|
+
0.9186801971030496 လး
|
28
|
+
0.785076109857006 စား
|
29
|
+
0.9999999994387101 ျပဳျခ
|
30
30
|
0.9999999999999998 င္းသည္ လူခ
|
31
|
-
0.
|
32
|
-
0.
|
33
|
-
0.
|
34
|
-
0.
|
35
|
-
0.
|
31
|
+
0.9995136549000326 ပ္
|
32
|
+
0.9997519059251643 သိမ္း
|
33
|
+
0.9840808937738904 ၏
|
34
|
+
0.4867781535978122 အ
|
35
|
+
0.999997345195692 က်င့္သိ
|
36
36
|
1.0 ကၡာကို ခ်ိဳးေဖာက္
|
37
37
|
1.0 ဖ်က္ဆီးတတ္သည့္ ရက္စက္ၾကမ္းၾကဳတ္ေသာ
|
38
|
-
0.
|
38
|
+
0.99999994404868 အျပဳအမွု
|
39
39
|
1.0 မ်ားကို ျဖစ္ေပၚေ
|
40
|
-
0.
|
40
|
+
0.18826884503304805 စ
|
41
41
|
1.0 ခဲ့ေသာေၾကာင့္
|
42
42
|
1.0 လည္းေကာင္း၊
|
43
|
+
1.7562339552675883E-12 ထၢမ်တွပ်ႇ (shan language)
|
44
|
+
2.0362467174145114E-13 ၶူးႁိုၼ် (shan language)
|
45
|
+
1.825753617221113E-15 ဂလာန်ညးဒါန်လိက် (mon language)
|
46
|
+
7.802298132172665E-15 ဍုၚ်မွဲမွဲ (mon language)
|
47
|
+
3.859871377455363E-6 ဝိနယပိဋကေ (pali language)
|
48
|
+
8.004254357250203E-4 ပါရာဇိကပါဠိ (pali language)
|
49
|
+
1.3507380767721152E-12 တၢ်ကစီၣ် (karen language)
|
50
|
+
1.0297557056138798E-21 တၢ်ထံၣ်တၢ်ပၥ်သး (karen language)
|
Binary file
|
@@ -30,9 +30,6 @@ class ZawgyiUnicodeMarkovModel
|
|
30
30
|
# Magic number used to identify this object in byte streams. (Reads in ASCII as "UZMODEL ")
|
31
31
|
BINARY_TAG = "555a4d4f44454c20"
|
32
32
|
|
33
|
-
# Current serial format version number, used in association with the magic number.
|
34
|
-
BINARY_VERSION = 1
|
35
|
-
|
36
33
|
# Standard Myanmar code point range before digits
|
37
34
|
STD_CP0 = 0x1000
|
38
35
|
STD_CP1 = 0x103F
|
@@ -61,7 +58,7 @@ class ZawgyiUnicodeMarkovModel
|
|
61
58
|
SPC_OFFSET = EXB_OFFSET + EXB_CP1 - EXB_CP0 + 1
|
62
59
|
NUM_STATES = SPC_OFFSET + SPC_CP1 - SPC_CP0 + 1
|
63
60
|
|
64
|
-
private_constant :BINARY_TAG, :
|
61
|
+
private_constant :BINARY_TAG, :STD_CP0, :STD_CP1, :AFT_CP0, :AFT_CP1,
|
65
62
|
:EXA_CP0, :EXA_CP1, :EXB_CP0, :EXB_CP1, :SPC_CP0, :SPC_CP1, :STD_OFFSET,
|
66
63
|
:AFT_OFFSET, :EXA_OFFSET, :EXB_OFFSET, :SPC_OFFSET, :NUM_STATES
|
67
64
|
|
@@ -74,8 +71,16 @@ class ZawgyiUnicodeMarkovModel
|
|
74
71
|
end
|
75
72
|
|
76
73
|
binary_version = stream.read(4).unpack('H*')[0].to_i
|
77
|
-
if binary_version
|
78
|
-
|
74
|
+
if binary_version == 1
|
75
|
+
@ssv = 0
|
76
|
+
elsif binary_version == 2
|
77
|
+
# TODO: Support nonzero SSV if needed in the future
|
78
|
+
@ssv = stream.read(4).unpack('H*')[0].to_i
|
79
|
+
if @ssv != 0
|
80
|
+
raise "Unsupported ssv: #{@ssv}"
|
81
|
+
end
|
82
|
+
else
|
83
|
+
raise "Unexpected serial version number: expected 1 or 2 but got #{binary_version}"
|
79
84
|
end
|
80
85
|
|
81
86
|
@classifier = BinaryMarkov.new(stream)
|
data/myanmar-tools.gemspec
CHANGED
@@ -11,7 +11,7 @@ Gem::Specification.new do |spec|
|
|
11
11
|
|
12
12
|
spec.summary = 'Tools for handling the Zawgyi font encoding in Myanmar.'
|
13
13
|
spec.description = 'Tools for handling the Zawgyi font encoding in Myanmar.'
|
14
|
-
spec.homepage = 'https://github.com/
|
14
|
+
spec.homepage = 'https://github.com/google/myanmar-tools'
|
15
15
|
spec.required_ruby_version = '>= 2.0.0'
|
16
16
|
spec.license = 'Apache-2.0'
|
17
17
|
|
@@ -23,7 +23,7 @@ Gem::Specification.new do |spec|
|
|
23
23
|
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
24
24
|
spec.require_paths = ['lib']
|
25
25
|
|
26
|
-
spec.add_development_dependency 'bundler', '
|
26
|
+
spec.add_development_dependency 'bundler', '>= 1.0'
|
27
27
|
spec.add_development_dependency 'rake', '~> 10.0'
|
28
28
|
spec.add_development_dependency 'minitest', '~> 3.0'
|
29
29
|
end
|
data/test/zawgyidetector_test.rb
CHANGED
@@ -58,17 +58,17 @@ class ZawgyiDetectorTest < Test::Unit::TestCase
|
|
58
58
|
cases = []
|
59
59
|
# STRINGS IDENTICAL IN UNICODE/ZAWGYI
|
60
60
|
cases << {cp: "အသံကို အစားထိုးလိုပါသလား။", score: 0.995}
|
61
|
-
cases << {cp: "နမူနာ", score: 0.
|
62
|
-
cases << {cp: " ဦး", score: 0.
|
61
|
+
cases << {cp: "နမူနာ", score: 0.26}
|
62
|
+
cases << {cp: " ဦး", score: 0.35}
|
63
63
|
|
64
64
|
# UNICODE STRINGS WITH HIGH ZAWGYI SCORES
|
65
65
|
cases << {cp: "အစားထိုး အထူးအက္ခရာ", score: 0.995}
|
66
66
|
cases << {cp: "ယခု မိုးရွာနေပါသလား။", score: 0.995}
|
67
|
-
cases << {cp: "အခြား", score: 0.
|
67
|
+
cases << {cp: "အခြား", score: 0.74}
|
68
68
|
|
69
69
|
# DIFFICULT STRINGS THAT DETECT CORRECTLY
|
70
70
|
cases << {cp: "ကာမစာအုပ္မ်ား(ေစာက္ပတ္စာအုပ္မ်ား)", score: 1.0}
|
71
|
-
cases << {cp: "ညႇပ္စရာမလို", score:
|
71
|
+
cases << {cp: "ညႇပ္စရာမလို", score: 0.82}
|
72
72
|
|
73
73
|
cases.each{|c|
|
74
74
|
result = @detector.get_zawgyi_probability(c[:cp])
|
metadata
CHANGED
@@ -1,27 +1,27 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: myanmar-tools
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 1.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Aung Kyaw Phyo
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2020-07-29 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- - "
|
17
|
+
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
19
|
version: '1.0'
|
20
20
|
type: :development
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
|
-
- - "
|
24
|
+
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '1.0'
|
27
27
|
- !ruby/object:Gem::Dependency
|
@@ -75,7 +75,7 @@ files:
|
|
75
75
|
- lib/myanmar-tools/zawgyi_unicode_markov_model.rb
|
76
76
|
- myanmar-tools.gemspec
|
77
77
|
- test/zawgyidetector_test.rb
|
78
|
-
homepage: https://github.com/
|
78
|
+
homepage: https://github.com/google/myanmar-tools
|
79
79
|
licenses:
|
80
80
|
- Apache-2.0
|
81
81
|
metadata: {}
|
@@ -95,7 +95,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
95
95
|
version: '0'
|
96
96
|
requirements: []
|
97
97
|
rubyforge_project:
|
98
|
-
rubygems_version: 2.7.
|
98
|
+
rubygems_version: 2.7.6.2
|
99
99
|
signing_key:
|
100
100
|
specification_version: 4
|
101
101
|
summary: Tools for handling the Zawgyi font encoding in Myanmar.
|