myanmar-tools 0.1.1 → 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/README.md +21 -1
- data/lib/myanmar-tools.rb +2 -2
- data/lib/myanmar-tools/resources/compatibility.tsv +42 -34
- data/lib/myanmar-tools/resources/zawgyiUnicodeModel.dat +0 -0
- data/lib/myanmar-tools/version.rb +1 -1
- data/lib/myanmar-tools/zawgyi_unicode_markov_model.rb +11 -6
- data/myanmar-tools.gemspec +2 -2
- data/test/zawgyidetector_test.rb +4 -4
- metadata +6 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 0dbc0284872a618fddefb0d2a887f6fc5b94d4c90a7848ce0f5c72f00a6066da
|
4
|
+
data.tar.gz: 38b2ae3dd0df9a6225401675acf2288ed7690c11d597366e5666cc8b884a61e9
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 652287b770fba31364a0e2bc82f2d88e36c08fadcf91a4c1137609dccb674d19492f201e9cb31dd813c27fc64a26b9ae36d976621fe6f37996c65c47bc0dda02
|
7
|
+
data.tar.gz: 2766900bf91ff26e26b8b3235895e15ecd5b426023c7d43f91bca932c53fa1f502de8abd1d20e0231b0101560461fb53a13ff8e92689fd481583eb767802d457
|
data/.gitignore
CHANGED
data/README.md
CHANGED
@@ -10,6 +10,8 @@ Add the dependency to your project:
|
|
10
10
|
|
11
11
|
```bash
|
12
12
|
$ gem install myanmar-tools
|
13
|
+
or
|
14
|
+
$ bundle add myanmar-tools
|
13
15
|
```
|
14
16
|
|
15
17
|
To detect Zawgyi, create an instance of ZawgyiDetector, and call `get_zawgyi_probability` with your string.
|
@@ -21,10 +23,28 @@ score = detector.get_zawgyi_probability('အျပည္ျပည္ဆို
|
|
21
23
|
# score is 1.0 (The input is definitely Zawgyi)
|
22
24
|
```
|
23
25
|
|
26
|
+
To convert from Zawgyi to Unicode, you can use [the Ruby wrapper over icu4c](https://github.com/fantasticfears/icu4r). Install it like this:
|
27
|
+
|
28
|
+
```bash
|
29
|
+
$ gem install icu
|
30
|
+
or
|
31
|
+
$ bundle add icu
|
32
|
+
```
|
33
|
+
|
34
|
+
Then convert from Z to U like this:
|
35
|
+
|
36
|
+
```ruby
|
37
|
+
require 'icu'
|
38
|
+
converter = ICU::Transliterator.new "Zawgyi-my"
|
39
|
+
output = converter.transliterate("မ္း")
|
40
|
+
# output is now "မ်း"
|
41
|
+
```
|
42
|
+
|
24
43
|
For a complete working example, see [samples/ruby/demo.rb](../../samples/ruby/demo.rb).
|
44
|
+
|
25
45
|
## Contributing
|
26
46
|
|
27
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/
|
47
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/google/myanmar-tools . This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
28
48
|
|
29
49
|
## License
|
30
50
|
|
data/lib/myanmar-tools.rb
CHANGED
@@ -18,7 +18,7 @@ require 'myanmar-tools/zawgyi_unicode_markov_model'
|
|
18
18
|
|
19
19
|
module MyanmarTools
|
20
20
|
# Uses a machine learning model to determine whether a string of text is Zawgyi or Unicode.
|
21
|
-
# For more details and documentation, see https://github.com/
|
21
|
+
# For more details and documentation, see https://github.com/google/myanmar-tools
|
22
22
|
class ZawgyiDetector
|
23
23
|
def initialize
|
24
24
|
@model = ZawgyiUnicodeMarkovModel.new(stream_markov_model)
|
@@ -54,4 +54,4 @@ class ZawgyiDetector
|
|
54
54
|
end
|
55
55
|
end
|
56
56
|
end
|
57
|
-
end
|
57
|
+
end
|
@@ -1,42 +1,50 @@
|
|
1
1
|
-Infinity all ascii
|
2
|
-
|
3
|
-
|
2
|
+
1.7578816048593336E-11 ascii mixed with စင်္ကာပူမှာ unicode
|
3
|
+
0.9999999999720051 ascii mixed with ဘဂၤလားေဒရွ္႕ zawgyi
|
4
4
|
-Infinity digits ၅၂၀
|
5
|
-
0.
|
6
|
-
|
7
|
-
0.
|
8
|
-
|
9
|
-
0.
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
1.
|
14
|
-
0.
|
15
|
-
0.
|
16
|
-
0.
|
17
|
-
|
18
|
-
0.
|
19
|
-
0.
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
0.
|
25
|
-
0.
|
26
|
-
0.
|
27
|
-
0.
|
28
|
-
0.
|
29
|
-
0.
|
5
|
+
0.26211359963003 လူခပ်သိမ်း၏
|
6
|
+
5.31323436438016E-17 မျိုးရိုးဂုဏ်သိက္ခာနှင့်တကွ
|
7
|
+
0.4889728131312674 လူတို
|
8
|
+
1.6642496914651517E-5 င်း
|
9
|
+
0.11237116076150185 အညီ
|
10
|
+
6.707728524305754E-4 အမျှခံ
|
11
|
+
7.574188856564602E-16 စားခွင့်ရှိသည့်
|
12
|
+
4.638074119608733E-16 အခွင့်အရေးများကို
|
13
|
+
1.5487280820343697E-21 အသိအမှတ်ပြုခြင်းသည်
|
14
|
+
0.016794132891985412 လူခပ်
|
15
|
+
0.14479549474886305 သိမ်း
|
16
|
+
0.9934952187669185 ၏လွ
|
17
|
+
1.0003119485747386E-12 တ်လပ်မှု၊
|
18
|
+
0.1341121039650929 တ
|
19
|
+
0.688567924875057 ရား
|
20
|
+
3.606036708928397E-11 မျှတမှု၊
|
21
|
+
3.2639839874100116E-13 ငြိမ်းချမ်းမှုတို့၏
|
22
|
+
1.7756935949309309E-43 အ ခြေခံ အုတ် မြစ်ဖြစ်သော ကြောင့် လည်းကော င်း၊
|
23
|
+
0.9999999999999065 လူ႔အခြင့္ေရး
|
24
|
+
0.999992967794266 မ်ားကို
|
25
|
+
0.9988885380499896 အေရးမ
|
26
|
+
0.9161983282020472 ထား မထီေ
|
27
|
+
0.9186801971030496 လး
|
28
|
+
0.785076109857006 စား
|
29
|
+
0.9999999994387101 ျပဳျခ
|
30
30
|
0.9999999999999998 င္းသည္ လူခ
|
31
|
-
0.
|
32
|
-
0.
|
33
|
-
0.
|
34
|
-
0.
|
35
|
-
0.
|
31
|
+
0.9995136549000326 ပ္
|
32
|
+
0.9997519059251643 သိမ္း
|
33
|
+
0.9840808937738904 ၏
|
34
|
+
0.4867781535978122 အ
|
35
|
+
0.999997345195692 က်င့္သိ
|
36
36
|
1.0 ကၡာကို ခ်ိဳးေဖာက္
|
37
37
|
1.0 ဖ်က္ဆီးတတ္သည့္ ရက္စက္ၾကမ္းၾကဳတ္ေသာ
|
38
|
-
0.
|
38
|
+
0.99999994404868 အျပဳအမွု
|
39
39
|
1.0 မ်ားကို ျဖစ္ေပၚေ
|
40
|
-
0.
|
40
|
+
0.18826884503304805 စ
|
41
41
|
1.0 ခဲ့ေသာေၾကာင့္
|
42
42
|
1.0 လည္းေကာင္း၊
|
43
|
+
1.7562339552675883E-12 ထၢမ်တွပ်ႇ (shan language)
|
44
|
+
2.0362467174145114E-13 ၶူးႁိုၼ် (shan language)
|
45
|
+
1.825753617221113E-15 ဂလာန်ညးဒါန်လိက် (mon language)
|
46
|
+
7.802298132172665E-15 ဍုၚ်မွဲမွဲ (mon language)
|
47
|
+
3.859871377455363E-6 ဝိနယပိဋကေ (pali language)
|
48
|
+
8.004254357250203E-4 ပါရာဇိကပါဠိ (pali language)
|
49
|
+
1.3507380767721152E-12 တၢ်ကစီၣ် (karen language)
|
50
|
+
1.0297557056138798E-21 တၢ်ထံၣ်တၢ်ပၥ်သး (karen language)
|
Binary file
|
@@ -30,9 +30,6 @@ class ZawgyiUnicodeMarkovModel
|
|
30
30
|
# Magic number used to identify this object in byte streams. (Reads in ASCII as "UZMODEL ")
|
31
31
|
BINARY_TAG = "555a4d4f44454c20"
|
32
32
|
|
33
|
-
# Current serial format version number, used in association with the magic number.
|
34
|
-
BINARY_VERSION = 1
|
35
|
-
|
36
33
|
# Standard Myanmar code point range before digits
|
37
34
|
STD_CP0 = 0x1000
|
38
35
|
STD_CP1 = 0x103F
|
@@ -61,7 +58,7 @@ class ZawgyiUnicodeMarkovModel
|
|
61
58
|
SPC_OFFSET = EXB_OFFSET + EXB_CP1 - EXB_CP0 + 1
|
62
59
|
NUM_STATES = SPC_OFFSET + SPC_CP1 - SPC_CP0 + 1
|
63
60
|
|
64
|
-
private_constant :BINARY_TAG, :
|
61
|
+
private_constant :BINARY_TAG, :STD_CP0, :STD_CP1, :AFT_CP0, :AFT_CP1,
|
65
62
|
:EXA_CP0, :EXA_CP1, :EXB_CP0, :EXB_CP1, :SPC_CP0, :SPC_CP1, :STD_OFFSET,
|
66
63
|
:AFT_OFFSET, :EXA_OFFSET, :EXB_OFFSET, :SPC_OFFSET, :NUM_STATES
|
67
64
|
|
@@ -74,8 +71,16 @@ class ZawgyiUnicodeMarkovModel
|
|
74
71
|
end
|
75
72
|
|
76
73
|
binary_version = stream.read(4).unpack('H*')[0].to_i
|
77
|
-
if binary_version
|
78
|
-
|
74
|
+
if binary_version == 1
|
75
|
+
@ssv = 0
|
76
|
+
elsif binary_version == 2
|
77
|
+
# TODO: Support nonzero SSV if needed in the future
|
78
|
+
@ssv = stream.read(4).unpack('H*')[0].to_i
|
79
|
+
if @ssv != 0
|
80
|
+
raise "Unsupported ssv: #{@ssv}"
|
81
|
+
end
|
82
|
+
else
|
83
|
+
raise "Unexpected serial version number: expected 1 or 2 but got #{binary_version}"
|
79
84
|
end
|
80
85
|
|
81
86
|
@classifier = BinaryMarkov.new(stream)
|
data/myanmar-tools.gemspec
CHANGED
@@ -11,7 +11,7 @@ Gem::Specification.new do |spec|
|
|
11
11
|
|
12
12
|
spec.summary = 'Tools for handling the Zawgyi font encoding in Myanmar.'
|
13
13
|
spec.description = 'Tools for handling the Zawgyi font encoding in Myanmar.'
|
14
|
-
spec.homepage = 'https://github.com/
|
14
|
+
spec.homepage = 'https://github.com/google/myanmar-tools'
|
15
15
|
spec.required_ruby_version = '>= 2.0.0'
|
16
16
|
spec.license = 'Apache-2.0'
|
17
17
|
|
@@ -23,7 +23,7 @@ Gem::Specification.new do |spec|
|
|
23
23
|
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
24
24
|
spec.require_paths = ['lib']
|
25
25
|
|
26
|
-
spec.add_development_dependency 'bundler', '
|
26
|
+
spec.add_development_dependency 'bundler', '>= 1.0'
|
27
27
|
spec.add_development_dependency 'rake', '~> 10.0'
|
28
28
|
spec.add_development_dependency 'minitest', '~> 3.0'
|
29
29
|
end
|
data/test/zawgyidetector_test.rb
CHANGED
@@ -58,17 +58,17 @@ class ZawgyiDetectorTest < Test::Unit::TestCase
|
|
58
58
|
cases = []
|
59
59
|
# STRINGS IDENTICAL IN UNICODE/ZAWGYI
|
60
60
|
cases << {cp: "အသံကို အစားထိုးလိုပါသလား။", score: 0.995}
|
61
|
-
cases << {cp: "နမူနာ", score: 0.
|
62
|
-
cases << {cp: " ဦး", score: 0.
|
61
|
+
cases << {cp: "နမူနာ", score: 0.26}
|
62
|
+
cases << {cp: " ဦး", score: 0.35}
|
63
63
|
|
64
64
|
# UNICODE STRINGS WITH HIGH ZAWGYI SCORES
|
65
65
|
cases << {cp: "အစားထိုး အထူးအက္ခရာ", score: 0.995}
|
66
66
|
cases << {cp: "ယခု မိုးရွာနေပါသလား။", score: 0.995}
|
67
|
-
cases << {cp: "အခြား", score: 0.
|
67
|
+
cases << {cp: "အခြား", score: 0.74}
|
68
68
|
|
69
69
|
# DIFFICULT STRINGS THAT DETECT CORRECTLY
|
70
70
|
cases << {cp: "ကာမစာအုပ္မ်ား(ေစာက္ပတ္စာအုပ္မ်ား)", score: 1.0}
|
71
|
-
cases << {cp: "ညႇပ္စရာမလို", score:
|
71
|
+
cases << {cp: "ညႇပ္စရာမလို", score: 0.82}
|
72
72
|
|
73
73
|
cases.each{|c|
|
74
74
|
result = @detector.get_zawgyi_probability(c[:cp])
|
metadata
CHANGED
@@ -1,27 +1,27 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: myanmar-tools
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 1.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Aung Kyaw Phyo
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2020-07-29 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- - "
|
17
|
+
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
19
|
version: '1.0'
|
20
20
|
type: :development
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
|
-
- - "
|
24
|
+
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '1.0'
|
27
27
|
- !ruby/object:Gem::Dependency
|
@@ -75,7 +75,7 @@ files:
|
|
75
75
|
- lib/myanmar-tools/zawgyi_unicode_markov_model.rb
|
76
76
|
- myanmar-tools.gemspec
|
77
77
|
- test/zawgyidetector_test.rb
|
78
|
-
homepage: https://github.com/
|
78
|
+
homepage: https://github.com/google/myanmar-tools
|
79
79
|
licenses:
|
80
80
|
- Apache-2.0
|
81
81
|
metadata: {}
|
@@ -95,7 +95,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
95
95
|
version: '0'
|
96
96
|
requirements: []
|
97
97
|
rubyforge_project:
|
98
|
-
rubygems_version: 2.7.
|
98
|
+
rubygems_version: 2.7.6.2
|
99
99
|
signing_key:
|
100
100
|
specification_version: 4
|
101
101
|
summary: Tools for handling the Zawgyi font encoding in Myanmar.
|