myanmar-tools 0.1.1 → 1.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7fe5046c36d1fa57bb056fdd103f13c693342990e5df99b25b47a4e2330fd2ee
4
- data.tar.gz: d73e7f38c95b093c74c185353040ee02436e9afd86e5ed2c21bd548586e5151c
3
+ metadata.gz: 0dbc0284872a618fddefb0d2a887f6fc5b94d4c90a7848ce0f5c72f00a6066da
4
+ data.tar.gz: 38b2ae3dd0df9a6225401675acf2288ed7690c11d597366e5666cc8b884a61e9
5
5
  SHA512:
6
- metadata.gz: 10a8cc29a764b8c95eb14b4ca497a16ba4e54577c58860785f87ec019e2bc968f8ac41cedd8c88fd303429da22c8ee3f500af2240d635d273ee5e52a75f40738
7
- data.tar.gz: df22cc42616d9e2ce6499f74fbe9e5bdb9f184c01be85d2ec3702ab7ba3a5716a6c822c0da7f7d55cd912a15a40f0f6e9f407a0eb57d7156cb66a5a108a6ad89
6
+ metadata.gz: 652287b770fba31364a0e2bc82f2d88e36c08fadcf91a4c1137609dccb674d19492f201e9cb31dd813c27fc64a26b9ae36d976621fe6f37996c65c47bc0dda02
7
+ data.tar.gz: 2766900bf91ff26e26b8b3235895e15ecd5b426023c7d43f91bca932c53fa1f502de8abd1d20e0231b0101560461fb53a13ff8e92689fd481583eb767802d457
data/.gitignore CHANGED
@@ -6,5 +6,6 @@
6
6
  /pkg/
7
7
  /spec/reports/
8
8
  /tmp/
9
+ /vendor/
9
10
  *.gem
10
11
  *.lock
data/README.md CHANGED
@@ -10,6 +10,8 @@ Add the dependency to your project:
10
10
 
11
11
  ```bash
12
12
  $ gem install myanmar-tools
13
+ or
14
+ $ bundle add myanmar-tools
13
15
  ```
14
16
 
15
17
  To detect Zawgyi, create an instance of ZawgyiDetector, and call `get_zawgyi_probability` with your string.
@@ -21,10 +23,28 @@ score = detector.get_zawgyi_probability('အျပည္ျပည္ဆို
21
23
  # score is 1.0 (The input is definitely Zawgyi)
22
24
  ```
23
25
 
26
+ To convert from Zawgyi to Unicode, you can use [the Ruby wrapper over icu4c](https://github.com/fantasticfears/icu4r). Install it like this:
27
+
28
+ ```bash
29
+ $ gem install icu
30
+ or
31
+ $ bundle add icu
32
+ ```
33
+
34
+ Then convert from Z to U like this:
35
+
36
+ ```ruby
37
+ require 'icu'
38
+ converter = ICU::Transliterator.new "Zawgyi-my"
39
+ output = converter.transliterate("မ္း")
40
+ # output is now "မ်း"
41
+ ```
42
+
24
43
  For a complete working example, see [samples/ruby/demo.rb](../../samples/ruby/demo.rb).
44
+
25
45
  ## Contributing
26
46
 
27
- Bug reports and pull requests are welcome on GitHub at https://github.com/googlei18n/myanmar-tools . This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
47
+ Bug reports and pull requests are welcome on GitHub at https://github.com/google/myanmar-tools . This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
28
48
 
29
49
  ## License
30
50
 
@@ -18,7 +18,7 @@ require 'myanmar-tools/zawgyi_unicode_markov_model'
18
18
 
19
19
  module MyanmarTools
20
20
  # Uses a machine learning model to determine whether a string of text is Zawgyi or Unicode.
21
- # For more details and documentation, see https://github.com/googlei18n/myanmar-tools
21
+ # For more details and documentation, see https://github.com/google/myanmar-tools
22
22
  class ZawgyiDetector
23
23
  def initialize
24
24
  @model = ZawgyiUnicodeMarkovModel.new(stream_markov_model)
@@ -54,4 +54,4 @@ class ZawgyiDetector
54
54
  end
55
55
  end
56
56
  end
57
- end
57
+ end
@@ -1,42 +1,50 @@
1
1
  -Infinity all ascii
2
- 2.938530756915008E-11 ascii mixed with စင်္ကာပူမှာ unicode
3
- 1.0 ascii mixed with ဘဂၤလားေဒရွ္႕ zawgyi
2
+ 1.7578816048593336E-11 ascii mixed with စင်္ကာပူမှာ unicode
3
+ 0.9999999999720051 ascii mixed with ဘဂၤလားေဒရွ္႕ zawgyi
4
4
  -Infinity digits ၅၂၀
5
- 0.7892929982998882 လူခပ်သိမ်း၏
6
- 3.6119816361186886E-12 မျိုးရိုးဂုဏ်သိက္ခာနှင့်တကွ
7
- 0.4663329700445173 လူတို
8
- 4.9665655888566266E-6 င်း
9
- 0.23084563016290943 အညီ
10
- 0.0014668938180250133 အမျှခံ
11
- 3.5209271742118754E-16 စားခွင့်ရှိသည့်
12
- 2.3390629414495122E-15 အခွင့်အရေးများကို
13
- 1.5319184949968255E-21 အသိအမှတ်ပြုခြင်းသည်
14
- 0.058716887399297764 လူခပ်
15
- 0.07438825572428456 သိမ်း
16
- 0.9998813892221834 ၏လွ
17
- 3.8690513464199966E-12 တ်လပ်မှု၊
18
- 0.15418652297321492
19
- 0.646170406728457 ရား
20
- 4.1284781607674815E-10 မျှတမှု၊
21
- 5.985589691142374E-13 ငြိမ်းချမ်းမှုတို့၏
22
- 2.897899546774602E-42 အ ခြေခံ အုတ် မြစ်ဖြစ်သော ကြောင့် လည်းကော င်း၊
23
- 1.0 လူ႔အခြင့္ေရး
24
- 0.9999946000796034 မ်ားကို
25
- 0.9999970383777537 အေရးမ
26
- 0.9977308862728237 ထား မထီေ
27
- 0.9997356775204073 လး
28
- 0.7807064594381038 စား
29
- 0.9999999999926983 ျပဳျခ
5
+ 0.26211359963003 လူခပ်သိမ်း၏
6
+ 5.31323436438016E-17 မျိုးရိုးဂုဏ်သိက္ခာနှင့်တကွ
7
+ 0.4889728131312674 လူတို
8
+ 1.6642496914651517E-5 င်း
9
+ 0.11237116076150185 အညီ
10
+ 6.707728524305754E-4 အမျှခံ
11
+ 7.574188856564602E-16 စားခွင့်ရှိသည့်
12
+ 4.638074119608733E-16 အခွင့်အရေးများကို
13
+ 1.5487280820343697E-21 အသိအမှတ်ပြုခြင်းသည်
14
+ 0.016794132891985412 လူခပ်
15
+ 0.14479549474886305 သိမ်း
16
+ 0.9934952187669185 ၏လွ
17
+ 1.0003119485747386E-12 တ်လပ်မှု၊
18
+ 0.1341121039650929
19
+ 0.688567924875057 ရား
20
+ 3.606036708928397E-11 မျှတမှု၊
21
+ 3.2639839874100116E-13 ငြိမ်းချမ်းမှုတို့၏
22
+ 1.7756935949309309E-43 အ ခြေခံ အုတ် မြစ်ဖြစ်သော ကြောင့် လည်းကော င်း၊
23
+ 0.9999999999999065 လူ႔အခြင့္ေရး
24
+ 0.999992967794266 မ်ားကို
25
+ 0.9988885380499896 အေရးမ
26
+ 0.9161983282020472 ထား မထီေ
27
+ 0.9186801971030496 လး
28
+ 0.785076109857006 စား
29
+ 0.9999999994387101 ျပဳျခ
30
30
  0.9999999999999998 င္းသည္ လူခ
31
- 0.9999622852008538 ပ္
32
- 0.9998249697467181 သိမ္း
33
- 0.9964205019608555
34
- 0.9525739114028678
35
- 0.9999995405990166 က်င့္သိ
31
+ 0.9995136549000326 ပ္
32
+ 0.9997519059251643 သိမ္း
33
+ 0.9840808937738904
34
+ 0.4867781535978122
35
+ 0.999997345195692 က်င့္သိ
36
36
  1.0 ကၡာကို ခ်ိဳးေဖာက္
37
37
  1.0 ဖ်က္ဆီးတတ္သည့္ ရက္စက္ၾကမ္းၾကဳတ္ေသာ
38
- 0.9999999999933542 အျပဳအမွု
38
+ 0.99999994404868 အျပဳအမွု
39
39
  1.0 မ်ားကို ျဖစ္ေပၚေ
40
- 0.6167785052019471
40
+ 0.18826884503304805
41
41
  1.0 ခဲ့ေသာေၾကာင့္
42
42
  1.0 လည္းေကာင္း၊
43
+ 1.7562339552675883E-12 ထၢမ်တွပ်ႇ (shan language)
44
+ 2.0362467174145114E-13 ၶူးႁိုၼ် (shan language)
45
+ 1.825753617221113E-15 ဂလာန်ညးဒါန်လိက် (mon language)
46
+ 7.802298132172665E-15 ဍုၚ်မွဲမွဲ (mon language)
47
+ 3.859871377455363E-6 ဝိနယပိဋကေ (pali language)
48
+ 8.004254357250203E-4 ပါရာဇိကပါဠိ (pali language)
49
+ 1.3507380767721152E-12 တၢ်ကစီၣ် (karen language)
50
+ 1.0297557056138798E-21 တၢ်ထံၣ်တၢ်ပၥ်သး (karen language)
@@ -1,3 +1,3 @@
1
1
  module MyanmarTools
2
- VERSION = '0.1.1'
2
+ VERSION = '1.2.0'
3
3
  end
@@ -30,9 +30,6 @@ class ZawgyiUnicodeMarkovModel
30
30
  # Magic number used to identify this object in byte streams. (Reads in ASCII as "UZMODEL ")
31
31
  BINARY_TAG = "555a4d4f44454c20"
32
32
 
33
- # Current serial format version number, used in association with the magic number.
34
- BINARY_VERSION = 1
35
-
36
33
  # Standard Myanmar code point range before digits
37
34
  STD_CP0 = 0x1000
38
35
  STD_CP1 = 0x103F
@@ -61,7 +58,7 @@ class ZawgyiUnicodeMarkovModel
61
58
  SPC_OFFSET = EXB_OFFSET + EXB_CP1 - EXB_CP0 + 1
62
59
  NUM_STATES = SPC_OFFSET + SPC_CP1 - SPC_CP0 + 1
63
60
 
64
- private_constant :BINARY_TAG, :BINARY_VERSION, :STD_CP0, :STD_CP1, :AFT_CP0, :AFT_CP1,
61
+ private_constant :BINARY_TAG, :STD_CP0, :STD_CP1, :AFT_CP0, :AFT_CP1,
65
62
  :EXA_CP0, :EXA_CP1, :EXB_CP0, :EXB_CP1, :SPC_CP0, :SPC_CP1, :STD_OFFSET,
66
63
  :AFT_OFFSET, :EXA_OFFSET, :EXB_OFFSET, :SPC_OFFSET, :NUM_STATES
67
64
 
@@ -74,8 +71,16 @@ class ZawgyiUnicodeMarkovModel
74
71
  end
75
72
 
76
73
  binary_version = stream.read(4).unpack('H*')[0].to_i
77
- if binary_version != BINARY_VERSION
78
- raise "Unexpected serial version number: expected #{BINARY_VERSION} but got #{binary_version}"
74
+ if binary_version == 1
75
+ @ssv = 0
76
+ elsif binary_version == 2
77
+ # TODO: Support nonzero SSV if needed in the future
78
+ @ssv = stream.read(4).unpack('H*')[0].to_i
79
+ if @ssv != 0
80
+ raise "Unsupported ssv: #{@ssv}"
81
+ end
82
+ else
83
+ raise "Unexpected serial version number: expected 1 or 2 but got #{binary_version}"
79
84
  end
80
85
 
81
86
  @classifier = BinaryMarkov.new(stream)
@@ -11,7 +11,7 @@ Gem::Specification.new do |spec|
11
11
 
12
12
  spec.summary = 'Tools for handling the Zawgyi font encoding in Myanmar.'
13
13
  spec.description = 'Tools for handling the Zawgyi font encoding in Myanmar.'
14
- spec.homepage = 'https://github.com/googlei18n/myanmar-tools'
14
+ spec.homepage = 'https://github.com/google/myanmar-tools'
15
15
  spec.required_ruby_version = '>= 2.0.0'
16
16
  spec.license = 'Apache-2.0'
17
17
 
@@ -23,7 +23,7 @@ Gem::Specification.new do |spec|
23
23
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
24
24
  spec.require_paths = ['lib']
25
25
 
26
- spec.add_development_dependency 'bundler', '~> 1.0'
26
+ spec.add_development_dependency 'bundler', '>= 1.0'
27
27
  spec.add_development_dependency 'rake', '~> 10.0'
28
28
  spec.add_development_dependency 'minitest', '~> 3.0'
29
29
  end
@@ -58,17 +58,17 @@ class ZawgyiDetectorTest < Test::Unit::TestCase
58
58
  cases = []
59
59
  # STRINGS IDENTICAL IN UNICODE/ZAWGYI
60
60
  cases << {cp: "အသံကို အစားထိုးလိုပါသလား။", score: 0.995}
61
- cases << {cp: "နမူနာ", score: 0.90}
62
- cases << {cp: " ဦး", score: 0.995}
61
+ cases << {cp: "နမူနာ", score: 0.26}
62
+ cases << {cp: " ဦး", score: 0.35}
63
63
 
64
64
  # UNICODE STRINGS WITH HIGH ZAWGYI SCORES
65
65
  cases << {cp: "အစားထိုး အထူးအက္ခရာ", score: 0.995}
66
66
  cases << {cp: "ယခု မိုးရွာနေပါသလား။", score: 0.995}
67
- cases << {cp: "အခြား", score: 0.63}
67
+ cases << {cp: "အခြား", score: 0.74}
68
68
 
69
69
  # DIFFICULT STRINGS THAT DETECT CORRECTLY
70
70
  cases << {cp: "ကာမစာအုပ္မ်ား(ေစာက္ပတ္စာအုပ္မ်ား)", score: 1.0}
71
- cases << {cp: "ညႇပ္စရာမလို", score: 1.0}
71
+ cases << {cp: "ညႇပ္စရာမလို", score: 0.82}
72
72
 
73
73
  cases.each{|c|
74
74
  result = @detector.get_zawgyi_probability(c[:cp])
metadata CHANGED
@@ -1,27 +1,27 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: myanmar-tools
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 1.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Aung Kyaw Phyo
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2018-03-19 00:00:00.000000000 Z
11
+ date: 2020-07-29 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - "~>"
17
+ - - ">="
18
18
  - !ruby/object:Gem::Version
19
19
  version: '1.0'
20
20
  type: :development
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - "~>"
24
+ - - ">="
25
25
  - !ruby/object:Gem::Version
26
26
  version: '1.0'
27
27
  - !ruby/object:Gem::Dependency
@@ -75,7 +75,7 @@ files:
75
75
  - lib/myanmar-tools/zawgyi_unicode_markov_model.rb
76
76
  - myanmar-tools.gemspec
77
77
  - test/zawgyidetector_test.rb
78
- homepage: https://github.com/googlei18n/myanmar-tools
78
+ homepage: https://github.com/google/myanmar-tools
79
79
  licenses:
80
80
  - Apache-2.0
81
81
  metadata: {}
@@ -95,7 +95,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
95
95
  version: '0'
96
96
  requirements: []
97
97
  rubyforge_project:
98
- rubygems_version: 2.7.3
98
+ rubygems_version: 2.7.6.2
99
99
  signing_key:
100
100
  specification_version: 4
101
101
  summary: Tools for handling the Zawgyi font encoding in Myanmar.