gurmukhi_utils 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: dbb164eb5abec25b925649743ac6ae8797bce240903a42e13ee7bca4cd5635bc
4
+ data.tar.gz: 62fd647545eee43081089e407517bab0c00f3136a37862a82a78e7e276a9262f
5
+ SHA512:
6
+ metadata.gz: 314ce8d8a0149fa9d7327c44b8b0c145e2a0bfb031704d2b8843bfa636873d45fbf198dfe7666511242b3f0f788291beae2e9dd4f8a8d980fc6b99017a731424
7
+ data.tar.gz: 4d92bda8fe9cb31cbe430812d1953e517bc302acc5fa6e938bcf6197c05bd3dae39345ba723d5c0a79095b7576824f7a05a9bf1e9673e62aef675ced3e45abce
data/CONTRIBUTING.md ADDED
@@ -0,0 +1,76 @@
1
+ # Contributing
2
+
3
+ Please see our [community docs on contributing](https://shabados.com/docs/community/contributing).
4
+
5
+ This document is for developers or programmers contributing to source code. If you're interested in contributing a different way, please see the link above.
6
+
7
+ ## Project Structure
8
+
9
+ This project follows a standard Ruby gem structure. Here's a brief overview of the main directories and files:
10
+
11
+ - `lib/`: Contains the main source code for the GurmukhiUtils gem.
12
+ - `gurmukhi_utils.rb`: The main entry point for the gem. This file requires all other necessary files.
13
+ - `<feature>.rb` files: Each additional file in this directory represents a specific feature/module of the gem.
14
+ - `spec/`: Contains the RSpec test files for the gem. Test files should be placed in this directory, following the naming convention `lib/<feature>_spec.rb`.
15
+ - `Gemfile`: Specifies the gem dependencies for development and testing.
16
+ - `Gemfile.lock`: Generated by Bundler, this file contains the exact gem versions and their dependencies used in the project.
17
+ - `gurmukhi_utils.gemspec`: The gem specification file, which provides information about the gem. We need this to publish and release the gem.
18
+
19
+ ## Adding New Features to `GurmukhiUtils`
20
+
21
+ To add a new feature to GurmukhiUtils, follow these steps:
22
+
23
+ **Step 1.**
24
+
25
+ Create a new file in the `lib/` directory for the new feature, and place the new functionality within the `GurmukhiUtils` module.
26
+
27
+ For example, if you want to create an `ascii` method, create a new file called `ascii.rb`:
28
+
29
+ ```ruby
30
+ # lib/ascii.rb
31
+
32
+ module GurmukhiUtils
33
+ def self.helpers
34
+ # ...
35
+ end
36
+
37
+ def self.other_methods
38
+ # ...
39
+ end
40
+
41
+ def self.ascii
42
+ # ...
43
+ end
44
+ end
45
+ ```
46
+
47
+ **Step 2.**
48
+
49
+ Update the `lib/gurmukhi_utils.rb` file to require the new feature file using require_relative:
50
+
51
+ ```ruby
52
+ # lib/gurmukhi_utils.rb
53
+
54
+ require_relative 'gurmukhi_utils/version'
55
+ require_relative 'unicode'
56
+ require_relative 'ascii' # Add this line for the new feature
57
+ ```
58
+
59
+ **Step 3.**
60
+
61
+ Test the feature
62
+
63
+ Write tests for the new feature in the `spec` directory.
64
+
65
+ You can also use `irb` and do:
66
+
67
+ ```ruby
68
+ 001 > require_relative "lib/gurmukhi_utils"
69
+ => true
70
+ 002 > GurmukhiUtils.ascii("...")
71
+ => "..."
72
+ ```
73
+
74
+ ## Thank you
75
+
76
+ Your contributions to open source, large or small, make great projects like this possible. Thank you for taking the time to participate in this project.
data/Gemfile ADDED
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ ruby '3.2.1'
4
+
5
+ source 'https://rubygems.org'
6
+
7
+ group :development do
8
+ gem 'rubocop'
9
+ gem 'rubocop-rspec', :require => false
10
+ end
11
+
12
+ group :test do
13
+ gem 'rspec'
14
+ end
data/Gemfile.lock ADDED
@@ -0,0 +1,60 @@
1
+ GEM
2
+ remote: https://rubygems.org/
3
+ specs:
4
+ ast (2.4.2)
5
+ diff-lcs (1.5.0)
6
+ json (2.6.3)
7
+ parallel (1.22.1)
8
+ parser (3.2.1.0)
9
+ ast (~> 2.4.1)
10
+ rainbow (3.1.1)
11
+ regexp_parser (2.7.0)
12
+ rexml (3.2.5)
13
+ rspec (3.12.0)
14
+ rspec-core (~> 3.12.0)
15
+ rspec-expectations (~> 3.12.0)
16
+ rspec-mocks (~> 3.12.0)
17
+ rspec-core (3.12.2)
18
+ rspec-support (~> 3.12.0)
19
+ rspec-expectations (3.12.3)
20
+ diff-lcs (>= 1.2.0, < 2.0)
21
+ rspec-support (~> 3.12.0)
22
+ rspec-mocks (3.12.5)
23
+ diff-lcs (>= 1.2.0, < 2.0)
24
+ rspec-support (~> 3.12.0)
25
+ rspec-support (3.12.0)
26
+ rubocop (1.47.0)
27
+ json (~> 2.3)
28
+ parallel (~> 1.10)
29
+ parser (>= 3.2.0.0)
30
+ rainbow (>= 2.2.2, < 4.0)
31
+ regexp_parser (>= 1.8, < 3.0)
32
+ rexml (>= 3.2.5, < 4.0)
33
+ rubocop-ast (>= 1.26.0, < 2.0)
34
+ ruby-progressbar (~> 1.7)
35
+ unicode-display_width (>= 2.4.0, < 3.0)
36
+ rubocop-ast (1.27.0)
37
+ parser (>= 3.2.1.0)
38
+ rubocop-capybara (2.17.1)
39
+ rubocop (~> 1.41)
40
+ rubocop-rspec (2.19.0)
41
+ rubocop (~> 1.33)
42
+ rubocop-capybara (~> 2.17)
43
+ ruby-progressbar (1.12.0)
44
+ strscan (3.0.5)
45
+ unicode-display_width (2.4.2)
46
+
47
+ PLATFORMS
48
+ x86_64-darwin-19
49
+
50
+ DEPENDENCIES
51
+ rspec
52
+ rubocop
53
+ rubocop-rspec
54
+ strscan
55
+
56
+ RUBY VERSION
57
+ ruby 3.2.1p31
58
+
59
+ BUNDLED WITH
60
+ 2.4.7
data/README.md ADDED
@@ -0,0 +1,14 @@
1
+ # Gurmukhi Utils (Ruby)
2
+
3
+ Utilities library for converting, analyzing, and testing Gurmukhi strings.
4
+
5
+ ## Related
6
+
7
+ This library is one of many in the Gurmukhi Utils super-repo.
8
+
9
+ - [Super Repo](/README.md)
10
+ - [Python](/python/README.md)
11
+ - [JavaScript](/javascript/README.md)
12
+ - [Ruby](/ruby/README.md)
13
+ - [C# / C Sharp](/csharp/README.md)
14
+ - [Dart](/dart/README.md)
@@ -0,0 +1,20 @@
1
+ # frozen_string_literal: true
2
+
3
+ Gem::Specification.new do |spec|
4
+ spec.name = 'gurmukhi_utils'
5
+ spec.version = '0.0.1'
6
+ spec.authors = ['Dilraj Singh Somel (dsomel21)']
7
+ spec.email = ['dsomel21@gmail.com']
8
+
9
+ spec.summary = 'A utility gem for converting, analyzing, and testing Gurmukhi strings.'
10
+ spec.description = 'Library for working with Gurmukhi text, providing various operations like unicode conversion, ascii conversion, and more.'
11
+ spec.homepage = 'https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby'
12
+ spec.license = 'MIT'
13
+
14
+ spec.required_ruby_version = '3.2.1'
15
+
16
+ spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec)/}) }
17
+ spec.require_paths = ['lib']
18
+
19
+ spec.metadata['rubygems_mfa_required'] = 'true'
20
+ end
data/lib/ascii.rb ADDED
@@ -0,0 +1,161 @@
1
+ # frozen_string_literal: true
2
+
3
+ module GurmukhiUtils
4
+ ASCII_TRANSLATION = {
5
+ 'ੳ'.ord => 'a',
6
+ 'ਅ'.ord => 'A',
7
+ 'ੲ'.ord => 'e',
8
+ 'ਸ'.ord => 's',
9
+ 'ਹ'.ord => 'h',
10
+ 'ਕ'.ord => 'k',
11
+ 'ਖ'.ord => 'K',
12
+ 'ਗ'.ord => 'g',
13
+ 'ਘ'.ord => 'G',
14
+ 'ਙ'.ord => '|',
15
+ 'ਚ'.ord => 'c',
16
+ 'ਛ'.ord => 'C',
17
+ 'ਜ'.ord => 'j',
18
+ 'ਝ'.ord => 'J',
19
+ 'ਞ'.ord => '\\',
20
+ 'ਟ'.ord => 't',
21
+ 'ਠ'.ord => 'T',
22
+ 'ਡ'.ord => 'f',
23
+ 'ਢ'.ord => 'F',
24
+ 'ਣ'.ord => 'x',
25
+ 'ਤ'.ord => 'q',
26
+ 'ਥ'.ord => 'Q',
27
+ 'ਦ'.ord => 'd',
28
+ 'ਧ'.ord => 'D',
29
+ 'ਨ'.ord => 'n',
30
+ 'ਪ'.ord => 'p',
31
+ 'ਫ'.ord => 'P',
32
+ 'ਬ'.ord => 'b',
33
+ 'ਭ'.ord => 'B',
34
+ 'ਮ'.ord => 'm',
35
+ 'ਯ'.ord => 'X',
36
+ 'ਰ'.ord => 'r',
37
+ 'ਲ'.ord => 'l',
38
+ 'ਵ'.ord => 'v',
39
+ 'ੜ'.ord => 'V',
40
+ 'ਸ਼'.ord => 'S',
41
+ 'ਜ਼'.ord => 'z',
42
+ 'ਖ਼'.ord => '^',
43
+ 'ਫ਼'.ord => '&',
44
+ 'ਗ਼'.ord => 'Z',
45
+ 'ਲ਼'.ord => 'L',
46
+ '਼'.ord => 'æ',
47
+ 'ੑ'.ord => '@',
48
+ 'ੵ'.ord => "\u00b4", # acute accent (´)
49
+ 'ਃ'.ord => 'Ú', # capital u-acute letter
50
+ "\u0a13".ord => 'E', # ਓ
51
+ "\u0a06".ord => 'Aw', # ਆ
52
+ "\u0a07".ord => 'ei', # ਇ
53
+ "\u0a08".ord => 'eI', # ਈ
54
+ "\u0a09".ord => 'au', # ਉ
55
+ "\u0a0a".ord => 'aU', # ਊ
56
+ "\u0a0f".ord => 'ey', # ਏ
57
+ "\u0a10".ord => 'AY', # ਐ
58
+ "\u0a14".ord => 'AO', # ਔ
59
+ 'ਾ'.ord => 'w',
60
+ 'ਿ'.ord => 'i',
61
+ 'ੀ'.ord => 'I',
62
+ 'ੁ'.ord => 'u',
63
+ 'ੂ'.ord => 'U',
64
+ 'ੇ'.ord => 'y',
65
+ 'ੈ'.ord => 'Y',
66
+ 'ੋ'.ord => 'o',
67
+ 'ੌ'.ord => 'O',
68
+ 'ੰ'.ord => 'M',
69
+ 'ਂ'.ord => 'N',
70
+ 'ੱ'.ord => '~',
71
+ '।'.ord => '[',
72
+ '॥'.ord => ']',
73
+ '੦'.ord => '0',
74
+ '੧'.ord => '1',
75
+ '੨'.ord => '2',
76
+ '੩'.ord => '3',
77
+ '੪'.ord => '4',
78
+ '੫'.ord => '5',
79
+ '੬'.ord => '6',
80
+ '੭'.ord => '7',
81
+ '੮'.ord => '8',
82
+ '੯'.ord => '9',
83
+ 'ੴ'.ord => '<>',
84
+ '☬'.ord => 'Ç'
85
+ }.freeze
86
+
87
+ ASCII_REPLACEMENTS = {
88
+ '੍ਯ' => 'Î', # half-yayya
89
+ '꠳ਯ' => 'Î', # sant lipi variation
90
+ '꠴ਯ' => 'ï', # open-top yayya
91
+ '꠵ਯ' => 'î', # open-top half-yayya
92
+ '੍ਰ' => 'R',
93
+ '੍ਵ' => 'Í', # capital i-acute letter
94
+ '੍ਹ' => 'H',
95
+ '੍ਚ' => 'ç', # c-cedilla letter
96
+ '੍ਟ' => '†', # dagger symbol
97
+ '੍ਤ' => 'œ',
98
+ '੍ਨ' => "\u02dc" # small tilde (˜)
99
+ }.freeze
100
+
101
+ # TODO: Raise warnings if incorrect vowel syntax
102
+
103
+ def self.ascii(string)
104
+ string = unicode_normalize(string)
105
+
106
+ # Perform replacements
107
+ ASCII_REPLACEMENTS.each do |key, value|
108
+ string.gsub!(key, value)
109
+ end
110
+
111
+ # Perform translation
112
+ string = string.chars.map { |c| ASCII_TRANSLATION[c.ord] || c }.join
113
+
114
+ # Re-arrange sihari
115
+ ascii_base_letters = 'AeshkKgG|cCjJ\tTfFxqQdDnpPbBmXrlvVSz^&ZLÎïî'
116
+ ascii_modifiers = 'æ@´ÚwIuUyYoO`MNRÍH熜˜ü¨®µˆW~¤Ï'
117
+ regex = Regexp.new("([#{ascii_base_letters}][#{ascii_modifiers}]*)i([#{ascii_modifiers}]*)")
118
+ string.gsub!(regex, 'i\1\2')
119
+
120
+ # Fix below-base-letter + u vowel positioning
121
+ ascii_below_base_letters = 'RÍH熜˜´@'
122
+ below_vowel_mappings = {
123
+ 'u' => 'ü',
124
+ 'U' => '¨'
125
+ }
126
+
127
+ below_vowel_mappings.each do |key, value|
128
+ string.gsub!(/([#{ascii_below_base_letters}][#{ascii_modifiers}]*)#{key}([#{ascii_modifiers}]*)/, "\\1#{value}\\2")
129
+ end
130
+
131
+ # Fix center-stroke + tippi positioning
132
+ center_stroke_letters = 'nT'
133
+ string.gsub!(/([#{center_stroke_letters}][#{ascii_modifiers}]*)M([#{ascii_modifiers}]*)/, '\\1µ\\2')
134
+
135
+ # Fix positioning of bindi/tippi when it is the only above-base-form
136
+ ascii_non_above_modifiers = 'æ@´ÚwuURÍH熜˜ü¨®Ï'
137
+ nasalization_mappings = {
138
+ 'N' => 'ˆ',
139
+ '~' => '`'
140
+ }
141
+ nasalization_mappings.each do |key, value|
142
+ string.gsub!(/([#{ascii_base_letters}][#{ascii_non_above_modifiers}]*)#{key}([#{ascii_non_above_modifiers}]*)/, "\\1#{value}\\2")
143
+ end
144
+
145
+ # Make rendering changes for combos
146
+ ascii_combo_replacements = {
147
+ 'Iਁ' => 'ˆØI', # bindi + bihari ligature
148
+ 'IM' => 'µØI', # tippi + bihari ligature
149
+ 'Iµ' => 'µØI', # tippi + bihari ligature
150
+ 'kR' => 'k®', # kakka + pair-rara ligature
151
+ 'H¨' => '§',
152
+ 'wN' => 'W', # addhak positioning
153
+ 'wˆ' => 'W', # addhak positioning
154
+ 'nUµ' => 'ƒ'
155
+ }
156
+ ascii_combo_replacements.each do |key, value|
157
+ string.gsub!(key, value)
158
+ end
159
+ return string
160
+ end
161
+ end
data/lib/constants.rb ADDED
@@ -0,0 +1,32 @@
1
+ # frozen_string_literal: true
2
+
3
+ module GurmukhiUtils
4
+ # pause chars / ਵਿਸ਼ਰਾਮ symbols
5
+ VISHRAM_LIGHT = '.'
6
+ VISHRAM_MEDIUM = ','
7
+ VISHRAM_HEAVY = ';'
8
+ VISHRAMS = [VISHRAM_LIGHT, VISHRAM_MEDIUM, VISHRAM_HEAVY].freeze
9
+
10
+ BASE_LETTERS = 'ਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ਲ਼'
11
+ VOWEL_LETTERS =
12
+ # ਅ ਆ ਏ ਐ ਇ ਈ ਓ ਔ ਉ ਊ
13
+ "ਅ\u0a06\u0a0f\u0a10\u0a07\u0a08\u0a13\u0a14\u0a09\u0a0a"
14
+
15
+ ORDERED_VOWELS = [
16
+ 'ਿ',
17
+ 'ੇ',
18
+ 'ੈ',
19
+ 'ੋ',
20
+ 'ੌ',
21
+ 'ੁ',
22
+ 'ੂ',
23
+ 'ਾ',
24
+ 'ੀ'
25
+ ].freeze
26
+
27
+ VIRAMA = '੍'
28
+ BELOW_LETTERS = 'ਹਰਵਟਤਨਚ'
29
+ YAKASH = 'ੵ'
30
+
31
+ VOWEL_DIACRITICS = ORDERED_VOWELS.join
32
+ end
@@ -0,0 +1,6 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'unicode'
4
+ require_relative 'ascii'
5
+ require_relative 'remove'
6
+ require_relative 'constants'
data/lib/remove.rb ADDED
@@ -0,0 +1,70 @@
1
+ # frozen_string_literal: true
2
+
3
+ module GurmukhiUtils
4
+ # Removes substrings from the string.
5
+ #
6
+ # @param string [String] The string to affect.
7
+ # @param removals [Array<String>] Any substring to remove.
8
+ # @return [String] The string without any substrings.
9
+ #
10
+ # @example
11
+ # remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".", ","])
12
+ # # => "ਸਬਦ ਸਬਦ ਸਬਦ; ਸਬਦ"
13
+ # remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", GurmukhiUtils::Constants::VISHRAMS)
14
+ # # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ"
15
+ def remove(string, removals)
16
+ removals.each { |removal| string.gsub!(removal, '') }
17
+ string
18
+ end
19
+
20
+ # Removes regex patterns from the string.
21
+ #
22
+ # Note:
23
+ # Also removes duplicate space characters from the string.
24
+ #
25
+ # @param string [String] The string to affect.
26
+ # @param patterns [Array<String>] Any pattern to remove.
27
+ # @return [String] The string without any matching patterns, duplicate spaces, or leading/trailing spaces.
28
+ #
29
+ # @example
30
+ # remove_regex("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".+\\s"])
31
+ # # => "ਸਬਦ"
32
+ def remove_regex(string, patterns)
33
+ patterns.each { |pattern| string.gsub!(Regexp.new(pattern), '') }
34
+ string.squeeze!(' ').strip!
35
+ string
36
+ end
37
+
38
+ # Attempts to remove line endings as best as possible.
39
+ #
40
+ # @param string [String] The unicode Gurmukhi, Hindi, or English translation/transliteration to affect.
41
+ # @return [String] The string without line endings.
42
+ #
43
+ # @example
44
+ # remove_line_endings("ਸਬਦ ॥ ਸਬਦ ॥੧॥ ਰਹਾਉ ॥")
45
+ # # => "ਸਬਦ ਸਬਦ"
46
+ def remove_line_endings(string)
47
+ line_ending_patterns = [
48
+ '[।॥] *(ਰਹਾਉ|रहाउ).*',
49
+ '[|] *Pause.*',
50
+ '[|] *(rahaau|rahau|rahao).*',
51
+ '[।॥][੦-੯|०-९].*',
52
+ '[|]\\d.*',
53
+ '[।॥|]'
54
+ ]
55
+
56
+ remove_regex(string, line_ending_patterns)
57
+ end
58
+
59
+ # Removes all vishram characters.
60
+ #
61
+ # @param string [String] The string to affect.
62
+ # @return [String] The string without vishrams.
63
+ #
64
+ # @example
65
+ # remove_vishrams("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ ॥")
66
+ # # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ ॥"
67
+ def remove_vishrams(string)
68
+ remove(string, GurmukhiUtils::Constants::VISHRAMS)
69
+ end
70
+ end
data/lib/unicode.rb ADDED
@@ -0,0 +1,298 @@
1
+ # frozen_string_literal: true
2
+
3
+ # Copied from https://github.com/shabados/gurmukhiutils/blob/b7a1ca9f0d341b64715158893afb93675c773823/gurmukhiutils/unicode.py
4
+ require 'strscan'
5
+
6
+ module GurmukhiUtils
7
+ UNICODE_STANDARDS = ['Unicode Consortium', 'Sant Lipi'].freeze
8
+
9
+ ASCII_TO_SL_TRANSLATION = {
10
+ 'a'.ord => 'ੳ',
11
+ 'b'.ord => 'ਬ',
12
+ 'c'.ord => 'ਚ',
13
+ 'd'.ord => 'ਦ',
14
+ 'e'.ord => 'ੲ',
15
+ 'f'.ord => 'ਡ',
16
+ 'g'.ord => 'ਗ',
17
+ 'h'.ord => 'ਹ',
18
+ 'i'.ord => 'ਿ',
19
+ 'j'.ord => 'ਜ',
20
+ 'k'.ord => 'ਕ',
21
+ 'l'.ord => 'ਲ',
22
+ 'm'.ord => 'ਮ',
23
+ 'n'.ord => 'ਨ',
24
+ 'o'.ord => 'ੋ',
25
+ 'p'.ord => 'ਪ',
26
+ 'q'.ord => 'ਤ',
27
+ 'r'.ord => 'ਰ',
28
+ 's'.ord => 'ਸ',
29
+ 't'.ord => 'ਟ',
30
+ 'u'.ord => 'ੁ',
31
+ 'v'.ord => 'ਵ',
32
+ 'w'.ord => 'ਾ',
33
+ 'x'.ord => 'ਣ',
34
+ 'y'.ord => 'ੇ',
35
+ 'z'.ord => 'ਜ਼',
36
+ 'A'.ord => 'ਅ',
37
+ 'B'.ord => 'ਭ',
38
+ 'C'.ord => 'ਛ',
39
+ 'D'.ord => 'ਧ',
40
+ 'E'.ord => 'ਓ',
41
+ 'F'.ord => 'ਢ',
42
+ 'G'.ord => 'ਘ',
43
+ 'H'.ord => '੍ਹ',
44
+ 'I'.ord => 'ੀ',
45
+ 'J'.ord => 'ਝ',
46
+ 'K'.ord => 'ਖ',
47
+ 'L'.ord => 'ਲ਼',
48
+ 'M'.ord => 'ੰ',
49
+ 'N'.ord => 'ਂ',
50
+ 'O'.ord => 'ੌ',
51
+ 'P'.ord => 'ਫ',
52
+ 'Q'.ord => 'ਥ',
53
+ 'R'.ord => '੍ਰ',
54
+ 'S'.ord => 'ਸ਼',
55
+ 'T'.ord => 'ਠ',
56
+ 'U'.ord => 'ੂ',
57
+ 'V'.ord => 'ੜ',
58
+ 'W'.ord => 'ਾਂ',
59
+ 'X'.ord => 'ਯ',
60
+ 'Y'.ord => 'ੈ',
61
+ 'Z'.ord => 'ਗ਼',
62
+ '0'.ord => '੦',
63
+ '1'.ord => '੧',
64
+ '2'.ord => '੨',
65
+ '3'.ord => '੩',
66
+ '4'.ord => '੪',
67
+ '5'.ord => '੫',
68
+ '6'.ord => '੬',
69
+ '7'.ord => '੭',
70
+ '8'.ord => '੮',
71
+ '9'.ord => '੯',
72
+ '['.ord => '।',
73
+ ']'.ord => '॥',
74
+ '\\'.ord => 'ਞ',
75
+ '|'.ord => 'ਙ',
76
+ '`'.ord => 'ੱ',
77
+ '~'.ord => 'ੱ',
78
+ '@'.ord => 'ੑ',
79
+ '^'.ord => 'ਖ਼',
80
+ '&'.ord => 'ਫ਼',
81
+ '†'.ord => '੍ਟ', # dagger symbol
82
+ 'ü'.ord => 'ੁ', # u-diaeresis letter
83
+ '®'.ord => '੍ਰ', # registered symbol
84
+ "\u00b4".ord => 'ੵ', # acute accent (´)
85
+ "\u00a8".ord => 'ੂ', # diaeresis accent (¨)
86
+ 'µ'.ord => 'ੰ', # mu letter
87
+ 'æ'.ord => '਼',
88
+ "\u00a1".ord => 'ੴ', # inverted exclamation (¡)
89
+ 'ƒ'.ord => 'ਨੂੰ', # florin symbol
90
+ 'œ'.ord => '੍ਤ',
91
+ 'Í'.ord => '੍ਵ', # capital i-acute letter
92
+ 'Î'.ord => '੍ਯ', # capital i-circumflex letter
93
+ 'Ï'.ord => 'ੵ', # capital i-diaeresis letter
94
+ 'Ò'.ord => '॥', # capital o-grave letter
95
+ 'Ú'.ord => 'ਃ', # capital u-acute letter
96
+ "\u02c6".ord => 'ਂ', # circumflex accent (ˆ)
97
+ "\u02dc".ord => '੍ਨ', # small tilde (˜)
98
+ '§'.ord => '੍ਹੂ', # section symbol
99
+ '¤'.ord => 'ੱ', # currency symbol
100
+ 'ç'.ord => '੍ਚ', # c-cedilla letter
101
+ 'Ç'.ord => '☬', # khanda instead of california state symbol
102
+ "\u201a".ord => '❁', # single low-9 quotation (‚) mark
103
+ 'Æ'.ord => nil,
104
+ 'Ø'.ord => nil, # This is a topline / shirorekha (शिरोरेखा) extender
105
+ 'ÿ'.ord => nil, # This is the author Kulbir S Thind's stamp
106
+ 'Œ'.ord => nil, # Box drawing left flower
107
+ '‰'.ord => nil, # Box drawing right flower
108
+ 'Ó'.ord => nil, # Box drawing top flower
109
+ 'Ô'.ord => nil, # Box drawing bottom flower
110
+ 'Î'.ord => '꠳ਯ', # half-yayya
111
+ 'ï'.ord => '꠴ਯ', # open-top yayya
112
+ 'î'.ord => '꠵ਯ' # open-top half-yayya
113
+ }.freeze
114
+
115
+ ASCII_TO_SL_REPLACEMENTS = {
116
+ 'ˆØI' => 'ੀਁ', # Handle pre-bihari-bindi with unused adakbindi
117
+ '<>' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
118
+ '<' => 'ੴ', # GurbaniLipi variant
119
+ '>' => '☬', # GurbaniLipi variant
120
+ 'Åå' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
121
+ 'Å' => 'ੴ', # GurbaniLipi variant
122
+ 'å' => 'ੴ' # GurbaniLipi variant
123
+ }.freeze
124
+
125
+ UNICODE_TO_SL_REPLACEMENTS = {
126
+ '੍ਯ' => '꠳ਯ' # replace unicode half-yayya with Sant Lipi ligature (north indic one-sixteenth fraction + yayya)
127
+ }.freeze
128
+
129
+ SL_TO_UNICODE_REPLACEMENTS = {
130
+ '꠳ਯ' => '੍ਯ',
131
+ '꠴ਯ' => 'ਯ',
132
+ '꠵ਯ' => '੍ਯ',
133
+ 'ਁ' => 'ਂ' # pre-bihari-bindi
134
+ }.freeze
135
+
136
+ ##
137
+ # Converts any ASCII Gurmukhi characters and sanitizes to Unicode Gurmukhi.
138
+ # Note:
139
+ # Converting yayya (ਯ) variants with an open top using the Unicode Consortium standard is considered destructive.
140
+ # This function will substitute the original with its shirorekha/top-line equivalent.
141
+ #
142
+ # Many fonts and text shaping engines fail to render half-yayya (੍ਯ) correctly. Regardless of the standard used,
143
+ # it is recommended to use the Sant Lipi font mentioned below.
144
+ #
145
+ # @param string [String] The string to affect.
146
+ # @param unicode_standard [String] The mapping system to use. The default is Unicode compliant and can render
147
+ # 99% of the Shabad OS Database. The other option "Sant Lipi" is intended for a custom Unicode font bearing the
148
+ # same name (see: https://github.com/shabados/SantLipi). Defaults to "Unicode Consortium".
149
+ # @return [String] A string whose Gurmukhi is normalized to a Unicode standard.
150
+ #
151
+ # @example
152
+ # unicode("123")
153
+ # #=> "੧੨੩"
154
+ # unicode("<> > <")
155
+ # #=> "ੴ ☬ ੴ"
156
+ # unicode("gurU")
157
+ # #=> "ਗੁਰੂ"
158
+ ##
159
+ def self.unicode(string, unicode_standard = 'Unicode Consortium')
160
+ # Move ASCII sihari before mapping to unicode
161
+ ascii_base_letters = '\\a-zA-Z|^&Îîï'
162
+ ascii_sihari_pattern = Regexp.new("(i)([#{ascii_base_letters}])")
163
+ string = string.gsub(ascii_sihari_pattern, '\2\1')
164
+
165
+ # Map any ASCII / Unicode Gurmukhi to Sant Lipi format
166
+ ASCII_TO_SL_REPLACEMENTS.each do |key, value|
167
+ string.gsub!(key, value)
168
+ end
169
+
170
+ UNICODE_TO_SL_REPLACEMENTS.each do |key, value|
171
+ string.gsub!(key, value)
172
+ end
173
+
174
+ string = string.chars.map { |c| ASCII_TO_SL_TRANSLATION[c.ord] || c }.join
175
+
176
+ string = unicode_normalize(string)
177
+
178
+ if unicode_standard == 'Unicode Consortium'
179
+ SL_TO_UNICODE_REPLACEMENTS.each do |key, value|
180
+ string.gsub!(key, value)
181
+ end
182
+ end
183
+
184
+ return string
185
+ end
186
+
187
+ ##
188
+ # Normalizes Gurmukhi according to Unicode Standards.
189
+ # @param string [String] The string to affect.
190
+ # @return [String] A string containing normalized Gurmukhi.
191
+ #
192
+ # @example
193
+ # unicode_normalize("Hello ਜੀ")
194
+ # #=> "Hello ਜੀ"
195
+ ##
196
+ def self.unicode_normalize(string)
197
+ string = sort_diacritics(string)
198
+ return sanitize_unicode(string)
199
+ end
200
+
201
+ ##
202
+ # Gurmukhi script, some common diacritics include vowel signs (matras), such as:
203
+ # ੁ (u), ੀ (ī), or ੋ (ō), and other marks like bindi (ਂ), tippi (ੰ), and nukta (਼).
204
+ #
205
+ # @brief Orders the Gurmukhi diacritics in a string according to Unicode standards.
206
+ # Not intended for base letters with multiple subjoined letters.
207
+ # @param string [String] The string to affect.
208
+ # @return [String] The same string with Gurmukhi diacritics arranged in a sorted manner.
209
+ #
210
+ # @example
211
+ # sort_diacritics("\u0a41\u0a4b") # => "\u0a4b\u0a41" # ੁੋ vs ੋੁ
212
+ ##
213
+ def self.sort_diacritics(string)
214
+ # Nukta is essential to form a new base letter and must be ordered first.
215
+ # Udaat, Yakash, and subjoined letters should follow.
216
+ # Subjoined letters are constructed (they are not single char), so they cannot be used
217
+ # in the same regex group pattern. See further below for subjoined letters.
218
+ base_letter_modifiers = ['਼', 'ੑ', 'ੵ']
219
+
220
+ # More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.
221
+ # p. 491 of The Unicode® Standard Version 14.0 – Core Specification
222
+ # https://www.unicode.org/versions/Unicode14.0.0/ch12.pdf
223
+ vowel_order = ['ਿ', 'ੇ', 'ੈ', 'ੋ', 'ੌ', 'ੁ', 'ੂ', 'ਾ', 'ੀ']
224
+
225
+ # The remaining diacritics are to be sorted at the end according to the following order
226
+ remaining_modifier_order = ['ਁ', 'ੱ', 'ਂ', 'ੰ', 'ਃ']
227
+
228
+ generated_marks = (base_letter_modifiers + vowel_order + remaining_modifier_order).join
229
+ mark_pattern = Regexp.new("([#{generated_marks}]*)")
230
+
231
+ virama = '੍'
232
+ below_base_letters = 'ਹਰਵਟਤਨਚ'
233
+ below_base_pattern = Regexp.new("(#{virama}[#{below_base_letters}])?")
234
+
235
+ regex_match_pattern = Regexp.new("#{mark_pattern}#{below_base_pattern}#{mark_pattern}")
236
+
237
+ generated_match_order = (base_letter_modifiers + [virama] + below_base_letters.chars + vowel_order + remaining_modifier_order).join
238
+
239
+ string.gsub(regex_match_pattern) do |match|
240
+ match_chars = match.chars
241
+ match_chars.sort_by! { |e| generated_match_order.index(e) }
242
+ match_chars.join
243
+ end
244
+ end
245
+
246
+ def self.sanitize_unicode(string)
247
+ unicode_sanitization_map = {
248
+ "\u0a73\u0a4b" => "\u0a13", # ਓ
249
+ "\u0a05\u0a3e" => "\u0a06", # ਅ + ਾ = ਆ
250
+ "\u0a72\u0a3f" => "\u0a07", # ਇ
251
+ "\u0a72\u0a40" => "\u0a08", # ਈ
252
+ "\u0a73\u0a41" => "\u0a09", # ਉ
253
+ "\u0a73\u0a42" => "\u0a0a", # ਊ
254
+ "\u0a72\u0a47" => "\u0a0f", # ਏ
255
+ "\u0a05\u0a48" => "\u0a10", # ਐ
256
+ "\u0a05\u0a4c" => "\u0a14", # ਔ
257
+ "\u0a32\u0a3c" => "\u0a33", # ਲ਼
258
+ "\u0a38\u0a3c" => "\u0a36", # ਸ਼
259
+ "\u0a16\u0a3c" => "\u0a59", # ਖ਼
260
+ "\u0a17\u0a3c" => "\u0a5a", # ਗ਼
261
+ "\u0a1c\u0a3c" => "\u0a5b", # ਜ਼
262
+ "\u0a2b\u0a3c" => "\u0a5e", # ਫ਼
263
+ "\u0a71\u0a02" => "\u0a01" # ਁ adak bindi (quite literally never used today or in the Shabad OS Database, only included for parity with the Unicode block)
264
+ }
265
+
266
+ unicode_sanitization_map.each do |key, value|
267
+ string.gsub!(key, value)
268
+ end
269
+
270
+ return string
271
+ end
272
+
273
+ ##
274
+ # Takes a string and returns a list of keys and values of each character and its corresponding code point.
275
+ # @param string [String] the string to affect
276
+ # @return [Array<Hash{String => String}>] a list of each character and its corresponding code point
277
+ # @example
278
+ # decode_unicode("To ਜੀ")
279
+ # #=> [{"T" => "0054"}, {"o" => "006f"}, {" " => "0020"}, {"ਜ" => "0a1c"}, {"ੀ" => "0a40"}]
280
+ ##
281
+ def self.decode_unicode(string)
282
+ return string.chars.map { |item| { item => format('%04x', item.ord) } }
283
+ end
284
+
285
+ ##
286
+ # Takes a string and returns its corresponding unicode character.
287
+ # @param strings [Array<String>] the list containing any strings to encode
288
+ # @return [Array<String>] a list of any corresponding unicode characters
289
+ # @example
290
+ # encode_unicode(["0054"])
291
+ # #=> "T"
292
+ # encode_unicode(["0a1c", "0A40"])
293
+ # #=> ["ਜ", "ੀ"]
294
+ ##
295
+ def self.encode_unicode(strings)
296
+ return strings.map { |string| string.to_i(16).chr(Encoding::UTF_8) }
297
+ end
298
+ end
metadata ADDED
@@ -0,0 +1,55 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: gurmukhi_utils
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Dilraj Singh Somel (dsomel21)
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2023-07-15 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: Library for working with Gurmukhi text, providing various operations
14
+ like unicode conversion, ascii conversion, and more.
15
+ email:
16
+ - dsomel21@gmail.com
17
+ executables: []
18
+ extensions: []
19
+ extra_rdoc_files: []
20
+ files:
21
+ - CONTRIBUTING.md
22
+ - Gemfile
23
+ - Gemfile.lock
24
+ - README.md
25
+ - gurmukhi_utils.gemspec
26
+ - lib/ascii.rb
27
+ - lib/constants.rb
28
+ - lib/gurmukhi_utils.rb
29
+ - lib/remove.rb
30
+ - lib/unicode.rb
31
+ homepage: https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby
32
+ licenses:
33
+ - MIT
34
+ metadata:
35
+ rubygems_mfa_required: 'true'
36
+ post_install_message:
37
+ rdoc_options: []
38
+ require_paths:
39
+ - lib
40
+ required_ruby_version: !ruby/object:Gem::Requirement
41
+ requirements:
42
+ - - '='
43
+ - !ruby/object:Gem::Version
44
+ version: 3.2.1
45
+ required_rubygems_version: !ruby/object:Gem::Requirement
46
+ requirements:
47
+ - - ">="
48
+ - !ruby/object:Gem::Version
49
+ version: '0'
50
+ requirements: []
51
+ rubygems_version: 3.4.7
52
+ signing_key:
53
+ specification_version: 4
54
+ summary: A utility gem for converting, analyzing, and testing Gurmukhi strings.
55
+ test_files: []