gurmukhi_utils 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: dbb164eb5abec25b925649743ac6ae8797bce240903a42e13ee7bca4cd5635bc
4
+ data.tar.gz: 62fd647545eee43081089e407517bab0c00f3136a37862a82a78e7e276a9262f
5
+ SHA512:
6
+ metadata.gz: 314ce8d8a0149fa9d7327c44b8b0c145e2a0bfb031704d2b8843bfa636873d45fbf198dfe7666511242b3f0f788291beae2e9dd4f8a8d980fc6b99017a731424
7
+ data.tar.gz: 4d92bda8fe9cb31cbe430812d1953e517bc302acc5fa6e938bcf6197c05bd3dae39345ba723d5c0a79095b7576824f7a05a9bf1e9673e62aef675ced3e45abce
data/CONTRIBUTING.md ADDED
@@ -0,0 +1,76 @@
1
+ # Contributing
2
+
3
+ Please see our [community docs on contributing](https://shabados.com/docs/community/contributing).
4
+
5
+ This document is for developers or programmers contributing to source code. If you're interested in contributing a different way, please see the link above.
6
+
7
+ ## Project Structure
8
+
9
+ This project follows a standard Ruby gem structure. Here's a brief overview of the main directories and files:
10
+
11
+ - `lib/`: Contains the main source code for the GurmukhiUtils gem.
12
+ - `gurmukhi_utils.rb`: The main entry point for the gem. This file requires all other necessary files.
13
+ - `<feature>.rb` files: Each additional file in this directory represents a specific feature/module of the gem.
14
+ - `spec/`: Contains the RSpec test files for the gem. Test files should be placed in this directory, following the naming convention `lib/<feature>_spec.rb`.
15
+ - `Gemfile`: Specifies the gem dependencies for development and testing.
16
+ - `Gemfile.lock`: Generated by Bundler, this file contains the exact gem versions and their dependencies used in the project.
17
+ - `gurmukhi_utils.gemspec`: The gem specification file, which provides information about the gem. We need this to publish and release the gem.
18
+
19
+ ## Adding New Features to `GurmukhiUtils`
20
+
21
+ To add a new feature to GurmukhiUtils, follow these steps:
22
+
23
+ **Step 1.**
24
+
25
+ Create a new file in the `lib/` directory for the new feature, and place the new functionality within the `GurmukhiUtils` module.
26
+
27
+ For example, if you want to create an `ascii` method, create a new file called `ascii.rb`:
28
+
29
+ ```ruby
30
+ # lib/ascii.rb
31
+
32
+ module GurmukhiUtils
33
+ def self.helpers
34
+ # ...
35
+ end
36
+
37
+ def self.other_methods
38
+ # ...
39
+ end
40
+
41
+ def self.ascii
42
+ # ...
43
+ end
44
+ end
45
+ ```
46
+
47
+ **Step 2.**
48
+
49
+ Update the `lib/gurmukhi_utils.rb` file to require the new feature file using require_relative:
50
+
51
+ ```ruby
52
+ # lib/gurmukhi_utils.rb
53
+
54
+ require_relative 'gurmukhi_utils/version'
55
+ require_relative 'unicode'
56
+ require_relative 'ascii' # Add this line for the new feature
57
+ ```
58
+
59
+ **Step 3.**
60
+
61
+ Test the feature
62
+
63
+ Write tests for the new feature in the `spec` directory.
64
+
65
+ You can also use `irb` and do:
66
+
67
+ ```ruby
68
+ 001 > require_relative "lib/gurmukhi_utils"
69
+ => true
70
+ 002 > GurmukhiUtils.ascii("...")
71
+ => "..."
72
+ ```
73
+
74
+ ## Thank you
75
+
76
+ Your contributions to open source, large or small, make great projects like this possible. Thank you for taking the time to participate in this project.
data/Gemfile ADDED
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ ruby '3.2.1'
4
+
5
+ source 'https://rubygems.org'
6
+
7
+ group :development do
8
+ gem 'rubocop'
9
+ gem 'rubocop-rspec', :require => false
10
+ end
11
+
12
+ group :test do
13
+ gem 'rspec'
14
+ end
data/Gemfile.lock ADDED
@@ -0,0 +1,60 @@
1
+ GEM
2
+ remote: https://rubygems.org/
3
+ specs:
4
+ ast (2.4.2)
5
+ diff-lcs (1.5.0)
6
+ json (2.6.3)
7
+ parallel (1.22.1)
8
+ parser (3.2.1.0)
9
+ ast (~> 2.4.1)
10
+ rainbow (3.1.1)
11
+ regexp_parser (2.7.0)
12
+ rexml (3.2.5)
13
+ rspec (3.12.0)
14
+ rspec-core (~> 3.12.0)
15
+ rspec-expectations (~> 3.12.0)
16
+ rspec-mocks (~> 3.12.0)
17
+ rspec-core (3.12.2)
18
+ rspec-support (~> 3.12.0)
19
+ rspec-expectations (3.12.3)
20
+ diff-lcs (>= 1.2.0, < 2.0)
21
+ rspec-support (~> 3.12.0)
22
+ rspec-mocks (3.12.5)
23
+ diff-lcs (>= 1.2.0, < 2.0)
24
+ rspec-support (~> 3.12.0)
25
+ rspec-support (3.12.0)
26
+ rubocop (1.47.0)
27
+ json (~> 2.3)
28
+ parallel (~> 1.10)
29
+ parser (>= 3.2.0.0)
30
+ rainbow (>= 2.2.2, < 4.0)
31
+ regexp_parser (>= 1.8, < 3.0)
32
+ rexml (>= 3.2.5, < 4.0)
33
+ rubocop-ast (>= 1.26.0, < 2.0)
34
+ ruby-progressbar (~> 1.7)
35
+ unicode-display_width (>= 2.4.0, < 3.0)
36
+ rubocop-ast (1.27.0)
37
+ parser (>= 3.2.1.0)
38
+ rubocop-capybara (2.17.1)
39
+ rubocop (~> 1.41)
40
+ rubocop-rspec (2.19.0)
41
+ rubocop (~> 1.33)
42
+ rubocop-capybara (~> 2.17)
43
+ ruby-progressbar (1.12.0)
44
+ strscan (3.0.5)
45
+ unicode-display_width (2.4.2)
46
+
47
+ PLATFORMS
48
+ x86_64-darwin-19
49
+
50
+ DEPENDENCIES
51
+ rspec
52
+ rubocop
53
+ rubocop-rspec
54
+ strscan
55
+
56
+ RUBY VERSION
57
+ ruby 3.2.1p31
58
+
59
+ BUNDLED WITH
60
+ 2.4.7
data/README.md ADDED
@@ -0,0 +1,14 @@
1
+ # Gurmukhi Utils (Ruby)
2
+
3
+ Utilities library for converting, analyzing, and testing Gurmukhi strings.
4
+
5
+ ## Related
6
+
7
+ This library is one of many in the Gurmukhi Utils super-repo.
8
+
9
+ - [Super Repo](/README.md)
10
+ - [Python](/python/README.md)
11
+ - [JavaScript](/javascript/README.md)
12
+ - [Ruby](/ruby/README.md)
13
+ - [C# / C Sharp](/csharp/README.md)
14
+ - [Dart](/dart/README.md)
@@ -0,0 +1,20 @@
1
+ # frozen_string_literal: true
2
+
3
+ Gem::Specification.new do |spec|
4
+ spec.name = 'gurmukhi_utils'
5
+ spec.version = '0.0.1'
6
+ spec.authors = ['Dilraj Singh Somel (dsomel21)']
7
+ spec.email = ['dsomel21@gmail.com']
8
+
9
+ spec.summary = 'A utility gem for converting, analyzing, and testing Gurmukhi strings.'
10
+ spec.description = 'Library for working with Gurmukhi text, providing various operations like unicode conversion, ascii conversion, and more.'
11
+ spec.homepage = 'https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby'
12
+ spec.license = 'MIT'
13
+
14
+ spec.required_ruby_version = '3.2.1'
15
+
16
+ spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec)/}) }
17
+ spec.require_paths = ['lib']
18
+
19
+ spec.metadata['rubygems_mfa_required'] = 'true'
20
+ end
data/lib/ascii.rb ADDED
@@ -0,0 +1,161 @@
1
+ # frozen_string_literal: true
2
+
3
+ module GurmukhiUtils
4
+ ASCII_TRANSLATION = {
5
+ 'ੳ'.ord => 'a',
6
+ 'ਅ'.ord => 'A',
7
+ 'ੲ'.ord => 'e',
8
+ 'ਸ'.ord => 's',
9
+ 'ਹ'.ord => 'h',
10
+ 'ਕ'.ord => 'k',
11
+ 'ਖ'.ord => 'K',
12
+ 'ਗ'.ord => 'g',
13
+ 'ਘ'.ord => 'G',
14
+ 'ਙ'.ord => '|',
15
+ 'ਚ'.ord => 'c',
16
+ 'ਛ'.ord => 'C',
17
+ 'ਜ'.ord => 'j',
18
+ 'ਝ'.ord => 'J',
19
+ 'ਞ'.ord => '\\',
20
+ 'ਟ'.ord => 't',
21
+ 'ਠ'.ord => 'T',
22
+ 'ਡ'.ord => 'f',
23
+ 'ਢ'.ord => 'F',
24
+ 'ਣ'.ord => 'x',
25
+ 'ਤ'.ord => 'q',
26
+ 'ਥ'.ord => 'Q',
27
+ 'ਦ'.ord => 'd',
28
+ 'ਧ'.ord => 'D',
29
+ 'ਨ'.ord => 'n',
30
+ 'ਪ'.ord => 'p',
31
+ 'ਫ'.ord => 'P',
32
+ 'ਬ'.ord => 'b',
33
+ 'ਭ'.ord => 'B',
34
+ 'ਮ'.ord => 'm',
35
+ 'ਯ'.ord => 'X',
36
+ 'ਰ'.ord => 'r',
37
+ 'ਲ'.ord => 'l',
38
+ 'ਵ'.ord => 'v',
39
+ 'ੜ'.ord => 'V',
40
+ 'ਸ਼'.ord => 'S',
41
+ 'ਜ਼'.ord => 'z',
42
+ 'ਖ਼'.ord => '^',
43
+ 'ਫ਼'.ord => '&',
44
+ 'ਗ਼'.ord => 'Z',
45
+ 'ਲ਼'.ord => 'L',
46
+ '਼'.ord => 'æ',
47
+ 'ੑ'.ord => '@',
48
+ 'ੵ'.ord => "\u00b4", # acute accent (´)
49
+ 'ਃ'.ord => 'Ú', # capital u-acute letter
50
+ "\u0a13".ord => 'E', # ਓ
51
+ "\u0a06".ord => 'Aw', # ਆ
52
+ "\u0a07".ord => 'ei', # ਇ
53
+ "\u0a08".ord => 'eI', # ਈ
54
+ "\u0a09".ord => 'au', # ਉ
55
+ "\u0a0a".ord => 'aU', # ਊ
56
+ "\u0a0f".ord => 'ey', # ਏ
57
+ "\u0a10".ord => 'AY', # ਐ
58
+ "\u0a14".ord => 'AO', # ਔ
59
+ 'ਾ'.ord => 'w',
60
+ 'ਿ'.ord => 'i',
61
+ 'ੀ'.ord => 'I',
62
+ 'ੁ'.ord => 'u',
63
+ 'ੂ'.ord => 'U',
64
+ 'ੇ'.ord => 'y',
65
+ 'ੈ'.ord => 'Y',
66
+ 'ੋ'.ord => 'o',
67
+ 'ੌ'.ord => 'O',
68
+ 'ੰ'.ord => 'M',
69
+ 'ਂ'.ord => 'N',
70
+ 'ੱ'.ord => '~',
71
+ '।'.ord => '[',
72
+ '॥'.ord => ']',
73
+ '੦'.ord => '0',
74
+ '੧'.ord => '1',
75
+ '੨'.ord => '2',
76
+ '੩'.ord => '3',
77
+ '੪'.ord => '4',
78
+ '੫'.ord => '5',
79
+ '੬'.ord => '6',
80
+ '੭'.ord => '7',
81
+ '੮'.ord => '8',
82
+ '੯'.ord => '9',
83
+ 'ੴ'.ord => '<>',
84
+ '☬'.ord => 'Ç'
85
+ }.freeze
86
+
87
+ ASCII_REPLACEMENTS = {
88
+ '੍ਯ' => 'Î', # half-yayya
89
+ '꠳ਯ' => 'Î', # sant lipi variation
90
+ '꠴ਯ' => 'ï', # open-top yayya
91
+ '꠵ਯ' => 'î', # open-top half-yayya
92
+ '੍ਰ' => 'R',
93
+ '੍ਵ' => 'Í', # capital i-acute letter
94
+ '੍ਹ' => 'H',
95
+ '੍ਚ' => 'ç', # c-cedilla letter
96
+ '੍ਟ' => '†', # dagger symbol
97
+ '੍ਤ' => 'œ',
98
+ '੍ਨ' => "\u02dc" # small tilde (˜)
99
+ }.freeze
100
+
101
+ # TODO: Raise warnings if incorrect vowel syntax
102
+
103
+ def self.ascii(string)
104
+ string = unicode_normalize(string)
105
+
106
+ # Perform replacements
107
+ ASCII_REPLACEMENTS.each do |key, value|
108
+ string.gsub!(key, value)
109
+ end
110
+
111
+ # Perform translation
112
+ string = string.chars.map { |c| ASCII_TRANSLATION[c.ord] || c }.join
113
+
114
+ # Re-arrange sihari
115
+ ascii_base_letters = 'AeshkKgG|cCjJ\tTfFxqQdDnpPbBmXrlvVSz^&ZLÎïî'
116
+ ascii_modifiers = 'æ@´ÚwIuUyYoO`MNRÍH熜˜ü¨®µˆW~¤Ï'
117
+ regex = Regexp.new("([#{ascii_base_letters}][#{ascii_modifiers}]*)i([#{ascii_modifiers}]*)")
118
+ string.gsub!(regex, 'i\1\2')
119
+
120
+ # Fix below-base-letter + u vowel positioning
121
+ ascii_below_base_letters = 'RÍH熜˜´@'
122
+ below_vowel_mappings = {
123
+ 'u' => 'ü',
124
+ 'U' => '¨'
125
+ }
126
+
127
+ below_vowel_mappings.each do |key, value|
128
+ string.gsub!(/([#{ascii_below_base_letters}][#{ascii_modifiers}]*)#{key}([#{ascii_modifiers}]*)/, "\\1#{value}\\2")
129
+ end
130
+
131
+ # Fix center-stroke + tippi positioning
132
+ center_stroke_letters = 'nT'
133
+ string.gsub!(/([#{center_stroke_letters}][#{ascii_modifiers}]*)M([#{ascii_modifiers}]*)/, '\\1µ\\2')
134
+
135
+ # Fix positioning of bindi/tippi when it is the only above-base-form
136
+ ascii_non_above_modifiers = 'æ@´ÚwuURÍH熜˜ü¨®Ï'
137
+ nasalization_mappings = {
138
+ 'N' => 'ˆ',
139
+ '~' => '`'
140
+ }
141
+ nasalization_mappings.each do |key, value|
142
+ string.gsub!(/([#{ascii_base_letters}][#{ascii_non_above_modifiers}]*)#{key}([#{ascii_non_above_modifiers}]*)/, "\\1#{value}\\2")
143
+ end
144
+
145
+ # Make rendering changes for combos
146
+ ascii_combo_replacements = {
147
+ 'Iਁ' => 'ˆØI', # bindi + bihari ligature
148
+ 'IM' => 'µØI', # tippi + bihari ligature
149
+ 'Iµ' => 'µØI', # tippi + bihari ligature
150
+ 'kR' => 'k®', # kakka + pair-rara ligature
151
+ 'H¨' => '§',
152
+ 'wN' => 'W', # addhak positioning
153
+ 'wˆ' => 'W', # addhak positioning
154
+ 'nUµ' => 'ƒ'
155
+ }
156
+ ascii_combo_replacements.each do |key, value|
157
+ string.gsub!(key, value)
158
+ end
159
+ return string
160
+ end
161
+ end
data/lib/constants.rb ADDED
@@ -0,0 +1,32 @@
1
+ # frozen_string_literal: true
2
+
3
+ module GurmukhiUtils
4
+ # pause chars / ਵਿਸ਼ਰਾਮ symbols
5
+ VISHRAM_LIGHT = '.'
6
+ VISHRAM_MEDIUM = ','
7
+ VISHRAM_HEAVY = ';'
8
+ VISHRAMS = [VISHRAM_LIGHT, VISHRAM_MEDIUM, VISHRAM_HEAVY].freeze
9
+
10
+ BASE_LETTERS = 'ਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ਲ਼'
11
+ VOWEL_LETTERS =
12
+ # ਅ ਆ ਏ ਐ ਇ ਈ ਓ ਔ ਉ ਊ
13
+ "ਅ\u0a06\u0a0f\u0a10\u0a07\u0a08\u0a13\u0a14\u0a09\u0a0a"
14
+
15
+ ORDERED_VOWELS = [
16
+ 'ਿ',
17
+ 'ੇ',
18
+ 'ੈ',
19
+ 'ੋ',
20
+ 'ੌ',
21
+ 'ੁ',
22
+ 'ੂ',
23
+ 'ਾ',
24
+ 'ੀ'
25
+ ].freeze
26
+
27
+ VIRAMA = '੍'
28
+ BELOW_LETTERS = 'ਹਰਵਟਤਨਚ'
29
+ YAKASH = 'ੵ'
30
+
31
+ VOWEL_DIACRITICS = ORDERED_VOWELS.join
32
+ end
@@ -0,0 +1,6 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'unicode'
4
+ require_relative 'ascii'
5
+ require_relative 'remove'
6
+ require_relative 'constants'
data/lib/remove.rb ADDED
@@ -0,0 +1,70 @@
1
+ # frozen_string_literal: true
2
+
3
+ module GurmukhiUtils
4
+ # Removes substrings from the string.
5
+ #
6
+ # @param string [String] The string to affect.
7
+ # @param removals [Array<String>] Any substring to remove.
8
+ # @return [String] The string without any substrings.
9
+ #
10
+ # @example
11
+ # remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".", ","])
12
+ # # => "ਸਬਦ ਸਬਦ ਸਬਦ; ਸਬਦ"
13
+ # remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", GurmukhiUtils::Constants::VISHRAMS)
14
+ # # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ"
15
+ def remove(string, removals)
16
+ removals.each { |removal| string.gsub!(removal, '') }
17
+ string
18
+ end
19
+
20
+ # Removes regex patterns from the string.
21
+ #
22
+ # Note:
23
+ # Also removes duplicate space characters from the string.
24
+ #
25
+ # @param string [String] The string to affect.
26
+ # @param patterns [Array<String>] Any pattern to remove.
27
+ # @return [String] The string without any matching patterns, duplicate spaces, or leading/trailing spaces.
28
+ #
29
+ # @example
30
+ # remove_regex("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".+\\s"])
31
+ # # => "ਸਬਦ"
32
+ def remove_regex(string, patterns)
33
+ patterns.each { |pattern| string.gsub!(Regexp.new(pattern), '') }
34
+ string.squeeze!(' ').strip!
35
+ string
36
+ end
37
+
38
+ # Attempts to remove line endings as best as possible.
39
+ #
40
+ # @param string [String] The unicode Gurmukhi, Hindi, or English translation/transliteration to affect.
41
+ # @return [String] The string without line endings.
42
+ #
43
+ # @example
44
+ # remove_line_endings("ਸਬਦ ॥ ਸਬਦ ॥੧॥ ਰਹਾਉ ॥")
45
+ # # => "ਸਬਦ ਸਬਦ"
46
+ def remove_line_endings(string)
47
+ line_ending_patterns = [
48
+ '[।॥] *(ਰਹਾਉ|रहाउ).*',
49
+ '[|] *Pause.*',
50
+ '[|] *(rahaau|rahau|rahao).*',
51
+ '[।॥][੦-੯|०-९].*',
52
+ '[|]\\d.*',
53
+ '[।॥|]'
54
+ ]
55
+
56
+ remove_regex(string, line_ending_patterns)
57
+ end
58
+
59
+ # Removes all vishram characters.
60
+ #
61
+ # @param string [String] The string to affect.
62
+ # @return [String] The string without vishrams.
63
+ #
64
+ # @example
65
+ # remove_vishrams("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ ॥")
66
+ # # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ ॥"
67
+ def remove_vishrams(string)
68
+ remove(string, GurmukhiUtils::Constants::VISHRAMS)
69
+ end
70
+ end
data/lib/unicode.rb ADDED
@@ -0,0 +1,298 @@
1
+ # frozen_string_literal: true
2
+
3
+ # Copied from https://github.com/shabados/gurmukhiutils/blob/b7a1ca9f0d341b64715158893afb93675c773823/gurmukhiutils/unicode.py
4
+ require 'strscan'
5
+
6
+ module GurmukhiUtils
7
+ UNICODE_STANDARDS = ['Unicode Consortium', 'Sant Lipi'].freeze
8
+
9
+ ASCII_TO_SL_TRANSLATION = {
10
+ 'a'.ord => 'ੳ',
11
+ 'b'.ord => 'ਬ',
12
+ 'c'.ord => 'ਚ',
13
+ 'd'.ord => 'ਦ',
14
+ 'e'.ord => 'ੲ',
15
+ 'f'.ord => 'ਡ',
16
+ 'g'.ord => 'ਗ',
17
+ 'h'.ord => 'ਹ',
18
+ 'i'.ord => 'ਿ',
19
+ 'j'.ord => 'ਜ',
20
+ 'k'.ord => 'ਕ',
21
+ 'l'.ord => 'ਲ',
22
+ 'm'.ord => 'ਮ',
23
+ 'n'.ord => 'ਨ',
24
+ 'o'.ord => 'ੋ',
25
+ 'p'.ord => 'ਪ',
26
+ 'q'.ord => 'ਤ',
27
+ 'r'.ord => 'ਰ',
28
+ 's'.ord => 'ਸ',
29
+ 't'.ord => 'ਟ',
30
+ 'u'.ord => 'ੁ',
31
+ 'v'.ord => 'ਵ',
32
+ 'w'.ord => 'ਾ',
33
+ 'x'.ord => 'ਣ',
34
+ 'y'.ord => 'ੇ',
35
+ 'z'.ord => 'ਜ਼',
36
+ 'A'.ord => 'ਅ',
37
+ 'B'.ord => 'ਭ',
38
+ 'C'.ord => 'ਛ',
39
+ 'D'.ord => 'ਧ',
40
+ 'E'.ord => 'ਓ',
41
+ 'F'.ord => 'ਢ',
42
+ 'G'.ord => 'ਘ',
43
+ 'H'.ord => '੍ਹ',
44
+ 'I'.ord => 'ੀ',
45
+ 'J'.ord => 'ਝ',
46
+ 'K'.ord => 'ਖ',
47
+ 'L'.ord => 'ਲ਼',
48
+ 'M'.ord => 'ੰ',
49
+ 'N'.ord => 'ਂ',
50
+ 'O'.ord => 'ੌ',
51
+ 'P'.ord => 'ਫ',
52
+ 'Q'.ord => 'ਥ',
53
+ 'R'.ord => '੍ਰ',
54
+ 'S'.ord => 'ਸ਼',
55
+ 'T'.ord => 'ਠ',
56
+ 'U'.ord => 'ੂ',
57
+ 'V'.ord => 'ੜ',
58
+ 'W'.ord => 'ਾਂ',
59
+ 'X'.ord => 'ਯ',
60
+ 'Y'.ord => 'ੈ',
61
+ 'Z'.ord => 'ਗ਼',
62
+ '0'.ord => '੦',
63
+ '1'.ord => '੧',
64
+ '2'.ord => '੨',
65
+ '3'.ord => '੩',
66
+ '4'.ord => '੪',
67
+ '5'.ord => '੫',
68
+ '6'.ord => '੬',
69
+ '7'.ord => '੭',
70
+ '8'.ord => '੮',
71
+ '9'.ord => '੯',
72
+ '['.ord => '।',
73
+ ']'.ord => '॥',
74
+ '\\'.ord => 'ਞ',
75
+ '|'.ord => 'ਙ',
76
+ '`'.ord => 'ੱ',
77
+ '~'.ord => 'ੱ',
78
+ '@'.ord => 'ੑ',
79
+ '^'.ord => 'ਖ਼',
80
+ '&'.ord => 'ਫ਼',
81
+ '†'.ord => '੍ਟ', # dagger symbol
82
+ 'ü'.ord => 'ੁ', # u-diaeresis letter
83
+ '®'.ord => '੍ਰ', # registered symbol
84
+ "\u00b4".ord => 'ੵ', # acute accent (´)
85
+ "\u00a8".ord => 'ੂ', # diaeresis accent (¨)
86
+ 'µ'.ord => 'ੰ', # mu letter
87
+ 'æ'.ord => '਼',
88
+ "\u00a1".ord => 'ੴ', # inverted exclamation (¡)
89
+ 'ƒ'.ord => 'ਨੂੰ', # florin symbol
90
+ 'œ'.ord => '੍ਤ',
91
+ 'Í'.ord => '੍ਵ', # capital i-acute letter
92
+ 'Î'.ord => '੍ਯ', # capital i-circumflex letter
93
+ 'Ï'.ord => 'ੵ', # capital i-diaeresis letter
94
+ 'Ò'.ord => '॥', # capital o-grave letter
95
+ 'Ú'.ord => 'ਃ', # capital u-acute letter
96
+ "\u02c6".ord => 'ਂ', # circumflex accent (ˆ)
97
+ "\u02dc".ord => '੍ਨ', # small tilde (˜)
98
+ '§'.ord => '੍ਹੂ', # section symbol
99
+ '¤'.ord => 'ੱ', # currency symbol
100
+ 'ç'.ord => '੍ਚ', # c-cedilla letter
101
+ 'Ç'.ord => '☬', # khanda instead of california state symbol
102
+ "\u201a".ord => '❁', # single low-9 quotation (‚) mark
103
+ 'Æ'.ord => nil,
104
+ 'Ø'.ord => nil, # This is a topline / shirorekha (शिरोरेखा) extender
105
+ 'ÿ'.ord => nil, # This is the author Kulbir S Thind's stamp
106
+ 'Œ'.ord => nil, # Box drawing left flower
107
+ '‰'.ord => nil, # Box drawing right flower
108
+ 'Ó'.ord => nil, # Box drawing top flower
109
+ 'Ô'.ord => nil, # Box drawing bottom flower
110
+ 'Î'.ord => '꠳ਯ', # half-yayya
111
+ 'ï'.ord => '꠴ਯ', # open-top yayya
112
+ 'î'.ord => '꠵ਯ' # open-top half-yayya
113
+ }.freeze
114
+
115
+ ASCII_TO_SL_REPLACEMENTS = {
116
+ 'ˆØI' => 'ੀਁ', # Handle pre-bihari-bindi with unused adakbindi
117
+ '<>' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
118
+ '<' => 'ੴ', # GurbaniLipi variant
119
+ '>' => '☬', # GurbaniLipi variant
120
+ 'Åå' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
121
+ 'Å' => 'ੴ', # GurbaniLipi variant
122
+ 'å' => 'ੴ' # GurbaniLipi variant
123
+ }.freeze
124
+
125
+ UNICODE_TO_SL_REPLACEMENTS = {
126
+ '੍ਯ' => '꠳ਯ' # replace unicode half-yayya with Sant Lipi ligature (north indic one-sixteenth fraction + yayya)
127
+ }.freeze
128
+
129
+ SL_TO_UNICODE_REPLACEMENTS = {
130
+ '꠳ਯ' => '੍ਯ',
131
+ '꠴ਯ' => 'ਯ',
132
+ '꠵ਯ' => '੍ਯ',
133
+ 'ਁ' => 'ਂ' # pre-bihari-bindi
134
+ }.freeze
135
+
136
+ ##
137
+ # Converts any ASCII Gurmukhi characters and sanitizes to Unicode Gurmukhi.
138
+ # Note:
139
+ # Converting yayya (ਯ) variants with an open top using the Unicode Consortium standard is considered destructive.
140
+ # This function will substitute the original with its shirorekha/top-line equivalent.
141
+ #
142
+ # Many fonts and text shaping engines fail to render half-yayya (੍ਯ) correctly. Regardless of the standard used,
143
+ # it is recommended to use the Sant Lipi font mentioned below.
144
+ #
145
+ # @param string [String] The string to affect.
146
+ # @param unicode_standard [String] The mapping system to use. The default is Unicode compliant and can render
147
+ # 99% of the Shabad OS Database. The other option "Sant Lipi" is intended for a custom Unicode font bearing the
148
+ # same name (see: https://github.com/shabados/SantLipi). Defaults to "Unicode Consortium".
149
+ # @return [String] A string whose Gurmukhi is normalized to a Unicode standard.
150
+ #
151
+ # @example
152
+ # unicode("123")
153
+ # #=> "੧੨੩"
154
+ # unicode("<> > <")
155
+ # #=> "ੴ ☬ ੴ"
156
+ # unicode("gurU")
157
+ # #=> "ਗੁਰੂ"
158
+ ##
159
+ def self.unicode(string, unicode_standard = 'Unicode Consortium')
160
+ # Move ASCII sihari before mapping to unicode
161
+ ascii_base_letters = '\\a-zA-Z|^&Îîï'
162
+ ascii_sihari_pattern = Regexp.new("(i)([#{ascii_base_letters}])")
163
+ string = string.gsub(ascii_sihari_pattern, '\2\1')
164
+
165
+ # Map any ASCII / Unicode Gurmukhi to Sant Lipi format
166
+ ASCII_TO_SL_REPLACEMENTS.each do |key, value|
167
+ string.gsub!(key, value)
168
+ end
169
+
170
+ UNICODE_TO_SL_REPLACEMENTS.each do |key, value|
171
+ string.gsub!(key, value)
172
+ end
173
+
174
+ string = string.chars.map { |c| ASCII_TO_SL_TRANSLATION[c.ord] || c }.join
175
+
176
+ string = unicode_normalize(string)
177
+
178
+ if unicode_standard == 'Unicode Consortium'
179
+ SL_TO_UNICODE_REPLACEMENTS.each do |key, value|
180
+ string.gsub!(key, value)
181
+ end
182
+ end
183
+
184
+ return string
185
+ end
186
+
187
+ ##
188
+ # Normalizes Gurmukhi according to Unicode Standards.
189
+ # @param string [String] The string to affect.
190
+ # @return [String] A string containing normalized Gurmukhi.
191
+ #
192
+ # @example
193
+ # unicode_normalize("Hello ਜੀ")
194
+ # #=> "Hello ਜੀ"
195
+ ##
196
+ def self.unicode_normalize(string)
197
+ string = sort_diacritics(string)
198
+ return sanitize_unicode(string)
199
+ end
200
+
201
+ ##
202
+ # Gurmukhi script, some common diacritics include vowel signs (matras), such as:
203
+ # ੁ (u), ੀ (ī), or ੋ (ō), and other marks like bindi (ਂ), tippi (ੰ), and nukta (਼).
204
+ #
205
+ # @brief Orders the Gurmukhi diacritics in a string according to Unicode standards.
206
+ # Not intended for base letters with multiple subjoined letters.
207
+ # @param string [String] The string to affect.
208
+ # @return [String] The same string with Gurmukhi diacritics arranged in a sorted manner.
209
+ #
210
+ # @example
211
+ # sort_diacritics("\u0a41\u0a4b") # => "\u0a4b\u0a41" # ੁੋ vs ੋੁ
212
+ ##
213
+ def self.sort_diacritics(string)
214
+ # Nukta is essential to form a new base letter and must be ordered first.
215
+ # Udaat, Yakash, and subjoined letters should follow.
216
+ # Subjoined letters are constructed (they are not single char), so they cannot be used
217
+ # in the same regex group pattern. See further below for subjoined letters.
218
+ base_letter_modifiers = ['਼', 'ੑ', 'ੵ']
219
+
220
+ # More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.
221
+ # p. 491 of The Unicode® Standard Version 14.0 – Core Specification
222
+ # https://www.unicode.org/versions/Unicode14.0.0/ch12.pdf
223
+ vowel_order = ['ਿ', 'ੇ', 'ੈ', 'ੋ', 'ੌ', 'ੁ', 'ੂ', 'ਾ', 'ੀ']
224
+
225
+ # The remaining diacritics are to be sorted at the end according to the following order
226
+ remaining_modifier_order = ['ਁ', 'ੱ', 'ਂ', 'ੰ', 'ਃ']
227
+
228
+ generated_marks = (base_letter_modifiers + vowel_order + remaining_modifier_order).join
229
+ mark_pattern = Regexp.new("([#{generated_marks}]*)")
230
+
231
+ virama = '੍'
232
+ below_base_letters = 'ਹਰਵਟਤਨਚ'
233
+ below_base_pattern = Regexp.new("(#{virama}[#{below_base_letters}])?")
234
+
235
+ regex_match_pattern = Regexp.new("#{mark_pattern}#{below_base_pattern}#{mark_pattern}")
236
+
237
+ generated_match_order = (base_letter_modifiers + [virama] + below_base_letters.chars + vowel_order + remaining_modifier_order).join
238
+
239
+ string.gsub(regex_match_pattern) do |match|
240
+ match_chars = match.chars
241
+ match_chars.sort_by! { |e| generated_match_order.index(e) }
242
+ match_chars.join
243
+ end
244
+ end
245
+
246
+ def self.sanitize_unicode(string)
247
+ unicode_sanitization_map = {
248
+ "\u0a73\u0a4b" => "\u0a13", # ਓ
249
+ "\u0a05\u0a3e" => "\u0a06", # ਅ + ਾ = ਆ
250
+ "\u0a72\u0a3f" => "\u0a07", # ਇ
251
+ "\u0a72\u0a40" => "\u0a08", # ਈ
252
+ "\u0a73\u0a41" => "\u0a09", # ਉ
253
+ "\u0a73\u0a42" => "\u0a0a", # ਊ
254
+ "\u0a72\u0a47" => "\u0a0f", # ਏ
255
+ "\u0a05\u0a48" => "\u0a10", # ਐ
256
+ "\u0a05\u0a4c" => "\u0a14", # ਔ
257
+ "\u0a32\u0a3c" => "\u0a33", # ਲ਼
258
+ "\u0a38\u0a3c" => "\u0a36", # ਸ਼
259
+ "\u0a16\u0a3c" => "\u0a59", # ਖ਼
260
+ "\u0a17\u0a3c" => "\u0a5a", # ਗ਼
261
+ "\u0a1c\u0a3c" => "\u0a5b", # ਜ਼
262
+ "\u0a2b\u0a3c" => "\u0a5e", # ਫ਼
263
+ "\u0a71\u0a02" => "\u0a01" # ਁ adak bindi (quite literally never used today or in the Shabad OS Database, only included for parity with the Unicode block)
264
+ }
265
+
266
+ unicode_sanitization_map.each do |key, value|
267
+ string.gsub!(key, value)
268
+ end
269
+
270
+ return string
271
+ end
272
+
273
+ ##
274
+ # Takes a string and returns a list of keys and values of each character and its corresponding code point.
275
+ # @param string [String] the string to affect
276
+ # @return [Array<Hash{String => String}>] a list of each character and its corresponding code point
277
+ # @example
278
+ # decode_unicode("To ਜੀ")
279
+ # #=> [{"T" => "0054"}, {"o" => "006f"}, {" " => "0020"}, {"ਜ" => "0a1c"}, {"ੀ" => "0a40"}]
280
+ ##
281
+ def self.decode_unicode(string)
282
+ return string.chars.map { |item| { item => format('%04x', item.ord) } }
283
+ end
284
+
285
+ ##
286
+ # Takes a string and returns its corresponding unicode character.
287
+ # @param strings [Array<String>] the list containing any strings to encode
288
+ # @return [Array<String>] a list of any corresponding unicode characters
289
+ # @example
290
+ # encode_unicode(["0054"])
291
+ # #=> "T"
292
+ # encode_unicode(["0a1c", "0A40"])
293
+ # #=> ["ਜ", "ੀ"]
294
+ ##
295
+ def self.encode_unicode(strings)
296
+ return strings.map { |string| string.to_i(16).chr(Encoding::UTF_8) }
297
+ end
298
+ end
metadata ADDED
@@ -0,0 +1,55 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: gurmukhi_utils
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Dilraj Singh Somel (dsomel21)
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2023-07-15 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: Library for working with Gurmukhi text, providing various operations
14
+ like unicode conversion, ascii conversion, and more.
15
+ email:
16
+ - dsomel21@gmail.com
17
+ executables: []
18
+ extensions: []
19
+ extra_rdoc_files: []
20
+ files:
21
+ - CONTRIBUTING.md
22
+ - Gemfile
23
+ - Gemfile.lock
24
+ - README.md
25
+ - gurmukhi_utils.gemspec
26
+ - lib/ascii.rb
27
+ - lib/constants.rb
28
+ - lib/gurmukhi_utils.rb
29
+ - lib/remove.rb
30
+ - lib/unicode.rb
31
+ homepage: https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby
32
+ licenses:
33
+ - MIT
34
+ metadata:
35
+ rubygems_mfa_required: 'true'
36
+ post_install_message:
37
+ rdoc_options: []
38
+ require_paths:
39
+ - lib
40
+ required_ruby_version: !ruby/object:Gem::Requirement
41
+ requirements:
42
+ - - '='
43
+ - !ruby/object:Gem::Version
44
+ version: 3.2.1
45
+ required_rubygems_version: !ruby/object:Gem::Requirement
46
+ requirements:
47
+ - - ">="
48
+ - !ruby/object:Gem::Version
49
+ version: '0'
50
+ requirements: []
51
+ rubygems_version: 3.4.7
52
+ signing_key:
53
+ specification_version: 4
54
+ summary: A utility gem for converting, analyzing, and testing Gurmukhi strings.
55
+ test_files: []