gurmukhi_utils 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/CONTRIBUTING.md +76 -0
- data/Gemfile +14 -0
- data/Gemfile.lock +60 -0
- data/README.md +14 -0
- data/gurmukhi_utils.gemspec +20 -0
- data/lib/ascii.rb +161 -0
- data/lib/constants.rb +32 -0
- data/lib/gurmukhi_utils.rb +6 -0
- data/lib/remove.rb +70 -0
- data/lib/unicode.rb +298 -0
- metadata +55 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: dbb164eb5abec25b925649743ac6ae8797bce240903a42e13ee7bca4cd5635bc
|
4
|
+
data.tar.gz: 62fd647545eee43081089e407517bab0c00f3136a37862a82a78e7e276a9262f
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 314ce8d8a0149fa9d7327c44b8b0c145e2a0bfb031704d2b8843bfa636873d45fbf198dfe7666511242b3f0f788291beae2e9dd4f8a8d980fc6b99017a731424
|
7
|
+
data.tar.gz: 4d92bda8fe9cb31cbe430812d1953e517bc302acc5fa6e938bcf6197c05bd3dae39345ba723d5c0a79095b7576824f7a05a9bf1e9673e62aef675ced3e45abce
|
data/CONTRIBUTING.md
ADDED
@@ -0,0 +1,76 @@
|
|
1
|
+
# Contributing
|
2
|
+
|
3
|
+
Please see our [community docs on contributing](https://shabados.com/docs/community/contributing).
|
4
|
+
|
5
|
+
This document is for developers or programmers contributing to source code. If you're interested in contributing a different way, please see the link above.
|
6
|
+
|
7
|
+
## Project Structure
|
8
|
+
|
9
|
+
This project follows a standard Ruby gem structure. Here's a brief overview of the main directories and files:
|
10
|
+
|
11
|
+
- `lib/`: Contains the main source code for the GurmukhiUtils gem.
|
12
|
+
- `gurmukhi_utils.rb`: The main entry point for the gem. This file requires all other necessary files.
|
13
|
+
- `<feature>.rb` files: Each additional file in this directory represents a specific feature/module of the gem.
|
14
|
+
- `spec/`: Contains the RSpec test files for the gem. Test files should be placed in this directory, following the naming convention `lib/<feature>_spec.rb`.
|
15
|
+
- `Gemfile`: Specifies the gem dependencies for development and testing.
|
16
|
+
- `Gemfile.lock`: Generated by Bundler, this file contains the exact gem versions and their dependencies used in the project.
|
17
|
+
- `gurmukhi_utils.gemspec`: The gem specification file, which provides information about the gem. We need this to publish and release the gem.
|
18
|
+
|
19
|
+
## Adding New Features to `GurmukhiUtils`
|
20
|
+
|
21
|
+
To add a new feature to GurmukhiUtils, follow these steps:
|
22
|
+
|
23
|
+
**Step 1.**
|
24
|
+
|
25
|
+
Create a new file in the `lib/` directory for the new feature, and place the new functionality within the `GurmukhiUtils` module.
|
26
|
+
|
27
|
+
For example, if you want to create an `ascii` method, create a new file called `ascii.rb`:
|
28
|
+
|
29
|
+
```ruby
|
30
|
+
# lib/ascii.rb
|
31
|
+
|
32
|
+
module GurmukhiUtils
|
33
|
+
def self.helpers
|
34
|
+
# ...
|
35
|
+
end
|
36
|
+
|
37
|
+
def self.other_methods
|
38
|
+
# ...
|
39
|
+
end
|
40
|
+
|
41
|
+
def self.ascii
|
42
|
+
# ...
|
43
|
+
end
|
44
|
+
end
|
45
|
+
```
|
46
|
+
|
47
|
+
**Step 2.**
|
48
|
+
|
49
|
+
Update the `lib/gurmukhi_utils.rb` file to require the new feature file using require_relative:
|
50
|
+
|
51
|
+
```ruby
|
52
|
+
# lib/gurmukhi_utils.rb
|
53
|
+
|
54
|
+
require_relative 'gurmukhi_utils/version'
|
55
|
+
require_relative 'unicode'
|
56
|
+
require_relative 'ascii' # Add this line for the new feature
|
57
|
+
```
|
58
|
+
|
59
|
+
**Step 3.**
|
60
|
+
|
61
|
+
Test the feature
|
62
|
+
|
63
|
+
Write tests for the new feature in the `spec` directory.
|
64
|
+
|
65
|
+
You can also use `irb` and do:
|
66
|
+
|
67
|
+
```ruby
|
68
|
+
001 > require_relative "lib/gurmukhi_utils"
|
69
|
+
=> true
|
70
|
+
002 > GurmukhiUtils.ascii("...")
|
71
|
+
=> "..."
|
72
|
+
```
|
73
|
+
|
74
|
+
## Thank you
|
75
|
+
|
76
|
+
Your contributions to open source, large or small, make great projects like this possible. Thank you for taking the time to participate in this project.
|
data/Gemfile
ADDED
data/Gemfile.lock
ADDED
@@ -0,0 +1,60 @@
|
|
1
|
+
GEM
|
2
|
+
remote: https://rubygems.org/
|
3
|
+
specs:
|
4
|
+
ast (2.4.2)
|
5
|
+
diff-lcs (1.5.0)
|
6
|
+
json (2.6.3)
|
7
|
+
parallel (1.22.1)
|
8
|
+
parser (3.2.1.0)
|
9
|
+
ast (~> 2.4.1)
|
10
|
+
rainbow (3.1.1)
|
11
|
+
regexp_parser (2.7.0)
|
12
|
+
rexml (3.2.5)
|
13
|
+
rspec (3.12.0)
|
14
|
+
rspec-core (~> 3.12.0)
|
15
|
+
rspec-expectations (~> 3.12.0)
|
16
|
+
rspec-mocks (~> 3.12.0)
|
17
|
+
rspec-core (3.12.2)
|
18
|
+
rspec-support (~> 3.12.0)
|
19
|
+
rspec-expectations (3.12.3)
|
20
|
+
diff-lcs (>= 1.2.0, < 2.0)
|
21
|
+
rspec-support (~> 3.12.0)
|
22
|
+
rspec-mocks (3.12.5)
|
23
|
+
diff-lcs (>= 1.2.0, < 2.0)
|
24
|
+
rspec-support (~> 3.12.0)
|
25
|
+
rspec-support (3.12.0)
|
26
|
+
rubocop (1.47.0)
|
27
|
+
json (~> 2.3)
|
28
|
+
parallel (~> 1.10)
|
29
|
+
parser (>= 3.2.0.0)
|
30
|
+
rainbow (>= 2.2.2, < 4.0)
|
31
|
+
regexp_parser (>= 1.8, < 3.0)
|
32
|
+
rexml (>= 3.2.5, < 4.0)
|
33
|
+
rubocop-ast (>= 1.26.0, < 2.0)
|
34
|
+
ruby-progressbar (~> 1.7)
|
35
|
+
unicode-display_width (>= 2.4.0, < 3.0)
|
36
|
+
rubocop-ast (1.27.0)
|
37
|
+
parser (>= 3.2.1.0)
|
38
|
+
rubocop-capybara (2.17.1)
|
39
|
+
rubocop (~> 1.41)
|
40
|
+
rubocop-rspec (2.19.0)
|
41
|
+
rubocop (~> 1.33)
|
42
|
+
rubocop-capybara (~> 2.17)
|
43
|
+
ruby-progressbar (1.12.0)
|
44
|
+
strscan (3.0.5)
|
45
|
+
unicode-display_width (2.4.2)
|
46
|
+
|
47
|
+
PLATFORMS
|
48
|
+
x86_64-darwin-19
|
49
|
+
|
50
|
+
DEPENDENCIES
|
51
|
+
rspec
|
52
|
+
rubocop
|
53
|
+
rubocop-rspec
|
54
|
+
strscan
|
55
|
+
|
56
|
+
RUBY VERSION
|
57
|
+
ruby 3.2.1p31
|
58
|
+
|
59
|
+
BUNDLED WITH
|
60
|
+
2.4.7
|
data/README.md
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
# Gurmukhi Utils (Ruby)
|
2
|
+
|
3
|
+
Utilities library for converting, analyzing, and testing Gurmukhi strings.
|
4
|
+
|
5
|
+
## Related
|
6
|
+
|
7
|
+
This library is one of many in the Gurmukhi Utils super-repo.
|
8
|
+
|
9
|
+
- [Super Repo](/README.md)
|
10
|
+
- [Python](/python/README.md)
|
11
|
+
- [JavaScript](/javascript/README.md)
|
12
|
+
- [Ruby](/ruby/README.md)
|
13
|
+
- [C# / C Sharp](/csharp/README.md)
|
14
|
+
- [Dart](/dart/README.md)
|
@@ -0,0 +1,20 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
Gem::Specification.new do |spec|
|
4
|
+
spec.name = 'gurmukhi_utils'
|
5
|
+
spec.version = '0.0.1'
|
6
|
+
spec.authors = ['Dilraj Singh Somel (dsomel21)']
|
7
|
+
spec.email = ['dsomel21@gmail.com']
|
8
|
+
|
9
|
+
spec.summary = 'A utility gem for converting, analyzing, and testing Gurmukhi strings.'
|
10
|
+
spec.description = 'Library for working with Gurmukhi text, providing various operations like unicode conversion, ascii conversion, and more.'
|
11
|
+
spec.homepage = 'https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby'
|
12
|
+
spec.license = 'MIT'
|
13
|
+
|
14
|
+
spec.required_ruby_version = '3.2.1'
|
15
|
+
|
16
|
+
spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec)/}) }
|
17
|
+
spec.require_paths = ['lib']
|
18
|
+
|
19
|
+
spec.metadata['rubygems_mfa_required'] = 'true'
|
20
|
+
end
|
data/lib/ascii.rb
ADDED
@@ -0,0 +1,161 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module GurmukhiUtils
|
4
|
+
ASCII_TRANSLATION = {
|
5
|
+
'ੳ'.ord => 'a',
|
6
|
+
'ਅ'.ord => 'A',
|
7
|
+
'ੲ'.ord => 'e',
|
8
|
+
'ਸ'.ord => 's',
|
9
|
+
'ਹ'.ord => 'h',
|
10
|
+
'ਕ'.ord => 'k',
|
11
|
+
'ਖ'.ord => 'K',
|
12
|
+
'ਗ'.ord => 'g',
|
13
|
+
'ਘ'.ord => 'G',
|
14
|
+
'ਙ'.ord => '|',
|
15
|
+
'ਚ'.ord => 'c',
|
16
|
+
'ਛ'.ord => 'C',
|
17
|
+
'ਜ'.ord => 'j',
|
18
|
+
'ਝ'.ord => 'J',
|
19
|
+
'ਞ'.ord => '\\',
|
20
|
+
'ਟ'.ord => 't',
|
21
|
+
'ਠ'.ord => 'T',
|
22
|
+
'ਡ'.ord => 'f',
|
23
|
+
'ਢ'.ord => 'F',
|
24
|
+
'ਣ'.ord => 'x',
|
25
|
+
'ਤ'.ord => 'q',
|
26
|
+
'ਥ'.ord => 'Q',
|
27
|
+
'ਦ'.ord => 'd',
|
28
|
+
'ਧ'.ord => 'D',
|
29
|
+
'ਨ'.ord => 'n',
|
30
|
+
'ਪ'.ord => 'p',
|
31
|
+
'ਫ'.ord => 'P',
|
32
|
+
'ਬ'.ord => 'b',
|
33
|
+
'ਭ'.ord => 'B',
|
34
|
+
'ਮ'.ord => 'm',
|
35
|
+
'ਯ'.ord => 'X',
|
36
|
+
'ਰ'.ord => 'r',
|
37
|
+
'ਲ'.ord => 'l',
|
38
|
+
'ਵ'.ord => 'v',
|
39
|
+
'ੜ'.ord => 'V',
|
40
|
+
'ਸ਼'.ord => 'S',
|
41
|
+
'ਜ਼'.ord => 'z',
|
42
|
+
'ਖ਼'.ord => '^',
|
43
|
+
'ਫ਼'.ord => '&',
|
44
|
+
'ਗ਼'.ord => 'Z',
|
45
|
+
'ਲ਼'.ord => 'L',
|
46
|
+
'਼'.ord => 'æ',
|
47
|
+
'ੑ'.ord => '@',
|
48
|
+
'ੵ'.ord => "\u00b4", # acute accent (´)
|
49
|
+
'ਃ'.ord => 'Ú', # capital u-acute letter
|
50
|
+
"\u0a13".ord => 'E', # ਓ
|
51
|
+
"\u0a06".ord => 'Aw', # ਆ
|
52
|
+
"\u0a07".ord => 'ei', # ਇ
|
53
|
+
"\u0a08".ord => 'eI', # ਈ
|
54
|
+
"\u0a09".ord => 'au', # ਉ
|
55
|
+
"\u0a0a".ord => 'aU', # ਊ
|
56
|
+
"\u0a0f".ord => 'ey', # ਏ
|
57
|
+
"\u0a10".ord => 'AY', # ਐ
|
58
|
+
"\u0a14".ord => 'AO', # ਔ
|
59
|
+
'ਾ'.ord => 'w',
|
60
|
+
'ਿ'.ord => 'i',
|
61
|
+
'ੀ'.ord => 'I',
|
62
|
+
'ੁ'.ord => 'u',
|
63
|
+
'ੂ'.ord => 'U',
|
64
|
+
'ੇ'.ord => 'y',
|
65
|
+
'ੈ'.ord => 'Y',
|
66
|
+
'ੋ'.ord => 'o',
|
67
|
+
'ੌ'.ord => 'O',
|
68
|
+
'ੰ'.ord => 'M',
|
69
|
+
'ਂ'.ord => 'N',
|
70
|
+
'ੱ'.ord => '~',
|
71
|
+
'।'.ord => '[',
|
72
|
+
'॥'.ord => ']',
|
73
|
+
'੦'.ord => '0',
|
74
|
+
'੧'.ord => '1',
|
75
|
+
'੨'.ord => '2',
|
76
|
+
'੩'.ord => '3',
|
77
|
+
'੪'.ord => '4',
|
78
|
+
'੫'.ord => '5',
|
79
|
+
'੬'.ord => '6',
|
80
|
+
'੭'.ord => '7',
|
81
|
+
'੮'.ord => '8',
|
82
|
+
'੯'.ord => '9',
|
83
|
+
'ੴ'.ord => '<>',
|
84
|
+
'☬'.ord => 'Ç'
|
85
|
+
}.freeze
|
86
|
+
|
87
|
+
ASCII_REPLACEMENTS = {
|
88
|
+
'੍ਯ' => 'Î', # half-yayya
|
89
|
+
'꠳ਯ' => 'Î', # sant lipi variation
|
90
|
+
'꠴ਯ' => 'ï', # open-top yayya
|
91
|
+
'꠵ਯ' => 'î', # open-top half-yayya
|
92
|
+
'੍ਰ' => 'R',
|
93
|
+
'੍ਵ' => 'Í', # capital i-acute letter
|
94
|
+
'੍ਹ' => 'H',
|
95
|
+
'੍ਚ' => 'ç', # c-cedilla letter
|
96
|
+
'੍ਟ' => '†', # dagger symbol
|
97
|
+
'੍ਤ' => 'œ',
|
98
|
+
'੍ਨ' => "\u02dc" # small tilde (˜)
|
99
|
+
}.freeze
|
100
|
+
|
101
|
+
# TODO: Raise warnings if incorrect vowel syntax
|
102
|
+
|
103
|
+
def self.ascii(string)
|
104
|
+
string = unicode_normalize(string)
|
105
|
+
|
106
|
+
# Perform replacements
|
107
|
+
ASCII_REPLACEMENTS.each do |key, value|
|
108
|
+
string.gsub!(key, value)
|
109
|
+
end
|
110
|
+
|
111
|
+
# Perform translation
|
112
|
+
string = string.chars.map { |c| ASCII_TRANSLATION[c.ord] || c }.join
|
113
|
+
|
114
|
+
# Re-arrange sihari
|
115
|
+
ascii_base_letters = 'AeshkKgG|cCjJ\tTfFxqQdDnpPbBmXrlvVSz^&ZLÎïî'
|
116
|
+
ascii_modifiers = 'æ@´ÚwIuUyYoO`MNRÍH熜˜ü¨®µˆW~¤Ï'
|
117
|
+
regex = Regexp.new("([#{ascii_base_letters}][#{ascii_modifiers}]*)i([#{ascii_modifiers}]*)")
|
118
|
+
string.gsub!(regex, 'i\1\2')
|
119
|
+
|
120
|
+
# Fix below-base-letter + u vowel positioning
|
121
|
+
ascii_below_base_letters = 'RÍH熜˜´@'
|
122
|
+
below_vowel_mappings = {
|
123
|
+
'u' => 'ü',
|
124
|
+
'U' => '¨'
|
125
|
+
}
|
126
|
+
|
127
|
+
below_vowel_mappings.each do |key, value|
|
128
|
+
string.gsub!(/([#{ascii_below_base_letters}][#{ascii_modifiers}]*)#{key}([#{ascii_modifiers}]*)/, "\\1#{value}\\2")
|
129
|
+
end
|
130
|
+
|
131
|
+
# Fix center-stroke + tippi positioning
|
132
|
+
center_stroke_letters = 'nT'
|
133
|
+
string.gsub!(/([#{center_stroke_letters}][#{ascii_modifiers}]*)M([#{ascii_modifiers}]*)/, '\\1µ\\2')
|
134
|
+
|
135
|
+
# Fix positioning of bindi/tippi when it is the only above-base-form
|
136
|
+
ascii_non_above_modifiers = 'æ@´ÚwuURÍH熜˜ü¨®Ï'
|
137
|
+
nasalization_mappings = {
|
138
|
+
'N' => 'ˆ',
|
139
|
+
'~' => '`'
|
140
|
+
}
|
141
|
+
nasalization_mappings.each do |key, value|
|
142
|
+
string.gsub!(/([#{ascii_base_letters}][#{ascii_non_above_modifiers}]*)#{key}([#{ascii_non_above_modifiers}]*)/, "\\1#{value}\\2")
|
143
|
+
end
|
144
|
+
|
145
|
+
# Make rendering changes for combos
|
146
|
+
ascii_combo_replacements = {
|
147
|
+
'Iਁ' => 'ˆØI', # bindi + bihari ligature
|
148
|
+
'IM' => 'µØI', # tippi + bihari ligature
|
149
|
+
'Iµ' => 'µØI', # tippi + bihari ligature
|
150
|
+
'kR' => 'k®', # kakka + pair-rara ligature
|
151
|
+
'H¨' => '§',
|
152
|
+
'wN' => 'W', # addhak positioning
|
153
|
+
'wˆ' => 'W', # addhak positioning
|
154
|
+
'nUµ' => 'ƒ'
|
155
|
+
}
|
156
|
+
ascii_combo_replacements.each do |key, value|
|
157
|
+
string.gsub!(key, value)
|
158
|
+
end
|
159
|
+
return string
|
160
|
+
end
|
161
|
+
end
|
data/lib/constants.rb
ADDED
@@ -0,0 +1,32 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module GurmukhiUtils
|
4
|
+
# pause chars / ਵਿਸ਼ਰਾਮ symbols
|
5
|
+
VISHRAM_LIGHT = '.'
|
6
|
+
VISHRAM_MEDIUM = ','
|
7
|
+
VISHRAM_HEAVY = ';'
|
8
|
+
VISHRAMS = [VISHRAM_LIGHT, VISHRAM_MEDIUM, VISHRAM_HEAVY].freeze
|
9
|
+
|
10
|
+
BASE_LETTERS = 'ਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ਲ਼'
|
11
|
+
VOWEL_LETTERS =
|
12
|
+
# ਅ ਆ ਏ ਐ ਇ ਈ ਓ ਔ ਉ ਊ
|
13
|
+
"ਅ\u0a06\u0a0f\u0a10\u0a07\u0a08\u0a13\u0a14\u0a09\u0a0a"
|
14
|
+
|
15
|
+
ORDERED_VOWELS = [
|
16
|
+
'ਿ',
|
17
|
+
'ੇ',
|
18
|
+
'ੈ',
|
19
|
+
'ੋ',
|
20
|
+
'ੌ',
|
21
|
+
'ੁ',
|
22
|
+
'ੂ',
|
23
|
+
'ਾ',
|
24
|
+
'ੀ'
|
25
|
+
].freeze
|
26
|
+
|
27
|
+
VIRAMA = '੍'
|
28
|
+
BELOW_LETTERS = 'ਹਰਵਟਤਨਚ'
|
29
|
+
YAKASH = 'ੵ'
|
30
|
+
|
31
|
+
VOWEL_DIACRITICS = ORDERED_VOWELS.join
|
32
|
+
end
|
data/lib/remove.rb
ADDED
@@ -0,0 +1,70 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module GurmukhiUtils
|
4
|
+
# Removes substrings from the string.
|
5
|
+
#
|
6
|
+
# @param string [String] The string to affect.
|
7
|
+
# @param removals [Array<String>] Any substring to remove.
|
8
|
+
# @return [String] The string without any substrings.
|
9
|
+
#
|
10
|
+
# @example
|
11
|
+
# remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".", ","])
|
12
|
+
# # => "ਸਬਦ ਸਬਦ ਸਬਦ; ਸਬਦ"
|
13
|
+
# remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", GurmukhiUtils::Constants::VISHRAMS)
|
14
|
+
# # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ"
|
15
|
+
def remove(string, removals)
|
16
|
+
removals.each { |removal| string.gsub!(removal, '') }
|
17
|
+
string
|
18
|
+
end
|
19
|
+
|
20
|
+
# Removes regex patterns from the string.
|
21
|
+
#
|
22
|
+
# Note:
|
23
|
+
# Also removes duplicate space characters from the string.
|
24
|
+
#
|
25
|
+
# @param string [String] The string to affect.
|
26
|
+
# @param patterns [Array<String>] Any pattern to remove.
|
27
|
+
# @return [String] The string without any matching patterns, duplicate spaces, or leading/trailing spaces.
|
28
|
+
#
|
29
|
+
# @example
|
30
|
+
# remove_regex("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".+\\s"])
|
31
|
+
# # => "ਸਬਦ"
|
32
|
+
def remove_regex(string, patterns)
|
33
|
+
patterns.each { |pattern| string.gsub!(Regexp.new(pattern), '') }
|
34
|
+
string.squeeze!(' ').strip!
|
35
|
+
string
|
36
|
+
end
|
37
|
+
|
38
|
+
# Attempts to remove line endings as best as possible.
|
39
|
+
#
|
40
|
+
# @param string [String] The unicode Gurmukhi, Hindi, or English translation/transliteration to affect.
|
41
|
+
# @return [String] The string without line endings.
|
42
|
+
#
|
43
|
+
# @example
|
44
|
+
# remove_line_endings("ਸਬਦ ॥ ਸਬਦ ॥੧॥ ਰਹਾਉ ॥")
|
45
|
+
# # => "ਸਬਦ ਸਬਦ"
|
46
|
+
def remove_line_endings(string)
|
47
|
+
line_ending_patterns = [
|
48
|
+
'[।॥] *(ਰਹਾਉ|रहाउ).*',
|
49
|
+
'[|] *Pause.*',
|
50
|
+
'[|] *(rahaau|rahau|rahao).*',
|
51
|
+
'[।॥][੦-੯|०-९].*',
|
52
|
+
'[|]\\d.*',
|
53
|
+
'[।॥|]'
|
54
|
+
]
|
55
|
+
|
56
|
+
remove_regex(string, line_ending_patterns)
|
57
|
+
end
|
58
|
+
|
59
|
+
# Removes all vishram characters.
|
60
|
+
#
|
61
|
+
# @param string [String] The string to affect.
|
62
|
+
# @return [String] The string without vishrams.
|
63
|
+
#
|
64
|
+
# @example
|
65
|
+
# remove_vishrams("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ ॥")
|
66
|
+
# # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ ॥"
|
67
|
+
def remove_vishrams(string)
|
68
|
+
remove(string, GurmukhiUtils::Constants::VISHRAMS)
|
69
|
+
end
|
70
|
+
end
|
data/lib/unicode.rb
ADDED
@@ -0,0 +1,298 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
# Copied from https://github.com/shabados/gurmukhiutils/blob/b7a1ca9f0d341b64715158893afb93675c773823/gurmukhiutils/unicode.py
|
4
|
+
require 'strscan'
|
5
|
+
|
6
|
+
module GurmukhiUtils
|
7
|
+
UNICODE_STANDARDS = ['Unicode Consortium', 'Sant Lipi'].freeze
|
8
|
+
|
9
|
+
ASCII_TO_SL_TRANSLATION = {
|
10
|
+
'a'.ord => 'ੳ',
|
11
|
+
'b'.ord => 'ਬ',
|
12
|
+
'c'.ord => 'ਚ',
|
13
|
+
'd'.ord => 'ਦ',
|
14
|
+
'e'.ord => 'ੲ',
|
15
|
+
'f'.ord => 'ਡ',
|
16
|
+
'g'.ord => 'ਗ',
|
17
|
+
'h'.ord => 'ਹ',
|
18
|
+
'i'.ord => 'ਿ',
|
19
|
+
'j'.ord => 'ਜ',
|
20
|
+
'k'.ord => 'ਕ',
|
21
|
+
'l'.ord => 'ਲ',
|
22
|
+
'm'.ord => 'ਮ',
|
23
|
+
'n'.ord => 'ਨ',
|
24
|
+
'o'.ord => 'ੋ',
|
25
|
+
'p'.ord => 'ਪ',
|
26
|
+
'q'.ord => 'ਤ',
|
27
|
+
'r'.ord => 'ਰ',
|
28
|
+
's'.ord => 'ਸ',
|
29
|
+
't'.ord => 'ਟ',
|
30
|
+
'u'.ord => 'ੁ',
|
31
|
+
'v'.ord => 'ਵ',
|
32
|
+
'w'.ord => 'ਾ',
|
33
|
+
'x'.ord => 'ਣ',
|
34
|
+
'y'.ord => 'ੇ',
|
35
|
+
'z'.ord => 'ਜ਼',
|
36
|
+
'A'.ord => 'ਅ',
|
37
|
+
'B'.ord => 'ਭ',
|
38
|
+
'C'.ord => 'ਛ',
|
39
|
+
'D'.ord => 'ਧ',
|
40
|
+
'E'.ord => 'ਓ',
|
41
|
+
'F'.ord => 'ਢ',
|
42
|
+
'G'.ord => 'ਘ',
|
43
|
+
'H'.ord => '੍ਹ',
|
44
|
+
'I'.ord => 'ੀ',
|
45
|
+
'J'.ord => 'ਝ',
|
46
|
+
'K'.ord => 'ਖ',
|
47
|
+
'L'.ord => 'ਲ਼',
|
48
|
+
'M'.ord => 'ੰ',
|
49
|
+
'N'.ord => 'ਂ',
|
50
|
+
'O'.ord => 'ੌ',
|
51
|
+
'P'.ord => 'ਫ',
|
52
|
+
'Q'.ord => 'ਥ',
|
53
|
+
'R'.ord => '੍ਰ',
|
54
|
+
'S'.ord => 'ਸ਼',
|
55
|
+
'T'.ord => 'ਠ',
|
56
|
+
'U'.ord => 'ੂ',
|
57
|
+
'V'.ord => 'ੜ',
|
58
|
+
'W'.ord => 'ਾਂ',
|
59
|
+
'X'.ord => 'ਯ',
|
60
|
+
'Y'.ord => 'ੈ',
|
61
|
+
'Z'.ord => 'ਗ਼',
|
62
|
+
'0'.ord => '੦',
|
63
|
+
'1'.ord => '੧',
|
64
|
+
'2'.ord => '੨',
|
65
|
+
'3'.ord => '੩',
|
66
|
+
'4'.ord => '੪',
|
67
|
+
'5'.ord => '੫',
|
68
|
+
'6'.ord => '੬',
|
69
|
+
'7'.ord => '੭',
|
70
|
+
'8'.ord => '੮',
|
71
|
+
'9'.ord => '੯',
|
72
|
+
'['.ord => '।',
|
73
|
+
']'.ord => '॥',
|
74
|
+
'\\'.ord => 'ਞ',
|
75
|
+
'|'.ord => 'ਙ',
|
76
|
+
'`'.ord => 'ੱ',
|
77
|
+
'~'.ord => 'ੱ',
|
78
|
+
'@'.ord => 'ੑ',
|
79
|
+
'^'.ord => 'ਖ਼',
|
80
|
+
'&'.ord => 'ਫ਼',
|
81
|
+
'†'.ord => '੍ਟ', # dagger symbol
|
82
|
+
'ü'.ord => 'ੁ', # u-diaeresis letter
|
83
|
+
'®'.ord => '੍ਰ', # registered symbol
|
84
|
+
"\u00b4".ord => 'ੵ', # acute accent (´)
|
85
|
+
"\u00a8".ord => 'ੂ', # diaeresis accent (¨)
|
86
|
+
'µ'.ord => 'ੰ', # mu letter
|
87
|
+
'æ'.ord => '਼',
|
88
|
+
"\u00a1".ord => 'ੴ', # inverted exclamation (¡)
|
89
|
+
'ƒ'.ord => 'ਨੂੰ', # florin symbol
|
90
|
+
'œ'.ord => '੍ਤ',
|
91
|
+
'Í'.ord => '੍ਵ', # capital i-acute letter
|
92
|
+
'Î'.ord => '੍ਯ', # capital i-circumflex letter
|
93
|
+
'Ï'.ord => 'ੵ', # capital i-diaeresis letter
|
94
|
+
'Ò'.ord => '॥', # capital o-grave letter
|
95
|
+
'Ú'.ord => 'ਃ', # capital u-acute letter
|
96
|
+
"\u02c6".ord => 'ਂ', # circumflex accent (ˆ)
|
97
|
+
"\u02dc".ord => '੍ਨ', # small tilde (˜)
|
98
|
+
'§'.ord => '੍ਹੂ', # section symbol
|
99
|
+
'¤'.ord => 'ੱ', # currency symbol
|
100
|
+
'ç'.ord => '੍ਚ', # c-cedilla letter
|
101
|
+
'Ç'.ord => '☬', # khanda instead of california state symbol
|
102
|
+
"\u201a".ord => '❁', # single low-9 quotation (‚) mark
|
103
|
+
'Æ'.ord => nil,
|
104
|
+
'Ø'.ord => nil, # This is a topline / shirorekha (शिरोरेखा) extender
|
105
|
+
'ÿ'.ord => nil, # This is the author Kulbir S Thind's stamp
|
106
|
+
'Œ'.ord => nil, # Box drawing left flower
|
107
|
+
'‰'.ord => nil, # Box drawing right flower
|
108
|
+
'Ó'.ord => nil, # Box drawing top flower
|
109
|
+
'Ô'.ord => nil, # Box drawing bottom flower
|
110
|
+
'Î'.ord => '꠳ਯ', # half-yayya
|
111
|
+
'ï'.ord => '꠴ਯ', # open-top yayya
|
112
|
+
'î'.ord => '꠵ਯ' # open-top half-yayya
|
113
|
+
}.freeze
|
114
|
+
|
115
|
+
ASCII_TO_SL_REPLACEMENTS = {
|
116
|
+
'ˆØI' => 'ੀਁ', # Handle pre-bihari-bindi with unused adakbindi
|
117
|
+
'<>' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
|
118
|
+
'<' => 'ੴ', # GurbaniLipi variant
|
119
|
+
'>' => '☬', # GurbaniLipi variant
|
120
|
+
'Åå' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
|
121
|
+
'Å' => 'ੴ', # GurbaniLipi variant
|
122
|
+
'å' => 'ੴ' # GurbaniLipi variant
|
123
|
+
}.freeze
|
124
|
+
|
125
|
+
UNICODE_TO_SL_REPLACEMENTS = {
|
126
|
+
'੍ਯ' => '꠳ਯ' # replace unicode half-yayya with Sant Lipi ligature (north indic one-sixteenth fraction + yayya)
|
127
|
+
}.freeze
|
128
|
+
|
129
|
+
SL_TO_UNICODE_REPLACEMENTS = {
|
130
|
+
'꠳ਯ' => '੍ਯ',
|
131
|
+
'꠴ਯ' => 'ਯ',
|
132
|
+
'꠵ਯ' => '੍ਯ',
|
133
|
+
'ਁ' => 'ਂ' # pre-bihari-bindi
|
134
|
+
}.freeze
|
135
|
+
|
136
|
+
##
|
137
|
+
# Converts any ASCII Gurmukhi characters and sanitizes to Unicode Gurmukhi.
|
138
|
+
# Note:
|
139
|
+
# Converting yayya (ਯ) variants with an open top using the Unicode Consortium standard is considered destructive.
|
140
|
+
# This function will substitute the original with its shirorekha/top-line equivalent.
|
141
|
+
#
|
142
|
+
# Many fonts and text shaping engines fail to render half-yayya (੍ਯ) correctly. Regardless of the standard used,
|
143
|
+
# it is recommended to use the Sant Lipi font mentioned below.
|
144
|
+
#
|
145
|
+
# @param string [String] The string to affect.
|
146
|
+
# @param unicode_standard [String] The mapping system to use. The default is Unicode compliant and can render
|
147
|
+
# 99% of the Shabad OS Database. The other option "Sant Lipi" is intended for a custom Unicode font bearing the
|
148
|
+
# same name (see: https://github.com/shabados/SantLipi). Defaults to "Unicode Consortium".
|
149
|
+
# @return [String] A string whose Gurmukhi is normalized to a Unicode standard.
|
150
|
+
#
|
151
|
+
# @example
|
152
|
+
# unicode("123")
|
153
|
+
# #=> "੧੨੩"
|
154
|
+
# unicode("<> > <")
|
155
|
+
# #=> "ੴ ☬ ੴ"
|
156
|
+
# unicode("gurU")
|
157
|
+
# #=> "ਗੁਰੂ"
|
158
|
+
##
|
159
|
+
def self.unicode(string, unicode_standard = 'Unicode Consortium')
|
160
|
+
# Move ASCII sihari before mapping to unicode
|
161
|
+
ascii_base_letters = '\\a-zA-Z|^&Îîï'
|
162
|
+
ascii_sihari_pattern = Regexp.new("(i)([#{ascii_base_letters}])")
|
163
|
+
string = string.gsub(ascii_sihari_pattern, '\2\1')
|
164
|
+
|
165
|
+
# Map any ASCII / Unicode Gurmukhi to Sant Lipi format
|
166
|
+
ASCII_TO_SL_REPLACEMENTS.each do |key, value|
|
167
|
+
string.gsub!(key, value)
|
168
|
+
end
|
169
|
+
|
170
|
+
UNICODE_TO_SL_REPLACEMENTS.each do |key, value|
|
171
|
+
string.gsub!(key, value)
|
172
|
+
end
|
173
|
+
|
174
|
+
string = string.chars.map { |c| ASCII_TO_SL_TRANSLATION[c.ord] || c }.join
|
175
|
+
|
176
|
+
string = unicode_normalize(string)
|
177
|
+
|
178
|
+
if unicode_standard == 'Unicode Consortium'
|
179
|
+
SL_TO_UNICODE_REPLACEMENTS.each do |key, value|
|
180
|
+
string.gsub!(key, value)
|
181
|
+
end
|
182
|
+
end
|
183
|
+
|
184
|
+
return string
|
185
|
+
end
|
186
|
+
|
187
|
+
##
|
188
|
+
# Normalizes Gurmukhi according to Unicode Standards.
|
189
|
+
# @param string [String] The string to affect.
|
190
|
+
# @return [String] A string containing normalized Gurmukhi.
|
191
|
+
#
|
192
|
+
# @example
|
193
|
+
# unicode_normalize("Hello ਜੀ")
|
194
|
+
# #=> "Hello ਜੀ"
|
195
|
+
##
|
196
|
+
def self.unicode_normalize(string)
|
197
|
+
string = sort_diacritics(string)
|
198
|
+
return sanitize_unicode(string)
|
199
|
+
end
|
200
|
+
|
201
|
+
##
|
202
|
+
# Gurmukhi script, some common diacritics include vowel signs (matras), such as:
|
203
|
+
# ੁ (u), ੀ (ī), or ੋ (ō), and other marks like bindi (ਂ), tippi (ੰ), and nukta (਼).
|
204
|
+
#
|
205
|
+
# @brief Orders the Gurmukhi diacritics in a string according to Unicode standards.
|
206
|
+
# Not intended for base letters with multiple subjoined letters.
|
207
|
+
# @param string [String] The string to affect.
|
208
|
+
# @return [String] The same string with Gurmukhi diacritics arranged in a sorted manner.
|
209
|
+
#
|
210
|
+
# @example
|
211
|
+
# sort_diacritics("\u0a41\u0a4b") # => "\u0a4b\u0a41" # ੁੋ vs ੋੁ
|
212
|
+
##
|
213
|
+
def self.sort_diacritics(string)
|
214
|
+
# Nukta is essential to form a new base letter and must be ordered first.
|
215
|
+
# Udaat, Yakash, and subjoined letters should follow.
|
216
|
+
# Subjoined letters are constructed (they are not single char), so they cannot be used
|
217
|
+
# in the same regex group pattern. See further below for subjoined letters.
|
218
|
+
base_letter_modifiers = ['਼', 'ੑ', 'ੵ']
|
219
|
+
|
220
|
+
# More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.
|
221
|
+
# p. 491 of The Unicode® Standard Version 14.0 – Core Specification
|
222
|
+
# https://www.unicode.org/versions/Unicode14.0.0/ch12.pdf
|
223
|
+
vowel_order = ['ਿ', 'ੇ', 'ੈ', 'ੋ', 'ੌ', 'ੁ', 'ੂ', 'ਾ', 'ੀ']
|
224
|
+
|
225
|
+
# The remaining diacritics are to be sorted at the end according to the following order
|
226
|
+
remaining_modifier_order = ['ਁ', 'ੱ', 'ਂ', 'ੰ', 'ਃ']
|
227
|
+
|
228
|
+
generated_marks = (base_letter_modifiers + vowel_order + remaining_modifier_order).join
|
229
|
+
mark_pattern = Regexp.new("([#{generated_marks}]*)")
|
230
|
+
|
231
|
+
virama = '੍'
|
232
|
+
below_base_letters = 'ਹਰਵਟਤਨਚ'
|
233
|
+
below_base_pattern = Regexp.new("(#{virama}[#{below_base_letters}])?")
|
234
|
+
|
235
|
+
regex_match_pattern = Regexp.new("#{mark_pattern}#{below_base_pattern}#{mark_pattern}")
|
236
|
+
|
237
|
+
generated_match_order = (base_letter_modifiers + [virama] + below_base_letters.chars + vowel_order + remaining_modifier_order).join
|
238
|
+
|
239
|
+
string.gsub(regex_match_pattern) do |match|
|
240
|
+
match_chars = match.chars
|
241
|
+
match_chars.sort_by! { |e| generated_match_order.index(e) }
|
242
|
+
match_chars.join
|
243
|
+
end
|
244
|
+
end
|
245
|
+
|
246
|
+
def self.sanitize_unicode(string)
|
247
|
+
unicode_sanitization_map = {
|
248
|
+
"\u0a73\u0a4b" => "\u0a13", # ਓ
|
249
|
+
"\u0a05\u0a3e" => "\u0a06", # ਅ + ਾ = ਆ
|
250
|
+
"\u0a72\u0a3f" => "\u0a07", # ਇ
|
251
|
+
"\u0a72\u0a40" => "\u0a08", # ਈ
|
252
|
+
"\u0a73\u0a41" => "\u0a09", # ਉ
|
253
|
+
"\u0a73\u0a42" => "\u0a0a", # ਊ
|
254
|
+
"\u0a72\u0a47" => "\u0a0f", # ਏ
|
255
|
+
"\u0a05\u0a48" => "\u0a10", # ਐ
|
256
|
+
"\u0a05\u0a4c" => "\u0a14", # ਔ
|
257
|
+
"\u0a32\u0a3c" => "\u0a33", # ਲ਼
|
258
|
+
"\u0a38\u0a3c" => "\u0a36", # ਸ਼
|
259
|
+
"\u0a16\u0a3c" => "\u0a59", # ਖ਼
|
260
|
+
"\u0a17\u0a3c" => "\u0a5a", # ਗ਼
|
261
|
+
"\u0a1c\u0a3c" => "\u0a5b", # ਜ਼
|
262
|
+
"\u0a2b\u0a3c" => "\u0a5e", # ਫ਼
|
263
|
+
"\u0a71\u0a02" => "\u0a01" # ਁ adak bindi (quite literally never used today or in the Shabad OS Database, only included for parity with the Unicode block)
|
264
|
+
}
|
265
|
+
|
266
|
+
unicode_sanitization_map.each do |key, value|
|
267
|
+
string.gsub!(key, value)
|
268
|
+
end
|
269
|
+
|
270
|
+
return string
|
271
|
+
end
|
272
|
+
|
273
|
+
##
|
274
|
+
# Takes a string and returns a list of keys and values of each character and its corresponding code point.
|
275
|
+
# @param string [String] the string to affect
|
276
|
+
# @return [Array<Hash{String => String}>] a list of each character and its corresponding code point
|
277
|
+
# @example
|
278
|
+
# decode_unicode("To ਜੀ")
|
279
|
+
# #=> [{"T" => "0054"}, {"o" => "006f"}, {" " => "0020"}, {"ਜ" => "0a1c"}, {"ੀ" => "0a40"}]
|
280
|
+
##
|
281
|
+
def self.decode_unicode(string)
|
282
|
+
return string.chars.map { |item| { item => format('%04x', item.ord) } }
|
283
|
+
end
|
284
|
+
|
285
|
+
##
|
286
|
+
# Takes a string and returns its corresponding unicode character.
|
287
|
+
# @param strings [Array<String>] the list containing any strings to encode
|
288
|
+
# @return [Array<String>] a list of any corresponding unicode characters
|
289
|
+
# @example
|
290
|
+
# encode_unicode(["0054"])
|
291
|
+
# #=> "T"
|
292
|
+
# encode_unicode(["0a1c", "0A40"])
|
293
|
+
# #=> ["ਜ", "ੀ"]
|
294
|
+
##
|
295
|
+
def self.encode_unicode(strings)
|
296
|
+
return strings.map { |string| string.to_i(16).chr(Encoding::UTF_8) }
|
297
|
+
end
|
298
|
+
end
|
metadata
ADDED
@@ -0,0 +1,55 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: gurmukhi_utils
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Dilraj Singh Somel (dsomel21)
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2023-07-15 00:00:00.000000000 Z
|
12
|
+
dependencies: []
|
13
|
+
description: Library for working with Gurmukhi text, providing various operations
|
14
|
+
like unicode conversion, ascii conversion, and more.
|
15
|
+
email:
|
16
|
+
- dsomel21@gmail.com
|
17
|
+
executables: []
|
18
|
+
extensions: []
|
19
|
+
extra_rdoc_files: []
|
20
|
+
files:
|
21
|
+
- CONTRIBUTING.md
|
22
|
+
- Gemfile
|
23
|
+
- Gemfile.lock
|
24
|
+
- README.md
|
25
|
+
- gurmukhi_utils.gemspec
|
26
|
+
- lib/ascii.rb
|
27
|
+
- lib/constants.rb
|
28
|
+
- lib/gurmukhi_utils.rb
|
29
|
+
- lib/remove.rb
|
30
|
+
- lib/unicode.rb
|
31
|
+
homepage: https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby
|
32
|
+
licenses:
|
33
|
+
- MIT
|
34
|
+
metadata:
|
35
|
+
rubygems_mfa_required: 'true'
|
36
|
+
post_install_message:
|
37
|
+
rdoc_options: []
|
38
|
+
require_paths:
|
39
|
+
- lib
|
40
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
41
|
+
requirements:
|
42
|
+
- - '='
|
43
|
+
- !ruby/object:Gem::Version
|
44
|
+
version: 3.2.1
|
45
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
46
|
+
requirements:
|
47
|
+
- - ">="
|
48
|
+
- !ruby/object:Gem::Version
|
49
|
+
version: '0'
|
50
|
+
requirements: []
|
51
|
+
rubygems_version: 3.4.7
|
52
|
+
signing_key:
|
53
|
+
specification_version: 4
|
54
|
+
summary: A utility gem for converting, analyzing, and testing Gurmukhi strings.
|
55
|
+
test_files: []
|