gurmukhi_utils 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CONTRIBUTING.md +76 -0
- data/Gemfile +14 -0
- data/Gemfile.lock +60 -0
- data/README.md +14 -0
- data/gurmukhi_utils.gemspec +20 -0
- data/lib/ascii.rb +161 -0
- data/lib/constants.rb +32 -0
- data/lib/gurmukhi_utils.rb +6 -0
- data/lib/remove.rb +70 -0
- data/lib/unicode.rb +298 -0
- metadata +55 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: dbb164eb5abec25b925649743ac6ae8797bce240903a42e13ee7bca4cd5635bc
|
4
|
+
data.tar.gz: 62fd647545eee43081089e407517bab0c00f3136a37862a82a78e7e276a9262f
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 314ce8d8a0149fa9d7327c44b8b0c145e2a0bfb031704d2b8843bfa636873d45fbf198dfe7666511242b3f0f788291beae2e9dd4f8a8d980fc6b99017a731424
|
7
|
+
data.tar.gz: 4d92bda8fe9cb31cbe430812d1953e517bc302acc5fa6e938bcf6197c05bd3dae39345ba723d5c0a79095b7576824f7a05a9bf1e9673e62aef675ced3e45abce
|
data/CONTRIBUTING.md
ADDED
@@ -0,0 +1,76 @@
|
|
1
|
+
# Contributing
|
2
|
+
|
3
|
+
Please see our [community docs on contributing](https://shabados.com/docs/community/contributing).
|
4
|
+
|
5
|
+
This document is for developers or programmers contributing to source code. If you're interested in contributing a different way, please see the link above.
|
6
|
+
|
7
|
+
## Project Structure
|
8
|
+
|
9
|
+
This project follows a standard Ruby gem structure. Here's a brief overview of the main directories and files:
|
10
|
+
|
11
|
+
- `lib/`: Contains the main source code for the GurmukhiUtils gem.
|
12
|
+
- `gurmukhi_utils.rb`: The main entry point for the gem. This file requires all other necessary files.
|
13
|
+
- `<feature>.rb` files: Each additional file in this directory represents a specific feature/module of the gem.
|
14
|
+
- `spec/`: Contains the RSpec test files for the gem. Test files should be placed in this directory, following the naming convention `lib/<feature>_spec.rb`.
|
15
|
+
- `Gemfile`: Specifies the gem dependencies for development and testing.
|
16
|
+
- `Gemfile.lock`: Generated by Bundler, this file contains the exact gem versions and their dependencies used in the project.
|
17
|
+
- `gurmukhi_utils.gemspec`: The gem specification file, which provides information about the gem. We need this to publish and release the gem.
|
18
|
+
|
19
|
+
## Adding New Features to `GurmukhiUtils`
|
20
|
+
|
21
|
+
To add a new feature to GurmukhiUtils, follow these steps:
|
22
|
+
|
23
|
+
**Step 1.**
|
24
|
+
|
25
|
+
Create a new file in the `lib/` directory for the new feature, and place the new functionality within the `GurmukhiUtils` module.
|
26
|
+
|
27
|
+
For example, if you want to create an `ascii` method, create a new file called `ascii.rb`:
|
28
|
+
|
29
|
+
```ruby
|
30
|
+
# lib/ascii.rb
|
31
|
+
|
32
|
+
module GurmukhiUtils
|
33
|
+
def self.helpers
|
34
|
+
# ...
|
35
|
+
end
|
36
|
+
|
37
|
+
def self.other_methods
|
38
|
+
# ...
|
39
|
+
end
|
40
|
+
|
41
|
+
def self.ascii
|
42
|
+
# ...
|
43
|
+
end
|
44
|
+
end
|
45
|
+
```
|
46
|
+
|
47
|
+
**Step 2.**
|
48
|
+
|
49
|
+
Update the `lib/gurmukhi_utils.rb` file to require the new feature file using require_relative:
|
50
|
+
|
51
|
+
```ruby
|
52
|
+
# lib/gurmukhi_utils.rb
|
53
|
+
|
54
|
+
require_relative 'gurmukhi_utils/version'
|
55
|
+
require_relative 'unicode'
|
56
|
+
require_relative 'ascii' # Add this line for the new feature
|
57
|
+
```
|
58
|
+
|
59
|
+
**Step 3.**
|
60
|
+
|
61
|
+
Test the feature
|
62
|
+
|
63
|
+
Write tests for the new feature in the `spec` directory.
|
64
|
+
|
65
|
+
You can also use `irb` and do:
|
66
|
+
|
67
|
+
```ruby
|
68
|
+
001 > require_relative "lib/gurmukhi_utils"
|
69
|
+
=> true
|
70
|
+
002 > GurmukhiUtils.ascii("...")
|
71
|
+
=> "..."
|
72
|
+
```
|
73
|
+
|
74
|
+
## Thank you
|
75
|
+
|
76
|
+
Your contributions to open source, large or small, make great projects like this possible. Thank you for taking the time to participate in this project.
|
data/Gemfile
ADDED
data/Gemfile.lock
ADDED
@@ -0,0 +1,60 @@
|
|
1
|
+
GEM
|
2
|
+
remote: https://rubygems.org/
|
3
|
+
specs:
|
4
|
+
ast (2.4.2)
|
5
|
+
diff-lcs (1.5.0)
|
6
|
+
json (2.6.3)
|
7
|
+
parallel (1.22.1)
|
8
|
+
parser (3.2.1.0)
|
9
|
+
ast (~> 2.4.1)
|
10
|
+
rainbow (3.1.1)
|
11
|
+
regexp_parser (2.7.0)
|
12
|
+
rexml (3.2.5)
|
13
|
+
rspec (3.12.0)
|
14
|
+
rspec-core (~> 3.12.0)
|
15
|
+
rspec-expectations (~> 3.12.0)
|
16
|
+
rspec-mocks (~> 3.12.0)
|
17
|
+
rspec-core (3.12.2)
|
18
|
+
rspec-support (~> 3.12.0)
|
19
|
+
rspec-expectations (3.12.3)
|
20
|
+
diff-lcs (>= 1.2.0, < 2.0)
|
21
|
+
rspec-support (~> 3.12.0)
|
22
|
+
rspec-mocks (3.12.5)
|
23
|
+
diff-lcs (>= 1.2.0, < 2.0)
|
24
|
+
rspec-support (~> 3.12.0)
|
25
|
+
rspec-support (3.12.0)
|
26
|
+
rubocop (1.47.0)
|
27
|
+
json (~> 2.3)
|
28
|
+
parallel (~> 1.10)
|
29
|
+
parser (>= 3.2.0.0)
|
30
|
+
rainbow (>= 2.2.2, < 4.0)
|
31
|
+
regexp_parser (>= 1.8, < 3.0)
|
32
|
+
rexml (>= 3.2.5, < 4.0)
|
33
|
+
rubocop-ast (>= 1.26.0, < 2.0)
|
34
|
+
ruby-progressbar (~> 1.7)
|
35
|
+
unicode-display_width (>= 2.4.0, < 3.0)
|
36
|
+
rubocop-ast (1.27.0)
|
37
|
+
parser (>= 3.2.1.0)
|
38
|
+
rubocop-capybara (2.17.1)
|
39
|
+
rubocop (~> 1.41)
|
40
|
+
rubocop-rspec (2.19.0)
|
41
|
+
rubocop (~> 1.33)
|
42
|
+
rubocop-capybara (~> 2.17)
|
43
|
+
ruby-progressbar (1.12.0)
|
44
|
+
strscan (3.0.5)
|
45
|
+
unicode-display_width (2.4.2)
|
46
|
+
|
47
|
+
PLATFORMS
|
48
|
+
x86_64-darwin-19
|
49
|
+
|
50
|
+
DEPENDENCIES
|
51
|
+
rspec
|
52
|
+
rubocop
|
53
|
+
rubocop-rspec
|
54
|
+
strscan
|
55
|
+
|
56
|
+
RUBY VERSION
|
57
|
+
ruby 3.2.1p31
|
58
|
+
|
59
|
+
BUNDLED WITH
|
60
|
+
2.4.7
|
data/README.md
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
# Gurmukhi Utils (Ruby)
|
2
|
+
|
3
|
+
Utilities library for converting, analyzing, and testing Gurmukhi strings.
|
4
|
+
|
5
|
+
## Related
|
6
|
+
|
7
|
+
This library is one of many in the Gurmukhi Utils super-repo.
|
8
|
+
|
9
|
+
- [Super Repo](/README.md)
|
10
|
+
- [Python](/python/README.md)
|
11
|
+
- [JavaScript](/javascript/README.md)
|
12
|
+
- [Ruby](/ruby/README.md)
|
13
|
+
- [C# / C Sharp](/csharp/README.md)
|
14
|
+
- [Dart](/dart/README.md)
|
@@ -0,0 +1,20 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
Gem::Specification.new do |spec|
|
4
|
+
spec.name = 'gurmukhi_utils'
|
5
|
+
spec.version = '0.0.1'
|
6
|
+
spec.authors = ['Dilraj Singh Somel (dsomel21)']
|
7
|
+
spec.email = ['dsomel21@gmail.com']
|
8
|
+
|
9
|
+
spec.summary = 'A utility gem for converting, analyzing, and testing Gurmukhi strings.'
|
10
|
+
spec.description = 'Library for working with Gurmukhi text, providing various operations like unicode conversion, ascii conversion, and more.'
|
11
|
+
spec.homepage = 'https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby'
|
12
|
+
spec.license = 'MIT'
|
13
|
+
|
14
|
+
spec.required_ruby_version = '3.2.1'
|
15
|
+
|
16
|
+
spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec)/}) }
|
17
|
+
spec.require_paths = ['lib']
|
18
|
+
|
19
|
+
spec.metadata['rubygems_mfa_required'] = 'true'
|
20
|
+
end
|
data/lib/ascii.rb
ADDED
@@ -0,0 +1,161 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module GurmukhiUtils
|
4
|
+
ASCII_TRANSLATION = {
|
5
|
+
'ੳ'.ord => 'a',
|
6
|
+
'ਅ'.ord => 'A',
|
7
|
+
'ੲ'.ord => 'e',
|
8
|
+
'ਸ'.ord => 's',
|
9
|
+
'ਹ'.ord => 'h',
|
10
|
+
'ਕ'.ord => 'k',
|
11
|
+
'ਖ'.ord => 'K',
|
12
|
+
'ਗ'.ord => 'g',
|
13
|
+
'ਘ'.ord => 'G',
|
14
|
+
'ਙ'.ord => '|',
|
15
|
+
'ਚ'.ord => 'c',
|
16
|
+
'ਛ'.ord => 'C',
|
17
|
+
'ਜ'.ord => 'j',
|
18
|
+
'ਝ'.ord => 'J',
|
19
|
+
'ਞ'.ord => '\\',
|
20
|
+
'ਟ'.ord => 't',
|
21
|
+
'ਠ'.ord => 'T',
|
22
|
+
'ਡ'.ord => 'f',
|
23
|
+
'ਢ'.ord => 'F',
|
24
|
+
'ਣ'.ord => 'x',
|
25
|
+
'ਤ'.ord => 'q',
|
26
|
+
'ਥ'.ord => 'Q',
|
27
|
+
'ਦ'.ord => 'd',
|
28
|
+
'ਧ'.ord => 'D',
|
29
|
+
'ਨ'.ord => 'n',
|
30
|
+
'ਪ'.ord => 'p',
|
31
|
+
'ਫ'.ord => 'P',
|
32
|
+
'ਬ'.ord => 'b',
|
33
|
+
'ਭ'.ord => 'B',
|
34
|
+
'ਮ'.ord => 'm',
|
35
|
+
'ਯ'.ord => 'X',
|
36
|
+
'ਰ'.ord => 'r',
|
37
|
+
'ਲ'.ord => 'l',
|
38
|
+
'ਵ'.ord => 'v',
|
39
|
+
'ੜ'.ord => 'V',
|
40
|
+
'ਸ਼'.ord => 'S',
|
41
|
+
'ਜ਼'.ord => 'z',
|
42
|
+
'ਖ਼'.ord => '^',
|
43
|
+
'ਫ਼'.ord => '&',
|
44
|
+
'ਗ਼'.ord => 'Z',
|
45
|
+
'ਲ਼'.ord => 'L',
|
46
|
+
'਼'.ord => 'æ',
|
47
|
+
'ੑ'.ord => '@',
|
48
|
+
'ੵ'.ord => "\u00b4", # acute accent (´)
|
49
|
+
'ਃ'.ord => 'Ú', # capital u-acute letter
|
50
|
+
"\u0a13".ord => 'E', # ਓ
|
51
|
+
"\u0a06".ord => 'Aw', # ਆ
|
52
|
+
"\u0a07".ord => 'ei', # ਇ
|
53
|
+
"\u0a08".ord => 'eI', # ਈ
|
54
|
+
"\u0a09".ord => 'au', # ਉ
|
55
|
+
"\u0a0a".ord => 'aU', # ਊ
|
56
|
+
"\u0a0f".ord => 'ey', # ਏ
|
57
|
+
"\u0a10".ord => 'AY', # ਐ
|
58
|
+
"\u0a14".ord => 'AO', # ਔ
|
59
|
+
'ਾ'.ord => 'w',
|
60
|
+
'ਿ'.ord => 'i',
|
61
|
+
'ੀ'.ord => 'I',
|
62
|
+
'ੁ'.ord => 'u',
|
63
|
+
'ੂ'.ord => 'U',
|
64
|
+
'ੇ'.ord => 'y',
|
65
|
+
'ੈ'.ord => 'Y',
|
66
|
+
'ੋ'.ord => 'o',
|
67
|
+
'ੌ'.ord => 'O',
|
68
|
+
'ੰ'.ord => 'M',
|
69
|
+
'ਂ'.ord => 'N',
|
70
|
+
'ੱ'.ord => '~',
|
71
|
+
'।'.ord => '[',
|
72
|
+
'॥'.ord => ']',
|
73
|
+
'੦'.ord => '0',
|
74
|
+
'੧'.ord => '1',
|
75
|
+
'੨'.ord => '2',
|
76
|
+
'੩'.ord => '3',
|
77
|
+
'੪'.ord => '4',
|
78
|
+
'੫'.ord => '5',
|
79
|
+
'੬'.ord => '6',
|
80
|
+
'੭'.ord => '7',
|
81
|
+
'੮'.ord => '8',
|
82
|
+
'੯'.ord => '9',
|
83
|
+
'ੴ'.ord => '<>',
|
84
|
+
'☬'.ord => 'Ç'
|
85
|
+
}.freeze
|
86
|
+
|
87
|
+
ASCII_REPLACEMENTS = {
|
88
|
+
'੍ਯ' => 'Î', # half-yayya
|
89
|
+
'꠳ਯ' => 'Î', # sant lipi variation
|
90
|
+
'꠴ਯ' => 'ï', # open-top yayya
|
91
|
+
'꠵ਯ' => 'î', # open-top half-yayya
|
92
|
+
'੍ਰ' => 'R',
|
93
|
+
'੍ਵ' => 'Í', # capital i-acute letter
|
94
|
+
'੍ਹ' => 'H',
|
95
|
+
'੍ਚ' => 'ç', # c-cedilla letter
|
96
|
+
'੍ਟ' => '†', # dagger symbol
|
97
|
+
'੍ਤ' => 'œ',
|
98
|
+
'੍ਨ' => "\u02dc" # small tilde (˜)
|
99
|
+
}.freeze
|
100
|
+
|
101
|
+
# TODO: Raise warnings if incorrect vowel syntax
|
102
|
+
|
103
|
+
def self.ascii(string)
|
104
|
+
string = unicode_normalize(string)
|
105
|
+
|
106
|
+
# Perform replacements
|
107
|
+
ASCII_REPLACEMENTS.each do |key, value|
|
108
|
+
string.gsub!(key, value)
|
109
|
+
end
|
110
|
+
|
111
|
+
# Perform translation
|
112
|
+
string = string.chars.map { |c| ASCII_TRANSLATION[c.ord] || c }.join
|
113
|
+
|
114
|
+
# Re-arrange sihari
|
115
|
+
ascii_base_letters = 'AeshkKgG|cCjJ\tTfFxqQdDnpPbBmXrlvVSz^&ZLÎïî'
|
116
|
+
ascii_modifiers = 'æ@´ÚwIuUyYoO`MNRÍH熜˜ü¨®µˆW~¤Ï'
|
117
|
+
regex = Regexp.new("([#{ascii_base_letters}][#{ascii_modifiers}]*)i([#{ascii_modifiers}]*)")
|
118
|
+
string.gsub!(regex, 'i\1\2')
|
119
|
+
|
120
|
+
# Fix below-base-letter + u vowel positioning
|
121
|
+
ascii_below_base_letters = 'RÍH熜˜´@'
|
122
|
+
below_vowel_mappings = {
|
123
|
+
'u' => 'ü',
|
124
|
+
'U' => '¨'
|
125
|
+
}
|
126
|
+
|
127
|
+
below_vowel_mappings.each do |key, value|
|
128
|
+
string.gsub!(/([#{ascii_below_base_letters}][#{ascii_modifiers}]*)#{key}([#{ascii_modifiers}]*)/, "\\1#{value}\\2")
|
129
|
+
end
|
130
|
+
|
131
|
+
# Fix center-stroke + tippi positioning
|
132
|
+
center_stroke_letters = 'nT'
|
133
|
+
string.gsub!(/([#{center_stroke_letters}][#{ascii_modifiers}]*)M([#{ascii_modifiers}]*)/, '\\1µ\\2')
|
134
|
+
|
135
|
+
# Fix positioning of bindi/tippi when it is the only above-base-form
|
136
|
+
ascii_non_above_modifiers = 'æ@´ÚwuURÍH熜˜ü¨®Ï'
|
137
|
+
nasalization_mappings = {
|
138
|
+
'N' => 'ˆ',
|
139
|
+
'~' => '`'
|
140
|
+
}
|
141
|
+
nasalization_mappings.each do |key, value|
|
142
|
+
string.gsub!(/([#{ascii_base_letters}][#{ascii_non_above_modifiers}]*)#{key}([#{ascii_non_above_modifiers}]*)/, "\\1#{value}\\2")
|
143
|
+
end
|
144
|
+
|
145
|
+
# Make rendering changes for combos
|
146
|
+
ascii_combo_replacements = {
|
147
|
+
'Iਁ' => 'ˆØI', # bindi + bihari ligature
|
148
|
+
'IM' => 'µØI', # tippi + bihari ligature
|
149
|
+
'Iµ' => 'µØI', # tippi + bihari ligature
|
150
|
+
'kR' => 'k®', # kakka + pair-rara ligature
|
151
|
+
'H¨' => '§',
|
152
|
+
'wN' => 'W', # addhak positioning
|
153
|
+
'wˆ' => 'W', # addhak positioning
|
154
|
+
'nUµ' => 'ƒ'
|
155
|
+
}
|
156
|
+
ascii_combo_replacements.each do |key, value|
|
157
|
+
string.gsub!(key, value)
|
158
|
+
end
|
159
|
+
return string
|
160
|
+
end
|
161
|
+
end
|
data/lib/constants.rb
ADDED
@@ -0,0 +1,32 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module GurmukhiUtils
|
4
|
+
# pause chars / ਵਿਸ਼ਰਾਮ symbols
|
5
|
+
VISHRAM_LIGHT = '.'
|
6
|
+
VISHRAM_MEDIUM = ','
|
7
|
+
VISHRAM_HEAVY = ';'
|
8
|
+
VISHRAMS = [VISHRAM_LIGHT, VISHRAM_MEDIUM, VISHRAM_HEAVY].freeze
|
9
|
+
|
10
|
+
BASE_LETTERS = 'ਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ਲ਼'
|
11
|
+
VOWEL_LETTERS =
|
12
|
+
# ਅ ਆ ਏ ਐ ਇ ਈ ਓ ਔ ਉ ਊ
|
13
|
+
"ਅ\u0a06\u0a0f\u0a10\u0a07\u0a08\u0a13\u0a14\u0a09\u0a0a"
|
14
|
+
|
15
|
+
ORDERED_VOWELS = [
|
16
|
+
'ਿ',
|
17
|
+
'ੇ',
|
18
|
+
'ੈ',
|
19
|
+
'ੋ',
|
20
|
+
'ੌ',
|
21
|
+
'ੁ',
|
22
|
+
'ੂ',
|
23
|
+
'ਾ',
|
24
|
+
'ੀ'
|
25
|
+
].freeze
|
26
|
+
|
27
|
+
VIRAMA = '੍'
|
28
|
+
BELOW_LETTERS = 'ਹਰਵਟਤਨਚ'
|
29
|
+
YAKASH = 'ੵ'
|
30
|
+
|
31
|
+
VOWEL_DIACRITICS = ORDERED_VOWELS.join
|
32
|
+
end
|
data/lib/remove.rb
ADDED
@@ -0,0 +1,70 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module GurmukhiUtils
|
4
|
+
# Removes substrings from the string.
|
5
|
+
#
|
6
|
+
# @param string [String] The string to affect.
|
7
|
+
# @param removals [Array<String>] Any substring to remove.
|
8
|
+
# @return [String] The string without any substrings.
|
9
|
+
#
|
10
|
+
# @example
|
11
|
+
# remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".", ","])
|
12
|
+
# # => "ਸਬਦ ਸਬਦ ਸਬਦ; ਸਬਦ"
|
13
|
+
# remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", GurmukhiUtils::Constants::VISHRAMS)
|
14
|
+
# # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ"
|
15
|
+
def remove(string, removals)
|
16
|
+
removals.each { |removal| string.gsub!(removal, '') }
|
17
|
+
string
|
18
|
+
end
|
19
|
+
|
20
|
+
# Removes regex patterns from the string.
|
21
|
+
#
|
22
|
+
# Note:
|
23
|
+
# Also removes duplicate space characters from the string.
|
24
|
+
#
|
25
|
+
# @param string [String] The string to affect.
|
26
|
+
# @param patterns [Array<String>] Any pattern to remove.
|
27
|
+
# @return [String] The string without any matching patterns, duplicate spaces, or leading/trailing spaces.
|
28
|
+
#
|
29
|
+
# @example
|
30
|
+
# remove_regex("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".+\\s"])
|
31
|
+
# # => "ਸਬਦ"
|
32
|
+
def remove_regex(string, patterns)
|
33
|
+
patterns.each { |pattern| string.gsub!(Regexp.new(pattern), '') }
|
34
|
+
string.squeeze!(' ').strip!
|
35
|
+
string
|
36
|
+
end
|
37
|
+
|
38
|
+
# Attempts to remove line endings as best as possible.
|
39
|
+
#
|
40
|
+
# @param string [String] The unicode Gurmukhi, Hindi, or English translation/transliteration to affect.
|
41
|
+
# @return [String] The string without line endings.
|
42
|
+
#
|
43
|
+
# @example
|
44
|
+
# remove_line_endings("ਸਬਦ ॥ ਸਬਦ ॥੧॥ ਰਹਾਉ ॥")
|
45
|
+
# # => "ਸਬਦ ਸਬਦ"
|
46
|
+
def remove_line_endings(string)
|
47
|
+
line_ending_patterns = [
|
48
|
+
'[।॥] *(ਰਹਾਉ|रहाउ).*',
|
49
|
+
'[|] *Pause.*',
|
50
|
+
'[|] *(rahaau|rahau|rahao).*',
|
51
|
+
'[।॥][੦-੯|०-९].*',
|
52
|
+
'[|]\\d.*',
|
53
|
+
'[।॥|]'
|
54
|
+
]
|
55
|
+
|
56
|
+
remove_regex(string, line_ending_patterns)
|
57
|
+
end
|
58
|
+
|
59
|
+
# Removes all vishram characters.
|
60
|
+
#
|
61
|
+
# @param string [String] The string to affect.
|
62
|
+
# @return [String] The string without vishrams.
|
63
|
+
#
|
64
|
+
# @example
|
65
|
+
# remove_vishrams("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ ॥")
|
66
|
+
# # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ ॥"
|
67
|
+
def remove_vishrams(string)
|
68
|
+
remove(string, GurmukhiUtils::Constants::VISHRAMS)
|
69
|
+
end
|
70
|
+
end
|
data/lib/unicode.rb
ADDED
@@ -0,0 +1,298 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
# Copied from https://github.com/shabados/gurmukhiutils/blob/b7a1ca9f0d341b64715158893afb93675c773823/gurmukhiutils/unicode.py
|
4
|
+
require 'strscan'
|
5
|
+
|
6
|
+
module GurmukhiUtils
|
7
|
+
UNICODE_STANDARDS = ['Unicode Consortium', 'Sant Lipi'].freeze
|
8
|
+
|
9
|
+
ASCII_TO_SL_TRANSLATION = {
|
10
|
+
'a'.ord => 'ੳ',
|
11
|
+
'b'.ord => 'ਬ',
|
12
|
+
'c'.ord => 'ਚ',
|
13
|
+
'd'.ord => 'ਦ',
|
14
|
+
'e'.ord => 'ੲ',
|
15
|
+
'f'.ord => 'ਡ',
|
16
|
+
'g'.ord => 'ਗ',
|
17
|
+
'h'.ord => 'ਹ',
|
18
|
+
'i'.ord => 'ਿ',
|
19
|
+
'j'.ord => 'ਜ',
|
20
|
+
'k'.ord => 'ਕ',
|
21
|
+
'l'.ord => 'ਲ',
|
22
|
+
'm'.ord => 'ਮ',
|
23
|
+
'n'.ord => 'ਨ',
|
24
|
+
'o'.ord => 'ੋ',
|
25
|
+
'p'.ord => 'ਪ',
|
26
|
+
'q'.ord => 'ਤ',
|
27
|
+
'r'.ord => 'ਰ',
|
28
|
+
's'.ord => 'ਸ',
|
29
|
+
't'.ord => 'ਟ',
|
30
|
+
'u'.ord => 'ੁ',
|
31
|
+
'v'.ord => 'ਵ',
|
32
|
+
'w'.ord => 'ਾ',
|
33
|
+
'x'.ord => 'ਣ',
|
34
|
+
'y'.ord => 'ੇ',
|
35
|
+
'z'.ord => 'ਜ਼',
|
36
|
+
'A'.ord => 'ਅ',
|
37
|
+
'B'.ord => 'ਭ',
|
38
|
+
'C'.ord => 'ਛ',
|
39
|
+
'D'.ord => 'ਧ',
|
40
|
+
'E'.ord => 'ਓ',
|
41
|
+
'F'.ord => 'ਢ',
|
42
|
+
'G'.ord => 'ਘ',
|
43
|
+
'H'.ord => '੍ਹ',
|
44
|
+
'I'.ord => 'ੀ',
|
45
|
+
'J'.ord => 'ਝ',
|
46
|
+
'K'.ord => 'ਖ',
|
47
|
+
'L'.ord => 'ਲ਼',
|
48
|
+
'M'.ord => 'ੰ',
|
49
|
+
'N'.ord => 'ਂ',
|
50
|
+
'O'.ord => 'ੌ',
|
51
|
+
'P'.ord => 'ਫ',
|
52
|
+
'Q'.ord => 'ਥ',
|
53
|
+
'R'.ord => '੍ਰ',
|
54
|
+
'S'.ord => 'ਸ਼',
|
55
|
+
'T'.ord => 'ਠ',
|
56
|
+
'U'.ord => 'ੂ',
|
57
|
+
'V'.ord => 'ੜ',
|
58
|
+
'W'.ord => 'ਾਂ',
|
59
|
+
'X'.ord => 'ਯ',
|
60
|
+
'Y'.ord => 'ੈ',
|
61
|
+
'Z'.ord => 'ਗ਼',
|
62
|
+
'0'.ord => '੦',
|
63
|
+
'1'.ord => '੧',
|
64
|
+
'2'.ord => '੨',
|
65
|
+
'3'.ord => '੩',
|
66
|
+
'4'.ord => '੪',
|
67
|
+
'5'.ord => '੫',
|
68
|
+
'6'.ord => '੬',
|
69
|
+
'7'.ord => '੭',
|
70
|
+
'8'.ord => '੮',
|
71
|
+
'9'.ord => '੯',
|
72
|
+
'['.ord => '।',
|
73
|
+
']'.ord => '॥',
|
74
|
+
'\\'.ord => 'ਞ',
|
75
|
+
'|'.ord => 'ਙ',
|
76
|
+
'`'.ord => 'ੱ',
|
77
|
+
'~'.ord => 'ੱ',
|
78
|
+
'@'.ord => 'ੑ',
|
79
|
+
'^'.ord => 'ਖ਼',
|
80
|
+
'&'.ord => 'ਫ਼',
|
81
|
+
'†'.ord => '੍ਟ', # dagger symbol
|
82
|
+
'ü'.ord => 'ੁ', # u-diaeresis letter
|
83
|
+
'®'.ord => '੍ਰ', # registered symbol
|
84
|
+
"\u00b4".ord => 'ੵ', # acute accent (´)
|
85
|
+
"\u00a8".ord => 'ੂ', # diaeresis accent (¨)
|
86
|
+
'µ'.ord => 'ੰ', # mu letter
|
87
|
+
'æ'.ord => '਼',
|
88
|
+
"\u00a1".ord => 'ੴ', # inverted exclamation (¡)
|
89
|
+
'ƒ'.ord => 'ਨੂੰ', # florin symbol
|
90
|
+
'œ'.ord => '੍ਤ',
|
91
|
+
'Í'.ord => '੍ਵ', # capital i-acute letter
|
92
|
+
'Î'.ord => '੍ਯ', # capital i-circumflex letter
|
93
|
+
'Ï'.ord => 'ੵ', # capital i-diaeresis letter
|
94
|
+
'Ò'.ord => '॥', # capital o-grave letter
|
95
|
+
'Ú'.ord => 'ਃ', # capital u-acute letter
|
96
|
+
"\u02c6".ord => 'ਂ', # circumflex accent (ˆ)
|
97
|
+
"\u02dc".ord => '੍ਨ', # small tilde (˜)
|
98
|
+
'§'.ord => '੍ਹੂ', # section symbol
|
99
|
+
'¤'.ord => 'ੱ', # currency symbol
|
100
|
+
'ç'.ord => '੍ਚ', # c-cedilla letter
|
101
|
+
'Ç'.ord => '☬', # khanda instead of california state symbol
|
102
|
+
"\u201a".ord => '❁', # single low-9 quotation (‚) mark
|
103
|
+
'Æ'.ord => nil,
|
104
|
+
'Ø'.ord => nil, # This is a topline / shirorekha (शिरोरेखा) extender
|
105
|
+
'ÿ'.ord => nil, # This is the author Kulbir S Thind's stamp
|
106
|
+
'Œ'.ord => nil, # Box drawing left flower
|
107
|
+
'‰'.ord => nil, # Box drawing right flower
|
108
|
+
'Ó'.ord => nil, # Box drawing top flower
|
109
|
+
'Ô'.ord => nil, # Box drawing bottom flower
|
110
|
+
'Î'.ord => '꠳ਯ', # half-yayya
|
111
|
+
'ï'.ord => '꠴ਯ', # open-top yayya
|
112
|
+
'î'.ord => '꠵ਯ' # open-top half-yayya
|
113
|
+
}.freeze
|
114
|
+
|
115
|
+
ASCII_TO_SL_REPLACEMENTS = {
|
116
|
+
'ˆØI' => 'ੀਁ', # Handle pre-bihari-bindi with unused adakbindi
|
117
|
+
'<>' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
|
118
|
+
'<' => 'ੴ', # GurbaniLipi variant
|
119
|
+
'>' => '☬', # GurbaniLipi variant
|
120
|
+
'Åå' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
|
121
|
+
'Å' => 'ੴ', # GurbaniLipi variant
|
122
|
+
'å' => 'ੴ' # GurbaniLipi variant
|
123
|
+
}.freeze
|
124
|
+
|
125
|
+
UNICODE_TO_SL_REPLACEMENTS = {
|
126
|
+
'੍ਯ' => '꠳ਯ' # replace unicode half-yayya with Sant Lipi ligature (north indic one-sixteenth fraction + yayya)
|
127
|
+
}.freeze
|
128
|
+
|
129
|
+
SL_TO_UNICODE_REPLACEMENTS = {
|
130
|
+
'꠳ਯ' => '੍ਯ',
|
131
|
+
'꠴ਯ' => 'ਯ',
|
132
|
+
'꠵ਯ' => '੍ਯ',
|
133
|
+
'ਁ' => 'ਂ' # pre-bihari-bindi
|
134
|
+
}.freeze
|
135
|
+
|
136
|
+
##
|
137
|
+
# Converts any ASCII Gurmukhi characters and sanitizes to Unicode Gurmukhi.
|
138
|
+
# Note:
|
139
|
+
# Converting yayya (ਯ) variants with an open top using the Unicode Consortium standard is considered destructive.
|
140
|
+
# This function will substitute the original with its shirorekha/top-line equivalent.
|
141
|
+
#
|
142
|
+
# Many fonts and text shaping engines fail to render half-yayya (੍ਯ) correctly. Regardless of the standard used,
|
143
|
+
# it is recommended to use the Sant Lipi font mentioned below.
|
144
|
+
#
|
145
|
+
# @param string [String] The string to affect.
|
146
|
+
# @param unicode_standard [String] The mapping system to use. The default is Unicode compliant and can render
|
147
|
+
# 99% of the Shabad OS Database. The other option "Sant Lipi" is intended for a custom Unicode font bearing the
|
148
|
+
# same name (see: https://github.com/shabados/SantLipi). Defaults to "Unicode Consortium".
|
149
|
+
# @return [String] A string whose Gurmukhi is normalized to a Unicode standard.
|
150
|
+
#
|
151
|
+
# @example
|
152
|
+
# unicode("123")
|
153
|
+
# #=> "੧੨੩"
|
154
|
+
# unicode("<> > <")
|
155
|
+
# #=> "ੴ ☬ ੴ"
|
156
|
+
# unicode("gurU")
|
157
|
+
# #=> "ਗੁਰੂ"
|
158
|
+
##
|
159
|
+
def self.unicode(string, unicode_standard = 'Unicode Consortium')
|
160
|
+
# Move ASCII sihari before mapping to unicode
|
161
|
+
ascii_base_letters = '\\a-zA-Z|^&Îîï'
|
162
|
+
ascii_sihari_pattern = Regexp.new("(i)([#{ascii_base_letters}])")
|
163
|
+
string = string.gsub(ascii_sihari_pattern, '\2\1')
|
164
|
+
|
165
|
+
# Map any ASCII / Unicode Gurmukhi to Sant Lipi format
|
166
|
+
ASCII_TO_SL_REPLACEMENTS.each do |key, value|
|
167
|
+
string.gsub!(key, value)
|
168
|
+
end
|
169
|
+
|
170
|
+
UNICODE_TO_SL_REPLACEMENTS.each do |key, value|
|
171
|
+
string.gsub!(key, value)
|
172
|
+
end
|
173
|
+
|
174
|
+
string = string.chars.map { |c| ASCII_TO_SL_TRANSLATION[c.ord] || c }.join
|
175
|
+
|
176
|
+
string = unicode_normalize(string)
|
177
|
+
|
178
|
+
if unicode_standard == 'Unicode Consortium'
|
179
|
+
SL_TO_UNICODE_REPLACEMENTS.each do |key, value|
|
180
|
+
string.gsub!(key, value)
|
181
|
+
end
|
182
|
+
end
|
183
|
+
|
184
|
+
return string
|
185
|
+
end
|
186
|
+
|
187
|
+
##
|
188
|
+
# Normalizes Gurmukhi according to Unicode Standards.
|
189
|
+
# @param string [String] The string to affect.
|
190
|
+
# @return [String] A string containing normalized Gurmukhi.
|
191
|
+
#
|
192
|
+
# @example
|
193
|
+
# unicode_normalize("Hello ਜੀ")
|
194
|
+
# #=> "Hello ਜੀ"
|
195
|
+
##
|
196
|
+
def self.unicode_normalize(string)
|
197
|
+
string = sort_diacritics(string)
|
198
|
+
return sanitize_unicode(string)
|
199
|
+
end
|
200
|
+
|
201
|
+
##
|
202
|
+
# Gurmukhi script, some common diacritics include vowel signs (matras), such as:
|
203
|
+
# ੁ (u), ੀ (ī), or ੋ (ō), and other marks like bindi (ਂ), tippi (ੰ), and nukta (਼).
|
204
|
+
#
|
205
|
+
# @brief Orders the Gurmukhi diacritics in a string according to Unicode standards.
|
206
|
+
# Not intended for base letters with multiple subjoined letters.
|
207
|
+
# @param string [String] The string to affect.
|
208
|
+
# @return [String] The same string with Gurmukhi diacritics arranged in a sorted manner.
|
209
|
+
#
|
210
|
+
# @example
|
211
|
+
# sort_diacritics("\u0a41\u0a4b") # => "\u0a4b\u0a41" # ੁੋ vs ੋੁ
|
212
|
+
##
|
213
|
+
def self.sort_diacritics(string)
|
214
|
+
# Nukta is essential to form a new base letter and must be ordered first.
|
215
|
+
# Udaat, Yakash, and subjoined letters should follow.
|
216
|
+
# Subjoined letters are constructed (they are not single char), so they cannot be used
|
217
|
+
# in the same regex group pattern. See further below for subjoined letters.
|
218
|
+
base_letter_modifiers = ['਼', 'ੑ', 'ੵ']
|
219
|
+
|
220
|
+
# More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.
|
221
|
+
# p. 491 of The Unicode® Standard Version 14.0 – Core Specification
|
222
|
+
# https://www.unicode.org/versions/Unicode14.0.0/ch12.pdf
|
223
|
+
vowel_order = ['ਿ', 'ੇ', 'ੈ', 'ੋ', 'ੌ', 'ੁ', 'ੂ', 'ਾ', 'ੀ']
|
224
|
+
|
225
|
+
# The remaining diacritics are to be sorted at the end according to the following order
|
226
|
+
remaining_modifier_order = ['ਁ', 'ੱ', 'ਂ', 'ੰ', 'ਃ']
|
227
|
+
|
228
|
+
generated_marks = (base_letter_modifiers + vowel_order + remaining_modifier_order).join
|
229
|
+
mark_pattern = Regexp.new("([#{generated_marks}]*)")
|
230
|
+
|
231
|
+
virama = '੍'
|
232
|
+
below_base_letters = 'ਹਰਵਟਤਨਚ'
|
233
|
+
below_base_pattern = Regexp.new("(#{virama}[#{below_base_letters}])?")
|
234
|
+
|
235
|
+
regex_match_pattern = Regexp.new("#{mark_pattern}#{below_base_pattern}#{mark_pattern}")
|
236
|
+
|
237
|
+
generated_match_order = (base_letter_modifiers + [virama] + below_base_letters.chars + vowel_order + remaining_modifier_order).join
|
238
|
+
|
239
|
+
string.gsub(regex_match_pattern) do |match|
|
240
|
+
match_chars = match.chars
|
241
|
+
match_chars.sort_by! { |e| generated_match_order.index(e) }
|
242
|
+
match_chars.join
|
243
|
+
end
|
244
|
+
end
|
245
|
+
|
246
|
+
def self.sanitize_unicode(string)
|
247
|
+
unicode_sanitization_map = {
|
248
|
+
"\u0a73\u0a4b" => "\u0a13", # ਓ
|
249
|
+
"\u0a05\u0a3e" => "\u0a06", # ਅ + ਾ = ਆ
|
250
|
+
"\u0a72\u0a3f" => "\u0a07", # ਇ
|
251
|
+
"\u0a72\u0a40" => "\u0a08", # ਈ
|
252
|
+
"\u0a73\u0a41" => "\u0a09", # ਉ
|
253
|
+
"\u0a73\u0a42" => "\u0a0a", # ਊ
|
254
|
+
"\u0a72\u0a47" => "\u0a0f", # ਏ
|
255
|
+
"\u0a05\u0a48" => "\u0a10", # ਐ
|
256
|
+
"\u0a05\u0a4c" => "\u0a14", # ਔ
|
257
|
+
"\u0a32\u0a3c" => "\u0a33", # ਲ਼
|
258
|
+
"\u0a38\u0a3c" => "\u0a36", # ਸ਼
|
259
|
+
"\u0a16\u0a3c" => "\u0a59", # ਖ਼
|
260
|
+
"\u0a17\u0a3c" => "\u0a5a", # ਗ਼
|
261
|
+
"\u0a1c\u0a3c" => "\u0a5b", # ਜ਼
|
262
|
+
"\u0a2b\u0a3c" => "\u0a5e", # ਫ਼
|
263
|
+
"\u0a71\u0a02" => "\u0a01" # ਁ adak bindi (quite literally never used today or in the Shabad OS Database, only included for parity with the Unicode block)
|
264
|
+
}
|
265
|
+
|
266
|
+
unicode_sanitization_map.each do |key, value|
|
267
|
+
string.gsub!(key, value)
|
268
|
+
end
|
269
|
+
|
270
|
+
return string
|
271
|
+
end
|
272
|
+
|
273
|
+
##
|
274
|
+
# Takes a string and returns a list of keys and values of each character and its corresponding code point.
|
275
|
+
# @param string [String] the string to affect
|
276
|
+
# @return [Array<Hash{String => String}>] a list of each character and its corresponding code point
|
277
|
+
# @example
|
278
|
+
# decode_unicode("To ਜੀ")
|
279
|
+
# #=> [{"T" => "0054"}, {"o" => "006f"}, {" " => "0020"}, {"ਜ" => "0a1c"}, {"ੀ" => "0a40"}]
|
280
|
+
##
|
281
|
+
def self.decode_unicode(string)
|
282
|
+
return string.chars.map { |item| { item => format('%04x', item.ord) } }
|
283
|
+
end
|
284
|
+
|
285
|
+
##
|
286
|
+
# Takes a string and returns its corresponding unicode character.
|
287
|
+
# @param strings [Array<String>] the list containing any strings to encode
|
288
|
+
# @return [Array<String>] a list of any corresponding unicode characters
|
289
|
+
# @example
|
290
|
+
# encode_unicode(["0054"])
|
291
|
+
# #=> "T"
|
292
|
+
# encode_unicode(["0a1c", "0A40"])
|
293
|
+
# #=> ["ਜ", "ੀ"]
|
294
|
+
##
|
295
|
+
def self.encode_unicode(strings)
|
296
|
+
return strings.map { |string| string.to_i(16).chr(Encoding::UTF_8) }
|
297
|
+
end
|
298
|
+
end
|
metadata
ADDED
@@ -0,0 +1,55 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: gurmukhi_utils
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Dilraj Singh Somel (dsomel21)
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2023-07-15 00:00:00.000000000 Z
|
12
|
+
dependencies: []
|
13
|
+
description: Library for working with Gurmukhi text, providing various operations
|
14
|
+
like unicode conversion, ascii conversion, and more.
|
15
|
+
email:
|
16
|
+
- dsomel21@gmail.com
|
17
|
+
executables: []
|
18
|
+
extensions: []
|
19
|
+
extra_rdoc_files: []
|
20
|
+
files:
|
21
|
+
- CONTRIBUTING.md
|
22
|
+
- Gemfile
|
23
|
+
- Gemfile.lock
|
24
|
+
- README.md
|
25
|
+
- gurmukhi_utils.gemspec
|
26
|
+
- lib/ascii.rb
|
27
|
+
- lib/constants.rb
|
28
|
+
- lib/gurmukhi_utils.rb
|
29
|
+
- lib/remove.rb
|
30
|
+
- lib/unicode.rb
|
31
|
+
homepage: https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby
|
32
|
+
licenses:
|
33
|
+
- MIT
|
34
|
+
metadata:
|
35
|
+
rubygems_mfa_required: 'true'
|
36
|
+
post_install_message:
|
37
|
+
rdoc_options: []
|
38
|
+
require_paths:
|
39
|
+
- lib
|
40
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
41
|
+
requirements:
|
42
|
+
- - '='
|
43
|
+
- !ruby/object:Gem::Version
|
44
|
+
version: 3.2.1
|
45
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
46
|
+
requirements:
|
47
|
+
- - ">="
|
48
|
+
- !ruby/object:Gem::Version
|
49
|
+
version: '0'
|
50
|
+
requirements: []
|
51
|
+
rubygems_version: 3.4.7
|
52
|
+
signing_key:
|
53
|
+
specification_version: 4
|
54
|
+
summary: A utility gem for converting, analyzing, and testing Gurmukhi strings.
|
55
|
+
test_files: []
|