RubyGems - gurmukhi_utils - Versions diffs - 0.0.1 - Mend

gurmukhi_utils 0.0.1

Files changed (12) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: dbb164eb5abec25b925649743ac6ae8797bce240903a42e13ee7bca4cd5635bc
+  data.tar.gz: 62fd647545eee43081089e407517bab0c00f3136a37862a82a78e7e276a9262f
+SHA512:
+  metadata.gz: 314ce8d8a0149fa9d7327c44b8b0c145e2a0bfb031704d2b8843bfa636873d45fbf198dfe7666511242b3f0f788291beae2e9dd4f8a8d980fc6b99017a731424
+  data.tar.gz: 4d92bda8fe9cb31cbe430812d1953e517bc302acc5fa6e938bcf6197c05bd3dae39345ba723d5c0a79095b7576824f7a05a9bf1e9673e62aef675ced3e45abce

data/CONTRIBUTING.md ADDED Viewed

@@ -0,0 +1,76 @@
+# Contributing
+Please see our [community docs on contributing](https://shabados.com/docs/community/contributing).
+This document is for developers or programmers contributing to source code. If you're interested in contributing a different way, please see the link above.
+## Project Structure
+This project follows a standard Ruby gem structure. Here's a brief overview of the main directories and files:
+- `lib/`: Contains the main source code for the GurmukhiUtils gem.
+  - `gurmukhi_utils.rb`: The main entry point for the gem. This file requires all other necessary files.
+  - `<feature>.rb` files: Each additional file in this directory represents a specific feature/module of the gem.
+- `spec/`: Contains the RSpec test files for the gem. Test files should be placed in this directory, following the naming convention `lib/<feature>_spec.rb`.
+- `Gemfile`: Specifies the gem dependencies for development and testing.
+- `Gemfile.lock`: Generated by Bundler, this file contains the exact gem versions and their dependencies used in the project.
+- `gurmukhi_utils.gemspec`: The gem specification file, which provides information about the gem. We need this to publish and release the gem.
+## Adding New Features to `GurmukhiUtils`
+To add a new feature to GurmukhiUtils, follow these steps:
+**Step 1.**
+Create a new file in the `lib/` directory for the new feature, and place the new functionality within the `GurmukhiUtils` module.
+For example, if you want to create an `ascii` method, create a new file called `ascii.rb`:
+```ruby
+# lib/ascii.rb
+module GurmukhiUtils
+  def self.helpers
+    # ...
+  end
+  def self.other_methods
+    # ...
+  end
+  def self.ascii
+    # ...
+  end
+end
+```
+**Step 2.**
+Update the `lib/gurmukhi_utils.rb` file to require the new feature file using require_relative:
+```ruby
+# lib/gurmukhi_utils.rb
+require_relative 'gurmukhi_utils/version'
+require_relative 'unicode'
+require_relative 'ascii' # Add this line for the new feature
+```
+**Step 3.**
+Test the feature
+Write tests for the new feature in the `spec` directory.
+You can also use `irb` and do:
+```ruby
+001 > require_relative "lib/gurmukhi_utils"
+ => true
+002 > GurmukhiUtils.ascii("...")
+ => "..."
+```
+## Thank you
+Your contributions to open source, large or small, make great projects like this possible. Thank you for taking the time to participate in this project.

data/Gemfile ADDED Viewed

@@ -0,0 +1,14 @@
+# frozen_string_literal: true
+ruby '3.2.1'
+source 'https://rubygems.org'
+group :development do
+  gem 'rubocop'
+  gem 'rubocop-rspec', :require => false
+end
+group :test do
+  gem 'rspec'
+end

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,60 @@
+GEM
+  remote: https://rubygems.org/
+  specs:
+    ast (2.4.2)
+    diff-lcs (1.5.0)
+    json (2.6.3)
+    parallel (1.22.1)
+    parser (3.2.1.0)
+      ast (~> 2.4.1)
+    rainbow (3.1.1)
+    regexp_parser (2.7.0)
+    rexml (3.2.5)
+    rspec (3.12.0)
+      rspec-core (~> 3.12.0)
+      rspec-expectations (~> 3.12.0)
+      rspec-mocks (~> 3.12.0)
+    rspec-core (3.12.2)
+      rspec-support (~> 3.12.0)
+    rspec-expectations (3.12.3)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.12.0)
+    rspec-mocks (3.12.5)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.12.0)
+    rspec-support (3.12.0)
+    rubocop (1.47.0)
+      json (~> 2.3)
+      parallel (~> 1.10)
+      parser (>= 3.2.0.0)
+      rainbow (>= 2.2.2, < 4.0)
+      regexp_parser (>= 1.8, < 3.0)
+      rexml (>= 3.2.5, < 4.0)
+      rubocop-ast (>= 1.26.0, < 2.0)
+      ruby-progressbar (~> 1.7)
+      unicode-display_width (>= 2.4.0, < 3.0)
+    rubocop-ast (1.27.0)
+      parser (>= 3.2.1.0)
+    rubocop-capybara (2.17.1)
+      rubocop (~> 1.41)
+    rubocop-rspec (2.19.0)
+      rubocop (~> 1.33)
+      rubocop-capybara (~> 2.17)
+    ruby-progressbar (1.12.0)
+    strscan (3.0.5)
+    unicode-display_width (2.4.2)
+PLATFORMS
+  x86_64-darwin-19
+DEPENDENCIES
+  rspec
+  rubocop
+  rubocop-rspec
+  strscan
+RUBY VERSION
+   ruby 3.2.1p31
+BUNDLED WITH
+   2.4.7

data/README.md ADDED Viewed

@@ -0,0 +1,14 @@
+# Gurmukhi Utils (Ruby)
+Utilities library for converting, analyzing, and testing Gurmukhi strings.
+## Related
+This library is one of many in the Gurmukhi Utils super-repo.
+- [Super Repo](/README.md)
+- [Python](/python/README.md)
+- [JavaScript](/javascript/README.md)
+- [Ruby](/ruby/README.md)
+- [C# / C Sharp](/csharp/README.md)
+- [Dart](/dart/README.md)

data/gurmukhi_utils.gemspec ADDED Viewed

@@ -0,0 +1,20 @@
+# frozen_string_literal: true
+Gem::Specification.new do |spec|
+  spec.name          = 'gurmukhi_utils'
+  spec.version       = '0.0.1'
+  spec.authors       = ['Dilraj Singh Somel (dsomel21)']
+  spec.email         = ['dsomel21@gmail.com']
+  spec.summary       = 'A utility gem for converting, analyzing, and testing Gurmukhi strings.'
+  spec.description   = 'Library for working with Gurmukhi text, providing various operations like unicode conversion, ascii conversion, and more.'
+  spec.homepage      = 'https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby'
+  spec.license       = 'MIT'
+  spec.required_ruby_version = '3.2.1'
+  spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec)/}) }
+  spec.require_paths = ['lib']
+  spec.metadata['rubygems_mfa_required'] = 'true'
+end

data/lib/ascii.rb ADDED Viewed

@@ -0,0 +1,161 @@
+# frozen_string_literal: true
+module GurmukhiUtils
+  ASCII_TRANSLATION = {
+    'ੳ'.ord => 'a',
+    'ਅ'.ord => 'A',
+    'ੲ'.ord => 'e',
+    'ਸ'.ord => 's',
+    'ਹ'.ord => 'h',
+    'ਕ'.ord => 'k',
+    'ਖ'.ord => 'K',
+    'ਗ'.ord => 'g',
+    'ਘ'.ord => 'G',
+    'ਙ'.ord => '|',
+    'ਚ'.ord => 'c',
+    'ਛ'.ord => 'C',
+    'ਜ'.ord => 'j',
+    'ਝ'.ord => 'J',
+    'ਞ'.ord => '\\',
+    'ਟ'.ord => 't',
+    'ਠ'.ord => 'T',
+    'ਡ'.ord => 'f',
+    'ਢ'.ord => 'F',
+    'ਣ'.ord => 'x',
+    'ਤ'.ord => 'q',
+    'ਥ'.ord => 'Q',
+    'ਦ'.ord => 'd',
+    'ਧ'.ord => 'D',
+    'ਨ'.ord => 'n',
+    'ਪ'.ord => 'p',
+    'ਫ'.ord => 'P',
+    'ਬ'.ord => 'b',
+    'ਭ'.ord => 'B',
+    'ਮ'.ord => 'm',
+    'ਯ'.ord => 'X',
+    'ਰ'.ord => 'r',
+    'ਲ'.ord => 'l',
+    'ਵ'.ord => 'v',
+    'ੜ'.ord => 'V',
+    'ਸ਼'.ord => 'S',
+    'ਜ਼'.ord => 'z',
+    'ਖ਼'.ord => '^',
+    'ਫ਼'.ord => '&',
+    'ਗ਼'.ord => 'Z',
+    'ਲ਼'.ord => 'L',
+    '਼'.ord => 'æ',
+    'ੑ'.ord => '@',
+    'ੵ'.ord => "\u00b4", # acute accent (´)
+    'ਃ'.ord => 'Ú', # capital u-acute letter
+    "\u0a13".ord => 'E', # ਓ
+    "\u0a06".ord => 'Aw',  # ਆ
+    "\u0a07".ord => 'ei',  # ਇ
+    "\u0a08".ord => 'eI',  # ਈ
+    "\u0a09".ord => 'au',  # ਉ
+    "\u0a0a".ord => 'aU',  # ਊ
+    "\u0a0f".ord => 'ey',  # ਏ
+    "\u0a10".ord => 'AY',  # ਐ
+    "\u0a14".ord => 'AO',  # ਔ
+    'ਾ'.ord => 'w',
+    'ਿ'.ord => 'i',
+    'ੀ'.ord => 'I',
+    'ੁ'.ord => 'u',
+    'ੂ'.ord => 'U',
+    'ੇ'.ord => 'y',
+    'ੈ'.ord => 'Y',
+    'ੋ'.ord => 'o',
+    'ੌ'.ord => 'O',
+    'ੰ'.ord => 'M',
+    'ਂ'.ord => 'N',
+    'ੱ'.ord => '~',
+    '।'.ord => '[',
+    '॥'.ord => ']',
+    '੦'.ord => '0',
+    '੧'.ord => '1',
+    '੨'.ord => '2',
+    '੩'.ord => '3',
+    '੪'.ord => '4',
+    '੫'.ord => '5',
+    '੬'.ord => '6',
+    '੭'.ord => '7',
+    '੮'.ord => '8',
+    '੯'.ord => '9',
+    'ੴ'.ord => '<>',
+    '☬'.ord => 'Ç'
+  }.freeze
+  ASCII_REPLACEMENTS = {
+    '੍ਯ' => 'Î',  # half-yayya
+    '꠳ਯ' => 'Î',  # sant lipi variation
+    '꠴ਯ' => 'ï',  # open-top yayya
+    '꠵ਯ' => 'î',  # open-top half-yayya
+    '੍ਰ' => 'R',
+    '੍ਵ' => 'Í',  # capital i-acute letter
+    '੍ਹ' => 'H',
+    '੍ਚ' => 'ç',  # c-cedilla letter
+    '੍ਟ' => '†',  # dagger symbol
+    '੍ਤ' => 'œ',
+    '੍ਨ' => "\u02dc" # small tilde (˜)
+  }.freeze
+  # TODO: Raise warnings  if incorrect vowel syntax
+  def self.ascii(string)
+    string = unicode_normalize(string)
+    # Perform replacements
+    ASCII_REPLACEMENTS.each do |key, value|
+      string.gsub!(key, value)
+    end
+    # Perform translation
+    string = string.chars.map { |c| ASCII_TRANSLATION[c.ord] || c }.join
+    # Re-arrange sihari
+    ascii_base_letters = 'AeshkKgG|cCjJ\tTfFxqQdDnpPbBmXrlvVSz^&ZLÎïî'
+    ascii_modifiers = 'æ@´ÚwIuUyYoO`MNRÍHç†œ˜ü¨®µˆW~¤Ï'
+    regex = Regexp.new("([#{ascii_base_letters}][#{ascii_modifiers}]*)i([#{ascii_modifiers}]*)")
+    string.gsub!(regex, 'i\1\2')
+    # Fix below-base-letter + u vowel positioning
+    ascii_below_base_letters = 'RÍHç†œ˜´@'
+    below_vowel_mappings = {
+      'u' => 'ü',
+      'U' => '¨'
+    }
+    below_vowel_mappings.each do |key, value|
+      string.gsub!(/([#{ascii_below_base_letters}][#{ascii_modifiers}]*)#{key}([#{ascii_modifiers}]*)/, "\\1#{value}\\2")
+    end
+    # Fix center-stroke + tippi positioning
+    center_stroke_letters = 'nT'
+    string.gsub!(/([#{center_stroke_letters}][#{ascii_modifiers}]*)M([#{ascii_modifiers}]*)/, '\\1µ\\2')
+    # Fix positioning of bindi/tippi when it is the only above-base-form
+    ascii_non_above_modifiers = 'æ@´ÚwuURÍHç†œ˜ü¨®Ï'
+    nasalization_mappings = {
+      'N' => 'ˆ',
+      '~' => '`'
+    }
+    nasalization_mappings.each do |key, value|
+      string.gsub!(/([#{ascii_base_letters}][#{ascii_non_above_modifiers}]*)#{key}([#{ascii_non_above_modifiers}]*)/, "\\1#{value}\\2")
+    end
+    # Make rendering changes for combos
+    ascii_combo_replacements = {
+      'Iਁ' => 'ˆØI',  # bindi + bihari ligature
+      'IM' => 'µØI',  # tippi + bihari ligature
+      'Iµ' => 'µØI',  # tippi + bihari ligature
+      'kR' => 'k®', # kakka + pair-rara ligature
+      'H¨' => '§',
+      'wN' => 'W',  # addhak positioning
+      'wˆ' => 'W',  # addhak positioning
+      'nUµ' => 'ƒ'
+    }
+    ascii_combo_replacements.each do |key, value|
+      string.gsub!(key, value)
+    end
+    return string
+  end
+end

data/lib/constants.rb ADDED Viewed

@@ -0,0 +1,32 @@
+# frozen_string_literal: true
+module GurmukhiUtils
+  # pause chars / ਵਿਸ਼ਰਾਮ symbols
+  VISHRAM_LIGHT = '.'
+  VISHRAM_MEDIUM = ','
+  VISHRAM_HEAVY = ';'
+  VISHRAMS = [VISHRAM_LIGHT, VISHRAM_MEDIUM, VISHRAM_HEAVY].freeze
+  BASE_LETTERS = 'ਸਹਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਵੜਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ਲ਼'
+  VOWEL_LETTERS =
+    # ਅ ਆ ਏ ਐ ਇ ਈ ਓ ਔ ਉ ਊ
+    "ਅ\u0a06\u0a0f\u0a10\u0a07\u0a08\u0a13\u0a14\u0a09\u0a0a"
+  ORDERED_VOWELS = [
+    'ਿ',
+    'ੇ',
+    'ੈ',
+    'ੋ',
+    'ੌ',
+    'ੁ',
+    'ੂ',
+    'ਾ',
+    'ੀ'
+  ].freeze
+  VIRAMA = '੍'
+  BELOW_LETTERS = 'ਹਰਵਟਤਨਚ'
+  YAKASH = 'ੵ'
+  VOWEL_DIACRITICS = ORDERED_VOWELS.join
+end

data/lib/gurmukhi_utils.rb ADDED Viewed

@@ -0,0 +1,6 @@
+# frozen_string_literal: true
+require_relative 'unicode'
+require_relative 'ascii'
+require_relative 'remove'
+require_relative 'constants'

data/lib/remove.rb ADDED Viewed

@@ -0,0 +1,70 @@
+# frozen_string_literal: true
+module GurmukhiUtils
+  # Removes substrings from the string.
+  #
+  # @param string [String] The string to affect.
+  # @param removals [Array<String>] Any substring to remove.
+  # @return [String] The string without any substrings.
+  #
+  # @example
+  #   remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".", ","])
+  #   # => "ਸਬਦ ਸਬਦ ਸਬਦ; ਸਬਦ"
+  #   remove("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", GurmukhiUtils::Constants::VISHRAMS)
+  #   # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ"
+  def remove(string, removals)
+    removals.each { |removal| string.gsub!(removal, '') }
+    string
+  end
+  # Removes regex patterns from the string.
+  #
+  # Note:
+  #   Also removes duplicate space characters from the string.
+  #
+  # @param string [String] The string to affect.
+  # @param patterns [Array<String>] Any pattern to remove.
+  # @return [String] The string without any matching patterns, duplicate spaces, or leading/trailing spaces.
+  #
+  # @example
+  #   remove_regex("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ", [".+\\s"])
+  #   # => "ਸਬਦ"
+  def remove_regex(string, patterns)
+    patterns.each { |pattern| string.gsub!(Regexp.new(pattern), '') }
+    string.squeeze!(' ').strip!
+    string
+  end
+  # Attempts to remove line endings as best as possible.
+  #
+  # @param string [String] The unicode Gurmukhi, Hindi, or English translation/transliteration to affect.
+  # @return [String] The string without line endings.
+  #
+  # @example
+  #   remove_line_endings("ਸਬਦ ॥ ਸਬਦ ॥੧॥ ਰਹਾਉ ॥")
+  #   # => "ਸਬਦ ਸਬਦ"
+  def remove_line_endings(string)
+    line_ending_patterns = [
+      '[।॥] *(ਰਹਾਉ|रहाउ).*',
+      '[|] *Pause.*',
+      '[|] *(rahaau|rahau|rahao).*',
+      '[।॥][੦-੯|०-९].*',
+      '[|]\\d.*',
+      '[।॥|]'
+    ]
+    remove_regex(string, line_ending_patterns)
+  end
+  # Removes all vishram characters.
+  #
+  # @param string [String] The string to affect.
+  # @return [String] The string without vishrams.
+  #
+  # @example
+  #   remove_vishrams("ਸਬਦ. ਸਬਦ, ਸਬਦ; ਸਬਦ ॥")
+  #   # => "ਸਬਦ ਸਬਦ ਸਬਦ ਸਬਦ ॥"
+  def remove_vishrams(string)
+    remove(string, GurmukhiUtils::Constants::VISHRAMS)
+  end
+end

data/lib/unicode.rb ADDED Viewed

@@ -0,0 +1,298 @@
+# frozen_string_literal: true
+# Copied from https://github.com/shabados/gurmukhiutils/blob/b7a1ca9f0d341b64715158893afb93675c773823/gurmukhiutils/unicode.py
+require 'strscan'
+module GurmukhiUtils
+  UNICODE_STANDARDS = ['Unicode Consortium', 'Sant Lipi'].freeze
+  ASCII_TO_SL_TRANSLATION = {
+    'a'.ord => 'ੳ',
+    'b'.ord => 'ਬ',
+    'c'.ord => 'ਚ',
+    'd'.ord => 'ਦ',
+    'e'.ord => 'ੲ',
+    'f'.ord => 'ਡ',
+    'g'.ord => 'ਗ',
+    'h'.ord => 'ਹ',
+    'i'.ord => 'ਿ',
+    'j'.ord => 'ਜ',
+    'k'.ord => 'ਕ',
+    'l'.ord => 'ਲ',
+    'm'.ord => 'ਮ',
+    'n'.ord => 'ਨ',
+    'o'.ord => 'ੋ',
+    'p'.ord => 'ਪ',
+    'q'.ord => 'ਤ',
+    'r'.ord => 'ਰ',
+    's'.ord => 'ਸ',
+    't'.ord => 'ਟ',
+    'u'.ord => 'ੁ',
+    'v'.ord => 'ਵ',
+    'w'.ord => 'ਾ',
+    'x'.ord => 'ਣ',
+    'y'.ord => 'ੇ',
+    'z'.ord => 'ਜ਼',
+    'A'.ord => 'ਅ',
+    'B'.ord => 'ਭ',
+    'C'.ord => 'ਛ',
+    'D'.ord => 'ਧ',
+    'E'.ord => 'ਓ',
+    'F'.ord => 'ਢ',
+    'G'.ord => 'ਘ',
+    'H'.ord => '੍ਹ',
+    'I'.ord => 'ੀ',
+    'J'.ord => 'ਝ',
+    'K'.ord => 'ਖ',
+    'L'.ord => 'ਲ਼',
+    'M'.ord => 'ੰ',
+    'N'.ord => 'ਂ',
+    'O'.ord => 'ੌ',
+    'P'.ord => 'ਫ',
+    'Q'.ord => 'ਥ',
+    'R'.ord => '੍ਰ',
+    'S'.ord => 'ਸ਼',
+    'T'.ord => 'ਠ',
+    'U'.ord => 'ੂ',
+    'V'.ord => 'ੜ',
+    'W'.ord => 'ਾਂ',
+    'X'.ord => 'ਯ',
+    'Y'.ord => 'ੈ',
+    'Z'.ord => 'ਗ਼',
+    '0'.ord => '੦',
+    '1'.ord => '੧',
+    '2'.ord => '੨',
+    '3'.ord => '੩',
+    '4'.ord => '੪',
+    '5'.ord => '੫',
+    '6'.ord => '੬',
+    '7'.ord => '੭',
+    '8'.ord => '੮',
+    '9'.ord => '੯',
+    '['.ord => '।',
+    ']'.ord => '॥',
+    '\\'.ord => 'ਞ',
+    '|'.ord => 'ਙ',
+    '`'.ord => 'ੱ',
+    '~'.ord => 'ੱ',
+    '@'.ord => 'ੑ',
+    '^'.ord => 'ਖ਼',
+    '&'.ord => 'ਫ਼',
+    '†'.ord => '੍ਟ', # dagger symbol
+    'ü'.ord => 'ੁ',  # u-diaeresis letter
+    '®'.ord => '੍ਰ', # registered symbol
+    "\u00b4".ord => 'ੵ', # acute accent (´)
+    "\u00a8".ord => 'ੂ',  # diaeresis accent (¨)
+    'µ'.ord => 'ੰ',       # mu letter
+    'æ'.ord => '਼',
+    "\u00a1".ord => 'ੴ',  # inverted exclamation (¡)
+    'ƒ'.ord => 'ਨੂੰ',     # florin symbol
+    'œ'.ord => '੍ਤ',
+    'Í'.ord => '੍ਵ',      # capital i-acute letter
+    'Î'.ord => '੍ਯ',      # capital i-circumflex letter
+    'Ï'.ord => 'ੵ',       # capital i-diaeresis letter
+    'Ò'.ord => '॥',       # capital o-grave letter
+    'Ú'.ord => 'ਃ',       # capital u-acute letter
+    "\u02c6".ord => 'ਂ',  # circumflex accent (ˆ)
+    "\u02dc".ord => '੍ਨ',  # small tilde (˜)
+    '§'.ord => '੍ਹੂ',      # section symbol
+    '¤'.ord => 'ੱ',       # currency symbol
+    'ç'.ord => '੍ਚ',      # c-cedilla letter
+    'Ç'.ord => '☬',       # khanda instead of california state symbol
+    "\u201a".ord => '❁',  # single low-9 quotation (‚) mark
+    'Æ'.ord => nil,
+    'Ø'.ord => nil,       # This is a topline / shirorekha (शिरोरेखा) extender
+    'ÿ'.ord => nil,       # This is the author Kulbir S Thind's stamp
+    'Œ'.ord => nil,       # Box drawing left flower
+    '‰'.ord => nil,       # Box drawing right flower
+    'Ó'.ord => nil,       # Box drawing top flower
+    'Ô'.ord => nil,       # Box drawing bottom flower
+    'Î'.ord => '꠳ਯ',      # half-yayya
+    'ï'.ord => '꠴ਯ',      # open-top yayya
+    'î'.ord => '꠵ਯ'       # open-top half-yayya
+  }.freeze
+  ASCII_TO_SL_REPLACEMENTS = {
+    'ˆØI' => 'ੀਁ', # Handle pre-bihari-bindi with unused adakbindi
+    '<>' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
+    '<' => 'ੴ',  # GurbaniLipi variant
+    '>' => '☬',  # GurbaniLipi variant
+    'Åå' => 'ੴ', # AnmolLipi/GurbaniAkhar variant
+    'Å' => 'ੴ', # GurbaniLipi variant
+    'å' => 'ੴ' # GurbaniLipi variant
+  }.freeze
+  UNICODE_TO_SL_REPLACEMENTS = {
+    '੍ਯ' => '꠳ਯ' # replace unicode half-yayya with Sant Lipi ligature (north indic one-sixteenth fraction + yayya)
+  }.freeze
+  SL_TO_UNICODE_REPLACEMENTS = {
+    '꠳ਯ' => '੍ਯ',
+    '꠴ਯ' => 'ਯ',
+    '꠵ਯ' => '੍ਯ',
+    'ਁ' => 'ਂ' # pre-bihari-bindi
+  }.freeze
+  ##
+  # Converts any ASCII Gurmukhi characters and sanitizes to Unicode Gurmukhi.
+  # Note:
+  #   Converting yayya (ਯ) variants with an open top using the Unicode Consortium standard is considered destructive.
+  #   This function will substitute the original with its shirorekha/top-line equivalent.
+  #
+  #   Many fonts and text shaping engines fail to render half-yayya (੍ਯ) correctly. Regardless of the standard used,
+  #   it is recommended to use the Sant Lipi font mentioned below.
+  #
+  # @param string [String] The string to affect.
+  # @param unicode_standard [String] The mapping system to use. The default is Unicode compliant and can render
+  #   99% of the Shabad OS Database. The other option "Sant Lipi" is intended for a custom Unicode font bearing the
+  #   same name (see: https://github.com/shabados/SantLipi). Defaults to "Unicode Consortium".
+  # @return [String] A string whose Gurmukhi is normalized to a Unicode standard.
+  #
+  # @example
+  #   unicode("123")
+  #   #=> "੧੨੩"
+  #   unicode("<> > <")
+  #   #=> "ੴ ☬ ੴ"
+  #   unicode("gurU")
+  #   #=> "ਗੁਰੂ"
+  ##
+  def self.unicode(string, unicode_standard = 'Unicode Consortium')
+    # Move ASCII sihari before mapping to unicode
+    ascii_base_letters = '\\a-zA-Z|^&Îîï'
+    ascii_sihari_pattern = Regexp.new("(i)([#{ascii_base_letters}])")
+    string = string.gsub(ascii_sihari_pattern, '\2\1')
+    # Map any ASCII / Unicode Gurmukhi to Sant Lipi format
+    ASCII_TO_SL_REPLACEMENTS.each do |key, value|
+      string.gsub!(key, value)
+    end
+    UNICODE_TO_SL_REPLACEMENTS.each do |key, value|
+      string.gsub!(key, value)
+    end
+    string = string.chars.map { |c| ASCII_TO_SL_TRANSLATION[c.ord] || c }.join
+    string = unicode_normalize(string)
+    if unicode_standard == 'Unicode Consortium'
+      SL_TO_UNICODE_REPLACEMENTS.each do |key, value|
+        string.gsub!(key, value)
+      end
+    end
+    return string
+  end
+  ##
+  # Normalizes Gurmukhi according to Unicode Standards.
+  # @param string [String] The string to affect.
+  # @return [String] A string containing normalized Gurmukhi.
+  #
+  # @example
+  #   unicode_normalize("Hello ਜੀ")
+  #   #=> "Hello ਜੀ"
+  ##
+  def self.unicode_normalize(string)
+    string = sort_diacritics(string)
+    return sanitize_unicode(string)
+  end
+  ##
+  # Gurmukhi script, some common diacritics include vowel signs (matras), such as:
+  #  ੁ (u), ੀ (ī), or ੋ (ō), and other marks like bindi (ਂ), tippi (ੰ), and nukta (਼).
+  #
+  # @brief Orders the Gurmukhi diacritics in a string according to Unicode standards.
+  # Not intended for base letters with multiple subjoined letters.
+  # @param string [String] The string to affect.
+  # @return [String] The same string with Gurmukhi diacritics arranged in a sorted manner.
+  #
+  # @example
+  #   sort_diacritics("\u0a41\u0a4b") # => "\u0a4b\u0a41" # ੁੋ vs  ੋੁ
+  ##
+  def self.sort_diacritics(string)
+    # Nukta is essential to form a new base letter and must be ordered first.
+    # Udaat, Yakash, and subjoined letters should follow.
+    # Subjoined letters are constructed (they are not single char), so they cannot be used
+    # in the same regex group pattern. See further below for subjoined letters.
+    base_letter_modifiers = ['਼', 'ੑ', 'ੵ']
+    # More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.
+    # p. 491 of The Unicode® Standard Version 14.0 – Core Specification
+    # https://www.unicode.org/versions/Unicode14.0.0/ch12.pdf
+    vowel_order = ['ਿ', 'ੇ', 'ੈ', 'ੋ', 'ੌ', 'ੁ', 'ੂ', 'ਾ', 'ੀ']
+    # The remaining diacritics are to be sorted at the end according to the following order
+    remaining_modifier_order = ['ਁ', 'ੱ', 'ਂ', 'ੰ', 'ਃ']
+    generated_marks = (base_letter_modifiers + vowel_order + remaining_modifier_order).join
+    mark_pattern = Regexp.new("([#{generated_marks}]*)")
+    virama = '੍'
+    below_base_letters = 'ਹਰਵਟਤਨਚ'
+    below_base_pattern = Regexp.new("(#{virama}[#{below_base_letters}])?")
+    regex_match_pattern = Regexp.new("#{mark_pattern}#{below_base_pattern}#{mark_pattern}")
+    generated_match_order = (base_letter_modifiers + [virama] + below_base_letters.chars + vowel_order + remaining_modifier_order).join
+    string.gsub(regex_match_pattern) do |match|
+      match_chars = match.chars
+      match_chars.sort_by! { |e| generated_match_order.index(e) }
+      match_chars.join
+    end
+  end
+  def self.sanitize_unicode(string)
+    unicode_sanitization_map = {
+      "\u0a73\u0a4b" => "\u0a13", # ਓ
+      "\u0a05\u0a3e" => "\u0a06",  # ਅ + ਾ = ਆ
+      "\u0a72\u0a3f" => "\u0a07",  # ਇ
+      "\u0a72\u0a40" => "\u0a08",  # ਈ
+      "\u0a73\u0a41" => "\u0a09",  # ਉ
+      "\u0a73\u0a42" => "\u0a0a",  # ਊ
+      "\u0a72\u0a47" => "\u0a0f",  # ਏ
+      "\u0a05\u0a48" => "\u0a10",  # ਐ
+      "\u0a05\u0a4c" => "\u0a14",  # ਔ
+      "\u0a32\u0a3c" => "\u0a33",  # ਲ਼
+      "\u0a38\u0a3c" => "\u0a36",  # ਸ਼
+      "\u0a16\u0a3c" => "\u0a59",  # ਖ਼
+      "\u0a17\u0a3c" => "\u0a5a",  # ਗ਼
+      "\u0a1c\u0a3c" => "\u0a5b",  # ਜ਼
+      "\u0a2b\u0a3c" => "\u0a5e",  # ਫ਼
+      "\u0a71\u0a02" => "\u0a01" # ਁ adak bindi (quite literally never used today or in the Shabad OS Database, only included for parity with the Unicode block)
+    }
+    unicode_sanitization_map.each do |key, value|
+      string.gsub!(key, value)
+    end
+    return string
+  end
+  ##
+  # Takes a string and returns a list of keys and values of each character and its corresponding code point.
+  # @param string [String] the string to affect
+  # @return [Array<Hash{String => String}>] a list of each character and its corresponding code point
+  # @example
+  #   decode_unicode("To ਜੀ")
+  #   #=> [{"T" => "0054"}, {"o" => "006f"}, {" " => "0020"}, {"ਜ" => "0a1c"}, {"ੀ" => "0a40"}]
+  ##
+  def self.decode_unicode(string)
+    return string.chars.map { |item| { item => format('%04x', item.ord) } }
+  end
+  ##
+  # Takes a string and returns its corresponding unicode character.
+  # @param strings [Array<String>] the list containing any strings to encode
+  # @return [Array<String>] a list of any corresponding unicode characters
+  # @example
+  #   encode_unicode(["0054"])
+  #   #=> "T"
+  #   encode_unicode(["0a1c", "0A40"])
+  #   #=> ["ਜ", "ੀ"]
+  ##
+  def self.encode_unicode(strings)
+    return strings.map { |string| string.to_i(16).chr(Encoding::UTF_8) }
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,55 @@
+--- !ruby/object:Gem::Specification
+name: gurmukhi_utils
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- Dilraj Singh Somel (dsomel21)
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2023-07-15 00:00:00.000000000 Z
+dependencies: []
+description: Library for working with Gurmukhi text, providing various operations
+  like unicode conversion, ascii conversion, and more.
+email:
+- dsomel21@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- CONTRIBUTING.md
+- Gemfile
+- Gemfile.lock
+- README.md
+- gurmukhi_utils.gemspec
+- lib/ascii.rb
+- lib/constants.rb
+- lib/gurmukhi_utils.rb
+- lib/remove.rb
+- lib/unicode.rb
+homepage: https://github.com/ShabadOS/gurmukhi-utils/tree/main/ruby
+licenses:
+- MIT
+metadata:
+  rubygems_mfa_required: 'true'
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '='
+    - !ruby/object:Gem::Version
+      version: 3.2.1
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.4.7
+signing_key:
+specification_version: 4
+summary: A utility gem for converting, analyzing, and testing Gurmukhi strings.
+test_files: []