RubyGems - injalid_dejice - Versions diffs - 0.1.0 → 1.0.0 - Mend

injalid_dejice 0.1.0 → 1.0.0

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +4 -0
data/README.md +54 -4
data/lib/injalid_dejice/decoder.rb +48 -28
data/lib/injalid_dejice/encoder.rb +7 -49
data/lib/injalid_dejice/injalid_dejice.rb +4 -2
data/lib/injalid_dejice/locale_resolver.rb +106 -0
data/lib/injalid_dejice/version.rb +1 -1
data/lib/injalid_dejice.rb +34 -5
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 810e98c784de94eb504872bdb8bc8dd7eadcb948c40b40b77ddc8050b787da31
-  data.tar.gz: fe8a847cfdcf4c8e631e1ce1759fe4d68ebf9e166d79873d223e9ebf330e060b
+  metadata.gz: dd5574caa3104732d22132bccd8709065f75b34fca9424b4babf8838d4cd8816
+  data.tar.gz: 9f8a57a4d12c8e435e5c955b76674fb0afe3ee82a102add50cf1f54cb0afdd8a
 SHA512:
-  metadata.gz: 9e4ff10a85b68e80f8c099abd8083c7df0ab556ce88eba41e9ff2118ed868a341cfe9cfd2e5e0fd803e9ae801bfe5e6aee28bb0cb8e2100356ec9aa6585ade33
-  data.tar.gz: eb223603c6d2b4788b66dde0011a56dce5618c16ae620ad6aa501a6f69b15f6752e79d20a88bff3ef7f1c8125b9bdd42f11d27ff8016e32540a0a6723142ce74
+  metadata.gz: c673d4083d47c3a21cc2ae4395bb7316f6d0a5e9805cde6371b063620d9bd50356fca7e15d6a2aa46c264716eeb8853c80baf649cab57ec912bfe42c1798b4f1
+  data.tar.gz: 6ca6c76abe1a0c7ed2acdec0c76be176cade95f89a464b0927f5e2d3467a2c32771207ae375892ff7ac666dc4298746fa68e30c48170e9460ba070e6b2a99838

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,7 @@
+## [1.0.0]- 2024-01-13
+- API change.
 ## [0.1.0] - 2024-01-11
 - Initial release

data/README.md CHANGED Viewed

@@ -1,12 +1,12 @@
 # injalid_dejice.rb
-Convets UTF-8 strings to the legacy KOI-7 encoding.
+Converts UTF-8 strings to the legacy KOI-7 encoding.
-**KOI-7** (Russian: КОИ-7) was a 7-bit character encoding developed in the USSR and based on the US-ASCII encoding.
+**KOI-7** (Russian: КОИ-7) was a 7-bit character encoding developed in the USSR and based on the US-ASCII encoding. US-ASCII has only 128 code points, which are enough to encode the Latin alphabet (with numbers and most common punctuation marks), but not enough to encode both Latin and Cyrillic alphabets with the full set of characters (uppercase and lowercase).
-US-ASCII has only 128 code points, which is enough to encode the Latin alphabet (with numbers and most common punctuation marks), but not enough to encode both Latin and Cyrillic alphabets with the full set of characters (uppercase and lowercase). In order to add the Cyrillic alphabet support, but also remain the US-ASCII compatibility, KOI-7 designers came up with a “hack”, in which both Cyrillic and Latin characters were encoded with the same code points, while the ASCII control characters 0x0E and 0x0F were adapted to switch between the Cyrillic and the Latin code pages respectively.
+In order to add Cyrillic alphabet support, but also to remain US-ASCII compatible, KOI-7 designers came up with a “hack”, in which both Cyrillic and Latin characters were encoded with the same code points, while the ASCII control characters 0x0E and 0x0F were adapted to switch between the Cyrillic and the Latin code pages, respectively.
-That led to funny bugs when a message was displayed with an incorrect code page. The «**иНЖАЛИД ДЕЖИЦЕ**» message (pronounced as “injalid dejitze”) was the most memorable, and it became somewhat a meme (hence the name of the gem). In fact, it was an incorrectly displayed English message, that simply stated: “Invalid device”.
+This led to very annoying, yet sometimes funny bugs when an English text was printed (on a screen, a printer, or a teletype), while the codepage was still set to Cyrillic. The «**иНЖАЛИД ДЕЖИЦЕ**» message (pronounced as “injalid dejitze”) is the most notable example of a such case. The message became somewhat of a meme, and also an eponym for improperly coded text that looks like gibberish. In fact, the phrase was an improperly coded error message that simply states: “Invalid device”. Hence the name of the gem.
 KOI-7 was widely used on the Soviet PDP-11 clones, notably the DVK systems and the Elektronika MK90 portable computer.
@@ -28,6 +28,56 @@ Or install it yourself as:
 ## Usage
+Ruby doesn't support KOI-7 natively (see ```Encoding.name_list```), but does support 7-bit US-ASCII, a KOI-7 compatible encoding (since KOI-7 was based on the US-ASCII). So the encoder returns result as an US-ASCII string, and the decoder expects an input string to be encoded in US-ASCII, or in any compatible encoding (e.g. UTF-8 with characters in the range of 0x00 - 0x7F).
+### Encoding a string from UTF-8 to KOI-7
+InjalidDejice.utf_to_koi(string [, keyword arguments])
+Arguments:
+- string
+Keyword arguments:
+- :forced_latin ([])
+  Characters 0x00 - 0x32 are shared by both Cyrillic and Latin code page (KOI-7 N0 and KOI-7 N1 code pages). A such character's locale is defined by the preceding character(s). But sometimes locale of such a character should be forcibly switched to the Latin, regardless of the preceding character(s).
+  :forced_latin option allows to do that. It specifies characters that will be treated as Latin. I.e. if the Cyrillic locale was set with the SO (0x0E), any character specified in the forced_latin array will forcibly append the SI (0x0F) code before the character.
+- :unknown_char_rep ("?")
+  A replacement character for the unsupported characters.
+Returns:
+- String
+### Decoding a string from KOI-7 to UTF-8
+InjalidDejice.koi_to_utf(string [, keyword arguments])
+Arguments:
+- string
+Keyword arguments:
+- :strict mode (false)
+   When 'true', if locale was switched to the Cyrillic with an SO (0x0E) character, it should be switched back to the Latin locale with a SI (0x0F) character. If the condition haven't been met an ArgumentError is raised.
+- :unknown_char_rep ("?")
+  A replacement character for the unsupported characters.
+Returns:
+- String
+### Examples
 ```ruby
 require "injalid_dejice"

data/lib/injalid_dejice/decoder.rb CHANGED Viewed

@@ -7,49 +7,69 @@ module InjalidDejice
   class Decoder
     OUTPUT_ENCODING = "UTF-8"
+    def initialize(unknown_char_rep: DEF_UNKNOWN_CHAR_REP, strict_mode: false)
+      @unknown_char_rep = unknown_char_rep
+      @strict_mode = strict_mode
+    end
     #
     # Encode US-ANSII (KOI-7-compatible) string to the UTF-8.
     #
     # @param [String] input_string
     #
-    # @return [String] modified_string
-    #
-    # @raise [ArgumentError]
-    #   Found the beginning of the Cyrillic sub-string, but couldn't find the end.
+    # @return [String]
     #
     def call(input_string)
-      modified_string = input_string.dup
-      start_idx = nil
+      @result_string = input_string.dup
+      @start_idx = nil
+      _process_chars(input_string).force_encoding(OUTPUT_ENCODING)
+    end
+    private
+    def _process_chars(input_string)
       input_string.each_char.with_index do |char, idx|
-        # Everything above the 0x7F cannot be in the KOI-7,
-        # so just get rid of it.
-        modified_string[idx] = UNKNOWN_CHAR if char.ord > 127
+        # Everything above the 0x7F cannot be in the KOI-7, so just get rid of it.
+        @result_string[idx] = @unknown_char_rep if char.ord > ASCII_CHARS_LIMIT
         # Found beginning of a Cyr. sub-string.
-        start_idx = idx if char == CYR_CHAR
-        # Found end of a Cyr. sub-string. Decode everything between the
-        # control characters.
-        if char == LATIN_CHAR
-          # Decode characters only if 'start_idx' was set.
-          if start_idx
-            (start_idx + 1..idx).each do |sub_idx|
-              if modified_string[sub_idx].ord >= SHARED_CHARS_LIMIT
-                modified_string[sub_idx] = (LAT_TO_CYR_DICT[modified_string[sub_idx]] || UNKNOWN_CHAR)
-              end
-            end
-          end
-          start_idx = nil
+        @start_idx = idx if char == CYR_CHAR
+        # Found beginning of a Latin string, what should indicate the end of the Cyr. string.
+        _on_cyr_substr_end(idx) if char == LATIN_CHAR && @start_idx
+      end
+      _check_unexpected_end if @strict_mode
+      # Before the return remove control characters with the regex.
+      @result_string.gsub(/[#{LATIN_CHAR}#{CYR_CHAR}]/, "")
+    end
+    def _on_cyr_substr_end(idx)
+      _decode_cyr_substr(idx)
+      @start_idx = nil
+    end
+    #
+    # Decode everything between the control characters.
+    #
+    def _decode_cyr_substr(idx)
+      (@start_idx + 1..idx).each do |sub_idx|
+        if @result_string[sub_idx].ord >= SHARED_CHARS_LIMIT
+          @result_string[sub_idx] = (LAT_TO_CYR_DICT[@result_string[sub_idx]] || DEF_UNKNOWN_CHAR_REP)
         end
       end
+    end
+    #
+    # @raise [ArgumentError]
+    #   Found the beginning of the Cyrillic sub-string, but couldn't find the end.
+    #
+    def _check_unexpected_end
       # Found the beginning of the Cyr. sub-string, but couldn't find the end (the LATIN_CHAR marker),
-      # probably that means an unexpected end of the string. Raise an error.
-      raise ArgumentError, "unexpected end of the string" if start_idx
-      # Before the return remove control characters with the regex, and set the encoding.
-      modified_string.gsub(/[#{LATIN_CHAR}#{CYR_CHAR}]/, "").force_encoding(OUTPUT_ENCODING)
+      # probably that means an unexpected end of the string. Raise an error (only in the strict mode).
+      raise ArgumentError, "unexpected end of the string" if @start_idx
     end
   end
 end

data/lib/injalid_dejice/encoder.rb CHANGED Viewed

@@ -7,6 +7,11 @@ module InjalidDejice
   class Encoder
     OUTPUT_ENCODING = "US-ASCII"
+    def initialize(forced_latin: [], unknown_char_rep: DEF_UNKNOWN_CHAR_REP)
+      @forced_latin = forced_latin
+      @unknown_char_rep = unknown_char_rep
+    end
     #
     # Encode 'input_string' to the KOI-7 encoding.
     #
@@ -17,7 +22,7 @@ module InjalidDejice
     #   An ASCII string, KOI-7-compatible.
     #
     def call(input_string)
-      non_latin_indexes = _search_non_latin_indexes(input_string)
+      non_latin_indexes = LocaleResolver.new(input_string, @forced_latin).call
       result = if non_latin_indexes.empty?
                  # No non-Latin characters were found, so no futher action needed.
@@ -38,7 +43,6 @@ module InjalidDejice
     # Replace all non-Latin characters in the 'input_string'.
     #
     # @param [String] input_string
-    # @param [Array<Array<Integer>>] non_latin_indexes
     #
     # @return [String] modified_string
     #
@@ -46,7 +50,7 @@ module InjalidDejice
       encoding_options = {
         invalid: :replace,
         fallback: lambda { |char|
-          CYR_TO_LAT_DICT.fetch(char, UNKNOWN_CHAR)
+          CYR_TO_LAT_DICT.fetch(char, @unknown_char_rep)
         }
       }
@@ -81,51 +85,5 @@ module InjalidDejice
       input_string
     end
-    #
-    # Return start and end indexes for all so-called "non-Latin" sub-strings of the
-    # input string. Numbers and some punctuation characters can be part of both Latin,
-    # and non-Latin sub-strings. The "locale" of these characters is defined by the
-    # preceding character. E.g., in the string "привет-123 hello-456" the "-123 " chunk
-    # (note the trailing space) would be considered as a part of the non-Latin sub-
-    # string, while the "-456" chunk as a part of the Latin sub-string.
-    #
-    # @param [String] input_string
-    #
-    # @return [Array<Array<Integer>>] non_latin_indexes
-    #   Pairs of non-Latin character indexes.
-    #   Example: [[2, 5], [9, 14]], i.e. 'input_string' has two non-Latin sub-strings.
-    #   The first one occupies characters from 2-nd to 5-th, the second one from 9-th
-    #   to 14-th. If there's no such sub-string, the 'non_latin_indexes' is an empty
-    #   array.
-    #
-    def _search_non_latin_indexes(input_string)
-      non_latin_indexes = []
-      start_index = nil
-      input_string.each_char.with_index do |char, idx|
-        if char.ord < SHARED_CHARS_LIMIT
-          # Skip, no action needed, as the character is shared by both
-          # Latin & Cyrillic charsets.
-        elsif char.ord >= SHARED_CHARS_LIMIT && char.ord <= ASCII_CHARS_LIMIT
-          # Latin character.
-          # If the current character is Latin and 'start_index' is set,
-          # it means the Cyrillic substring has ended.
-          if start_index
-            non_latin_indexes << [start_index, idx - 1]
-            start_index = nil
-          end
-        elsif start_index.nil?
-          start_index = idx
-        end
-        # Cyrillic character (or possibly other unsupported non-ASCII).
-      end
-      # Handle the last sub-string.
-      non_latin_indexes << [start_index, input_string.length - 1] if start_index
-      non_latin_indexes
-    end
   end
 end

data/lib/injalid_dejice/injalid_dejice.rb CHANGED Viewed

@@ -75,11 +75,13 @@ module InjalidDejice
   # KOI-7 code pages.
   SHARED_CHARS_LIMIT = 64
+  SHARED_CHARS = " !\"#$%&'()*+,-./0123456789:;<=>?"
   # ASCII highest possible character code.
   ASCII_CHARS_LIMIT = 127
-  # Character to use if no replacement was found in the CYR_TO_LAT_DICT dictionary.
-  UNKNOWN_CHAR = "?"
+  # Character to use if no replacement was found in the dictionary.
+  DEF_UNKNOWN_CHAR_REP = "?"
   # Swithes encoding to the Cyrillic KOI-7 code page. Also known as "Shift Out" (SO) character.
   CYR_CHAR = 0x0E.chr

data/lib/injalid_dejice/locale_resolver.rb ADDED Viewed

@@ -0,0 +1,106 @@
+# frozen_string_literal: true
+module InjalidDejice
+  #
+  # Finds start and end indexes of a non-Latin sub-string.
+  #
+  class LocaleResolver
+    #
+    # Class init.
+    #
+    # @param [String] string
+    # @param [Array<String>] forced_latin ([])
+    #
+    def initialize(string, forced_latin = [])
+      @string = string
+      @forced_latin = forced_latin
+    end
+    #
+    # Return start and end indexes for all so-called "non-Latin" sub-strings of the
+    # input string. Numbers and some punctuation characters can be part of both Latin,
+    # and non-Latin sub-strings. The "locale" of these characters is defined by the
+    # preceding character(s). E.g., in the string "привет-123 hello-456" the "-123 "
+    # chunk (note the trailing space) would be considered as a part of the non-Latin
+    # sub-string, while the "-456" chunk as a part of the Latin sub-string.
+    #
+    # Sometimes it is required to forcibly apply the Latin 'locale' to the character.
+    # This is needed to avoid some bugs in the legacy software (e.g. MK90 BASIC). For
+    # this purpose there's the optional argument 'forced_latin'.
+    #
+    # @return [Array<Array<Integer>>] non_latin_indexes
+    #   Pairs of non-Latin character indexes.
+    #   Example: [[2, 5], [9, 14]], i.e. @string has two non-Latin sub-strings.
+    #   The first one occupies characters from 2-nd to 5-th, the second one from 9-th
+    #   to 14-th. If there's no such sub-string, the @non_latin_indexes is an empty
+    #   array.
+    #
+    def call
+      @non_latin_indexes = []
+      @start_index = nil
+      _handle_subtrings
+      _handle_last_substring if @start_index
+      @non_latin_indexes
+    end
+    private
+    def _handle_subtrings
+      @string.each_char.with_index do |char, idx|
+        case _char_kind_of?(char)
+        when :shared
+          _handle_shared_char(char, idx)
+        when :latin
+          _handle_latin_char(idx)
+        else
+          _handle_other_char(idx)
+        end
+      end
+    end
+    def _handle_shared_char(char, idx)
+      # Force a character to be recognized as a Latin one if it was specified
+      # in the '@forced_latin' array.
+      _on_end_of_non_latin_substr(idx) if @forced_latin.include?(char) && @start_index
+    end
+    def _handle_latin_char(idx)
+      # If the current character is Latin and '@start_index' is set,
+      # it means the Cyrillic substring has ended.
+      _on_end_of_non_latin_substr(idx) if @start_index
+    end
+    def _handle_other_char(idx)
+      @start_index = idx if @start_index.nil?
+    end
+    def _handle_last_substring
+      @non_latin_indexes << [@start_index, @string.length - 1]
+    end
+    def _on_end_of_non_latin_substr(idx)
+      @non_latin_indexes << [@start_index, idx - 1]
+      @start_index = nil
+    end
+    #
+    # Return character type based on its UTF-8 code.
+    #
+    # @param [String] char
+    #
+    # @return [Symbol]
+    #
+    def _char_kind_of?(char)
+      if char.ord < SHARED_CHARS_LIMIT
+        :shared
+      elsif char.ord >= SHARED_CHARS_LIMIT && char.ord <= ASCII_CHARS_LIMIT
+        :latin
+      else
+        :other
+      end
+    end
+  end
+end

data/lib/injalid_dejice/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module InjalidDejice
-  VERSION = "0.1.0"
+  VERSION = "1.0.0"
 end

data/lib/injalid_dejice.rb CHANGED Viewed

@@ -1,8 +1,9 @@
 # frozen_string_literal: true
-require_relative "injalid_dejice/encoder"
 require_relative "injalid_dejice/decoder"
+require_relative "injalid_dejice/encoder"
 require_relative "injalid_dejice/injalid_dejice"
+require_relative "injalid_dejice/locale_resolver"
 require_relative "injalid_dejice/version"
 #
@@ -17,14 +18,42 @@ module InjalidDejice
   # @param [String] utf_string
   #   A UTF-8 string.
   #
+  # @param [Hash{Symbol => Object}] kwargs
+  #   Options.
+  #
+  # @option kwargs [Array<String>] :forced_latin ([])
+  #   Force characters to be recognized as Latin ones.
+  #
+  # @option kwargs [String] :unknown_char_rep (DEF_UNKNOWN_CHAR_REP)
+  #   Replacement character for the unsupported characters.
+  #
   # @return [String]
   #   An US-ASCII string, KOI-7-compatible.
   #
-  def self.utf_to_koi(utf_string)
-    Encoder.new.call(utf_string)
+  def self.utf_to_koi(utf_string, **kwargs)
+    Encoder.new(**kwargs).call(utf_string)
   end
-  def self.koi_to_utf(koi_string)
-    Decoder.new.call(koi_string)
+  #
+  # Decode 'koi_string' from the KOI-7 to the UTF-8.
+  #
+  # @param [String] koi_string
+  #   A KOI-7-compatible string.
+  #
+  # @param [Hash{Symbol => Object}] kwargs
+  #   Options.
+  #
+  # @option kwargs [String] :unknown_char_rep (DEF_UNKNOWN_CHAR_REP)
+  #   Replacement character for the unsupported characters.
+  #
+  # @option kwargs [Boolean] :strict_mode (false)
+  #   When 'true', an opening SO (0x0E) character should have a closing SI (0x0F)
+  #   counterpart, otherwise ArgumentError would be raised. When 'false', an error
+  #   won't be raised.
+  #
+  # @return [String]
+  #
+  def self.koi_to_utf(koi_string, **kwargs)
+    Decoder.new(**kwargs).call(koi_string)
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: injalid_dejice
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 1.0.0
 platform: ruby
 authors:
 - 8bit-m8
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2024-01-10 00:00:00.000000000 Z
+date: 2024-01-13 00:00:00.000000000 Z
 dependencies: []
 description:
 email:
@@ -28,6 +28,7 @@ files:
 - lib/injalid_dejice/decoder.rb
 - lib/injalid_dejice/encoder.rb
 - lib/injalid_dejice/injalid_dejice.rb
+- lib/injalid_dejice/locale_resolver.rb
 - lib/injalid_dejice/version.rb
 - sig/injalid_dejice.rbs
 homepage: https://github.com/8bit-mate/injalid_dejice.rb