injalid_dejice 0.1.0 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 810e98c784de94eb504872bdb8bc8dd7eadcb948c40b40b77ddc8050b787da31
4
- data.tar.gz: fe8a847cfdcf4c8e631e1ce1759fe4d68ebf9e166d79873d223e9ebf330e060b
3
+ metadata.gz: dd5574caa3104732d22132bccd8709065f75b34fca9424b4babf8838d4cd8816
4
+ data.tar.gz: 9f8a57a4d12c8e435e5c955b76674fb0afe3ee82a102add50cf1f54cb0afdd8a
5
5
  SHA512:
6
- metadata.gz: 9e4ff10a85b68e80f8c099abd8083c7df0ab556ce88eba41e9ff2118ed868a341cfe9cfd2e5e0fd803e9ae801bfe5e6aee28bb0cb8e2100356ec9aa6585ade33
7
- data.tar.gz: eb223603c6d2b4788b66dde0011a56dce5618c16ae620ad6aa501a6f69b15f6752e79d20a88bff3ef7f1c8125b9bdd42f11d27ff8016e32540a0a6723142ce74
6
+ metadata.gz: c673d4083d47c3a21cc2ae4395bb7316f6d0a5e9805cde6371b063620d9bd50356fca7e15d6a2aa46c264716eeb8853c80baf649cab57ec912bfe42c1798b4f1
7
+ data.tar.gz: 6ca6c76abe1a0c7ed2acdec0c76be176cade95f89a464b0927f5e2d3467a2c32771207ae375892ff7ac666dc4298746fa68e30c48170e9460ba070e6b2a99838
data/CHANGELOG.md CHANGED
@@ -1,3 +1,7 @@
1
+ ## [1.0.0]- 2024-01-13
2
+
3
+ - API change.
4
+
1
5
  ## [0.1.0] - 2024-01-11
2
6
 
3
7
  - Initial release
data/README.md CHANGED
@@ -1,12 +1,12 @@
1
1
  # injalid_dejice.rb
2
2
 
3
- Convets UTF-8 strings to the legacy KOI-7 encoding.
3
+ Converts UTF-8 strings to the legacy KOI-7 encoding.
4
4
 
5
- **KOI-7** (Russian: КОИ-7) was a 7-bit character encoding developed in the USSR and based on the US-ASCII encoding.
5
+ **KOI-7** (Russian: КОИ-7) was a 7-bit character encoding developed in the USSR and based on the US-ASCII encoding. US-ASCII has only 128 code points, which are enough to encode the Latin alphabet (with numbers and most common punctuation marks), but not enough to encode both Latin and Cyrillic alphabets with the full set of characters (uppercase and lowercase).
6
6
 
7
- US-ASCII has only 128 code points, which is enough to encode the Latin alphabet (with numbers and most common punctuation marks), but not enough to encode both Latin and Cyrillic alphabets with the full set of characters (uppercase and lowercase). In order to add the Cyrillic alphabet support, but also remain the US-ASCII compatibility, KOI-7 designers came up with a “hack”, in which both Cyrillic and Latin characters were encoded with the same code points, while the ASCII control characters 0x0E and 0x0F were adapted to switch between the Cyrillic and the Latin code pages respectively.
7
+ In order to add Cyrillic alphabet support, but also to remain US-ASCII compatible, KOI-7 designers came up with a “hack”, in which both Cyrillic and Latin characters were encoded with the same code points, while the ASCII control characters 0x0E and 0x0F were adapted to switch between the Cyrillic and the Latin code pages, respectively.
8
8
 
9
- That led to funny bugs when a message was displayed with an incorrect code page. The «**иНЖАЛИД ДЕЖИЦЕ**» message (pronounced as “injalid dejitze”) was the most memorable, and it became somewhat a meme (hence the name of the gem). In fact, it was an incorrectly displayed English message, that simply stated: “Invalid device”.
9
+ This led to very annoying, yet sometimes funny bugs when an English text was printed (on a screen, a printer, or a teletype), while the codepage was still set to Cyrillic. The «**иНЖАЛИД ДЕЖИЦЕ**» message (pronounced as “injalid dejitze”) is the most notable example of a such case. The message became somewhat of a meme, and also an eponym for improperly coded text that looks like gibberish. In fact, the phrase was an improperly coded error message that simply states: “Invalid device”. Hence the name of the gem.
10
10
 
11
11
  KOI-7 was widely used on the Soviet PDP-11 clones, notably the DVK systems and the Elektronika MK90 portable computer.
12
12
 
@@ -28,6 +28,56 @@ Or install it yourself as:
28
28
 
29
29
  ## Usage
30
30
 
31
+ Ruby doesn't support KOI-7 natively (see ```Encoding.name_list```), but does support 7-bit US-ASCII, a KOI-7 compatible encoding (since KOI-7 was based on the US-ASCII). So the encoder returns result as an US-ASCII string, and the decoder expects an input string to be encoded in US-ASCII, or in any compatible encoding (e.g. UTF-8 with characters in the range of 0x00 - 0x7F).
32
+
33
+ ### Encoding a string from UTF-8 to KOI-7
34
+
35
+ InjalidDejice.utf_to_koi(string [, keyword arguments])
36
+
37
+ Arguments:
38
+
39
+ - string
40
+
41
+ Keyword arguments:
42
+
43
+ - :forced_latin ([])
44
+
45
+ Characters 0x00 - 0x32 are shared by both Cyrillic and Latin code page (KOI-7 N0 and KOI-7 N1 code pages). A such character's locale is defined by the preceding character(s). But sometimes locale of such a character should be forcibly switched to the Latin, regardless of the preceding character(s).
46
+
47
+ :forced_latin option allows to do that. It specifies characters that will be treated as Latin. I.e. if the Cyrillic locale was set with the SO (0x0E), any character specified in the forced_latin array will forcibly append the SI (0x0F) code before the character.
48
+
49
+ - :unknown_char_rep ("?")
50
+
51
+ A replacement character for the unsupported characters.
52
+
53
+ Returns:
54
+
55
+ - String
56
+
57
+ ### Decoding a string from KOI-7 to UTF-8
58
+
59
+ InjalidDejice.koi_to_utf(string [, keyword arguments])
60
+
61
+ Arguments:
62
+
63
+ - string
64
+
65
+ Keyword arguments:
66
+
67
+ - :strict mode (false)
68
+
69
+ When 'true', if locale was switched to the Cyrillic with an SO (0x0E) character, it should be switched back to the Latin locale with a SI (0x0F) character. If the condition haven't been met an ArgumentError is raised.
70
+
71
+ - :unknown_char_rep ("?")
72
+
73
+ A replacement character for the unsupported characters.
74
+
75
+ Returns:
76
+
77
+ - String
78
+
79
+ ### Examples
80
+
31
81
  ```ruby
32
82
  require "injalid_dejice"
33
83
 
@@ -7,49 +7,69 @@ module InjalidDejice
7
7
  class Decoder
8
8
  OUTPUT_ENCODING = "UTF-8"
9
9
 
10
+ def initialize(unknown_char_rep: DEF_UNKNOWN_CHAR_REP, strict_mode: false)
11
+ @unknown_char_rep = unknown_char_rep
12
+ @strict_mode = strict_mode
13
+ end
14
+
10
15
  #
11
16
  # Encode US-ANSII (KOI-7-compatible) string to the UTF-8.
12
17
  #
13
18
  # @param [String] input_string
14
19
  #
15
- # @return [String] modified_string
16
- #
17
- # @raise [ArgumentError]
18
- # Found the beginning of the Cyrillic sub-string, but couldn't find the end.
20
+ # @return [String]
19
21
  #
20
22
  def call(input_string)
21
- modified_string = input_string.dup
22
- start_idx = nil
23
+ @result_string = input_string.dup
24
+ @start_idx = nil
25
+
26
+ _process_chars(input_string).force_encoding(OUTPUT_ENCODING)
27
+ end
28
+
29
+ private
23
30
 
31
+ def _process_chars(input_string)
24
32
  input_string.each_char.with_index do |char, idx|
25
- # Everything above the 0x7F cannot be in the KOI-7,
26
- # so just get rid of it.
27
- modified_string[idx] = UNKNOWN_CHAR if char.ord > 127
33
+ # Everything above the 0x7F cannot be in the KOI-7, so just get rid of it.
34
+ @result_string[idx] = @unknown_char_rep if char.ord > ASCII_CHARS_LIMIT
28
35
 
29
36
  # Found beginning of a Cyr. sub-string.
30
- start_idx = idx if char == CYR_CHAR
31
-
32
- # Found end of a Cyr. sub-string. Decode everything between the
33
- # control characters.
34
- if char == LATIN_CHAR
35
- # Decode characters only if 'start_idx' was set.
36
- if start_idx
37
- (start_idx + 1..idx).each do |sub_idx|
38
- if modified_string[sub_idx].ord >= SHARED_CHARS_LIMIT
39
- modified_string[sub_idx] = (LAT_TO_CYR_DICT[modified_string[sub_idx]] || UNKNOWN_CHAR)
40
- end
41
- end
42
- end
43
- start_idx = nil
37
+ @start_idx = idx if char == CYR_CHAR
38
+
39
+ # Found beginning of a Latin string, what should indicate the end of the Cyr. string.
40
+ _on_cyr_substr_end(idx) if char == LATIN_CHAR && @start_idx
41
+ end
42
+
43
+ _check_unexpected_end if @strict_mode
44
+
45
+ # Before the return remove control characters with the regex.
46
+ @result_string.gsub(/[#{LATIN_CHAR}#{CYR_CHAR}]/, "")
47
+ end
48
+
49
+ def _on_cyr_substr_end(idx)
50
+ _decode_cyr_substr(idx)
51
+ @start_idx = nil
52
+ end
53
+
54
+ #
55
+ # Decode everything between the control characters.
56
+ #
57
+ def _decode_cyr_substr(idx)
58
+ (@start_idx + 1..idx).each do |sub_idx|
59
+ if @result_string[sub_idx].ord >= SHARED_CHARS_LIMIT
60
+ @result_string[sub_idx] = (LAT_TO_CYR_DICT[@result_string[sub_idx]] || DEF_UNKNOWN_CHAR_REP)
44
61
  end
45
62
  end
63
+ end
46
64
 
65
+ #
66
+ # @raise [ArgumentError]
67
+ # Found the beginning of the Cyrillic sub-string, but couldn't find the end.
68
+ #
69
+ def _check_unexpected_end
47
70
  # Found the beginning of the Cyr. sub-string, but couldn't find the end (the LATIN_CHAR marker),
48
- # probably that means an unexpected end of the string. Raise an error.
49
- raise ArgumentError, "unexpected end of the string" if start_idx
50
-
51
- # Before the return remove control characters with the regex, and set the encoding.
52
- modified_string.gsub(/[#{LATIN_CHAR}#{CYR_CHAR}]/, "").force_encoding(OUTPUT_ENCODING)
71
+ # probably that means an unexpected end of the string. Raise an error (only in the strict mode).
72
+ raise ArgumentError, "unexpected end of the string" if @start_idx
53
73
  end
54
74
  end
55
75
  end
@@ -7,6 +7,11 @@ module InjalidDejice
7
7
  class Encoder
8
8
  OUTPUT_ENCODING = "US-ASCII"
9
9
 
10
+ def initialize(forced_latin: [], unknown_char_rep: DEF_UNKNOWN_CHAR_REP)
11
+ @forced_latin = forced_latin
12
+ @unknown_char_rep = unknown_char_rep
13
+ end
14
+
10
15
  #
11
16
  # Encode 'input_string' to the KOI-7 encoding.
12
17
  #
@@ -17,7 +22,7 @@ module InjalidDejice
17
22
  # An ASCII string, KOI-7-compatible.
18
23
  #
19
24
  def call(input_string)
20
- non_latin_indexes = _search_non_latin_indexes(input_string)
25
+ non_latin_indexes = LocaleResolver.new(input_string, @forced_latin).call
21
26
 
22
27
  result = if non_latin_indexes.empty?
23
28
  # No non-Latin characters were found, so no futher action needed.
@@ -38,7 +43,6 @@ module InjalidDejice
38
43
  # Replace all non-Latin characters in the 'input_string'.
39
44
  #
40
45
  # @param [String] input_string
41
- # @param [Array<Array<Integer>>] non_latin_indexes
42
46
  #
43
47
  # @return [String] modified_string
44
48
  #
@@ -46,7 +50,7 @@ module InjalidDejice
46
50
  encoding_options = {
47
51
  invalid: :replace,
48
52
  fallback: lambda { |char|
49
- CYR_TO_LAT_DICT.fetch(char, UNKNOWN_CHAR)
53
+ CYR_TO_LAT_DICT.fetch(char, @unknown_char_rep)
50
54
  }
51
55
  }
52
56
 
@@ -81,51 +85,5 @@ module InjalidDejice
81
85
 
82
86
  input_string
83
87
  end
84
-
85
- #
86
- # Return start and end indexes for all so-called "non-Latin" sub-strings of the
87
- # input string. Numbers and some punctuation characters can be part of both Latin,
88
- # and non-Latin sub-strings. The "locale" of these characters is defined by the
89
- # preceding character. E.g., in the string "привет-123 hello-456" the "-123 " chunk
90
- # (note the trailing space) would be considered as a part of the non-Latin sub-
91
- # string, while the "-456" chunk as a part of the Latin sub-string.
92
- #
93
- # @param [String] input_string
94
- #
95
- # @return [Array<Array<Integer>>] non_latin_indexes
96
- # Pairs of non-Latin character indexes.
97
- # Example: [[2, 5], [9, 14]], i.e. 'input_string' has two non-Latin sub-strings.
98
- # The first one occupies characters from 2-nd to 5-th, the second one from 9-th
99
- # to 14-th. If there's no such sub-string, the 'non_latin_indexes' is an empty
100
- # array.
101
- #
102
- def _search_non_latin_indexes(input_string)
103
- non_latin_indexes = []
104
- start_index = nil
105
-
106
- input_string.each_char.with_index do |char, idx|
107
- if char.ord < SHARED_CHARS_LIMIT
108
- # Skip, no action needed, as the character is shared by both
109
- # Latin & Cyrillic charsets.
110
- elsif char.ord >= SHARED_CHARS_LIMIT && char.ord <= ASCII_CHARS_LIMIT
111
- # Latin character.
112
-
113
- # If the current character is Latin and 'start_index' is set,
114
- # it means the Cyrillic substring has ended.
115
- if start_index
116
- non_latin_indexes << [start_index, idx - 1]
117
- start_index = nil
118
- end
119
- elsif start_index.nil?
120
- start_index = idx
121
- end
122
- # Cyrillic character (or possibly other unsupported non-ASCII).
123
- end
124
-
125
- # Handle the last sub-string.
126
- non_latin_indexes << [start_index, input_string.length - 1] if start_index
127
-
128
- non_latin_indexes
129
- end
130
88
  end
131
89
  end
@@ -75,11 +75,13 @@ module InjalidDejice
75
75
  # KOI-7 code pages.
76
76
  SHARED_CHARS_LIMIT = 64
77
77
 
78
+ SHARED_CHARS = " !\"#$%&'()*+,-./0123456789:;<=>?"
79
+
78
80
  # ASCII highest possible character code.
79
81
  ASCII_CHARS_LIMIT = 127
80
82
 
81
- # Character to use if no replacement was found in the CYR_TO_LAT_DICT dictionary.
82
- UNKNOWN_CHAR = "?"
83
+ # Character to use if no replacement was found in the dictionary.
84
+ DEF_UNKNOWN_CHAR_REP = "?"
83
85
 
84
86
  # Swithes encoding to the Cyrillic KOI-7 code page. Also known as "Shift Out" (SO) character.
85
87
  CYR_CHAR = 0x0E.chr
@@ -0,0 +1,106 @@
1
+ # frozen_string_literal: true
2
+
3
+ module InjalidDejice
4
+ #
5
+ # Finds start and end indexes of a non-Latin sub-string.
6
+ #
7
+ class LocaleResolver
8
+ #
9
+ # Class init.
10
+ #
11
+ # @param [String] string
12
+ # @param [Array<String>] forced_latin ([])
13
+ #
14
+ def initialize(string, forced_latin = [])
15
+ @string = string
16
+ @forced_latin = forced_latin
17
+ end
18
+
19
+ #
20
+ # Return start and end indexes for all so-called "non-Latin" sub-strings of the
21
+ # input string. Numbers and some punctuation characters can be part of both Latin,
22
+ # and non-Latin sub-strings. The "locale" of these characters is defined by the
23
+ # preceding character(s). E.g., in the string "привет-123 hello-456" the "-123 "
24
+ # chunk (note the trailing space) would be considered as a part of the non-Latin
25
+ # sub-string, while the "-456" chunk as a part of the Latin sub-string.
26
+ #
27
+ # Sometimes it is required to forcibly apply the Latin 'locale' to the character.
28
+ # This is needed to avoid some bugs in the legacy software (e.g. MK90 BASIC). For
29
+ # this purpose there's the optional argument 'forced_latin'.
30
+ #
31
+ # @return [Array<Array<Integer>>] non_latin_indexes
32
+ # Pairs of non-Latin character indexes.
33
+ # Example: [[2, 5], [9, 14]], i.e. @string has two non-Latin sub-strings.
34
+ # The first one occupies characters from 2-nd to 5-th, the second one from 9-th
35
+ # to 14-th. If there's no such sub-string, the @non_latin_indexes is an empty
36
+ # array.
37
+ #
38
+ def call
39
+ @non_latin_indexes = []
40
+ @start_index = nil
41
+
42
+ _handle_subtrings
43
+
44
+ _handle_last_substring if @start_index
45
+
46
+ @non_latin_indexes
47
+ end
48
+
49
+ private
50
+
51
+ def _handle_subtrings
52
+ @string.each_char.with_index do |char, idx|
53
+ case _char_kind_of?(char)
54
+ when :shared
55
+ _handle_shared_char(char, idx)
56
+ when :latin
57
+ _handle_latin_char(idx)
58
+ else
59
+ _handle_other_char(idx)
60
+ end
61
+ end
62
+ end
63
+
64
+ def _handle_shared_char(char, idx)
65
+ # Force a character to be recognized as a Latin one if it was specified
66
+ # in the '@forced_latin' array.
67
+ _on_end_of_non_latin_substr(idx) if @forced_latin.include?(char) && @start_index
68
+ end
69
+
70
+ def _handle_latin_char(idx)
71
+ # If the current character is Latin and '@start_index' is set,
72
+ # it means the Cyrillic substring has ended.
73
+ _on_end_of_non_latin_substr(idx) if @start_index
74
+ end
75
+
76
+ def _handle_other_char(idx)
77
+ @start_index = idx if @start_index.nil?
78
+ end
79
+
80
+ def _handle_last_substring
81
+ @non_latin_indexes << [@start_index, @string.length - 1]
82
+ end
83
+
84
+ def _on_end_of_non_latin_substr(idx)
85
+ @non_latin_indexes << [@start_index, idx - 1]
86
+ @start_index = nil
87
+ end
88
+
89
+ #
90
+ # Return character type based on its UTF-8 code.
91
+ #
92
+ # @param [String] char
93
+ #
94
+ # @return [Symbol]
95
+ #
96
+ def _char_kind_of?(char)
97
+ if char.ord < SHARED_CHARS_LIMIT
98
+ :shared
99
+ elsif char.ord >= SHARED_CHARS_LIMIT && char.ord <= ASCII_CHARS_LIMIT
100
+ :latin
101
+ else
102
+ :other
103
+ end
104
+ end
105
+ end
106
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module InjalidDejice
4
- VERSION = "0.1.0"
4
+ VERSION = "1.0.0"
5
5
  end
@@ -1,8 +1,9 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- require_relative "injalid_dejice/encoder"
4
3
  require_relative "injalid_dejice/decoder"
4
+ require_relative "injalid_dejice/encoder"
5
5
  require_relative "injalid_dejice/injalid_dejice"
6
+ require_relative "injalid_dejice/locale_resolver"
6
7
  require_relative "injalid_dejice/version"
7
8
 
8
9
  #
@@ -17,14 +18,42 @@ module InjalidDejice
17
18
  # @param [String] utf_string
18
19
  # A UTF-8 string.
19
20
  #
21
+ # @param [Hash{Symbol => Object}] kwargs
22
+ # Options.
23
+ #
24
+ # @option kwargs [Array<String>] :forced_latin ([])
25
+ # Force characters to be recognized as Latin ones.
26
+ #
27
+ # @option kwargs [String] :unknown_char_rep (DEF_UNKNOWN_CHAR_REP)
28
+ # Replacement character for the unsupported characters.
29
+ #
20
30
  # @return [String]
21
31
  # An US-ASCII string, KOI-7-compatible.
22
32
  #
23
- def self.utf_to_koi(utf_string)
24
- Encoder.new.call(utf_string)
33
+ def self.utf_to_koi(utf_string, **kwargs)
34
+ Encoder.new(**kwargs).call(utf_string)
25
35
  end
26
36
 
27
- def self.koi_to_utf(koi_string)
28
- Decoder.new.call(koi_string)
37
+ #
38
+ # Decode 'koi_string' from the KOI-7 to the UTF-8.
39
+ #
40
+ # @param [String] koi_string
41
+ # A KOI-7-compatible string.
42
+ #
43
+ # @param [Hash{Symbol => Object}] kwargs
44
+ # Options.
45
+ #
46
+ # @option kwargs [String] :unknown_char_rep (DEF_UNKNOWN_CHAR_REP)
47
+ # Replacement character for the unsupported characters.
48
+ #
49
+ # @option kwargs [Boolean] :strict_mode (false)
50
+ # When 'true', an opening SO (0x0E) character should have a closing SI (0x0F)
51
+ # counterpart, otherwise ArgumentError would be raised. When 'false', an error
52
+ # won't be raised.
53
+ #
54
+ # @return [String]
55
+ #
56
+ def self.koi_to_utf(koi_string, **kwargs)
57
+ Decoder.new(**kwargs).call(koi_string)
29
58
  end
30
59
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: injalid_dejice
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 1.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - 8bit-m8
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2024-01-10 00:00:00.000000000 Z
11
+ date: 2024-01-13 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description:
14
14
  email:
@@ -28,6 +28,7 @@ files:
28
28
  - lib/injalid_dejice/decoder.rb
29
29
  - lib/injalid_dejice/encoder.rb
30
30
  - lib/injalid_dejice/injalid_dejice.rb
31
+ - lib/injalid_dejice/locale_resolver.rb
31
32
  - lib/injalid_dejice/version.rb
32
33
  - sig/injalid_dejice.rbs
33
34
  homepage: https://github.com/8bit-mate/injalid_dejice.rb