unibits 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 0405b9099735bcf25bc05f82e91d9c5629071ee1
4
- data.tar.gz: 85d9bde5ddc1e90ac70cf288e7753ddc8511fc87
3
+ metadata.gz: 3a765c7b1a6edfd62f8fed06b683d0d4ff2cacae
4
+ data.tar.gz: c66cb699f9a4425fc169fd6562dd4b758b94c584
5
5
  SHA512:
6
- metadata.gz: a8d944dfcf50869cf1a64a7237e8145ee8b8fcb57a801db9a475684ac0e7bd4865be2b2d372f943ddeebb6d86f632a3eb9655d147cc6e3b79bc6e0c6dc09fa66
7
- data.tar.gz: be6685c1bfc9b9ecf073e9899be4ded5d590ccdd90b527dfdfdfeb1ac654016cac4116033905183f988e249cabedcc767cc8fa3a4d04b3681650ad7852035330
6
+ metadata.gz: a516728fe08185386fc1902fa341784517521d0a85e494e35a7e2700c1c6d7fc3dfc8f86bb20d4268de6d69df591ab431aa3ab5ee02d96f6798fcd97f2426a8e
7
+ data.tar.gz: cb2fbfea68be359661ad75cc6e203bd4235ccd81431a37f74f965e90f6116e1a12d8cb04094a213f537e2a032c0c25775edd08378cd7c8ae613503dbe5cc3e0e
data/CHANGELOG.md CHANGED
@@ -1,5 +1,13 @@
1
1
  ## CHANGELOG
2
2
 
3
+ ### 1.1.0
4
+
5
+ * Support (and highlight) invalid encodings \o/
6
+ * Improve character symbolification
7
+ * Fix that the Kernel method would not take keyword arguments
8
+ * New option for setting a custom output width to use
9
+ * New option for activating wide ambiguous characters
10
+
3
11
  ### 1.0.0
4
12
 
5
13
  * Initial release
data/README.md CHANGED
@@ -1,18 +1,24 @@
1
- # unibits [![[version]](https://badge.fury.io/rb/unibits.svg)](http://badge.fury.io/rb/unibits) [![[travis]](https://travis-ci.org/janlelis/unibits.svg)](https://travis-ci.org/janlelis/unibits)
1
+ # unibits | Reveal the Unicode [![[version]](https://badge.fury.io/rb/unibits.svg)](http://badge.fury.io/rb/unibits) [![[travis]](https://travis-ci.org/janlelis/unibits.svg)](https://travis-ci.org/janlelis/unibits)
2
2
 
3
- Ruby library and CLI command that visualizes various Unicode and ASCII encodings in the terminal.
3
+ Ruby library and CLI command that visualizes various Unicode and ASCII encodings in the terminal:
4
4
 
5
- Helps you debugging and learning about Unicode encodings.
6
-
7
- Supported encodings: **UTF-8**, **UTF-16LE**, **UTF-16BE**, **UTF-32LE**, **UTF-32BE**, **BINARY**, **ASCII**
5
+ - Makes analyzing encodings easier
6
+ - Helps you with debugging strings
7
+ - Supports **UTF-8**, **UTF-16LE**/**UTF-16BE**, **UTF-32LE**/**UTF-32BE**, arbitrary **BINARY** data, and **ASCII**
8
+ - Highlights invalid encodings
8
9
 
9
10
  ## Setup
10
11
 
12
+ Make sure you have Ruby installed and installing gems works properly. Then do:
13
+
11
14
  ```
12
15
  $ gem install unibits
13
16
  ```
14
17
 
15
18
  ## Usage
19
+
20
+ Pass the string to debug to unibits:
21
+
16
22
  ### From CLI
17
23
 
18
24
  ```
@@ -26,17 +32,17 @@ require 'unibits/kernel_method'
26
32
  unibits "🌫 Idiosyncrätic ℜսᖯʏ"
27
33
  ```
28
34
 
29
- ### Options
35
+ ### Advanced Options
30
36
 
31
- `unibits` takes three optional options:
37
+ `unibits` takes some optional options:
32
38
 
33
39
  - *encoding (e)*: The encoding of the given string (uses your default encoding if none given)
34
40
  - *convert (c)*: An encoding the string should be converted to before visualizing it
35
41
  - *stats*: Whether to show a short stats header (default: `true`), you can deactivate on the CLI with `--no-stats`
42
+ - *wide-ambiguous*: Treat characters of ambiguous width as 2 spaces instead of 1 ([more info](https://github.com/janlelis/unicode-display_width))
43
+ - *width (w)*: Set a custom column width, if not set, *unibits* will retrieve it from the terminal or just use 80
36
44
 
37
- **Please note**: This uses Ruby's built-in encoding support. Currently, only strings with valid encodings are supported.
38
-
39
- ## Output for Different Encodings
45
+ ## Output of Different Valid Encodings
40
46
  ### UTF-8
41
47
 
42
48
  CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
@@ -47,33 +53,33 @@ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert:
47
53
 
48
54
  ### UTF-16LE
49
55
 
50
- CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
56
+ CLI: `$ unibits -e utf-8 -c utf-16le "🌫 Idiosyncrätic ℜսᖯʏ"`
51
57
 
52
- Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-8'`
58
+ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-16le'`
53
59
 
54
60
  ![Screenshot UTF-16LE](/screenshots/utf-16le.png?raw=true "UTF-16LE")
55
61
 
56
62
  ### UTF-16BE
57
63
 
58
- CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
64
+ CLI: `$ unibits -e utf-8 -c utf-16be "🌫 Idiosyncrätic ℜսᖯʏ"`
59
65
 
60
- Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-8'`
66
+ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-16be'`
61
67
 
62
68
  ![Screenshot UTF-16BE](/screenshots/utf-16be.png?raw=true "UTF-16BE")
63
69
 
64
70
  ### UTF-32LE
65
71
 
66
- CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
72
+ CLI: `$ unibits -e utf-8 -c utf-32le "🌫 Idiosyncrätic ℜսᖯʏ"`
67
73
 
68
- Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-8'`
74
+ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-32le'`
69
75
 
70
76
  ![Screenshot UTF-32LE](/screenshots/utf-32le.png?raw=true "UTF-32LE")
71
77
 
72
78
  ### UTF-32BE
73
79
 
74
- CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
80
+ CLI: `$ unibits -e utf-8 -c utf-32be "🌫 Idiosyncrätic ℜսᖯʏ"`
75
81
 
76
- Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-8'`
82
+ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-32be'`
77
83
 
78
84
  ![Screenshot UTF-32BE](/screenshots/utf-32be.png?raw=true "UTF-32BE")
79
85
 
@@ -93,6 +99,23 @@ Ruby: `unibits "ASCII String", encoding: 'utf-8', convert: 'ascii'`
93
99
 
94
100
  ![Screenshot ASCII](/screenshots/ascii.png?raw=true "ASCII")
95
101
 
102
+ ## Invalid Encodings
103
+ ### UTF-8
104
+
105
+ Example in Ruby: `unibits "unexpected \x80 | not enough \xF0\x9F\x8C | overlong \xE0\x81\x81 | surrogate \xED\xA0\x80 | too large \xF5\x8F\xBF\xBF"`
106
+
107
+ ![Screenshot invalid UTF-8](/screenshots/utf-8.invalid.png?raw=true "Invalid UTF-8")
108
+
109
+ ### ASCII
110
+
111
+ Example in Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'ascii'`
112
+
113
+ ![Screenshot invalid ASCII](/screenshots/ascii.invalid.png?raw=true "Invalid ASCII")
114
+
115
+ ### BINARY
116
+
117
+ (not possible to produce invalid binary strings)
118
+
96
119
  ## Notes
97
120
 
98
121
  Also see
@@ -101,6 +124,7 @@ Also see
101
124
  - [UTF-16 (Wikipedia)](https://en.wikipedia.org/wiki/UTF-16#Description)
102
125
  - [UTF-32 (Wikipedia)](https://en.wikipedia.org/wiki/UTF-32)
103
126
  - [Ruby's Encoding class](https://ruby-doc.org/core/Encoding.html)
127
+ - [Difference between BINARY and ASCII](http://idiosyncratic-ruby.com/56-us-ascii-8bit.html)
104
128
  - [Unicode Micro Libraries for Ruby](https://github.com/janlelis/unicode-x)
105
129
 
106
130
  Lots of thanks to @damienklinnert for the motivation and inspiration required to build this! 🎆
data/Rakefile CHANGED
@@ -24,7 +24,7 @@ end
24
24
 
25
25
  desc "#{gemspec.name} | IRB"
26
26
  task :irb do
27
- sh "irb -I ./lib -r #{gemspec.name.gsub '-','/'}"
27
+ sh "irb -I ./lib -r #{gemspec.name.gsub '-','/'}/kernel_method"
28
28
  end
29
29
 
30
30
  # # #
data/bin/unibits CHANGED
@@ -7,14 +7,16 @@ argv = Rationalist.parse(
7
7
  ARGV,
8
8
  string: '_',
9
9
  alias: {
10
- e: 'encoding',
11
10
  c: 'convert',
11
+ e: 'encoding',
12
12
  v: 'version',
13
+ w: 'width',
13
14
  },
14
15
  boolean: [
15
16
  'help',
17
+ 'stats',
16
18
  'version',
17
- 'stats'
19
+ 'wide-ambiguous',
18
20
  ],
19
21
  default: {
20
22
  stats: true
@@ -22,21 +24,54 @@ argv = Rationalist.parse(
22
24
  )
23
25
 
24
26
  if argv[:version]
25
- puts Unibits::VERSION
27
+ puts "unibits #{Unibits::VERSION} by #{Paint["J-_-L", :bold]} <https://github.com/janlelis/unibits>"
26
28
  exit(0)
27
29
  end
28
30
 
29
31
  if argv[:help]
30
32
  puts <<-HELP
31
33
 
34
+ #{Paint["DESCRIPTION", :underline]}
35
+
36
+ Visualizes ANSI and variouus Unicode encodings in the terminal.
37
+
32
38
  #{Paint["USAGE", :underline]}
33
39
 
34
- #{Paint["unibits", :bold]} [--encoding <encoding>] [--convert <encoding>] [--no-stats] data
40
+ #{Paint["unibits", :bold]} [options] data
41
+
42
+ --encoding <encoding> | -e | which encoding to use for given data
43
+ --convert <encoding> | -c | which encoding to convert to (if possible)
44
+ --width <n> | -w | force a specific number of terminal columns
45
+ --no-stats | | no stats header with length info
46
+ --version | | displays version of unibits
47
+ --wide-ambiguous | | ambiguous characters
48
+
49
+ #{Paint["ENCODINGS", :underline]}
50
+
51
+ #{Unibits::SUPPORTED_ENCODINGS.join(', ')}
52
+
53
+ #{Paint["STATS", :underline]}
54
+
55
+ ( bytes / codepoints / glyphs / expected terminal width )
56
+
57
+ #{Paint["INVALID BYTES", :underline]}
58
+
59
+ UTF-8
60
+
61
+ n.e.con. | not enough continuation bytes
62
+ unexp.c. | unexpected continuation byte
63
+ toolarge | codepoint value exceeds maximum allowed (U+10FFFF)
64
+ sur.gate | codepoint value would be a surrogate half
65
+ overlong | unnecessary null padding
66
+
67
+ UTF-16
68
+
69
+ incompl. | not enough bytes left to finish codepoint
70
+ hlf.srg. | other half of surrogate missing
35
71
 
36
- Supported encodings: #{Unibits::SUPPORTED_ENCODINGS.join(', ')}
37
- Explanation of stats: bytes / codepoints / glyphs / expected terminal width
72
+ #{Paint["MORE INFO", :underline]}
38
73
 
39
- More info and examples at: https://github.com/janlelis/unibits
74
+ https://github.com/janlelis/unibits
40
75
 
41
76
  HELP
42
77
  exit(0)
@@ -51,7 +86,14 @@ else
51
86
  end
52
87
 
53
88
  begin
54
- Unibits.of(data, encoding: argv[:encoding], convert: argv[:convert], stats: argv[:stats])
89
+ Unibits.of(
90
+ data,
91
+ encoding: argv[:encoding],
92
+ convert: argv[:convert],
93
+ stats: argv[:stats],
94
+ wide_ambiguous: argv[:'wide-ambiguous'],
95
+ width: argv[:width],
96
+ )
55
97
  rescue ArgumentError
56
98
  $stderr.puts Paint[$!.message, :red]
57
99
  exit(1)
data/lib/unibits.rb CHANGED
@@ -1,4 +1,5 @@
1
1
  require_relative "unibits/version"
2
+ require_relative "unibits/symbolify"
2
3
 
3
4
  require "io/console"
4
5
  require "paint"
@@ -14,8 +15,9 @@ module Unibits
14
15
  'ASCII-8BIT',
15
16
  'US-ASCII',
16
17
  ].freeze
18
+ DEFAULT_TERMINAL_WIDTH = 80
17
19
 
18
- def self.of(string, encoding: nil, convert: nil, stats: true)
20
+ def self.of(string, encoding: nil, convert: nil, stats: true, wide_ambiguous: false, width: nil)
19
21
  if !string || string.empty?
20
22
  raise ArgumentError, "no data given to unibits"
21
23
  end
@@ -25,12 +27,8 @@ module Unibits
25
27
 
26
28
  case string.encoding.name
27
29
  when *SUPPORTED_ENCODINGS
28
- unless string.valid_encoding?
29
- raise ArgumentError, "the data given is not valid #{string.encoding.name}"
30
- end
31
-
32
- puts stats(string) if stats
33
- puts visualize(string)
30
+ puts stats(string, wide_ambiguous: wide_ambiguous) if stats
31
+ puts visualize(string, wide_ambiguous: wide_ambiguous, width: width)
34
32
  when 'UTF-16', 'UTF-32'
35
33
  raise ArgumentError, "unibits only supports #{string.encoding.name} with specified endianess, please use #{string.encoding.name}LE or #{string.encoding.name}BE"
36
34
  else
@@ -38,27 +36,36 @@ module Unibits
38
36
  end
39
37
  end
40
38
 
41
- def self.stats(string)
42
- "\n " \
43
- "#{Paint[string.encoding.name, :bold]} (" \
44
- "#{string.bytesize}/" \
45
- "#{string.size}/" \
46
- "#{string.scan(Regexp.compile('\X'.encode(string.encoding))).size}/" \
47
- "#{Unicode::DisplayWidth.of(string)})"
39
+ def self.stats(string, wide_ambiguous: false)
40
+ valid = string.valid_encoding?
41
+ bytes = string.bytesize rescue "?"
42
+ codepoints = string.size rescue "?"
43
+ glyphs = string.scan(Regexp.compile('\X'.encode(string.encoding))).size rescue "?"
44
+ width = Unicode::DisplayWidth.of(string, wide_ambiguous ? 2 : 1) rescue "?"
45
+
46
+ "\n #{valid ? '' : Paint["Invalid ", :bold, :red]}#{Paint[string.encoding.name, :bold]} (#{bytes}/#{codepoints}/#{glyphs}/#{width})"
48
47
  end
49
48
 
50
- def self.visualize(string)
51
- cols = determine_terminal_cols
49
+ def self.visualize(string, wide_ambiguous: false, width: nil)
50
+ cols = width || determine_terminal_cols
52
51
 
53
52
  cp_buffer = [" "]
54
53
  enc_buffer = [" "]
55
54
  hex_buffer = [" "]
56
55
  bin_buffer = [" "]
57
56
  separator = [" "]
57
+ current_encoding_error = nil
58
58
 
59
59
  puts
60
60
  string.each_char{ |char|
61
- current_color = random_color
61
+ if char.valid_encoding?
62
+ char_valid = true
63
+ current_color = random_color
64
+ current_encoding_error = nil
65
+ else
66
+ char_valid = false
67
+ current_color = :red
68
+ end
62
69
 
63
70
  char.each_byte.with_index{ |byte, index|
64
71
  if Paint.unpaint(hex_buffer[-1]).bytesize > cols - 12
@@ -70,12 +77,116 @@ module Unibits
70
77
  end
71
78
 
72
79
  if index == 0
80
+ if char_valid
81
+ codepoint = "U+%04X" % char.ord
82
+ else
83
+ case string.encoding.name
84
+ when "US-ASCII"
85
+ codepoint = "invalid"
86
+ when "UTF-8"
87
+ # this tries to detect what is wrong with this utf-8 encoded string
88
+ # sorry for this mess
89
+ case char.unpack("B*")[0]
90
+ when /^110.{5}$/
91
+ current_encoding_error = [:nec, 1, 1]
92
+ codepoint = "n.e.con."
93
+ when /^1110(.{4})$/
94
+ if $1 == "1101"
95
+ current_encoding_error = [:nec, 2, 2, :maybe_surrogate]
96
+ else
97
+ current_encoding_error = [:nec, 2, 2]
98
+ end
99
+ codepoint = "n.e.con."
100
+ when /^11110(.{3})$/
101
+ case $1
102
+ when "100"
103
+ current_encoding_error = [:nec, 3, 3, :leading_at_max]
104
+ when "101", "110", "111"
105
+ current_encoding_error = [:nec, 3, 3, :too_large]
106
+ else
107
+ current_encoding_error = [:nec, 3, 3]
108
+ end
109
+ codepoint = "n.e.con."
110
+ when /^11111.{3}$/
111
+ codepoint = "toolarge"
112
+ when /^10(.{2}).{4}$/
113
+ # uglyhack to fixup that it is not n.e.c, but something different
114
+ if current_encoding_error && current_encoding_error[0] == :nec
115
+ if current_encoding_error[3] == :leading_at_max
116
+ if $1 != "00"
117
+ current_encoding_error[3] = :too_large
118
+ else
119
+ current_encoding_error[3] = nil
120
+ end
121
+ elsif current_encoding_error[3] == :maybe_surrogate
122
+ if $1[0] == "1"
123
+ current_encoding_error[3] = :surrogate
124
+ else
125
+ current_encoding_error[3] = nil
126
+ end
127
+ end
128
+
129
+ if current_encoding_error[1] > 1
130
+ current_encoding_error[1] -= 1
131
+ codepoint = "n.e.con."
132
+ else
133
+ case current_encoding_error[3]
134
+ when :too_large
135
+ actual_error = "toolarge"
136
+ when :surrogate
137
+ actual_error = "sur.gate"
138
+ else
139
+ actual_error = "overlong"
140
+ end
141
+ current_cp_buffer_index = -1
142
+ (current_encoding_error[2]).times{
143
+ if index = cp_buffer[current_cp_buffer_index].rindex("n.e.con.")
144
+ cp_buffer[current_cp_buffer_index][index..-1] = cp_buffer[current_cp_buffer_index][index..-1].sub("n.e.con.", actual_error)
145
+ else
146
+ current_cp_buffer_index -= 1
147
+ index = cp_buffer[current_cp_buffer_index].rindex("n.e.con.")
148
+ cp_buffer[current_cp_buffer_index][index..-1] = cp_buffer[current_cp_buffer_index][index..-1].sub("n.e.con.", actual_error)
149
+ end
150
+ current_encoding_error = [:overlong]
151
+ codepoint = actual_error
152
+ }
153
+ end
154
+ else
155
+ current_encoding_error = [:unexp]
156
+ codepoint = "unexp.c."
157
+ end
158
+ else
159
+ current_encoding_error = [:invallid]
160
+ codepoint = "invalid"
161
+ end
162
+ when 'UTF-16LE', 'UTF-16BE'
163
+ if char.bytesize.odd?
164
+ codepoint = "incompl."
165
+ elsif char.b[string.encoding.name == 'UTF-16LE' ? 1 : 0].unpack("B*")[0][0, 5] == "11011"
166
+ codepoint = "hlf.srg."
167
+ else
168
+ codepoint = "invalid"
169
+ end
170
+ when 'UTF-32LE', 'UTF-32BE'
171
+ if char.bytesize != "4"
172
+ codepoint = "incompl."
173
+ else
174
+ codepoint = "invalid"
175
+ end
176
+ end
177
+ end
178
+
73
179
  cp_buffer[-1] << Paint[
74
- ("U+%04X" % char.ord).ljust(10), current_color, :bold
180
+ codepoint.ljust(10), current_color, :bold
75
181
  ]
76
182
 
77
- symbolified_char = symbolify(char)
78
- padding = 10 - Unicode::DisplayWidth.of(symbolified_char)
183
+ if char_valid
184
+ symbolified_char = symbolify(char)
185
+ else
186
+ symbolified_char = "�"
187
+ end
188
+
189
+ padding = 10 - Unicode::DisplayWidth.of(symbolified_char, wide_ambiguous ? 2 : 1)
79
190
 
80
191
  enc_buffer[-1] << Paint[
81
192
  symbolified_char, current_color
@@ -92,54 +203,62 @@ module Unibits
92
203
 
93
204
  bin_byte_complete = byte.to_s(2).rjust(8, "0")
94
205
 
95
- case string.encoding.name
96
- when 'US-ASCII'
97
- bin_byte_1 = bin_byte_complete[0...1]
98
- bin_byte_2 = bin_byte_complete[1...8]
99
- when 'ASCII-8BIT'
100
- bin_byte_1 = ""
101
- bin_byte_2 = bin_byte_complete
102
- when 'UTF-8'
103
- if index == 0
104
- if bin_byte_complete =~ /^(0|1{2,4}0)([01]+)$/
206
+ if !char_valid
207
+ bin_byte_1 = bin_byte_complete
208
+ bin_byte_2 = ""
209
+ else
210
+ case string.encoding.name
211
+ when 'US-ASCII'
212
+ bin_byte_1 = bin_byte_complete[0...1]
213
+ bin_byte_2 = bin_byte_complete[1...8]
214
+ when 'ASCII-8BIT'
215
+ bin_byte_1 = ""
216
+ bin_byte_2 = bin_byte_complete
217
+ when 'UTF-8'
218
+ if index == 0
219
+ if bin_byte_complete =~ /^(0|1{2,4}0)([01]+)$/
220
+ bin_byte_1 = $1
221
+ bin_byte_2 = $2
222
+ else
223
+ bin_byte_1 = ""
224
+ bin_byte_2 = bin_byte_complete
225
+ end
226
+ else
227
+ bin_byte_1 = bin_byte_complete[0...2]
228
+ bin_byte_2 = bin_byte_complete[2...8]
229
+ end
230
+ when 'UTF-16LE'
231
+ if char.ord <= 0xFFFF || index == 0 || index == 2
232
+ bin_byte_1 = ""
233
+ bin_byte_2 = bin_byte_complete
234
+ else
235
+ bin_byte_complete =~ /^(11011[01])([01]+)$/
105
236
  bin_byte_1 = $1
106
237
  bin_byte_2 = $2
107
- else
238
+ end
239
+ when 'UTF-16BE'
240
+ if char.ord <= 0xFFFF || index == 1 || index == 3
108
241
  bin_byte_1 = ""
109
242
  bin_byte_2 = bin_byte_complete
243
+ else
244
+ bin_byte_complete =~ /^(11011[01])([01]+)$/
245
+ bin_byte_1 = $1
246
+ bin_byte_2 = $2
110
247
  end
111
- else
112
- bin_byte_1 = bin_byte_complete[0...2]
113
- bin_byte_2 = bin_byte_complete[2...8]
114
- end
115
- when 'UTF-16LE'
116
- if char.ord <= 0xFFFF || index == 0 || index == 2
248
+ when 'UTF-32LE', 'UTF-32BE'
117
249
  bin_byte_1 = ""
118
250
  bin_byte_2 = bin_byte_complete
119
- else
120
- bin_byte_complete =~ /^(11011[01])([01]+)$/
121
- bin_byte_1 = $1
122
- bin_byte_2 = $2
123
- end
124
- when 'UTF-16BE'
125
- if char.ord <= 0xFFFF || index == 1 || index == 3
126
- bin_byte_1 = ""
127
- bin_byte_2 = bin_byte_complete
128
- else
129
- bin_byte_complete =~ /^(11011[01])([01]+)$/
130
- bin_byte_1 = $1
131
- bin_byte_2 = $2
132
251
  end
133
- when 'UTF-32LE', 'UTF-32BE'
134
- bin_byte_1 = ""
135
- bin_byte_2 = bin_byte_complete
136
252
  end
253
+
137
254
  bin_buffer[-1] << Paint[
138
255
  bin_byte_1, current_color
139
256
  ] unless !bin_byte_1 || bin_byte_1.empty?
257
+
140
258
  bin_buffer[-1] << Paint[
141
259
  bin_byte_2, current_color, :underline
142
260
  ] unless !bin_byte_2 || bin_byte_2.empty?
261
+
143
262
  bin_buffer[-1] << " "
144
263
  }
145
264
  }
@@ -157,18 +276,12 @@ module Unibits
157
276
 
158
277
  def self.symbolify(char)
159
278
  return char.inspect unless char.encoding.name[0, 3] == "UTF"
160
- char
161
- .tr("\x00-\x1F".encode(char.encoding), "\u{2400}-\u{241F}".encode(char.encoding))
162
- .gsub(
163
- Regexp.compile('[\p{Space}᠎​‌‍⁠]'.encode(char.encoding)),
164
- ']\0['.encode(char.encoding)
165
- )
166
- .encode('UTF-8')
279
+ Symbolify.symbolify(char).encode('UTF-8')
167
280
  end
168
281
 
169
282
  def self.determine_terminal_cols
170
- STDIN.winsize[1] || 80
283
+ STDIN.winsize[1] || DEFAULT_TERMINAL_WIDTH
171
284
  rescue Errno::ENOTTY
172
- return 80
285
+ return DEFAULT_TERMINAL_WIDTH
173
286
  end
174
287
  end
@@ -3,7 +3,7 @@ require_relative '../unibits'
3
3
  module Kernel
4
4
  private
5
5
 
6
- def unibits(string)
7
- Unibits.of(string)
6
+ def unibits(string, **kwargs)
7
+ Unibits.of(string, **kwargs)
8
8
  end
9
9
  end
@@ -0,0 +1,108 @@
1
+ module Unibits
2
+ module Symbolify
3
+ ASCII_CONTROL_CODEPOINTS = "\x00-\x1F\x7F".freeze
4
+ ASCII_CONTROL_SYMBOLS = "\u{2400}-\u{241F}\u{2421}".freeze
5
+ ASCII_CHARS = "\x20-\x7E".freeze
6
+ TAG_START = "\u{E0001}".freeze
7
+ TAG_START_SYMBOL = "LANG TAG".freeze
8
+ TAG_SPACE = "\u{E0020}".freeze
9
+ TAG_SPACE_SYMBOL = "TAG ␠".freeze
10
+ TAGS = "\u{E0021}-\u{E007E}".freeze
11
+ TAG_DELETE = "\u{E007F}".freeze
12
+ TAG_DELETE_SYMBOL = "TAG ␡".freeze
13
+ INTERESTING_CODEPOINTS = {
14
+ "\u{0080}" => "PAD",
15
+ "\u{0081}" => "HOP",
16
+ "\u{0082}" => "BPH",
17
+ "\u{0083}" => "NBH",
18
+ "\u{0084}" => "IND",
19
+ "\u{0085}" => "NEL",
20
+ "\u{0086}" => "SSA",
21
+ "\u{0087}" => "ESA",
22
+ "\u{0088}" => "HTS",
23
+ "\u{0089}" => "HTJ",
24
+ "\u{008A}" => "VTS",
25
+ "\u{008B}" => "PLD",
26
+ "\u{008C}" => "PLU",
27
+ "\u{008D}" => "RI",
28
+ "\u{008E}" => "SS2",
29
+ "\u{008F}" => "SS3",
30
+ "\u{0090}" => "DCS",
31
+ "\u{0091}" => "PU1",
32
+ "\u{0092}" => "PU2",
33
+ "\u{0093}" => "STS",
34
+ "\u{0094}" => "CCH",
35
+ "\u{0095}" => "MW",
36
+ "\u{0096}" => "SPA",
37
+ "\u{0097}" => "EPA",
38
+ "\u{0098}" => "SOS",
39
+ "\u{0099}" => "SGC",
40
+ "\u{009A}" => "SCI",
41
+ "\u{009B}" => "CSI",
42
+ "\u{009C}" => "ST",
43
+ "\u{009D}" => "OSC",
44
+ "\u{009E}" => "PM",
45
+ "\u{009F}" => "APC",
46
+
47
+ "\u{200E}" => "LRM",
48
+ "\u{200F}" => "RLM",
49
+ "\u{202A}" => "LRE",
50
+ "\u{202B}" => "RLE",
51
+ "\u{202C}" => "PDF",
52
+ "\u{202D}" => "LRO",
53
+ "\u{202E}" => "RLO",
54
+ "\u{2066}" => "LRI",
55
+ "\u{2067}" => "RLI",
56
+ "\u{2068}" => "FSI",
57
+ "\u{2069}" => "PDI",
58
+
59
+ "\u{FE00}" => "VS1",
60
+ "\u{FE01}" => "VS2",
61
+ "\u{FE02}" => "VS3",
62
+ "\u{FE03}" => "VS4",
63
+ "\u{FE04}" => "VS5",
64
+ "\u{FE05}" => "VS6",
65
+ "\u{FE06}" => "VS7",
66
+ "\u{FE07}" => "VS8",
67
+ "\u{FE08}" => "VS9",
68
+ "\u{FE09}" => "VS10",
69
+ "\u{FE0A}" => "VS11",
70
+ "\u{FE0B}" => "VS12",
71
+ "\u{FE0C}" => "VS13",
72
+ "\u{FE0D}" => "VS14",
73
+ "\u{FE0E}" => "VS15",
74
+ "\u{FE0F}" => "VS16",
75
+ }.freeze
76
+ COULD_BE_WHITESPACE = '[\p{Space}­᠎​‌‍⁠⁡⁢⁣⁤⠀𛲠𛲡𛲢𛲣𝅳𝅴𝅵𝅶𝅷𝅸𝅹𝅺]'.freeze
77
+ # UNASSIGNED = '\p{Cn}'.freeze
78
+
79
+ def self.symbolify(char, encoding = char.encoding)
80
+ char.tr!(
81
+ ASCII_CONTROL_CODEPOINTS.encode(encoding),
82
+ ASCII_CONTROL_SYMBOLS.encode(encoding)
83
+ )
84
+ char.gsub!(
85
+ Regexp.compile(COULD_BE_WHITESPACE.encode(encoding)),
86
+ ']\0['.encode(encoding)
87
+ )
88
+ # char.gsub!(
89
+ # Regexp.compile(UNASSIGNED.encode(encoding)),
90
+ # 'n/a'.encode(encoding)
91
+ # )
92
+ INTERESTING_CODEPOINTS.each{ |cp, desc|
93
+ char.gsub! Regexp.compile(cp.encode(encoding)), desc.encode(encoding)
94
+ }
95
+ char.gsub! TAG_START.encode(encoding), TAG_START_SYMBOL.encode(encoding)
96
+ char.gsub! TAG_SPACE.encode(encoding), TAG_SPACE_SYMBOL.encode(encoding)
97
+ char.gsub! TAG_DELETE.encode(encoding), TAG_DELETE_SYMBOL.encode(encoding)
98
+
99
+ ord = char.ord
100
+ if ord > 917536 && ord < 917631
101
+ char.tr!(TAGS.encode(encoding), ASCII_CHARS.encode(encoding))
102
+ char = "TAG ".encode(encoding) + char
103
+ end
104
+
105
+ char
106
+ end
107
+ end
108
+ end
@@ -1,3 +1,3 @@
1
1
  module Unibits
2
- VERSION = "1.0.0".freeze
2
+ VERSION = "1.1.0".freeze
3
3
  end
Binary file
Binary file
data/spec/unibits_spec.rb CHANGED
@@ -66,5 +66,171 @@ describe Unibits do
66
66
  result.must_match "43"
67
67
  result.must_match "01000011"
68
68
  end
69
+
70
+ describe "invalid UTF-8 encodings" do
71
+ it "- unexpected continuation byte (1/2)" do
72
+ string = "abc\x80efg"
73
+ result = Paint.unpaint(Unibits.visualize(string))
74
+ result.must_match "�"
75
+ result.must_match "unexp.c."
76
+ result.must_match /e.*f.*g/m
77
+ end
78
+
79
+ it "- unexpected continuation byte (2/2)" do
80
+ string = "🌫\x81efg"
81
+ result = Paint.unpaint(Unibits.visualize(string))
82
+ result.must_match "�"
83
+ result.must_match "unexp.c."
84
+ result.must_match /e.*f.*g/m
85
+ end
86
+
87
+ it "- not enough continuation bytes" do
88
+ string = "\xF0\x9F\x8CABC"
89
+ result = Paint.unpaint(Unibits.visualize(string))
90
+ result.must_match "�"
91
+ result.must_match "n.e.con."
92
+ result.must_match /A.*B.*C/m
93
+ end
94
+
95
+ it "- overlong padding (1/2)" do
96
+ string = "\xE0\x81\x81ABC"
97
+ result = Paint.unpaint(Unibits.visualize(string))
98
+ result.must_match "�"
99
+ result.must_match /overlong.*overlong.*overlong/m
100
+ result.must_match /A.*B.*C/m
101
+ end
102
+
103
+ it "- overlong padding (2/2)" do
104
+ string = "\xC0\x80no double null"
105
+ result = Paint.unpaint(Unibits.visualize(string))
106
+ result.must_match "�"
107
+ result.must_match /overlong.*overlong/m
108
+ end
109
+
110
+ it "- too large codepoint (1/2)" do
111
+ string = "\xF5\x8F\xBF\xBFABC"
112
+ result = Paint.unpaint(Unibits.visualize(string))
113
+ result.must_match "�"
114
+ result.must_match /toolarge.*toolarge.*toolarge.*toolarge/m
115
+ result.must_match /A.*B.*C/m
116
+ end
117
+
118
+ it "- too large codepoint (2/2)" do
119
+ string = "\xF4\xAF\xBF\xBFABC"
120
+ result = Paint.unpaint(Unibits.visualize(string))
121
+ result.must_match "�"
122
+ result.must_match /toolarge.*toolarge.*toolarge.*toolarge/m
123
+ result.must_match /A.*B.*C/m
124
+ end
125
+
126
+ it "- too large byte" do
127
+ string = "\xFF"
128
+ result = Paint.unpaint(Unibits.visualize(string))
129
+ result.must_match "�"
130
+ result.must_match "toolarge"
131
+ end
132
+
133
+ it "- has surrogate (1/2)" do
134
+ string = "\xED\xA0\x80ABC"
135
+ result = Paint.unpaint(Unibits.visualize(string))
136
+ result.must_match "�"
137
+ result.must_match "sur.gate"
138
+ result.must_match /A.*B.*C/m
139
+ end
140
+
141
+ it "- has surrogate (2/2)" do
142
+ string = "\xED\xBF\xBFABC"
143
+ result = Paint.unpaint(Unibits.visualize(string))
144
+ result.must_match "�"
145
+ result.must_match "sur.gate"
146
+ result.must_match /A.*B.*C/m
147
+ end
148
+ end
149
+
150
+ describe "invalid UTF-16 encodings" do
151
+ it "- incomplete number of bytes (1/2)" do
152
+ string = "a".b.force_encoding("UTF-16LE")
153
+ result = Paint.unpaint(Unibits.visualize(string))
154
+ result.must_match "incompl."
155
+ result.must_match "�"
156
+ end
157
+
158
+ it "- incomplete number of bytes (2/2)" do
159
+ string = "🌫".b[0..-2].force_encoding("UTF-16LE")
160
+ result = Paint.unpaint(Unibits.visualize(string))
161
+ result.must_match "incompl."
162
+ result.must_match "�"
163
+ end
164
+
165
+ it "- only lower half surrogate" do
166
+ string = "\x3C\xD8\x2Ba".force_encoding("UTF-16LE")
167
+ result = Paint.unpaint(Unibits.visualize(string))
168
+ result.must_match "hlf.srg."
169
+ result.must_match "�"
170
+ end
171
+
172
+ it "- only higher half surrogate" do
173
+ string = "\x3Ca\x2B\xDF".force_encoding("UTF-16LE")
174
+ result = Paint.unpaint(Unibits.visualize(string))
175
+ result.must_match "hlf.srg."
176
+ result.must_match "�"
177
+ end
178
+ end
179
+
180
+ describe "invalid UTF-32 encodings" do
181
+ # please note, currently, too large codepoints and encoded utf16 surrogates are treated as valid encodings
182
+
183
+ it "- incomplete number of bytes (1/3)" do
184
+ string = "a".b.force_encoding("UTF-32LE")
185
+ result = Paint.unpaint(Unibits.visualize(string))
186
+ result.must_match "incompl."
187
+ result.must_match "�"
188
+ end
189
+
190
+ it "- incomplete number of bytes (2/3)" do
191
+ string = "🌫".b[0..-2].force_encoding("UTF-32LE")
192
+ result = Paint.unpaint(Unibits.visualize(string))
193
+ result.must_match "incompl."
194
+ result.must_match "�"
195
+ end
196
+
197
+ it "- incomplete number of bytes (3/3)" do
198
+ string = "🌫".b[0..-2].force_encoding("UTF-32LE")
199
+ result = Paint.unpaint(Unibits.visualize(string))
200
+ result.must_match "incompl."
201
+ result.must_match "�"
202
+ end
203
+ end
204
+
205
+ describe "invalid ASCII encodings" do
206
+ it "- contains bytes with 8th bit set" do
207
+ string = "abc\x80efg".force_encoding("ASCII")
208
+ result = Paint.unpaint(Unibits.visualize(string))
209
+ result.must_match "�"
210
+ result.must_match /e.*f.*g/m
211
+ end
212
+ end
213
+ end
214
+
215
+ describe "wide_ambiguous: option" do
216
+ it "- default is 1" do
217
+ string = "⚀······"
218
+ result = Unibits.stats(string)
219
+ result.wont_match "13"
220
+ end
221
+
222
+ it "- default is 2" do
223
+ string = "⚀······"
224
+ result = Unibits.stats(string, wide_ambiguous: true)
225
+ result.must_match "13"
226
+ end
227
+ end
228
+
229
+ describe "width: option" do
230
+ it "sets a custom column width" do
231
+ string = "bla" * 99
232
+ result = Paint.unpaint(Unibits.visualize(string, width: 50))
233
+ (result[/^.*$/].size <= 50).must_equal true
234
+ end
69
235
  end
70
236
  end
data/unibits.gemspec CHANGED
@@ -17,7 +17,7 @@ Gem::Specification.new do |gem|
17
17
  gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
18
18
  gem.require_paths = ["lib"]
19
19
 
20
- gem.add_dependency 'paint', '~> 2.0'
20
+ gem.add_dependency 'paint', '>= 0.9', '< 3.0'
21
21
  gem.add_dependency 'unicode-display_width', '~> 1.1'
22
22
  gem.add_dependency 'rationalist', '~> 2.0'
23
23
 
metadata CHANGED
@@ -1,29 +1,35 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: unibits
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jan Lelis
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-03-05 00:00:00.000000000 Z
11
+ date: 2017-03-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: paint
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - "~>"
17
+ - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: '2.0'
19
+ version: '0.9'
20
+ - - "<"
21
+ - !ruby/object:Gem::Version
22
+ version: '3.0'
20
23
  type: :runtime
21
24
  prerelease: false
22
25
  version_requirements: !ruby/object:Gem::Requirement
23
26
  requirements:
24
- - - "~>"
27
+ - - ">="
25
28
  - !ruby/object:Gem::Version
26
- version: '2.0'
29
+ version: '0.9'
30
+ - - "<"
31
+ - !ruby/object:Gem::Version
32
+ version: '3.0'
27
33
  - !ruby/object:Gem::Dependency
28
34
  name: unicode-display_width
29
35
  requirement: !ruby/object:Gem::Requirement
@@ -74,13 +80,16 @@ files:
74
80
  - bin/unibits
75
81
  - lib/unibits.rb
76
82
  - lib/unibits/kernel_method.rb
83
+ - lib/unibits/symbolify.rb
77
84
  - lib/unibits/version.rb
85
+ - screenshots/ascii.invalid.png
78
86
  - screenshots/ascii.png
79
87
  - screenshots/binary.png
80
88
  - screenshots/utf-16be.png
81
89
  - screenshots/utf-16le.png
82
90
  - screenshots/utf-32be.png
83
91
  - screenshots/utf-32le.png
92
+ - screenshots/utf-8.invalid.png
84
93
  - screenshots/utf-8.png
85
94
  - spec/unibits_spec.rb
86
95
  - unibits.gemspec