unibits 1.0.0 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 0405b9099735bcf25bc05f82e91d9c5629071ee1
4
- data.tar.gz: 85d9bde5ddc1e90ac70cf288e7753ddc8511fc87
3
+ metadata.gz: 3a765c7b1a6edfd62f8fed06b683d0d4ff2cacae
4
+ data.tar.gz: c66cb699f9a4425fc169fd6562dd4b758b94c584
5
5
  SHA512:
6
- metadata.gz: a8d944dfcf50869cf1a64a7237e8145ee8b8fcb57a801db9a475684ac0e7bd4865be2b2d372f943ddeebb6d86f632a3eb9655d147cc6e3b79bc6e0c6dc09fa66
7
- data.tar.gz: be6685c1bfc9b9ecf073e9899be4ded5d590ccdd90b527dfdfdfeb1ac654016cac4116033905183f988e249cabedcc767cc8fa3a4d04b3681650ad7852035330
6
+ metadata.gz: a516728fe08185386fc1902fa341784517521d0a85e494e35a7e2700c1c6d7fc3dfc8f86bb20d4268de6d69df591ab431aa3ab5ee02d96f6798fcd97f2426a8e
7
+ data.tar.gz: cb2fbfea68be359661ad75cc6e203bd4235ccd81431a37f74f965e90f6116e1a12d8cb04094a213f537e2a032c0c25775edd08378cd7c8ae613503dbe5cc3e0e
data/CHANGELOG.md CHANGED
@@ -1,5 +1,13 @@
1
1
  ## CHANGELOG
2
2
 
3
+ ### 1.1.0
4
+
5
+ * Support (and highlight) invalid encodings \o/
6
+ * Improve character symbolification
7
+ * Fix that the Kernel method would not take keyword arguments
8
+ * New option for setting a custom output width to use
9
+ * New option for activating wide ambiguous characters
10
+
3
11
  ### 1.0.0
4
12
 
5
13
  * Initial release
data/README.md CHANGED
@@ -1,18 +1,24 @@
1
- # unibits [![[version]](https://badge.fury.io/rb/unibits.svg)](http://badge.fury.io/rb/unibits) [![[travis]](https://travis-ci.org/janlelis/unibits.svg)](https://travis-ci.org/janlelis/unibits)
1
+ # unibits | Reveal the Unicode [![[version]](https://badge.fury.io/rb/unibits.svg)](http://badge.fury.io/rb/unibits) [![[travis]](https://travis-ci.org/janlelis/unibits.svg)](https://travis-ci.org/janlelis/unibits)
2
2
 
3
- Ruby library and CLI command that visualizes various Unicode and ASCII encodings in the terminal.
3
+ Ruby library and CLI command that visualizes various Unicode and ASCII encodings in the terminal:
4
4
 
5
- Helps you debugging and learning about Unicode encodings.
6
-
7
- Supported encodings: **UTF-8**, **UTF-16LE**, **UTF-16BE**, **UTF-32LE**, **UTF-32BE**, **BINARY**, **ASCII**
5
+ - Makes analyzing encodings easier
6
+ - Helps you with debugging strings
7
+ - Supports **UTF-8**, **UTF-16LE**/**UTF-16BE**, **UTF-32LE**/**UTF-32BE**, arbitrary **BINARY** data, and **ASCII**
8
+ - Highlights invalid encodings
8
9
 
9
10
  ## Setup
10
11
 
12
+ Make sure you have Ruby installed and installing gems works properly. Then do:
13
+
11
14
  ```
12
15
  $ gem install unibits
13
16
  ```
14
17
 
15
18
  ## Usage
19
+
20
+ Pass the string to debug to unibits:
21
+
16
22
  ### From CLI
17
23
 
18
24
  ```
@@ -26,17 +32,17 @@ require 'unibits/kernel_method'
26
32
  unibits "🌫 Idiosyncrätic ℜսᖯʏ"
27
33
  ```
28
34
 
29
- ### Options
35
+ ### Advanced Options
30
36
 
31
- `unibits` takes three optional options:
37
+ `unibits` takes some optional options:
32
38
 
33
39
  - *encoding (e)*: The encoding of the given string (uses your default encoding if none given)
34
40
  - *convert (c)*: An encoding the string should be converted to before visualizing it
35
41
  - *stats*: Whether to show a short stats header (default: `true`), you can deactivate on the CLI with `--no-stats`
42
+ - *wide-ambiguous*: Treat characters of ambiguous width as 2 spaces instead of 1 ([more info](https://github.com/janlelis/unicode-display_width))
43
+ - *width (w)*: Set a custom column width, if not set, *unibits* will retrieve it from the terminal or just use 80
36
44
 
37
- **Please note**: This uses Ruby's built-in encoding support. Currently, only strings with valid encodings are supported.
38
-
39
- ## Output for Different Encodings
45
+ ## Output of Different Valid Encodings
40
46
  ### UTF-8
41
47
 
42
48
  CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
@@ -47,33 +53,33 @@ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert:
47
53
 
48
54
  ### UTF-16LE
49
55
 
50
- CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
56
+ CLI: `$ unibits -e utf-8 -c utf-16le "🌫 Idiosyncrätic ℜսᖯʏ"`
51
57
 
52
- Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-8'`
58
+ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-16le'`
53
59
 
54
60
  ![Screenshot UTF-16LE](/screenshots/utf-16le.png?raw=true "UTF-16LE")
55
61
 
56
62
  ### UTF-16BE
57
63
 
58
- CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
64
+ CLI: `$ unibits -e utf-8 -c utf-16be "🌫 Idiosyncrätic ℜսᖯʏ"`
59
65
 
60
- Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-8'`
66
+ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-16be'`
61
67
 
62
68
  ![Screenshot UTF-16BE](/screenshots/utf-16be.png?raw=true "UTF-16BE")
63
69
 
64
70
  ### UTF-32LE
65
71
 
66
- CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
72
+ CLI: `$ unibits -e utf-8 -c utf-32le "🌫 Idiosyncrätic ℜսᖯʏ"`
67
73
 
68
- Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-8'`
74
+ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-32le'`
69
75
 
70
76
  ![Screenshot UTF-32LE](/screenshots/utf-32le.png?raw=true "UTF-32LE")
71
77
 
72
78
  ### UTF-32BE
73
79
 
74
- CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
80
+ CLI: `$ unibits -e utf-8 -c utf-32be "🌫 Idiosyncrätic ℜսᖯʏ"`
75
81
 
76
- Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-8'`
82
+ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-32be'`
77
83
 
78
84
  ![Screenshot UTF-32BE](/screenshots/utf-32be.png?raw=true "UTF-32BE")
79
85
 
@@ -93,6 +99,23 @@ Ruby: `unibits "ASCII String", encoding: 'utf-8', convert: 'ascii'`
93
99
 
94
100
  ![Screenshot ASCII](/screenshots/ascii.png?raw=true "ASCII")
95
101
 
102
+ ## Invalid Encodings
103
+ ### UTF-8
104
+
105
+ Example in Ruby: `unibits "unexpected \x80 | not enough \xF0\x9F\x8C | overlong \xE0\x81\x81 | surrogate \xED\xA0\x80 | too large \xF5\x8F\xBF\xBF"`
106
+
107
+ ![Screenshot invalid UTF-8](/screenshots/utf-8.invalid.png?raw=true "Invalid UTF-8")
108
+
109
+ ### ASCII
110
+
111
+ Example in Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'ascii'`
112
+
113
+ ![Screenshot invalid ASCII](/screenshots/ascii.invalid.png?raw=true "Invalid ASCII")
114
+
115
+ ### BINARY
116
+
117
+ (not possible to produce invalid binary strings)
118
+
96
119
  ## Notes
97
120
 
98
121
  Also see
@@ -101,6 +124,7 @@ Also see
101
124
  - [UTF-16 (Wikipedia)](https://en.wikipedia.org/wiki/UTF-16#Description)
102
125
  - [UTF-32 (Wikipedia)](https://en.wikipedia.org/wiki/UTF-32)
103
126
  - [Ruby's Encoding class](https://ruby-doc.org/core/Encoding.html)
127
+ - [Difference between BINARY and ASCII](http://idiosyncratic-ruby.com/56-us-ascii-8bit.html)
104
128
  - [Unicode Micro Libraries for Ruby](https://github.com/janlelis/unicode-x)
105
129
 
106
130
  Lots of thanks to @damienklinnert for the motivation and inspiration required to build this! 🎆
data/Rakefile CHANGED
@@ -24,7 +24,7 @@ end
24
24
 
25
25
  desc "#{gemspec.name} | IRB"
26
26
  task :irb do
27
- sh "irb -I ./lib -r #{gemspec.name.gsub '-','/'}"
27
+ sh "irb -I ./lib -r #{gemspec.name.gsub '-','/'}/kernel_method"
28
28
  end
29
29
 
30
30
  # # #
data/bin/unibits CHANGED
@@ -7,14 +7,16 @@ argv = Rationalist.parse(
7
7
  ARGV,
8
8
  string: '_',
9
9
  alias: {
10
- e: 'encoding',
11
10
  c: 'convert',
11
+ e: 'encoding',
12
12
  v: 'version',
13
+ w: 'width',
13
14
  },
14
15
  boolean: [
15
16
  'help',
17
+ 'stats',
16
18
  'version',
17
- 'stats'
19
+ 'wide-ambiguous',
18
20
  ],
19
21
  default: {
20
22
  stats: true
@@ -22,21 +24,54 @@ argv = Rationalist.parse(
22
24
  )
23
25
 
24
26
  if argv[:version]
25
- puts Unibits::VERSION
27
+ puts "unibits #{Unibits::VERSION} by #{Paint["J-_-L", :bold]} <https://github.com/janlelis/unibits>"
26
28
  exit(0)
27
29
  end
28
30
 
29
31
  if argv[:help]
30
32
  puts <<-HELP
31
33
 
34
+ #{Paint["DESCRIPTION", :underline]}
35
+
36
+ Visualizes ANSI and variouus Unicode encodings in the terminal.
37
+
32
38
  #{Paint["USAGE", :underline]}
33
39
 
34
- #{Paint["unibits", :bold]} [--encoding <encoding>] [--convert <encoding>] [--no-stats] data
40
+ #{Paint["unibits", :bold]} [options] data
41
+
42
+ --encoding <encoding> | -e | which encoding to use for given data
43
+ --convert <encoding> | -c | which encoding to convert to (if possible)
44
+ --width <n> | -w | force a specific number of terminal columns
45
+ --no-stats | | no stats header with length info
46
+ --version | | displays version of unibits
47
+ --wide-ambiguous | | ambiguous characters
48
+
49
+ #{Paint["ENCODINGS", :underline]}
50
+
51
+ #{Unibits::SUPPORTED_ENCODINGS.join(', ')}
52
+
53
+ #{Paint["STATS", :underline]}
54
+
55
+ ( bytes / codepoints / glyphs / expected terminal width )
56
+
57
+ #{Paint["INVALID BYTES", :underline]}
58
+
59
+ UTF-8
60
+
61
+ n.e.con. | not enough continuation bytes
62
+ unexp.c. | unexpected continuation byte
63
+ toolarge | codepoint value exceeds maximum allowed (U+10FFFF)
64
+ sur.gate | codepoint value would be a surrogate half
65
+ overlong | unnecessary null padding
66
+
67
+ UTF-16
68
+
69
+ incompl. | not enough bytes left to finish codepoint
70
+ hlf.srg. | other half of surrogate missing
35
71
 
36
- Supported encodings: #{Unibits::SUPPORTED_ENCODINGS.join(', ')}
37
- Explanation of stats: bytes / codepoints / glyphs / expected terminal width
72
+ #{Paint["MORE INFO", :underline]}
38
73
 
39
- More info and examples at: https://github.com/janlelis/unibits
74
+ https://github.com/janlelis/unibits
40
75
 
41
76
  HELP
42
77
  exit(0)
@@ -51,7 +86,14 @@ else
51
86
  end
52
87
 
53
88
  begin
54
- Unibits.of(data, encoding: argv[:encoding], convert: argv[:convert], stats: argv[:stats])
89
+ Unibits.of(
90
+ data,
91
+ encoding: argv[:encoding],
92
+ convert: argv[:convert],
93
+ stats: argv[:stats],
94
+ wide_ambiguous: argv[:'wide-ambiguous'],
95
+ width: argv[:width],
96
+ )
55
97
  rescue ArgumentError
56
98
  $stderr.puts Paint[$!.message, :red]
57
99
  exit(1)
data/lib/unibits.rb CHANGED
@@ -1,4 +1,5 @@
1
1
  require_relative "unibits/version"
2
+ require_relative "unibits/symbolify"
2
3
 
3
4
  require "io/console"
4
5
  require "paint"
@@ -14,8 +15,9 @@ module Unibits
14
15
  'ASCII-8BIT',
15
16
  'US-ASCII',
16
17
  ].freeze
18
+ DEFAULT_TERMINAL_WIDTH = 80
17
19
 
18
- def self.of(string, encoding: nil, convert: nil, stats: true)
20
+ def self.of(string, encoding: nil, convert: nil, stats: true, wide_ambiguous: false, width: nil)
19
21
  if !string || string.empty?
20
22
  raise ArgumentError, "no data given to unibits"
21
23
  end
@@ -25,12 +27,8 @@ module Unibits
25
27
 
26
28
  case string.encoding.name
27
29
  when *SUPPORTED_ENCODINGS
28
- unless string.valid_encoding?
29
- raise ArgumentError, "the data given is not valid #{string.encoding.name}"
30
- end
31
-
32
- puts stats(string) if stats
33
- puts visualize(string)
30
+ puts stats(string, wide_ambiguous: wide_ambiguous) if stats
31
+ puts visualize(string, wide_ambiguous: wide_ambiguous, width: width)
34
32
  when 'UTF-16', 'UTF-32'
35
33
  raise ArgumentError, "unibits only supports #{string.encoding.name} with specified endianess, please use #{string.encoding.name}LE or #{string.encoding.name}BE"
36
34
  else
@@ -38,27 +36,36 @@ module Unibits
38
36
  end
39
37
  end
40
38
 
41
- def self.stats(string)
42
- "\n " \
43
- "#{Paint[string.encoding.name, :bold]} (" \
44
- "#{string.bytesize}/" \
45
- "#{string.size}/" \
46
- "#{string.scan(Regexp.compile('\X'.encode(string.encoding))).size}/" \
47
- "#{Unicode::DisplayWidth.of(string)})"
39
+ def self.stats(string, wide_ambiguous: false)
40
+ valid = string.valid_encoding?
41
+ bytes = string.bytesize rescue "?"
42
+ codepoints = string.size rescue "?"
43
+ glyphs = string.scan(Regexp.compile('\X'.encode(string.encoding))).size rescue "?"
44
+ width = Unicode::DisplayWidth.of(string, wide_ambiguous ? 2 : 1) rescue "?"
45
+
46
+ "\n #{valid ? '' : Paint["Invalid ", :bold, :red]}#{Paint[string.encoding.name, :bold]} (#{bytes}/#{codepoints}/#{glyphs}/#{width})"
48
47
  end
49
48
 
50
- def self.visualize(string)
51
- cols = determine_terminal_cols
49
+ def self.visualize(string, wide_ambiguous: false, width: nil)
50
+ cols = width || determine_terminal_cols
52
51
 
53
52
  cp_buffer = [" "]
54
53
  enc_buffer = [" "]
55
54
  hex_buffer = [" "]
56
55
  bin_buffer = [" "]
57
56
  separator = [" "]
57
+ current_encoding_error = nil
58
58
 
59
59
  puts
60
60
  string.each_char{ |char|
61
- current_color = random_color
61
+ if char.valid_encoding?
62
+ char_valid = true
63
+ current_color = random_color
64
+ current_encoding_error = nil
65
+ else
66
+ char_valid = false
67
+ current_color = :red
68
+ end
62
69
 
63
70
  char.each_byte.with_index{ |byte, index|
64
71
  if Paint.unpaint(hex_buffer[-1]).bytesize > cols - 12
@@ -70,12 +77,116 @@ module Unibits
70
77
  end
71
78
 
72
79
  if index == 0
80
+ if char_valid
81
+ codepoint = "U+%04X" % char.ord
82
+ else
83
+ case string.encoding.name
84
+ when "US-ASCII"
85
+ codepoint = "invalid"
86
+ when "UTF-8"
87
+ # this tries to detect what is wrong with this utf-8 encoded string
88
+ # sorry for this mess
89
+ case char.unpack("B*")[0]
90
+ when /^110.{5}$/
91
+ current_encoding_error = [:nec, 1, 1]
92
+ codepoint = "n.e.con."
93
+ when /^1110(.{4})$/
94
+ if $1 == "1101"
95
+ current_encoding_error = [:nec, 2, 2, :maybe_surrogate]
96
+ else
97
+ current_encoding_error = [:nec, 2, 2]
98
+ end
99
+ codepoint = "n.e.con."
100
+ when /^11110(.{3})$/
101
+ case $1
102
+ when "100"
103
+ current_encoding_error = [:nec, 3, 3, :leading_at_max]
104
+ when "101", "110", "111"
105
+ current_encoding_error = [:nec, 3, 3, :too_large]
106
+ else
107
+ current_encoding_error = [:nec, 3, 3]
108
+ end
109
+ codepoint = "n.e.con."
110
+ when /^11111.{3}$/
111
+ codepoint = "toolarge"
112
+ when /^10(.{2}).{4}$/
113
+ # uglyhack to fixup that it is not n.e.c, but something different
114
+ if current_encoding_error && current_encoding_error[0] == :nec
115
+ if current_encoding_error[3] == :leading_at_max
116
+ if $1 != "00"
117
+ current_encoding_error[3] = :too_large
118
+ else
119
+ current_encoding_error[3] = nil
120
+ end
121
+ elsif current_encoding_error[3] == :maybe_surrogate
122
+ if $1[0] == "1"
123
+ current_encoding_error[3] = :surrogate
124
+ else
125
+ current_encoding_error[3] = nil
126
+ end
127
+ end
128
+
129
+ if current_encoding_error[1] > 1
130
+ current_encoding_error[1] -= 1
131
+ codepoint = "n.e.con."
132
+ else
133
+ case current_encoding_error[3]
134
+ when :too_large
135
+ actual_error = "toolarge"
136
+ when :surrogate
137
+ actual_error = "sur.gate"
138
+ else
139
+ actual_error = "overlong"
140
+ end
141
+ current_cp_buffer_index = -1
142
+ (current_encoding_error[2]).times{
143
+ if index = cp_buffer[current_cp_buffer_index].rindex("n.e.con.")
144
+ cp_buffer[current_cp_buffer_index][index..-1] = cp_buffer[current_cp_buffer_index][index..-1].sub("n.e.con.", actual_error)
145
+ else
146
+ current_cp_buffer_index -= 1
147
+ index = cp_buffer[current_cp_buffer_index].rindex("n.e.con.")
148
+ cp_buffer[current_cp_buffer_index][index..-1] = cp_buffer[current_cp_buffer_index][index..-1].sub("n.e.con.", actual_error)
149
+ end
150
+ current_encoding_error = [:overlong]
151
+ codepoint = actual_error
152
+ }
153
+ end
154
+ else
155
+ current_encoding_error = [:unexp]
156
+ codepoint = "unexp.c."
157
+ end
158
+ else
159
+ current_encoding_error = [:invallid]
160
+ codepoint = "invalid"
161
+ end
162
+ when 'UTF-16LE', 'UTF-16BE'
163
+ if char.bytesize.odd?
164
+ codepoint = "incompl."
165
+ elsif char.b[string.encoding.name == 'UTF-16LE' ? 1 : 0].unpack("B*")[0][0, 5] == "11011"
166
+ codepoint = "hlf.srg."
167
+ else
168
+ codepoint = "invalid"
169
+ end
170
+ when 'UTF-32LE', 'UTF-32BE'
171
+ if char.bytesize != "4"
172
+ codepoint = "incompl."
173
+ else
174
+ codepoint = "invalid"
175
+ end
176
+ end
177
+ end
178
+
73
179
  cp_buffer[-1] << Paint[
74
- ("U+%04X" % char.ord).ljust(10), current_color, :bold
180
+ codepoint.ljust(10), current_color, :bold
75
181
  ]
76
182
 
77
- symbolified_char = symbolify(char)
78
- padding = 10 - Unicode::DisplayWidth.of(symbolified_char)
183
+ if char_valid
184
+ symbolified_char = symbolify(char)
185
+ else
186
+ symbolified_char = "�"
187
+ end
188
+
189
+ padding = 10 - Unicode::DisplayWidth.of(symbolified_char, wide_ambiguous ? 2 : 1)
79
190
 
80
191
  enc_buffer[-1] << Paint[
81
192
  symbolified_char, current_color
@@ -92,54 +203,62 @@ module Unibits
92
203
 
93
204
  bin_byte_complete = byte.to_s(2).rjust(8, "0")
94
205
 
95
- case string.encoding.name
96
- when 'US-ASCII'
97
- bin_byte_1 = bin_byte_complete[0...1]
98
- bin_byte_2 = bin_byte_complete[1...8]
99
- when 'ASCII-8BIT'
100
- bin_byte_1 = ""
101
- bin_byte_2 = bin_byte_complete
102
- when 'UTF-8'
103
- if index == 0
104
- if bin_byte_complete =~ /^(0|1{2,4}0)([01]+)$/
206
+ if !char_valid
207
+ bin_byte_1 = bin_byte_complete
208
+ bin_byte_2 = ""
209
+ else
210
+ case string.encoding.name
211
+ when 'US-ASCII'
212
+ bin_byte_1 = bin_byte_complete[0...1]
213
+ bin_byte_2 = bin_byte_complete[1...8]
214
+ when 'ASCII-8BIT'
215
+ bin_byte_1 = ""
216
+ bin_byte_2 = bin_byte_complete
217
+ when 'UTF-8'
218
+ if index == 0
219
+ if bin_byte_complete =~ /^(0|1{2,4}0)([01]+)$/
220
+ bin_byte_1 = $1
221
+ bin_byte_2 = $2
222
+ else
223
+ bin_byte_1 = ""
224
+ bin_byte_2 = bin_byte_complete
225
+ end
226
+ else
227
+ bin_byte_1 = bin_byte_complete[0...2]
228
+ bin_byte_2 = bin_byte_complete[2...8]
229
+ end
230
+ when 'UTF-16LE'
231
+ if char.ord <= 0xFFFF || index == 0 || index == 2
232
+ bin_byte_1 = ""
233
+ bin_byte_2 = bin_byte_complete
234
+ else
235
+ bin_byte_complete =~ /^(11011[01])([01]+)$/
105
236
  bin_byte_1 = $1
106
237
  bin_byte_2 = $2
107
- else
238
+ end
239
+ when 'UTF-16BE'
240
+ if char.ord <= 0xFFFF || index == 1 || index == 3
108
241
  bin_byte_1 = ""
109
242
  bin_byte_2 = bin_byte_complete
243
+ else
244
+ bin_byte_complete =~ /^(11011[01])([01]+)$/
245
+ bin_byte_1 = $1
246
+ bin_byte_2 = $2
110
247
  end
111
- else
112
- bin_byte_1 = bin_byte_complete[0...2]
113
- bin_byte_2 = bin_byte_complete[2...8]
114
- end
115
- when 'UTF-16LE'
116
- if char.ord <= 0xFFFF || index == 0 || index == 2
248
+ when 'UTF-32LE', 'UTF-32BE'
117
249
  bin_byte_1 = ""
118
250
  bin_byte_2 = bin_byte_complete
119
- else
120
- bin_byte_complete =~ /^(11011[01])([01]+)$/
121
- bin_byte_1 = $1
122
- bin_byte_2 = $2
123
- end
124
- when 'UTF-16BE'
125
- if char.ord <= 0xFFFF || index == 1 || index == 3
126
- bin_byte_1 = ""
127
- bin_byte_2 = bin_byte_complete
128
- else
129
- bin_byte_complete =~ /^(11011[01])([01]+)$/
130
- bin_byte_1 = $1
131
- bin_byte_2 = $2
132
251
  end
133
- when 'UTF-32LE', 'UTF-32BE'
134
- bin_byte_1 = ""
135
- bin_byte_2 = bin_byte_complete
136
252
  end
253
+
137
254
  bin_buffer[-1] << Paint[
138
255
  bin_byte_1, current_color
139
256
  ] unless !bin_byte_1 || bin_byte_1.empty?
257
+
140
258
  bin_buffer[-1] << Paint[
141
259
  bin_byte_2, current_color, :underline
142
260
  ] unless !bin_byte_2 || bin_byte_2.empty?
261
+
143
262
  bin_buffer[-1] << " "
144
263
  }
145
264
  }
@@ -157,18 +276,12 @@ module Unibits
157
276
 
158
277
  def self.symbolify(char)
159
278
  return char.inspect unless char.encoding.name[0, 3] == "UTF"
160
- char
161
- .tr("\x00-\x1F".encode(char.encoding), "\u{2400}-\u{241F}".encode(char.encoding))
162
- .gsub(
163
- Regexp.compile('[\p{Space}᠎​‌‍⁠]'.encode(char.encoding)),
164
- ']\0['.encode(char.encoding)
165
- )
166
- .encode('UTF-8')
279
+ Symbolify.symbolify(char).encode('UTF-8')
167
280
  end
168
281
 
169
282
  def self.determine_terminal_cols
170
- STDIN.winsize[1] || 80
283
+ STDIN.winsize[1] || DEFAULT_TERMINAL_WIDTH
171
284
  rescue Errno::ENOTTY
172
- return 80
285
+ return DEFAULT_TERMINAL_WIDTH
173
286
  end
174
287
  end
@@ -3,7 +3,7 @@ require_relative '../unibits'
3
3
  module Kernel
4
4
  private
5
5
 
6
- def unibits(string)
7
- Unibits.of(string)
6
+ def unibits(string, **kwargs)
7
+ Unibits.of(string, **kwargs)
8
8
  end
9
9
  end
@@ -0,0 +1,108 @@
1
+ module Unibits
2
+ module Symbolify
3
+ ASCII_CONTROL_CODEPOINTS = "\x00-\x1F\x7F".freeze
4
+ ASCII_CONTROL_SYMBOLS = "\u{2400}-\u{241F}\u{2421}".freeze
5
+ ASCII_CHARS = "\x20-\x7E".freeze
6
+ TAG_START = "\u{E0001}".freeze
7
+ TAG_START_SYMBOL = "LANG TAG".freeze
8
+ TAG_SPACE = "\u{E0020}".freeze
9
+ TAG_SPACE_SYMBOL = "TAG ␠".freeze
10
+ TAGS = "\u{E0021}-\u{E007E}".freeze
11
+ TAG_DELETE = "\u{E007F}".freeze
12
+ TAG_DELETE_SYMBOL = "TAG ␡".freeze
13
+ INTERESTING_CODEPOINTS = {
14
+ "\u{0080}" => "PAD",
15
+ "\u{0081}" => "HOP",
16
+ "\u{0082}" => "BPH",
17
+ "\u{0083}" => "NBH",
18
+ "\u{0084}" => "IND",
19
+ "\u{0085}" => "NEL",
20
+ "\u{0086}" => "SSA",
21
+ "\u{0087}" => "ESA",
22
+ "\u{0088}" => "HTS",
23
+ "\u{0089}" => "HTJ",
24
+ "\u{008A}" => "VTS",
25
+ "\u{008B}" => "PLD",
26
+ "\u{008C}" => "PLU",
27
+ "\u{008D}" => "RI",
28
+ "\u{008E}" => "SS2",
29
+ "\u{008F}" => "SS3",
30
+ "\u{0090}" => "DCS",
31
+ "\u{0091}" => "PU1",
32
+ "\u{0092}" => "PU2",
33
+ "\u{0093}" => "STS",
34
+ "\u{0094}" => "CCH",
35
+ "\u{0095}" => "MW",
36
+ "\u{0096}" => "SPA",
37
+ "\u{0097}" => "EPA",
38
+ "\u{0098}" => "SOS",
39
+ "\u{0099}" => "SGC",
40
+ "\u{009A}" => "SCI",
41
+ "\u{009B}" => "CSI",
42
+ "\u{009C}" => "ST",
43
+ "\u{009D}" => "OSC",
44
+ "\u{009E}" => "PM",
45
+ "\u{009F}" => "APC",
46
+
47
+ "\u{200E}" => "LRM",
48
+ "\u{200F}" => "RLM",
49
+ "\u{202A}" => "LRE",
50
+ "\u{202B}" => "RLE",
51
+ "\u{202C}" => "PDF",
52
+ "\u{202D}" => "LRO",
53
+ "\u{202E}" => "RLO",
54
+ "\u{2066}" => "LRI",
55
+ "\u{2067}" => "RLI",
56
+ "\u{2068}" => "FSI",
57
+ "\u{2069}" => "PDI",
58
+
59
+ "\u{FE00}" => "VS1",
60
+ "\u{FE01}" => "VS2",
61
+ "\u{FE02}" => "VS3",
62
+ "\u{FE03}" => "VS4",
63
+ "\u{FE04}" => "VS5",
64
+ "\u{FE05}" => "VS6",
65
+ "\u{FE06}" => "VS7",
66
+ "\u{FE07}" => "VS8",
67
+ "\u{FE08}" => "VS9",
68
+ "\u{FE09}" => "VS10",
69
+ "\u{FE0A}" => "VS11",
70
+ "\u{FE0B}" => "VS12",
71
+ "\u{FE0C}" => "VS13",
72
+ "\u{FE0D}" => "VS14",
73
+ "\u{FE0E}" => "VS15",
74
+ "\u{FE0F}" => "VS16",
75
+ }.freeze
76
+ COULD_BE_WHITESPACE = '[\p{Space}­᠎​‌‍⁠⁡⁢⁣⁤⠀𛲠𛲡𛲢𛲣𝅳𝅴𝅵𝅶𝅷𝅸𝅹𝅺]'.freeze
77
+ # UNASSIGNED = '\p{Cn}'.freeze
78
+
79
+ def self.symbolify(char, encoding = char.encoding)
80
+ char.tr!(
81
+ ASCII_CONTROL_CODEPOINTS.encode(encoding),
82
+ ASCII_CONTROL_SYMBOLS.encode(encoding)
83
+ )
84
+ char.gsub!(
85
+ Regexp.compile(COULD_BE_WHITESPACE.encode(encoding)),
86
+ ']\0['.encode(encoding)
87
+ )
88
+ # char.gsub!(
89
+ # Regexp.compile(UNASSIGNED.encode(encoding)),
90
+ # 'n/a'.encode(encoding)
91
+ # )
92
+ INTERESTING_CODEPOINTS.each{ |cp, desc|
93
+ char.gsub! Regexp.compile(cp.encode(encoding)), desc.encode(encoding)
94
+ }
95
+ char.gsub! TAG_START.encode(encoding), TAG_START_SYMBOL.encode(encoding)
96
+ char.gsub! TAG_SPACE.encode(encoding), TAG_SPACE_SYMBOL.encode(encoding)
97
+ char.gsub! TAG_DELETE.encode(encoding), TAG_DELETE_SYMBOL.encode(encoding)
98
+
99
+ ord = char.ord
100
+ if ord > 917536 && ord < 917631
101
+ char.tr!(TAGS.encode(encoding), ASCII_CHARS.encode(encoding))
102
+ char = "TAG ".encode(encoding) + char
103
+ end
104
+
105
+ char
106
+ end
107
+ end
108
+ end
@@ -1,3 +1,3 @@
1
1
  module Unibits
2
- VERSION = "1.0.0".freeze
2
+ VERSION = "1.1.0".freeze
3
3
  end
Binary file
Binary file
data/spec/unibits_spec.rb CHANGED
@@ -66,5 +66,171 @@ describe Unibits do
66
66
  result.must_match "43"
67
67
  result.must_match "01000011"
68
68
  end
69
+
70
+ describe "invalid UTF-8 encodings" do
71
+ it "- unexpected continuation byte (1/2)" do
72
+ string = "abc\x80efg"
73
+ result = Paint.unpaint(Unibits.visualize(string))
74
+ result.must_match "�"
75
+ result.must_match "unexp.c."
76
+ result.must_match /e.*f.*g/m
77
+ end
78
+
79
+ it "- unexpected continuation byte (2/2)" do
80
+ string = "🌫\x81efg"
81
+ result = Paint.unpaint(Unibits.visualize(string))
82
+ result.must_match "�"
83
+ result.must_match "unexp.c."
84
+ result.must_match /e.*f.*g/m
85
+ end
86
+
87
+ it "- not enough continuation bytes" do
88
+ string = "\xF0\x9F\x8CABC"
89
+ result = Paint.unpaint(Unibits.visualize(string))
90
+ result.must_match "�"
91
+ result.must_match "n.e.con."
92
+ result.must_match /A.*B.*C/m
93
+ end
94
+
95
+ it "- overlong padding (1/2)" do
96
+ string = "\xE0\x81\x81ABC"
97
+ result = Paint.unpaint(Unibits.visualize(string))
98
+ result.must_match "�"
99
+ result.must_match /overlong.*overlong.*overlong/m
100
+ result.must_match /A.*B.*C/m
101
+ end
102
+
103
+ it "- overlong padding (2/2)" do
104
+ string = "\xC0\x80no double null"
105
+ result = Paint.unpaint(Unibits.visualize(string))
106
+ result.must_match "�"
107
+ result.must_match /overlong.*overlong/m
108
+ end
109
+
110
+ it "- too large codepoint (1/2)" do
111
+ string = "\xF5\x8F\xBF\xBFABC"
112
+ result = Paint.unpaint(Unibits.visualize(string))
113
+ result.must_match "�"
114
+ result.must_match /toolarge.*toolarge.*toolarge.*toolarge/m
115
+ result.must_match /A.*B.*C/m
116
+ end
117
+
118
+ it "- too large codepoint (2/2)" do
119
+ string = "\xF4\xAF\xBF\xBFABC"
120
+ result = Paint.unpaint(Unibits.visualize(string))
121
+ result.must_match "�"
122
+ result.must_match /toolarge.*toolarge.*toolarge.*toolarge/m
123
+ result.must_match /A.*B.*C/m
124
+ end
125
+
126
+ it "- too large byte" do
127
+ string = "\xFF"
128
+ result = Paint.unpaint(Unibits.visualize(string))
129
+ result.must_match "�"
130
+ result.must_match "toolarge"
131
+ end
132
+
133
+ it "- has surrogate (1/2)" do
134
+ string = "\xED\xA0\x80ABC"
135
+ result = Paint.unpaint(Unibits.visualize(string))
136
+ result.must_match "�"
137
+ result.must_match "sur.gate"
138
+ result.must_match /A.*B.*C/m
139
+ end
140
+
141
+ it "- has surrogate (2/2)" do
142
+ string = "\xED\xBF\xBFABC"
143
+ result = Paint.unpaint(Unibits.visualize(string))
144
+ result.must_match "�"
145
+ result.must_match "sur.gate"
146
+ result.must_match /A.*B.*C/m
147
+ end
148
+ end
149
+
150
+ describe "invalid UTF-16 encodings" do
151
+ it "- incomplete number of bytes (1/2)" do
152
+ string = "a".b.force_encoding("UTF-16LE")
153
+ result = Paint.unpaint(Unibits.visualize(string))
154
+ result.must_match "incompl."
155
+ result.must_match "�"
156
+ end
157
+
158
+ it "- incomplete number of bytes (2/2)" do
159
+ string = "🌫".b[0..-2].force_encoding("UTF-16LE")
160
+ result = Paint.unpaint(Unibits.visualize(string))
161
+ result.must_match "incompl."
162
+ result.must_match "�"
163
+ end
164
+
165
+ it "- only lower half surrogate" do
166
+ string = "\x3C\xD8\x2Ba".force_encoding("UTF-16LE")
167
+ result = Paint.unpaint(Unibits.visualize(string))
168
+ result.must_match "hlf.srg."
169
+ result.must_match "�"
170
+ end
171
+
172
+ it "- only higher half surrogate" do
173
+ string = "\x3Ca\x2B\xDF".force_encoding("UTF-16LE")
174
+ result = Paint.unpaint(Unibits.visualize(string))
175
+ result.must_match "hlf.srg."
176
+ result.must_match "�"
177
+ end
178
+ end
179
+
180
+ describe "invalid UTF-32 encodings" do
181
+ # please note, currently, too large codepoints and encoded utf16 surrogates are treated as valid encodings
182
+
183
+ it "- incomplete number of bytes (1/3)" do
184
+ string = "a".b.force_encoding("UTF-32LE")
185
+ result = Paint.unpaint(Unibits.visualize(string))
186
+ result.must_match "incompl."
187
+ result.must_match "�"
188
+ end
189
+
190
+ it "- incomplete number of bytes (2/3)" do
191
+ string = "🌫".b[0..-2].force_encoding("UTF-32LE")
192
+ result = Paint.unpaint(Unibits.visualize(string))
193
+ result.must_match "incompl."
194
+ result.must_match "�"
195
+ end
196
+
197
+ it "- incomplete number of bytes (3/3)" do
198
+ string = "🌫".b[0..-2].force_encoding("UTF-32LE")
199
+ result = Paint.unpaint(Unibits.visualize(string))
200
+ result.must_match "incompl."
201
+ result.must_match "�"
202
+ end
203
+ end
204
+
205
+ describe "invalid ASCII encodings" do
206
+ it "- contains bytes with 8th bit set" do
207
+ string = "abc\x80efg".force_encoding("ASCII")
208
+ result = Paint.unpaint(Unibits.visualize(string))
209
+ result.must_match "�"
210
+ result.must_match /e.*f.*g/m
211
+ end
212
+ end
213
+ end
214
+
215
+ describe "wide_ambiguous: option" do
216
+ it "- default is 1" do
217
+ string = "⚀······"
218
+ result = Unibits.stats(string)
219
+ result.wont_match "13"
220
+ end
221
+
222
+ it "- default is 2" do
223
+ string = "⚀······"
224
+ result = Unibits.stats(string, wide_ambiguous: true)
225
+ result.must_match "13"
226
+ end
227
+ end
228
+
229
+ describe "width: option" do
230
+ it "sets a custom column width" do
231
+ string = "bla" * 99
232
+ result = Paint.unpaint(Unibits.visualize(string, width: 50))
233
+ (result[/^.*$/].size <= 50).must_equal true
234
+ end
69
235
  end
70
236
  end
data/unibits.gemspec CHANGED
@@ -17,7 +17,7 @@ Gem::Specification.new do |gem|
17
17
  gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
18
18
  gem.require_paths = ["lib"]
19
19
 
20
- gem.add_dependency 'paint', '~> 2.0'
20
+ gem.add_dependency 'paint', '>= 0.9', '< 3.0'
21
21
  gem.add_dependency 'unicode-display_width', '~> 1.1'
22
22
  gem.add_dependency 'rationalist', '~> 2.0'
23
23
 
metadata CHANGED
@@ -1,29 +1,35 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: unibits
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jan Lelis
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-03-05 00:00:00.000000000 Z
11
+ date: 2017-03-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: paint
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - "~>"
17
+ - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: '2.0'
19
+ version: '0.9'
20
+ - - "<"
21
+ - !ruby/object:Gem::Version
22
+ version: '3.0'
20
23
  type: :runtime
21
24
  prerelease: false
22
25
  version_requirements: !ruby/object:Gem::Requirement
23
26
  requirements:
24
- - - "~>"
27
+ - - ">="
25
28
  - !ruby/object:Gem::Version
26
- version: '2.0'
29
+ version: '0.9'
30
+ - - "<"
31
+ - !ruby/object:Gem::Version
32
+ version: '3.0'
27
33
  - !ruby/object:Gem::Dependency
28
34
  name: unicode-display_width
29
35
  requirement: !ruby/object:Gem::Requirement
@@ -74,13 +80,16 @@ files:
74
80
  - bin/unibits
75
81
  - lib/unibits.rb
76
82
  - lib/unibits/kernel_method.rb
83
+ - lib/unibits/symbolify.rb
77
84
  - lib/unibits/version.rb
85
+ - screenshots/ascii.invalid.png
78
86
  - screenshots/ascii.png
79
87
  - screenshots/binary.png
80
88
  - screenshots/utf-16be.png
81
89
  - screenshots/utf-16le.png
82
90
  - screenshots/utf-32be.png
83
91
  - screenshots/utf-32le.png
92
+ - screenshots/utf-8.invalid.png
84
93
  - screenshots/utf-8.png
85
94
  - spec/unibits_spec.rb
86
95
  - unibits.gemspec