unibits 1.0.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -0
- data/README.md +42 -18
- data/Rakefile +1 -1
- data/bin/unibits +50 -8
- data/lib/unibits.rb +175 -62
- data/lib/unibits/kernel_method.rb +2 -2
- data/lib/unibits/symbolify.rb +108 -0
- data/lib/unibits/version.rb +1 -1
- data/screenshots/ascii.invalid.png +0 -0
- data/screenshots/utf-8.invalid.png +0 -0
- data/spec/unibits_spec.rb +166 -0
- data/unibits.gemspec +1 -1
- metadata +15 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 3a765c7b1a6edfd62f8fed06b683d0d4ff2cacae
|
4
|
+
data.tar.gz: c66cb699f9a4425fc169fd6562dd4b758b94c584
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: a516728fe08185386fc1902fa341784517521d0a85e494e35a7e2700c1c6d7fc3dfc8f86bb20d4268de6d69df591ab431aa3ab5ee02d96f6798fcd97f2426a8e
|
7
|
+
data.tar.gz: cb2fbfea68be359661ad75cc6e203bd4235ccd81431a37f74f965e90f6116e1a12d8cb04094a213f537e2a032c0c25775edd08378cd7c8ae613503dbe5cc3e0e
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,13 @@
|
|
1
1
|
## CHANGELOG
|
2
2
|
|
3
|
+
### 1.1.0
|
4
|
+
|
5
|
+
* Support (and highlight) invalid encodings \o/
|
6
|
+
* Improve character symbolification
|
7
|
+
* Fix that the Kernel method would not take keyword arguments
|
8
|
+
* New option for setting a custom output width to use
|
9
|
+
* New option for activating wide ambiguous characters
|
10
|
+
|
3
11
|
### 1.0.0
|
4
12
|
|
5
13
|
* Initial release
|
data/README.md
CHANGED
@@ -1,18 +1,24 @@
|
|
1
|
-
# unibits [![[version]](https://badge.fury.io/rb/unibits.svg)](http://badge.fury.io/rb/unibits) [![[travis]](https://travis-ci.org/janlelis/unibits.svg)](https://travis-ci.org/janlelis/unibits)
|
1
|
+
# unibits | Reveal the Unicode [![[version]](https://badge.fury.io/rb/unibits.svg)](http://badge.fury.io/rb/unibits) [![[travis]](https://travis-ci.org/janlelis/unibits.svg)](https://travis-ci.org/janlelis/unibits)
|
2
2
|
|
3
|
-
Ruby library and CLI command that visualizes various Unicode and ASCII encodings in the terminal
|
3
|
+
Ruby library and CLI command that visualizes various Unicode and ASCII encodings in the terminal:
|
4
4
|
|
5
|
-
|
6
|
-
|
7
|
-
|
5
|
+
- Makes analyzing encodings easier
|
6
|
+
- Helps you with debugging strings
|
7
|
+
- Supports **UTF-8**, **UTF-16LE**/**UTF-16BE**, **UTF-32LE**/**UTF-32BE**, arbitrary **BINARY** data, and **ASCII**
|
8
|
+
- Highlights invalid encodings
|
8
9
|
|
9
10
|
## Setup
|
10
11
|
|
12
|
+
Make sure you have Ruby installed and installing gems works properly. Then do:
|
13
|
+
|
11
14
|
```
|
12
15
|
$ gem install unibits
|
13
16
|
```
|
14
17
|
|
15
18
|
## Usage
|
19
|
+
|
20
|
+
Pass the string to debug to unibits:
|
21
|
+
|
16
22
|
### From CLI
|
17
23
|
|
18
24
|
```
|
@@ -26,17 +32,17 @@ require 'unibits/kernel_method'
|
|
26
32
|
unibits "🌫 Idiosyncrätic ℜսᖯʏ"
|
27
33
|
```
|
28
34
|
|
29
|
-
### Options
|
35
|
+
### Advanced Options
|
30
36
|
|
31
|
-
`unibits` takes
|
37
|
+
`unibits` takes some optional options:
|
32
38
|
|
33
39
|
- *encoding (e)*: The encoding of the given string (uses your default encoding if none given)
|
34
40
|
- *convert (c)*: An encoding the string should be converted to before visualizing it
|
35
41
|
- *stats*: Whether to show a short stats header (default: `true`), you can deactivate on the CLI with `--no-stats`
|
42
|
+
- *wide-ambiguous*: Treat characters of ambiguous width as 2 spaces instead of 1 ([more info](https://github.com/janlelis/unicode-display_width))
|
43
|
+
- *width (w)*: Set a custom column width, if not set, *unibits* will retrieve it from the terminal or just use 80
|
36
44
|
|
37
|
-
|
38
|
-
|
39
|
-
## Output for Different Encodings
|
45
|
+
## Output of Different Valid Encodings
|
40
46
|
### UTF-8
|
41
47
|
|
42
48
|
CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
|
@@ -47,33 +53,33 @@ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert:
|
|
47
53
|
|
48
54
|
### UTF-16LE
|
49
55
|
|
50
|
-
CLI: `$ unibits -e utf-8 -c utf-
|
56
|
+
CLI: `$ unibits -e utf-8 -c utf-16le "🌫 Idiosyncrätic ℜսᖯʏ"`
|
51
57
|
|
52
|
-
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-
|
58
|
+
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-16le'`
|
53
59
|
|
54
60
|

|
55
61
|
|
56
62
|
### UTF-16BE
|
57
63
|
|
58
|
-
CLI: `$ unibits -e utf-8 -c utf-
|
64
|
+
CLI: `$ unibits -e utf-8 -c utf-16be "🌫 Idiosyncrätic ℜսᖯʏ"`
|
59
65
|
|
60
|
-
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-
|
66
|
+
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-16be'`
|
61
67
|
|
62
68
|

|
63
69
|
|
64
70
|
### UTF-32LE
|
65
71
|
|
66
|
-
CLI: `$ unibits -e utf-8 -c utf-
|
72
|
+
CLI: `$ unibits -e utf-8 -c utf-32le "🌫 Idiosyncrätic ℜսᖯʏ"`
|
67
73
|
|
68
|
-
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-
|
74
|
+
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-32le'`
|
69
75
|
|
70
76
|

|
71
77
|
|
72
78
|
### UTF-32BE
|
73
79
|
|
74
|
-
CLI: `$ unibits -e utf-8 -c utf-
|
80
|
+
CLI: `$ unibits -e utf-8 -c utf-32be "🌫 Idiosyncrätic ℜսᖯʏ"`
|
75
81
|
|
76
|
-
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-
|
82
|
+
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-32be'`
|
77
83
|
|
78
84
|

|
79
85
|
|
@@ -93,6 +99,23 @@ Ruby: `unibits "ASCII String", encoding: 'utf-8', convert: 'ascii'`
|
|
93
99
|
|
94
100
|

|
95
101
|
|
102
|
+
## Invalid Encodings
|
103
|
+
### UTF-8
|
104
|
+
|
105
|
+
Example in Ruby: `unibits "unexpected \x80 | not enough \xF0\x9F\x8C | overlong \xE0\x81\x81 | surrogate \xED\xA0\x80 | too large \xF5\x8F\xBF\xBF"`
|
106
|
+
|
107
|
+

|
108
|
+
|
109
|
+
### ASCII
|
110
|
+
|
111
|
+
Example in Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'ascii'`
|
112
|
+
|
113
|
+

|
114
|
+
|
115
|
+
### BINARY
|
116
|
+
|
117
|
+
(not possible to produce invalid binary strings)
|
118
|
+
|
96
119
|
## Notes
|
97
120
|
|
98
121
|
Also see
|
@@ -101,6 +124,7 @@ Also see
|
|
101
124
|
- [UTF-16 (Wikipedia)](https://en.wikipedia.org/wiki/UTF-16#Description)
|
102
125
|
- [UTF-32 (Wikipedia)](https://en.wikipedia.org/wiki/UTF-32)
|
103
126
|
- [Ruby's Encoding class](https://ruby-doc.org/core/Encoding.html)
|
127
|
+
- [Difference between BINARY and ASCII](http://idiosyncratic-ruby.com/56-us-ascii-8bit.html)
|
104
128
|
- [Unicode Micro Libraries for Ruby](https://github.com/janlelis/unicode-x)
|
105
129
|
|
106
130
|
Lots of thanks to @damienklinnert for the motivation and inspiration required to build this! 🎆
|
data/Rakefile
CHANGED
data/bin/unibits
CHANGED
@@ -7,14 +7,16 @@ argv = Rationalist.parse(
|
|
7
7
|
ARGV,
|
8
8
|
string: '_',
|
9
9
|
alias: {
|
10
|
-
e: 'encoding',
|
11
10
|
c: 'convert',
|
11
|
+
e: 'encoding',
|
12
12
|
v: 'version',
|
13
|
+
w: 'width',
|
13
14
|
},
|
14
15
|
boolean: [
|
15
16
|
'help',
|
17
|
+
'stats',
|
16
18
|
'version',
|
17
|
-
'
|
19
|
+
'wide-ambiguous',
|
18
20
|
],
|
19
21
|
default: {
|
20
22
|
stats: true
|
@@ -22,21 +24,54 @@ argv = Rationalist.parse(
|
|
22
24
|
)
|
23
25
|
|
24
26
|
if argv[:version]
|
25
|
-
puts Unibits::VERSION
|
27
|
+
puts "unibits #{Unibits::VERSION} by #{Paint["J-_-L", :bold]} <https://github.com/janlelis/unibits>"
|
26
28
|
exit(0)
|
27
29
|
end
|
28
30
|
|
29
31
|
if argv[:help]
|
30
32
|
puts <<-HELP
|
31
33
|
|
34
|
+
#{Paint["DESCRIPTION", :underline]}
|
35
|
+
|
36
|
+
Visualizes ANSI and variouus Unicode encodings in the terminal.
|
37
|
+
|
32
38
|
#{Paint["USAGE", :underline]}
|
33
39
|
|
34
|
-
#{Paint["unibits", :bold]} [
|
40
|
+
#{Paint["unibits", :bold]} [options] data
|
41
|
+
|
42
|
+
--encoding <encoding> | -e | which encoding to use for given data
|
43
|
+
--convert <encoding> | -c | which encoding to convert to (if possible)
|
44
|
+
--width <n> | -w | force a specific number of terminal columns
|
45
|
+
--no-stats | | no stats header with length info
|
46
|
+
--version | | displays version of unibits
|
47
|
+
--wide-ambiguous | | ambiguous characters
|
48
|
+
|
49
|
+
#{Paint["ENCODINGS", :underline]}
|
50
|
+
|
51
|
+
#{Unibits::SUPPORTED_ENCODINGS.join(', ')}
|
52
|
+
|
53
|
+
#{Paint["STATS", :underline]}
|
54
|
+
|
55
|
+
( bytes / codepoints / glyphs / expected terminal width )
|
56
|
+
|
57
|
+
#{Paint["INVALID BYTES", :underline]}
|
58
|
+
|
59
|
+
UTF-8
|
60
|
+
|
61
|
+
n.e.con. | not enough continuation bytes
|
62
|
+
unexp.c. | unexpected continuation byte
|
63
|
+
toolarge | codepoint value exceeds maximum allowed (U+10FFFF)
|
64
|
+
sur.gate | codepoint value would be a surrogate half
|
65
|
+
overlong | unnecessary null padding
|
66
|
+
|
67
|
+
UTF-16
|
68
|
+
|
69
|
+
incompl. | not enough bytes left to finish codepoint
|
70
|
+
hlf.srg. | other half of surrogate missing
|
35
71
|
|
36
|
-
|
37
|
-
Explanation of stats: bytes / codepoints / glyphs / expected terminal width
|
72
|
+
#{Paint["MORE INFO", :underline]}
|
38
73
|
|
39
|
-
|
74
|
+
https://github.com/janlelis/unibits
|
40
75
|
|
41
76
|
HELP
|
42
77
|
exit(0)
|
@@ -51,7 +86,14 @@ else
|
|
51
86
|
end
|
52
87
|
|
53
88
|
begin
|
54
|
-
Unibits.of(
|
89
|
+
Unibits.of(
|
90
|
+
data,
|
91
|
+
encoding: argv[:encoding],
|
92
|
+
convert: argv[:convert],
|
93
|
+
stats: argv[:stats],
|
94
|
+
wide_ambiguous: argv[:'wide-ambiguous'],
|
95
|
+
width: argv[:width],
|
96
|
+
)
|
55
97
|
rescue ArgumentError
|
56
98
|
$stderr.puts Paint[$!.message, :red]
|
57
99
|
exit(1)
|
data/lib/unibits.rb
CHANGED
@@ -1,4 +1,5 @@
|
|
1
1
|
require_relative "unibits/version"
|
2
|
+
require_relative "unibits/symbolify"
|
2
3
|
|
3
4
|
require "io/console"
|
4
5
|
require "paint"
|
@@ -14,8 +15,9 @@ module Unibits
|
|
14
15
|
'ASCII-8BIT',
|
15
16
|
'US-ASCII',
|
16
17
|
].freeze
|
18
|
+
DEFAULT_TERMINAL_WIDTH = 80
|
17
19
|
|
18
|
-
def self.of(string, encoding: nil, convert: nil, stats: true)
|
20
|
+
def self.of(string, encoding: nil, convert: nil, stats: true, wide_ambiguous: false, width: nil)
|
19
21
|
if !string || string.empty?
|
20
22
|
raise ArgumentError, "no data given to unibits"
|
21
23
|
end
|
@@ -25,12 +27,8 @@ module Unibits
|
|
25
27
|
|
26
28
|
case string.encoding.name
|
27
29
|
when *SUPPORTED_ENCODINGS
|
28
|
-
|
29
|
-
|
30
|
-
end
|
31
|
-
|
32
|
-
puts stats(string) if stats
|
33
|
-
puts visualize(string)
|
30
|
+
puts stats(string, wide_ambiguous: wide_ambiguous) if stats
|
31
|
+
puts visualize(string, wide_ambiguous: wide_ambiguous, width: width)
|
34
32
|
when 'UTF-16', 'UTF-32'
|
35
33
|
raise ArgumentError, "unibits only supports #{string.encoding.name} with specified endianess, please use #{string.encoding.name}LE or #{string.encoding.name}BE"
|
36
34
|
else
|
@@ -38,27 +36,36 @@ module Unibits
|
|
38
36
|
end
|
39
37
|
end
|
40
38
|
|
41
|
-
def self.stats(string)
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
39
|
+
def self.stats(string, wide_ambiguous: false)
|
40
|
+
valid = string.valid_encoding?
|
41
|
+
bytes = string.bytesize rescue "?"
|
42
|
+
codepoints = string.size rescue "?"
|
43
|
+
glyphs = string.scan(Regexp.compile('\X'.encode(string.encoding))).size rescue "?"
|
44
|
+
width = Unicode::DisplayWidth.of(string, wide_ambiguous ? 2 : 1) rescue "?"
|
45
|
+
|
46
|
+
"\n #{valid ? '' : Paint["Invalid ", :bold, :red]}#{Paint[string.encoding.name, :bold]} (#{bytes}/#{codepoints}/#{glyphs}/#{width})"
|
48
47
|
end
|
49
48
|
|
50
|
-
def self.visualize(string)
|
51
|
-
cols = determine_terminal_cols
|
49
|
+
def self.visualize(string, wide_ambiguous: false, width: nil)
|
50
|
+
cols = width || determine_terminal_cols
|
52
51
|
|
53
52
|
cp_buffer = [" "]
|
54
53
|
enc_buffer = [" "]
|
55
54
|
hex_buffer = [" "]
|
56
55
|
bin_buffer = [" "]
|
57
56
|
separator = [" "]
|
57
|
+
current_encoding_error = nil
|
58
58
|
|
59
59
|
puts
|
60
60
|
string.each_char{ |char|
|
61
|
-
|
61
|
+
if char.valid_encoding?
|
62
|
+
char_valid = true
|
63
|
+
current_color = random_color
|
64
|
+
current_encoding_error = nil
|
65
|
+
else
|
66
|
+
char_valid = false
|
67
|
+
current_color = :red
|
68
|
+
end
|
62
69
|
|
63
70
|
char.each_byte.with_index{ |byte, index|
|
64
71
|
if Paint.unpaint(hex_buffer[-1]).bytesize > cols - 12
|
@@ -70,12 +77,116 @@ module Unibits
|
|
70
77
|
end
|
71
78
|
|
72
79
|
if index == 0
|
80
|
+
if char_valid
|
81
|
+
codepoint = "U+%04X" % char.ord
|
82
|
+
else
|
83
|
+
case string.encoding.name
|
84
|
+
when "US-ASCII"
|
85
|
+
codepoint = "invalid"
|
86
|
+
when "UTF-8"
|
87
|
+
# this tries to detect what is wrong with this utf-8 encoded string
|
88
|
+
# sorry for this mess
|
89
|
+
case char.unpack("B*")[0]
|
90
|
+
when /^110.{5}$/
|
91
|
+
current_encoding_error = [:nec, 1, 1]
|
92
|
+
codepoint = "n.e.con."
|
93
|
+
when /^1110(.{4})$/
|
94
|
+
if $1 == "1101"
|
95
|
+
current_encoding_error = [:nec, 2, 2, :maybe_surrogate]
|
96
|
+
else
|
97
|
+
current_encoding_error = [:nec, 2, 2]
|
98
|
+
end
|
99
|
+
codepoint = "n.e.con."
|
100
|
+
when /^11110(.{3})$/
|
101
|
+
case $1
|
102
|
+
when "100"
|
103
|
+
current_encoding_error = [:nec, 3, 3, :leading_at_max]
|
104
|
+
when "101", "110", "111"
|
105
|
+
current_encoding_error = [:nec, 3, 3, :too_large]
|
106
|
+
else
|
107
|
+
current_encoding_error = [:nec, 3, 3]
|
108
|
+
end
|
109
|
+
codepoint = "n.e.con."
|
110
|
+
when /^11111.{3}$/
|
111
|
+
codepoint = "toolarge"
|
112
|
+
when /^10(.{2}).{4}$/
|
113
|
+
# uglyhack to fixup that it is not n.e.c, but something different
|
114
|
+
if current_encoding_error && current_encoding_error[0] == :nec
|
115
|
+
if current_encoding_error[3] == :leading_at_max
|
116
|
+
if $1 != "00"
|
117
|
+
current_encoding_error[3] = :too_large
|
118
|
+
else
|
119
|
+
current_encoding_error[3] = nil
|
120
|
+
end
|
121
|
+
elsif current_encoding_error[3] == :maybe_surrogate
|
122
|
+
if $1[0] == "1"
|
123
|
+
current_encoding_error[3] = :surrogate
|
124
|
+
else
|
125
|
+
current_encoding_error[3] = nil
|
126
|
+
end
|
127
|
+
end
|
128
|
+
|
129
|
+
if current_encoding_error[1] > 1
|
130
|
+
current_encoding_error[1] -= 1
|
131
|
+
codepoint = "n.e.con."
|
132
|
+
else
|
133
|
+
case current_encoding_error[3]
|
134
|
+
when :too_large
|
135
|
+
actual_error = "toolarge"
|
136
|
+
when :surrogate
|
137
|
+
actual_error = "sur.gate"
|
138
|
+
else
|
139
|
+
actual_error = "overlong"
|
140
|
+
end
|
141
|
+
current_cp_buffer_index = -1
|
142
|
+
(current_encoding_error[2]).times{
|
143
|
+
if index = cp_buffer[current_cp_buffer_index].rindex("n.e.con.")
|
144
|
+
cp_buffer[current_cp_buffer_index][index..-1] = cp_buffer[current_cp_buffer_index][index..-1].sub("n.e.con.", actual_error)
|
145
|
+
else
|
146
|
+
current_cp_buffer_index -= 1
|
147
|
+
index = cp_buffer[current_cp_buffer_index].rindex("n.e.con.")
|
148
|
+
cp_buffer[current_cp_buffer_index][index..-1] = cp_buffer[current_cp_buffer_index][index..-1].sub("n.e.con.", actual_error)
|
149
|
+
end
|
150
|
+
current_encoding_error = [:overlong]
|
151
|
+
codepoint = actual_error
|
152
|
+
}
|
153
|
+
end
|
154
|
+
else
|
155
|
+
current_encoding_error = [:unexp]
|
156
|
+
codepoint = "unexp.c."
|
157
|
+
end
|
158
|
+
else
|
159
|
+
current_encoding_error = [:invallid]
|
160
|
+
codepoint = "invalid"
|
161
|
+
end
|
162
|
+
when 'UTF-16LE', 'UTF-16BE'
|
163
|
+
if char.bytesize.odd?
|
164
|
+
codepoint = "incompl."
|
165
|
+
elsif char.b[string.encoding.name == 'UTF-16LE' ? 1 : 0].unpack("B*")[0][0, 5] == "11011"
|
166
|
+
codepoint = "hlf.srg."
|
167
|
+
else
|
168
|
+
codepoint = "invalid"
|
169
|
+
end
|
170
|
+
when 'UTF-32LE', 'UTF-32BE'
|
171
|
+
if char.bytesize != "4"
|
172
|
+
codepoint = "incompl."
|
173
|
+
else
|
174
|
+
codepoint = "invalid"
|
175
|
+
end
|
176
|
+
end
|
177
|
+
end
|
178
|
+
|
73
179
|
cp_buffer[-1] << Paint[
|
74
|
-
|
180
|
+
codepoint.ljust(10), current_color, :bold
|
75
181
|
]
|
76
182
|
|
77
|
-
|
78
|
-
|
183
|
+
if char_valid
|
184
|
+
symbolified_char = symbolify(char)
|
185
|
+
else
|
186
|
+
symbolified_char = "�"
|
187
|
+
end
|
188
|
+
|
189
|
+
padding = 10 - Unicode::DisplayWidth.of(symbolified_char, wide_ambiguous ? 2 : 1)
|
79
190
|
|
80
191
|
enc_buffer[-1] << Paint[
|
81
192
|
symbolified_char, current_color
|
@@ -92,54 +203,62 @@ module Unibits
|
|
92
203
|
|
93
204
|
bin_byte_complete = byte.to_s(2).rjust(8, "0")
|
94
205
|
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
206
|
+
if !char_valid
|
207
|
+
bin_byte_1 = bin_byte_complete
|
208
|
+
bin_byte_2 = ""
|
209
|
+
else
|
210
|
+
case string.encoding.name
|
211
|
+
when 'US-ASCII'
|
212
|
+
bin_byte_1 = bin_byte_complete[0...1]
|
213
|
+
bin_byte_2 = bin_byte_complete[1...8]
|
214
|
+
when 'ASCII-8BIT'
|
215
|
+
bin_byte_1 = ""
|
216
|
+
bin_byte_2 = bin_byte_complete
|
217
|
+
when 'UTF-8'
|
218
|
+
if index == 0
|
219
|
+
if bin_byte_complete =~ /^(0|1{2,4}0)([01]+)$/
|
220
|
+
bin_byte_1 = $1
|
221
|
+
bin_byte_2 = $2
|
222
|
+
else
|
223
|
+
bin_byte_1 = ""
|
224
|
+
bin_byte_2 = bin_byte_complete
|
225
|
+
end
|
226
|
+
else
|
227
|
+
bin_byte_1 = bin_byte_complete[0...2]
|
228
|
+
bin_byte_2 = bin_byte_complete[2...8]
|
229
|
+
end
|
230
|
+
when 'UTF-16LE'
|
231
|
+
if char.ord <= 0xFFFF || index == 0 || index == 2
|
232
|
+
bin_byte_1 = ""
|
233
|
+
bin_byte_2 = bin_byte_complete
|
234
|
+
else
|
235
|
+
bin_byte_complete =~ /^(11011[01])([01]+)$/
|
105
236
|
bin_byte_1 = $1
|
106
237
|
bin_byte_2 = $2
|
107
|
-
|
238
|
+
end
|
239
|
+
when 'UTF-16BE'
|
240
|
+
if char.ord <= 0xFFFF || index == 1 || index == 3
|
108
241
|
bin_byte_1 = ""
|
109
242
|
bin_byte_2 = bin_byte_complete
|
243
|
+
else
|
244
|
+
bin_byte_complete =~ /^(11011[01])([01]+)$/
|
245
|
+
bin_byte_1 = $1
|
246
|
+
bin_byte_2 = $2
|
110
247
|
end
|
111
|
-
|
112
|
-
bin_byte_1 = bin_byte_complete[0...2]
|
113
|
-
bin_byte_2 = bin_byte_complete[2...8]
|
114
|
-
end
|
115
|
-
when 'UTF-16LE'
|
116
|
-
if char.ord <= 0xFFFF || index == 0 || index == 2
|
248
|
+
when 'UTF-32LE', 'UTF-32BE'
|
117
249
|
bin_byte_1 = ""
|
118
250
|
bin_byte_2 = bin_byte_complete
|
119
|
-
else
|
120
|
-
bin_byte_complete =~ /^(11011[01])([01]+)$/
|
121
|
-
bin_byte_1 = $1
|
122
|
-
bin_byte_2 = $2
|
123
|
-
end
|
124
|
-
when 'UTF-16BE'
|
125
|
-
if char.ord <= 0xFFFF || index == 1 || index == 3
|
126
|
-
bin_byte_1 = ""
|
127
|
-
bin_byte_2 = bin_byte_complete
|
128
|
-
else
|
129
|
-
bin_byte_complete =~ /^(11011[01])([01]+)$/
|
130
|
-
bin_byte_1 = $1
|
131
|
-
bin_byte_2 = $2
|
132
251
|
end
|
133
|
-
when 'UTF-32LE', 'UTF-32BE'
|
134
|
-
bin_byte_1 = ""
|
135
|
-
bin_byte_2 = bin_byte_complete
|
136
252
|
end
|
253
|
+
|
137
254
|
bin_buffer[-1] << Paint[
|
138
255
|
bin_byte_1, current_color
|
139
256
|
] unless !bin_byte_1 || bin_byte_1.empty?
|
257
|
+
|
140
258
|
bin_buffer[-1] << Paint[
|
141
259
|
bin_byte_2, current_color, :underline
|
142
260
|
] unless !bin_byte_2 || bin_byte_2.empty?
|
261
|
+
|
143
262
|
bin_buffer[-1] << " "
|
144
263
|
}
|
145
264
|
}
|
@@ -157,18 +276,12 @@ module Unibits
|
|
157
276
|
|
158
277
|
def self.symbolify(char)
|
159
278
|
return char.inspect unless char.encoding.name[0, 3] == "UTF"
|
160
|
-
char
|
161
|
-
.tr("\x00-\x1F".encode(char.encoding), "\u{2400}-\u{241F}".encode(char.encoding))
|
162
|
-
.gsub(
|
163
|
-
Regexp.compile('[\p{Space}]'.encode(char.encoding)),
|
164
|
-
']\0['.encode(char.encoding)
|
165
|
-
)
|
166
|
-
.encode('UTF-8')
|
279
|
+
Symbolify.symbolify(char).encode('UTF-8')
|
167
280
|
end
|
168
281
|
|
169
282
|
def self.determine_terminal_cols
|
170
|
-
STDIN.winsize[1] ||
|
283
|
+
STDIN.winsize[1] || DEFAULT_TERMINAL_WIDTH
|
171
284
|
rescue Errno::ENOTTY
|
172
|
-
return
|
285
|
+
return DEFAULT_TERMINAL_WIDTH
|
173
286
|
end
|
174
287
|
end
|
@@ -0,0 +1,108 @@
|
|
1
|
+
module Unibits
|
2
|
+
module Symbolify
|
3
|
+
ASCII_CONTROL_CODEPOINTS = "\x00-\x1F\x7F".freeze
|
4
|
+
ASCII_CONTROL_SYMBOLS = "\u{2400}-\u{241F}\u{2421}".freeze
|
5
|
+
ASCII_CHARS = "\x20-\x7E".freeze
|
6
|
+
TAG_START = "\u{E0001}".freeze
|
7
|
+
TAG_START_SYMBOL = "LANG TAG".freeze
|
8
|
+
TAG_SPACE = "\u{E0020}".freeze
|
9
|
+
TAG_SPACE_SYMBOL = "TAG ␠".freeze
|
10
|
+
TAGS = "\u{E0021}-\u{E007E}".freeze
|
11
|
+
TAG_DELETE = "\u{E007F}".freeze
|
12
|
+
TAG_DELETE_SYMBOL = "TAG ␡".freeze
|
13
|
+
INTERESTING_CODEPOINTS = {
|
14
|
+
"\u{0080}" => "PAD",
|
15
|
+
"\u{0081}" => "HOP",
|
16
|
+
"\u{0082}" => "BPH",
|
17
|
+
"\u{0083}" => "NBH",
|
18
|
+
"\u{0084}" => "IND",
|
19
|
+
"\u{0085}" => "NEL",
|
20
|
+
"\u{0086}" => "SSA",
|
21
|
+
"\u{0087}" => "ESA",
|
22
|
+
"\u{0088}" => "HTS",
|
23
|
+
"\u{0089}" => "HTJ",
|
24
|
+
"\u{008A}" => "VTS",
|
25
|
+
"\u{008B}" => "PLD",
|
26
|
+
"\u{008C}" => "PLU",
|
27
|
+
"\u{008D}" => "RI",
|
28
|
+
"\u{008E}" => "SS2",
|
29
|
+
"\u{008F}" => "SS3",
|
30
|
+
"\u{0090}" => "DCS",
|
31
|
+
"\u{0091}" => "PU1",
|
32
|
+
"\u{0092}" => "PU2",
|
33
|
+
"\u{0093}" => "STS",
|
34
|
+
"\u{0094}" => "CCH",
|
35
|
+
"\u{0095}" => "MW",
|
36
|
+
"\u{0096}" => "SPA",
|
37
|
+
"\u{0097}" => "EPA",
|
38
|
+
"\u{0098}" => "SOS",
|
39
|
+
"\u{0099}" => "SGC",
|
40
|
+
"\u{009A}" => "SCI",
|
41
|
+
"\u{009B}" => "CSI",
|
42
|
+
"\u{009C}" => "ST",
|
43
|
+
"\u{009D}" => "OSC",
|
44
|
+
"\u{009E}" => "PM",
|
45
|
+
"\u{009F}" => "APC",
|
46
|
+
|
47
|
+
"\u{200E}" => "LRM",
|
48
|
+
"\u{200F}" => "RLM",
|
49
|
+
"\u{202A}" => "LRE",
|
50
|
+
"\u{202B}" => "RLE",
|
51
|
+
"\u{202C}" => "PDF",
|
52
|
+
"\u{202D}" => "LRO",
|
53
|
+
"\u{202E}" => "RLO",
|
54
|
+
"\u{2066}" => "LRI",
|
55
|
+
"\u{2067}" => "RLI",
|
56
|
+
"\u{2068}" => "FSI",
|
57
|
+
"\u{2069}" => "PDI",
|
58
|
+
|
59
|
+
"\u{FE00}" => "VS1",
|
60
|
+
"\u{FE01}" => "VS2",
|
61
|
+
"\u{FE02}" => "VS3",
|
62
|
+
"\u{FE03}" => "VS4",
|
63
|
+
"\u{FE04}" => "VS5",
|
64
|
+
"\u{FE05}" => "VS6",
|
65
|
+
"\u{FE06}" => "VS7",
|
66
|
+
"\u{FE07}" => "VS8",
|
67
|
+
"\u{FE08}" => "VS9",
|
68
|
+
"\u{FE09}" => "VS10",
|
69
|
+
"\u{FE0A}" => "VS11",
|
70
|
+
"\u{FE0B}" => "VS12",
|
71
|
+
"\u{FE0C}" => "VS13",
|
72
|
+
"\u{FE0D}" => "VS14",
|
73
|
+
"\u{FE0E}" => "VS15",
|
74
|
+
"\u{FE0F}" => "VS16",
|
75
|
+
}.freeze
|
76
|
+
COULD_BE_WHITESPACE = '[\p{Space}⠀]'.freeze
|
77
|
+
# UNASSIGNED = '\p{Cn}'.freeze
|
78
|
+
|
79
|
+
def self.symbolify(char, encoding = char.encoding)
|
80
|
+
char.tr!(
|
81
|
+
ASCII_CONTROL_CODEPOINTS.encode(encoding),
|
82
|
+
ASCII_CONTROL_SYMBOLS.encode(encoding)
|
83
|
+
)
|
84
|
+
char.gsub!(
|
85
|
+
Regexp.compile(COULD_BE_WHITESPACE.encode(encoding)),
|
86
|
+
']\0['.encode(encoding)
|
87
|
+
)
|
88
|
+
# char.gsub!(
|
89
|
+
# Regexp.compile(UNASSIGNED.encode(encoding)),
|
90
|
+
# 'n/a'.encode(encoding)
|
91
|
+
# )
|
92
|
+
INTERESTING_CODEPOINTS.each{ |cp, desc|
|
93
|
+
char.gsub! Regexp.compile(cp.encode(encoding)), desc.encode(encoding)
|
94
|
+
}
|
95
|
+
char.gsub! TAG_START.encode(encoding), TAG_START_SYMBOL.encode(encoding)
|
96
|
+
char.gsub! TAG_SPACE.encode(encoding), TAG_SPACE_SYMBOL.encode(encoding)
|
97
|
+
char.gsub! TAG_DELETE.encode(encoding), TAG_DELETE_SYMBOL.encode(encoding)
|
98
|
+
|
99
|
+
ord = char.ord
|
100
|
+
if ord > 917536 && ord < 917631
|
101
|
+
char.tr!(TAGS.encode(encoding), ASCII_CHARS.encode(encoding))
|
102
|
+
char = "TAG ".encode(encoding) + char
|
103
|
+
end
|
104
|
+
|
105
|
+
char
|
106
|
+
end
|
107
|
+
end
|
108
|
+
end
|
data/lib/unibits/version.rb
CHANGED
Binary file
|
Binary file
|
data/spec/unibits_spec.rb
CHANGED
@@ -66,5 +66,171 @@ describe Unibits do
|
|
66
66
|
result.must_match "43"
|
67
67
|
result.must_match "01000011"
|
68
68
|
end
|
69
|
+
|
70
|
+
describe "invalid UTF-8 encodings" do
|
71
|
+
it "- unexpected continuation byte (1/2)" do
|
72
|
+
string = "abc\x80efg"
|
73
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
74
|
+
result.must_match "�"
|
75
|
+
result.must_match "unexp.c."
|
76
|
+
result.must_match /e.*f.*g/m
|
77
|
+
end
|
78
|
+
|
79
|
+
it "- unexpected continuation byte (2/2)" do
|
80
|
+
string = "🌫\x81efg"
|
81
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
82
|
+
result.must_match "�"
|
83
|
+
result.must_match "unexp.c."
|
84
|
+
result.must_match /e.*f.*g/m
|
85
|
+
end
|
86
|
+
|
87
|
+
it "- not enough continuation bytes" do
|
88
|
+
string = "\xF0\x9F\x8CABC"
|
89
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
90
|
+
result.must_match "�"
|
91
|
+
result.must_match "n.e.con."
|
92
|
+
result.must_match /A.*B.*C/m
|
93
|
+
end
|
94
|
+
|
95
|
+
it "- overlong padding (1/2)" do
|
96
|
+
string = "\xE0\x81\x81ABC"
|
97
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
98
|
+
result.must_match "�"
|
99
|
+
result.must_match /overlong.*overlong.*overlong/m
|
100
|
+
result.must_match /A.*B.*C/m
|
101
|
+
end
|
102
|
+
|
103
|
+
it "- overlong padding (2/2)" do
|
104
|
+
string = "\xC0\x80no double null"
|
105
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
106
|
+
result.must_match "�"
|
107
|
+
result.must_match /overlong.*overlong/m
|
108
|
+
end
|
109
|
+
|
110
|
+
it "- too large codepoint (1/2)" do
|
111
|
+
string = "\xF5\x8F\xBF\xBFABC"
|
112
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
113
|
+
result.must_match "�"
|
114
|
+
result.must_match /toolarge.*toolarge.*toolarge.*toolarge/m
|
115
|
+
result.must_match /A.*B.*C/m
|
116
|
+
end
|
117
|
+
|
118
|
+
it "- too large codepoint (2/2)" do
|
119
|
+
string = "\xF4\xAF\xBF\xBFABC"
|
120
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
121
|
+
result.must_match "�"
|
122
|
+
result.must_match /toolarge.*toolarge.*toolarge.*toolarge/m
|
123
|
+
result.must_match /A.*B.*C/m
|
124
|
+
end
|
125
|
+
|
126
|
+
it "- too large byte" do
|
127
|
+
string = "\xFF"
|
128
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
129
|
+
result.must_match "�"
|
130
|
+
result.must_match "toolarge"
|
131
|
+
end
|
132
|
+
|
133
|
+
it "- has surrogate (1/2)" do
|
134
|
+
string = "\xED\xA0\x80ABC"
|
135
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
136
|
+
result.must_match "�"
|
137
|
+
result.must_match "sur.gate"
|
138
|
+
result.must_match /A.*B.*C/m
|
139
|
+
end
|
140
|
+
|
141
|
+
it "- has surrogate (2/2)" do
|
142
|
+
string = "\xED\xBF\xBFABC"
|
143
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
144
|
+
result.must_match "�"
|
145
|
+
result.must_match "sur.gate"
|
146
|
+
result.must_match /A.*B.*C/m
|
147
|
+
end
|
148
|
+
end
|
149
|
+
|
150
|
+
describe "invalid UTF-16 encodings" do
|
151
|
+
it "- incomplete number of bytes (1/2)" do
|
152
|
+
string = "a".b.force_encoding("UTF-16LE")
|
153
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
154
|
+
result.must_match "incompl."
|
155
|
+
result.must_match "�"
|
156
|
+
end
|
157
|
+
|
158
|
+
it "- incomplete number of bytes (2/2)" do
|
159
|
+
string = "🌫".b[0..-2].force_encoding("UTF-16LE")
|
160
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
161
|
+
result.must_match "incompl."
|
162
|
+
result.must_match "�"
|
163
|
+
end
|
164
|
+
|
165
|
+
it "- only lower half surrogate" do
|
166
|
+
string = "\x3C\xD8\x2Ba".force_encoding("UTF-16LE")
|
167
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
168
|
+
result.must_match "hlf.srg."
|
169
|
+
result.must_match "�"
|
170
|
+
end
|
171
|
+
|
172
|
+
it "- only higher half surrogate" do
|
173
|
+
string = "\x3Ca\x2B\xDF".force_encoding("UTF-16LE")
|
174
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
175
|
+
result.must_match "hlf.srg."
|
176
|
+
result.must_match "�"
|
177
|
+
end
|
178
|
+
end
|
179
|
+
|
180
|
+
describe "invalid UTF-32 encodings" do
|
181
|
+
# please note, currently, too large codepoints and encoded utf16 surrogates are treated as valid encodings
|
182
|
+
|
183
|
+
it "- incomplete number of bytes (1/3)" do
|
184
|
+
string = "a".b.force_encoding("UTF-32LE")
|
185
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
186
|
+
result.must_match "incompl."
|
187
|
+
result.must_match "�"
|
188
|
+
end
|
189
|
+
|
190
|
+
it "- incomplete number of bytes (2/3)" do
|
191
|
+
string = "🌫".b[0..-2].force_encoding("UTF-32LE")
|
192
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
193
|
+
result.must_match "incompl."
|
194
|
+
result.must_match "�"
|
195
|
+
end
|
196
|
+
|
197
|
+
it "- incomplete number of bytes (3/3)" do
|
198
|
+
string = "🌫".b[0..-2].force_encoding("UTF-32LE")
|
199
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
200
|
+
result.must_match "incompl."
|
201
|
+
result.must_match "�"
|
202
|
+
end
|
203
|
+
end
|
204
|
+
|
205
|
+
describe "invalid ASCII encodings" do
|
206
|
+
it "- contains bytes with 8th bit set" do
|
207
|
+
string = "abc\x80efg".force_encoding("ASCII")
|
208
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
209
|
+
result.must_match "�"
|
210
|
+
result.must_match /e.*f.*g/m
|
211
|
+
end
|
212
|
+
end
|
213
|
+
end
|
214
|
+
|
215
|
+
describe "wide_ambiguous: option" do
|
216
|
+
it "- default is 1" do
|
217
|
+
string = "⚀······"
|
218
|
+
result = Unibits.stats(string)
|
219
|
+
result.wont_match "13"
|
220
|
+
end
|
221
|
+
|
222
|
+
it "- default is 2" do
|
223
|
+
string = "⚀······"
|
224
|
+
result = Unibits.stats(string, wide_ambiguous: true)
|
225
|
+
result.must_match "13"
|
226
|
+
end
|
227
|
+
end
|
228
|
+
|
229
|
+
describe "width: option" do
|
230
|
+
it "sets a custom column width" do
|
231
|
+
string = "bla" * 99
|
232
|
+
result = Paint.unpaint(Unibits.visualize(string, width: 50))
|
233
|
+
(result[/^.*$/].size <= 50).must_equal true
|
234
|
+
end
|
69
235
|
end
|
70
236
|
end
|
data/unibits.gemspec
CHANGED
@@ -17,7 +17,7 @@ Gem::Specification.new do |gem|
|
|
17
17
|
gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
|
18
18
|
gem.require_paths = ["lib"]
|
19
19
|
|
20
|
-
gem.add_dependency 'paint', '
|
20
|
+
gem.add_dependency 'paint', '>= 0.9', '< 3.0'
|
21
21
|
gem.add_dependency 'unicode-display_width', '~> 1.1'
|
22
22
|
gem.add_dependency 'rationalist', '~> 2.0'
|
23
23
|
|
metadata
CHANGED
@@ -1,29 +1,35 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: unibits
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jan Lelis
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2017-03-
|
11
|
+
date: 2017-03-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: paint
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- - "
|
17
|
+
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '
|
19
|
+
version: '0.9'
|
20
|
+
- - "<"
|
21
|
+
- !ruby/object:Gem::Version
|
22
|
+
version: '3.0'
|
20
23
|
type: :runtime
|
21
24
|
prerelease: false
|
22
25
|
version_requirements: !ruby/object:Gem::Requirement
|
23
26
|
requirements:
|
24
|
-
- - "
|
27
|
+
- - ">="
|
25
28
|
- !ruby/object:Gem::Version
|
26
|
-
version: '
|
29
|
+
version: '0.9'
|
30
|
+
- - "<"
|
31
|
+
- !ruby/object:Gem::Version
|
32
|
+
version: '3.0'
|
27
33
|
- !ruby/object:Gem::Dependency
|
28
34
|
name: unicode-display_width
|
29
35
|
requirement: !ruby/object:Gem::Requirement
|
@@ -74,13 +80,16 @@ files:
|
|
74
80
|
- bin/unibits
|
75
81
|
- lib/unibits.rb
|
76
82
|
- lib/unibits/kernel_method.rb
|
83
|
+
- lib/unibits/symbolify.rb
|
77
84
|
- lib/unibits/version.rb
|
85
|
+
- screenshots/ascii.invalid.png
|
78
86
|
- screenshots/ascii.png
|
79
87
|
- screenshots/binary.png
|
80
88
|
- screenshots/utf-16be.png
|
81
89
|
- screenshots/utf-16le.png
|
82
90
|
- screenshots/utf-32be.png
|
83
91
|
- screenshots/utf-32le.png
|
92
|
+
- screenshots/utf-8.invalid.png
|
84
93
|
- screenshots/utf-8.png
|
85
94
|
- spec/unibits_spec.rb
|
86
95
|
- unibits.gemspec
|