unibits 1.0.0 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -0
- data/README.md +42 -18
- data/Rakefile +1 -1
- data/bin/unibits +50 -8
- data/lib/unibits.rb +175 -62
- data/lib/unibits/kernel_method.rb +2 -2
- data/lib/unibits/symbolify.rb +108 -0
- data/lib/unibits/version.rb +1 -1
- data/screenshots/ascii.invalid.png +0 -0
- data/screenshots/utf-8.invalid.png +0 -0
- data/spec/unibits_spec.rb +166 -0
- data/unibits.gemspec +1 -1
- metadata +15 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 3a765c7b1a6edfd62f8fed06b683d0d4ff2cacae
|
4
|
+
data.tar.gz: c66cb699f9a4425fc169fd6562dd4b758b94c584
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: a516728fe08185386fc1902fa341784517521d0a85e494e35a7e2700c1c6d7fc3dfc8f86bb20d4268de6d69df591ab431aa3ab5ee02d96f6798fcd97f2426a8e
|
7
|
+
data.tar.gz: cb2fbfea68be359661ad75cc6e203bd4235ccd81431a37f74f965e90f6116e1a12d8cb04094a213f537e2a032c0c25775edd08378cd7c8ae613503dbe5cc3e0e
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,13 @@
|
|
1
1
|
## CHANGELOG
|
2
2
|
|
3
|
+
### 1.1.0
|
4
|
+
|
5
|
+
* Support (and highlight) invalid encodings \o/
|
6
|
+
* Improve character symbolification
|
7
|
+
* Fix that the Kernel method would not take keyword arguments
|
8
|
+
* New option for setting a custom output width to use
|
9
|
+
* New option for activating wide ambiguous characters
|
10
|
+
|
3
11
|
### 1.0.0
|
4
12
|
|
5
13
|
* Initial release
|
data/README.md
CHANGED
@@ -1,18 +1,24 @@
|
|
1
|
-
# unibits [![[version]](https://badge.fury.io/rb/unibits.svg)](http://badge.fury.io/rb/unibits) [![[travis]](https://travis-ci.org/janlelis/unibits.svg)](https://travis-ci.org/janlelis/unibits)
|
1
|
+
# unibits | Reveal the Unicode [![[version]](https://badge.fury.io/rb/unibits.svg)](http://badge.fury.io/rb/unibits) [![[travis]](https://travis-ci.org/janlelis/unibits.svg)](https://travis-ci.org/janlelis/unibits)
|
2
2
|
|
3
|
-
Ruby library and CLI command that visualizes various Unicode and ASCII encodings in the terminal
|
3
|
+
Ruby library and CLI command that visualizes various Unicode and ASCII encodings in the terminal:
|
4
4
|
|
5
|
-
|
6
|
-
|
7
|
-
|
5
|
+
- Makes analyzing encodings easier
|
6
|
+
- Helps you with debugging strings
|
7
|
+
- Supports **UTF-8**, **UTF-16LE**/**UTF-16BE**, **UTF-32LE**/**UTF-32BE**, arbitrary **BINARY** data, and **ASCII**
|
8
|
+
- Highlights invalid encodings
|
8
9
|
|
9
10
|
## Setup
|
10
11
|
|
12
|
+
Make sure you have Ruby installed and installing gems works properly. Then do:
|
13
|
+
|
11
14
|
```
|
12
15
|
$ gem install unibits
|
13
16
|
```
|
14
17
|
|
15
18
|
## Usage
|
19
|
+
|
20
|
+
Pass the string to debug to unibits:
|
21
|
+
|
16
22
|
### From CLI
|
17
23
|
|
18
24
|
```
|
@@ -26,17 +32,17 @@ require 'unibits/kernel_method'
|
|
26
32
|
unibits "🌫 Idiosyncrätic ℜսᖯʏ"
|
27
33
|
```
|
28
34
|
|
29
|
-
### Options
|
35
|
+
### Advanced Options
|
30
36
|
|
31
|
-
`unibits` takes
|
37
|
+
`unibits` takes some optional options:
|
32
38
|
|
33
39
|
- *encoding (e)*: The encoding of the given string (uses your default encoding if none given)
|
34
40
|
- *convert (c)*: An encoding the string should be converted to before visualizing it
|
35
41
|
- *stats*: Whether to show a short stats header (default: `true`), you can deactivate on the CLI with `--no-stats`
|
42
|
+
- *wide-ambiguous*: Treat characters of ambiguous width as 2 spaces instead of 1 ([more info](https://github.com/janlelis/unicode-display_width))
|
43
|
+
- *width (w)*: Set a custom column width, if not set, *unibits* will retrieve it from the terminal or just use 80
|
36
44
|
|
37
|
-
|
38
|
-
|
39
|
-
## Output for Different Encodings
|
45
|
+
## Output of Different Valid Encodings
|
40
46
|
### UTF-8
|
41
47
|
|
42
48
|
CLI: `$ unibits -e utf-8 -c utf-8 "🌫 Idiosyncrätic ℜսᖯʏ"`
|
@@ -47,33 +53,33 @@ Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert:
|
|
47
53
|
|
48
54
|
### UTF-16LE
|
49
55
|
|
50
|
-
CLI: `$ unibits -e utf-8 -c utf-
|
56
|
+
CLI: `$ unibits -e utf-8 -c utf-16le "🌫 Idiosyncrätic ℜսᖯʏ"`
|
51
57
|
|
52
|
-
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-
|
58
|
+
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-16le'`
|
53
59
|
|
54
60
|
![Screenshot UTF-16LE](/screenshots/utf-16le.png?raw=true "UTF-16LE")
|
55
61
|
|
56
62
|
### UTF-16BE
|
57
63
|
|
58
|
-
CLI: `$ unibits -e utf-8 -c utf-
|
64
|
+
CLI: `$ unibits -e utf-8 -c utf-16be "🌫 Idiosyncrätic ℜսᖯʏ"`
|
59
65
|
|
60
|
-
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-
|
66
|
+
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-16be'`
|
61
67
|
|
62
68
|
![Screenshot UTF-16BE](/screenshots/utf-16be.png?raw=true "UTF-16BE")
|
63
69
|
|
64
70
|
### UTF-32LE
|
65
71
|
|
66
|
-
CLI: `$ unibits -e utf-8 -c utf-
|
72
|
+
CLI: `$ unibits -e utf-8 -c utf-32le "🌫 Idiosyncrätic ℜսᖯʏ"`
|
67
73
|
|
68
|
-
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-
|
74
|
+
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-32le'`
|
69
75
|
|
70
76
|
![Screenshot UTF-32LE](/screenshots/utf-32le.png?raw=true "UTF-32LE")
|
71
77
|
|
72
78
|
### UTF-32BE
|
73
79
|
|
74
|
-
CLI: `$ unibits -e utf-8 -c utf-
|
80
|
+
CLI: `$ unibits -e utf-8 -c utf-32be "🌫 Idiosyncrätic ℜսᖯʏ"`
|
75
81
|
|
76
|
-
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-
|
82
|
+
Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'utf-8', convert: 'utf-32be'`
|
77
83
|
|
78
84
|
![Screenshot UTF-32BE](/screenshots/utf-32be.png?raw=true "UTF-32BE")
|
79
85
|
|
@@ -93,6 +99,23 @@ Ruby: `unibits "ASCII String", encoding: 'utf-8', convert: 'ascii'`
|
|
93
99
|
|
94
100
|
![Screenshot ASCII](/screenshots/ascii.png?raw=true "ASCII")
|
95
101
|
|
102
|
+
## Invalid Encodings
|
103
|
+
### UTF-8
|
104
|
+
|
105
|
+
Example in Ruby: `unibits "unexpected \x80 | not enough \xF0\x9F\x8C | overlong \xE0\x81\x81 | surrogate \xED\xA0\x80 | too large \xF5\x8F\xBF\xBF"`
|
106
|
+
|
107
|
+
![Screenshot invalid UTF-8](/screenshots/utf-8.invalid.png?raw=true "Invalid UTF-8")
|
108
|
+
|
109
|
+
### ASCII
|
110
|
+
|
111
|
+
Example in Ruby: `unibits "🌫 Idiosyncrätic ℜսᖯʏ", encoding: 'ascii'`
|
112
|
+
|
113
|
+
![Screenshot invalid ASCII](/screenshots/ascii.invalid.png?raw=true "Invalid ASCII")
|
114
|
+
|
115
|
+
### BINARY
|
116
|
+
|
117
|
+
(not possible to produce invalid binary strings)
|
118
|
+
|
96
119
|
## Notes
|
97
120
|
|
98
121
|
Also see
|
@@ -101,6 +124,7 @@ Also see
|
|
101
124
|
- [UTF-16 (Wikipedia)](https://en.wikipedia.org/wiki/UTF-16#Description)
|
102
125
|
- [UTF-32 (Wikipedia)](https://en.wikipedia.org/wiki/UTF-32)
|
103
126
|
- [Ruby's Encoding class](https://ruby-doc.org/core/Encoding.html)
|
127
|
+
- [Difference between BINARY and ASCII](http://idiosyncratic-ruby.com/56-us-ascii-8bit.html)
|
104
128
|
- [Unicode Micro Libraries for Ruby](https://github.com/janlelis/unicode-x)
|
105
129
|
|
106
130
|
Lots of thanks to @damienklinnert for the motivation and inspiration required to build this! 🎆
|
data/Rakefile
CHANGED
data/bin/unibits
CHANGED
@@ -7,14 +7,16 @@ argv = Rationalist.parse(
|
|
7
7
|
ARGV,
|
8
8
|
string: '_',
|
9
9
|
alias: {
|
10
|
-
e: 'encoding',
|
11
10
|
c: 'convert',
|
11
|
+
e: 'encoding',
|
12
12
|
v: 'version',
|
13
|
+
w: 'width',
|
13
14
|
},
|
14
15
|
boolean: [
|
15
16
|
'help',
|
17
|
+
'stats',
|
16
18
|
'version',
|
17
|
-
'
|
19
|
+
'wide-ambiguous',
|
18
20
|
],
|
19
21
|
default: {
|
20
22
|
stats: true
|
@@ -22,21 +24,54 @@ argv = Rationalist.parse(
|
|
22
24
|
)
|
23
25
|
|
24
26
|
if argv[:version]
|
25
|
-
puts Unibits::VERSION
|
27
|
+
puts "unibits #{Unibits::VERSION} by #{Paint["J-_-L", :bold]} <https://github.com/janlelis/unibits>"
|
26
28
|
exit(0)
|
27
29
|
end
|
28
30
|
|
29
31
|
if argv[:help]
|
30
32
|
puts <<-HELP
|
31
33
|
|
34
|
+
#{Paint["DESCRIPTION", :underline]}
|
35
|
+
|
36
|
+
Visualizes ANSI and variouus Unicode encodings in the terminal.
|
37
|
+
|
32
38
|
#{Paint["USAGE", :underline]}
|
33
39
|
|
34
|
-
#{Paint["unibits", :bold]} [
|
40
|
+
#{Paint["unibits", :bold]} [options] data
|
41
|
+
|
42
|
+
--encoding <encoding> | -e | which encoding to use for given data
|
43
|
+
--convert <encoding> | -c | which encoding to convert to (if possible)
|
44
|
+
--width <n> | -w | force a specific number of terminal columns
|
45
|
+
--no-stats | | no stats header with length info
|
46
|
+
--version | | displays version of unibits
|
47
|
+
--wide-ambiguous | | ambiguous characters
|
48
|
+
|
49
|
+
#{Paint["ENCODINGS", :underline]}
|
50
|
+
|
51
|
+
#{Unibits::SUPPORTED_ENCODINGS.join(', ')}
|
52
|
+
|
53
|
+
#{Paint["STATS", :underline]}
|
54
|
+
|
55
|
+
( bytes / codepoints / glyphs / expected terminal width )
|
56
|
+
|
57
|
+
#{Paint["INVALID BYTES", :underline]}
|
58
|
+
|
59
|
+
UTF-8
|
60
|
+
|
61
|
+
n.e.con. | not enough continuation bytes
|
62
|
+
unexp.c. | unexpected continuation byte
|
63
|
+
toolarge | codepoint value exceeds maximum allowed (U+10FFFF)
|
64
|
+
sur.gate | codepoint value would be a surrogate half
|
65
|
+
overlong | unnecessary null padding
|
66
|
+
|
67
|
+
UTF-16
|
68
|
+
|
69
|
+
incompl. | not enough bytes left to finish codepoint
|
70
|
+
hlf.srg. | other half of surrogate missing
|
35
71
|
|
36
|
-
|
37
|
-
Explanation of stats: bytes / codepoints / glyphs / expected terminal width
|
72
|
+
#{Paint["MORE INFO", :underline]}
|
38
73
|
|
39
|
-
|
74
|
+
https://github.com/janlelis/unibits
|
40
75
|
|
41
76
|
HELP
|
42
77
|
exit(0)
|
@@ -51,7 +86,14 @@ else
|
|
51
86
|
end
|
52
87
|
|
53
88
|
begin
|
54
|
-
Unibits.of(
|
89
|
+
Unibits.of(
|
90
|
+
data,
|
91
|
+
encoding: argv[:encoding],
|
92
|
+
convert: argv[:convert],
|
93
|
+
stats: argv[:stats],
|
94
|
+
wide_ambiguous: argv[:'wide-ambiguous'],
|
95
|
+
width: argv[:width],
|
96
|
+
)
|
55
97
|
rescue ArgumentError
|
56
98
|
$stderr.puts Paint[$!.message, :red]
|
57
99
|
exit(1)
|
data/lib/unibits.rb
CHANGED
@@ -1,4 +1,5 @@
|
|
1
1
|
require_relative "unibits/version"
|
2
|
+
require_relative "unibits/symbolify"
|
2
3
|
|
3
4
|
require "io/console"
|
4
5
|
require "paint"
|
@@ -14,8 +15,9 @@ module Unibits
|
|
14
15
|
'ASCII-8BIT',
|
15
16
|
'US-ASCII',
|
16
17
|
].freeze
|
18
|
+
DEFAULT_TERMINAL_WIDTH = 80
|
17
19
|
|
18
|
-
def self.of(string, encoding: nil, convert: nil, stats: true)
|
20
|
+
def self.of(string, encoding: nil, convert: nil, stats: true, wide_ambiguous: false, width: nil)
|
19
21
|
if !string || string.empty?
|
20
22
|
raise ArgumentError, "no data given to unibits"
|
21
23
|
end
|
@@ -25,12 +27,8 @@ module Unibits
|
|
25
27
|
|
26
28
|
case string.encoding.name
|
27
29
|
when *SUPPORTED_ENCODINGS
|
28
|
-
|
29
|
-
|
30
|
-
end
|
31
|
-
|
32
|
-
puts stats(string) if stats
|
33
|
-
puts visualize(string)
|
30
|
+
puts stats(string, wide_ambiguous: wide_ambiguous) if stats
|
31
|
+
puts visualize(string, wide_ambiguous: wide_ambiguous, width: width)
|
34
32
|
when 'UTF-16', 'UTF-32'
|
35
33
|
raise ArgumentError, "unibits only supports #{string.encoding.name} with specified endianess, please use #{string.encoding.name}LE or #{string.encoding.name}BE"
|
36
34
|
else
|
@@ -38,27 +36,36 @@ module Unibits
|
|
38
36
|
end
|
39
37
|
end
|
40
38
|
|
41
|
-
def self.stats(string)
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
39
|
+
def self.stats(string, wide_ambiguous: false)
|
40
|
+
valid = string.valid_encoding?
|
41
|
+
bytes = string.bytesize rescue "?"
|
42
|
+
codepoints = string.size rescue "?"
|
43
|
+
glyphs = string.scan(Regexp.compile('\X'.encode(string.encoding))).size rescue "?"
|
44
|
+
width = Unicode::DisplayWidth.of(string, wide_ambiguous ? 2 : 1) rescue "?"
|
45
|
+
|
46
|
+
"\n #{valid ? '' : Paint["Invalid ", :bold, :red]}#{Paint[string.encoding.name, :bold]} (#{bytes}/#{codepoints}/#{glyphs}/#{width})"
|
48
47
|
end
|
49
48
|
|
50
|
-
def self.visualize(string)
|
51
|
-
cols = determine_terminal_cols
|
49
|
+
def self.visualize(string, wide_ambiguous: false, width: nil)
|
50
|
+
cols = width || determine_terminal_cols
|
52
51
|
|
53
52
|
cp_buffer = [" "]
|
54
53
|
enc_buffer = [" "]
|
55
54
|
hex_buffer = [" "]
|
56
55
|
bin_buffer = [" "]
|
57
56
|
separator = [" "]
|
57
|
+
current_encoding_error = nil
|
58
58
|
|
59
59
|
puts
|
60
60
|
string.each_char{ |char|
|
61
|
-
|
61
|
+
if char.valid_encoding?
|
62
|
+
char_valid = true
|
63
|
+
current_color = random_color
|
64
|
+
current_encoding_error = nil
|
65
|
+
else
|
66
|
+
char_valid = false
|
67
|
+
current_color = :red
|
68
|
+
end
|
62
69
|
|
63
70
|
char.each_byte.with_index{ |byte, index|
|
64
71
|
if Paint.unpaint(hex_buffer[-1]).bytesize > cols - 12
|
@@ -70,12 +77,116 @@ module Unibits
|
|
70
77
|
end
|
71
78
|
|
72
79
|
if index == 0
|
80
|
+
if char_valid
|
81
|
+
codepoint = "U+%04X" % char.ord
|
82
|
+
else
|
83
|
+
case string.encoding.name
|
84
|
+
when "US-ASCII"
|
85
|
+
codepoint = "invalid"
|
86
|
+
when "UTF-8"
|
87
|
+
# this tries to detect what is wrong with this utf-8 encoded string
|
88
|
+
# sorry for this mess
|
89
|
+
case char.unpack("B*")[0]
|
90
|
+
when /^110.{5}$/
|
91
|
+
current_encoding_error = [:nec, 1, 1]
|
92
|
+
codepoint = "n.e.con."
|
93
|
+
when /^1110(.{4})$/
|
94
|
+
if $1 == "1101"
|
95
|
+
current_encoding_error = [:nec, 2, 2, :maybe_surrogate]
|
96
|
+
else
|
97
|
+
current_encoding_error = [:nec, 2, 2]
|
98
|
+
end
|
99
|
+
codepoint = "n.e.con."
|
100
|
+
when /^11110(.{3})$/
|
101
|
+
case $1
|
102
|
+
when "100"
|
103
|
+
current_encoding_error = [:nec, 3, 3, :leading_at_max]
|
104
|
+
when "101", "110", "111"
|
105
|
+
current_encoding_error = [:nec, 3, 3, :too_large]
|
106
|
+
else
|
107
|
+
current_encoding_error = [:nec, 3, 3]
|
108
|
+
end
|
109
|
+
codepoint = "n.e.con."
|
110
|
+
when /^11111.{3}$/
|
111
|
+
codepoint = "toolarge"
|
112
|
+
when /^10(.{2}).{4}$/
|
113
|
+
# uglyhack to fixup that it is not n.e.c, but something different
|
114
|
+
if current_encoding_error && current_encoding_error[0] == :nec
|
115
|
+
if current_encoding_error[3] == :leading_at_max
|
116
|
+
if $1 != "00"
|
117
|
+
current_encoding_error[3] = :too_large
|
118
|
+
else
|
119
|
+
current_encoding_error[3] = nil
|
120
|
+
end
|
121
|
+
elsif current_encoding_error[3] == :maybe_surrogate
|
122
|
+
if $1[0] == "1"
|
123
|
+
current_encoding_error[3] = :surrogate
|
124
|
+
else
|
125
|
+
current_encoding_error[3] = nil
|
126
|
+
end
|
127
|
+
end
|
128
|
+
|
129
|
+
if current_encoding_error[1] > 1
|
130
|
+
current_encoding_error[1] -= 1
|
131
|
+
codepoint = "n.e.con."
|
132
|
+
else
|
133
|
+
case current_encoding_error[3]
|
134
|
+
when :too_large
|
135
|
+
actual_error = "toolarge"
|
136
|
+
when :surrogate
|
137
|
+
actual_error = "sur.gate"
|
138
|
+
else
|
139
|
+
actual_error = "overlong"
|
140
|
+
end
|
141
|
+
current_cp_buffer_index = -1
|
142
|
+
(current_encoding_error[2]).times{
|
143
|
+
if index = cp_buffer[current_cp_buffer_index].rindex("n.e.con.")
|
144
|
+
cp_buffer[current_cp_buffer_index][index..-1] = cp_buffer[current_cp_buffer_index][index..-1].sub("n.e.con.", actual_error)
|
145
|
+
else
|
146
|
+
current_cp_buffer_index -= 1
|
147
|
+
index = cp_buffer[current_cp_buffer_index].rindex("n.e.con.")
|
148
|
+
cp_buffer[current_cp_buffer_index][index..-1] = cp_buffer[current_cp_buffer_index][index..-1].sub("n.e.con.", actual_error)
|
149
|
+
end
|
150
|
+
current_encoding_error = [:overlong]
|
151
|
+
codepoint = actual_error
|
152
|
+
}
|
153
|
+
end
|
154
|
+
else
|
155
|
+
current_encoding_error = [:unexp]
|
156
|
+
codepoint = "unexp.c."
|
157
|
+
end
|
158
|
+
else
|
159
|
+
current_encoding_error = [:invallid]
|
160
|
+
codepoint = "invalid"
|
161
|
+
end
|
162
|
+
when 'UTF-16LE', 'UTF-16BE'
|
163
|
+
if char.bytesize.odd?
|
164
|
+
codepoint = "incompl."
|
165
|
+
elsif char.b[string.encoding.name == 'UTF-16LE' ? 1 : 0].unpack("B*")[0][0, 5] == "11011"
|
166
|
+
codepoint = "hlf.srg."
|
167
|
+
else
|
168
|
+
codepoint = "invalid"
|
169
|
+
end
|
170
|
+
when 'UTF-32LE', 'UTF-32BE'
|
171
|
+
if char.bytesize != "4"
|
172
|
+
codepoint = "incompl."
|
173
|
+
else
|
174
|
+
codepoint = "invalid"
|
175
|
+
end
|
176
|
+
end
|
177
|
+
end
|
178
|
+
|
73
179
|
cp_buffer[-1] << Paint[
|
74
|
-
|
180
|
+
codepoint.ljust(10), current_color, :bold
|
75
181
|
]
|
76
182
|
|
77
|
-
|
78
|
-
|
183
|
+
if char_valid
|
184
|
+
symbolified_char = symbolify(char)
|
185
|
+
else
|
186
|
+
symbolified_char = "�"
|
187
|
+
end
|
188
|
+
|
189
|
+
padding = 10 - Unicode::DisplayWidth.of(symbolified_char, wide_ambiguous ? 2 : 1)
|
79
190
|
|
80
191
|
enc_buffer[-1] << Paint[
|
81
192
|
symbolified_char, current_color
|
@@ -92,54 +203,62 @@ module Unibits
|
|
92
203
|
|
93
204
|
bin_byte_complete = byte.to_s(2).rjust(8, "0")
|
94
205
|
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
206
|
+
if !char_valid
|
207
|
+
bin_byte_1 = bin_byte_complete
|
208
|
+
bin_byte_2 = ""
|
209
|
+
else
|
210
|
+
case string.encoding.name
|
211
|
+
when 'US-ASCII'
|
212
|
+
bin_byte_1 = bin_byte_complete[0...1]
|
213
|
+
bin_byte_2 = bin_byte_complete[1...8]
|
214
|
+
when 'ASCII-8BIT'
|
215
|
+
bin_byte_1 = ""
|
216
|
+
bin_byte_2 = bin_byte_complete
|
217
|
+
when 'UTF-8'
|
218
|
+
if index == 0
|
219
|
+
if bin_byte_complete =~ /^(0|1{2,4}0)([01]+)$/
|
220
|
+
bin_byte_1 = $1
|
221
|
+
bin_byte_2 = $2
|
222
|
+
else
|
223
|
+
bin_byte_1 = ""
|
224
|
+
bin_byte_2 = bin_byte_complete
|
225
|
+
end
|
226
|
+
else
|
227
|
+
bin_byte_1 = bin_byte_complete[0...2]
|
228
|
+
bin_byte_2 = bin_byte_complete[2...8]
|
229
|
+
end
|
230
|
+
when 'UTF-16LE'
|
231
|
+
if char.ord <= 0xFFFF || index == 0 || index == 2
|
232
|
+
bin_byte_1 = ""
|
233
|
+
bin_byte_2 = bin_byte_complete
|
234
|
+
else
|
235
|
+
bin_byte_complete =~ /^(11011[01])([01]+)$/
|
105
236
|
bin_byte_1 = $1
|
106
237
|
bin_byte_2 = $2
|
107
|
-
|
238
|
+
end
|
239
|
+
when 'UTF-16BE'
|
240
|
+
if char.ord <= 0xFFFF || index == 1 || index == 3
|
108
241
|
bin_byte_1 = ""
|
109
242
|
bin_byte_2 = bin_byte_complete
|
243
|
+
else
|
244
|
+
bin_byte_complete =~ /^(11011[01])([01]+)$/
|
245
|
+
bin_byte_1 = $1
|
246
|
+
bin_byte_2 = $2
|
110
247
|
end
|
111
|
-
|
112
|
-
bin_byte_1 = bin_byte_complete[0...2]
|
113
|
-
bin_byte_2 = bin_byte_complete[2...8]
|
114
|
-
end
|
115
|
-
when 'UTF-16LE'
|
116
|
-
if char.ord <= 0xFFFF || index == 0 || index == 2
|
248
|
+
when 'UTF-32LE', 'UTF-32BE'
|
117
249
|
bin_byte_1 = ""
|
118
250
|
bin_byte_2 = bin_byte_complete
|
119
|
-
else
|
120
|
-
bin_byte_complete =~ /^(11011[01])([01]+)$/
|
121
|
-
bin_byte_1 = $1
|
122
|
-
bin_byte_2 = $2
|
123
|
-
end
|
124
|
-
when 'UTF-16BE'
|
125
|
-
if char.ord <= 0xFFFF || index == 1 || index == 3
|
126
|
-
bin_byte_1 = ""
|
127
|
-
bin_byte_2 = bin_byte_complete
|
128
|
-
else
|
129
|
-
bin_byte_complete =~ /^(11011[01])([01]+)$/
|
130
|
-
bin_byte_1 = $1
|
131
|
-
bin_byte_2 = $2
|
132
251
|
end
|
133
|
-
when 'UTF-32LE', 'UTF-32BE'
|
134
|
-
bin_byte_1 = ""
|
135
|
-
bin_byte_2 = bin_byte_complete
|
136
252
|
end
|
253
|
+
|
137
254
|
bin_buffer[-1] << Paint[
|
138
255
|
bin_byte_1, current_color
|
139
256
|
] unless !bin_byte_1 || bin_byte_1.empty?
|
257
|
+
|
140
258
|
bin_buffer[-1] << Paint[
|
141
259
|
bin_byte_2, current_color, :underline
|
142
260
|
] unless !bin_byte_2 || bin_byte_2.empty?
|
261
|
+
|
143
262
|
bin_buffer[-1] << " "
|
144
263
|
}
|
145
264
|
}
|
@@ -157,18 +276,12 @@ module Unibits
|
|
157
276
|
|
158
277
|
def self.symbolify(char)
|
159
278
|
return char.inspect unless char.encoding.name[0, 3] == "UTF"
|
160
|
-
char
|
161
|
-
.tr("\x00-\x1F".encode(char.encoding), "\u{2400}-\u{241F}".encode(char.encoding))
|
162
|
-
.gsub(
|
163
|
-
Regexp.compile('[\p{Space}]'.encode(char.encoding)),
|
164
|
-
']\0['.encode(char.encoding)
|
165
|
-
)
|
166
|
-
.encode('UTF-8')
|
279
|
+
Symbolify.symbolify(char).encode('UTF-8')
|
167
280
|
end
|
168
281
|
|
169
282
|
def self.determine_terminal_cols
|
170
|
-
STDIN.winsize[1] ||
|
283
|
+
STDIN.winsize[1] || DEFAULT_TERMINAL_WIDTH
|
171
284
|
rescue Errno::ENOTTY
|
172
|
-
return
|
285
|
+
return DEFAULT_TERMINAL_WIDTH
|
173
286
|
end
|
174
287
|
end
|
@@ -0,0 +1,108 @@
|
|
1
|
+
module Unibits
|
2
|
+
module Symbolify
|
3
|
+
ASCII_CONTROL_CODEPOINTS = "\x00-\x1F\x7F".freeze
|
4
|
+
ASCII_CONTROL_SYMBOLS = "\u{2400}-\u{241F}\u{2421}".freeze
|
5
|
+
ASCII_CHARS = "\x20-\x7E".freeze
|
6
|
+
TAG_START = "\u{E0001}".freeze
|
7
|
+
TAG_START_SYMBOL = "LANG TAG".freeze
|
8
|
+
TAG_SPACE = "\u{E0020}".freeze
|
9
|
+
TAG_SPACE_SYMBOL = "TAG ␠".freeze
|
10
|
+
TAGS = "\u{E0021}-\u{E007E}".freeze
|
11
|
+
TAG_DELETE = "\u{E007F}".freeze
|
12
|
+
TAG_DELETE_SYMBOL = "TAG ␡".freeze
|
13
|
+
INTERESTING_CODEPOINTS = {
|
14
|
+
"\u{0080}" => "PAD",
|
15
|
+
"\u{0081}" => "HOP",
|
16
|
+
"\u{0082}" => "BPH",
|
17
|
+
"\u{0083}" => "NBH",
|
18
|
+
"\u{0084}" => "IND",
|
19
|
+
"\u{0085}" => "NEL",
|
20
|
+
"\u{0086}" => "SSA",
|
21
|
+
"\u{0087}" => "ESA",
|
22
|
+
"\u{0088}" => "HTS",
|
23
|
+
"\u{0089}" => "HTJ",
|
24
|
+
"\u{008A}" => "VTS",
|
25
|
+
"\u{008B}" => "PLD",
|
26
|
+
"\u{008C}" => "PLU",
|
27
|
+
"\u{008D}" => "RI",
|
28
|
+
"\u{008E}" => "SS2",
|
29
|
+
"\u{008F}" => "SS3",
|
30
|
+
"\u{0090}" => "DCS",
|
31
|
+
"\u{0091}" => "PU1",
|
32
|
+
"\u{0092}" => "PU2",
|
33
|
+
"\u{0093}" => "STS",
|
34
|
+
"\u{0094}" => "CCH",
|
35
|
+
"\u{0095}" => "MW",
|
36
|
+
"\u{0096}" => "SPA",
|
37
|
+
"\u{0097}" => "EPA",
|
38
|
+
"\u{0098}" => "SOS",
|
39
|
+
"\u{0099}" => "SGC",
|
40
|
+
"\u{009A}" => "SCI",
|
41
|
+
"\u{009B}" => "CSI",
|
42
|
+
"\u{009C}" => "ST",
|
43
|
+
"\u{009D}" => "OSC",
|
44
|
+
"\u{009E}" => "PM",
|
45
|
+
"\u{009F}" => "APC",
|
46
|
+
|
47
|
+
"\u{200E}" => "LRM",
|
48
|
+
"\u{200F}" => "RLM",
|
49
|
+
"\u{202A}" => "LRE",
|
50
|
+
"\u{202B}" => "RLE",
|
51
|
+
"\u{202C}" => "PDF",
|
52
|
+
"\u{202D}" => "LRO",
|
53
|
+
"\u{202E}" => "RLO",
|
54
|
+
"\u{2066}" => "LRI",
|
55
|
+
"\u{2067}" => "RLI",
|
56
|
+
"\u{2068}" => "FSI",
|
57
|
+
"\u{2069}" => "PDI",
|
58
|
+
|
59
|
+
"\u{FE00}" => "VS1",
|
60
|
+
"\u{FE01}" => "VS2",
|
61
|
+
"\u{FE02}" => "VS3",
|
62
|
+
"\u{FE03}" => "VS4",
|
63
|
+
"\u{FE04}" => "VS5",
|
64
|
+
"\u{FE05}" => "VS6",
|
65
|
+
"\u{FE06}" => "VS7",
|
66
|
+
"\u{FE07}" => "VS8",
|
67
|
+
"\u{FE08}" => "VS9",
|
68
|
+
"\u{FE09}" => "VS10",
|
69
|
+
"\u{FE0A}" => "VS11",
|
70
|
+
"\u{FE0B}" => "VS12",
|
71
|
+
"\u{FE0C}" => "VS13",
|
72
|
+
"\u{FE0D}" => "VS14",
|
73
|
+
"\u{FE0E}" => "VS15",
|
74
|
+
"\u{FE0F}" => "VS16",
|
75
|
+
}.freeze
|
76
|
+
COULD_BE_WHITESPACE = '[\p{Space}⠀]'.freeze
|
77
|
+
# UNASSIGNED = '\p{Cn}'.freeze
|
78
|
+
|
79
|
+
def self.symbolify(char, encoding = char.encoding)
|
80
|
+
char.tr!(
|
81
|
+
ASCII_CONTROL_CODEPOINTS.encode(encoding),
|
82
|
+
ASCII_CONTROL_SYMBOLS.encode(encoding)
|
83
|
+
)
|
84
|
+
char.gsub!(
|
85
|
+
Regexp.compile(COULD_BE_WHITESPACE.encode(encoding)),
|
86
|
+
']\0['.encode(encoding)
|
87
|
+
)
|
88
|
+
# char.gsub!(
|
89
|
+
# Regexp.compile(UNASSIGNED.encode(encoding)),
|
90
|
+
# 'n/a'.encode(encoding)
|
91
|
+
# )
|
92
|
+
INTERESTING_CODEPOINTS.each{ |cp, desc|
|
93
|
+
char.gsub! Regexp.compile(cp.encode(encoding)), desc.encode(encoding)
|
94
|
+
}
|
95
|
+
char.gsub! TAG_START.encode(encoding), TAG_START_SYMBOL.encode(encoding)
|
96
|
+
char.gsub! TAG_SPACE.encode(encoding), TAG_SPACE_SYMBOL.encode(encoding)
|
97
|
+
char.gsub! TAG_DELETE.encode(encoding), TAG_DELETE_SYMBOL.encode(encoding)
|
98
|
+
|
99
|
+
ord = char.ord
|
100
|
+
if ord > 917536 && ord < 917631
|
101
|
+
char.tr!(TAGS.encode(encoding), ASCII_CHARS.encode(encoding))
|
102
|
+
char = "TAG ".encode(encoding) + char
|
103
|
+
end
|
104
|
+
|
105
|
+
char
|
106
|
+
end
|
107
|
+
end
|
108
|
+
end
|
data/lib/unibits/version.rb
CHANGED
Binary file
|
Binary file
|
data/spec/unibits_spec.rb
CHANGED
@@ -66,5 +66,171 @@ describe Unibits do
|
|
66
66
|
result.must_match "43"
|
67
67
|
result.must_match "01000011"
|
68
68
|
end
|
69
|
+
|
70
|
+
describe "invalid UTF-8 encodings" do
|
71
|
+
it "- unexpected continuation byte (1/2)" do
|
72
|
+
string = "abc\x80efg"
|
73
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
74
|
+
result.must_match "�"
|
75
|
+
result.must_match "unexp.c."
|
76
|
+
result.must_match /e.*f.*g/m
|
77
|
+
end
|
78
|
+
|
79
|
+
it "- unexpected continuation byte (2/2)" do
|
80
|
+
string = "🌫\x81efg"
|
81
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
82
|
+
result.must_match "�"
|
83
|
+
result.must_match "unexp.c."
|
84
|
+
result.must_match /e.*f.*g/m
|
85
|
+
end
|
86
|
+
|
87
|
+
it "- not enough continuation bytes" do
|
88
|
+
string = "\xF0\x9F\x8CABC"
|
89
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
90
|
+
result.must_match "�"
|
91
|
+
result.must_match "n.e.con."
|
92
|
+
result.must_match /A.*B.*C/m
|
93
|
+
end
|
94
|
+
|
95
|
+
it "- overlong padding (1/2)" do
|
96
|
+
string = "\xE0\x81\x81ABC"
|
97
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
98
|
+
result.must_match "�"
|
99
|
+
result.must_match /overlong.*overlong.*overlong/m
|
100
|
+
result.must_match /A.*B.*C/m
|
101
|
+
end
|
102
|
+
|
103
|
+
it "- overlong padding (2/2)" do
|
104
|
+
string = "\xC0\x80no double null"
|
105
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
106
|
+
result.must_match "�"
|
107
|
+
result.must_match /overlong.*overlong/m
|
108
|
+
end
|
109
|
+
|
110
|
+
it "- too large codepoint (1/2)" do
|
111
|
+
string = "\xF5\x8F\xBF\xBFABC"
|
112
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
113
|
+
result.must_match "�"
|
114
|
+
result.must_match /toolarge.*toolarge.*toolarge.*toolarge/m
|
115
|
+
result.must_match /A.*B.*C/m
|
116
|
+
end
|
117
|
+
|
118
|
+
it "- too large codepoint (2/2)" do
|
119
|
+
string = "\xF4\xAF\xBF\xBFABC"
|
120
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
121
|
+
result.must_match "�"
|
122
|
+
result.must_match /toolarge.*toolarge.*toolarge.*toolarge/m
|
123
|
+
result.must_match /A.*B.*C/m
|
124
|
+
end
|
125
|
+
|
126
|
+
it "- too large byte" do
|
127
|
+
string = "\xFF"
|
128
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
129
|
+
result.must_match "�"
|
130
|
+
result.must_match "toolarge"
|
131
|
+
end
|
132
|
+
|
133
|
+
it "- has surrogate (1/2)" do
|
134
|
+
string = "\xED\xA0\x80ABC"
|
135
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
136
|
+
result.must_match "�"
|
137
|
+
result.must_match "sur.gate"
|
138
|
+
result.must_match /A.*B.*C/m
|
139
|
+
end
|
140
|
+
|
141
|
+
it "- has surrogate (2/2)" do
|
142
|
+
string = "\xED\xBF\xBFABC"
|
143
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
144
|
+
result.must_match "�"
|
145
|
+
result.must_match "sur.gate"
|
146
|
+
result.must_match /A.*B.*C/m
|
147
|
+
end
|
148
|
+
end
|
149
|
+
|
150
|
+
describe "invalid UTF-16 encodings" do
|
151
|
+
it "- incomplete number of bytes (1/2)" do
|
152
|
+
string = "a".b.force_encoding("UTF-16LE")
|
153
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
154
|
+
result.must_match "incompl."
|
155
|
+
result.must_match "�"
|
156
|
+
end
|
157
|
+
|
158
|
+
it "- incomplete number of bytes (2/2)" do
|
159
|
+
string = "🌫".b[0..-2].force_encoding("UTF-16LE")
|
160
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
161
|
+
result.must_match "incompl."
|
162
|
+
result.must_match "�"
|
163
|
+
end
|
164
|
+
|
165
|
+
it "- only lower half surrogate" do
|
166
|
+
string = "\x3C\xD8\x2Ba".force_encoding("UTF-16LE")
|
167
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
168
|
+
result.must_match "hlf.srg."
|
169
|
+
result.must_match "�"
|
170
|
+
end
|
171
|
+
|
172
|
+
it "- only higher half surrogate" do
|
173
|
+
string = "\x3Ca\x2B\xDF".force_encoding("UTF-16LE")
|
174
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
175
|
+
result.must_match "hlf.srg."
|
176
|
+
result.must_match "�"
|
177
|
+
end
|
178
|
+
end
|
179
|
+
|
180
|
+
describe "invalid UTF-32 encodings" do
|
181
|
+
# please note, currently, too large codepoints and encoded utf16 surrogates are treated as valid encodings
|
182
|
+
|
183
|
+
it "- incomplete number of bytes (1/3)" do
|
184
|
+
string = "a".b.force_encoding("UTF-32LE")
|
185
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
186
|
+
result.must_match "incompl."
|
187
|
+
result.must_match "�"
|
188
|
+
end
|
189
|
+
|
190
|
+
it "- incomplete number of bytes (2/3)" do
|
191
|
+
string = "🌫".b[0..-2].force_encoding("UTF-32LE")
|
192
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
193
|
+
result.must_match "incompl."
|
194
|
+
result.must_match "�"
|
195
|
+
end
|
196
|
+
|
197
|
+
it "- incomplete number of bytes (3/3)" do
|
198
|
+
string = "🌫".b[0..-2].force_encoding("UTF-32LE")
|
199
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
200
|
+
result.must_match "incompl."
|
201
|
+
result.must_match "�"
|
202
|
+
end
|
203
|
+
end
|
204
|
+
|
205
|
+
describe "invalid ASCII encodings" do
|
206
|
+
it "- contains bytes with 8th bit set" do
|
207
|
+
string = "abc\x80efg".force_encoding("ASCII")
|
208
|
+
result = Paint.unpaint(Unibits.visualize(string))
|
209
|
+
result.must_match "�"
|
210
|
+
result.must_match /e.*f.*g/m
|
211
|
+
end
|
212
|
+
end
|
213
|
+
end
|
214
|
+
|
215
|
+
describe "wide_ambiguous: option" do
|
216
|
+
it "- default is 1" do
|
217
|
+
string = "⚀······"
|
218
|
+
result = Unibits.stats(string)
|
219
|
+
result.wont_match "13"
|
220
|
+
end
|
221
|
+
|
222
|
+
it "- default is 2" do
|
223
|
+
string = "⚀······"
|
224
|
+
result = Unibits.stats(string, wide_ambiguous: true)
|
225
|
+
result.must_match "13"
|
226
|
+
end
|
227
|
+
end
|
228
|
+
|
229
|
+
describe "width: option" do
|
230
|
+
it "sets a custom column width" do
|
231
|
+
string = "bla" * 99
|
232
|
+
result = Paint.unpaint(Unibits.visualize(string, width: 50))
|
233
|
+
(result[/^.*$/].size <= 50).must_equal true
|
234
|
+
end
|
69
235
|
end
|
70
236
|
end
|
data/unibits.gemspec
CHANGED
@@ -17,7 +17,7 @@ Gem::Specification.new do |gem|
|
|
17
17
|
gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
|
18
18
|
gem.require_paths = ["lib"]
|
19
19
|
|
20
|
-
gem.add_dependency 'paint', '
|
20
|
+
gem.add_dependency 'paint', '>= 0.9', '< 3.0'
|
21
21
|
gem.add_dependency 'unicode-display_width', '~> 1.1'
|
22
22
|
gem.add_dependency 'rationalist', '~> 2.0'
|
23
23
|
|
metadata
CHANGED
@@ -1,29 +1,35 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: unibits
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jan Lelis
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2017-03-
|
11
|
+
date: 2017-03-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: paint
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- - "
|
17
|
+
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '
|
19
|
+
version: '0.9'
|
20
|
+
- - "<"
|
21
|
+
- !ruby/object:Gem::Version
|
22
|
+
version: '3.0'
|
20
23
|
type: :runtime
|
21
24
|
prerelease: false
|
22
25
|
version_requirements: !ruby/object:Gem::Requirement
|
23
26
|
requirements:
|
24
|
-
- - "
|
27
|
+
- - ">="
|
25
28
|
- !ruby/object:Gem::Version
|
26
|
-
version: '
|
29
|
+
version: '0.9'
|
30
|
+
- - "<"
|
31
|
+
- !ruby/object:Gem::Version
|
32
|
+
version: '3.0'
|
27
33
|
- !ruby/object:Gem::Dependency
|
28
34
|
name: unicode-display_width
|
29
35
|
requirement: !ruby/object:Gem::Requirement
|
@@ -74,13 +80,16 @@ files:
|
|
74
80
|
- bin/unibits
|
75
81
|
- lib/unibits.rb
|
76
82
|
- lib/unibits/kernel_method.rb
|
83
|
+
- lib/unibits/symbolify.rb
|
77
84
|
- lib/unibits/version.rb
|
85
|
+
- screenshots/ascii.invalid.png
|
78
86
|
- screenshots/ascii.png
|
79
87
|
- screenshots/binary.png
|
80
88
|
- screenshots/utf-16be.png
|
81
89
|
- screenshots/utf-16le.png
|
82
90
|
- screenshots/utf-32be.png
|
83
91
|
- screenshots/utf-32le.png
|
92
|
+
- screenshots/utf-8.invalid.png
|
84
93
|
- screenshots/utf-8.png
|
85
94
|
- spec/unibits_spec.rb
|
86
95
|
- unibits.gemspec
|