characteristics 0.6.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 95c672e25105f6a9281caee75f58cc242e50bee5
4
- data.tar.gz: 17fe74e2842c0175b284e9a0b170561aba80c7af
3
+ metadata.gz: 8bd09e16bed3587eaa18057790766a6e720e3d25
4
+ data.tar.gz: 0bb97a8df68aa447fce5b04750573ef9da8a7373
5
5
  SHA512:
6
- metadata.gz: f12e4d9b7b471019a606060166629a99e950b1e5fc3d44f933da70239f2cbc5ed4ac8023b9bb79998709a6bed14c3681dd141f1894c11ddec14649e31c68d841
7
- data.tar.gz: 5110369bc3619f9e3ddc23c9bc6dd5abd7ff0e8ffdfd46889fbc2b04aea1d7611de607852ed0dfab04befa330e8c3e8c3afe5fa680ff0cb16ace232a5a968f6d
6
+ metadata.gz: '0519adb1524a00105cb7bc54b329aadac50ad2fd350bbe0fda9356474e038a52796ad6548cd3a33865b943f1fd66ae22c5c6bf0673c8528b31cfdf2b5296626e'
7
+ data.tar.gz: 4a10cc4ec190ca4ed5fdf075449ce8d655182a70cc09faeb75e3805a02a0c7798b762901e6ff881723170f5a7bf51bc119c4af18c44be51c8fb803722db956f7
@@ -1,5 +1,13 @@
1
1
  ## CHANGELOG
2
2
 
3
+ ### 0.7.0
4
+
5
+ * Add more Unicode properties
6
+ * variation_selector?
7
+ * tag?
8
+ * ignorable?
9
+ * noncharacter?
10
+
3
11
  ### 0.6.0
4
12
 
5
13
  * Add separator? property
data/README.md CHANGED
@@ -1,14 +1,17 @@
1
1
  # Characteristics [![[version]](https://badge.fury.io/rb/characteristics.svg)](http://badge.fury.io/rb/characteristics) [![[travis]](https://travis-ci.org/janlelis/characteristics.svg)](https://travis-ci.org/janlelis/characteristics)
2
2
 
3
- A Ruby library which provides some basic information about how characters behave in different encodings:
3
+ A Ruby library that provides additional info about characters
4
4
 
5
- - Is a character valid according to its encoding?
5
+ - Could a character be invisible (blank)?
6
6
  - Is a character assigned?
7
7
  - Is a character a special control character?
8
- - Could a character be invisible (blank)?
8
+
9
+ Extra data is available for Unicode characters (see below).
9
10
 
10
11
  The [unibits](https://github.com/janlelis/unibits) and [uniscribe](https://github.com/janlelis/uniscribe) gems makes use of this data to visualize it accordingliy.
11
12
 
13
+ ¹ in the sense of [codepoints](https://en.wikipedia.org/wiki/Codepoint)
14
+
12
15
  ## Setup
13
16
 
14
17
  Add to your `Gemfile`:
@@ -20,6 +23,7 @@ gem 'characteristics'
20
23
  ## Usage
21
24
 
22
25
  ```ruby
26
+ # All supported encodings
23
27
  char_info = Characteristics.create(character)
24
28
  char_info.valid? # => true / false
25
29
  char_info.unicode? # => true / false
@@ -28,6 +32,13 @@ char_info.control? # => true / false
28
32
  char_info.blank? # => true / false
29
33
  char_info.separator? # => true / false
30
34
  char_info.format? # => true / false
35
+
36
+ # Unicode characters
37
+ char_info = Characteristics.create(character)
38
+ char_info.variation_selector? # => true / false
39
+ char_info.tag? # => true / false
40
+ char_info.ignorable? # => true / false
41
+ char_info.noncharacter? # => true / false
31
42
  ```
32
43
 
33
44
  ## Types of Encodings
@@ -43,44 +54,68 @@ This library knows of four different kinds of encodings:
43
54
  - **:binary** Arbitrary string
44
55
  - *ASCII-8BIT*
45
56
 
46
- Other encodings are not supported, yet.
57
+ Other encodings are currently not supported.
58
+
59
+ ## Properties
47
60
 
48
- ## Predicates
61
+ ### General
49
62
 
50
- ### `valid?`
63
+ #### `valid?`
51
64
 
52
65
  Validness is determined by Ruby's `String#valid_encoding?`
53
66
 
54
- ### `unicode?`
67
+ #### `unicode?`
55
68
 
56
- `true` for Unicode encodings (`UTF-X`)
69
+ **true** for Unicode encodings (`UTF-X`)
57
70
 
58
- ### `control?`
71
+ #### `control?`
59
72
 
60
73
  Control characters are codepoints in the is [C0, delete or C1 control character range](https://en.wikipedia.org/wiki/C0_and_C1_control_codes). Characters in this range of [IBM codepage 437](https://en.wikipedia.org/wiki/Code_page_437) based encodings are always treated as control characters.
61
74
 
62
- ### `assigned?`
75
+ #### `assigned?`
63
76
 
64
77
  - All valid ASCII and BINARY characters are considered assigned
65
78
  - For other byte based encodings, a character is considered assigned if it is not on the exception list included in this library. C0 control characters (and `\x7F`) are always considered assigned. C1 control characters are treated as assigned, if the encoding generally does not assign characters in the C1 region.
66
79
  - For Unicode, the general category is considered
67
80
 
68
- ### `blank?`
81
+ #### `blank?`
69
82
 
70
83
  The library includes a list of characters that might not be rendered visually. This list does not include unassigned codepoints, control characters (except for `\t`, `\n`, `\v`, `\f`, `\r`, and `\u{85}` in Unicode), or special formatting characters (right-to-left markers, variation selectors, etc).
71
84
 
72
- ### `separator?`
85
+ #### `separator?`
73
86
 
74
87
  Returns true if character is considered a separator. All separators also return true for the `blank?` check. In Unicode, the following characters are separators: `\n`, `\v`, `\f`, `\r`, `\u{85}` (next line), `\u{2028}` (line separator), and `\u{2029}` (paragraph separator)
75
88
 
76
- ### `format?`
89
+ #### `format?`
90
+
91
+ This flag is *true* only for special formatting characters, which are not control characters, like right-to-left marks. In Unicode, this means codepoints with the General Category of **Cf**.
92
+
93
+ ### Additional Unicode Properties
77
94
 
78
- This flag is `true` only for special formatting characters, which are not control characters, like Right-to-left marks. In Unicode, this means codepoints with the General Category of **Cf**.
95
+ #### `variation_selector?`
96
+
97
+ **true** for [variation selectors](https://en.wikipedia.org/wiki/Variation_Selector).
98
+
99
+ #### `tag?`
100
+
101
+ **true** for [tags](https://en.wikipedia.org/wiki/Tags_(Unicode_block)).
102
+
103
+ #### `ignorable?`
104
+
105
+ **true** for characters which might not be implemented, and thus, might render no visible glyph.
106
+
107
+ #### `noncharacter?`
108
+
109
+ **true** if codepoint will never be assigned in a future standard of Unicode.
79
110
 
80
111
  ## Todo
81
112
 
82
113
  - Support all non-dummy encodings that Ruby supports
83
114
 
115
+ ## Also See
116
+
117
+ - [Symbolify](https://github.com/janlelis/symbolify)
118
+
84
119
  ## MIT License
85
120
 
86
121
  Copyright (C) 2017 Jan Lelis <http://janlelis.com>. Released under the MIT license.
@@ -91,6 +91,76 @@ class UnicodeCharacteristics < Characteristics
91
91
  0x2069,
92
92
  ].freeze
93
93
 
94
+ VARIATION_SELECTORS = [
95
+ *0x180B..0x180D,
96
+ *0xFE00..0xFE0F,
97
+ *0xE0100..0xE01EF,
98
+ ].freeze
99
+
100
+ TAGS = [
101
+ 0xE0001,
102
+ *0xE0020..0xE007F,
103
+ ].freeze
104
+
105
+ NONCHARACTERS = [
106
+ *0xFDD0..0xFDEF,
107
+ 0xFFFE, 0xFFFF,
108
+ 0x1FFFE, 0x1FFFF,
109
+ 0x2FFFE, 0x2FFFF,
110
+ 0x3FFFE, 0x3FFFF,
111
+ 0x4FFFE, 0x4FFFF,
112
+ 0x5FFFE, 0x5FFFF,
113
+ 0x6FFFE, 0x6FFFF,
114
+ 0x7FFFE, 0x7FFFF,
115
+ 0x8FFFE, 0x8FFFF,
116
+ 0x9FFFE, 0x9FFFF,
117
+ 0xAFFFE, 0xAFFFF,
118
+ 0xBFFFE, 0xBFFFF,
119
+ 0xCFFFE, 0xCFFFF,
120
+ 0xDFFFE, 0xDFFFF,
121
+ 0xEFFFE, 0xEFFFF,
122
+ 0xFFFFE, 0xFFFFF,
123
+ 0x10FFFE, 0x10FFFF,
124
+ ].freeze
125
+
126
+ IGNORABLE = [
127
+ 0x00AD,
128
+ 0x034F,
129
+ 0x061C,
130
+ *0x115F..0x1160,
131
+ *0x17B4..0x17B5,
132
+ *0x180B..0x180E,
133
+ *0x200B..0x200F,
134
+ *0x202A..0x202E,
135
+ *0x2060..0x206F,
136
+ 0x3164,
137
+ *0xFE00..0xFE0F,
138
+ 0xFEFF,
139
+ 0xFFA0,
140
+ *0xFFF0..0xFFF8,
141
+ *0x1BCA0..0x1BCA3,
142
+ *0x1D173..0x1D17A,
143
+ *0xE0000..0xE0FFF,
144
+ ].freeze
145
+
146
+ KDDI = [
147
+ *0xE468..0xE5DF,
148
+ *0xEA80..0xEB8E,
149
+ ].freeze
150
+
151
+ SOFTBANK = [
152
+ *0xE001..0xE05A,
153
+ *0xE101..0xE15A,
154
+ *0xE201..0xE25A,
155
+ *0xE301..0xE34D,
156
+ *0xE401..0xE44C,
157
+ *0xE501..0xE53E,
158
+ ].freeze
159
+
160
+ DOCOMO = [
161
+ *0xE63E..0xE757,
162
+ ].freeze
163
+
94
164
  attr_reader :category
95
165
 
96
166
  def initialize(char)
@@ -142,28 +212,42 @@ class UnicodeCharacteristics < Characteristics
142
212
  @is_valid && BIDI_CONTROL.include?(@ord)
143
213
  end
144
214
 
215
+ # unicode specific
216
+
217
+ def variation_selector?
218
+ @is_valid && VARIATION_SELECTORS.include?(@ord)
219
+ end
220
+
221
+ def tag?
222
+ @is_valid && TAGS.include?(@ord)
223
+ end
224
+
225
+ def noncharacter?
226
+ @is_valid && NONCHARACTERS.include?(@ord)
227
+ end
228
+
229
+ def ignorable?
230
+ @is_valid && IGNORABLE.include?(@ord)
231
+ end
232
+
233
+ # emoji
234
+
145
235
  def kddi?
146
236
  @is_valid &&
147
237
  encoding_has_kddi? &&
148
- ( @ord >= 0xE468 && @ord <= 0xE5DF ||
149
- @ord >= 0xEA80 && @ord <= 0xEB8E )
238
+ KDDI.include?(@ord)
150
239
  end
151
240
 
152
241
  def softbank?
153
242
  @is_valid &&
154
243
  encoding_has_softbank? &&
155
- ( @ord >= 0xE001 && @ord <= 0xE05A ||
156
- @ord >= 0xE101 && @ord <= 0xE15A ||
157
- @ord >= 0xE201 && @ord <= 0xE25A ||
158
- @ord >= 0xE301 && @ord <= 0xE34D ||
159
- @ord >= 0xE401 && @ord <= 0xE44C ||
160
- @ord >= 0xE501 && @ord <= 0xE53E )
244
+ SOFTBANK.include?(@ord)
161
245
  end
162
246
 
163
247
  def docomo?
164
248
  @is_valid &&
165
249
  encoding_has_docomo? &&
166
- ( @ord >= 0xE63E && @ord <= 0xE757 )
250
+ DOCOMO.include?(@ord)
167
251
  end
168
252
 
169
253
  private
@@ -1,6 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  class Characteristics
4
- VERSION = "0.6.0".freeze
4
+ VERSION = "0.7.0".freeze
5
5
  UNICODE_VERSION = "9.0.0".freeze
6
6
  end
@@ -41,13 +41,13 @@ describe Characteristics do
41
41
 
42
42
  it "is assigned or not" do
43
43
  assert assigned? "\x21"
44
- refute assigned? "\uFFEF"
44
+ refute assigned? "\u{FFEF}"
45
45
  end
46
46
 
47
47
  it "is control or not" do
48
48
  assert control? "\x1E"
49
49
  assert control? "\x7F"
50
- assert control? "\u0080"
50
+ assert control? "\u{0080}"
51
51
  refute control? "\x67"
52
52
  end
53
53
 
@@ -62,32 +62,55 @@ describe Characteristics do
62
62
  end
63
63
 
64
64
  it "is format or not" do
65
- assert format? "\uFFF9"
65
+ assert format? "\u{FFF9}"
66
66
  refute format? "\x21"
67
67
  end
68
68
 
69
69
  it "is bidi_control or not" do
70
- assert bidi_control? "\u202D"
70
+ assert bidi_control? "\u{202D}"
71
71
  refute bidi_control? "\x21"
72
72
  end
73
73
  end
74
74
 
75
+ describe "Unicode Properties" do
76
+ it "is variation_selector or not" do
77
+ assert Characteristics.create("\u{FE00}").variation_selector?
78
+ refute Characteristics.create("a").variation_selector?
79
+ end
80
+
81
+ it "is tag or not" do
82
+ assert Characteristics.create("\u{E0020}").tag?
83
+ refute Characteristics.create("a").tag?
84
+ end
85
+
86
+ it "is noncharacter or not" do
87
+ assert Characteristics.create("\u{10FFFF}").noncharacter?
88
+ refute Characteristics.create("a").noncharacter?
89
+ end
90
+
91
+ it "is ignorable or not" do
92
+ assert Characteristics.create("\u{AD}").ignorable?
93
+ assert Characteristics.create("\u{E0000}").ignorable?
94
+ refute Characteristics.create(" ").ignorable?
95
+ end
96
+ end
97
+
75
98
  describe "Japanese Emojis" do
76
99
  it "can be a KDDI emoji" do
77
100
  encoding = "UTF8-KDDI"
78
- assert Characteristics.create("\uE468".force_encoding(encoding)).kddi?
101
+ assert Characteristics.create("\u{E468}".force_encoding(encoding)).kddi?
79
102
  refute Characteristics.create("A".force_encoding(encoding)).kddi?
80
103
  end
81
104
 
82
105
  it "can be a SoftBank emoji" do
83
106
  encoding = "UTF8-SoftBank"
84
- assert Characteristics.create("\uE001".force_encoding(encoding)).softbank?
107
+ assert Characteristics.create("\u{E001}".force_encoding(encoding)).softbank?
85
108
  refute Characteristics.create("A".force_encoding(encoding)).softbank?
86
109
  end
87
110
 
88
111
  it "can be a DoCoMo emoji" do
89
112
  encoding = "UTF8-DoCoMo"
90
- assert Characteristics.create("\uE63E".force_encoding(encoding)).docomo?
113
+ assert Characteristics.create("\u{E63E}".force_encoding(encoding)).docomo?
91
114
  refute Characteristics.create("A".force_encoding(encoding)).docomo?
92
115
  end
93
116
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: characteristics
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.7.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jan Lelis
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-03-30 00:00:00.000000000 Z
11
+ date: 2017-03-31 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: unicode-categories