unicode-scripts 1.10.0 → 1.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +4 -0
- data/Gemfile.lock +1 -1
- data/README.md +78 -368
- data/lib/unicode/scripts/constants.rb +3 -1
- data/lib/unicode/scripts.rb +70 -4
- data/spec/unicode_scripts_spec.rb +50 -0
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: feaabd20c3a3869a96e62e34d7c39b83739365549904ecdb129e83d9f73540d4
|
4
|
+
data.tar.gz: 40af16102c2aa63b35051f09b65cd8e2d14c32fbce21a7c802ea260121ade5b5
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8d5f215ed6b03d5192eef673d22f0705cac149e3701427570ab52b4e3c538ac1537b7ca5a0768a66e6d5ffdd4d66b9363b4fdafbdf74112df1e7e59ab639cf2c
|
7
|
+
data.tar.gz: 735b9611f0bfee72dd074a8873c3c269d23a150b6d7a9da64ca33b7d02c5a65316a45eaa47bb60fc44a554b70d3efbd3bcd8cf0c94185e01d7fdd5f767766839
|
data/CHANGELOG.md
CHANGED
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -1,12 +1,12 @@
|
|
1
1
|
# Unicode::Scripts [![[version]](https://badge.fury.io/rb/unicode-scripts.svg)](https://badge.fury.io/rb/unicode-scripts) [![[ci]](https://github.com/janlelis/unicode-scripts/workflows/Test/badge.svg)](https://github.com/janlelis/unicode-scripts/actions?query=workflow%3ATest)
|
2
2
|
|
3
|
-
Retrieve
|
3
|
+
Retrieve all [Unicode script(s)](https://en.wikipedia.org/wiki/Script_%28Unicode%29) a string belongs to. Can also return the *Script_Extension* property (scx) which is defined as characters which are "commonly used with more than one script, but with a limited number of scripts".
|
4
4
|
|
5
|
-
|
5
|
+
Based on the *Script_Extension*, this library can also return the [augmented script set](https://www.unicode.org/reports/tr39/#def-augmented-script-set) to figure out if a string is **mixed-script** or **single-script**. Mixed scripts can be an indicator of suspicious user inputs.
|
6
6
|
|
7
|
-
|
7
|
+
Unicode version: **16.0.0** (September 2024)
|
8
8
|
|
9
|
-
|
9
|
+
Supported Rubies: **3.x** (might work: **2.x**)
|
10
10
|
|
11
11
|
## Gemfile
|
12
12
|
|
@@ -14,7 +14,7 @@ Old Rubies that might still work: **2.7**, **2.6**, **2.5**, **2.4**, **2.3**, *
|
|
14
14
|
gem "unicode-scripts"
|
15
15
|
```
|
16
16
|
|
17
|
-
## Usage
|
17
|
+
## Usage - Scripts and Script Extensions
|
18
18
|
|
19
19
|
```ruby
|
20
20
|
require "unicode/scripts"
|
@@ -34,387 +34,97 @@ Unicode::Scripts.script_extensions("॥")
|
|
34
34
|
"Oriya", "Sinhala", "Syloti_Nagri", "Takri", "Tamil", "Telugu", "Tirhuta"]
|
35
35
|
```
|
36
36
|
|
37
|
-
##
|
38
|
-
### Regex Matching
|
37
|
+
## Usage - Augmented Scripts
|
39
38
|
|
40
|
-
|
39
|
+
Like script extensions, but adds meta scripts for Asian languages and treats _Common_/_Inherited_ values as ALL scripts.
|
41
40
|
|
42
41
|
```ruby
|
43
|
-
|
42
|
+
require "unicode/scripts"
|
43
|
+
|
44
|
+
Unicode::Scripts.augmented_scripts("ねガ") # => ['Hira', 'Kana', 'Jpan']
|
45
|
+
Unicode::Scripts.augmented_scripts("1") # => ["Adlm", "Aghb", "Ahom", … ]
|
44
46
|
```
|
45
47
|
|
46
|
-
|
48
|
+
## Usage - Resolved Script
|
49
|
+
|
50
|
+
Intersection of all augmented scripts per character.
|
51
|
+
|
52
|
+
```ruby
|
53
|
+
require "unicode/scripts"
|
54
|
+
|
55
|
+
Unicode::Scripts.resolved_scripts("СігсӀе") # => [ 'Cyrl' ]
|
56
|
+
Unicode::Scripts.resolved_scripts("Сirсlе") # => []
|
57
|
+
Unicode::Scripts.resolved_scripts("𝖢𝗂𝗋𝖼𝗅𝖾") # => ['Adlm', 'Aghb', 'Ahom', … ]
|
58
|
+
Unicode::Scripts.resolved_scripts("1") # => ['Adlm','Aghb', 'Ahom', … ]
|
59
|
+
Unicode::Scripts.resolved_scripts("ねガ") # => ['Hira', 'Kana', 'Jpan']
|
60
|
+
```
|
61
|
+
|
62
|
+
Please note that the **resolved script** can contain multiple scripts, as per standard.
|
63
|
+
|
64
|
+
## Usage - Mixed-Script Detection
|
65
|
+
|
66
|
+
Mixed-script if resolved script set is empty, single-script otherwise.
|
67
|
+
|
68
|
+
```ruby
|
69
|
+
require "unicode/scripts"
|
47
70
|
|
48
|
-
|
71
|
+
Unicode::Scripts.mixed?("СігсӀе"); # => false
|
72
|
+
Unicode::Scripts.mixed?("Сirсlе"); # => true
|
73
|
+
Unicode::Scripts.mixed?("𝖢𝗂𝗋𝖼𝗅𝖾"); # => false
|
74
|
+
Unicode::Scripts.mixed?("1"); # => false
|
75
|
+
Unicode::Scripts.mixed?("ねガ"); # => false
|
76
|
+
|
77
|
+
Unicode::Scripts.single?("СігсӀе"); # => true
|
78
|
+
Unicode::Scripts.single?("Сirсlе"); # => false
|
79
|
+
Unicode::Scripts.single?("𝖢𝗂𝗋𝖼𝗅𝖾"); # => true
|
80
|
+
Unicode::Scripts.single?("1"); # => true
|
81
|
+
Unicode::Scripts.single?("ねガ"); # => true
|
82
|
+
```
|
83
|
+
|
84
|
+
Please note that a **single-script** string might actually contain multiple scripts, as per standard (e.g. for Asian languages)
|
85
|
+
|
86
|
+
### List of All Scripts
|
49
87
|
|
50
88
|
You can extract all script names from the gem like this:
|
51
89
|
|
52
90
|
```ruby
|
53
91
|
require "unicode/scripts"
|
54
|
-
puts Unicode::Scripts.names
|
55
|
-
|
56
|
-
# # # Output # # #
|
57
|
-
|
58
|
-
Adlam
|
59
|
-
Ahom
|
60
|
-
Anatolian_Hieroglyphs
|
61
|
-
Arabic
|
62
|
-
Armenian
|
63
|
-
Avestan
|
64
|
-
Balinese
|
65
|
-
Bamum
|
66
|
-
Bassa_Vah
|
67
|
-
Batak
|
68
|
-
Bengali
|
69
|
-
Bhaiksuki
|
70
|
-
Bopomofo
|
71
|
-
Brahmi
|
72
|
-
Braille
|
73
|
-
Buginese
|
74
|
-
Buhid
|
75
|
-
Canadian_Aboriginal
|
76
|
-
Carian
|
77
|
-
Caucasian_Albanian
|
78
|
-
Chakma
|
79
|
-
Cham
|
80
|
-
Cherokee
|
81
|
-
Chorasmian
|
82
|
-
Common
|
83
|
-
Coptic
|
84
|
-
Cuneiform
|
85
|
-
Cypriot
|
86
|
-
Cypro_Minoan
|
87
|
-
Cyrillic
|
88
|
-
Deseret
|
89
|
-
Devanagari
|
90
|
-
Dives_Akuru
|
91
|
-
Dogra
|
92
|
-
Duployan
|
93
|
-
Egyptian_Hieroglyphs
|
94
|
-
Elbasan
|
95
|
-
Elymaic
|
96
|
-
Ethiopic
|
97
|
-
Garay
|
98
|
-
Georgian
|
99
|
-
Glagolitic
|
100
|
-
Gothic
|
101
|
-
Grantha
|
102
|
-
Greek
|
103
|
-
Gujarati
|
104
|
-
Gunjala_Gondi
|
105
|
-
Gurmukhi
|
106
|
-
Gurung_Khema
|
107
|
-
Han
|
108
|
-
Hangul
|
109
|
-
Hanifi_Rohingya
|
110
|
-
Hanunoo
|
111
|
-
Hatran
|
112
|
-
Hebrew
|
113
|
-
Hiragana
|
114
|
-
Imperial_Aramaic
|
115
|
-
Inherited
|
116
|
-
Inscriptional_Pahlavi
|
117
|
-
Inscriptional_Parthian
|
118
|
-
Javanese
|
119
|
-
Kaithi
|
120
|
-
Kannada
|
121
|
-
Katakana
|
122
|
-
Katakana_Or_Hiragana
|
123
|
-
Kawi
|
124
|
-
Kayah_Li
|
125
|
-
Kharoshthi
|
126
|
-
Khitan_Small_Script
|
127
|
-
Khmer
|
128
|
-
Khojki
|
129
|
-
Khudawadi
|
130
|
-
Kirat_Rai
|
131
|
-
Lao
|
132
|
-
Latin
|
133
|
-
Lepcha
|
134
|
-
Limbu
|
135
|
-
Linear_A
|
136
|
-
Linear_B
|
137
|
-
Lisu
|
138
|
-
Lycian
|
139
|
-
Lydian
|
140
|
-
Mahajani
|
141
|
-
Makasar
|
142
|
-
Malayalam
|
143
|
-
Mandaic
|
144
|
-
Manichaean
|
145
|
-
Marchen
|
146
|
-
Masaram_Gondi
|
147
|
-
Medefaidrin
|
148
|
-
Meetei_Mayek
|
149
|
-
Mende_Kikakui
|
150
|
-
Meroitic_Cursive
|
151
|
-
Meroitic_Hieroglyphs
|
152
|
-
Miao
|
153
|
-
Modi
|
154
|
-
Mongolian
|
155
|
-
Mro
|
156
|
-
Multani
|
157
|
-
Myanmar
|
158
|
-
Nabataean
|
159
|
-
Nag_Mundari
|
160
|
-
Nandinagari
|
161
|
-
New_Tai_Lue
|
162
|
-
Newa
|
163
|
-
Nko
|
164
|
-
Nushu
|
165
|
-
Nyiakeng_Puachue_Hmong
|
166
|
-
Ogham
|
167
|
-
Ol_Chiki
|
168
|
-
Ol_Onal
|
169
|
-
Old_Hungarian
|
170
|
-
Old_Italic
|
171
|
-
Old_North_Arabian
|
172
|
-
Old_Permic
|
173
|
-
Old_Persian
|
174
|
-
Old_Sogdian
|
175
|
-
Old_South_Arabian
|
176
|
-
Old_Turkic
|
177
|
-
Old_Uyghur
|
178
|
-
Oriya
|
179
|
-
Osage
|
180
|
-
Osmanya
|
181
|
-
Pahawh_Hmong
|
182
|
-
Palmyrene
|
183
|
-
Pau_Cin_Hau
|
184
|
-
Phags_Pa
|
185
|
-
Phoenician
|
186
|
-
Psalter_Pahlavi
|
187
|
-
Rejang
|
188
|
-
Runic
|
189
|
-
Samaritan
|
190
|
-
Saurashtra
|
191
|
-
Sharada
|
192
|
-
Shavian
|
193
|
-
Siddham
|
194
|
-
SignWriting
|
195
|
-
Sinhala
|
196
|
-
Sogdian
|
197
|
-
Sora_Sompeng
|
198
|
-
Soyombo
|
199
|
-
Sundanese
|
200
|
-
Sunuwar
|
201
|
-
Syloti_Nagri
|
202
|
-
Syriac
|
203
|
-
Tagalog
|
204
|
-
Tagbanwa
|
205
|
-
Tai_Le
|
206
|
-
Tai_Tham
|
207
|
-
Tai_Viet
|
208
|
-
Takri
|
209
|
-
Tamil
|
210
|
-
Tangsa
|
211
|
-
Tangut
|
212
|
-
Telugu
|
213
|
-
Thaana
|
214
|
-
Thai
|
215
|
-
Tibetan
|
216
|
-
Tifinagh
|
217
|
-
Tirhuta
|
218
|
-
Todhri
|
219
|
-
Toto
|
220
|
-
Tulu_Tigalari
|
221
|
-
Ugaritic
|
222
|
-
Unknown
|
223
|
-
Vai
|
224
|
-
Vithkuqi
|
225
|
-
Wancho
|
226
|
-
Warang_Citi
|
227
|
-
Yezidi
|
228
|
-
Yi
|
229
|
-
Zanabazar_Square
|
92
|
+
puts Unicode::Scripts.names # list of scripts
|
230
93
|
```
|
231
94
|
|
232
|
-
|
233
|
-
You can extract all 4 letter script names from the gem like this:
|
95
|
+
To get all 4 letter script codes (ISO 15924):
|
234
96
|
|
235
97
|
```ruby
|
236
98
|
require "unicode/scripts"
|
237
|
-
puts Unicode::Scripts.names(format: :short)
|
238
|
-
|
239
|
-
# # # Output # # #
|
240
|
-
|
241
|
-
Adlm
|
242
|
-
Aghb
|
243
|
-
Ahom
|
244
|
-
Arab
|
245
|
-
Armi
|
246
|
-
Armn
|
247
|
-
Avst
|
248
|
-
Bali
|
249
|
-
Bamu
|
250
|
-
Bass
|
251
|
-
Batk
|
252
|
-
Beng
|
253
|
-
Bhks
|
254
|
-
Bopo
|
255
|
-
Brah
|
256
|
-
Brai
|
257
|
-
Bugi
|
258
|
-
Buhd
|
259
|
-
Cakm
|
260
|
-
Cans
|
261
|
-
Cari
|
262
|
-
Cham
|
263
|
-
Cher
|
264
|
-
Chrs
|
265
|
-
Copt
|
266
|
-
Cpmn
|
267
|
-
Cprt
|
268
|
-
Cyrl
|
269
|
-
Deva
|
270
|
-
Diak
|
271
|
-
Dogr
|
272
|
-
Dsrt
|
273
|
-
Dupl
|
274
|
-
Egyp
|
275
|
-
Elba
|
276
|
-
Elym
|
277
|
-
Ethi
|
278
|
-
Gara
|
279
|
-
Geor
|
280
|
-
Glag
|
281
|
-
Gong
|
282
|
-
Gonm
|
283
|
-
Goth
|
284
|
-
Gran
|
285
|
-
Grek
|
286
|
-
Gujr
|
287
|
-
Gukh
|
288
|
-
Guru
|
289
|
-
Hang
|
290
|
-
Hani
|
291
|
-
Hano
|
292
|
-
Hatr
|
293
|
-
Hebr
|
294
|
-
Hira
|
295
|
-
Hluw
|
296
|
-
Hmng
|
297
|
-
Hmnp
|
298
|
-
Hrkt
|
299
|
-
Hung
|
300
|
-
Ital
|
301
|
-
Java
|
302
|
-
Kali
|
303
|
-
Kana
|
304
|
-
Kawi
|
305
|
-
Khar
|
306
|
-
Khmr
|
307
|
-
Khoj
|
308
|
-
Kits
|
309
|
-
Knda
|
310
|
-
Krai
|
311
|
-
Kthi
|
312
|
-
Lana
|
313
|
-
Laoo
|
314
|
-
Latn
|
315
|
-
Lepc
|
316
|
-
Limb
|
317
|
-
Lina
|
318
|
-
Linb
|
319
|
-
Lisu
|
320
|
-
Lyci
|
321
|
-
Lydi
|
322
|
-
Mahj
|
323
|
-
Maka
|
324
|
-
Mand
|
325
|
-
Mani
|
326
|
-
Marc
|
327
|
-
Medf
|
328
|
-
Mend
|
329
|
-
Merc
|
330
|
-
Mero
|
331
|
-
Mlym
|
332
|
-
Modi
|
333
|
-
Mong
|
334
|
-
Mroo
|
335
|
-
Mtei
|
336
|
-
Mult
|
337
|
-
Mymr
|
338
|
-
Nagm
|
339
|
-
Nand
|
340
|
-
Narb
|
341
|
-
Nbat
|
342
|
-
Newa
|
343
|
-
Nkoo
|
344
|
-
Nshu
|
345
|
-
Ogam
|
346
|
-
Olck
|
347
|
-
Onao
|
348
|
-
Orkh
|
349
|
-
Orya
|
350
|
-
Osge
|
351
|
-
Osma
|
352
|
-
Ougr
|
353
|
-
Palm
|
354
|
-
Pauc
|
355
|
-
Perm
|
356
|
-
Phag
|
357
|
-
Phli
|
358
|
-
Phlp
|
359
|
-
Phnx
|
360
|
-
Plrd
|
361
|
-
Prti
|
362
|
-
Qaac
|
363
|
-
Qaai
|
364
|
-
Rjng
|
365
|
-
Rohg
|
366
|
-
Runr
|
367
|
-
Samr
|
368
|
-
Sarb
|
369
|
-
Saur
|
370
|
-
Sgnw
|
371
|
-
Shaw
|
372
|
-
Shrd
|
373
|
-
Sidd
|
374
|
-
Sind
|
375
|
-
Sinh
|
376
|
-
Sogd
|
377
|
-
Sogo
|
378
|
-
Sora
|
379
|
-
Soyo
|
380
|
-
Sund
|
381
|
-
Sunu
|
382
|
-
Sylo
|
383
|
-
Syrc
|
384
|
-
Tagb
|
385
|
-
Takr
|
386
|
-
Tale
|
387
|
-
Talu
|
388
|
-
Taml
|
389
|
-
Tang
|
390
|
-
Tavt
|
391
|
-
Telu
|
392
|
-
Tfng
|
393
|
-
Tglg
|
394
|
-
Thaa
|
395
|
-
Thai
|
396
|
-
Tibt
|
397
|
-
Tirh
|
398
|
-
Tnsa
|
399
|
-
Todr
|
400
|
-
Toto
|
401
|
-
Tutg
|
402
|
-
Ugar
|
403
|
-
Vaii
|
404
|
-
Vith
|
405
|
-
Wara
|
406
|
-
Wcho
|
407
|
-
Xpeo
|
408
|
-
Xsux
|
409
|
-
Yezi
|
410
|
-
Yiii
|
411
|
-
Zanb
|
412
|
-
Zinh
|
413
|
-
Zyyy
|
414
|
-
Zzzz
|
99
|
+
puts Unicode::Scripts.names(format: :short) # list of scripts
|
415
100
|
```
|
416
101
|
|
417
|
-
|
102
|
+
Augmented scripts:
|
103
|
+
|
104
|
+
```ruby
|
105
|
+
require "unicode/scripts"
|
106
|
+
puts Unicode::Scripts.names(format: :short, augmented: :only)
|
107
|
+
```
|
108
|
+
|
109
|
+
You can find a list of all scripts in Unicode, with links to Wikipedia on [character.construction/scripts](https://character.construction/scripts)
|
110
|
+
|
111
|
+
## Hints
|
112
|
+
### Regex Matching
|
113
|
+
|
114
|
+
If you have a string and want to match a substring/character from a specific Unicode script, you actually won't need this gem. Instead, you can use the [Regexp Unicode Property Syntax `\p{}`](https://ruby-doc.org/core/Regexp.html#class-Regexp-label-Character+Properties):
|
115
|
+
|
116
|
+
```ruby
|
117
|
+
"Coptic letter: ⲁ".scan(/\p{Coptic}/) # => ["ⲁ"]
|
118
|
+
```
|
119
|
+
|
120
|
+
See [Idiosyncratic Ruby: Proper Unicoding](https://idiosyncratic-ruby.com/41-proper-unicoding.html) for more info.
|
121
|
+
|
122
|
+
## Also See
|
123
|
+
|
124
|
+
- JavaScript implementation (same data & algorithms): [unicode-script.js](https://github.com/janlelis/unicode-script.js)
|
125
|
+
- Index created with: [unicoder](https://github.com/janlelis/unicoder)
|
126
|
+
- Get the Unicode blocks of a string: [unicode-blocks gem](https://github.com/janlelis/unicode-blocks)
|
127
|
+
- See [unicode-x](https://github.com/janlelis/unicode-x) for more Unicode related micro libraries for Ruby.
|
418
128
|
|
419
129
|
## MIT License
|
420
130
|
|
@@ -2,9 +2,11 @@
|
|
2
2
|
|
3
3
|
module Unicode
|
4
4
|
module Scripts
|
5
|
-
VERSION = "1.
|
5
|
+
VERSION = "1.11.0"
|
6
6
|
UNICODE_VERSION = "16.0.0"
|
7
7
|
DATA_DIRECTORY = File.expand_path(File.dirname(__FILE__) + "/../../../data/").freeze
|
8
8
|
INDEX_FILENAME = (DATA_DIRECTORY + "/scripts.marshal.gz").freeze
|
9
|
+
|
10
|
+
AUGMENTED_SCRIPT_CODES = ["Hanb", "Jpan", "Kore"]
|
9
11
|
end
|
10
12
|
end
|
data/lib/unicode/scripts.rb
CHANGED
@@ -46,11 +46,77 @@ module Unicode
|
|
46
46
|
}.sort
|
47
47
|
end
|
48
48
|
|
49
|
-
def self.
|
49
|
+
def self.augmented_scripts(string)
|
50
50
|
require_relative 'scripts/index' unless defined? ::Unicode::Scripts::INDEX
|
51
|
-
|
52
|
-
|
53
|
-
|
51
|
+
|
52
|
+
augmented = string.each_codepoint.inject([]){ |res, codepoint|
|
53
|
+
if new_scripts = INDEX[:SCRIPT_EXTENSIONS][codepoint]
|
54
|
+
script_extension_names = new_scripts.map{ |new_script|
|
55
|
+
INDEX[:SCRIPT_ALIASES].key(new_script)
|
56
|
+
}
|
57
|
+
else
|
58
|
+
script_extension_names = scripts([codepoint].pack("U"), format: :short)
|
59
|
+
end
|
60
|
+
|
61
|
+
res | script_extension_names
|
62
|
+
}
|
63
|
+
|
64
|
+
if augmented.include? "Hani"
|
65
|
+
augmented |= ["Hanb", "Jpan", "Kore"]
|
66
|
+
end
|
67
|
+
if augmented.include?("Hira") || augmented.include?("Kana")
|
68
|
+
augmented |= ["Jpan"]
|
69
|
+
end
|
70
|
+
if augmented.include? "Hang"
|
71
|
+
augmented |= ["Kore"]
|
72
|
+
end
|
73
|
+
if augmented.include? "Bopo"
|
74
|
+
augmented |= ["Hanb"]
|
75
|
+
end
|
76
|
+
if augmented.include?("Zyyy") || augmented.include?("Zinh")
|
77
|
+
augmented |= names(format: :short, augmented: :include )
|
78
|
+
end
|
79
|
+
|
80
|
+
augmented.sort
|
81
|
+
end
|
82
|
+
|
83
|
+
def self.resolved_scripts(string)
|
84
|
+
string.chars.reduce(
|
85
|
+
Unicode::Scripts.names(format: :short, augmented: :include)
|
86
|
+
){ |acc, char|
|
87
|
+
acc & augmented_scripts(char)
|
88
|
+
}
|
89
|
+
end
|
90
|
+
|
91
|
+
def self.mixed?(string)
|
92
|
+
resolved_scripts(string).empty?
|
93
|
+
end
|
94
|
+
|
95
|
+
def self.single?(string)
|
96
|
+
!resolved_scripts(string).empty?
|
97
|
+
end
|
98
|
+
|
99
|
+
# Lists scripts. Options:
|
100
|
+
# - format - :long, :short
|
101
|
+
# - augmented - :include, :exclude, :only
|
102
|
+
def self.names(format: :long, augmented: :exclude)
|
103
|
+
if format == :long && augmented != :exclude
|
104
|
+
raise ArgumentError, "only short four-letter script codes (ISO 15924) supported when listing augmented scripts"
|
105
|
+
end
|
106
|
+
|
107
|
+
if augmented == :only
|
108
|
+
return AUGMENTED_SCRIPT_CODES
|
109
|
+
end
|
110
|
+
|
111
|
+
require_relative 'scripts/index' unless defined? ::Unicode::Scripts::INDEX
|
112
|
+
|
113
|
+
if format == :long
|
114
|
+
INDEX[:SCRIPT_NAMES].sort
|
115
|
+
elsif augmented == :exclude
|
116
|
+
INDEX[:SCRIPT_ALIASES].keys.sort
|
117
|
+
else
|
118
|
+
(INDEX[:SCRIPT_ALIASES].keys + AUGMENTED_SCRIPT_CODES).sort
|
119
|
+
end
|
54
120
|
end
|
55
121
|
end
|
56
122
|
end
|
@@ -130,11 +130,61 @@ describe Unicode::Scripts do
|
|
130
130
|
end
|
131
131
|
end
|
132
132
|
|
133
|
+
describe ".augmented_scripts" do
|
134
|
+
it "will always return an Array" do
|
135
|
+
assert_equal [], Unicode::Scripts.augmented_scripts("")
|
136
|
+
end
|
137
|
+
|
138
|
+
it "will return all extended scripts that characters in the string belong to + augmented" do
|
139
|
+
assert_equal ["Hira", "Jpan", "Kana"], Unicode::Scripts.augmented_scripts("ねガ")
|
140
|
+
end
|
141
|
+
|
142
|
+
it "will replace Common with all scripts" do
|
143
|
+
assert_equal \
|
144
|
+
Unicode::Scripts.names(format: :short, augmented: :include),
|
145
|
+
Unicode::Scripts.augmented_scripts("1")
|
146
|
+
end
|
147
|
+
end
|
148
|
+
|
149
|
+
describe ".resolved_scripts" do
|
150
|
+
it "return intersection of augmented scripts per character" do
|
151
|
+
assert_equal ["Cyrl"], Unicode::Scripts.resolved_scripts("СігсӀе")
|
152
|
+
assert_equal [], Unicode::Scripts.resolved_scripts("Сirсlе")
|
153
|
+
assert_equal \
|
154
|
+
Unicode::Scripts.names(format: :short, augmented: :include),
|
155
|
+
Unicode::Scripts.resolved_scripts("𝖢𝗂𝗋𝖼𝗅𝖾")
|
156
|
+
end
|
157
|
+
end
|
158
|
+
|
159
|
+
describe "mixed?" do
|
160
|
+
it "will return true if .resolved_scripts(string) is empty" do
|
161
|
+
assert_equal false, Unicode::Scripts.mixed?("СігсӀе")
|
162
|
+
assert Unicode::Scripts.mixed?("Сirсlе")
|
163
|
+
assert_equal false, Unicode::Scripts.mixed?("𝖢𝗂𝗋𝖼𝗅𝖾")
|
164
|
+
assert_equal false, Unicode::Scripts.mixed?("1")
|
165
|
+
assert_equal false, Unicode::Scripts.mixed?("ねガ")
|
166
|
+
end
|
167
|
+
end
|
168
|
+
|
169
|
+
describe "single?" do
|
170
|
+
it "will return true if .resolved_scripts(string) is not empty" do
|
171
|
+
assert Unicode::Scripts.single?("СігсӀе")
|
172
|
+
assert_equal false, Unicode::Scripts.single?("Сirсlе")
|
173
|
+
assert Unicode::Scripts.single?("𝖢𝗂𝗋𝖼𝗅𝖾")
|
174
|
+
assert Unicode::Scripts.single?("1")
|
175
|
+
assert Unicode::Scripts.single?("ねガ")
|
176
|
+
end
|
177
|
+
end
|
178
|
+
|
133
179
|
describe ".names" do
|
134
180
|
it "will return a list of all script names" do
|
135
181
|
assert_kind_of Array, Unicode::Scripts.names
|
136
182
|
assert_includes Unicode::Scripts.names, "Inscriptional_Parthian"
|
137
183
|
end
|
184
|
+
|
185
|
+
it "will return a list of all augmented script codes" do
|
186
|
+
assert_equal Unicode::Scripts.names(format: :short, augmented: :only), ["Hanb", "Jpan", "Kore"]
|
187
|
+
end
|
138
188
|
end
|
139
189
|
end
|
140
190
|
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: unicode-scripts
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.11.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jan Lelis
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-
|
11
|
+
date: 2024-11-03 00:00:00.000000000 Z
|
12
12
|
dependencies: []
|
13
13
|
description: "[Unicode 16.0.0] Retrieve the Unicode script(s) a string belongs to.
|
14
14
|
Can also return the Script_Extension property which is defined as characters which
|
@@ -54,7 +54,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
54
54
|
- !ruby/object:Gem::Version
|
55
55
|
version: '0'
|
56
56
|
requirements: []
|
57
|
-
rubygems_version: 3.5.
|
57
|
+
rubygems_version: 3.5.21
|
58
58
|
signing_key:
|
59
59
|
specification_version: 4
|
60
60
|
summary: Which script(s) does a Unicode string belong to?
|