character_set 1.5.0 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9622bc20bbdb48f8deff84dbed9e800e6bc500a6a08a27e7b3aea2ea651cd278
4
- data.tar.gz: 5853e8d5be7e9a1963419aa4f9fbc631148fe5bef45aa185b9117d32b44aa959
3
+ metadata.gz: e216e6c199ac9443cda9180a9e35d5ed92b50b45c12e7f64f45d74ecd2cf08d6
4
+ data.tar.gz: 5f3634d426dc33875d6c197ce75466544d97808b1e8b1858ac56d93422b226e8
5
5
  SHA512:
6
- metadata.gz: 2cc2a60b9388a2e3beef66da20aa8205cc501980a7dc66f2716c66f7e999a083927b27a761e6b932b6d5c16b8e5968f8e04370ecf3c999326f378f60bfa3cedc
7
- data.tar.gz: a2a8d1f9ac6cdf6302af98662fc3efda4b8c6fe003c7cdc853a61a64f9c7a596b1bbd7a79dca19081b8ce2576f9c3d848869141b164c145e22befaaffec8b265
6
+ metadata.gz: d24cfaa40b6e4e472e1f76cc8b6f7f3f1282e6830c0cbf76c4810c0f6f365c7419a19816d0b741cee99eb428dae03fc1d60eecab7d1ba6d210015f0cf2d5ee14
7
+ data.tar.gz: 2bd7ea63b286e106358293b1428a687374d0cd2cdc985b2da5b5cf1f45c6c541cb0ddde5b06477243cf4011065cfac7fa6bb8a521fb144a750c90039d268f03b
data/.gitattributes CHANGED
@@ -1,3 +1,3 @@
1
1
  *.cps linguist-detectable=false
2
2
  benchmarks/* linguist-detectable=false
3
- spec/ruby-spec/* linguist-vendored
3
+ spec/* linguist-detectable=false
@@ -1,6 +1,10 @@
1
1
  name: tests
2
2
 
3
- on: [push, pull_request]
3
+ on:
4
+ push:
5
+ pull_request:
6
+ schedule:
7
+ - cron: '11 11 14 * *' # at 11:11 am on the 14th of every month
4
8
 
5
9
  jobs:
6
10
  build:
@@ -8,7 +12,7 @@ jobs:
8
12
 
9
13
  strategy:
10
14
  matrix:
11
- ruby: [ '2.2', '2.7', '3.0', 'ruby-head', 'jruby-head' ]
15
+ ruby: [ '2.2', '2.7', '3.0', '3.1', 'ruby-head', 'jruby-head' ]
12
16
 
13
17
  steps:
14
18
  - uses: actions/checkout@v2
data/BENCHMARK.md CHANGED
@@ -1,86 +1,90 @@
1
- Results of `rake:benchmark` on ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]
1
+ Results of `rake:benchmark` on ruby 3.2.0dev (2022-02-14T14:35:54Z master 26187a8520) [arm64-darwin21]
2
2
 
3
3
  ```
4
4
  Counting non-letters
5
5
 
6
- CharacterSet#count_in: 9472902.2 i/s
7
- String#count: 2221799.9 i/s - 4.26x slower
6
+ CharacterSet#count_in: 14794607.9 i/s
7
+ String#count: 3875939.3 i/s - 3.82x slower
8
8
  ```
9
9
  ```
10
10
  Detecting non-whitespace
11
11
 
12
- CharacterSet#cover?: 12388427.2 i/s
13
- Regexp#match?: 7901676.8 i/s - 1.57x slower
12
+ CharacterSet#cover?: 17448329.0 i/s
13
+ Regexp#match?: 13089358.1 i/s - 1.33x slower
14
14
  ```
15
15
  ```
16
16
  Detecting non-letters
17
17
 
18
- CharacterSet#cover?: 12263689.1 i/s
19
- Regexp#match?: 4940889.9 i/s - 2.48x slower
18
+ CharacterSet#cover?: 17565596.9 i/s
19
+ Regexp#match?: 7951108.0 i/s - 2.21x slower
20
20
  ```
21
21
  ```
22
- Removing whitespace
22
+ Removing ASCII whitespace
23
23
 
24
- CharacterSet#delete_in: 2406722.6 i/s
25
- String#gsub: 235760.3 i/s - 10.21x slower
24
+ CharacterSet#delete_in: 6306078.2 i/s
25
+ String#tr: 4734401.0 i/s - 1.33x slower
26
+ String#gsub: 211631.8 i/s - 29.80x slower
26
27
  ```
27
28
  ```
28
29
  Removing whitespace, emoji and umlauts
29
30
 
30
- CharacterSet#delete_in: 1653607.6 i/s
31
- String#gsub: 272782.9 i/s - 6.06x slower
31
+ CharacterSet#delete_in: 5984149.6 i/s
32
+ String#tr: 363643.1 i/s - 16.46x slower
33
+ String#gsub: 317201.7 i/s - 18.87x slower
32
34
  ```
33
35
  ```
34
36
  Removing non-whitespace
35
37
 
36
- CharacterSet#keep_in: 2671038.2 i/s
37
- String#gsub: 242551.0 i/s - 11.01x slower
38
+ CharacterSet#keep_in: 7650925.6 i/s
39
+ String#gsub: 207374.6 i/s - 36.89x slower
40
+ String#tr: 12.3 i/s - 619745.60x slower
38
41
  ```
39
42
  ```
40
- Extracting emoji
43
+ Keeping only emoji
41
44
 
42
- CharacterSet#keep_in: 1726496.5 i/s
43
- String#gsub: 215609.2 i/s - 8.01x slower
45
+ CharacterSet#keep_in: 7272940.1 i/s
46
+ String#gsub: 177993.8 i/s - 40.86x slower
47
+ String#tr: 12.3 i/s - 590222.71x slower
44
48
  ```
45
49
  ```
46
50
  Extracting emoji to an Array
47
51
 
48
- CharacterSet#scan: 2373856.1 i/s
49
- String#scan: 480000.5 i/s - 4.95x slower
52
+ CharacterSet#scan: 2978285.0 i/s
53
+ String#scan: 865793.8 i/s - 3.44x slower
50
54
  ```
51
55
  ```
52
56
  Detecting whitespace
53
57
 
54
- CharacterSet#used_by?: 11988328.7 i/s
55
- Regexp#match?: 6758146.8 i/s - 1.77x slower
58
+ CharacterSet#used_by?: 17292338.4 i/s
59
+ Regexp#match?: 11705563.9 i/s - 1.48x slower
56
60
  ```
57
61
  ```
58
62
  Detecting emoji in a large string
59
63
 
60
- CharacterSet#used_by?: 288223.3 i/s
61
- Regexp#match?: 102384.2 i/s - 2.82x slower
64
+ CharacterSet#used_by?: 340444.1 i/s
65
+ Regexp#match?: 180549.8 i/s - 1.89x slower
62
66
  ```
63
67
  ```
64
68
  Adding entries
65
69
 
66
- CharacterSet#add: 2538251.2 i/s
67
- SortedSet#add: 443925.9 i/s - 5.72x slower
70
+ CharacterSet#add: 4951781.4 i/s
71
+ SortedSet#add: 1019637.9 i/s - 4.86x slower
68
72
  ```
69
73
  ```
70
74
  Removing entries
71
75
 
72
- CharacterSet#delete: 2487620.8 i/s
73
- SortedSet#delete: 628816.1 i/s - 3.96x slower
76
+ CharacterSet#delete: 5006337.6 i/s
77
+ SortedSet#delete: 3922752.2 i/s - same-ish
74
78
  ```
75
79
  ```
76
80
  Merging entries
77
81
 
78
- CharacterSet#merge: 551.6 i/s
79
- SortedSet#merge: 1.4 i/s - 393.59x slower
82
+ CharacterSet#merge: 661.8 i/s
83
+ SortedSet#merge: 3.9 i/s - 167.82x slower
80
84
  ```
81
85
  ```
82
86
  Getting the min and max
83
87
 
84
- CharacterSet#minmax: 636890.7 i/s
85
- SortedSet#minmax: 254.1 i/s - 2506.20x slower
88
+ CharacterSet#minmax: 1212462.2 i/s
89
+ SortedSet#minmax: 844.4 i/s - 1435.93x slower
86
90
  ```
data/CHANGELOG.md CHANGED
@@ -4,6 +4,21 @@ All notable changes to this project will be documented in this file.
4
4
  The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
5
5
  and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
6
6
 
7
+ ## [1.6.0] - 2022-02-16
8
+
9
+ ### Added
10
+
11
+ - `::of` now supports both `String` and `Regexp` arguments
12
+
13
+ ### Fixed
14
+
15
+ - fixed segfault during `String` manipulation on Ruby 3.2.0-dev
16
+ - improved performance for `String` manipulation
17
+ - allow usage in Ractors
18
+ - predefined sets must be pre-initialized for this, though
19
+ - e.g. `CharacterSet.ascii`, `keep_character_set(:ascii)` etc.
20
+ - call them once in the main Ractor to trigger initialization
21
+
7
22
  ## [1.5.0] - 2021-12-05
8
23
 
9
24
  ### Added
data/Gemfile CHANGED
@@ -4,3 +4,17 @@ git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
4
4
 
5
5
  # Specify your gem's dependencies in character_set.gemspec
6
6
  gemspec
7
+
8
+ gem 'benchmark-ips', '~> 2.7'
9
+ gem 'get_process_mem', '~> 0.2.3'
10
+ gem 'rake', '~> 13.0'
11
+ gem 'rake-compiler', '~> 1.1'
12
+ gem 'range_compressor', '~> 1.0'
13
+ gem 'regexp_parser', '~> 2.1'
14
+ gem 'regexp_property_values', '~> 1.0'
15
+ gem 'rspec', '~> 3.8'
16
+ if RUBY_VERSION.to_f >= 2.7
17
+ gem 'codecov', '~> 0.2.12'
18
+ gem 'gouteur', '~> 1.0.0'
19
+ gem 'rubocop', '~> 1.8'
20
+ end
data/README.md CHANGED
@@ -5,16 +5,17 @@
5
5
  [![Build Status](https://github.com/jaynetics/character_set/workflows/gouteur/badge.svg)](https://github.com/jaynetics/character_set/actions)
6
6
  [![codecov](https://codecov.io/gh/jaynetics/character_set/branch/master/graph/badge.svg)](https://codecov.io/gh/jaynetics/character_set)
7
7
 
8
- This is a C-extended Ruby gem to work with sets of Unicode codepoints. It can read and write these sets in various formats and implements the stdlib `Set` interface for them.
8
+ This is a C-extended Ruby gem to work with sets of Unicode codepoints.
9
9
 
10
- It also offers an alternate paradigm of `String` processing which grants much better performance than `Regexp` and `String` methods from the stdlib where applicable (see [benchmarks](./BENCHMARK.md)).
10
+ It can [read](#parseinitialize) and [write](#write) sets of codepoints in various formats and it implements the stdlib `Set` interface for them.
11
+
12
+ It also offers a [way of scrubbing and scanning characters in Strings](#interact-with-strings) that is more semantic and consistently offers better performance than `Regexp` and `String` methods from the stdlib for this (see [benchmarks](./BENCHMARK.md)).
11
13
 
12
14
  Many parts can be used independently, e.g.:
13
15
  - `CharacterSet::Character`
14
16
  - `CharacterSet::ExpressionConverter`
15
17
  - `CharacterSet::Parser`
16
18
  - `CharacterSet::Writer`
17
- - [`RangeCompressor`](https://github.com/jaynetics/range_compressor)
18
19
 
19
20
  ## Usage
20
21
 
@@ -42,9 +43,10 @@ CharacterSet.parse('[a-c]')
42
43
  CharacterSet.parse('\U00000061-\U00000063')
43
44
  ```
44
45
 
45
- If the gems [`regexp_parser`](https://github.com/ammar/regexp_parser) and [`regexp_property_values`](https://github.com/jaynetics/regexp_property_values) are installed, `::of_regexp` and `::of_property` can also be used. `::of_regexp` can handle intersections, negations, and set nesting. Regexp's `i`-flag is ignored; call `#case_insensitive` on the result if needed.
46
+ If the gems [`regexp_parser`](https://github.com/ammar/regexp_parser) and [`regexp_property_values`](https://github.com/jaynetics/regexp_property_values) are installed, `Regexp` and unicode property names can also be read. Regexp intersections, negations, and set nesting are covered, but the `i`-flag is ignored; call `#case_insensitive` on the result if needed.
46
47
 
47
48
  ```ruby
49
+ CharacterSet.of(/./) # => #<CharacterSet (size: 1112064)>
48
50
  CharacterSet.of_property('Thai') # => #<CharacterSet (size: 86)>
49
51
 
50
52
  require 'character_set/core_ext/regexp_ext'
@@ -145,6 +147,7 @@ CharacterSet['1', 'A'].case_insensitive # => CharacterSet['1', 'A', 'a']
145
147
  ```
146
148
 
147
149
  ### Write
150
+
148
151
  ```ruby
149
152
  set = CharacterSet['a', 'b', 'c', 'j', '-']
150
153
 
@@ -211,6 +214,6 @@ CharacterSet['a', 'ü', '🤩'].member_in_plane?(7) # => false
211
214
  CharacterSet::Character.new('a').plane # => 0
212
215
  ```
213
216
 
214
- ### Contributions
217
+ ## Contributions
215
218
 
216
219
  Feel free to send suggestions, point out issues, or submit pull requests.
data/Rakefile CHANGED
@@ -147,8 +147,11 @@ namespace :benchmark do
147
147
  f.puts "Results of `rake:benchmark` on #{RUBY_DESCRIPTION}", ''
148
148
 
149
149
  $store_comparison_results.each do |caption, result|
150
- f.puts '```', caption, '',
151
- result.strip.gsub(/(same-ish).*$/, '\1').lines[1..-1], '```'
150
+ f.puts '```',
151
+ caption,
152
+ '',
153
+ result.strip.gsub(/ \(±[^)]+\) /, '').gsub(/(same-ish).*$/, '\1').lines[1..-1],
154
+ '```'
152
155
  end
153
156
  end
154
157
  end
@@ -2,24 +2,28 @@ require_relative './shared'
2
2
 
3
3
  str = 'Lorem ipsum et dolorem'
4
4
  rx = /\s/
5
+ trt = "\t\n\v\f\r\s"
5
6
  cs = CharacterSet.whitespace
6
7
 
7
8
  benchmark(
8
- caption: 'Removing whitespace',
9
+ caption: 'Removing ASCII whitespace',
9
10
  cases: {
10
11
  'String#gsub' => -> { str.gsub(rx, '') },
12
+ 'String#tr' => -> { str.tr(trt, '') },
11
13
  'CharacterSet#delete_in' => -> { cs.delete_in(str) },
12
14
  }
13
15
  )
14
16
 
15
17
  str = 'Lörem ipsüm ⛷ et dölörem'
16
18
  rx = /[\s\p{emoji}äüö]/
19
+ trt = "\t\n\v\f\r\s😀-🙏äüö"
17
20
  cs = CharacterSet.whitespace + CharacterSet.emoji + CharacterSet['ä', 'ö', 'ü']
18
21
 
19
22
  benchmark(
20
23
  caption: 'Removing whitespace, emoji and umlauts',
21
24
  cases: {
22
25
  'String#gsub' => -> { str.gsub(rx, '') },
26
+ 'String#tr' => -> { str.tr(trt, '') },
23
27
  'CharacterSet#delete_in' => -> { cs.delete_in(str) },
24
28
  }
25
29
  )
@@ -2,24 +2,28 @@ require_relative './shared'
2
2
 
3
3
  str = 'Lorem ipsum et dolorem'
4
4
  rx = /\S/
5
+ trt = "\u{0080}-\u{10FFFF}" # approximation
5
6
  cs = CharacterSet.whitespace
6
7
 
7
8
  benchmark(
8
9
  caption: 'Removing non-whitespace',
9
10
  cases: {
10
11
  'String#gsub' => -> { str.gsub(rx, '') },
12
+ 'String#tr' => -> { str.tr(trt, '') },
11
13
  'CharacterSet#keep_in' => -> { cs.keep_in(str) },
12
14
  }
13
15
  )
14
16
 
15
17
  str = 'Lorem ipsum ⛷ et dolorem'
16
18
  rx = /\p{^emoji}/
19
+ trt = "\u0000-\u{1F599}\u{1F650}-\u{10FFFF}"
17
20
  cs = CharacterSet.emoji
18
21
 
19
22
  benchmark(
20
- caption: 'Extracting emoji',
23
+ caption: 'Keeping only emoji',
21
24
  cases: {
22
25
  'String#gsub' => -> { str.gsub(rx, '') },
26
+ 'String#tr' => -> { str.tr(trt, '') },
23
27
  'CharacterSet#keep_in' => -> { cs.keep_in(str) },
24
28
  }
25
29
  )
@@ -28,18 +28,4 @@ Gem::Specification.new do |s|
28
28
  if RUBY_VERSION.to_f >= 3.0 && !RUBY_PLATFORM[/java/i]
29
29
  s.add_dependency 'sorted_set', '~> 1.0'
30
30
  end
31
-
32
- s.add_development_dependency 'benchmark-ips', '~> 2.7'
33
- s.add_development_dependency 'get_process_mem', '~> 0.2.3'
34
- s.add_development_dependency 'rake', '~> 13.0'
35
- s.add_development_dependency 'rake-compiler', '~> 1.1'
36
- s.add_development_dependency 'range_compressor', '~> 1.0'
37
- s.add_development_dependency 'regexp_parser', '~> 2.1'
38
- s.add_development_dependency 'regexp_property_values', '~> 1.0'
39
- s.add_development_dependency 'rspec', '~> 3.8'
40
- if RUBY_VERSION.to_f >= 2.7
41
- s.add_development_dependency 'codecov', '~> 0.2.12'
42
- s.add_development_dependency 'gouteur', '~> 1.0.0'
43
- s.add_development_dependency 'rubocop', '~> 1.8'
44
- end
45
31
  end
@@ -82,7 +82,11 @@ static const rb_data_type_t cs_type = {
82
82
  .dsize = cs_memsize,
83
83
  },
84
84
  .data = NULL,
85
+ #ifdef RUBY_TYPED_FROZEN_SHAREABLE
86
+ .flags = RUBY_TYPED_FREE_IMMEDIATELY | RUBY_TYPED_FROZEN_SHAREABLE,
87
+ #else
85
88
  .flags = RUBY_TYPED_FREE_IMMEDIATELY,
89
+ #endif
86
90
  };
87
91
 
88
92
  static inline VALUE
@@ -315,9 +319,9 @@ cs_method_minmax(VALUE self)
315
319
  cs_cp cp, alen, blen; \
316
320
  cs_ar *acps, *bcps; \
317
321
  struct cs_data *new_data; \
318
- new_cs = cs_alloc(RBASIC(self)->klass, &new_data); \
319
322
  acps = cs_fetch_cps(cs_a, &alen); \
320
323
  bcps = cs_fetch_cps(cs_b, &blen); \
324
+ new_cs = cs_alloc(RBASIC(self)->klass, &new_data); \
321
325
  for (cp = 0; cp < UNICODE_CP_COUNT; cp++) \
322
326
  { \
323
327
  if (tst_cp(acps, alen, cp) comp_op tst_cp(bcps, blen, cp)) \
@@ -705,8 +709,7 @@ cs_method_ranges(VALUE self)
705
709
 
706
710
  if (!previous_cp_num) {
707
711
  current_start = cp_num;
708
- } else if (previous_cp_num + 2 != cp_num)
709
- {
712
+ } else if (previous_cp_num + 2 != cp_num) {
710
713
  // gap found, finalize previous range
711
714
  rb_ary_push(ranges, rb_range_new(current_start, current_end, 0));
712
715
  current_start = cp_num;
@@ -1047,17 +1050,14 @@ raise_arg_err_unless_string(VALUE val)
1047
1050
  }
1048
1051
 
1049
1052
  static VALUE
1050
- cs_class_method_of(int argc, VALUE *argv, VALUE self)
1053
+ cs_class_method_of_string(VALUE self, VALUE string)
1051
1054
  {
1052
1055
  VALUE new_cs;
1053
1056
  struct cs_data *new_data;
1054
- int i;
1057
+
1058
+ raise_arg_err_unless_string(string);
1055
1059
  new_cs = cs_alloc(self, &new_data);
1056
- for (i = 0; i < argc; i++)
1057
- {
1058
- raise_arg_err_unless_string(argv[i]);
1059
- each_cp(argv[i], add_str_cp_to_arr, 0, 0, new_data, 0);
1060
- }
1060
+ each_cp(string, add_str_cp_to_arr, 0, 0, new_data, 0);
1061
1061
  return new_cs;
1062
1062
  }
1063
1063
 
@@ -1138,116 +1138,76 @@ cs_method_used_by_p(VALUE self, VALUE str)
1138
1138
  return only_uses_other_cps == Qfalse ? Qtrue : Qfalse;
1139
1139
  }
1140
1140
 
1141
- static void
1142
- cs_str_buf_cat(VALUE str, const char *ptr, long len)
1143
- {
1144
- long total, olen;
1145
- char *sptr;
1146
-
1147
- RSTRING_GETMEM(str, sptr, olen);
1148
- sptr = RSTRING(str)->as.heap.ptr;
1149
- olen = RSTRING(str)->as.heap.len;
1150
- total = olen + len;
1151
- memcpy(sptr + olen, ptr, len);
1152
- RSTRING(str)->as.heap.len = total;
1153
- }
1154
-
1155
- #ifndef TERM_FILL
1156
- #define TERM_FILL(ptr, termlen) \
1157
- do \
1158
- { \
1159
- char *const term_fill_ptr = (ptr); \
1160
- const int term_fill_len = (termlen); \
1161
- *term_fill_ptr = '\0'; \
1162
- if (__builtin_expect(!!(term_fill_len > 1), 0)) \
1163
- memset(term_fill_ptr, 0, term_fill_len); \
1164
- } while (0)
1165
- #endif
1166
-
1167
- static void
1168
- cs_str_buf_terminate(VALUE str, rb_encoding *enc)
1169
- {
1170
- char *ptr;
1171
- long len;
1172
-
1173
- ptr = RSTRING(str)->as.heap.ptr;
1174
- len = RSTRING(str)->as.heap.len;
1175
- TERM_FILL(ptr + len, rb_enc_mbminlen(enc));
1176
- }
1177
-
1141
+ // partially based on rb_str_delete_bang
1178
1142
  static inline VALUE
1179
1143
  cs_apply_to_str(VALUE set, VALUE str, int delete, int bang)
1180
1144
  {
1181
1145
  cs_ar *cps;
1182
- cs_cp len;
1183
- rb_encoding *str_enc;
1184
- VALUE orig_len, new_str_buf;
1185
- int cp_len;
1186
- unsigned int str_cp;
1187
- const char *ptr, *end;
1146
+ cs_cp cs_len;
1147
+ VALUE orig_str_len;
1148
+
1149
+ rb_encoding *enc;
1150
+ char *s, *send, *t;
1151
+ int ascompat, cr;
1188
1152
 
1189
1153
  raise_arg_err_unless_string(str);
1190
1154
 
1191
- cps = cs_fetch_cps(set, &len);
1155
+ orig_str_len = RSTRING_LEN(str);
1192
1156
 
1193
- orig_len = RSTRING_LEN(str);
1194
- if (orig_len < 1) // empty string, will never change
1157
+ if (orig_str_len == 0)
1195
1158
  {
1196
- if (bang)
1197
- {
1198
- return Qnil;
1199
- }
1200
- return rb_str_dup(str);
1159
+ return bang ? Qnil : str;
1201
1160
  }
1202
1161
 
1203
- new_str_buf = rb_str_buf_new(orig_len + 30); // len + margin
1204
- str_enc = rb_enc_get(str);
1205
- rb_enc_associate(new_str_buf, str_enc);
1206
- rb_str_modify(new_str_buf);
1207
- ENC_CODERANGE_SET(new_str_buf, rb_enc_asciicompat(str_enc) ? ENC_CODERANGE_7BIT : ENC_CODERANGE_VALID);
1208
-
1209
- ptr = RSTRING_PTR(str);
1210
- end = RSTRING_END(str);
1162
+ if (!bang)
1163
+ {
1164
+ str = rb_str_dup(str);
1165
+ }
1211
1166
 
1212
- if (single_byte_optimizable(str))
1167
+ cps = cs_fetch_cps(set, &cs_len);
1168
+ rb_str_modify(str);
1169
+ enc = rb_enc_get(str);
1170
+ ascompat = rb_enc_asciicompat(enc);
1171
+ s = t = RSTRING_PTR(str);
1172
+ send = RSTRING_END(str);
1173
+ cr = ascompat ? ENC_CODERANGE_7BIT : ENC_CODERANGE_VALID;
1174
+ while (s < send)
1213
1175
  {
1214
- while (ptr < end)
1176
+ unsigned int c;
1177
+ int clen;
1178
+
1179
+ if (ascompat && (c = *(unsigned char *)s) < 0x80)
1215
1180
  {
1216
- str_cp = *ptr & 0xff;
1217
- if ((!tst_cp(cps, len, str_cp)) == delete)
1181
+ if (tst_cp(cps, cs_len, c) != delete)
1218
1182
  {
1219
- cs_str_buf_cat(new_str_buf, ptr, 1);
1183
+ if (t != s)
1184
+ *t = c;
1185
+ t++;
1220
1186
  }
1221
- ptr++;
1187
+ s++;
1222
1188
  }
1223
- }
1224
- else // likely to be multibyte string
1225
- {
1226
- while (ptr < end)
1189
+ else
1227
1190
  {
1228
- str_cp = rb_enc_codepoint_len(ptr, end, &cp_len, str_enc);
1229
- if ((!tst_cp(cps, len, str_cp)) == delete)
1191
+ c = rb_enc_codepoint_len(s, send, &clen, enc);
1192
+
1193
+ if (tst_cp(cps, cs_len, c) != delete)
1230
1194
  {
1231
- cs_str_buf_cat(new_str_buf, ptr, cp_len);
1195
+ if (t != s)
1196
+ rb_enc_mbcput(c, t, enc);
1197
+ t += clen;
1198
+ if (cr == ENC_CODERANGE_7BIT)
1199
+ cr = ENC_CODERANGE_VALID;
1232
1200
  }
1233
- ptr += cp_len;
1201
+ s += clen;
1234
1202
  }
1235
1203
  }
1236
1204
 
1237
- cs_str_buf_terminate(new_str_buf, str_enc);
1205
+ rb_str_set_len(str, t - RSTRING_PTR(str));
1206
+ ENC_CODERANGE_SET(str, cr);
1238
1207
 
1239
- if (bang)
1208
+ if (bang && (RSTRING_LEN(str) == (long)orig_str_len)) // string unchanged
1240
1209
  {
1241
- if (RSTRING_LEN(new_str_buf) == (long)orig_len) // string unchanged
1242
- {
1243
- return Qnil;
1244
- }
1245
- rb_str_shared_replace(str, new_str_buf);
1246
- }
1247
- else
1248
- {
1249
- RB_OBJ_WRITE(new_str_buf, &(RBASIC(new_str_buf))->klass, rb_obj_class(str));
1250
- str = new_str_buf;
1210
+ return Qnil;
1251
1211
  }
1252
1212
 
1253
1213
  return str;
@@ -1289,6 +1249,10 @@ cs_method_allocated_length(VALUE self)
1289
1249
 
1290
1250
  void Init_character_set()
1291
1251
  {
1252
+ #ifdef HAVE_RB_EXT_RACTOR_SAFE
1253
+ rb_ext_ractor_safe(true);
1254
+ #endif
1255
+
1292
1256
  VALUE cs = rb_define_class("CharacterSet", rb_cObject);
1293
1257
 
1294
1258
  rb_define_alloc_func(cs, cs_method_allocate);
@@ -1343,7 +1307,7 @@ void Init_character_set()
1343
1307
  // `CharacterSet`-specific methods
1344
1308
 
1345
1309
  rb_define_singleton_method(cs, "from_ranges", cs_class_method_from_ranges, -2);
1346
- rb_define_singleton_method(cs, "of", cs_class_method_of, -1);
1310
+ rb_define_singleton_method(cs, "of_string", cs_class_method_of_string, 1);
1347
1311
 
1348
1312
  rb_define_method(cs, "ranges", cs_method_ranges, 0);
1349
1313
  rb_define_method(cs, "sample", cs_method_sample, -1);
@@ -2,7 +2,7 @@ class CharacterSet
2
2
  module CoreExt
3
3
  module StringExt
4
4
  def character_set
5
- CharacterSet.of(self)
5
+ CharacterSet.of_string(self)
6
6
  end
7
7
 
8
8
  {
@@ -22,6 +22,17 @@ class CharacterSet
22
22
  alias valid unicode
23
23
 
24
24
  def build_from_cps_file(path)
25
+ if defined?(Ractor) && Ractor.current != Ractor.main
26
+ raise <<-EOS.gsub(/^ */, '')
27
+ CharacterSet's predefined sets are lazy-loaded.
28
+ Pre-load them to use them in Ractors. E.g.:
29
+
30
+ CharacterSet.ascii # pre-load
31
+ Ractor.new { CharacterSet.ascii.size }.take # => 128
32
+ Ractor.new { 'abc'.keep_character_set(:ascii) }.take # => 'abc'
33
+ EOS
34
+ end
35
+
25
36
  File.readlines(path).inject(new) do |set, line|
26
37
  range_start, range_end = line.split(',')
27
38
  set.merge((range_start.to_i(16))..(range_end.to_i(16)))
@@ -6,13 +6,9 @@ class CharacterSet
6
6
  new(Array(ranges).flat_map(&:to_a))
7
7
  end
8
8
 
9
- def of(*strings)
10
- new_set = new
11
- strings.each do |str|
12
- raise ArgumentError, 'pass a String' unless str.respond_to?(:codepoints)
13
- str.codepoints.each { |cp| new_set << cp }
14
- end
15
- new_set
9
+ def of_string(str)
10
+ raise ArgumentError, 'pass a String' unless str.respond_to?(:codepoints)
11
+ str.codepoints.each_with_object(new) { |cp, set| set << cp }
16
12
  end
17
13
  end
18
14
 
@@ -15,6 +15,12 @@ class CharacterSet
15
15
  new(Array(args))
16
16
  end
17
17
 
18
+ def of(*args)
19
+ args.map do |arg|
20
+ arg.is_a?(Regexp) ? of_regexp(arg) : of_string(arg)
21
+ end.reduce(:merge) || new
22
+ end
23
+
18
24
  def parse(string)
19
25
  codepoints = Parser.codepoints_from_bracket_expression(string)
20
26
  result = new(codepoints)
@@ -1,3 +1,3 @@
1
1
  class CharacterSet
2
- VERSION = '1.5.0'
2
+ VERSION = '1.6.0'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: character_set
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.5.0
4
+ version: 1.6.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Janosch Müller
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-12-05 00:00:00.000000000 Z
11
+ date: 2022-02-16 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: sorted_set
@@ -24,160 +24,6 @@ dependencies:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
26
  version: '1.0'
27
- - !ruby/object:Gem::Dependency
28
- name: benchmark-ips
29
- requirement: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - "~>"
32
- - !ruby/object:Gem::Version
33
- version: '2.7'
34
- type: :development
35
- prerelease: false
36
- version_requirements: !ruby/object:Gem::Requirement
37
- requirements:
38
- - - "~>"
39
- - !ruby/object:Gem::Version
40
- version: '2.7'
41
- - !ruby/object:Gem::Dependency
42
- name: get_process_mem
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - "~>"
46
- - !ruby/object:Gem::Version
47
- version: 0.2.3
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - "~>"
53
- - !ruby/object:Gem::Version
54
- version: 0.2.3
55
- - !ruby/object:Gem::Dependency
56
- name: rake
57
- requirement: !ruby/object:Gem::Requirement
58
- requirements:
59
- - - "~>"
60
- - !ruby/object:Gem::Version
61
- version: '13.0'
62
- type: :development
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - "~>"
67
- - !ruby/object:Gem::Version
68
- version: '13.0'
69
- - !ruby/object:Gem::Dependency
70
- name: rake-compiler
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - "~>"
74
- - !ruby/object:Gem::Version
75
- version: '1.1'
76
- type: :development
77
- prerelease: false
78
- version_requirements: !ruby/object:Gem::Requirement
79
- requirements:
80
- - - "~>"
81
- - !ruby/object:Gem::Version
82
- version: '1.1'
83
- - !ruby/object:Gem::Dependency
84
- name: range_compressor
85
- requirement: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - "~>"
88
- - !ruby/object:Gem::Version
89
- version: '1.0'
90
- type: :development
91
- prerelease: false
92
- version_requirements: !ruby/object:Gem::Requirement
93
- requirements:
94
- - - "~>"
95
- - !ruby/object:Gem::Version
96
- version: '1.0'
97
- - !ruby/object:Gem::Dependency
98
- name: regexp_parser
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - "~>"
102
- - !ruby/object:Gem::Version
103
- version: '2.1'
104
- type: :development
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - "~>"
109
- - !ruby/object:Gem::Version
110
- version: '2.1'
111
- - !ruby/object:Gem::Dependency
112
- name: regexp_property_values
113
- requirement: !ruby/object:Gem::Requirement
114
- requirements:
115
- - - "~>"
116
- - !ruby/object:Gem::Version
117
- version: '1.0'
118
- type: :development
119
- prerelease: false
120
- version_requirements: !ruby/object:Gem::Requirement
121
- requirements:
122
- - - "~>"
123
- - !ruby/object:Gem::Version
124
- version: '1.0'
125
- - !ruby/object:Gem::Dependency
126
- name: rspec
127
- requirement: !ruby/object:Gem::Requirement
128
- requirements:
129
- - - "~>"
130
- - !ruby/object:Gem::Version
131
- version: '3.8'
132
- type: :development
133
- prerelease: false
134
- version_requirements: !ruby/object:Gem::Requirement
135
- requirements:
136
- - - "~>"
137
- - !ruby/object:Gem::Version
138
- version: '3.8'
139
- - !ruby/object:Gem::Dependency
140
- name: codecov
141
- requirement: !ruby/object:Gem::Requirement
142
- requirements:
143
- - - "~>"
144
- - !ruby/object:Gem::Version
145
- version: 0.2.12
146
- type: :development
147
- prerelease: false
148
- version_requirements: !ruby/object:Gem::Requirement
149
- requirements:
150
- - - "~>"
151
- - !ruby/object:Gem::Version
152
- version: 0.2.12
153
- - !ruby/object:Gem::Dependency
154
- name: gouteur
155
- requirement: !ruby/object:Gem::Requirement
156
- requirements:
157
- - - "~>"
158
- - !ruby/object:Gem::Version
159
- version: 1.0.0
160
- type: :development
161
- prerelease: false
162
- version_requirements: !ruby/object:Gem::Requirement
163
- requirements:
164
- - - "~>"
165
- - !ruby/object:Gem::Version
166
- version: 1.0.0
167
- - !ruby/object:Gem::Dependency
168
- name: rubocop
169
- requirement: !ruby/object:Gem::Requirement
170
- requirements:
171
- - - "~>"
172
- - !ruby/object:Gem::Version
173
- version: '1.8'
174
- type: :development
175
- prerelease: false
176
- version_requirements: !ruby/object:Gem::Requirement
177
- requirements:
178
- - - "~>"
179
- - !ruby/object:Gem::Version
180
- version: '1.8'
181
27
  description:
182
28
  email:
183
29
  - janosch84@gmail.com
@@ -269,7 +115,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
269
115
  - !ruby/object:Gem::Version
270
116
  version: '0'
271
117
  requirements: []
272
- rubygems_version: 3.3.0.dev
118
+ rubygems_version: 3.4.0.dev
273
119
  signing_key:
274
120
  specification_version: 4
275
121
  summary: Build, read, write and compare sets of Unicode codepoints.