string_view 0.1.0-arm64-darwin

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 9a28e83b128d95e9494a197674999728de3a0da909e8cc6f4e5adad40b759a4d
4
+ data.tar.gz: b7373a6840f1fa39f7cda173a98c8b4ff174495f73866819a90fd54f6ced613e
5
+ SHA512:
6
+ metadata.gz: 5b45cdfc7d76fb266224805f1f39bd1bd94a57f4b4453cfb74bcf0e63e17d78b26656d94875bf6e14a3167886561240af2c9b8e4c09768577b719040ebe46bf0
7
+ data.tar.gz: 72106bec2faef615ae1d354bd529df26ecc887aa5b5a2b65d56b63b23abea24bf4b13dbb78c427ac6c8d8f074e27b04c09b4c4fd2c913cba46b5c01961bb7a42
@@ -0,0 +1,25 @@
1
+ simdutf is used under the MIT License.
2
+
3
+ Source: https://github.com/simdutf/simdutf
4
+ Version: 8.2.0
5
+
6
+ ==============================================================================
7
+
8
+ Copyright 2021 The simdutf authors
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
11
+ this software and associated documentation files (the "Software"), to deal in
12
+ the Software without restriction, including without limitation the rights to
13
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
14
+ the Software, and to permit persons to whom the Software is furnished to do so,
15
+ subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
22
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
23
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
24
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
25
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2026 Shopify
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,264 @@
1
+ # StringView
2
+
3
+ A zero-copy, read-only view into a Ruby String, implemented as a C extension. StringView avoids allocating and copying bytes when slicing strings, making it significantly faster and more memory-efficient than `String#[]` for inner slices — especially on large buffers.
4
+
5
+ Think of it like C++'s `std::string_view` or Rust's `&str`: a lightweight window into an existing string's bytes.
6
+
7
+ ## How it works
8
+
9
+ A `StringView` wraps a frozen Ruby String (the "backing") and stores a byte offset and length. Slicing a `StringView` returns a new `StringView` pointing into the same backing — no bytes are ever copied.
10
+
11
+ ```ruby
12
+ buf = "Hello, world! This is a large buffer.".freeze
13
+ sv = StringView.new(buf)
14
+
15
+ # Slicing returns a new StringView — zero copy, O(1)
16
+ chunk = sv[7, 6] # => #<StringView "world!">
17
+ chunk.to_s # => "world!"
18
+ chunk.class # => StringView
19
+ ```
20
+
21
+ ### Three-tier design
22
+
23
+ StringView methods are organized into three tiers based on their allocation behavior:
24
+
25
+ | Tier | Strategy | Returns | Allocations |
26
+ |------|----------|---------|-------------|
27
+ | **Tier 1** | Native C, zero-copy | Primitives | **0** |
28
+ | **Tier 2** | New view into same backing | `StringView` | **1 object** (no byte copy) |
29
+ | **Tier 3** | Delegate to String | `String` | **1 String** (result only) |
30
+
31
+ **Tier 1 — Zero-copy reads** (no allocations):
32
+ `bytesize`, `length`, `empty?`, `encoding`, `ascii_only?`, `include?`, `start_with?`, `end_with?`, `index`, `rindex`, `getbyte`, `byteindex`, `byterindex`, `each_byte`, `each_char`, `bytes`, `chars`, `match`, `match?`, `=~`, `to_i`, `to_f`, `hex`, `oct`, `==`, `<=>`, `eql?`, `hash`
33
+
34
+ **Tier 2 — Slicing returns StringView** (one object, zero byte copy):
35
+ `[]`, `slice`, `byteslice`
36
+
37
+ **Tier 3 — Transform via delegation** (one String for the result, no intermediate copy):
38
+ `upcase`, `downcase`, `capitalize`, `swapcase`, `strip`, `lstrip`, `rstrip`, `chomp`, `chop`, `reverse`, `squeeze`, `encode`, `gsub`, `sub`, `tr`, `split`, `scan`, `count`, `delete`, `center`, `ljust`, `rjust`, `%`, `+`, `*`, `unpack1`, `scrub`, `unicode_normalize`
39
+
40
+ ### Design decisions
41
+
42
+ - **`to_str` is private.** Ruby's internal coercion protocol (`rb_check_string_type`) ignores method visibility, so StringView works seamlessly with `Regexp#=~`, string interpolation, and `String#+`. But `respond_to?(:to_str)` returns `false`, so code that explicitly checks for string-like objects won't treat StringView as a drop-in String replacement. When coercion happens, it returns a frozen shared string (zero-copy for heap-allocated backings).
43
+ - **All bang methods raise `FrozenError`.** StringView is immutable — `upcase!`, `gsub!`, `slice!`, etc. all raise immediately.
44
+ - **`method_missing` is a safety net.** Any String method not yet implemented natively raises `NotImplementedError` with a message telling you to call `.to_s.method_name(...)` explicitly. No silent fallback.
45
+
46
+ ## Performance
47
+
48
+ ### Where StringView is faster
49
+
50
+ StringView's primary advantage is **inner slicing** — extracting a substring from the middle of a large string. CRuby's `String#[]` must copy the bytes (except for tail slices), while StringView just adjusts an offset and length.
51
+
52
+ **Inner slice creation** (Ruby 4.0.2, YJIT, 1MB ASCII backing):
53
+
54
+ | Slice size | String | StringView | Speedup |
55
+ |------------|--------|------------|---------|
56
+ | 1 KB | 5.98M i/s | 22.4M i/s | **3.75x faster** |
57
+ | 10 KB | 1.74M i/s | 21.2M i/s | **12.2x faster** |
58
+ | 100 KB | 555K i/s | 21.5M i/s | **38.8x faster** |
59
+ | 500 KB | 131K i/s | 21.6M i/s | **165x faster** |
60
+
61
+ StringView slice creation is **constant time** (~45ns) regardless of slice size, while String scales linearly with the number of bytes copied.
62
+
63
+ **Combined: inner slice then operate** (the real-world pattern):
64
+
65
+ | Operation | String | StringView | Speedup |
66
+ |-----------|--------|------------|---------|
67
+ | slice + `start_with?` | 126K i/s | 18.9M i/s | **151x faster** |
68
+ | slice + `include?` (500KB) | 6.97K i/s | 7.37K i/s | 1.06x faster |
69
+ | 50 inner slices | 60.1K i/s | 396K i/s | **6.6x faster** |
70
+
71
+ **UTF-8 character-indexed slicing** (YJIT, 1.2M-char / 2.8MB UTF-8 string, stride index pre-built):
72
+
73
+ | Character offset | String | StringView | Speedup |
74
+ |------------------|--------|------------|---------|
75
+ | 100 | 0.029s | 0.021s | **1.4x faster** |
76
+ | 1,000 | 0.048s | 0.024s | **2x faster** |
77
+ | 10,000 | 0.144s | 0.020s | **7x faster** |
78
+ | 100,000 | 1.12s | 0.008s | **134x faster** |
79
+ | 500,000 | 5.25s | 0.007s | **743x faster** |
80
+ | 1,000,000 | 10.35s | 0.015s | **698x faster** |
81
+
82
+ StringView builds a stride index on first character-indexed access, making subsequent lookups O(1). String must scan from the beginning every time (O(n)).
83
+
84
+ **UTF-8 character counting** (`length` on multibyte strings):
85
+ - First call: **2.8x faster** than String (SIMD-accelerated via [simdutf](https://github.com/simdutf/simdutf))
86
+ - Subsequent calls: **100x faster** (cached)
87
+
88
+ **Memory** (1,000 inner slices of 10KB each from a 1MB string):
89
+ - String: **10 MB** (each slice copies its bytes)
90
+ - StringView: **224 KB** (each slice is a small embedded struct)
91
+ - **45x less memory**
92
+
93
+ **Reads on pre-existing slices:**
94
+
95
+ | Operation | String | StringView | Ratio |
96
+ |-----------|--------|------------|-------|
97
+ | `start_with?` | 40.0M i/s | 48.9M i/s | **1.22x faster** |
98
+ | `include?` (500KB) | 7.2K i/s | 7.2K i/s | same |
99
+
100
+ ### Where String is faster
101
+
102
+ For simple accessor methods on pre-existing objects, String has a small advantage because its internal macros (`RSTRING_LEN`, etc.) are inlined by the compiler, while StringView goes through the TypedData API:
103
+
104
+ | Operation | String | StringView | Overhead |
105
+ |-----------|--------|------------|----------|
106
+ | `bytesize` | 67.4M i/s | 60.7M i/s | 1.11x slower |
107
+ | `getbyte` | 62.2M i/s | 57.3M i/s | same-ish (within error) |
108
+
109
+ This is the irreducible cost of being a C extension type rather than a built-in. In practice, the difference is ~5ns per call.
110
+
111
+ **Tail slices** (slices that extend to the end of the string) are already zero-copy in CRuby via shared strings, so StringView provides no advantage there.
112
+
113
+ **Small string slices** (below ~23 bytes) use CRuby's embedded string optimization, which is already very fast. StringView's per-object overhead is comparable.
114
+
115
+ ### When to use StringView
116
+
117
+ StringView is most beneficial when you:
118
+
119
+ - Parse large buffers (HTTP bodies, log files, serialized data) by extracting many substrings
120
+ - Repeatedly slice into the middle of large strings
121
+ - Work with large UTF-8 text and need character-indexed access
122
+ - Want to reduce memory pressure from retained string slices
123
+ - Need to pass around "windows" into a string without copying
124
+
125
+ StringView is less beneficial when you:
126
+
127
+ - Work with small strings (< 1KB) where copy cost is negligible
128
+ - Only need tail slices (CRuby already optimizes these)
129
+ - Primarily call simple accessors (`bytesize`, `getbyte`) on pre-existing slices
130
+
131
+ ## Installation
132
+
133
+ Add to your Gemfile:
134
+
135
+ ```ruby
136
+ gem "string_view"
137
+ ```
138
+
139
+ Or install directly:
140
+
141
+ ```bash
142
+ gem install string_view
143
+ ```
144
+
145
+ The gem includes a C extension that compiles during installation. It requires Ruby 3.3+ and a C99/C++17 compiler. Ruby 3.3 is needed for `RUBY_TYPED_EMBEDDABLE` (struct embedding in the Ruby object) and `RTYPEDDATA_GET_DATA` (fast struct access).
146
+
147
+ ## Usage
148
+
149
+ ### Basic usage
150
+
151
+ ```ruby
152
+ require "string_view"
153
+
154
+ # Create from a String (freezes the backing automatically)
155
+ sv = StringView.new("Hello, world!")
156
+
157
+ # Slicing returns StringView — zero copy
158
+ greeting = sv[0, 5] # => StringView "Hello"
159
+ name = sv[7, 5] # => StringView "world"
160
+
161
+ # Chain slices — all share the same backing
162
+ first = greeting[0, 1] # => StringView "H"
163
+
164
+ # Materialize when you need a real String
165
+ str = sv.to_s # => "Hello, world!" (new String)
166
+
167
+ # Read-only operations work directly
168
+ sv.include?("world") # => true
169
+ sv.start_with?("Hello") # => true
170
+ sv.length # => 13
171
+ sv.bytesize # => 13
172
+
173
+ # Transforms return String (not StringView)
174
+ sv.upcase # => "HELLO, WORLD!" (String)
175
+ sv.split(", ") # => ["Hello", "world!"] (Array of Strings)
176
+ ```
177
+
178
+ ### Parsing a large buffer
179
+
180
+ ```ruby
181
+ # Simulate a large HTTP response or log chunk
182
+ buffer = File.read("large_file.txt")
183
+ sv = StringView.new(buffer)
184
+
185
+ # Extract fields without copying the entire buffer
186
+ header = sv[0, 1024] # First 1KB — zero copy
187
+ body = sv[1024, sv.bytesize - 1024] # Rest — zero copy
188
+
189
+ # Search within a region
190
+ body.include?("ERROR") # Searches directly, no copy
191
+ body.match?(/\d{4}-\d{2}-\d{2}/) # Regex on the view
192
+ ```
193
+
194
+ ### Reusing a view with `reset!`
195
+
196
+ ```ruby
197
+ sv = StringView.new("initial content")
198
+
199
+ # Re-point at a different backing string
200
+ new_data = "different content"
201
+ sv.reset!(new_data, 0, new_data.bytesize)
202
+ sv.to_s # => "different content"
203
+ ```
204
+
205
+ ## Internals
206
+
207
+ ### Struct layout
208
+
209
+ Each StringView is a small C struct embedded directly in the Ruby object (via `RUBY_TYPED_EMBEDDABLE`):
210
+
211
+ ```
212
+ StringView object (no separate heap allocation)
213
+ ├── backing: VALUE — frozen String (strong GC reference)
214
+ ├── base: const char* — cached RSTRING_PTR(backing)
215
+ ├── enc: rb_encoding* — cached encoding
216
+ ├── offset: long — byte offset into backing
217
+ ├── length: long — byte length of this view
218
+ ├── charlen: long — cached character count (-1 = not computed)
219
+ ├── single_byte: int — cached flag: 1=ASCII/binary, 0=multibyte
220
+ └── stride_idx: ptr — lazily-built char→byte index (UTF-8 only)
221
+ ```
222
+
223
+ ### UTF-8 acceleration
224
+
225
+ For UTF-8 strings with actual multibyte content, StringView uses two techniques:
226
+
227
+ 1. **SIMD character counting** via [simdutf](https://github.com/simdutf/simdutf) — counts UTF-8 characters at billions of bytes per second using NEON (ARM) or SSE/AVX (x86).
228
+
229
+ 2. **Stride index** — on first character-indexed access, builds a lookup table mapping every 128th character to its byte offset. Subsequent character→byte conversions are O(1): one table lookup + at most 128 bytes of scalar scan.
230
+
231
+ ### Compilation
232
+
233
+ The extension is compiled with `-O3` and uses `__attribute__((always_inline))` on hot paths, `RTYPEDDATA_GET_DATA` for fast struct access (skipping type checks), and `FL_SET_RAW` for freeze (bypassing method dispatch).
234
+
235
+ ## Development
236
+
237
+ ```bash
238
+ git clone https://github.com/Shopify/string_view
239
+ cd string_view
240
+ bundle install
241
+ bundle exec rake compile
242
+ bundle exec rake test
243
+ ```
244
+
245
+ Run benchmarks:
246
+
247
+ ```bash
248
+ bundle exec rake compile
249
+ ruby --yjit -Ilib benchmark/bench.rb
250
+ ```
251
+
252
+ ## License
253
+
254
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
255
+
256
+ simdutf is included under the [MIT License](https://github.com/simdutf/simdutf/blob/master/LICENSE-MIT) (also available under Apache 2.0). See [LICENSE-simdutf.txt](LICENSE-simdutf.txt) for the full text.
257
+
258
+ ## Contributing
259
+
260
+ See [CONTRIBUTING.md](CONTRIBUTING.md) for details on how to contribute.
261
+
262
+ ## Code of Conduct
263
+
264
+ Everyone interacting in the StringView project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [Contributor Code of Conduct](CODE_OF_CONDUCT.md). Report unacceptable behavior to <opensource@shopify.com>.
data/Rakefile ADDED
@@ -0,0 +1,20 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rake/extensiontask"
5
+ require "minitest/test_task"
6
+
7
+ Rake::ExtensionTask.new("string_view") do |ext|
8
+ ext.lib_dir = "lib/string_view"
9
+ end
10
+
11
+ Minitest::TestTask.create
12
+
13
+ # Compile the extension before running tests
14
+ task test: :compile
15
+
16
+ require "rubocop/rake_task"
17
+
18
+ RuboCop::RakeTask.new
19
+
20
+ task default: [:test, :rubocop]
@@ -0,0 +1,15 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "mkmf"
4
+
5
+ # C flags for string_view.c
6
+ $CFLAGS << " -std=c99 -O3 -Wall -Wextra -Wno-unused-parameter"
7
+
8
+ # C++ flags for simdutf
9
+ $CXXFLAGS = " -std=c++17 -O3 -DNDEBUG"
10
+
11
+ # Tell mkmf about our source files (C and C++ mixed)
12
+ $srcs = ["string_view.c", "simdutf.cpp"]
13
+ $INCFLAGS << " -I$(srcdir)"
14
+
15
+ create_makefile("string_view/string_view")