hyperion-rb 2.12.0 → 2.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ae613d6efdf92072a2076a95a3656687e6bc0b5a9302700c56d75bcc0a900b56
4
- data.tar.gz: f4f3f26cff7f4ebd08d35757726219ee53e65418bacdd0061d05626d7b354dba
3
+ metadata.gz: 4a2ebbc1658c8f98f631a1c8e7d8059248995d0d7e89905cc796385583d4739b
4
+ data.tar.gz: '0910002789df50d8de0b387a9125ef6f84b4e7d4a740e579f6d43cd4c1a04d16'
5
5
  SHA512:
6
- metadata.gz: 5e0cbc3aa81bbe246dbf670642cae5fd2e1f0275f36030729bb68fe23b00dd54bbc51fb5f2e62c0a306fd825f73ab35e86cc49dd3bb6a27f38269885067de62c
7
- data.tar.gz: f749e0b6e05c52695bc4698d847b92e88c10a17dc21a2d6d93deb9c9a013c5c712f778dfa13aa6dab6f5dd0ebf398c69a5eb56255031aaf9539ca561a38bca6d
6
+ metadata.gz: 34d825dd492712e8a5f555854097205535fc779c57418035104c3f2a0e51ec4a1dcdd93de67b5bf736cbb1642b0845cd8af61d89a3b82a06a581194e9a7d3df6
7
+ data.tar.gz: 4bce0232541f6f49ad51826bcf49cbfa791fd9df0b876ff3baf3952629a668da45b006600d08adf1a77480d62588e4ec5c52bfb126982d75063ec6241f63c989
data/CHANGELOG.md CHANGED
@@ -1,5 +1,518 @@
1
1
  # Changelog
2
2
 
3
+ ## 2.13.0 — 2026-05-01
4
+
5
+ ### 2.13-E — io_uring soak signal + default-ON decision
6
+
7
+ **Background.** 2.12-D shipped the io_uring accept loop on Linux 5.x+
8
+ (opt-in via `HYPERION_IO_URING_ACCEPT=1`, bench delta:
9
+ 15,685 → 134,084 r/s on `handle_static` hello — 8.6× over the 2.12-C
10
+ accept4 fallback, 7× over Agoo's 19,024). The CHANGELOG explicitly
11
+ deferred the default-ON decision to 2.13: "default off until 2.13
12
+ production soak". 2.13-E is that soak — and the harness operators
13
+ need to run their own soak in their own staging.
14
+
15
+ **What 2.13-E ships.**
16
+
17
+ 1. **`bench/io_uring_soak.sh`** — bash-only soak harness.
18
+ - Boots Hyperion with `HYPERION_IO_URING_ACCEPT=1 -w 1 -t 32`
19
+ against `bench/hello_static.ru` (the 2.12-D fast path) on a
20
+ fixed port. `setsid nohup … & disown` so the master survives
21
+ SSH disconnect over a 24h run.
22
+ - 30s warm-up, then `wrk -t4 -c100 -d24h --latency` in the
23
+ foreground. `SOAK_DURATION` is operator-tunable (defaults to
24
+ `24h`; the bench-window proof-of-concept was `30m`).
25
+ - In parallel, every `SAMPLE_INTERVAL` (default 60s), samples
26
+ `/proc/$PID/status` (VmRSS, VmSize, Threads), `/proc/$PID/fd`
27
+ count, scrapes `hyperion_requests_dispatch_total` from
28
+ `/-/metrics`, and bucket-derives p50/p99 from
29
+ `hyperion_request_duration_seconds_bucket`. Appends one CSV
30
+ row per sample to `/tmp/io_uring_soak_<tag>_<ts>.csv`.
31
+ - On exit (24h elapsed OR Ctrl-C), prints summary:
32
+ min/max/mean/stddev RSS, fd_count peak, p99 stddev/mean, plus
33
+ wrk's HDR-precision p50/p99/p999 from `--latency`.
34
+ - **Verdict**:
35
+ - PASS if RSS variance < 10%, fd peak ≤ `WRK_CONNS + 50`, and
36
+ (when histogram has ≥ 3 distinct bucket values across the
37
+ window) p99 stddev/mean < 20%. Eligible for default-flip.
38
+ - SOAK FAIL on any breach. Defer the flip; the failed metric
39
+ is documented in the verdict notes.
40
+ - The histogram p99 check is bypassed when there are < 3
41
+ distinct bucket values across the soak window — Hyperion's
42
+ 7-edge histogram (1 ms, 5 ms, 25 ms, …) quantizes a stable
43
+ hello-world p99 into bucket-boundary jumps, and the wrk
44
+ `--latency` p99 is the right tail-truth source there.
45
+ - `IO_URING=0` runs the same harness against the 2.12-C accept4
46
+ fallback so the operator can diff io_uring vs accept4 CSVs
47
+ apples-to-apples.
48
+
49
+ 2. **`spec/hyperion/io_uring_soak_smoke_spec.rb`** — durable CI
50
+ coverage. A 1000-request mini-soak over the io_uring loop with
51
+ a 200-request warm-up, asserts: RSS delta < 20 MB,
52
+ fd_count back to baseline ± 5, threads back to baseline ± 4.
53
+ Skipped on macOS / non-liburing builds via the same
54
+ `Hyperion::Http::PageCache.io_uring_loop_compiled?` predicate the
55
+ 2.12-D wire-shape spec uses. Lives in its own file so a 2.13-E
56
+ leak signal regression is diagnosable without re-reading the
57
+ 2.12-D wire-shape spec.
58
+
59
+ *Calibration note*: the 2.13-E ticket header proposed a 5 MB
60
+ delta bound. Bench-host measurements (3-run baseline, IO_URING=1,
61
+ 1000 sequential GETs) put the delta at 7-9 MB — but the
62
+ dominant allocator is the test process itself
63
+ (`::TCPSocket.new`, `Timeout` threads, response Strings), not the
64
+ Hyperion server. The 20 MB threshold catches a real Hyperion
65
+ leak (1 KB/req of leakage = +1 MB at the assertion site) without
66
+ false-positiving on the test driver's own arena cost.
67
+
68
+ 3. **30-minute proof-of-concept soak** — see the companion
69
+ `[bench]` commit for the harness-vs-harness numbers across
70
+ IO_URING=1 / IO_URING=0 and the explicit default-ON decision.
71
+
72
+ **Constraints respected.** No regression in spec count. macOS-host
73
+ suite: 1124/0/14 → 1126/0/16 (+2 examples, +2 pending — the soak
74
+ smoke is pending on macOS, the documentation example is active). Linux
75
+ bench-host suite: 1124/0/14 → 1126/0/15 (+2 examples, +1 pending —
76
+ the soak smoke runs, the documentation example is pending). The
77
+ 2.10-G TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F
78
+ `rb_pc_serve_request`, 2.11-A dispatch pool warmup, 2.11-B cglue HPACK
79
+ default, 2.12-C accept4 loop, 2.12-D io_uring loop, 2.12-E per-worker
80
+ counter, 2.12-F gRPC unary trailers, 2.13-A metric shards, 2.13-B
81
+ response head builder, 2.13-C flake fixes, 2.13-D gRPC streaming —
82
+ all on master and untouched.
83
+
84
+ ### 2.13-D — gRPC streaming RPCs + ghz vs Falcon bench
85
+
86
+ **Background.** 2.12-F shipped gRPC unary on h2 — trailers
87
+ (`grpc-status` / `grpc-message` final HEADERS frame), `te: trailers`
88
+ handling, and h2 request half-close semantics. The three remaining
89
+ gRPC call shapes — server-streaming, client-streaming, and
90
+ bidirectional — were not yet wired. 2.13-D closes that gap.
91
+
92
+ **What was missing.** `dispatch_stream` gated on
93
+ `RequestStream#request_complete` (i.e., END_STREAM on the request),
94
+ which is correct for unary but blocks both streaming-input shapes:
95
+ the app cannot read DATA frames until END_STREAM has already arrived.
96
+ Likewise the response path materialised the full body into a single
97
+ String before splitting it across DATA frames, which folded
98
+ multi-message server-streaming responses into one logical write
99
+ (verified: a 5-message body produced one DATA frame, not five).
100
+
101
+ **What 2.13-D ships.**
102
+
103
+ 1. **Server-streaming.** When the response body responds to
104
+ `:trailers`, `dispatch_stream` now iterates `body#each` lazily and
105
+ emits one DATA frame per yielded chunk (no inter-chunk coalescing),
106
+ followed by the trailer HEADERS frame carrying END_STREAM=1. A
107
+ single oversize chunk still gets max-frame-size split inside the
108
+ per-chunk send path, but small messages stay one DATA frame each.
109
+ Plain HTTP/2 traffic (no `:trailers` method on the body) keeps the
110
+ pre-2.13-D buffered shape — no behaviour change for non-gRPC apps.
111
+
112
+ 2. **Streaming-input dispatch.** A new
113
+ `Hyperion::Http2Handler::StreamingInput` IO-shaped queue replaces
114
+ the buffered `@request_body` String for requests that look like
115
+ gRPC: `content-type: application/grpc*` AND `te: trailers` on a
116
+ POST. When promoted, `process_data` pushes each DATA frame's bytes
117
+ into the queue (and the END_STREAM frame closes the writer), and
118
+ the serve-loop dispatches the app on HEADERS arrival via a new
119
+ `RequestStream#dispatchable?` predicate. The Rack adapter detects
120
+ the non-String request body and sets `env['rack.input']` directly
121
+ to the queue (no StringIO wrap, so streaming-read semantics are
122
+ preserved). Reads block the calling fiber on `Async::Notification`
123
+ until either bytes arrive or the writer closes.
124
+
125
+ 3. **Bidirectional.** Falls out for free once 1 + 2 are in place —
126
+ each h2 stream already runs on its own fiber, so concurrent
127
+ read+write on the same stream is supported by the Async scheduler.
128
+
129
+ **Tests.** `spec/hyperion/grpc_streaming_spec.rb` (5 examples):
130
+ server-streaming wire shape (5 yielded chunks → 5 DATA frames + trailer
131
+ HEADERS, END_STREAM ride placement asserted), client-streaming
132
+ (5 spaced DATA frames decoded by the app via `rack.input.read`),
133
+ bidirectional (5 round-trips with strict ordering), and 2 unit specs
134
+ on `StreamingInput` (blocking reads + EOF handling, partial-chunk
135
+ slicing). All run end-to-end via `Protocol::HTTP2::Client` over real
136
+ TLS — same harness shape as the 2.12-F unary specs.
137
+
138
+ **Constraints respected.** No regression in the 1176/0/15 baseline:
139
+ post-2.13-D suite is 1181/0/15 (5 new streaming examples + 0
140
+ regressions). The pre-existing
141
+ `http2_empty_body_short_circuit_spec`'s `FakeStream` test double
142
+ needed a `respond_to?(:streaming_input)` guard at the dispatch
143
+ read-site — added defensively (no protocol change). The 2.10-G
144
+ TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F `rb_pc_serve_request`,
145
+ 2.11-A dispatch pool warmup, 2.11-B cglue HPACK default, 2.12-C accept4
146
+ loop, 2.12-D io_uring loop, 2.12-E per-worker counter, 2.12-F gRPC
147
+ unary trailers, 2.13-A metric shards, 2.13-B response head builder
148
+ C-rewrite, 2.13-C flake fixes are all on master and untouched by
149
+ this change.
150
+
151
+ **Bench.** See the companion `[bench]` commit for ghz numbers and
152
+ the documented limits of the Hyperion-vs-Falcon comparison.
153
+
154
+ ### 2.13-C — Spec flake hunt
155
+
156
+ Two flakes carried over from the 2.11/2.12 release cuts. Both are
157
+ spec-side hermeticity issues — neither indicates a regression in
158
+ `lib/` or `ext/`.
159
+
160
+ **Flake 1 — `spec/hyperion/tls_ktls_spec.rb`, macOS, seed-dependent.**
161
+ The two examples in the `Linux-only: kTLS engages with default :auto policy`
162
+ describe block previously gated their bodies on
163
+ `unless Hyperion::TLS.ktls_supported?`. That probe consults
164
+ `Etc.uname[:sysname]` and `OpenSSL::OPENSSL_VERSION_NUMBER`. The same
165
+ `Etc.uname` is stubbed elsewhere in the suite (`io_uring_spec`)
166
+ to drive the io_uring platform matrix; under a particular full-
167
+ suite seed those stubs and this spec's `before { reset_ktls_probe! }`
168
+ overlapped in a way that let the probe report `true` on the actual
169
+ Darwin host. The body then ran the kTLS-supported branch and failed.
170
+
171
+ *Fix shape:* hard `RUBY_PLATFORM.include?('linux')` guard at the
172
+ example-body top BEFORE the existing probe-based guard. Runtime
173
+ platform is unstubbable from another spec and matches the
174
+ example title's intent. Linux runs unaffected (the second guard
175
+ still protects Linux + old-OpenSSL hosts).
176
+
177
+ *Verification:* 10/10 green on macOS arm64; 1/1 green on the
178
+ Linux x86_64 bench (openclaw-vm, kernel 6.8); full suite holds
179
+ the 1175 baseline on macOS / 1118 on bench.
180
+
181
+ **Flake 2 — `spec/hyperion/connection_loop_spec.rb:79`, Linux, deterministic.**
182
+ Misdiagnosed previously as "port-9292-busy". The actual root cause
183
+ is Linux ≥ 5.x's `close()`-doesn't-wake-other-thread-`accept(2)`
184
+ behaviour: when the spec calls `listener.close` from one thread,
185
+ the C accept loop parked in `accept(2)` on that fd in another thread
186
+ stays blocked until the next connection arrives. The `stop_accept_loop`
187
+ flag is checked BETWEEN accepts, not while parked, so flipping it
188
+ without a wake is a no-op. `thread.join(5)` then exhausts its
189
+ timeout and `result` (assigned inside the thread block) is still
190
+ `nil`, breaking the `expect(result).to be_a(Integer)` assertion.
191
+ Other examples in the file had the same teardown shape but happened
192
+ to assert on side-effects populated BEFORE the join, so they passed
193
+ despite the same 5 s thread leak — runtime was 46 s for 10 examples.
194
+
195
+ *Fix shape:* extract a `stop_loop_and_wake(listener, thread)`
196
+ helper that flips the stop flag, dials one throwaway TCP connection
197
+ at the listener so the parked `accept(2)` returns, then closes the
198
+ listener and joins. Replace the `stop_accept_loop` + `listener.close`
199
+ + `thread.join(5)` pattern at every callsite (8 in-file plus the
200
+ Server-level engagement example, which goes through `server.stop`).
201
+ Add a regression block — "teardown is hermetic across repeated
202
+ bring-ups" — that runs the bring-up + serve + teardown cycle 3
203
+ times in one process and asserts each teardown is < 1 s.
204
+
205
+ *Verification:* 0/10 failures on the bench (was 5/5 deterministic
206
+ failure pre-fix); spec runtime 46 s → 1.3 s; macOS 11/11 green;
207
+ full suite 1119 on bench / 1176 on macOS (regression spec adds 1).
208
+
209
+ *Out of scope:* the same wake-shape affects `Hyperion::Server#stop`
210
+ in production — `close()` on the listener fd from the signal-
211
+ handling thread won't reliably wake the worker's parked accept.
212
+ Flagging this as a follow-up rather than fixing in 2.13-C scope:
213
+ briefing was explicit ("Don't touch lib/ext code unless the flake
214
+ is a real bug there"), and the production failure mode (worker
215
+ hangs on shutdown) is operationally distinct from the spec flake
216
+ (test-suite stalls).
217
+
218
+ ### 2.13-B — CPU JSON gap
219
+
220
+ **Background.** The 2.12-B re-bench surfaced one row that got *worse*
221
+ relative to Agoo across the 2.10/2.11/2.12 streams: `bench/work.ru`
222
+ (50-key JSON serialised per-request, no `handle_static` because the
223
+ response varies per request). 2.10-B had Hyperion 3,450 / Agoo 6,374
224
+ (1.85× behind); the 2.12-B re-bench had Hyperion 3,659 / Agoo 7,489
225
+ (2.05× behind). Hyperion +6.0%, Agoo +17.5% over the same window —
226
+ the gap *widened*. None of the 2.10/2.11/2.12 work touched this row,
227
+ so it was the obvious 2.13 follow-on.
228
+
229
+ **Profile.** `perf record -F 199 -g` on the worker pid while
230
+ `wrk -t4 -c100 -d15s` ran (CPU-JSON workload, default config
231
+ `-t 5 -w 1` = bench harness canonical):
232
+
233
+ ```
234
+ 13.15% vm_exec_core (Ruby VM dispatch)
235
+ 13.01% _raw_spin_unlock_irqrestore (kernel; ~6% inside TCP write softirq)
236
+ 4.56% raw_generate_json_string (JSON.generate — app's own work)
237
+ 2.28% generate_json_general (ditto)
238
+ 1.75% vm_call_cfunc_with_frame
239
+ 1.34% rb_class_of
240
+ 1.24% json_object_i (ditto)
241
+ 1.11% rb_vm_opt_getconstant_path
242
+ 1.01% BSD_vfprintf (sprintf — content-length builder + JSON floats)
243
+ 0.94% generate_json_float (ditto)
244
+ 0.70% hash_foreach_call (header iteration)
245
+ 0.57% llhttp__internal__run (Hyperion C ext request parser)
246
+ ```
247
+
248
+ The honest read: ~10 % of CPU is `JSON.generate` (the app's per-request
249
+ work — not removable from inside Hyperion); ~13 % is kernel TCP write
250
+ softirq (one `write(2)` per response — already minimal); ~13 % is the
251
+ Ruby VM dispatch loop, of which the adapter + writer Ruby path is a
252
+ fraction. **The dominant tail is GVL serialisation under `-t 5 -w 1`**
253
+ — a concurrency sweep shows c=1 → 5,800 r/s, c=5 → 3,563 r/s, c=100 →
254
+ 3,665 r/s. Hyperion's per-thread workload SCALES DOWN with concurrency
255
+ because every request holds the GVL through `app.call` + `JSON.generate`
256
+ (Ruby code, no I/O wait). Agoo (pure-C HTTP core) scales UP with
257
+ concurrency: c=1 → 4,384 → c=5 → 6,182 → c=100 → 6,519. That structural
258
+ gap cannot close inside Hyperion — `app.call(env)` IS Ruby. What 2.13-B
259
+ *can* do is shrink Hyperion's GVL hold time per request so the ratio
260
+ of "GVL held by Hyperion" to "GVL held by app.call" drops, leaving
261
+ more room for the worker pool to interleave.
262
+
263
+ ### 2.13-B — CPU savings in `cbuild_response_head`
264
+
265
+ The C-side response-head builder (called once per Rack response) had
266
+ four removable per-request costs:
267
+
268
+ 1. **Status line `snprintf`.** Every request ran
269
+ `snprintf("HTTP/1.1 %d ", status)` + `rb_str_cat(reason)` +
270
+ `rb_str_cat("\r\n", 2)` to build "HTTP/1.1 200 OK\r\n". The 23
271
+ status codes in `ResponseWriter::REASONS` are a fixed set with
272
+ fixed reason phrases — the entire status line is a constant per
273
+ `(status, reason)` pair. The 2.13-B builder switches on `status`
274
+ and emits the pre-baked line in ONE `rb_str_cat` when the reason
275
+ phrase matches; falls back to the snprintf path for unknown
276
+ statuses or operator-overridden reason phrases.
277
+
278
+ 2. **Header iteration via `rb_funcall(:keys)`.** The legacy iterator
279
+ called `rb_funcall(rb_headers, :keys, 0)` to materialise a fresh
280
+ keys Array per request, then `rb_ary_entry(keys, i)` +
281
+ `rb_hash_aref(rb_headers, key)` per header. The 2.13-B builder
282
+ uses `rb_hash_foreach`, which walks the hash table directly with
283
+ no intermediate Array allocation and no per-key hash lookup.
284
+
285
+ 3. **Per-key `String#downcase` allocation.** Header keys are nearly
286
+ always frozen-literal Strings in Rack apps (`'content-type'`,
287
+ `'cache-control'`, …) — same `VALUE` every request. The legacy
288
+ builder ran `rb_funcall(:downcase)` per key per call, allocating
289
+ a fresh lowercase String + crossing the FFI boundary. 2.13-B
290
+ keys an `st_table` on the input String's identity and stores
291
+ the cached lowercase form + the pre-built `"<lc>: "` prefix
292
+ buffer; the second-and-later requests for the same frozen-literal
293
+ key get one st-table hit. Capped at 64 entries — a misbehaving
294
+ app emitting `x-trace-<uuid>` per request can't grow the cache
295
+ without bound, just falls back to the per-call downcase.
296
+
297
+ 4. **Per-(key, value) full-line cache.** When BOTH the key AND the
298
+ value are frozen-literal Strings (`'cache-control' => 'no-store'`
299
+ in `bench/work.ru`, `'content-type' => 'application/json'`),
300
+ the entire wire line `"<lc-key>: <value>\r\n"` is identical
301
+ every request. 2.13-B caches the prebuilt line keyed on
302
+ `(key.object_id, value.object_id)`; on hit the entire emit is
303
+ ONE `rb_str_cat`. Capped at 256 entries with the same fall-back
304
+ semantics. The CRLF-injection guard always re-runs (the cache
305
+ stores only validated lines; new (k, v) pairs go through the
306
+ full validator before the line populates).
307
+
308
+ 5. **`itoa_positive_decimal` for content-length.** `snprintf("content-
309
+ length: %ld\r\n", body_size)` was 1 % of CPU on the profile.
310
+ `body_size` is always non-negative (bytesize of a buffered body)
311
+ so the sign branch + locale logic in `vfprintf` are pure
312
+ overhead. 2.13-B writes the digits backwards into a 24-byte
313
+ stack scratch then `rb_str_cat`s the populated suffix — no
314
+ heap, no locale, no format-string parser.
315
+
316
+ **Bench impact.** Same bench host (openclaw-vm, Linux 6.8.0, Ruby
317
+ 3.3.3, loopback), each version compiled fresh from source on the
318
+ host before its run.
319
+
320
+ | Workload | Baseline (master adac63e) | 2.13-B | Δ |
321
+ |---|---:|---:|---|
322
+ | Single-thread synthetic (`Adapter::Rack.call → ResponseWriter#write` against a sink, 50,000 iters; 3-trial median r/s) | 18,018 | **19,404** | **+7.7%** |
323
+ | Multi-thread loopback `wrk -t4 -c100 -d20s --latency`, two batches of 3-trial median r/s | 3,427; 3,550 | 3,440; 3,528 | **−0.1%** |
324
+ | Multi-thread loopback p99 latency | 2.77ms; 2.64ms | 2.74ms; 2.67ms | tied |
325
+
326
+ The +7.7 % single-thread win is the per-request CPU savings inside
327
+ `cbuild_response_head`. The neutral multi-thread result is the
328
+ GVL-contention floor: at `-t 5 -w 1` Hyperion's worker threads
329
+ serialise on the GVL while running `JSON.generate` + `app.call`,
330
+ so shaving 2-3 µs off Hyperion's slice of the hot path leaves the
331
+ total throughput dominated by `JSON.generate` (~10 % CPU per the
332
+ profile) and the kernel TCP write softirq (~6 %). For comparison
333
+ the same bench host, same Ruby, with Hyperion `-w 4` SO_REUSEPORT
334
+ on `bench/work.ru`: **14,200 r/s** — 2× over Agoo's single-process
335
+ 7,489 r/s baseline.
336
+
337
+ **Honest assessment of the residual gap.** The 2.05× gap to Agoo
338
+ on the canonical `-t 5 -w 1` row is a GVL-architecture gap, not a
339
+ per-request CPU gap. Agoo's pure-C HTTP core lets 5 worker threads
340
+ truly run in parallel; Hyperion's adapter + writer + `app.call`
341
+ hold the GVL together because every step except the `read(2)` /
342
+ `write(2)` syscalls is Ruby. Closing this row to ≥ Agoo would
343
+ require either (a) running `-w 4` SO_REUSEPORT (the 2.12-E
344
+ cluster work — Hyperion DOES exceed Agoo by 2× there), or (b) a
345
+ 2.14+ track that moves more of the per-request lifecycle into C
346
+ (e.g. running `cbuild_response_head` from the C accept loop with
347
+ the writer fully C-side). 2.13-B closes Hyperion's portion of the
348
+ GVL hold; the rest is structural.
349
+
350
+ ### 2.13-A — Extend C-side wins to generic Rack apps
351
+
352
+ **Background.** The 2.12 sprint shipped huge wins on the
353
+ `Server.handle_static`-routed traffic shape: 5,502 r/s → 134,084 r/s on
354
+ the static-route `hello` workload (24× over 2.11.0; 7× over Agoo). But
355
+ those wins are gated on the C accept-loop's `route_table.lookup`
356
+ returning a `RouteTable::StaticEntry`. Generic Rack apps — the vast
357
+ majority of real-world deployments (Rails, Sinatra, Roda, Hanami,
358
+ anything calling `body.each` to yield response chunks) — never engage
359
+ the C loop; they go through `Hyperion::Adapter::Rack` + the Ruby
360
+ accept loop + the thread pool. The 2.12-B re-bench confirmed a generic
361
+ Rack `bench/hello.ru` ran at 4,477 r/s — 4.25× behind Agoo, and 30×
362
+ behind the C-loop static path on the same machine. Most of the 2.12
363
+ wins were not available to operators running real apps.
364
+
365
+ **What 2.13-A targets.** Optimizations that *do* port to the generic
366
+ Rack dispatch path without breaking semantics. Per-request we don't
367
+ get to skip `app.call(env)` (that IS the dispatch) and we can't
368
+ prebuild the response body (it's dynamic), but we can attack:
369
+ syscall coalescing on accept+read, env hash + rack.input recycling,
370
+ metrics-mutex contention under multi-thread workloads, and the
371
+ keepalive-fast-path tail.
372
+
373
+ ### 2.13-A — Per-thread shard for hot-path metrics
374
+
375
+ Pre-2.13-A, every `observe_histogram` and `increment_labeled_counter`
376
+ took `@hg_mutex.synchronize`. The original commit comment claimed
377
+ those paths were "low-rate", but that's no longer true:
378
+
379
+ * `tick_worker_request` is called once per dispatched request
380
+ (every `Connection#serve` iteration, every h2 stream, every
381
+ handed-off connection from the C loop).
382
+ * `observe_histogram` is called once per dispatched request via
383
+ the per-route request-duration histogram registered in
384
+ `Connection#register_request_duration_histogram!`.
385
+
386
+ Under `-t 32` that single mutex serialised 32 worker threads on the
387
+ request-completion tail — every `+= 1` waited behind the previous
388
+ thread's release. That contention was invisible on the C accept loop
389
+ (the loop bypasses Ruby metrics entirely and folds in its atomic
390
+ counter at scrape time), but it was the dominant tail-latency term
391
+ on the generic Rack workload.
392
+
393
+ The new path keeps per-thread shards (`Thread#thread_variable_set`,
394
+ true thread-local — NOT fiber-local, matching the unlabeled counter
395
+ convention from 2.0.0) for both `@histograms` and `@labeled_counters`.
396
+ Observations and increments hit the per-thread shard with zero
397
+ contention; `histogram_snapshot` and `labeled_counter_snapshot` merge
398
+ across shards under the mutex (a low-rate operation — once per
399
+ `/-/metrics` scrape).
400
+
401
+ **Public API stays identical.** `observe_histogram`, `register_histogram`,
402
+ `set_gauge`, `histogram_snapshot`, `labeled_counter_snapshot`,
403
+ `increment_labeled_counter` keep the same signatures and semantics.
404
+ Registered-but-never-observed families still surface in the snapshot
405
+ (pre-2.13-A behaviour). `reset!` now also clears the per-thread
406
+ shards so cross-spec leakage stays prevented.
407
+
408
+ **Edge cases covered by spec:**
409
+
410
+ * Multi-thread observe-then-snapshot: 8 threads × 1000 observations
411
+ on the same `(name, labels)` produce `count == 8000` and
412
+ cumulative bucket counts that match.
413
+ * 16 threads × 500 increments × distinct label values produce 16
414
+ series with count 500 each.
415
+ * Concurrent observe + 50 mid-run snapshots run without deadlock
416
+ or torn counts.
417
+ * Reset clears across threads.
418
+ * Unregistered observe is a silent no-op.
419
+ * Registered-but-never-observed families show up in the scrape
420
+ with an empty `:series` Hash.
421
+
422
+ ### 2.13-A — Cached worker-id label tuple + Rack-3 keepalive fast path
423
+
424
+ Two micro-optimizations on the per-request hot path of
425
+ `Connection#serve`, both targeting the steady-state -c1 single-
426
+ keepalive profile (where 8000 r/s = the upper bound of single-thread
427
+ Ruby work and every saved allocation / iteration shows up).
428
+
429
+ **Cached worker-id label tuple.** `tick_worker_request(@worker_id)`
430
+ went through a wrapper that called `worker_id.to_s` (worker_id is
431
+ already a String) and built a fresh `[label]` Array per request. The
432
+ wrapper also re-checked `@worker_request_family_registered` on every
433
+ call. The new path pre-builds the frozen `[@worker_id]` tuple once in
434
+ the Connection constructor, registers the family once at construction
435
+ too, and the request loop calls `increment_labeled_counter` directly
436
+ with the cached tuple — saving one Array allocation + one method
437
+ dispatch + one early-return-checked branch per request.
438
+
439
+ **Rack-3 keepalive fast path.** `should_keep_alive?` used to scan the
440
+ entire response-headers Hash with `headers.find { |k,_| k.to_s.downcase
441
+ == 'connection' }`. That ran `to_s.downcase` (one transient String
442
+ allocation) PER iteration and walked to completion on every response
443
+ that didn't carry a `Connection` header (which is most of them — the
444
+ response writer adds its own). Rack 3 mandates lowercase Hash keys
445
+ (spec §6.4), so the new path is a single `headers['connection']`
446
+ lookup. Apps that violate Rack 3 by returning mixed-case keys lose
447
+ the Connection-close response signal and stay on keep-alive — a
448
+ benign degradation pinned by spec; the fix is to update the app to
449
+ spec. Non-Hash header containers (legacy Array-of-pairs) still flow
450
+ through a slow-scan fallback, also case-sensitive on the lowercase
451
+ key.
452
+
453
+ **Spec coverage:**
454
+
455
+ * Lowercase `connection: close` from app closes the connection.
456
+ * Lowercase `connection: keep-alive` keeps the conn alive across
457
+ pipelined requests.
458
+ * Mixed-case `Connection: close` (Rack-3 violation) is documented
459
+ as falling through to keep-alive — pinned so the behaviour is
460
+ stable.
461
+ * `@worker_id_label_tuple` is constructed once, frozen, and reused
462
+ by identity across requests on the same Connection.
463
+
464
+ ### 2.13-A — Bench result
465
+
466
+ Bench host: openclaw-vm (Linux 6.8.0, x86_64). Hyperion `-t 32 -w 1`
467
+ on `bench/hello.ru` (generic Rack hello). 5-trial median, wrk 4.x.
468
+
469
+ **With access logging on (default — JSON access lines per request):**
470
+
471
+ | Workload | Master | 2.13-A | Δ |
472
+ |------------|-------:|-------:|---:|
473
+ | -c1 -t1 | 7,631 | 7,386 | -3.2% (within noise) |
474
+ | -c100 -t4 | 4,004 | 4,031 | +0.7% (within noise) |
475
+
476
+ **With `--no-log-requests` (logging disabled):**
477
+
478
+ | Workload | Master | 2.13-A | Δ |
479
+ |------------|-------:|-------:|---:|
480
+ | -c1 -t1 | 9,028 | 8,938 | -1.0% (within noise) |
481
+ | -c100 -t4 | 4,804 | 4,979 | **+3.6%** |
482
+
483
+ **Honest framing.** The +3.6% at `-c100 --no-log-requests` is the
484
+ clearest signal that the metrics-mutex contention removal lands. At
485
+ `-c1` the workload is single-thread CPU-bound (one wrk thread, one
486
+ keepalive connection, one Hyperion worker thread serving it); the
487
+ optimisations don't help and don't hurt. At `-c100` the
488
+ optimisations reach a reasonable +3.6% on the no-log path; with
489
+ logging on, the access-log path dominates the per-request CPU
490
+ budget so the metrics savings are masked.
491
+
492
+ **What the bench reveals.** The 4.25× gap to Agoo on the generic
493
+ Rack workload that 2.12-B identified is not a metrics-contention
494
+ problem and not an env-pool problem (env pooling was already in
495
+ place since 1.6.x). It is a **single-thread Ruby work per request**
496
+ ceiling — at `-c1` the bench tops out at ~9,000 r/s = 110 µs/req
497
+ of single-thread Ruby work, which Agoo (Rack-shape C server) does
498
+ in ~52 µs/req. Closing that gap requires moving meaningful chunks
499
+ of the per-request Ruby surface (parser, env build, headers, log
500
+ line) into C — work that's already 50% complete in
501
+ `Hyperion::CParser` (`build_env`, `build_response_head`,
502
+ `build_access_line`). The 2.13 sprint will continue moving the
503
+ remaining Ruby-side pieces (`Connection#serve` request loop,
504
+ ResponseWriter dispatch, `should_keep_alive?`) into C in subsequent
505
+ phases.
506
+
507
+ **Durable infrastructure.** The 2.13-A optimisations stay in the
508
+ tree even though the headline-bench delta is small. The
509
+ per-thread metrics shard removes a real mutex bottleneck that
510
+ *will* compound under future workloads — Ractor-based dispatch,
511
+ multi-process scrapers, observability-heavy apps that observe
512
+ custom histograms per-request. The Rack-3 keepalive fast path and
513
+ the cached worker-id tuple are pure-quality changes (less code,
514
+ fewer allocations, no API surface change).
515
+
3
516
  ## 2.12.0 — 2026-05-01
4
517
 
5
518
  ### 2.12-F — gRPC support on h2
data/README.md CHANGED
@@ -11,6 +11,55 @@ gem install hyperion-rb
11
11
  bundle exec hyperion config.ru
12
12
  ```
13
13
 
14
+ ## What's new in 2.13.0
15
+
16
+ **Hardening sprint: profile-driven CPU work, durable infrastructure,
17
+ and gRPC streaming.** 2.13 follows up the 2.12 perf jump with the
18
+ work that wasn't structural enough to need its own major release —
19
+ each stream is small but adds up:
20
+
21
+ - **2.13-A — Generic Rack hot-path wins.** Per-thread shards for
22
+ hot-path metrics (no more cross-worker mutex on `observe_histogram`),
23
+ cached `worker_id` label tuple, and a Rack-3 keepalive fast-path.
24
+ Generic Rack hello bench: **+3.6% on `-c100` no-log**. Honest
25
+ finding: env-pool + body-coalesce already shipped; the deeper
26
+ generic-Rack gap to Agoo is single-thread Ruby ceiling that closing
27
+ needs moving `app.call` into the C accept loop — a 2.14 lift.
28
+ - **2.13-B — Response head builder rewritten in C.** Pre-baked
29
+ status-line table, `rb_hash_foreach` replacing `rb_funcall(:keys)`,
30
+ per-key downcase + per-(key, value) full-line caches, custom `itoa`
31
+ replacing `snprintf`. **+7.7% single-thread synthetic; multi-thread
32
+ neutral (GVL-bound).** Profile confirms Hyperion's own C-ext code is
33
+ **<1%** of wall-clock; the rest is libruby + JSON gem.
34
+ - **2.13-C — Spec flake hunt.** Two long-standing flakes fixed:
35
+ `tls_ktls_spec` macOS skip leak (unconditional `RUBY_PLATFORM`
36
+ guard), and `connection_loop_spec:79` Linux port-bind flake (root
37
+ cause: Linux `close()` doesn't wake a parked `accept(2)` — fixed
38
+ with a `stop_loop_and_wake` helper). 5/5 → 0/10 failure rate;
39
+ spec-suite runtime 46 s → 1.3 s.
40
+ - **2.13-D — gRPC streaming RPCs.** Server-streaming, client-streaming,
41
+ and bidirectional RPCs on top of 2.12-F's unary trailers foundation.
42
+ New `bench/grpc_stream.{proto,ru}` + `grpc_stream_bench.sh` ghz
43
+ harness for operator-side comparison vs Falcon's `async-grpc`.
44
+ - **2.13-E — io_uring soak harness + CI smoke.** New
45
+ `bench/io_uring_soak.sh` runs a 24h soak against the 2.12-D
46
+ io_uring loop with `/proc/$PID` + `/-/metrics` sampling, emits a
47
+ CSV + verdict (PASS / SOAK FAIL / borderline). New
48
+ `spec/hyperion/io_uring_soak_smoke_spec.rb` runs a 1000-request
49
+ mini-soak in CI to catch leak regressions before any 24h run.
50
+ **`HYPERION_IO_URING_ACCEPT` stays opt-in for 2.13** — operators
51
+ with their own staging environments can now collect signal; the
52
+ default-flip decision moves to 2.14.
53
+
54
+ Plus all previous wins are preserved and verified by the 1183-spec
55
+ suite (2.10-G TCP_NODELAY at accept, 2.10-E preload hooks, 2.10-F
56
+ C-ext fast-path response writer, 2.11-A dispatch pool warmup, 2.11-B
57
+ cglue HPACK default, 2.12-C accept4 connection loop, 2.12-D io_uring
58
+ loop, 2.12-E per-worker request counter, 2.12-F gRPC unary trailers).
59
+
60
+ Full per-stream details, bench tables, and follow-up items in
61
+ [`CHANGELOG.md`](CHANGELOG.md).
62
+
14
63
  ## What's new in 2.12.0
15
64
 
16
65
  **The hot path moves into C — and gRPC ships.** The headline win:
@@ -244,8 +293,77 @@ What Hyperion handles for you: ALPN negotiation, HTTP/2 framing, HPACK,
244
293
  per-stream flow control, the trailer-frame emit, binary-clean
245
294
  `env['rack.input']` (gRPC bodies are non-UTF-8), and `te: trailers`
246
295
  preserved into `env['HTTP_TE']`. What you handle: protobuf
247
- marshalling and the `grpc-status` semantics. Streaming RPCs (server /
248
- client / bidi) are 2.13 candidates — pin to unary for now.
296
+ marshalling and the `grpc-status` semantics.
297
+
298
+ ### Streaming RPCs (2.13-D+)
299
+
300
+ All four gRPC call shapes work on Hyperion since 2.13-D — unary,
301
+ server-streaming, client-streaming, and bidirectional. The detection
302
+ trigger is the gRPC content-type plus `te: trailers`; any HTTP/2
303
+ request that carries both is dispatched to the Rack app on HEADERS
304
+ arrival (rather than after END_STREAM), and `env['rack.input']`
305
+ becomes a streaming IO that blocks reads until the next DATA frame
306
+ lands. Plain HTTP/2 traffic (without those headers) keeps the unary
307
+ buffered semantics — no behaviour change for non-gRPC clients.
308
+
309
+ **Server-streaming.** Yield one gRPC-framed message per `each`
310
+ iteration; Hyperion writes each yield as its own DATA frame:
311
+
312
+ ```ruby
313
+ class StreamReply
314
+ def initialize(messages); @messages = messages; end
315
+ def each; @messages.each { |m| yield m }; end # one DATA frame each
316
+ def trailers; { 'grpc-status' => '0' }; end
317
+ def close; end
318
+ end
319
+
320
+ run ->(env) {
321
+ env['rack.input'].read # the unary request message
322
+ [200, { 'content-type' => 'application/grpc' }, StreamReply.new(messages)]
323
+ }
324
+ ```
325
+
326
+ **Client-streaming.** Read messages off `env['rack.input']` as the peer
327
+ sends them. Reads block until a DATA frame arrives:
328
+
329
+ ```ruby
330
+ run ->(env) {
331
+ io = env['rack.input']
332
+ count = 0
333
+ while (prefix = io.read(5)) && prefix.bytesize == 5
334
+ length = prefix.byteslice(1, 4).unpack1('N')
335
+ msg = io.read(length)
336
+ count += 1
337
+ end
338
+ [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply_for(count))]
339
+ }
340
+ ```
341
+
342
+ **Bidirectional.** Interleave reads and writes. The response body is
343
+ iterated lazily, so you can read one request message, yield one reply,
344
+ read the next, yield the next:
345
+
346
+ ```ruby
347
+ class BidiReplies
348
+ def initialize(io); @io = io; end
349
+ def each
350
+ while (prefix = @io.read(5)) && prefix.bytesize == 5
351
+ len = prefix.byteslice(1, 4).unpack1('N')
352
+ msg = @io.read(len)
353
+ yield grpc_frame(handle(msg)) # sent immediately
354
+ end
355
+ end
356
+ def trailers; { 'grpc-status' => '0' }; end
357
+ def close; end
358
+ end
359
+
360
+ run ->(env) {
361
+ [200, { 'content-type' => 'application/grpc' }, BidiReplies.new(env['rack.input'])]
362
+ }
363
+ ```
364
+
365
+ The streaming-input path runs each stream on its own fiber, so
366
+ concurrent read+write on the same stream is safe.
249
367
 
250
368
  ## Highlights
251
369