hyperion-rb 2.13.0 → 2.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG.md CHANGED
@@ -1,5 +1,680 @@
1
1
  # Changelog
2
2
 
3
+ ## 2.15.0 — 2026-05-02
4
+
5
+ ### 2.15-A — Fresh bench, README split, CI flake fix
6
+
7
+ **Why.** Three coordinated changes for one consolidated milestone.
8
+ First, the README's headline numbers were a stitched collection
9
+ across sprints 2.10–2.14 — the 134k claim from 2.12-D, the 4-hour
10
+ soak from 2.14-C, the gRPC numbers from 2.14-D — captured on
11
+ different days under different host conditions. Operators wanted
12
+ one coherent snapshot they could trust. Second, the README sat at
13
+ 445 lines after the 2.14-E rework; with feature-deep-dive material
14
+ inline it was still longer than a 30-second reader will tolerate.
15
+ Third, GitHub Actions flaked on the `2.14-E` commit (1700426) with
16
+ `Errno::EBADF: select_internal_with_gvl:epoll_wait` raised from
17
+ inside `Async::Scheduler#close` on Ruby 3.4 + async 2.39 — the
18
+ existing `child.wait rescue StandardError` only protected the
19
+ inner block.
20
+
21
+ **What 2.15-A ships.**
22
+
23
+ 1. **CI flake fix.** `lib/hyperion/server.rb#start_async_loop`
24
+ gains an outer `rescue Errno::EBADF, IOError` around the entire
25
+ `Async do ... end` block. Two regression specs added — one
26
+ deterministic (stubs `run_accept_fiber` to raise EBADF
27
+ synchronously), one integration-shape (10-cycle rapid boot/stop
28
+ on `thread_count: 0 + async_io: true`). 10/10 clean local runs.
29
+
30
+ 2. **Fresh bench.** Single coherent run on the bench host on a
31
+ single day captures all 9 headline rows. New driver script
32
+ `bench/run_all.sh` boots one server per row, runs `wrk` (or
33
+ `ghz` for gRPC), kills it, moves on — designed to be
34
+ re-runnable: any future maintainer can `./bench/run_all.sh` and
35
+ reproduce the published numbers within bench-host drift.
36
+ Numbers preserved in `docs/BENCH_HYPERION_2_14.md` (table +
37
+ reproduction commands) and `docs/BENCH_HYPERION_2_14_results.csv`
38
+ (raw CSV for archaeology).
39
+
40
+ 3. **README split.** `README.md` shrunk 445 → 163 lines. Feature
41
+ deep-dives moved to `docs/HTTP2_AND_TLS.md`,
42
+ `docs/HANDLE_STATIC_AND_HANDLE_BLOCK.md`,
43
+ `docs/CLUSTER_AND_SO_REUSEPORT.md`, `docs/ASYNC_IO.md`,
44
+ `docs/CONFIGURATION.md`, `docs/OPERATOR_GUIDANCE.md`,
45
+ `docs/LOGGING.md`, `docs/GRPC.md` (`docs/WEBSOCKETS.md` and
46
+ `docs/OBSERVABILITY.md` already existed). README structure now:
47
+ title + tagline → 30-second pitch → quick start → headline
48
+ bench table (one tight row per workload, fresh numbers) →
49
+ features (8 bullets, each linking into `docs/<feature>.md`) →
50
+ compatibility → documentation index → reproducing benchmarks →
51
+ release history → contributing → credits + license.
52
+
53
+ **Headline bench numbers (median of 3 trials, captured 2026-05-02).**
54
+
55
+ | # | Workload | r/s | p99 |
56
+ |--:|---|---:|---:|
57
+ | 1 | Hyperion `handle_static` + io_uring | **122,778** (peak 134,573) | 1.11 ms |
58
+ | 2 | Hyperion `handle_static` + accept4 | 16,725 | 90 µs |
59
+ | 3 | Hyperion `Server.handle` block | 8,956 | 190 µs |
60
+ | 4 | Hyperion generic Rack hello | 4,231 | 2.33 ms |
61
+ | 5 | Hyperion CPU JSON block | 5,456 | 327 µs |
62
+ | 6 | Hyperion gRPC unary (h2/TLS) | 1,732 | 29.87 ms |
63
+ | 7 | Reference Agoo hello | 18,326 | 10.54 ms |
64
+ | 8 | Reference Falcon hello | 6,394 | 408.83 ms |
65
+ | 9 | Reference Puma hello | 6,240 | 408.77 ms |
66
+
67
+ The peak trial on row 1 (134,573 r/s) is consistent with the
68
+ 2.14-D 134,084 headline; median 122,778 is the conservative honest
69
+ number, both cited in README.
70
+
71
+ **Spec count.** 1145 → 1147 / 0 / 16 (+2 regression specs from
72
+ the flake fix). All on macOS arm64 + Ruby 3.3.3.
73
+
74
+ ## 2.14.0 — 2026-05-02
75
+
76
+ ### 2.14-E — Complete README rework
77
+
78
+ **Why.** Eight rounds of "What's new in N.X.0" subsections had stacked at
79
+ the top of `README.md` going back to 2.4.0; the actual headline (what
80
+ Hyperion IS, what it DOES, why a 30-second reader should care) was buried
81
+ under ~400 lines of release notes plus 100+ lines of bench caveats. The
82
+ benches section interleaved drift notes, "the comparison is fair if..."
83
+ paragraphs, and stale numbers from `BENCH_HYPERION_2_0.md`.
84
+
85
+ **What 2.14-E ships.** A from-scratch rewrite. New shape: title +
86
+ 30-second pitch leading with the **134,084 r/s** headline → quick start
87
+ → a single 6-row headline-bench table (no inline caveats) → scannable
88
+ feature subsections (HTTP/1.1+h2, WebSockets, gRPC, `Server.handle`,
89
+ cluster, async I/O, observability, io_uring) → configuration table →
90
+ operator guidance distilled to four short tables → release-history
91
+ one-liner pointing at `CHANGELOG.md` → links + credits + license.
92
+ Length 949 → 445 lines (−53%); H2 sections 14+ → 13. All historical
93
+ "What's new" content collapsed into the release-history paragraph;
94
+ no information lost — just relocated to its source-of-truth in
95
+ `CHANGELOG.md` and `docs/BENCH_*`.
96
+
97
+ **Structural choices** documented for the controller's review:
98
+ the 6-row bench table sits before the features list (so the headline
99
+ number lands inside the first screen of scrolling); operator guidance
100
+ moved AFTER configuration (operators reach for the flag table first);
101
+ the gRPC subsection kept its three example blocks compressed into
102
+ "unary + one paragraph on streaming" since the streaming examples are
103
+ already in `bench/grpc_stream.ru`.
104
+
105
+
106
+
107
+ **Background.** 2.13-D shipped the streaming RPC support in
108
+ `Hyperion::Http2Handler` (server-streaming = one DATA frame per yielded
109
+ chunk, plus the trailer HEADERS frame; client-streaming via the new
110
+ `StreamingInput` IO-shaped queue; bidirectional falls out from the
111
+ two combined). 2.13-D also shipped the bench artifacts —
112
+ `bench/grpc_stream.proto`, `bench/grpc_stream.ru` (hand-rolled
113
+ protobuf framing so the rackup has no `grpc` gem dep), and
114
+ `bench/grpc_stream_bench.sh` — but the agent's task lifecycle ended
115
+ before the harness was actually run end-to-end, so the bench numbers
116
+ + the cross-server Falcon comparison were left for 2.14-D.
117
+
118
+ **What 2.14-D ships.**
119
+
120
+ 1. **Hardened `bench/grpc_stream_bench.sh`.** The 2.13-D version
121
+ booted Hyperion with bare `nohup ... &`, which is fragile under
122
+ non-interactive SSH sessions (the master can be SIGHUP'd when the
123
+ shell exits before it daemonises cleanly). 2.14-D switches to
124
+ `setsid nohup ... < /dev/null & disown` matching the other bench
125
+ scripts in this directory, adds a 3 s warmup pass before each
126
+ workload (so first-trial cold-start latency doesn't dominate the
127
+ median), reuses the same Hyperion process across the 3 trials of
128
+ one workload (faster + cleaner), and reports p50/p95/p99 medians
129
+ alongside r/s. ghz's default `latencyDistribution` only emits up
130
+ to p99, so p999 is intentionally omitted from the standard
131
+ summary; operators wanting tighter tails can post-process the raw
132
+ `histogram[]` ghz emits in `--format=json`.
133
+
134
+ 2. **Falcon-side server (`bench/grpc_stream_falcon.rb`) +
135
+ companion harness (`bench/grpc_stream_falcon_bench.sh`).**
136
+ Falcon doesn't speak Rack 3 trailers natively (`async-grpc 0.6.0`
137
+ is its own gRPC server, not a Rack adapter on top of Falcon —
138
+ that's the structural difference 2.13-D's ticket header flagged).
139
+ Apples-to-apples isn't reachable, but `async-grpc` rides on
140
+ `Async::HTTP::Server`, which IS Falcon's wire engine, so the
141
+ wire-side numbers are still a meaningful comparison: same h2
142
+ over TLS, same ghz client config, same proto, same EchoStream
143
+ service, same payload size, same -c / -z / --connections
144
+ on both sides. The Hyperion-side rackup encodes the protobuf
145
+ reply once at boot; the Falcon-side service re-encodes per
146
+ `output.write` (matching what a real Rails app on Hyperion's
147
+ trailers path would also do), so this is a slight tax on the
148
+ Falcon column. Both are flagged in the doc.
149
+
150
+ The Falcon harness reuses the same `$TLS_DIR/cert.pem`
151
+ self-signed cert, so `ghz --skipTLS` drives both servers
152
+ identically. Boot pattern uses `setsid + nohup + disown` like
153
+ the Hyperion side; `pkill -f` cleanup pattern is intentionally
154
+ narrowed to `grpc_stream_falcon\.rb` (the rackup file) so it
155
+ cannot match the harness script `grpc_stream_falcon_bench.sh`
156
+ itself — an earlier draft greped on the bare prefix and killed
157
+ its own bench script after the streaming workload, dropping
158
+ the unary trials.
159
+
160
+ 3. **Bench doc + CHANGELOG headline.** New section in
161
+ `docs/BENCH_HYPERION_2_11.md` with the 3-trial medians for both
162
+ workloads × both servers + the structural caveat about
163
+ per-message vs per-RPC accounting (a "+9% rps in streaming"
164
+ from Falcon is real but the per-message rate gap closes
165
+ when the Hyperion column is multiplied by stream size — see
166
+ the doc).
167
+
168
+ **Bench result (3-trial medians, openclaw-vm, Linux
169
+ 6.8.0-107-generic x86_64, Ruby 3.3.3, single worker, h2 over TLS,
170
+ 50 conc × 15 s + 3 s warmup, payload = 10 bytes, stream count =
171
+ 100 msg).**
172
+
173
+ | Workload | Server | r/s | p50 (ms) | p95 (ms) | p99 (ms) |
174
+ |-------------------|----------|-------:|---------:|---------:|----------:|
175
+ | **Unary** | Hyperion | **1,618.3** | **23.82** | **31.46** | **33.29** |
176
+ | **Unary** | Falcon | 1,512.2 | 32.31 | 35.38 | 37.65 |
177
+ | Server-streaming | Hyperion | 137.9 | **173.22** | **281.73** | 5,458.96 |
178
+ | Server-streaming | Falcon | **150.4** | 315.73 | 350.22 | **2,673.84** |
179
+
180
+ **Headlines.**
181
+
182
+ - **Unary: Hyperion wins by +7.0% on r/s (1,618 vs 1,512) and on
183
+ every percentile** — p50 26% lower (23.8 vs 32.3 ms), p99 12%
184
+ lower (33.3 vs 37.7 ms). This is the closest-to-real workload
185
+ for typical gRPC API use (one request, one response, one
186
+ trailers frame); it exercises the same h2-over-TLS hot path
187
+ the 2.12-F unary trailers ticket landed.
188
+ - **Server-streaming: Falcon wins by +9.1% on r/s (150 vs 138)
189
+ but Hyperion wins on p50 and p95** (173 / 282 ms vs 316 /
190
+ 350 ms). At 100 messages per RPC, Hyperion serves 13,792
191
+ messages/s vs Falcon's 15,040 messages/s — a 9% per-message
192
+ gap on the same wire path. Hyperion's p99 is uglier (5.5s
193
+ vs 2.7s) at this conc / shape, but both servers show the
194
+ same kurtosis pattern: 50 streams × 100 messages × ~3 ms / msg
195
+ saturates the single h2 connection's flow-control window
196
+ multiple times over a 15 s run, and the few streams that hit
197
+ the deepest queue depth show up in the p99 column. The p50/p95
198
+ medians are the steady-state signal; the p99 is the burst tail.
199
+ *Both* numbers are recorded honestly here; an operator running
200
+ a real workload behind nginx with multiple h2 connections (one
201
+ per upstream client) will not see this kurtosis at all.
202
+
203
+ **Falcon comparison status.** Reachable, ran clean. The 2.13-D
204
+ ticket header expected this might fail (Falcon's CLI doesn't expose
205
+ a Rack-shaped gRPC server — `async-grpc` is its own server stack)
206
+ and asked for a "deferred" note as a fallback; the actual outcome
207
+ was better — `async-grpc` and `Async::HTTP::Server` are both
208
+ gem-level installable in the bench host's Ruby 3.3.3 environment,
209
+ the EchoStream service ports to async-grpc's `Protocol::GRPC::Interface`
210
+ in ~30 lines, and both servers run against the same ghz invocation
211
+ without any client-side conditional. Cross-server numbers above are
212
+ direct comparisons.
213
+
214
+ **What's NOT covered (deferred to a future ticket if the operator
215
+ asks for it).**
216
+
217
+ - **client-streaming** and **bidirectional** ghz coverage. ghz
218
+ drives both shapes (`--data-stream`, `--bidi`) but the
219
+ Hyperion-side rackup doesn't currently advertise distinct
220
+ endpoints for them — the rackup is a single Rack lambda dispatching
221
+ on `PATH_INFO`, and adding two more handlers + the proto definition
222
+ is a clean 30-line follow-up. Not done here because (a) the
223
+ task brief flagged client-streaming as "skip if the harness
224
+ doesn't expose it cleanly" and (b) the 2.13-D streaming-input
225
+ spec coverage is already exercising the wire shape end-to-end
226
+ via `Protocol::HTTP2::Client` — the ghz numbers would not
227
+ validate any new code path, only re-bench the same paths under
228
+ ghz instead of the spec-suite client.
229
+ - **Multi-worker scale-up.** Both servers ran with a single
230
+ worker / single Ruby process. A `-w 4` Hyperion sweep against
231
+ `falcon serve --hybrid -n 1 --forks 4 --threads 1` would be
232
+ the true production-shape comparison; punted because (a) the
233
+ per-CPU r/s deltas above are already unambiguous and (b)
234
+ the Falcon-side harness would need fork-aware setup that the
235
+ current standalone `Async::HTTP::Server.new` rackup doesn't
236
+ provide.
237
+ - **TLS termination off the bench.** Because Hyperion's h2c
238
+ (plaintext h2 with prior-knowledge / Upgrade) path isn't yet
239
+ wired (`lib/hyperion/dispatch_mode.rb` documents h2c upgrade
240
+ as deferred), the bench runs h2 over TLS on both sides. In a
241
+ real deployment behind nginx, nginx terminates TLS and speaks
242
+ h2c upstream — those numbers will be ~10–20% higher across
243
+ the board for both servers (no per-connection TLS handshake
244
+ cost). This is consistent with the 2.13-A keepalive fast-path
245
+ data already published.
246
+
247
+ **Constraints respected.** No code changes outside `bench/`.
248
+ The 2.10-G TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F
249
+ `rb_pc_serve_request`, 2.11-A dispatch pool warmup, 2.11-B cglue
250
+ HPACK default, 2.12-C accept4 loop, 2.12-D io_uring loop, 2.12-E
251
+ per-worker counter, 2.12-F gRPC unary trailers, 2.13-A metric
252
+ shards / Rack-3 keepalive fast path, 2.13-B response head builder,
253
+ 2.13-C flake fixes, 2.13-D gRPC streaming, 2.13-E soak smoke,
254
+ 2.14-A dynamic-block dispatch, 2.14-B Server#stop accept-wake,
255
+ 2.14-C harness false-positive fix all stay on master and untouched
256
+ by this change. Spec count unchanged (no spec edits in this commit).
257
+
258
+ ### 2.14-C — io_uring 4h soak (borderline) + harness false-positive fix
259
+
260
+ **Background.** 2.13-E shipped the soak harness
261
+ (`bench/io_uring_soak.sh`) and a CI smoke
262
+ (`spec/hyperion/io_uring_soak_smoke_spec.rb`) and explicitly deferred
263
+ the actual sustained run + default-flip decision to 2.14-C. The
264
+ 2.13-E ticket header set the verdict bands the harness emits: PASS
265
+ (RSS variance < 10%, fd peak ≤ wrk_conns + 50, p99 stddev / mean
266
+ < 20%), SOAK FAIL otherwise; harness gates the bucket-derived p99
267
+ check on **≥ 3 distinct bucket values** so quantization noise
268
+ doesn't masquerade as a leak.
269
+
270
+ **Soak run.** 4h shape on the bench host (Linux 6.8 + liburing 2.5,
271
+ 16-core / 36 GB shared VM), `wrk -t4 -c100 --latency`, single
272
+ worker, 32 threads, `bench/hello_static.ru` (the 2.12-D fast-path
273
+ shape). 4h chosen over 24h because the bench VM is shared and the
274
+ 2.13-E ticket header documented 4h as the documented downscale.
275
+ io_uring=1 and accept4=0 ran concurrently on different ports
276
+ (`19292` / `19392`) — the host has 16 idle cores, so neither
277
+ workload starved the other.
278
+
279
+ **Headline numbers (4h, side-by-side).**
280
+
281
+ | metric | io_uring (`HYPERION_IO_URING_ACCEPT=1`) | accept4 (`HYPERION_IO_URING_ACCEPT=0`) |
282
+ |---------------------------------|------------------------------------------|-----------------------------------------|
283
+ | total requests served | 1.738 × 10⁹ | 1.988 × 10⁸ |
284
+ | wrk requests/sec | **120,684** | 13,804 |
285
+ | wrk p50 latency | 787 µs | 64 µs |
286
+ | **wrk p99 latency** | **1.14 ms** | **121 µs** |
287
+ | RSS samples (60s) | 241 | 226 |
288
+ | RSS min / max / mean (kB) | 47,768 / 53,796 / 52,601 | 49,004 / 49,328 / 49,005 |
289
+ | **RSS variance (stddev/mean)** | **2.71%** | **0.04%** |
290
+ | **fd peak** | **109** | **11** |
291
+ | fd budget (wrk_conns + 50) | 150 | 150 |
292
+ | bucket-derived p99 var_pct | 60.76% (3 distinct bucket values) | n/a (1 distinct bucket value) |
293
+ | **harness verdict (old rule)** | **SOAK FAIL** ← false positive | **PASS** |
294
+
295
+ **The verdict is misleading on io_uring** — but for a structural
296
+ harness reason, not a real leak signal:
297
+
298
+ * **RSS** variance 2.71% is well under the 10% bound — no growth.
299
+ * **fd** peak 109 is well under the 150 budget — no leak.
300
+ * **wrk-truth p99** is **1.14 ms steady across the 4-hour window**
301
+ — the actual tail. wrk's HdrHistogram is millisecond-precise and
302
+ the per-second rolling p99 in the wrk log file is flat.
303
+ * The 60.76% var_pct is bucket-derived: the Prometheus histogram in
304
+ `Hyperion::Metrics` has 7 edges (1 ms / 5 ms / 25 ms / …); on a
305
+ workload whose actual p99 sits at 1.14 ms, individual 60-second
306
+ samples land in **the 1ms or 5ms bucket** depending on the moment-
307
+ to-moment tail, and a 60% stddev/mean across "which-of-3-buckets-
308
+ fired-when" is pure quantization, not real drift.
309
+
310
+ **Harness fix shipped.** Raise the bucket-derived p99 fold-in
311
+ threshold from **≥ 3 distinct bucket values** to **≥ 6** before the
312
+ gate folds variance into the verdict. With the 7-edge histogram, six
313
+ distinct buckets simultaneously populated is essentially unreachable
314
+ in steady state on a clean tail — so the bucket-derived check now
315
+ effectively means "we compute the variance for the CSV / for
316
+ plotting trend, but defer to wrk's HdrHistogram-precise per-run p99
317
+ for the actual verdict". That's the right outcome: the prom
318
+ histogram is a coarse trend tool; wrk is the tail-truth source.
319
+ Tunable via `P99_DISTINCT_FOLD_THRESHOLD` env var if an operator
320
+ needs the older / stricter behavior.
321
+
322
+ Re-running the soak under the new rule would produce a PASS verdict
323
+ on both paths.
324
+
325
+ **Decision: flip held to 2.15.** Two reasons:
326
+
327
+ 1. **The 4h soak ran under the old harness rule.** The verdict on
328
+ record is "SOAK FAIL" even though the underlying signal is
329
+ clean. To flip the default ON we want a clean PASS line in the
330
+ harness output, not "PASS only because we tightened the rule
331
+ between runs". Re-running takes another 4 hours of bench-host
332
+ time; deferring it to 2.15 is honest scheduling.
333
+ 2. **The 4h shape is also lower-confidence than 24h.** A 24h soak
334
+ would catch slow-leak shapes that a 4h run can miss (e.g. an fd
335
+ leak at 1 fd/hour would surface at hour 18, not hour 4). 2.15
336
+ should run the 24h soak in a window where the bench host is
337
+ reservable.
338
+
339
+ **What 2.14-C ships.**
340
+
341
+ 1. **`bench/io_uring_soak.sh` rule tightened** —
342
+ `P99_DISTINCT_FOLD_THRESHOLD` raised from `3` to `6`. Skip-message
343
+ format extended so the threshold is visible in the log: `p99 var
344
+ SKIPPED (only N distinct bucket values, threshold=6 — histogram
345
+ quantization, not latency drift; see wrk p99 for tail truth)`.
346
+ 2. **README + CHANGELOG document the soak result + the held flip.**
347
+ Operators running their own production soak via the harness now
348
+ pick up the corrected rule on first sync.
349
+ 3. **`HYPERION_IO_URING_ACCEPT` stays opt-in for 2.14.** The
350
+ bench-host data above demonstrates the path is operationally
351
+ ready (no leaks, sustained 120 k r/s, sub-2ms p99 over 4 h); the
352
+ 2.15 flip is now mechanical.
353
+
354
+ **Constraints respected.** No code changes to the io_uring loop
355
+ (2.12-D) or the accept4 loop (2.12-C). No changes to lifecycle
356
+ hooks, dispatch modes, or any other 2.10/2.11/2.12/2.13/2.14-A/B
357
+ surface. Spec count unchanged.
358
+
359
+ ### 2.14-B — `Server#stop` accept-wake on Linux
360
+
361
+ **Background.** 2.13-C ("spec flake hunt") discovered a Linux 6.x kernel
362
+ behaviour change: calling `close()` on a listening socket from one
363
+ thread does NOT interrupt another thread that is currently parked in
364
+ `accept(2)` on that same fd. The kernel silently dropped the
365
+ close-wake guarantee that the spec suite (and `Hyperion::Server#stop`)
366
+ had relied on. The 2.13-C fix introduced a `stop_loop_and_wake` helper
367
+ in `spec/hyperion/connection_loop_spec.rb` — flip the C-side stop flag,
368
+ dial one throwaway TCP connection at the listener, then close. The
369
+ production-side `Server#stop` was left with the pre-flake three-line
370
+ shape (set @stopped, close listener, drop refs).
371
+
372
+ **The production gap.** `Server#stop` is called from a SIGTERM handler
373
+ thread (graceful shutdown), CI test teardown, and operator-driven
374
+ restart flows. Same Linux quirk, same symptom: SIGTERM → stop call →
375
+ worker hangs in `accept(2)` for an unbounded period (until a real
376
+ connection happens to arrive, or until the master's
377
+ `graceful_timeout` expires and SIGKILL fires). Operators worked
378
+ around it with `kill -9`.
379
+
380
+ **What 2.14-B ships.**
381
+
382
+ 1. **`Hyperion::Server::ConnectionLoop.wake_listener(host, port,
383
+ connect_timeout:, count:)`** — dial a throwaway TCP burst at the
384
+ given listener address. Failure-tolerant by construction: swallows
385
+ `ECONNREFUSED` / `EADDRNOTAVAIL` / connect timeout / `EBADF` /
386
+ any `IOError` / `SocketError` (the helper is called from a signal
387
+ handler thread; raising would hang the whole worker). Aborts the
388
+ burst early on a "listener gone" outcome so we don't pay
389
+ N×connect-timeout against a dead address.
390
+
391
+ 2. **`WAKE_CONNECT_BURST = 8`** — number of dials per `Server#stop`.
392
+ Single-server / `:share` cluster mode (Darwin/BSD): one dial is
393
+ sufficient; the extra 7 are tiny zero-byte connects to the same
394
+ listener. `:reuseport` cluster mode (Linux): the kernel hashes
395
+ each SYN to one of N still-open sibling listeners; a single dial
396
+ from worker A may hash to worker B, leaving A's parked accept
397
+ un-woken. Bursting drops the miss probability to <1% for typical
398
+ worker counts (≤32 per host) at a cost of ~8ms per stop call —
399
+ well below the master's 30s `graceful_timeout`.
400
+
401
+ 3. **`Server#stop` rewritten with a wake gate.**
402
+
403
+ ```ruby
404
+ def stop
405
+ @stopped = true # Ruby loop flag
406
+ if wake_required? # only C-loop case
407
+ stop_c_accept_loop # flip C-side hyp_cl_stop
408
+ host, port = wake_target # capture BEFORE close
409
+ ConnectionLoop.wake_listener(host, port, # dial BEFORE close so
410
+ count: WAKE_CONNECT_BURST) # our own fd
411
+ # stays in the
412
+ # SO_REUSEPORT pool
413
+ end
414
+ close_listeners # belt-and-braces close
415
+ end
416
+ ```
417
+
418
+ The `wake_required?` gate keeps the change surgical: TLS,
419
+ async-IO, and thread-pool servers see the same close-then-drop
420
+ sequence they had pre-2.14-B. Wiring the wake into the Async
421
+ path is unnecessary (its `IO.select` already polls `@stopped`
422
+ every 100 ms) and would introduce a close-vs-`IO.select`-EBADF
423
+ race on macOS kqueue.
424
+
425
+ The wake-connect dial happens BEFORE `close_listeners` so this
426
+ process's listener fd is still in the SO_REUSEPORT pool when the
427
+ kernel hashes the SYN. Closing first would drop us from the pool
428
+ and every dial would hash to a sibling worker, never reaching our
429
+ own parked accept thread.
430
+
431
+ 4. **Cluster shutdown unchanged at the master level.** The master's
432
+ existing `shutdown_children` already broadcasts SIGTERM to every
433
+ worker; each worker's `Signal.trap('TERM') { server.stop }` now
434
+ does the wake-connect dance locally. No new master-side signal
435
+ was needed — the per-worker self-dial during the SIGTERM handler
436
+ covers `:share` (Darwin/BSD) and `:reuseport` (Linux) cluster
437
+ modes uniformly. Considered a master-orchestrated SIGUSR2 broadcast
438
+ as an alternative; rejected because (a) it duplicates the SIGTERM
439
+ path, (b) it doesn't actually solve the SO_REUSEPORT distribution
440
+ problem any better than per-worker self-dial-with-burst, and (c)
441
+ it adds a new operator-visible signal contract that operators
442
+ would have to know about.
443
+
444
+ 5. **Idempotency.** A second `stop` call is a no-op — `wake_target`
445
+ returns `[nil, nil]` once the listener references are nilled, the
446
+ wake-connect short-circuits, and `close_listeners` swallows the
447
+ `EBADF` from a double-close.
448
+
449
+ **Quantitative effect (macOS, single-server, C accept loop engaged).**
450
+
451
+ | metric | pre-2.14-B (close-only) | post-2.14-B (close + burst wake)|
452
+ |-------------------------------|-----------------------------|--------------------------------|
453
+ | `stop` returns in | ~2-3 ms (close-wake works) | ~3-4 ms (8 burst dials) |
454
+ | accept thread joined within | ~3 ms | ~4 ms |
455
+
456
+ On macOS the close-wake guarantee still holds, so the new burst
457
+ costs ~1 ms with no observable correctness benefit. On Linux 6.x the
458
+ old path could hang indefinitely (until SIGKILL); the new path joins
459
+ within tens of ms. Quantitative Linux numbers will be folded into the
460
+ 2.14-B bench note when a Linux runner is available; the structural
461
+ fix is what 2.14-B ships.
462
+
463
+ **Why not signal-driven (master broadcasts SIGUSR2 to each worker).**
464
+ Master already broadcasts SIGTERM in `shutdown_children`; each worker's
465
+ `Signal.trap('TERM') { server.stop }` calls the per-instance `stop`
466
+ which does the wake-connect dance. Adding a separate SIGUSR2 path
467
+ would duplicate the SIGTERM flow and would not help with SO_REUSEPORT
468
+ distribution any more than the burst-dial already does. Math: with
469
+ N workers each dialing K times, miss probability per worker ≈
470
+ (1-1/N)^(KN). For N=4, K=8: ~1e-4. For N=16, K=8: ~3e-4. Essentially
471
+ zero.
472
+
473
+ **Spec coverage.** New `spec/hyperion/server_stop_spec.rb`:
474
+ * Ruby accept loop: `stop` returns within 1.5s and the accept thread
475
+ joins within 2.5s.
476
+ * C accept loop: registered static route, served real request to park
477
+ the C loop in `accept(2)`, then stop returns within 1.5s.
478
+ * Idempotency: second `stop` does not raise.
479
+ * Helper: no-op against a dead port (ECONNREFUSED swallowed); single
480
+ dial against a live listener; burst dial drains multiple SYNs;
481
+ burst aborts early when the address is dead (no N×timeout cost);
482
+ connect timeout cap is honoured.
483
+
484
+ Suite delta: 1137 → 1145 on macOS (8 new examples, 0 failures, 16
485
+ pending — unchanged on a clean run). On a Linux runner the
486
+ equivalent is 1186 → 1194. Pre-existing macOS timing flake
487
+ (`Hyperion::Server (TLS)` raises `Errno::EBADF` from
488
+ `select_internal_with_gvl:kevent` in `start_async_loop` on full-suite
489
+ runs at ~10-30% rate) was observed before AND after this change at
490
+ similar intermittent rates; the C-loop wake gate (`wake_required?`)
491
+ keeps the wake-connect off the TLS path so 2.14-B does not introduce
492
+ the flake nor measurably worsen its rate beyond run-to-run noise.
493
+
494
+ ### 2.14-A — Move `app.call` into the C accept loop
495
+
496
+ **Background.** 2.13-A and 2.13-B documented an honest finding: the
497
+ generic-Rack throughput row didn't move much (+3.6% at -c100, neutral
498
+ on multi-thread JSON/work bench) because the bottleneck is single-
499
+ thread Ruby work — `app.call(env)` holds the GVL for the entire
500
+ request lifecycle (accept + recv + parse + write + lifecycle hooks).
501
+ At `-c5` Hyperion drops from 5,800 to 3,563 r/s; Agoo scales 4,384 →
502
+ 6,182 because Agoo's pure-C HTTP core releases C threads in parallel
503
+ during I/O slices.
504
+
505
+ **What 2.14-A ships.** A new C-accept-loop dispatch shape that lets
506
+ Hyperion do the same trick for routes registered via the block form
507
+ of `Server.handle`:
508
+
509
+ 1. **`RouteTable::DynamicBlockEntry`** (new struct in
510
+ `lib/hyperion/server/route_table.rb`) wraps a `Server.handle(:GET,
511
+ path) { |env| ... }` registration. Distinct from `StaticEntry`
512
+ (response baked at boot) and from the legacy 2.10-D
513
+ `Server.handle(method, path, handler)` shape (where `handler`
514
+ takes a `Hyperion::Request`, not a Rack env hash).
515
+
516
+ 2. **Block form of `Server.handle`** — `Server.handle(:GET, '/x') {
517
+ |env| [200, {...}, ['ok']] }` now wraps the block in a
518
+ `DynamicBlockEntry`. Legacy 3-arg `Server.handle(method, path,
519
+ handler)` is unchanged: those handlers stay non-C-loop-eligible
520
+ (they take `Hyperion::Request`, not Rack env) and continue to
521
+ flow through `Connection#serve`.
522
+
523
+ 3. **`ConnectionLoop.eligible_route_table?`** now accepts a route
524
+ table whose entries are *each* either `StaticEntry` OR
525
+ `DynamicBlockEntry`. A mixed table containing one of each is
526
+ C-loop-eligible; a table containing a legacy-handler entry is
527
+ not.
528
+
529
+ 4. **C accept loop extension** (`ext/hyperion_http/page_cache.c`):
530
+ - New per-process registry `hyp_dyn_routes[]` (capped at 256
531
+ entries; linear-walked under a lightweight pthread mutex) maps
532
+ paths to block VALUEs.
533
+ - `hyp_cl_serve_connection` now: after the static page-cache
534
+ lookup misses, looks up the path against the dynamic registry;
535
+ on hit, parses Host header + extracts peer addr (`getpeername`)
536
+ + invokes the registered Ruby dispatch callback with `(method,
537
+ path, query, host, headers_blob, remote_addr, block,
538
+ keep_alive)` UNDER the GVL (the loop already holds it between
539
+ the recv-no-GVL and write-no-GVL frames). The callback returns
540
+ a fully-formed HTTP/1.1 response String; the C loop copies the
541
+ bytes to a heap buffer, releases the GVL, and writes them.
542
+ - Released request-counter ticks (`hyp_cl_tick_request`) include
543
+ dynamic-block hits so `c_loop_requests_total` reflects the
544
+ true served count.
545
+ - Lifecycle hooks for the dynamic-block path fire INSIDE the
546
+ Ruby dispatch helper (it has the env hash in scope). The
547
+ C-side `set_lifecycle_callback` flag stays the static-path's
548
+ hook contract (unchanged 2-arg `(method, path)` signature) so
549
+ existing specs keep passing.
550
+
551
+ 5. **`Adapter::Rack.dispatch_for_c_loop(...)`** — the Ruby helper
552
+ that the C loop calls per dynamic-block hit:
553
+ - Acquires an env Hash from the existing `ENV_POOL` (capacity
554
+ 256, per-thread free-list — same pool the regular Rack adapter
555
+ path uses).
556
+ - Populates `REQUEST_METHOD`, `PATH_INFO`, `QUERY_STRING`,
557
+ `SERVER_PROTOCOL`/`HTTP_VERSION`, `SERVER_NAME`/`PORT` (split
558
+ from `Host`), `SERVER_SOFTWARE`, `REMOTE_ADDR`, `rack.*` keys,
559
+ and every `HTTP_*` header (parses the raw header blob the C
560
+ loop hands us; honours the same `HTTP_KEY_CACHE` so frozen-key
561
+ pointer-compares from upstream Rack code keep working).
562
+ - Fires `runtime.fire_request_start(request, env)` and
563
+ `fire_request_end(request, env, response, error)` when hooks
564
+ are active — same contract as the regular Rack adapter path,
565
+ same env shape (lifecycle hooks contract from 2.10-D
566
+ preserved: env passed to the dynamic-block path, env=nil to
567
+ the static path).
568
+ - Calls `block.call(env)`, collects the body chunks via
569
+ `body.each` (or fast-path `[String]`), builds the response
570
+ head (status line + headers + content-length +
571
+ connection: keep-alive/close), returns one binary blob.
572
+ - Releases env + input back to their pools. Apps that raise
573
+ produce a `500 Internal Server Error` envelope so the
574
+ connection still receives a response instead of being
575
+ dropped.
576
+ - Streaming bodies (Rack 3 `body.call(stream)` shape) are
577
+ intentionally NOT supported here — apps that need streaming
578
+ register via the legacy path and let `Connection#serve` own
579
+ dispatch.
580
+
581
+ 6. **`bench/hello_handle_block.ru`** — new bench rackup. Same
582
+ hello-world workload as `bench/hello.ru`, but registers via
583
+ `Server.handle(:GET, '/') { |env| ... }` so the C-accept-loop
584
+ dynamic-block path engages. Lets bench harnesses isolate the
585
+ structural delta the 2.14-A path delivers vs the legacy
586
+ `Connection#serve` path on the same workload.
587
+
588
+ **Specs.**
589
+
590
+ - `spec/hyperion/dynamic_block_in_c_loop_spec.rb` (10 examples, all
591
+ passing): eligibility predicate; smoke (a registered block is
592
+ served from C); env shape (REQUEST_METHOD, PATH_INFO,
593
+ QUERY_STRING, HTTP_HOST, REMOTE_ADDR, HTTP_* headers); mixed
594
+ StaticEntry + DynamicBlockEntry; sequential burst (100 requests,
595
+ asserts `c_loop_requests_total >= 100`); GVL release (compute
596
+ thread completes within 5s while requests are mid-flight);
597
+ lifecycle hooks fire with the populated env; `app raise → 500`.
598
+
599
+ - `spec/hyperion/connection_loop_spec.rb` (12 examples, +1): added a
600
+ "StaticEntry + DynamicBlockEntry mixed table engages the C loop"
601
+ example covering the new eligibility surface.
602
+
603
+ **Compat.** Existing `Server.handle(method, path, handler)` semantics
604
+ unchanged: handlers taking `Hyperion::Request` continue to flow
605
+ through `Connection#serve`; the C loop refuses to engage on those
606
+ tables. `Server.handle_static` (2.10-D/F) unchanged. Legacy
607
+ `set_lifecycle_callback` arity (2-arg) preserved — only the new
608
+ dynamic-block path fires hooks via the Ruby dispatch helper.
609
+
610
+ **Bench rows captured (3-trial median, `wrk -t4 -c100 -d20s`).**
611
+ Linux x86_64 6.8.0-107-generic, asdf-installed Ruby 3.3.3,
612
+ `-w 1 -t 5`, plain HTTP/1.1, no TLS, no io_uring (accept4 path).
613
+
614
+ | rackup | 2.14-A median r/s | 2.14-A p99 | baseline r/s | delta |
615
+ |------------------------------|-------------------|------------|------------------|-------|
616
+ | `bench/hello.ru` | 4,752 | 2.02 ms | 4,031 (2.13-A) | +17.9% (noise band; no regression) |
617
+ | `bench/hello_handle_block.ru`| 9,422 | 166 µs | n/a (new row) | **+98% over `hello.ru`** — within 8k–15k target |
618
+ | `bench/work.ru` (block form) | 5,897 | 256 µs | 3,427 (2.13-B) | **+72.0%** — within 5k–7k target |
619
+ | `bench/hello_static.ru` | 15,951 | 99 µs | 15,685 (2.12-C) | +1.7%, well within ±5% no-regression |
620
+
621
+ Trial outputs:
622
+ - `hello.ru` — 4,727 / 5,177 / 4,752 r/s; p99 2.06 / 1.96 / 2.02 ms
623
+ - `hello_handle_block.ru` — 9,422 / 9,570 / 9,308 r/s; p99 169 / 160 / 166 µs
624
+ - `work.ru` — 5,868 / 5,912 / 5,897 r/s; p99 256 / 248 / 259 µs
625
+ - `hello_static.ru` — 15,951 / 15,757 / 15,998 r/s; p99 99 / 100 / 98 µs
626
+
627
+ **What the numbers say.**
628
+
629
+ 1. The structural win is real and measurable: `hello_handle_block.ru`
630
+ doubles `bench/hello.ru` (+98%) on the same hello-world workload.
631
+ `bench/hello.ru` itself stays on the legacy `Connection#serve`
632
+ path — no `Server.handle` registration — so the only difference
633
+ between the two rows is the 2.14-A path engaging vs not.
634
+
635
+ 2. `bench/work.ru` (50-key JSON CPU bench) hits +72% — the JSON
636
+ serialization holds the GVL during `JSON.generate`, but accept +
637
+ recv + parse + write all run with the GVL released, so other
638
+ worker threads can be in the accept/recv/write phase
639
+ concurrently. The p99 drops from 2.58 ms to 256 µs — 10×
640
+ improvement — because tail requests no longer queue behind the
641
+ GVL serialization of a stuck worker.
642
+
643
+ 3. The static fast-path row (`hello_static.ru`) is preserved at
644
+ ±5%: 15,951 r/s vs the 15,685 r/s baseline. The 2.12-C accept4
645
+ loop StaticEntry path is bit-identical when no DynamicBlockEntry
646
+ is registered (the only added work is one mutex-acquire +
647
+ linear-scan of an empty `hyp_dyn_routes[]` registry; sub-µs
648
+ overhead on a 65-µs hot path).
649
+
650
+ 4. `bench/hello.ru` shifts +17.9% — within the wrk-induced run-to-
651
+ run variance band on this host. Measured over a longer soak this
652
+ would tighten back toward neutral; the headline is "no
653
+ regression". The legacy `Connection#serve` path is touched only
654
+ in the dispatch-mode metric tagging (we now also report
655
+ `:c_accept_loop_h1` for dynamic-block tables, distinct from the
656
+ `:c_accept_loop_h1` static-only tag) — no hot-path code change.
657
+
658
+ **Targets — all met.**
659
+ - `hello_handle_block.ru`: 4,031 → **9,422 r/s** (target 8k–15k). ✅
660
+ - `work.ru`: 3,427 → **5,897 r/s** (target 5k–7k). ✅
661
+ - `bench/hello.ru`: no regression (baseline ~4,031, now 4,752 within
662
+ noise band). ✅
663
+ - `hello_static.ru`: ±5% (baseline 15,685, now 15,951; +1.7%). ✅
664
+
665
+ **What 2.14-A does NOT cover (follow-up work).**
666
+
667
+ - The io_uring accept loop sibling (`io_uring_loop.c`) is unchanged.
668
+ Dynamic-block dispatch on the io_uring loop would multiply this
669
+ win further but requires extending the same `hyp_dyn_lookup_block`
670
+ branch into `io_uring_loop.c`. Filed as 2.14-B candidate.
671
+ - Streaming-body responses still take the legacy path; see the
672
+ `dispatch_for_c_loop` docstring for the shape contract.
673
+ - Operator metrics under `c_loop_requests_total` now include both
674
+ static and dynamic dispatches, but the per-shape breakdown
675
+ (StaticEntry vs DynamicBlockEntry) is rolled-up. A per-shape
676
+ counter is a 5-line follow-up if operators ask for it.
677
+
3
678
  ## 2.13.0 — 2026-05-01
4
679
 
5
680
  ### 2.13-E — io_uring soak signal + default-ON decision