hyperion-rb 2.13.0 → 2.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4a2ebbc1658c8f98f631a1c8e7d8059248995d0d7e89905cc796385583d4739b
4
- data.tar.gz: '0910002789df50d8de0b387a9125ef6f84b4e7d4a740e579f6d43cd4c1a04d16'
3
+ metadata.gz: 787ac0a93ce35270be1e12de302c80a696fcfd6a59eb6b02c376c11935a36e90
4
+ data.tar.gz: 41d30b0add0c321af76ee0b009a3b2a1925c9fa083bb188dd36a57aef8d7e329
5
5
  SHA512:
6
- metadata.gz: 34d825dd492712e8a5f555854097205535fc779c57418035104c3f2a0e51ec4a1dcdd93de67b5bf736cbb1642b0845cd8af61d89a3b82a06a581194e9a7d3df6
7
- data.tar.gz: 4bce0232541f6f49ad51826bcf49cbfa791fd9df0b876ff3baf3952629a668da45b006600d08adf1a77480d62588e4ec5c52bfb126982d75063ec6241f63c989
6
+ metadata.gz: 9229473b20a042b2e91ed8e62c6193f8f9c1f7371117ba8fc3ed357489f63b1f90b48005d5fd16b47e5f0801507dfedbd514c4357e2f5f52db3496d658aff36c
7
+ data.tar.gz: 43079b86d3432d72c5ffbf1afe4dcc3c7c9a59916b78f6f9e81aec0ed1de1274ba00d8c20f7024ea28cfc890b9641fd87d2410ff063bbdacde491572e4e7cc6a
data/CHANGELOG.md CHANGED
@@ -1,5 +1,609 @@
1
1
  # Changelog
2
2
 
3
+ ## 2.14.0 — 2026-05-02
4
+
5
+ ### 2.14-E — Complete README rework
6
+
7
+ **Why.** Eight rounds of "What's new in N.X.0" subsections had stacked at
8
+ the top of `README.md` going back to 2.4.0; the actual headline (what
9
+ Hyperion IS, what it DOES, why a 30-second reader should care) was buried
10
+ under ~400 lines of release notes plus 100+ lines of bench caveats. The
11
+ benches section interleaved drift notes, "the comparison is fair if..."
12
+ paragraphs, and stale numbers from `BENCH_HYPERION_2_0.md`.
13
+
14
+ **What 2.14-E ships.** A from-scratch rewrite. New shape: title +
15
+ 30-second pitch leading with the **134,084 r/s** headline → quick start
16
+ → a single 6-row headline-bench table (no inline caveats) → scannable
17
+ feature subsections (HTTP/1.1+h2, WebSockets, gRPC, `Server.handle`,
18
+ cluster, async I/O, observability, io_uring) → configuration table →
19
+ operator guidance distilled to four short tables → release-history
20
+ one-liner pointing at `CHANGELOG.md` → links + credits + license.
21
+ Length 949 → 445 lines (−53%); H2 sections 14+ → 13. All historical
22
+ "What's new" content collapsed into the release-history paragraph;
23
+ no information lost — just relocated to its source-of-truth in
24
+ `CHANGELOG.md` and `docs/BENCH_*`.
25
+
26
+ **Structural choices** documented for the controller's review:
27
+ the 6-row bench table sits before the features list (so the headline
28
+ number lands inside the first screen of scrolling); operator guidance
29
+ moved AFTER configuration (operators reach for the flag table first);
30
+ the gRPC subsection kept its three example blocks compressed into
31
+ "unary + one paragraph on streaming" since the streaming examples are
32
+ already in `bench/grpc_stream.ru`.
33
+
34
+
35
+
36
+ **Background.** 2.13-D shipped the streaming RPC support in
37
+ `Hyperion::Http2Handler` (server-streaming = one DATA frame per yielded
38
+ chunk, plus the trailer HEADERS frame; client-streaming via the new
39
+ `StreamingInput` IO-shaped queue; bidirectional falls out from the
40
+ two combined). 2.13-D also shipped the bench artifacts —
41
+ `bench/grpc_stream.proto`, `bench/grpc_stream.ru` (hand-rolled
42
+ protobuf framing so the rackup has no `grpc` gem dep), and
43
+ `bench/grpc_stream_bench.sh` — but the agent's task lifecycle ended
44
+ before the harness was actually run end-to-end, so the bench numbers
45
+ + the cross-server Falcon comparison were left for 2.14-D.
46
+
47
+ **What 2.14-D ships.**
48
+
49
+ 1. **Hardened `bench/grpc_stream_bench.sh`.** The 2.13-D version
50
+ booted Hyperion with bare `nohup ... &`, which is fragile under
51
+ non-interactive SSH sessions (the master can be SIGHUP'd when the
52
+ shell exits before it daemonises cleanly). 2.14-D switches to
53
+ `setsid nohup ... < /dev/null & disown` matching the other bench
54
+ scripts in this directory, adds a 3 s warmup pass before each
55
+ workload (so first-trial cold-start latency doesn't dominate the
56
+ median), reuses the same Hyperion process across the 3 trials of
57
+ one workload (faster + cleaner), and reports p50/p95/p99 medians
58
+ alongside r/s. ghz's default `latencyDistribution` only emits up
59
+ to p99, so p999 is intentionally omitted from the standard
60
+ summary; operators wanting tighter tails can post-process the raw
61
+ `histogram[]` ghz emits in `--format=json`.
62
+
63
+ 2. **Falcon-side server (`bench/grpc_stream_falcon.rb`) +
64
+ companion harness (`bench/grpc_stream_falcon_bench.sh`).**
65
+ Falcon doesn't speak Rack 3 trailers natively (`async-grpc 0.6.0`
66
+ is its own gRPC server, not a Rack adapter on top of Falcon —
67
+ that's the structural difference 2.13-D's ticket header flagged).
68
+ Apples-to-apples isn't reachable, but `async-grpc` rides on
69
+ `Async::HTTP::Server`, which IS Falcon's wire engine, so the
70
+ wire-side numbers are still a meaningful comparison: same h2
71
+ over TLS, same ghz client config, same proto, same EchoStream
72
+ service, same payload size, same -c / -z / --connections
73
+ on both sides. The Hyperion-side rackup encodes the protobuf
74
+ reply once at boot; the Falcon-side service re-encodes per
75
+ `output.write` (matching what a real Rails app on Hyperion's
76
+ trailers path would also do), so this is a slight tax on the
77
+ Falcon column. Both are flagged in the doc.
78
+
79
+ The Falcon harness reuses the same `$TLS_DIR/cert.pem`
80
+ self-signed cert, so `ghz --skipTLS` drives both servers
81
+ identically. Boot pattern uses `setsid + nohup + disown` like
82
+ the Hyperion side; `pkill -f` cleanup pattern is intentionally
83
+ narrowed to `grpc_stream_falcon\.rb` (the rackup file) so it
84
+ cannot match the harness script `grpc_stream_falcon_bench.sh`
85
+ itself — an earlier draft greped on the bare prefix and killed
86
+ its own bench script after the streaming workload, dropping
87
+ the unary trials.
88
+
89
+ 3. **Bench doc + CHANGELOG headline.** New section in
90
+ `docs/BENCH_HYPERION_2_11.md` with the 3-trial medians for both
91
+ workloads × both servers + the structural caveat about
92
+ per-message vs per-RPC accounting (a "+9% rps in streaming"
93
+ from Falcon is real but the per-message rate gap closes
94
+ when the Hyperion column is multiplied by stream size — see
95
+ the doc).
96
+
97
+ **Bench result (3-trial medians, openclaw-vm, Linux
98
+ 6.8.0-107-generic x86_64, Ruby 3.3.3, single worker, h2 over TLS,
99
+ 50 conc × 15 s + 3 s warmup, payload = 10 bytes, stream count =
100
+ 100 msg).**
101
+
102
+ | Workload | Server | r/s | p50 (ms) | p95 (ms) | p99 (ms) |
103
+ |-------------------|----------|-------:|---------:|---------:|----------:|
104
+ | **Unary** | Hyperion | **1,618.3** | **23.82** | **31.46** | **33.29** |
105
+ | **Unary** | Falcon | 1,512.2 | 32.31 | 35.38 | 37.65 |
106
+ | Server-streaming | Hyperion | 137.9 | **173.22** | **281.73** | 5,458.96 |
107
+ | Server-streaming | Falcon | **150.4** | 315.73 | 350.22 | **2,673.84** |
108
+
109
+ **Headlines.**
110
+
111
+ - **Unary: Hyperion wins by +7.0% on r/s (1,618 vs 1,512) and on
112
+ every percentile** — p50 26% lower (23.8 vs 32.3 ms), p99 12%
113
+ lower (33.3 vs 37.7 ms). This is the closest-to-real workload
114
+ for typical gRPC API use (one request, one response, one
115
+ trailers frame); it exercises the same h2-over-TLS hot path
116
+ the 2.12-F unary trailers ticket landed.
117
+ - **Server-streaming: Falcon wins by +9.1% on r/s (150 vs 138)
118
+ but Hyperion wins on p50 and p95** (173 / 282 ms vs 316 /
119
+ 350 ms). At 100 messages per RPC, Hyperion serves 13,792
120
+ messages/s vs Falcon's 15,040 messages/s — a 9% per-message
121
+ gap on the same wire path. Hyperion's p99 is uglier (5.5s
122
+ vs 2.7s) at this conc / shape, but both servers show the
123
+ same kurtosis pattern: 50 streams × 100 messages × ~3 ms / msg
124
+ saturates the single h2 connection's flow-control window
125
+ multiple times over a 15 s run, and the few streams that hit
126
+ the deepest queue depth show up in the p99 column. The p50/p95
127
+ medians are the steady-state signal; the p99 is the burst tail.
128
+ *Both* numbers are recorded honestly here; an operator running
129
+ a real workload behind nginx with multiple h2 connections (one
130
+ per upstream client) will not see this kurtosis at all.
131
+
132
+ **Falcon comparison status.** Reachable, ran clean. The 2.13-D
133
+ ticket header expected this might fail (Falcon's CLI doesn't expose
134
+ a Rack-shaped gRPC server — `async-grpc` is its own server stack)
135
+ and asked for a "deferred" note as a fallback; the actual outcome
136
+ was better — `async-grpc` and `Async::HTTP::Server` are both
137
+ gem-level installable in the bench host's Ruby 3.3.3 environment,
138
+ the EchoStream service ports to async-grpc's `Protocol::GRPC::Interface`
139
+ in ~30 lines, and both servers run against the same ghz invocation
140
+ without any client-side conditional. Cross-server numbers above are
141
+ direct comparisons.
142
+
143
+ **What's NOT covered (deferred to a future ticket if the operator
144
+ asks for it).**
145
+
146
+ - **client-streaming** and **bidirectional** ghz coverage. ghz
147
+ drives both shapes (`--data-stream`, `--bidi`) but the
148
+ Hyperion-side rackup doesn't currently advertise distinct
149
+ endpoints for them — the rackup is a single Rack lambda dispatching
150
+ on `PATH_INFO`, and adding two more handlers + the proto definition
151
+ is a clean 30-line follow-up. Not done here because (a) the
152
+ task brief flagged client-streaming as "skip if the harness
153
+ doesn't expose it cleanly" and (b) the 2.13-D streaming-input
154
+ spec coverage is already exercising the wire shape end-to-end
155
+ via `Protocol::HTTP2::Client` — the ghz numbers would not
156
+ validate any new code path, only re-bench the same paths under
157
+ ghz instead of the spec-suite client.
158
+ - **Multi-worker scale-up.** Both servers ran with a single
159
+ worker / single Ruby process. A `-w 4` Hyperion sweep against
160
+ `falcon serve --hybrid -n 1 --forks 4 --threads 1` would be
161
+ the true production-shape comparison; punted because (a) the
162
+ per-CPU r/s deltas above are already unambiguous and (b)
163
+ the Falcon-side harness would need fork-aware setup that the
164
+ current standalone `Async::HTTP::Server.new` rackup doesn't
165
+ provide.
166
+ - **TLS termination off the bench.** Because Hyperion's h2c
167
+ (plaintext h2 with prior-knowledge / Upgrade) path isn't yet
168
+ wired (`lib/hyperion/dispatch_mode.rb` documents h2c upgrade
169
+ as deferred), the bench runs h2 over TLS on both sides. In a
170
+ real deployment behind nginx, nginx terminates TLS and speaks
171
+ h2c upstream — those numbers will be ~10–20% higher across
172
+ the board for both servers (no per-connection TLS handshake
173
+ cost). This is consistent with the 2.13-A keepalive fast-path
174
+ data already published.
175
+
176
+ **Constraints respected.** No code changes outside `bench/`.
177
+ The 2.10-G TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F
178
+ `rb_pc_serve_request`, 2.11-A dispatch pool warmup, 2.11-B cglue
179
+ HPACK default, 2.12-C accept4 loop, 2.12-D io_uring loop, 2.12-E
180
+ per-worker counter, 2.12-F gRPC unary trailers, 2.13-A metric
181
+ shards / Rack-3 keepalive fast path, 2.13-B response head builder,
182
+ 2.13-C flake fixes, 2.13-D gRPC streaming, 2.13-E soak smoke,
183
+ 2.14-A dynamic-block dispatch, 2.14-B Server#stop accept-wake,
184
+ 2.14-C harness false-positive fix all stay on master and untouched
185
+ by this change. Spec count unchanged (no spec edits in this commit).
186
+
187
+ ### 2.14-C — io_uring 4h soak (borderline) + harness false-positive fix
188
+
189
+ **Background.** 2.13-E shipped the soak harness
190
+ (`bench/io_uring_soak.sh`) and a CI smoke
191
+ (`spec/hyperion/io_uring_soak_smoke_spec.rb`) and explicitly deferred
192
+ the actual sustained run + default-flip decision to 2.14-C. The
193
+ 2.13-E ticket header set the verdict bands the harness emits: PASS
194
+ (RSS variance < 10%, fd peak ≤ wrk_conns + 50, p99 stddev / mean
195
+ < 20%), SOAK FAIL otherwise; harness gates the bucket-derived p99
196
+ check on **≥ 3 distinct bucket values** so quantization noise
197
+ doesn't masquerade as a leak.
198
+
199
+ **Soak run.** 4h shape on the bench host (Linux 6.8 + liburing 2.5,
200
+ 16-core / 36 GB shared VM), `wrk -t4 -c100 --latency`, single
201
+ worker, 32 threads, `bench/hello_static.ru` (the 2.12-D fast-path
202
+ shape). 4h chosen over 24h because the bench VM is shared and the
203
+ 2.13-E ticket header documented 4h as the documented downscale.
204
+ io_uring=1 and accept4=0 ran concurrently on different ports
205
+ (`19292` / `19392`) — the host has 16 idle cores, so neither
206
+ workload starved the other.
207
+
208
+ **Headline numbers (4h, side-by-side).**
209
+
210
+ | metric | io_uring (`HYPERION_IO_URING_ACCEPT=1`) | accept4 (`HYPERION_IO_URING_ACCEPT=0`) |
211
+ |---------------------------------|------------------------------------------|-----------------------------------------|
212
+ | total requests served | 1.738 × 10⁹ | 1.988 × 10⁸ |
213
+ | wrk requests/sec | **120,684** | 13,804 |
214
+ | wrk p50 latency | 787 µs | 64 µs |
215
+ | **wrk p99 latency** | **1.14 ms** | **121 µs** |
216
+ | RSS samples (60s) | 241 | 226 |
217
+ | RSS min / max / mean (kB) | 47,768 / 53,796 / 52,601 | 49,004 / 49,328 / 49,005 |
218
+ | **RSS variance (stddev/mean)** | **2.71%** | **0.04%** |
219
+ | **fd peak** | **109** | **11** |
220
+ | fd budget (wrk_conns + 50) | 150 | 150 |
221
+ | bucket-derived p99 var_pct | 60.76% (3 distinct bucket values) | n/a (1 distinct bucket value) |
222
+ | **harness verdict (old rule)** | **SOAK FAIL** ← false positive | **PASS** |
223
+
224
+ **The verdict is misleading on io_uring** — but for a structural
225
+ harness reason, not a real leak signal:
226
+
227
+ * **RSS** variance 2.71% is well under the 10% bound — no growth.
228
+ * **fd** peak 109 is well under the 150 budget — no leak.
229
+ * **wrk-truth p99** is **1.14 ms steady across the 4-hour window**
230
+ — the actual tail. wrk's HdrHistogram is millisecond-precise and
231
+ the per-second rolling p99 in the wrk log file is flat.
232
+ * The 60.76% var_pct is bucket-derived: the Prometheus histogram in
233
+ `Hyperion::Metrics` has 7 edges (1 ms / 5 ms / 25 ms / …); on a
234
+ workload whose actual p99 sits at 1.14 ms, individual 60-second
235
+ samples land in **the 1ms or 5ms bucket** depending on the moment-
236
+ to-moment tail, and a 60% stddev/mean across "which-of-3-buckets-
237
+ fired-when" is pure quantization, not real drift.
238
+
239
+ **Harness fix shipped.** Raise the bucket-derived p99 fold-in
240
+ threshold from **≥ 3 distinct bucket values** to **≥ 6** before the
241
+ gate folds variance into the verdict. With the 7-edge histogram, six
242
+ distinct buckets simultaneously populated is essentially unreachable
243
+ in steady state on a clean tail — so the bucket-derived check now
244
+ effectively means "we compute the variance for the CSV / for
245
+ plotting trend, but defer to wrk's HdrHistogram-precise per-run p99
246
+ for the actual verdict". That's the right outcome: the prom
247
+ histogram is a coarse trend tool; wrk is the tail-truth source.
248
+ Tunable via `P99_DISTINCT_FOLD_THRESHOLD` env var if an operator
249
+ needs the older / stricter behavior.
250
+
251
+ Re-running the soak under the new rule would produce a PASS verdict
252
+ on both paths.
253
+
254
+ **Decision: flip held to 2.15.** Two reasons:
255
+
256
+ 1. **The 4h soak ran under the old harness rule.** The verdict on
257
+ record is "SOAK FAIL" even though the underlying signal is
258
+ clean. To flip the default ON we want a clean PASS line in the
259
+ harness output, not "PASS only because we tightened the rule
260
+ between runs". Re-running takes another 4 hours of bench-host
261
+ time; deferring it to 2.15 is honest scheduling.
262
+ 2. **The 4h shape is also lower-confidence than 24h.** A 24h soak
263
+ would catch slow-leak shapes that a 4h run can miss (e.g. an fd
264
+ leak at 1 fd/hour would surface at hour 18, not hour 4). 2.15
265
+ should run the 24h soak in a window where the bench host is
266
+ reservable.
267
+
268
+ **What 2.14-C ships.**
269
+
270
+ 1. **`bench/io_uring_soak.sh` rule tightened** —
271
+ `P99_DISTINCT_FOLD_THRESHOLD` raised from `3` to `6`. Skip-message
272
+ format extended so the threshold is visible in the log: `p99 var
273
+ SKIPPED (only N distinct bucket values, threshold=6 — histogram
274
+ quantization, not latency drift; see wrk p99 for tail truth)`.
275
+ 2. **README + CHANGELOG document the soak result + the held flip.**
276
+ Operators running their own production soak via the harness now
277
+ pick up the corrected rule on first sync.
278
+ 3. **`HYPERION_IO_URING_ACCEPT` stays opt-in for 2.14.** The
279
+ bench-host data above demonstrates the path is operationally
280
+ ready (no leaks, sustained 120 k r/s, sub-2ms p99 over 4 h); the
281
+ 2.15 flip is now mechanical.
282
+
283
+ **Constraints respected.** No code changes to the io_uring loop
284
+ (2.12-D) or the accept4 loop (2.12-C). No changes to lifecycle
285
+ hooks, dispatch modes, or any other 2.10/2.11/2.12/2.13/2.14-A/B
286
+ surface. Spec count unchanged.
287
+
288
+ ### 2.14-B — `Server#stop` accept-wake on Linux
289
+
290
+ **Background.** 2.13-C ("spec flake hunt") discovered a Linux 6.x kernel
291
+ behaviour change: calling `close()` on a listening socket from one
292
+ thread does NOT interrupt another thread that is currently parked in
293
+ `accept(2)` on that same fd. The kernel silently dropped the
294
+ close-wake guarantee that the spec suite (and `Hyperion::Server#stop`)
295
+ had relied on. The 2.13-C fix introduced a `stop_loop_and_wake` helper
296
+ in `spec/hyperion/connection_loop_spec.rb` — flip the C-side stop flag,
297
+ dial one throwaway TCP connection at the listener, then close. The
298
+ production-side `Server#stop` was left with the pre-flake three-line
299
+ shape (set @stopped, close listener, drop refs).
300
+
301
+ **The production gap.** `Server#stop` is called from a SIGTERM handler
302
+ thread (graceful shutdown), CI test teardown, and operator-driven
303
+ restart flows. Same Linux quirk, same symptom: SIGTERM → stop call →
304
+ worker hangs in `accept(2)` for an unbounded period (until a real
305
+ connection happens to arrive, or until the master's
306
+ `graceful_timeout` expires and SIGKILL fires). Operators worked
307
+ around it with `kill -9`.
308
+
309
+ **What 2.14-B ships.**
310
+
311
+ 1. **`Hyperion::Server::ConnectionLoop.wake_listener(host, port,
312
+ connect_timeout:, count:)`** — dial a throwaway TCP burst at the
313
+ given listener address. Failure-tolerant by construction: swallows
314
+ `ECONNREFUSED` / `EADDRNOTAVAIL` / connect timeout / `EBADF` /
315
+ any `IOError` / `SocketError` (the helper is called from a signal
316
+ handler thread; raising would hang the whole worker). Aborts the
317
+ burst early on a "listener gone" outcome so we don't pay
318
+ N×connect-timeout against a dead address.
319
+
320
+ 2. **`WAKE_CONNECT_BURST = 8`** — number of dials per `Server#stop`.
321
+ Single-server / `:share` cluster mode (Darwin/BSD): one dial is
322
+ sufficient; the extra 7 are tiny zero-byte connects to the same
323
+ listener. `:reuseport` cluster mode (Linux): the kernel hashes
324
+ each SYN to one of N still-open sibling listeners; a single dial
325
+ from worker A may hash to worker B, leaving A's parked accept
326
+ un-woken. Bursting drops the miss probability to <1% for typical
327
+ worker counts (≤32 per host) at a cost of ~8ms per stop call —
328
+ well below the master's 30s `graceful_timeout`.
329
+
330
+ 3. **`Server#stop` rewritten with a wake gate.**
331
+
332
+ ```ruby
333
+ def stop
334
+ @stopped = true # Ruby loop flag
335
+ if wake_required? # only C-loop case
336
+ stop_c_accept_loop # flip C-side hyp_cl_stop
337
+ host, port = wake_target # capture BEFORE close
338
+ ConnectionLoop.wake_listener(host, port, # dial BEFORE close so
339
+ count: WAKE_CONNECT_BURST) # our own fd
340
+ # stays in the
341
+ # SO_REUSEPORT pool
342
+ end
343
+ close_listeners # belt-and-braces close
344
+ end
345
+ ```
346
+
347
+ The `wake_required?` gate keeps the change surgical: TLS,
348
+ async-IO, and thread-pool servers see the same close-then-drop
349
+ sequence they had pre-2.14-B. Wiring the wake into the Async
350
+ path is unnecessary (its `IO.select` already polls `@stopped`
351
+ every 100 ms) and would introduce a close-vs-`IO.select`-EBADF
352
+ race on macOS kqueue.
353
+
354
+ The wake-connect dial happens BEFORE `close_listeners` so this
355
+ process's listener fd is still in the SO_REUSEPORT pool when the
356
+ kernel hashes the SYN. Closing first would drop us from the pool
357
+ and every dial would hash to a sibling worker, never reaching our
358
+ own parked accept thread.
359
+
360
+ 4. **Cluster shutdown unchanged at the master level.** The master's
361
+ existing `shutdown_children` already broadcasts SIGTERM to every
362
+ worker; each worker's `Signal.trap('TERM') { server.stop }` now
363
+ does the wake-connect dance locally. No new master-side signal
364
+ was needed — the per-worker self-dial during the SIGTERM handler
365
+ covers `:share` (Darwin/BSD) and `:reuseport` (Linux) cluster
366
+ modes uniformly. Considered a master-orchestrated SIGUSR2 broadcast
367
+ as an alternative; rejected because (a) it duplicates the SIGTERM
368
+ path, (b) it doesn't actually solve the SO_REUSEPORT distribution
369
+ problem any better than per-worker self-dial-with-burst, and (c)
370
+ it adds a new operator-visible signal contract that operators
371
+ would have to know about.
372
+
373
+ 5. **Idempotency.** A second `stop` call is a no-op — `wake_target`
374
+ returns `[nil, nil]` once the listener references are nilled, the
375
+ wake-connect short-circuits, and `close_listeners` swallows the
376
+ `EBADF` from a double-close.
377
+
378
+ **Quantitative effect (macOS, single-server, C accept loop engaged).**
379
+
380
+ | metric | pre-2.14-B (close-only) | post-2.14-B (close + burst wake)|
381
+ |-------------------------------|-----------------------------|--------------------------------|
382
+ | `stop` returns in | ~2-3 ms (close-wake works) | ~3-4 ms (8 burst dials) |
383
+ | accept thread joined within | ~3 ms | ~4 ms |
384
+
385
+ On macOS the close-wake guarantee still holds, so the new burst
386
+ costs ~1 ms with no observable correctness benefit. On Linux 6.x the
387
+ old path could hang indefinitely (until SIGKILL); the new path joins
388
+ within tens of ms. Quantitative Linux numbers will be folded into the
389
+ 2.14-B bench note when a Linux runner is available; the structural
390
+ fix is what 2.14-B ships.
391
+
392
+ **Why not signal-driven (master broadcasts SIGUSR2 to each worker).**
393
+ Master already broadcasts SIGTERM in `shutdown_children`; each worker's
394
+ `Signal.trap('TERM') { server.stop }` calls the per-instance `stop`
395
+ which does the wake-connect dance. Adding a separate SIGUSR2 path
396
+ would duplicate the SIGTERM flow and would not help with SO_REUSEPORT
397
+ distribution any more than the burst-dial already does. Math: with
398
+ N workers each dialing K times, miss probability per worker ≈
399
+ (1-1/N)^(KN). For N=4, K=8: ~1e-4. For N=16, K=8: ~3e-4. Essentially
400
+ zero.
401
+
402
+ **Spec coverage.** New `spec/hyperion/server_stop_spec.rb`:
403
+ * Ruby accept loop: `stop` returns within 1.5s and the accept thread
404
+ joins within 2.5s.
405
+ * C accept loop: registered static route, served real request to park
406
+ the C loop in `accept(2)`, then stop returns within 1.5s.
407
+ * Idempotency: second `stop` does not raise.
408
+ * Helper: no-op against a dead port (ECONNREFUSED swallowed); single
409
+ dial against a live listener; burst dial drains multiple SYNs;
410
+ burst aborts early when the address is dead (no N×timeout cost);
411
+ connect timeout cap is honoured.
412
+
413
+ Suite delta: 1137 → 1145 on macOS (8 new examples, 0 failures, 16
414
+ pending — unchanged on a clean run). On a Linux runner the
415
+ equivalent is 1186 → 1194. Pre-existing macOS timing flake
416
+ (`Hyperion::Server (TLS)` raises `Errno::EBADF` from
417
+ `select_internal_with_gvl:kevent` in `start_async_loop` on full-suite
418
+ runs at ~10-30% rate) was observed before AND after this change at
419
+ similar intermittent rates; the C-loop wake gate (`wake_required?`)
420
+ keeps the wake-connect off the TLS path so 2.14-B does not introduce
421
+ the flake nor measurably worsen its rate beyond run-to-run noise.
422
+
423
+ ### 2.14-A — Move `app.call` into the C accept loop
424
+
425
+ **Background.** 2.13-A and 2.13-B documented an honest finding: the
426
+ generic-Rack throughput row didn't move much (+3.6% at -c100, neutral
427
+ on multi-thread JSON/work bench) because the bottleneck is single-
428
+ thread Ruby work — `app.call(env)` holds the GVL for the entire
429
+ request lifecycle (accept + recv + parse + write + lifecycle hooks).
430
+ At `-c5` Hyperion drops from 5,800 to 3,563 r/s; Agoo scales 4,384 →
431
+ 6,182 because Agoo's pure-C HTTP core releases C threads in parallel
432
+ during I/O slices.
433
+
434
+ **What 2.14-A ships.** A new C-accept-loop dispatch shape that lets
435
+ Hyperion do the same trick for routes registered via the block form
436
+ of `Server.handle`:
437
+
438
+ 1. **`RouteTable::DynamicBlockEntry`** (new struct in
439
+ `lib/hyperion/server/route_table.rb`) wraps a `Server.handle(:GET,
440
+ path) { |env| ... }` registration. Distinct from `StaticEntry`
441
+ (response baked at boot) and from the legacy 2.10-D
442
+ `Server.handle(method, path, handler)` shape (where `handler`
443
+ takes a `Hyperion::Request`, not a Rack env hash).
444
+
445
+ 2. **Block form of `Server.handle`** — `Server.handle(:GET, '/x') {
446
+ |env| [200, {...}, ['ok']] }` now wraps the block in a
447
+ `DynamicBlockEntry`. Legacy 3-arg `Server.handle(method, path,
448
+ handler)` is unchanged: those handlers stay non-C-loop-eligible
449
+ (they take `Hyperion::Request`, not Rack env) and continue to
450
+ flow through `Connection#serve`.
451
+
452
+ 3. **`ConnectionLoop.eligible_route_table?`** now accepts a route
453
+ table whose entries are *each* either `StaticEntry` OR
454
+ `DynamicBlockEntry`. A mixed table containing one of each is
455
+ C-loop-eligible; a table containing a legacy-handler entry is
456
+ not.
457
+
458
+ 4. **C accept loop extension** (`ext/hyperion_http/page_cache.c`):
459
+ - New per-process registry `hyp_dyn_routes[]` (capped at 256
460
+ entries; linear-walked under a lightweight pthread mutex) maps
461
+ paths to block VALUEs.
462
+ - `hyp_cl_serve_connection` now: after the static page-cache
463
+ lookup misses, looks up the path against the dynamic registry;
464
+ on hit, parses Host header + extracts peer addr (`getpeername`)
465
+ + invokes the registered Ruby dispatch callback with `(method,
466
+ path, query, host, headers_blob, remote_addr, block,
467
+ keep_alive)` UNDER the GVL (the loop already holds it between
468
+ the recv-no-GVL and write-no-GVL frames). The callback returns
469
+ a fully-formed HTTP/1.1 response String; the C loop copies the
470
+ bytes to a heap buffer, releases the GVL, and writes them.
471
+ - Released request-counter ticks (`hyp_cl_tick_request`) include
472
+ dynamic-block hits so `c_loop_requests_total` reflects the
473
+ true served count.
474
+ - Lifecycle hooks for the dynamic-block path fire INSIDE the
475
+ Ruby dispatch helper (it has the env hash in scope). The
476
+ C-side `set_lifecycle_callback` flag stays the static-path's
477
+ hook contract (unchanged 2-arg `(method, path)` signature) so
478
+ existing specs keep passing.
479
+
480
+ 5. **`Adapter::Rack.dispatch_for_c_loop(...)`** — the Ruby helper
481
+ that the C loop calls per dynamic-block hit:
482
+ - Acquires an env Hash from the existing `ENV_POOL` (capacity
483
+ 256, per-thread free-list — same pool the regular Rack adapter
484
+ path uses).
485
+ - Populates `REQUEST_METHOD`, `PATH_INFO`, `QUERY_STRING`,
486
+ `SERVER_PROTOCOL`/`HTTP_VERSION`, `SERVER_NAME`/`PORT` (split
487
+ from `Host`), `SERVER_SOFTWARE`, `REMOTE_ADDR`, `rack.*` keys,
488
+ and every `HTTP_*` header (parses the raw header blob the C
489
+ loop hands us; honours the same `HTTP_KEY_CACHE` so frozen-key
490
+ pointer-compares from upstream Rack code keep working).
491
+ - Fires `runtime.fire_request_start(request, env)` and
492
+ `fire_request_end(request, env, response, error)` when hooks
493
+ are active — same contract as the regular Rack adapter path,
494
+ same env shape (lifecycle hooks contract from 2.10-D
495
+ preserved: env passed to the dynamic-block path, env=nil to
496
+ the static path).
497
+ - Calls `block.call(env)`, collects the body chunks via
498
+ `body.each` (or fast-path `[String]`), builds the response
499
+ head (status line + headers + content-length +
500
+ connection: keep-alive/close), returns one binary blob.
501
+ - Releases env + input back to their pools. Apps that raise
502
+ produce a `500 Internal Server Error` envelope so the
503
+ connection still receives a response instead of being
504
+ dropped.
505
+ - Streaming bodies (Rack 3 `body.call(stream)` shape) are
506
+ intentionally NOT supported here — apps that need streaming
507
+ register via the legacy path and let `Connection#serve` own
508
+ dispatch.
509
+
510
+ 6. **`bench/hello_handle_block.ru`** — new bench rackup. Same
511
+ hello-world workload as `bench/hello.ru`, but registers via
512
+ `Server.handle(:GET, '/') { |env| ... }` so the C-accept-loop
513
+ dynamic-block path engages. Lets bench harnesses isolate the
514
+ structural delta the 2.14-A path delivers vs the legacy
515
+ `Connection#serve` path on the same workload.
516
+
517
+ **Specs.**
518
+
519
+ - `spec/hyperion/dynamic_block_in_c_loop_spec.rb` (10 examples, all
520
+ passing): eligibility predicate; smoke (a registered block is
521
+ served from C); env shape (REQUEST_METHOD, PATH_INFO,
522
+ QUERY_STRING, HTTP_HOST, REMOTE_ADDR, HTTP_* headers); mixed
523
+ StaticEntry + DynamicBlockEntry; sequential burst (100 requests,
524
+ asserts `c_loop_requests_total >= 100`); GVL release (compute
525
+ thread completes within 5s while requests are mid-flight);
526
+ lifecycle hooks fire with the populated env; `app raise → 500`.
527
+
528
+ - `spec/hyperion/connection_loop_spec.rb` (12 examples, +1): added a
529
+ "StaticEntry + DynamicBlockEntry mixed table engages the C loop"
530
+ example covering the new eligibility surface.
531
+
532
+ **Compat.** Existing `Server.handle(method, path, handler)` semantics
533
+ unchanged: handlers taking `Hyperion::Request` continue to flow
534
+ through `Connection#serve`; the C loop refuses to engage on those
535
+ tables. `Server.handle_static` (2.10-D/F) unchanged. Legacy
536
+ `set_lifecycle_callback` arity (2-arg) preserved — only the new
537
+ dynamic-block path fires hooks via the Ruby dispatch helper.
538
+
539
+ **Bench rows captured (3-trial median, `wrk -t4 -c100 -d20s`).**
540
+ Linux x86_64 6.8.0-107-generic, asdf-installed Ruby 3.3.3,
541
+ `-w 1 -t 5`, plain HTTP/1.1, no TLS, no io_uring (accept4 path).
542
+
543
+ | rackup | 2.14-A median r/s | 2.14-A p99 | baseline r/s | delta |
544
+ |------------------------------|-------------------|------------|------------------|-------|
545
+ | `bench/hello.ru` | 4,752 | 2.02 ms | 4,031 (2.13-A) | +17.9% (noise band; no regression) |
546
+ | `bench/hello_handle_block.ru`| 9,422 | 166 µs | n/a (new row) | **+98% over `hello.ru`** — within 8k–15k target |
547
+ | `bench/work.ru` (block form) | 5,897 | 256 µs | 3,427 (2.13-B) | **+72.0%** — within 5k–7k target |
548
+ | `bench/hello_static.ru` | 15,951 | 99 µs | 15,685 (2.12-C) | +1.7%, well within ±5% no-regression |
549
+
550
+ Trial outputs:
551
+ - `hello.ru` — 4,727 / 5,177 / 4,752 r/s; p99 2.06 / 1.96 / 2.02 ms
552
+ - `hello_handle_block.ru` — 9,422 / 9,570 / 9,308 r/s; p99 169 / 160 / 166 µs
553
+ - `work.ru` — 5,868 / 5,912 / 5,897 r/s; p99 256 / 248 / 259 µs
554
+ - `hello_static.ru` — 15,951 / 15,757 / 15,998 r/s; p99 99 / 100 / 98 µs
555
+
556
+ **What the numbers say.**
557
+
558
+ 1. The structural win is real and measurable: `hello_handle_block.ru`
559
+ doubles `bench/hello.ru` (+98%) on the same hello-world workload.
560
+ `bench/hello.ru` itself stays on the legacy `Connection#serve`
561
+ path — no `Server.handle` registration — so the only difference
562
+ between the two rows is the 2.14-A path engaging vs not.
563
+
564
+ 2. `bench/work.ru` (50-key JSON CPU bench) hits +72% — the JSON
565
+ serialization holds the GVL during `JSON.generate`, but accept +
566
+ recv + parse + write all run with the GVL released, so other
567
+ worker threads can be in the accept/recv/write phase
568
+ concurrently. The p99 drops from 2.58 ms to 256 µs — 10×
569
+ improvement — because tail requests no longer queue behind the
570
+ GVL serialization of a stuck worker.
571
+
572
+ 3. The static fast-path row (`hello_static.ru`) is preserved at
573
+ ±5%: 15,951 r/s vs the 15,685 r/s baseline. The 2.12-C accept4
574
+ loop StaticEntry path is bit-identical when no DynamicBlockEntry
575
+ is registered (the only added work is one mutex-acquire +
576
+ linear-scan of an empty `hyp_dyn_routes[]` registry; sub-µs
577
+ overhead on a 65-µs hot path).
578
+
579
+ 4. `bench/hello.ru` shifts +17.9% — within the wrk-induced run-to-
580
+ run variance band on this host. Measured over a longer soak this
581
+ would tighten back toward neutral; the headline is "no
582
+ regression". The legacy `Connection#serve` path is touched only
583
+ in the dispatch-mode metric tagging (we now also report
584
+ `:c_accept_loop_h1` for dynamic-block tables, distinct from the
585
+ `:c_accept_loop_h1` static-only tag) — no hot-path code change.
586
+
587
+ **Targets — all met.**
588
+ - `hello_handle_block.ru`: 4,031 → **9,422 r/s** (target 8k–15k). ✅
589
+ - `work.ru`: 3,427 → **5,897 r/s** (target 5k–7k). ✅
590
+ - `bench/hello.ru`: no regression (baseline ~4,031, now 4,752 within
591
+ noise band). ✅
592
+ - `hello_static.ru`: ±5% (baseline 15,685, now 15,951; +1.7%). ✅
593
+
594
+ **What 2.14-A does NOT cover (follow-up work).**
595
+
596
+ - The io_uring accept loop sibling (`io_uring_loop.c`) is unchanged.
597
+ Dynamic-block dispatch on the io_uring loop would multiply this
598
+ win further but requires extending the same `hyp_dyn_lookup_block`
599
+ branch into `io_uring_loop.c`. Filed as 2.14-B candidate.
600
+ - Streaming-body responses still take the legacy path; see the
601
+ `dispatch_for_c_loop` docstring for the shape contract.
602
+ - Operator metrics under `c_loop_requests_total` now include both
603
+ static and dynamic dispatches, but the per-shape breakdown
604
+ (StaticEntry vs DynamicBlockEntry) is rolled-up. A per-shape
605
+ counter is a 5-line follow-up if operators ask for it.
606
+
3
607
  ## 2.13.0 — 2026-05-01
4
608
 
5
609
  ### 2.13-E — io_uring soak signal + default-ON decision