hyperion-rb 1.6.2 → 2.10.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (57) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +4563 -0
  3. data/README.md +189 -13
  4. data/ext/hyperion_h2_codec/Cargo.lock +7 -0
  5. data/ext/hyperion_h2_codec/Cargo.toml +33 -0
  6. data/ext/hyperion_h2_codec/extconf.rb +73 -0
  7. data/ext/hyperion_h2_codec/src/frames.rs +140 -0
  8. data/ext/hyperion_h2_codec/src/hpack/huffman.rs +161 -0
  9. data/ext/hyperion_h2_codec/src/hpack.rs +457 -0
  10. data/ext/hyperion_h2_codec/src/lib.rs +296 -0
  11. data/ext/hyperion_http/extconf.rb +28 -0
  12. data/ext/hyperion_http/h2_codec_glue.c +408 -0
  13. data/ext/hyperion_http/page_cache.c +1125 -0
  14. data/ext/hyperion_http/parser.c +473 -38
  15. data/ext/hyperion_http/sendfile.c +982 -0
  16. data/ext/hyperion_http/websocket.c +493 -0
  17. data/ext/hyperion_io_uring/Cargo.lock +33 -0
  18. data/ext/hyperion_io_uring/Cargo.toml +34 -0
  19. data/ext/hyperion_io_uring/extconf.rb +74 -0
  20. data/ext/hyperion_io_uring/src/lib.rs +316 -0
  21. data/lib/hyperion/adapter/rack.rb +370 -42
  22. data/lib/hyperion/admin_listener.rb +207 -0
  23. data/lib/hyperion/admin_middleware.rb +36 -7
  24. data/lib/hyperion/cli.rb +310 -11
  25. data/lib/hyperion/config.rb +440 -14
  26. data/lib/hyperion/connection.rb +679 -22
  27. data/lib/hyperion/deprecations.rb +81 -0
  28. data/lib/hyperion/dispatch_mode.rb +165 -0
  29. data/lib/hyperion/fiber_local.rb +75 -13
  30. data/lib/hyperion/h2_admission.rb +77 -0
  31. data/lib/hyperion/h2_codec.rb +452 -0
  32. data/lib/hyperion/http/page_cache.rb +122 -0
  33. data/lib/hyperion/http/sendfile.rb +696 -0
  34. data/lib/hyperion/http2/native_hpack_adapter.rb +70 -0
  35. data/lib/hyperion/http2_handler.rb +368 -9
  36. data/lib/hyperion/io_uring.rb +317 -0
  37. data/lib/hyperion/lint_wrapper_pool.rb +126 -0
  38. data/lib/hyperion/master.rb +96 -9
  39. data/lib/hyperion/metrics/path_templater.rb +68 -0
  40. data/lib/hyperion/metrics.rb +256 -0
  41. data/lib/hyperion/prometheus_exporter.rb +150 -0
  42. data/lib/hyperion/request.rb +13 -0
  43. data/lib/hyperion/response_writer.rb +477 -16
  44. data/lib/hyperion/runtime.rb +195 -0
  45. data/lib/hyperion/server/route_table.rb +179 -0
  46. data/lib/hyperion/server.rb +519 -55
  47. data/lib/hyperion/static_preload.rb +133 -0
  48. data/lib/hyperion/thread_pool.rb +61 -7
  49. data/lib/hyperion/tls.rb +343 -1
  50. data/lib/hyperion/version.rb +1 -1
  51. data/lib/hyperion/websocket/close_codes.rb +71 -0
  52. data/lib/hyperion/websocket/connection.rb +876 -0
  53. data/lib/hyperion/websocket/frame.rb +356 -0
  54. data/lib/hyperion/websocket/handshake.rb +525 -0
  55. data/lib/hyperion/worker.rb +111 -9
  56. data/lib/hyperion.rb +137 -3
  57. metadata +50 -1
data/CHANGELOG.md CHANGED
@@ -1,5 +1,4568 @@
1
1
  # Changelog
2
2
 
3
+ ## 2.10.1 — 2026-05-01
4
+
5
+ ### 2.10-F — C-ext fast-path response writer for prebuilt responses
6
+
7
+ Folds the matched-route hot path for `Server.handle_static` into a
8
+ single C function so the request never re-enters Ruby on the response
9
+ side. Closes the syscall-overhead gap between Hyperion's 2.10-D
10
+ direct-route path (5,619 r/s on `hello` per the 2.10-D bench) and
11
+ Agoo's pure-C static path (19,364 r/s) by eliminating the Ruby
12
+ plumbing around the `write()` syscall: no handler closure dispatch,
13
+ no `[status, headers, body]` tuple materialization, no extra GVL
14
+ acquire/release for the response phase.
15
+
16
+ **Headline shape.** New
17
+ `Hyperion::Http::PageCache.serve_request(socket, method, path)` C-ext
18
+ entry point: hash lookup, snapshot under the C lock, write outside
19
+ the GVL via `rb_thread_call_without_gvl`, return
20
+ `[:ok, bytes_written]` on hit / `:miss` on absence. HEAD requests
21
+ write the headers-only prefix (body stripped on the C side, no
22
+ extra Ruby work).
23
+
24
+ **Why the previous shape was leaving cycles on the table.** 2.10-D's
25
+ `Connection#dispatch_direct!` matched the route in O(1), but the
26
+ write path was still pure Ruby:
27
+ `handler.call(request) → StaticEntry → socket.write(buf)`. Each
28
+ step involves Ruby method dispatch, ivar reads, GC roots updated,
29
+ and (for any blocking I/O wait the kernel returns from `write()`)
30
+ a fresh GVL re-acquire. Strace on the 2.10-D `hello` path showed
31
+ ~30,000 ancillary syscalls per second (`clone3`, `futex`) for what
32
+ should be one `write()` per request — that's the Ruby plumbing
33
+ talking, not the protocol. 2.10-F shrinks the response phase to
34
+ ONE C call that does the whole thing.
35
+
36
+ **Operator surface.**
37
+
38
+ ```ruby
39
+ # Already worked in 2.10-D — same call:
40
+ Hyperion::Server.handle_static(:GET, '/health', "OK\n")
41
+
42
+ # 2.10-F changes what happens UNDER this call:
43
+ # 1. The prebuilt buffer is registered with the C-side PageCache
44
+ # under '/health'.
45
+ # 2. Connection#dispatch_direct! detects StaticEntry routes
46
+ # and calls PageCache.serve_request(socket, method, '/health'),
47
+ # which does the hash lookup + GVL-released write in C.
48
+ # 3. HEAD on a GET-registered handle_static now serves headers-
49
+ # only automatically (HTTP semantic; the C side strips body
50
+ # bytes for HEAD).
51
+ ```
52
+
53
+ No CHANGELOG-touching API changes for callers — this is a hot-path
54
+ micro-architecture rework.
55
+
56
+ **Files added.**
57
+
58
+ | File | Purpose |
59
+ |---|---|
60
+ | `spec/hyperion/page_cache_serve_request_spec.rb` | 13 examples covering `:ok`/`:miss`, GET full body, HEAD headers-only, method-gating (POST/PUT/etc. miss), case-insensitive method match, last-writer-wins on `register_prebuilt`, oversized-path rejection, no-deadlock under 8-thread `serve_request` contention. |
61
+
62
+ **Files touched.**
63
+
64
+ | File | Change |
65
+ |---|---|
66
+ | `ext/hyperion_http/page_cache.c` | Add `rb_pc_register_prebuilt(path, response_bytes, body_len)` so the prebuilt handle_static buffer can be folded into the C cache without ever reading from disk. Add `rb_pc_serve_request(socket_io, method, path)` — C-ext fast path: classifies method (GET/HEAD/other) inline, looks up in the existing FNV-1a bucket table, snapshots under the pthread mutex, writes via the existing `hyp_pc_write_blocking` helper (still under `rb_thread_call_without_gvl`). Extends `hyp_page_t` with `headers_len` (= response_len − body_len, used for HEAD writes) and `prebuilt` (skips re-stat for entries that have no on-disk file). New `:miss` symbol distinct from `:missing` so logs/metrics can tell "not in cache" apart from "in cache but method-ineligible". |
67
+ | `lib/hyperion/http/page_cache.rb` | Document the new C surface (`register_prebuilt`, `serve_request`) in the module-level comment; no Ruby-side helper added — these are direct C exports. |
68
+ | `lib/hyperion/server/route_table.rb` | `StaticEntry` gains a 4th field `headers_len` (defaulted nil for back-compat with 2.10-D 3-arg constructions), responds to `#call(request)` returning `self` (so `Server.handle_static` can register the entry directly into the route table — the 2.10-D wrapping closure is gone), exposes `headers_bytesize` for the Ruby fallback HEAD-strip path. |
69
+ | `lib/hyperion/server.rb` | `Server.handle_static` now: (a) records `head.bytesize` on the StaticEntry as `headers_len`; (b) registers the StaticEntry **directly** into the route table (not wrapped in `->(req) { entry }`); (c) registers a HEAD twin for any GET registration (HTTP-mandated); (d) calls `Hyperion::Http::PageCache.register_prebuilt` to fold the prebuilt buffer into the C cache. |
70
+ | `lib/hyperion/connection.rb` | `dispatch_direct!` branches on `handler.is_a?(StaticEntry)` BEFORE invoking the handler closure; on hit, calls new `dispatch_direct_static!` which fires lifecycle hooks, calls `PageCache.serve_request(socket, request.method, entry.path)` via the new `serve_static_entry` helper, and falls back to a Ruby `socket.write` of `entry.buffer` (or `headers_bytesize` prefix on HEAD) if the C cache returned `:miss`. Lifecycle-hook contract from 2.10-D preserved — `env=nil` on direct routes still holds. |
71
+ | `spec/hyperion/direct_route_spec.rb` | +2 examples: end-to-end serves `handle_static` via the C-ext fast path (asserts `PageCache.serve_request` finds the registered entry and returns `[:ok, n]`); end-to-end HEAD on a `handle_static`-registered GET route returns headers-only on the wire. |
72
+
73
+ **Spec count: 1045 → 1060 (+15).** All 1060 examples green; 11 pending
74
+ (unchanged platform-only kTLS / io_uring / Linux-splice branches from
75
+ the 2.10.0 baseline).
76
+
77
+ **Judgment calls (documented for archaeology).**
78
+
79
+ 1. **HEAD twin auto-registration.** `Server.handle_static(:GET, ...)` now
80
+ ALSO registers HEAD on the same path. Operators registering HEAD
81
+ explicitly through `Server.handle(:HEAD, path, handler)` continue to
82
+ work — the route_table is last-writer-wins, so explicit overrides
83
+ take precedence if registered after `handle_static`. The alternative
84
+ was to teach `RouteTable#lookup` to fall back GET→HEAD, but that
85
+ would have widened the hot-path lookup from one Hash#[] to two
86
+ without operator opt-in. Auto-registration was the cheaper choice.
87
+
88
+ 2. **`:miss` vs `:missing`.** Two different symbols intentionally:
89
+ `:missing` = "you asked for this and the on-disk file isn't cached"
90
+ (the existing 2.10-C `write_to`/`fetch` contract); `:miss` = "the
91
+ route lookup didn't yield a serveable prebuilt entry" (new in 2.10-F).
92
+ Operators wiring metrics on the C path can split the two reasons.
93
+
94
+ 3. **GVL release.** The implementation reuses the existing
95
+ `hyp_pc_write_blocking` helper — same `rb_thread_call_without_gvl`
96
+ shape as 2.10-C's `PageCache.write_to`, same EAGAIN-with-bounded-
97
+ `select` retry. No new lock, no new write strategy: just a smaller
98
+ call surface.
99
+
100
+ 4. **`StaticEntry#call` returning self.** Keeps the route_table
101
+ `respond_to?(:call)` invariant intact so the same hash can hold both
102
+ prebuilt entries AND user-defined Rack-tuple handlers without a
103
+ separate registration API. `dispatch_direct!`'s
104
+ `handler.is_a?(StaticEntry)` branch fires BEFORE `handler.call(request)`,
105
+ so the `call` method is only ever exercised by callers reaching for
106
+ the route table directly (specs, custom dispatchers).
107
+
108
+ ### 2.10-F — bench result: hello via handle_static 5,768 r/s (vs 5,619 baseline, +2.6%)
109
+
110
+ **Honest reading: 2.10-F is durable infrastructure, NOT a sustained-r/s win.**
111
+ The C-ext fast path eliminates the Ruby plumbing around the response
112
+ write, but on this workload the dominant cost is the per-connection
113
+ lifecycle (accept4 + clone3 + futex + epoll setup) — NOT the response
114
+ write phase that 2.10-F shrinks. The wrk profile spawns 100 fresh
115
+ keep-alive connections every 20-second run; each one pays the
116
+ connection-startup tax once, and that tax dwarfs the per-request
117
+ write cost on a 5-byte body. Closing THAT gap is explicitly the
118
+ domain of 2.11+.
119
+
120
+ **Setup (openclaw-vm, Ubuntu 24.04, x86_64, Ruby 3.3.3, page-cache C
121
+ ext compiled fresh against the 2.10-F source).**
122
+
123
+ | Run | Hyperion args | Rackup |
124
+ |---|---|---|
125
+ | 2.10-F | `-t 5 -w 1 -p 9810 bench/hello_static.ru` | `Hyperion::Server.handle_static(:GET, '/', 'hello')` + Rack 404 fallback |
126
+
127
+ `wrk -t4 -c100 -d20s --latency http://127.0.0.1:9810/`, 3 trials,
128
+ median r/s reported.
129
+
130
+ | Run | Trial 1 | Trial 2 | Trial 3 | Median r/s | Median p99 |
131
+ |---|---:|---:|---:|---:|---:|
132
+ | 2.10-D baseline (published, this rackup shape) | — | — | — | **5,619** | **1.93 ms** |
133
+ | **2.10-F (with C-ext fast-path)** | 5,688.55 | 6,035.16 | 5,768.47 | **5,768** | **1.67 ms** |
134
+ | Agoo 2.15.14 (2.10-B reference) | — | — | — | 19,364 | 9.41 ms |
135
+
136
+ **Delta vs 2.10-D baseline:** **+2.6% r/s** (5,619 → 5,768), **−14%
137
+ p99 latency** (1.93 ms → 1.67 ms). Trial-to-trial spread on this
138
+ host is ±5%, so the r/s delta is inside the noise band; the p99
139
+ improvement is outside the noise band and reproduces across all
140
+ three trials.
141
+
142
+ **Why no headline rps win.** The 2.10-F change shrinks the response
143
+ phase to one C call. On the pre-2.10-F path, the response phase was
144
+ already a single `write()` syscall (the 2.10-D static buffer write)
145
+ plus a handful of Ruby method dispatches around it. The Ruby
146
+ dispatch overhead — handler closure call, `is_a?(StaticEntry)`
147
+ branch, `socket.write` ivar reads — is ~1-2 microseconds total at
148
+ this scale, swamped by the ~150-200 microsecond per-request floor
149
+ imposed by the connection lifecycle (accept queue, thread-pool
150
+ worker hand-off, parser dispatch, GVL release/acquire on read).
151
+ 2.10-F removes that 1-2 µs cleanly; the bench shows it as a small
152
+ p99 improvement (the tail latency on the fast-path-eligible
153
+ requests tightens) but the throughput floor is gated on a different
154
+ bottleneck.
155
+
156
+ **What 2.10-F DOES move (durable wins).**
157
+
158
+ 1. **−14% p99 latency on the fast path.** Tail latency tightens
159
+ because the response phase no longer pays the GVL re-acquisition
160
+ that the Ruby `socket.write` triggered on the post-syscall return.
161
+ Operator-visible: a static-asset CDN origin running Hyperion
162
+ sees 1.67 ms instead of 1.93 ms p99 on cached hits.
163
+
164
+ 2. **Syscall-reduction infrastructure for 2.11.** Strace -f over
165
+ 5,000 warm requests on 2.10-F shows: 5,000× write (the
166
+ `serve_request` C-side write) + ~30,000 ancillary syscalls
167
+ (accept4, clone3, futex, recvfrom, epoll). The 5,000 writes
168
+ are unchanged from 2.10-D; the **ancillary syscalls are
169
+ unchanged too** because they're driven by the connection
170
+ lifecycle, not the response phase. When 2.11 closes the
171
+ accept-loop / thread-pool gap, the C-ext write path is already
172
+ in place — no second Ruby-to-C migration needed.
173
+
174
+ 3. **HEAD support automatic.** `handle_static(:GET, ...)` now
175
+ auto-registers the HEAD twin (HTTP-mandated). Operators
176
+ running CDN-shaped traffic against `handle_static` paths get
177
+ correct HEAD semantics for free; the C side strips the body
178
+ bytes on HEAD, so the wire saves the body byte-count per
179
+ HEAD request.
180
+
181
+ 4. **GVL released across the write syscall.** The 2.10-D path
182
+ ran the `socket.write` under the GVL — a slow client (one
183
+ that didn't drain the kernel send buffer fast enough) could
184
+ block other Ruby-side work on the same VM. The C path uses
185
+ `rb_thread_call_without_gvl`, so other threads / fibers on
186
+ the same worker can run while the kernel drains.
187
+
188
+ **Caveats / honest framing.**
189
+
190
+ - The plan target of "8,000-12,000 r/s (half-way to Agoo)" is
191
+ NOT met on this row. The half-way mark would have required
192
+ closing the connection lifecycle gap as well — that's the
193
+ 2.11 sprint's job, not 2.10-F's.
194
+ - Agoo's 19,364 r/s on this row is still 3.4× ahead. Closing
195
+ that gap requires owning the accept loop in C (which Agoo does)
196
+ — distinct from owning the response write in C, which is what
197
+ 2.10-F lands.
198
+ - Trial-to-trial noise on this row is ±5% on the openclaw-vm host
199
+ (visible in the spread 5,688 / 6,035 / 5,768). The 2.6% r/s
200
+ delta is inside that band; the p99 delta is outside it.
201
+
202
+ **Recommendation.** Ship 2.10-F. The operator-visible value is
203
+ the −14% tail latency on the fast path, the durable C-side
204
+ write infrastructure for the 2.11 connection-lifecycle work, the
205
+ free HEAD support, and the GVL release across writes. The
206
+ CHANGELOG headline tells operators the honest delta — not "+47%
207
+ to Agoo".
208
+
209
+ ### 2.10-E — Static asset preload + immutable flag
210
+
211
+ Adds a boot-time hook that walks operator-supplied directory trees,
212
+ populates `Hyperion::Http::PageCache` with each regular file, and
213
+ (by default) marks every cached entry immutable so subsequent serves
214
+ never re-stat. Closes the gap between Hyperion 2.10-C's cold-cache
215
+ 1,880 r/s on the static-1-KB row and Agoo's "8000× faster static"
216
+ warm-cache claim (their `misc/rails.md`) — Agoo gets that number by
217
+ preloading every Rails-managed asset path at boot. 2.10-E ships the
218
+ same shape so operators don't have to call `PageCache.preload`
219
+ themselves.
220
+
221
+ **Operator surface — three ways in.**
222
+
223
+ 1. **CLI flag** — repeatable.
224
+ ```
225
+ bundle exec hyperion --preload-static /srv/app/public \
226
+ --preload-static /srv/app/public/uploads \
227
+ config.ru
228
+ ```
229
+ Each `--preload-static <dir>` entry is walked recursively at boot;
230
+ every file becomes a cached page-cache entry marked immutable.
231
+ `--no-preload-static` is the sibling sentinel that disables the
232
+ Rails-aware auto-detect (operator-supplied dirs still take effect).
233
+
234
+ 2. **Config DSL key** — accumulates across multiple calls.
235
+ ```ruby
236
+ # config/hyperion.rb
237
+ preload_static '/srv/app/public' # immutable: true (default)
238
+ preload_static '/srv/app/public/uploads', immutable: false # opt-out per-dir
239
+ ```
240
+ The `immutable:` kwarg defaults to `true` — preload's whole point
241
+ is "I promise these don't change without a restart", and operators
242
+ wanting per-request mtime polling can opt out per-dir.
243
+
244
+ 3. **Rails auto-detect** — zero-config for Rails apps.
245
+
246
+ When the operator has NOT configured `preload_static` and did NOT
247
+ pass `--no-preload-static`, Hyperion checks for a Rails-shaped boot
248
+ environment (`defined?(::Rails) && ::Rails.respond_to?(:configuration)`,
249
+ `Rails.configuration.assets.paths` returns a non-empty Array) and
250
+ auto-preloads the first 8 entries of `Rails.configuration.assets.paths`.
251
+ Hyperion never `require`s rails — auto-detect is purely defensive
252
+ probing, so Hyperion stays a generic Rack server that has a Rails
253
+ bonus mode.
254
+
255
+ **Boot-time log line per dir.** One info-level summary line per
256
+ processed directory:
257
+
258
+ ```
259
+ {"message":"static preload complete","dir":"/srv/app/public","files":42,"bytes":2487136,"ms":18.4}
260
+ ```
261
+
262
+ Operators alert on this if files=0 (config typo, missing dir) or ms
263
+ spike (disk regression / NFS storm).
264
+
265
+ **Where it runs.** `Hyperion::StaticPreload.run` is invoked from
266
+ `Server#preload_static!` inside `Server#start`, after `listen`
267
+ configures the listener but BEFORE the accept loop spins. First
268
+ request lands on warm cache. The preload list flows
269
+ `Config#resolved_preload_static_dirs` → `Master` → `Worker` → `Server`
270
+ in cluster mode and `CLI.run_single` → `Server` in single-mode.
271
+
272
+ **Files added.**
273
+
274
+ | File | Purpose |
275
+ |---|---|
276
+ | `lib/hyperion/static_preload.rb` | `Hyperion::StaticPreload.run(entries, logger:)` walks each `{path:, immutable:}` Hash, calls `PageCache.cache_file` + `set_immutable`, emits the summary log line. `.detect_rails_paths(cap: 8)` defensively probes `Rails.configuration.assets.paths` for the auto-detect path. |
277
+ | `spec/hyperion/static_preload_spec.rb` | 11 examples covering walk + cache, immutable-flag behaviour, summary log shape, missing-dir warn, multi-dir accumulation, and the Rails detect branches (operator-overrode, auto-detect-disabled, Rails-undefined, non-Array paths, cap kwarg). |
278
+ | `spec/hyperion/cli_preload_static_spec.rb` | 6 examples covering `--preload-static` repeatable flag, `--no-preload-static` toggle, and the `merge_cli!` routing into `Config#preload_static_dirs`. |
279
+ | `spec/hyperion/config_preload_static_spec.rb` | 6 examples covering DSL accumulation, defaults, and the `resolved_preload_static_dirs` precedence (operator > Rails > none). |
280
+ | `spec/hyperion/server_static_preload_spec.rb` | 3 examples covering `Server#preload_static!` warming the page cache from a configured directory and respecting the immutable flag. |
281
+
282
+ **Files touched.**
283
+
284
+ | File | Change |
285
+ |---|---|
286
+ | `lib/hyperion.rb` | Require `hyperion/static_preload` after `hyperion/http/page_cache`. |
287
+ | `lib/hyperion/cli.rb` | Add `--preload-static DIR` (repeatable) and `--no-preload-static` flags. Pass `config.resolved_preload_static_dirs` to `Server.new` in single-worker mode. |
288
+ | `lib/hyperion/config.rb` | Add `preload_static_dirs` (Array of `{path:, immutable:}` Hashes) + `auto_preload_static_disabled` (Boolean) defaults. Add `preload_static "/path", immutable: true` DSL method (accumulates). Special-case `:preload_static` in `merge_cli!` to append each CLI dir. New `Config#resolved_preload_static_dirs` returns operator dirs verbatim, falls through to `StaticPreload.detect_rails_paths` when none are configured AND auto-detect isn't disabled. |
289
+ | `lib/hyperion/server.rb` | New `preload_static_dirs:` kwarg on the constructor (default nil = no preload). New public `Server#preload_static!(logger:)` walks the entries via `StaticPreload.run`, idempotent via `@preloaded` so respawn paths don't re-walk. Invoked once from `Server#start` between `listen` and the accept-loop spin-up. |
290
+ | `lib/hyperion/master.rb` | Pass `@config.resolved_preload_static_dirs` through the worker spawn args so cluster mode also warms each worker's page cache at boot. |
291
+ | `lib/hyperion/worker.rb` | Accept and forward `preload_static_dirs:` to `Server.new`. |
292
+
293
+ **Spec count: 1018 → 1045 (+27).** All 1045 examples green; 11 pending
294
+ (unchanged platform-only kTLS / io_uring branches from the 2.10.0 baseline).
295
+
296
+ ### 2.10-E — bench result: static 1 KiB cold 1,929 r/s vs warm 1,886 r/s (no rps win)
297
+
298
+ **Honest reading: the preload feature does NOT move sustained throughput
299
+ on the static-1-KB row, but it does normalize first-request latency.**
300
+
301
+ Bench-only — no production code changes, spec count unchanged at 1045.
302
+
303
+ **Setup (openclaw-vm, Ubuntu 24.04, x86_64, Ruby 3.3.3, page-cache C
304
+ ext compiled fresh against the 2.10-E source).**
305
+
306
+ | Run | Hyperion args | Asset dir |
307
+ |---|---|---|
308
+ | Cold | `-t 5 -w 1 -p 9810 bench/static.ru` | `/tmp/hyperion_static_e/` (single 1 KiB file) |
309
+ | Warm | `-t 5 -w 1 -p 9810 --preload-static /tmp/hyperion_static_e bench/static.ru` | same |
310
+
311
+ `wrk -t4 -c100 -d20s --latency http://127.0.0.1:9810/hyperion_bench_1k.bin`,
312
+ 3 trials each, median r/s reported.
313
+
314
+ | Run | Trial 1 | Trial 2 | Trial 3 | Median r/s | Median p99 |
315
+ |---|---:|---:|---:|---:|---:|
316
+ | Cold | 1,929.75 | 2,037.98 | 1,920.91 | **1,929** | **3.50 ms** |
317
+ | Warm (preloaded + immutable) | 2,013.95 | 1,842.22 | 1,886.55 | **1,886** | **3.51 ms** |
318
+
319
+ **Why no throughput win.** Hyperion's `ResponseWriter` already
320
+ auto-caches Rack::Files hits on first request (the `cache_file` /
321
+ `write_to` fall-through landed in 2.10-C). At sustained 100-conn `wrk`,
322
+ both the cold and warm paths converge on the same PageCache hot path
323
+ inside the first millisecond — preload just frontloads that one
324
+ `cache_file` call from the first wrk iteration to boot time.
325
+
326
+ **What 2.10-E DOES move.** The "static preload complete" log line
327
+ fires at boot:
328
+
329
+ ```
330
+ {"message":"static preload complete","dir":"/tmp/hyperion_static_e","files":1,"bytes":1024,"ms":0.2}
331
+ ```
332
+
333
+ Operators get:
334
+
335
+ 1. **Predictable first-request latency.** With preload, request #1
336
+ after boot lands on a warm cache (saves the `File.size?` +
337
+ `cache_file` work that the first cold request pays). Without preload
338
+ request #1 takes a `stat` + `open` + alloc hit; subsequent requests
339
+ are warm. With preload every request including #1 is warm.
340
+ 2. **Immutable flag.** Cached entries marked immutable skip the
341
+ `recheck_seconds` mtime poll on every serve. For long-running
342
+ processes serving content-hashed asset bundles this saves ~1
343
+ `lstat()` syscall per request — operationally invisible at the
344
+ 2k r/s scale of the bench host but matters at production scale
345
+ on Linux where `stat` against NFS or overlayfs can be a tail-latency
346
+ land mine.
347
+ 3. **Operational predictability.** No first-request "cold cache"
348
+ class of incidents (operator restarts the worker, first user request
349
+ pays a ~10ms `cache_file` cost — not visible at this bench scale,
350
+ but a real shape with 50 KiB files / 1000 assets).
351
+
352
+ **Caveats / honest framing.**
353
+
354
+ - The plan target ("2,400-2,600 r/s warm") is NOT met. The harness +
355
+ 1-KB-via-Rack::Files shape is already throughput-bound on the
356
+ PageCache hot path — preload doesn't unlock more.
357
+ - A different bench (cold-process first-request latency, or 1000-asset
358
+ tree where Find.find dominates the first-N requests) WOULD show a
359
+ larger delta. We're not running that here because the plan's
360
+ requested bench is the 2.10-B static-1-KB row.
361
+ - Trial-to-trial noise on this row is ±5% on the openclaw-vm host
362
+ (visible in the "Warm" column going 2,014 → 1,842 → 1,886).
363
+ The median spread between cold and warm (1,929 vs 1,886, ~2%) is
364
+ inside that noise band.
365
+
366
+ **Recommendation.** Ship 2.10-E. The operator-visible value is the
367
+ zero-config Rails-aware boot-time cache warming, the immutable flag
368
+ that ships with it, and the boot-time summary log line — not a
369
+ sustained-r/s improvement. The CHANGELOG headline tells operators what
370
+ they actually buy.
371
+
372
+ ## 2.10.0 — 2026-05-01
373
+
374
+ The 2.10 sprint widens the bench comparison from "Hyperion vs Falcon
375
+ on h2" (the 2.9-B head-to-head) to **all four major Ruby web servers**:
376
+ Hyperion, Puma, Falcon, and Agoo. Agoo is widely cited as the fastest
377
+ Ruby web server, so we want it in the matrix as the upper-bound
378
+ reference for the 2.10 follow-on streams (static-response cache,
379
+ direct route registration). 2.10-A ships the harness; 2.10-B will run
380
+ the BASELINE bench BEFORE any 2.10 code changes so the gap that the
381
+ new code closes is honestly known.
382
+
383
+ ### 2.10-A — 4-way bench harness (Hyperion + Puma + Falcon + Agoo)
384
+
385
+ Harness only — no production code changes, no spec changes
386
+ (spec count stays at 964).
387
+
388
+ **Files added.**
389
+
390
+ | File | Purpose |
391
+ |---|---|
392
+ | `bench/Gemfile.4way` | Sibling Gemfile pinning `puma ~> 8.0`, `falcon ~> 0.55`, `agoo`, `rack ~> 3.0`, plus `hyperion-rb` from `ENV['HYPERION_PATH']` (defaults to `/home/ubuntu/hyperion` for the openclaw-vm bench host). Kept separate from the main bench Gemfile so existing harnesses (`h2_falcon_compare.sh`, etc.) are not disturbed by the 2.10-era pins. |
393
+ | `bench/agoo_boot.rb` | Wrapper that lets Agoo serve a Rack rackup (Agoo's CLI doesn't take rackups directly). Calls `Agoo::Server.handle_not_found(app)` so the parsed Rack builder is the catch-all handler, and parks the main thread on a `Queue#pop` so the process stays alive after `Agoo::Server.start` returns (Agoo's `thread_count: N>0` runs workers in their own threads and returns from `start`, unlike `thread_count: 0` which blocks the caller). |
394
+ | `bench/4way_compare.sh` | Single-script harness that boots each of the four servers in turn on the same port (9810) with a matched `-t 5 -w 1` budget, smokes a single 200, and (unless `SMOKE_ONLY=1`) drives 3× `wrk -t4 -c100 -d20s --latency` runs. Subset by passing server names after the rackup: `bench/4way_compare.sh bench/hello.ru hyperion agoo`. |
395
+
396
+ **Boot recipes (one server per row, all bound to port 9810,
397
+ matched 5-thread / 1-process budget).**
398
+
399
+ | Server | Command |
400
+ |---|---|
401
+ | Hyperion | `bundle exec hyperion -t 5 -w 1 -p 9810 bench/hello.ru` |
402
+ | Puma | `bundle exec puma -t 5:5 -w 1 -b tcp://127.0.0.1:9810 bench/hello.ru` |
403
+ | Falcon | `bundle exec falcon serve --bind http://localhost:9810 --hybrid -n 1 --forks 1 --threads 5 --config bench/hello.ru` |
404
+ | Agoo | `bundle exec ruby bench/agoo_boot.rb bench/hello.ru 9810 5` |
405
+
406
+ Falcon's `--threads` flag is documented "hybrid only" — that's why
407
+ the harness explicitly selects `--hybrid -n 1 --forks 1 --threads 5`
408
+ (verified against `falcon serve --help` on the bench host before
409
+ this commit landed).
410
+
411
+ **Smoke verification on openclaw-vm (`SMOKE_ONLY=1`, GET / against
412
+ `bench/hello.ru`).** SSH to the bench host now works after 2.9-E's
413
+ `IdentitiesOnly yes` fix landed.
414
+
415
+ | Server | Boots? | Serves 200? | Notes |
416
+ |---|---|---|---|
417
+ | Hyperion | yes (1 s) | yes | `bin/hyperion` from this repo, agoo Gemfile resolves the path gem normally |
418
+ | Puma | yes (1 s) | yes | `puma 8.0`, threads-only |
419
+ | Falcon | yes (1 s) | yes | `falcon 0.55`, `--hybrid -n 1 --forks 1 --threads 5` |
420
+ | Agoo | yes (1 s) | yes | `agoo 2.15.14`. First boot attempt FAILED — needed the `Queue#pop` main-thread parking in `agoo_boot.rb`. With `thread_count: 5`, `Agoo::Server.start` is non-blocking and returns immediately; without the pop the process exits with the listener torn down. Fixed in this commit. |
421
+
422
+ The full `wrk` bench is **2.10-B's** job, not 2.10-A's — this commit
423
+ only validates that all four servers come up + serve a 200 on the
424
+ shared rackup.
425
+
426
+ ### 2.10-B — 4-way baseline bench (Hyperion vs Puma vs Falcon vs Agoo)
427
+
428
+ **BASELINE — this is the honest 4-way comparison BEFORE 2.10-C/D/E/F
429
+ land. Establishes the gap that subsequent perf work needs to close
430
+ (or honestly prove uncloseable).** Re-run after each 2.10-C/D/E/F
431
+ stream lands so the delta is visible per-row.
432
+
433
+ Bench-only — no production code changes, spec count unchanged at 964.
434
+
435
+ **Files touched.**
436
+
437
+ | File | Change |
438
+ |---|---|
439
+ | `bench/Gemfile.4way` | Added `pg ~> 1.5` and `hyperion-async-pg ~> 0.5` so row 5 (PG-bound async) can run inside the 4way bundle. Other rows are unaffected — the gems load only when MODE=async. |
440
+ | `bench/4way_compare.sh` | Three additions: (1) `URL_PATH` env var for rackups whose root URL isn't a 200 (e.g. static.ru → `/hyperion_bench_1k.bin`); (2) `WRK_TIMEOUT` env var for slow rackups (row 5 wants `--timeout 8s`); (3) `HYPERION_EXTRA` env var for hyperion-only flags like `--async-io`; (4) **perfer integration** — every (rackup, server) combination now runs both `wrk -t4 -c100 -d20s` AND `perfer -t4 -c100 -k -d20`, with NA-aware median + summary lines. Set `SKIP_PERFER=1` to disable. |
441
+
442
+ **Results (medians of 3 trials, openclaw-vm Ubuntu 24.04 + Ruby
443
+ 3.3.3, all servers `-t 5 -w 1`).** Per-row tables, full caveat
444
+ discussion, and reproduction recipe are in
445
+ [`docs/BENCH_HYPERION_2_0.md` § "4-way head-to-head (2.10-B
446
+ baseline)"](docs/BENCH_HYPERION_2_0.md#4-way-head-to-head-210-b-baseline-2026-05-01).
447
+
448
+ | Row | Workload | Hyperion 2.9.0 | Puma 8.0.1 | Falcon 0.55.3 | Agoo 2.15.14 | Verdict |
449
+ |---|---|---:|---:|---:|---:|---|
450
+ | 1 | hello | 4,587 r/s · p99 2.08 ms | 4,049 · 28.74 ms | 6,082 · 408.53 ms | **19,364** · 9.41 ms | Agoo wins by 4.2× over Hyperion; Hyperion has the cleanest p99 |
451
+ | 2 | static 1 KB | 1,380 · **4.86 ms** | 1,416 · 93.86 ms | 1,785 · 64.87 ms | **2,606** · 58.89 ms | Agoo wins; Hyperion p99 20× tighter than Puma |
452
+ | 3 | static 1 MiB | **1,378** · **5.62 ms** | 1,282 · 95.17 ms | 523 · 833.54 ms | 152 · 743.37 ms | **Hyperion wins** — sendfile path; 9× over Agoo |
453
+ | 4 | CPU JSON 50-key | 3,450 · **2.73 ms** | 2,771 · 40.74 ms | 4,245 · 410.05 ms | **6,374** · 19.18 ms | Agoo wins by 1.85× over Hyperion |
454
+ | 5 | PG-bound async (50 ms `pg_sleep`, c=200) | **1,564** · 145.71 ms | n/a | n/a | n/a | Hyperion-only — others can't run the rackup (no fiber-cooperative I/O) |
455
+ | 6 | SSE 1000 × 50 B (c=1, t=1) | **500** · 2.59 ms | 137 · 9.23 ms | 29 · 38.60 ms | smoke-fail | **Hyperion wins** — 3.6× over Puma, 17× over Falcon; Agoo can't stream SSE chunked |
456
+
457
+ **Honest reading.** Agoo is faster than Hyperion on hello-world by
458
+ **4.2×** (19k r/s vs 4.6k r/s) and on CPU JSON by **1.85×** (6.4k
459
+ vs 3.5k). That gap is what 2.10-C (static-response cache) and
460
+ 2.10-D (direct route registration) need to walk down to <2× hello,
461
+ <1.5× JSON. Hyperion already wins the workloads that matter
462
+ operationally — large static (sendfile, 9× ahead of Agoo),
463
+ PG-bound async (Hyperion-only at this concurrency), and SSE
464
+ streaming (3.6×–17× ahead). Tail latency is Hyperion's clean win
465
+ across every row (worst p99 across the 6 rows: 145 ms on row 5
466
+ intentionally; next worst 5.62 ms — Puma worst 95 ms, Falcon worst
467
+ 833 ms, Agoo worst 743 ms).
468
+
469
+ **perfer caveats.** Adding `perfer` (https://github.com/ohler55/perfer,
470
+ Agoo's own bench tool) for apples-to-apples vs Agoo's published
471
+ numbers surfaced four issues that are documented inline in the
472
+ BENCH doc:
473
+
474
+ 1. perfer's `Content-Length:` / `Transfer-Encoding:` lookups are
475
+ case-sensitive `strstr` — hangs against Hyperion's RFC 9110
476
+ lowercase headers. Patched local copy to `strcasestr`; should
477
+ upstream.
478
+ 2. perfer's `pool_warmup` deadlocks against Hyperion `-t 5 -c 100`
479
+ (per-conn 2.0s recv timeout × 100 conns vs 5-thread accept
480
+ loop). Recorded `NA` for hyperion+perfer on rows 1, 2, 4.
481
+ Raising hyperion to `-t 200` makes perfer happy but breaks
482
+ matched-config. Workaround for 2.10-B: rely on wrk for
483
+ hyperion's headline; perfer numbers for puma / falcon / agoo
484
+ agree with wrk within 5–10%.
485
+ 3. perfer's `MAX_RESP_SIZE = 16 KB` recv buffer means the 1 MiB
486
+ static row is `NA` across all four servers under perfer. wrk
487
+ handles arbitrary body sizes, so row 3 stays wrk-only.
488
+ 4. SSE row reports `0 r/s` under perfer — its response-framing
489
+ doesn't handle multi-chunk streams. wrk's `c=1` read-until-close
490
+ parses correctly.
491
+
492
+ **Re-run pattern.** After each 2.10-C/D/E/F stream lands, re-run
493
+ the same 6 rows and append a new "post-{stream}" table to the
494
+ BENCH section so the cumulative perf delta is visible.
495
+
496
+ ### 2.10-C — Hyperion::Http::PageCache (pre-built static-response cache)
497
+
498
+ **Headline (vs 2.10-B baseline on openclaw-vm, static 1 KiB row, `-t 5 -w 1`).**
499
+
500
+ | Build | r/s (median of 3) | vs 2.10-B | vs Agoo 2.15.14 |
501
+ |---|---:|---:|---:|
502
+ | 2.10-B baseline (Hyperion 2.9.0) | 1,380 | — | −47% (Agoo wins) |
503
+ | **2.10-C with PageCache engaged** | **1,880** | **+36%** | −28% |
504
+ | Agoo 2.15.14 (reference) | 2,606 | +89% | — |
505
+
506
+ Three trials on openclaw-vm: 1880 / 1932 / 1794 r/s; latency
507
+ median dropped from ≈3.7 ms to ≈2.7 ms. The PageCache primitive
508
+ delivers the response-buffer half of agoo's small-static
509
+ advantage; the remaining gap to Agoo on the 1 KiB row lives in
510
+ connection handling — Hyperion's HTTP/1.1 path spawns a thread
511
+ per accept, while Agoo runs an event-loop model. That gap is
512
+ the explicit subject of 2.10-D / 2.10-E / 2.10-F (connection
513
+ fastpath + Rack-bypass routes), each of which will reuse the
514
+ PageCache primitive shipped here.
515
+
516
+ **Plan target was 5,000 r/s.** That target assumed the PageCache
517
+ could remove the entire Rack-stack cost on a hit, but the
518
+ adapter still pays for ENV pool acquire + header hash iteration
519
+ + the file_size stat from ResponseWriter; on a strace −f over
520
+ 500 warm-cache requests we still see 500 × accept4 + 500 ×
521
+ clone3 (per-conn thread spawn) + 500 × stat. The cache buffer
522
+ itself contributes 500 × write() — the single-syscall promise
523
+ holds inside the cache, but the wider connection path drops
524
+ ~18,000 syscalls on the 500-request slice (≈36 syscalls /
525
+ request) of which ~8 are application-level. Closing the rest
526
+ to one syscall per response is 2.10-F's job (the direct-route
527
+ register).
528
+
529
+ The win source mirrors agoo's `agooPage` design: each cached static
530
+ asset's full HTTP/1.1 response (status line + Content-Type +
531
+ Content-Length + body) lives in ONE contiguous heap buffer that's
532
+ built ONCE on first read. The hot path (`PageCache.write_to(socket,
533
+ path)`) hashes the path, snapshots the buffer pointer + length out
534
+ of the cache under a brief pthread mutex, releases the mutex, then
535
+ issues `write(fd, buf, len)` from C without the GVL. Per-request
536
+ cost on a hit is:
537
+
538
+ * 0 file reads — body bytes already live in the response buffer.
539
+ * 0 mime lookups — Content-Type baked in at insert time.
540
+ * 0 header re-builds — Content-Length / status line baked in.
541
+ * 0 Rack env construction — engaged below the Rack adapter.
542
+ * 0 Ruby allocations on the C path itself (the only return value
543
+ is a `SSIZET2NUM` Integer that's small enough to fit in a
544
+ pointer-encoded Fixnum on every supported host).
545
+ * 1 socket write syscall in the common case.
546
+
547
+ **Files added.**
548
+
549
+ | File | Purpose |
550
+ |---|---|
551
+ | `ext/hyperion_http/page_cache.c` | C primitive (~800 LOC). Open-addressed bucket table (`PAGE_BUCKET_SIZE = 1024`, `MAX_KEY_LEN = 1024` mirroring agoo). pthread mutex on the structural ops; readers snapshot under the lock then release before the kernel write. `Init_hyperion_page_cache` runs from `parser.c` after `Init_hyperion_sendfile`. |
552
+ | `lib/hyperion/http/page_cache.rb` | Ruby façade. Adds `write_response` (alias of `write_to`), `preload(dir, immutable: false)` (recursive), `mark_immutable` / `mark_mutable`, `available?`. |
553
+ | `spec/hyperion/page_cache_spec.rb` | 25 specs — round-trip via real TCP pair, mtime invalidation, immutable flag, recursive preload, per-extension Content-Type matrix (12 extensions), zero-allocation hot path (< 100 objects per 1000 hits), `recheck_seconds` knob. |
554
+
555
+ **Files touched.**
556
+
557
+ | File | Change |
558
+ |---|---|
559
+ | `ext/hyperion_http/extconf.rb` | `$srcs` adds `page_cache.c` so it links into the same `.bundle` / `.so` as `parser.c` / `sendfile.c` (single `require 'hyperion_http/hyperion_http'` brings up the full surface). |
560
+ | `ext/hyperion_http/parser.c` | `Init_hyperion_http` calls `Init_hyperion_page_cache` after `Init_hyperion_sendfile`. |
561
+ | `lib/hyperion.rb` | Requires `lib/hyperion/http/page_cache.rb` between `http/sendfile` and `adapter/rack`. |
562
+ | `lib/hyperion/response_writer.rb` | `write_sendfile_inner` checks the page cache first via the new `page_cache_write` helper. On a hit (or after opportunistic populate-then-write for files ≤ `AUTO_THRESHOLD = 64 KiB`), skip the entire `File.open` / `file.read` / `build_head` / `io.write` path. Class-level `page_cache_available?` probe memoised at load. Falls through cleanly when the IO is StringIO / SSL-wrapped (no real fd) — see `real_fd_io?`. |
563
+
564
+ **Public Ruby API.**
565
+
566
+ ```ruby
567
+ PC = Hyperion::Http::PageCache
568
+
569
+ PC.preload('/var/www/public') # warm on boot
570
+ PC.mark_immutable('/var/www/public/asset-abcdef.css') # hashed assets
571
+ PC.cache_file('/var/www/public/index.html') # one file
572
+ PC.fetch(path) # :ok | :stale | :missing
573
+ PC.write_to(socket, path) # bytes_written | :missing (hot path)
574
+ PC.size; PC.clear
575
+ PC.recheck_seconds = 5.0 # default; matches agoo's PAGE_RECHECK_TIME
576
+ ```
577
+
578
+ **Auto-engage from `Adapter::Rack`.** Operators get the page cache
579
+ for free on `Rack::Files`-style routes (any body that responds to
580
+ `:to_path`) — `ResponseWriter#write_sendfile_inner` first calls
581
+ `page_cache_write`. Above the 64 KiB threshold it falls through to
582
+ the existing sendfile path because Hyperion already dominates big
583
+ static at 9× Agoo (the 2.10-B baseline). Apps wanting predictable
584
+ first-request latency call `PageCache.preload(dir)` on boot;
585
+ apps with content-hashed assets call `mark_immutable(path)` to
586
+ skip the per-recheck-window stat entirely.
587
+
588
+ **Wire-output note (intentional).** The cached response carries
589
+ status line + `Content-Type` + `Content-Length` + blank line + body
590
+ only — no `Date`, no `Connection`. Same shape Agoo emits on its
591
+ fast path. Non-cached paths still emit the full Hyperion header
592
+ set via `build_head`. Any response that carries `Set-Cookie` /
593
+ `Cache-Control` / `ETag` / `Last-Modified` / `Content-Encoding` /
594
+ `Content-Disposition` / `Vary` falls through to the existing path
595
+ unconditionally — those headers can't be safely baked into a
596
+ cross-request buffer.
597
+
598
+ **Mtime recheck.** Every cache lookup honours
599
+ `recheck_seconds` (default 5.0s, matches agoo's `PAGE_RECHECK_TIME`).
600
+ On expiry the C path stat()s the file; if `mtime` is unchanged it
601
+ just bumps `last_check`, otherwise it rebuilds the response buffer
602
+ in place. Per-page `set_immutable(true)` skips the stat entirely
603
+ — for fingerprinted assets the cache is effectively a one-shot
604
+ read.
605
+
606
+ **Concurrency.** Cache is per-process. Each forked Hyperion worker
607
+ holds its own table; no IPC, no shared memory, no cross-worker
608
+ contention. Within a worker, the table is guarded by a pthread
609
+ mutex that's held only for the lookup + snapshot — the kernel
610
+ `write()` runs without the GVL and without any Ruby-level lock
611
+ so other fibers / threads keep running while the socket buffer
612
+ drains.
613
+
614
+ **Spec count.** 964 → 989 (+25 in `page_cache_spec.rb`). All 989
615
+ green on macOS arm64; Linux x86_64 verified via the bench host
616
+ boot.
617
+
618
+ ### 2.10-D — Server.handle direct route registration (bypass Rack adapter)
619
+
620
+ **Headline.** New `Hyperion::Server.handle(:GET, '/path', handler)`
621
+ + `Hyperion::Server.handle_static(:GET, '/path', body)` API.
622
+ Mirrors agoo's `Agoo::Server.handle(:GET, "/hello", handler)`
623
+ design. On a registered route, `Connection#serve` skips the
624
+ Rack adapter entirely — no env-hash build, no middleware chain
625
+ walk, no body iteration; the handler is called directly with
626
+ the parsed `Hyperion::Request` value object. Lifecycle hooks
627
+ (`Runtime#on_request_start` / `on_request_end`) still fire so
628
+ NewRelic / AppSignal / OpenTelemetry instrumentation works
629
+ regardless of dispatch shape.
630
+
631
+ **Win source.** `handle_static` builds the FULL HTTP/1.1
632
+ response buffer (status line + Content-Type + Content-Length +
633
+ body) ONCE at registration time; the hot path is one
634
+ `socket.write(buffer)` syscall per request — same shape as the
635
+ existing 503 / 413 / 408 fast paths in `Server` / `Connection`,
636
+ zero Ruby allocation past the Connection ivars. `handle` (the
637
+ dynamic-handler form) bypasses env construction but still
638
+ walks the standard `ResponseWriter` for the
639
+ `[status, headers, body]` tuple — slower than the static path
640
+ but still skips the entire Rack-adapter overhead.
641
+
642
+ **Bench validation on openclaw-vm** (`-t 5 -w 1`, `wrk -t4
643
+ -c100 -d20s --latency`, three trials median). Hello-world via
644
+ `handle_static` vs the 2.10-B Rack-lambda baseline:
645
+
646
+ | Path | r/s (median of 3) | p99 latency | vs 2.10-B baseline | vs Agoo 2.15.14 |
647
+ |---|---:|---:|---:|---:|
648
+ | 2.10-B Rack lambda (Hyperion 2.9.0, published) | 4,587 | 2.08 ms | — | −76% (Agoo wins) |
649
+ | 2.10-B re-baseline (this run, same host, vanilla rackup) | 4,408 | 2.19 ms | −4% drift | — |
650
+ | **2.10-D handle_static** | **5,619** | **1.93 ms** | **+22% / +27%** vs the published / re-bench | −71% |
651
+ | Agoo 2.15.14 (reference) | 19,364 | 9.41 ms | +322% | — |
652
+
653
+ Three trials on openclaw-vm: 5,619 / 5,335 / 5,914 r/s.
654
+ **+22% over the published 2.10-B baseline; +27% over the
655
+ re-bench on the same host today.** p99 latency 1.93 ms — the
656
+ cleanest p99 in the 4-way matrix (Agoo's p99 on this row is
657
+ 9.41 ms, 4.9× wider despite the higher mean throughput).
658
+
659
+ **Plan target was 12,000 r/s; we landed at 5,619 r/s
660
+ (47% of plan target).** Honest reading of the gap: removing
661
+ the Rack-adapter cost (env-hash build + middleware chain +
662
+ body iteration) was correctly sized at 2.10-D's win zone, but
663
+ on a `-c 100 -t 4` wrk profile the dominant cost is NOT the
664
+ adapter — it's the per-accept thread-pool clone3 + the
665
+ per-Connection ivar allocation that ResponseWriter / Connection
666
+ still pay before the dispatch_direct! branch. An strace -f
667
+ sample over 5,000 warm requests shows: 5,000× accept4 +
668
+ 5,000× clone3 (per-conn submit_connection enqueue → worker
669
+ spawn / wakeup) + 5,000× write (the StaticEntry buffer) +
670
+ ~30,000 ancillary syscalls (epoll, recvfrom, futex, …). The
671
+ StaticEntry path itself is ONE syscall per request as
672
+ designed — the cache buffer write — but the surrounding
673
+ connection lifecycle still dominates. Closing the rest is
674
+ explicitly the subject of 2.10-E (connection fast-path) and
675
+ 2.10-F (event-loop accept) — both of which now have a clean
676
+ hand-off API: `route_table.lookup` is the gating call;
677
+ 2.10-E/F can short-circuit even earlier (before the worker
678
+ hop) for direct routes.
679
+
680
+ **Files added.**
681
+
682
+ | File | Purpose |
683
+ |---|---|
684
+ | `lib/hyperion/server/route_table.rb` | `RouteTable` class + `StaticEntry` value object. Per-method Hash keyed by exact-match path String; O(1) lookup; Mutex-guarded writes. `KNOWN_METHODS` matrix matches agoo's surface verbatim (GET / POST / PUT / DELETE / HEAD / PATCH / OPTIONS). |
685
+ | `spec/hyperion/direct_route_spec.rb` | 21 specs covering register / lookup / dispatch happy path, fall-through to Rack adapter on miss, lifecycle-hook firing on direct routes, `handle_static` byte-exact response, method case-insensitive matching, concurrent multi-thread registration, error paths (unknown method, non-String path, non-callable handler). |
686
+
687
+ **Files touched.**
688
+
689
+ | File | Change |
690
+ |---|---|
691
+ | `lib/hyperion/server.rb` | `require_relative 'server/route_table'`; class-level `Server.route_table` singleton + `Server.handle` / `Server.handle_static` registration API; `route_table:` constructor kwarg + `attr_reader :route_table`; plumb `@route_table` through every `Connection.new` site (4 inline-dispatch branches) and into `ThreadPool.new`. |
692
+ | `lib/hyperion/connection.rb` | `Connection.new` accepts `route_table:` kwarg (defaults to `Hyperion::Server.route_table` singleton). In the request loop, after parse + before per-conn fairness gate: if `@route_table.lookup(method, path)` hits, call `dispatch_direct!` and skip the Rack-adapter path entirely. New private helpers `dispatch_direct!` / `write_direct_response` / `should_keep_alive_after_direct?` — direct dispatch fires `runtime.fire_request_start` / `fire_request_end` (env is `nil` on the direct branch, documented contract for observers), writes either a `StaticEntry` buffer in one syscall or a full Rack tuple via the existing `ResponseWriter`, then continues the keep-alive loop. |
693
+ | `lib/hyperion/thread_pool.rb` | `ThreadPool.new` accepts `route_table:` kwarg; `:connection` job spawns `Connection.new(..., route_table: @route_table)` so the per-worker fast path inherits the registered routes. |
694
+
695
+ **Public Ruby API.**
696
+
697
+ ```ruby
698
+ # Static — response buffer baked at registration time, one
699
+ # socket.write per hit. The hello-bench win zone.
700
+ Hyperion::Server.handle_static(:GET, '/health', "OK\n")
701
+ Hyperion::Server.handle_static(:GET, '/version',
702
+ '{"v":"1.0"}',
703
+ content_type: 'application/json')
704
+
705
+ # Dynamic — handler#call(request) returns a [status, headers,
706
+ # body] tuple per request. Bypasses env construction; still
707
+ # uses ResponseWriter for the writeout.
708
+ class HealthCheck
709
+ def call(request)
710
+ [200, { 'content-type' => 'text/plain' },
711
+ ["#{Process.pid}\t#{Time.now.to_i}\n"]]
712
+ end
713
+ end
714
+ Hyperion::Server.handle(:GET, '/-/probe', HealthCheck.new)
715
+ ```
716
+
717
+ **Lifecycle hooks invariant.** `Runtime#on_request_start` /
718
+ `on_request_end` fire on direct routes regardless of the
719
+ `StaticEntry` vs `[status, headers, body]` branch. The `env`
720
+ positional is `nil` on the direct path (no Rack env was built
721
+ — that's the whole point); observers that depend on env keys
722
+ (e.g. NewRelic transaction names from `PATH_INFO`) should read
723
+ `request.path` / `request.method` from the `Hyperion::Request`
724
+ positional instead. Documented + spec-covered.
725
+
726
+ **Per-process route table.** Forked workers each inherit a
727
+ copy of the parent process's table at fork time (no IPC, no
728
+ shared memory). Registrations made BEFORE `Server.start`
729
+ propagate to every worker via copy-on-write; registrations
730
+ made AFTER fork (e.g. from `on_worker_boot`) only affect the
731
+ calling worker — by design, this is the operator's escape
732
+ hatch for per-worker routing (e.g. a debug endpoint that
733
+ wants to know which worker served the response).
734
+
735
+ **Concurrency.** Registrations are Mutex-guarded; lookups are
736
+ lock-free (Ruby Hash reads under MRI are safe against a
737
+ mutex-guarded concurrent write because the GVL pins the
738
+ writer during the bucket update). Concurrent multi-thread
739
+ registration is regression-tested via an 8-thread × 100-route
740
+ stress example.
741
+
742
+ **Spec count.** 989 → 1018 (+21 from `direct_route_spec.rb`,
743
+ +8 from a parallel 2.10-G h2 timing spec landing in this same
744
+ window). All 1018 green on macOS arm64.
745
+
746
+ ### 2.10-G — Investigate Hyperion h2 max-lat ~40 ms ceiling (instrumentation + fix)
747
+
748
+ **Status: RESOLVED.** Instrumentation landed first (`HYPERION_H2_TIMING=1`
749
+ on `WriterContext`); the bench-host h2load re-run that followed showed
750
+ the latency ceiling was *not* a first-stream-only cost as hypothesized
751
+ — it was paid by **every** stream (`min 40.63 ms, mean 44.01 ms` on
752
+ `-c 1 -m 100 -n 5000`), which is the unmistakable signature of the
753
+ Linux **delayed-ACK 40 ms timer** interacting with Nagle on small h2
754
+ framer writes. Fix: enable **TCP_NODELAY** on every accepted socket
755
+ right after `apply_timeout` runs. Result on the same bench
756
+ (post-handshake handshake stripped, h2load `-c 1 -m 1 -n 200`):
757
+
758
+ | | Pre-fix | Post-fix | Delta |
759
+ |---|---:|---:|---:|
760
+ | min request time | 40.62 ms | 542 µs | **−98.7%** |
761
+ | mean request time | 41.66 ms | 833 µs | **−98.0%** |
762
+ | max request time | 45.00 ms | 4.71 ms | **−89.5%** |
763
+ | throughput | 23.98 r/s | 1,141.81 r/s | **+47.6×** |
764
+
765
+ (Latency tail collapses from a tight Gaussian centered on the
766
+ delayed-ACK timer to a sub-millisecond mean — exactly what Falcon and
767
+ Agoo show on the same workload.)
768
+
769
+ The pre-fix instrumentation stays in place — the env-flag-gated
770
+ `WriterContext` timing slots are durable diagnostic infrastructure for
771
+ any future cold-stream / first-stream regression. The TCP_NODELAY
772
+ single-line fix is in `lib/hyperion/server.rb#apply_tcp_nodelay`,
773
+ called from `apply_timeout` so every accepted connection picks it up
774
+ (both H1 and H2 paths). Errors swallowed silently — UNIX sockets,
775
+ SSLSocket-without-`#io`, and platforms missing TCP_NODELAY all
776
+ gracefully fall through.
777
+
778
+ **Why this beat both hypotheses.** The instrumentation framing
779
+ expected the cost on the **first** stream of each connection
780
+ (SETTINGS round-trip or fiber pool warm-up). Reality: protocol-http2
781
+ emits HEADERS and DATA as separate framer writes per stream; on
782
+ **every** stream, the server's first packet arrives at the peer
783
+ alone, the peer waits 40 ms for the next packet to piggyback an ACK,
784
+ Hyperion's writer fiber waits because Nagle is buffering the second
785
+ write until that ACK lands. Setting TCP_NODELAY at accept-time breaks
786
+ the cycle for every stream, not just the cold one.
787
+
788
+ **Filed for 2.11 (no longer the same item).** "h2 first-stream tail
789
+ optimization (TLS handshake parallelization)" — distinct from the
790
+ 40 ms ceiling, which is fixed. The instrumentation reads the residual
791
+ cold-stream cost cleanly now that the ACK noise is gone.
792
+
793
+ ---
794
+
795
+
796
+
797
+ **Background (filed by 2.9-B).** The Falcon h2 head-to-head bench
798
+ (`bench/h2_falcon_compare.sh`, `h2load -c 1 -m 100 -n 5000`) found
799
+ Hyperion's max-latency suspiciously **flat at ~40 ms** across all three
800
+ rows (hello / h2_post / h2_rails_shape) while Falcon's max-latency on
801
+ the same workloads is **5-10 ms**. RPS is fine — 1,778-2,198 r/s — so
802
+ this is a tail-latency problem on the **first** stream of each h2
803
+ connection, not a throughput problem.
804
+
805
+ **Hypothesis.** The bench drives 5,000 streams over ONE connection;
806
+ only stream #1 pays the connection-setup cost. The flat 40 ms ceiling
807
+ across rows reads as a fixed-cost first-stream setup delay — most
808
+ likely TLS handshake completion + initial SETTINGS round-trip +
809
+ framer-fiber priming all serialized on the first stream's response
810
+ path. Once the connection is warm, subsequent streams should land at
811
+ ~5 ms and the median+p99 stay healthy; the **max** comes entirely
812
+ from the cold first stream.
813
+
814
+ **What landed (instrumentation only).** Four monotonic timestamps on
815
+ every h2 connection, gated by `HYPERION_H2_TIMING=1` (off by default —
816
+ zero hot-path overhead when disabled, single ivar read per branch
817
+ when enabled):
818
+
819
+ | Slot | Captured at |
820
+ |---|---|
821
+ | `t0_serve_entry` | `Http2Handler#serve` entry (post-TLS, post-ALPN — the SSL handshake completed before the handler was reached) |
822
+ | `t1_preface_done` | After `server.read_connection_preface(initial_settings_payload)` returns (server SETTINGS encoded + handed to framer queue; client preface fully read) |
823
+ | `t2_first_encode` | After the **first** stream's `send_headers` + `send_body` finish encoding (bytes sit in writer queue) |
824
+ | `t2_first_wire` | After the writer fiber's first successful `socket.write` (first chunk on the wire — typically the server's preface SETTINGS frame) |
825
+
826
+ Stored on the per-connection `WriterContext`. Captured exactly once
827
+ per connection via simple `nil?` guards (no mutex needed — the encode
828
+ mutex around `send_headers` plus the single-writer-fiber invariant
829
+ already serialize the writes that matter). Emits one info-level line
830
+ on connection close:
831
+
832
+ ```text
833
+ {"ts":"...","level":"info","source":"hyperion",
834
+ "message":"h2 first-stream timing",
835
+ "t0_to_t1_ms":<preface exchange>,
836
+ "t1_to_t2_enc_ms":<first stream encode>,
837
+ "t2_enc_to_t2_wire_ms":<encode→wire>,
838
+ "t0_to_t2_wire_ms":<total cold-stream cost>}
839
+ ```
840
+
841
+ **Next bench window (the 60-second drill).** SSH to the bench host,
842
+ enable instrumentation, re-run `h2load -c 1 -m 100 -n 5000`, grep for
843
+ `'h2 first-stream timing'`. Three diagnostic shapes:
844
+
845
+ 1. `t0_to_t1_ms` ≈ 40 ms, others ~0 → fix is to parallelize the
846
+ server-preface SETTINGS write with the kernel TLS handshake
847
+ completion (currently they're serial — TLS handshake completes
848
+ inside `Hyperion::Tls`, then `Http2Handler#serve` runs, then
849
+ preface is written).
850
+ 2. `t1_to_t2_enc_ms` ≈ 40 ms, others ~0 → fix is to **pre-spawn the
851
+ stream-dispatch fiber pool** at connection accept rather than
852
+ lazily on the first `ready_ids` tick. Today the first dispatch
853
+ fiber is spawned by `task.async { dispatch_stream(...) }` only
854
+ AFTER the first complete HEADERS frame is read; an Async scheduler
855
+ tick is needed before that fiber runs and reaches `send_headers`.
856
+ 3. Spread across both → both fixes apply. Either way the deltas tell
857
+ the operator exactly where to cut.
858
+
859
+ **Files touched (instrumentation-only commit).**
860
+
861
+ | File | Change |
862
+ |---|---|
863
+ | `lib/hyperion/http2_handler.rb` | Added 4 timing slots to `WriterContext` + capture sites in `serve` (t0/t1), `dispatch_stream` (t2_encode), `run_writer_loop` (t2_wire). New private helpers `monotonic_now` and `log_h2_first_stream_timing`. All gated by `@h2_timing_enabled` (resolved once at handler-construction from `HYPERION_H2_TIMING`). |
864
+ | `spec/hyperion/http2_first_stream_timing_spec.rb` | 8 new specs locking the contract: env-flag gating (3 cases — default off, truthy-on, truthy-off), nil-default WriterContext slots, log-emit shape (deltas in ms, non-negative, ordered), partial-capture short-circuit (nil timestamp → no log), best-effort error swallow. **No assertion on absolute latency** (would be CI-flaky). |
865
+
866
+ **Spec count.** 989 → 997 (+8). All green on macOS arm64.
867
+
868
+ **Why no fix this sprint.** Step 2 / step 4 of the investigation plan
869
+ required SSH to `openclaw-vm` for live h2load runs to read the
870
+ breakdown. That access path is currently rejecting the operator's
871
+ on-disk SSH key from this workstation (same root cause class as 2.9-E
872
+ documented — environmental, not code). Rather than guess at the right
873
+ fix from a hypothesis (and risk a regression on the 2.5-B Rust HPACK
874
+ win or the 1.6.0 writer-fiber refactor), the instrumentation is
875
+ landing standalone so the bench window after this can resolve it in
876
+ one shot. The 40 ms is a tail-latency knob; the median + p99 stay
877
+ healthy and 2.10's 4-way bench results stand as published.
878
+
879
+ **Filed forward to 2.11.** "h2 max-lat fix based on 2.10-G timing
880
+ breakdown" — owner runs the timing bench, reads the dominant bucket,
881
+ implements one of the two candidate fixes above (or files a third if
882
+ the breakdown reveals something unexpected). The instrumentation
883
+ itself is durable infrastructure beyond this single investigation:
884
+ any future "first-stream slow" / "cold-connection latency" bug hits
885
+ the same probe and gets the same diagnostic shape.
886
+
887
+ ### Filed for later 2.10 streams
888
+
889
+ (populated as 2.10-D..H land)
890
+
891
+ ## [2.9.0] - 2026-05-01
892
+
893
+ ### Headline
894
+
895
+ A measurement-correction + observability + ops-fix release. The 2.9
896
+ sprint resolved the deferred 2.7 / 2.8 bench items now that openclaw-vm
897
+ is back online, plus shipped the 2.8-A per-route deflate metric that
898
+ was held mid-sprint, plus closed the recurring "Permission denied
899
+ (publickey)" subagent-SSH gap.
900
+
901
+ | Item | Result |
902
+ |---|---|
903
+ | **2.9-A — chunk-size A/B (2.6-A delta quantified)** | **+7.4% rps** for 256 KiB vs 64 KiB chunk on fresh host (3,358 vs 3,128 r/s median, p99 identical). Real but smaller than 2.6-A's inflated +20.7% (which was measured against a degraded baseline). |
904
+ | **2.9-B — Falcon h2 head-to-head** | **hello: parity. h2_post: Falcon +10%. h2_rails_shape: Hyperion +58%** (Rust HPACK earning its keep on header-heavy responses). 2.10-A finding filed: Hyperion's max-lat stuck at ~40 ms vs Falcon's 5-10 ms (suspect first-stream setup delay). |
905
+ | **2.9-C — Per-route deflate ratio** | `hyperion_websocket_deflate_ratio` now carries a `route` label (explicit `env['hyperion.websocket.route']` or `env['PATH_INFO']` templated). Multi-channel ActionCable operators see per-channel compression. Cardinality bounded by templater LRU. |
906
+ | **2.9-D — Matched-config PG bench** | **Hyperion +29.8% rps, p99 26% lower** at matched pool=80 against local PG (max_conn=100). The 2.0.0 "+378% / 4.78×" was a config artifact; honest architectural advantage is +30% rps + 26% lower tail. |
907
+ | **2.9-E — SSH subagent gap** | Documented in `docs/BENCH_HOST_SETUP.md`. Fix: `IdentitiesOnly yes` + explicit `IdentityFile` in `~/.ssh/config`. Verified hermetic-env reproducibility. |
908
+
909
+ Spec count: 956 (2.8.0) → **964** (2.9.0). 0 failures, 11 pending.
910
+
911
+ ### Filed for 2.10
912
+
913
+ - **2.10-A** Hyperion h2 max-lat ~40 ms ceiling (vs Falcon's 5-10 ms). Suspect first-stream setup delay.
914
+
915
+ ### 2.9-E — Fix recurring SSH "Permission denied (publickey)" subagent gap
916
+
917
+ Fixes the recurring "Permission denied (publickey)" gap that has
918
+ blocked subagent bench runs against `openclaw-vm` since at least
919
+ Phase 9 (2.2.x) — every bench-running subagent from Phase 9/10/11,
920
+ 2.2.x fix-A..E, 2.3-A..D, 2.5-B/D, 2.6-A..D, 2.7-A/C/D/F, 2.8-A and
921
+ 2.9-B has reported "SSH not available, deferred to maintainer".
922
+
923
+ **Root cause.** The maintainer's `~/.ssh/config` had `Host openclaw-vm`
924
+ + `IdentityFile ~/.ssh/id_ed25519_woblavobla` but no `IdentitiesOnly yes`
925
+ and no explicit `User ubuntu`. Interactive shells worked because
926
+ macOS Keychain loaded the key into `ssh-agent`, but subagent shells
927
+ inherited an empty `SSH_AUTH_SOCK` (or one populated with other
928
+ host keys), so OpenSSH offered the wrong identities and the bench
929
+ host rejected them before falling through to the on-disk file.
930
+
931
+ **Fix.** Add `IdentitiesOnly yes` + `User ubuntu` + `HostName
932
+ 192.168.31.14` to the workstation's `~/.ssh/config` block. With
933
+ `IdentitiesOnly yes`, OpenSSH ignores the agent entirely for that
934
+ host and uses only the listed `IdentityFile`, making the config
935
+ robust to any process-environment state (no agent, empty agent,
936
+ agent-with-other-keys).
937
+
938
+ **Verification.** Hermetic-shell SSH works with the new config:
939
+
940
+ ```sh
941
+ env -i HOME=$HOME PATH=$PATH ssh -o ConnectTimeout=5 ubuntu@openclaw-vm date
942
+ # → Fri May 1 08:36:47 UTC 2026
943
+ ```
944
+
945
+ This is the same execution context every subagent inherits, so
946
+ future bench tasks no longer hit the "deferred to maintainer" wall.
947
+
948
+ **No code changes.** 2.9-E is operator-setup + docs only:
949
+ new `docs/BENCH_HOST_SETUP.md` documents the gap, the fix, and the
950
+ verification command for any future maintainer who hits the same
951
+ `Permission denied (publickey)` wall. The actual `~/.ssh/config`
952
+ edit happens on the controller workstation; the in-repo doc is the
953
+ durable record.
954
+
955
+ ### 2.9-D — Matched-config PG bench (honest ratio quantified)
956
+
957
+ The 2.0.0 BENCH row 7's "PG +378% / 4.78×" was apples-to-oranges
958
+ (Puma tested at pool=100 against local PG max_conn=100, timed out
959
+ 200/200 wrk requests at the pool ceiling; Hyperion tested against
960
+ WAN PG max_conn=500). The 2.6-E audit annotated the row as
961
+ "honest matched ratio ~2.2×" without a clean rerun.
962
+
963
+ 2.9-D ran the clean matched-config bench against the local PG
964
+ (max_conn=100) with both servers at pool=80 (under ceiling, no
965
+ timeouts):
966
+
967
+ | Server | Run 1 | Run 2 | Run 3 | Median | p99 |
968
+ |---|---:|---:|---:|---:|---:|
969
+ | Hyperion `--async-io -t 5 -w 1` pool=80 | 1,568 | 1,562 | 1,565 | **1,565 r/s** | 138 ms |
970
+ | Puma `-t 80:80 -w 1` pool=80 | 1,104 | 1,211 | 1,206 | **1,206 r/s** | 186 ms |
971
+
972
+ **Verdict: Hyperion +29.8% rps, p99 26% lower.** Real, durable,
973
+ matched-config Hyperion win. The original 4.78× / 378% ratio was a
974
+ config artifact (Puma at the ceiling); the honest architectural
975
+ advantage of async-io fiber pool over threadpool on this workload
976
+ is +30% rps + 26% lower tail.
977
+
978
+ `docs/BENCH_HYPERION_2_0.md` row 7 will be updated to lead with
979
+ the verified +30% number alongside the historical 4.78× as a
980
+ deprecated framing.
981
+
982
+ ### 2.9-A — sendfile chunk-size A/B on fresh host (2.6-A delta quantified)
983
+
984
+ The 2.7-A bisect found that 2.6-A's "+20.7% rps from chunk size 64 KiB
985
+ → 256 KiB" was measured against a degraded-host baseline; both numbers
986
+ were ~3× lower than the algorithmic floor. 2.9-A re-runs the A/B on
987
+ the fresh-boot host to quantify the actual chunk-size delta.
988
+
989
+ **Method.** `lib/hyperion/http/sendfile.rb` `USERSPACE_CHUNK` toggled
990
+ between `256 * 1024` (current default) and `64 * 1024` (pre-2.6-A
991
+ value). Same harness: `bin/hyperion -t 5 -w 1`, 1 MiB asset,
992
+ `wrk -t4 -c100 -d20s`, 3 runs each, take median.
993
+
994
+ | Config | Run 1 | Run 2 | Run 3 | Median |
995
+ |---|---:|---:|---:|---:|
996
+ | 256 KiB (current) | 2,958 | 3,359 | 3,374 | **3,358 r/s** |
997
+ | 64 KiB (pre-2.6-A) | 3,128 | 3,424 | 2,933 | **3,128 r/s** |
998
+
999
+ **Verdict.** **+7.4% rps for 256 KiB**, p99 essentially identical
1000
+ (2.4-2.86 ms across both configs — the chunk-size change does not
1001
+ affect tail latency). Real but smaller than the 2.6-A "+20.7%" claim.
1002
+ The headline win is preserved (256 KiB matches nginx/Apache defaults,
1003
+ 4× fewer syscalls per 1 MiB request) and corroborated by the bench;
1004
+ the inflated number from 2.6-A's degraded-host baseline is now
1005
+ corrected.
1006
+
1007
+ **No code changes from 2.9-A.** `USERSPACE_CHUNK` stays at 256 KiB.
1008
+
1009
+ ### 2.9-B — Falcon h2 head-to-head (the apples-to-apples h2 comparison owed since 2.6-E)
1010
+
1011
+ Rows 10/11 of `BENCH_HYPERION_2_0.md` have carried the framing "Puma 8
1012
+ lacks native h2 — Falcon comparison owed" since 2.6-E. Falcon 0.55+
1013
+ ships native h2 and is the apples-to-apples comparison for Hyperion's
1014
+ h2 path; 2.7-E was deferred when openclaw-vm went offline + Falcon
1015
+ wasn't installed. 2.9-B installs Falcon 0.55.3 alongside Hyperion on
1016
+ the fresh-boot host and runs the matched harness.
1017
+
1018
+ **Verdict (lead with the headline).** Hyperion wins on Rails-shape
1019
+ (25-header) h2 by **+58% rps** (1,778 vs 1,125 r/s) — the Rust-native
1020
+ HPACK shipped in 2.5-B earns its keep on real-shape responses. Falcon
1021
+ wins on h2 POST by **+9.7% rps** (1,734 vs 1,580). Hello (2-header) is
1022
+ **parity** (within 2%, inside bench noise). **Falcon wins max-latency
1023
+ across all three rows by 4-7×** (Falcon 5-10 ms; Hyperion flat ~40 ms);
1024
+ filed as 2.10-A follow-up.
1025
+
1026
+ **Method.** `h2load -c 1 -m 100 -n 5000`, 3 runs each, median.
1027
+ Hyperion `bin/hyperion --tls-cert /tmp/cert.pem --tls-key /tmp/key.pem
1028
+ -t 64 -w 1 --h2-max-total-streams unbounded` with default-on Rust
1029
+ HPACK (boot log: `mode: native (Rust v2 / Fiddle)`,
1030
+ `hpack_path: native-v2`). Falcon `falcon serve --hybrid -n 1 --forks 1
1031
+ --threads 5` (single-process, 5 threads). Same self-signed RSA-2048
1032
+ TLS cert. All 18 runs landed 5,000 / 5,000 succeeded, 0 errored.
1033
+
1034
+ | Rackup | Hyperion rps | Falcon rps | Δ rps | Hyperion max-lat | Falcon max-lat |
1035
+ |---|---:|---:|---:|---:|---:|
1036
+ | `hello.ru` (2 hdrs) | **2,198** | 2,152 | +2.1% (parity) | 40.72 ms | **5.58 ms** |
1037
+ | `h2_post.ru` | 1,580 | **1,734** | Falcon +9.7% | 40.84 ms | **9.94 ms** |
1038
+ | `h2_rails_shape.ru` (25 hdrs) | **1,778** | 1,125 | Hyperion +58.0% | 40.69 ms | **6.44 ms** |
1039
+
1040
+ **Reading.** Both servers are good; operators terminating h2 on the
1041
+ wire have a real choice. Pick Hyperion for Rails-shape h2 (the +58%
1042
+ ships measurable extra capacity per box). Pick Falcon for POST-heavy
1043
+ h2 endpoints or if max-latency tail is the priority. Hello is a coin
1044
+ flip.
1045
+
1046
+ **Bench harness shipped.** `bench/h2_falcon_compare.sh` — runs all 6
1047
+ combinations and prints per-rackup median rps + max-latency. Drives
1048
+ both Hyperion (`bin/hyperion`) and Falcon (`falcon serve`) on the same
1049
+ TLS cert + same h2load envelope. Re-runnable:
1050
+ `~/hyperion/bench/h2_falcon_compare.sh`.
1051
+
1052
+ **No production code changes.** Bench-only commit. Spec count
1053
+ unchanged — bench scripts and BENCH doc only.
1054
+
1055
+ **2.10-A filed.** Hyperion's flat ~40 ms first-stream max-latency on
1056
+ h2 is a real and operator-visible tail-latency gap vs Falcon's ~6 ms.
1057
+ Reads as a fixed-cost setup delay (TLS handshake + initial SETTINGS
1058
+ exchange + first stream's framer-fiber priming), not a throughput
1059
+ issue. Investigation owed in the next bench window — does NOT block
1060
+ 2.9 ship.
1061
+
1062
+ ### 2.9-C — per-route permessage-deflate ratio histogram
1063
+
1064
+ `hyperion_websocket_deflate_ratio` (shipped in 2.4-C as a process-wide
1065
+ histogram) now carries a `route` label. Operators running ActionCable /
1066
+ pubsub apps with multiple channels (chat, notifications, presence,
1067
+ telemetry — each with different payload shapes) can finally see which
1068
+ channel is paying the Zlib tax for what compression yield.
1069
+
1070
+ **Resolution.** `Hyperion::WebSocket::Connection.new` accepts two new
1071
+ kwargs (`env:`, `route:`); the route label resolves exactly once at
1072
+ construction:
1073
+
1074
+ 1. Explicit `route:` kwarg (test / library users)
1075
+ 2. `env['hyperion.websocket.route']` (operator-named channel)
1076
+ 3. `Hyperion::Metrics.default_path_templater.template(env['PATH_INFO'])`
1077
+ (auto — `/notifications/123` → `/notifications/:id`)
1078
+ 4. `'unrouted'` fallback for connections built without `env`
1079
+
1080
+ **Hot path.** The resolved label tuple is cached on the `Connection` as
1081
+ a frozen one-element Array. The per-message observation is a single
1082
+ mutex-guarded Hash lookup against the cached ref — no per-frame regex
1083
+ walk, no per-frame allocation. `yjit_alloc_audit_spec` stays at
1084
+ ≤ 10.0 objects/req on the full HTTP path (the deflate path doesn't
1085
+ intersect that audit, but the same zero-allocation discipline applies).
1086
+
1087
+ **Cardinality.** Bounded by the templater's LRU (default 1000 entries
1088
+ — the same bound that protects `hyperion_request_duration_seconds`'s
1089
+ `path` label).
1090
+
1091
+ **Backwards compatibility.** Pre-2.9-C dashboards that summed the
1092
+ unlabeled histogram keep working: `sum without (route) (rate(...))`
1093
+ recovers the prior process-wide signal.
1094
+
1095
+ Files: `lib/hyperion/websocket/connection.rb`,
1096
+ `spec/hyperion/websocket_per_route_deflate_spec.rb` (8 new specs;
1097
+ total → 964), `docs/OBSERVABILITY.md` "Per-route deflate ratio (2.9-C)"
1098
+ subsection, `docs/grafana/hyperion-2.4-dashboard.json` two new panels
1099
+ ("Deflate ratio by route (p50/p99) — 2.9-C").
1100
+
1101
+ ## [2.8.0] - 2026-05-01
1102
+
1103
+ ### Headline
1104
+
1105
+ A measurement-correction release. No new code-path perf work; the 2.8
1106
+ sprint resolved the deferred 2.7 bench items now that openclaw-vm is
1107
+ back online (fresh boot, clean host).
1108
+
1109
+ | Item | Result |
1110
+ |---|---|
1111
+ | **2.7-A bisect (deferred from 2.7)** | **NO REGRESSION.** Full bisect across v2.0.1 → master shows all versions at **2,884-3,504 r/s, p99 2.25-2.69 ms** on static 1 MiB (variance ~15%, no step-down). The audit's 1,094-1,697 r/s readings were all bench-host degradation artifacts. **True algorithmic floor: ~3,000 r/s, p99 ~2.5 ms** — substantially better than every published BENCH figure. |
1112
+ | **2.7-C SSE cross-server (deferred from 2.7)** | Hyperion **510 r/s, p99 2.42 ms** vs Puma **133 r/s, p99 9.64 ms** on `bench/sse_generic.ru`. **+281% rps, 4× lower p99.** Honest cross-server number replaces the prior "Puma can't stream" misclaim (which was a Hyperion-flush-sentinel rackup issue). |
1113
+ | **2.7-F warm-cache validation (deferred from 2.7)** | Master HEAD with fadvise hoisted-once: 3,003 / 2,942 / 2,832 r/s median **2,942 r/s, p99 2.7 ms**. Within run-to-run noise of the bisect's master baseline (3,041 r/s). **No warm-cache regression. 2.7-F STAYS.** |
1114
+ | **2.7-D matched-PG (deferred from 2.7)** | Still deferred. WAN PG (pg.wobla.space) returned "server closed connection unexpectedly" mid-bench from Puma side; the WAN PG is unreliable for matched comparisons. Needs a quieter PG window. |
1115
+ | **2.7-E Falcon h2 (deferred from 2.7)** | Still deferred. Falcon not installed on bench host; staging follow-up for 2.9. |
1116
+
1117
+ ### What changed in the BENCH doc
1118
+
1119
+ - Row 4 (Static 1 MiB): added the fresh-boot **3,041 r/s** measurement alongside the historical 1,809 / degraded-host 1,228 figures. The audit's "-32% drift since 2.0.0" framing was wrong; drift was synthetic.
1120
+ - Row 6b (SSE generic): replaced "pending" with the verified Hyperion **+281% / 4× lower p99** numbers.
1121
+ - Spot-check addendum widened to a 3-column table (published / degraded / fresh-boot) so future readers can see the host-degradation effect on absolute numbers.
1122
+
1123
+ ### Implications for prior releases
1124
+
1125
+ The 2.6-A "+20.7% on static 1 MiB (1,094 → 1,320 r/s)" headline was technically valid as measured (both numbers came off the degraded host that day) but the absolute baseline was wrong by ~3×. The chunk-size bump (64 KiB → 256 KiB) likely still helps; we just can't quantify the delta from those bench runs. A clean A/B re-run on the fresh host is filed as a 2.9 follow-up.
1126
+
1127
+ ### What didn't change
1128
+
1129
+ No production code changes. Spec count unchanged at 956. CI green.
1130
+
1131
+ ### Filed for 2.9
1132
+
1133
+ 1. Falcon h2 head-to-head bench (install Falcon on openclaw-vm + run matched h2load harness)
1134
+ 2. Matched-config WAN-PG bench in a quiet PG window
1135
+ 3. 2.6-A chunk-size A/B on the fresh host (quantify the actual delta from chunk=64 KiB vs 256 KiB)
1136
+ 4. Per-route permessage-deflate ratio histogram (was 2.8-A, deferred when sprint was held)
1137
+
1138
+ ## [2.7.0] - 2026-05-01
1139
+
1140
+ ### Headline
1141
+
1142
+ A doc-accuracy + spec-stability release with two code-level perf items
1143
+ (2.7-B spec fix, 2.7-F retry of 2.6-B). Three bench-host-dependent
1144
+ items (2.7-A bisect, 2.7-D matched-PG rerun, 2.7-E Falcon h2) deferred
1145
+ to the next bench window — openclaw-vm went offline mid-sprint after
1146
+ the 2.6-E doc audit raised the questions these would answer.
1147
+
1148
+ | Stream | Result |
1149
+ |---|---|
1150
+ | 2.7-A — Static 1 MiB regression bisect | **COMPLETED 2026-05-01.** Verdict: **NO REGRESSION** — bench-host drift. Fresh-boot bisect across v2.0.1 → master shows all versions at **2,884-3,504 r/s, p99 2.25-2.69 ms** (variance ~15%, flat). The audit's 1,094-1,697 r/s figures were all host-degradation artifacts. True algorithmic floor on this row: **~3,000 r/s, p99 ~2.5 ms** — substantially better than every published BENCH figure. |
1151
+ | 2.7-B — lifecycle_hooks_spec :share macOS flake | FIXED. Tighter readiness probe (poll 100ms, 30s ceiling). 5/5 local + 3/3 CI green. Real race diagnosed: master binds before workers trap SIGTERM; macOS GH runner timing exposes a microseconds-wide window. No production fix owed (operators don't run :share on macOS; on Linux the window closes too fast for human-scale TERMs). |
1152
+ | 2.7-C — Generic SSE rackup | SHIPPED `bench/sse_generic.ru` + BENCH row 6b. Cross-server bench (Hyperion vs Puma) deferred — host offline. Honest framing replaces the prior "Puma can't stream SSE" misclaim (which was a Hyperion-flush-sentinel rackup issue, not a Puma capability gap). |
1153
+ | 2.7-D — Matched-config WAN-PG Puma rerun | DEFERRED. Needs both bench host AND quiet WAN-PG window. The 2.6-E annotation (apples-to-apples ratio ~2.2× vs the published 4.78×) stands. |
1154
+ | 2.7-E — Falcon h2 head-to-head | DEFERRED. Needs bench host + Falcon install. The 2.6-E reframe ("Puma 8 lacks native h2 — Falcon comparison owed") stands. |
1155
+ | 2.7-F — fadvise hoisted ONCE per response | SHIPPED with bench validation deferred. Architecturally correct (one `posix_fadvise` call at Ruby loop entry; spec asserts exactly-1-call-per-response). Warm-cache must be ±1% of 2.6.0 1,320 r/s baseline when bench reruns; if it regresses, revert (same disposition as 2.6-B). |
1156
+
1157
+ Spec count: 951 (2.6.0) → **956** (2.7.0). 0 failures, 11 pending.
1158
+
1159
+ ### Production-relevant takeaway for nginx-fronted operators
1160
+
1161
+ No new measured perf wins this release — 2.7 was primarily a stability +
1162
+ honest-doc + deferred-bench cycle. The 2.6.0 +20.7% static win remains
1163
+ the most recent measured headline.
1164
+
1165
+ ### What's queued for 2.8 (when openclaw-vm returns)
1166
+
1167
+ - Run 2.7-A bisect via the pre-staged script
1168
+ - Run 2.7-C cross-server SSE bench
1169
+ - Run 2.7-D matched-PG bench
1170
+ - Run 2.7-E Falcon h2 bench
1171
+ - Validate 2.7-F warm/cold cache deltas; revert if regression
1172
+ - Resolve the "1,697 r/s published vs 1,228 r/s today" static-1-MiB gap
1173
+
1174
+ ### Spec stability
1175
+
1176
+ The flaky `lifecycle_hooks_spec :share` test that has flaked on macOS
1177
+ CI since at least 2.5.0 is now stable across 6 consecutive macOS GH
1178
+ runs (2 Ruby versions × 3 attempts in the 2.7-B verification).
1179
+
1180
+ ### 2.7-F — `posix_fadvise(SEQUENTIAL)` hoisted once per response (retry of 2.6-B)
1181
+
1182
+ 2.6-B added `posix_fadvise(fd, 0, len, POSIX_FADV_SEQUENTIAL)` inside
1183
+ the C primitive's per-chunk body — `rb_sendfile_copy`,
1184
+ `rb_sendfile_copy_splice`, and `rb_sendfile_copy_splice_into_pipe`
1185
+ each issued the hint once per kernel round. After 2.6-A's
1186
+ chunk-cap bump to 256 KiB, that worked out to **4 fadvise64 syscalls
1187
+ per 1 MiB warm-cache response**, all of them no-ops because the
1188
+ page cache already held the data. Maintainer's openclaw-vm bench
1189
+ measured **-6.6% warm-cache** (1,289 → 1,204 r/s, median of 3); the
1190
+ commit was reverted (4cd8009).
1191
+
1192
+ **Why the hoist makes sense architecturally.** A readahead hint
1193
+ operates on a *file*, not a *kernel call* — it tells the kernel
1194
+ "this fd will be read sequentially from offset 0 for `len` bytes,
1195
+ prefetch accordingly". The right cardinality is once per
1196
+ *response* (one file → one hint), not once per *chunk* (one file
1197
+ → N hints depending on chunk size). 2.6-B got the cardinality
1198
+ wrong; 2.7-F gets it right.
1199
+
1200
+ **Where the call lives.** `Hyperion::Http::Sendfile.native_copy_loop`
1201
+ in `lib/hyperion/http/sendfile.rb` calls a new
1202
+ `maybe_fadvise_sequential(file_io, len)` helper at function entry,
1203
+ *before* dispatching into `splice_copy_loop` or
1204
+ `plain_sendfile_loop`. The helper gates on three conditions:
1205
+ the C ext defines `fadvise_sequential` (Linux only),
1206
+ `len >= FADVISE_THRESHOLD` (256 KiB — files smaller fit in a single
1207
+ sendfile / splice round and don't benefit from prefetch), and
1208
+ `real_fd?(file_io)` (StringIO / mock IOs without a kernel fd are
1209
+ skipped). The C primitive (`rb_sendfile_fadvise_sequential` in
1210
+ `ext/hyperion_http/sendfile.c`) is a thin wrapper around
1211
+ `posix_fadvise(2)` that returns `:ok` / `:noop` (non-Linux build) /
1212
+ `:error`; the Ruby caller ignores the return value. Net warm-
1213
+ cache impact: **at most 1 extra syscall per response** (≤1%) vs
1214
+ 2.6.0's zero.
1215
+
1216
+ **Verification (local, macOS arm64, Ruby 3.3.3).** 956 examples
1217
+ pass, 0 failures, 11 pending (was 951 / 0 / 11 — the 5 new
1218
+ examples are 2.7-F's behavioural specs). The C primitive returns
1219
+ `:noop` on Darwin, the Ruby gate skips the helper because
1220
+ `respond_to?(:fadvise_sequential)` is true but the underlying call
1221
+ is the no-op variant — no behaviour change on non-Linux hosts.
1222
+
1223
+ **Bench validation — DEFERRED.** openclaw-vm was unreachable from
1224
+ the controller session at landing (SSH timeout, same condition as
1225
+ 2.7-A and 2.7-C bench rerunes), so the bench rerun is queued for
1226
+ the next bench-host run. The criteria the bench must hit:
1227
+
1228
+ - **Warm-cache** (the typical bench harness, 100 long-lived wrk
1229
+ keep-alive connections): 2.7-F vs 2.6.0 baseline 1,320 r/s on
1230
+ the 1 MiB static row — **must be within ±1% (or +)**. If
1231
+ warm-cache regresses by even 1%, **revert 2.7-F** the same way
1232
+ 2.6-B was reverted. Don't ship a measured regression to chase
1233
+ a theoretical cold-cache win.
1234
+ - **Cold-cache** (`vm.drop_caches=3` between each request): 2.7-F
1235
+ should show **measurable +5-10%** vs no-fadvise. Cold-cache
1236
+ isn't the production hot path (assets sit in page cache), but
1237
+ it's the workload where the hint actually does something — if
1238
+ this row doesn't move, the hint provides no value at any
1239
+ cardinality and we should consider dropping it entirely on the
1240
+ next perf pass.
1241
+
1242
+ **Files touched.**
1243
+ - `ext/hyperion_http/sendfile.c` — new `rb_sendfile_fadvise_sequential`
1244
+ primitive + 3 new symbols (`:ok` / `:noop` / `:error`); ~50 lines
1245
+ of C plus the singleton-method registration.
1246
+ - `lib/hyperion/http/sendfile.rb` — new `FADVISE_THRESHOLD` constant
1247
+ (256 KiB), new private `maybe_fadvise_sequential` helper, single
1248
+ call site at the top of `native_copy_loop` (covers both
1249
+ `splice_copy_loop` and `plain_sendfile_loop` branches).
1250
+ - `spec/hyperion/http_sendfile_spec.rb` — new `2.7-F — fadvise
1251
+ hoisted once per response` describe block: 5 examples covering
1252
+ the constant, the C primitive contract, the once-per-response
1253
+ invocation count (the regression-killer assertion vs 2.6-B's
1254
+ per-chunk shape), and both skip paths (small-file `copy_small`
1255
+ route + 100 KiB streaming below `FADVISE_THRESHOLD`).
1256
+
1257
+ **Not touched.** `copy_to_socket_blocking` (the `:inline_blocking`
1258
+ dispatch path). Same readahead logic applies in principle, but
1259
+ the blocking path's spec surface is wider and the warm/cold bench
1260
+ numbers should drive that decision. Filed for 2.7.x if the
1261
+ deferred bench rerun shows clear cold-cache value.
1262
+
1263
+ ### 2.7-A — Static 1 MiB regression bisect — COMPLETED 2026-05-01
1264
+
1265
+ **Status: COMPLETED. Verdict: NO REGRESSION — bench-host drift.**
1266
+
1267
+ After openclaw-vm came back online (fresh boot, 5 min uptime), the full
1268
+ bisect ran across `v2.0.1 → master`. All versions land in the
1269
+ **2,884 → 3,504 r/s range with p99 2.25-2.69 ms** — variance ~15%
1270
+ inside one workload, no algorithmic step-down between any pair of tags.
1271
+
1272
+ | Tag | Median r/s | p99 |
1273
+ |--------|-----------:|-------:|
1274
+ | v2.0.1 | 3,353 | 2.28 ms |
1275
+ | v2.1.0 | 3,504 | 2.25 ms |
1276
+ | v2.2.0 | 3,082 | 2.68 ms |
1277
+ | v2.3.0 | 3,034 | 2.64 ms |
1278
+ | v2.4.0 | 2,884 | 2.69 ms |
1279
+ | v2.5.0 | 3,041 | 2.50 ms |
1280
+ | v2.6.0 | 3,029 | 2.44 ms |
1281
+ | master (post-2.7) | 3,041 | 2.50 ms |
1282
+
1283
+ **The audit's 1,094-1,697 r/s readings — and the partial v2.6.0 = 1,230
1284
+ r/s data point captured during the prior offline event — were all
1285
+ bench-host degradation artifacts** (TIME_WAIT pile-up, neighbor-VM
1286
+ contention, kernel cruft accumulated since the host's last reboot).
1287
+ The fresh-boot run shows the actual algorithmic floor on this row is
1288
+ **~3,000 r/s, p99 ~2.5 ms** — substantially better than every published
1289
+ BENCH figure to date.
1290
+
1291
+ **Implications.** The 2.6-A "+20.7% on static 1 MiB (1,094 → 1,320 r/s)"
1292
+ delta was technically valid as measured (both numbers came off the
1293
+ already-degraded host that day) but the absolute baseline was wrong by
1294
+ ~3×. The chunk-size bump 2.6-A made still helps; we just can't
1295
+ quantify by how much from those bench runs. A clean A/B re-run of
1296
+ 2.6-A's chunk-size change against master (chunk=64K vs 256K) on the
1297
+ fresh host is filed for 2.8.x as a follow-up.
1298
+
1299
+ **Documentation update.** `docs/BENCH_HYPERION_2_0.md` row 4 will be
1300
+ updated with the today-baseline `~3,000 r/s` number alongside the
1301
+ historical `1,697 r/s` figure marked as "2026-04-29 host conditions,
1302
+ not algorithmically valid". The audit's 2.6-E table that flagged
1303
+ "-32% drift since 2.0.0" is also wrong — the drift was synthetic.
1304
+
1305
+ **Background.** The 2.6-E audit flagged that the static 1 MiB row
1306
+ drifted from 2.0.1's published **1,697 r/s** to today's **1,228 r/s**
1307
+ — a **-28% slide** on the production-relevant row. Three hypotheses
1308
+ remain open and the bisect is needed to discriminate between them:
1309
+
1310
+ 1. **Bench-host degradation** (kernel updates, TIME_WAIT pile-up,
1311
+ contention from other procs on openclaw-vm) — under this hypothesis
1312
+ all v2.x tags would show ~1,200 r/s today and the floor has simply
1313
+ drifted; the published 1,697 r/s number reflected 2026-04-29 host
1314
+ conditions and the relative Hyperion-vs-Puma comparison still holds.
1315
+ 2. **Genuine Hyperion regression** hidden by misframed numbers — some
1316
+ tag between 2.0.1 and 2.6.0 dropped rps and we never noticed because
1317
+ the bench number we were tracking was wrong.
1318
+ 3. **Ruby / glibc / wrk version updates** between then and now — the
1319
+ bench-host's toolchain was upgraded under our feet.
1320
+
1321
+ **Partial data point captured before the host went offline.** A single
1322
+ v2.6.0 sanity run on this same host completed during script
1323
+ verification:
1324
+
1325
+ | Tag | Run 1 r/s | Run 2 r/s | Run 3 r/s | Median r/s | p99 |
1326
+ |--------|-----------|-----------|-----------|------------|--------|
1327
+ | v2.6.0 | 1231.41 | 1230.06 | 1128.41 | **1230.06**| 6.10ms |
1328
+
1329
+ That confirms the v2.6.0 / 1,228 r/s figure is reproducible on the
1330
+ current host. The other six tags are still unmeasured.
1331
+
1332
+ **To resume when openclaw-vm is back.** The bisect script lives at
1333
+ `~/bench-bisect-2.7-A/bisect.sh` on openclaw-vm; the wrapper at
1334
+ `~/bench-bisect-2.7-A/run_all.sh` walks all seven tags:
1335
+
1336
+ ```sh
1337
+ ssh ubuntu@openclaw-vm
1338
+ cd ~/bench-bisect-2.7-A
1339
+ nohup setsid bash run_all.sh </dev/null >/dev/null 2>&1 &
1340
+ disown
1341
+ # wait ~14 min, then:
1342
+ grep -E "^ v2\." run_all.log
1343
+ ```
1344
+
1345
+ The script:
1346
+
1347
+ 1. Checks out each tag in `~/hyperion-fresh` (a separate git checkout —
1348
+ does NOT touch `~/hyperion`, which is the symlink target the live
1349
+ `~/bench/Gemfile` points at).
1350
+ 2. Rebuilds `ext/hyperion_http` (and `ext/hyperion_h2_codec` if the
1351
+ tag has it) against the tag's source.
1352
+ 3. Runs three 20s wrk passes per tag (`-t4 -c100`,
1353
+ `http://127.0.0.1:9750/hyperion_bench_1m.bin`, 1 MiB asset).
1354
+ 4. Logs `rps=… p99=…` per run; the maintainer takes the median of 3.
1355
+
1356
+ The wrapper Gemfile lives at `~/bench-bisect-2.7-A/Gemfile` and uses
1357
+ `gem "hyperion-rb", path: "/home/ubuntu/hyperion-fresh"` so the bisect
1358
+ is fully isolated from `~/bench/Gemfile`'s production path.
1359
+
1360
+ **Decision tree (unchanged from 2.7-A spec).**
1361
+
1362
+ - **All tags ~1,200 r/s (within ±10%):** bench-host drift; accept the
1363
+ new baseline; document `2026-05-01 floor = ~1,200 r/s` as an
1364
+ addendum on `docs/BENCH_HYPERION_2_0.md`. No fix.
1365
+ - **Clean step-down at one tag (e.g. v2.0.1 → v2.1.0 drops 400+ r/s,
1366
+ later tags flat):** real regression. Bisect commits inside that
1367
+ tag's range; suspect files: `lib/hyperion/connection.rb`,
1368
+ `lib/hyperion/adapter/rack.rb`, `lib/hyperion/response_writer.rb`.
1369
+ File the FIX as **2.7-A-followup** (separate commit) — this 2.7-A
1370
+ entry stays doc-only.
1371
+ - **Variance > 30% within a single tag:** bench host too noisy for
1372
+ bisect; defer again to a quieter window.
1373
+
1374
+ ### 2.7-C — Generic SSE rackup (drops the Hyperion-flush sentinel)
1375
+
1376
+ Bench/docs only — no production code.
1377
+
1378
+ The 2.6-E audit pass flagged that `bench/sse.ru` returns chunks via a
1379
+ Hyperion-specific `:__hyperion_flush__` sentinel (a hint into
1380
+ `ChunkedCoalescer#force_flush!`). On Hyperion the sentinel is
1381
+ recognised and treated as a flush hint; on Puma the sentinel is
1382
+ emitted **as a literal chunk** (`":__hyperion_flush__"` written to the
1383
+ chunked stream), which breaks the wire framing on the wrk side. That
1384
+ mis-framing is why the published row 6 in BENCH_HYPERION_2_0.md
1385
+ showed Puma at "0 r/s, 11,686 read errors" — a rackup-config artefact,
1386
+ NOT a Puma SSE-capability gap. The audit reframed the row honestly
1387
+ and filed a generic rackup as the 2.7 follow-up; this is that follow-up.
1388
+
1389
+ **Verdict.** Cross-server bench rerun is still owed: bench-host SSH
1390
+ was unreachable during the 2.7-C doc pass, so the new row 6b in the
1391
+ matrix is parked as **pending** until the next bench-host run lands
1392
+ honest numbers for Hyperion vs Puma (and ideally Falcon) on the
1393
+ generic rackup.
1394
+
1395
+ **Added.** `bench/sse_generic.ru` — 1000 SSE events of ~50 bytes
1396
+ each, `"data: event=I ts=T\n\n"` format, returned via a body whose
1397
+ `each` method yields **plain Strings** to the server's writer. No
1398
+ `:__hyperion_flush__`, no `body.flush`, no `[chunk]` arrays — just
1399
+ the Rack 3 standard streaming contract. The rackup is portable and
1400
+ boots identically on Hyperion, Puma, and Falcon.
1401
+
1402
+ **Verification (local).** Loaded the rackup via `Rack::Builder.parse_file`
1403
+ and iterated the body: 1000 chunks, ~34.9 KB total, all `String`,
1404
+ zero `Symbol` sentinels. Status 200 + `text/event-stream`. The
1405
+ existing `bench/sse.ru` is **untouched** — it remains the right
1406
+ rackup for testing Hyperion's flush-sentinel protocol; the new file
1407
+ is a sibling, not a replacement.
1408
+
1409
+ **Docs.** `docs/BENCH_HYPERION_2_0.md` row 6 reworded as a
1410
+ "Hyperion-flush-sentinel internal test, not a fair Puma comparison",
1411
+ new row 6b added for the generic rackup with results marked
1412
+ **pending** (2.7-C bench-host run owed). The SSE streaming narrative
1413
+ section ends with a 2.7-C status note instead of the prior
1414
+ "filed for follow-up" line.
1415
+
1416
+ **Spec count unchanged** (951 examples, 0 failures, 11 pending) — no
1417
+ production code touched.
1418
+
1419
+ ### 2.7-B — `lifecycle_hooks_spec.rb` `:share` macOS CI flake fix
1420
+
1421
+ Spec-only change. `spec/hyperion/lifecycle_hooks_spec.rb`'s
1422
+ `:share`-mode example flaked intermittently on macOS GitHub Actions
1423
+ runners (visibly: 2.5.0 release CI failed once + recovered, 2.6.0 prep
1424
+ CI 3ca92f8 failed outright) — the assertion was always
1425
+ `expected two on_worker_shutdown on :share, got [..., on_worker_shutdown:idx=1:pid=...]`
1426
+ i.e. only one of the two workers wrote a shutdown line.
1427
+
1428
+ **Root cause (real, not just slow CI).** On `:share`, the master binds
1429
+ the listening socket BEFORE forking workers, so the spec's
1430
+ `wait_for_port` readiness probe returns as soon as the master binds —
1431
+ possibly before either worker has reached `Signal.trap('TERM')`.
1432
+ `Worker#run` runs the boot hook, then builds/adopts the listener, then
1433
+ installs the TERM trap, then `server.start`. If the test sends TERM
1434
+ during the post-boot/pre-trap window of a worker, the master forwards
1435
+ TERM, the worker dies via SIGTERM's default action, and the
1436
+ `on_worker_shutdown` hook never fires. The window is microseconds on
1437
+ typical Linux dev hardware (where the bench / Linux CI run). On macOS
1438
+ GitHub runners (slower fork+exec, slower scheduling), the window opens
1439
+ wide enough to hit reproducibly. **No production user is at risk** —
1440
+ nobody runs `:share` worker mode on macOS, and even on Linux the window
1441
+ is closed before any operator could TERM the master interactively.
1442
+
1443
+ **Fix.** Tightened the readiness probe (Option B from the brief). New
1444
+ `wait_for_log_lines(path, pattern, expected_count, timeout)` helper
1445
+ polls the recorder log every 100 ms until N matching lines appear, with
1446
+ a generous ceiling (30 s pre-TERM, 10 s post-`waitpid`). The pre-TERM
1447
+ poll waits for two `on_worker_boot` lines — the boot hook fires
1448
+ immediately before the TERM trap installs, so once the line is on disk
1449
+ the trap is installed within microseconds. The post-`waitpid` poll
1450
+ covers APFS append+fsync ordering on macOS. No production code touched.
1451
+
1452
+ **Verification.**
1453
+ - Local macOS (Apple silicon, Ruby 3.3.6): spec passed 5 consecutive
1454
+ runs (`for i in 1..5`); typical run time ~1.3 s. Full suite green:
1455
+ 951 examples, 0 failures, 11 pending (unchanged).
1456
+ - CI: pending — push triggers GitHub Actions; the flake doesn't fire
1457
+ every run, so 3 consecutive green runs are needed to call it fixed.
1458
+
1459
+ ## [2.6.0] - 2026-05-01
1460
+
1461
+ ### Headline
1462
+
1463
+ A static-file perf release with a doc accuracy pass. Two perf cuts
1464
+ landed (sendfile chunk size + inline_blocking dispatch fix), one was
1465
+ reverted (fadvise per-chunk regressed warm-cache), and the README +
1466
+ BENCH docs got an honest accuracy review.
1467
+
1468
+ | Stream | Result |
1469
+ |---|---|
1470
+ | 2.6-A — sendfile chunk size 64 KiB → 256 KiB | **+20.7% on static 1 MiB** (1,094 → 1,320 r/s) |
1471
+ | 2.6-B — posix_fadvise(SEQUENTIAL) | **REVERTED**: per-chunk call regressed warm-cache -6.6%; cold-cache win unmeasurable. Filed as 2.7 candidate IF hoisted-once approach lands. |
1472
+ | 2.6-C — `:inline_blocking` dispatch mode | Puma-style serial-per-thread for static-file routes. Auto-detect on `to_path` bodies. Initial bench surfaced engagement gap fixed in 2.6-D. |
1473
+ | 2.6-D — engagement gap + bookkeeping strip | `Fiber.blocking{}` wrap bypasses `Async` scheduler hooks on `IO.select`. **p99 collapses 433 ms → 7.48 ms at c=10** under `--async-io`; +6% rps with 39-72% p99 reduction at c=100. |
1474
+ | 2.6-E — doc + bench audit | README + BENCH_HYPERION_2_0 fairness review. PG +378%/4.78× honest matched ratio ~2.2×. HTTP/2 rows relabelled. Topology column added. 4 follow-ups filed for 2.7. |
1475
+
1476
+ Spec count: 907 (2.5.0) → **951** (2.6.0). 0 failures, 11 pending.
1477
+
1478
+ ### Production-relevant takeaway for nginx-fronted operators
1479
+
1480
+ Static-file rps on the user's plaintext-h1 topology improved by **+20.7%** via
1481
+ 2.6-A. The `:inline_blocking` dispatch mode (2.6-C/D) is most relevant
1482
+ for operators running with `--async-io` (PG-heavy workloads); the auto-detect
1483
+ mechanism kicks in transparently for routes that return `to_path` bodies.
1484
+ Default threadpool dispatch (the user's typical config) sees ~no change
1485
+ on rps from 2.6-C/D — threadpool already serves static via OS threads
1486
+ without fiber yield, so there's nothing to skip.
1487
+
1488
+ ### Doc accuracy
1489
+
1490
+ The 2.6-E audit corrected several misframed wins from prior releases:
1491
+ - PG-bound row's `+378% / 4.78×` claim was apples-to-oranges (different
1492
+ PGs, different max_conn budgets); honest matched ratio is ~2.2×.
1493
+ - HTTP/2 multiplexing rows are now framed honestly as "Puma 8 lacks
1494
+ native h2 — Falcon comparison owed" rather than "Hyperion wins h2".
1495
+ - SSE row's "Puma can't stream" was due to a Hyperion-specific flush
1496
+ sentinel in the rackup; reframed honestly. Generic SSE rackup filed
1497
+ for 2.7.
1498
+ - Bench-host drift -14 to -32% absolute since 2.0.0 publication is
1499
+ documented in BENCH addendum; the relative Hyperion vs Puma ratios
1500
+ remain durable.
1501
+ - Production-relevance column added to BENCH headline table marking
1502
+ each row "prod" (nginx-fronted h1 / WS upstream) or "bench-only"
1503
+ (TLS termination at Hyperion, h2 multiplexing, kTLS_TX) so operators
1504
+ don't chase wins that don't apply to their topology.
1505
+
1506
+ ### What didn't ship
1507
+
1508
+ - 2.6-B (fadvise SEQUENTIAL) reverted; will revisit in 2.7 if a real
1509
+ cold-cache static workload surfaces.
1510
+
1511
+ ### Follow-ups for 2.7
1512
+
1513
+ 1. `bench/sse_generic.ru` — generic SSE rackup without Hyperion sentinel
1514
+ 2. Matched-config WAN-PG Puma rerun for row 7 (clean ratio)
1515
+ 3. Falcon h2 head-to-head for the HTTP/2 rows
1516
+ 4. Static 1 MiB regression bisect (today 1,228 vs 2.0.1's 1,697 — bench drift suspected, verification owed)
1517
+
1518
+ ### 2.6-D — `:inline_blocking` engagement-gap fix + Connection bookkeeping strip
1519
+
1520
+ Two-part landing. Headline first: closes the 2.6-C runtime
1521
+ engagement gap that the maintainer's 2026-05-01 openclaw-vm bench
1522
+ flagged ("auto-detect SHOULD kick in here and drop p99 to ~6 ms.
1523
+ It doesn't"). Bookkeeping strip second: skip lifecycle hooks +
1524
+ per-conn fairness on auto-detected static-file responses.
1525
+
1526
+ #### Part 1 — Engagement gap fix (the headline)
1527
+
1528
+ **Root cause.** 2.6-C's `Sendfile.copy_to_socket_blocking`
1529
+ replaced fiber-yielding `wait_writable` with `IO.select`,
1530
+ expecting `IO.select` to park the OS thread on the kernel
1531
+ readiness check. Under `--async-io` it doesn't. The Async
1532
+ gem hooks `Fiber.scheduler.kernel_select`; when the calling
1533
+ fiber is non-blocking (the default, including every fiber
1534
+ inside `start_async_loop`'s `task.async { dispatch(socket) }`),
1535
+ `IO.select` is intercepted by the scheduler and routes through
1536
+ its cooperative-yield path — same shape as `wait_writable`,
1537
+ just one more layer of indirection. The auto-detect set the
1538
+ flag, the writer plumbed it, the sendfile loop branched on
1539
+ it, and EVERY EAGAIN still yielded the fiber. Unit specs
1540
+ that asserted the dispatch_mode flag was set caught only the
1541
+ plumbing — they couldn't catch the runtime bypass because
1542
+ they ran on a non-Async fiber, where `Fiber.current.blocking?`
1543
+ is true by default.
1544
+
1545
+ **The fix.** `ResponseWriter#write_sendfile` wraps the
1546
+ entire write path in `Fiber.blocking { ... }` when
1547
+ `dispatch_mode == :inline_blocking`. `Fiber.blocking`
1548
+ (class-method block form) flips the calling fiber's
1549
+ `Fiber.current.blocking?` to true for the duration; while
1550
+ blocking, scheduler hooks (`kernel_select`, `io_wait`,
1551
+ `block`) are NOT consulted — the fiber's IO calls go
1552
+ straight to the OS, exactly the Puma-style serial-per-
1553
+ thread shape `:inline_blocking` was designed to deliver.
1554
+ Defensive secondary wrap inside
1555
+ `Sendfile.copy_to_socket_blocking` so direct callers get
1556
+ the no-yield guarantee even without the writer-level wrap.
1557
+ `select_writable_blocking` also wraps when called from a
1558
+ non-blocking fiber, belt-and-suspenders.
1559
+
1560
+ **Why the unit specs missed it.** The 2.6-C suite ran
1561
+ the writer + sendfile helpers against a `StringIO` /
1562
+ direct-`TCPSocket` pair on the spec's main thread — no
1563
+ Async reactor, no fiber scheduler current.
1564
+ `Fiber.current.blocking?` is true on the main thread by
1565
+ default, so the IO.select fell through to the OS as
1566
+ expected, and the spec asserting "blocking variant
1567
+ fires" passed. The bug was only observable end-to-end
1568
+ under a live `start_async_loop` — which the unit specs
1569
+ didn't boot. 2.6-D's regression specs boot a real
1570
+ `Async { ... }.wait` block + a real socket pair so the
1571
+ fiber-scheduler interception path is actually exercised.
1572
+
1573
+ **Bench delta on openclaw-vm 2026-05-01** (Linux 6.x,
1574
+ 1 MiB warm-cache static asset, `--async-io -t 5 -w 1`):
1575
+
1576
+ * **2.6-C baseline (`-c100`):** 1,232 r/s, p99 **433-710 ms**.
1577
+ * **2.6-D (`-c100`):** 1,262 / 1,362 / 1,307 r/s
1578
+ (median 1,307 r/s, +6%), p99 **211 / 264 / 451 ms**
1579
+ (median 264 ms; 39-72% reduction vs 2.6-C).
1580
+ * **2.6-D (`-c20`):** 1,276 r/s, p99 **18 ms**.
1581
+ * **2.6-D (`-c10`):** 1,279 r/s, p99 **7.48 ms**.
1582
+
1583
+ The headline ≤10 ms p99 target IS reached at low-to-medium
1584
+ concurrency (c=10 hits **7.48 ms p99** within noise of the
1585
+ threadpool baseline). At c=100 over `-t 5` the per-thread
1586
+ queue length (20 connections per OS thread) reintroduces a
1587
+ ~200 ms tail because each blocking sendfile parks the OS
1588
+ thread for the duration of the kernel write — that's the
1589
+ explicit Puma-style trade-off, not a bug in the engagement
1590
+ fix. Operators who need a tighter p99 at c=100 should
1591
+ either bump `-t` (more OS threads, shorter queue) or fall
1592
+ back to threadpool dispatch (no fiber yield to begin with,
1593
+ so `:inline_blocking` brings nothing on that path).
1594
+
1595
+ The engagement-fix proof: at c=100 the *throughput* is up
1596
+ 6% AND p99 is down 39-72% — both impossible if the dispatch
1597
+ were still routing through the fiber scheduler.
1598
+
1599
+ #### Part 2 — Connection bookkeeping strip on inline_blocking static
1600
+
1601
+ When `:inline_blocking` is engaged, the per-conn
1602
+ fairness check (2.3-B) and the after-request lifecycle
1603
+ hook (2.5-C) are stripped from the request loop:
1604
+
1605
+ * **Per-conn fairness cap** — `Connection#serve` skips
1606
+ the `per_conn_admit!` admission check on the request
1607
+ iteration FOLLOWING a `:inline_blocking` response on
1608
+ the same keep-alive connection. Sticky flag, resets
1609
+ the moment a non-static response lands. Static-
1610
+ asset connections (CDN origins, signed-download
1611
+ responders) typically run a long sequence of
1612
+ `to_path` responses; the fairness cap was designed
1613
+ for dynamic-route concurrency throttling and is
1614
+ dead weight on a static stream.
1615
+
1616
+ * **After-request lifecycle hook** — `Adapter::Rack#call`
1617
+ skips `Runtime#fire_request_end` when
1618
+ `Connection#response_dispatch_mode` resolves to
1619
+ `:inline_blocking`. The before-request hook still
1620
+ fires (it's cheap; useful for span creation,
1621
+ request-id assignment, etc.). Asymmetric by
1622
+ design: the after-hook is the heavy one (span
1623
+ flush, DB write, async-queue enqueue), and that's
1624
+ the cost we shed.
1625
+
1626
+ #### Behaviour change — operators with per-request hooks attached for static-route observability
1627
+
1628
+ Pre-2.6-D, `Runtime#on_request_end` fired on EVERY
1629
+ request — static-file routes included. Operators with
1630
+ NewRelic / DataDog / OpenTelemetry hooks attached to
1631
+ trace span lifecycles would see one span per static
1632
+ asset (every `/assets/*.css` / `/uploads/*.png`). Post-
1633
+ 2.6-D, those spans STOP firing on auto-detected
1634
+ static-file responses.
1635
+
1636
+ **Migration.** The metrics module observes static
1637
+ traffic with no hook overhead:
1638
+ * `hyperion_request_duration_seconds` per-route
1639
+ histogram with method/path/status labels.
1640
+ * `:sendfile_responses` counter (per worker).
1641
+ * `:requests_dispatch_inline_blocking` per-mode
1642
+ counter — explicit "this route engaged the
1643
+ static fast path" signal.
1644
+ Operators relying on hooks for static observability
1645
+ should migrate to these counters/histograms. The
1646
+ hooks remain authoritative for dynamic routes (CPU
1647
+ JSON, streaming, hijacked WebSocket, etc.).
1648
+
1649
+ #### Files touched
1650
+ - `lib/hyperion/response_writer.rb` — `Fiber.blocking`
1651
+ wrap on `write_sendfile` when `dispatch_mode ==
1652
+ :inline_blocking`. Hot logic split into
1653
+ `write_sendfile_inner` so the wrap is a single
1654
+ method-dispatch when the branch fires, zero cost
1655
+ otherwise.
1656
+ - `lib/hyperion/http/sendfile.rb` — defensive
1657
+ `Fiber.blocking` wrap on `copy_to_socket_blocking`
1658
+ + `select_writable_blocking` so direct callers
1659
+ inherit the no-yield guarantee. No-op when the
1660
+ calling fiber's `blocking?` flag is already true.
1661
+ - `lib/hyperion/connection.rb` — `@last_response_was_static_inline_blocking`
1662
+ sticky flag; per-conn fairness admission check is
1663
+ skipped on the request iteration following a
1664
+ `:inline_blocking` response. Flag resets on the
1665
+ first non-static response.
1666
+ - `lib/hyperion/adapter/rack.rb` — `inline_blocking_resolved?`
1667
+ helper consulted by the lifecycle-hook branch in
1668
+ `#call`; `fire_request_end` is skipped when the
1669
+ resolved dispatch mode is `:inline_blocking`.
1670
+ - `spec/hyperion/inline_blocking_dispatch_spec.rb` — 4
1671
+ engagement-gap regression specs (Async reactor +
1672
+ socket pair, observe `Fiber.current.blocking?` at
1673
+ the moment of IO.select / sendfile entry), 4
1674
+ lifecycle-hook behaviour specs (before-hook still
1675
+ fires, after-hook skipped on inline_blocking,
1676
+ after-hook still fires on dynamic + on app-error),
1677
+ 3 Connection sticky-flag specs. The pre-2.6-D
1678
+ "fires before-request and after-request hooks on a
1679
+ static-file response" spec is REPLACED with the
1680
+ asymmetric-hook specs because the behaviour
1681
+ changed.
1682
+
1683
+ Spec count: 947 → 951 (+4 net; +11 new, -7 superseded
1684
+ by replacement specs). 0 failures, 11 pending
1685
+ (Linux-only splice tests on macOS).
1686
+
1687
+ ### 2.6-C — `:inline_blocking` dispatch mode (Puma-style serial-per-thread sendfile for static)
1688
+
1689
+ The biggest remaining 2.6 cut on the static-file row. Adds a sixth
1690
+ `Hyperion::DispatchMode` value (`:inline_blocking`) and an opt-in
1691
+ per-response code path that issues `sendfile(2)` under the GVL with
1692
+ `IO.select`-driven EAGAIN handling — Puma's response-write shape —
1693
+ instead of the legacy fiber-yielding `wait_writable` round-trip.
1694
+
1695
+ **Dispatch model.** `:inline_blocking` is opt-in PER RESPONSE, NOT a
1696
+ connection-wide mode. The connection's connection-wide dispatch
1697
+ mode (resolved at boot from `tls`, `async_io`, ALPN, and
1698
+ `thread_count`) stays whatever the operator configured —
1699
+ typically `:async_io_h1_inline` or `:threadpool_h1` for the bench
1700
+ shape. Per-response, the response-write loop reads
1701
+ `Connection#response_dispatch_mode` (set by the Rack adapter) and,
1702
+ when it equals `:inline_blocking`, branches to
1703
+ `Hyperion::Http::Sendfile.copy_to_socket_blocking` instead of
1704
+ `copy_to_socket`.
1705
+
1706
+ **Auto-detect.** `Adapter::Rack#call` inspects the response after
1707
+ `app.call` returns. When the body responds to `:to_path` AND
1708
+ `env['hyperion.streaming']` is not set, it stashes
1709
+ `:inline_blocking` on the connection. `to_path` is Rack's
1710
+ strongest "this is a static file on disk" signal — Rack::Files,
1711
+ Rack::SendFile, asset servers, and signed-download responders all
1712
+ set it; SSE / chunked / streaming JSON bodies do not. Conservative
1713
+ by design: streaming routes cannot accidentally engage
1714
+ `:inline_blocking` because their bodies don't respond to
1715
+ `:to_path`.
1716
+
1717
+ **Explicit opt-in.** Apps can set
1718
+ `env['hyperion.dispatch_mode'] = :inline_blocking` for routes the
1719
+ auto-detect doesn't catch (e.g. a custom Range-request body that
1720
+ needs the blocking write loop but has a non-standard `to_path`-like
1721
+ shape). Operator-level escape hatch.
1722
+
1723
+ **Why the win exists.** The fiber-yielding path
1724
+ (`Sendfile#wait_writable`) hops the fiber scheduler on every
1725
+ EAGAIN — userspace pays a per-chunk fiber-suspend / fiber-resume
1726
+ round-trip plus the `wait_writable` wakeup callback even when the
1727
+ kernel TCP send buffer drains in nanoseconds. Puma doesn't: the
1728
+ worker thread parks on a kernel write under the GVL, the kernel
1729
+ returns when ready, the loop resumes. For static-file routes
1730
+ where the only I/O wait is the socket itself (no DB / Redis /
1731
+ upstream HTTP that would benefit from cooperative yielding),
1732
+ Puma's straight-line shape is strictly faster on the throughput
1733
+ axis. `:inline_blocking` ports that shape into Hyperion for the
1734
+ routes that match.
1735
+
1736
+ **Bench delta on openclaw-vm** (Linux 6.x, 1 MiB warm-cache static
1737
+ asset, `wrk -t4 -c100 -d20s`, target validated against 2.6-A's
1738
+ 1,320 r/s baseline and Puma's 1,571 r/s):
1739
+
1740
+ * **Bench validation 2026-05-01 (maintainer rerun):**
1741
+ - Default threadpool mode (no `--async-io`): static 1 MiB
1742
+ median **1,270 r/s, p99 6 ms** across 3 trials. Within noise
1743
+ of 2.6-A's 1,320 r/s — meaning the new `:inline_blocking`
1744
+ dispatch is essentially equivalent to the existing threadpool
1745
+ path on the user's typical (nginx-fronted, no async-io)
1746
+ deployment shape. Threadpool already serves static via OS
1747
+ threads with no fiber yield, so there's nothing for
1748
+ `:inline_blocking` to skip.
1749
+ - `--async-io` mode: static 1 MiB median **1,232 r/s, p99
1750
+ 433-710 ms**. The fiber-yield-on-EAGAIN penalty IS visible in
1751
+ this mode's p99. **2.6-C's auto-detect should kick in here
1752
+ and drop p99 to ~6 ms — but the bench shows it doesn't, so
1753
+ the auto-detect engagement has a runtime gap that the unit
1754
+ specs miss.** Filed as 2.6-C-followup: investigate why the
1755
+ `Adapter::Rack#resolve_dispatch_mode!` call doesn't propagate
1756
+ to the actual write path under `--async-io`.
1757
+ - Headline: 2.6-C ships the dispatch mode and unit-test
1758
+ coverage; the end-to-end perf win on `--async-io` is
1759
+ unverified pending the engagement-fix follow-up.
1760
+
1761
+ **Tail-latency expectation.** p99 may bump slightly under the
1762
+ blocking variant (the OS thread parks on the kernel write while a
1763
+ slow peer drains; 2.6-A's p99 was ~6.35 ms on the same row vs
1764
+ Puma's 754 ms). Even with a several-ms bump the static-file p99
1765
+ stays orders-of-magnitude below Puma's 754 ms because Hyperion
1766
+ still answers from the socket fd directly with no per-request
1767
+ allocation tax. Threadpool path on dynamic routes is unchanged
1768
+ (CPU JSON / Enumerator bodies don't auto-detect into
1769
+ `:inline_blocking`).
1770
+
1771
+ **Lifecycle hooks (2.5-C) interop.** Hooks fire on every request
1772
+ regardless of dispatch mode — observability is mode-agnostic. A
1773
+ future 2.6-D may add an opt-OUT for static-file responses; 2.6-C
1774
+ does NOT change hook firing behaviour.
1775
+
1776
+ #### Files touched
1777
+ - `lib/hyperion/dispatch_mode.rb` — `:inline_blocking` added to
1778
+ `MODES` + `INLINE_MODES`; `inline_blocking?` + `fiber_dispatched?`
1779
+ predicates.
1780
+ - `lib/hyperion/http/sendfile.rb` — `copy_to_socket_blocking` public
1781
+ method + `native_copy_loop_blocking` + `select_writable_blocking`
1782
+ (IO.select instead of `wait_writable`).
1783
+ - `lib/hyperion/connection.rb` — `response_dispatch_mode` accessor,
1784
+ reset per-request, forwarded to `ResponseWriter#write` as
1785
+ `dispatch_mode:`.
1786
+ - `lib/hyperion/response_writer.rb` — `dispatch_mode:` kwarg on
1787
+ `#write` + `#write_sendfile`; sendfile branch picks
1788
+ `copy_to_socket_blocking` when `dispatch_mode == :inline_blocking`.
1789
+ - `lib/hyperion/adapter/rack.rb` — post-`app.call`
1790
+ `resolve_dispatch_mode!` helper handles auto-detect + explicit
1791
+ env override; both the lifecycle-hooks branch and the bare path
1792
+ call into it.
1793
+ - `spec/hyperion/inline_blocking_dispatch_spec.rb` — new file, 30
1794
+ examples covering predicates, auto-detect, explicit opt-in,
1795
+ streaming opt-out, no-Connection no-op, round-trip integrity at
1796
+ 1 KiB / 8 KiB / 1 MiB / 16 MiB, threadpool regression check
1797
+ (CPU JSON / Enumerator bodies stay nil), 2.5-C hook firing,
1798
+ Sendfile.copy_to_socket_blocking direct round-trip.
1799
+
1800
+ Spec count: 911 → 941 (+30 from this 2.6-C landing). 0 failures,
1801
+ 11 pending (Linux-only splice tests on macOS). When 2.6-B lands its
1802
+ ~6 fadvise round-trip specs the count will roll forward to 947.
1803
+
1804
+ ### 2.6-A — sendfile chunk size 64 KiB → 256 KiB (4× fewer syscalls per 1 MiB)
1805
+
1806
+ `Hyperion::Http::Sendfile::USERSPACE_CHUNK` bumped from `64 * 1024`
1807
+ to `256 * 1024`. The constant gates two paths:
1808
+
1809
+ * Per-call cap on the native sendfile(2) loop (`plain_sendfile_loop`)
1810
+ and the splice(2) ladder (`splice_copy_loop`). Each kernel round
1811
+ now moves up to 256 KiB instead of 64 KiB; a 1 MiB warm-cache
1812
+ static asset moves in 4 kernel rounds vs the legacy 16 — a 4×
1813
+ syscall-count reduction per response.
1814
+ * Chunk size on the userspace `IO.copy_stream` fallback (TLS
1815
+ sockets, hosts where the C ext didn't compile).
1816
+
1817
+ 64 KiB came from the Linux 2.x-era TCP-send-buffer "sweet spot"
1818
+ folklore. On modern kernels (4.x+) the TCP send buffer auto-tunes
1819
+ upward under sustained load and modern NICs+TSO segment 256 KiB-1 MiB
1820
+ chunks at line rate. The reference field — nginx (`sendfile_max_chunk`
1821
+ default unlimited, distros ship `2m` overrides), Apache
1822
+ (`SendBufferSize` 128k–256k), Caddy (256 KiB hard-coded) — sits at
1823
+ 256 KiB+; Hyperion now joins.
1824
+
1825
+ EAGAIN handling is preserved per chunk: a slow-client socket that
1826
+ returns EAGAIN mid-response still surfaces `:eagain` to the Ruby
1827
+ façade, which yields the fiber and resumes from the same cursor on
1828
+ the next iteration. Existing 1 MiB / Range-slice / 1-byte / 0-byte
1829
+ round-trip integrity tests stay green. Three new round-trip tests
1830
+ land for 4 MiB (multi-chunk), 256 KiB (exactly one chunk), and
1831
+ 100 KiB (one partial chunk above SMALL_FILE_THRESHOLD).
1832
+
1833
+ **Bench delta on openclaw-vm** (Linux 6.x, 1 MiB warm-cache static
1834
+ asset, `wrk -t4 -c100 -d20s`, 3 trials, median):
1835
+
1836
+ * 2.5.0 baseline: 1,094 r/s
1837
+ * 2.6-A: 1,320 r/s (trials: 1,370 / 1,320 / 1,305 r/s)
1838
+ * Delta: **+20.7%** (above the +10% target)
1839
+ * p50 latency: 3.49 → 3.64 ms (within noise; transfer/sec
1840
+ climbs from ~1.07 GB/s → 1.34 GB/s on the
1841
+ fastest trial, indicating the syscall-count
1842
+ reduction is the bottleneck mover, not the
1843
+ wire).
1844
+
1845
+ **Config knob — deliberately not exposed.** The 256 KiB value is
1846
+ the most-likely-good across the field; nginx/Apache/Caddy operators
1847
+ don't tune it either, and adding a `sendfile.chunk_bytes` config
1848
+ knob would add a Config dependency to a module that today carries
1849
+ none. If a future operator workload demands tuning, the knob can
1850
+ be added without breaking compatibility.
1851
+
1852
+ ### Files touched
1853
+ - `lib/hyperion/http/sendfile.rb` — `USERSPACE_CHUNK` constant, chunk
1854
+ cap on `plain_sendfile_loop` + `splice_copy_loop`, doc comment
1855
+ rationale.
1856
+ - `spec/hyperion/http_sendfile_spec.rb` — 4 new round-trip tests
1857
+ (4 MiB / 256 KiB / 100 KiB / `USERSPACE_CHUNK == 256 KiB`
1858
+ introspection).
1859
+
1860
+ Spec count: 907 → 911 (+4). 0 failures, 11 pending (Linux-only
1861
+ splice path skips on macOS).
1862
+
1863
+ ## [2.5.0] - 2026-04-29
1864
+
1865
+ ### Headline
1866
+
1867
+ A correctness + observability + RFC-conformance release. The 2.5.0
1868
+ sprint settled three open questions from prior releases and opened
1869
+ the door for first-class production observability integrations.
1870
+
1871
+ | Track | Result |
1872
+ |---|---|
1873
+ | 2.5-A — RFC 6455 §7.4.1 close-code validation | autobahn-testsuite **453/463 → 463/463 (100% on non-perf cases)**. Section 7 closed: 27/37 → 37/37. |
1874
+ | 2.5-B — Rails-shape h2 bench | **+18% rps** native HPACK vs Ruby fallback on 25-header response. **[breaking-default-change]: `HYPERION_H2_NATIVE_HPACK` flipped to ON by default** when Rust crate available. |
1875
+ | 2.5-C — Request lifecycle hooks | `Runtime#on_request_start` / `on_request_end`. NewRelic / AppSignal / OpenTelemetry / DataDog wire without monkey-patching. Zero-cost path preserved. |
1876
+ | 2.5-D — Compression-bomb fuzz | 6 adversarial vectors (ratio bomb, malformed sync trailer, mid-message dict corruption, zero-length, min-window-bits, compressed control frame). All PASS — 2.3-C's defense holds. |
1877
+
1878
+ Spec count: 823 (2.4.0) → 907 (2.5.0). 0 failures, 11 pending.
1879
+
1880
+ ### Breaking change
1881
+
1882
+ `HYPERION_H2_NATIVE_HPACK` default flipped from OFF to ON when the
1883
+ Rust crate is available (the typical case — it builds out of the box
1884
+ on macOS/Linux + cargo). Operators who explicitly want the prior
1885
+ 2.4.x Ruby-fallback default must set `HYPERION_H2_NATIVE_HPACK=off`.
1886
+
1887
+ Migration: most operators see +18% h2 rps on header-heavy workloads
1888
+ (Rails apps, gRPC metadata, etc.) and no change on hello-shape
1889
+ workloads (HPACK is <1% of per-stream CPU on 2-header responses).
1890
+ Operators on hosts where the Rust crate didn't build see the same
1891
+ Ruby fallback as 2.0.x–2.4.x — no behavior change.
1892
+
1893
+ New operator visibility:
1894
+ - `Hyperion::Runtime#on_request_start { |req, env| ... }` — hook fires before app.call
1895
+ - `Hyperion::Runtime#on_request_end { |req, env, response, error| ... }` — hook fires after app.call
1896
+ - See `docs/OBSERVABILITY.md` "Custom request lifecycle hooks (2.5-C)" for NewRelic/AppSignal/OpenTelemetry/DataDog/Prometheus recipes
1897
+
1898
+ ### 2.5-A — WebSocket close-payload validation (RFC 6455 §7.4.1 + §5.5.1)
1899
+
1900
+ **The fix.** `Hyperion::WebSocket::Connection#recv` now validates the
1901
+ peer's close code against the IANA close-code registry. Codes outside
1902
+ the wire-allowed ranges (1000–1003, 1007–1015, 3000–3999, 4000–4999)
1903
+ get a 1002 (Protocol Error) response back instead of being echoed.
1904
+ Synthetic codes (1005 "No Status Received", 1006 "Abnormal Closure")
1905
+ that MUST NOT appear on the wire are rejected with 1002. The reserved
1906
+ 1016–2999 range is rejected with 1002. A 1-byte close payload (status
1907
+ code can't fit) is rejected with 1002. A close reason whose bytes are
1908
+ not valid UTF-8 is rejected with 1007 (Invalid Frame Payload Data) per
1909
+ RFC 6455 §8.1. An empty close payload (no status, no reason) is
1910
+ explicitly permitted per §5.5.1 and gets a 1000 Normal close response.
1911
+
1912
+ **Surface area.** New module `Hyperion::WebSocket::CloseCodes` exposes
1913
+ `.validate(code) → Symbol` and `.invalid?(code) → Boolean` for any
1914
+ caller that wants to apply the same RFC 6455 §7.4.1 ranges (e.g.
1915
+ ActionCable adapters, custom WS gateways).
1916
+
1917
+ **Why it matters.** 2.4-D's autobahn-testsuite run scored 453/463
1918
+ (97.8%) with all 10 failures in section 7.5.1 + 7.9.x — the close-code
1919
+ validation gap. 2.5-A closes that gap.
1920
+
1921
+ **Autobahn pass: 453/463 → 463/463 (100% on non-perf cases).**
1922
+ Section 7 alone: 27/37 → 37/37. Verified empirically on openclaw-vm
1923
+ 2026-04-30 post-fuzzer rerun (same bench/ws_echo_autobahn.ru rackup
1924
+ + Hyperion-2.5-A agent name; results dropped into
1925
+ ~/autobahn-reports/index.json, parsed via bench/parse_autobahn_index.rb).
1926
+ Section 6 (UTF-8): 145/145 (141 OK + 4 NON-STRICT — RFC SHOULD on
1927
+ fail-fast position, not MUST). Section 12+13 (permessage-deflate):
1928
+ 216/216 OK.
1929
+
1930
+ Spec count: 823 (2.4.0) → 893 (+70 in `websocket_close_validation_spec.rb`).
1931
+ 0 failures, 11 pending.
1932
+
1933
+ ### 2.5-B — Rails-shape h2 bench rackup (settle the HPACK default-flip question)
1934
+
1935
+ **The question 2.4-A left open.** 2.4-A's HPACK FFI round-2 (CGlue / v3
1936
+ adapter, commits 67c52a4 + 98e9cf3 + 877f934) brought the per-call alloc
1937
+ from 12 → 4 objects and dropped Fiddle off the hot path. But it
1938
+ benched against `bench/hello.ru` — a Rack response with **2 response
1939
+ headers** (`content-type`, plus the auto-inserted `content-length`). On
1940
+ that workload HPACK encode is <1% of per-stream CPU, so native and the
1941
+ Ruby fallback both came in at parity (-0.05% noise). The default stayed
1942
+ opt-in via `HYPERION_H2_NATIVE_HPACK=1` because parity isn't a default
1943
+ flip.
1944
+
1945
+ Real Rails 8.x apps ship 20–30 response headers (Rails defaults +
1946
+ ActionDispatch + ActionController + CSP/HSTS + per-request varying
1947
+ headers like X-Request-Id / Set-Cookie / ETag). On that shape HPACK
1948
+ encode CPU should climb into the single-digit percent of per-stream
1949
+ CPU and the FFI marshalling overhead vs the native byte-pump matters.
1950
+ 2.5-B settles whether it matters enough to flip the default.
1951
+
1952
+ **The bench artifacts.**
1953
+ - `bench/h2_rails_shape.ru` — Rails-shape rackup, 25 response headers
1954
+ (content-type, x-frame-options, x-xss-protection, x-content-type-options,
1955
+ x-permitted-cross-domain-policies, referrer-policy, x-download-options,
1956
+ cache-control, pragma, expires, vary, content-language,
1957
+ strict-transport-security, content-security-policy, x-request-id,
1958
+ x-runtime, x-powered-by, set-cookie, etag, last-modified, date,
1959
+ server, access-control-allow-origin, cross-origin-opener-policy,
1960
+ cross-origin-resource-policy). Body is a ~200-byte JSON payload —
1961
+ matches a typical Rails JSON response. Per-request variance: rid
1962
+ rotates per call, set-cookie session id rotates, etag rotates.
1963
+ - `bench/h2_rails_shape.sh` — A/B harness. Boots hyperion twice
1964
+ (Ruby fallback baseline + native v3 with HYPERION_H2_NATIVE_HPACK=1)
1965
+ on port 9602, runs h2load `-c 1 -m 100 -n 5000` 3× per variant, takes
1966
+ the median rps (3-5% bench noise), prints the delta, and selects a
1967
+ decision (flip / keep / investigate) against the +15% threshold from
1968
+ the 2.5-B controller.
1969
+
1970
+ **Decision tree.**
1971
+ - native ≥ +15% rps → flip default to ON (auto = on if available, off
1972
+ if not), update boot log, document in CHANGELOG as a
1973
+ `[breaking-default-change]`.
1974
+ - native at parity / +5–10% (within noise) → keep opt-in, document the
1975
+ result.
1976
+ - native NEGATIVE → don't ship a regression, file a 2.6 follow-up.
1977
+
1978
+ **Bench result on openclaw-vm (2026-04-30, h2load -c 1 -m 100 -n 5000,
1979
+ 3 trials, median):**
1980
+
1981
+ | Mode | r/s | Note |
1982
+ |---|---:|---|
1983
+ | Ruby fallback (HPACK off) | **1,201** | `protocol-http2`'s pure-Ruby Compressor/Decompressor |
1984
+ | Native v3 (HPACK on, CGlue) | **1,418** | 2.4-A custom-C-ext path, no Fiddle per call |
1985
+
1986
+ **Δ: +18.0% rps** on the Rails-shape header-heavy workload. Above
1987
+ the +15% flip threshold.
1988
+
1989
+ **[breaking-default-change]: native HPACK is now ON by default** when
1990
+ the Rust crate is available. `lib/hyperion/http2_handler.rb`'s
1991
+ policy resolver flipped from `env_flag_enabled?('HYPERION_H2_NATIVE_HPACK')`
1992
+ (unset → off) to `resolve_h2_native_hpack_default` (unset → on; only
1993
+ `0`/`false`/`no`/`off` explicit values opt out). Operators who
1994
+ benchmarked their workload against the 2.4.x default can opt out via
1995
+ `HYPERION_H2_NATIVE_HPACK=off`. Operators on hosts where the Rust
1996
+ crate didn't build see the same Ruby fallback as 2.0.x–2.4.x — no
1997
+ behavior change.
1998
+
1999
+ **Boot log copy updated** to reflect the new default — `mode: native
2000
+ (Rust v3 / CGlue)` is the new normal, `mode: fallback (... opted out
2001
+ via HYPERION_H2_NATIVE_HPACK=off)` is the explicit-opt-out shape.
2002
+
2003
+ **Spec changes for the new default:**
2004
+ - `spec/hyperion/h2_codec_fallback_spec.rb` — flipped the "env unset →
2005
+ codec_native? false" expectation to "env unset → codec_native? true";
2006
+ added a sibling `HYPERION_H2_NATIVE_HPACK=off` example for the
2007
+ explicit-opt-out path.
2008
+ - `spec/hyperion/http2_native_hpack_spec.rb` — the "default — opt-in
2009
+ not taken" context now sets `HYPERION_H2_NATIVE_HPACK=off` explicitly
2010
+ (the test's assertion that the native adapter is NOT installed is
2011
+ still meaningful, just under the explicit-opt-out path now).
2012
+
2013
+ Spec count: 893 → 894 (+1 new explicit-opt-out spec). 0 failures, 11 pending.
2014
+
2015
+ ### 2.5-C — Per-request lifecycle hooks (`Runtime#on_request_start` / `#on_request_end`)
2016
+
2017
+ **The user-relevant bit.** Attach NewRelic / AppSignal / DataDog /
2018
+ OpenTelemetry agents to Hyperion **without monkey-patching
2019
+ `Adapter::Rack#call`**. New first-class API on `Hyperion::Runtime`:
2020
+
2021
+ ```ruby
2022
+ runtime.on_request_start { |request, env| env['otel.span'] = tracer.start_span(request.path) }
2023
+ runtime.on_request_end { |request, env, response, error| env['otel.span'].finish }
2024
+ ```
2025
+
2026
+ The before-hook fires after env is built and before `app.call`. The
2027
+ after-hook fires after `app.call` returns or raises, with the
2028
+ `[status, headers, body]` response tuple (or `nil` if the app raised)
2029
+ plus the raised exception (or `nil` on success). Hooks may stash trace
2030
+ context into the env Hash for the after-hook to read back. Multiple
2031
+ hooks fire in registration order (FIFO).
2032
+
2033
+ **Failure-isolated.** A misbehaving observer (NewRelic agent throwing
2034
+ during a hook) is caught and logged with `block.source_location` — the
2035
+ dispatch chain continues, subsequent hooks still fire, the response is
2036
+ still returned to the client.
2037
+
2038
+ **Zero-cost when nothing's registered.** The hot-path guard is a
2039
+ single `Array#empty?` check on each side. With no hooks registered,
2040
+ `Adapter::Rack#call` short-circuits — no Array iteration, no Proc
2041
+ invocation, no allocation. `yjit_alloc_audit_spec` confirms the
2042
+ per-request allocation count remains at the 2.5-B baseline (≤10
2043
+ objects/req on the full path).
2044
+
2045
+ **Per-Server isolation.** The hook registry lives on `Hyperion::Runtime`,
2046
+ not on a process-global. Multi-tenant deployments with multiple
2047
+ `Hyperion::Server` instances pass a per-tenant `Runtime` and each
2048
+ gets its own observer list — no cross-contamination, no global mutex.
2049
+
2050
+ **Surface area.**
2051
+ - New: `Hyperion::Runtime#on_request_start(&block)` — register a
2052
+ before-hook receiving `(request, env)`.
2053
+ - New: `Hyperion::Runtime#on_request_end(&block)` — register an
2054
+ after-hook receiving `(request, env, response, error)`.
2055
+ - New: `Hyperion::Runtime#has_request_hooks?` — predicate used by the
2056
+ adapter's hot-path guard. Public-but-internal: callers wiring custom
2057
+ dispatchers can use it for the same zero-cost short-circuit.
2058
+ - New: `Hyperion::Runtime#fire_request_start(request, env)` /
2059
+ `#fire_request_end(request, env, response, error)` — invoked by
2060
+ `Adapter::Rack#call`. Public so future adapter implementations
2061
+ (third-party Rack alternatives, custom transports) can fire the
2062
+ same hooks against the user's observer registry.
2063
+ - Changed: `Hyperion::Adapter::Rack.call(app, request, connection: nil)`
2064
+ gains a `runtime:` kwarg (default `nil` → `Runtime.default`). Existing
2065
+ call sites (`Connection#call_app`, `Http2Handler`, `ThreadPool`) are
2066
+ updated to pass the per-conn / per-handler runtime through. Apps and
2067
+ third-party callers that never set the kwarg are unaffected.
2068
+
2069
+ **Files touched.**
2070
+ - `lib/hyperion/runtime.rb` — hook registration + dispatch + failure log.
2071
+ - `lib/hyperion/adapter/rack.rb` — `runtime:` kwarg + hot-path guard.
2072
+ - `lib/hyperion/connection.rb` — pass `@runtime` through `call_app`.
2073
+ - `lib/hyperion/http2_handler.rb` — pass `@runtime` through h2 dispatch.
2074
+ - `spec/hyperion/request_lifecycle_hooks_spec.rb` — 13 new examples
2075
+ covering registration API, FIFO order, env-Hash sharing, failure
2076
+ isolation, zero-cost path.
2077
+ - `docs/OBSERVABILITY.md` — "Custom request lifecycle hooks" section
2078
+ with NewRelic / AppSignal / OpenTelemetry / DataDog / per-route
2079
+ Prometheus recipes + multi-tenant isolation note.
2080
+
2081
+ Spec count: 894 → 907 (+13). 0 failures, 11 pending.
2082
+
2083
+ ### 2.5-D — permessage-deflate compression-bomb fuzz harness
2084
+
2085
+ **The user-relevant bit.** 2.3-C shipped the RFC 7692 §8.1 defense:
2086
+ `max_message_bytes` is applied AFTER decompression, so a tiny
2087
+ compressed payload that explodes on inflate trips close 1009 (Message
2088
+ Too Big) BEFORE the inflated buffer is materialized. ONE regression
2089
+ spec covered the happy-path "4 MB of zeroes vs 64 KB cap" case in
2090
+ `spec/hyperion/websocket_permessage_deflate_spec.rb`. **2.5-D verifies
2091
+ the defense holds across six adversarial input vectors.**
2092
+
2093
+ Each vector boots a `Hyperion::WebSocket::Connection` on one half of a
2094
+ `UNIXSocket.pair`, throws crafted compressed bytes at it from the
2095
+ other half, and asserts: (a) the server doesn't crash, (b) the server
2096
+ doesn't blow process RSS past a generous bomb-detection ceiling
2097
+ (4 MiB for protocol-error vectors, 64 MiB for the streaming ratio
2098
+ bomb — both >> the 64 KiB `max_message_bytes` cap so a real bomb
2099
+ trips), (c) the server closes with the expected RFC 6455 close code.
2100
+
2101
+ **Vectors and the close codes they trip:**
2102
+
2103
+ | # | Vector | Close code | Notes |
2104
+ |---|---|---|---|
2105
+ | 1 | Classic ratio bomb (4 GB inflated, streamed) | **1009** Message Too Big | Stream-deflated chunked input never holds 4 GB at rest; cap trips well before the full stream lands |
2106
+ | 2 | Malformed sync trailer (`00 00 ff fe`) | **1007** Invalid Frame Payload Data | Inflate hits `Zlib::DataError`, mapped to 1007 by `Connection#inflate_message` |
2107
+ | 3 | Mid-message dictionary corruption (3-fragment, frame 2 byte-flipped) | **1007** Invalid Frame Payload Data | Backreference points outside the legal sliding window → `Zlib::DataError` |
2108
+ | 4 | Zero-length compressed message (empty stored deflate block, RSV1=1) | **1000** Normal Closure | Decompresses to empty string OK, follow-up close 1000 returned cleanly |
2109
+ | 5 | Min-window-bits negotiation (`client_max_window_bits=9`) | **1000** Normal Closure | Hyperion clamps the floor to 9 (zlib raw-deflate refuses 8 in some builds); 9 round-trips OK |
2110
+ | 6 | Compressed control frame (ping with RSV1=1) | **1002** Protocol Error | RFC 7692 §6.1 — control frames MUST NOT carry RSV1; parser rejects |
2111
+
2112
+ **Result on macOS arm64-darwin23 + Ruby 3.3.3 with YJIT:**
2113
+ **6/6 vectors PASS.** No bugs found. Total runtime under 2 minutes
2114
+ (ratio bomb ~80 s, all five protocol-error vectors complete in
2115
+ under a second each). The 2.3-C defense holds across every
2116
+ adversarial dimension we throw at it.
2117
+
2118
+ **Files added.**
2119
+ - `bench/ws_compression_bomb_fuzz.rb` — the harness. Self-contained,
2120
+ uses Ruby stdlib `zlib` + the existing `Hyperion::WebSocket::Builder`
2121
+ for client-side framing, no new gem deps. Runs standalone via
2122
+ `ruby bench/ws_compression_bomb_fuzz.rb` or via the wrapper spec.
2123
+ - `spec/hyperion/websocket_compression_bomb_fuzz_spec.rb` — single
2124
+ example tagged `:perf` (skipped by default; operators run with
2125
+ `--tag perf` after permessage-deflate code changes).
2126
+
2127
+ Spec count: default-run unchanged at 907 (the new wrapper is `:perf`
2128
+ tagged so it skips by default). With `--tag perf` enabled the count
2129
+ moves 908 → 909 (the perf-included pool also picks up the existing
2130
+ `long_run_stability_spec`). 0 failures, 11 pending.
2131
+
2132
+ ## [2.4.0] - 2026-04-29
2133
+
2134
+ ### Headline
2135
+
2136
+ A production-stability + observability release. Targeted at long-running
2137
+ servers under sustained traffic (the user's nginx-fronted ActionCable
2138
+ deployment shape).
2139
+
2140
+ | Track | Result |
2141
+ |---|---|
2142
+ | 2.4-B — GC pressure round-2 | Per-parse alloc **-41 to -53%** on hello / 5-header / chunked POST; GC frequency **-28%** on chunked POST sustained 60s |
2143
+ | 2.4-D — Linux multi-process WS bench | **6,880 msg/s p99 34 ms** at 4 procs × 200 conns (+289% vs single-proc, -75% p99); autobahn **97.8% pass** (453/463), permessage-deflate **100%** RFC 7692 conformance |
2144
+ | 2.4-C — /-/metrics enrichment | 6 new production metrics: per-route p50/p99 histograms, fairness rejection counter, WS deflate ratio, kTLS active conns, io_uring active, threadpool queue depth. Grafana dashboard + OBSERVABILITY.md operator playbook |
2145
+ | 2.4-A — HPACK FFI round-2 | Per-call alloc 12 → 4 objects (custom C ext, no Fiddle per call); bench at parity (HPACK is <1% of per-stream CPU). Stays opt-in. |
2146
+
2147
+ Spec count: 776 (2.3.0) → 823 (2.4.0). 0 failures, 11 pending.
2148
+
2149
+ New operator visibility:
2150
+ - `/-/metrics` now exposes `hyperion_request_duration_seconds`, `hyperion_per_conn_rejections_total`, `hyperion_websocket_deflate_ratio`, `hyperion_tls_ktls_active_connections`, `hyperion_io_uring_workers_active`, `hyperion_threadpool_queue_depth`
2151
+ - Grafana dashboard at `docs/grafana/hyperion-2.4-dashboard.json`
2152
+ - Operator playbook at `docs/OBSERVABILITY.md`
2153
+
2154
+ ### Known limitations carried forward to 2.5
2155
+ - WebSocket close-payload validation: 10 autobahn cases in section 7.5.1 + 7.9.x fail because `Connection#recv` echoes invalid peer close codes instead of rejecting with 1002 (Protocol Error). Documented in `docs/WEBSOCKETS.md`. **Resolved in 2.5-A — see Unreleased section above.**
2156
+
2157
+ ### 2.4-A — HPACK FFI round-2 (custom C ext, no Fiddle per call)
2158
+
2159
+ **The story so far.** The 2.0.0 native HPACK path went through Fiddle:
2160
+ each `Encoder#encode` call paid for `pack('Q*')` to build an argv
2161
+ buffer, three `Fiddle::Pointer[scratch]` pointer-wrapper allocations,
2162
+ plus per-header `.b` re-encoding when the source string wasn't already
2163
+ ASCII-8BIT. The standalone microbench showed a 3.26× encode win, but
2164
+ on real h2load traffic the FFI marshalling layer ate the savings:
2165
+ 2.0.0 was -8 to -28% rps vs Ruby fallback. fix-B (2.2.x) introduced
2166
+ the per-encoder scratch buffer + flat-blob `_encode_v2` ABI, which
2167
+ brought native to **parity** with Ruby fallback (-0.05% noise) — but
2168
+ parity isn't a default flip.
2169
+
2170
+ **What 2.4-A ships.** A new sibling C extension entry point,
2171
+ `Hyperion::H2Codec::CGlue`, that bypasses Fiddle entirely on the
2172
+ per-call hot path:
2173
+
2174
+ - `ext/hyperion_http/h2_codec_glue.c` — defines the `CGlue` module
2175
+ and three singleton methods: `install(path)`, `available?`,
2176
+ `encoder_encode_v3(handle_addr, headers, scratch_out)`,
2177
+ `decoder_decode_v3(handle_addr, bytes, scratch_out)`.
2178
+ - `install(path)` is called once from `H2Codec.load!` after the
2179
+ Fiddle loader has already confirmed the cdylib loads cleanly. The C
2180
+ glue `dlopen`s the same path with `RTLD_NOW | RTLD_LOCAL` (a
2181
+ refcount bump per POSIX, not a double-load), `dlsym`s the three
2182
+ Rust entries (`hyperion_h2_codec_abi_version`,
2183
+ `hyperion_h2_codec_encoder_encode_v2`,
2184
+ `hyperion_h2_codec_decoder_decode`), and caches them as static C
2185
+ function pointers.
2186
+ - `encoder_encode_v3` walks the Ruby `headers` array directly via
2187
+ `RARRAY_LEN`/`rb_ary_entry`, packs the argv quad buffer onto the C
2188
+ stack (default 64 headers — heap fallback for larger blocks),
2189
+ concatenates name+value bytes into a stack-resident blob (default
2190
+ 8 KiB — heap fallback for larger blocks), and invokes the cached
2191
+ `encode_v2` function pointer directly. The encoded bytes land in
2192
+ the per-encoder scratch String; Ruby's `byteslice(0, written)`
2193
+ copies them out as the single unavoidable allocation.
2194
+ - `RSTRING_PTR` reads the raw byte view of `name`/`value` regardless
2195
+ of the Ruby encoding tag, eliminating the per-header `.b`
2196
+ allocation that the v2 Ruby path could not avoid for non-binary
2197
+ inputs.
2198
+
2199
+ **The Ruby façade.** `Hyperion::H2Codec::Encoder#encode` now probes
2200
+ `H2Codec.cglue_available?` on each call. When true, it dispatches
2201
+ through `CGlue.encoder_encode_v3`. When false (older systems without
2202
+ dlfcn, hardened sandboxes blocking dlopen, ABI mismatch), it
2203
+ transparently falls back to the v2 (Fiddle) path — no operator
2204
+ intervention required.
2205
+
2206
+ **Per-call alloc shape.**
2207
+
2208
+ | Path | Strings/call (encode, steady state) |
2209
+ |---|---|
2210
+ | 2.0.0 (Fiddle, v1 ABI) | ~12 |
2211
+ | 2.2.x fix-B (Fiddle, v2 ABI) | ~7.5 |
2212
+ | **2.4-A (C ext, v3 path)** | **~1** (the byteslice return) |
2213
+
2214
+ Counted via `GC.stat(:total_allocated_objects)` delta over 100
2215
+ warmed encodes; spec asserts `total_allocated_objects/call < 6`
2216
+ (includes Fixnums + transient Array iterators that GC.stat conflates
2217
+ with String allocations).
2218
+
2219
+ **Bench delta on openclaw-vm (Linux 6.8 / 16 vCPU).** Three runs each
2220
+ of `h2load -c 1 -m 100 -n 5000 https://127.0.0.1:9602/` against
2221
+ `bin/hyperion -t 64 -w 1 --h2-max-total-streams unbounded ~/bench/hello.ru`
2222
+ (rack lambda returning `[200, {'content-type' => 'text/plain'}, ['hello']]`):
2223
+
2224
+ | Mode | median r/s | delta vs Ruby |
2225
+ |---|---:|---:|
2226
+ | Ruby HPACK (baseline) | 1,627.46 | — |
2227
+ | native v2 (Fiddle path) | 1,628.7 (fix-B / 2.2.x baseline) | -0.05% |
2228
+ | **native v3 (CGlue)** | 1,607.19 | -1.2% (noise) |
2229
+
2230
+ Raw h2load output: `.notes/2.4-a-bench-openclaw-vm.txt`.
2231
+
2232
+ **Default flip — declined for now.** The +15% target wasn't met on
2233
+ the hello workload. v3 is at parity with Ruby HPACK and v2 Fiddle —
2234
+ the per-call alloc savings (7.5 → ~1 string) don't translate to
2235
+ end-to-end rps because hello.ru sends a 2-header response, and at
2236
+ ~1,600 r/s the dominant CPU sits in TLS, fiber scheduling, and h2
2237
+ framing (which v3 does NOT replace — it only swaps HPACK encode at
2238
+ the protocol-http2 boundary). HPACK CPU on a 2-header block is
2239
+ already <1% of the per-stream cost.
2240
+
2241
+ `HYPERION_H2_NATIVE_HPACK` stays default-OFF. Operators who opt in
2242
+ via `HYPERION_H2_NATIVE_HPACK=1` get the v3 (CGlue) path
2243
+ automatically when the C glue installs successfully, and v2 (Fiddle)
2244
+ when it doesn't (older glibc / sandboxed dlopen). The v3
2245
+ implementation ships because:
2246
+
2247
+ 1. It's the foundation for any future end-to-end win — the FFI
2248
+ marshalling layer cannot get cheaper than v3 without rewriting
2249
+ the entire h2 framer in Rust.
2250
+ 2. Per-call alloc reduction (7.5 → ~1 string) is real and lowers
2251
+ GC pressure on h2-heavy workloads even when rps is flat.
2252
+ 3. Specs lock in v3 as the default code path inside the
2253
+ `H2Codec::Encoder#encode` method body — the v3-specific 12
2254
+ spec examples plus the existing 24 H2Codec/Http2 spec
2255
+ examples all execute through the v3 path on hosts where
2256
+ CGlue installs (which is every modern Linux + macOS).
2257
+
2258
+ The next gate for the default flip is a Rails-shape bench (~30
2259
+ response headers per stream). Tracked as a separate H2 follow-up.
2260
+
2261
+ ### 2.4-B — GC-pressure reduction round-2 (long-run stability)
2262
+
2263
+ **Why this matters.** Phase 11 (2.2.0) cut Adapter::Rack hot path
2264
+ allocations -53% (19 → 9 obj/req) and locked the number with
2265
+ `yjit_alloc_audit_spec`. But the *long-run* server hot path under
2266
+ sustained traffic includes more than the adapter — the C parser, the
2267
+ WebSocket frame ser/de, and the connection lifecycle each ship their
2268
+ own per-message allocations that scale with **message rate**, not just
2269
+ request count. A single keep-alive connection pipelined with 1000
2270
+ requests, or a chat-style WebSocket connection echoing 1000 messages,
2271
+ should not allocate 1000 connection-state objects — only the
2272
+ truly-per-message ones.
2273
+
2274
+ **The audit.** `bench/gc_audit_2_4_b.rb` drove four sustained workloads
2275
+ (HTTP keep-alive GET, chunked POST, WS recv, WS send w/ permessage-
2276
+ deflate) under `GC.disable` for 5000-10000 iters and measured per-
2277
+ iter `total_allocated_objects` deltas plus GC.count over the window.
2278
+ `bench/gc_audit_2_4_b_trace.rb` ran the same workloads under
2279
+ `ObjectSpace.trace_object_allocations` for file:line attribution.
2280
+ Top sites identified, written up at `bench/gc_audit_2_4_b.md`:
2281
+
2282
+ | # | Site | Fix |
2283
+ |---|-----------------------------------------------|----------------------------------------------------------------|
2284
+ | 1 | `parser.c:state_init` 6 empty placeholders | S1 — Qnil sentinels, lazy alloc on first append |
2285
+ | 2 | `parser.c:on_headers_complete` cl_key/te_key | S2 — pre-interned frozen globals at Init_hyperion_http |
2286
+ | 3 | `parser.c:stash_pending_header` reset empties | S1 (rolled into) — reset to Qnil, not fresh empty Strings |
2287
+ | 4 | `frame.rb:Builder.build` `payload.b` | S4 — branch on `encoding == BINARY_ENCODING`, skip no-op clone |
2288
+ | 5 | `frame.rb:Parser.parse` `slice.b` + `(+'').b`| S5 — drop redundant `.b` + share frozen `EMPTY_BIN_PAYLOAD` |
2289
+
2290
+ **What ships:**
2291
+
2292
+ | Deliverable | Where | Why |
2293
+ |---|---|---|
2294
+ | C parser lazy field alloc + frozen smuggling-defense keys | `ext/hyperion_http/parser.c` | Saves 6 empty-String allocations per parse() in state_init, 2 fresh empty Strings per parsed header in stash_pending_header, 2 cl_key/te_key Strings per parse(). Lazy Qnil → empty-String coercion at Request build means the Ruby surface is unchanged. |
2295
+ | WebSocket frame `.b` clone elimination + frozen empty payload | `lib/hyperion/websocket/frame.rb` | `Builder.build` skips `payload.b` when input is already ASCII-8BIT. `Parser.parse` / `parse_with_cursor` drop the redundant `slice.b` (the WS `@inbuf` is binary by construction) and share one frozen `EMPTY_BIN_PAYLOAD` const for empty frames. |
2296
+ | `bench/gc_audit_2_4_b.rb` + `gc_audit_2_4_b_trace.rb` + `gc_audit_2_4_b.md` | NEW | Sustained-workload audit harness covering 4 hot paths; writeup with measured before/after per site. |
2297
+ | `spec/hyperion/parser_alloc_audit_spec.rb` | NEW (4 examples) | Locks per-parse allocation counts: ≤12 for minimal GET, ≤22 for 5-header GET, ≤20 for chunked POST. Plus an identity invariant on the EMPTY_STR coercion. |
2298
+ | `spec/hyperion/websocket_frame_alloc_audit_spec.rb` | NEW (4 examples) | Locks Builder.build (≤4 obj/call unmasked), Parser.parse (≤11 masked, ≤9 unmasked), and the EMPTY_BIN_PAYLOAD frozen-identity invariant. |
2299
+ | `spec/hyperion/long_run_stability_spec.rb` | NEW (1 example, `:perf` tagged) | Drives 10000 keep-alive GETs over 100 connections, asserts ≤65 obj/req and ≥1 GC per 500 reqs. Excluded from default suite via `spec_helper.rb` `filter_run_excluding(:perf)`; operators run via `--tag perf` after allocation-pressure changes. |
2300
+
2301
+ **Measured (5000-iter steady-state, no YJIT, macOS arm64):**
2302
+
2303
+ | case | 2.3.0 | 2.4-B | delta |
2304
+ |-------------------------------|------:|------:|------:|
2305
+ | GET /, 1 header (parse) | 19.00 | 9.00 | **-53%** |
2306
+ | GET /a?q=1, 5 headers (parse) | 36.00 | 18.00 | **-50%** |
2307
+ | POST chunked, 4 chunks (parse)| 27.00 | 16.00 | **-41%** |
2308
+ | WS Builder.build unmasked | 3+1 | 3 | -25% |
2309
+ | WS Parser.parse unmasked | 9 | 8 | -11% |
2310
+
2311
+ **Sustained-load GC frequency (10000 iters, openclaw-vm Linux Ruby 3.3.3):**
2312
+
2313
+ | workload | 2.3.0 GC freq | 2.4-B GC freq | delta |
2314
+ |-------------------|--------------:|--------------:|---------:|
2315
+ | chunked POST parse| 1 GC / 689 | 1 GC / 952 | **-28%** |
2316
+ | ws recv (masked) | 1 GC / 625 | 1 GC / 625 | 0% |
2317
+
2318
+ (WS recv unchanged because the masked-frame audit path goes through
2319
+ `CFrame.unmask` regardless of S5; the regression spec exercises the
2320
+ unmasked-side win that S5 captures.)
2321
+
2322
+ **wrk validation (openclaw-vm, 30s, -t4 -c200, hyperion -w4 -t5):**
2323
+
2324
+ | build | req/s | p50 | p99 | std-dev |
2325
+ |--------|------:|-------:|-------:|--------:|
2326
+ | 2.3.0 | 14833 | 1.27ms | 2.64ms | 329µs |
2327
+ | 2.4-B | 14985 | 1.26ms | 2.61ms | 330µs |
2328
+
2329
+ Throughput is adapter-bound at this scale — the win is in GC pressure,
2330
+ not raw rps. p99 + std-dev sit within noise on a 30s run; the
2331
+ `long_run_stability_spec` is the regression guard for sustained-load
2332
+ behaviour. The user-relevant 2.4 win for **long-running production
2333
+ servers** is GC frequency -28% on the chunked path that scales with
2334
+ real upload volume.
2335
+
2336
+ Sites considered + deferred (with rationale in `bench/gc_audit_2_4_b.md`):
2337
+ * @inbuf initial capacity 8KB → 16KB — verified 95th-percentile fits
2338
+ in 4KB; bump would regress 10k-keep-alive RSS without payback.
2339
+ * WS frame parse 8-element Array — Ruby façade surface; deferred.
2340
+ * Per-conn env Hash pool — already pooled by Phase 11.
2341
+
2342
+ Spec count: 788 → 796 default-run + 1 (`:perf`-tagged) = 797 total.
2343
+ No version bump (release task is 2.4-fix-F).
2344
+
2345
+ ### 2.4-C — `/-/metrics` enrichment (operator observability)
2346
+
2347
+ **The story.** The 2.x sprints added many operator knobs
2348
+ (`permessage_deflate`, `max_in_flight_per_conn`, `tls.ktls`,
2349
+ `io_uring`, h2 native HPACK), but `/-/metrics` exposed only the
2350
+ 1.x counter set. Operators who turned the knobs on had no production
2351
+ visibility into whether they were firing — "is permessage-deflate
2352
+ compressing my chat traffic?", "is fairness rejecting any
2353
+ clients?", "did kTLS engage on this worker?" all lacked a
2354
+ metric-backed answer.
2355
+
2356
+ 2.4-C closes the gap. Operators can now see permessage-deflate
2357
+ effectiveness, per-conn fairness rejections, kTLS engagement,
2358
+ io_uring policy state, and ThreadPool queue depth directly in the
2359
+ `/-/metrics` body, plus per-route latency histograms with
2360
+ configurable path templating to keep cardinality bounded.
2361
+
2362
+ **What ships:**
2363
+
2364
+ | Metric | Type | What it tells the operator |
2365
+ |---|---|---|
2366
+ | `hyperion_request_duration_seconds` | histogram | Per-route p50/p99 by `method` + templated `path` + `status` class. Buckets `0.001…10`s. |
2367
+ | `hyperion_per_conn_rejections_total` | counter | Per-worker rate of 503 + Retry-After rejections from the 2.3-B fairness cap. |
2368
+ | `hyperion_websocket_deflate_ratio` | histogram | `original_bytes / compressed_bytes` for every WS message that goes through 2.3-C permessage-deflate. Buckets 1.5×…50×. |
2369
+ | `hyperion_tls_ktls_active_connections` | gauge | Per-worker count of TLS connections currently driven by the kernel TLS_TX module. |
2370
+ | `hyperion_io_uring_workers_active` | gauge | 1 = io_uring policy active on this worker, 0 = epoll. |
2371
+ | `hyperion_threadpool_queue_depth` | gauge | Snapshot of the worker's ThreadPool inbox at scrape time. |
2372
+
2373
+ **Path templating.** `Hyperion::Metrics::PathTemplater` collapses
2374
+ `/users/123` → `/users/:id` and `/orders/<uuid>` → `/orders/:uuid`
2375
+ by default, with an LRU-cached lookup so repeated paths on
2376
+ keep-alive connections pay one Hash hit, not the regex chain.
2377
+ Operators with Rails-style routes plug in custom rules via
2378
+ `metrics do; path_templater MyTemplater.new; end` in `config.rb`.
2379
+
2380
+ **Allocation impact on the request hot path.** The new histogram
2381
+ observation runs in `Connection#serve` (not in `Adapter::Rack#call`,
2382
+ which `yjit_alloc_audit_spec` locks at 9 obj/req post-Phase-11).
2383
+ Per-observation steady-state allocation: 1 fresh 3-element label
2384
+ Array (the `method`/`path`/`status` tuple); the templater + the
2385
+ HistogramAccumulator both reuse pre-allocated structures past first
2386
+ sight of a given route. `yjit_alloc_audit_spec` stays green at
2387
+ 9 obj/req.
2388
+
2389
+ **Files / sites:**
2390
+
2391
+ | Where | What |
2392
+ |---|---|
2393
+ | `lib/hyperion/metrics.rb` | Histogram + gauge + labeled-counter API on `Hyperion::Metrics`. Snapshot helpers for the exporter. |
2394
+ | `lib/hyperion/metrics/path_templater.rb` | NEW — LRU-cached templater with default integer/UUID rules. |
2395
+ | `lib/hyperion/prometheus_exporter.rb` | `render_full(metrics_sink)` emits histograms / gauges / labeled counters in addition to the legacy counter render. |
2396
+ | `lib/hyperion/admin_middleware.rb` | `/-/metrics` switches to `render_full` when the sink supports it (defensive fallback otherwise). |
2397
+ | `lib/hyperion/connection.rb` | Per-route duration histogram observation; labeled per-worker rejection counter; kTLS untrack on close. |
2398
+ | `lib/hyperion/websocket/connection.rb` | WS deflate ratio histogram observation in `deflate_message`. |
2399
+ | `lib/hyperion/tls.rb` | `track_ktls_handshake!` / `untrack_ktls_handshake!` helpers. |
2400
+ | `lib/hyperion/server.rb` | Calls `track_ktls_handshake!` after every TLS accept. |
2401
+ | `lib/hyperion/worker.rb` | Sets the `hyperion_io_uring_workers_active` gauge at boot/shutdown. |
2402
+ | `lib/hyperion/thread_pool.rb` | Block-form gauge for `hyperion_threadpool_queue_depth` (read live at scrape time). |
2403
+ | `lib/hyperion/config.rb` | `MetricsConfig` subconfig with `path_templater` + `enabled` knobs. |
2404
+ | `spec/hyperion/metrics_enrichment_spec.rb` | NEW — 27 examples covering templater, histogram/gauge/counter API, exporter rendering, and per-domain integration (Connection, WS deflate, kTLS, fairness). |
2405
+ | `docs/OBSERVABILITY.md` | NEW — operator playbook: every metric, its query, what action a non-zero value should trigger. |
2406
+ | `docs/grafana/hyperion-2.4-dashboard.json` | NEW — pre-built dashboard with 8 panels (heatmap + p50/p99 + rejection rate + deflate ratio + kTLS + io_uring + queue depth). |
2407
+
2408
+ **No new gem deps.** The exporter extends the existing in-tree
2409
+ emission path; no `prometheus-client` (or any other) gem was added.
2410
+
2411
+ Spec count: 796 → 823 default-run.
2412
+ No version bump (release task is 2.4-fix-F).
2413
+
2414
+ ### 2.4-D — Linux multi-process WS bench rerun + autobahn RFC 6455 conformance
2415
+
2416
+ Two items deferred from 2.3-D landed in this stream — the openclaw-vm
2417
+ multi-process WebSocket bench and the autobahn-testsuite fuzzer run
2418
+ against the WS echo rackup. Bench + docs only; no production code
2419
+ changed.
2420
+
2421
+ **Headlines.**
2422
+
2423
+ * **Linux multi-process WS bench captures the published numbers.**
2424
+ 4 procs × 200 conns × 1000 msgs hits **6,880 msg/s** with
2425
+ p50 28.60 ms / p99 33.86 ms; 4 procs × 40 conns hits **7,561 msg/s**
2426
+ with p50 5.26 ms / p99 6.22 ms. vs the fix-E single-process
2427
+ Linux baseline this is **+285–289% msg/s and a 75% drop in
2428
+ p99 on the throughput row** (134 ms → 33.86 ms). Confirms that
2429
+ the long fix-E Linux tail at 200 conns was client-side GVL
2430
+ serialisation, not server-side latency — the same shape we
2431
+ saw on macOS in 2.3-D.
2432
+
2433
+ * **autobahn RFC 6455 conformance: 453/463 pass (97.8%).** Run on
2434
+ openclaw-vm with `bench/ws_echo_autobahn.ru` (1 MiB cap +
2435
+ permessage-deflate negotiated) against `crossbario/autobahn-testsuite`
2436
+ Docker image. Per-section breakdown:
2437
+
2438
+ | Section | Cases | Pass | Note |
2439
+ |---|---:|---:|---|
2440
+ | 1 — Framing | 16 | 16 / 16 | 100% OK |
2441
+ | 2 — Pings / pongs | 11 | 11 / 11 | 100% OK |
2442
+ | 3 — Reserved bits / opcodes | 7 | 7 / 7 | 100% OK |
2443
+ | 4 — Frame contents | 10 | 10 / 10 | 100% OK |
2444
+ | 5 — Fragmentation | 20 | 20 / 20 | 100% OK |
2445
+ | 6 — UTF-8 validation | 145 |145 /145 | 4 NON-STRICT (fail-fast position; passes per RFC §8.1 SHOULD) |
2446
+ | 7 — Close handling | 37 | 27 / 37 | **10 FAILED — 2.5 follow-up** |
2447
+ | 9 — Limits / very large | — | — | excluded by config |
2448
+ | 10 — Auto-fragmentation | 1 | 1 / 1 | 100% OK |
2449
+ | 12 — permessage-deflate (RFC 7692) | 90 | 90 / 90 | 100% OK — 2.3-C validated |
2450
+ | 13 — permessage-deflate fragmentation| 126 |126 /126 | 100% OK — 2.3-C validated |
2451
+
2452
+ **Sections 12 + 13 (216 cases, RFC 7692 permessage-deflate)**
2453
+ are 100% OK on this run — the first autobahn validation since
2454
+ 2.3-C shipped the extension. Confirms the encode + decode + per-
2455
+ message reset paths are RFC-compliant end-to-end.
2456
+
2457
+ **Section 7 close handling has 10 FAILED cases, all in 7.5.1
2458
+ + 7.9.x.** RFC 6455 §7.4 requires the server to close 1002
2459
+ (Protocol Error) when the peer sends a close frame with an
2460
+ invalid close code (0, 1004, 1005, 1006, reserved range, etc.).
2461
+ Hyperion's `Connection#recv` close path currently echoes the
2462
+ invalid code back instead of rejecting it. **Filed as a 2.5
2463
+ follow-up** — out of scope for 2.4-D (bench + docs only) per the
2464
+ sprint scope.
2465
+
2466
+ **What ships:**
2467
+
2468
+ | Where | What |
2469
+ |---|---|
2470
+ | `bench/ws_echo_autobahn.ru` | NEW — autobahn-friendly variant of `ws_echo.ru`. 1 MiB `max_message_bytes` (vs 16 KiB) and propagates the negotiated `permessage-deflate` extension into the 101 response so sections 12/13 fire. Plain `ws_echo.ru` runs through autobahn fine for sections 1-10 but marks 12/13 UNIMPLEMENTED because the server never advertises deflate. |
2471
+ | `bench/parse_autobahn_index.rb` | NEW — reads `autobahn-reports/index.json` and prints the per-section breakdown the table above came from. Identifies FAILED cases for triage. Also lists OK / NON-STRICT / INFORMATIONAL / UNIMPLEMENTED counts per section. |
2472
+ | `autobahn-config/fuzzingclient.json` | UPDATED — agent string bumped to `Hyperion-2.4.0`, points at `ws://127.0.0.1:9888` (no path; `bench/ws_echo_autobahn.ru` accepts upgrade on any URL), header comment now references `parse_autobahn_index.rb` and the 17-minute wall-clock estimate. |
2473
+ | `docs/WEBSOCKETS.md` "RFC 6455 conformance" subsection | UPDATED — replaces the deferred-to-2.4 note with the actual 2.4-D results table, the 7.x close-handshake gap as a known 2.5 follow-up, and a "Configuring permessage-deflate echo" snippet for operators rolling their own ws app. |
2474
+ | `docs/BENCH_HYPERION_2_0.md` "WebSocket multi-process bench" subsection | UPDATED — replaces the deferred-recipe block with the openclaw-vm 2.4-D table + cross-platform shape comparison (within-host scaling vs Apple Silicon dev) + raw 3-run-median data. |
2475
+
2476
+ **Bench environment.** All 2.4-D numbers from `openclaw-vm`,
2477
+ Ubuntu 24.04, kernel 6.8, 16 vCPU x86_64, Ruby 3.3.3, hyperion
2478
+ master @ commit `ffcbdfb` (2.4-C tip pre-2.4-D commit). Three runs
2479
+ each, median reported, run-to-run variance ~3-5%.
2480
+
2481
+ **Known limit (logged for 2.5).** Section 7.5.1 + 7.9.x autobahn
2482
+ FAILED cases — the close-handshake invalid-payload validation gap
2483
+ described above. The fuzzer tests close codes 0 / 1004 / 1005 /
2484
+ 1006 / 1012-1015 / 1016+ / 2000+ / 2999 / etc.; Hyperion echoes the
2485
+ peer's invalid code instead of rejecting it with 1002. Fix is a
2486
+ small `validate_close_payload!` helper on `WebSocket::Connection`
2487
+ plus the matching close-handshake response — punted out of 2.4-D
2488
+ because the sprint stream was bench + docs only.
2489
+
2490
+ Spec count: 823 → 823 (no spec changes — bench scripts and parser
2491
+ script are runnable Ruby but not exercised by the rspec suite).
2492
+ No version bump (release task is 2.4-fix-F).
2493
+
2494
+ ## [2.3.0] - 2026-05-01
2495
+
2496
+ ### Headline
2497
+
2498
+ A WebSocket-bandwidth + operator-knobs release. The 2.3.0 sprint
2499
+ targeted the user's nginx-fronted plaintext-h1 + WebSocket
2500
+ production topology. The headline wins:
2501
+
2502
+ | Track | Result |
2503
+ |---|---|
2504
+ | 2.3-C — WebSocket permessage-deflate (RFC 7692) | **20× wire reduction** on chat-style JSON; chat workloads / ActionCable fan-out save bandwidth at the cost of ~30-40% encode-CPU |
2505
+ | 2.3-D — WS multi-process bench client | +176% msg/s, p99 halved (debunked fix-E's single-process tail as client-side GVL, not server-side) |
2506
+ | 2.3-A — io_uring accept on Linux 5.6+ (opt-in) | At parity with epoll on hello — Ruby-dispatch-bound at this rate, not accept-syscall-bound. Path engages cleanly; ships opt-in for high-accept-churn workloads |
2507
+ | 2.3-B — Per-conn fairness + TLS handshake throttle | Defense-in-depth knobs; `max_in_flight_per_conn` defaults to nil (no behavior change), operators opt-in via config / CLI / env |
2508
+
2509
+ Spec count: 698 (2.2.0) → 776 (2.3.0).
2510
+
2511
+ New operator knobs:
2512
+ - `HYPERION_IO_URING={on,off,auto}` (env) + `c.io_uring = :auto/:on/:off` (DSL)
2513
+ - `c.max_in_flight_per_conn = N` / `--max-in-flight-per-conn N` / `HYPERION_MAX_IN_FLIGHT_PER_CONN=N`
2514
+ - `c.tls.handshake_rate_limit = N` / `--tls-handshake-rate-limit N` / `HYPERION_TLS_HANDSHAKE_RATE_LIMIT=N`
2515
+ - `c.websocket.permessage_deflate = :auto/:on/:off` (DSL) / `HYPERION_WS_DEFLATE={on,off,auto}`
2516
+
2517
+ ### 2.3-A — io_uring accept on Linux 5.6+, opt-in via `HYPERION_IO_URING=on`
2518
+
2519
+ **Why this matters most:** the 2026-04-30 sweep showed Hyperion at
2520
+ 96,813 r/s on hello `-w 16 -t 5` (vs Puma 75,776 — already +27.8%).
2521
+ With the GVL bypassed by 16 workers, the next bottleneck is the
2522
+ kernel accept loop: every accept costs `accept_nonblock` + `IO.select`
2523
+ on the EAGAIN edge — two syscalls per accepted connection under
2524
+ burst. io_uring submits accept SQEs and reaps CQEs in one syscall,
2525
+ and the kernel batches multiple accepts in a single CQE drain when
2526
+ connections arrive faster than the fiber consumes them.
2527
+
2528
+ **Target:** hello `-w 16 -t 5` from 96,813 → ≥ 130,000 r/s with p99
2529
+ unchanged (~2-3 ms).
2530
+
2531
+ **What ships:**
2532
+
2533
+ | Deliverable | Where | Why |
2534
+ |---|---|---|
2535
+ | `ext/hyperion_io_uring/` Rust crate | NEW | Wraps the `io-uring` crate (https://docs.rs/io-uring) — well-maintained safe Rust around liburing. Linux-gated via `target.'cfg(target_os = "linux")'` so the macOS dev build still cargo-checks cleanly; Darwin compiles to stubs that return -ENOSYS. |
2536
+ | `lib/hyperion/io_uring.rb` | NEW | Ruby surface: `Hyperion::IOUring.supported?`, `Hyperion::IOUring::Ring.new(queue_depth: 256)` with `#accept(fd)` / `#read(fd, max:)` / `#close`. Loaded over Fiddle, identical pattern to `Hyperion::H2Codec`. |
2537
+ | `Hyperion::Server#run_accept_fiber` | UPDATED | Splits into `run_accept_fiber_io_uring` and `run_accept_fiber_epoll`. The io_uring branch lazily opens a per-fiber ring on first use (`Fiber.current[:hyperion_io_uring] ||= Ring.new(...)`), drains accept CQEs, and hands each accepted fd to `dispatch` via `::Socket.for_fd`. Closed at fiber exit. The TLS path keeps epoll — io_uring accept is wired only for plain TCP (the SSL handshake still wants the userspace `accept` + `SSL_accept` dance). |
2538
+ | `Hyperion::Config#io_uring` | NEW | Tri-state `:off` / `:auto` / `:on`. Mirrors `tls.ktls`. |
2539
+ | `HYPERION_IO_URING={on,auto,off}` env var | NEW | Operator flips on for an A/B run without rewriting the config file, identical pattern to fix-B `HYPERION_H2_NATIVE_HPACK` and fix-C `HYPERION_TLS_KTLS`. |
2540
+ | `spec/hyperion/io_uring_spec.rb` | NEW (16 examples) | Cross-platform: `supported?` returns false on Darwin, `:auto` doesn't raise on Mac, `:on` raises with clear "io_uring not supported" / "io_uring required" message on Mac. Linux-only context (gated via `if: described_class.supported?`): ring lifecycle (open + close + no fd leak across 1000 accepts), feature parity (bytes through io_uring path match bytes through epoll). |
2541
+
2542
+ **Per-fiber rings, NEVER per-process or per-thread.** io_uring under
2543
+ fork+threads has known sharp edges:
2544
+
2545
+ - Submission queue is process-shared by default — under fork, the
2546
+ parent's outstanding SQEs leak into the child's CQ.
2547
+ - `IORING_SETUP_SQPOLL` kernel thread does not survive fork.
2548
+ - Threads sharing a ring need `IORING_SETUP_SINGLE_ISSUER` + careful
2549
+ submission discipline.
2550
+
2551
+ The safe pattern matching Hyperion's fiber-per-conn architecture: one
2552
+ ring per fiber that needs it (the accept fiber, optionally per-conn
2553
+ read fibers in a future 2.3-x round). Rings are opened lazily on
2554
+ first use and closed at fiber exit. Workers never share rings across
2555
+ fork — each child opens its own.
2556
+
2557
+ **Default off in 2.3.0.** Mirrors the 2.2.0 fix-B
2558
+ `HYPERION_H2_NATIVE_HPACK` pattern: ship the plumbing, give operators
2559
+ the env var to A/B, flip the default to `:auto` only after 6 months
2560
+ of soak. io_uring code in production has too many sharp edges to
2561
+ default-on without field validation.
2562
+
2563
+ **Bench delta on openclaw-vm — measured 2026-04-30 (post Linux build fix `599775a`):**
2564
+
2565
+ | Row | epoll baseline | io_uring (HYPERION_IO_URING=on) | Δ |
2566
+ |---|---:|---:|---:|
2567
+ | hello `-w 16 -t 5` | 90,022 r/s | 91,228 r/s | +1.3% (noise) |
2568
+ | hello `-w 4 -t 5` | 21,184 r/s | 22,073 r/s | +4.2% |
2569
+
2570
+ io_uring engages cleanly (boot log: `io_uring accept policy resolved
2571
+ policy=on active=true supported=true`) but the rps delta is inside
2572
+ the bench-noise envelope. The hello workload at 90k r/s on -w 16 is
2573
+ **Ruby-dispatch-bound, not accept-syscall-bound** — each accept is
2574
+ already one syscall on the epoll path (`accept_nonblock` + `IO.select`
2575
+ on EAGAIN); the kernel-side time difference between that and an
2576
+ io_uring accept SQE is small relative to the per-request env hash
2577
+ construction + body iteration + response writing. The expected win
2578
+ zone for io_uring is high-churn accept-bound workloads (e.g.,
2579
+ many short-lived connections, multi-connection accept batching with
2580
+ `IORING_OP_ACCEPT_MULTI`); on long-keepalive wrk benches like ours,
2581
+ the accept rate is just (connection count / wrk run duration) =
2582
+ 200/20s = 10/sec, which neither path is paying for. Default stays
2583
+ `:off` — operators with high-accept-churn shapes (RPC ingress,
2584
+ short-lived workers behind a TCP load balancer that opens fresh
2585
+ connections per request) can flip it on for A/B.
2586
+
2587
+ Spec count: 698 (2.2.0) → 714 (+ 16 io_uring specs).
2588
+
2589
+ ### 2.3-B — per-conn fairness cap + TLS handshake CPU throttle
2590
+
2591
+ **Why this matters.** The user's deployment shape is plaintext h1
2592
+ behind nginx + LB. nginx multiplexes many client requests onto a
2593
+ small number of upstream connections via HTTP/1.1 keep-alive. **One
2594
+ greedy upstream connection** (nginx pipelining many requests through
2595
+ it) can starve other connections — a CPU-bound JSON serialize in the
2596
+ wrong place lets one client hog the worker thread pool while
2597
+ everyone else's p99 climbs.
2598
+
2599
+ Two related defences ship together:
2600
+
2601
+ 1. **Per-conn fairness cap.** `Hyperion::Connection` now carries an
2602
+ in-flight counter and an optional ceiling. When a request arrives
2603
+ and the cap would be exceeded, the connection answers with a
2604
+ canned `503 Service Unavailable` + `Retry-After: 1` and stays
2605
+ alive. nginx (or any peer) retries the request after the in-flight
2606
+ work drains. Default cap = `nil` (no cap, matches 2.2.0); the
2607
+ recommended setting is `pool/4` so no single conn can use more
2608
+ than 25% of the worker's thread budget. `:auto` resolves at
2609
+ `Config#finalize!` to `thread_count / 4`, floor 1.
2610
+
2611
+ 2. **TLS handshake CPU throttle.** A new
2612
+ `Hyperion::TLS::HandshakeRateLimiter` token bucket caps SSL_accept
2613
+ CPU per worker. Defends direct-exposure operators against
2614
+ handshake storms (e.g., during a deployment when nginx restarts
2615
+ and reconnects everything). Default = `:unlimited` (matches
2616
+ 2.2.0). For nginx-fronted topologies this is mostly defensive —
2617
+ nginx keeps long-lived upstream conns so handshake rate is
2618
+ normally near-zero.
2619
+
2620
+ **What ships:**
2621
+
2622
+ | Deliverable | Where | Why |
2623
+ |---|---|---|
2624
+ | `Hyperion::Connection` per-conn semaphore | `lib/hyperion/connection.rb` | Mutex-guarded `@in_flight` counter; admit/release helpers; canned `REJECT_503_PER_CONN_OVERLOAD` payload (no allocation per reject); deduplicated `:per_conn_overload_rejects` warn (one per Connection lifetime). |
2625
+ | `Hyperion::Config#max_in_flight_per_conn` | `lib/hyperion/config.rb` | Top-level knob (not nested — applies to every conn, not h2-specific). Tri-state: `nil` (default, no cap), positive Integer (explicit cap), `:auto` (resolves to `thread_count / 4`, floor 1, at finalize time). |
2626
+ | `Hyperion::Config#tls.handshake_rate_limit` | `lib/hyperion/config.rb` | Token-bucket budget in handshakes/sec/worker. `:unlimited` (default) or positive Integer. |
2627
+ | `Hyperion::TLS::HandshakeRateLimiter` | `lib/hyperion/tls.rb` | Mutex-guarded token-bucket. `acquire_token!` returns true when budget available, false when over budget. `:unlimited` short-circuits every call to true so the hot path stays branchless. |
2628
+ | CLI `--max-in-flight-per-conn VALUE` + `HYPERION_MAX_IN_FLIGHT_PER_CONN` env var | `lib/hyperion/cli.rb` | Same parser/env-var pattern as fix-D `--h2-max-total-streams`. `auto` resolves at finalize. |
2629
+ | CLI `--tls-handshake-rate-limit VALUE` + `HYPERION_TLS_HANDSHAKE_RATE_LIMIT` env var | `lib/hyperion/cli.rb` | Same pattern. `unlimited` is the default sentinel. |
2630
+ | Plumbing through Server / Worker / Master / ThreadPool | 4 files | The cap propagates from `Config` → CLI → `Server.new` → `ThreadPool` → every Connection the worker thread builds. The TLS limiter lives on `Server#tls_handshake_limiter` (one per worker). |
2631
+ | `spec/hyperion/per_conn_fairness_spec.rb` | NEW (24 examples) | Cap=nil = 2.2.0 behaviour; cap=N admits + rejects; 503 + Retry-After + per-connection overload body verified; metric + dedup-warn coverage; finalize! resolves `:auto` to `thread_count/4`; CLI flag + env var grammar tests. |
2632
+ | `spec/hyperion/tls_handshake_throttle_spec.rb` | NEW (19 examples) | Limiter `:unlimited` = no throttle (regression); 100/sec rate admits ~100, rejects ~100 in a 200-attempt burst; refill over 0.55s adds ~25 tokens; capacity bounded (no infinite accrual); thread-safety; CLI + env var coverage. |
2633
+
2634
+ **Default unchanged.** The cap is an opt-in hardening tool, not a
2635
+ default flip. Existing operators upgrading from 2.2.0 → 2.3.0 see
2636
+ identical behaviour without setting either knob. Operators who want
2637
+ the fairness cap on for `-t 16` workers add
2638
+ `max_in_flight_per_conn :auto` to their config (or pass
2639
+ `--max-in-flight-per-conn auto` / set
2640
+ `HYPERION_MAX_IN_FLIGHT_PER_CONN=auto`). Pattern matches fix-D's
2641
+ `--h2-max-total-streams`: configure once, no daemon reload required.
2642
+
2643
+ **Bench plan (deferred; openclaw-vm SSH key not loaded in this
2644
+ session).** The contended-workload shape is:
2645
+
2646
+ ```sh
2647
+ # Setup: a Rack app where one client gets a 50ms handler, others fast.
2648
+ # Run two wrk processes simultaneously: one client × 1000 req/s
2649
+ # (greedy, on one connection), 50 clients × 10 req/s (light, on
2650
+ # 50 connections). Measure p99 of the light clients.
2651
+ ```
2652
+
2653
+ Compare 2.2.0 baseline vs 2.3-B with `--max-in-flight-per-conn 4`
2654
+ (for `-t 16`). Target: light-client p99 -20-30%. Simpler proxy if
2655
+ the contended-workload bench is too involved to set up cleanly: run
2656
+ `wrk -t1 -c1` (single client, single conn) at peak rps, then
2657
+ `wrk -t1 -c100` (100 clients, 100 conns) at peak rps, compare
2658
+ per-conn-msg-rate. The bench will land as a separate `[bench]`
2659
+ commit when SSH is available.
2660
+
2661
+ Spec count: 714 → 757 (+ 24 fairness + 19 throttle = 43 new specs,
2662
+ 0 regressions).
2663
+
2664
+ ### 2.3-C — WebSocket permessage-deflate (RFC 7692)
2665
+
2666
+ **Why this matters most for the user's deployment shape.** ActionCable /
2667
+ chat / pubsub WebSocket traffic compresses very well — typical JSON
2668
+ message frames are 80-95% redundant (repeated field names, recurring
2669
+ user IDs, recurring chat IDs). RFC 7692 permessage-deflate compresses
2670
+ each message with a shared LZ77 dictionary so wire bytes drop 5-20×
2671
+ on the chat-style workload that ActionCable fans out to thousands of
2672
+ idle subscribers. **Bandwidth costs move with bytes, not with
2673
+ dispatches** — for an nginx-fronted deployment the saving lands
2674
+ straight on the egress bill (Cloudfront / ALB egress @ ~$0.085/GiB at
2675
+ the unhappy AWS price band).
2676
+
2677
+ **Bench delta on a chat-style JSON workload (1 KB messages,
2678
+ `bench/ws_deflate_bench.rb`):**
2679
+
2680
+ | Mode | Bytes per message | Wire reduction | msg/s (macOS arm64) | msg/s (openclaw Linux 16-vCPU) |
2681
+ |---:|---:|---:|---:|---:|
2682
+ | Plain (no deflate) | 400.8 B | — | 57,498 | 11,782 |
2683
+ | permessage-deflate | 19.7-20.0 B | **20.0-20.4× smaller** | 34,999 (61%) | 8,101 (69%) |
2684
+
2685
+ Both hosts confirm the wire reduction is workload-shape-bound, not
2686
+ host-bound (zlib is identical on both). The msg/s gap is the deflate
2687
+ CPU cost on the encode side. The openclaw measurement was run after
2688
+ the 2.3-C ship (commit `8044610`) on `bench/ws_deflate_bench.rb`.
2689
+
2690
+ The 20× number is upper-bound — chat-style JSON has very repetitive
2691
+ field names which the shared deflate dictionary picks up immediately.
2692
+ Random binary / already-compressed payloads (h.264 video frames,
2693
+ gzipped logs) see near-zero saving and would be better served by
2694
+ the `:off` policy on those routes (the operator knob is per-process,
2695
+ but you can hand-roll different dispatch shapes per route if needed).
2696
+
2697
+ The msg/s drop is the deflate CPU cost on the encode side and is
2698
+ expected — for a bandwidth-bound workload (the typical ActionCable
2699
+ fan-out shape: one server-side message reflected to N idle browsers)
2700
+ the bandwidth saving wins handily over the per-message CPU cost.
2701
+
2702
+ **What ships:**
2703
+
2704
+ | Deliverable | Where | Why |
2705
+ |---|---|---|
2706
+ | Handshake negotiation | `lib/hyperion/websocket/handshake.rb` | `validate(env, permessage_deflate: :auto/:on/:off)`. Parses `Sec-WebSocket-Extensions`, picks the first usable offer, returns the negotiated parameter set in slot 4 of the result tuple. `format_extensions_header` renders the response header for `build_101_response`. |
2707
+ | Connection wiring | `lib/hyperion/websocket/connection.rb` | `Connection.new(... extensions: result[3])` instantiates a per-conn `Zlib::Deflate` / `Zlib::Inflate` pair sized to the negotiated `server_max_window_bits` / `client_max_window_bits`. `send` deflates + sets RSV1; `recv` strips RSV1 + appends `\x00\x00\xff\xff` sync trailer + inflates. Streaming inflate with the cap applied to running output bytes — the compression-bomb defense. |
2708
+ | Frame builder + parser RSV1 contract | `ext/hyperion_http/websocket.c` + `lib/hyperion/websocket/frame.rb` | C parser preserves RSV1 in slot 8 of the metadata tuple; allows it on data frames; rejects it on control frames (RFC 7692 §6.1). `Builder.build(rsv1: true)` sets the high bit alongside FIN. RSV2/RSV3 still reject (no defined semantics). The RubyFrame fallback mirrors the contract. |
2709
+ | `Hyperion::Config#websocket.permessage_deflate` | `lib/hyperion/config.rb` | Tri-state `:off` / `:auto` (default) / `:on`. Mirrors `tls.ktls`. Operators flip `:auto → :on` to harden when they control the client population. |
2710
+ | `bench/ws_echo.ru` HYPERION_WS_DEFLATE knob | `bench/ws_echo.ru` | Bench app advertises permessage-deflate when the env var is set; pipes the negotiated extensions through to the Connection. |
2711
+ | `bench/ws_deflate_bench.rb` | NEW | Local UNIXSocket harness — measures wire bytes with vs without permessage-deflate on a 1 KB chat-style JSON workload. |
2712
+ | `spec/hyperion/websocket_permessage_deflate_spec.rb` | NEW (18 examples) | Handshake negotiation (8 cases including `:on`/`:off`/`:auto` and multi-offer), wire round-trip via Zlib (RFC 7692 "Hello" vector), Connection round-trips with shared and reset context, control-frame protections (ping with RSV1 → 1002), compression-bomb defense (4 MiB inflated → 1009 close), Config DSL plumbing. |
2713
+
2714
+ **Compression-bomb defense (RFC 7692 §8.1).** A malicious client can
2715
+ ship a tiny compressed payload that inflates to gigabytes. The
2716
+ streaming inflater drains output in 16 KB chunks and short-circuits
2717
+ the moment the running decompressed total would exceed
2718
+ `max_message_bytes` (default 1 MiB). The connection then closes 1009
2719
+ (Message Too Big) and the next `recv` raises `StateError`. Verified
2720
+ via `4 MiB → 4 KB compressed → close 1009` regression spec.
2721
+
2722
+ **Backwards compatibility.** Default `:auto` is the safe default —
2723
+ clients that don't offer permessage-deflate keep getting plain frames,
2724
+ identical to 2.2.0. The 4th slot of the handshake result tuple is new;
2725
+ existing `[:ok, accept, sub]` 3-arg destructure call sites remain
2726
+ correct because Ruby's array destructure tolerates extra slots.
2727
+
2728
+ **Default unchanged for the operator-facing knob.** The Connection
2729
+ constructor's `extensions:` kwarg defaults to `{}`. Apps that don't
2730
+ read `result[3]` from the handshake tuple keep getting uncompressed
2731
+ WebSocket traffic, identical to 2.2.0. The `:auto` default on the
2732
+ Config knob means handshakes advertise the extension when offered, but
2733
+ the Connection wrapper only deflates when the app explicitly threads
2734
+ the negotiated `extensions:` into the constructor — both ends
2735
+ opt-in.
2736
+
2737
+ Spec count: 757 → 776 (+18 deflate specs + 1 RSV1-on-control split,
2738
+ 0 regressions).
2739
+
2740
+ ### 2.3-D — WS multi-process bench + RFC 6455 conformance recipe
2741
+
2742
+ **Why this matters.** fix-E's 200-conn WebSocket bench landed at
2743
+ 1,766 msg/s with p99 134 ms on openclaw-vm — and the long tail there
2744
+ turned out to be **client-side** GVL serialisation, not server-side
2745
+ latency. The single-process Ruby bench client funnels every per-message
2746
+ mask/unmask + frame parse + IO.select through one interpreter under
2747
+ the GVL; at 200 concurrent connections the client itself ran out of
2748
+ CPU before the server did. To publish honest server throughput, the
2749
+ client needs to scale across OS processes.
2750
+
2751
+ **Bench-only commit — no production code changes.**
2752
+
2753
+ **What ships:**
2754
+
2755
+ | Deliverable | Where | Why |
2756
+ |---|---|---|
2757
+ | `bench/ws_bench_client_multi.rb` | NEW | Forks N child processes (`--procs N`), each running `bench/ws_bench_client.rb --json` against a slice of the total connection count. Aggregates: `total_msgs = Σ child[total_msgs]`, wall `elapsed = max(child[elapsed_s])`, `msg/s = total_msgs / elapsed`, `p50 / p99 / max = max across children` (conservative — slowest child sets the published tail). |
2758
+ | `autobahn-config/fuzzingclient.json` | NEW | Canonical RFC 6455 fuzzingclient config pointed at `ws://127.0.0.1:9888/echo`. Excludes case 9.* (very-large-message cases — too slow for default runs; soak via uncomment). Run via `crossbario/autobahn-testsuite` Docker image. |
2759
+ | `docs/WEBSOCKETS.md` "Performance" addendum | UPDATED | Adds the 2.3-D multi-process numbers next to the fix-E single-process numbers, plus an operator note that published msg/s requires multi-process. |
2760
+ | `docs/WEBSOCKETS.md` "RFC 6455 conformance" section | NEW | Full Docker recipe + expected per-section pass matrix. |
2761
+ | `docs/BENCH_HYPERION_2_0.md` "WebSocket multi-process bench (2.3-D)" subsection | NEW | macOS dev numbers + openclaw-vm reproduction recipe (deferred: SSH unavailable this session). |
2762
+
2763
+ **Bench numbers — macOS dev (Apple Silicon, 14 efficient cores), median of 3:**
2764
+
2765
+ | Workload | msg/s | p50 | p99 | vs fix-E single-process |
2766
+ |---|---:|---:|---:|---|
2767
+ | 4 procs × 50 conns × 1000 msgs (200-conn aggregate, `-t 256 -w 1`) | **14,757** | 13.01 ms | 21.75 ms | **+176%** msg/s, p99 cut in half (43.12 → 21.75 ms) |
2768
+ | 4 procs × 10 conns × 1000 msgs (40-conn aggregate, `-t 256 -w 1`) | 13,594 | 2.49 ms | 7.75 ms | +110% vs fix-E 10-conn 6,463 msg/s |
2769
+
2770
+ The 200-conn p99 drop from 43.12 ms to 21.75 ms is the smoking gun
2771
+ that fix-E's tail was client-side. With the GVL released across 4
2772
+ processes, the bench now actually probes server latency rather than
2773
+ client scheduling.
2774
+
2775
+ **Deferred to 2.4:**
2776
+
2777
+ - **openclaw-vm Linux 16-vCPU rerun.** Bench host SSH was unavailable
2778
+ this session (`Permission denied (publickey)` after a verbose probe).
2779
+ Recipe is in `docs/BENCH_HYPERION_2_0.md` for the next maintainer
2780
+ with agent-loaded keys; expected ≥ 4,500 msg/s aggregate at
2781
+ p99 < 70 ms based on macOS multiplier extrapolation. The
2782
+ aspirational 50,000 msg/s figure from the 2.1.0 brief still needs
2783
+ `-w 16 -t small` plus a non-Ruby client.
2784
+ - **autobahn-testsuite RFC 6455 fuzzer run.** Docker daemon was not
2785
+ running locally and the openclaw bench host (where the
2786
+ `crossbario/autobahn-testsuite` image is pre-pulled) was
2787
+ unreachable. Config landed at `autobahn-config/fuzzingclient.json`;
2788
+ recipe + expected pass matrix in `docs/WEBSOCKETS.md`. Any RFC
2789
+ violations the fuzzer surfaces are 2.4 follow-ups.
2790
+
2791
+ Spec count: 776 → 776 (no spec changes; bench client lives in `bench/`,
2792
+ not `lib/`, and is not loaded by `require 'hyperion'`).
2793
+
2794
+ ## [2.2.0] - 2026-05-01
2795
+
2796
+ ### Headline
2797
+
2798
+ A correctness + foundation release with two measurable perf wins. The
2799
+ original 2.2.0 sprint shipped four foundation tracks (kTLS plumbing,
2800
+ Rust HPACK adapter, allocation reductions, splice correctness fix); a
2801
+ follow-up sprint (fix-A through fix-E) closed the gaps that made the
2802
+ original bench numbers regress and added two operator knobs.
2803
+
2804
+ | Track | Result |
2805
+ |---|---|
2806
+ | Phase 9 + fix-C — kTLS large-payload TLS | **+18-24% rps / -13-14% p99** on TLS 50 KB JSON and 1 MiB static |
2807
+ | Phase 10 + fix-B — Rust HPACK adapter | h2load at parity with Ruby fallback (was -8% to -28% in Phase 10's first cut) |
2808
+ | Phase 11 — Rack adapter allocation audit | -53% per-request alloc, -78% on `build_env` (rps stayed flat — bound elsewhere) |
2809
+ | Splice + fix-A — fresh-pipe-per-response correctness | -3.4× syscall count per 1 MiB request; gated to opt-in (kernel 6.8 still favors plain sendfile) |
2810
+ | fix-D — `--h2-max-total-streams` CLI flag | h2load `-n 5000` runnable without a config file |
2811
+ | fix-E — WebSocket echo bench numbers | Linux 16-vCPU: ~1,962 msg/s p99 3 ms (10 conns), ~1,766 msg/s p99 134 ms (200 conns) |
2812
+
2813
+ ### Bench results (openclaw-vm, 2026-04-30, vs 2.1.0 baseline)
2814
+
2815
+ | Row | 2.1.0 | 2.2.0 | Δ | Notes |
2816
+ |---|---:|---:|---:|---|
2817
+ | TLS 1 MiB static, kTLS auto vs off | — | **+24% rps / -14% p99** | win | fix-C bench harness — kTLS large-payload finally measured |
2818
+ | TLS 50 KB JSON, kTLS auto vs off | — | **+18.6% rps / -13% p99** | win | fix-C bench harness — sweet-spot payload size |
2819
+ | h2load c=1 m=100 + native HPACK | 1,609.80 r/s | 1,608.97 r/s | -0.05% | fix-B brought native HPACK to parity with Ruby fallback |
2820
+ | h2load c=1 m=100, `--h2-max-total-streams unbounded` | 1,597 r/s | 1,567 r/s | within 2% | fix-D — 5,000/5,000 streams succeeded, 0 errored |
2821
+ | static 1 MiB, splice on vs off | 1,697 r/s (sendfile) | 1,048 (splice) / 1,086 (sendfile) | splice slower | fix-A pipe-hoist 64 → 19 syscalls/MiB but kernel 6.8 still favors plain sendfile; gated opt-in |
2822
+ | WS echo, 10 conns × 1 KiB, `-t 5 -w 1` | — | **1,962 msg/s, p99 3 ms** | new | fix-E bench numbers |
2823
+ | WS echo, 200 conns × 1 KiB, `-t 256 -w 1` | — | 1,766 msg/s, p99 134 ms | new | fix-E bench numbers (Linux 16-vCPU) |
2824
+
2825
+ For the full sweep including hello / work / static-8KB rows, see the
2826
+ "Bench sweep notes" section below and `docs/BENCH_HYPERION_2_0.md`.
2827
+
2828
+ ### What's in / new operator knobs
2829
+
2830
+ - `Hyperion::TLS.ktls_supported?` / `tls.ktls = :auto/:on/:off` (Phase 9)
2831
+ - `HYPERION_TLS_KTLS={auto,on,off}` env-var (fix-C)
2832
+ - `Hyperion::H2Codec` Rust adapter (Phase 10) — gated behind `HYPERION_H2_NATIVE_HPACK=1` (fix-B)
2833
+ - `--h2-max-total-streams VALUE` CLI flag + `HYPERION_H2_MAX_TOTAL_STREAMS` env-var (fix-D)
2834
+ - `HYPERION_HTTP_SPLICE=1` opt-in for splice-vs-sendfile A/B (kept off-by-default after the bench)
2835
+ - New bench rackups: `bench/tls_static_1m.ru`, `bench/tls_json_50k.ru`, `bench/ws_echo.ru`, `bench/ws_bench_client.rb` (Ruby WS bench client built on `Hyperion::WebSocket::Frame`)
2836
+ - Spec count: 632 (2.1.0) → 698 (2.2.0).
2837
+
2838
+ The per-fix subsections below carry the detailed rationale, syscall
2839
+ math, allocation tables, and ABI notes for each track that landed —
2840
+ preserved verbatim as the reference record for the sprint.
2841
+
2842
+ ### fix-E — WebSocket echo bench numbers + Ruby bench client
2843
+
2844
+ **2.1.0 shipped WebSocket support but never published bench numbers.**
2845
+ The 2.1.0 release commit (b097b78) included `bench/ws_echo.rb` as a
2846
+ rackup ready for the openclaw-vm bench host, plus a perf-note in
2847
+ `docs/WEBSOCKETS.md` claiming a 50,000+ msg/s target shape on 16
2848
+ vCPU, but no actual measurements. fix-E ships the bench numbers,
2849
+ the bench client tooling, and uncovers a small file-extension bug
2850
+ in the rackup itself.
2851
+
2852
+ **Three deliverables:**
2853
+
2854
+ | Deliverable | Where | Why |
2855
+ |---|---|---|
2856
+ | `bench/ws_bench_client.rb` | NEW (~250 LOC) | Ruby WS client built on `Hyperion::WebSocket::Frame` primitives. Zero external deps — shares the gem's masking/parsing code with the server side. Cleaner than installing `websocat` per bench host, and the tooling drops into a Linux CI box without cargo/pip toolchains. |
2857
+ | `bench/ws_echo.ru` | NEW | Renamed copy of `bench/ws_echo.rb`. The 2.1.0 commit shipped the rackup with a `.rb` extension — `Rack::Builder.parse_file` treats `.rb` as plain Ruby and tries to `Object.const_get` the camelized basename, which fails because the file uses the rackup `run lambda { ... }` DSL. fix-E adds the `.ru` variant; the original `.rb` stays in place for archaeology. |
2858
+ | `spec/hyperion/bench_ws_client_spec.rb` | NEW (5 examples) | Smoke for the bench client: 5-msg single-conn run + 2-conn × 3-msg concurrent run + percentile helper unit tests. No perf assertion — bench-host concerns belong on the bench host. |
2859
+
2860
+ **Bench numbers — 2026-04-30:**
2861
+
2862
+ | Workload | msg/s | p50 | p99 | max |
2863
+ |---|---:|---:|---:|---:|
2864
+ | WS echo, 10 conns × 1000 msgs × 1 KiB, `-t 5 -w 1` | **6,463** | **0.76 ms** | **1.03 ms** | 1.81 ms |
2865
+ | WS echo, 10 conns × 1000 msgs × 1 KiB, `-t 256 -w 1` | 6,205 | 1.58 ms | 2.02 ms | 2.99 ms |
2866
+ | WS echo, 200 conns × 1000 msgs × 1 KiB, `-t 256 -w 1` | **5,346** | 37.19 ms | **43.12 ms** | 93.68 ms |
2867
+
2868
+ Median of 3 runs per row. The numbers above are dev-hardware
2869
+ (Apple Silicon).
2870
+
2871
+ **openclaw-vm Linux 16-vCPU follow-up (2026-04-30, single worker):**
2872
+
2873
+ | Workload | msg/s | p50 | p99 | max |
2874
+ |---|---:|---:|---:|---:|
2875
+ | WS echo, 10 conns × 1000 msgs × 1 KiB, `-t 5 -w 1` | **1,962** | 2.51 ms | **3.27 ms** | 4.58 ms |
2876
+ | WS echo, 200 conns × 1000 msgs × 1 KiB, `-t 256 -w 1` | **1,766** | 112 ms | **134 ms** | 141 ms |
2877
+
2878
+ **These are real numbers; the 50,000+ msg/s figure cited in the 2.1.0
2879
+ perf-note (docs/WEBSOCKETS.md) was aspirational, not measured.** A
2880
+ single-worker Hyperion on this 16-vCPU box pushes ~2 k msg/s against a
2881
+ single Ruby bench client; the client is also single-process and is a
2882
+ meaningful portion of the bottleneck. Approaching 50 k msg/s would need
2883
+ multi-process client + multi-worker server + likely a non-Ruby client.
2884
+ Filed as 2.3 follow-up: rerun with `-w 4` + multi-process client, plus
2885
+ an autobahn-testsuite RFC 6455 conformance pass (deferred — Docker
2886
+ daemon not running this session, and `pip install autobahntestsuite`
2887
+ exceeded the brief's "trivially installable" threshold).
2888
+
2889
+ **Comparison vs the 2.1.0 spec p50 of ~0.18 ms** (single-conn,
2890
+ dev hardware, e2e smoke `spec/hyperion/websocket_e2e_spec.rb`): the
2891
+ 0.76 ms p50 from the 10-conn `-t 5` row is 4.2× the smoke-spec
2892
+ single-conn number, which lines up cleanly with the queue-wait
2893
+ inside a 5-thread worker pool serving 10 client connections (each
2894
+ client thread parks behind a server thread for the round-trip).
2895
+ The `recv → echo → send` pipeline isn't slower per-message — it's
2896
+ serialized.
2897
+
2898
+ **Operator notes surfaced by the bench:**
2899
+
2900
+ - **`-t` is a hard cap on concurrent WS connections per worker.**
2901
+ Each WebSocket permanently hijacks a worker thread for its
2902
+ lifetime, so the (N+1)th client behind `-t N` queues at the
2903
+ handshake stage until an existing connection drains. The
2904
+ brief-recommended `-t 5` config rejects 200 concurrent client
2905
+ threads — fix-E's 200-conn row used `-t 256` out of necessity.
2906
+ This guidance is added to `docs/WEBSOCKETS.md` alongside the
2907
+ published numbers.
2908
+ - **Don't over-provision `-t`** for low-concurrency latency
2909
+ paths. The 10-conn / `-t 5` row runs 2× faster per-message than
2910
+ the same workload on `-t 256` — extra threads cost GVL contention
2911
+ without adding parallelism. Match `-t` to expected concurrent-
2912
+ connection count, not "as high as it goes".
2913
+
2914
+ **Spec count delta**: 693 → 698 (+5). All 693 prior examples still
2915
+ green. 11 pending (10 Linux-only splice tests + 1 macOS-kTLS gate),
2916
+ unchanged.
2917
+
2918
+ ### fix-D — `h2.max_total_streams` CLI flag + env-var (h2load comparability)
2919
+
2920
+ **The 2.0.0 default flip needed an operator escape hatch.** 2.0.0 made the
2921
+ 1.7.0 admission cap mandatory by default at
2922
+ `max_concurrent_streams × workers × 4` (= 512 streams per process at -w 1).
2923
+ That is a sensible browser-traffic ceiling — each browser connection
2924
+ rarely opens more than ~50–100 multiplexed streams, and 4× headroom
2925
+ covers legitimate fan-out. But two operator workflows trip on the cap:
2926
+
2927
+ * **h2load benches.** `h2load -c 1 -m 100 -n 5000 https://host/` opens
2928
+ 5,000 streams on a single connection. The 2026-04-30 sweep
2929
+ (BENCH_HYPERION_2_0.md row 10) hit the cap mid-test: **4,489 of 5,000
2930
+ streams errored** with the connection closed for "max-total-streams
2931
+ exceeded". The published 1,597 r/s row only landed because the
2932
+ bench script flipped `h2 do; max_total_streams :unbounded; end` in
2933
+ config — operators couldn't reproduce it without writing a config file.
2934
+ * **gRPC / long-fan-out services.** Servers holding thousands of long-lived
2935
+ RPCs over a small connection pool routinely exceed the 512 default
2936
+ even at modest traffic.
2937
+
2938
+ The knob has existed since 1.7.0 (nested DSL: `c.h2.max_total_streams = X`)
2939
+ but was never exposed at the CLI surface, and Phase 9-era operators on
2940
+ the 2.0.0 bench-comparability path had to choose between editing config
2941
+ files per row or accepting the errored-stream noise.
2942
+
2943
+ **fix-D ships two operator knobs:**
2944
+
2945
+ | Knob | Where | Notes |
2946
+ |---|---|---|
2947
+ | `--h2-max-total-streams VALUE` | `lib/hyperion/cli.rb` OptionParser branch | Per-invocation override. `VALUE` is a positive integer or `unbounded` (or `:unbounded`). |
2948
+ | `HYPERION_H2_MAX_TOTAL_STREAMS=VALUE` | `apply_h2_max_total_streams_env_override!` | Outermost knob (env > CLI > config > default). Same value grammar; typos warn + ignore (matches the fix-C `HYPERION_TLS_KTLS` shape — convenience knob, not a security boundary). |
2949
+
2950
+ Both ride the existing `H2Settings::UNBOUNDED` sentinel: `unbounded`
2951
+ parses to that symbol on the way in, `Config#finalize!` later resolves
2952
+ it to `nil` (no cap, matches 1.x behaviour). The integer branch lands
2953
+ directly on `config.h2.max_total_streams` and finalize! leaves it
2954
+ untouched.
2955
+
2956
+ **Specs.** `spec/hyperion/cli_h2_flag_spec.rb` (new file) adds 11 examples:
2957
+
2958
+ * CLI flag parses an integer, parses `unbounded` to the sentinel,
2959
+ accepts `:unbounded` as an alias, raises `OptionParser::InvalidArgument`
2960
+ on non-numeric / non-positive values.
2961
+ * Env-var override is unset-noop, integer-pass, `unbounded`-to-sentinel,
2962
+ empty-string noop, unknown-value warn-and-preserve, and
2963
+ env-var-overrides-CLI-flag (proves env-var wins the precedence chain).
2964
+
2965
+ Spec count **682 → 693** (+11). All 682 prior examples still green
2966
+ (spec sweep on macOS, 11 pending Linux-only splice tests unchanged).
2967
+
2968
+ **Bench measurement on openclaw-vm — VERIFIED 2026-04-30:**
2969
+
2970
+ ```
2971
+ hyperion --tls-cert /tmp/cert.pem --tls-key /tmp/key.pem -t 64 -w 1 \
2972
+ --h2-max-total-streams unbounded -p 9602 ~/bench/hello.ru
2973
+ h2load -c 1 -m 100 -n 5000 https://127.0.0.1:9602/
2974
+ finished in 3.19s, 1567.67 req/s, 38.29KB/s
2975
+ requests: 5000 total, 5000 started, 5000 done,
2976
+ 5000 succeeded, 0 failed, 0 errored, 0 timeout
2977
+ time for request: 41.05ms / 83.08ms / 62.46ms (mean / max / sd)
2978
+ ```
2979
+
2980
+ **5,000 / 5,000 succeeded, 0 errored, 1,567 r/s** (matches the 2.0.0
2981
+ baseline 1,597 r/s within 2% noise). Confirms the flag fixes the
2982
+ 2026-04-30 regression where 4,489 of 5,000 streams errored on a
2983
+ default-shape command line. The rps baseline isn't moved by fix-D —
2984
+ the goal was only to make h2load `-n 5000` runnable without a config
2985
+ file.
2986
+
2987
+ ### fix-C — large-payload TLS bench harness (rackups + `HYPERION_TLS_KTLS` env-var)
2988
+
2989
+ **The 2026-04-30 Phase 9 -15% TLS regression diagnosed: wrong workload.**
2990
+ Phase 9 shipped kTLS_TX on Linux ≥ 4.13 + OpenSSL ≥ 3.0 and the boot
2991
+ probe correctly engaged kernel-TLS on openclaw-vm
2992
+ (`ktls_active: true, cipher: TLS_AES_256_GCM_SHA384`,
2993
+ `/proc/modules: tls 155648 3 - Live`). The TLS h1 row in the 2.2.0
2994
+ sweep used `bench/hello.ru` (5 B response body) and recorded -15% rps
2995
+ (3,425 → 2,909) — the regression read as "kTLS didn't help", but at
2996
+ hello-payload the cipher cost is a tiny fraction of per-request
2997
+ overhead (parser + dispatch + handshake CPU dominate). The kTLS_TX
2998
+ win compounds with **larger payloads** where SSL_write would
2999
+ otherwise burn userspace cycles encrypting MBs of data; the
3000
+ hello-payload bench simply didn't exercise that path.
3001
+
3002
+ **fix-C ships the workload that does exercise it:**
3003
+
3004
+ * **`bench/tls_static_1m.ru`** — 1 MiB static asset over TLS via
3005
+ `Rack::Files`. Pairs with `bench/static.ru` for the unencrypted
3006
+ comparison. At 1 MiB the cipher accounts for most of the
3007
+ per-request cycles — userspace SSL_write copies & encrypts in
3008
+ Ruby-land; kTLS_TX hands the symmetric key to the kernel and goes
3009
+ through `sendfile`+`KTLS_TX_OFFLOAD` paths.
3010
+ * **`bench/tls_json_50k.ru`** — ~50 KB JSON (600 items × 8× name
3011
+ multiplier, verified 50,039 bytes on ruby 3.3.3). Sized to the
3012
+ kTLS_TX sweet spot: large enough that cipher cost is meaningful,
3013
+ small enough to fit in one kernel TCP send buffer in a single
3014
+ syscall (default `net.ipv4.tcp_wmem` max ~6 MB on Linux). 30-80 KB
3015
+ is the sweet-spot range; the spec asserts the payload lands inside
3016
+ it so an operator tweaking the multiplier can't accidentally drift
3017
+ out.
3018
+
3019
+ **Operator A/B knob: `HYPERION_TLS_KTLS` env-var.** Phase 9 only
3020
+ exposed kTLS via the `tls.ktls` DSL knob — operators wanting to A/B
3021
+ kernel-TLS vs userspace SSL_write had to rewrite their config file
3022
+ between bench rows. fix-C adds a 3-state env-var bridge in
3023
+ `lib/hyperion/cli.rb` (`apply_ktls_env_override!`) that runs right
3024
+ after `config.merge_cli!` and overrides the resolved knob:
3025
+
3026
+ | `HYPERION_TLS_KTLS` | `config.tls.ktls` | Behaviour |
3027
+ |---|---|---|
3028
+ | unset / empty | `:auto` (default) | Linux ≥ 4.13 + OpenSSL ≥ 3.0: kTLS_TX on; elsewhere: off |
3029
+ | `auto` | `:auto` | Same as unset, explicit |
3030
+ | `on` | `:on` | Force enable; raise at boot if unsupported |
3031
+ | `off` | `:off` | Force disable; userspace SSL_write everywhere |
3032
+ | anything else | (unchanged) | Warn + ignore (not a security boundary) |
3033
+
3034
+ The unknown-value branch warns rather than aborting boot — the env
3035
+ var is a convenience knob for operators benchmarking, not a
3036
+ security boundary, and a typo shouldn't crash the process.
3037
+
3038
+ **Specs.** `spec/hyperion/bench_tls_rackups_spec.rb` (new file)
3039
+ adds 9 examples:
3040
+
3041
+ * `bench/tls_json_50k.ru` parses cleanly via `Rack::Builder.parse_file`,
3042
+ responds 200 with `application/json`, and the payload lands in
3043
+ the 30-80 KB range.
3044
+ * `bench/tls_static_1m.ru` parses cleanly, serves a 1 MiB asset
3045
+ written into a tempdir via `HYPERION_BENCH_ASSET_DIR`.
3046
+ * `HYPERION_TLS_KTLS` env-var → `config.tls.ktls` mapping for all
3047
+ 3 valid states + unset + empty + unknown (warn + ignore).
3048
+
3049
+ The static-asset spec writes its own fixture into `Dir.mktmpdir` so
3050
+ it doesn't touch `/tmp` outside the test run. Spec count
3051
+ **673 → 682** (+9).
3052
+
3053
+ **Bench measurement: VERIFIED 2026-04-30 on openclaw-vm
3054
+ (commit `f135b55`).** The bench harness ran on the openclaw-vm 16-vCPU
3055
+ box (Linux 6.8, OpenSSL 3.0, kTLS module loaded) once SSH access was
3056
+ restored. kTLS auto wins on both large-payload rows:
3057
+
3058
+ | Workload | kTLS off | kTLS auto | Δ rps | Δ p99 |
3059
+ |---|---:|---:|---:|---:|
3060
+ | TLS 50 KB JSON, h1 c=64 d=10s | baseline | **+18.6% rps** | win | **-13% p99** |
3061
+ | TLS 1 MiB static, h1 c=8 d=10s | baseline | **+24% rps** | win | **-14% p99** |
3062
+
3063
+ Phase 9's correctness work (kTLS_TX engages cleanly on Linux ≥ 4.13 +
3064
+ OpenSSL ≥ 3.0) was right; the original 2.2.0 sweep simply benched the
3065
+ wrong workload (5-byte hello where cipher cost is dominated by parser
3066
+ + dispatch). The fix-C rackups exercise the path where the cipher
3067
+ cost actually surfaces, and the kernel-side win shows.
3068
+
3069
+ ### fix-B — Rust HPACK FFI marshalling rewrite (per-encoder scratch buffer + flat-blob ABI)
3070
+
3071
+ **The 2026-04-30 native-HPACK regression diagnosed and fixed.** Phase 10
3072
+ shipped a Rust HPACK adapter that won 3.26× on the encode microbench (a
3073
+ tight loop over many headers in one call) but ran -8% to -28% **slower**
3074
+ than the Ruby fallback on h2load c=1 m=100 traffic. The bench sweep
3075
+ identified per-HEADERS-frame Fiddle FFI marshalling as the root cause:
3076
+ on real h2 traffic each call encodes 3-8 small headers, so the per-call
3077
+ allocation overhead dominates whatever the encode kernel saves.
3078
+
3079
+ The v1 ABI's per-call allocation profile (3 headers, response-side):
3080
+ * `Fiddle::Pointer[]` per header name **and** value = 6 Pointer wrappers
3081
+ * 4 separate `pack('Q*' / 'L*')` calls (names buf, name lens, vals buf, val lens)
3082
+ * 1 capacity-byte output buffer pre-fill: `out << ("\x00".b * capacity)`
3083
+ * 1 `byteslice(0, written)` to extract the encoded prefix
3084
+
3085
+ ≈ **12 transient String allocations per `encode_headers` call** on a
3086
+ 3-header response, multiplied across thousands of streams per second
3087
+ on h2 traffic.
3088
+
3089
+ **fix-B: per-encoder scratch buffer + flat-blob v2 ABI.** The wrapper
3090
+ now allocates the scratch buffers ONCE in `Encoder#initialize`:
3091
+
3092
+ * `@scratch_blob` — concatenated header bytes (name_1, value_1, …)
3093
+ * `@scratch_argv` — packed `(name_off, name_len, val_off, val_len)` u64 quads
3094
+ * `@scratch_out` — output buffer, grows on demand (start 16 KiB)
3095
+ * `@scratch_argv_ints` — Ruby Array reused for `pack('Q*', buffer:)`
3096
+ * Cached `Fiddle::Pointer` for each scratch — refreshed only on
3097
+ reallocation.
3098
+
3099
+ `#encode(headers)` clears the three buffers, appends raw bytes + offset
3100
+ quads, and dispatches a SINGLE FFI call to the new entry point:
3101
+
3102
+ ```rust
3103
+ pub unsafe extern "C" fn hyperion_h2_codec_encoder_encode_v2(
3104
+ handle: EncoderHandle,
3105
+ headers_blob_ptr: *const u8,
3106
+ headers_blob_len: usize,
3107
+ argv_ptr: *const u64,
3108
+ argv_count: usize,
3109
+ out_ptr: *mut u8,
3110
+ out_capacity: usize,
3111
+ ) -> i64 // bytes_written, -1 overflow, -2 bad args
3112
+ ```
3113
+
3114
+ The Rust side reads each `(name_off, name_len, val_off, val_len)` quad
3115
+ out of `argv_ptr` and indexes into `headers_blob_ptr` — no per-header
3116
+ allocation on the Ruby side, no per-header `Fiddle::Pointer.new`, no
3117
+ per-header pack(). The Rust `Encoder` also stashes a reusable scratch
3118
+ `Vec<u8>` (cleared via `Vec::clear` between calls — capacity preserved)
3119
+ so the Rust side avoids `Vec::with_capacity(64 * count)` per call too.
3120
+
3121
+ **Per-call allocation count: BEFORE → AFTER (3-header response, 50 calls):**
3122
+
3123
+ | Path | T_STRING per call | 50 calls total |
3124
+ |---|---:|---:|
3125
+ | v1 (shipped Phase 10) | ~12 | ~600 (lower bound) |
3126
+ | v2 fix-B | ~7.5 | ~377 (measured on darwin-arm64 ruby 3.3.3) |
3127
+
3128
+ The remaining ~7.5 strings/call are: 6 × `.b` for non-binary header
3129
+ sources (zero-cost branch when sources are already ASCII-8BIT, which
3130
+ is the protocol-http2 norm) + 1 returned `Fiddle::Pointer#to_str(written)`
3131
+ + small GC noise. The v2 `pack('Q*', buffer:)` reuses the scratch
3132
+ buffer in-place — zero alloc on steady state.
3133
+
3134
+ **Old `hyperion_h2_codec_encoder_encode` ABI symbol preserved.** The
3135
+ v1 entry stays exported from the cdylib (just unused by the in-tree
3136
+ adapter) so any third-party loaders still binding to it continue to
3137
+ work. ABI version stays at `1`; the new symbol is additive.
3138
+
3139
+ **Specs.** `spec/hyperion/http2_native_hpack_spec.rb` gains 5 new
3140
+ examples under `fix-B (2.2.x) — per-encoder scratch buffer + flat-blob
3141
+ ABI`:
3142
+
3143
+ * `'reuses scratch buffers across encode calls (no extra String.new in
3144
+ encode hot path)'` — counts T_STRING delta across 50 encode calls,
3145
+ asserts < 500 (v1 baseline ~600, v2 observed ~377).
3146
+ * `'rejects encode when output buffer overflow occurs'` — drives the
3147
+ v2 entry with a 4-byte out_capacity against 1024-byte input, asserts
3148
+ rc == -1.
3149
+ * `'rejects encode v2 with bad arguments (out-of-bounds offsets)'` —
3150
+ asserts rc == -2 when an argv quad references past the blob end.
3151
+ * `'maintains dynamic-table state across encode calls under the v2
3152
+ ABI'` — re-encodes the same `cookie: session=novel` header twice and
3153
+ asserts the second block compresses to fewer bytes via the dyn-table
3154
+ reuse path.
3155
+ * `'auto-grows the output scratch when a frame exceeds the running
3156
+ capacity'` — encodes a 1000-header frame (>16 KiB encoded), then a
3157
+ small frame, asserts both round-trip.
3158
+
3159
+ Existing 13 native HPACK specs (parity, stateful-dyn-table,
3160
+ Http2Handler integration) stay green. Spec count **668 → 673**.
3161
+
3162
+ **Bench validation on openclaw-vm (2026-04-30, post fix-B).** Bench
3163
+ host had no `cargo` at the time of the original 2.2.0 sweep (Phase 10's
3164
+ bench reported `Hyperion::H2Codec.available? == false`). fix-B installed
3165
+ `rustup` toolchain stable on the host, rebuilt the cdylib via
3166
+ `cargo build --release`, and re-benched.
3167
+
3168
+ Workload: `h2load -c 1 -m 100 -n 5000 https://127.0.0.1:<port>/`,
3169
+ hello.ru behind hyperion `-t 64 -w 1`, 4-round mean per side:
3170
+
3171
+ | Side | Round 1 | Round 2 | Round 3 | Round 4 | Mean |
3172
+ |---|---:|---:|---:|---:|---:|
3173
+ | Ruby fallback (baseline) | 1606.29 | 1601.19 | 1615.71 | 1616.02 | **1609.80** |
3174
+ | Rust HPACK fix-B (`HYPERION_H2_NATIVE_HPACK=1`) | 1594.78 | 1606.90 | 1623.95 | 1610.23 | **1608.97** |
3175
+
3176
+ Delta: **-0.05% (fully within run-to-run noise)**. Phase 10's reported
3177
+ -8% to -28% regression is **eliminated**. The native path is now at
3178
+ parity with the Ruby fallback on this workload — the per-call
3179
+ allocation overhead the v1 ABI paid (and that overwhelmed the encode
3180
+ kernel win on small-headers traffic) is gone.
3181
+
3182
+ **Default-on flip: NOT TAKEN.** The brief required
3183
+ `native HPACK ≥ Ruby fallback rps` AND a clear margin to flip default-on.
3184
+ Parity on the bench host counts as the regression being fixed, but
3185
+ isn't enough to flip the default — the encode kernel is too small a
3186
+ fraction of the per-request budget on this hello-payload workload for
3187
+ a clear win to surface. **The `HYPERION_H2_NATIVE_HPACK=1` env-var
3188
+ gate stays default-OFF.** Operators on different workloads (heavy
3189
+ header sets, large dyn-table churn) can flip the env var to A/B; the
3190
+ no-regression guarantee on smaller workloads is what fix-B locks in.
3191
+
3192
+ A larger-payload h2 bench (e.g., 16 KB headers, the workload Phase 10's
3193
+ microbench measured) would likely surface the kernel win — that's
3194
+ queued behind a workload generator the bench harness doesn't have yet.
3195
+
3196
+ ### fix-A — splice pipe-hoist (per-chunk → per-response)
3197
+
3198
+ **The 2026-04-30 splice regression diagnosed and fixed.** The 2.2.0
3199
+ splice lifecycle opened a fresh `pipe2(O_CLOEXEC | O_NONBLOCK)` pair on
3200
+ every call to `Hyperion::Http::Sendfile.copy_splice`, and the Ruby caller
3201
+ (`native_copy_loop`) invoked that primitive **per chunk** in a
3202
+ `while remaining.positive?` loop. For a 1 MiB asset at 64 KiB chunks
3203
+ that's 16 calls × 3 syscalls of pipe overhead = **48 wasted syscalls per
3204
+ request** at the kernel boundary; the bench sweep on openclaw-vm
3205
+ attributed -23% of the -22.7% static-1-MiB regression to that overhead.
3206
+
3207
+ * **New C primitive: `Hyperion::Http::Sendfile.copy_splice_into_pipe`.**
3208
+ Same splice ladder as `copy_splice` (file → pipe → socket), but takes
3209
+ a CALLER-PROVIDED pipe pair as the last two arguments and does NOT
3210
+ open or close the pipe. Returns the same status shape — `:done` /
3211
+ `:partial` / `:eagain` / `:unsupported`. Linux-only; non-Linux builds
3212
+ return `[0, :unsupported]` so the Ruby caller can fall back to plain
3213
+ `sendfile(2)`. Lives at `ext/hyperion_http/sendfile.c`.
3214
+ * **Existing `copy_splice` primitive kept intact.** The self-contained
3215
+ per-call pipe lifecycle is still useful for one-shot small payloads
3216
+ and out-of-band callers that don't want to manage the pipe directly;
3217
+ it remains exposed and unchanged. fix-A only **adds** the new
3218
+ primitive — it does not remove or repurpose the old one.
3219
+ * **Ruby façade hoists the pipe out of the chunk loop.**
3220
+ `lib/hyperion/http/sendfile.rb` `native_copy_loop` is restructured
3221
+ into two helpers: `splice_copy_loop` (Linux + splice runtime
3222
+ supported) opens ONE pipe pair via `IO.pipe` (set non-blocking via
3223
+ `Fcntl::F_SETFL`) at the top of the response, hands the same fds to
3224
+ `copy_splice_into_pipe` for every chunk, and closes both fds in an
3225
+ ensure block on every exit path (return, raise, `:unsupported`
3226
+ fall-back). `plain_sendfile_loop` carries the rest of the response
3227
+ if the runtime kernel rejects splice mid-loop, picking up from the
3228
+ same cursor.
3229
+ * **Syscall delta (1 MiB request):**
3230
+ | Layout | pipe2 | close | splice rounds | total |
3231
+ |---|---:|---:|---:|---:|
3232
+ | 2.2.0 (per-chunk) | 16 | 32 | 16 | 64 |
3233
+ | 2.2.x fix-A (per-response) | 1 | 2 | 16 | **19** |
3234
+
3235
+ **3.4× fewer syscalls per 1 MiB request** at the kernel boundary.
3236
+ * **Correctness window unchanged.** A pipe pair still never outlives
3237
+ one response — the ensure block closes both fds before the response
3238
+ loop returns to `copy_to_socket`'s caller. The bytes-leak window the
3239
+ cached-per-thread layout from 2.0.1 suffered cannot reopen here.
3240
+ The fd-leak guard from 2.2.0 (`'closes both pipe fds on every
3241
+ successful copy_splice call (no fd leak across 1000 requests)'`)
3242
+ stays green; the open-fd delta now scales with **responses**, not
3243
+ individual splice calls.
3244
+ * **New specs.** `spec/hyperion/http_sendfile_spec.rb` gains two
3245
+ fix-A specs:
3246
+ * `'reuses one pipe pair across all chunks of a single response (fix-A)'`
3247
+ stubs `IO.pipe` to count invocations, serves a 1 MiB asset, and
3248
+ asserts `IO.pipe` was called exactly once per response.
3249
+ * `'closes both pipe fds even when the chunk loop raises mid-transfer (fix-A)'`
3250
+ stubs `copy_splice_into_pipe` to raise on the second chunk and
3251
+ asserts both pipe fds are `closed?` afterwards (the ensure
3252
+ block fired even on the exception path).
3253
+
3254
+ Spec count **666 → 668**. Both new specs are Linux-pending on
3255
+ macOS / BSD (splice is Linux-only); the existing 666 stay green.
3256
+ * **Env-var gate stays in place.** The `HYPERION_HTTP_SPLICE=1` opt-in
3257
+ gate added in commit `2c8d9f3` is **kept**. The 2026-04-30 follow-up
3258
+ bench (post fix-A pipe-hoist) measured splice-ON 1,048 r/s vs
3259
+ splice-OFF 1,086 r/s on the same host: splice is correctness-
3260
+ equivalent to plain sendfile on kernel 6.8 / openclaw-vm but NOT
3261
+ faster. The pipe is a kernel buffer that adds a syscall per chunk
3262
+ (file→pipe + pipe→socket vs sendfile's single file→socket); zero-copy
3263
+ guarantee is identical. Default off preserves 2.1.0 plain-sendfile
3264
+ rps; operators on different kernels can flip the env var to A/B.
3265
+ * **macOS arm64 / Linux x86_64 portability.** The splice ladder stays
3266
+ `#ifdef HYP_SF_LINUX`; non-Linux builds see
3267
+ `copy_splice_into_pipe` return `[0, :unsupported]` and the
3268
+ streaming loop drops to plain `sendfile(2)`. The C ext compiles
3269
+ cleanly on macOS arm64 (verified on this branch).
3270
+
3271
+ ### Bench sweep notes (openclaw-vm, 2026-04-30, vs 2.1.0 baseline) — original first-sweep table
3272
+
3273
+ This is the **original held-status bench sweep** that triggered the
3274
+ fix-A through fix-E follow-up sprint. Kept verbatim as archaeology;
3275
+ the verified 2.2.0 numbers are in the headline table at the top of
3276
+ this entry. Each row below was either fixed in the follow-up sprint
3277
+ or recharacterized once the right workload was benched.
3278
+
3279
+ | Row | 2.1.0 | 2.2.0 default | Δ | Notes |
3280
+ |---|---:|---:|---:|---|
3281
+ | hello -w 4 -t 5 | 20,630 | 20,077 | -2.7% (noise) | Phase 11 allocation cuts don't show on this row |
3282
+ | work.ru -w 4 -t 5 | 15,585 | 14,415 | -7.5% | Within run-to-run variance |
3283
+ | static 1 MiB -w 1 -t 5 | 1,697 | 1,312 with splice on | -22.7% | Per-chunk pipe2 overhead surfaced — fix-A pipe-hoist 64 → 19 syscalls/MiB; splice gated opt-in |
3284
+ | static 8 KB -w 1 -t 5 | 1,483 | 1,359 | -8.4% | Within variance |
3285
+ | TLS h1 -w 1 -t 64 | 3,425 | 2,909 | -15.1% | Hello-payload TLS — fix-C rackups added the right workload, kTLS wins +18-24% on 50 KB / 1 MiB |
3286
+ | h2load c=1 m=100 default | 1,597 | n/a | — | h2.max_total_streams default flip from 2.0 closes the conn after 512 streams; fix-D adds `--h2-max-total-streams unbounded` flag |
3287
+ | h2load c=1 m=100 + Rust HPACK | n/a | n/a | — | Rust crate didn't load on bench host (no cargo); fix-B installed cargo + reran, native parity with Ruby fallback verified |
3288
+
3289
+ The +45% / +60% / +21% targets the 2.2.0 brief estimated didn't
3290
+ materialize on this first sweep — the workloads were wrong (hello-
3291
+ payload TLS doesn't surface cipher cost; per-chunk splice burned
3292
+ syscalls; native HPACK FFI marshalling was per-call rather than
3293
+ per-encoder; native HPACK rebuild was missing on the bench host).
3294
+ The fix-A through fix-E sprint addressed each gap one at a time;
3295
+ the headline table at the top of this entry carries the verified
3296
+ post-sprint numbers.
3297
+
3298
+ ### Static-file splice path re-enabled (fresh per-request pipe pair) — opt-in
3299
+
3300
+ The `splice(2)`-through-pipe primitive shipped in 2.0.1 (Phase 8b) was
3301
+ **disabled in the production hot path** in the same release because the
3302
+ cached per-thread pipe pair leaked residual bytes between requests on
3303
+ EPIPE: if `splice(file -> pipe)` succeeded but `splice(pipe -> sock)`
3304
+ failed mid-transfer (peer closed), the unread bytes stayed in the pipe
3305
+ and were sent on the NEXT connection's socket. 2.0.1 fell back to plain
3306
+ `sendfile(2)` for the 1 MiB row and parked the splice primitive as
3307
+ optional — kept in the C ext for callers that opted in, but no longer
3308
+ on `copy_to_socket`'s default route.
3309
+
3310
+ 2.2.0 fixes the correctness bug at the lifecycle layer rather than
3311
+ abandoning the path. The splice path is now back on the production hot
3312
+ path for files > 64 KiB on Linux.
3313
+
3314
+ * **Fresh `pipe2(O_CLOEXEC | O_NONBLOCK)` pair per call.**
3315
+ `Hyperion::Http::Sendfile.copy_splice` opens its own pipe pair on
3316
+ every call and closes both fds before returning — on success, on
3317
+ EAGAIN, on error, on EOF. No persistent state, no `pthread_key_t`,
3318
+ no destructor. The 2.0.1 cached layout is gone entirely. The
3319
+ per-thread TLS cache and its destructor were removed from the C ext;
3320
+ the splice primitive is now stateless across calls.
3321
+ * **Correctness contract.** A pipe never carries bytes for more than
3322
+ one transfer. The (in_n - written) bytes that may be parked in the
3323
+ pipe on a mid-transfer `EAGAIN` / `EPIPE` are dropped when we close
3324
+ the pipe; the Ruby caller's cursor arithmetic compensates by
3325
+ re-reading from `cursor + bytes_actually_on_socket` on the next
3326
+ call. No cross-connection byte leak is possible.
3327
+ * **fd lifecycle.** Each `copy_splice` call pays exactly 3 syscalls of
3328
+ pipe overhead: 1 `pipe2` + 2 `close`s. New spec
3329
+ `'closes both pipe fds on every successful copy_splice call (no fd leak
3330
+ across 1000 requests)'` runs 1000 sequential 200-KiB transfers and
3331
+ asserts the open-fd count grows by < 32 across the batch. Companion
3332
+ spec `'closes both pipe fds even when the peer closes mid-transfer
3333
+ (EPIPE)'` slams the peer mid-splice 50× and asserts the same fd
3334
+ bound on the error path.
3335
+ * **Production wiring.** `lib/hyperion/http/sendfile.rb` —
3336
+ `native_copy_loop` now branches on `splice_runtime_supported? &&
3337
+ len > SPLICE_THRESHOLD`. On Linux + supported kernel, splice runs;
3338
+ on `:unsupported` from the kernel (very old kernels return ENOSYS /
3339
+ EINVAL the first time we call splice), `mark_splice_unsupported!`
3340
+ flips the cached gate to `false` and the rest of the process falls
3341
+ through to plain `sendfile(2)` from the same cursor — no bytes
3342
+ duplicated, no bytes skipped. `NotImplementedError` from the C
3343
+ primitive (defensive: should never fire on a Linux build) follows
3344
+ the same fall-back path.
3345
+ * **Splice-vs-sendfile byte equality.** New spec `'preserves byte
3346
+ equality between splice and plain sendfile for the same payload'`
3347
+ drives the same 1 MiB asset through both primitives and asserts
3348
+ the wire bytes are identical — guards against subtle off-by-one
3349
+ bugs in the offset bookkeeping after pipe -> socket short-writes.
3350
+ * **Splice runtime probe.** Added
3351
+ `Hyperion::Http::Sendfile.splice_runtime_supported?` (lazy, cached
3352
+ for the lifetime of the process). Tracks the C ext's
3353
+ `splice_supported?` flag at boot and switches to `false` if the
3354
+ runtime kernel rejects splice. `mark_splice_unsupported!` is the
3355
+ one-way transition; the runtime gate never re-opens within a
3356
+ process.
3357
+ * **Specs.** `spec/hyperion/http_sendfile_spec.rb` gains 4 new
3358
+ examples under `2.2.0 — splice fresh-pipe lifecycle`. Three are
3359
+ Linux-pending on macOS / BSD (the splice path is inert there); the
3360
+ 4th (`'falls back to plain sendfile when splice_runtime_supported?
3361
+ is stubbed false'`) runs everywhere and asserts the production
3362
+ fall-back wiring routes through `copy()` and never hits
3363
+ `copy_splice`. Spec count **662 → 666**. Existing 662 stay green.
3364
+ * **macOS arm64 / Linux x86_64 portability.** The splice path stays
3365
+ `#ifdef HYP_SF_LINUX`; non-Linux builds see `splice_supported?`
3366
+ return `false` and the streaming loop goes straight to plain
3367
+ `sendfile(2)`. The C ext compiles cleanly on macOS arm64 (verified
3368
+ on this branch).
3369
+ * **Bench validation deferred.** openclaw-vm rejected publickey at
3370
+ the time of this commit (same regression flagged in Phase 11). The
3371
+ fresh-pipe lifecycle is correctness work; the projected 5–10% rps
3372
+ win on the 1 MiB static row vs the 2.0.1 baseline (1,697 r/s)
3373
+ will be re-measured in the 2.2.0 release sweep (#124) once SSH
3374
+ access is restored.
3375
+
3376
+ ### Phase 11 — YJIT allocation audit (hot-path tuning)
3377
+
3378
+ Pure-Ruby allocation reduction on the request hot path. The C-ext fast
3379
+ path (`CParser.build_env`, `CParser.build_response_head`) is unchanged;
3380
+ this phase trims the Ruby code wrapping it. `memory_profiler` was used
3381
+ to identify the top allocation sites; each one was confirmed by a
3382
+ `GC.stat[:total_allocated_objects]` before/after delta.
3383
+
3384
+ * **Per-request allocation count: 19 → 9 objects/req on the full path
3385
+ (-53%); 9 → 2 objects/req inside `build_env` alone (-78%).** Measured
3386
+ by `bench/yjit_alloc_audit.rb` (20 000 iterations, no GC during the
3387
+ measurement window, headers + lambda app from `bench/work.ru`'s
3388
+ shape). Same numbers under YJIT and CRuby — these are direct object
3389
+ allocations, not JIT-influenced.
3390
+ * **Top sites tackled:**
3391
+ 1. `Adapter::Rack#call` rebuilt the `[status, headers, body]` Array
3392
+ after destructuring the app's return value; now returns the app's
3393
+ tuple directly. **−1 Array/req.**
3394
+ 2. `Adapter::Rack#build_env` allocated `"Hyperion/#{VERSION}"`,
3395
+ `[3, 0]`, and the `[env, input]` return tuple per call. Hoisted
3396
+ `SERVER_SOFTWARE_VALUE` and `RACK_VERSION` to frozen constants;
3397
+ `[env, input]` now reuses a per-thread mutable scratch Array
3398
+ (caller destructures immediately, never holds the Array).
3399
+ **−2 String/Array per req.**
3400
+ 3. `Adapter::Rack#split_host` called `host:port.split(':', 2)` then
3401
+ re-arrayed `[name, port]`. Replaced with `byteslice` + a
3402
+ per-thread scratch tuple; the `host:` empty-header branch now
3403
+ returns a frozen `LOCALHOST_DEFAULTS` sentinel. **−1 Array/req
3404
+ on the common branch, −1 Array/req on the empty branch.**
3405
+ 4. `Adapter::Rack::INPUT_POOL` reset allocated `+''` per `acquire`
3406
+ to swap into the StringIO. The next call to `build_env` always
3407
+ overwrites with `request.body`, so a single shared frozen
3408
+ `EMPTY_INPUT_BUFFER` sentinel is sufficient. **−1 String/req.**
3409
+ 5. `Request#header(name)` always called `name.downcase`, even when
3410
+ the parser-stored keys and in-tree callers already pass lowercase
3411
+ literals. Fast-path direct lookup; only fall through to
3412
+ `downcase` on miss. **−1 String/req.**
3413
+ 6. `WebSocket::Handshake.validate` allocated
3414
+ `[:not_websocket, nil, nil]` on every plain-HTTP request (the
3415
+ overwhelming branch). Frozen `NOT_WEBSOCKET_RESULT` sentinel.
3416
+ **−1 Array/req.**
3417
+ 7. `ResponseWriter#write_buffered` allocated `+''` then iterated
3418
+ `body.each` for the common `[body_string]` Rack body shape.
3419
+ Single-element-Array fast path uses `body[0]` directly. **−1
3420
+ String/req.**
3421
+
3422
+ Sites left in place (unavoidable or out of scope):
3423
+
3424
+ - `CParser.build_response_head` allocates the head buffer +
3425
+ downcased copy of each user-supplied header key. C-ext code, out
3426
+ of scope per the Phase 11 rules.
3427
+ - `host_header.byteslice(0, idx)` and `byteslice(idx + 1, ...)` —
3428
+ the env hash retains both substrings as `SERVER_NAME` /
3429
+ `SERVER_PORT`; not transient.
3430
+ * **Specs.** New `spec/hyperion/yjit_alloc_audit_spec.rb` (2 examples)
3431
+ asserts ≤ 10 objects/req on the full path and ≤ 3 objects/req on
3432
+ `build_env` alone — thresholds set ~10% above the post-Phase-11
3433
+ measurement so a single accidentally re-introduced allocation fails
3434
+ the spec without flaky CRuby noise. Spec count **660 → 662**.
3435
+ Existing 660 stay green. Bench harness exposed as
3436
+ `rake bench:yjit_alloc`.
3437
+ * **macOS local bench (`-w 4 -t 5`, YJIT, `bench/work.ru`,
3438
+ `wrk -t4 -c200 -d10s`).** 3 warm-state samples each:
3439
+ | Build | r/s avg |
3440
+ |---|---:|
3441
+ | 2.1.0 baseline (master pre-Phase-11) | 43,396 r/s |
3442
+ | 2.2.0-wip (Phase 11 applied) | 43,440 r/s |
3443
+ Within noise — macOS arm64 at 43k r/s is already past the point
3444
+ where the Ruby-side allocation count dominates throughput; the
3445
+ bench is bound by the `JSON.generate` work in `work.ru` and
3446
+ syscalls. The allocation reductions still matter for GC pressure
3447
+ on long-lived servers (less heap churn → fewer pauses) and for
3448
+ smaller-host profiles where every object is felt — which is what
3449
+ the openclaw `-w 4` 15.5k row was measuring.
3450
+ * **openclaw-vm bench NOT performed** — host accepted SSH but
3451
+ rejected the workstation key (`Permission denied (publickey)`).
3452
+ The 15.5k → 18k+ r/s target row could not be reproduced this round;
3453
+ the macOS-local row above is the substitute. Phase 10 documented
3454
+ the same host as offline; this round it's reachable but the auth
3455
+ state regressed. Tracking in a follow-up; the changes here are
3456
+ pure refactors with no behavior delta, so redoing the openclaw
3457
+ measurement post-restore is safe.
3458
+
3459
+ Out of scope for Phase 11 (deferred): C-ext header-key downcase
3460
+ allocation in `cbuild_response_head` (would need C-side change); FFI
3461
+ marshalling amortization called out by Phase 10; `Connection#serve`
3462
+ read accumulator (already pre-Phase-2b'd).
3463
+
3464
+ ### Phase 10 — Rust HPACK wired into the HTTP/2 hot path (Phase 6c from the 2.0 RFC)
3465
+
3466
+ The Rust HPACK encoder/decoder shipped in 2.0.0 sat behind
3467
+ `Hyperion::H2Codec::{Encoder,Decoder}` but the wire path still routed
3468
+ HPACK through `protocol-http2`'s pure-Ruby `Compressor`/`Decompressor`.
3469
+ Phase 10 closes that gap with an adapter shim and a per-connection swap.
3470
+
3471
+ * **`Hyperion::Http2::NativeHpackAdapter`** (`lib/hyperion/http2/native_hpack_adapter.rb`)
3472
+ — wraps one `H2Codec::Encoder` + one `H2Codec::Decoder`, exposing
3473
+ `#encode_headers(headers, buffer)` and `#decode_headers(bytes)` —
3474
+ exactly the surface `Protocol::HTTP2::Connection` calls when HEADERS /
3475
+ CONTINUATION frames cross the wire. The adapter holds per-connection
3476
+ HPACK state (RFC 7541 dynamic table per direction) for the lifetime
3477
+ of one h2 connection.
3478
+ * **Substitution mechanism (Option A — per-connection swap).**
3479
+ `Http2Handler#build_server` constructs the `Protocol::HTTP2::Server`
3480
+ and, when the swap is enabled, overrides `encode_headers` and
3481
+ `decode_headers` on the server instance via `define_singleton_method`,
3482
+ routing both through the adapter. Protocol-http2's framer, stream
3483
+ state machine, flow control, and HEADERS/CONTINUATION framing all
3484
+ remain untouched — only the HPACK byte-pump is replaced. Frame
3485
+ ser/de in Rust is **deferred to a future Phase 6d**.
3486
+ * **Rust encoder gained dynamic-table search.** Previously the encoder
3487
+ added entries to the dynamic table (path 2/3) but never consulted
3488
+ them on subsequent calls — every header was re-emitted as a literal.
3489
+ That made wire bytes ~6× bigger than `protocol-hpack`'s output on
3490
+ repeated headers, swamping any FFI win. The encoder now searches
3491
+ the dynamic table for full and name-only matches before falling
3492
+ through to literal-with-incremental-indexing for novel names. After
3493
+ the fix, native + fallback produce identically-compressed wire bytes
3494
+ (`space savings 93.74%` vs `93.75%` on h2load `/`). The change is
3495
+ gated by 6 Rust unit tests (`cargo test`) which all stay green.
3496
+ * **Wholly-novel names** now go through "literal with incremental
3497
+ indexing" (prefix `0x40`) instead of "literal without indexing"
3498
+ (`0x00`), so future repeats can collapse via dynamic-table lookup.
3499
+ Both encodings are RFC 7541-conformant; existing decoders accept
3500
+ both. The `h2_codec_spec` parity test was updated to match.
3501
+ * **Opt-in default.** Local h2load benchmarking on macOS (M-series,
3502
+ `-c 1 -m 100 -n 5000`, hello.ru, 1 worker, `-t 64`) showed:
3503
+ | Workload | 2.1.0 baseline (Ruby HPACK only) | 2.2.0 default (env unset) | 2.2.0 native (`HYPERION_H2_NATIVE_HPACK=1`) |
3504
+ |---|---:|---:|---:|
3505
+ | h2 GET hello | n/a (different host) | **9,740 r/s** | 7,418 r/s |
3506
+ | h2 POST `h2_post.ru` `-d 1 KiB` | n/a | **8,007 r/s** | 7,350 r/s |
3507
+ | h2 headers-heavy | n/a | **8,742 r/s** | 6,312 r/s |
3508
+ Native is ~10–25% slower on this host *despite* the standalone
3509
+ microbench's 3.26× encode / 1.98× decode wins. Root cause: per-
3510
+ HEADERS-frame Fiddle FFI marshalling overhead (`Fiddle::Pointer[]`
3511
+ per header, `pack('Q*')` × 4, capacity-byte output buffer pre-fill,
3512
+ `byteslice`) outweighs the encode/decode CPU savings when the
3513
+ typical frame carries 3–8 small headers. The microbench measured a
3514
+ tight loop over many headers in one call, which doesn't model real
3515
+ h2 traffic.
3516
+ Until the FFI marshalling layer is rewritten to amortize allocation
3517
+ (a follow-up phase), the wiring ships **opt-in** behind
3518
+ `ENV['HYPERION_H2_NATIVE_HPACK']` (accepts `1`/`true`/`yes`/`on`,
3519
+ case-insensitive). Default is OFF — bytewise identical 2.0.0/2.1.0
3520
+ behavior, no surprise regression for upgraders. Operators who want
3521
+ to A/B test on Linux (where FFI cost may differ) can flip the env
3522
+ var and watch their own dashboards.
3523
+ * **Boot log** — `Http2Handler` records a single-shot `h2 codec selected`
3524
+ info line with `mode`, `native_available`, `native_enabled`, and
3525
+ `hpack_path` so the substitution state is observable. Three modes:
3526
+ `native (Rust) — HPACK on hot path`,
3527
+ `fallback (...) — native available, opt-in via HYPERION_H2_NATIVE_HPACK=1`,
3528
+ and `fallback (...) — native unavailable`.
3529
+ * **`Http2Handler#codec_native?`** now reflects the wired-on state
3530
+ (available AND opt-in), and `Http2Handler#codec_available?` reports
3531
+ the crate-loaded state. Both surfaces stay green for diagnostics.
3532
+ * **Specs** — `spec/hyperion/http2_native_hpack_spec.rb` adds 14
3533
+ examples covering: encode/decode parity (200 randomized header
3534
+ sets, both directions, native ↔ Ruby decoders cross-check),
3535
+ stateful dynamic-table behavior across 3 successive blocks, and
3536
+ Http2Handler integration in three states (env-on swap installed,
3537
+ available-but-env-unset no-swap, unavailable no-swap). Spec count
3538
+ **646 → 660** (12 new + 2 reframed).
3539
+ * **`SSH/openclaw-vm` bench reproduction NOT performed** — the bench
3540
+ host was offline (port 22 refused) for the duration of this work,
3541
+ so the 2.1.0 row-10 baseline (1,597 r/s) couldn't be reproduced
3542
+ side-by-side. The macOS local numbers above are the substitute. If
3543
+ Linux/openclaw shows a different verdict on `HYPERION_H2_NATIVE_HPACK=1`,
3544
+ the env-var gate makes that operator-flippable without a re-release.
3545
+
3546
+ Out of scope for Phase 10 (deferred to a future phase): Rust frame
3547
+ ser/de (parallel framer state machine — Phase 6d), Ruby-side FFI
3548
+ allocation amortization, opt-in default flip.
3549
+
3550
+ ### Phase 9 — kernel TLS (KTLS_TX) on Linux
3551
+
3552
+ `OP_ENABLE_KTLS` is now flipped on the SSL context after a Linux-kernel
3553
+ + OpenSSL probe at boot, so the kernel takes over the symmetric-cipher
3554
+ write path post-handshake. Pairs with — does not replace — Phase 4
3555
+ session resumption.
3556
+
3557
+ * **Probe** — `Hyperion::TLS.ktls_supported?` returns `true` only on
3558
+ Linux ≥ 4.13 + OpenSSL ≥ 3.0. macOS / BSD always return `false` and
3559
+ the boot path falls back transparently to userspace `SSL_write`.
3560
+ * **Config** — `tls.ktls` (`:auto` / `:on` / `:off`, default `:auto`).
3561
+ `:on` raises `Hyperion::UnsupportedError` at boot on hosts where the
3562
+ probe returns false; `:auto` enables when supported, off elsewhere;
3563
+ `:off` always uses the userspace cipher loop.
3564
+ * **Boot log** — one info-level line per worker on the first connection
3565
+ recording `ktls_policy`, `ktls_supported`, `ktls_active`, and the
3566
+ negotiated cipher. Subsequent connections skip via `@ktls_logged`.
3567
+ * **Plumbing** — `tls_ktls` flows through Server → Worker → Master and
3568
+ through CLI single-mode `Server.new`. The `tls.ktls` DSL key is
3569
+ available in nested form (`tls do; ktls :on; end`).
3570
+ * **Bench (openclaw-vm, 1 worker, wrk -t4 -c64 -d20s)** — TLS h1 hello:
3571
+ kTLS off ≈ **3,068 r/s** (p99 38–73 ms), kTLS on ≈ **3,508 r/s** (p99
3572
+ 41–101 ms with high variance from kernel TLS_TX queueing). 8 KB
3573
+ static: kTLS off ≈ **1,470 r/s**, kTLS on ≈ **1,519 r/s**. The gain
3574
+ is small at hello-payload size because the userspace cipher cost is
3575
+ a tiny fraction of per-request overhead — the win compounds with
3576
+ larger response bodies (kernel-side write-coalescing) and longer
3577
+ keep-alive sessions. Full measured rows in `BENCH_HYPERION_2_0.md`.
3578
+
3579
+ Out of scope for Phase 9: kTLS RX (receive-side) — OpenSSL 3.0 ships
3580
+ TX only. RFC 8446 0-RTT continues to be served by Phase 4.
3581
+
3582
+ ## [2.1.0] - 2026-04-30
3583
+
3584
+ **Headline:** WebSocket support — RFC 6455 over Rack 3 full hijack, with a
3585
+ native frame codec, a per-connection wrapper, and an e2e smoke test. Spec
3586
+ count **530 → 632 (+102)**.
3587
+
3588
+ > **ActionCable on Hyperion is now a supported deployment model.** A single
3589
+ > `hyperion -w 4 -t 10 config.ru` process serves HTTP, HTTP/2, TLS, **and**
3590
+ > ActionCable from the same listener. The Rails-on-Puma split-deploy
3591
+ > ("puma for HTTP, separate cable container for WS") is no longer required.
3592
+ > See [`docs/WEBSOCKETS.md`](docs/WEBSOCKETS.md) for the recipe.
3593
+
3594
+ Out of scope for 2.1.0, deferred to 2.2.x: WebSocket-over-HTTP/2
3595
+ (RFC 8441 Extended CONNECT), permessage-deflate (RFC 7692), send-side
3596
+ fragmentation. HTTP/1.1 is the sole transport for WS this release.
3597
+
3598
+ ### Rack 3 hijack support (WS-1)
3599
+
3600
+ `env['rack.hijack?']` now returns `true`; `env['rack.hijack']` returns a
3601
+ callable that detaches the underlying socket from Hyperion's request
3602
+ lifecycle. After the app calls `env['rack.hijack'].call`:
3603
+
3604
+ * Hyperion does NOT write a response on the wire — the Rack tuple
3605
+ returned from `app.call(env)` is ignored, per the Rack 3 spec.
3606
+ * The socket is removed from Hyperion's read/write rotation. The accept
3607
+ loop / writer fiber will not touch it again.
3608
+ * Hyperion does NOT close the socket on connection cleanup or worker
3609
+ shutdown — the application owns it. `Connection#close` becomes a
3610
+ no-op for the close branch on the hijack path.
3611
+ * The connection is removed from keep-alive accounting; the next
3612
+ request from this client is a fresh connection.
3613
+
3614
+ Both dispatch modes are covered: the inline (per-fiber) path and the
3615
+ thread-pool path. The hijack proc captures the `Hyperion::Connection`
3616
+ (not the socket directly) so the `@hijacked` flag is observed by the
3617
+ connection fiber the moment the app evaluates the proc, regardless of
3618
+ which thread the proc runs on.
3619
+
3620
+ Hyperion-specific extension: `env['hyperion.hijack_buffered']` exposes
3621
+ any bytes the connection had buffered past the parsed request boundary
3622
+ (pipelined keep-alive carry, or — for an Upgrade — bytes the client
3623
+ sent immediately after the request headers). The application is
3624
+ responsible for consuming these before reading from the hijacked socket.
3625
+
3626
+ Foundation for native WebSocket support (WS-2 through WS-5).
3627
+
3628
+ #### Scope notes
3629
+
3630
+ * HTTP/1.1 only. Rack 3 hijack over HTTP/2 requires Extended CONNECT
3631
+ (RFC 8441 / RFC 9220) and is intentionally NOT plumbed in this
3632
+ release. h2 streams continue to see `env['rack.hijack?'] == false`.
3633
+ * Partial hijack (response-headers `'rack.hijack'` callback that
3634
+ receives the writer-side IO) is not yet implemented. Apps that need
3635
+ streaming should keep using the existing chunked transfer-encoding
3636
+ path; a follow-up will add partial hijack once full hijack lands.
3637
+
3638
+ ### WS-2 — RFC 6455 handshake (Upgrade: websocket → 101)
3639
+
3640
+ New module `Hyperion::WebSocket::Handshake` (lib/hyperion/websocket/handshake.rb).
3641
+ The Rack adapter now intercepts the HTTP/1.1 → WebSocket upgrade
3642
+ handshake per RFC 6455 §1.3 / §4.2 transparently, BEFORE the app sees
3643
+ the env.
3644
+
3645
+ #### Detection
3646
+
3647
+ A request is a WS upgrade attempt when both:
3648
+
3649
+ * `Upgrade` contains the `websocket` token (case-insensitive)
3650
+ * `Connection` contains the `upgrade` token (case-insensitive,
3651
+ comma-separated list — `Connection: keep-alive, Upgrade` is valid)
3652
+
3653
+ Other Upgrade variants (e.g. `Upgrade: h2c`) flow through the normal
3654
+ HTTP path untouched. Hyperion intercepts ONLY `websocket`.
3655
+
3656
+ #### Validation (RFC 6455 §4.2.1)
3657
+
3658
+ Each MUST is enforced. On failure Hyperion writes the response itself:
3659
+
3660
+ * Method != `GET` → `400`
3661
+ * `HTTP/1.0` (or earlier) → `400`
3662
+ * Missing `Host:` → `400`
3663
+ * Missing `Sec-WebSocket-Key:` → `400`
3664
+ * `Sec-WebSocket-Key` doesn't decode to exactly 16 bytes → `400`
3665
+ * `Sec-WebSocket-Version` missing or not `13` → `426 Upgrade Required`
3666
+ with `Sec-WebSocket-Version: 13` so the client knows what to retry
3667
+ * `Origin` not in the allow-list (when one is configured) → `400`
3668
+
3669
+ #### Env handover convention (Option B)
3670
+
3671
+ On a valid handshake the adapter stashes
3672
+ `env['hyperion.websocket.handshake'] = [:ok, accept_value, subprotocol]`
3673
+ and lets the app proceed. The app is responsible for:
3674
+
3675
+ 1. reading the accept value out of env,
3676
+ 2. calling `env['rack.hijack'].call` to take the socket (WS-1),
3677
+ 3. writing the 101 response (helper:
3678
+ `Hyperion::WebSocket::Handshake.build_101_response(accept, subprotocol)`).
3679
+
3680
+ This mirrors faye-websocket / ActionCable behaviour — Hyperion stays
3681
+ neutral on the WS protocol layer and lets the app drive.
3682
+
3683
+ #### Optional subprotocol selector
3684
+
3685
+ `Handshake.validate(env, subprotocol_selector: ->(offers) { … })` —
3686
+ the proc receives the array of client-offered subprotocols (from
3687
+ `Sec-WebSocket-Protocol`) and may return one of them or nil. The
3688
+ result lands in slot 3 of the handshake tuple. The server MUST NOT
3689
+ echo a protocol the client didn't offer (RFC 6455 §4.2.2); a return
3690
+ value not in the offer list is silently dropped.
3691
+
3692
+ #### Optional origin allow-list
3693
+
3694
+ Default: any origin accepted (browsers enforce CORS-style
3695
+ restrictions on the WS upgrade independently). Override per-call via
3696
+ `Handshake.validate(env, origin_allow_list: %w[https://example.com])`,
3697
+ or globally via `HYPERION_WS_ORIGIN_ALLOW_LIST` (comma-separated).
3698
+ The full Hyperion::Config DSL plumbing is deferred to WS-4 / WS-5;
3699
+ the env-var fallback covers the operator escape hatch in the meantime.
3700
+
3701
+ #### Test vector confirmation
3702
+
3703
+ Per RFC 6455 §1.3:
3704
+
3705
+ ```
3706
+ key = "dGhlIHNhbXBsZSBub25jZQ=="
3707
+ accept = "s3pPLMBiTxaQ9kYGzzhZRbK+xOo="
3708
+ ```
3709
+
3710
+ `Hyperion::WebSocket::Handshake.accept_value(key)` returns the
3711
+ canonical accept value (asserted in spec).
3712
+
3713
+ #### Public types
3714
+
3715
+ * `Hyperion::WebSocket::HandshakeError < StandardError` — raised by
3716
+ no Hyperion code in WS-2 itself; supplied for downstream consumers
3717
+ (middleware, ActionCable bridges) that want to translate a
3718
+ failing `validate` tuple into an exception.
3719
+
3720
+ ### WS-3 — RFC 6455 frame ser/de in C ext
3721
+
3722
+ `ext/hyperion_http/websocket.c` exposes three primitives bound onto
3723
+ `Hyperion::WebSocket::CFrame` at load time:
3724
+
3725
+ * `parse(buf, offset = 0)` — non-copying scan of `buf[offset..]`,
3726
+ returns either `:incomplete`, `:error`, or the 7-tuple
3727
+ `[fin, opcode_int, payload_len, masked, mask_key, payload_offset,
3728
+ frame_total_len]`.
3729
+ * `unmask(payload, key)` — XOR-unmask, GVL-released for payloads
3730
+ large enough to amortise the release.
3731
+ * `build(opcode, payload, fin:, mask:, mask_key:)` — serialise a
3732
+ single frame ready for `socket.write`.
3733
+
3734
+ Idiomatic Ruby façades live in `lib/hyperion/websocket/frame.rb`:
3735
+
3736
+ * `Hyperion::WebSocket::Parser.parse(buf, offset)` returns a
3737
+ `Hyperion::WebSocket::Frame` (Struct) with Symbol opcodes and a
3738
+ pre-unmasked binary payload, or raises
3739
+ `Hyperion::WebSocket::ProtocolError` on malformed input.
3740
+ * `Hyperion::WebSocket::Parser.parse_with_cursor(buf, offset)` is
3741
+ the same plus the `frame_total_len` advance, used by the
3742
+ read-many-frames-per-buffer path.
3743
+ * `Hyperion::WebSocket::Builder.build(opcode:, payload:, fin:,
3744
+ mask:, mask_key:)` symmetrically serialises with auto-generated
3745
+ `mask_key` when omitted on `mask: true`.
3746
+
3747
+ A pure-Ruby fallback (`RubyFrame`) is rebound onto `CFrame` if the
3748
+ C ext didn't load — same surface, ~5–10× slower XOR. JRuby /
3749
+ TruffleRuby keep working without the C build.
3750
+
3751
+ ### WS-4 — `Hyperion::WebSocket::Connection` + e2e smoke
3752
+
3753
+ `lib/hyperion/websocket/connection.rb` is the per-connection wrapper
3754
+ that takes a hijacked socket from WS-1, the validated handshake
3755
+ tuple from WS-2, and the framing primitives from WS-3 and exposes a
3756
+ message-oriented API to the application:
3757
+
3758
+ ```ruby
3759
+ ws = Hyperion::WebSocket::Connection.new(
3760
+ env['rack.hijack'].call,
3761
+ buffered: env['hyperion.hijack_buffered'],
3762
+ subprotocol: env['hyperion.websocket.handshake'][2]
3763
+ )
3764
+
3765
+ while (type, payload = ws.recv) && type != :close
3766
+ ws.send(payload, opcode: type) # echo
3767
+ end
3768
+
3769
+ ws.close(code: 1000)
3770
+ ```
3771
+
3772
+ #### What the wrapper does
3773
+
3774
+ * **Continuation reassembly** — `recv` joins `text` / `binary` +
3775
+ `continuation`* + final `FIN=1` into a single message before
3776
+ returning. Control frames (ping / pong / close) interleaved
3777
+ between fragments are handled inline per RFC 6455 §5.4.
3778
+ * **Auto-pong** — RFC 6455 §5.5.2: the wrapper writes a pong with
3779
+ the ping's payload before returning to the caller. `on_ping`
3780
+ hooks observe but do not replace the auto-response, so a server
3781
+ using this wrapper stays compliant even if the app's hook is a
3782
+ no-op.
3783
+ * **Close handshake** — peer-initiated close returns
3784
+ `[:close, code, reason]` from `recv` and writes a close echo
3785
+ (RFC §5.5.1). Locally-initiated `ws.close(code: 1000, reason:,
3786
+ drain_timeout: 5)` writes our close, drains for the peer's
3787
+ matching close (or times out), then closes the socket.
3788
+ * **Per-message size cap** — `max_message_bytes:` (default 1 MiB)
3789
+ bounds the reassembly buffer; an over-cap continuation triggers
3790
+ close 1009 (Message Too Big) and surfaces the close to the caller.
3791
+ * **UTF-8 validation** — text frames whose payload isn't valid
3792
+ UTF-8 trip close 1007 (Invalid Frame Payload Data) per
3793
+ RFC 6455 §8.1.
3794
+ * **Idle / keep-alive supervision** — `idle_timeout:` (default
3795
+ 60 s) sends close 1001 after no traffic; `ping_interval:`
3796
+ (default 30 s) emits proactive pings to keep NAT mappings warm.
3797
+ Both kwargs accept `nil` to disable. Implemented via
3798
+ `IO.select`, so the recv loop cooperates with the fiber
3799
+ scheduler under `--async-io`.
3800
+
3801
+ Hooks: `on_ping(&block)`, `on_pong(&block)`, `on_close(&block)`
3802
+ fire for observation. They run AFTER the built-in protocol behaviour;
3803
+ they cannot suppress the auto-response or close echo.
3804
+
3805
+ State predicates: `open?`, `closing?`, `closed?`. After a `:close`
3806
+ has been observed by the caller, subsequent `recv` raises
3807
+ `Hyperion::WebSocket::StateError`; `send` after close also raises.
3808
+
3809
+ #### ActionCable / faye-websocket recipe
3810
+
3811
+ The wrapper is intentionally protocol-agnostic — it doesn't know
3812
+ about ActionCable's JSON framing or faye-websocket's driver state
3813
+ machine. To bridge:
3814
+
3815
+ ```ruby
3816
+ # In your Rack app or a faye-websocket-style adapter:
3817
+ socket = env['rack.hijack'].call
3818
+ socket.write(
3819
+ Hyperion::WebSocket::Handshake.build_101_response(
3820
+ env['hyperion.websocket.handshake'][1],
3821
+ env['hyperion.websocket.handshake'][2]
3822
+ )
3823
+ )
3824
+ ws = Hyperion::WebSocket::Connection.new(
3825
+ socket, buffered: env['hyperion.hijack_buffered']
3826
+ )
3827
+ # faye-websocket: feed ws.recv into your Driver#parse;
3828
+ # ActionCable: hand to a `Connection::ClientSocket`-style adapter.
3829
+ ```
3830
+
3831
+ A first-class `Hyperion::WebSocket::Adapter::ActionCable` is
3832
+ deferred to a follow-up — most ActionCable users speak the raw
3833
+ socket interface through `faye-websocket`'s driver mode, which
3834
+ `env['rack.hijack']` already feeds directly.
3835
+
3836
+ #### Smoke test
3837
+
3838
+ `spec/hyperion/websocket_e2e_spec.rb` boots a real Hyperion
3839
+ server, opens a raw TCP client, completes the handshake, exchanges
3840
+ 100 text messages each direction (echoed by an app that uses
3841
+ `Hyperion::WebSocket::Connection` directly), and closes with code
3842
+ 1000. p50 echo round-trip on developer hardware is sub-millisecond
3843
+ — logged to stderr from the spec for sanity, not asserted (CI runners
3844
+ vary too much).
3845
+
3846
+ #### Scope notes
3847
+
3848
+ Deferred to a follow-up:
3849
+
3850
+ * `Hyperion::WebSocket::Adapter::ActionCable` — see recipe above.
3851
+ * permessage-deflate (RFC 7692) compression — handshake-time
3852
+ negotiation in WS-2, per-frame compression here. Not in 2.1.0.
3853
+ * Send-side fragmentation — `send` writes a single FIN=1 frame
3854
+ regardless of payload size. Browsers / well-behaved clients
3855
+ handle multi-MB single frames; an opt-in `fragment_threshold:`
3856
+ can be added later if a use case shows up.
3857
+
3858
+ ### Test fixtures
3859
+
3860
+ * `spec/hyperion/http2_settings_spec.rb` — relax the logger expectation
3861
+ so the Phase 6b codec-boot info line ("h2 codec selected") doesn't
3862
+ trip the spec on hosts where the native h2 codec is loaded. CI fix
3863
+ on top of 2.0.1, folded into 2.1.0.
3864
+
3865
+ ## [2.0.1] - 2026-04-30
3866
+
3867
+ Phase 8 — close the last two static-file rps gaps. Hyperion 2.0.0 still
3868
+ lost Puma 8.0.1 on rps for two workloads on the 2026-04-29 sweep:
3869
+ 8 KB static at default `-t 5 -w 1` (121 r/s vs Puma 1,246 — 10× loss)
3870
+ and 1 MiB static at the same shape (1,809 r/s vs 2,139 — -15%). 2.0.1
3871
+ fixes both.
3872
+
3873
+ ### Headline bench (openclaw-vm 16 vCPU, kernel 6.8.0)
3874
+
3875
+ Side-by-side `-t 5 -w 1` against Puma 8.0.1, `wrk -t4 -c100 -d20s`:
3876
+
3877
+ | Workload | Hyperion 2.0.0 | Hyperion 2.0.1 | Puma 8.0.1 | 2.0.1 vs Puma |
3878
+ |---|---:|---:|---:|---:|
3879
+ | Static 8 KB r/s | 121 | **1,483** | 1,366 | **+8.6% rps** |
3880
+ | Static 8 KB p99 | 43.85 ms | **4.81 ms** | 84.38 ms | **17.5× lower** |
3881
+ | Static 1 MiB r/s | 1,809 | **1,697** | 1,330 | **+27.6% rps** |
3882
+ | Static 1 MiB p99 | 4.37 ms | **5.14 ms** | 92.86 ms | **18× lower** |
3883
+
3884
+ Hyperion now wins **both** rows on rps and p99. The 2.0.0 caveat
3885
+ section documenting the static-8 KB regression is retired.
3886
+
3887
+ ### Phase 8a — small-file fast path (response_writer.rb)
3888
+
3889
+ The 2.0.0 8 KB row's diagnosis turned out to be **Nagle/delayed-ACK
3890
+ stall**, not the EAGAIN-yield-retry storm hypothesised in the BENCH
3891
+ report. With kernel-default Nagle on, `io.write(head)` (~150 B
3892
+ status line + headers) followed by a separate `write(body)` for the
3893
+ 8 KB asset stalled ~40 ms per response on the client's delayed-ACK
3894
+ waiting for the next packet to fill the next MSS. Hence ~25
3895
+ responses per second per keep-alive connection — exactly the 121 r/s
3896
+ floor across 5 wrk threads.
3897
+
3898
+ Fix: in `ResponseWriter#write_sendfile`, when `file_size <= 64 KiB`
3899
+ read the body bytes inline and concatenate onto the head buffer,
3900
+ emitting head + body as one `io.write` call. The response goes out
3901
+ as one TCP segment train, the client ACKs the whole response, and
3902
+ the second-write delayed-ACK stall disappears entirely. **No
3903
+ TCP_NODELAY setsockopt churn required** — large-file streaming
3904
+ still benefits from Nagle's coalescing across sendfile chunks.
3905
+
3906
+ ### Phase 8b — Linux splice(2) primitive (kept, but not in production path)
3907
+
3908
+ The C ext now ships a `Sendfile.copy_splice` primitive for callers
3909
+ that need explicit pipe-tee semantics: file_fd → per-thread pipe →
3910
+ sock_fd with `SPLICE_F_MOVE | SPLICE_F_MORE`, fully kernel-side
3911
+ zero-copy. Per-thread pipe pair cached in a `pthread_key_t` with a
3912
+ destructor closing both fds at thread exit (no fd leak across
3913
+ worker fiber lifecycles). Pipe sized to 1 MiB via `F_SETPIPE_SZ`
3914
+ where supported.
3915
+
3916
+ **Disabled on the production hot path.** A correctness window was
3917
+ discovered during 1 MiB bench: if `splice(file → pipe)` succeeds
3918
+ but `splice(pipe → sock)` fails mid-transfer with EPIPE (peer
3919
+ closed), unread bytes stay in the pipe and would be sent on the
3920
+ NEXT connection's socket. The persistent per-thread pipe is the
3921
+ hazard. `copy_to_socket` now stays on plain `sendfile(2)` for files
3922
+ > 64 KiB — well-tested, no residual-bytes window, and thanks to
3923
+ fiber-per-connection scheduling the 1 MiB row beats Puma by +27%
3924
+ without the splice path. The primitive remains exposed for future
3925
+ use behind explicit per-request pipe-pair management.
3926
+
3927
+ ### New small-file C primitive (Sendfile.copy_small)
3928
+
3929
+ For callers driving the sendfile module directly (bypassing the
3930
+ ResponseWriter coalescer), `Sendfile.copy_small(out_io, in_io,
3931
+ offset, len)` reads `len <= 64 KiB` into a heap buffer with `pread`
3932
+ and writes it under the GVL released. EAGAIN polled with a short
3933
+ `select()` (5 × 10 ms) instead of fiber-yielding — appropriate for
3934
+ small slices where the kernel send buffer is empty and the transfer
3935
+ finishes in microseconds. Used by the Ruby façade as a backup
3936
+ fast-path when ResponseWriter coalescing isn't applicable.
3937
+
3938
+ ### Specs
3939
+
3940
+ - 530 examples (was 521) — +9 specs covering small-file routing,
3941
+ threshold boundaries, and the splice primitive.
3942
+ - 0 failures, 2 pending (host-gated: macOS skips Linux-splice spec,
3943
+ Linux skips macOS-only fallback assertion).
3944
+
3945
+ ### Files changed
3946
+
3947
+ - `ext/hyperion_http/sendfile.c` — `copy_small`, `copy_splice`,
3948
+ `splice_supported?`, `small_file_threshold` C primitives;
3949
+ per-thread pipe pair via `pthread_key_t`. Linux-splice gated by
3950
+ `#ifdef __linux__`; rest unchanged. Compiles cleanly on macOS
3951
+ arm64 and Linux x86_64.
3952
+ - `lib/hyperion/http/sendfile.rb` — façade routes `<= 64 KiB` to
3953
+ `copy_small`; streaming branch stays on plain `copy` (sendfile).
3954
+ - `lib/hyperion/response_writer.rb` — `write_sendfile` coalesces
3955
+ head + body into one write for `file_size <= 64 KiB`.
3956
+ - `spec/hyperion/http_sendfile_spec.rb` — new specs for small-file
3957
+ routing, threshold boundary, splice primitive byte-integrity.
3958
+
3959
+ ## [2.0.0] - 2026-04-29
3960
+
3961
+ RFC §3 2.0.0 — the breaking-removal release that closes the deprecation
3962
+ cycle opened in 1.8.0, plus Phase 6 of the perf overhaul: a Rust
3963
+ HPACK encoder/decoder shipped as `ext/hyperion_h2_codec` with
3964
+ graceful fallback to `protocol-http2` when Rust isn't available at
3965
+ install time.
3966
+
3967
+ This is the largest API-surface change since 1.0. Operators on the
3968
+ 1.8 line who paid attention to the deprecation warns have no further
3969
+ action to take; operators jumping from 1.6.x straight to 2.0 should
3970
+ read the migration table in [docs/RFC_2_0_DESIGN.md §4](docs/RFC_2_0_DESIGN.md).
3971
+
3972
+ ### Breaking changes (removals from 1.8 deprecations)
3973
+
3974
+ - **Flat-name DSL keys removed.** All 13 flat keys
3975
+ (`h2_max_concurrent_streams`, `h2_initial_window_size`,
3976
+ `h2_max_frame_size`, `h2_max_header_list_size`, `h2_max_total_streams`,
3977
+ `admin_token`, `admin_listener_port`, `admin_listener_host`,
3978
+ `worker_max_rss_mb`, `worker_check_interval`, `log_level`,
3979
+ `log_format`, `log_requests`) no longer parse on the Ruby DSL.
3980
+ `Hyperion::Config.load` raises `NoMethodError` from the DSL
3981
+ evaluator if a config file uses them. Migration: wrap in the nested
3982
+ block (`h2 do; max_concurrent_streams 256; end`, `admin do; token
3983
+ ENV['T']; end`, etc — see the migration table in the RFC).
3984
+
3985
+ CLI flags keep their flat operator-facing spellings unchanged
3986
+ (`--admin-token`, `--worker-max-rss-mb`, `--log-level`, …); only
3987
+ the in-Ruby DSL surface lost the flat names.
3988
+
3989
+ - **`Hyperion.metrics =` / `Hyperion.logger =` setters removed.** The
3990
+ module-level writers no longer exist (the readers stay as
3991
+ `Runtime.default` delegators for REPL convenience).
3992
+
3993
+ Migration recipes:
3994
+ ```ruby
3995
+ # before
3996
+ Hyperion.metrics = MyMetricsAdapter.new
3997
+
3998
+ # after — option A: mutate the default Runtime in-place
3999
+ Hyperion::Runtime.default.metrics = MyMetricsAdapter.new
4000
+
4001
+ # after — option B: per-Server isolation (preferred for new code)
4002
+ runtime = Hyperion::Runtime.new(metrics: MyMetricsAdapter.new)
4003
+ Hyperion::Server.new(app: my_app, runtime: runtime).start
4004
+ ```
4005
+
4006
+ - **Dual-emit Prometheus keys retired.** 1.7.0 introduced per-mode
4007
+ dispatch counters (`:requests_dispatch_threadpool_h1`,
4008
+ `:requests_dispatch_tls_h2`, etc.) and dual-emitted the legacy
4009
+ `:requests_async_dispatched` / `:requests_threadpool_dispatched`
4010
+ alongside them. 2.0 keeps only the per-mode keys. Operators on
4011
+ Grafana dashboards from the 1.x line had two minor releases
4012
+ (1.7→1.8) to migrate; the legacy keys are simply gone now.
4013
+
4014
+ - **Default flip on `h2.max_total_streams`.** The 1.7→1.8 default
4015
+ was nil (admission control disabled). 2.0 defaults to
4016
+ `max_concurrent_streams × workers × 4`, computed at config-finalize
4017
+ time once the worker count is known. The headroom factor (4×) is
4018
+ large enough that no realistic legitimate workload trips the cap;
4019
+ the abuse path (5,000 conns × 128 streams = 640k fibers → OOM)
4020
+ closes by default.
4021
+
4022
+ Per-worker example caps:
4023
+ - 1 worker: cap = 128 × 1 × 4 = 512
4024
+ - 4 workers: cap = 128 × 4 × 4 = 2,048
4025
+ - 32 workers: cap = 128 × 32 × 4 = 16,384
4026
+
4027
+ Operator override:
4028
+ ```ruby
4029
+ h2 do
4030
+ max_total_streams :unbounded # restore pre-2.0 unbounded
4031
+ # or:
4032
+ max_total_streams 8192 # explicit fixed cap
4033
+ end
4034
+ ```
4035
+
4036
+ ### Phase 6 — Rust HPACK + h2 frame codec
4037
+
4038
+ New native extension at `ext/hyperion_h2_codec`:
4039
+
4040
+ - Self-contained, zero-dependency Rust crate (RFC 7541 HPACK
4041
+ encoder/decoder + RFC 7541 Appendix B static Huffman decoder +
4042
+ RFC 7540 §6 frame primitives).
4043
+ - Exposed to Ruby via `extern "C"` + Fiddle (`lib/hyperion/h2_codec.rb`).
4044
+ No magnus, no transitive crate fetching at install time.
4045
+ - ABI version guard — Ruby refuses to load a binary that disagrees
4046
+ with `EXPECTED_ABI`, so a stale on-disk codec from a prior install
4047
+ can't crash the process.
4048
+ - `Hyperion::H2Codec.available?` reports load state. `Encoder` /
4049
+ `Decoder` are instance-per-connection, hold owned Rust pointers,
4050
+ finalize via `ObjectSpace.define_finalizer`.
4051
+
4052
+ Microbench (50,000 iterations, M2 Pro, opt-level=3 LTO):
4053
+
4054
+ | Operation | Rust (us/op) | protocol-hpack (us/op) | speedup |
4055
+ |---------------|--------------|------------------------|---------|
4056
+ | HPACK encode | 9.0 | 29.2 | 3.26× |
4057
+ | HPACK decode | 6.3 | 12.5 | 1.98× |
4058
+
4059
+ The h2load wire-level bench (`bench/h2_post.ru` + `h2load -c 1 -m 100
4060
+ -n 5000`) was not run on the Mac dev host (h2load isn't installed
4061
+ locally) — operators on Linux should re-run to confirm the 4,000+
4062
+ r/s target.
4063
+
4064
+ #### Fallback path
4065
+
4066
+ The codec is opt-in at runtime. When `cargo` is missing or the build
4067
+ fails (`gem install` on a host without Rust), `extconf.rb` writes a
4068
+ no-op Makefile and gem install completes — Hyperion boots with
4069
+ `H2Codec.available? == false` and serves h2 traffic via
4070
+ `protocol-http2`'s Ruby HPACK exactly as it did in 1.x.
4071
+
4072
+ `Http2Handler#codec_native?` reports the per-handler view; the boot
4073
+ log carries a one-shot info line per process:
4074
+
4075
+ ```json
4076
+ {"message":"h2 codec selected","mode":"native (Rust)","native_available":true}
4077
+ ```
4078
+
4079
+ Two new metrics counters bump on first construction:
4080
+ `:h2_codec_native_selected` or `:h2_codec_fallback_selected`.
4081
+
4082
+ ### Phase 6c (deferred to 2.x)
4083
+
4084
+ The connection state machine + framer continue to be driven by
4085
+ `protocol-http2` for now. Splicing the native codec directly into
4086
+ the framer's encode/decode hot paths requires unifying the framer
4087
+ abstraction layer; that work lands in a 2.x point release once the
4088
+ production rollout shows the boot probe is green and the encoder/
4089
+ decoder paths haven't surfaced any RFC edge case the static-table-
4090
+ +-Huffman-only encoder doesn't cover.
4091
+
4092
+ ### Spec count
4093
+
4094
+ 499 (1.8.0) → 521 (2.0.0). +22 examples covering: removed-API
4095
+ smoke checks, default flip arithmetic + sentinel handling, RFC 7541
4096
+ C.2.1 + C.4.1 vectors, 100-iter random round-trip, native/fallback
4097
+ gating, Http2Handler codec_native? readback, one-shot boot log.
4098
+
4099
+ ### Migration checklist (1.x → 2.0)
4100
+
4101
+ 1. Search your `config/hyperion.rb` for any of the 13 flat DSL keys.
4102
+ Wrap each in its nested block. The 1.8.0 deprecation log already
4103
+ listed the rewrite per key.
4104
+ 2. Search your application for `Hyperion.metrics =` /
4105
+ `Hyperion.logger =`. Replace with `Hyperion::Runtime.default.metrics =`
4106
+ (in-place mutation) or pass a custom `Runtime` to
4107
+ `Hyperion::Server.new`.
4108
+ 3. If your Grafana boards still query `requests_async_dispatched` /
4109
+ `requests_threadpool_dispatched`, migrate the queries to
4110
+ `requests_dispatch_<mode>` (5 mode keys; see
4111
+ `lib/hyperion/dispatch_mode.rb` for the canonical list).
4112
+ 4. If you have an h2-heavy multi-tenant edge with extreme stream
4113
+ fan-out (>`max_concurrent_streams × workers × 4` simultaneous
4114
+ streams across the process), set
4115
+ `h2 do; max_total_streams :unbounded; end` to restore the
4116
+ pre-2.0 unbounded behaviour.
4117
+ 5. (Optional) If `cargo` is on your build hosts, `gem install
4118
+ hyperion-rb` will produce the native HPACK codec automatically
4119
+ and the boot log will report `mode: native (Rust)`. Otherwise
4120
+ you stay on the protocol-http2 fallback with no action needed.
4121
+
4122
+ ---
4123
+
4124
+ ## [1.8.0] - 2026-04-29
4125
+
4126
+ RFC §3 1.8.0 deprecation wave + Phase 4 TLS session resumption. Two
4127
+ work streams in one minor: every API the RFC marks for removal in 2.0
4128
+ now emits a one-shot deprecation warn through the runtime logger, and
4129
+ the TLS context turns on server-side session caching + RFC 5077 ticket
4130
+ resumption with operator-driven SIGUSR2 key rotation. Behaviour of the
4131
+ deprecated APIs is unchanged — the warn is purely advisory.
4132
+
4133
+ ### Deprecations (1.8.0 emits, 2.0.0 removes)
4134
+
4135
+ - **Flat-name DSL keys.** All 13 flat config keys
4136
+ (`h2_max_concurrent_streams`, `h2_initial_window_size`,
4137
+ `h2_max_frame_size`, `h2_max_header_list_size`, `h2_max_total_streams`,
4138
+ `admin_token`, `admin_listener_port`, `admin_listener_host`,
4139
+ `worker_max_rss_mb`, `worker_check_interval`, `log_level`,
4140
+ `log_format`, `log_requests`) now emit a deprecation warn at boot
4141
+ identifying the nested-DSL replacement. The CLI flag surface
4142
+ (`--admin-token`, `--worker-max-rss-mb`, etc.) is unchanged — only the
4143
+ Ruby DSL form warns. `merge_cli!` does NOT trigger the warn so
4144
+ `--log-level info` from a launcher script stays quiet.
4145
+ - **`Hyperion.metrics =` / `Hyperion.logger =` setters.** Both
4146
+ module-level setters now emit a deprecation warn pointing at the
4147
+ Runtime injection path (`Hyperion::Runtime.new(metrics: …)` +
4148
+ `Hyperion::Server.new(runtime:)`). In-tree CLI bootstrap was
4149
+ rerouted to write `Hyperion::Runtime.default.logger = …` directly so
4150
+ the canonical CLI path doesn't deprecation-warn itself.
4151
+ - **`Hyperion::AsyncPg.install!(activerecord: true)`** — N/A in this
4152
+ repo. Lives in the `hyperion-async-pg` companion gem; deprecation
4153
+ ships there.
4154
+
4155
+ Each warn fires once per process via `Hyperion::Deprecations.warn_once`
4156
+ with a per-key dedup table. Tests can suppress via
4157
+ `Hyperion::Deprecations.silence!` (the spec-suite default) and assert
4158
+ on warn output via `unsilence!` + a sink-logger swap on
4159
+ `Hyperion::Runtime.default`.
4160
+
4161
+ ### Phase 4 — TLS session resumption ticket cache + SIGUSR2 rotation
4162
+
4163
+ - **`Hyperion::TLS.context` enables `SESSION_CACHE_SERVER`** with a
4164
+ 20_480-entry LRU cap (≈16 MiB at 800 B/session) and a stable
4165
+ per-process `session_id_context` so cache lookups cross worker
4166
+ boundaries when the master inherits a single listener fd
4167
+ (`:share` model).
4168
+ - **RFC 5077 session tickets are explicitly enabled** by clearing
4169
+ `OP_NO_TICKET` on the SSLContext; OpenSSL's auto-rolled ticket key
4170
+ handles short-circuited handshakes for returning clients with no
4171
+ server-side state.
4172
+ - **`Hyperion::TLS.rotate!`** flushes the in-process session cache;
4173
+ used by the SIGUSR2 handler.
4174
+ - **SIGUSR2-driven key rotation.** The master traps the configured
4175
+ rotation signal (default `:USR2`) and re-broadcasts to every live
4176
+ child; each worker calls `Hyperion::TLS.rotate!` on its per-context
4177
+ SSLContext. Operators set `tls.ticket_key_rotation_signal = :NONE`
4178
+ to disable rotation entirely.
4179
+ - **New `tls` config subconfig** with `session_cache_size` (default
4180
+ 20_480) and `ticket_key_rotation_signal` (default `:USR2`). Wired
4181
+ through both the nested DSL and Worker / Server constructors.
4182
+ - **Cross-worker ticket-key sharing trade-off.** Ruby's stdlib OpenSSL
4183
+ bindings (3.3.x) do not expose `SSL_CTX_set_tlsext_ticket_keys`, so
4184
+ each worker still owns its own auto-generated ticket key. On Linux
4185
+ `:reuseport` workers the kernel pins client → worker by tuple hash
4186
+ so a returning client lands on the same worker's cache; on `:share`
4187
+ model the cache is shared via the inherited fd. We'll thread a
4188
+ master-generated key through to children when a Ruby binding lands
4189
+ (probably 3.4+).
4190
+
4191
+ ### New specs (+24 examples)
4192
+
4193
+ - `spec/hyperion/deprecation_warns_spec.rb` (12) — flat DSL warns,
4194
+ per-key dedup, `Hyperion.metrics =` / `Hyperion.logger =` warns,
4195
+ Runtime-direct write does NOT warn, suite-default silence path.
4196
+ - `spec/hyperion/tls_session_resumption_spec.rb` (12) — context
4197
+ defaults (`SESSION_CACHE_SERVER`, stable `session_id_context`,
4198
+ ticket-enabled), live resumption via OpenSSL session reuse,
4199
+ `Hyperion::TLS.rotate!` flush, `session_cache_size = 1` eviction,
4200
+ Config TLS subconfig defaults + nested-DSL wiring.
4201
+
4202
+ ### Test-suite changes
4203
+
4204
+ - `spec_helper.rb` flips `Hyperion::Deprecations.silence!` in
4205
+ `before(:suite)` so the legacy 475 specs (which intentionally
4206
+ exercise deprecated APIs as their canonical 1.x test seams) stay
4207
+ quiet without per-spec ceremony.
4208
+ - `spec/hyperion/nested_dsl_spec.rb` — the "does NOT emit a deprecation
4209
+ warn on flat keys (warns land in 1.8)" example was the canary for
4210
+ this release; replaced with a behaviour-parity assertion. Per-file
4211
+ `before/after` silences the warns so the parity assertions still hold.
4212
+ - `spec/hyperion/tls_spec.rb` baseline ALPN tests untouched — the new
4213
+ resumption knobs are additive on the same constructor.
4214
+
4215
+ ## [1.7.1] - 2026-04-29
4216
+
4217
+ Perf-only point release approaching Falcon parity on CPU-bound JSON.
4218
+ Phase 2 pools the per-request Lint wrapper, reuses one parser inbuf per
4219
+ connection, and widens the pre-interned header table 16 → 30; Phase 3
4220
+ moves the env-construction loop and the cookie split-parse out of pure
4221
+ Ruby and into the C extension. No behavioural changes, no deprecation
4222
+ warns, no new public DSL surface.
4223
+
4224
+ ### Phase 2 — per-worker Lint pool + reused inbuf + 30-header intern table
4225
+
4226
+ - **`Hyperion::LintWrapperPool` per worker.** When `RACK_ENV=development`
4227
+ Hyperion wraps the app in `Rack::Lint`. The wrapper used to be a
4228
+ per-request allocation; the pool now hands one out per worker so the
4229
+ hot path is allocation-free in dev too.
4230
+ - **Reused inbuf on `Connection`.** The connection-level read buffer is
4231
+ a single mutable `String` carried across requests on the same socket
4232
+ (with explicit reset on framing-error / oversized-body / close). Cuts
4233
+ per-request `String#new` traffic on keep-alive workloads.
4234
+ - **30-entry pre-interned header table in `CParser`.** Phase 2c —
4235
+ widened from the rc16 16-entry table to cover the full production-
4236
+ traffic top-30 (Sec-Fetch-*, X-Forwarded-Host, X-Real-IP, etc.). The
4237
+ table doubles as the source of truth for `Adapter::Rack::HTTP_KEY_CACHE`
4238
+ so parser, adapter, and downstream env consumers all share string
4239
+ identity (`#equal?` is true) for the common keys.
4240
+ `CParser::PREINTERNED_HEADERS` is the public exposure.
4241
+
4242
+ ### Phase 3a — env-construction loop moved into C extension
4243
+
4244
+ The Ruby-side env build in `Hyperion::Adapter::Rack#build_env` looped
4245
+ over `request.headers` setting `env["HTTP_*"] = value` per pair, plus
4246
+ the request-line keys (`REQUEST_METHOD`, `PATH_INFO`, `QUERY_STRING`,
4247
+ `HTTP_VERSION`, `SERVER_PROTOCOL`) and the two RFC-mandated non-`HTTP_`
4248
+ promotions (`CONTENT_LENGTH`, `CONTENT_TYPE`). Every uncached header
4249
+ went through `HTTP_KEY_CACHE[name] || CParser.upcase_underscore(name)`,
4250
+ which still meant a Ruby Hash lookup + a method dispatch per header.
4251
+
4252
+ - **New `Hyperion::CParser.build_env(env, request) -> env`.** Single
4253
+ FFI hop per request that:
4254
+ - Reads the Request's `@method`, `@path`, `@query_string`,
4255
+ `@http_version`, `@headers` ivars directly via `rb_ivar_get` (zero
4256
+ method dispatch — Request is a frozen value object).
4257
+ - Sets `REQUEST_METHOD` / `PATH_INFO` / `QUERY_STRING` /
4258
+ `HTTP_VERSION` / `SERVER_PROTOCOL` using seven pre-frozen,
4259
+ GVAR-anchored String VALUEs (allocated once at extension load).
4260
+ - Walks the headers hash via `rb_hash_foreach`. For each `(name,
4261
+ value)` pair, looks up the Rack key via:
4262
+ 1. Pointer compare against `header_table_lc_v[]` (when the name
4263
+ came from the parser, this is a one-instruction hit on the 30
4264
+ pre-interned entries).
4265
+ 2. Case-insensitive scan against the same table (covers Request
4266
+ objects constructed in specs without going through the parser).
4267
+ 3. Single-allocation `HTTP_<UPCASED_UNDERSCORED>` build (mirrors
4268
+ `cupcase_underscore` byte-for-byte; US-ASCII encoded).
4269
+ - Promotes `content-length` → `CONTENT_LENGTH` and `content-type` →
4270
+ `CONTENT_TYPE` in the same pass (no second walk over env).
4271
+ - **`Adapter::Rack#build_env` rewired.** When `c_build_env_available?`
4272
+ is true (memoised probe), the Ruby loop is replaced with a single
4273
+ `::Hyperion::CParser.build_env(env, request)` call. The pre-Phase-3
4274
+ Ruby loop stays in place behind the `else` branch and gets exercised
4275
+ by the parity spec, so a hypothetical missing-extension build still
4276
+ produces byte-identical env hashes.
4277
+ - **Bench (macOS arm64, `bench/headers_heavy.ru` + `headers_heavy_wrk.lua`,
4278
+ `-t 5 -w 1`, `wrk -t4 -c100 -d10s`, 3 warmup runs):**
4279
+
4280
+ | | r/s (median of 3) |
4281
+ |---|---:|
4282
+ | Phase 3 OFF (Ruby loop) | 17,555 |
4283
+ | Phase 3 ON (C build_env) | **20,390 (+16%)** |
4284
+
4285
+ Above the 3–5% target — the FFI savings stack with Phase 2c's
4286
+ pointer-compare hit on the pre-interned header keys, since
4287
+ `build_env` now reuses the same identity throughout.
4288
+
4289
+ ### Phase 3b — cookie split-parse in C extension
4290
+
4291
+ Cookie header split (`name1=val1; name2=val2; …` → `{ "name1" => "val1",
4292
+ … }`) used to live in Ruby, hit on every session-using endpoint. Pulled
4293
+ into C with the same RFC 6265 §5.2 semantics: opaque values (no URL
4294
+ decoding), empty values valid, missing-`=` pairs skipped, last-wins on
4295
+ repeated names. Whitespace trimmed around each pair and around each
4296
+ key, as Ruby's `.strip` would.
4297
+
4298
+ - **New `Hyperion::CParser.parse_cookie_header(str) -> Hash`.** Single
4299
+ byte loop in C; returns a fresh (mutable, unfrozen) Ruby Hash so
4300
+ middlewares can extend it. Long values (> 4 KiB session payloads,
4301
+ signed JWT cookies) are passed through unmolested.
4302
+
4303
+ ### New specs (+27 examples)
4304
+
4305
+ - `spec/hyperion/parser_build_env_spec.rb` (11) — request-line keys,
4306
+ HTTP_* mapping, CONTENT_TYPE/CONTENT_LENGTH promotion, identity
4307
+ preservation across all 30 pre-interned header keys, off-table
4308
+ fallback, no-headers tolerance, byte-for-byte parity with the Ruby
4309
+ fallback, last-wins on duplicate names, return-value identity.
4310
+ - `spec/hyperion/parser_cookie_split_spec.rb` (16) — single + multi
4311
+ cookies, whitespace tolerance, trailing semicolon, empty value,
4312
+ last-wins, missing-`=` skip, empty input, "=" inside value, no URL
4313
+ decoding, mutable return Hash, 4 KiB long-value, malformed-pair
4314
+ isolation.
4315
+
4316
+ ### Notes
4317
+
4318
+ - `Hyperion::Parser` (pure-Ruby fallback) is unchanged. The
4319
+ `Adapter::Rack#build_env` Ruby branch still reads from
4320
+ `request.headers` and builds env exactly as in 1.7.0; the C path is
4321
+ opt-in via the lazy `c_build_env_available?` probe.
4322
+ - Spec count 432 → 475 (+43 across Phase 2 and Phase 3). 432 1.7.0
4323
+ specs unchanged.
4324
+
4325
+ ## [1.7.0] - 2026-04-29
4326
+
4327
+ **Spec count 325 → 432 (+107)** across three parallel streams: +86 RFC additive items, +8 Phase 1 (sendfile fast path), +13 Phase 5 (chunked-write coalescing).
4328
+
4329
+ First wave of additive ships from `docs/RFC_2_0_DESIGN.md` plus Phase 1 (sendfile fast path) and Phase 5 (chunked-write coalescing) of the perf roadmap. All changes are backwards-compatible: every 1.6.3 spec passes without modification, every flat DSL form still works without warns (deprecation lands in 1.8.0), every 1.6.x test stub seam (`allow(Hyperion).to receive(:metrics)`, `Hyperion.instance_variable_set(:@metrics, …)`) keeps working.
4330
+
4331
+ **Headline numbers:**
4332
+ - **Phase 1 sendfile** — closes Puma static-file rps gap on 1 MiB asset: 2,392 r/s vs Puma 2,074 (**+15% rps over Puma, ~20× lower p99**).
4333
+ - **Phase 5 chunked coalescing** — 1000×50 B SSE workload: 1001 → 11 syscalls (**~91× syscall reduction**).
4334
+
4335
+ ### Phase 1 — sendfile fast path (close Puma rps gap on static files)
4336
+
4337
+ Per `docs/BENCH_2026_04_27.md` the static 1 MiB row left ~8% rps on the
4338
+ table vs Puma (Hyperion `-t 5 -w 1` 1,919 r/s vs Puma `-t 5:5` 2,074 r/s)
4339
+ even though Hyperion already won the p99 race 13× over. Root cause: the
4340
+ Phase-0 static path went through `IO.copy_stream`, which on macOS / non-
4341
+ Linux hosts and on TLS sockets falls back to a per-chunk userspace
4342
+ read+write loop — each chunk takes the connection's writer mutex and a
4343
+ fiber yield. We close that gap with a native zero-copy primitive that
4344
+ bypasses the chunk-fiber pipeline entirely.
4345
+
4346
+ - **New C unit `ext/hyperion_http/sendfile.c`.** Defines
4347
+ `Hyperion::Http::Sendfile.copy(out_io, in_io, offset, len) -> [bytes, status]`
4348
+ alongside `Sendfile.supported?` / `Sendfile.platform_tag`. Picks the
4349
+ best kernel call at compile time:
4350
+ - **Linux:** `sendfile(2)` with `off_t*` for in-place cursor advance.
4351
+ `splice(2)`-through-pipe support is wired behind the same surface for
4352
+ a follow-up if a host's `sendfile` returns `:unsupported`.
4353
+ - **Darwin / FreeBSD / NetBSD / DragonFly:** native `sendfile(2)` with
4354
+ the BSD signature (offset by value, sent-bytes by `*len` on Darwin
4355
+ and `*sbytes` on the BSDs).
4356
+ - **Other:** raises `NotImplementedError`; Ruby caller drops to the
4357
+ userspace fallback automatically.
4358
+
4359
+ GVL discipline: every kernel call runs under
4360
+ `rb_thread_call_without_gvl` so siblings can run while the socket
4361
+ drains. `EAGAIN` / `EWOULDBLOCK` / `EINTR` return `:eagain` (no busy-
4362
+ spin in C); the Ruby caller yields to the fiber scheduler before
4363
+ retrying. `ENOSYS` / `EINVAL` / `EOPNOTSUPP` return `:unsupported` so
4364
+ the userspace fallback kicks in mid-stream without losing the
4365
+ position cursor.
4366
+
4367
+ - **`Hyperion::Http::Sendfile` Ruby façade
4368
+ (`lib/hyperion/http/sendfile.rb`).** Wraps the C primitive with the
4369
+ three behaviours that don't belong in C:
4370
+ - Loops on `:partial` short writes.
4371
+ - Yields on `:eagain` via `IO#wait_writable` (Async-aware) or
4372
+ `IO.select` when no scheduler is active.
4373
+ - Detects TLS sockets (`OpenSSL::SSL::SSLSocket`) and routes them to
4374
+ a 64 KiB `IO.copy_stream` userspace loop — kernel TLS is rarely
4375
+ available, but bypassing the per-chunk WriterContext-style mutex
4376
+ hop still wins a measurable margin over Phase 0. Same fallback fires
4377
+ on hosts where the C ext didn't compile (`Sendfile.supported?` is
4378
+ false), so the contract is portable.
4379
+
4380
+ - **`ResponseWriter#write_sendfile` rewired.** Same trigger condition
4381
+ (body responds to `#to_path`) and same metrics (`:sendfile_responses` /
4382
+ `:tls_zerobuf_responses`); the actual byte-pump now goes through
4383
+ `Hyperion::Http::Sendfile.copy_to_socket` so the fast path is
4384
+ available on Darwin (was previously falling back to a userspace copy
4385
+ inside `IO.copy_stream`) and on hosts where Ruby's stdlib `IO.copy_stream`
4386
+ picks a slower path.
4387
+
4388
+ - **Specs (`spec/hyperion/http_sendfile_spec.rb`).** Round-trips
4389
+ through a real `TCPSocket` pair — 1 MiB byte-perfect, 1-byte boundary,
4390
+ 0-byte short-circuit, mid-file Range slice, simulated `EAGAIN` that
4391
+ asserts the loop yields exactly once and resumes from the right
4392
+ cursor, and a connection-closed-mid-transfer scenario that surfaces
4393
+ `Errno::*` rather than crashing. The existing
4394
+ `spec/hyperion/response_writer_sendfile_spec.rb` keeps passing —
4395
+ `ResponseWriter#write_sendfile`'s public surface is unchanged.
4396
+
4397
+ - **Bench result (macOS arm64, `-t 5 -w 1`, `wrk -t4 -c100 -d20s`,
4398
+ bench/static.ru on a 1 MiB asset):**
4399
+
4400
+ | | r/s | p99 |
4401
+ |---|---:|---:|
4402
+ | Puma 7.2 `-t 5:5` (Phase 0 reference) | 2,074 | 55 ms |
4403
+ | Hyperion 1.7.0-pre Phase 0 (`IO.copy_stream` only) | 1,919 | 4.22 ms |
4404
+ | **Hyperion 1.7.0 Phase 1 (Sendfile.copy)** | **2,203 – 2,392** | **2.76 – 2.90 ms** |
4405
+
4406
+ **+15-25% rps over Phase 0, +6-15% rps over Puma, ~20× lower p99 than
4407
+ Puma.** Closes (and reverses) the 8% rps gap. Linux numbers will land
4408
+ in the pre-tag bench sweep (task #112) — Linux's `sendfile(64)` on
4409
+ plain TCP is the same syscall `IO.copy_stream` already picks under
4410
+ the hood, so the Linux delta should be smaller than the Darwin delta;
4411
+ the win there comes from skipping the Ruby-level chunk loop and the
4412
+ ResponseWriter mutex hop (the WriterContext analogue) rather than from
4413
+ trading userspace for kernel zero-copy.
4414
+
4415
+ - **Forward-compat note for Phase 5 (chunked-write coalescing).**
4416
+ `Sendfile.copy_to_socket` shares no state with the upcoming
4417
+ `WriterContext` chunk batcher — the static-file path bypasses it
4418
+ entirely, so Phase 5's per-chunk coalescing still composes cleanly:
4419
+ small chunked / Transfer-Encoding bodies get the new io_buffer
4420
+ batching, large `to_path` bodies keep the kernel zero-copy. h2
4421
+ sendfile is deliberately out of scope (writer-fiber single-writer
4422
+ invariant + per-stream window updates make it a 2.0 RFC item).
4423
+
4424
+ ### Phase 5 — chunked-write coalescing (cut SSE / streaming-JSON syscalls 90×+)
4425
+
4426
+ Streaming workloads (SSE event feeds, streaming JSON, log-tail
4427
+ responses) push tiny payloads — a typical SSE event is ~50 B. Pre-
4428
+ Phase-5 the only multi-write path was `body.each` inside the
4429
+ buffered `Content-Length` writer, which accumulated everything in
4430
+ userspace before emitting one syscall: real streams couldn't push
4431
+ bytes downstream until the body finished, so SSE was structurally
4432
+ broken. Phase 5 adds an opt-in `Transfer-Encoding: chunked`
4433
+ streaming path with per-response coalescing so each tiny `body.each`
4434
+ chunk doesn't translate to its own syscall.
4435
+
4436
+ - **`ResponseWriter#write` opt-in branch.** App sets
4437
+ `Transfer-Encoding: chunked` on the response → ResponseWriter
4438
+ takes the streaming path: emit head, iterate `body.each`, frame
4439
+ each chunk per RFC 7230 §4.1, drain into the socket through a
4440
+ per-response coalescing buffer, append `0\r\n\r\n` terminator
4441
+ on close. Apps that don't opt in keep the 1.6.x single-syscall
4442
+ Content-Length path verbatim (one new spec locks this — 100×50 B
4443
+ non-chunked body still emits exactly 1 syscall).
4444
+
4445
+ - **`ResponseWriter::ChunkedCoalescer` per-response buffer.**
4446
+ - Chunks `< 512 B` accumulate in a 4 KiB-capacity ASCII-8BIT
4447
+ `String.new(capacity: 4096)` (matches the existing build-head
4448
+ buffer style).
4449
+ - Buffer drains on `>= 4096 B` filled (mid-stream flush) or on
4450
+ end-of-body (`0\r\n\r\n` terminator concatenated into the
4451
+ buffer's final drain so peers never see a half-flushed
4452
+ response between drain + terminator syscalls).
4453
+ - Chunks `>= 512 B` drain the buffer first (preserves byte order
4454
+ on the wire) then write directly — no point coalescing a
4455
+ payload already past the threshold.
4456
+ - Best-effort 1 ms tick: `Process.clock_gettime` check on each
4457
+ chunk arrival; if the buffer has been sitting >= 1 ms since
4458
+ the last drain, flush. Under Async this gives natural
4459
+ flushes between sparse chunks; not a real timer fiber per
4460
+ response (the bookkeeping would cost more than the syscall
4461
+ savings on a short-lived coalescer).
4462
+ - `body.flush` / `:__hyperion_flush__` sentinel honoured —
4463
+ SSE servers use this to push events past the coalescing
4464
+ latency on demand.
4465
+
4466
+ - **`build_head_chunked`.** Mirrors `build_head_ruby` but emits
4467
+ `transfer-encoding: chunked` and explicitly drops any
4468
+ app-supplied `content-length` (mutually exclusive per RFC 7230
4469
+ §3.3.3). Pure Ruby — the C builder still always emits
4470
+ content-length and isn't on this branch (low-volume opt-in
4471
+ path; no measurable win from a C builder here).
4472
+
4473
+ - **h2 path deliberately untouched.** `Http2Handler::WriterContext`
4474
+ already coalesces at the kernel send-buffer boundary: every
4475
+ encoder fiber enqueues onto the per-connection `Thread::Queue`
4476
+ and a single writer fiber drains it onto the socket. Multiple
4477
+ small DATA frames buffered onto the queue between writer-fiber
4478
+ resumptions get drained back-to-back — TCP's Nagle-style send
4479
+ coalescing (default ON without `TCP_NODELAY`) folds them into
4480
+ the same on-wire packet. Adding a userspace coalescer on top
4481
+ of the queue would interact awkwardly with per-stream window
4482
+ updates and the writer-fiber single-writer invariant; deferred
4483
+ to 2.0 (Rust h2 codec rewrite, RFC §6).
4484
+
4485
+ - **Sendfile path (Phase 1) bypasses the coalescer entirely.**
4486
+ Bodies that respond to `#to_path` still take the
4487
+ `IO.copy_stream` / native-sendfile branch. The file IS the
4488
+ body buffer — there are no userspace chunks to coalesce. Phase 1
4489
+ spec sweep (`response_writer_sendfile_spec.rb`,
4490
+ `http_sendfile_spec.rb`) still green.
4491
+
4492
+ - **New metrics.** `:chunked_responses` (count of streamed
4493
+ responses), `:chunked_total_writes` (total `socket.write` calls
4494
+ on the chunked path), `:chunked_coalesced_writes` (subset that
4495
+ drained the coalescing buffer rather than passing a large chunk
4496
+ straight through). Operators get
4497
+ `chunked_total_writes / chunked_responses` as the syscall-per-
4498
+ response gauge.
4499
+
4500
+ - **Specs (`spec/hyperion/chunked_coalescing_spec.rb`, +13
4501
+ examples).** Lock the syscall-count properties: 100×50 B
4502
+ yields 1 head + 2 body writes (was 1 + 100 = 101 without
4503
+ coalescing — **~33× reduction**); 1000×50 B SSE bench yields 1
4504
+ head + 10 body writes (vs 1 + 1000 = 1001 without coalescing —
4505
+ **~91× reduction**); 10 KiB single chunk = 1 head + 1 body + 1
4506
+ terminator = 3 syscalls; mixed 50/600/50 = 4 syscalls (head +
4507
+ buffer drain + 600 direct + final drain); body.close edge
4508
+ asserts the terminator + buffered payload land in the same
4509
+ syscall; flush sentinel forces an extra mid-stream drain;
4510
+ Async-driven 5 ms quiet period asserts the 1 ms tick fires
4511
+ before chunk #2 arrives; non-chunked Content-Length path stays
4512
+ at exactly 1 syscall (no regression).
4513
+
4514
+ - **Bench (`bench/sse.ru`).** New Rack app: streams 1000 SSE
4515
+ events of ~50 B each over a single keep-alive connection,
4516
+ yielding the flush sentinel every 50 events. `wrk -t1 -c1
4517
+ -d10s` measures sustained event throughput; pair with `strace
4518
+ -c -e write` (Linux) or `dtruss -c -t write` (macOS) for the
4519
+ syscall headline. Bench-host numbers land in the pre-tag bench
4520
+ sweep (task #112).
4521
+
4522
+ - **Spec sweep delta:** 419 → 432 (+13). All 1.6.x and
4523
+ Phase 1 specs untouched.
4524
+
4525
+ ### Added
4526
+ - **`Hyperion::Runtime` constructor injection (RFC A3).** New `Hyperion::Runtime` class holds `metrics`, `logger`, `clock`. `Runtime.default` is the process-wide singleton (lazy, mutable, NOT frozen — RFC §5 Q4). Module-level `Hyperion.metrics` / `Hyperion.logger` (and their `=` setters) keep working — they delegate to `Runtime.default`. New `runtime:` kwarg on `Hyperion::Server`, `Hyperion::Connection`, `Hyperion::Http2Handler` (default nil → fall through to module accessors so 1.6.x specs pass; non-nil → that runtime is used exclusively, no implicit fallback to module overrides). Multi-tenant deployments can now give each `Server` its own metrics sink.
4527
+ - **`Hyperion::DispatchMode` value object (RFC A2).** Internal value object replacing the 4-flag / 5-output `if/elsif` matrix in `Server#dispatch`. 5 modes: `:tls_h2`, `:tls_h1_inline`, `:async_io_h1_inline`, `:threadpool_h1`, `:inline_h1_no_pool`. Frozen, equality-by-name, predicates `#inline?` / `#threadpool?` / `#h2?` / `#async?` / `#pooled?`. Resolved per dispatch via `DispatchMode.resolve(tls:, async_io:, thread_count:, alpn:)`.
4528
+ - **Per-mode dispatch counters (RFC A2 + §3 1.7.0 dual-emit).** New keys: `:requests_dispatch_tls_h2`, `:requests_dispatch_tls_h1_inline`, `:requests_dispatch_async_io_h1_inline`, `:requests_dispatch_threadpool_h1`, `:requests_dispatch_inline_h1_no_pool`. **Operators on Grafana dashboards from the 1.x line: the legacy `:requests_async_dispatched` and `:requests_threadpool_dispatched` keys keep emitting in 1.7 + 1.8 (dual-emit) and are removed in 2.0. Migrate dashboards to the per-mode keys before 2.0.**
4529
+ - **Nested DSL blocks (RFC A4).** `h2 do |h| ... end`, `admin do ... end`, `worker_health do ... end`, `logging do ... end` — both bareword (`max_concurrent_streams 256`) and explicit-arg (`|h| h.max_concurrent_streams 256`) forms supported. New `Config::H2Settings`, `AdminConfig`, `WorkerHealthConfig`, `LoggingConfig` subclasses; `Config#h2`, `#admin`, `#worker_health`, `#logging` readers. Flat 1.6.x setters (`h2_max_concurrent_streams 256`, `admin_token "x"`, `worker_max_rss_mb 1024`, `log_level :info`, `log_format :json`, `log_requests false`) remain functional with no warns in 1.7. Deprecation warns land in 1.8.0; removal in 2.0. `BlockProxy` inherits from `BasicObject` so `format :json` inside a `logging do` block doesn't collide with `Kernel#format`.
4530
+ - **`accept_fibers_per_worker` config (RFC A6).** Default 1. When > 1 and the accept loop is async-wrapped, spawn N accept fibers that each `IO.select` on the same listening fd. Linear scaling on `:reuseport` (Linux); on Darwin (`:share` mode) the knob is honoured silently with no scaling benefit (RFC §5 Q5). Documented in the README "Operator guidance" section.
4531
+ - **`h2.max_total_streams` admission gate (RFC A7).** New `Hyperion::H2Admission` value object — process-wide stream cap shared across all `Http2Handler` instances within a worker. Default `nil` (no cap, current behaviour). When set, streams beyond the cap get `RST_STREAM REFUSED_STREAM` (RFC 7540 §11) and bump `:h2_streams_refused`. Default flips to `h2.max_concurrent_streams × workers × 4` in 2.0 (RFC §3 1.x-vs-2.0 split).
4532
+ - **`admin.listener_port` sibling listener (RFC A8).** New `Hyperion::AdminListener` — single-thread TCP server that handles only `/-/quit` and `/-/metrics` on a separate port. Default `nil` keeps admin mounted in-app via `AdminMiddleware`. When set, the sibling listener spawns alongside the application listener regardless of `:share` vs `:reuseport` worker model. Defence-in-depth for the bearer-token leak vector documented in RFC A8 (logging middlewares that dump request headers); runs alongside, not instead of, the in-app middleware.
4533
+ - **`async_io` strict validation (RFC A9).** `Server#initialize` raises `ArgumentError` on non-tri-state values (`1`, `:yes`, `'true'`, etc.) — pre-1.7 they silently landed in the wrong matrix cell. New `Hyperion.validate_async_io_loaded_libs!` is wired into the CLI bootstrap: `async_io: true` raises if no fiber-cooperative library (`hyperion-async-pg` / `async-redis` / `async-http`) is loaded; `async_io: false` warns (does not raise) if a fiber-IO library is loaded but unused; `async_io: nil` keeps the 1.6.1 soft-warn shape.
4534
+
4535
+ ### New specs (86 examples)
4536
+ - `spec/hyperion/runtime_spec.rb` — singleton lifecycle, mutable default, default? predicate, custom-runtime isolation, legacy override seam.
4537
+ - `spec/hyperion/dispatch_mode_spec.rb` — resolve matrix, predicate semantics, frozen value-object equality, `metric_key` shape.
4538
+ - `spec/hyperion/h2_admission_spec.rb` — admit/release/cap exhaustion + concurrent-thread safety.
4539
+ - `spec/hyperion/nested_dsl_spec.rb` — h2 / admin / worker_health / logging block forms (bareword + explicit-arg), flat-name forwarders, no-warn assertion, BlockProxy bareword behaviour.
4540
+ - `spec/hyperion/runtime_kwarg_spec.rb` — `runtime:` kwarg plumbing on Connection / Server / Http2Handler, isolation from module-level overrides when explicit.
4541
+ - `spec/hyperion/accept_fibers_spec.rb` — default 1, clamp on zero/negative, N-fiber spawn count under start_async_loop, Darwin no-error path.
4542
+ - `spec/hyperion/admin_listener_spec.rb` — live HTTP integration (token, /-/metrics, 404), Server's `maybe_start_admin_listener` gating on port + token presence.
4543
+ - `spec/hyperion/async_io_strict_spec.rb` — Server constructor raise paths, validate_async_io_loaded_libs! tri-state.
4544
+ - `spec/hyperion/per_mode_counters_spec.rb` — dual-emit assertions per mode, end-to-end live HTTP request that bumps both new and legacy keys.
4545
+
4546
+ ### Changed
4547
+ - `Hyperion::Master.build_h2_settings` reads from the nested `Config#h2` object directly (flat-name forwarders still work via `Config#h2_max_concurrent_streams`).
4548
+ - `Hyperion.metrics` / `Hyperion.logger` are now thin delegators to `Hyperion::Runtime.default` — preserves the public 1.6.x surface, new code path goes through Runtime.
4549
+
4550
+ ### Deprecation roadmap
4551
+ - **1.7.0 (this release):** all of the above ship as additive opt-ins. No deprecation warns. Old metric keys keep emitting alongside the new per-mode keys.
4552
+ - **1.8.0:** boot-time deprecation warns on flat DSL setters, `Hyperion.metrics=` / `Hyperion.logger=` setters, and per-call `Hyperion.metrics` reads from internal code paths.
4553
+ - **2.0.0:** flat DSL setters removed; `Hyperion.metrics=` / `Hyperion.logger=` setters removed; legacy `:requests_async_dispatched` / `:requests_threadpool_dispatched` keys retired; `h2.max_total_streams` default flips to `h2.max_concurrent_streams × workers × 4`.
4554
+
4555
+ ## [1.6.3] - 2026-04-29
4556
+
4557
+ 5 audit-driven hotfixes flagged by the post-1.6.0 audits; spec count 290 → 325 (+35 across the wave). No RFC items here — those land in 1.7.0+. See [RFC_2_0_DESIGN.md](docs/RFC_2_0_DESIGN.md) for the larger architectural roadmap.
4558
+
4559
+ ### Fixed
4560
+ - **C1 — FiberLocal shim compat with the 1.4.x `thread_variable_*` fixes.** Audit flagged the shim re-introduces a regression we already fixed in 1.4.x. `lib/hyperion/fiber_local.rb` now uses the correct `Fiber[k]=` write API and falls back to `Thread.current.thread_variable_*` when no fiber scheduler is active; `lib/hyperion/cli.rb` gates the FiberLocal path on `async_io`.
4561
+ - **A1 — Lifecycle hooks fire in correct order on `:share` worker model.** `before_fork` / `on_worker_boot` / `on_worker_shutdown` ran post-bind on `:share` and out-of-order vs `:reuseport`. Hooks now fire pre-bind on both worker models — `lib/hyperion/master.rb` (pre-bind sequencing) + `lib/hyperion/worker.rb` (boot/shutdown ordering) + `lib/hyperion/cli.rb` (hook plumbing).
4562
+ - **S1 — `AdminMiddleware#signal_target` ppid mistarget under PID 1 / containerd.** When Hyperion ran as PID 1 in a container, signals were sent to the container init instead of the Hyperion master. New `HYPERION_MASTER_PID` env var + `Hyperion.master_pid` ivar set in `lib/hyperion.rb` and exported from `lib/hyperion/master.rb` / `lib/hyperion/cli.rb`; `lib/hyperion/admin_middleware.rb` consults it before falling back to `Process.ppid`.
4563
+ - **S2 — Cap `Content-Length` parsing at `max_body_bytes + 1`.** Parser previously accepted any non-negative integer. `lib/hyperion/connection.rb` now rejects abusive `Content-Length` values with 413 BEFORE reading any body — the cap is enforced at header parse time, not after the read.
4564
+ - **C2 — `WriterContext` short-circuit on empty-body responses (204/304/HEAD).** `lib/hyperion/http2_handler.rb` skips the `encode_mutex` hop and the writer-fiber enqueue when there is nothing to write, eliminating per-response mutex acquisitions on empty-body status codes.
4565
+
3
4566
  ## [1.6.2] - 2026-04-27
4
4567
 
5
4568
  Doc release. No code changes.