RubyGems - hyperion-rb - Versions diffs - 2.12.0 → 2.13.0 - Mend

hyperion-rb 2.12.0 → 2.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +513 -0
data/README.md +120 -2
data/ext/hyperion_http/parser.c +382 -51
data/lib/hyperion/adapter/rack.rb +18 -4
data/lib/hyperion/connection.rb +65 -4
data/lib/hyperion/http2_handler.rb +348 -21
data/lib/hyperion/metrics.rb +174 -38
data/lib/hyperion/version.rb +1 -1
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ae613d6efdf92072a2076a95a3656687e6bc0b5a9302700c56d75bcc0a900b56
-  data.tar.gz: f4f3f26cff7f4ebd08d35757726219ee53e65418bacdd0061d05626d7b354dba
+  metadata.gz: 4a2ebbc1658c8f98f631a1c8e7d8059248995d0d7e89905cc796385583d4739b
+  data.tar.gz: '0910002789df50d8de0b387a9125ef6f84b4e7d4a740e579f6d43cd4c1a04d16'
 SHA512:
-  metadata.gz: 5e0cbc3aa81bbe246dbf670642cae5fd2e1f0275f36030729bb68fe23b00dd54bbc51fb5f2e62c0a306fd825f73ab35e86cc49dd3bb6a27f38269885067de62c
-  data.tar.gz: f749e0b6e05c52695bc4698d847b92e88c10a17dc21a2d6d93deb9c9a013c5c712f778dfa13aa6dab6f5dd0ebf398c69a5eb56255031aaf9539ca561a38bca6d
+  metadata.gz: 34d825dd492712e8a5f555854097205535fc779c57418035104c3f2a0e51ec4a1dcdd93de67b5bf736cbb1642b0845cd8af61d89a3b82a06a581194e9a7d3df6
+  data.tar.gz: 4bce0232541f6f49ad51826bcf49cbfa791fd9df0b876ff3baf3952629a668da45b006600d08adf1a77480d62588e4ec5c52bfb126982d75063ec6241f63c989

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,518 @@
 # Changelog
+## 2.13.0 — 2026-05-01
+### 2.13-E — io_uring soak signal + default-ON decision
+**Background.** 2.12-D shipped the io_uring accept loop on Linux 5.x+
+(opt-in via `HYPERION_IO_URING_ACCEPT=1`, bench delta:
+15,685 → 134,084 r/s on `handle_static` hello — 8.6× over the 2.12-C
+accept4 fallback, 7× over Agoo's 19,024). The CHANGELOG explicitly
+deferred the default-ON decision to 2.13: "default off until 2.13
+production soak". 2.13-E is that soak — and the harness operators
+need to run their own soak in their own staging.
+**What 2.13-E ships.**
+1. **`bench/io_uring_soak.sh`** — bash-only soak harness.
+   - Boots Hyperion with `HYPERION_IO_URING_ACCEPT=1 -w 1 -t 32`
+     against `bench/hello_static.ru` (the 2.12-D fast path) on a
+     fixed port. `setsid nohup … & disown` so the master survives
+     SSH disconnect over a 24h run.
+   - 30s warm-up, then `wrk -t4 -c100 -d24h --latency` in the
+     foreground. `SOAK_DURATION` is operator-tunable (defaults to
+     `24h`; the bench-window proof-of-concept was `30m`).
+   - In parallel, every `SAMPLE_INTERVAL` (default 60s), samples
+     `/proc/$PID/status` (VmRSS, VmSize, Threads), `/proc/$PID/fd`
+     count, scrapes `hyperion_requests_dispatch_total` from
+     `/-/metrics`, and bucket-derives p50/p99 from
+     `hyperion_request_duration_seconds_bucket`. Appends one CSV
+     row per sample to `/tmp/io_uring_soak_<tag>_<ts>.csv`.
+   - On exit (24h elapsed OR Ctrl-C), prints summary:
+     min/max/mean/stddev RSS, fd_count peak, p99 stddev/mean, plus
+     wrk's HDR-precision p50/p99/p999 from `--latency`.
+   - **Verdict**:
+     - PASS if RSS variance < 10%, fd peak ≤ `WRK_CONNS + 50`, and
+       (when histogram has ≥ 3 distinct bucket values across the
+       window) p99 stddev/mean < 20%. Eligible for default-flip.
+     - SOAK FAIL on any breach. Defer the flip; the failed metric
+       is documented in the verdict notes.
+     - The histogram p99 check is bypassed when there are < 3
+       distinct bucket values across the soak window — Hyperion's
+       7-edge histogram (1 ms, 5 ms, 25 ms, …) quantizes a stable
+       hello-world p99 into bucket-boundary jumps, and the wrk
+       `--latency` p99 is the right tail-truth source there.
+   - `IO_URING=0` runs the same harness against the 2.12-C accept4
+     fallback so the operator can diff io_uring vs accept4 CSVs
+     apples-to-apples.
+2. **`spec/hyperion/io_uring_soak_smoke_spec.rb`** — durable CI
+   coverage. A 1000-request mini-soak over the io_uring loop with
+   a 200-request warm-up, asserts: RSS delta < 20 MB,
+   fd_count back to baseline ± 5, threads back to baseline ± 4.
+   Skipped on macOS / non-liburing builds via the same
+   `Hyperion::Http::PageCache.io_uring_loop_compiled?` predicate the
+   2.12-D wire-shape spec uses. Lives in its own file so a 2.13-E
+   leak signal regression is diagnosable without re-reading the
+   2.12-D wire-shape spec.
+   *Calibration note*: the 2.13-E ticket header proposed a 5 MB
+   delta bound. Bench-host measurements (3-run baseline, IO_URING=1,
+   1000 sequential GETs) put the delta at 7-9 MB — but the
+   dominant allocator is the test process itself
+   (`::TCPSocket.new`, `Timeout` threads, response Strings), not the
+   Hyperion server. The 20 MB threshold catches a real Hyperion
+   leak (1 KB/req of leakage = +1 MB at the assertion site) without
+   false-positiving on the test driver's own arena cost.
+3. **30-minute proof-of-concept soak** — see the companion
+   `[bench]` commit for the harness-vs-harness numbers across
+   IO_URING=1 / IO_URING=0 and the explicit default-ON decision.
+**Constraints respected.** No regression in spec count. macOS-host
+suite: 1124/0/14 → 1126/0/16 (+2 examples, +2 pending — the soak
+smoke is pending on macOS, the documentation example is active). Linux
+bench-host suite: 1124/0/14 → 1126/0/15 (+2 examples, +1 pending —
+the soak smoke runs, the documentation example is pending). The
+2.10-G TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F
+`rb_pc_serve_request`, 2.11-A dispatch pool warmup, 2.11-B cglue HPACK
+default, 2.12-C accept4 loop, 2.12-D io_uring loop, 2.12-E per-worker
+counter, 2.12-F gRPC unary trailers, 2.13-A metric shards, 2.13-B
+response head builder, 2.13-C flake fixes, 2.13-D gRPC streaming —
+all on master and untouched.
+### 2.13-D — gRPC streaming RPCs + ghz vs Falcon bench
+**Background.** 2.12-F shipped gRPC unary on h2 — trailers
+(`grpc-status` / `grpc-message` final HEADERS frame), `te: trailers`
+handling, and h2 request half-close semantics. The three remaining
+gRPC call shapes — server-streaming, client-streaming, and
+bidirectional — were not yet wired. 2.13-D closes that gap.
+**What was missing.** `dispatch_stream` gated on
+`RequestStream#request_complete` (i.e., END_STREAM on the request),
+which is correct for unary but blocks both streaming-input shapes:
+the app cannot read DATA frames until END_STREAM has already arrived.
+Likewise the response path materialised the full body into a single
+String before splitting it across DATA frames, which folded
+multi-message server-streaming responses into one logical write
+(verified: a 5-message body produced one DATA frame, not five).
+**What 2.13-D ships.**
+1. **Server-streaming.** When the response body responds to
+   `:trailers`, `dispatch_stream` now iterates `body#each` lazily and
+   emits one DATA frame per yielded chunk (no inter-chunk coalescing),
+   followed by the trailer HEADERS frame carrying END_STREAM=1. A
+   single oversize chunk still gets max-frame-size split inside the
+   per-chunk send path, but small messages stay one DATA frame each.
+   Plain HTTP/2 traffic (no `:trailers` method on the body) keeps the
+   pre-2.13-D buffered shape — no behaviour change for non-gRPC apps.
+2. **Streaming-input dispatch.** A new
+   `Hyperion::Http2Handler::StreamingInput` IO-shaped queue replaces
+   the buffered `@request_body` String for requests that look like
+   gRPC: `content-type: application/grpc*` AND `te: trailers` on a
+   POST. When promoted, `process_data` pushes each DATA frame's bytes
+   into the queue (and the END_STREAM frame closes the writer), and
+   the serve-loop dispatches the app on HEADERS arrival via a new
+   `RequestStream#dispatchable?` predicate. The Rack adapter detects
+   the non-String request body and sets `env['rack.input']` directly
+   to the queue (no StringIO wrap, so streaming-read semantics are
+   preserved). Reads block the calling fiber on `Async::Notification`
+   until either bytes arrive or the writer closes.
+3. **Bidirectional.** Falls out for free once 1 + 2 are in place —
+   each h2 stream already runs on its own fiber, so concurrent
+   read+write on the same stream is supported by the Async scheduler.
+**Tests.** `spec/hyperion/grpc_streaming_spec.rb` (5 examples):
+server-streaming wire shape (5 yielded chunks → 5 DATA frames + trailer
+HEADERS, END_STREAM ride placement asserted), client-streaming
+(5 spaced DATA frames decoded by the app via `rack.input.read`),
+bidirectional (5 round-trips with strict ordering), and 2 unit specs
+on `StreamingInput` (blocking reads + EOF handling, partial-chunk
+slicing). All run end-to-end via `Protocol::HTTP2::Client` over real
+TLS — same harness shape as the 2.12-F unary specs.
+**Constraints respected.** No regression in the 1176/0/15 baseline:
+post-2.13-D suite is 1181/0/15 (5 new streaming examples + 0
+regressions). The pre-existing
+`http2_empty_body_short_circuit_spec`'s `FakeStream` test double
+needed a `respond_to?(:streaming_input)` guard at the dispatch
+read-site — added defensively (no protocol change). The 2.10-G
+TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F `rb_pc_serve_request`,
+2.11-A dispatch pool warmup, 2.11-B cglue HPACK default, 2.12-C accept4
+loop, 2.12-D io_uring loop, 2.12-E per-worker counter, 2.12-F gRPC
+unary trailers, 2.13-A metric shards, 2.13-B response head builder
+C-rewrite, 2.13-C flake fixes are all on master and untouched by
+this change.
+**Bench.** See the companion `[bench]` commit for ghz numbers and
+the documented limits of the Hyperion-vs-Falcon comparison.
+### 2.13-C — Spec flake hunt
+Two flakes carried over from the 2.11/2.12 release cuts. Both are
+spec-side hermeticity issues — neither indicates a regression in
+`lib/` or `ext/`.
+**Flake 1 — `spec/hyperion/tls_ktls_spec.rb`, macOS, seed-dependent.**
+The two examples in the `Linux-only: kTLS engages with default :auto policy`
+describe block previously gated their bodies on
+`unless Hyperion::TLS.ktls_supported?`. That probe consults
+`Etc.uname[:sysname]` and `OpenSSL::OPENSSL_VERSION_NUMBER`. The same
+`Etc.uname` is stubbed elsewhere in the suite (`io_uring_spec`)
+to drive the io_uring platform matrix; under a particular full-
+suite seed those stubs and this spec's `before { reset_ktls_probe! }`
+overlapped in a way that let the probe report `true` on the actual
+Darwin host. The body then ran the kTLS-supported branch and failed.
+  *Fix shape:* hard `RUBY_PLATFORM.include?('linux')` guard at the
+  example-body top BEFORE the existing probe-based guard. Runtime
+  platform is unstubbable from another spec and matches the
+  example title's intent. Linux runs unaffected (the second guard
+  still protects Linux + old-OpenSSL hosts).
+  *Verification:* 10/10 green on macOS arm64; 1/1 green on the
+  Linux x86_64 bench (openclaw-vm, kernel 6.8); full suite holds
+  the 1175 baseline on macOS / 1118 on bench.
+**Flake 2 — `spec/hyperion/connection_loop_spec.rb:79`, Linux, deterministic.**
+Misdiagnosed previously as "port-9292-busy". The actual root cause
+is Linux ≥ 5.x's `close()`-doesn't-wake-other-thread-`accept(2)`
+behaviour: when the spec calls `listener.close` from one thread,
+the C accept loop parked in `accept(2)` on that fd in another thread
+stays blocked until the next connection arrives. The `stop_accept_loop`
+flag is checked BETWEEN accepts, not while parked, so flipping it
+without a wake is a no-op. `thread.join(5)` then exhausts its
+timeout and `result` (assigned inside the thread block) is still
+`nil`, breaking the `expect(result).to be_a(Integer)` assertion.
+Other examples in the file had the same teardown shape but happened
+to assert on side-effects populated BEFORE the join, so they passed
+despite the same 5 s thread leak — runtime was 46 s for 10 examples.
+  *Fix shape:* extract a `stop_loop_and_wake(listener, thread)`
+  helper that flips the stop flag, dials one throwaway TCP connection
+  at the listener so the parked `accept(2)` returns, then closes the
+  listener and joins. Replace the `stop_accept_loop` + `listener.close`
+  + `thread.join(5)` pattern at every callsite (8 in-file plus the
+  Server-level engagement example, which goes through `server.stop`).
+  Add a regression block — "teardown is hermetic across repeated
+  bring-ups" — that runs the bring-up + serve + teardown cycle 3
+  times in one process and asserts each teardown is < 1 s.
+  *Verification:* 0/10 failures on the bench (was 5/5 deterministic
+  failure pre-fix); spec runtime 46 s → 1.3 s; macOS 11/11 green;
+  full suite 1119 on bench / 1176 on macOS (regression spec adds 1).
+  *Out of scope:* the same wake-shape affects `Hyperion::Server#stop`
+  in production — `close()` on the listener fd from the signal-
+  handling thread won't reliably wake the worker's parked accept.
+  Flagging this as a follow-up rather than fixing in 2.13-C scope:
+  briefing was explicit ("Don't touch lib/ext code unless the flake
+  is a real bug there"), and the production failure mode (worker
+  hangs on shutdown) is operationally distinct from the spec flake
+  (test-suite stalls).
+### 2.13-B — CPU JSON gap
+**Background.** The 2.12-B re-bench surfaced one row that got *worse*
+relative to Agoo across the 2.10/2.11/2.12 streams: `bench/work.ru`
+(50-key JSON serialised per-request, no `handle_static` because the
+response varies per request). 2.10-B had Hyperion 3,450 / Agoo 6,374
+(1.85× behind); the 2.12-B re-bench had Hyperion 3,659 / Agoo 7,489
+(2.05× behind). Hyperion +6.0%, Agoo +17.5% over the same window —
+the gap *widened*. None of the 2.10/2.11/2.12 work touched this row,
+so it was the obvious 2.13 follow-on.
+**Profile.** `perf record -F 199 -g` on the worker pid while
+`wrk -t4 -c100 -d15s` ran (CPU-JSON workload, default config
+`-t 5 -w 1` = bench harness canonical):
+```
+13.15%  vm_exec_core                   (Ruby VM dispatch)
+13.01%  _raw_spin_unlock_irqrestore    (kernel; ~6% inside TCP write softirq)
+ 4.56%  raw_generate_json_string       (JSON.generate — app's own work)
+ 2.28%  generate_json_general          (ditto)
+ 1.75%  vm_call_cfunc_with_frame
+ 1.34%  rb_class_of
+ 1.24%  json_object_i                  (ditto)
+ 1.11%  rb_vm_opt_getconstant_path
+ 1.01%  BSD_vfprintf                   (sprintf — content-length builder + JSON floats)
+ 0.94%  generate_json_float            (ditto)
+ 0.70%  hash_foreach_call              (header iteration)
+ 0.57%  llhttp__internal__run          (Hyperion C ext request parser)
+```
+The honest read: ~10 % of CPU is `JSON.generate` (the app's per-request
+work — not removable from inside Hyperion); ~13 % is kernel TCP write
+softirq (one `write(2)` per response — already minimal); ~13 % is the
+Ruby VM dispatch loop, of which the adapter + writer Ruby path is a
+fraction. **The dominant tail is GVL serialisation under `-t 5 -w 1`**
+— a concurrency sweep shows c=1 → 5,800 r/s, c=5 → 3,563 r/s, c=100 →
+3,665 r/s. Hyperion's per-thread workload SCALES DOWN with concurrency
+because every request holds the GVL through `app.call` + `JSON.generate`
+(Ruby code, no I/O wait). Agoo (pure-C HTTP core) scales UP with
+concurrency: c=1 → 4,384 → c=5 → 6,182 → c=100 → 6,519. That structural
+gap cannot close inside Hyperion — `app.call(env)` IS Ruby. What 2.13-B
+*can* do is shrink Hyperion's GVL hold time per request so the ratio
+of "GVL held by Hyperion" to "GVL held by app.call" drops, leaving
+more room for the worker pool to interleave.
+### 2.13-B — CPU savings in `cbuild_response_head`
+The C-side response-head builder (called once per Rack response) had
+four removable per-request costs:
+  1. **Status line `snprintf`.** Every request ran
+     `snprintf("HTTP/1.1 %d ", status)` + `rb_str_cat(reason)` +
+     `rb_str_cat("\r\n", 2)` to build "HTTP/1.1 200 OK\r\n". The 23
+     status codes in `ResponseWriter::REASONS` are a fixed set with
+     fixed reason phrases — the entire status line is a constant per
+     `(status, reason)` pair. The 2.13-B builder switches on `status`
+     and emits the pre-baked line in ONE `rb_str_cat` when the reason
+     phrase matches; falls back to the snprintf path for unknown
+     statuses or operator-overridden reason phrases.
+  2. **Header iteration via `rb_funcall(:keys)`.** The legacy iterator
+     called `rb_funcall(rb_headers, :keys, 0)` to materialise a fresh
+     keys Array per request, then `rb_ary_entry(keys, i)` +
+     `rb_hash_aref(rb_headers, key)` per header. The 2.13-B builder
+     uses `rb_hash_foreach`, which walks the hash table directly with
+     no intermediate Array allocation and no per-key hash lookup.
+  3. **Per-key `String#downcase` allocation.** Header keys are nearly
+     always frozen-literal Strings in Rack apps (`'content-type'`,
+     `'cache-control'`, …) — same `VALUE` every request. The legacy
+     builder ran `rb_funcall(:downcase)` per key per call, allocating
+     a fresh lowercase String + crossing the FFI boundary. 2.13-B
+     keys an `st_table` on the input String's identity and stores
+     the cached lowercase form + the pre-built `"<lc>: "` prefix
+     buffer; the second-and-later requests for the same frozen-literal
+     key get one st-table hit. Capped at 64 entries — a misbehaving
+     app emitting `x-trace-<uuid>` per request can't grow the cache
+     without bound, just falls back to the per-call downcase.
+  4. **Per-(key, value) full-line cache.** When BOTH the key AND the
+     value are frozen-literal Strings (`'cache-control' => 'no-store'`
+     in `bench/work.ru`, `'content-type' => 'application/json'`),
+     the entire wire line `"<lc-key>: <value>\r\n"` is identical
+     every request. 2.13-B caches the prebuilt line keyed on
+     `(key.object_id, value.object_id)`; on hit the entire emit is
+     ONE `rb_str_cat`. Capped at 256 entries with the same fall-back
+     semantics. The CRLF-injection guard always re-runs (the cache
+     stores only validated lines; new (k, v) pairs go through the
+     full validator before the line populates).
+  5. **`itoa_positive_decimal` for content-length.** `snprintf("content-
+     length: %ld\r\n", body_size)` was 1 % of CPU on the profile.
+     `body_size` is always non-negative (bytesize of a buffered body)
+     so the sign branch + locale logic in `vfprintf` are pure
+     overhead. 2.13-B writes the digits backwards into a 24-byte
+     stack scratch then `rb_str_cat`s the populated suffix — no
+     heap, no locale, no format-string parser.
+**Bench impact.** Same bench host (openclaw-vm, Linux 6.8.0, Ruby
+3.3.3, loopback), each version compiled fresh from source on the
+host before its run.
+| Workload | Baseline (master adac63e) | 2.13-B | Δ |
+|---|---:|---:|---|
+| Single-thread synthetic (`Adapter::Rack.call → ResponseWriter#write` against a sink, 50,000 iters; 3-trial median r/s) | 18,018 | **19,404** | **+7.7%** |
+| Multi-thread loopback `wrk -t4 -c100 -d20s --latency`, two batches of 3-trial median r/s | 3,427; 3,550 | 3,440; 3,528 | **−0.1%** |
+| Multi-thread loopback p99 latency | 2.77ms; 2.64ms | 2.74ms; 2.67ms | tied |
+The +7.7 % single-thread win is the per-request CPU savings inside
+`cbuild_response_head`. The neutral multi-thread result is the
+GVL-contention floor: at `-t 5 -w 1` Hyperion's worker threads
+serialise on the GVL while running `JSON.generate` + `app.call`,
+so shaving 2-3 µs off Hyperion's slice of the hot path leaves the
+total throughput dominated by `JSON.generate` (~10 % CPU per the
+profile) and the kernel TCP write softirq (~6 %). For comparison
+the same bench host, same Ruby, with Hyperion `-w 4` SO_REUSEPORT
+on `bench/work.ru`: **14,200 r/s** — 2× over Agoo's single-process
+7,489 r/s baseline.
+**Honest assessment of the residual gap.** The 2.05× gap to Agoo
+on the canonical `-t 5 -w 1` row is a GVL-architecture gap, not a
+per-request CPU gap. Agoo's pure-C HTTP core lets 5 worker threads
+truly run in parallel; Hyperion's adapter + writer + `app.call`
+hold the GVL together because every step except the `read(2)` /
+`write(2)` syscalls is Ruby. Closing this row to ≥ Agoo would
+require either (a) running `-w 4` SO_REUSEPORT (the 2.12-E
+cluster work — Hyperion DOES exceed Agoo by 2× there), or (b) a
+2.14+ track that moves more of the per-request lifecycle into C
+(e.g. running `cbuild_response_head` from the C accept loop with
+the writer fully C-side). 2.13-B closes Hyperion's portion of the
+GVL hold; the rest is structural.
+### 2.13-A — Extend C-side wins to generic Rack apps
+**Background.** The 2.12 sprint shipped huge wins on the
+`Server.handle_static`-routed traffic shape: 5,502 r/s → 134,084 r/s on
+the static-route `hello` workload (24× over 2.11.0; 7× over Agoo). But
+those wins are gated on the C accept-loop's `route_table.lookup`
+returning a `RouteTable::StaticEntry`. Generic Rack apps — the vast
+majority of real-world deployments (Rails, Sinatra, Roda, Hanami,
+anything calling `body.each` to yield response chunks) — never engage
+the C loop; they go through `Hyperion::Adapter::Rack` + the Ruby
+accept loop + the thread pool. The 2.12-B re-bench confirmed a generic
+Rack `bench/hello.ru` ran at 4,477 r/s — 4.25× behind Agoo, and 30×
+behind the C-loop static path on the same machine. Most of the 2.12
+wins were not available to operators running real apps.
+**What 2.13-A targets.** Optimizations that *do* port to the generic
+Rack dispatch path without breaking semantics. Per-request we don't
+get to skip `app.call(env)` (that IS the dispatch) and we can't
+prebuild the response body (it's dynamic), but we can attack:
+syscall coalescing on accept+read, env hash + rack.input recycling,
+metrics-mutex contention under multi-thread workloads, and the
+keepalive-fast-path tail.
+### 2.13-A — Per-thread shard for hot-path metrics
+Pre-2.13-A, every `observe_histogram` and `increment_labeled_counter`
+took `@hg_mutex.synchronize`. The original commit comment claimed
+those paths were "low-rate", but that's no longer true:
+  * `tick_worker_request` is called once per dispatched request
+    (every `Connection#serve` iteration, every h2 stream, every
+    handed-off connection from the C loop).
+  * `observe_histogram` is called once per dispatched request via
+    the per-route request-duration histogram registered in
+    `Connection#register_request_duration_histogram!`.
+Under `-t 32` that single mutex serialised 32 worker threads on the
+request-completion tail — every `+= 1` waited behind the previous
+thread's release. That contention was invisible on the C accept loop
+(the loop bypasses Ruby metrics entirely and folds in its atomic
+counter at scrape time), but it was the dominant tail-latency term
+on the generic Rack workload.
+The new path keeps per-thread shards (`Thread#thread_variable_set`,
+true thread-local — NOT fiber-local, matching the unlabeled counter
+convention from 2.0.0) for both `@histograms` and `@labeled_counters`.
+Observations and increments hit the per-thread shard with zero
+contention; `histogram_snapshot` and `labeled_counter_snapshot` merge
+across shards under the mutex (a low-rate operation — once per
+`/-/metrics` scrape).
+**Public API stays identical.** `observe_histogram`, `register_histogram`,
+`set_gauge`, `histogram_snapshot`, `labeled_counter_snapshot`,
+`increment_labeled_counter` keep the same signatures and semantics.
+Registered-but-never-observed families still surface in the snapshot
+(pre-2.13-A behaviour). `reset!` now also clears the per-thread
+shards so cross-spec leakage stays prevented.
+**Edge cases covered by spec:**
+  * Multi-thread observe-then-snapshot: 8 threads × 1000 observations
+    on the same `(name, labels)` produce `count == 8000` and
+    cumulative bucket counts that match.
+  * 16 threads × 500 increments × distinct label values produce 16
+    series with count 500 each.
+  * Concurrent observe + 50 mid-run snapshots run without deadlock
+    or torn counts.
+  * Reset clears across threads.
+  * Unregistered observe is a silent no-op.
+  * Registered-but-never-observed families show up in the scrape
+    with an empty `:series` Hash.
+### 2.13-A — Cached worker-id label tuple + Rack-3 keepalive fast path
+Two micro-optimizations on the per-request hot path of
+`Connection#serve`, both targeting the steady-state -c1 single-
+keepalive profile (where 8000 r/s = the upper bound of single-thread
+Ruby work and every saved allocation / iteration shows up).
+**Cached worker-id label tuple.** `tick_worker_request(@worker_id)`
+went through a wrapper that called `worker_id.to_s` (worker_id is
+already a String) and built a fresh `[label]` Array per request. The
+wrapper also re-checked `@worker_request_family_registered` on every
+call. The new path pre-builds the frozen `[@worker_id]` tuple once in
+the Connection constructor, registers the family once at construction
+too, and the request loop calls `increment_labeled_counter` directly
+with the cached tuple — saving one Array allocation + one method
+dispatch + one early-return-checked branch per request.
+**Rack-3 keepalive fast path.** `should_keep_alive?` used to scan the
+entire response-headers Hash with `headers.find { |k,_| k.to_s.downcase
+== 'connection' }`. That ran `to_s.downcase` (one transient String
+allocation) PER iteration and walked to completion on every response
+that didn't carry a `Connection` header (which is most of them — the
+response writer adds its own). Rack 3 mandates lowercase Hash keys
+(spec §6.4), so the new path is a single `headers['connection']`
+lookup. Apps that violate Rack 3 by returning mixed-case keys lose
+the Connection-close response signal and stay on keep-alive — a
+benign degradation pinned by spec; the fix is to update the app to
+spec. Non-Hash header containers (legacy Array-of-pairs) still flow
+through a slow-scan fallback, also case-sensitive on the lowercase
+key.
+**Spec coverage:**
+  * Lowercase `connection: close` from app closes the connection.
+  * Lowercase `connection: keep-alive` keeps the conn alive across
+    pipelined requests.
+  * Mixed-case `Connection: close` (Rack-3 violation) is documented
+    as falling through to keep-alive — pinned so the behaviour is
+    stable.
+  * `@worker_id_label_tuple` is constructed once, frozen, and reused
+    by identity across requests on the same Connection.
+### 2.13-A — Bench result
+Bench host: openclaw-vm (Linux 6.8.0, x86_64). Hyperion `-t 32 -w 1`
+on `bench/hello.ru` (generic Rack hello). 5-trial median, wrk 4.x.
+**With access logging on (default — JSON access lines per request):**
+| Workload   | Master | 2.13-A | Δ |
+|------------|-------:|-------:|---:|
+| -c1 -t1    | 7,631  | 7,386  | -3.2% (within noise) |
+| -c100 -t4  | 4,004  | 4,031  | +0.7% (within noise) |
+**With `--no-log-requests` (logging disabled):**
+| Workload   | Master | 2.13-A | Δ |
+|------------|-------:|-------:|---:|
+| -c1 -t1    | 9,028  | 8,938  | -1.0% (within noise) |
+| -c100 -t4  | 4,804  | 4,979  | **+3.6%** |
+**Honest framing.** The +3.6% at `-c100 --no-log-requests` is the
+clearest signal that the metrics-mutex contention removal lands. At
+`-c1` the workload is single-thread CPU-bound (one wrk thread, one
+keepalive connection, one Hyperion worker thread serving it); the
+optimisations don't help and don't hurt. At `-c100` the
+optimisations reach a reasonable +3.6% on the no-log path; with
+logging on, the access-log path dominates the per-request CPU
+budget so the metrics savings are masked.
+**What the bench reveals.** The 4.25× gap to Agoo on the generic
+Rack workload that 2.12-B identified is not a metrics-contention
+problem and not an env-pool problem (env pooling was already in
+place since 1.6.x). It is a **single-thread Ruby work per request**
+ceiling — at `-c1` the bench tops out at ~9,000 r/s = 110 µs/req
+of single-thread Ruby work, which Agoo (Rack-shape C server) does
+in ~52 µs/req. Closing that gap requires moving meaningful chunks
+of the per-request Ruby surface (parser, env build, headers, log
+line) into C — work that's already 50% complete in
+`Hyperion::CParser` (`build_env`, `build_response_head`,
+`build_access_line`). The 2.13 sprint will continue moving the
+remaining Ruby-side pieces (`Connection#serve` request loop,
+ResponseWriter dispatch, `should_keep_alive?`) into C in subsequent
+phases.
+**Durable infrastructure.** The 2.13-A optimisations stay in the
+tree even though the headline-bench delta is small. The
+per-thread metrics shard removes a real mutex bottleneck that
+*will* compound under future workloads — Ractor-based dispatch,
+multi-process scrapers, observability-heavy apps that observe
+custom histograms per-request. The Rack-3 keepalive fast path and
+the cached worker-id tuple are pure-quality changes (less code,
+fewer allocations, no API surface change).
 ## 2.12.0 — 2026-05-01
 ### 2.12-F — gRPC support on h2

data/README.md CHANGED Viewed

@@ -11,6 +11,55 @@ gem install hyperion-rb
 bundle exec hyperion config.ru
 ```
+## What's new in 2.13.0
+**Hardening sprint: profile-driven CPU work, durable infrastructure,
+and gRPC streaming.** 2.13 follows up the 2.12 perf jump with the
+work that wasn't structural enough to need its own major release —
+each stream is small but adds up:
+- **2.13-A — Generic Rack hot-path wins.** Per-thread shards for
+  hot-path metrics (no more cross-worker mutex on `observe_histogram`),
+  cached `worker_id` label tuple, and a Rack-3 keepalive fast-path.
+  Generic Rack hello bench: **+3.6% on `-c100` no-log**. Honest
+  finding: env-pool + body-coalesce already shipped; the deeper
+  generic-Rack gap to Agoo is single-thread Ruby ceiling that closing
+  needs moving `app.call` into the C accept loop — a 2.14 lift.
+- **2.13-B — Response head builder rewritten in C.** Pre-baked
+  status-line table, `rb_hash_foreach` replacing `rb_funcall(:keys)`,
+  per-key downcase + per-(key, value) full-line caches, custom `itoa`
+  replacing `snprintf`. **+7.7% single-thread synthetic; multi-thread
+  neutral (GVL-bound).** Profile confirms Hyperion's own C-ext code is
+  **<1%** of wall-clock; the rest is libruby + JSON gem.
+- **2.13-C — Spec flake hunt.** Two long-standing flakes fixed:
+  `tls_ktls_spec` macOS skip leak (unconditional `RUBY_PLATFORM`
+  guard), and `connection_loop_spec:79` Linux port-bind flake (root
+  cause: Linux `close()` doesn't wake a parked `accept(2)` — fixed
+  with a `stop_loop_and_wake` helper). 5/5 → 0/10 failure rate;
+  spec-suite runtime 46 s → 1.3 s.
+- **2.13-D — gRPC streaming RPCs.** Server-streaming, client-streaming,
+  and bidirectional RPCs on top of 2.12-F's unary trailers foundation.
+  New `bench/grpc_stream.{proto,ru}` + `grpc_stream_bench.sh` ghz
+  harness for operator-side comparison vs Falcon's `async-grpc`.
+- **2.13-E — io_uring soak harness + CI smoke.** New
+  `bench/io_uring_soak.sh` runs a 24h soak against the 2.12-D
+  io_uring loop with `/proc/$PID` + `/-/metrics` sampling, emits a
+  CSV + verdict (PASS / SOAK FAIL / borderline). New
+  `spec/hyperion/io_uring_soak_smoke_spec.rb` runs a 1000-request
+  mini-soak in CI to catch leak regressions before any 24h run.
+  **`HYPERION_IO_URING_ACCEPT` stays opt-in for 2.13** — operators
+  with their own staging environments can now collect signal; the
+  default-flip decision moves to 2.14.
+Plus all previous wins are preserved and verified by the 1183-spec
+suite (2.10-G TCP_NODELAY at accept, 2.10-E preload hooks, 2.10-F
+C-ext fast-path response writer, 2.11-A dispatch pool warmup, 2.11-B
+cglue HPACK default, 2.12-C accept4 connection loop, 2.12-D io_uring
+loop, 2.12-E per-worker request counter, 2.12-F gRPC unary trailers).
+Full per-stream details, bench tables, and follow-up items in
+[`CHANGELOG.md`](CHANGELOG.md).
 ## What's new in 2.12.0
 **The hot path moves into C — and gRPC ships.** The headline win:
@@ -244,8 +293,77 @@ What Hyperion handles for you: ALPN negotiation, HTTP/2 framing, HPACK,
 per-stream flow control, the trailer-frame emit, binary-clean
 `env['rack.input']` (gRPC bodies are non-UTF-8), and `te: trailers`
 preserved into `env['HTTP_TE']`. What you handle: protobuf
-marshalling and the `grpc-status` semantics. Streaming RPCs (server /
-client / bidi) are 2.13 candidates — pin to unary for now.
+marshalling and the `grpc-status` semantics.
+### Streaming RPCs (2.13-D+)
+All four gRPC call shapes work on Hyperion since 2.13-D — unary,
+server-streaming, client-streaming, and bidirectional. The detection
+trigger is the gRPC content-type plus `te: trailers`; any HTTP/2
+request that carries both is dispatched to the Rack app on HEADERS
+arrival (rather than after END_STREAM), and `env['rack.input']`
+becomes a streaming IO that blocks reads until the next DATA frame
+lands. Plain HTTP/2 traffic (without those headers) keeps the unary
+buffered semantics — no behaviour change for non-gRPC clients.
+**Server-streaming.** Yield one gRPC-framed message per `each`
+iteration; Hyperion writes each yield as its own DATA frame:
+```ruby
+class StreamReply
+  def initialize(messages); @messages = messages; end
+  def each; @messages.each { |m| yield m }; end       # one DATA frame each
+  def trailers; { 'grpc-status' => '0' }; end
+  def close; end
+end
+run ->(env) {
+  env['rack.input'].read # the unary request message
+  [200, { 'content-type' => 'application/grpc' }, StreamReply.new(messages)]
+}
+```
+**Client-streaming.** Read messages off `env['rack.input']` as the peer
+sends them. Reads block until a DATA frame arrives:
+```ruby
+run ->(env) {
+  io = env['rack.input']
+  count = 0
+  while (prefix = io.read(5)) && prefix.bytesize == 5
+    length = prefix.byteslice(1, 4).unpack1('N')
+    msg = io.read(length)
+    count += 1
+  end
+  [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply_for(count))]
+}
+```
+**Bidirectional.** Interleave reads and writes. The response body is
+iterated lazily, so you can read one request message, yield one reply,
+read the next, yield the next:
+```ruby
+class BidiReplies
+  def initialize(io); @io = io; end
+  def each
+    while (prefix = @io.read(5)) && prefix.bytesize == 5
+      len = prefix.byteslice(1, 4).unpack1('N')
+      msg = @io.read(len)
+      yield grpc_frame(handle(msg))                   # sent immediately
+    end
+  end
+  def trailers; { 'grpc-status' => '0' }; end
+  def close; end
+end
+run ->(env) {
+  [200, { 'content-type' => 'application/grpc' }, BidiReplies.new(env['rack.input'])]
+}
+```
+The streaming-input path runs each stream on its own fiber, so
+concurrent read+write on the same stream is safe.
 ## Highlights