RubyGems - hyperion-rb - Versions diffs - 2.13.0 → 2.14.0 - Mend

hyperion-rb 2.13.0 → 2.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +604 -0
data/README.md +301 -792
data/ext/hyperion_http/page_cache.c +538 -43
data/lib/hyperion/adapter/rack.rb +285 -0
data/lib/hyperion/server/connection_loop.rb +104 -6
data/lib/hyperion/server/route_table.rb +64 -0
data/lib/hyperion/server.rb +202 -2
data/lib/hyperion/version.rb +1 -1
metadata +1 -1

data/README.md CHANGED Viewed

@@ -1,707 +1,218 @@
 # Hyperion
-High-performance Ruby HTTP server. Falcon-class fiber concurrency, Puma-class compatibility.
+High-performance Ruby HTTP server. Rack 3 + HTTP/2 + WebSockets + gRPC on a single binary.
 [![CI](https://github.com/andrew-woblavobla/hyperion/actions/workflows/ci.yml/badge.svg)](https://github.com/andrew-woblavobla/hyperion/actions/workflows/ci.yml)
 [![Gem Version](https://img.shields.io/gem/v/hyperion-rb.svg)](https://rubygems.org/gems/hyperion-rb)
 [![License: MIT](https://img.shields.io/github/license/andrew-woblavobla/hyperion.svg)](https://github.com/andrew-woblavobla/hyperion/blob/master/LICENSE)
+Hyperion serves a hello-world Rack response at **134,084 r/s with a 1.14 ms p99**
+on a single worker (Linux 6.x, io_uring accept loop, `Server.handle_static`),
+**7×** Agoo's 19,024 r/s on the same hardware. Beyond the C-side fast path
+it's a complete Rack 3 server: HTTP/1.1 + HTTP/2 with ALPN, WebSockets
+(RFC 6455), gRPC unary + streaming on the Rack 3 trailers contract, native
+fiber concurrency for PG-bound apps, and pre-fork cluster mode with
+SO_REUSEPORT-balanced workers.
 ```sh
 gem install hyperion-rb
-bundle exec hyperion config.ru
-```
-## What's new in 2.13.0
-**Hardening sprint: profile-driven CPU work, durable infrastructure,
-and gRPC streaming.** 2.13 follows up the 2.12 perf jump with the
-work that wasn't structural enough to need its own major release —
-each stream is small but adds up:
-- **2.13-A — Generic Rack hot-path wins.** Per-thread shards for
-  hot-path metrics (no more cross-worker mutex on `observe_histogram`),
-  cached `worker_id` label tuple, and a Rack-3 keepalive fast-path.
-  Generic Rack hello bench: **+3.6% on `-c100` no-log**. Honest
-  finding: env-pool + body-coalesce already shipped; the deeper
-  generic-Rack gap to Agoo is single-thread Ruby ceiling that closing
-  needs moving `app.call` into the C accept loop — a 2.14 lift.
-- **2.13-B — Response head builder rewritten in C.** Pre-baked
-  status-line table, `rb_hash_foreach` replacing `rb_funcall(:keys)`,
-  per-key downcase + per-(key, value) full-line caches, custom `itoa`
-  replacing `snprintf`. **+7.7% single-thread synthetic; multi-thread
-  neutral (GVL-bound).** Profile confirms Hyperion's own C-ext code is
-  **<1%** of wall-clock; the rest is libruby + JSON gem.
-- **2.13-C — Spec flake hunt.** Two long-standing flakes fixed:
-  `tls_ktls_spec` macOS skip leak (unconditional `RUBY_PLATFORM`
-  guard), and `connection_loop_spec:79` Linux port-bind flake (root
-  cause: Linux `close()` doesn't wake a parked `accept(2)` — fixed
-  with a `stop_loop_and_wake` helper). 5/5 → 0/10 failure rate;
-  spec-suite runtime 46 s → 1.3 s.
-- **2.13-D — gRPC streaming RPCs.** Server-streaming, client-streaming,
-  and bidirectional RPCs on top of 2.12-F's unary trailers foundation.
-  New `bench/grpc_stream.{proto,ru}` + `grpc_stream_bench.sh` ghz
-  harness for operator-side comparison vs Falcon's `async-grpc`.
-- **2.13-E — io_uring soak harness + CI smoke.** New
-  `bench/io_uring_soak.sh` runs a 24h soak against the 2.12-D
-  io_uring loop with `/proc/$PID` + `/-/metrics` sampling, emits a
-  CSV + verdict (PASS / SOAK FAIL / borderline). New
-  `spec/hyperion/io_uring_soak_smoke_spec.rb` runs a 1000-request
-  mini-soak in CI to catch leak regressions before any 24h run.
-  **`HYPERION_IO_URING_ACCEPT` stays opt-in for 2.13** — operators
-  with their own staging environments can now collect signal; the
-  default-flip decision moves to 2.14.
-Plus all previous wins are preserved and verified by the 1183-spec
-suite (2.10-G TCP_NODELAY at accept, 2.10-E preload hooks, 2.10-F
-C-ext fast-path response writer, 2.11-A dispatch pool warmup, 2.11-B
-cglue HPACK default, 2.12-C accept4 connection loop, 2.12-D io_uring
-loop, 2.12-E per-worker request counter, 2.12-F gRPC unary trailers).
-Full per-stream details, bench tables, and follow-up items in
-[`CHANGELOG.md`](CHANGELOG.md).
-## What's new in 2.12.0
-**The hot path moves into C — and gRPC ships.** The headline win:
-`Server.handle_static` routes now serve from a C accept→read→route→write
-loop with optional **io_uring** (Linux 5.x+) backing it. The `wrk -t4
--c100 -d20s` hello bench moved from **5,502 r/s** (2.11.0
-`Server.handle_static` via Ruby accept loop) to **15,685 r/s** (2.12-C
-C accept4 loop) to **134,084 r/s** (2.12-D io_uring loop) — that's
-**24× over 2.11.0's `handle_static` and 7× over Agoo 2.15.14's
-19,024 r/s** on the same workload. p99 stays sub-millisecond
-throughout. Plus durable foundation work and one big new feature:
-- **2.12-B — Fresh 4-way re-bench.** New
-  [`docs/BENCH_HYPERION_2_11.md`](docs/BENCH_HYPERION_2_11.md) re-runs
-  Hyperion / Puma / Falcon / Agoo on the 6 workloads with all 2.10/2.11
-  wins enabled. Headline shifts: static 1 KB Hyperion `handle_static`
-  flipped from 1.89× behind Agoo to **+127% ahead**; CPU JSON gap
-  widened (the one row 2.10/2.11 didn't touch — flagged for follow-up).
-- **2.12-C — Connection lifecycle in C.** New
-  `Hyperion::Http::PageCache.run_static_accept_loop` does
-  `accept4` + `recv` + path lookup + `write` entirely in a C tight
-  loop, returning to Ruby only on a route miss / TLS / h2 / WebSocket
-  upgrade. GVL released across syscalls. Auto-engages when the listener
-  is plain TCP and the route table contains only `StaticEntry`
-  registrations. **5,502 → 15,685 r/s (+185%, 2.85×) on `handle_static`
-  hello; p99 1.59 ms → 107 µs (15× tighter).** Falls through to the
-  existing Ruby accept loop on miss with no regression.
-- **2.12-D — io_uring accept loop (Linux 5.x+).** A multishot accept +
-  per-conn RECV/WRITE/CLOSE state machine on top of liburing. One
-  `io_uring_enter` per N requests instead of N×3 syscalls. Opt-in via
-  `HYPERION_IO_URING_ACCEPT=1` (default off until 2.13 production
-  soak). **15,685 → 134,084 r/s (+755%, 8.6×) on the same bench.**
-  Compiles out cleanly without liburing — the `accept4` path stays
-  the fallback. macOS keeps using `accept4` (no liburing).
-- **2.12-E — SO_REUSEPORT cluster-mode audit.** New per-worker request
-  metric (`requests_dispatch_total{worker_id="N"}`) ticks under every
-  dispatch mode (Rack, `handle_static`, h2, the C accept loops). New
-  audit harness `bench/cluster_distribution.sh` and a 4-worker, 30s
-  sustained-load bench: under steady state the SO_REUSEPORT hash
-  distributes within **1.004-1.011 max/min ratio** — production-grade,
-  measured. The cold-start swing (1.16× during the first second of
-  fresh boot) is documented as expected `SO_REUSEPORT + keep-alive`
-  behavior and matches what production L4 LBs already exhibit.
-- **2.12-F — gRPC support on h2.** Trailers (the `grpc-status` /
-  `grpc-message` final HEADERS frame), `TE: trailers` handling, h2
-  request half-close semantics. Rack 3 contract: a Rack body that
-  defines `#trailers` triggers the trailers wire shape automatically;
-  bodies that don't are byte-identical to 2.11.x h2. Smoke test against
-  the real `grpc` Ruby gem ships gated by `RUN_GRPC_SMOKE=1`; the
-  durable coverage is 11 unit specs driving real `protocol-http2`
-  framer + HPACK encode/decode + TLS.
-The 2.10-G TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F C-ext
-`rb_pc_serve_request`, 2.11-A dispatch pool warmup, and 2.11-B cglue
-HPACK default all preserved and verified by the 1143-spec suite.
-Full per-stream details, bench tables, and follow-up items in
-[`CHANGELOG.md`](CHANGELOG.md).
-## What's new in 2.11.0
-**h2 cold-stream latency cut + native HPACK CGlue flipped to default.**
-Two perf wins on top of 2.10:
-- **2.11-A — h2 first-stream TLS handshake parallelization.** The
-  2.10-G `HYPERION_H2_TIMING=1` instrumentation, run against the
-  TCP_NODELAY-fixed handler, isolated the residual cold-stream cost
-  to **bucket 2**: lazy `task.async {}` fiber spawn for the first
-  stream of every connection. Fix: pre-spawn a stream-dispatch fiber
-  pool at connection accept (configurable via `HYPERION_H2_DISPATCH_POOL`,
-  default 4, ceiling 16). h2load `-c 1 -m 1 -n 50` cold first-run:
-  **time-to-1st-byte 20.28 → 9.28 ms (−54%); m=100 throughput +5.5%**.
-  Warm steady-state unchanged (no head-of-line blocking under the small
-  pool — backlog still spills to ad-hoc `task.async`).
-- **2.11-B — HPACK FFI marshalling round-2 (CGlue flipped to default).**
-  Three-way bench (`bench/h2_rails_shape.sh` extended): `ruby` (1,585
-  r/s) vs `native v2` (1,602 r/s, +1% — noise) vs `native v3 / CGlue`
-  (**2,291 r/s, +43% over v2**). The +18-44% native-vs-Ruby headline
-  was almost entirely Fiddle marshalling overhead, not the underlying
-  Rust HPACK encoder — same encoder, no per-call FFI marshalling, +43%
-  rps. Default flipped: unset `HYPERION_H2_NATIVE_HPACK` now selects
-  CGlue. Three escape valves stay (`=v2` to force the old path, `=ruby`
-  / `=off` for the pure-Ruby fallback) for any operator that needs
-  them. Boot log gains a `native_mode` field documenting which path is
-  actually live.
-Plus operator infrastructure: a stale-`.dylib`-on-Linux cross-platform
-host-OS portability fix in `H2Codec.candidate_paths` (was silently
-falling through to pure-Ruby on the bench host); `bench/h2_rails_shape.sh`
-race-fixed (boot-log probe + stderr routing). Full bench tables and
-flip-decision rationale in [`CHANGELOG.md`](CHANGELOG.md).
-## What's new in 2.10.1
-**Static-asset operator surface (2.10-E) + C-ext fast-path response
-writer (2.10-F).** Two follow-on streams to 2.10's static / direct-route
-work:
-- **2.10-E — Static asset preload + immutable flag.** Boot-time hook
-  warms `Hyperion::Http::PageCache` over a tree of files and marks
-  every cached entry immutable. Surface: `--preload-static <dir>` (and
-  `--no-preload-static`) CLI flags, `preload_static "/path", immutable:
-  true` config DSL key, and zero-config Rails auto-detect that pulls
-  `Rails.configuration.assets.paths.first(8)` when present. Hyperion
-  never `require`s Rails — purely defensive `defined?(::Rails)`
-  probing keeps the generic Rack server path clean. **Operator value:
-  predictable first-request latency** (the asset is in cache before
-  the first request arrives) and the `recheck_seconds` mtime poll is
-  skipped on immutable entries. Sustained-load throughput on the
-  static-1-KB bench did *not* move (cold 1,929 r/s vs warm 1,886 r/s,
-  inside trial noise) because `ResponseWriter` already auto-caches
-  Rack::Files responses on the first hit; preload moves that one
-  `cache_file` call from request 1 to boot.
-- **2.10-F — C-ext fast-path response writer for prebuilt responses.**
-  `Server.handle_static`-routed requests now serve from a single
-  C function (`rb_pc_serve_request` in `ext/hyperion_http/page_cache.c`)
-  that does route lookup → header build → `write()` syscall without
-  re-entering Ruby on the response side. GVL is released across the
-  `write()` so slow clients no longer block other Ruby work on the
-  same VM. Automatic HEAD support (HTTP-mandated) lights up on every
-  GET registered via `handle_static` — same buffer, body stripped.
-  Bench (3-trial median, `wrk -t4 -c100 -d20s`): **5,768 r/s vs
-  2.10-D's 5,619 r/s (+2.6% — inside noise) and p99 1.93 → 1.67 ms
-  (−14% — outside noise, reproducible).** The throughput needle didn't
-  move because the per-connection lifecycle (accept4 + clone3 + futex
-  on GVL handoff) dominates at 100 concurrent connections; 2.10-F
-  shrinks the response phase, but the response phase isn't the
-  bottleneck on this profile. Durable infrastructure for 2.11+ when
-  the accept-loop work closes.
-Full per-stream details and bench tables in
-[`CHANGELOG.md`](CHANGELOG.md).
-## What's new in 2.10.0
-**4-way bench harness, page cache, direct routes, and the h2 40 ms
-ceiling killed.** This sprint widens the comparison matrix to all four
-major Ruby web servers (Hyperion + Puma + Falcon + Agoo) and ships
-four substantive perf streams against that backdrop:
-- **2.10-A / 2.10-B — 4-way bench harness + honest baseline.**
-  `bench/4way_compare.sh` runs the same 6 workloads (hello, static
-  1 KB / 1 MiB, CPU JSON, PG-bound, SSE) against all four servers from
-  one script. Baseline numbers committed *before* any code changes:
-  Agoo wins the static-asset and JSON columns by ~2-4×, Hyperion wins
-  the static 1 MiB column by 9× and the SSE column by 3.6-17×.
-- **2.10-C — `Hyperion::Http::PageCache` (pre-built static response
-  cache).** Open-addressed bucket table behind a pthread mutex
-  (GVL-released for writes), engages automatically on `Rack::Files`
-  responses. **Static 1 KB: 1,380 → 1,880 r/s (+36%), p99 3.7 → 2.7
-  ms.** Closes the Agoo gap from −47% to −28% on that column.
-- **2.10-D — `Hyperion::Server.handle` direct route registration.**
-  New API for hot Rack-bypass paths (`Server.handle '/health' do …
-  end`, `Server.handle_static '/robots.txt', body: '...'`). Skips Rack
-  adapter + env-build for matched routes. **`hello` via
-  `handle_static`: 4,408 → 5,619 r/s (+27%), p99 1.93 ms** — the
-  cleanest p99 in the 4-way matrix.
-- **2.10-G — h2 max-latency ceiling at ~40 ms: fixed.** Filed by 2.9-B
-  as a "first-stream cost" hypothesis, the instrumentation revealed
-  it was paid by *every* h2 stream — the canonical Linux delayed-ACK
-  + Nagle interaction on small framer writes. One-line fix:
-  TCP_NODELAY at accept time. **h2load `-c 1 -m 1 -n 200`: min
-  40.62 → 0.54 ms (−98.7%), throughput 24 → 1,142 r/s (+47.6×).** The
-  `HYPERION_H2_TIMING=1` instrumentation stays in place as durable
-  diagnostic infrastructure.
-Full per-stream details, bench numbers, and follow-up items live in
-[`CHANGELOG.md`](CHANGELOG.md).
-## What's new in 2.5.0
-**Native HPACK ON by default + autobahn 100% conformance + request
-hooks.** The Rust HPACK encoder (added in 2.0.0, opt-in until 2.4.x)
-flips ON by default in 2.5.0 — verified **+18% rps on Rails-shape h2
-workloads** (25-header responses, the bench harness lives at
-`bench/h2_rails_shape.ru` + `bench/h2_rails_shape.sh`). RFC 6455
-WebSocket conformance hit **463/463 autobahn-testsuite cases passing**
-(2.5-A, host openclaw-vm). Request lifecycle hooks
-(`Runtime#on_request_start` / `on_request_end`) shipped in 2.5-C —
-recipes in [`docs/OBSERVABILITY.md`](docs/OBSERVABILITY.md).
-## What's new in 2.4.0
-**Production observability.** The `/-/metrics` endpoint now exposes
-per-route latency histograms, per-conn fairness rejections, WebSocket
-permessage-deflate compression ratio, kTLS active connections,
-io_uring-active workers, and ThreadPool queue depth — operators can
-finally see whether the 2.x knobs are firing and how effective they
-are. A pre-built Grafana dashboard ships at
-[`docs/grafana/hyperion-2.4-dashboard.json`](docs/grafana/hyperion-2.4-dashboard.json).
-Full metric reference + operator playbook in
-[`docs/OBSERVABILITY.md`](docs/OBSERVABILITY.md).
-## What's new in 2.1.0
-**WebSockets.** RFC 6455 over Rack 3 full hijack, native frame codec,
-per-connection wrapper with auto-pong / close handshake / UTF-8 validation /
-per-message size cap. **ActionCable on Hyperion is now a single-binary
-deployment** — one `hyperion -w 4 -t 10 config.ru` process serves HTTP,
-HTTP/2, TLS, **and** `/cable` from the same listener; no separate cable
-container required. HTTP/1.1 only this release; WS-over-HTTP/2 (RFC 8441
-Extended CONNECT) and permessage-deflate (RFC 7692) defer to 2.2.x.
-See [`docs/WEBSOCKETS.md`](docs/WEBSOCKETS.md).
-## gRPC on Hyperion (2.12-F+)
-Hyperion's HTTP/2 path supports gRPC unary calls via the Rack 3 trailers
-contract: any response body that exposes `:trailers` gets a final
-HEADERS frame (with END_STREAM=1) carrying the trailer map after the
-DATA frames. That's the wire shape gRPC clients expect for the
-`grpc-status` / `grpc-message` map.
-A minimal Rack-shaped gRPC handler:
-```ruby
-class GrpcBody
-  def initialize(reply); @reply = reply; end
-  def each; yield @reply; end
-  def trailers; { 'grpc-status' => '0', 'grpc-message' => 'OK' }; end
-  def close; end
-end
-run ->(env) {
-  request = env['rack.input'].read # gRPC-framed protobuf bytes
-  reply   = handle(request)        # your service implementation
-  [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply)]
-}
-```
-What Hyperion handles for you: ALPN negotiation, HTTP/2 framing, HPACK,
-per-stream flow control, the trailer-frame emit, binary-clean
-`env['rack.input']` (gRPC bodies are non-UTF-8), and `te: trailers`
-preserved into `env['HTTP_TE']`. What you handle: protobuf
-marshalling and the `grpc-status` semantics.
-### Streaming RPCs (2.13-D+)
-All four gRPC call shapes work on Hyperion since 2.13-D — unary,
-server-streaming, client-streaming, and bidirectional. The detection
-trigger is the gRPC content-type plus `te: trailers`; any HTTP/2
-request that carries both is dispatched to the Rack app on HEADERS
-arrival (rather than after END_STREAM), and `env['rack.input']`
-becomes a streaming IO that blocks reads until the next DATA frame
-lands. Plain HTTP/2 traffic (without those headers) keeps the unary
-buffered semantics — no behaviour change for non-gRPC clients.
-**Server-streaming.** Yield one gRPC-framed message per `each`
-iteration; Hyperion writes each yield as its own DATA frame:
-```ruby
-class StreamReply
-  def initialize(messages); @messages = messages; end
-  def each; @messages.each { |m| yield m }; end       # one DATA frame each
-  def trailers; { 'grpc-status' => '0' }; end
-  def close; end
-end
-run ->(env) {
-  env['rack.input'].read # the unary request message
-  [200, { 'content-type' => 'application/grpc' }, StreamReply.new(messages)]
-}
-```
-**Client-streaming.** Read messages off `env['rack.input']` as the peer
-sends them. Reads block until a DATA frame arrives:
-```ruby
-run ->(env) {
-  io = env['rack.input']
-  count = 0
-  while (prefix = io.read(5)) && prefix.bytesize == 5
-    length = prefix.byteslice(1, 4).unpack1('N')
-    msg = io.read(length)
-    count += 1
-  end
-  [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply_for(count))]
-}
-```
-**Bidirectional.** Interleave reads and writes. The response body is
-iterated lazily, so you can read one request message, yield one reply,
-read the next, yield the next:
-```ruby
-class BidiReplies
-  def initialize(io); @io = io; end
-  def each
-    while (prefix = @io.read(5)) && prefix.bytesize == 5
-      len = prefix.byteslice(1, 4).unpack1('N')
-      msg = @io.read(len)
-      yield grpc_frame(handle(msg))                   # sent immediately
-    end
-  end
-  def trailers; { 'grpc-status' => '0' }; end
-  def close; end
-end
-run ->(env) {
-  [200, { 'content-type' => 'application/grpc' }, BidiReplies.new(env['rack.input'])]
-}
+bundle exec hyperion config.ru                 # http://127.0.0.1:9292
 ```
-The streaming-input path runs each stream on its own fiber, so
-concurrent read+write on the same stream is safe.
-## Highlights
-- **HTTP/1.1 + HTTP/2 + TLS** out of the box (HTTP/2 with per-stream fiber multiplexing, WINDOW_UPDATE-aware flow control, ALPN auto-negotiation).
-- **WebSockets (RFC 6455)** — full handshake, native frame codec, per-connection wrapper. ActionCable + faye-websocket work on a single-binary deploy. See [`docs/WEBSOCKETS.md`](docs/WEBSOCKETS.md). (2.1.0+, HTTP/1.1 only.)
-- **Pre-fork cluster** with per-OS worker model: `SO_REUSEPORT` on Linux, master-bind + worker-fd-share on macOS/BSD (Darwin's `SO_REUSEPORT` doesn't load-balance).
-- **Hybrid concurrency**: fiber-per-connection for I/O, OS-thread pool for `app.call(env)` — synchronous Rack handlers (Rails, ActiveRecord, anything holding a global mutex) get true OS-thread concurrency.
-- **Vendored llhttp 9.3.0** C parser; pure-Ruby fallback for non-MRI runtimes.
-- **Default-ON structured access logs** (one JSON or text line per request) with hot-path optimisations: per-thread cached timestamp, hand-rolled line builder, lock-free per-thread write buffer.
-- **12-factor logger split**: info/debug → stdout, warn/error/fatal → stderr.
-- **Ruby DSL config file** (`config/hyperion.rb`) with lifecycle hooks (`before_fork`, `on_worker_boot`, `on_worker_shutdown`).
-- **Object pooling** for the Rack `env` hash and `rack.input` IO — amortizes per-request allocations across the worker's lifetime.
-- **`Hyperion::FiberLocal`** opt-in shim for older Rails idioms that store request-scoped data via `Thread.current.thread_variable_*`.
-## Benchmarks
-All numbers are real wrk runs against published Hyperion configs. Hyperion ships **with default-ON structured access logs**; Puma comparisons use Puma defaults (no per-request log emission). Each section is stamped with the Hyperion version + bench host it was measured against — bench-host drift over time is real (see "Bench-host drift" note below).
-**Headline doc**: the most recent comprehensive sweep is
-[`docs/BENCH_HYPERION_2_11.md`](docs/BENCH_HYPERION_2_11.md) — the
-2.12-B 4-way re-bench (Hyperion 2.11.0 vs Puma 8.0.1 / Falcon 0.55.3 /
-Agoo 2.15.14, 16-vCPU Ubuntu 24.04, 6 workloads). It's the post-
-2.10/2.11-wins re-baseline of the four-server matrix that originally
-shipped in [`docs/BENCH_HYPERION_2_0.md`](docs/BENCH_HYPERION_2_0.md)
-§ "4-way head-to-head (2.10-B baseline)" — the older doc is the
-**historical baseline (pre-2.10/2.11 wins)** and is preserved
-unchanged for archaeology. The 1.6.0 matrix at
-[`docs/BENCH_2026_04_27.md`](docs/BENCH_2026_04_27.md) covers 9
-workloads × 25+ configs against hyperion-async-pg 0.5.0; all three
-docs include caveats and per-row reproduction commands.
-> **Bench-host drift note (2026-05-01).** A spot-check rerun on
-> `openclaw-vm` 5 days after the 2.0.0 sweep showed Puma 8.0.1 and
-> Hyperion 2.0.0 baseline numbers had drifted 14-32% downward from the
-> 2026-04-29 sweep with no code changes — the bench host runs other
-> workloads in the background and is a single VM (KVM CPU). Numbers in
-> this README and BENCH docs are snapshots; expect ±10-30% absolute
-> drift between sweep dates. **The relative position (Hyperion vs Puma
-> at matched config) is the durable signal**; e.g. Hyperion `-w 16 -t 5`
-> hello-world today is 76,593 r/s vs Puma 8.0.1 `-w 16 -t 5:5` at 55,609
-> r/s, **+37.7% over Puma** — wider than the 2.0.0 sweep's +27.8% even
-> though absolute rps is lower. Reproduce: `bundle exec bin/hyperion
-> -p 9501 -w 16 -t 5 bench/hello.ru` then `wrk -t4 -c200 -d20s
-> http://127.0.0.1:9501/`.
-> **Topology relevance.** Hyperion is built to run **fronted by nginx
-> or an L7 load balancer** in most production deployments — plaintext
-> HTTP/1.1 upstream, TLS terminated at the LB. The benches in this
-> README that match that topology are: hello-world, CPU JSON, static,
-> SSE, PG, WebSocket. Benches that are **bench-only for nginx-fronted
-> ops** (the LB → upstream hop is plaintext h1 regardless): TLS h1,
-> HTTP/2, kTLS_TX. Those rows still ship for operators who terminate
-> TLS / h2 at Hyperion directly (small static fleets, edge boxes), but
-> don't chase the +60% TLS-h1 win unless you actually terminate TLS at
-> Hyperion.
-### Hello-world Rack app
-`bench/hello.ru`, single worker, parity threads (`-t 5` vs Puma `-t 5:5`), 4 wrk threads / 100 connections / 15s, macOS arm64 / Ruby 3.3.3, Hyperion 1.2.0. **macOS dev numbers; the headline Linux 2.0.0 bench is in [`docs/BENCH_HYPERION_2_0.md`](docs/BENCH_HYPERION_2_0.md)**:
-| | r/s | p99 | tail vs Hyperion |
-|---|---:|---:|---:|
-| **Hyperion 1.2.0** (default, logs ON) | **22,496** | **502 µs** | **1×** |
-| Falcon 0.55.3 `--count 1` | 22,199 | 5.36 ms | 11× worse |
-| Puma 7.1.0 `-t 5:5` | 20,400 | 422.85 ms | 845× worse |
-**Hyperion: 1.10× Puma throughput, parity with Falcon on throughput, ~10× lower p99 than Falcon and ~845× lower than Puma — while emitting structured JSON access logs the others don't.**
-### Production cluster config (`-w 4`)
-Same bench app, `-w 4` cluster, parity threads (`-t 5` everywhere), 4 wrk threads / 200 connections / 15s, macOS arm64:
-| | r/s | p99 | tail vs Hyperion |
-|---|---:|---:|---:|
-| Falcon `--count 4` | 48,197 | 4.84 ms | 5.9× worse |
-| **Hyperion `-w 4 -t 5`** | **40,137** | **825 µs** | **1×** |
-| Puma `-w 4 -t 5:5` | 34,793 | 177.76 ms | 215× worse (1 timeout) |
-Falcon edges Hyperion ~20% on raw rps at `-w 4` on macOS hello-world. **Hyperion still leads on tail latency by 5.9× over Falcon and 215× over Puma**, and beats Puma on throughput by 1.15×. On Linux production-config and DB-backed workloads (below) Hyperion takes the rps lead too — the macOS hello-world advantage to Falcon disappears once the workload includes any actual work or the kernel is Linux.
-### Linux production-config (DB-backed Rack)
-`-w 4 -t 10` on Ubuntu 24.04 / Ruby 3.3.3. Rack app does one Postgres `SELECT 1` + one Redis `GET` per request, real network round-trip. wrk `-t4 -c50 -d10s` × 3 runs (median):
-| | r/s (median) | vs Puma default |
-|---|---:|---:|
-| **Hyperion default (rc17, logs ON)** | **5,786** | **1.012×** |
-| Hyperion `--no-log-requests` | 6,364 | 1.114× |
-| Puma `-w 4 -t 10:10` (no per-req logs) | 5,715 | 1.000× |
-Bench is **wait-bound** — ~3-4 ms median is the PG + Redis round-trip, dwarfing the per-request CPU work where Hyperion's optimisations live. With a synchronous `pg` driver, fibers don't help: every in-flight DB call still parks an OS thread, and both servers max out at `workers × threads` concurrent queries. To widen this gap requires either an async PG driver — see [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg) (companion gem; pair with `--async-io` and a fiber-aware pool, see "Async I/O — fiber concurrency on PG-bound apps" below) — or a CPU-bound workload, where Hyperion's lead becomes visible (next section).
-### Async I/O — fiber concurrency on PG-bound apps
-Ubuntu 24.04 / 16 vCPU / Ruby 3.3.3, Postgres 17 over WAN, `wrk -t4 -c200 -d20s`. Single worker (`-w 1`) unless noted. All configs returned 0 non-2xx and 0 timeouts. RSS sampled mid-run via `ps -o rss`.
-**Wait-bound workload** (`pg_concurrent.ru`: `SELECT pg_sleep(0.05)` + tiny JSON; rackup lives in the [hyperion-async-pg companion repo](https://github.com/andrew-woblavobla/hyperion-async-pg) and on the bench host at `~/bench/pg_concurrent.ru`, not in this repo):
-| | r/s | p99 | RSS | vs Puma `-t 5` |
-|---|---:|---:|---:|---:|
-| Puma 8.0 `-t 5` pool=5 | 56.5 | 3.88 s | 87 MB | 1.0× |
-| Puma 8.0 `-t 30` pool=30 | 402.1 | 880 ms | 99 MB | 7.1× |
-| Puma 8.0 `-t 100` pool=100 | 1067.4 | 557 ms | 121 MB | 18.9× |
-| **Hyperion `--async-io -t 5`** pool=32 | 400.4 | 878 ms | 123 MB | 7.1× |
-| **Hyperion `--async-io -t 5`** pool=64 | 778.9 | 638 ms | 133 MB | 13.8× |
-| **Hyperion `--async-io -t 5`** pool=128 | 1344.2 | 536 ms | 148 MB | 23.8× |
-| **Hyperion `--async-io -t 5` pool=200** | **2381.4** | **471 ms** | **164 MB** | **42.2×** |
-| Hyperion `--async-io -w 4 -t 5` pool=64 | 1937.5 | 4.84 s | 416 MB | 34.3× (cold-start p99 — see note) |
-| Falcon 0.55.3 `--count 1` pool=128 | 1665.7 | 516 ms | 141 MB | 29.5× |
-**Mixed CPU+wait** (`pg_mixed.ru`: same query + 50-key JSON serialization, ~5 ms CPU; rackup lives in hyperion-async-pg + on the bench host at `~/bench/pg_mixed.ru`, not in this repo):
-| | r/s | p99 | RSS | vs Puma `-t 30` |
-|---|---:|---:|---:|---:|
-| Puma 8.0 `-t 30` pool=30 | 351.7 | 963 ms | 127 MB | 1.0× |
-| Hyperion `--async-io -t 5` pool=32 | 371.2 | 919 ms | 151 MB | 1.05× |
-| Hyperion `--async-io -t 5` pool=64 | 741.5 | 681 ms | 161 MB | 2.1× |
-| **Hyperion `--async-io -t 5` pool=128** | **1739.9** | **512 ms** | **201 MB** | **4.9×** |
-| Falcon `--count 1` pool=128 | 1642.1 | 531 ms | 213 MB | 4.7× |
-**Takeaways:**
-1. **Linear scaling with pool size** under `--async-io` — `r/s ≈ pool × 12` on this WAN bench. Single-worker pool=200 hits 2381 r/s. The "**42× Puma `-t 5`**" and "**5.9× Puma's best**" framings above use Puma's pool=5 (timeout-floor) and pool=30 (mid-tier) rows respectively — fair comparisons on the *same* bench fixture, but a Puma operator who sizes their pool to match (`-t 100 pool=100` row above) lands at 1,067 r/s, so the **honest "Puma at its own best vs Hyperion at its own best" ratio is 2,381 / 1,067 ≈ 2.2×**, not 42×. The architectural win — fiber-pool grows to pool=200 without OS-thread cost — is real; the 42× headline is a configuration-difference effect, not a steady-state gap on matched configs.
-2. **Mixed workload doesn't kill the win** — Hyperion `--async-io` pool=128 actually goes *up* on mixed (1740 vs 1344 r/s) because CPU work overlaps other fibers' PG-wait windows. This is the honest "what happens to a real Rails handler" answer.
-3. **Hyperion ≈ Falcon within 3-7%** across pool sizes; both fiber-native architectures extract similar value from `hyperion-async-pg`.
-4. **RSS at single-worker scale isn't the architectural moat** — Linux thread stacks are demand-paged; PG connection buffers dominate RSS at pool sizes ≤ 200. The architectural win is **handler concurrency under load**, not idle memory: Hyperion's fiber path runs thousands of in-flight handler invocations per OS thread, so wait-bound handlers don't queue at `max_threads`. See [Concurrency at scale](#concurrency-at-scale-architectural-advantages) for both the throughput-under-load row and a measured 10k-idle-keepalive RSS sweep against Puma and Falcon.
-5. **`-w 4` cold-start caveat** — multi-worker p99 inflates because the bench rackup uses lazy per-process pool init (each worker pays full pool fill on its first request). Production apps avoid this with `on_worker_boot { Hyperion::AsyncPg::FiberPool.new(...).fill }`.
-6. **Apples-to-apples PG note**: the row above uses `pg.wobla.space` WAN PG with `max_connections=500`. Earlier sweeps that compared Hyperion (WAN, max_conn=500) against Puma (local, max_conn=100) overstated the ratio because the Puma side timed out at the local pool ceiling. The 2.0.0 bench doc carries this caveat in the row 7 verification section; treat any "Hyperion 4× Puma on PG" headline as **indicative**, not precisely calibrated, until rerun against matched-pool PG.
-Three things must all be true to get this win:
-1. **`async_io: true`** in your Hyperion config (or `--async-io` CLI flag). Default is off to keep 1.2.0's raw-loop perf for fiber-unaware apps.
-2. **`hyperion-async-pg`** installed: `gem 'hyperion-async-pg', require: 'hyperion/async_pg'` + `Hyperion::AsyncPg.install!` at boot.
-3. **Fiber-aware connection pool.** The popular `connection_pool` gem is NOT — its Mutex blocks the OS thread. Use `Hyperion::AsyncPg::FiberPool` (ships with hyperion-async-pg 0.3.0+), [`async-pool`](https://github.com/socketry/async-pool), or `Async::Semaphore`.
-Skip any of these and you get parity with Puma at the same `-t`. Run the bench yourself: `MODE=async DATABASE_URL=... PG_POOL_SIZE=200 bundle exec hyperion --async-io -t 5 bench/pg_concurrent.ru` (in the [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg) repo).
-> **TLS + async-pg note (1.4.0+).** TLS / HTTPS already runs each connection on a fiber under `Async::Scheduler` (the TLS path always uses `start_async_loop` for the ALPN handshake). **As of 1.4.0, the post-handshake `app.call` for HTTP/1.1-over-TLS dispatches inline on the calling fiber by default** — so fiber-cooperative libraries (`hyperion-async-pg`, `async-redis`) work on the TLS h1 path without needing `--async-io`. The Async-loop cost is already paid for the handshake; running the handler under the existing scheduler just preserves that context instead of stripping it on a thread-pool hop. h2 streams are always fiber-dispatched and benefit from async-pg without the flag.
->
-> Operators who specifically want **TLS + threadpool dispatch** (e.g. CPU-heavy handlers competing for OS threads, where you'd rather not pay fiber yields and want true OS-thread parallelism on a synchronous handler) can pass `async_io: false` in the config to force the pool branch back on. The three-way `async_io` setting:
-> - `nil` (default): plain HTTP/1.1 → pool, TLS h1 → inline.
-> - `true`: plain HTTP/1.1 → inline, TLS h1 → inline (force fiber dispatch everywhere; needed for `hyperion-async-pg` on plain HTTP).
-> - `false`: plain HTTP/1.1 → pool, TLS h1 → pool (explicit opt-out for TLS+threadpool).
-### CPU-bound JSON workload
+## Headline benchmarks
-`bench/work.ru` — handler builds a 50-key fixture, JSON-encodes a fresh response per request (~8 KB body), processes a 6-cookie header chain. wrk `-t4 -c200 -d15s`, macOS arm64 / Ruby 3.3.3, 1.2.0:
+Linux 6.8 / 16-vCPU Ubuntu 24.04 / Ruby 3.3.3, single worker, `wrk -t4 -c100 -d20s`
+unless noted. Reproduction commands and the full 6-row 4-way matrix
+(Hyperion / Puma / Falcon / Agoo) live in
+[docs/BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md).
-| | r/s | p99 | tail vs Hyperion |
-|---|---:|---:|---:|
-| Falcon `--count 4` | 46,166 | 20.17 ms | 24× worse |
-| **Hyperion `-w 4 -t 5`** | **43,924** | **824 µs** | **1×** |
-| Puma `-w 4 -t 5:5` | 36,383 | 166.30 ms (47 socket errors) | 200× worse |
+| Workload                                              | Hyperion r/s | Hyperion p99 | Reference            |
+|-------------------------------------------------------|-------------:|-------------:|----------------------|
+| Static hello, `handle_static` + io_uring (2.12-D)     | **134,084**  | 1.14 ms      | Agoo 2.15.14: 19,024 |
+| Static hello, `handle_static` + accept4 fallback      |       15,685 | 107 µs       | Agoo 2.15.14: 19,024 |
+| Dynamic block, `Server.handle { \|env\| ... }` (2.14-A) |        9,422 | 166 µs       | Agoo 2.15.14: 19,024 |
+| CPU JSON via block (`bench/work.ru`, 2.14-A)          |        5,897 | 256 µs       | Falcon: 4,226        |
+| Generic Rack hello (no `Server.handle`)               |        4,752 | 2.02 ms      | Agoo 2.15.14: 19,024 |
+| gRPC unary, h2/TLS, ghz `-c50` (2.14-D)               |        1,618 | 33.3 ms      | Falcon `async-grpc`: 1,512 (+7%) |
-**1.21× Puma throughput, 200× lower p99.** This is the gap that hides behind PG-round-trip noise on the DB bench. Hyperion's per-request CPU savings (lock-free per-thread metrics, frozen header keys in the Rack adapter, C-ext response head builder, cached iso8601 timestamps, cached HTTP Date header) land on the wire when the workload is CPU-bound. Falcon edges us 5% on raw r/s but with 24× worse tail — a different tradeoff curve. Reproduce: `bundle exec bin/hyperion -w 4 -t 5 -p 9292 bench/work.ru`.
+The 134,084 r/s row is sustained over a 4-hour soak at **120,684 r/s**
+with RSS variance 2.71% and `wrk-truth` p99 1.14 ms (2.14-C). The
+io_uring loop is opt-in via `HYPERION_IO_URING_ACCEPT=1` until 2.15;
+the `accept4` row is the default on Linux.
-### Real Rails 8.1 app (single worker, parity threads `-t 16`)
-Health endpoint that traverses the full middleware chain (rack-attack, locale redirect, structured tagger, geo-location, etc.). Plus a Grape API endpoint reading cached data, and a Rails controller doing a Redis GET + an ActiveRecord query.
-| endpoint | server | r/s | p99 | wrk timeouts |
-|---|---|---:|---:|---:|
-| `/up` (health) | **Hyperion** | **19.03** | **1.12 s** | **0** |
-| `/up` (health) | Puma `-t 16:16` | 16.64 | 1.95 s | **138** |
-| Grape `/api/v1/cached_data` | **Hyperion** | **16.15** | **779 ms** | 16 |
-| Grape `/api/v1/cached_data` | Puma `-t 16:16` | 10.90 | (>2 s, censored) | **110** |
-| Rails `/api/v1/health` | **Hyperion** | **15.95** | **992 ms** | 16 |
-| Rails `/api/v1/health` | Puma `-t 16:16` | 11.29 | (>2 s, censored) | **114** |
-On Grape and Rails-controller workloads Puma hits wrk's 2 s timeout cap on ~⅔ of requests — its real p99 is censored above 2 s. Hyperion serves all of its requests under 1.2 s with 0 to 16 timeouts. **1.14–1.48× Puma throughput** depending on endpoint.
-### Static-asset serving (sendfile zero-copy path, 1.2.0+)
-`bench/static.ru` (`Rack::Files` over a 1 MiB asset), `-w 1`, `wrk -t4 -c100 -d15s`, macOS arm64 / Ruby 3.3.3:
-| | r/s | p99 | transferred | tail vs winner |
-|---|---:|---:|---:|---:|
-| **Hyperion (sendfile path)** | **2,069** | **3.10 ms** | 30.4 GB | **1×** |
-| Puma `-w 1 -t 5:5` | 2,109 | 566.16 ms | 31.0 GB | 183× worse |
-| Falcon `--count 1` | 1,269 | 801.01 ms | 18.7 GB | 258× worse (28 timeouts) |
-Throughput is bandwidth-bound on localhost (≈2 GB/s = the loopback memory ceiling), so the throughput column looks like parity. The actual win is in the **tail latency** column: Hyperion's `IO.copy_stream` → `sendfile(2)` path skips userspace entirely, while Puma allocates a String per response and Falcon serializes more aggressively. On real network paths sendfile widens the gap further (kernel-to-NIC zero-copy).
+## Quick start
-Reproduce:
 ```sh
-ruby -e 'File.binwrite("/tmp/hyperion_bench_asset_1m.bin", "x" * (1024*1024))'
-bundle exec bin/hyperion -p 9292 bench/static.ru
-wrk --latency -t4 -c100 -d15s http://127.0.0.1:9292/hyperion_bench_asset_1m.bin
+bundle exec hyperion config.ru                          # single process
+bundle exec hyperion -w 4 -t 10 config.ru               # 4 workers × 10 threads
+bundle exec hyperion -w 0 config.ru                     # one worker per CPU
+bundle exec hyperion --tls-cert cert.pem --tls-key key.pem -p 9443 config.ru
 ```
-### Concurrency at scale (architectural advantages)
+`bundle exec rake spec` (and the default task) auto-invoke `compile`, so a
+fresh checkout just needs `bundle install && bundle exec rake` for a green run.
-These workloads demonstrate structural differences between Hyperion's fiber-per-connection / fiber-per-stream model and Puma's thread-pool model. Numbers are illustrative; the architecture is what matters. Run on Ubuntu 24.04 / Ruby 3.3.3, single worker, h2load `-c <conns> -n 100000 --rps 1000 --h1`.
+Migrating from Puma? `hyperion -t N -w M` matching your current Puma
+`-t N:N -w M` is the recommended drop-in. See
+[docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md).
-**5,000 concurrent keep-alive connections (50,000 requests):**
+## Features
-| | succeeded | r/s | wall | master RSS |
-|---|---:|---:|---:|---:|
-| Hyperion `-w 1 -t 10` | 50,000 / 50,000 | 3,460 | 14.45 s | 53.5 MB |
-| Puma `-w 1 -t 10:10`  | 50,000 / 50,000 | 1,762 | 28.37 s | 36.9 MB |
+### HTTP/1.1 + HTTP/2 + TLS
-**10,000 concurrent keep-alive connections (100,000 requests):**
-| | succeeded | failed | r/s | wall |
-|---|---:|---:|---:|---:|
-| Hyperion `-w 1 -t 10` | 93,090 | 6,910 | 3,446 | 27.01 s |
-| Puma `-w 1 -t 10:10`  | 77,340 | 22,660 | 706 | 109.59 s |
-At 10k concurrent connections under load Hyperion serves **~5× the throughput** of Puma with **~20% fewer dropped requests**. The per-connection bookkeeping cost is bounded by fiber size, not by `max_threads` — workers don't get pinned to long-lived sockets, so a slow handler doesn't starve other connections.
-**Memory at idle keep-alive scale — 10,000 idle HTTP/1.1 keep-alive connections:**
-Each client opens a TCP connection, sends one keep-alive GET, drains the response, then holds the socket open without sending a follow-up request. RSS is sampled once a second across a 30s idle hold. Same hello-world rackup, single worker, no TLS. Hyperion runs with `async_io true` (fiber-per-connection on the plain HTTP/1.1 path).
-| | held | dropped | peak RSS | RSS after drain |
-|---|---:|---:|---:|---:|
-| Hyperion `-w 1 -t 5 --async-io` | 10,000 / 10,000 | 0 | 173 MB | 155 MB |
-| Puma `-w 0 -t 100`               | 10,000 / 10,000 | 0 | 101 MB | 104 MB |
-| Falcon `--count 1`               | 10,000 / 10,000 | 0 | 429 MB | 440 MB |
-All three hold 10k idle conns without OOMing or dropping — the "MB-per-thread" intuition that thread-based servers can't reach this scale doesn't survive contact with Linux's demand-paged thread stacks plus Puma's reactor-based keep-alive handling. Per-conn RSS lands at ~14 KB (Hyperion fiber + parser state), ~7 KB (Puma reactor entry + tiny thread share), ~36 KB (Falcon Async::Task + protocol-http stack). Bounded, not unbounded — for all three.
-The architectural difference shows up under **load**, not at idle: Puma can only run `max_threads` handler invocations concurrently, so wait-bound handlers (DB, HTTP, Redis) starve at higher request concurrency than `max_threads`. Hyperion's fiber-per-connection model + `--async-io` gives one OS thread thousands of in-flight handler executions, paired with [hyperion-async-pg](https://github.com/exodusgaming-io/hyperion-async-pg) for non-blocking DB. The 10k-conn throughput row above (5× Puma) is the consequence — same idle RSS shape, very different behaviour once the handlers actually do work.
-**HTTP/2 multiplexing — 1 connection × 100 concurrent streams (handler sleeps 50 ms):**
-| | wall time |
-|---|---:|
-| Hyperion (per-stream fiber dispatch) | **1.04 s** |
-| Serial baseline (100 × 50 ms) | 5.00 s |
-Hyperion fans 100 in-flight streams across separate fibers within a single TCP connection. A serial server would take 5 s; the fiber-multiplexed result (1.04 s, ~96 req/s on one socket) is bounded by single-handler sleep time plus framing overhead. Puma has no native HTTP/2 path — production deployments terminate h2 at nginx and forward h1 to the worker pool, which serializes again.
-> **1.6.0 outbound write path** — `Http2Handler` no longer serializes every framer write through one `Mutex#synchronize { socket.write(...) }`. HPACK encoding (microseconds, in-memory) still serializes on a fast encode mutex, but the actual `socket.write` is owned by a dedicated per-connection writer fiber draining a queue. On per-connection multi-stream workloads where the kernel send buffer or peer reads are slow, encode work for ready streams overlaps the writer's flush of earlier chunks, instead of stacking up behind it. See `bench/h2_streams.sh` (`h2load -c 1 -m 100 -n 5000`) for a recipe to compare 1.5.0 vs 1.6.0 on a workload of your choice.
-### Reproducing the benchmarks
-Every number in this README and `docs/BENCH_HYPERION_2_0.md` is reproducible. Operators who don't trust headline numbers (and you shouldn't trust *any* benchmark numbers without independent verification) can rerun the workloads on their own host and get their own honest measurements. Per-row reproduction commands:
-```sh
-# Setup (once)
-bundle install
-bundle exec rake compile
+ALPN auto-negotiates `h2` or `http/1.1` per connection. HTTP/2 multiplexes
+streams onto fibers within a single connection — slow handlers don't
+head-of-line-block other streams. Cluster-mode TLS works (`-w N` +
+`--tls-cert` / `--tls-key`).
-# Hello-world (rps + p99 ceiling, no I/O)
-bundle exec bin/hyperion -p 9292 -w 16 -t 5 bench/hello.ru &
-wrk -t4 -c200 -d20s --latency http://127.0.0.1:9292/
+Smuggling defenses for HTTP/1.1: `Content-Length` + `Transfer-Encoding`
+together → 400; non-chunked `Transfer-Encoding` → 501; CRLF in response
+header values → `ArgumentError` (response-splitting guard).
-# CPU-bound JSON (per-request CPU savings visible)
-bundle exec bin/hyperion -p 9292 -w 4 -t 5 bench/work.ru &
-wrk -t4 -c200 -d15s --latency http://127.0.0.1:9292/
+### WebSockets (2.1.0+)
-# Static 1 MiB sendfile path
-ruby -e 'File.binwrite("/tmp/hyperion_bench_asset_1m.bin", "x" * (1024*1024))'
-bundle exec bin/hyperion -p 9292 -w 1 -t 5 bench/static.ru &
-wrk -t4 -c100 -d15s --latency http://127.0.0.1:9292/hyperion_bench_asset_1m.bin
+RFC 6455 over Rack 3 full hijack, native frame codec, per-connection
+wrapper with auto-pong, close handshake, UTF-8 validation, and per-message
+size cap. **ActionCable + faye-websocket on a single binary** — one
+`hyperion -w 4 -t 10 config.ru` serves HTTP, HTTP/2, TLS, and `/cable`
+from the same listener. Conformance: 463/463 autobahn-testsuite cases
+pass. See [docs/WEBSOCKETS.md](docs/WEBSOCKETS.md).
-# SSE streaming (Hyperion-shaped rackup with explicit flush sentinel — see caveat in BENCH doc)
-bundle exec bin/hyperion -p 9292 -w 1 -t 5 bench/sse.ru &
-wrk -t1 -c1 -d10s http://127.0.0.1:9292/
+### gRPC (2.12-F+)
-# WebSocket multi-process throughput
-bundle exec bin/hyperion -p 9888 -w 4 -t 64 bench/ws_echo.ru &
-ruby bench/ws_bench_client_multi.rb --port 9888 --procs 4 --conns 200 --msgs 1000 --bytes 1024 --json
+Hyperion's HTTP/2 path supports gRPC unary, server-streaming,
+client-streaming, and bidirectional RPCs via the Rack 3 trailers contract:
+any response body that defines `#trailers` gets a final HEADERS frame
+(with `END_STREAM=1`) carrying the trailer map after the DATA frames.
+Plain HTTP/2 traffic without the gRPC content-type keeps the unary
+buffered semantics — no behaviour change for non-gRPC clients.
-# h2 native HPACK (Rails-shape, 25-header response)
-./bench/h2_rails_shape.sh
+A minimal unary handler:
-# Idle keep-alive RSS sweep (1k / 5k / 10k conns, 30s hold per server)
-./bench/keepalive_memory.sh
+```ruby
+class GrpcBody
+  def initialize(reply); @reply = reply; end
+  def each; yield @reply; end
+  def trailers; { 'grpc-status' => '0', 'grpc-message' => 'OK' }; end
+  def close; end
+end
-# Hello-world quick comparator (Hyperion vs Puma vs Falcon)
-bundle exec ruby bench/compare.rb
-HYPERION_WORKERS=4 PUMA_WORKERS=4 FALCON_COUNT=4 bundle exec ruby bench/compare.rb
+run ->(env) {
+  request = env['rack.input'].read
+  reply   = handle(request)
+  [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply)]
+}
 ```
-PG benches (`pg_concurrent.ru`, `pg_mixed.ru`, `pg_realistic.ru`) live in the [hyperion-async-pg companion repo](https://github.com/andrew-woblavobla/hyperion-async-pg) — they require a running Postgres + the companion gem and are not part of this repo. The 2.0.0 sweep used `~/bench/pg_concurrent.ru` on the bench host; reproduce by cloning hyperion-async-pg and following its README, or `scp` the rackup + DATABASE_URL.
+Server-streaming yields one DATA frame per `each`; client-streaming
+reads incoming frames off `env['rack.input']` (a streaming IO that
+blocks until the next DATA frame lands); bidirectional interleaves
+both. Reproducible bench at `bench/grpc_stream.{proto,ru}` +
+`bench/grpc_stream_bench.sh` (ghz). Numbers in
+[docs/BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md#grpc-ghz-bench--hyperion-vs-falcon-async-grpc-214-d).
-When numbers from your host don't match the published numbers, the most likely explanations (in order): (1) bench-host noise — single-VM benches drift 10-30% over days; (2) Puma version mismatch — the 2.0.0 sweep used Puma 8.0.1 in the `~/bench/Gemfile`, the hyperion repo's own Gemfile pins Puma `~> 6.4`; (3) different kernel / Ruby; (4) different `-t` / `-c` (apples-to-apples requires identical worker count, thread count, wrk concurrency, payload size, kernel, Ruby, TLS cipher).
+### `Server.handle` direct routes
-## Quick start
+Bypass the Rack adapter for hot paths:
-```sh
-bundle install
-bundle exec rake compile                              # build the llhttp C ext
-bundle exec hyperion config.ru                        # single-process default
-bundle exec hyperion -w 4 -t 10 config.ru             # 4-worker cluster, 10 threads each
-bundle exec hyperion -w 0 config.ru                   # 1 worker per CPU
-bundle exec hyperion --tls-cert cert.pem --tls-key key.pem -p 9443 config.ru   # HTTPS
-curl http://127.0.0.1:9292/                            # => hello
-# Chunked POST works:
-curl -X POST -H "Transfer-Encoding: chunked" --data-binary @file http://127.0.0.1:9292/
-# HTTP/2 (over TLS, ALPN-negotiated):
-curl --http2 -k https://127.0.0.1:9443/
+```ruby
+Hyperion::Server.handle_static '/health', body: 'ok'
+Hyperion::Server.handle(:GET, '/v1/ping') { |env| [200, {}, ['pong']] }
 ```
-`bundle exec rake spec` (and the `default` task) auto-invoke `compile`, so a fresh checkout just needs `bundle install && bundle exec rake` to get a green run.
-**Migrating from Puma?** See [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md).
+`handle_static` bakes the response at boot and serves from the C accept
+loop (134k r/s with io_uring, 16k r/s on accept4). The dynamic block
+form (2.14-A) runs `app.call(env)` on the C accept loop too — accept +
+recv + parse + write release the GVL while the block holds it, so
+multi-threaded workers actually parallelise.
+### Pre-fork cluster
+Per-OS worker model: `SO_REUSEPORT` on Linux (kernel-balanced accept,
+1.004–1.011 max/min ratio across workers under steady load — 2.12-E
+audit), master-bind + worker-fd-share on macOS/BSD where Darwin's
+`SO_REUSEPORT` doesn't load-balance. Lifecycle hooks (`before_fork`,
+`on_worker_boot`, `on_worker_shutdown`) for AR / Redis / pool init.
+### Async I/O (PG-bound apps)
+`--async-io` runs plain HTTP/1.1 connections under `Async::Scheduler`,
+turning one OS thread into thousands of in-flight handler invocations.
+Paired with [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg)
+on a `pg_sleep(50ms)` workload, single-worker `pool=200` hits **2,381 r/s**
+vs Puma `-t 5` at 56 r/s (architectural ceiling: pool size, not thread
+count). Three things must all be true: `--async-io`, `hyperion-async-pg`
+loaded, and a fiber-aware pool (`Hyperion::AsyncPg::FiberPool`,
+`async-pool`, or `Async::Semaphore` — **not** the `connection_pool` gem,
+whose `Mutex` blocks the OS thread). Skip any one and you get parity
+with Puma.
+### Observability
+`/-/metrics` Prometheus endpoint (admin-token guarded), per-route
+latency histograms, per-conn fairness rejections, WebSocket
+permessage-deflate ratio, kTLS active connections, ThreadPool queue
+depth, dispatch-mode counters (Rack / `handle_static` / dynamic block /
+h2 / async-io). Pre-built Grafana dashboard at
+[docs/grafana/hyperion-2.4-dashboard.json](docs/grafana/hyperion-2.4-dashboard.json).
+Full reference: [docs/OBSERVABILITY.md](docs/OBSERVABILITY.md).
+Default-ON structured access logs (one JSON or text line per request)
+with hot-path optimisations: per-thread cached iso8601 timestamp,
+hand-rolled line builder, lock-free per-thread 4 KiB write buffer.
+12-factor logger split: `info`/`debug` → stdout, `warn`/`error`/`fatal`
+→ stderr.
+### Optional io_uring accept loop
+Linux 5.x+, opt-in via `HYPERION_IO_URING_ACCEPT=1`. Multishot accept
++ per-conn RECV/WRITE/CLOSE state machine on top of liburing. One
+`io_uring_enter` per N requests instead of N×3 syscalls. Compiles out
+cleanly without liburing — the `accept4` path stays the fallback.
+macOS keeps using `accept4`. Default-flip moves to 2.15 with a fresh
+24h soak.
 ## Configuration
-Three layers, in precedence order: explicit CLI flag > environment variable > `config/hyperion.rb` > built-in default.
+Three layers, in precedence order: explicit CLI flag > environment
+variable > `config/hyperion.rb` > built-in default.
-### CLI flags
+### Most-used CLI flags
 | Flag | Default | Notes |
 |---|---|---|
 | `-b, --bind HOST` | `127.0.0.1` | |
 | `-p, --port PORT` | `9292` | |
 | `-w, --workers N` | `1` | `0` → `Etc.nprocessors` |
-| `-t, --threads N` | `5` | OS-thread Rack handler pool per worker. `0` → run inline (no pool, debugging only). |
+| `-t, --threads N` | `5` | OS-thread Rack handler pool per worker. `0` → run inline (debugging). |
 | `-C, --config PATH` | `config/hyperion.rb` if present | Ruby DSL file. |
-| `--tls-cert PATH` | nil | PEM certificate. |
-| `--tls-key PATH` | nil | PEM private key. |
-| `--log-level LEVEL` | `info` | `debug` / `info` / `warn` / `error` / `fatal`. |
-| `--log-format FORMAT` | `auto` | `text` / `json` / `auto`. Auto: JSON when `RAILS_ENV`/`RACK_ENV` is `production`/`staging`, colored text on TTY, JSON otherwise. |
-| `--[no-]log-requests` | ON | Per-request access log. |
-| `--fiber-local-shim` | off | Patches `Thread#thread_variable_*` to fiber storage for older Rails idioms. |
-| `--[no-]yjit` | auto | Force YJIT on/off. Default: auto-on under `RAILS_ENV`/`RACK_ENV` = `production`/`staging`. |
-| `--[no-]async-io` | off | Run plain HTTP/1.1 connections under `Async::Scheduler`. Required for `hyperion-async-pg` on plain HTTP. TLS h1 / HTTP/2 always run under the scheduler regardless. |
-| `--max-body-bytes BYTES` | `16777216` (16 MiB) | Maximum request body size. |
-| `--max-header-bytes BYTES` | `65536` (64 KiB) | Maximum total request-header size. |
-| `--max-pending COUNT` | unbounded | Per-worker accept-queue cap before new connections are rejected with HTTP 503 + `Retry-After: 1`. |
-| `--max-request-read-seconds SECONDS` | `60` | Total wallclock budget for reading request line + headers + body for ONE request. Slowloris defence. |
-| `--admin-token TOKEN` | unset | Bearer token for `POST /-/quit` and `GET /-/metrics`. **Production: prefer `--admin-token-file` — argv is visible via `ps`.** |
-| `--admin-token-file PATH` | unset | Read the admin token from a file. Refuses to load if the file is missing or world-readable (mode must mask `0o007`). |
-| `--worker-max-rss-mb MB` | unset | Master gracefully recycles a worker once its RSS exceeds this many megabytes. nil = disabled. |
-| `--idle-keepalive SECONDS` | `5` | Keep-alive idle timeout. Connection closes after this many seconds of inactivity. |
-| `--graceful-timeout SECONDS` | `30` | Shutdown deadline before SIGKILL is delivered to a worker that hasn't drained. |
+| `--tls-cert PATH` / `--tls-key PATH` | nil | PEM cert + key for HTTPS. |
+| `--[no-]async-io` | off | Run plain HTTP/1.1 under `Async::Scheduler`. Required for `hyperion-async-pg` on plain HTTP. |
+| `--preload-static DIR` | nil | Preload static assets from DIR at boot (repeatable, immutable). Rails apps auto-detect from `Rails.configuration.assets.paths`. |
+| `--admin-token-file PATH` | unset | Auth file for `/-/quit` and `/-/metrics`. Refuses world-readable files. |
+| `--worker-max-rss-mb MB` | unset | Master gracefully recycles a worker exceeding MB RSS. |
+| `--max-pending COUNT` | unbounded | Per-worker accept-queue cap before HTTP 503 + `Retry-After: 1`. |
+| `--idle-keepalive SECONDS` | `5` | Keep-alive idle timeout. |
+| `--graceful-timeout SECONDS` | `30` | Shutdown deadline before SIGKILL. |
+`bin/hyperion --help` prints the full set, including `--max-body-bytes`,
+`--max-header-bytes`, `--max-request-read-seconds` (slowloris defence),
+`--h2-max-total-streams`, `--max-in-flight-per-conn`,
+`--tls-handshake-rate-limit`, and the `--[no-]yjit` /
+`--[no-]log-requests` toggles.
 ### Environment variables
-`HYPERION_LOG_LEVEL`, `HYPERION_LOG_FORMAT`, `HYPERION_LOG_REQUESTS` (`0|1|true|false|yes|no|on|off`), `HYPERION_ENV`, `HYPERION_WORKER_MODEL` (`share|reuseport`).
+`HYPERION_LOG_LEVEL`, `HYPERION_LOG_FORMAT`, `HYPERION_LOG_REQUESTS`
+(`0|1|true|false|yes|no|on|off`), `HYPERION_ENV`,
+`HYPERION_WORKER_MODEL` (`share|reuseport`), `HYPERION_IO_URING_ACCEPT`
+(`0|1`), `HYPERION_H2_DISPATCH_POOL`, `HYPERION_H2_NATIVE_HPACK`
+(`v2|ruby|off`), `HYPERION_H2_TIMING`.
 ### Config file
-`config/hyperion.rb` — same shape as Puma's `puma.rb`. Auto-loaded if present.
+`config/hyperion.rb` — same shape as Puma's `puma.rb`. Auto-loaded if
+present. Strict DSL: unknown methods raise `NoMethodError` at boot.
 ```ruby
 # config/hyperion.rb
@@ -718,16 +229,11 @@ read_timeout      30
 idle_keepalive     5
 graceful_timeout  30
-max_header_bytes  64 * 1024
-max_body_bytes    16 * 1024 * 1024
 log_level    :info
 log_format   :auto
 log_requests true
-fiber_local_shim false
-async_io nil    # Three-way (1.4.0+): nil (default, auto: inline-on-fiber for TLS h1, pool hop for plain HTTP/1.1), true (force inline-on-fiber everywhere — required for hyperion-async-pg on plain HTTP/1.1), false (force pool hop everywhere — explicit opt-out for TLS+threadpool with CPU-heavy handlers). ~5% throughput hit on hello-world when inline; in exchange one OS thread serves N concurrent in-flight DB queries on wait-bound workloads. TLS / HTTP/2 accept loops always run under Async::Scheduler regardless of this flag.
+async_io nil    # nil = auto (1.4.0+), true = inline-on-fiber everywhere, false = pool everywhere
 before_fork do
   ActiveRecord::Base.connection_handler.clear_all_connections! if defined?(ActiveRecord)
@@ -736,199 +242,202 @@ end
 on_worker_boot do |worker_index|
   ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
 end
-on_worker_shutdown do |worker_index|
-  ActiveRecord::Base.connection_handler.clear_all_connections! if defined?(ActiveRecord)
-end
 ```
-Strict DSL: unknown methods raise `NoMethodError` at boot — typos surface immediately rather than getting silently ignored.
-A documented sample lives at [`config/hyperion.example.rb`](config/hyperion.example.rb).
+A documented sample lives at
+[`config/hyperion.example.rb`](config/hyperion.example.rb).
 ## Operator guidance
-Concrete tradeoffs distilled from [`docs/BENCH_2026_04_27.md`](docs/BENCH_2026_04_27.md). If the bench numbers cited below feel surprising, check that doc for the full matrix + caveats.
+Distilled from [docs/BENCH_2026_04_27.md](docs/BENCH_2026_04_27.md)
+(Rails 8.1 real-app sweep). Headline finding: **the simplest drop-in
+is the right answer.**
-### When to use `-w N`
+### Migrating from Puma
-| Workload shape | Recommended | Why |
-|---|---|---|
-| **Pure I/O-bound** (PG / Redis / external HTTP, no significant CPU) | `-w 1` + larger pool | Bench: `-w 1 pool=200` = 87 MB / 2,180 r/s vs `-w 4 pool=64` = 224 MB / 1,680 r/s. **2.6× more memory, 0.77× rps** if you pick multi-worker on a wait-bound workload. |
-| **Pure CPU-bound** (heavy JSON / template render / image processing) | `-w N` matching CPU count | Each worker's accept loop is single-threaded under `--async-io`; multi-worker gives CPU-parallelism. Bench: `-w 16 -t 5` hits 98,818 r/s on a 16-vCPU box, 4.7× a `-w 1` ceiling on the same hardware. |
-| **Mixed** (Rails-shaped: ~5 ms CPU + 50 ms PG wait per request) | `-w N/2` (half cores) + medium pool | Lets CPU work parallelise while keeping per-worker memory tractable. Bench `pg_mixed.ru` (in hyperion-async-pg repo / `~/bench/`) at `-w 4 -t 5 pool=128` = 1,740 r/s with no cold-start spike (ForkSafe `prefill_in_child: true`). |
-Multi-worker on PG-wait workloads is the **wrong** default for most apps — the headline rps doesn't justify the memory and PG-connection cost. Verify your shape with the bench before scaling out.
-### When to use `--async-io`
-```
-                 Are you using a fiber-cooperative I/O library?
-                 (hyperion-async-pg, async-redis, async-http)
-                              │
-                ┌─────────────┴─────────────┐
-                yes                          no
-                │                            │
-        Pair with a fiber-aware       Leave --async-io OFF.
-        connection pool               Default thread-pool dispatch
-        (FiberPool, async-pool —      is faster for synchronous
-        NOT connection_pool gem,      Rails apps. Bench: --async-io
-        which uses non-fiber Mutex).  on hello-world = 47% rps
-                │                     regression + p99 spike to
-        Set --async-io.               3.65 s under no-yield workloads.
-        Pool size is the real         No reason to flip the flag.
-        concurrency knob; -t is
-        decorative for wait-bound.
-```
+`hyperion -t N -w M` matching your current Puma `-t N:N -w M`. No other
+flags. Versus Puma at the same `-t/-w` shape on real Rails endpoints:
+**+9% rps on lightweight endpoints, 28× lower p99 on health-style
+endpoints, 3.8× lower p99 on PG-touching endpoints.** Same RSS, same
+operator surface — keep all your existing config, monitoring, deploy
+scripts.
-Hyperion warns at boot if you set `--async-io` without any fiber-cooperative library loaded. The setting is still honoured; the warn just nudges operators who flipped it expecting a free perf bump.
+### Knobs that help on synthetic benches but **not** on real Rails
-### Tuning `-t` and pool sizes
+| Knob | Synthetic | Real Rails | Recommendation |
+|---|---|---|---|
+| `-t 30` | +5–10% on hello-world | **Hurts** p99 vs `-t 10` (3.51 s vs 148 ms on `/up`) — GVL + middleware Mutex contention | Stay at `-t 10`. |
+| `--yjit` | +5–10% on CPU-bound | Wash on dev-mode Rails | Skip until you bench production-mode. |
+| `RAILS_POOL > 25` | n/a | No improvement at 50 or 100 | Keep your existing AR pool. |
+| `--async-io` | 33–42× rps on PG-bound | **Worse** than drop-in (4.14 s p99 on `/up`) until your full I/O stack is fiber-cooperative | Don't enable until `redis-rb` → `async-redis`. |
-- **Without `--async-io`** (sync server, default): `-t` is the concurrency knob. Each in-flight request holds an OS thread; pool size should match `-t`. Bench shows Puma-style behaviour — at 200 wrk conns hitting a 5-thread server, queue depth dominates p99 (Hyperion `-t 5 -w 1` p50 = 0.95 ms vs Puma's same shape at 59.5 ms — Hyperion's queueing is cheaper but the model still serializes at `-t`).
-- **With `--async-io` + a fiber-aware pool**: pool size is the concurrency knob. `-t` is decorative for wait-bound workloads; one accept-loop fiber serves all in-flight queries via the pool. Linear scaling: pool=64 → ~780 r/s, pool=128 → ~1,344 r/s, pool=200 → ~2,180 r/s on 50 ms PG queries.
-- **Pool over WAN**: if `PG.connect` round-trip is >50 ms, expect pool fill at startup to take `pool_size / parallel_fill_threads × RTT`. `hyperion-async-pg 0.5.1+` auto-scales `parallel_fill_threads` so pool=200 fills in ~1-2 s.
+### When `-w N` helps
-### How to read p50 vs p99
+| Workload | Recommended | Why |
+|---|---|---|
+| Pure I/O-bound (PG / Redis / external HTTP) | `-w 1` + larger pool | `-w 1 pool=200` = 87 MB / 2,180 r/s vs `-w 4 pool=64` = 224 MB / 1,680 r/s. **2.6× memory, 0.77× rps** if you pick multi-worker on wait-bound. |
+| Pure CPU-bound | `-w N` matching CPU count | Bench: `-w 16 -t 5` hits 98,818 r/s on a 16-vCPU box. |
+| Mixed (Rails-shaped, ~5 ms CPU + 50 ms wait) | `-w N/2` (half cores) + medium pool | `-w 4 -t 5 pool=128` = 1,740 r/s on `pg_mixed.ru`, no cold-start spike. |
-Tail latency tells the queueing story; rps tells the throughput story. Hyperion's tail wins are **always** bigger than its rps wins — sometimes the rps numbers look close to a competitor while p99 is 5-200× lower:
+### Read p99 not mean
 | Workload | Hyperion rps / p99 | Closest competitor | rps ratio | p99 ratio |
 |---|---|---|---:|---:|
-| Hello `-w 4` | 21,215 r/s / 1.87 ms | Falcon 24,061 / 9.78 ms | 0.88× | **5.2× lower** |
-| CPU JSON `-w 4` | 15,582 r/s / 2.47 ms | Falcon 18,643 / 13.51 ms | 0.84× | **5.5× lower** |
-| Static 1 MiB | 1,919 r/s / 4.22 ms | Puma 2,074 / 55 ms | 0.93× | **13× lower** |
-| PG-wait `-w 1` pool=200 | 2,180 r/s / 668 ms | Puma 530 r/s + 200 timeouts | **4.1×** | qualitative crush |
+| Hello `-w 4` | 21,215 / 1.87 ms | Falcon 24,061 / 9.78 ms | 0.88× | **5.2× lower** |
+| CPU JSON `-w 4` | 15,582 / 2.47 ms | Falcon 18,643 / 13.51 ms | 0.84× | **5.5× lower** |
+| Static 1 MiB | 1,919 / 4.22 ms | Puma 2,074 / 55 ms | 0.93× | **13× lower** |
+| PG-wait `-w 1` pool=200 | 2,180 / 668 ms | Puma 530 + 200 timeouts | **4.1×** | qualitative crush |
-**Size capacity by p99, not by mean.** Throughput peaks are easy to fake under controlled bench conditions; tail latency reflects what your slowest user actually experiences when the load balancer fans them onto a busy worker.
-### Production tuning (real Rails apps)
-Distilled from a real-app bench against the [Exodus platform](https://github.com/andrew-woblavobla/hyperion/blob/master/docs/BENCH_2026_04_27.md) (Rails 8.1, on-LAN PG + Redis at ~0.3 ms RTT, `-w 4 -t 10`, `wrk -t8 -c200 -d30s`). The headline finding: the **simplest drop-in is the right answer**, and the additional knobs operators reach for first don't help on real Rails.
-**Recommended for migrating from Puma**: `hyperion -t N -w M` matching your current Puma `-t N:N -w M`. No other flags. That gives you (vs Puma at the same `-t/-w`):
-- **+9% rps on lightweight endpoints** (matches the 5-10% per-request CPU savings the rest of the bench section documents).
-- **28× lower p99 on health-style endpoints** — the queue-of-doom shape Puma exhibits under sustained 200-conn load doesn't reproduce on Hyperion's worker-owns-connection model.
-- **3.8× lower p99 on PG-touching endpoints**.
-- **Same RSS, same operator surface** — you keep all your existing config, monitoring, and deploy scripts.
-**Knobs that help on synthetic benches but NOT on real Rails — leave them off:**
-| Knob | Synthetic bench result | Real Rails result | Recommendation |
-|---|---|---|---|
-| `-t 30` (more threads/worker) | Helped Hyperion 5-10% on hello-world | **Hurt** p99 vs `-t 10` on real Rails (3.51 s vs 148 ms on /up) — GVL + middleware Mutex contention dominates past `-t 10` | Stay at `-t 10`. Match Puma's recommended `RAILS_MAX_THREADS`. |
-| `--yjit` | 5-10% on synthetic CPU-bound | Wash on dev-mode Rails (312 vs 328 rps, p99 worse with YJIT) | Skip for now. Production-mode Rails may behave differently — verify with your own bench before flipping. |
-| `RAILS_POOL` > 25 | n/a | No improvement at pool=50 or pool=100 on real Rails (rps within 3%, p99 within noise). Pool starvation is rarely the bottleneck on a `-w 4 -t 10` config | Keep your existing AR pool size. |
-| `--async-io` | 33-42× rps on PG-bound (with `hyperion-async-pg`) | **Worse** than drop-in on real Rails (4.14 s p99 on /up vs 148 ms drop-in) | **Don't enable** until your full I/O stack is fiber-cooperative. The synchronous Redis client (`redis-rb`) blocks the OS thread before async-pg can yield, so fibers can't compound. Migrate to `async-redis` *first*, then revisit. |
-| `--async-io` + `hyperion-async-pg` AR adapter | Verified 48× rps lift on a single-PG-query bench | Marginal-or-negative on real Rails (similar reason: Redis-first handlers don't yield) | Same — wait for a full-async I/O stack. |
-**Why the simple drop-in wins on real Rails:** the per-request budget on a real handler is dominated by the Rails middleware chain (rack-attack, locale redirect, tagger, etc.) + handler logic + DB + cache I/O. Hyperion's per-request CPU optimizations (C-ext header parser, response builder, lock-free metrics, fiber-cooperative TLS dispatch in 1.4.0+) shave ~5-10% off the *non-I/O* portion of the budget consistently — and the [worker-owns-connection model](#concurrency-at-scale-architectural-advantages) prevents the queue-amplification that Puma's thread-pool dispatch shows under sustained load. You don't need to "tune" anything to get those.
+Throughput peaks are easy to fake under controlled conditions; tail
+latency reflects what your slowest user actually experiences when the
+load balancer fans them onto a busy worker.
 ## Logging
-Default behaviour (rc16+):
-- **`info` / `debug` → stdout**, **`warn` / `error` / `fatal` → stderr** (12-factor).
-- **One structured access-log line per response**, info level, on stdout. Disable with `--no-log-requests` or `HYPERION_LOG_REQUESTS=0`.
-- **Format auto-selects**: production envs → JSON (line-delimited, parseable by every log aggregator); TTY → coloured text; piped output without env hint → JSON.
+Default behaviour:
-### Sample access log lines
+- `info`/`debug` → stdout, `warn`/`error`/`fatal` → stderr (12-factor).
+- One structured access-log line per response, `info` level. Disable
+  with `--no-log-requests` or `HYPERION_LOG_REQUESTS=0`.
+- Format auto-selects: `RAILS_ENV=production`/`staging` → JSON; TTY →
+  coloured text; piped output without env hint → JSON.
-Text format (TTY default):
+Sample text (TTY default):
 ```
 2026-04-26T18:40:04.112Z INFO  [hyperion] message=request method=GET path=/api/v1/health status=200 duration_ms=46.63 remote_addr=127.0.0.1 http_version=HTTP/1.1
-2026-04-26T18:40:04.123Z INFO  [hyperion] message=request method=GET path=/api/v1/cached_data query="currency=USD" status=200 duration_ms=43.87 remote_addr=127.0.0.1 http_version=HTTP/1.1
 ```
-JSON format (auto-selected on `RAILS_ENV=production`/`staging` or piped output):
+Sample JSON (production / piped):
 ```json
 {"ts":"2026-04-26T18:38:49.405Z","level":"info","source":"hyperion","message":"request","method":"GET","path":"/api/v1/health","status":200,"duration_ms":46.63,"remote_addr":"127.0.0.1","http_version":"HTTP/1.1"}
-{"ts":"2026-04-26T18:38:49.411Z","level":"info","source":"hyperion","message":"request","method":"GET","path":"/api/v1/cached_data","query":"currency=USD","status":200,"duration_ms":40.64,"remote_addr":"127.0.0.1","http_version":"HTTP/1.1"}
 ```
-### Hot-path optimisations
-The default-ON access log path is engineered to stay near-zero cost:
-- **Per-thread cached `iso8601(3)` timestamp** — one allocation per millisecond per thread, reused across all requests in that millisecond.
-- **Hand-rolled single-interpolation line builder** — bypasses generic `Hash#map.join`.
-- **Per-thread 4 KiB write buffer** — flushes to stdout when full or on connection close. Cuts ~32× the syscalls under load.
-- **Lock-free emit** — POSIX `write(2)` is atomic for writes ≤ PIPE_BUF (4096 B); a log line is ~200 B. No logger mutex.
 ## Metrics
-`Hyperion.stats` returns a snapshot Hash with the following counters (lock-free per-thread aggregation):
-| Counter | Meaning |
-|---|---|
-| `connections_accepted` | Lifetime accept count. |
-| `connections_active` | Currently in-flight connections. |
-| `requests_total` | Lifetime request count. |
-| `requests_in_flight` | Currently in-flight requests. |
-| `responses_<code>` | One counter per status code emitted (`responses_200`, `responses_400`, …). |
-| `parse_errors` | HTTP parse failures → 400. |
-| `app_errors` | Rack app raised → 500. |
-| `read_timeouts` | Per-connection read deadline hit. |
-| `requests_threadpool_dispatched` | HTTP/1.1 connection handed to the worker pool (or served inline in `start_raw_loop` when `thread_count: 0`). The default dispatch path. |
-| `requests_async_dispatched` | HTTP/1.1 connection served inline on the accept-loop fiber under `--async-io`. Operators can use the ratio against `requests_threadpool_dispatched` to verify fiber-cooperative I/O is actually engaged. |
-```ruby
-require 'hyperion'
-Hyperion.stats
-# => {connections_accepted: 1234, connections_active: 7, requests_total: 8910, …}
-```
+`Hyperion.stats` returns a snapshot Hash with lock-free per-thread
+counters (`connections_accepted`, `connections_active`, `requests_total`,
+`requests_in_flight`, `responses_<code>`, `parse_errors`, `app_errors`,
+`read_timeouts`, `requests_threadpool_dispatched`,
+`requests_async_dispatched`, `c_loop_requests_total`).
-### Prometheus exporter
-When `admin_token` is set in your config, Hyperion mounts a `/-/metrics` endpoint that emits Prometheus text-format v0.0.4. Same token guards both `/-/metrics` (GET) and `/-/quit` (POST); auth is via the `X-Hyperion-Admin-Token` header.
+When `admin_token` is set, `/-/metrics` emits Prometheus text-format
+v0.0.4. Auth is via the `X-Hyperion-Admin-Token` header (same token
+guards `POST /-/quit`):
 ```sh
 $ curl -s -H 'X-Hyperion-Admin-Token: secret' http://127.0.0.1:9292/-/metrics
 # HELP hyperion_requests_total Total HTTP requests handled
 # TYPE hyperion_requests_total counter
 hyperion_requests_total 8910
-# HELP hyperion_bytes_written_total Total bytes written to response sockets
-# TYPE hyperion_bytes_written_total counter
-hyperion_bytes_written_total 2351023
-# HELP hyperion_responses_status_total Responses by HTTP status code
-# TYPE hyperion_responses_status_total counter
 hyperion_responses_status_total{status="200"} 8521
 hyperion_responses_status_total{status="404"} 12
-hyperion_responses_status_total{status="500"} 3
-# … and so on for sendfile_responses_total, rejected_connections_total,
-# slow_request_aborts_total, requests_async_dispatched_total, etc.
 ```
-Any counter not in the known set (added by app middleware via `Hyperion.metrics.increment(:custom_thing)`) is auto-exported as `hyperion_custom_thing` with a generic HELP line — no Hyperion config change required.
+Any counter not in the known set (added via
+`Hyperion.metrics.increment(:custom_thing)`) is auto-exported as
+`hyperion_custom_thing` with a generic HELP line. Network-isolate the
+admin endpoints if the listener is internet-facing — see
+[docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) for the nginx
+`location /-/ { return 404; }` recipe.
+## Compatibility
+| Component | Version |
+|---|---|
+| Ruby | 3.3+ (transitive `protocol-http2 ~> 0.26` floor) |
+| Rack | 3.x |
+| Rails | verified up to 8.1 |
+| Linux kernel | 5.x+ for io_uring opt-in; 4.x+ otherwise |
+| macOS | works (TLS, h2, WebSockets, `accept4` fallback path) |
-Point your scraper at it: in Prometheus' `scrape_configs`, set `metrics_path: /-/metrics` and `bearer_token` (or use a custom header relabel — Prometheus 2.42+ supports `authorization.credentials_file` paired with a custom `header` block). Network-isolate the admin endpoints if the listener is internet-facing — see [docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) for the nginx `location /-/ { return 404; }` recipe.
+Per-Rack-3-spec: auto-sets `SERVER_SOFTWARE`, `rack.version`,
+`REMOTE_ADDR`, IPv6-safe `Host` parsing, CRLF guard. The
+`Hyperion::FiberLocal.install!` opt-in shim handles the residual
+`Thread.current.thread_variable_*` footgun in older Rails idioms;
+modern Rails 7.1+ already uses Fiber storage natively.
-## TLS + HTTP/2
+## Reproducing benchmarks
-Provide a PEM cert + key:
+Every number in this README is reproducible. Per-row commands:
 ```sh
-bundle exec hyperion --tls-cert config/cert.pem --tls-key config/key.pem -p 9443 config.ru
-```
+# Setup (once)
+bundle install
+bundle exec rake compile
-ALPN auto-negotiates `h2` (HTTP/2) or `http/1.1` per connection. HTTP/2 multiplexes streams onto fibers within a single connection — slow handlers don't head-of-line-block other streams. Cluster-mode TLS works (`-w N` + `--tls-cert` / `--tls-key`).
+# Hello via Server.handle_static + io_uring (134k r/s row)
+HYPERION_IO_URING_ACCEPT=1 bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/hello_static.ru &
+wrk -t4 -c100 -d20s --latency http://127.0.0.1:9292/
-Smuggling defenses for HTTP/1.1: `Content-Length` + `Transfer-Encoding` together → 400; non-chunked `Transfer-Encoding` → 501; CRLF in response header values → `ArgumentError` (response-splitting guard).
+# Dynamic block via Server.handle (9.4k r/s row)
+bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/hello_handle_block.ru &
+wrk -t4 -c100 -d20s --latency http://127.0.0.1:9292/
-## Compatibility
+# Generic Rack hello (4.7k r/s row)
+bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/hello.ru &
+wrk -t4 -c100 -d20s --latency http://127.0.0.1:9292/
+# CPU JSON via block form (5.9k r/s row)
+bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/work.ru &
+wrk -t4 -c200 -d15s --latency http://127.0.0.1:9292/
+# 4-way comparator (Hyperion vs Puma vs Falcon vs Agoo)
+bash bench/4way_compare.sh
+# gRPC unary + streaming (Hyperion side)
+GHZ=/tmp/ghz TRIALS=3 DURATION=15s WARMUP_DURATION=3s bash bench/grpc_stream_bench.sh
+# Idle keep-alive RSS sweep (10k conns × 30s hold)
+bash bench/keepalive_memory.sh
+```
-- **Ruby 3.3+** required (the `protocol-http2 ~> 0.26` transitive dep imposes this floor; older Ruby installs error at `bundle install`).
-- **Rack 3** (auto-sets `SERVER_SOFTWARE`, `rack.version`, `REMOTE_ADDR`, IPv6-safe `Host` parsing, CRLF guard).
-- **`Hyperion::FiberLocal.install!`** opt-in shim for older Rails apps that store request-scoped data via `Thread.current.thread_variable_*` (modern Rails 7.1+ already uses Fiber storage natively; the shim handles the residual footgun).
-- **`Hyperion::FiberLocal.verify_environment!`** runtime check that `Thread.current[:k]` is fiber-local on the current Ruby (it is on 3.2+).
+PG benches (`pg_concurrent.ru`, `pg_mixed.ru`) live in the
+[hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg)
+companion repo — they require a running Postgres and the companion
+gem.
+When numbers from your host don't match the published numbers, the
+most likely explanations (in order): (1) bench-host noise — single-VM
+benches drift 10–30% over days; (2) Puma version mismatch (sweep used
+Puma 8.0.1; the in-repo Gemfile pins `~> 6.4`); (3) different kernel
+or Ruby; (4) different `-t` / `-c` (apples-to-apples requires
+identical worker count, thread count, wrk concurrency, payload, and
+TLS cipher).
+## Release history
+See [CHANGELOG.md](CHANGELOG.md). Recent: 2.14.0 (gRPC streaming ghz
+numbers; dynamic-block C dispatch — `Server.handle { |env| ... }` lifts
+hello to 9,422 r/s and CPU JSON to 5,897 r/s; `Server#stop` accept-wake
+on Linux; io_uring 4h soak), 2.13.0 (response head builder C-rewrite;
+gRPC streaming RPCs; soak harness), 2.12.0 (C connection lifecycle;
+io_uring loop hits 134k r/s; gRPC unary trailers; SO_REUSEPORT
+audit), 2.11.0 (HPACK CGlue default; h2 dispatch-pool warmup), 2.10.x
+(`PageCache`, `Server.handle` direct routes, TCP_NODELAY at accept).
+## Links
+- [CHANGELOG.md](CHANGELOG.md) — per-stream releases.
+- [docs/BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md) — current
+  4-way matrix + 2.14-D gRPC numbers.
+- [docs/BENCH_HYPERION_2_0.md](docs/BENCH_HYPERION_2_0.md) — historical
+  2.10-B baseline (preserved for archaeology).
+- [docs/BENCH_2026_04_27.md](docs/BENCH_2026_04_27.md) — real Rails 8.1
+  app sweep (Exodus platform).
+- [docs/OBSERVABILITY.md](docs/OBSERVABILITY.md) — metrics + Grafana.
+- [docs/WEBSOCKETS.md](docs/WEBSOCKETS.md) — RFC 6455 surface.
+- [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md) — drop-in
+  guide.
+- [docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) — nginx fronting.
 ## Credits
-- Vendored [llhttp](https://github.com/nodejs/llhttp) (Node.js's HTTP parser, MIT) under `ext/hyperion_http/llhttp/`.
-- HTTP/2 framing and HPACK via [`protocol-http2`](https://github.com/socketry/protocol-http2).
+- Vendored [llhttp](https://github.com/nodejs/llhttp) (Node.js's HTTP
+  parser, MIT) under `ext/hyperion_http/llhttp/`.
+- HTTP/2 framing and HPACK via
+  [`protocol-http2`](https://github.com/socketry/protocol-http2).
 - Fiber scheduler via [`async`](https://github.com/socketry/async).
 ## License