hyperion-rb 2.12.0 → 2.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,589 +1,218 @@
1
1
  # Hyperion
2
2
 
3
- High-performance Ruby HTTP server. Falcon-class fiber concurrency, Puma-class compatibility.
3
+ High-performance Ruby HTTP server. Rack 3 + HTTP/2 + WebSockets + gRPC on a single binary.
4
4
 
5
5
  [![CI](https://github.com/andrew-woblavobla/hyperion/actions/workflows/ci.yml/badge.svg)](https://github.com/andrew-woblavobla/hyperion/actions/workflows/ci.yml)
6
6
  [![Gem Version](https://img.shields.io/gem/v/hyperion-rb.svg)](https://rubygems.org/gems/hyperion-rb)
7
7
  [![License: MIT](https://img.shields.io/github/license/andrew-woblavobla/hyperion.svg)](https://github.com/andrew-woblavobla/hyperion/blob/master/LICENSE)
8
8
 
9
+ Hyperion serves a hello-world Rack response at **134,084 r/s with a 1.14 ms p99**
10
+ on a single worker (Linux 6.x, io_uring accept loop, `Server.handle_static`),
11
+ **7×** Agoo's 19,024 r/s on the same hardware. Beyond the C-side fast path
12
+ it's a complete Rack 3 server: HTTP/1.1 + HTTP/2 with ALPN, WebSockets
13
+ (RFC 6455), gRPC unary + streaming on the Rack 3 trailers contract, native
14
+ fiber concurrency for PG-bound apps, and pre-fork cluster mode with
15
+ SO_REUSEPORT-balanced workers.
16
+
9
17
  ```sh
10
18
  gem install hyperion-rb
11
- bundle exec hyperion config.ru
19
+ bundle exec hyperion config.ru # http://127.0.0.1:9292
12
20
  ```
13
21
 
14
- ## What's new in 2.12.0
15
-
16
- **The hot path moves into C — and gRPC ships.** The headline win:
17
- `Server.handle_static` routes now serve from a C accept→read→route→write
18
- loop with optional **io_uring** (Linux 5.x+) backing it. The `wrk -t4
19
- -c100 -d20s` hello bench moved from **5,502 r/s** (2.11.0
20
- `Server.handle_static` via Ruby accept loop) to **15,685 r/s** (2.12-C
21
- C accept4 loop) to **134,084 r/s** (2.12-D io_uring loop) — that's
22
- **24× over 2.11.0's `handle_static` and 7× over Agoo 2.15.14's
23
- 19,024 r/s** on the same workload. p99 stays sub-millisecond
24
- throughout. Plus durable foundation work and one big new feature:
25
-
26
- - **2.12-B — Fresh 4-way re-bench.** New
27
- [`docs/BENCH_HYPERION_2_11.md`](docs/BENCH_HYPERION_2_11.md) re-runs
28
- Hyperion / Puma / Falcon / Agoo on the 6 workloads with all 2.10/2.11
29
- wins enabled. Headline shifts: static 1 KB Hyperion `handle_static`
30
- flipped from 1.89× behind Agoo to **+127% ahead**; CPU JSON gap
31
- widened (the one row 2.10/2.11 didn't touch — flagged for follow-up).
32
- - **2.12-C — Connection lifecycle in C.** New
33
- `Hyperion::Http::PageCache.run_static_accept_loop` does
34
- `accept4` + `recv` + path lookup + `write` entirely in a C tight
35
- loop, returning to Ruby only on a route miss / TLS / h2 / WebSocket
36
- upgrade. GVL released across syscalls. Auto-engages when the listener
37
- is plain TCP and the route table contains only `StaticEntry`
38
- registrations. **5,502 → 15,685 r/s (+185%, 2.85×) on `handle_static`
39
- hello; p99 1.59 ms → 107 µs (15× tighter).** Falls through to the
40
- existing Ruby accept loop on miss with no regression.
41
- - **2.12-D — io_uring accept loop (Linux 5.x+).** A multishot accept +
42
- per-conn RECV/WRITE/CLOSE state machine on top of liburing. One
43
- `io_uring_enter` per N requests instead of N×3 syscalls. Opt-in via
44
- `HYPERION_IO_URING_ACCEPT=1` (default off until 2.13 production
45
- soak). **15,685 → 134,084 r/s (+755%, 8.6×) on the same bench.**
46
- Compiles out cleanly without liburing — the `accept4` path stays
47
- the fallback. macOS keeps using `accept4` (no liburing).
48
- - **2.12-E — SO_REUSEPORT cluster-mode audit.** New per-worker request
49
- metric (`requests_dispatch_total{worker_id="N"}`) ticks under every
50
- dispatch mode (Rack, `handle_static`, h2, the C accept loops). New
51
- audit harness `bench/cluster_distribution.sh` and a 4-worker, 30s
52
- sustained-load bench: under steady state the SO_REUSEPORT hash
53
- distributes within **1.004-1.011 max/min ratio** — production-grade,
54
- measured. The cold-start swing (1.16× during the first second of
55
- fresh boot) is documented as expected `SO_REUSEPORT + keep-alive`
56
- behavior and matches what production L4 LBs already exhibit.
57
- - **2.12-F — gRPC support on h2.** Trailers (the `grpc-status` /
58
- `grpc-message` final HEADERS frame), `TE: trailers` handling, h2
59
- request half-close semantics. Rack 3 contract: a Rack body that
60
- defines `#trailers` triggers the trailers wire shape automatically;
61
- bodies that don't are byte-identical to 2.11.x h2. Smoke test against
62
- the real `grpc` Ruby gem ships gated by `RUN_GRPC_SMOKE=1`; the
63
- durable coverage is 11 unit specs driving real `protocol-http2`
64
- framer + HPACK encode/decode + TLS.
65
-
66
- The 2.10-G TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F C-ext
67
- `rb_pc_serve_request`, 2.11-A dispatch pool warmup, and 2.11-B cglue
68
- HPACK default all preserved and verified by the 1143-spec suite.
69
-
70
- Full per-stream details, bench tables, and follow-up items in
71
- [`CHANGELOG.md`](CHANGELOG.md).
72
-
73
- ## What's new in 2.11.0
74
-
75
- **h2 cold-stream latency cut + native HPACK CGlue flipped to default.**
76
- Two perf wins on top of 2.10:
77
-
78
- - **2.11-A — h2 first-stream TLS handshake parallelization.** The
79
- 2.10-G `HYPERION_H2_TIMING=1` instrumentation, run against the
80
- TCP_NODELAY-fixed handler, isolated the residual cold-stream cost
81
- to **bucket 2**: lazy `task.async {}` fiber spawn for the first
82
- stream of every connection. Fix: pre-spawn a stream-dispatch fiber
83
- pool at connection accept (configurable via `HYPERION_H2_DISPATCH_POOL`,
84
- default 4, ceiling 16). h2load `-c 1 -m 1 -n 50` cold first-run:
85
- **time-to-1st-byte 20.28 → 9.28 ms (−54%); m=100 throughput +5.5%**.
86
- Warm steady-state unchanged (no head-of-line blocking under the small
87
- pool — backlog still spills to ad-hoc `task.async`).
88
- - **2.11-B — HPACK FFI marshalling round-2 (CGlue flipped to default).**
89
- Three-way bench (`bench/h2_rails_shape.sh` extended): `ruby` (1,585
90
- r/s) vs `native v2` (1,602 r/s, +1% — noise) vs `native v3 / CGlue`
91
- (**2,291 r/s, +43% over v2**). The +18-44% native-vs-Ruby headline
92
- was almost entirely Fiddle marshalling overhead, not the underlying
93
- Rust HPACK encoder — same encoder, no per-call FFI marshalling, +43%
94
- rps. Default flipped: unset `HYPERION_H2_NATIVE_HPACK` now selects
95
- CGlue. Three escape valves stay (`=v2` to force the old path, `=ruby`
96
- / `=off` for the pure-Ruby fallback) for any operator that needs
97
- them. Boot log gains a `native_mode` field documenting which path is
98
- actually live.
99
-
100
- Plus operator infrastructure: a stale-`.dylib`-on-Linux cross-platform
101
- host-OS portability fix in `H2Codec.candidate_paths` (was silently
102
- falling through to pure-Ruby on the bench host); `bench/h2_rails_shape.sh`
103
- race-fixed (boot-log probe + stderr routing). Full bench tables and
104
- flip-decision rationale in [`CHANGELOG.md`](CHANGELOG.md).
105
-
106
- ## What's new in 2.10.1
107
-
108
- **Static-asset operator surface (2.10-E) + C-ext fast-path response
109
- writer (2.10-F).** Two follow-on streams to 2.10's static / direct-route
110
- work:
111
-
112
- - **2.10-E — Static asset preload + immutable flag.** Boot-time hook
113
- warms `Hyperion::Http::PageCache` over a tree of files and marks
114
- every cached entry immutable. Surface: `--preload-static <dir>` (and
115
- `--no-preload-static`) CLI flags, `preload_static "/path", immutable:
116
- true` config DSL key, and zero-config Rails auto-detect that pulls
117
- `Rails.configuration.assets.paths.first(8)` when present. Hyperion
118
- never `require`s Rails — purely defensive `defined?(::Rails)`
119
- probing keeps the generic Rack server path clean. **Operator value:
120
- predictable first-request latency** (the asset is in cache before
121
- the first request arrives) and the `recheck_seconds` mtime poll is
122
- skipped on immutable entries. Sustained-load throughput on the
123
- static-1-KB bench did *not* move (cold 1,929 r/s vs warm 1,886 r/s,
124
- inside trial noise) because `ResponseWriter` already auto-caches
125
- Rack::Files responses on the first hit; preload moves that one
126
- `cache_file` call from request 1 to boot.
127
- - **2.10-F — C-ext fast-path response writer for prebuilt responses.**
128
- `Server.handle_static`-routed requests now serve from a single
129
- C function (`rb_pc_serve_request` in `ext/hyperion_http/page_cache.c`)
130
- that does route lookup → header build → `write()` syscall without
131
- re-entering Ruby on the response side. GVL is released across the
132
- `write()` so slow clients no longer block other Ruby work on the
133
- same VM. Automatic HEAD support (HTTP-mandated) lights up on every
134
- GET registered via `handle_static` — same buffer, body stripped.
135
- Bench (3-trial median, `wrk -t4 -c100 -d20s`): **5,768 r/s vs
136
- 2.10-D's 5,619 r/s (+2.6% — inside noise) and p99 1.93 → 1.67 ms
137
- (−14% — outside noise, reproducible).** The throughput needle didn't
138
- move because the per-connection lifecycle (accept4 + clone3 + futex
139
- on GVL handoff) dominates at 100 concurrent connections; 2.10-F
140
- shrinks the response phase, but the response phase isn't the
141
- bottleneck on this profile. Durable infrastructure for 2.11+ when
142
- the accept-loop work closes.
143
-
144
- Full per-stream details and bench tables in
145
- [`CHANGELOG.md`](CHANGELOG.md).
146
-
147
- ## What's new in 2.10.0
148
-
149
- **4-way bench harness, page cache, direct routes, and the h2 40 ms
150
- ceiling killed.** This sprint widens the comparison matrix to all four
151
- major Ruby web servers (Hyperion + Puma + Falcon + Agoo) and ships
152
- four substantive perf streams against that backdrop:
153
-
154
- - **2.10-A / 2.10-B — 4-way bench harness + honest baseline.**
155
- `bench/4way_compare.sh` runs the same 6 workloads (hello, static
156
- 1 KB / 1 MiB, CPU JSON, PG-bound, SSE) against all four servers from
157
- one script. Baseline numbers committed *before* any code changes:
158
- Agoo wins the static-asset and JSON columns by ~2-4×, Hyperion wins
159
- the static 1 MiB column by 9× and the SSE column by 3.6-17×.
160
- - **2.10-C — `Hyperion::Http::PageCache` (pre-built static response
161
- cache).** Open-addressed bucket table behind a pthread mutex
162
- (GVL-released for writes), engages automatically on `Rack::Files`
163
- responses. **Static 1 KB: 1,380 → 1,880 r/s (+36%), p99 3.7 → 2.7
164
- ms.** Closes the Agoo gap from −47% to −28% on that column.
165
- - **2.10-D — `Hyperion::Server.handle` direct route registration.**
166
- New API for hot Rack-bypass paths (`Server.handle '/health' do …
167
- end`, `Server.handle_static '/robots.txt', body: '...'`). Skips Rack
168
- adapter + env-build for matched routes. **`hello` via
169
- `handle_static`: 4,408 → 5,619 r/s (+27%), p99 1.93 ms** — the
170
- cleanest p99 in the 4-way matrix.
171
- - **2.10-G — h2 max-latency ceiling at ~40 ms: fixed.** Filed by 2.9-B
172
- as a "first-stream cost" hypothesis, the instrumentation revealed
173
- it was paid by *every* h2 stream — the canonical Linux delayed-ACK
174
- + Nagle interaction on small framer writes. One-line fix:
175
- TCP_NODELAY at accept time. **h2load `-c 1 -m 1 -n 200`: min
176
- 40.62 → 0.54 ms (−98.7%), throughput 24 → 1,142 r/s (+47.6×).** The
177
- `HYPERION_H2_TIMING=1` instrumentation stays in place as durable
178
- diagnostic infrastructure.
179
-
180
- Full per-stream details, bench numbers, and follow-up items live in
181
- [`CHANGELOG.md`](CHANGELOG.md).
182
-
183
- ## What's new in 2.5.0
184
-
185
- **Native HPACK ON by default + autobahn 100% conformance + request
186
- hooks.** The Rust HPACK encoder (added in 2.0.0, opt-in until 2.4.x)
187
- flips ON by default in 2.5.0 — verified **+18% rps on Rails-shape h2
188
- workloads** (25-header responses, the bench harness lives at
189
- `bench/h2_rails_shape.ru` + `bench/h2_rails_shape.sh`). RFC 6455
190
- WebSocket conformance hit **463/463 autobahn-testsuite cases passing**
191
- (2.5-A, host openclaw-vm). Request lifecycle hooks
192
- (`Runtime#on_request_start` / `on_request_end`) shipped in 2.5-C —
193
- recipes in [`docs/OBSERVABILITY.md`](docs/OBSERVABILITY.md).
194
-
195
- ## What's new in 2.4.0
196
-
197
- **Production observability.** The `/-/metrics` endpoint now exposes
198
- per-route latency histograms, per-conn fairness rejections, WebSocket
199
- permessage-deflate compression ratio, kTLS active connections,
200
- io_uring-active workers, and ThreadPool queue depth — operators can
201
- finally see whether the 2.x knobs are firing and how effective they
202
- are. A pre-built Grafana dashboard ships at
203
- [`docs/grafana/hyperion-2.4-dashboard.json`](docs/grafana/hyperion-2.4-dashboard.json).
204
- Full metric reference + operator playbook in
205
- [`docs/OBSERVABILITY.md`](docs/OBSERVABILITY.md).
206
-
207
- ## What's new in 2.1.0
208
-
209
- **WebSockets.** RFC 6455 over Rack 3 full hijack, native frame codec,
210
- per-connection wrapper with auto-pong / close handshake / UTF-8 validation /
211
- per-message size cap. **ActionCable on Hyperion is now a single-binary
212
- deployment** — one `hyperion -w 4 -t 10 config.ru` process serves HTTP,
213
- HTTP/2, TLS, **and** `/cable` from the same listener; no separate cable
214
- container required. HTTP/1.1 only this release; WS-over-HTTP/2 (RFC 8441
215
- Extended CONNECT) and permessage-deflate (RFC 7692) defer to 2.2.x.
216
- See [`docs/WEBSOCKETS.md`](docs/WEBSOCKETS.md).
217
-
218
- ## gRPC on Hyperion (2.12-F+)
219
-
220
- Hyperion's HTTP/2 path supports gRPC unary calls via the Rack 3 trailers
221
- contract: any response body that exposes `:trailers` gets a final
222
- HEADERS frame (with END_STREAM=1) carrying the trailer map after the
223
- DATA frames. That's the wire shape gRPC clients expect for the
224
- `grpc-status` / `grpc-message` map.
225
-
226
- A minimal Rack-shaped gRPC handler:
227
-
228
- ```ruby
229
- class GrpcBody
230
- def initialize(reply); @reply = reply; end
231
- def each; yield @reply; end
232
- def trailers; { 'grpc-status' => '0', 'grpc-message' => 'OK' }; end
233
- def close; end
234
- end
235
-
236
- run ->(env) {
237
- request = env['rack.input'].read # gRPC-framed protobuf bytes
238
- reply = handle(request) # your service implementation
239
- [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply)]
240
- }
241
- ```
22
+ ## Headline benchmarks
242
23
 
243
- What Hyperion handles for you: ALPN negotiation, HTTP/2 framing, HPACK,
244
- per-stream flow control, the trailer-frame emit, binary-clean
245
- `env['rack.input']` (gRPC bodies are non-UTF-8), and `te: trailers`
246
- preserved into `env['HTTP_TE']`. What you handle: protobuf
247
- marshalling and the `grpc-status` semantics. Streaming RPCs (server /
248
- client / bidi) are 2.13 candidates — pin to unary for now.
249
-
250
- ## Highlights
251
-
252
- - **HTTP/1.1 + HTTP/2 + TLS** out of the box (HTTP/2 with per-stream fiber multiplexing, WINDOW_UPDATE-aware flow control, ALPN auto-negotiation).
253
- - **WebSockets (RFC 6455)** — full handshake, native frame codec, per-connection wrapper. ActionCable + faye-websocket work on a single-binary deploy. See [`docs/WEBSOCKETS.md`](docs/WEBSOCKETS.md). (2.1.0+, HTTP/1.1 only.)
254
- - **Pre-fork cluster** with per-OS worker model: `SO_REUSEPORT` on Linux, master-bind + worker-fd-share on macOS/BSD (Darwin's `SO_REUSEPORT` doesn't load-balance).
255
- - **Hybrid concurrency**: fiber-per-connection for I/O, OS-thread pool for `app.call(env)` — synchronous Rack handlers (Rails, ActiveRecord, anything holding a global mutex) get true OS-thread concurrency.
256
- - **Vendored llhttp 9.3.0** C parser; pure-Ruby fallback for non-MRI runtimes.
257
- - **Default-ON structured access logs** (one JSON or text line per request) with hot-path optimisations: per-thread cached timestamp, hand-rolled line builder, lock-free per-thread write buffer.
258
- - **12-factor logger split**: info/debug → stdout, warn/error/fatal → stderr.
259
- - **Ruby DSL config file** (`config/hyperion.rb`) with lifecycle hooks (`before_fork`, `on_worker_boot`, `on_worker_shutdown`).
260
- - **Object pooling** for the Rack `env` hash and `rack.input` IO — amortizes per-request allocations across the worker's lifetime.
261
- - **`Hyperion::FiberLocal`** opt-in shim for older Rails idioms that store request-scoped data via `Thread.current.thread_variable_*`.
262
-
263
- ## Benchmarks
264
-
265
- All numbers are real wrk runs against published Hyperion configs. Hyperion ships **with default-ON structured access logs**; Puma comparisons use Puma defaults (no per-request log emission). Each section is stamped with the Hyperion version + bench host it was measured against — bench-host drift over time is real (see "Bench-host drift" note below).
266
-
267
- **Headline doc**: the most recent comprehensive sweep is
268
- [`docs/BENCH_HYPERION_2_11.md`](docs/BENCH_HYPERION_2_11.md) — the
269
- 2.12-B 4-way re-bench (Hyperion 2.11.0 vs Puma 8.0.1 / Falcon 0.55.3 /
270
- Agoo 2.15.14, 16-vCPU Ubuntu 24.04, 6 workloads). It's the post-
271
- 2.10/2.11-wins re-baseline of the four-server matrix that originally
272
- shipped in [`docs/BENCH_HYPERION_2_0.md`](docs/BENCH_HYPERION_2_0.md)
273
- § "4-way head-to-head (2.10-B baseline)" — the older doc is the
274
- **historical baseline (pre-2.10/2.11 wins)** and is preserved
275
- unchanged for archaeology. The 1.6.0 matrix at
276
- [`docs/BENCH_2026_04_27.md`](docs/BENCH_2026_04_27.md) covers 9
277
- workloads × 25+ configs against hyperion-async-pg 0.5.0; all three
278
- docs include caveats and per-row reproduction commands.
279
-
280
- > **Bench-host drift note (2026-05-01).** A spot-check rerun on
281
- > `openclaw-vm` 5 days after the 2.0.0 sweep showed Puma 8.0.1 and
282
- > Hyperion 2.0.0 baseline numbers had drifted 14-32% downward from the
283
- > 2026-04-29 sweep with no code changes — the bench host runs other
284
- > workloads in the background and is a single VM (KVM CPU). Numbers in
285
- > this README and BENCH docs are snapshots; expect ±10-30% absolute
286
- > drift between sweep dates. **The relative position (Hyperion vs Puma
287
- > at matched config) is the durable signal**; e.g. Hyperion `-w 16 -t 5`
288
- > hello-world today is 76,593 r/s vs Puma 8.0.1 `-w 16 -t 5:5` at 55,609
289
- > r/s, **+37.7% over Puma** — wider than the 2.0.0 sweep's +27.8% even
290
- > though absolute rps is lower. Reproduce: `bundle exec bin/hyperion
291
- > -p 9501 -w 16 -t 5 bench/hello.ru` then `wrk -t4 -c200 -d20s
292
- > http://127.0.0.1:9501/`.
293
-
294
- > **Topology relevance.** Hyperion is built to run **fronted by nginx
295
- > or an L7 load balancer** in most production deployments — plaintext
296
- > HTTP/1.1 upstream, TLS terminated at the LB. The benches in this
297
- > README that match that topology are: hello-world, CPU JSON, static,
298
- > SSE, PG, WebSocket. Benches that are **bench-only for nginx-fronted
299
- > ops** (the LB → upstream hop is plaintext h1 regardless): TLS h1,
300
- > HTTP/2, kTLS_TX. Those rows still ship for operators who terminate
301
- > TLS / h2 at Hyperion directly (small static fleets, edge boxes), but
302
- > don't chase the +60% TLS-h1 win unless you actually terminate TLS at
303
- > Hyperion.
304
-
305
- ### Hello-world Rack app
306
-
307
- `bench/hello.ru`, single worker, parity threads (`-t 5` vs Puma `-t 5:5`), 4 wrk threads / 100 connections / 15s, macOS arm64 / Ruby 3.3.3, Hyperion 1.2.0. **macOS dev numbers; the headline Linux 2.0.0 bench is in [`docs/BENCH_HYPERION_2_0.md`](docs/BENCH_HYPERION_2_0.md)**:
308
-
309
- | | r/s | p99 | tail vs Hyperion |
310
- |---|---:|---:|---:|
311
- | **Hyperion 1.2.0** (default, logs ON) | **22,496** | **502 µs** | **1×** |
312
- | Falcon 0.55.3 `--count 1` | 22,199 | 5.36 ms | 11× worse |
313
- | Puma 7.1.0 `-t 5:5` | 20,400 | 422.85 ms | 845× worse |
314
-
315
- **Hyperion: 1.10× Puma throughput, parity with Falcon on throughput, ~10× lower p99 than Falcon and ~845× lower than Puma — while emitting structured JSON access logs the others don't.**
316
-
317
- ### Production cluster config (`-w 4`)
318
-
319
- Same bench app, `-w 4` cluster, parity threads (`-t 5` everywhere), 4 wrk threads / 200 connections / 15s, macOS arm64:
320
-
321
- | | r/s | p99 | tail vs Hyperion |
322
- |---|---:|---:|---:|
323
- | Falcon `--count 4` | 48,197 | 4.84 ms | 5.9× worse |
324
- | **Hyperion `-w 4 -t 5`** | **40,137** | **825 µs** | **1×** |
325
- | Puma `-w 4 -t 5:5` | 34,793 | 177.76 ms | 215× worse (1 timeout) |
326
-
327
- Falcon edges Hyperion ~20% on raw rps at `-w 4` on macOS hello-world. **Hyperion still leads on tail latency by 5.9× over Falcon and 215× over Puma**, and beats Puma on throughput by 1.15×. On Linux production-config and DB-backed workloads (below) Hyperion takes the rps lead too — the macOS hello-world advantage to Falcon disappears once the workload includes any actual work or the kernel is Linux.
328
-
329
- ### Linux production-config (DB-backed Rack)
330
-
331
- `-w 4 -t 10` on Ubuntu 24.04 / Ruby 3.3.3. Rack app does one Postgres `SELECT 1` + one Redis `GET` per request, real network round-trip. wrk `-t4 -c50 -d10s` × 3 runs (median):
332
-
333
- | | r/s (median) | vs Puma default |
334
- |---|---:|---:|
335
- | **Hyperion default (rc17, logs ON)** | **5,786** | **1.012×** |
336
- | Hyperion `--no-log-requests` | 6,364 | 1.114× |
337
- | Puma `-w 4 -t 10:10` (no per-req logs) | 5,715 | 1.000× |
338
-
339
- Bench is **wait-bound** — ~3-4 ms median is the PG + Redis round-trip, dwarfing the per-request CPU work where Hyperion's optimisations live. With a synchronous `pg` driver, fibers don't help: every in-flight DB call still parks an OS thread, and both servers max out at `workers × threads` concurrent queries. To widen this gap requires either an async PG driver — see [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg) (companion gem; pair with `--async-io` and a fiber-aware pool, see "Async I/O — fiber concurrency on PG-bound apps" below) — or a CPU-bound workload, where Hyperion's lead becomes visible (next section).
340
-
341
- ### Async I/O — fiber concurrency on PG-bound apps
342
-
343
- Ubuntu 24.04 / 16 vCPU / Ruby 3.3.3, Postgres 17 over WAN, `wrk -t4 -c200 -d20s`. Single worker (`-w 1`) unless noted. All configs returned 0 non-2xx and 0 timeouts. RSS sampled mid-run via `ps -o rss`.
344
-
345
- **Wait-bound workload** (`pg_concurrent.ru`: `SELECT pg_sleep(0.05)` + tiny JSON; rackup lives in the [hyperion-async-pg companion repo](https://github.com/andrew-woblavobla/hyperion-async-pg) and on the bench host at `~/bench/pg_concurrent.ru`, not in this repo):
346
-
347
- | | r/s | p99 | RSS | vs Puma `-t 5` |
348
- |---|---:|---:|---:|---:|
349
- | Puma 8.0 `-t 5` pool=5 | 56.5 | 3.88 s | 87 MB | 1.0× |
350
- | Puma 8.0 `-t 30` pool=30 | 402.1 | 880 ms | 99 MB | 7.1× |
351
- | Puma 8.0 `-t 100` pool=100 | 1067.4 | 557 ms | 121 MB | 18.9× |
352
- | **Hyperion `--async-io -t 5`** pool=32 | 400.4 | 878 ms | 123 MB | 7.1× |
353
- | **Hyperion `--async-io -t 5`** pool=64 | 778.9 | 638 ms | 133 MB | 13.8× |
354
- | **Hyperion `--async-io -t 5`** pool=128 | 1344.2 | 536 ms | 148 MB | 23.8× |
355
- | **Hyperion `--async-io -t 5` pool=200** | **2381.4** | **471 ms** | **164 MB** | **42.2×** |
356
- | Hyperion `--async-io -w 4 -t 5` pool=64 | 1937.5 | 4.84 s | 416 MB | 34.3× (cold-start p99 — see note) |
357
- | Falcon 0.55.3 `--count 1` pool=128 | 1665.7 | 516 ms | 141 MB | 29.5× |
358
-
359
- **Mixed CPU+wait** (`pg_mixed.ru`: same query + 50-key JSON serialization, ~5 ms CPU; rackup lives in hyperion-async-pg + on the bench host at `~/bench/pg_mixed.ru`, not in this repo):
360
-
361
- | | r/s | p99 | RSS | vs Puma `-t 30` |
362
- |---|---:|---:|---:|---:|
363
- | Puma 8.0 `-t 30` pool=30 | 351.7 | 963 ms | 127 MB | 1.0× |
364
- | Hyperion `--async-io -t 5` pool=32 | 371.2 | 919 ms | 151 MB | 1.05× |
365
- | Hyperion `--async-io -t 5` pool=64 | 741.5 | 681 ms | 161 MB | 2.1× |
366
- | **Hyperion `--async-io -t 5` pool=128** | **1739.9** | **512 ms** | **201 MB** | **4.9×** |
367
- | Falcon `--count 1` pool=128 | 1642.1 | 531 ms | 213 MB | 4.7× |
368
-
369
- **Takeaways:**
370
- 1. **Linear scaling with pool size** under `--async-io` — `r/s ≈ pool × 12` on this WAN bench. Single-worker pool=200 hits 2381 r/s. The "**42× Puma `-t 5`**" and "**5.9× Puma's best**" framings above use Puma's pool=5 (timeout-floor) and pool=30 (mid-tier) rows respectively — fair comparisons on the *same* bench fixture, but a Puma operator who sizes their pool to match (`-t 100 pool=100` row above) lands at 1,067 r/s, so the **honest "Puma at its own best vs Hyperion at its own best" ratio is 2,381 / 1,067 ≈ 2.2×**, not 42×. The architectural win — fiber-pool grows to pool=200 without OS-thread cost — is real; the 42× headline is a configuration-difference effect, not a steady-state gap on matched configs.
371
- 2. **Mixed workload doesn't kill the win** — Hyperion `--async-io` pool=128 actually goes *up* on mixed (1740 vs 1344 r/s) because CPU work overlaps other fibers' PG-wait windows. This is the honest "what happens to a real Rails handler" answer.
372
- 3. **Hyperion ≈ Falcon within 3-7%** across pool sizes; both fiber-native architectures extract similar value from `hyperion-async-pg`.
373
- 4. **RSS at single-worker scale isn't the architectural moat** — Linux thread stacks are demand-paged; PG connection buffers dominate RSS at pool sizes ≤ 200. The architectural win is **handler concurrency under load**, not idle memory: Hyperion's fiber path runs thousands of in-flight handler invocations per OS thread, so wait-bound handlers don't queue at `max_threads`. See [Concurrency at scale](#concurrency-at-scale-architectural-advantages) for both the throughput-under-load row and a measured 10k-idle-keepalive RSS sweep against Puma and Falcon.
374
- 5. **`-w 4` cold-start caveat** — multi-worker p99 inflates because the bench rackup uses lazy per-process pool init (each worker pays full pool fill on its first request). Production apps avoid this with `on_worker_boot { Hyperion::AsyncPg::FiberPool.new(...).fill }`.
375
- 6. **Apples-to-apples PG note**: the row above uses `pg.wobla.space` WAN PG with `max_connections=500`. Earlier sweeps that compared Hyperion (WAN, max_conn=500) against Puma (local, max_conn=100) overstated the ratio because the Puma side timed out at the local pool ceiling. The 2.0.0 bench doc carries this caveat in the row 7 verification section; treat any "Hyperion 4× Puma on PG" headline as **indicative**, not precisely calibrated, until rerun against matched-pool PG.
376
-
377
- Three things must all be true to get this win:
378
- 1. **`async_io: true`** in your Hyperion config (or `--async-io` CLI flag). Default is off to keep 1.2.0's raw-loop perf for fiber-unaware apps.
379
- 2. **`hyperion-async-pg`** installed: `gem 'hyperion-async-pg', require: 'hyperion/async_pg'` + `Hyperion::AsyncPg.install!` at boot.
380
- 3. **Fiber-aware connection pool.** The popular `connection_pool` gem is NOT — its Mutex blocks the OS thread. Use `Hyperion::AsyncPg::FiberPool` (ships with hyperion-async-pg 0.3.0+), [`async-pool`](https://github.com/socketry/async-pool), or `Async::Semaphore`.
381
-
382
- Skip any of these and you get parity with Puma at the same `-t`. Run the bench yourself: `MODE=async DATABASE_URL=... PG_POOL_SIZE=200 bundle exec hyperion --async-io -t 5 bench/pg_concurrent.ru` (in the [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg) repo).
383
-
384
- > **TLS + async-pg note (1.4.0+).** TLS / HTTPS already runs each connection on a fiber under `Async::Scheduler` (the TLS path always uses `start_async_loop` for the ALPN handshake). **As of 1.4.0, the post-handshake `app.call` for HTTP/1.1-over-TLS dispatches inline on the calling fiber by default** — so fiber-cooperative libraries (`hyperion-async-pg`, `async-redis`) work on the TLS h1 path without needing `--async-io`. The Async-loop cost is already paid for the handshake; running the handler under the existing scheduler just preserves that context instead of stripping it on a thread-pool hop. h2 streams are always fiber-dispatched and benefit from async-pg without the flag.
385
- >
386
- > Operators who specifically want **TLS + threadpool dispatch** (e.g. CPU-heavy handlers competing for OS threads, where you'd rather not pay fiber yields and want true OS-thread parallelism on a synchronous handler) can pass `async_io: false` in the config to force the pool branch back on. The three-way `async_io` setting:
387
- > - `nil` (default): plain HTTP/1.1 → pool, TLS h1 → inline.
388
- > - `true`: plain HTTP/1.1 → inline, TLS h1 → inline (force fiber dispatch everywhere; needed for `hyperion-async-pg` on plain HTTP).
389
- > - `false`: plain HTTP/1.1 → pool, TLS h1 → pool (explicit opt-out for TLS+threadpool).
24
+ Linux 6.8 / 16-vCPU Ubuntu 24.04 / Ruby 3.3.3, single worker, `wrk -t4 -c100 -d20s`
25
+ unless noted. Reproduction commands and the full 6-row 4-way matrix
26
+ (Hyperion / Puma / Falcon / Agoo) live in
27
+ [docs/BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md).
390
28
 
391
- ### CPU-bound JSON workload
29
+ | Workload | Hyperion r/s | Hyperion p99 | Reference |
30
+ |-------------------------------------------------------|-------------:|-------------:|----------------------|
31
+ | Static hello, `handle_static` + io_uring (2.12-D) | **134,084** | 1.14 ms | Agoo 2.15.14: 19,024 |
32
+ | Static hello, `handle_static` + accept4 fallback | 15,685 | 107 µs | Agoo 2.15.14: 19,024 |
33
+ | Dynamic block, `Server.handle { \|env\| ... }` (2.14-A) | 9,422 | 166 µs | Agoo 2.15.14: 19,024 |
34
+ | CPU JSON via block (`bench/work.ru`, 2.14-A) | 5,897 | 256 µs | Falcon: 4,226 |
35
+ | Generic Rack hello (no `Server.handle`) | 4,752 | 2.02 ms | Agoo 2.15.14: 19,024 |
36
+ | gRPC unary, h2/TLS, ghz `-c50` (2.14-D) | 1,618 | 33.3 ms | Falcon `async-grpc`: 1,512 (+7%) |
392
37
 
393
- `bench/work.ru` — handler builds a 50-key fixture, JSON-encodes a fresh response per request (~8 KB body), processes a 6-cookie header chain. wrk `-t4 -c200 -d15s`, macOS arm64 / Ruby 3.3.3, 1.2.0:
38
+ The 134,084 r/s row is sustained over a 4-hour soak at **120,684 r/s**
39
+ with RSS variance 2.71% and `wrk-truth` p99 1.14 ms (2.14-C). The
40
+ io_uring loop is opt-in via `HYPERION_IO_URING_ACCEPT=1` until 2.15;
41
+ the `accept4` row is the default on Linux.
394
42
 
395
- | | r/s | p99 | tail vs Hyperion |
396
- |---|---:|---:|---:|
397
- | Falcon `--count 4` | 46,166 | 20.17 ms | 24× worse |
398
- | **Hyperion `-w 4 -t 5`** | **43,924** | **824 µs** | **1×** |
399
- | Puma `-w 4 -t 5:5` | 36,383 | 166.30 ms (47 socket errors) | 200× worse |
400
-
401
- **1.21× Puma throughput, 200× lower p99.** This is the gap that hides behind PG-round-trip noise on the DB bench. Hyperion's per-request CPU savings (lock-free per-thread metrics, frozen header keys in the Rack adapter, C-ext response head builder, cached iso8601 timestamps, cached HTTP Date header) land on the wire when the workload is CPU-bound. Falcon edges us 5% on raw r/s but with 24× worse tail — a different tradeoff curve. Reproduce: `bundle exec bin/hyperion -w 4 -t 5 -p 9292 bench/work.ru`.
402
-
403
- ### Real Rails 8.1 app (single worker, parity threads `-t 16`)
404
-
405
- Health endpoint that traverses the full middleware chain (rack-attack, locale redirect, structured tagger, geo-location, etc.). Plus a Grape API endpoint reading cached data, and a Rails controller doing a Redis GET + an ActiveRecord query.
406
-
407
- | endpoint | server | r/s | p99 | wrk timeouts |
408
- |---|---|---:|---:|---:|
409
- | `/up` (health) | **Hyperion** | **19.03** | **1.12 s** | **0** |
410
- | `/up` (health) | Puma `-t 16:16` | 16.64 | 1.95 s | **138** |
411
- | Grape `/api/v1/cached_data` | **Hyperion** | **16.15** | **779 ms** | 16 |
412
- | Grape `/api/v1/cached_data` | Puma `-t 16:16` | 10.90 | (>2 s, censored) | **110** |
413
- | Rails `/api/v1/health` | **Hyperion** | **15.95** | **992 ms** | 16 |
414
- | Rails `/api/v1/health` | Puma `-t 16:16` | 11.29 | (>2 s, censored) | **114** |
415
-
416
- On Grape and Rails-controller workloads Puma hits wrk's 2 s timeout cap on ~⅔ of requests — its real p99 is censored above 2 s. Hyperion serves all of its requests under 1.2 s with 0 to 16 timeouts. **1.14–1.48× Puma throughput** depending on endpoint.
417
-
418
- ### Static-asset serving (sendfile zero-copy path, 1.2.0+)
419
-
420
- `bench/static.ru` (`Rack::Files` over a 1 MiB asset), `-w 1`, `wrk -t4 -c100 -d15s`, macOS arm64 / Ruby 3.3.3:
421
-
422
- | | r/s | p99 | transferred | tail vs winner |
423
- |---|---:|---:|---:|---:|
424
- | **Hyperion (sendfile path)** | **2,069** | **3.10 ms** | 30.4 GB | **1×** |
425
- | Puma `-w 1 -t 5:5` | 2,109 | 566.16 ms | 31.0 GB | 183× worse |
426
- | Falcon `--count 1` | 1,269 | 801.01 ms | 18.7 GB | 258× worse (28 timeouts) |
427
-
428
- Throughput is bandwidth-bound on localhost (≈2 GB/s = the loopback memory ceiling), so the throughput column looks like parity. The actual win is in the **tail latency** column: Hyperion's `IO.copy_stream` → `sendfile(2)` path skips userspace entirely, while Puma allocates a String per response and Falcon serializes more aggressively. On real network paths sendfile widens the gap further (kernel-to-NIC zero-copy).
43
+ ## Quick start
429
44
 
430
- Reproduce:
431
45
  ```sh
432
- ruby -e 'File.binwrite("/tmp/hyperion_bench_asset_1m.bin", "x" * (1024*1024))'
433
- bundle exec bin/hyperion -p 9292 bench/static.ru
434
- wrk --latency -t4 -c100 -d15s http://127.0.0.1:9292/hyperion_bench_asset_1m.bin
46
+ bundle exec hyperion config.ru # single process
47
+ bundle exec hyperion -w 4 -t 10 config.ru # 4 workers × 10 threads
48
+ bundle exec hyperion -w 0 config.ru # one worker per CPU
49
+ bundle exec hyperion --tls-cert cert.pem --tls-key key.pem -p 9443 config.ru
435
50
  ```
436
51
 
437
- ### Concurrency at scale (architectural advantages)
438
-
439
- These workloads demonstrate structural differences between Hyperion's fiber-per-connection / fiber-per-stream model and Puma's thread-pool model. Numbers are illustrative; the architecture is what matters. Run on Ubuntu 24.04 / Ruby 3.3.3, single worker, h2load `-c <conns> -n 100000 --rps 1000 --h1`.
440
-
441
- **5,000 concurrent keep-alive connections (50,000 requests):**
442
-
443
- | | succeeded | r/s | wall | master RSS |
444
- |---|---:|---:|---:|---:|
445
- | Hyperion `-w 1 -t 10` | 50,000 / 50,000 | 3,460 | 14.45 s | 53.5 MB |
446
- | Puma `-w 1 -t 10:10` | 50,000 / 50,000 | 1,762 | 28.37 s | 36.9 MB |
52
+ `bundle exec rake spec` (and the default task) auto-invoke `compile`, so a
53
+ fresh checkout just needs `bundle install && bundle exec rake` for a green run.
447
54
 
448
- **10,000 concurrent keep-alive connections (100,000 requests):**
55
+ Migrating from Puma? `hyperion -t N -w M` matching your current Puma
56
+ `-t N:N -w M` is the recommended drop-in. See
57
+ [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md).
449
58
 
450
- | | succeeded | failed | r/s | wall |
451
- |---|---:|---:|---:|---:|
452
- | Hyperion `-w 1 -t 10` | 93,090 | 6,910 | 3,446 | 27.01 s |
453
- | Puma `-w 1 -t 10:10` | 77,340 | 22,660 | 706 | 109.59 s |
59
+ ## Features
454
60
 
455
- At 10k concurrent connections under load Hyperion serves **~5× the throughput** of Puma with **~20% fewer dropped requests**. The per-connection bookkeeping cost is bounded by fiber size, not by `max_threads` — workers don't get pinned to long-lived sockets, so a slow handler doesn't starve other connections.
61
+ ### HTTP/1.1 + HTTP/2 + TLS
456
62
 
457
- **Memory at idle keep-alive scale 10,000 idle HTTP/1.1 keep-alive connections:**
63
+ ALPN auto-negotiates `h2` or `http/1.1` per connection. HTTP/2 multiplexes
64
+ streams onto fibers within a single connection — slow handlers don't
65
+ head-of-line-block other streams. Cluster-mode TLS works (`-w N` +
66
+ `--tls-cert` / `--tls-key`).
458
67
 
459
- Each client opens a TCP connection, sends one keep-alive GET, drains the response, then holds the socket open without sending a follow-up request. RSS is sampled once a second across a 30s idle hold. Same hello-world rackup, single worker, no TLS. Hyperion runs with `async_io true` (fiber-per-connection on the plain HTTP/1.1 path).
68
+ Smuggling defenses for HTTP/1.1: `Content-Length` + `Transfer-Encoding`
69
+ together → 400; non-chunked `Transfer-Encoding` → 501; CRLF in response
70
+ header values → `ArgumentError` (response-splitting guard).
460
71
 
461
- | | held | dropped | peak RSS | RSS after drain |
462
- |---|---:|---:|---:|---:|
463
- | Hyperion `-w 1 -t 5 --async-io` | 10,000 / 10,000 | 0 | 173 MB | 155 MB |
464
- | Puma `-w 0 -t 100` | 10,000 / 10,000 | 0 | 101 MB | 104 MB |
465
- | Falcon `--count 1` | 10,000 / 10,000 | 0 | 429 MB | 440 MB |
72
+ ### WebSockets (2.1.0+)
466
73
 
467
- All three hold 10k idle conns without OOMing or dropping — the "MB-per-thread" intuition that thread-based servers can't reach this scale doesn't survive contact with Linux's demand-paged thread stacks plus Puma's reactor-based keep-alive handling. Per-conn RSS lands at ~14 KB (Hyperion fiber + parser state), ~7 KB (Puma reactor entry + tiny thread share), ~36 KB (Falcon Async::Task + protocol-http stack). Bounded, not unbounded — for all three.
74
+ RFC 6455 over Rack 3 full hijack, native frame codec, per-connection
75
+ wrapper with auto-pong, close handshake, UTF-8 validation, and per-message
76
+ size cap. **ActionCable + faye-websocket on a single binary** — one
77
+ `hyperion -w 4 -t 10 config.ru` serves HTTP, HTTP/2, TLS, and `/cable`
78
+ from the same listener. Conformance: 463/463 autobahn-testsuite cases
79
+ pass. See [docs/WEBSOCKETS.md](docs/WEBSOCKETS.md).
468
80
 
469
- The architectural difference shows up under **load**, not at idle: Puma can only run `max_threads` handler invocations concurrently, so wait-bound handlers (DB, HTTP, Redis) starve at higher request concurrency than `max_threads`. Hyperion's fiber-per-connection model + `--async-io` gives one OS thread thousands of in-flight handler executions, paired with [hyperion-async-pg](https://github.com/exodusgaming-io/hyperion-async-pg) for non-blocking DB. The 10k-conn throughput row above (5× Puma) is the consequence — same idle RSS shape, very different behaviour once the handlers actually do work.
81
+ ### gRPC (2.12-F+)
470
82
 
471
- **HTTP/2 multiplexing 1 connection × 100 concurrent streams (handler sleeps 50 ms):**
83
+ Hyperion's HTTP/2 path supports gRPC unary, server-streaming,
84
+ client-streaming, and bidirectional RPCs via the Rack 3 trailers contract:
85
+ any response body that defines `#trailers` gets a final HEADERS frame
86
+ (with `END_STREAM=1`) carrying the trailer map after the DATA frames.
87
+ Plain HTTP/2 traffic without the gRPC content-type keeps the unary
88
+ buffered semantics — no behaviour change for non-gRPC clients.
472
89
 
473
- | | wall time |
474
- |---|---:|
475
- | Hyperion (per-stream fiber dispatch) | **1.04 s** |
476
- | Serial baseline (100 × 50 ms) | 5.00 s |
90
+ A minimal unary handler:
477
91
 
478
- Hyperion fans 100 in-flight streams across separate fibers within a single TCP connection. A serial server would take 5 s; the fiber-multiplexed result (1.04 s, ~96 req/s on one socket) is bounded by single-handler sleep time plus framing overhead. Puma has no native HTTP/2 path — production deployments terminate h2 at nginx and forward h1 to the worker pool, which serializes again.
479
-
480
- > **1.6.0 outbound write path** — `Http2Handler` no longer serializes every framer write through one `Mutex#synchronize { socket.write(...) }`. HPACK encoding (microseconds, in-memory) still serializes on a fast encode mutex, but the actual `socket.write` is owned by a dedicated per-connection writer fiber draining a queue. On per-connection multi-stream workloads where the kernel send buffer or peer reads are slow, encode work for ready streams overlaps the writer's flush of earlier chunks, instead of stacking up behind it. See `bench/h2_streams.sh` (`h2load -c 1 -m 100 -n 5000`) for a recipe to compare 1.5.0 vs 1.6.0 on a workload of your choice.
481
-
482
- ### Reproducing the benchmarks
483
-
484
- Every number in this README and `docs/BENCH_HYPERION_2_0.md` is reproducible. Operators who don't trust headline numbers (and you shouldn't trust *any* benchmark numbers without independent verification) can rerun the workloads on their own host and get their own honest measurements. Per-row reproduction commands:
485
-
486
- ```sh
487
- # Setup (once)
488
- bundle install
489
- bundle exec rake compile
490
-
491
- # Hello-world (rps + p99 ceiling, no I/O)
492
- bundle exec bin/hyperion -p 9292 -w 16 -t 5 bench/hello.ru &
493
- wrk -t4 -c200 -d20s --latency http://127.0.0.1:9292/
494
-
495
- # CPU-bound JSON (per-request CPU savings visible)
496
- bundle exec bin/hyperion -p 9292 -w 4 -t 5 bench/work.ru &
497
- wrk -t4 -c200 -d15s --latency http://127.0.0.1:9292/
498
-
499
- # Static 1 MiB sendfile path
500
- ruby -e 'File.binwrite("/tmp/hyperion_bench_asset_1m.bin", "x" * (1024*1024))'
501
- bundle exec bin/hyperion -p 9292 -w 1 -t 5 bench/static.ru &
502
- wrk -t4 -c100 -d15s --latency http://127.0.0.1:9292/hyperion_bench_asset_1m.bin
503
-
504
- # SSE streaming (Hyperion-shaped rackup with explicit flush sentinel — see caveat in BENCH doc)
505
- bundle exec bin/hyperion -p 9292 -w 1 -t 5 bench/sse.ru &
506
- wrk -t1 -c1 -d10s http://127.0.0.1:9292/
507
-
508
- # WebSocket multi-process throughput
509
- bundle exec bin/hyperion -p 9888 -w 4 -t 64 bench/ws_echo.ru &
510
- ruby bench/ws_bench_client_multi.rb --port 9888 --procs 4 --conns 200 --msgs 1000 --bytes 1024 --json
511
-
512
- # h2 native HPACK (Rails-shape, 25-header response)
513
- ./bench/h2_rails_shape.sh
514
-
515
- # Idle keep-alive RSS sweep (1k / 5k / 10k conns, 30s hold per server)
516
- ./bench/keepalive_memory.sh
92
+ ```ruby
93
+ class GrpcBody
94
+ def initialize(reply); @reply = reply; end
95
+ def each; yield @reply; end
96
+ def trailers; { 'grpc-status' => '0', 'grpc-message' => 'OK' }; end
97
+ def close; end
98
+ end
517
99
 
518
- # Hello-world quick comparator (Hyperion vs Puma vs Falcon)
519
- bundle exec ruby bench/compare.rb
520
- HYPERION_WORKERS=4 PUMA_WORKERS=4 FALCON_COUNT=4 bundle exec ruby bench/compare.rb
100
+ run ->(env) {
101
+ request = env['rack.input'].read
102
+ reply = handle(request)
103
+ [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply)]
104
+ }
521
105
  ```
522
106
 
523
- PG benches (`pg_concurrent.ru`, `pg_mixed.ru`, `pg_realistic.ru`) live in the [hyperion-async-pg companion repo](https://github.com/andrew-woblavobla/hyperion-async-pg) they require a running Postgres + the companion gem and are not part of this repo. The 2.0.0 sweep used `~/bench/pg_concurrent.ru` on the bench host; reproduce by cloning hyperion-async-pg and following its README, or `scp` the rackup + DATABASE_URL.
107
+ Server-streaming yields one DATA frame per `each`; client-streaming
108
+ reads incoming frames off `env['rack.input']` (a streaming IO that
109
+ blocks until the next DATA frame lands); bidirectional interleaves
110
+ both. Reproducible bench at `bench/grpc_stream.{proto,ru}` +
111
+ `bench/grpc_stream_bench.sh` (ghz). Numbers in
112
+ [docs/BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md#grpc-ghz-bench--hyperion-vs-falcon-async-grpc-214-d).
524
113
 
525
- When numbers from your host don't match the published numbers, the most likely explanations (in order): (1) bench-host noise — single-VM benches drift 10-30% over days; (2) Puma version mismatch — the 2.0.0 sweep used Puma 8.0.1 in the `~/bench/Gemfile`, the hyperion repo's own Gemfile pins Puma `~> 6.4`; (3) different kernel / Ruby; (4) different `-t` / `-c` (apples-to-apples requires identical worker count, thread count, wrk concurrency, payload size, kernel, Ruby, TLS cipher).
114
+ ### `Server.handle` direct routes
526
115
 
527
- ## Quick start
116
+ Bypass the Rack adapter for hot paths:
528
117
 
529
- ```sh
530
- bundle install
531
- bundle exec rake compile # build the llhttp C ext
532
- bundle exec hyperion config.ru # single-process default
533
- bundle exec hyperion -w 4 -t 10 config.ru # 4-worker cluster, 10 threads each
534
- bundle exec hyperion -w 0 config.ru # 1 worker per CPU
535
- bundle exec hyperion --tls-cert cert.pem --tls-key key.pem -p 9443 config.ru # HTTPS
536
- curl http://127.0.0.1:9292/ # => hello
537
-
538
- # Chunked POST works:
539
- curl -X POST -H "Transfer-Encoding: chunked" --data-binary @file http://127.0.0.1:9292/
540
-
541
- # HTTP/2 (over TLS, ALPN-negotiated):
542
- curl --http2 -k https://127.0.0.1:9443/
118
+ ```ruby
119
+ Hyperion::Server.handle_static '/health', body: 'ok'
120
+ Hyperion::Server.handle(:GET, '/v1/ping') { |env| [200, {}, ['pong']] }
543
121
  ```
544
122
 
545
- `bundle exec rake spec` (and the `default` task) auto-invoke `compile`, so a fresh checkout just needs `bundle install && bundle exec rake` to get a green run.
546
-
547
- **Migrating from Puma?** See [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md).
123
+ `handle_static` bakes the response at boot and serves from the C accept
124
+ loop (134k r/s with io_uring, 16k r/s on accept4). The dynamic block
125
+ form (2.14-A) runs `app.call(env)` on the C accept loop too — accept +
126
+ recv + parse + write release the GVL while the block holds it, so
127
+ multi-threaded workers actually parallelise.
128
+
129
+ ### Pre-fork cluster
130
+
131
+ Per-OS worker model: `SO_REUSEPORT` on Linux (kernel-balanced accept,
132
+ 1.004–1.011 max/min ratio across workers under steady load — 2.12-E
133
+ audit), master-bind + worker-fd-share on macOS/BSD where Darwin's
134
+ `SO_REUSEPORT` doesn't load-balance. Lifecycle hooks (`before_fork`,
135
+ `on_worker_boot`, `on_worker_shutdown`) for AR / Redis / pool init.
136
+
137
+ ### Async I/O (PG-bound apps)
138
+
139
+ `--async-io` runs plain HTTP/1.1 connections under `Async::Scheduler`,
140
+ turning one OS thread into thousands of in-flight handler invocations.
141
+ Paired with [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg)
142
+ on a `pg_sleep(50ms)` workload, single-worker `pool=200` hits **2,381 r/s**
143
+ vs Puma `-t 5` at 56 r/s (architectural ceiling: pool size, not thread
144
+ count). Three things must all be true: `--async-io`, `hyperion-async-pg`
145
+ loaded, and a fiber-aware pool (`Hyperion::AsyncPg::FiberPool`,
146
+ `async-pool`, or `Async::Semaphore` — **not** the `connection_pool` gem,
147
+ whose `Mutex` blocks the OS thread). Skip any one and you get parity
148
+ with Puma.
149
+
150
+ ### Observability
151
+
152
+ `/-/metrics` Prometheus endpoint (admin-token guarded), per-route
153
+ latency histograms, per-conn fairness rejections, WebSocket
154
+ permessage-deflate ratio, kTLS active connections, ThreadPool queue
155
+ depth, dispatch-mode counters (Rack / `handle_static` / dynamic block /
156
+ h2 / async-io). Pre-built Grafana dashboard at
157
+ [docs/grafana/hyperion-2.4-dashboard.json](docs/grafana/hyperion-2.4-dashboard.json).
158
+ Full reference: [docs/OBSERVABILITY.md](docs/OBSERVABILITY.md).
159
+
160
+ Default-ON structured access logs (one JSON or text line per request)
161
+ with hot-path optimisations: per-thread cached iso8601 timestamp,
162
+ hand-rolled line builder, lock-free per-thread 4 KiB write buffer.
163
+ 12-factor logger split: `info`/`debug` → stdout, `warn`/`error`/`fatal`
164
+ → stderr.
165
+
166
+ ### Optional io_uring accept loop
167
+
168
+ Linux 5.x+, opt-in via `HYPERION_IO_URING_ACCEPT=1`. Multishot accept
169
+ + per-conn RECV/WRITE/CLOSE state machine on top of liburing. One
170
+ `io_uring_enter` per N requests instead of N×3 syscalls. Compiles out
171
+ cleanly without liburing — the `accept4` path stays the fallback.
172
+ macOS keeps using `accept4`. Default-flip moves to 2.15 with a fresh
173
+ 24h soak.
548
174
 
549
175
  ## Configuration
550
176
 
551
- Three layers, in precedence order: explicit CLI flag > environment variable > `config/hyperion.rb` > built-in default.
177
+ Three layers, in precedence order: explicit CLI flag > environment
178
+ variable > `config/hyperion.rb` > built-in default.
552
179
 
553
- ### CLI flags
180
+ ### Most-used CLI flags
554
181
 
555
182
  | Flag | Default | Notes |
556
183
  |---|---|---|
557
184
  | `-b, --bind HOST` | `127.0.0.1` | |
558
185
  | `-p, --port PORT` | `9292` | |
559
186
  | `-w, --workers N` | `1` | `0` → `Etc.nprocessors` |
560
- | `-t, --threads N` | `5` | OS-thread Rack handler pool per worker. `0` → run inline (no pool, debugging only). |
187
+ | `-t, --threads N` | `5` | OS-thread Rack handler pool per worker. `0` → run inline (debugging). |
561
188
  | `-C, --config PATH` | `config/hyperion.rb` if present | Ruby DSL file. |
562
- | `--tls-cert PATH` | nil | PEM certificate. |
563
- | `--tls-key PATH` | nil | PEM private key. |
564
- | `--log-level LEVEL` | `info` | `debug` / `info` / `warn` / `error` / `fatal`. |
565
- | `--log-format FORMAT` | `auto` | `text` / `json` / `auto`. Auto: JSON when `RAILS_ENV`/`RACK_ENV` is `production`/`staging`, colored text on TTY, JSON otherwise. |
566
- | `--[no-]log-requests` | ON | Per-request access log. |
567
- | `--fiber-local-shim` | off | Patches `Thread#thread_variable_*` to fiber storage for older Rails idioms. |
568
- | `--[no-]yjit` | auto | Force YJIT on/off. Default: auto-on under `RAILS_ENV`/`RACK_ENV` = `production`/`staging`. |
569
- | `--[no-]async-io` | off | Run plain HTTP/1.1 connections under `Async::Scheduler`. Required for `hyperion-async-pg` on plain HTTP. TLS h1 / HTTP/2 always run under the scheduler regardless. |
570
- | `--max-body-bytes BYTES` | `16777216` (16 MiB) | Maximum request body size. |
571
- | `--max-header-bytes BYTES` | `65536` (64 KiB) | Maximum total request-header size. |
572
- | `--max-pending COUNT` | unbounded | Per-worker accept-queue cap before new connections are rejected with HTTP 503 + `Retry-After: 1`. |
573
- | `--max-request-read-seconds SECONDS` | `60` | Total wallclock budget for reading request line + headers + body for ONE request. Slowloris defence. |
574
- | `--admin-token TOKEN` | unset | Bearer token for `POST /-/quit` and `GET /-/metrics`. **Production: prefer `--admin-token-file` — argv is visible via `ps`.** |
575
- | `--admin-token-file PATH` | unset | Read the admin token from a file. Refuses to load if the file is missing or world-readable (mode must mask `0o007`). |
576
- | `--worker-max-rss-mb MB` | unset | Master gracefully recycles a worker once its RSS exceeds this many megabytes. nil = disabled. |
577
- | `--idle-keepalive SECONDS` | `5` | Keep-alive idle timeout. Connection closes after this many seconds of inactivity. |
578
- | `--graceful-timeout SECONDS` | `30` | Shutdown deadline before SIGKILL is delivered to a worker that hasn't drained. |
189
+ | `--tls-cert PATH` / `--tls-key PATH` | nil | PEM cert + key for HTTPS. |
190
+ | `--[no-]async-io` | off | Run plain HTTP/1.1 under `Async::Scheduler`. Required for `hyperion-async-pg` on plain HTTP. |
191
+ | `--preload-static DIR` | nil | Preload static assets from DIR at boot (repeatable, immutable). Rails apps auto-detect from `Rails.configuration.assets.paths`. |
192
+ | `--admin-token-file PATH` | unset | Auth file for `/-/quit` and `/-/metrics`. Refuses world-readable files. |
193
+ | `--worker-max-rss-mb MB` | unset | Master gracefully recycles a worker exceeding MB RSS. |
194
+ | `--max-pending COUNT` | unbounded | Per-worker accept-queue cap before HTTP 503 + `Retry-After: 1`. |
195
+ | `--idle-keepalive SECONDS` | `5` | Keep-alive idle timeout. |
196
+ | `--graceful-timeout SECONDS` | `30` | Shutdown deadline before SIGKILL. |
197
+
198
+ `bin/hyperion --help` prints the full set, including `--max-body-bytes`,
199
+ `--max-header-bytes`, `--max-request-read-seconds` (slowloris defence),
200
+ `--h2-max-total-streams`, `--max-in-flight-per-conn`,
201
+ `--tls-handshake-rate-limit`, and the `--[no-]yjit` /
202
+ `--[no-]log-requests` toggles.
579
203
 
580
204
  ### Environment variables
581
205
 
582
- `HYPERION_LOG_LEVEL`, `HYPERION_LOG_FORMAT`, `HYPERION_LOG_REQUESTS` (`0|1|true|false|yes|no|on|off`), `HYPERION_ENV`, `HYPERION_WORKER_MODEL` (`share|reuseport`).
206
+ `HYPERION_LOG_LEVEL`, `HYPERION_LOG_FORMAT`, `HYPERION_LOG_REQUESTS`
207
+ (`0|1|true|false|yes|no|on|off`), `HYPERION_ENV`,
208
+ `HYPERION_WORKER_MODEL` (`share|reuseport`), `HYPERION_IO_URING_ACCEPT`
209
+ (`0|1`), `HYPERION_H2_DISPATCH_POOL`, `HYPERION_H2_NATIVE_HPACK`
210
+ (`v2|ruby|off`), `HYPERION_H2_TIMING`.
583
211
 
584
212
  ### Config file
585
213
 
586
- `config/hyperion.rb` — same shape as Puma's `puma.rb`. Auto-loaded if present.
214
+ `config/hyperion.rb` — same shape as Puma's `puma.rb`. Auto-loaded if
215
+ present. Strict DSL: unknown methods raise `NoMethodError` at boot.
587
216
 
588
217
  ```ruby
589
218
  # config/hyperion.rb
@@ -600,16 +229,11 @@ read_timeout 30
600
229
  idle_keepalive 5
601
230
  graceful_timeout 30
602
231
 
603
- max_header_bytes 64 * 1024
604
- max_body_bytes 16 * 1024 * 1024
605
-
606
232
  log_level :info
607
233
  log_format :auto
608
234
  log_requests true
609
235
 
610
- fiber_local_shim false
611
-
612
- async_io nil # Three-way (1.4.0+): nil (default, auto: inline-on-fiber for TLS h1, pool hop for plain HTTP/1.1), true (force inline-on-fiber everywhere — required for hyperion-async-pg on plain HTTP/1.1), false (force pool hop everywhere — explicit opt-out for TLS+threadpool with CPU-heavy handlers). ~5% throughput hit on hello-world when inline; in exchange one OS thread serves N concurrent in-flight DB queries on wait-bound workloads. TLS / HTTP/2 accept loops always run under Async::Scheduler regardless of this flag.
236
+ async_io nil # nil = auto (1.4.0+), true = inline-on-fiber everywhere, false = pool everywhere
613
237
 
614
238
  before_fork do
615
239
  ActiveRecord::Base.connection_handler.clear_all_connections! if defined?(ActiveRecord)
@@ -618,199 +242,202 @@ end
618
242
  on_worker_boot do |worker_index|
619
243
  ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
620
244
  end
621
-
622
- on_worker_shutdown do |worker_index|
623
- ActiveRecord::Base.connection_handler.clear_all_connections! if defined?(ActiveRecord)
624
- end
625
245
  ```
626
246
 
627
- Strict DSL: unknown methods raise `NoMethodError` at boot — typos surface immediately rather than getting silently ignored.
628
-
629
- A documented sample lives at [`config/hyperion.example.rb`](config/hyperion.example.rb).
247
+ A documented sample lives at
248
+ [`config/hyperion.example.rb`](config/hyperion.example.rb).
630
249
 
631
250
  ## Operator guidance
632
251
 
633
- Concrete tradeoffs distilled from [`docs/BENCH_2026_04_27.md`](docs/BENCH_2026_04_27.md). If the bench numbers cited below feel surprising, check that doc for the full matrix + caveats.
634
-
635
- ### When to use `-w N`
636
-
637
- | Workload shape | Recommended | Why |
638
- |---|---|---|
639
- | **Pure I/O-bound** (PG / Redis / external HTTP, no significant CPU) | `-w 1` + larger pool | Bench: `-w 1 pool=200` = 87 MB / 2,180 r/s vs `-w 4 pool=64` = 224 MB / 1,680 r/s. **2.6× more memory, 0.77× rps** if you pick multi-worker on a wait-bound workload. |
640
- | **Pure CPU-bound** (heavy JSON / template render / image processing) | `-w N` matching CPU count | Each worker's accept loop is single-threaded under `--async-io`; multi-worker gives CPU-parallelism. Bench: `-w 16 -t 5` hits 98,818 r/s on a 16-vCPU box, 4.7× a `-w 1` ceiling on the same hardware. |
641
- | **Mixed** (Rails-shaped: ~5 ms CPU + 50 ms PG wait per request) | `-w N/2` (half cores) + medium pool | Lets CPU work parallelise while keeping per-worker memory tractable. Bench `pg_mixed.ru` (in hyperion-async-pg repo / `~/bench/`) at `-w 4 -t 5 pool=128` = 1,740 r/s with no cold-start spike (ForkSafe `prefill_in_child: true`). |
642
-
643
- Multi-worker on PG-wait workloads is the **wrong** default for most apps — the headline rps doesn't justify the memory and PG-connection cost. Verify your shape with the bench before scaling out.
252
+ Distilled from [docs/BENCH_2026_04_27.md](docs/BENCH_2026_04_27.md)
253
+ (Rails 8.1 real-app sweep). Headline finding: **the simplest drop-in
254
+ is the right answer.**
644
255
 
645
- ### When to use `--async-io`
256
+ ### Migrating from Puma
646
257
 
647
- ```
648
- Are you using a fiber-cooperative I/O library?
649
- (hyperion-async-pg, async-redis, async-http)
650
-
651
- ┌─────────────┴─────────────┐
652
- yes no
653
- │ │
654
- Pair with a fiber-aware Leave --async-io OFF.
655
- connection pool Default thread-pool dispatch
656
- (FiberPool, async-pool — is faster for synchronous
657
- NOT connection_pool gem, Rails apps. Bench: --async-io
658
- which uses non-fiber Mutex). on hello-world = 47% rps
659
- │ regression + p99 spike to
660
- Set --async-io. 3.65 s under no-yield workloads.
661
- Pool size is the real No reason to flip the flag.
662
- concurrency knob; -t is
663
- decorative for wait-bound.
664
- ```
258
+ `hyperion -t N -w M` matching your current Puma `-t N:N -w M`. No other
259
+ flags. Versus Puma at the same `-t/-w` shape on real Rails endpoints:
260
+ **+9% rps on lightweight endpoints, 28× lower p99 on health-style
261
+ endpoints, 3.8× lower p99 on PG-touching endpoints.** Same RSS, same
262
+ operator surface — keep all your existing config, monitoring, deploy
263
+ scripts.
665
264
 
666
- Hyperion warns at boot if you set `--async-io` without any fiber-cooperative library loaded. The setting is still honoured; the warn just nudges operators who flipped it expecting a free perf bump.
265
+ ### Knobs that help on synthetic benches but **not** on real Rails
667
266
 
668
- ### Tuning `-t` and pool sizes
267
+ | Knob | Synthetic | Real Rails | Recommendation |
268
+ |---|---|---|---|
269
+ | `-t 30` | +5–10% on hello-world | **Hurts** p99 vs `-t 10` (3.51 s vs 148 ms on `/up`) — GVL + middleware Mutex contention | Stay at `-t 10`. |
270
+ | `--yjit` | +5–10% on CPU-bound | Wash on dev-mode Rails | Skip until you bench production-mode. |
271
+ | `RAILS_POOL > 25` | n/a | No improvement at 50 or 100 | Keep your existing AR pool. |
272
+ | `--async-io` | 33–42× rps on PG-bound | **Worse** than drop-in (4.14 s p99 on `/up`) until your full I/O stack is fiber-cooperative | Don't enable until `redis-rb` → `async-redis`. |
669
273
 
670
- - **Without `--async-io`** (sync server, default): `-t` is the concurrency knob. Each in-flight request holds an OS thread; pool size should match `-t`. Bench shows Puma-style behaviour — at 200 wrk conns hitting a 5-thread server, queue depth dominates p99 (Hyperion `-t 5 -w 1` p50 = 0.95 ms vs Puma's same shape at 59.5 ms — Hyperion's queueing is cheaper but the model still serializes at `-t`).
671
- - **With `--async-io` + a fiber-aware pool**: pool size is the concurrency knob. `-t` is decorative for wait-bound workloads; one accept-loop fiber serves all in-flight queries via the pool. Linear scaling: pool=64 → ~780 r/s, pool=128 → ~1,344 r/s, pool=200 → ~2,180 r/s on 50 ms PG queries.
672
- - **Pool over WAN**: if `PG.connect` round-trip is >50 ms, expect pool fill at startup to take `pool_size / parallel_fill_threads × RTT`. `hyperion-async-pg 0.5.1+` auto-scales `parallel_fill_threads` so pool=200 fills in ~1-2 s.
274
+ ### When `-w N` helps
673
275
 
674
- ### How to read p50 vs p99
276
+ | Workload | Recommended | Why |
277
+ |---|---|---|
278
+ | Pure I/O-bound (PG / Redis / external HTTP) | `-w 1` + larger pool | `-w 1 pool=200` = 87 MB / 2,180 r/s vs `-w 4 pool=64` = 224 MB / 1,680 r/s. **2.6× memory, 0.77× rps** if you pick multi-worker on wait-bound. |
279
+ | Pure CPU-bound | `-w N` matching CPU count | Bench: `-w 16 -t 5` hits 98,818 r/s on a 16-vCPU box. |
280
+ | Mixed (Rails-shaped, ~5 ms CPU + 50 ms wait) | `-w N/2` (half cores) + medium pool | `-w 4 -t 5 pool=128` = 1,740 r/s on `pg_mixed.ru`, no cold-start spike. |
675
281
 
676
- Tail latency tells the queueing story; rps tells the throughput story. Hyperion's tail wins are **always** bigger than its rps wins — sometimes the rps numbers look close to a competitor while p99 is 5-200× lower:
282
+ ### Read p99 not mean
677
283
 
678
284
  | Workload | Hyperion rps / p99 | Closest competitor | rps ratio | p99 ratio |
679
285
  |---|---|---|---:|---:|
680
- | Hello `-w 4` | 21,215 r/s / 1.87 ms | Falcon 24,061 / 9.78 ms | 0.88× | **5.2× lower** |
681
- | CPU JSON `-w 4` | 15,582 r/s / 2.47 ms | Falcon 18,643 / 13.51 ms | 0.84× | **5.5× lower** |
682
- | Static 1 MiB | 1,919 r/s / 4.22 ms | Puma 2,074 / 55 ms | 0.93× | **13× lower** |
683
- | PG-wait `-w 1` pool=200 | 2,180 r/s / 668 ms | Puma 530 r/s + 200 timeouts | **4.1×** | qualitative crush |
684
-
685
- **Size capacity by p99, not by mean.** Throughput peaks are easy to fake under controlled bench conditions; tail latency reflects what your slowest user actually experiences when the load balancer fans them onto a busy worker.
686
-
687
- ### Production tuning (real Rails apps)
688
-
689
- Distilled from a real-app bench against the [Exodus platform](https://github.com/andrew-woblavobla/hyperion/blob/master/docs/BENCH_2026_04_27.md) (Rails 8.1, on-LAN PG + Redis at ~0.3 ms RTT, `-w 4 -t 10`, `wrk -t8 -c200 -d30s`). The headline finding: the **simplest drop-in is the right answer**, and the additional knobs operators reach for first don't help on real Rails.
690
-
691
- **Recommended for migrating from Puma**: `hyperion -t N -w M` matching your current Puma `-t N:N -w M`. No other flags. That gives you (vs Puma at the same `-t/-w`):
692
-
693
- - **+9% rps on lightweight endpoints** (matches the 5-10% per-request CPU savings the rest of the bench section documents).
694
- - **28× lower p99 on health-style endpoints** — the queue-of-doom shape Puma exhibits under sustained 200-conn load doesn't reproduce on Hyperion's worker-owns-connection model.
695
- - **3.8× lower p99 on PG-touching endpoints**.
696
- - **Same RSS, same operator surface** — you keep all your existing config, monitoring, and deploy scripts.
697
-
698
- **Knobs that help on synthetic benches but NOT on real Rails — leave them off:**
699
-
700
- | Knob | Synthetic bench result | Real Rails result | Recommendation |
701
- |---|---|---|---|
702
- | `-t 30` (more threads/worker) | Helped Hyperion 5-10% on hello-world | **Hurt** p99 vs `-t 10` on real Rails (3.51 s vs 148 ms on /up) — GVL + middleware Mutex contention dominates past `-t 10` | Stay at `-t 10`. Match Puma's recommended `RAILS_MAX_THREADS`. |
703
- | `--yjit` | 5-10% on synthetic CPU-bound | Wash on dev-mode Rails (312 vs 328 rps, p99 worse with YJIT) | Skip for now. Production-mode Rails may behave differently — verify with your own bench before flipping. |
704
- | `RAILS_POOL` > 25 | n/a | No improvement at pool=50 or pool=100 on real Rails (rps within 3%, p99 within noise). Pool starvation is rarely the bottleneck on a `-w 4 -t 10` config | Keep your existing AR pool size. |
705
- | `--async-io` | 33-42× rps on PG-bound (with `hyperion-async-pg`) | **Worse** than drop-in on real Rails (4.14 s p99 on /up vs 148 ms drop-in) | **Don't enable** until your full I/O stack is fiber-cooperative. The synchronous Redis client (`redis-rb`) blocks the OS thread before async-pg can yield, so fibers can't compound. Migrate to `async-redis` *first*, then revisit. |
706
- | `--async-io` + `hyperion-async-pg` AR adapter | Verified 48× rps lift on a single-PG-query bench | Marginal-or-negative on real Rails (similar reason: Redis-first handlers don't yield) | Same — wait for a full-async I/O stack. |
286
+ | Hello `-w 4` | 21,215 / 1.87 ms | Falcon 24,061 / 9.78 ms | 0.88× | **5.2× lower** |
287
+ | CPU JSON `-w 4` | 15,582 / 2.47 ms | Falcon 18,643 / 13.51 ms | 0.84× | **5.5× lower** |
288
+ | Static 1 MiB | 1,919 / 4.22 ms | Puma 2,074 / 55 ms | 0.93× | **13× lower** |
289
+ | PG-wait `-w 1` pool=200 | 2,180 / 668 ms | Puma 530 + 200 timeouts | **4.1×** | qualitative crush |
707
290
 
708
- **Why the simple drop-in wins on real Rails:** the per-request budget on a real handler is dominated by the Rails middleware chain (rack-attack, locale redirect, tagger, etc.) + handler logic + DB + cache I/O. Hyperion's per-request CPU optimizations (C-ext header parser, response builder, lock-free metrics, fiber-cooperative TLS dispatch in 1.4.0+) shave ~5-10% off the *non-I/O* portion of the budget consistently — and the [worker-owns-connection model](#concurrency-at-scale-architectural-advantages) prevents the queue-amplification that Puma's thread-pool dispatch shows under sustained load. You don't need to "tune" anything to get those.
291
+ Throughput peaks are easy to fake under controlled conditions; tail
292
+ latency reflects what your slowest user actually experiences when the
293
+ load balancer fans them onto a busy worker.
709
294
 
710
295
  ## Logging
711
296
 
712
- Default behaviour (rc16+):
713
-
714
- - **`info` / `debug` → stdout**, **`warn` / `error` / `fatal` → stderr** (12-factor).
715
- - **One structured access-log line per response**, info level, on stdout. Disable with `--no-log-requests` or `HYPERION_LOG_REQUESTS=0`.
716
- - **Format auto-selects**: production envs → JSON (line-delimited, parseable by every log aggregator); TTY → coloured text; piped output without env hint → JSON.
297
+ Default behaviour:
717
298
 
718
- ### Sample access log lines
299
+ - `info`/`debug` stdout, `warn`/`error`/`fatal` → stderr (12-factor).
300
+ - One structured access-log line per response, `info` level. Disable
301
+ with `--no-log-requests` or `HYPERION_LOG_REQUESTS=0`.
302
+ - Format auto-selects: `RAILS_ENV=production`/`staging` → JSON; TTY →
303
+ coloured text; piped output without env hint → JSON.
719
304
 
720
- Text format (TTY default):
305
+ Sample text (TTY default):
721
306
 
722
307
  ```
723
308
  2026-04-26T18:40:04.112Z INFO [hyperion] message=request method=GET path=/api/v1/health status=200 duration_ms=46.63 remote_addr=127.0.0.1 http_version=HTTP/1.1
724
- 2026-04-26T18:40:04.123Z INFO [hyperion] message=request method=GET path=/api/v1/cached_data query="currency=USD" status=200 duration_ms=43.87 remote_addr=127.0.0.1 http_version=HTTP/1.1
725
309
  ```
726
310
 
727
- JSON format (auto-selected on `RAILS_ENV=production`/`staging` or piped output):
311
+ Sample JSON (production / piped):
728
312
 
729
313
  ```json
730
314
  {"ts":"2026-04-26T18:38:49.405Z","level":"info","source":"hyperion","message":"request","method":"GET","path":"/api/v1/health","status":200,"duration_ms":46.63,"remote_addr":"127.0.0.1","http_version":"HTTP/1.1"}
731
- {"ts":"2026-04-26T18:38:49.411Z","level":"info","source":"hyperion","message":"request","method":"GET","path":"/api/v1/cached_data","query":"currency=USD","status":200,"duration_ms":40.64,"remote_addr":"127.0.0.1","http_version":"HTTP/1.1"}
732
315
  ```
733
316
 
734
- ### Hot-path optimisations
735
-
736
- The default-ON access log path is engineered to stay near-zero cost:
737
-
738
- - **Per-thread cached `iso8601(3)` timestamp** — one allocation per millisecond per thread, reused across all requests in that millisecond.
739
- - **Hand-rolled single-interpolation line builder** — bypasses generic `Hash#map.join`.
740
- - **Per-thread 4 KiB write buffer** — flushes to stdout when full or on connection close. Cuts ~32× the syscalls under load.
741
- - **Lock-free emit** — POSIX `write(2)` is atomic for writes ≤ PIPE_BUF (4096 B); a log line is ~200 B. No logger mutex.
742
-
743
317
  ## Metrics
744
318
 
745
- `Hyperion.stats` returns a snapshot Hash with the following counters (lock-free per-thread aggregation):
319
+ `Hyperion.stats` returns a snapshot Hash with lock-free per-thread
320
+ counters (`connections_accepted`, `connections_active`, `requests_total`,
321
+ `requests_in_flight`, `responses_<code>`, `parse_errors`, `app_errors`,
322
+ `read_timeouts`, `requests_threadpool_dispatched`,
323
+ `requests_async_dispatched`, `c_loop_requests_total`).
746
324
 
747
- | Counter | Meaning |
748
- |---|---|
749
- | `connections_accepted` | Lifetime accept count. |
750
- | `connections_active` | Currently in-flight connections. |
751
- | `requests_total` | Lifetime request count. |
752
- | `requests_in_flight` | Currently in-flight requests. |
753
- | `responses_<code>` | One counter per status code emitted (`responses_200`, `responses_400`, …). |
754
- | `parse_errors` | HTTP parse failures → 400. |
755
- | `app_errors` | Rack app raised → 500. |
756
- | `read_timeouts` | Per-connection read deadline hit. |
757
- | `requests_threadpool_dispatched` | HTTP/1.1 connection handed to the worker pool (or served inline in `start_raw_loop` when `thread_count: 0`). The default dispatch path. |
758
- | `requests_async_dispatched` | HTTP/1.1 connection served inline on the accept-loop fiber under `--async-io`. Operators can use the ratio against `requests_threadpool_dispatched` to verify fiber-cooperative I/O is actually engaged. |
759
-
760
- ```ruby
761
- require 'hyperion'
762
- Hyperion.stats
763
- # => {connections_accepted: 1234, connections_active: 7, requests_total: 8910, …}
764
- ```
765
-
766
- ### Prometheus exporter
767
-
768
- When `admin_token` is set in your config, Hyperion mounts a `/-/metrics` endpoint that emits Prometheus text-format v0.0.4. Same token guards both `/-/metrics` (GET) and `/-/quit` (POST); auth is via the `X-Hyperion-Admin-Token` header.
325
+ When `admin_token` is set, `/-/metrics` emits Prometheus text-format
326
+ v0.0.4. Auth is via the `X-Hyperion-Admin-Token` header (same token
327
+ guards `POST /-/quit`):
769
328
 
770
329
  ```sh
771
330
  $ curl -s -H 'X-Hyperion-Admin-Token: secret' http://127.0.0.1:9292/-/metrics
772
331
  # HELP hyperion_requests_total Total HTTP requests handled
773
332
  # TYPE hyperion_requests_total counter
774
333
  hyperion_requests_total 8910
775
- # HELP hyperion_bytes_written_total Total bytes written to response sockets
776
- # TYPE hyperion_bytes_written_total counter
777
- hyperion_bytes_written_total 2351023
778
- # HELP hyperion_responses_status_total Responses by HTTP status code
779
- # TYPE hyperion_responses_status_total counter
780
334
  hyperion_responses_status_total{status="200"} 8521
781
335
  hyperion_responses_status_total{status="404"} 12
782
- hyperion_responses_status_total{status="500"} 3
783
- # … and so on for sendfile_responses_total, rejected_connections_total,
784
- # slow_request_aborts_total, requests_async_dispatched_total, etc.
785
336
  ```
786
337
 
787
- Any counter not in the known set (added by app middleware via `Hyperion.metrics.increment(:custom_thing)`) is auto-exported as `hyperion_custom_thing` with a generic HELP line — no Hyperion config change required.
338
+ Any counter not in the known set (added via
339
+ `Hyperion.metrics.increment(:custom_thing)`) is auto-exported as
340
+ `hyperion_custom_thing` with a generic HELP line. Network-isolate the
341
+ admin endpoints if the listener is internet-facing — see
342
+ [docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) for the nginx
343
+ `location /-/ { return 404; }` recipe.
344
+
345
+ ## Compatibility
346
+
347
+ | Component | Version |
348
+ |---|---|
349
+ | Ruby | 3.3+ (transitive `protocol-http2 ~> 0.26` floor) |
350
+ | Rack | 3.x |
351
+ | Rails | verified up to 8.1 |
352
+ | Linux kernel | 5.x+ for io_uring opt-in; 4.x+ otherwise |
353
+ | macOS | works (TLS, h2, WebSockets, `accept4` fallback path) |
788
354
 
789
- Point your scraper at it: in Prometheus' `scrape_configs`, set `metrics_path: /-/metrics` and `bearer_token` (or use a custom header relabel — Prometheus 2.42+ supports `authorization.credentials_file` paired with a custom `header` block). Network-isolate the admin endpoints if the listener is internet-facing — see [docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) for the nginx `location /-/ { return 404; }` recipe.
355
+ Per-Rack-3-spec: auto-sets `SERVER_SOFTWARE`, `rack.version`,
356
+ `REMOTE_ADDR`, IPv6-safe `Host` parsing, CRLF guard. The
357
+ `Hyperion::FiberLocal.install!` opt-in shim handles the residual
358
+ `Thread.current.thread_variable_*` footgun in older Rails idioms;
359
+ modern Rails 7.1+ already uses Fiber storage natively.
790
360
 
791
- ## TLS + HTTP/2
361
+ ## Reproducing benchmarks
792
362
 
793
- Provide a PEM cert + key:
363
+ Every number in this README is reproducible. Per-row commands:
794
364
 
795
365
  ```sh
796
- bundle exec hyperion --tls-cert config/cert.pem --tls-key config/key.pem -p 9443 config.ru
797
- ```
366
+ # Setup (once)
367
+ bundle install
368
+ bundle exec rake compile
798
369
 
799
- ALPN auto-negotiates `h2` (HTTP/2) or `http/1.1` per connection. HTTP/2 multiplexes streams onto fibers within a single connection — slow handlers don't head-of-line-block other streams. Cluster-mode TLS works (`-w N` + `--tls-cert` / `--tls-key`).
370
+ # Hello via Server.handle_static + io_uring (134k r/s row)
371
+ HYPERION_IO_URING_ACCEPT=1 bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/hello_static.ru &
372
+ wrk -t4 -c100 -d20s --latency http://127.0.0.1:9292/
800
373
 
801
- Smuggling defenses for HTTP/1.1: `Content-Length` + `Transfer-Encoding` together → 400; non-chunked `Transfer-Encoding` → 501; CRLF in response header values → `ArgumentError` (response-splitting guard).
374
+ # Dynamic block via Server.handle (9.4k r/s row)
375
+ bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/hello_handle_block.ru &
376
+ wrk -t4 -c100 -d20s --latency http://127.0.0.1:9292/
802
377
 
803
- ## Compatibility
378
+ # Generic Rack hello (4.7k r/s row)
379
+ bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/hello.ru &
380
+ wrk -t4 -c100 -d20s --latency http://127.0.0.1:9292/
381
+
382
+ # CPU JSON via block form (5.9k r/s row)
383
+ bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/work.ru &
384
+ wrk -t4 -c200 -d15s --latency http://127.0.0.1:9292/
385
+
386
+ # 4-way comparator (Hyperion vs Puma vs Falcon vs Agoo)
387
+ bash bench/4way_compare.sh
388
+
389
+ # gRPC unary + streaming (Hyperion side)
390
+ GHZ=/tmp/ghz TRIALS=3 DURATION=15s WARMUP_DURATION=3s bash bench/grpc_stream_bench.sh
391
+
392
+ # Idle keep-alive RSS sweep (10k conns × 30s hold)
393
+ bash bench/keepalive_memory.sh
394
+ ```
804
395
 
805
- - **Ruby 3.3+** required (the `protocol-http2 ~> 0.26` transitive dep imposes this floor; older Ruby installs error at `bundle install`).
806
- - **Rack 3** (auto-sets `SERVER_SOFTWARE`, `rack.version`, `REMOTE_ADDR`, IPv6-safe `Host` parsing, CRLF guard).
807
- - **`Hyperion::FiberLocal.install!`** opt-in shim for older Rails apps that store request-scoped data via `Thread.current.thread_variable_*` (modern Rails 7.1+ already uses Fiber storage natively; the shim handles the residual footgun).
808
- - **`Hyperion::FiberLocal.verify_environment!`** runtime check that `Thread.current[:k]` is fiber-local on the current Ruby (it is on 3.2+).
396
+ PG benches (`pg_concurrent.ru`, `pg_mixed.ru`) live in the
397
+ [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg)
398
+ companion repo they require a running Postgres and the companion
399
+ gem.
400
+
401
+ When numbers from your host don't match the published numbers, the
402
+ most likely explanations (in order): (1) bench-host noise — single-VM
403
+ benches drift 10–30% over days; (2) Puma version mismatch (sweep used
404
+ Puma 8.0.1; the in-repo Gemfile pins `~> 6.4`); (3) different kernel
405
+ or Ruby; (4) different `-t` / `-c` (apples-to-apples requires
406
+ identical worker count, thread count, wrk concurrency, payload, and
407
+ TLS cipher).
408
+
409
+ ## Release history
410
+
411
+ See [CHANGELOG.md](CHANGELOG.md). Recent: 2.14.0 (gRPC streaming ghz
412
+ numbers; dynamic-block C dispatch — `Server.handle { |env| ... }` lifts
413
+ hello to 9,422 r/s and CPU JSON to 5,897 r/s; `Server#stop` accept-wake
414
+ on Linux; io_uring 4h soak), 2.13.0 (response head builder C-rewrite;
415
+ gRPC streaming RPCs; soak harness), 2.12.0 (C connection lifecycle;
416
+ io_uring loop hits 134k r/s; gRPC unary trailers; SO_REUSEPORT
417
+ audit), 2.11.0 (HPACK CGlue default; h2 dispatch-pool warmup), 2.10.x
418
+ (`PageCache`, `Server.handle` direct routes, TCP_NODELAY at accept).
419
+
420
+ ## Links
421
+
422
+ - [CHANGELOG.md](CHANGELOG.md) — per-stream releases.
423
+ - [docs/BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md) — current
424
+ 4-way matrix + 2.14-D gRPC numbers.
425
+ - [docs/BENCH_HYPERION_2_0.md](docs/BENCH_HYPERION_2_0.md) — historical
426
+ 2.10-B baseline (preserved for archaeology).
427
+ - [docs/BENCH_2026_04_27.md](docs/BENCH_2026_04_27.md) — real Rails 8.1
428
+ app sweep (Exodus platform).
429
+ - [docs/OBSERVABILITY.md](docs/OBSERVABILITY.md) — metrics + Grafana.
430
+ - [docs/WEBSOCKETS.md](docs/WEBSOCKETS.md) — RFC 6455 surface.
431
+ - [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md) — drop-in
432
+ guide.
433
+ - [docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) — nginx fronting.
809
434
 
810
435
  ## Credits
811
436
 
812
- - Vendored [llhttp](https://github.com/nodejs/llhttp) (Node.js's HTTP parser, MIT) under `ext/hyperion_http/llhttp/`.
813
- - HTTP/2 framing and HPACK via [`protocol-http2`](https://github.com/socketry/protocol-http2).
437
+ - Vendored [llhttp](https://github.com/nodejs/llhttp) (Node.js's HTTP
438
+ parser, MIT) under `ext/hyperion_http/llhttp/`.
439
+ - HTTP/2 framing and HPACK via
440
+ [`protocol-http2`](https://github.com/socketry/protocol-http2).
814
441
  - Fiber scheduler via [`async`](https://github.com/socketry/async).
815
442
 
816
443
  ## License