hyperion-rb 2.13.0 → 2.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,707 +1,218 @@
1
1
  # Hyperion
2
2
 
3
- High-performance Ruby HTTP server. Falcon-class fiber concurrency, Puma-class compatibility.
3
+ High-performance Ruby HTTP server. Rack 3 + HTTP/2 + WebSockets + gRPC on a single binary.
4
4
 
5
5
  [![CI](https://github.com/andrew-woblavobla/hyperion/actions/workflows/ci.yml/badge.svg)](https://github.com/andrew-woblavobla/hyperion/actions/workflows/ci.yml)
6
6
  [![Gem Version](https://img.shields.io/gem/v/hyperion-rb.svg)](https://rubygems.org/gems/hyperion-rb)
7
7
  [![License: MIT](https://img.shields.io/github/license/andrew-woblavobla/hyperion.svg)](https://github.com/andrew-woblavobla/hyperion/blob/master/LICENSE)
8
8
 
9
+ Hyperion serves a hello-world Rack response at **134,084 r/s with a 1.14 ms p99**
10
+ on a single worker (Linux 6.x, io_uring accept loop, `Server.handle_static`),
11
+ **7×** Agoo's 19,024 r/s on the same hardware. Beyond the C-side fast path
12
+ it's a complete Rack 3 server: HTTP/1.1 + HTTP/2 with ALPN, WebSockets
13
+ (RFC 6455), gRPC unary + streaming on the Rack 3 trailers contract, native
14
+ fiber concurrency for PG-bound apps, and pre-fork cluster mode with
15
+ SO_REUSEPORT-balanced workers.
16
+
9
17
  ```sh
10
18
  gem install hyperion-rb
11
- bundle exec hyperion config.ru
12
- ```
13
-
14
- ## What's new in 2.13.0
15
-
16
- **Hardening sprint: profile-driven CPU work, durable infrastructure,
17
- and gRPC streaming.** 2.13 follows up the 2.12 perf jump with the
18
- work that wasn't structural enough to need its own major release —
19
- each stream is small but adds up:
20
-
21
- - **2.13-A — Generic Rack hot-path wins.** Per-thread shards for
22
- hot-path metrics (no more cross-worker mutex on `observe_histogram`),
23
- cached `worker_id` label tuple, and a Rack-3 keepalive fast-path.
24
- Generic Rack hello bench: **+3.6% on `-c100` no-log**. Honest
25
- finding: env-pool + body-coalesce already shipped; the deeper
26
- generic-Rack gap to Agoo is single-thread Ruby ceiling that closing
27
- needs moving `app.call` into the C accept loop — a 2.14 lift.
28
- - **2.13-B — Response head builder rewritten in C.** Pre-baked
29
- status-line table, `rb_hash_foreach` replacing `rb_funcall(:keys)`,
30
- per-key downcase + per-(key, value) full-line caches, custom `itoa`
31
- replacing `snprintf`. **+7.7% single-thread synthetic; multi-thread
32
- neutral (GVL-bound).** Profile confirms Hyperion's own C-ext code is
33
- **<1%** of wall-clock; the rest is libruby + JSON gem.
34
- - **2.13-C — Spec flake hunt.** Two long-standing flakes fixed:
35
- `tls_ktls_spec` macOS skip leak (unconditional `RUBY_PLATFORM`
36
- guard), and `connection_loop_spec:79` Linux port-bind flake (root
37
- cause: Linux `close()` doesn't wake a parked `accept(2)` — fixed
38
- with a `stop_loop_and_wake` helper). 5/5 → 0/10 failure rate;
39
- spec-suite runtime 46 s → 1.3 s.
40
- - **2.13-D — gRPC streaming RPCs.** Server-streaming, client-streaming,
41
- and bidirectional RPCs on top of 2.12-F's unary trailers foundation.
42
- New `bench/grpc_stream.{proto,ru}` + `grpc_stream_bench.sh` ghz
43
- harness for operator-side comparison vs Falcon's `async-grpc`.
44
- - **2.13-E — io_uring soak harness + CI smoke.** New
45
- `bench/io_uring_soak.sh` runs a 24h soak against the 2.12-D
46
- io_uring loop with `/proc/$PID` + `/-/metrics` sampling, emits a
47
- CSV + verdict (PASS / SOAK FAIL / borderline). New
48
- `spec/hyperion/io_uring_soak_smoke_spec.rb` runs a 1000-request
49
- mini-soak in CI to catch leak regressions before any 24h run.
50
- **`HYPERION_IO_URING_ACCEPT` stays opt-in for 2.13** — operators
51
- with their own staging environments can now collect signal; the
52
- default-flip decision moves to 2.14.
53
-
54
- Plus all previous wins are preserved and verified by the 1183-spec
55
- suite (2.10-G TCP_NODELAY at accept, 2.10-E preload hooks, 2.10-F
56
- C-ext fast-path response writer, 2.11-A dispatch pool warmup, 2.11-B
57
- cglue HPACK default, 2.12-C accept4 connection loop, 2.12-D io_uring
58
- loop, 2.12-E per-worker request counter, 2.12-F gRPC unary trailers).
59
-
60
- Full per-stream details, bench tables, and follow-up items in
61
- [`CHANGELOG.md`](CHANGELOG.md).
62
-
63
- ## What's new in 2.12.0
64
-
65
- **The hot path moves into C — and gRPC ships.** The headline win:
66
- `Server.handle_static` routes now serve from a C accept→read→route→write
67
- loop with optional **io_uring** (Linux 5.x+) backing it. The `wrk -t4
68
- -c100 -d20s` hello bench moved from **5,502 r/s** (2.11.0
69
- `Server.handle_static` via Ruby accept loop) to **15,685 r/s** (2.12-C
70
- C accept4 loop) to **134,084 r/s** (2.12-D io_uring loop) — that's
71
- **24× over 2.11.0's `handle_static` and 7× over Agoo 2.15.14's
72
- 19,024 r/s** on the same workload. p99 stays sub-millisecond
73
- throughout. Plus durable foundation work and one big new feature:
74
-
75
- - **2.12-B — Fresh 4-way re-bench.** New
76
- [`docs/BENCH_HYPERION_2_11.md`](docs/BENCH_HYPERION_2_11.md) re-runs
77
- Hyperion / Puma / Falcon / Agoo on the 6 workloads with all 2.10/2.11
78
- wins enabled. Headline shifts: static 1 KB Hyperion `handle_static`
79
- flipped from 1.89× behind Agoo to **+127% ahead**; CPU JSON gap
80
- widened (the one row 2.10/2.11 didn't touch — flagged for follow-up).
81
- - **2.12-C — Connection lifecycle in C.** New
82
- `Hyperion::Http::PageCache.run_static_accept_loop` does
83
- `accept4` + `recv` + path lookup + `write` entirely in a C tight
84
- loop, returning to Ruby only on a route miss / TLS / h2 / WebSocket
85
- upgrade. GVL released across syscalls. Auto-engages when the listener
86
- is plain TCP and the route table contains only `StaticEntry`
87
- registrations. **5,502 → 15,685 r/s (+185%, 2.85×) on `handle_static`
88
- hello; p99 1.59 ms → 107 µs (15× tighter).** Falls through to the
89
- existing Ruby accept loop on miss with no regression.
90
- - **2.12-D — io_uring accept loop (Linux 5.x+).** A multishot accept +
91
- per-conn RECV/WRITE/CLOSE state machine on top of liburing. One
92
- `io_uring_enter` per N requests instead of N×3 syscalls. Opt-in via
93
- `HYPERION_IO_URING_ACCEPT=1` (default off until 2.13 production
94
- soak). **15,685 → 134,084 r/s (+755%, 8.6×) on the same bench.**
95
- Compiles out cleanly without liburing — the `accept4` path stays
96
- the fallback. macOS keeps using `accept4` (no liburing).
97
- - **2.12-E — SO_REUSEPORT cluster-mode audit.** New per-worker request
98
- metric (`requests_dispatch_total{worker_id="N"}`) ticks under every
99
- dispatch mode (Rack, `handle_static`, h2, the C accept loops). New
100
- audit harness `bench/cluster_distribution.sh` and a 4-worker, 30s
101
- sustained-load bench: under steady state the SO_REUSEPORT hash
102
- distributes within **1.004-1.011 max/min ratio** — production-grade,
103
- measured. The cold-start swing (1.16× during the first second of
104
- fresh boot) is documented as expected `SO_REUSEPORT + keep-alive`
105
- behavior and matches what production L4 LBs already exhibit.
106
- - **2.12-F — gRPC support on h2.** Trailers (the `grpc-status` /
107
- `grpc-message` final HEADERS frame), `TE: trailers` handling, h2
108
- request half-close semantics. Rack 3 contract: a Rack body that
109
- defines `#trailers` triggers the trailers wire shape automatically;
110
- bodies that don't are byte-identical to 2.11.x h2. Smoke test against
111
- the real `grpc` Ruby gem ships gated by `RUN_GRPC_SMOKE=1`; the
112
- durable coverage is 11 unit specs driving real `protocol-http2`
113
- framer + HPACK encode/decode + TLS.
114
-
115
- The 2.10-G TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F C-ext
116
- `rb_pc_serve_request`, 2.11-A dispatch pool warmup, and 2.11-B cglue
117
- HPACK default all preserved and verified by the 1143-spec suite.
118
-
119
- Full per-stream details, bench tables, and follow-up items in
120
- [`CHANGELOG.md`](CHANGELOG.md).
121
-
122
- ## What's new in 2.11.0
123
-
124
- **h2 cold-stream latency cut + native HPACK CGlue flipped to default.**
125
- Two perf wins on top of 2.10:
126
-
127
- - **2.11-A — h2 first-stream TLS handshake parallelization.** The
128
- 2.10-G `HYPERION_H2_TIMING=1` instrumentation, run against the
129
- TCP_NODELAY-fixed handler, isolated the residual cold-stream cost
130
- to **bucket 2**: lazy `task.async {}` fiber spawn for the first
131
- stream of every connection. Fix: pre-spawn a stream-dispatch fiber
132
- pool at connection accept (configurable via `HYPERION_H2_DISPATCH_POOL`,
133
- default 4, ceiling 16). h2load `-c 1 -m 1 -n 50` cold first-run:
134
- **time-to-1st-byte 20.28 → 9.28 ms (−54%); m=100 throughput +5.5%**.
135
- Warm steady-state unchanged (no head-of-line blocking under the small
136
- pool — backlog still spills to ad-hoc `task.async`).
137
- - **2.11-B — HPACK FFI marshalling round-2 (CGlue flipped to default).**
138
- Three-way bench (`bench/h2_rails_shape.sh` extended): `ruby` (1,585
139
- r/s) vs `native v2` (1,602 r/s, +1% — noise) vs `native v3 / CGlue`
140
- (**2,291 r/s, +43% over v2**). The +18-44% native-vs-Ruby headline
141
- was almost entirely Fiddle marshalling overhead, not the underlying
142
- Rust HPACK encoder — same encoder, no per-call FFI marshalling, +43%
143
- rps. Default flipped: unset `HYPERION_H2_NATIVE_HPACK` now selects
144
- CGlue. Three escape valves stay (`=v2` to force the old path, `=ruby`
145
- / `=off` for the pure-Ruby fallback) for any operator that needs
146
- them. Boot log gains a `native_mode` field documenting which path is
147
- actually live.
148
-
149
- Plus operator infrastructure: a stale-`.dylib`-on-Linux cross-platform
150
- host-OS portability fix in `H2Codec.candidate_paths` (was silently
151
- falling through to pure-Ruby on the bench host); `bench/h2_rails_shape.sh`
152
- race-fixed (boot-log probe + stderr routing). Full bench tables and
153
- flip-decision rationale in [`CHANGELOG.md`](CHANGELOG.md).
154
-
155
- ## What's new in 2.10.1
156
-
157
- **Static-asset operator surface (2.10-E) + C-ext fast-path response
158
- writer (2.10-F).** Two follow-on streams to 2.10's static / direct-route
159
- work:
160
-
161
- - **2.10-E — Static asset preload + immutable flag.** Boot-time hook
162
- warms `Hyperion::Http::PageCache` over a tree of files and marks
163
- every cached entry immutable. Surface: `--preload-static <dir>` (and
164
- `--no-preload-static`) CLI flags, `preload_static "/path", immutable:
165
- true` config DSL key, and zero-config Rails auto-detect that pulls
166
- `Rails.configuration.assets.paths.first(8)` when present. Hyperion
167
- never `require`s Rails — purely defensive `defined?(::Rails)`
168
- probing keeps the generic Rack server path clean. **Operator value:
169
- predictable first-request latency** (the asset is in cache before
170
- the first request arrives) and the `recheck_seconds` mtime poll is
171
- skipped on immutable entries. Sustained-load throughput on the
172
- static-1-KB bench did *not* move (cold 1,929 r/s vs warm 1,886 r/s,
173
- inside trial noise) because `ResponseWriter` already auto-caches
174
- Rack::Files responses on the first hit; preload moves that one
175
- `cache_file` call from request 1 to boot.
176
- - **2.10-F — C-ext fast-path response writer for prebuilt responses.**
177
- `Server.handle_static`-routed requests now serve from a single
178
- C function (`rb_pc_serve_request` in `ext/hyperion_http/page_cache.c`)
179
- that does route lookup → header build → `write()` syscall without
180
- re-entering Ruby on the response side. GVL is released across the
181
- `write()` so slow clients no longer block other Ruby work on the
182
- same VM. Automatic HEAD support (HTTP-mandated) lights up on every
183
- GET registered via `handle_static` — same buffer, body stripped.
184
- Bench (3-trial median, `wrk -t4 -c100 -d20s`): **5,768 r/s vs
185
- 2.10-D's 5,619 r/s (+2.6% — inside noise) and p99 1.93 → 1.67 ms
186
- (−14% — outside noise, reproducible).** The throughput needle didn't
187
- move because the per-connection lifecycle (accept4 + clone3 + futex
188
- on GVL handoff) dominates at 100 concurrent connections; 2.10-F
189
- shrinks the response phase, but the response phase isn't the
190
- bottleneck on this profile. Durable infrastructure for 2.11+ when
191
- the accept-loop work closes.
192
-
193
- Full per-stream details and bench tables in
194
- [`CHANGELOG.md`](CHANGELOG.md).
195
-
196
- ## What's new in 2.10.0
197
-
198
- **4-way bench harness, page cache, direct routes, and the h2 40 ms
199
- ceiling killed.** This sprint widens the comparison matrix to all four
200
- major Ruby web servers (Hyperion + Puma + Falcon + Agoo) and ships
201
- four substantive perf streams against that backdrop:
202
-
203
- - **2.10-A / 2.10-B — 4-way bench harness + honest baseline.**
204
- `bench/4way_compare.sh` runs the same 6 workloads (hello, static
205
- 1 KB / 1 MiB, CPU JSON, PG-bound, SSE) against all four servers from
206
- one script. Baseline numbers committed *before* any code changes:
207
- Agoo wins the static-asset and JSON columns by ~2-4×, Hyperion wins
208
- the static 1 MiB column by 9× and the SSE column by 3.6-17×.
209
- - **2.10-C — `Hyperion::Http::PageCache` (pre-built static response
210
- cache).** Open-addressed bucket table behind a pthread mutex
211
- (GVL-released for writes), engages automatically on `Rack::Files`
212
- responses. **Static 1 KB: 1,380 → 1,880 r/s (+36%), p99 3.7 → 2.7
213
- ms.** Closes the Agoo gap from −47% to −28% on that column.
214
- - **2.10-D — `Hyperion::Server.handle` direct route registration.**
215
- New API for hot Rack-bypass paths (`Server.handle '/health' do …
216
- end`, `Server.handle_static '/robots.txt', body: '...'`). Skips Rack
217
- adapter + env-build for matched routes. **`hello` via
218
- `handle_static`: 4,408 → 5,619 r/s (+27%), p99 1.93 ms** — the
219
- cleanest p99 in the 4-way matrix.
220
- - **2.10-G — h2 max-latency ceiling at ~40 ms: fixed.** Filed by 2.9-B
221
- as a "first-stream cost" hypothesis, the instrumentation revealed
222
- it was paid by *every* h2 stream — the canonical Linux delayed-ACK
223
- + Nagle interaction on small framer writes. One-line fix:
224
- TCP_NODELAY at accept time. **h2load `-c 1 -m 1 -n 200`: min
225
- 40.62 → 0.54 ms (−98.7%), throughput 24 → 1,142 r/s (+47.6×).** The
226
- `HYPERION_H2_TIMING=1` instrumentation stays in place as durable
227
- diagnostic infrastructure.
228
-
229
- Full per-stream details, bench numbers, and follow-up items live in
230
- [`CHANGELOG.md`](CHANGELOG.md).
231
-
232
- ## What's new in 2.5.0
233
-
234
- **Native HPACK ON by default + autobahn 100% conformance + request
235
- hooks.** The Rust HPACK encoder (added in 2.0.0, opt-in until 2.4.x)
236
- flips ON by default in 2.5.0 — verified **+18% rps on Rails-shape h2
237
- workloads** (25-header responses, the bench harness lives at
238
- `bench/h2_rails_shape.ru` + `bench/h2_rails_shape.sh`). RFC 6455
239
- WebSocket conformance hit **463/463 autobahn-testsuite cases passing**
240
- (2.5-A, host openclaw-vm). Request lifecycle hooks
241
- (`Runtime#on_request_start` / `on_request_end`) shipped in 2.5-C —
242
- recipes in [`docs/OBSERVABILITY.md`](docs/OBSERVABILITY.md).
243
-
244
- ## What's new in 2.4.0
245
-
246
- **Production observability.** The `/-/metrics` endpoint now exposes
247
- per-route latency histograms, per-conn fairness rejections, WebSocket
248
- permessage-deflate compression ratio, kTLS active connections,
249
- io_uring-active workers, and ThreadPool queue depth — operators can
250
- finally see whether the 2.x knobs are firing and how effective they
251
- are. A pre-built Grafana dashboard ships at
252
- [`docs/grafana/hyperion-2.4-dashboard.json`](docs/grafana/hyperion-2.4-dashboard.json).
253
- Full metric reference + operator playbook in
254
- [`docs/OBSERVABILITY.md`](docs/OBSERVABILITY.md).
255
-
256
- ## What's new in 2.1.0
257
-
258
- **WebSockets.** RFC 6455 over Rack 3 full hijack, native frame codec,
259
- per-connection wrapper with auto-pong / close handshake / UTF-8 validation /
260
- per-message size cap. **ActionCable on Hyperion is now a single-binary
261
- deployment** — one `hyperion -w 4 -t 10 config.ru` process serves HTTP,
262
- HTTP/2, TLS, **and** `/cable` from the same listener; no separate cable
263
- container required. HTTP/1.1 only this release; WS-over-HTTP/2 (RFC 8441
264
- Extended CONNECT) and permessage-deflate (RFC 7692) defer to 2.2.x.
265
- See [`docs/WEBSOCKETS.md`](docs/WEBSOCKETS.md).
266
-
267
- ## gRPC on Hyperion (2.12-F+)
268
-
269
- Hyperion's HTTP/2 path supports gRPC unary calls via the Rack 3 trailers
270
- contract: any response body that exposes `:trailers` gets a final
271
- HEADERS frame (with END_STREAM=1) carrying the trailer map after the
272
- DATA frames. That's the wire shape gRPC clients expect for the
273
- `grpc-status` / `grpc-message` map.
274
-
275
- A minimal Rack-shaped gRPC handler:
276
-
277
- ```ruby
278
- class GrpcBody
279
- def initialize(reply); @reply = reply; end
280
- def each; yield @reply; end
281
- def trailers; { 'grpc-status' => '0', 'grpc-message' => 'OK' }; end
282
- def close; end
283
- end
284
-
285
- run ->(env) {
286
- request = env['rack.input'].read # gRPC-framed protobuf bytes
287
- reply = handle(request) # your service implementation
288
- [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply)]
289
- }
290
- ```
291
-
292
- What Hyperion handles for you: ALPN negotiation, HTTP/2 framing, HPACK,
293
- per-stream flow control, the trailer-frame emit, binary-clean
294
- `env['rack.input']` (gRPC bodies are non-UTF-8), and `te: trailers`
295
- preserved into `env['HTTP_TE']`. What you handle: protobuf
296
- marshalling and the `grpc-status` semantics.
297
-
298
- ### Streaming RPCs (2.13-D+)
299
-
300
- All four gRPC call shapes work on Hyperion since 2.13-D — unary,
301
- server-streaming, client-streaming, and bidirectional. The detection
302
- trigger is the gRPC content-type plus `te: trailers`; any HTTP/2
303
- request that carries both is dispatched to the Rack app on HEADERS
304
- arrival (rather than after END_STREAM), and `env['rack.input']`
305
- becomes a streaming IO that blocks reads until the next DATA frame
306
- lands. Plain HTTP/2 traffic (without those headers) keeps the unary
307
- buffered semantics — no behaviour change for non-gRPC clients.
308
-
309
- **Server-streaming.** Yield one gRPC-framed message per `each`
310
- iteration; Hyperion writes each yield as its own DATA frame:
311
-
312
- ```ruby
313
- class StreamReply
314
- def initialize(messages); @messages = messages; end
315
- def each; @messages.each { |m| yield m }; end # one DATA frame each
316
- def trailers; { 'grpc-status' => '0' }; end
317
- def close; end
318
- end
319
-
320
- run ->(env) {
321
- env['rack.input'].read # the unary request message
322
- [200, { 'content-type' => 'application/grpc' }, StreamReply.new(messages)]
323
- }
324
- ```
325
-
326
- **Client-streaming.** Read messages off `env['rack.input']` as the peer
327
- sends them. Reads block until a DATA frame arrives:
328
-
329
- ```ruby
330
- run ->(env) {
331
- io = env['rack.input']
332
- count = 0
333
- while (prefix = io.read(5)) && prefix.bytesize == 5
334
- length = prefix.byteslice(1, 4).unpack1('N')
335
- msg = io.read(length)
336
- count += 1
337
- end
338
- [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply_for(count))]
339
- }
340
- ```
341
-
342
- **Bidirectional.** Interleave reads and writes. The response body is
343
- iterated lazily, so you can read one request message, yield one reply,
344
- read the next, yield the next:
345
-
346
- ```ruby
347
- class BidiReplies
348
- def initialize(io); @io = io; end
349
- def each
350
- while (prefix = @io.read(5)) && prefix.bytesize == 5
351
- len = prefix.byteslice(1, 4).unpack1('N')
352
- msg = @io.read(len)
353
- yield grpc_frame(handle(msg)) # sent immediately
354
- end
355
- end
356
- def trailers; { 'grpc-status' => '0' }; end
357
- def close; end
358
- end
359
-
360
- run ->(env) {
361
- [200, { 'content-type' => 'application/grpc' }, BidiReplies.new(env['rack.input'])]
362
- }
19
+ bundle exec hyperion config.ru # http://127.0.0.1:9292
363
20
  ```
364
21
 
365
- The streaming-input path runs each stream on its own fiber, so
366
- concurrent read+write on the same stream is safe.
367
-
368
- ## Highlights
369
-
370
- - **HTTP/1.1 + HTTP/2 + TLS** out of the box (HTTP/2 with per-stream fiber multiplexing, WINDOW_UPDATE-aware flow control, ALPN auto-negotiation).
371
- - **WebSockets (RFC 6455)** — full handshake, native frame codec, per-connection wrapper. ActionCable + faye-websocket work on a single-binary deploy. See [`docs/WEBSOCKETS.md`](docs/WEBSOCKETS.md). (2.1.0+, HTTP/1.1 only.)
372
- - **Pre-fork cluster** with per-OS worker model: `SO_REUSEPORT` on Linux, master-bind + worker-fd-share on macOS/BSD (Darwin's `SO_REUSEPORT` doesn't load-balance).
373
- - **Hybrid concurrency**: fiber-per-connection for I/O, OS-thread pool for `app.call(env)` — synchronous Rack handlers (Rails, ActiveRecord, anything holding a global mutex) get true OS-thread concurrency.
374
- - **Vendored llhttp 9.3.0** C parser; pure-Ruby fallback for non-MRI runtimes.
375
- - **Default-ON structured access logs** (one JSON or text line per request) with hot-path optimisations: per-thread cached timestamp, hand-rolled line builder, lock-free per-thread write buffer.
376
- - **12-factor logger split**: info/debug → stdout, warn/error/fatal → stderr.
377
- - **Ruby DSL config file** (`config/hyperion.rb`) with lifecycle hooks (`before_fork`, `on_worker_boot`, `on_worker_shutdown`).
378
- - **Object pooling** for the Rack `env` hash and `rack.input` IO — amortizes per-request allocations across the worker's lifetime.
379
- - **`Hyperion::FiberLocal`** opt-in shim for older Rails idioms that store request-scoped data via `Thread.current.thread_variable_*`.
380
-
381
- ## Benchmarks
382
-
383
- All numbers are real wrk runs against published Hyperion configs. Hyperion ships **with default-ON structured access logs**; Puma comparisons use Puma defaults (no per-request log emission). Each section is stamped with the Hyperion version + bench host it was measured against — bench-host drift over time is real (see "Bench-host drift" note below).
384
-
385
- **Headline doc**: the most recent comprehensive sweep is
386
- [`docs/BENCH_HYPERION_2_11.md`](docs/BENCH_HYPERION_2_11.md) — the
387
- 2.12-B 4-way re-bench (Hyperion 2.11.0 vs Puma 8.0.1 / Falcon 0.55.3 /
388
- Agoo 2.15.14, 16-vCPU Ubuntu 24.04, 6 workloads). It's the post-
389
- 2.10/2.11-wins re-baseline of the four-server matrix that originally
390
- shipped in [`docs/BENCH_HYPERION_2_0.md`](docs/BENCH_HYPERION_2_0.md)
391
- § "4-way head-to-head (2.10-B baseline)" — the older doc is the
392
- **historical baseline (pre-2.10/2.11 wins)** and is preserved
393
- unchanged for archaeology. The 1.6.0 matrix at
394
- [`docs/BENCH_2026_04_27.md`](docs/BENCH_2026_04_27.md) covers 9
395
- workloads × 25+ configs against hyperion-async-pg 0.5.0; all three
396
- docs include caveats and per-row reproduction commands.
397
-
398
- > **Bench-host drift note (2026-05-01).** A spot-check rerun on
399
- > `openclaw-vm` 5 days after the 2.0.0 sweep showed Puma 8.0.1 and
400
- > Hyperion 2.0.0 baseline numbers had drifted 14-32% downward from the
401
- > 2026-04-29 sweep with no code changes — the bench host runs other
402
- > workloads in the background and is a single VM (KVM CPU). Numbers in
403
- > this README and BENCH docs are snapshots; expect ±10-30% absolute
404
- > drift between sweep dates. **The relative position (Hyperion vs Puma
405
- > at matched config) is the durable signal**; e.g. Hyperion `-w 16 -t 5`
406
- > hello-world today is 76,593 r/s vs Puma 8.0.1 `-w 16 -t 5:5` at 55,609
407
- > r/s, **+37.7% over Puma** — wider than the 2.0.0 sweep's +27.8% even
408
- > though absolute rps is lower. Reproduce: `bundle exec bin/hyperion
409
- > -p 9501 -w 16 -t 5 bench/hello.ru` then `wrk -t4 -c200 -d20s
410
- > http://127.0.0.1:9501/`.
411
-
412
- > **Topology relevance.** Hyperion is built to run **fronted by nginx
413
- > or an L7 load balancer** in most production deployments — plaintext
414
- > HTTP/1.1 upstream, TLS terminated at the LB. The benches in this
415
- > README that match that topology are: hello-world, CPU JSON, static,
416
- > SSE, PG, WebSocket. Benches that are **bench-only for nginx-fronted
417
- > ops** (the LB → upstream hop is plaintext h1 regardless): TLS h1,
418
- > HTTP/2, kTLS_TX. Those rows still ship for operators who terminate
419
- > TLS / h2 at Hyperion directly (small static fleets, edge boxes), but
420
- > don't chase the +60% TLS-h1 win unless you actually terminate TLS at
421
- > Hyperion.
422
-
423
- ### Hello-world Rack app
424
-
425
- `bench/hello.ru`, single worker, parity threads (`-t 5` vs Puma `-t 5:5`), 4 wrk threads / 100 connections / 15s, macOS arm64 / Ruby 3.3.3, Hyperion 1.2.0. **macOS dev numbers; the headline Linux 2.0.0 bench is in [`docs/BENCH_HYPERION_2_0.md`](docs/BENCH_HYPERION_2_0.md)**:
426
-
427
- | | r/s | p99 | tail vs Hyperion |
428
- |---|---:|---:|---:|
429
- | **Hyperion 1.2.0** (default, logs ON) | **22,496** | **502 µs** | **1×** |
430
- | Falcon 0.55.3 `--count 1` | 22,199 | 5.36 ms | 11× worse |
431
- | Puma 7.1.0 `-t 5:5` | 20,400 | 422.85 ms | 845× worse |
432
-
433
- **Hyperion: 1.10× Puma throughput, parity with Falcon on throughput, ~10× lower p99 than Falcon and ~845× lower than Puma — while emitting structured JSON access logs the others don't.**
434
-
435
- ### Production cluster config (`-w 4`)
436
-
437
- Same bench app, `-w 4` cluster, parity threads (`-t 5` everywhere), 4 wrk threads / 200 connections / 15s, macOS arm64:
438
-
439
- | | r/s | p99 | tail vs Hyperion |
440
- |---|---:|---:|---:|
441
- | Falcon `--count 4` | 48,197 | 4.84 ms | 5.9× worse |
442
- | **Hyperion `-w 4 -t 5`** | **40,137** | **825 µs** | **1×** |
443
- | Puma `-w 4 -t 5:5` | 34,793 | 177.76 ms | 215× worse (1 timeout) |
444
-
445
- Falcon edges Hyperion ~20% on raw rps at `-w 4` on macOS hello-world. **Hyperion still leads on tail latency by 5.9× over Falcon and 215× over Puma**, and beats Puma on throughput by 1.15×. On Linux production-config and DB-backed workloads (below) Hyperion takes the rps lead too — the macOS hello-world advantage to Falcon disappears once the workload includes any actual work or the kernel is Linux.
446
-
447
- ### Linux production-config (DB-backed Rack)
448
-
449
- `-w 4 -t 10` on Ubuntu 24.04 / Ruby 3.3.3. Rack app does one Postgres `SELECT 1` + one Redis `GET` per request, real network round-trip. wrk `-t4 -c50 -d10s` × 3 runs (median):
450
-
451
- | | r/s (median) | vs Puma default |
452
- |---|---:|---:|
453
- | **Hyperion default (rc17, logs ON)** | **5,786** | **1.012×** |
454
- | Hyperion `--no-log-requests` | 6,364 | 1.114× |
455
- | Puma `-w 4 -t 10:10` (no per-req logs) | 5,715 | 1.000× |
456
-
457
- Bench is **wait-bound** — ~3-4 ms median is the PG + Redis round-trip, dwarfing the per-request CPU work where Hyperion's optimisations live. With a synchronous `pg` driver, fibers don't help: every in-flight DB call still parks an OS thread, and both servers max out at `workers × threads` concurrent queries. To widen this gap requires either an async PG driver — see [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg) (companion gem; pair with `--async-io` and a fiber-aware pool, see "Async I/O — fiber concurrency on PG-bound apps" below) — or a CPU-bound workload, where Hyperion's lead becomes visible (next section).
458
-
459
- ### Async I/O — fiber concurrency on PG-bound apps
460
-
461
- Ubuntu 24.04 / 16 vCPU / Ruby 3.3.3, Postgres 17 over WAN, `wrk -t4 -c200 -d20s`. Single worker (`-w 1`) unless noted. All configs returned 0 non-2xx and 0 timeouts. RSS sampled mid-run via `ps -o rss`.
462
-
463
- **Wait-bound workload** (`pg_concurrent.ru`: `SELECT pg_sleep(0.05)` + tiny JSON; rackup lives in the [hyperion-async-pg companion repo](https://github.com/andrew-woblavobla/hyperion-async-pg) and on the bench host at `~/bench/pg_concurrent.ru`, not in this repo):
464
-
465
- | | r/s | p99 | RSS | vs Puma `-t 5` |
466
- |---|---:|---:|---:|---:|
467
- | Puma 8.0 `-t 5` pool=5 | 56.5 | 3.88 s | 87 MB | 1.0× |
468
- | Puma 8.0 `-t 30` pool=30 | 402.1 | 880 ms | 99 MB | 7.1× |
469
- | Puma 8.0 `-t 100` pool=100 | 1067.4 | 557 ms | 121 MB | 18.9× |
470
- | **Hyperion `--async-io -t 5`** pool=32 | 400.4 | 878 ms | 123 MB | 7.1× |
471
- | **Hyperion `--async-io -t 5`** pool=64 | 778.9 | 638 ms | 133 MB | 13.8× |
472
- | **Hyperion `--async-io -t 5`** pool=128 | 1344.2 | 536 ms | 148 MB | 23.8× |
473
- | **Hyperion `--async-io -t 5` pool=200** | **2381.4** | **471 ms** | **164 MB** | **42.2×** |
474
- | Hyperion `--async-io -w 4 -t 5` pool=64 | 1937.5 | 4.84 s | 416 MB | 34.3× (cold-start p99 — see note) |
475
- | Falcon 0.55.3 `--count 1` pool=128 | 1665.7 | 516 ms | 141 MB | 29.5× |
476
-
477
- **Mixed CPU+wait** (`pg_mixed.ru`: same query + 50-key JSON serialization, ~5 ms CPU; rackup lives in hyperion-async-pg + on the bench host at `~/bench/pg_mixed.ru`, not in this repo):
478
-
479
- | | r/s | p99 | RSS | vs Puma `-t 30` |
480
- |---|---:|---:|---:|---:|
481
- | Puma 8.0 `-t 30` pool=30 | 351.7 | 963 ms | 127 MB | 1.0× |
482
- | Hyperion `--async-io -t 5` pool=32 | 371.2 | 919 ms | 151 MB | 1.05× |
483
- | Hyperion `--async-io -t 5` pool=64 | 741.5 | 681 ms | 161 MB | 2.1× |
484
- | **Hyperion `--async-io -t 5` pool=128** | **1739.9** | **512 ms** | **201 MB** | **4.9×** |
485
- | Falcon `--count 1` pool=128 | 1642.1 | 531 ms | 213 MB | 4.7× |
486
-
487
- **Takeaways:**
488
- 1. **Linear scaling with pool size** under `--async-io` — `r/s ≈ pool × 12` on this WAN bench. Single-worker pool=200 hits 2381 r/s. The "**42× Puma `-t 5`**" and "**5.9× Puma's best**" framings above use Puma's pool=5 (timeout-floor) and pool=30 (mid-tier) rows respectively — fair comparisons on the *same* bench fixture, but a Puma operator who sizes their pool to match (`-t 100 pool=100` row above) lands at 1,067 r/s, so the **honest "Puma at its own best vs Hyperion at its own best" ratio is 2,381 / 1,067 ≈ 2.2×**, not 42×. The architectural win — fiber-pool grows to pool=200 without OS-thread cost — is real; the 42× headline is a configuration-difference effect, not a steady-state gap on matched configs.
489
- 2. **Mixed workload doesn't kill the win** — Hyperion `--async-io` pool=128 actually goes *up* on mixed (1740 vs 1344 r/s) because CPU work overlaps other fibers' PG-wait windows. This is the honest "what happens to a real Rails handler" answer.
490
- 3. **Hyperion ≈ Falcon within 3-7%** across pool sizes; both fiber-native architectures extract similar value from `hyperion-async-pg`.
491
- 4. **RSS at single-worker scale isn't the architectural moat** — Linux thread stacks are demand-paged; PG connection buffers dominate RSS at pool sizes ≤ 200. The architectural win is **handler concurrency under load**, not idle memory: Hyperion's fiber path runs thousands of in-flight handler invocations per OS thread, so wait-bound handlers don't queue at `max_threads`. See [Concurrency at scale](#concurrency-at-scale-architectural-advantages) for both the throughput-under-load row and a measured 10k-idle-keepalive RSS sweep against Puma and Falcon.
492
- 5. **`-w 4` cold-start caveat** — multi-worker p99 inflates because the bench rackup uses lazy per-process pool init (each worker pays full pool fill on its first request). Production apps avoid this with `on_worker_boot { Hyperion::AsyncPg::FiberPool.new(...).fill }`.
493
- 6. **Apples-to-apples PG note**: the row above uses `pg.wobla.space` WAN PG with `max_connections=500`. Earlier sweeps that compared Hyperion (WAN, max_conn=500) against Puma (local, max_conn=100) overstated the ratio because the Puma side timed out at the local pool ceiling. The 2.0.0 bench doc carries this caveat in the row 7 verification section; treat any "Hyperion 4× Puma on PG" headline as **indicative**, not precisely calibrated, until rerun against matched-pool PG.
494
-
495
- Three things must all be true to get this win:
496
- 1. **`async_io: true`** in your Hyperion config (or `--async-io` CLI flag). Default is off to keep 1.2.0's raw-loop perf for fiber-unaware apps.
497
- 2. **`hyperion-async-pg`** installed: `gem 'hyperion-async-pg', require: 'hyperion/async_pg'` + `Hyperion::AsyncPg.install!` at boot.
498
- 3. **Fiber-aware connection pool.** The popular `connection_pool` gem is NOT — its Mutex blocks the OS thread. Use `Hyperion::AsyncPg::FiberPool` (ships with hyperion-async-pg 0.3.0+), [`async-pool`](https://github.com/socketry/async-pool), or `Async::Semaphore`.
499
-
500
- Skip any of these and you get parity with Puma at the same `-t`. Run the bench yourself: `MODE=async DATABASE_URL=... PG_POOL_SIZE=200 bundle exec hyperion --async-io -t 5 bench/pg_concurrent.ru` (in the [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg) repo).
501
-
502
- > **TLS + async-pg note (1.4.0+).** TLS / HTTPS already runs each connection on a fiber under `Async::Scheduler` (the TLS path always uses `start_async_loop` for the ALPN handshake). **As of 1.4.0, the post-handshake `app.call` for HTTP/1.1-over-TLS dispatches inline on the calling fiber by default** — so fiber-cooperative libraries (`hyperion-async-pg`, `async-redis`) work on the TLS h1 path without needing `--async-io`. The Async-loop cost is already paid for the handshake; running the handler under the existing scheduler just preserves that context instead of stripping it on a thread-pool hop. h2 streams are always fiber-dispatched and benefit from async-pg without the flag.
503
- >
504
- > Operators who specifically want **TLS + threadpool dispatch** (e.g. CPU-heavy handlers competing for OS threads, where you'd rather not pay fiber yields and want true OS-thread parallelism on a synchronous handler) can pass `async_io: false` in the config to force the pool branch back on. The three-way `async_io` setting:
505
- > - `nil` (default): plain HTTP/1.1 → pool, TLS h1 → inline.
506
- > - `true`: plain HTTP/1.1 → inline, TLS h1 → inline (force fiber dispatch everywhere; needed for `hyperion-async-pg` on plain HTTP).
507
- > - `false`: plain HTTP/1.1 → pool, TLS h1 → pool (explicit opt-out for TLS+threadpool).
508
-
509
- ### CPU-bound JSON workload
22
+ ## Headline benchmarks
510
23
 
511
- `bench/work.ru` handler builds a 50-key fixture, JSON-encodes a fresh response per request (~8 KB body), processes a 6-cookie header chain. wrk `-t4 -c200 -d15s`, macOS arm64 / Ruby 3.3.3, 1.2.0:
24
+ Linux 6.8 / 16-vCPU Ubuntu 24.04 / Ruby 3.3.3, single worker, `wrk -t4 -c100 -d20s`
25
+ unless noted. Reproduction commands and the full 6-row 4-way matrix
26
+ (Hyperion / Puma / Falcon / Agoo) live in
27
+ [docs/BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md).
512
28
 
513
- | | r/s | p99 | tail vs Hyperion |
514
- |---|---:|---:|---:|
515
- | Falcon `--count 4` | 46,166 | 20.17 ms | 24× worse |
516
- | **Hyperion `-w 4 -t 5`** | **43,924** | **824 µs** | **1×** |
517
- | Puma `-w 4 -t 5:5` | 36,383 | 166.30 ms (47 socket errors) | 200× worse |
29
+ | Workload | Hyperion r/s | Hyperion p99 | Reference |
30
+ |-------------------------------------------------------|-------------:|-------------:|----------------------|
31
+ | Static hello, `handle_static` + io_uring (2.12-D) | **134,084** | 1.14 ms | Agoo 2.15.14: 19,024 |
32
+ | Static hello, `handle_static` + accept4 fallback | 15,685 | 107 µs | Agoo 2.15.14: 19,024 |
33
+ | Dynamic block, `Server.handle { \|env\| ... }` (2.14-A) | 9,422 | 166 µs | Agoo 2.15.14: 19,024 |
34
+ | CPU JSON via block (`bench/work.ru`, 2.14-A) | 5,897 | 256 µs | Falcon: 4,226 |
35
+ | Generic Rack hello (no `Server.handle`) | 4,752 | 2.02 ms | Agoo 2.15.14: 19,024 |
36
+ | gRPC unary, h2/TLS, ghz `-c50` (2.14-D) | 1,618 | 33.3 ms | Falcon `async-grpc`: 1,512 (+7%) |
518
37
 
519
- **1.21× Puma throughput, 200× lower p99.** This is the gap that hides behind PG-round-trip noise on the DB bench. Hyperion's per-request CPU savings (lock-free per-thread metrics, frozen header keys in the Rack adapter, C-ext response head builder, cached iso8601 timestamps, cached HTTP Date header) land on the wire when the workload is CPU-bound. Falcon edges us 5% on raw r/s but with 24× worse tail — a different tradeoff curve. Reproduce: `bundle exec bin/hyperion -w 4 -t 5 -p 9292 bench/work.ru`.
38
+ The 134,084 r/s row is sustained over a 4-hour soak at **120,684 r/s**
39
+ with RSS variance 2.71% and `wrk-truth` p99 1.14 ms (2.14-C). The
40
+ io_uring loop is opt-in via `HYPERION_IO_URING_ACCEPT=1` until 2.15;
41
+ the `accept4` row is the default on Linux.
520
42
 
521
- ### Real Rails 8.1 app (single worker, parity threads `-t 16`)
522
-
523
- Health endpoint that traverses the full middleware chain (rack-attack, locale redirect, structured tagger, geo-location, etc.). Plus a Grape API endpoint reading cached data, and a Rails controller doing a Redis GET + an ActiveRecord query.
524
-
525
- | endpoint | server | r/s | p99 | wrk timeouts |
526
- |---|---|---:|---:|---:|
527
- | `/up` (health) | **Hyperion** | **19.03** | **1.12 s** | **0** |
528
- | `/up` (health) | Puma `-t 16:16` | 16.64 | 1.95 s | **138** |
529
- | Grape `/api/v1/cached_data` | **Hyperion** | **16.15** | **779 ms** | 16 |
530
- | Grape `/api/v1/cached_data` | Puma `-t 16:16` | 10.90 | (>2 s, censored) | **110** |
531
- | Rails `/api/v1/health` | **Hyperion** | **15.95** | **992 ms** | 16 |
532
- | Rails `/api/v1/health` | Puma `-t 16:16` | 11.29 | (>2 s, censored) | **114** |
533
-
534
- On Grape and Rails-controller workloads Puma hits wrk's 2 s timeout cap on ~⅔ of requests — its real p99 is censored above 2 s. Hyperion serves all of its requests under 1.2 s with 0 to 16 timeouts. **1.14–1.48× Puma throughput** depending on endpoint.
535
-
536
- ### Static-asset serving (sendfile zero-copy path, 1.2.0+)
537
-
538
- `bench/static.ru` (`Rack::Files` over a 1 MiB asset), `-w 1`, `wrk -t4 -c100 -d15s`, macOS arm64 / Ruby 3.3.3:
539
-
540
- | | r/s | p99 | transferred | tail vs winner |
541
- |---|---:|---:|---:|---:|
542
- | **Hyperion (sendfile path)** | **2,069** | **3.10 ms** | 30.4 GB | **1×** |
543
- | Puma `-w 1 -t 5:5` | 2,109 | 566.16 ms | 31.0 GB | 183× worse |
544
- | Falcon `--count 1` | 1,269 | 801.01 ms | 18.7 GB | 258× worse (28 timeouts) |
545
-
546
- Throughput is bandwidth-bound on localhost (≈2 GB/s = the loopback memory ceiling), so the throughput column looks like parity. The actual win is in the **tail latency** column: Hyperion's `IO.copy_stream` → `sendfile(2)` path skips userspace entirely, while Puma allocates a String per response and Falcon serializes more aggressively. On real network paths sendfile widens the gap further (kernel-to-NIC zero-copy).
43
+ ## Quick start
547
44
 
548
- Reproduce:
549
45
  ```sh
550
- ruby -e 'File.binwrite("/tmp/hyperion_bench_asset_1m.bin", "x" * (1024*1024))'
551
- bundle exec bin/hyperion -p 9292 bench/static.ru
552
- wrk --latency -t4 -c100 -d15s http://127.0.0.1:9292/hyperion_bench_asset_1m.bin
46
+ bundle exec hyperion config.ru # single process
47
+ bundle exec hyperion -w 4 -t 10 config.ru # 4 workers × 10 threads
48
+ bundle exec hyperion -w 0 config.ru # one worker per CPU
49
+ bundle exec hyperion --tls-cert cert.pem --tls-key key.pem -p 9443 config.ru
553
50
  ```
554
51
 
555
- ### Concurrency at scale (architectural advantages)
52
+ `bundle exec rake spec` (and the default task) auto-invoke `compile`, so a
53
+ fresh checkout just needs `bundle install && bundle exec rake` for a green run.
556
54
 
557
- These workloads demonstrate structural differences between Hyperion's fiber-per-connection / fiber-per-stream model and Puma's thread-pool model. Numbers are illustrative; the architecture is what matters. Run on Ubuntu 24.04 / Ruby 3.3.3, single worker, h2load `-c <conns> -n 100000 --rps 1000 --h1`.
55
+ Migrating from Puma? `hyperion -t N -w M` matching your current Puma
56
+ `-t N:N -w M` is the recommended drop-in. See
57
+ [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md).
558
58
 
559
- **5,000 concurrent keep-alive connections (50,000 requests):**
59
+ ## Features
560
60
 
561
- | | succeeded | r/s | wall | master RSS |
562
- |---|---:|---:|---:|---:|
563
- | Hyperion `-w 1 -t 10` | 50,000 / 50,000 | 3,460 | 14.45 s | 53.5 MB |
564
- | Puma `-w 1 -t 10:10` | 50,000 / 50,000 | 1,762 | 28.37 s | 36.9 MB |
61
+ ### HTTP/1.1 + HTTP/2 + TLS
565
62
 
566
- **10,000 concurrent keep-alive connections (100,000 requests):**
567
-
568
- | | succeeded | failed | r/s | wall |
569
- |---|---:|---:|---:|---:|
570
- | Hyperion `-w 1 -t 10` | 93,090 | 6,910 | 3,446 | 27.01 s |
571
- | Puma `-w 1 -t 10:10` | 77,340 | 22,660 | 706 | 109.59 s |
572
-
573
- At 10k concurrent connections under load Hyperion serves **~5× the throughput** of Puma with **~20% fewer dropped requests**. The per-connection bookkeeping cost is bounded by fiber size, not by `max_threads` — workers don't get pinned to long-lived sockets, so a slow handler doesn't starve other connections.
574
-
575
- **Memory at idle keep-alive scale — 10,000 idle HTTP/1.1 keep-alive connections:**
576
-
577
- Each client opens a TCP connection, sends one keep-alive GET, drains the response, then holds the socket open without sending a follow-up request. RSS is sampled once a second across a 30s idle hold. Same hello-world rackup, single worker, no TLS. Hyperion runs with `async_io true` (fiber-per-connection on the plain HTTP/1.1 path).
578
-
579
- | | held | dropped | peak RSS | RSS after drain |
580
- |---|---:|---:|---:|---:|
581
- | Hyperion `-w 1 -t 5 --async-io` | 10,000 / 10,000 | 0 | 173 MB | 155 MB |
582
- | Puma `-w 0 -t 100` | 10,000 / 10,000 | 0 | 101 MB | 104 MB |
583
- | Falcon `--count 1` | 10,000 / 10,000 | 0 | 429 MB | 440 MB |
584
-
585
- All three hold 10k idle conns without OOMing or dropping — the "MB-per-thread" intuition that thread-based servers can't reach this scale doesn't survive contact with Linux's demand-paged thread stacks plus Puma's reactor-based keep-alive handling. Per-conn RSS lands at ~14 KB (Hyperion fiber + parser state), ~7 KB (Puma reactor entry + tiny thread share), ~36 KB (Falcon Async::Task + protocol-http stack). Bounded, not unbounded — for all three.
586
-
587
- The architectural difference shows up under **load**, not at idle: Puma can only run `max_threads` handler invocations concurrently, so wait-bound handlers (DB, HTTP, Redis) starve at higher request concurrency than `max_threads`. Hyperion's fiber-per-connection model + `--async-io` gives one OS thread thousands of in-flight handler executions, paired with [hyperion-async-pg](https://github.com/exodusgaming-io/hyperion-async-pg) for non-blocking DB. The 10k-conn throughput row above (5× Puma) is the consequence — same idle RSS shape, very different behaviour once the handlers actually do work.
588
-
589
- **HTTP/2 multiplexing — 1 connection × 100 concurrent streams (handler sleeps 50 ms):**
590
-
591
- | | wall time |
592
- |---|---:|
593
- | Hyperion (per-stream fiber dispatch) | **1.04 s** |
594
- | Serial baseline (100 × 50 ms) | 5.00 s |
595
-
596
- Hyperion fans 100 in-flight streams across separate fibers within a single TCP connection. A serial server would take 5 s; the fiber-multiplexed result (1.04 s, ~96 req/s on one socket) is bounded by single-handler sleep time plus framing overhead. Puma has no native HTTP/2 path — production deployments terminate h2 at nginx and forward h1 to the worker pool, which serializes again.
597
-
598
- > **1.6.0 outbound write path** — `Http2Handler` no longer serializes every framer write through one `Mutex#synchronize { socket.write(...) }`. HPACK encoding (microseconds, in-memory) still serializes on a fast encode mutex, but the actual `socket.write` is owned by a dedicated per-connection writer fiber draining a queue. On per-connection multi-stream workloads where the kernel send buffer or peer reads are slow, encode work for ready streams overlaps the writer's flush of earlier chunks, instead of stacking up behind it. See `bench/h2_streams.sh` (`h2load -c 1 -m 100 -n 5000`) for a recipe to compare 1.5.0 vs 1.6.0 on a workload of your choice.
599
-
600
- ### Reproducing the benchmarks
601
-
602
- Every number in this README and `docs/BENCH_HYPERION_2_0.md` is reproducible. Operators who don't trust headline numbers (and you shouldn't trust *any* benchmark numbers without independent verification) can rerun the workloads on their own host and get their own honest measurements. Per-row reproduction commands:
603
-
604
- ```sh
605
- # Setup (once)
606
- bundle install
607
- bundle exec rake compile
63
+ ALPN auto-negotiates `h2` or `http/1.1` per connection. HTTP/2 multiplexes
64
+ streams onto fibers within a single connection — slow handlers don't
65
+ head-of-line-block other streams. Cluster-mode TLS works (`-w N` +
66
+ `--tls-cert` / `--tls-key`).
608
67
 
609
- # Hello-world (rps + p99 ceiling, no I/O)
610
- bundle exec bin/hyperion -p 9292 -w 16 -t 5 bench/hello.ru &
611
- wrk -t4 -c200 -d20s --latency http://127.0.0.1:9292/
68
+ Smuggling defenses for HTTP/1.1: `Content-Length` + `Transfer-Encoding`
69
+ together 400; non-chunked `Transfer-Encoding` 501; CRLF in response
70
+ header values `ArgumentError` (response-splitting guard).
612
71
 
613
- # CPU-bound JSON (per-request CPU savings visible)
614
- bundle exec bin/hyperion -p 9292 -w 4 -t 5 bench/work.ru &
615
- wrk -t4 -c200 -d15s --latency http://127.0.0.1:9292/
72
+ ### WebSockets (2.1.0+)
616
73
 
617
- # Static 1 MiB sendfile path
618
- ruby -e 'File.binwrite("/tmp/hyperion_bench_asset_1m.bin", "x" * (1024*1024))'
619
- bundle exec bin/hyperion -p 9292 -w 1 -t 5 bench/static.ru &
620
- wrk -t4 -c100 -d15s --latency http://127.0.0.1:9292/hyperion_bench_asset_1m.bin
74
+ RFC 6455 over Rack 3 full hijack, native frame codec, per-connection
75
+ wrapper with auto-pong, close handshake, UTF-8 validation, and per-message
76
+ size cap. **ActionCable + faye-websocket on a single binary** one
77
+ `hyperion -w 4 -t 10 config.ru` serves HTTP, HTTP/2, TLS, and `/cable`
78
+ from the same listener. Conformance: 463/463 autobahn-testsuite cases
79
+ pass. See [docs/WEBSOCKETS.md](docs/WEBSOCKETS.md).
621
80
 
622
- # SSE streaming (Hyperion-shaped rackup with explicit flush sentinel — see caveat in BENCH doc)
623
- bundle exec bin/hyperion -p 9292 -w 1 -t 5 bench/sse.ru &
624
- wrk -t1 -c1 -d10s http://127.0.0.1:9292/
81
+ ### gRPC (2.12-F+)
625
82
 
626
- # WebSocket multi-process throughput
627
- bundle exec bin/hyperion -p 9888 -w 4 -t 64 bench/ws_echo.ru &
628
- ruby bench/ws_bench_client_multi.rb --port 9888 --procs 4 --conns 200 --msgs 1000 --bytes 1024 --json
83
+ Hyperion's HTTP/2 path supports gRPC unary, server-streaming,
84
+ client-streaming, and bidirectional RPCs via the Rack 3 trailers contract:
85
+ any response body that defines `#trailers` gets a final HEADERS frame
86
+ (with `END_STREAM=1`) carrying the trailer map after the DATA frames.
87
+ Plain HTTP/2 traffic without the gRPC content-type keeps the unary
88
+ buffered semantics — no behaviour change for non-gRPC clients.
629
89
 
630
- # h2 native HPACK (Rails-shape, 25-header response)
631
- ./bench/h2_rails_shape.sh
90
+ A minimal unary handler:
632
91
 
633
- # Idle keep-alive RSS sweep (1k / 5k / 10k conns, 30s hold per server)
634
- ./bench/keepalive_memory.sh
92
+ ```ruby
93
+ class GrpcBody
94
+ def initialize(reply); @reply = reply; end
95
+ def each; yield @reply; end
96
+ def trailers; { 'grpc-status' => '0', 'grpc-message' => 'OK' }; end
97
+ def close; end
98
+ end
635
99
 
636
- # Hello-world quick comparator (Hyperion vs Puma vs Falcon)
637
- bundle exec ruby bench/compare.rb
638
- HYPERION_WORKERS=4 PUMA_WORKERS=4 FALCON_COUNT=4 bundle exec ruby bench/compare.rb
100
+ run ->(env) {
101
+ request = env['rack.input'].read
102
+ reply = handle(request)
103
+ [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply)]
104
+ }
639
105
  ```
640
106
 
641
- PG benches (`pg_concurrent.ru`, `pg_mixed.ru`, `pg_realistic.ru`) live in the [hyperion-async-pg companion repo](https://github.com/andrew-woblavobla/hyperion-async-pg) they require a running Postgres + the companion gem and are not part of this repo. The 2.0.0 sweep used `~/bench/pg_concurrent.ru` on the bench host; reproduce by cloning hyperion-async-pg and following its README, or `scp` the rackup + DATABASE_URL.
107
+ Server-streaming yields one DATA frame per `each`; client-streaming
108
+ reads incoming frames off `env['rack.input']` (a streaming IO that
109
+ blocks until the next DATA frame lands); bidirectional interleaves
110
+ both. Reproducible bench at `bench/grpc_stream.{proto,ru}` +
111
+ `bench/grpc_stream_bench.sh` (ghz). Numbers in
112
+ [docs/BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md#grpc-ghz-bench--hyperion-vs-falcon-async-grpc-214-d).
642
113
 
643
- When numbers from your host don't match the published numbers, the most likely explanations (in order): (1) bench-host noise — single-VM benches drift 10-30% over days; (2) Puma version mismatch — the 2.0.0 sweep used Puma 8.0.1 in the `~/bench/Gemfile`, the hyperion repo's own Gemfile pins Puma `~> 6.4`; (3) different kernel / Ruby; (4) different `-t` / `-c` (apples-to-apples requires identical worker count, thread count, wrk concurrency, payload size, kernel, Ruby, TLS cipher).
114
+ ### `Server.handle` direct routes
644
115
 
645
- ## Quick start
116
+ Bypass the Rack adapter for hot paths:
646
117
 
647
- ```sh
648
- bundle install
649
- bundle exec rake compile # build the llhttp C ext
650
- bundle exec hyperion config.ru # single-process default
651
- bundle exec hyperion -w 4 -t 10 config.ru # 4-worker cluster, 10 threads each
652
- bundle exec hyperion -w 0 config.ru # 1 worker per CPU
653
- bundle exec hyperion --tls-cert cert.pem --tls-key key.pem -p 9443 config.ru # HTTPS
654
- curl http://127.0.0.1:9292/ # => hello
655
-
656
- # Chunked POST works:
657
- curl -X POST -H "Transfer-Encoding: chunked" --data-binary @file http://127.0.0.1:9292/
658
-
659
- # HTTP/2 (over TLS, ALPN-negotiated):
660
- curl --http2 -k https://127.0.0.1:9443/
118
+ ```ruby
119
+ Hyperion::Server.handle_static '/health', body: 'ok'
120
+ Hyperion::Server.handle(:GET, '/v1/ping') { |env| [200, {}, ['pong']] }
661
121
  ```
662
122
 
663
- `bundle exec rake spec` (and the `default` task) auto-invoke `compile`, so a fresh checkout just needs `bundle install && bundle exec rake` to get a green run.
664
-
665
- **Migrating from Puma?** See [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md).
123
+ `handle_static` bakes the response at boot and serves from the C accept
124
+ loop (134k r/s with io_uring, 16k r/s on accept4). The dynamic block
125
+ form (2.14-A) runs `app.call(env)` on the C accept loop too — accept +
126
+ recv + parse + write release the GVL while the block holds it, so
127
+ multi-threaded workers actually parallelise.
128
+
129
+ ### Pre-fork cluster
130
+
131
+ Per-OS worker model: `SO_REUSEPORT` on Linux (kernel-balanced accept,
132
+ 1.004–1.011 max/min ratio across workers under steady load — 2.12-E
133
+ audit), master-bind + worker-fd-share on macOS/BSD where Darwin's
134
+ `SO_REUSEPORT` doesn't load-balance. Lifecycle hooks (`before_fork`,
135
+ `on_worker_boot`, `on_worker_shutdown`) for AR / Redis / pool init.
136
+
137
+ ### Async I/O (PG-bound apps)
138
+
139
+ `--async-io` runs plain HTTP/1.1 connections under `Async::Scheduler`,
140
+ turning one OS thread into thousands of in-flight handler invocations.
141
+ Paired with [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg)
142
+ on a `pg_sleep(50ms)` workload, single-worker `pool=200` hits **2,381 r/s**
143
+ vs Puma `-t 5` at 56 r/s (architectural ceiling: pool size, not thread
144
+ count). Three things must all be true: `--async-io`, `hyperion-async-pg`
145
+ loaded, and a fiber-aware pool (`Hyperion::AsyncPg::FiberPool`,
146
+ `async-pool`, or `Async::Semaphore` — **not** the `connection_pool` gem,
147
+ whose `Mutex` blocks the OS thread). Skip any one and you get parity
148
+ with Puma.
149
+
150
+ ### Observability
151
+
152
+ `/-/metrics` Prometheus endpoint (admin-token guarded), per-route
153
+ latency histograms, per-conn fairness rejections, WebSocket
154
+ permessage-deflate ratio, kTLS active connections, ThreadPool queue
155
+ depth, dispatch-mode counters (Rack / `handle_static` / dynamic block /
156
+ h2 / async-io). Pre-built Grafana dashboard at
157
+ [docs/grafana/hyperion-2.4-dashboard.json](docs/grafana/hyperion-2.4-dashboard.json).
158
+ Full reference: [docs/OBSERVABILITY.md](docs/OBSERVABILITY.md).
159
+
160
+ Default-ON structured access logs (one JSON or text line per request)
161
+ with hot-path optimisations: per-thread cached iso8601 timestamp,
162
+ hand-rolled line builder, lock-free per-thread 4 KiB write buffer.
163
+ 12-factor logger split: `info`/`debug` → stdout, `warn`/`error`/`fatal`
164
+ → stderr.
165
+
166
+ ### Optional io_uring accept loop
167
+
168
+ Linux 5.x+, opt-in via `HYPERION_IO_URING_ACCEPT=1`. Multishot accept
169
+ + per-conn RECV/WRITE/CLOSE state machine on top of liburing. One
170
+ `io_uring_enter` per N requests instead of N×3 syscalls. Compiles out
171
+ cleanly without liburing — the `accept4` path stays the fallback.
172
+ macOS keeps using `accept4`. Default-flip moves to 2.15 with a fresh
173
+ 24h soak.
666
174
 
667
175
  ## Configuration
668
176
 
669
- Three layers, in precedence order: explicit CLI flag > environment variable > `config/hyperion.rb` > built-in default.
177
+ Three layers, in precedence order: explicit CLI flag > environment
178
+ variable > `config/hyperion.rb` > built-in default.
670
179
 
671
- ### CLI flags
180
+ ### Most-used CLI flags
672
181
 
673
182
  | Flag | Default | Notes |
674
183
  |---|---|---|
675
184
  | `-b, --bind HOST` | `127.0.0.1` | |
676
185
  | `-p, --port PORT` | `9292` | |
677
186
  | `-w, --workers N` | `1` | `0` → `Etc.nprocessors` |
678
- | `-t, --threads N` | `5` | OS-thread Rack handler pool per worker. `0` → run inline (no pool, debugging only). |
187
+ | `-t, --threads N` | `5` | OS-thread Rack handler pool per worker. `0` → run inline (debugging). |
679
188
  | `-C, --config PATH` | `config/hyperion.rb` if present | Ruby DSL file. |
680
- | `--tls-cert PATH` | nil | PEM certificate. |
681
- | `--tls-key PATH` | nil | PEM private key. |
682
- | `--log-level LEVEL` | `info` | `debug` / `info` / `warn` / `error` / `fatal`. |
683
- | `--log-format FORMAT` | `auto` | `text` / `json` / `auto`. Auto: JSON when `RAILS_ENV`/`RACK_ENV` is `production`/`staging`, colored text on TTY, JSON otherwise. |
684
- | `--[no-]log-requests` | ON | Per-request access log. |
685
- | `--fiber-local-shim` | off | Patches `Thread#thread_variable_*` to fiber storage for older Rails idioms. |
686
- | `--[no-]yjit` | auto | Force YJIT on/off. Default: auto-on under `RAILS_ENV`/`RACK_ENV` = `production`/`staging`. |
687
- | `--[no-]async-io` | off | Run plain HTTP/1.1 connections under `Async::Scheduler`. Required for `hyperion-async-pg` on plain HTTP. TLS h1 / HTTP/2 always run under the scheduler regardless. |
688
- | `--max-body-bytes BYTES` | `16777216` (16 MiB) | Maximum request body size. |
689
- | `--max-header-bytes BYTES` | `65536` (64 KiB) | Maximum total request-header size. |
690
- | `--max-pending COUNT` | unbounded | Per-worker accept-queue cap before new connections are rejected with HTTP 503 + `Retry-After: 1`. |
691
- | `--max-request-read-seconds SECONDS` | `60` | Total wallclock budget for reading request line + headers + body for ONE request. Slowloris defence. |
692
- | `--admin-token TOKEN` | unset | Bearer token for `POST /-/quit` and `GET /-/metrics`. **Production: prefer `--admin-token-file` — argv is visible via `ps`.** |
693
- | `--admin-token-file PATH` | unset | Read the admin token from a file. Refuses to load if the file is missing or world-readable (mode must mask `0o007`). |
694
- | `--worker-max-rss-mb MB` | unset | Master gracefully recycles a worker once its RSS exceeds this many megabytes. nil = disabled. |
695
- | `--idle-keepalive SECONDS` | `5` | Keep-alive idle timeout. Connection closes after this many seconds of inactivity. |
696
- | `--graceful-timeout SECONDS` | `30` | Shutdown deadline before SIGKILL is delivered to a worker that hasn't drained. |
189
+ | `--tls-cert PATH` / `--tls-key PATH` | nil | PEM cert + key for HTTPS. |
190
+ | `--[no-]async-io` | off | Run plain HTTP/1.1 under `Async::Scheduler`. Required for `hyperion-async-pg` on plain HTTP. |
191
+ | `--preload-static DIR` | nil | Preload static assets from DIR at boot (repeatable, immutable). Rails apps auto-detect from `Rails.configuration.assets.paths`. |
192
+ | `--admin-token-file PATH` | unset | Auth file for `/-/quit` and `/-/metrics`. Refuses world-readable files. |
193
+ | `--worker-max-rss-mb MB` | unset | Master gracefully recycles a worker exceeding MB RSS. |
194
+ | `--max-pending COUNT` | unbounded | Per-worker accept-queue cap before HTTP 503 + `Retry-After: 1`. |
195
+ | `--idle-keepalive SECONDS` | `5` | Keep-alive idle timeout. |
196
+ | `--graceful-timeout SECONDS` | `30` | Shutdown deadline before SIGKILL. |
197
+
198
+ `bin/hyperion --help` prints the full set, including `--max-body-bytes`,
199
+ `--max-header-bytes`, `--max-request-read-seconds` (slowloris defence),
200
+ `--h2-max-total-streams`, `--max-in-flight-per-conn`,
201
+ `--tls-handshake-rate-limit`, and the `--[no-]yjit` /
202
+ `--[no-]log-requests` toggles.
697
203
 
698
204
  ### Environment variables
699
205
 
700
- `HYPERION_LOG_LEVEL`, `HYPERION_LOG_FORMAT`, `HYPERION_LOG_REQUESTS` (`0|1|true|false|yes|no|on|off`), `HYPERION_ENV`, `HYPERION_WORKER_MODEL` (`share|reuseport`).
206
+ `HYPERION_LOG_LEVEL`, `HYPERION_LOG_FORMAT`, `HYPERION_LOG_REQUESTS`
207
+ (`0|1|true|false|yes|no|on|off`), `HYPERION_ENV`,
208
+ `HYPERION_WORKER_MODEL` (`share|reuseport`), `HYPERION_IO_URING_ACCEPT`
209
+ (`0|1`), `HYPERION_H2_DISPATCH_POOL`, `HYPERION_H2_NATIVE_HPACK`
210
+ (`v2|ruby|off`), `HYPERION_H2_TIMING`.
701
211
 
702
212
  ### Config file
703
213
 
704
- `config/hyperion.rb` — same shape as Puma's `puma.rb`. Auto-loaded if present.
214
+ `config/hyperion.rb` — same shape as Puma's `puma.rb`. Auto-loaded if
215
+ present. Strict DSL: unknown methods raise `NoMethodError` at boot.
705
216
 
706
217
  ```ruby
707
218
  # config/hyperion.rb
@@ -718,16 +229,11 @@ read_timeout 30
718
229
  idle_keepalive 5
719
230
  graceful_timeout 30
720
231
 
721
- max_header_bytes 64 * 1024
722
- max_body_bytes 16 * 1024 * 1024
723
-
724
232
  log_level :info
725
233
  log_format :auto
726
234
  log_requests true
727
235
 
728
- fiber_local_shim false
729
-
730
- async_io nil # Three-way (1.4.0+): nil (default, auto: inline-on-fiber for TLS h1, pool hop for plain HTTP/1.1), true (force inline-on-fiber everywhere — required for hyperion-async-pg on plain HTTP/1.1), false (force pool hop everywhere — explicit opt-out for TLS+threadpool with CPU-heavy handlers). ~5% throughput hit on hello-world when inline; in exchange one OS thread serves N concurrent in-flight DB queries on wait-bound workloads. TLS / HTTP/2 accept loops always run under Async::Scheduler regardless of this flag.
236
+ async_io nil # nil = auto (1.4.0+), true = inline-on-fiber everywhere, false = pool everywhere
731
237
 
732
238
  before_fork do
733
239
  ActiveRecord::Base.connection_handler.clear_all_connections! if defined?(ActiveRecord)
@@ -736,199 +242,202 @@ end
736
242
  on_worker_boot do |worker_index|
737
243
  ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
738
244
  end
739
-
740
- on_worker_shutdown do |worker_index|
741
- ActiveRecord::Base.connection_handler.clear_all_connections! if defined?(ActiveRecord)
742
- end
743
245
  ```
744
246
 
745
- Strict DSL: unknown methods raise `NoMethodError` at boot — typos surface immediately rather than getting silently ignored.
746
-
747
- A documented sample lives at [`config/hyperion.example.rb`](config/hyperion.example.rb).
247
+ A documented sample lives at
248
+ [`config/hyperion.example.rb`](config/hyperion.example.rb).
748
249
 
749
250
  ## Operator guidance
750
251
 
751
- Concrete tradeoffs distilled from [`docs/BENCH_2026_04_27.md`](docs/BENCH_2026_04_27.md). If the bench numbers cited below feel surprising, check that doc for the full matrix + caveats.
252
+ Distilled from [docs/BENCH_2026_04_27.md](docs/BENCH_2026_04_27.md)
253
+ (Rails 8.1 real-app sweep). Headline finding: **the simplest drop-in
254
+ is the right answer.**
752
255
 
753
- ### When to use `-w N`
256
+ ### Migrating from Puma
754
257
 
755
- | Workload shape | Recommended | Why |
756
- |---|---|---|
757
- | **Pure I/O-bound** (PG / Redis / external HTTP, no significant CPU) | `-w 1` + larger pool | Bench: `-w 1 pool=200` = 87 MB / 2,180 r/s vs `-w 4 pool=64` = 224 MB / 1,680 r/s. **2.6× more memory, 0.77× rps** if you pick multi-worker on a wait-bound workload. |
758
- | **Pure CPU-bound** (heavy JSON / template render / image processing) | `-w N` matching CPU count | Each worker's accept loop is single-threaded under `--async-io`; multi-worker gives CPU-parallelism. Bench: `-w 16 -t 5` hits 98,818 r/s on a 16-vCPU box, 4.7× a `-w 1` ceiling on the same hardware. |
759
- | **Mixed** (Rails-shaped: ~5 ms CPU + 50 ms PG wait per request) | `-w N/2` (half cores) + medium pool | Lets CPU work parallelise while keeping per-worker memory tractable. Bench `pg_mixed.ru` (in hyperion-async-pg repo / `~/bench/`) at `-w 4 -t 5 pool=128` = 1,740 r/s with no cold-start spike (ForkSafe `prefill_in_child: true`). |
760
-
761
- Multi-worker on PG-wait workloads is the **wrong** default for most apps — the headline rps doesn't justify the memory and PG-connection cost. Verify your shape with the bench before scaling out.
762
-
763
- ### When to use `--async-io`
764
-
765
- ```
766
- Are you using a fiber-cooperative I/O library?
767
- (hyperion-async-pg, async-redis, async-http)
768
-
769
- ┌─────────────┴─────────────┐
770
- yes no
771
- │ │
772
- Pair with a fiber-aware Leave --async-io OFF.
773
- connection pool Default thread-pool dispatch
774
- (FiberPool, async-pool — is faster for synchronous
775
- NOT connection_pool gem, Rails apps. Bench: --async-io
776
- which uses non-fiber Mutex). on hello-world = 47% rps
777
- │ regression + p99 spike to
778
- Set --async-io. 3.65 s under no-yield workloads.
779
- Pool size is the real No reason to flip the flag.
780
- concurrency knob; -t is
781
- decorative for wait-bound.
782
- ```
258
+ `hyperion -t N -w M` matching your current Puma `-t N:N -w M`. No other
259
+ flags. Versus Puma at the same `-t/-w` shape on real Rails endpoints:
260
+ **+9% rps on lightweight endpoints, 28× lower p99 on health-style
261
+ endpoints, 3. lower p99 on PG-touching endpoints.** Same RSS, same
262
+ operator surface keep all your existing config, monitoring, deploy
263
+ scripts.
783
264
 
784
- Hyperion warns at boot if you set `--async-io` without any fiber-cooperative library loaded. The setting is still honoured; the warn just nudges operators who flipped it expecting a free perf bump.
265
+ ### Knobs that help on synthetic benches but **not** on real Rails
785
266
 
786
- ### Tuning `-t` and pool sizes
267
+ | Knob | Synthetic | Real Rails | Recommendation |
268
+ |---|---|---|---|
269
+ | `-t 30` | +5–10% on hello-world | **Hurts** p99 vs `-t 10` (3.51 s vs 148 ms on `/up`) — GVL + middleware Mutex contention | Stay at `-t 10`. |
270
+ | `--yjit` | +5–10% on CPU-bound | Wash on dev-mode Rails | Skip until you bench production-mode. |
271
+ | `RAILS_POOL > 25` | n/a | No improvement at 50 or 100 | Keep your existing AR pool. |
272
+ | `--async-io` | 33–42× rps on PG-bound | **Worse** than drop-in (4.14 s p99 on `/up`) until your full I/O stack is fiber-cooperative | Don't enable until `redis-rb` → `async-redis`. |
787
273
 
788
- - **Without `--async-io`** (sync server, default): `-t` is the concurrency knob. Each in-flight request holds an OS thread; pool size should match `-t`. Bench shows Puma-style behaviour — at 200 wrk conns hitting a 5-thread server, queue depth dominates p99 (Hyperion `-t 5 -w 1` p50 = 0.95 ms vs Puma's same shape at 59.5 ms — Hyperion's queueing is cheaper but the model still serializes at `-t`).
789
- - **With `--async-io` + a fiber-aware pool**: pool size is the concurrency knob. `-t` is decorative for wait-bound workloads; one accept-loop fiber serves all in-flight queries via the pool. Linear scaling: pool=64 → ~780 r/s, pool=128 → ~1,344 r/s, pool=200 → ~2,180 r/s on 50 ms PG queries.
790
- - **Pool over WAN**: if `PG.connect` round-trip is >50 ms, expect pool fill at startup to take `pool_size / parallel_fill_threads × RTT`. `hyperion-async-pg 0.5.1+` auto-scales `parallel_fill_threads` so pool=200 fills in ~1-2 s.
274
+ ### When `-w N` helps
791
275
 
792
- ### How to read p50 vs p99
276
+ | Workload | Recommended | Why |
277
+ |---|---|---|
278
+ | Pure I/O-bound (PG / Redis / external HTTP) | `-w 1` + larger pool | `-w 1 pool=200` = 87 MB / 2,180 r/s vs `-w 4 pool=64` = 224 MB / 1,680 r/s. **2.6× memory, 0.77× rps** if you pick multi-worker on wait-bound. |
279
+ | Pure CPU-bound | `-w N` matching CPU count | Bench: `-w 16 -t 5` hits 98,818 r/s on a 16-vCPU box. |
280
+ | Mixed (Rails-shaped, ~5 ms CPU + 50 ms wait) | `-w N/2` (half cores) + medium pool | `-w 4 -t 5 pool=128` = 1,740 r/s on `pg_mixed.ru`, no cold-start spike. |
793
281
 
794
- Tail latency tells the queueing story; rps tells the throughput story. Hyperion's tail wins are **always** bigger than its rps wins — sometimes the rps numbers look close to a competitor while p99 is 5-200× lower:
282
+ ### Read p99 not mean
795
283
 
796
284
  | Workload | Hyperion rps / p99 | Closest competitor | rps ratio | p99 ratio |
797
285
  |---|---|---|---:|---:|
798
- | Hello `-w 4` | 21,215 r/s / 1.87 ms | Falcon 24,061 / 9.78 ms | 0.88× | **5.2× lower** |
799
- | CPU JSON `-w 4` | 15,582 r/s / 2.47 ms | Falcon 18,643 / 13.51 ms | 0.84× | **5.5× lower** |
800
- | Static 1 MiB | 1,919 r/s / 4.22 ms | Puma 2,074 / 55 ms | 0.93× | **13× lower** |
801
- | PG-wait `-w 1` pool=200 | 2,180 r/s / 668 ms | Puma 530 r/s + 200 timeouts | **4.1×** | qualitative crush |
286
+ | Hello `-w 4` | 21,215 / 1.87 ms | Falcon 24,061 / 9.78 ms | 0.88× | **5.2× lower** |
287
+ | CPU JSON `-w 4` | 15,582 / 2.47 ms | Falcon 18,643 / 13.51 ms | 0.84× | **5.5× lower** |
288
+ | Static 1 MiB | 1,919 / 4.22 ms | Puma 2,074 / 55 ms | 0.93× | **13× lower** |
289
+ | PG-wait `-w 1` pool=200 | 2,180 / 668 ms | Puma 530 + 200 timeouts | **4.1×** | qualitative crush |
802
290
 
803
- **Size capacity by p99, not by mean.** Throughput peaks are easy to fake under controlled bench conditions; tail latency reflects what your slowest user actually experiences when the load balancer fans them onto a busy worker.
804
-
805
- ### Production tuning (real Rails apps)
806
-
807
- Distilled from a real-app bench against the [Exodus platform](https://github.com/andrew-woblavobla/hyperion/blob/master/docs/BENCH_2026_04_27.md) (Rails 8.1, on-LAN PG + Redis at ~0.3 ms RTT, `-w 4 -t 10`, `wrk -t8 -c200 -d30s`). The headline finding: the **simplest drop-in is the right answer**, and the additional knobs operators reach for first don't help on real Rails.
808
-
809
- **Recommended for migrating from Puma**: `hyperion -t N -w M` matching your current Puma `-t N:N -w M`. No other flags. That gives you (vs Puma at the same `-t/-w`):
810
-
811
- - **+9% rps on lightweight endpoints** (matches the 5-10% per-request CPU savings the rest of the bench section documents).
812
- - **28× lower p99 on health-style endpoints** — the queue-of-doom shape Puma exhibits under sustained 200-conn load doesn't reproduce on Hyperion's worker-owns-connection model.
813
- - **3.8× lower p99 on PG-touching endpoints**.
814
- - **Same RSS, same operator surface** — you keep all your existing config, monitoring, and deploy scripts.
815
-
816
- **Knobs that help on synthetic benches but NOT on real Rails — leave them off:**
817
-
818
- | Knob | Synthetic bench result | Real Rails result | Recommendation |
819
- |---|---|---|---|
820
- | `-t 30` (more threads/worker) | Helped Hyperion 5-10% on hello-world | **Hurt** p99 vs `-t 10` on real Rails (3.51 s vs 148 ms on /up) — GVL + middleware Mutex contention dominates past `-t 10` | Stay at `-t 10`. Match Puma's recommended `RAILS_MAX_THREADS`. |
821
- | `--yjit` | 5-10% on synthetic CPU-bound | Wash on dev-mode Rails (312 vs 328 rps, p99 worse with YJIT) | Skip for now. Production-mode Rails may behave differently — verify with your own bench before flipping. |
822
- | `RAILS_POOL` > 25 | n/a | No improvement at pool=50 or pool=100 on real Rails (rps within 3%, p99 within noise). Pool starvation is rarely the bottleneck on a `-w 4 -t 10` config | Keep your existing AR pool size. |
823
- | `--async-io` | 33-42× rps on PG-bound (with `hyperion-async-pg`) | **Worse** than drop-in on real Rails (4.14 s p99 on /up vs 148 ms drop-in) | **Don't enable** until your full I/O stack is fiber-cooperative. The synchronous Redis client (`redis-rb`) blocks the OS thread before async-pg can yield, so fibers can't compound. Migrate to `async-redis` *first*, then revisit. |
824
- | `--async-io` + `hyperion-async-pg` AR adapter | Verified 48× rps lift on a single-PG-query bench | Marginal-or-negative on real Rails (similar reason: Redis-first handlers don't yield) | Same — wait for a full-async I/O stack. |
825
-
826
- **Why the simple drop-in wins on real Rails:** the per-request budget on a real handler is dominated by the Rails middleware chain (rack-attack, locale redirect, tagger, etc.) + handler logic + DB + cache I/O. Hyperion's per-request CPU optimizations (C-ext header parser, response builder, lock-free metrics, fiber-cooperative TLS dispatch in 1.4.0+) shave ~5-10% off the *non-I/O* portion of the budget consistently — and the [worker-owns-connection model](#concurrency-at-scale-architectural-advantages) prevents the queue-amplification that Puma's thread-pool dispatch shows under sustained load. You don't need to "tune" anything to get those.
291
+ Throughput peaks are easy to fake under controlled conditions; tail
292
+ latency reflects what your slowest user actually experiences when the
293
+ load balancer fans them onto a busy worker.
827
294
 
828
295
  ## Logging
829
296
 
830
- Default behaviour (rc16+):
831
-
832
- - **`info` / `debug` → stdout**, **`warn` / `error` / `fatal` → stderr** (12-factor).
833
- - **One structured access-log line per response**, info level, on stdout. Disable with `--no-log-requests` or `HYPERION_LOG_REQUESTS=0`.
834
- - **Format auto-selects**: production envs → JSON (line-delimited, parseable by every log aggregator); TTY → coloured text; piped output without env hint → JSON.
297
+ Default behaviour:
835
298
 
836
- ### Sample access log lines
299
+ - `info`/`debug` stdout, `warn`/`error`/`fatal` → stderr (12-factor).
300
+ - One structured access-log line per response, `info` level. Disable
301
+ with `--no-log-requests` or `HYPERION_LOG_REQUESTS=0`.
302
+ - Format auto-selects: `RAILS_ENV=production`/`staging` → JSON; TTY →
303
+ coloured text; piped output without env hint → JSON.
837
304
 
838
- Text format (TTY default):
305
+ Sample text (TTY default):
839
306
 
840
307
  ```
841
308
  2026-04-26T18:40:04.112Z INFO [hyperion] message=request method=GET path=/api/v1/health status=200 duration_ms=46.63 remote_addr=127.0.0.1 http_version=HTTP/1.1
842
- 2026-04-26T18:40:04.123Z INFO [hyperion] message=request method=GET path=/api/v1/cached_data query="currency=USD" status=200 duration_ms=43.87 remote_addr=127.0.0.1 http_version=HTTP/1.1
843
309
  ```
844
310
 
845
- JSON format (auto-selected on `RAILS_ENV=production`/`staging` or piped output):
311
+ Sample JSON (production / piped):
846
312
 
847
313
  ```json
848
314
  {"ts":"2026-04-26T18:38:49.405Z","level":"info","source":"hyperion","message":"request","method":"GET","path":"/api/v1/health","status":200,"duration_ms":46.63,"remote_addr":"127.0.0.1","http_version":"HTTP/1.1"}
849
- {"ts":"2026-04-26T18:38:49.411Z","level":"info","source":"hyperion","message":"request","method":"GET","path":"/api/v1/cached_data","query":"currency=USD","status":200,"duration_ms":40.64,"remote_addr":"127.0.0.1","http_version":"HTTP/1.1"}
850
315
  ```
851
316
 
852
- ### Hot-path optimisations
853
-
854
- The default-ON access log path is engineered to stay near-zero cost:
855
-
856
- - **Per-thread cached `iso8601(3)` timestamp** — one allocation per millisecond per thread, reused across all requests in that millisecond.
857
- - **Hand-rolled single-interpolation line builder** — bypasses generic `Hash#map.join`.
858
- - **Per-thread 4 KiB write buffer** — flushes to stdout when full or on connection close. Cuts ~32× the syscalls under load.
859
- - **Lock-free emit** — POSIX `write(2)` is atomic for writes ≤ PIPE_BUF (4096 B); a log line is ~200 B. No logger mutex.
860
-
861
317
  ## Metrics
862
318
 
863
- `Hyperion.stats` returns a snapshot Hash with the following counters (lock-free per-thread aggregation):
864
-
865
- | Counter | Meaning |
866
- |---|---|
867
- | `connections_accepted` | Lifetime accept count. |
868
- | `connections_active` | Currently in-flight connections. |
869
- | `requests_total` | Lifetime request count. |
870
- | `requests_in_flight` | Currently in-flight requests. |
871
- | `responses_<code>` | One counter per status code emitted (`responses_200`, `responses_400`, …). |
872
- | `parse_errors` | HTTP parse failures → 400. |
873
- | `app_errors` | Rack app raised → 500. |
874
- | `read_timeouts` | Per-connection read deadline hit. |
875
- | `requests_threadpool_dispatched` | HTTP/1.1 connection handed to the worker pool (or served inline in `start_raw_loop` when `thread_count: 0`). The default dispatch path. |
876
- | `requests_async_dispatched` | HTTP/1.1 connection served inline on the accept-loop fiber under `--async-io`. Operators can use the ratio against `requests_threadpool_dispatched` to verify fiber-cooperative I/O is actually engaged. |
877
-
878
- ```ruby
879
- require 'hyperion'
880
- Hyperion.stats
881
- # => {connections_accepted: 1234, connections_active: 7, requests_total: 8910, …}
882
- ```
319
+ `Hyperion.stats` returns a snapshot Hash with lock-free per-thread
320
+ counters (`connections_accepted`, `connections_active`, `requests_total`,
321
+ `requests_in_flight`, `responses_<code>`, `parse_errors`, `app_errors`,
322
+ `read_timeouts`, `requests_threadpool_dispatched`,
323
+ `requests_async_dispatched`, `c_loop_requests_total`).
883
324
 
884
- ### Prometheus exporter
885
-
886
- When `admin_token` is set in your config, Hyperion mounts a `/-/metrics` endpoint that emits Prometheus text-format v0.0.4. Same token guards both `/-/metrics` (GET) and `/-/quit` (POST); auth is via the `X-Hyperion-Admin-Token` header.
325
+ When `admin_token` is set, `/-/metrics` emits Prometheus text-format
326
+ v0.0.4. Auth is via the `X-Hyperion-Admin-Token` header (same token
327
+ guards `POST /-/quit`):
887
328
 
888
329
  ```sh
889
330
  $ curl -s -H 'X-Hyperion-Admin-Token: secret' http://127.0.0.1:9292/-/metrics
890
331
  # HELP hyperion_requests_total Total HTTP requests handled
891
332
  # TYPE hyperion_requests_total counter
892
333
  hyperion_requests_total 8910
893
- # HELP hyperion_bytes_written_total Total bytes written to response sockets
894
- # TYPE hyperion_bytes_written_total counter
895
- hyperion_bytes_written_total 2351023
896
- # HELP hyperion_responses_status_total Responses by HTTP status code
897
- # TYPE hyperion_responses_status_total counter
898
334
  hyperion_responses_status_total{status="200"} 8521
899
335
  hyperion_responses_status_total{status="404"} 12
900
- hyperion_responses_status_total{status="500"} 3
901
- # … and so on for sendfile_responses_total, rejected_connections_total,
902
- # slow_request_aborts_total, requests_async_dispatched_total, etc.
903
336
  ```
904
337
 
905
- Any counter not in the known set (added by app middleware via `Hyperion.metrics.increment(:custom_thing)`) is auto-exported as `hyperion_custom_thing` with a generic HELP line — no Hyperion config change required.
338
+ Any counter not in the known set (added via
339
+ `Hyperion.metrics.increment(:custom_thing)`) is auto-exported as
340
+ `hyperion_custom_thing` with a generic HELP line. Network-isolate the
341
+ admin endpoints if the listener is internet-facing — see
342
+ [docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) for the nginx
343
+ `location /-/ { return 404; }` recipe.
344
+
345
+ ## Compatibility
346
+
347
+ | Component | Version |
348
+ |---|---|
349
+ | Ruby | 3.3+ (transitive `protocol-http2 ~> 0.26` floor) |
350
+ | Rack | 3.x |
351
+ | Rails | verified up to 8.1 |
352
+ | Linux kernel | 5.x+ for io_uring opt-in; 4.x+ otherwise |
353
+ | macOS | works (TLS, h2, WebSockets, `accept4` fallback path) |
906
354
 
907
- Point your scraper at it: in Prometheus' `scrape_configs`, set `metrics_path: /-/metrics` and `bearer_token` (or use a custom header relabel — Prometheus 2.42+ supports `authorization.credentials_file` paired with a custom `header` block). Network-isolate the admin endpoints if the listener is internet-facing — see [docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) for the nginx `location /-/ { return 404; }` recipe.
355
+ Per-Rack-3-spec: auto-sets `SERVER_SOFTWARE`, `rack.version`,
356
+ `REMOTE_ADDR`, IPv6-safe `Host` parsing, CRLF guard. The
357
+ `Hyperion::FiberLocal.install!` opt-in shim handles the residual
358
+ `Thread.current.thread_variable_*` footgun in older Rails idioms;
359
+ modern Rails 7.1+ already uses Fiber storage natively.
908
360
 
909
- ## TLS + HTTP/2
361
+ ## Reproducing benchmarks
910
362
 
911
- Provide a PEM cert + key:
363
+ Every number in this README is reproducible. Per-row commands:
912
364
 
913
365
  ```sh
914
- bundle exec hyperion --tls-cert config/cert.pem --tls-key config/key.pem -p 9443 config.ru
915
- ```
366
+ # Setup (once)
367
+ bundle install
368
+ bundle exec rake compile
916
369
 
917
- ALPN auto-negotiates `h2` (HTTP/2) or `http/1.1` per connection. HTTP/2 multiplexes streams onto fibers within a single connection — slow handlers don't head-of-line-block other streams. Cluster-mode TLS works (`-w N` + `--tls-cert` / `--tls-key`).
370
+ # Hello via Server.handle_static + io_uring (134k r/s row)
371
+ HYPERION_IO_URING_ACCEPT=1 bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/hello_static.ru &
372
+ wrk -t4 -c100 -d20s --latency http://127.0.0.1:9292/
918
373
 
919
- Smuggling defenses for HTTP/1.1: `Content-Length` + `Transfer-Encoding` together → 400; non-chunked `Transfer-Encoding` → 501; CRLF in response header values → `ArgumentError` (response-splitting guard).
374
+ # Dynamic block via Server.handle (9.4k r/s row)
375
+ bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/hello_handle_block.ru &
376
+ wrk -t4 -c100 -d20s --latency http://127.0.0.1:9292/
920
377
 
921
- ## Compatibility
378
+ # Generic Rack hello (4.7k r/s row)
379
+ bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/hello.ru &
380
+ wrk -t4 -c100 -d20s --latency http://127.0.0.1:9292/
381
+
382
+ # CPU JSON via block form (5.9k r/s row)
383
+ bundle exec bin/hyperion -w 1 -t 5 -p 9292 bench/work.ru &
384
+ wrk -t4 -c200 -d15s --latency http://127.0.0.1:9292/
385
+
386
+ # 4-way comparator (Hyperion vs Puma vs Falcon vs Agoo)
387
+ bash bench/4way_compare.sh
388
+
389
+ # gRPC unary + streaming (Hyperion side)
390
+ GHZ=/tmp/ghz TRIALS=3 DURATION=15s WARMUP_DURATION=3s bash bench/grpc_stream_bench.sh
391
+
392
+ # Idle keep-alive RSS sweep (10k conns × 30s hold)
393
+ bash bench/keepalive_memory.sh
394
+ ```
922
395
 
923
- - **Ruby 3.3+** required (the `protocol-http2 ~> 0.26` transitive dep imposes this floor; older Ruby installs error at `bundle install`).
924
- - **Rack 3** (auto-sets `SERVER_SOFTWARE`, `rack.version`, `REMOTE_ADDR`, IPv6-safe `Host` parsing, CRLF guard).
925
- - **`Hyperion::FiberLocal.install!`** opt-in shim for older Rails apps that store request-scoped data via `Thread.current.thread_variable_*` (modern Rails 7.1+ already uses Fiber storage natively; the shim handles the residual footgun).
926
- - **`Hyperion::FiberLocal.verify_environment!`** runtime check that `Thread.current[:k]` is fiber-local on the current Ruby (it is on 3.2+).
396
+ PG benches (`pg_concurrent.ru`, `pg_mixed.ru`) live in the
397
+ [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg)
398
+ companion repo they require a running Postgres and the companion
399
+ gem.
400
+
401
+ When numbers from your host don't match the published numbers, the
402
+ most likely explanations (in order): (1) bench-host noise — single-VM
403
+ benches drift 10–30% over days; (2) Puma version mismatch (sweep used
404
+ Puma 8.0.1; the in-repo Gemfile pins `~> 6.4`); (3) different kernel
405
+ or Ruby; (4) different `-t` / `-c` (apples-to-apples requires
406
+ identical worker count, thread count, wrk concurrency, payload, and
407
+ TLS cipher).
408
+
409
+ ## Release history
410
+
411
+ See [CHANGELOG.md](CHANGELOG.md). Recent: 2.14.0 (gRPC streaming ghz
412
+ numbers; dynamic-block C dispatch — `Server.handle { |env| ... }` lifts
413
+ hello to 9,422 r/s and CPU JSON to 5,897 r/s; `Server#stop` accept-wake
414
+ on Linux; io_uring 4h soak), 2.13.0 (response head builder C-rewrite;
415
+ gRPC streaming RPCs; soak harness), 2.12.0 (C connection lifecycle;
416
+ io_uring loop hits 134k r/s; gRPC unary trailers; SO_REUSEPORT
417
+ audit), 2.11.0 (HPACK CGlue default; h2 dispatch-pool warmup), 2.10.x
418
+ (`PageCache`, `Server.handle` direct routes, TCP_NODELAY at accept).
419
+
420
+ ## Links
421
+
422
+ - [CHANGELOG.md](CHANGELOG.md) — per-stream releases.
423
+ - [docs/BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md) — current
424
+ 4-way matrix + 2.14-D gRPC numbers.
425
+ - [docs/BENCH_HYPERION_2_0.md](docs/BENCH_HYPERION_2_0.md) — historical
426
+ 2.10-B baseline (preserved for archaeology).
427
+ - [docs/BENCH_2026_04_27.md](docs/BENCH_2026_04_27.md) — real Rails 8.1
428
+ app sweep (Exodus platform).
429
+ - [docs/OBSERVABILITY.md](docs/OBSERVABILITY.md) — metrics + Grafana.
430
+ - [docs/WEBSOCKETS.md](docs/WEBSOCKETS.md) — RFC 6455 surface.
431
+ - [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md) — drop-in
432
+ guide.
433
+ - [docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) — nginx fronting.
927
434
 
928
435
  ## Credits
929
436
 
930
- - Vendored [llhttp](https://github.com/nodejs/llhttp) (Node.js's HTTP parser, MIT) under `ext/hyperion_http/llhttp/`.
931
- - HTTP/2 framing and HPACK via [`protocol-http2`](https://github.com/socketry/protocol-http2).
437
+ - Vendored [llhttp](https://github.com/nodejs/llhttp) (Node.js's HTTP
438
+ parser, MIT) under `ext/hyperion_http/llhttp/`.
439
+ - HTTP/2 framing and HPACK via
440
+ [`protocol-http2`](https://github.com/socketry/protocol-http2).
932
441
  - Fiber scheduler via [`async`](https://github.com/socketry/async).
933
442
 
934
443
  ## License