hyperion-rb 2.13.0 → 2.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,933 +1,160 @@
1
1
  # Hyperion
2
2
 
3
- High-performance Ruby HTTP server. Falcon-class fiber concurrency, Puma-class compatibility.
3
+ High-performance Ruby HTTP server. Rack 3 + HTTP/2 + WebSockets + gRPC on a single binary.
4
4
 
5
5
  [![CI](https://github.com/andrew-woblavobla/hyperion/actions/workflows/ci.yml/badge.svg)](https://github.com/andrew-woblavobla/hyperion/actions/workflows/ci.yml)
6
6
  [![Gem Version](https://img.shields.io/gem/v/hyperion-rb.svg)](https://rubygems.org/gems/hyperion-rb)
7
7
  [![License: MIT](https://img.shields.io/github/license/andrew-woblavobla/hyperion.svg)](https://github.com/andrew-woblavobla/hyperion/blob/master/LICENSE)
8
8
 
9
- ```sh
10
- gem install hyperion-rb
11
- bundle exec hyperion config.ru
12
- ```
13
-
14
- ## What's new in 2.13.0
15
-
16
- **Hardening sprint: profile-driven CPU work, durable infrastructure,
17
- and gRPC streaming.** 2.13 follows up the 2.12 perf jump with the
18
- work that wasn't structural enough to need its own major release —
19
- each stream is small but adds up:
20
-
21
- - **2.13-A — Generic Rack hot-path wins.** Per-thread shards for
22
- hot-path metrics (no more cross-worker mutex on `observe_histogram`),
23
- cached `worker_id` label tuple, and a Rack-3 keepalive fast-path.
24
- Generic Rack hello bench: **+3.6% on `-c100` no-log**. Honest
25
- finding: env-pool + body-coalesce already shipped; the deeper
26
- generic-Rack gap to Agoo is single-thread Ruby ceiling that closing
27
- needs moving `app.call` into the C accept loop — a 2.14 lift.
28
- - **2.13-B — Response head builder rewritten in C.** Pre-baked
29
- status-line table, `rb_hash_foreach` replacing `rb_funcall(:keys)`,
30
- per-key downcase + per-(key, value) full-line caches, custom `itoa`
31
- replacing `snprintf`. **+7.7% single-thread synthetic; multi-thread
32
- neutral (GVL-bound).** Profile confirms Hyperion's own C-ext code is
33
- **<1%** of wall-clock; the rest is libruby + JSON gem.
34
- - **2.13-C — Spec flake hunt.** Two long-standing flakes fixed:
35
- `tls_ktls_spec` macOS skip leak (unconditional `RUBY_PLATFORM`
36
- guard), and `connection_loop_spec:79` Linux port-bind flake (root
37
- cause: Linux `close()` doesn't wake a parked `accept(2)` — fixed
38
- with a `stop_loop_and_wake` helper). 5/5 → 0/10 failure rate;
39
- spec-suite runtime 46 s → 1.3 s.
40
- - **2.13-D — gRPC streaming RPCs.** Server-streaming, client-streaming,
41
- and bidirectional RPCs on top of 2.12-F's unary trailers foundation.
42
- New `bench/grpc_stream.{proto,ru}` + `grpc_stream_bench.sh` ghz
43
- harness for operator-side comparison vs Falcon's `async-grpc`.
44
- - **2.13-E — io_uring soak harness + CI smoke.** New
45
- `bench/io_uring_soak.sh` runs a 24h soak against the 2.12-D
46
- io_uring loop with `/proc/$PID` + `/-/metrics` sampling, emits a
47
- CSV + verdict (PASS / SOAK FAIL / borderline). New
48
- `spec/hyperion/io_uring_soak_smoke_spec.rb` runs a 1000-request
49
- mini-soak in CI to catch leak regressions before any 24h run.
50
- **`HYPERION_IO_URING_ACCEPT` stays opt-in for 2.13** — operators
51
- with their own staging environments can now collect signal; the
52
- default-flip decision moves to 2.14.
53
-
54
- Plus all previous wins are preserved and verified by the 1183-spec
55
- suite (2.10-G TCP_NODELAY at accept, 2.10-E preload hooks, 2.10-F
56
- C-ext fast-path response writer, 2.11-A dispatch pool warmup, 2.11-B
57
- cglue HPACK default, 2.12-C accept4 connection loop, 2.12-D io_uring
58
- loop, 2.12-E per-worker request counter, 2.12-F gRPC unary trailers).
59
-
60
- Full per-stream details, bench tables, and follow-up items in
61
- [`CHANGELOG.md`](CHANGELOG.md).
62
-
63
- ## What's new in 2.12.0
64
-
65
- **The hot path moves into C — and gRPC ships.** The headline win:
66
- `Server.handle_static` routes now serve from a C accept→read→route→write
67
- loop with optional **io_uring** (Linux 5.x+) backing it. The `wrk -t4
68
- -c100 -d20s` hello bench moved from **5,502 r/s** (2.11.0
69
- `Server.handle_static` via Ruby accept loop) to **15,685 r/s** (2.12-C
70
- C accept4 loop) to **134,084 r/s** (2.12-D io_uring loop) — that's
71
- **24× over 2.11.0's `handle_static` and 7× over Agoo 2.15.14's
72
- 19,024 r/s** on the same workload. p99 stays sub-millisecond
73
- throughout. Plus durable foundation work and one big new feature:
74
-
75
- - **2.12-B — Fresh 4-way re-bench.** New
76
- [`docs/BENCH_HYPERION_2_11.md`](docs/BENCH_HYPERION_2_11.md) re-runs
77
- Hyperion / Puma / Falcon / Agoo on the 6 workloads with all 2.10/2.11
78
- wins enabled. Headline shifts: static 1 KB Hyperion `handle_static`
79
- flipped from 1.89× behind Agoo to **+127% ahead**; CPU JSON gap
80
- widened (the one row 2.10/2.11 didn't touch — flagged for follow-up).
81
- - **2.12-C — Connection lifecycle in C.** New
82
- `Hyperion::Http::PageCache.run_static_accept_loop` does
83
- `accept4` + `recv` + path lookup + `write` entirely in a C tight
84
- loop, returning to Ruby only on a route miss / TLS / h2 / WebSocket
85
- upgrade. GVL released across syscalls. Auto-engages when the listener
86
- is plain TCP and the route table contains only `StaticEntry`
87
- registrations. **5,502 → 15,685 r/s (+185%, 2.85×) on `handle_static`
88
- hello; p99 1.59 ms → 107 µs (15× tighter).** Falls through to the
89
- existing Ruby accept loop on miss with no regression.
90
- - **2.12-D — io_uring accept loop (Linux 5.x+).** A multishot accept +
91
- per-conn RECV/WRITE/CLOSE state machine on top of liburing. One
92
- `io_uring_enter` per N requests instead of N×3 syscalls. Opt-in via
93
- `HYPERION_IO_URING_ACCEPT=1` (default off until 2.13 production
94
- soak). **15,685 → 134,084 r/s (+755%, 8.6×) on the same bench.**
95
- Compiles out cleanly without liburing — the `accept4` path stays
96
- the fallback. macOS keeps using `accept4` (no liburing).
97
- - **2.12-E — SO_REUSEPORT cluster-mode audit.** New per-worker request
98
- metric (`requests_dispatch_total{worker_id="N"}`) ticks under every
99
- dispatch mode (Rack, `handle_static`, h2, the C accept loops). New
100
- audit harness `bench/cluster_distribution.sh` and a 4-worker, 30s
101
- sustained-load bench: under steady state the SO_REUSEPORT hash
102
- distributes within **1.004-1.011 max/min ratio** — production-grade,
103
- measured. The cold-start swing (1.16× during the first second of
104
- fresh boot) is documented as expected `SO_REUSEPORT + keep-alive`
105
- behavior and matches what production L4 LBs already exhibit.
106
- - **2.12-F — gRPC support on h2.** Trailers (the `grpc-status` /
107
- `grpc-message` final HEADERS frame), `TE: trailers` handling, h2
108
- request half-close semantics. Rack 3 contract: a Rack body that
109
- defines `#trailers` triggers the trailers wire shape automatically;
110
- bodies that don't are byte-identical to 2.11.x h2. Smoke test against
111
- the real `grpc` Ruby gem ships gated by `RUN_GRPC_SMOKE=1`; the
112
- durable coverage is 11 unit specs driving real `protocol-http2`
113
- framer + HPACK encode/decode + TLS.
114
-
115
- The 2.10-G TCP_NODELAY hunk, 2.10-E preload hooks, 2.10-F C-ext
116
- `rb_pc_serve_request`, 2.11-A dispatch pool warmup, and 2.11-B cglue
117
- HPACK default all preserved and verified by the 1143-spec suite.
118
-
119
- Full per-stream details, bench tables, and follow-up items in
120
- [`CHANGELOG.md`](CHANGELOG.md).
121
-
122
- ## What's new in 2.11.0
123
-
124
- **h2 cold-stream latency cut + native HPACK CGlue flipped to default.**
125
- Two perf wins on top of 2.10:
126
-
127
- - **2.11-A — h2 first-stream TLS handshake parallelization.** The
128
- 2.10-G `HYPERION_H2_TIMING=1` instrumentation, run against the
129
- TCP_NODELAY-fixed handler, isolated the residual cold-stream cost
130
- to **bucket 2**: lazy `task.async {}` fiber spawn for the first
131
- stream of every connection. Fix: pre-spawn a stream-dispatch fiber
132
- pool at connection accept (configurable via `HYPERION_H2_DISPATCH_POOL`,
133
- default 4, ceiling 16). h2load `-c 1 -m 1 -n 50` cold first-run:
134
- **time-to-1st-byte 20.28 → 9.28 ms (−54%); m=100 throughput +5.5%**.
135
- Warm steady-state unchanged (no head-of-line blocking under the small
136
- pool — backlog still spills to ad-hoc `task.async`).
137
- - **2.11-B — HPACK FFI marshalling round-2 (CGlue flipped to default).**
138
- Three-way bench (`bench/h2_rails_shape.sh` extended): `ruby` (1,585
139
- r/s) vs `native v2` (1,602 r/s, +1% — noise) vs `native v3 / CGlue`
140
- (**2,291 r/s, +43% over v2**). The +18-44% native-vs-Ruby headline
141
- was almost entirely Fiddle marshalling overhead, not the underlying
142
- Rust HPACK encoder — same encoder, no per-call FFI marshalling, +43%
143
- rps. Default flipped: unset `HYPERION_H2_NATIVE_HPACK` now selects
144
- CGlue. Three escape valves stay (`=v2` to force the old path, `=ruby`
145
- / `=off` for the pure-Ruby fallback) for any operator that needs
146
- them. Boot log gains a `native_mode` field documenting which path is
147
- actually live.
148
-
149
- Plus operator infrastructure: a stale-`.dylib`-on-Linux cross-platform
150
- host-OS portability fix in `H2Codec.candidate_paths` (was silently
151
- falling through to pure-Ruby on the bench host); `bench/h2_rails_shape.sh`
152
- race-fixed (boot-log probe + stderr routing). Full bench tables and
153
- flip-decision rationale in [`CHANGELOG.md`](CHANGELOG.md).
154
-
155
- ## What's new in 2.10.1
156
-
157
- **Static-asset operator surface (2.10-E) + C-ext fast-path response
158
- writer (2.10-F).** Two follow-on streams to 2.10's static / direct-route
159
- work:
160
-
161
- - **2.10-E — Static asset preload + immutable flag.** Boot-time hook
162
- warms `Hyperion::Http::PageCache` over a tree of files and marks
163
- every cached entry immutable. Surface: `--preload-static <dir>` (and
164
- `--no-preload-static`) CLI flags, `preload_static "/path", immutable:
165
- true` config DSL key, and zero-config Rails auto-detect that pulls
166
- `Rails.configuration.assets.paths.first(8)` when present. Hyperion
167
- never `require`s Rails — purely defensive `defined?(::Rails)`
168
- probing keeps the generic Rack server path clean. **Operator value:
169
- predictable first-request latency** (the asset is in cache before
170
- the first request arrives) and the `recheck_seconds` mtime poll is
171
- skipped on immutable entries. Sustained-load throughput on the
172
- static-1-KB bench did *not* move (cold 1,929 r/s vs warm 1,886 r/s,
173
- inside trial noise) because `ResponseWriter` already auto-caches
174
- Rack::Files responses on the first hit; preload moves that one
175
- `cache_file` call from request 1 to boot.
176
- - **2.10-F — C-ext fast-path response writer for prebuilt responses.**
177
- `Server.handle_static`-routed requests now serve from a single
178
- C function (`rb_pc_serve_request` in `ext/hyperion_http/page_cache.c`)
179
- that does route lookup → header build → `write()` syscall without
180
- re-entering Ruby on the response side. GVL is released across the
181
- `write()` so slow clients no longer block other Ruby work on the
182
- same VM. Automatic HEAD support (HTTP-mandated) lights up on every
183
- GET registered via `handle_static` — same buffer, body stripped.
184
- Bench (3-trial median, `wrk -t4 -c100 -d20s`): **5,768 r/s vs
185
- 2.10-D's 5,619 r/s (+2.6% — inside noise) and p99 1.93 → 1.67 ms
186
- (−14% — outside noise, reproducible).** The throughput needle didn't
187
- move because the per-connection lifecycle (accept4 + clone3 + futex
188
- on GVL handoff) dominates at 100 concurrent connections; 2.10-F
189
- shrinks the response phase, but the response phase isn't the
190
- bottleneck on this profile. Durable infrastructure for 2.11+ when
191
- the accept-loop work closes.
192
-
193
- Full per-stream details and bench tables in
194
- [`CHANGELOG.md`](CHANGELOG.md).
195
-
196
- ## What's new in 2.10.0
197
-
198
- **4-way bench harness, page cache, direct routes, and the h2 40 ms
199
- ceiling killed.** This sprint widens the comparison matrix to all four
200
- major Ruby web servers (Hyperion + Puma + Falcon + Agoo) and ships
201
- four substantive perf streams against that backdrop:
202
-
203
- - **2.10-A / 2.10-B — 4-way bench harness + honest baseline.**
204
- `bench/4way_compare.sh` runs the same 6 workloads (hello, static
205
- 1 KB / 1 MiB, CPU JSON, PG-bound, SSE) against all four servers from
206
- one script. Baseline numbers committed *before* any code changes:
207
- Agoo wins the static-asset and JSON columns by ~2-4×, Hyperion wins
208
- the static 1 MiB column by 9× and the SSE column by 3.6-17×.
209
- - **2.10-C — `Hyperion::Http::PageCache` (pre-built static response
210
- cache).** Open-addressed bucket table behind a pthread mutex
211
- (GVL-released for writes), engages automatically on `Rack::Files`
212
- responses. **Static 1 KB: 1,380 → 1,880 r/s (+36%), p99 3.7 → 2.7
213
- ms.** Closes the Agoo gap from −47% to −28% on that column.
214
- - **2.10-D — `Hyperion::Server.handle` direct route registration.**
215
- New API for hot Rack-bypass paths (`Server.handle '/health' do …
216
- end`, `Server.handle_static '/robots.txt', body: '...'`). Skips Rack
217
- adapter + env-build for matched routes. **`hello` via
218
- `handle_static`: 4,408 → 5,619 r/s (+27%), p99 1.93 ms** — the
219
- cleanest p99 in the 4-way matrix.
220
- - **2.10-G — h2 max-latency ceiling at ~40 ms: fixed.** Filed by 2.9-B
221
- as a "first-stream cost" hypothesis, the instrumentation revealed
222
- it was paid by *every* h2 stream — the canonical Linux delayed-ACK
223
- + Nagle interaction on small framer writes. One-line fix:
224
- TCP_NODELAY at accept time. **h2load `-c 1 -m 1 -n 200`: min
225
- 40.62 → 0.54 ms (−98.7%), throughput 24 → 1,142 r/s (+47.6×).** The
226
- `HYPERION_H2_TIMING=1` instrumentation stays in place as durable
227
- diagnostic infrastructure.
228
-
229
- Full per-stream details, bench numbers, and follow-up items live in
230
- [`CHANGELOG.md`](CHANGELOG.md).
231
-
232
- ## What's new in 2.5.0
233
-
234
- **Native HPACK ON by default + autobahn 100% conformance + request
235
- hooks.** The Rust HPACK encoder (added in 2.0.0, opt-in until 2.4.x)
236
- flips ON by default in 2.5.0 — verified **+18% rps on Rails-shape h2
237
- workloads** (25-header responses, the bench harness lives at
238
- `bench/h2_rails_shape.ru` + `bench/h2_rails_shape.sh`). RFC 6455
239
- WebSocket conformance hit **463/463 autobahn-testsuite cases passing**
240
- (2.5-A, host openclaw-vm). Request lifecycle hooks
241
- (`Runtime#on_request_start` / `on_request_end`) shipped in 2.5-C —
242
- recipes in [`docs/OBSERVABILITY.md`](docs/OBSERVABILITY.md).
243
-
244
- ## What's new in 2.4.0
245
-
246
- **Production observability.** The `/-/metrics` endpoint now exposes
247
- per-route latency histograms, per-conn fairness rejections, WebSocket
248
- permessage-deflate compression ratio, kTLS active connections,
249
- io_uring-active workers, and ThreadPool queue depth — operators can
250
- finally see whether the 2.x knobs are firing and how effective they
251
- are. A pre-built Grafana dashboard ships at
252
- [`docs/grafana/hyperion-2.4-dashboard.json`](docs/grafana/hyperion-2.4-dashboard.json).
253
- Full metric reference + operator playbook in
254
- [`docs/OBSERVABILITY.md`](docs/OBSERVABILITY.md).
255
-
256
- ## What's new in 2.1.0
257
-
258
- **WebSockets.** RFC 6455 over Rack 3 full hijack, native frame codec,
259
- per-connection wrapper with auto-pong / close handshake / UTF-8 validation /
260
- per-message size cap. **ActionCable on Hyperion is now a single-binary
261
- deployment** — one `hyperion -w 4 -t 10 config.ru` process serves HTTP,
262
- HTTP/2, TLS, **and** `/cable` from the same listener; no separate cable
263
- container required. HTTP/1.1 only this release; WS-over-HTTP/2 (RFC 8441
264
- Extended CONNECT) and permessage-deflate (RFC 7692) defer to 2.2.x.
265
- See [`docs/WEBSOCKETS.md`](docs/WEBSOCKETS.md).
266
-
267
- ## gRPC on Hyperion (2.12-F+)
268
-
269
- Hyperion's HTTP/2 path supports gRPC unary calls via the Rack 3 trailers
270
- contract: any response body that exposes `:trailers` gets a final
271
- HEADERS frame (with END_STREAM=1) carrying the trailer map after the
272
- DATA frames. That's the wire shape gRPC clients expect for the
273
- `grpc-status` / `grpc-message` map.
274
-
275
- A minimal Rack-shaped gRPC handler:
276
-
277
- ```ruby
278
- class GrpcBody
279
- def initialize(reply); @reply = reply; end
280
- def each; yield @reply; end
281
- def trailers; { 'grpc-status' => '0', 'grpc-message' => 'OK' }; end
282
- def close; end
283
- end
284
-
285
- run ->(env) {
286
- request = env['rack.input'].read # gRPC-framed protobuf bytes
287
- reply = handle(request) # your service implementation
288
- [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply)]
289
- }
290
- ```
291
-
292
- What Hyperion handles for you: ALPN negotiation, HTTP/2 framing, HPACK,
293
- per-stream flow control, the trailer-frame emit, binary-clean
294
- `env['rack.input']` (gRPC bodies are non-UTF-8), and `te: trailers`
295
- preserved into `env['HTTP_TE']`. What you handle: protobuf
296
- marshalling and the `grpc-status` semantics.
297
-
298
- ### Streaming RPCs (2.13-D+)
299
-
300
- All four gRPC call shapes work on Hyperion since 2.13-D — unary,
301
- server-streaming, client-streaming, and bidirectional. The detection
302
- trigger is the gRPC content-type plus `te: trailers`; any HTTP/2
303
- request that carries both is dispatched to the Rack app on HEADERS
304
- arrival (rather than after END_STREAM), and `env['rack.input']`
305
- becomes a streaming IO that blocks reads until the next DATA frame
306
- lands. Plain HTTP/2 traffic (without those headers) keeps the unary
307
- buffered semantics — no behaviour change for non-gRPC clients.
308
-
309
- **Server-streaming.** Yield one gRPC-framed message per `each`
310
- iteration; Hyperion writes each yield as its own DATA frame:
311
-
312
- ```ruby
313
- class StreamReply
314
- def initialize(messages); @messages = messages; end
315
- def each; @messages.each { |m| yield m }; end # one DATA frame each
316
- def trailers; { 'grpc-status' => '0' }; end
317
- def close; end
318
- end
319
-
320
- run ->(env) {
321
- env['rack.input'].read # the unary request message
322
- [200, { 'content-type' => 'application/grpc' }, StreamReply.new(messages)]
323
- }
324
- ```
325
-
326
- **Client-streaming.** Read messages off `env['rack.input']` as the peer
327
- sends them. Reads block until a DATA frame arrives:
328
-
329
- ```ruby
330
- run ->(env) {
331
- io = env['rack.input']
332
- count = 0
333
- while (prefix = io.read(5)) && prefix.bytesize == 5
334
- length = prefix.byteslice(1, 4).unpack1('N')
335
- msg = io.read(length)
336
- count += 1
337
- end
338
- [200, { 'content-type' => 'application/grpc' }, GrpcBody.new(reply_for(count))]
339
- }
340
- ```
341
-
342
- **Bidirectional.** Interleave reads and writes. The response body is
343
- iterated lazily, so you can read one request message, yield one reply,
344
- read the next, yield the next:
345
-
346
- ```ruby
347
- class BidiReplies
348
- def initialize(io); @io = io; end
349
- def each
350
- while (prefix = @io.read(5)) && prefix.bytesize == 5
351
- len = prefix.byteslice(1, 4).unpack1('N')
352
- msg = @io.read(len)
353
- yield grpc_frame(handle(msg)) # sent immediately
354
- end
355
- end
356
- def trailers; { 'grpc-status' => '0' }; end
357
- def close; end
358
- end
359
-
360
- run ->(env) {
361
- [200, { 'content-type' => 'application/grpc' }, BidiReplies.new(env['rack.input'])]
362
- }
363
- ```
364
-
365
- The streaming-input path runs each stream on its own fiber, so
366
- concurrent read+write on the same stream is safe.
367
-
368
- ## Highlights
369
-
370
- - **HTTP/1.1 + HTTP/2 + TLS** out of the box (HTTP/2 with per-stream fiber multiplexing, WINDOW_UPDATE-aware flow control, ALPN auto-negotiation).
371
- - **WebSockets (RFC 6455)** — full handshake, native frame codec, per-connection wrapper. ActionCable + faye-websocket work on a single-binary deploy. See [`docs/WEBSOCKETS.md`](docs/WEBSOCKETS.md). (2.1.0+, HTTP/1.1 only.)
372
- - **Pre-fork cluster** with per-OS worker model: `SO_REUSEPORT` on Linux, master-bind + worker-fd-share on macOS/BSD (Darwin's `SO_REUSEPORT` doesn't load-balance).
373
- - **Hybrid concurrency**: fiber-per-connection for I/O, OS-thread pool for `app.call(env)` — synchronous Rack handlers (Rails, ActiveRecord, anything holding a global mutex) get true OS-thread concurrency.
374
- - **Vendored llhttp 9.3.0** C parser; pure-Ruby fallback for non-MRI runtimes.
375
- - **Default-ON structured access logs** (one JSON or text line per request) with hot-path optimisations: per-thread cached timestamp, hand-rolled line builder, lock-free per-thread write buffer.
376
- - **12-factor logger split**: info/debug → stdout, warn/error/fatal → stderr.
377
- - **Ruby DSL config file** (`config/hyperion.rb`) with lifecycle hooks (`before_fork`, `on_worker_boot`, `on_worker_shutdown`).
378
- - **Object pooling** for the Rack `env` hash and `rack.input` IO — amortizes per-request allocations across the worker's lifetime.
379
- - **`Hyperion::FiberLocal`** opt-in shim for older Rails idioms that store request-scoped data via `Thread.current.thread_variable_*`.
380
-
381
- ## Benchmarks
382
-
383
- All numbers are real wrk runs against published Hyperion configs. Hyperion ships **with default-ON structured access logs**; Puma comparisons use Puma defaults (no per-request log emission). Each section is stamped with the Hyperion version + bench host it was measured against — bench-host drift over time is real (see "Bench-host drift" note below).
384
-
385
- **Headline doc**: the most recent comprehensive sweep is
386
- [`docs/BENCH_HYPERION_2_11.md`](docs/BENCH_HYPERION_2_11.md) — the
387
- 2.12-B 4-way re-bench (Hyperion 2.11.0 vs Puma 8.0.1 / Falcon 0.55.3 /
388
- Agoo 2.15.14, 16-vCPU Ubuntu 24.04, 6 workloads). It's the post-
389
- 2.10/2.11-wins re-baseline of the four-server matrix that originally
390
- shipped in [`docs/BENCH_HYPERION_2_0.md`](docs/BENCH_HYPERION_2_0.md)
391
- § "4-way head-to-head (2.10-B baseline)" — the older doc is the
392
- **historical baseline (pre-2.10/2.11 wins)** and is preserved
393
- unchanged for archaeology. The 1.6.0 matrix at
394
- [`docs/BENCH_2026_04_27.md`](docs/BENCH_2026_04_27.md) covers 9
395
- workloads × 25+ configs against hyperion-async-pg 0.5.0; all three
396
- docs include caveats and per-row reproduction commands.
397
-
398
- > **Bench-host drift note (2026-05-01).** A spot-check rerun on
399
- > `openclaw-vm` 5 days after the 2.0.0 sweep showed Puma 8.0.1 and
400
- > Hyperion 2.0.0 baseline numbers had drifted 14-32% downward from the
401
- > 2026-04-29 sweep with no code changes — the bench host runs other
402
- > workloads in the background and is a single VM (KVM CPU). Numbers in
403
- > this README and BENCH docs are snapshots; expect ±10-30% absolute
404
- > drift between sweep dates. **The relative position (Hyperion vs Puma
405
- > at matched config) is the durable signal**; e.g. Hyperion `-w 16 -t 5`
406
- > hello-world today is 76,593 r/s vs Puma 8.0.1 `-w 16 -t 5:5` at 55,609
407
- > r/s, **+37.7% over Puma** — wider than the 2.0.0 sweep's +27.8% even
408
- > though absolute rps is lower. Reproduce: `bundle exec bin/hyperion
409
- > -p 9501 -w 16 -t 5 bench/hello.ru` then `wrk -t4 -c200 -d20s
410
- > http://127.0.0.1:9501/`.
411
-
412
- > **Topology relevance.** Hyperion is built to run **fronted by nginx
413
- > or an L7 load balancer** in most production deployments — plaintext
414
- > HTTP/1.1 upstream, TLS terminated at the LB. The benches in this
415
- > README that match that topology are: hello-world, CPU JSON, static,
416
- > SSE, PG, WebSocket. Benches that are **bench-only for nginx-fronted
417
- > ops** (the LB → upstream hop is plaintext h1 regardless): TLS h1,
418
- > HTTP/2, kTLS_TX. Those rows still ship for operators who terminate
419
- > TLS / h2 at Hyperion directly (small static fleets, edge boxes), but
420
- > don't chase the +60% TLS-h1 win unless you actually terminate TLS at
421
- > Hyperion.
422
-
423
- ### Hello-world Rack app
424
-
425
- `bench/hello.ru`, single worker, parity threads (`-t 5` vs Puma `-t 5:5`), 4 wrk threads / 100 connections / 15s, macOS arm64 / Ruby 3.3.3, Hyperion 1.2.0. **macOS dev numbers; the headline Linux 2.0.0 bench is in [`docs/BENCH_HYPERION_2_0.md`](docs/BENCH_HYPERION_2_0.md)**:
426
-
427
- | | r/s | p99 | tail vs Hyperion |
428
- |---|---:|---:|---:|
429
- | **Hyperion 1.2.0** (default, logs ON) | **22,496** | **502 µs** | **1×** |
430
- | Falcon 0.55.3 `--count 1` | 22,199 | 5.36 ms | 11× worse |
431
- | Puma 7.1.0 `-t 5:5` | 20,400 | 422.85 ms | 845× worse |
432
-
433
- **Hyperion: 1.10× Puma throughput, parity with Falcon on throughput, ~10× lower p99 than Falcon and ~845× lower than Puma — while emitting structured JSON access logs the others don't.**
434
-
435
- ### Production cluster config (`-w 4`)
436
-
437
- Same bench app, `-w 4` cluster, parity threads (`-t 5` everywhere), 4 wrk threads / 200 connections / 15s, macOS arm64:
438
-
439
- | | r/s | p99 | tail vs Hyperion |
440
- |---|---:|---:|---:|
441
- | Falcon `--count 4` | 48,197 | 4.84 ms | 5.9× worse |
442
- | **Hyperion `-w 4 -t 5`** | **40,137** | **825 µs** | **1×** |
443
- | Puma `-w 4 -t 5:5` | 34,793 | 177.76 ms | 215× worse (1 timeout) |
444
-
445
- Falcon edges Hyperion ~20% on raw rps at `-w 4` on macOS hello-world. **Hyperion still leads on tail latency by 5.9× over Falcon and 215× over Puma**, and beats Puma on throughput by 1.15×. On Linux production-config and DB-backed workloads (below) Hyperion takes the rps lead too — the macOS hello-world advantage to Falcon disappears once the workload includes any actual work or the kernel is Linux.
446
-
447
- ### Linux production-config (DB-backed Rack)
448
-
449
- `-w 4 -t 10` on Ubuntu 24.04 / Ruby 3.3.3. Rack app does one Postgres `SELECT 1` + one Redis `GET` per request, real network round-trip. wrk `-t4 -c50 -d10s` × 3 runs (median):
450
-
451
- | | r/s (median) | vs Puma default |
452
- |---|---:|---:|
453
- | **Hyperion default (rc17, logs ON)** | **5,786** | **1.012×** |
454
- | Hyperion `--no-log-requests` | 6,364 | 1.114× |
455
- | Puma `-w 4 -t 10:10` (no per-req logs) | 5,715 | 1.000× |
456
-
457
- Bench is **wait-bound** — ~3-4 ms median is the PG + Redis round-trip, dwarfing the per-request CPU work where Hyperion's optimisations live. With a synchronous `pg` driver, fibers don't help: every in-flight DB call still parks an OS thread, and both servers max out at `workers × threads` concurrent queries. To widen this gap requires either an async PG driver — see [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg) (companion gem; pair with `--async-io` and a fiber-aware pool, see "Async I/O — fiber concurrency on PG-bound apps" below) — or a CPU-bound workload, where Hyperion's lead becomes visible (next section).
458
-
459
- ### Async I/O — fiber concurrency on PG-bound apps
460
-
461
- Ubuntu 24.04 / 16 vCPU / Ruby 3.3.3, Postgres 17 over WAN, `wrk -t4 -c200 -d20s`. Single worker (`-w 1`) unless noted. All configs returned 0 non-2xx and 0 timeouts. RSS sampled mid-run via `ps -o rss`.
462
-
463
- **Wait-bound workload** (`pg_concurrent.ru`: `SELECT pg_sleep(0.05)` + tiny JSON; rackup lives in the [hyperion-async-pg companion repo](https://github.com/andrew-woblavobla/hyperion-async-pg) and on the bench host at `~/bench/pg_concurrent.ru`, not in this repo):
464
-
465
- | | r/s | p99 | RSS | vs Puma `-t 5` |
466
- |---|---:|---:|---:|---:|
467
- | Puma 8.0 `-t 5` pool=5 | 56.5 | 3.88 s | 87 MB | 1.0× |
468
- | Puma 8.0 `-t 30` pool=30 | 402.1 | 880 ms | 99 MB | 7.1× |
469
- | Puma 8.0 `-t 100` pool=100 | 1067.4 | 557 ms | 121 MB | 18.9× |
470
- | **Hyperion `--async-io -t 5`** pool=32 | 400.4 | 878 ms | 123 MB | 7.1× |
471
- | **Hyperion `--async-io -t 5`** pool=64 | 778.9 | 638 ms | 133 MB | 13.8× |
472
- | **Hyperion `--async-io -t 5`** pool=128 | 1344.2 | 536 ms | 148 MB | 23.8× |
473
- | **Hyperion `--async-io -t 5` pool=200** | **2381.4** | **471 ms** | **164 MB** | **42.2×** |
474
- | Hyperion `--async-io -w 4 -t 5` pool=64 | 1937.5 | 4.84 s | 416 MB | 34.3× (cold-start p99 — see note) |
475
- | Falcon 0.55.3 `--count 1` pool=128 | 1665.7 | 516 ms | 141 MB | 29.5× |
476
-
477
- **Mixed CPU+wait** (`pg_mixed.ru`: same query + 50-key JSON serialization, ~5 ms CPU; rackup lives in hyperion-async-pg + on the bench host at `~/bench/pg_mixed.ru`, not in this repo):
478
-
479
- | | r/s | p99 | RSS | vs Puma `-t 30` |
480
- |---|---:|---:|---:|---:|
481
- | Puma 8.0 `-t 30` pool=30 | 351.7 | 963 ms | 127 MB | 1.0× |
482
- | Hyperion `--async-io -t 5` pool=32 | 371.2 | 919 ms | 151 MB | 1.05× |
483
- | Hyperion `--async-io -t 5` pool=64 | 741.5 | 681 ms | 161 MB | 2.1× |
484
- | **Hyperion `--async-io -t 5` pool=128** | **1739.9** | **512 ms** | **201 MB** | **4.9×** |
485
- | Falcon `--count 1` pool=128 | 1642.1 | 531 ms | 213 MB | 4.7× |
486
-
487
- **Takeaways:**
488
- 1. **Linear scaling with pool size** under `--async-io` — `r/s ≈ pool × 12` on this WAN bench. Single-worker pool=200 hits 2381 r/s. The "**42× Puma `-t 5`**" and "**5.9× Puma's best**" framings above use Puma's pool=5 (timeout-floor) and pool=30 (mid-tier) rows respectively — fair comparisons on the *same* bench fixture, but a Puma operator who sizes their pool to match (`-t 100 pool=100` row above) lands at 1,067 r/s, so the **honest "Puma at its own best vs Hyperion at its own best" ratio is 2,381 / 1,067 ≈ 2.2×**, not 42×. The architectural win — fiber-pool grows to pool=200 without OS-thread cost — is real; the 42× headline is a configuration-difference effect, not a steady-state gap on matched configs.
489
- 2. **Mixed workload doesn't kill the win** — Hyperion `--async-io` pool=128 actually goes *up* on mixed (1740 vs 1344 r/s) because CPU work overlaps other fibers' PG-wait windows. This is the honest "what happens to a real Rails handler" answer.
490
- 3. **Hyperion ≈ Falcon within 3-7%** across pool sizes; both fiber-native architectures extract similar value from `hyperion-async-pg`.
491
- 4. **RSS at single-worker scale isn't the architectural moat** — Linux thread stacks are demand-paged; PG connection buffers dominate RSS at pool sizes ≤ 200. The architectural win is **handler concurrency under load**, not idle memory: Hyperion's fiber path runs thousands of in-flight handler invocations per OS thread, so wait-bound handlers don't queue at `max_threads`. See [Concurrency at scale](#concurrency-at-scale-architectural-advantages) for both the throughput-under-load row and a measured 10k-idle-keepalive RSS sweep against Puma and Falcon.
492
- 5. **`-w 4` cold-start caveat** — multi-worker p99 inflates because the bench rackup uses lazy per-process pool init (each worker pays full pool fill on its first request). Production apps avoid this with `on_worker_boot { Hyperion::AsyncPg::FiberPool.new(...).fill }`.
493
- 6. **Apples-to-apples PG note**: the row above uses `pg.wobla.space` WAN PG with `max_connections=500`. Earlier sweeps that compared Hyperion (WAN, max_conn=500) against Puma (local, max_conn=100) overstated the ratio because the Puma side timed out at the local pool ceiling. The 2.0.0 bench doc carries this caveat in the row 7 verification section; treat any "Hyperion 4× Puma on PG" headline as **indicative**, not precisely calibrated, until rerun against matched-pool PG.
494
-
495
- Three things must all be true to get this win:
496
- 1. **`async_io: true`** in your Hyperion config (or `--async-io` CLI flag). Default is off to keep 1.2.0's raw-loop perf for fiber-unaware apps.
497
- 2. **`hyperion-async-pg`** installed: `gem 'hyperion-async-pg', require: 'hyperion/async_pg'` + `Hyperion::AsyncPg.install!` at boot.
498
- 3. **Fiber-aware connection pool.** The popular `connection_pool` gem is NOT — its Mutex blocks the OS thread. Use `Hyperion::AsyncPg::FiberPool` (ships with hyperion-async-pg 0.3.0+), [`async-pool`](https://github.com/socketry/async-pool), or `Async::Semaphore`.
499
-
500
- Skip any of these and you get parity with Puma at the same `-t`. Run the bench yourself: `MODE=async DATABASE_URL=... PG_POOL_SIZE=200 bundle exec hyperion --async-io -t 5 bench/pg_concurrent.ru` (in the [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg) repo).
501
-
502
- > **TLS + async-pg note (1.4.0+).** TLS / HTTPS already runs each connection on a fiber under `Async::Scheduler` (the TLS path always uses `start_async_loop` for the ALPN handshake). **As of 1.4.0, the post-handshake `app.call` for HTTP/1.1-over-TLS dispatches inline on the calling fiber by default** — so fiber-cooperative libraries (`hyperion-async-pg`, `async-redis`) work on the TLS h1 path without needing `--async-io`. The Async-loop cost is already paid for the handshake; running the handler under the existing scheduler just preserves that context instead of stripping it on a thread-pool hop. h2 streams are always fiber-dispatched and benefit from async-pg without the flag.
503
- >
504
- > Operators who specifically want **TLS + threadpool dispatch** (e.g. CPU-heavy handlers competing for OS threads, where you'd rather not pay fiber yields and want true OS-thread parallelism on a synchronous handler) can pass `async_io: false` in the config to force the pool branch back on. The three-way `async_io` setting:
505
- > - `nil` (default): plain HTTP/1.1 → pool, TLS h1 → inline.
506
- > - `true`: plain HTTP/1.1 → inline, TLS h1 → inline (force fiber dispatch everywhere; needed for `hyperion-async-pg` on plain HTTP).
507
- > - `false`: plain HTTP/1.1 → pool, TLS h1 → pool (explicit opt-out for TLS+threadpool).
508
-
509
- ### CPU-bound JSON workload
510
-
511
- `bench/work.ru` — handler builds a 50-key fixture, JSON-encodes a fresh response per request (~8 KB body), processes a 6-cookie header chain. wrk `-t4 -c200 -d15s`, macOS arm64 / Ruby 3.3.3, 1.2.0:
512
-
513
- | | r/s | p99 | tail vs Hyperion |
514
- |---|---:|---:|---:|
515
- | Falcon `--count 4` | 46,166 | 20.17 ms | 24× worse |
516
- | **Hyperion `-w 4 -t 5`** | **43,924** | **824 µs** | **1×** |
517
- | Puma `-w 4 -t 5:5` | 36,383 | 166.30 ms (47 socket errors) | 200× worse |
518
-
519
- **1.21× Puma throughput, 200× lower p99.** This is the gap that hides behind PG-round-trip noise on the DB bench. Hyperion's per-request CPU savings (lock-free per-thread metrics, frozen header keys in the Rack adapter, C-ext response head builder, cached iso8601 timestamps, cached HTTP Date header) land on the wire when the workload is CPU-bound. Falcon edges us 5% on raw r/s but with 24× worse tail — a different tradeoff curve. Reproduce: `bundle exec bin/hyperion -w 4 -t 5 -p 9292 bench/work.ru`.
520
-
521
- ### Real Rails 8.1 app (single worker, parity threads `-t 16`)
522
-
523
- Health endpoint that traverses the full middleware chain (rack-attack, locale redirect, structured tagger, geo-location, etc.). Plus a Grape API endpoint reading cached data, and a Rails controller doing a Redis GET + an ActiveRecord query.
524
-
525
- | endpoint | server | r/s | p99 | wrk timeouts |
526
- |---|---|---:|---:|---:|
527
- | `/up` (health) | **Hyperion** | **19.03** | **1.12 s** | **0** |
528
- | `/up` (health) | Puma `-t 16:16` | 16.64 | 1.95 s | **138** |
529
- | Grape `/api/v1/cached_data` | **Hyperion** | **16.15** | **779 ms** | 16 |
530
- | Grape `/api/v1/cached_data` | Puma `-t 16:16` | 10.90 | (>2 s, censored) | **110** |
531
- | Rails `/api/v1/health` | **Hyperion** | **15.95** | **992 ms** | 16 |
532
- | Rails `/api/v1/health` | Puma `-t 16:16` | 11.29 | (>2 s, censored) | **114** |
533
-
534
- On Grape and Rails-controller workloads Puma hits wrk's 2 s timeout cap on ~⅔ of requests — its real p99 is censored above 2 s. Hyperion serves all of its requests under 1.2 s with 0 to 16 timeouts. **1.14–1.48× Puma throughput** depending on endpoint.
535
-
536
- ### Static-asset serving (sendfile zero-copy path, 1.2.0+)
537
-
538
- `bench/static.ru` (`Rack::Files` over a 1 MiB asset), `-w 1`, `wrk -t4 -c100 -d15s`, macOS arm64 / Ruby 3.3.3:
539
-
540
- | | r/s | p99 | transferred | tail vs winner |
541
- |---|---:|---:|---:|---:|
542
- | **Hyperion (sendfile path)** | **2,069** | **3.10 ms** | 30.4 GB | **1×** |
543
- | Puma `-w 1 -t 5:5` | 2,109 | 566.16 ms | 31.0 GB | 183× worse |
544
- | Falcon `--count 1` | 1,269 | 801.01 ms | 18.7 GB | 258× worse (28 timeouts) |
545
-
546
- Throughput is bandwidth-bound on localhost (≈2 GB/s = the loopback memory ceiling), so the throughput column looks like parity. The actual win is in the **tail latency** column: Hyperion's `IO.copy_stream` → `sendfile(2)` path skips userspace entirely, while Puma allocates a String per response and Falcon serializes more aggressively. On real network paths sendfile widens the gap further (kernel-to-NIC zero-copy).
547
-
548
- Reproduce:
549
- ```sh
550
- ruby -e 'File.binwrite("/tmp/hyperion_bench_asset_1m.bin", "x" * (1024*1024))'
551
- bundle exec bin/hyperion -p 9292 bench/static.ru
552
- wrk --latency -t4 -c100 -d15s http://127.0.0.1:9292/hyperion_bench_asset_1m.bin
553
- ```
554
-
555
- ### Concurrency at scale (architectural advantages)
556
-
557
- These workloads demonstrate structural differences between Hyperion's fiber-per-connection / fiber-per-stream model and Puma's thread-pool model. Numbers are illustrative; the architecture is what matters. Run on Ubuntu 24.04 / Ruby 3.3.3, single worker, h2load `-c <conns> -n 100000 --rps 1000 --h1`.
558
-
559
- **5,000 concurrent keep-alive connections (50,000 requests):**
560
-
561
- | | succeeded | r/s | wall | master RSS |
562
- |---|---:|---:|---:|---:|
563
- | Hyperion `-w 1 -t 10` | 50,000 / 50,000 | 3,460 | 14.45 s | 53.5 MB |
564
- | Puma `-w 1 -t 10:10` | 50,000 / 50,000 | 1,762 | 28.37 s | 36.9 MB |
565
-
566
- **10,000 concurrent keep-alive connections (100,000 requests):**
567
-
568
- | | succeeded | failed | r/s | wall |
569
- |---|---:|---:|---:|---:|
570
- | Hyperion `-w 1 -t 10` | 93,090 | 6,910 | 3,446 | 27.01 s |
571
- | Puma `-w 1 -t 10:10` | 77,340 | 22,660 | 706 | 109.59 s |
572
-
573
- At 10k concurrent connections under load Hyperion serves **~5× the throughput** of Puma with **~20% fewer dropped requests**. The per-connection bookkeeping cost is bounded by fiber size, not by `max_threads` — workers don't get pinned to long-lived sockets, so a slow handler doesn't starve other connections.
574
-
575
- **Memory at idle keep-alive scale — 10,000 idle HTTP/1.1 keep-alive connections:**
576
-
577
- Each client opens a TCP connection, sends one keep-alive GET, drains the response, then holds the socket open without sending a follow-up request. RSS is sampled once a second across a 30s idle hold. Same hello-world rackup, single worker, no TLS. Hyperion runs with `async_io true` (fiber-per-connection on the plain HTTP/1.1 path).
578
-
579
- | | held | dropped | peak RSS | RSS after drain |
580
- |---|---:|---:|---:|---:|
581
- | Hyperion `-w 1 -t 5 --async-io` | 10,000 / 10,000 | 0 | 173 MB | 155 MB |
582
- | Puma `-w 0 -t 100` | 10,000 / 10,000 | 0 | 101 MB | 104 MB |
583
- | Falcon `--count 1` | 10,000 / 10,000 | 0 | 429 MB | 440 MB |
584
-
585
- All three hold 10k idle conns without OOMing or dropping — the "MB-per-thread" intuition that thread-based servers can't reach this scale doesn't survive contact with Linux's demand-paged thread stacks plus Puma's reactor-based keep-alive handling. Per-conn RSS lands at ~14 KB (Hyperion fiber + parser state), ~7 KB (Puma reactor entry + tiny thread share), ~36 KB (Falcon Async::Task + protocol-http stack). Bounded, not unbounded — for all three.
586
-
587
- The architectural difference shows up under **load**, not at idle: Puma can only run `max_threads` handler invocations concurrently, so wait-bound handlers (DB, HTTP, Redis) starve at higher request concurrency than `max_threads`. Hyperion's fiber-per-connection model + `--async-io` gives one OS thread thousands of in-flight handler executions, paired with [hyperion-async-pg](https://github.com/exodusgaming-io/hyperion-async-pg) for non-blocking DB. The 10k-conn throughput row above (5× Puma) is the consequence — same idle RSS shape, very different behaviour once the handlers actually do work.
588
-
589
- **HTTP/2 multiplexing — 1 connection × 100 concurrent streams (handler sleeps 50 ms):**
590
-
591
- | | wall time |
592
- |---|---:|
593
- | Hyperion (per-stream fiber dispatch) | **1.04 s** |
594
- | Serial baseline (100 × 50 ms) | 5.00 s |
595
-
596
- Hyperion fans 100 in-flight streams across separate fibers within a single TCP connection. A serial server would take 5 s; the fiber-multiplexed result (1.04 s, ~96 req/s on one socket) is bounded by single-handler sleep time plus framing overhead. Puma has no native HTTP/2 path — production deployments terminate h2 at nginx and forward h1 to the worker pool, which serializes again.
597
-
598
- > **1.6.0 outbound write path** — `Http2Handler` no longer serializes every framer write through one `Mutex#synchronize { socket.write(...) }`. HPACK encoding (microseconds, in-memory) still serializes on a fast encode mutex, but the actual `socket.write` is owned by a dedicated per-connection writer fiber draining a queue. On per-connection multi-stream workloads where the kernel send buffer or peer reads are slow, encode work for ready streams overlaps the writer's flush of earlier chunks, instead of stacking up behind it. See `bench/h2_streams.sh` (`h2load -c 1 -m 100 -n 5000`) for a recipe to compare 1.5.0 vs 1.6.0 on a workload of your choice.
599
-
600
- ### Reproducing the benchmarks
601
-
602
- Every number in this README and `docs/BENCH_HYPERION_2_0.md` is reproducible. Operators who don't trust headline numbers (and you shouldn't trust *any* benchmark numbers without independent verification) can rerun the workloads on their own host and get their own honest measurements. Per-row reproduction commands:
603
-
604
- ```sh
605
- # Setup (once)
606
- bundle install
607
- bundle exec rake compile
608
-
609
- # Hello-world (rps + p99 ceiling, no I/O)
610
- bundle exec bin/hyperion -p 9292 -w 16 -t 5 bench/hello.ru &
611
- wrk -t4 -c200 -d20s --latency http://127.0.0.1:9292/
612
-
613
- # CPU-bound JSON (per-request CPU savings visible)
614
- bundle exec bin/hyperion -p 9292 -w 4 -t 5 bench/work.ru &
615
- wrk -t4 -c200 -d15s --latency http://127.0.0.1:9292/
616
-
617
- # Static 1 MiB sendfile path
618
- ruby -e 'File.binwrite("/tmp/hyperion_bench_asset_1m.bin", "x" * (1024*1024))'
619
- bundle exec bin/hyperion -p 9292 -w 1 -t 5 bench/static.ru &
620
- wrk -t4 -c100 -d15s --latency http://127.0.0.1:9292/hyperion_bench_asset_1m.bin
621
-
622
- # SSE streaming (Hyperion-shaped rackup with explicit flush sentinel — see caveat in BENCH doc)
623
- bundle exec bin/hyperion -p 9292 -w 1 -t 5 bench/sse.ru &
624
- wrk -t1 -c1 -d10s http://127.0.0.1:9292/
625
-
626
- # WebSocket multi-process throughput
627
- bundle exec bin/hyperion -p 9888 -w 4 -t 64 bench/ws_echo.ru &
628
- ruby bench/ws_bench_client_multi.rb --port 9888 --procs 4 --conns 200 --msgs 1000 --bytes 1024 --json
629
-
630
- # h2 native HPACK (Rails-shape, 25-header response)
631
- ./bench/h2_rails_shape.sh
632
-
633
- # Idle keep-alive RSS sweep (1k / 5k / 10k conns, 30s hold per server)
634
- ./bench/keepalive_memory.sh
635
-
636
- # Hello-world quick comparator (Hyperion vs Puma vs Falcon)
637
- bundle exec ruby bench/compare.rb
638
- HYPERION_WORKERS=4 PUMA_WORKERS=4 FALCON_COUNT=4 bundle exec ruby bench/compare.rb
639
- ```
640
-
641
- PG benches (`pg_concurrent.ru`, `pg_mixed.ru`, `pg_realistic.ru`) live in the [hyperion-async-pg companion repo](https://github.com/andrew-woblavobla/hyperion-async-pg) — they require a running Postgres + the companion gem and are not part of this repo. The 2.0.0 sweep used `~/bench/pg_concurrent.ru` on the bench host; reproduce by cloning hyperion-async-pg and following its README, or `scp` the rackup + DATABASE_URL.
642
-
643
- When numbers from your host don't match the published numbers, the most likely explanations (in order): (1) bench-host noise — single-VM benches drift 10-30% over days; (2) Puma version mismatch — the 2.0.0 sweep used Puma 8.0.1 in the `~/bench/Gemfile`, the hyperion repo's own Gemfile pins Puma `~> 6.4`; (3) different kernel / Ruby; (4) different `-t` / `-c` (apples-to-apples requires identical worker count, thread count, wrk concurrency, payload size, kernel, Ruby, TLS cipher).
9
+ Hyperion serves a hello-world Rack response at **122,778 r/s with a 1.14 ms p99**
10
+ (median of 3 trials, peak 134,573) on a single worker — Linux 6.x, io_uring
11
+ accept loop, `Server.handle_static`, **6.7×** Agoo's 18,326 r/s on the same
12
+ hardware. Beyond the C-side fast path it's a complete Rack 3 server: HTTP/1.1
13
+ + HTTP/2 with ALPN, WebSockets (RFC 6455), gRPC unary + streaming on the Rack
14
+ 3 trailers contract, native fiber concurrency for PG-bound apps, and pre-fork
15
+ cluster mode with SO_REUSEPORT-balanced workers.
644
16
 
645
17
  ## Quick start
646
18
 
647
19
  ```sh
648
- bundle install
649
- bundle exec rake compile # build the llhttp C ext
650
- bundle exec hyperion config.ru # single-process default
651
- bundle exec hyperion -w 4 -t 10 config.ru # 4-worker cluster, 10 threads each
652
- bundle exec hyperion -w 0 config.ru # 1 worker per CPU
653
- bundle exec hyperion --tls-cert cert.pem --tls-key key.pem -p 9443 config.ru # HTTPS
654
- curl http://127.0.0.1:9292/ # => hello
655
-
656
- # Chunked POST works:
657
- curl -X POST -H "Transfer-Encoding: chunked" --data-binary @file http://127.0.0.1:9292/
658
-
659
- # HTTP/2 (over TLS, ALPN-negotiated):
660
- curl --http2 -k https://127.0.0.1:9443/
661
- ```
662
-
663
- `bundle exec rake spec` (and the `default` task) auto-invoke `compile`, so a fresh checkout just needs `bundle install && bundle exec rake` to get a green run.
664
-
665
- **Migrating from Puma?** See [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md).
666
-
667
- ## Configuration
668
-
669
- Three layers, in precedence order: explicit CLI flag > environment variable > `config/hyperion.rb` > built-in default.
670
-
671
- ### CLI flags
672
-
673
- | Flag | Default | Notes |
674
- |---|---|---|
675
- | `-b, --bind HOST` | `127.0.0.1` | |
676
- | `-p, --port PORT` | `9292` | |
677
- | `-w, --workers N` | `1` | `0` `Etc.nprocessors` |
678
- | `-t, --threads N` | `5` | OS-thread Rack handler pool per worker. `0` run inline (no pool, debugging only). |
679
- | `-C, --config PATH` | `config/hyperion.rb` if present | Ruby DSL file. |
680
- | `--tls-cert PATH` | nil | PEM certificate. |
681
- | `--tls-key PATH` | nil | PEM private key. |
682
- | `--log-level LEVEL` | `info` | `debug` / `info` / `warn` / `error` / `fatal`. |
683
- | `--log-format FORMAT` | `auto` | `text` / `json` / `auto`. Auto: JSON when `RAILS_ENV`/`RACK_ENV` is `production`/`staging`, colored text on TTY, JSON otherwise. |
684
- | `--[no-]log-requests` | ON | Per-request access log. |
685
- | `--fiber-local-shim` | off | Patches `Thread#thread_variable_*` to fiber storage for older Rails idioms. |
686
- | `--[no-]yjit` | auto | Force YJIT on/off. Default: auto-on under `RAILS_ENV`/`RACK_ENV` = `production`/`staging`. |
687
- | `--[no-]async-io` | off | Run plain HTTP/1.1 connections under `Async::Scheduler`. Required for `hyperion-async-pg` on plain HTTP. TLS h1 / HTTP/2 always run under the scheduler regardless. |
688
- | `--max-body-bytes BYTES` | `16777216` (16 MiB) | Maximum request body size. |
689
- | `--max-header-bytes BYTES` | `65536` (64 KiB) | Maximum total request-header size. |
690
- | `--max-pending COUNT` | unbounded | Per-worker accept-queue cap before new connections are rejected with HTTP 503 + `Retry-After: 1`. |
691
- | `--max-request-read-seconds SECONDS` | `60` | Total wallclock budget for reading request line + headers + body for ONE request. Slowloris defence. |
692
- | `--admin-token TOKEN` | unset | Bearer token for `POST /-/quit` and `GET /-/metrics`. **Production: prefer `--admin-token-file` — argv is visible via `ps`.** |
693
- | `--admin-token-file PATH` | unset | Read the admin token from a file. Refuses to load if the file is missing or world-readable (mode must mask `0o007`). |
694
- | `--worker-max-rss-mb MB` | unset | Master gracefully recycles a worker once its RSS exceeds this many megabytes. nil = disabled. |
695
- | `--idle-keepalive SECONDS` | `5` | Keep-alive idle timeout. Connection closes after this many seconds of inactivity. |
696
- | `--graceful-timeout SECONDS` | `30` | Shutdown deadline before SIGKILL is delivered to a worker that hasn't drained. |
697
-
698
- ### Environment variables
699
-
700
- `HYPERION_LOG_LEVEL`, `HYPERION_LOG_FORMAT`, `HYPERION_LOG_REQUESTS` (`0|1|true|false|yes|no|on|off`), `HYPERION_ENV`, `HYPERION_WORKER_MODEL` (`share|reuseport`).
701
-
702
- ### Config file
703
-
704
- `config/hyperion.rb` same shape as Puma's `puma.rb`. Auto-loaded if present.
705
-
706
- ```ruby
707
- # config/hyperion.rb
708
- bind '0.0.0.0'
709
- port 9292
710
-
711
- workers 4
712
- thread_count 10
713
-
714
- # tls_cert_path 'config/cert.pem'
715
- # tls_key_path 'config/key.pem'
716
-
717
- read_timeout 30
718
- idle_keepalive 5
719
- graceful_timeout 30
720
-
721
- max_header_bytes 64 * 1024
722
- max_body_bytes 16 * 1024 * 1024
723
-
724
- log_level :info
725
- log_format :auto
726
- log_requests true
727
-
728
- fiber_local_shim false
729
-
730
- async_io nil # Three-way (1.4.0+): nil (default, auto: inline-on-fiber for TLS h1, pool hop for plain HTTP/1.1), true (force inline-on-fiber everywhere — required for hyperion-async-pg on plain HTTP/1.1), false (force pool hop everywhere — explicit opt-out for TLS+threadpool with CPU-heavy handlers). ~5% throughput hit on hello-world when inline; in exchange one OS thread serves N concurrent in-flight DB queries on wait-bound workloads. TLS / HTTP/2 accept loops always run under Async::Scheduler regardless of this flag.
731
-
732
- before_fork do
733
- ActiveRecord::Base.connection_handler.clear_all_connections! if defined?(ActiveRecord)
734
- end
735
-
736
- on_worker_boot do |worker_index|
737
- ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
738
- end
739
-
740
- on_worker_shutdown do |worker_index|
741
- ActiveRecord::Base.connection_handler.clear_all_connections! if defined?(ActiveRecord)
742
- end
743
- ```
744
-
745
- Strict DSL: unknown methods raise `NoMethodError` at boot — typos surface immediately rather than getting silently ignored.
746
-
747
- A documented sample lives at [`config/hyperion.example.rb`](config/hyperion.example.rb).
748
-
749
- ## Operator guidance
750
-
751
- Concrete tradeoffs distilled from [`docs/BENCH_2026_04_27.md`](docs/BENCH_2026_04_27.md). If the bench numbers cited below feel surprising, check that doc for the full matrix + caveats.
752
-
753
- ### When to use `-w N`
754
-
755
- | Workload shape | Recommended | Why |
756
- |---|---|---|
757
- | **Pure I/O-bound** (PG / Redis / external HTTP, no significant CPU) | `-w 1` + larger pool | Bench: `-w 1 pool=200` = 87 MB / 2,180 r/s vs `-w 4 pool=64` = 224 MB / 1,680 r/s. **2.6× more memory, 0.77× rps** if you pick multi-worker on a wait-bound workload. |
758
- | **Pure CPU-bound** (heavy JSON / template render / image processing) | `-w N` matching CPU count | Each worker's accept loop is single-threaded under `--async-io`; multi-worker gives CPU-parallelism. Bench: `-w 16 -t 5` hits 98,818 r/s on a 16-vCPU box, 4.7× a `-w 1` ceiling on the same hardware. |
759
- | **Mixed** (Rails-shaped: ~5 ms CPU + 50 ms PG wait per request) | `-w N/2` (half cores) + medium pool | Lets CPU work parallelise while keeping per-worker memory tractable. Bench `pg_mixed.ru` (in hyperion-async-pg repo / `~/bench/`) at `-w 4 -t 5 pool=128` = 1,740 r/s with no cold-start spike (ForkSafe `prefill_in_child: true`). |
760
-
761
- Multi-worker on PG-wait workloads is the **wrong** default for most apps — the headline rps doesn't justify the memory and PG-connection cost. Verify your shape with the bench before scaling out.
762
-
763
- ### When to use `--async-io`
764
-
765
- ```
766
- Are you using a fiber-cooperative I/O library?
767
- (hyperion-async-pg, async-redis, async-http)
768
-
769
- ┌─────────────┴─────────────┐
770
- yes no
771
- │ │
772
- Pair with a fiber-aware Leave --async-io OFF.
773
- connection pool Default thread-pool dispatch
774
- (FiberPool, async-pool — is faster for synchronous
775
- NOT connection_pool gem, Rails apps. Bench: --async-io
776
- which uses non-fiber Mutex). on hello-world = 47% rps
777
- │ regression + p99 spike to
778
- Set --async-io. 3.65 s under no-yield workloads.
779
- Pool size is the real No reason to flip the flag.
780
- concurrency knob; -t is
781
- decorative for wait-bound.
782
- ```
783
-
784
- Hyperion warns at boot if you set `--async-io` without any fiber-cooperative library loaded. The setting is still honoured; the warn just nudges operators who flipped it expecting a free perf bump.
785
-
786
- ### Tuning `-t` and pool sizes
787
-
788
- - **Without `--async-io`** (sync server, default): `-t` is the concurrency knob. Each in-flight request holds an OS thread; pool size should match `-t`. Bench shows Puma-style behaviour — at 200 wrk conns hitting a 5-thread server, queue depth dominates p99 (Hyperion `-t 5 -w 1` p50 = 0.95 ms vs Puma's same shape at 59.5 ms — Hyperion's queueing is cheaper but the model still serializes at `-t`).
789
- - **With `--async-io` + a fiber-aware pool**: pool size is the concurrency knob. `-t` is decorative for wait-bound workloads; one accept-loop fiber serves all in-flight queries via the pool. Linear scaling: pool=64 → ~780 r/s, pool=128 → ~1,344 r/s, pool=200 → ~2,180 r/s on 50 ms PG queries.
790
- - **Pool over WAN**: if `PG.connect` round-trip is >50 ms, expect pool fill at startup to take `pool_size / parallel_fill_threads × RTT`. `hyperion-async-pg 0.5.1+` auto-scales `parallel_fill_threads` so pool=200 fills in ~1-2 s.
791
-
792
- ### How to read p50 vs p99
793
-
794
- Tail latency tells the queueing story; rps tells the throughput story. Hyperion's tail wins are **always** bigger than its rps wins — sometimes the rps numbers look close to a competitor while p99 is 5-200× lower:
795
-
796
- | Workload | Hyperion rps / p99 | Closest competitor | rps ratio | p99 ratio |
797
- |---|---|---|---:|---:|
798
- | Hello `-w 4` | 21,215 r/s / 1.87 ms | Falcon 24,061 / 9.78 ms | 0.88× | **5.2× lower** |
799
- | CPU JSON `-w 4` | 15,582 r/s / 2.47 ms | Falcon 18,643 / 13.51 ms | 0.84× | **5.5× lower** |
800
- | Static 1 MiB | 1,919 r/s / 4.22 ms | Puma 2,074 / 55 ms | 0.93× | **13× lower** |
801
- | PG-wait `-w 1` pool=200 | 2,180 r/s / 668 ms | Puma 530 r/s + 200 timeouts | **4.1×** | qualitative crush |
802
-
803
- **Size capacity by p99, not by mean.** Throughput peaks are easy to fake under controlled bench conditions; tail latency reflects what your slowest user actually experiences when the load balancer fans them onto a busy worker.
804
-
805
- ### Production tuning (real Rails apps)
806
-
807
- Distilled from a real-app bench against the [Exodus platform](https://github.com/andrew-woblavobla/hyperion/blob/master/docs/BENCH_2026_04_27.md) (Rails 8.1, on-LAN PG + Redis at ~0.3 ms RTT, `-w 4 -t 10`, `wrk -t8 -c200 -d30s`). The headline finding: the **simplest drop-in is the right answer**, and the additional knobs operators reach for first don't help on real Rails.
808
-
809
- **Recommended for migrating from Puma**: `hyperion -t N -w M` matching your current Puma `-t N:N -w M`. No other flags. That gives you (vs Puma at the same `-t/-w`):
810
-
811
- - **+9% rps on lightweight endpoints** (matches the 5-10% per-request CPU savings the rest of the bench section documents).
812
- - **28× lower p99 on health-style endpoints** — the queue-of-doom shape Puma exhibits under sustained 200-conn load doesn't reproduce on Hyperion's worker-owns-connection model.
813
- - **3.8× lower p99 on PG-touching endpoints**.
814
- - **Same RSS, same operator surface** — you keep all your existing config, monitoring, and deploy scripts.
815
-
816
- **Knobs that help on synthetic benches but NOT on real Rails — leave them off:**
817
-
818
- | Knob | Synthetic bench result | Real Rails result | Recommendation |
819
- |---|---|---|---|
820
- | `-t 30` (more threads/worker) | Helped Hyperion 5-10% on hello-world | **Hurt** p99 vs `-t 10` on real Rails (3.51 s vs 148 ms on /up) — GVL + middleware Mutex contention dominates past `-t 10` | Stay at `-t 10`. Match Puma's recommended `RAILS_MAX_THREADS`. |
821
- | `--yjit` | 5-10% on synthetic CPU-bound | Wash on dev-mode Rails (312 vs 328 rps, p99 worse with YJIT) | Skip for now. Production-mode Rails may behave differently — verify with your own bench before flipping. |
822
- | `RAILS_POOL` > 25 | n/a | No improvement at pool=50 or pool=100 on real Rails (rps within 3%, p99 within noise). Pool starvation is rarely the bottleneck on a `-w 4 -t 10` config | Keep your existing AR pool size. |
823
- | `--async-io` | 33-42× rps on PG-bound (with `hyperion-async-pg`) | **Worse** than drop-in on real Rails (4.14 s p99 on /up vs 148 ms drop-in) | **Don't enable** until your full I/O stack is fiber-cooperative. The synchronous Redis client (`redis-rb`) blocks the OS thread before async-pg can yield, so fibers can't compound. Migrate to `async-redis` *first*, then revisit. |
824
- | `--async-io` + `hyperion-async-pg` AR adapter | Verified 48× rps lift on a single-PG-query bench | Marginal-or-negative on real Rails (similar reason: Redis-first handlers don't yield) | Same — wait for a full-async I/O stack. |
825
-
826
- **Why the simple drop-in wins on real Rails:** the per-request budget on a real handler is dominated by the Rails middleware chain (rack-attack, locale redirect, tagger, etc.) + handler logic + DB + cache I/O. Hyperion's per-request CPU optimizations (C-ext header parser, response builder, lock-free metrics, fiber-cooperative TLS dispatch in 1.4.0+) shave ~5-10% off the *non-I/O* portion of the budget consistently — and the [worker-owns-connection model](#concurrency-at-scale-architectural-advantages) prevents the queue-amplification that Puma's thread-pool dispatch shows under sustained load. You don't need to "tune" anything to get those.
827
-
828
- ## Logging
829
-
830
- Default behaviour (rc16+):
831
-
832
- - **`info` / `debug` → stdout**, **`warn` / `error` / `fatal` → stderr** (12-factor).
833
- - **One structured access-log line per response**, info level, on stdout. Disable with `--no-log-requests` or `HYPERION_LOG_REQUESTS=0`.
834
- - **Format auto-selects**: production envs → JSON (line-delimited, parseable by every log aggregator); TTY → coloured text; piped output without env hint → JSON.
835
-
836
- ### Sample access log lines
837
-
838
- Text format (TTY default):
839
-
840
- ```
841
- 2026-04-26T18:40:04.112Z INFO [hyperion] message=request method=GET path=/api/v1/health status=200 duration_ms=46.63 remote_addr=127.0.0.1 http_version=HTTP/1.1
842
- 2026-04-26T18:40:04.123Z INFO [hyperion] message=request method=GET path=/api/v1/cached_data query="currency=USD" status=200 duration_ms=43.87 remote_addr=127.0.0.1 http_version=HTTP/1.1
843
- ```
844
-
845
- JSON format (auto-selected on `RAILS_ENV=production`/`staging` or piped output):
846
-
847
- ```json
848
- {"ts":"2026-04-26T18:38:49.405Z","level":"info","source":"hyperion","message":"request","method":"GET","path":"/api/v1/health","status":200,"duration_ms":46.63,"remote_addr":"127.0.0.1","http_version":"HTTP/1.1"}
849
- {"ts":"2026-04-26T18:38:49.411Z","level":"info","source":"hyperion","message":"request","method":"GET","path":"/api/v1/cached_data","query":"currency=USD","status":200,"duration_ms":40.64,"remote_addr":"127.0.0.1","http_version":"HTTP/1.1"}
850
- ```
851
-
852
- ### Hot-path optimisations
853
-
854
- The default-ON access log path is engineered to stay near-zero cost:
855
-
856
- - **Per-thread cached `iso8601(3)` timestamp** — one allocation per millisecond per thread, reused across all requests in that millisecond.
857
- - **Hand-rolled single-interpolation line builder** — bypasses generic `Hash#map.join`.
858
- - **Per-thread 4 KiB write buffer** — flushes to stdout when full or on connection close. Cuts ~32× the syscalls under load.
859
- - **Lock-free emit** — POSIX `write(2)` is atomic for writes ≤ PIPE_BUF (4096 B); a log line is ~200 B. No logger mutex.
860
-
861
- ## Metrics
20
+ gem install hyperion-rb
21
+ bundle exec hyperion config.ru # http://127.0.0.1:9292
22
+ bundle exec hyperion -w 4 -t 10 config.ru # 4 workers × 10 threads
23
+ bundle exec hyperion --tls-cert cert.pem --tls-key key.pem -p 9443 config.ru
24
+ ```
25
+
26
+ Migrating from Puma? `hyperion -t N -w M` matching your current Puma
27
+ `-t N:N -w M` is the recommended drop-in. See
28
+ [docs/MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md).
29
+
30
+ ## Headline benchmarks
31
+
32
+ Linux 6.8 / 16-vCPU Ubuntu 24.04 / Ruby 3.3.3, single worker, `wrk -t4 -c100 -d20s`
33
+ unless noted. Three trials per row, median reported. Captured 2026-05-02 on
34
+ the 2.14.0 release commit. Full reproduction in
35
+ [docs/BENCH_HYPERION_2_14.md](docs/BENCH_HYPERION_2_14.md); single-command
36
+ re-bench via [`bench/run_all.sh`](bench/run_all.sh).
37
+
38
+ | Workload | Hyperion r/s | Hyperion p99 | Reference |
39
+ |-------------------------------------------------------|-------------:|-------------:|----------------------|
40
+ | Static hello, `handle_static` + io_uring | **122,778** | 1.11 ms | Agoo: 18,326 |
41
+ | Static hello, `handle_static` + accept4 fallback | 16,725 | 90 µs | Agoo: 18,326 |
42
+ | Dynamic block, `Server.handle { \|env\| ... }` | 8,956 | 190 µs | Agoo: 18,326 |
43
+ | CPU JSON via block (`bench/work.ru`) | 5,456 | 327 µs | Falcon: 6,394 |
44
+ | Generic Rack hello (no `Server.handle`) | 4,231 | 2.33 ms | Agoo: 18,326 |
45
+ | gRPC unary, h2/TLS, `ghz -c50` | 1,732 | 29.87 ms | (Falcon `async-grpc` historical: 1,512) |
46
+
47
+ Peak trial on row 1: 134,573 r/s. The io_uring loop is opt-in via
48
+ `HYPERION_IO_URING_ACCEPT=1` until 2.15; the `accept4` row is the default on
49
+ Linux. Falcon and Puma both tail-latency at **>400 ms p99** on the generic
50
+ Rack hello row Hyperion serves at 2.33 ms; the closest-competitor's mean is
51
+ Hyperion's p99 read the tail, not the throughput peak.
52
+
53
+ ## Features
54
+
55
+ - **HTTP/1.1 + HTTP/2 + TLS** with ALPN auto-negotiation. Multiplexed h2
56
+ streams on fibers; smuggling defences inline. See
57
+ [docs/HTTP2_AND_TLS.md](docs/HTTP2_AND_TLS.md).
58
+ - **WebSockets** (RFC 6455) over Rack 3 full hijack. ActionCable +
59
+ faye-websocket on the same listener. 463/463 autobahn cases pass. See
60
+ [docs/WEBSOCKETS.md](docs/WEBSOCKETS.md).
61
+ - **gRPC** unary, server-stream, client-stream, bidirectional via
62
+ Rack 3 trailers. See [docs/GRPC.md](docs/GRPC.md).
63
+ - **`Server.handle_static`** + **`Server.handle { |env| }`**
64
+ C-loop direct routes that bypass the Rack adapter for hot paths.
65
+ See [docs/HANDLE_STATIC_AND_HANDLE_BLOCK.md](docs/HANDLE_STATIC_AND_HANDLE_BLOCK.md).
66
+ - **Pre-fork cluster mode** `SO_REUSEPORT` on Linux, master-bind on
67
+ macOS / BSD. 1.004–1.011 max/min worker fairness ratio under steady
68
+ load. See [docs/CLUSTER_AND_SO_REUSEPORT.md](docs/CLUSTER_AND_SO_REUSEPORT.md).
69
+ - **Async I/O** for PG-bound apps via `--async-io` +
70
+ [hyperion-async-pg](https://github.com/andrew-woblavobla/hyperion-async-pg).
71
+ Single worker `pool=200` hits 2,381 r/s on `pg_sleep(50ms)` vs Puma's 56
72
+ r/s. See [docs/ASYNC_IO.md](docs/ASYNC_IO.md).
73
+ - **Observability** — `/-/metrics` Prometheus endpoint, per-route
74
+ histograms, dispatch-mode counters, kTLS gauge. Pre-built Grafana
75
+ dashboard. See [docs/OBSERVABILITY.md](docs/OBSERVABILITY.md).
76
+ - **Default-on structured access logs** JSON in production, coloured
77
+ text on TTY. Per-thread cached timestamps; ≈ 0.1 µs per logged
78
+ request. See [docs/LOGGING.md](docs/LOGGING.md).
79
+ - **io_uring accept loop** (Linux 5.x+, opt-in) — multishot accept +
80
+ per-conn state machine. Compiles out cleanly without liburing.
81
+ Default-flip moves to 2.15 with a fresh 24h soak.
862
82
 
863
- `Hyperion.stats` returns a snapshot Hash with the following counters (lock-free per-thread aggregation):
83
+ ## Compatibility
864
84
 
865
- | Counter | Meaning |
85
+ | Component | Version |
866
86
  |---|---|
867
- | `connections_accepted` | Lifetime accept count. |
868
- | `connections_active` | Currently in-flight connections. |
869
- | `requests_total` | Lifetime request count. |
870
- | `requests_in_flight` | Currently in-flight requests. |
871
- | `responses_<code>` | One counter per status code emitted (`responses_200`, `responses_400`, ). |
872
- | `parse_errors` | HTTP parse failures → 400. |
873
- | `app_errors` | Rack app raised → 500. |
874
- | `read_timeouts` | Per-connection read deadline hit. |
875
- | `requests_threadpool_dispatched` | HTTP/1.1 connection handed to the worker pool (or served inline in `start_raw_loop` when `thread_count: 0`). The default dispatch path. |
876
- | `requests_async_dispatched` | HTTP/1.1 connection served inline on the accept-loop fiber under `--async-io`. Operators can use the ratio against `requests_threadpool_dispatched` to verify fiber-cooperative I/O is actually engaged. |
877
-
878
- ```ruby
879
- require 'hyperion'
880
- Hyperion.stats
881
- # => {connections_accepted: 1234, connections_active: 7, requests_total: 8910, …}
882
- ```
883
-
884
- ### Prometheus exporter
885
-
886
- When `admin_token` is set in your config, Hyperion mounts a `/-/metrics` endpoint that emits Prometheus text-format v0.0.4. Same token guards both `/-/metrics` (GET) and `/-/quit` (POST); auth is via the `X-Hyperion-Admin-Token` header.
87
+ | Ruby | 3.3+ |
88
+ | Rack | 3.x |
89
+ | Rails | verified up to 8.1 |
90
+ | Linux kernel | 5.x+ for io_uring opt-in; 4.x+ otherwise |
91
+ | macOS | works (TLS, h2, WebSockets, `accept4` fallback) |
92
+
93
+ ## Documentation
94
+
95
+ - [BENCH_HYPERION_2_14.md](docs/BENCH_HYPERION_2_14.md) fresh 2.14.0
96
+ bench (this README's headline numbers, with reproduction commands).
97
+ - [BENCH_HYPERION_2_11.md](docs/BENCH_HYPERION_2_11.md) — 4-way
98
+ matrix (Hyperion / Puma / Falcon / Agoo).
99
+ - [BENCH_2026_04_27.md](docs/BENCH_2026_04_27.md) — real Rails 8.1
100
+ app sweep (Exodus platform).
101
+ - [CONFIGURATION.md](docs/CONFIGURATION.md) CLI flags, env vars,
102
+ `config/hyperion.rb` DSL.
103
+ - [OPERATOR_GUIDANCE.md](docs/OPERATOR_GUIDANCE.md) — what `-w N` /
104
+ `-t N` / `--async-io` actually do on Rails-shaped traffic.
105
+ - [HTTP2_AND_TLS.md](docs/HTTP2_AND_TLS.md) — h2 + TLS surface.
106
+ - [WEBSOCKETS.md](docs/WEBSOCKETS.md) RFC 6455 surface.
107
+ - [GRPC.md](docs/GRPC.md) — Rack 3 trailers + streaming RPCs.
108
+ - [HANDLE_STATIC_AND_HANDLE_BLOCK.md](docs/HANDLE_STATIC_AND_HANDLE_BLOCK.md)
109
+ — direct-route forms.
110
+ - [CLUSTER_AND_SO_REUSEPORT.md](docs/CLUSTER_AND_SO_REUSEPORT.md) —
111
+ cluster mode and per-OS worker model.
112
+ - [ASYNC_IO.md](docs/ASYNC_IO.md) — `--async-io` for PG-bound apps.
113
+ - [OBSERVABILITY.md](docs/OBSERVABILITY.md) — metrics + Grafana.
114
+ - [LOGGING.md](docs/LOGGING.md) — access log surface.
115
+ - [MIGRATING_FROM_PUMA.md](docs/MIGRATING_FROM_PUMA.md) — drop-in guide.
116
+ - [REVERSE_PROXY.md](docs/REVERSE_PROXY.md) — nginx fronting.
117
+
118
+ ## Reproducing benchmarks
887
119
 
888
120
  ```sh
889
- $ curl -s -H 'X-Hyperion-Admin-Token: secret' http://127.0.0.1:9292/-/metrics
890
- # HELP hyperion_requests_total Total HTTP requests handled
891
- # TYPE hyperion_requests_total counter
892
- hyperion_requests_total 8910
893
- # HELP hyperion_bytes_written_total Total bytes written to response sockets
894
- # TYPE hyperion_bytes_written_total counter
895
- hyperion_bytes_written_total 2351023
896
- # HELP hyperion_responses_status_total Responses by HTTP status code
897
- # TYPE hyperion_responses_status_total counter
898
- hyperion_responses_status_total{status="200"} 8521
899
- hyperion_responses_status_total{status="404"} 12
900
- hyperion_responses_status_total{status="500"} 3
901
- # … and so on for sendfile_responses_total, rejected_connections_total,
902
- # slow_request_aborts_total, requests_async_dispatched_total, etc.
121
+ bundle install && bundle exec rake compile
122
+ ./bench/run_all.sh # full table
123
+ ./bench/run_all.sh --row 1 # single row
124
+ ./bench/run_all.sh --skip-grpc # rows 1-5 + 7-9
903
125
  ```
904
126
 
905
- Any counter not in the known set (added by app middleware via `Hyperion.metrics.increment(:custom_thing)`) is auto-exported as `hyperion_custom_thing` with a generic HELP line — no Hyperion config change required.
127
+ The `bench/run_all.sh` driver boots one server per row, runs `wrk` (or
128
+ `ghz` for gRPC), kills it, moves on — no concurrent runs (cross-talk
129
+ inflates noise on shared hosts). Output: CSV + markdown table at
130
+ `$OUT_CSV` / `$OUT_MD` (default `/tmp/hyperion-2.15-bench.{csv,md}`).
906
131
 
907
- Point your scraper at it: in Prometheus' `scrape_configs`, set `metrics_path: /-/metrics` and `bearer_token` (or use a custom header relabel — Prometheus 2.42+ supports `authorization.credentials_file` paired with a custom `header` block). Network-isolate the admin endpoints if the listener is internet-facing — see [docs/REVERSE_PROXY.md](docs/REVERSE_PROXY.md) for the nginx `location /-/ { return 404; }` recipe.
132
+ Per-row commands and the host snapshot live in
133
+ [docs/BENCH_HYPERION_2_14.md](docs/BENCH_HYPERION_2_14.md). When
134
+ your numbers don't match: bench-host noise drifts ±10–30% over days,
135
+ Puma version mismatch (sweep used 8.0.x; in-repo Gemfile pins
136
+ `~> 6.4`), and different `-t` / `-c` are the usual culprits.
908
137
 
909
- ## TLS + HTTP/2
138
+ ## Release history
910
139
 
911
- Provide a PEM cert + key:
140
+ See [CHANGELOG.md](CHANGELOG.md). Recent: 2.14.0 (gRPC streaming
141
+ ghz; dynamic-block C dispatch; `Server#stop` accept-wake on Linux;
142
+ io_uring 4h soak), 2.13.0 (response head builder C-rewrite; gRPC
143
+ streaming RPCs), 2.12.0 (C connection lifecycle; io_uring loop;
144
+ gRPC unary trailers), 2.11.0 (HPACK CGlue default; h2 dispatch-pool
145
+ warmup), 2.10.x (PageCache, `Server.handle` direct routes,
146
+ TCP_NODELAY at accept).
912
147
 
913
- ```sh
914
- bundle exec hyperion --tls-cert config/cert.pem --tls-key config/key.pem -p 9443 config.ru
915
- ```
916
-
917
- ALPN auto-negotiates `h2` (HTTP/2) or `http/1.1` per connection. HTTP/2 multiplexes streams onto fibers within a single connection — slow handlers don't head-of-line-block other streams. Cluster-mode TLS works (`-w N` + `--tls-cert` / `--tls-key`).
918
-
919
- Smuggling defenses for HTTP/1.1: `Content-Length` + `Transfer-Encoding` together → 400; non-chunked `Transfer-Encoding` → 501; CRLF in response header values → `ArgumentError` (response-splitting guard).
920
-
921
- ## Compatibility
148
+ ## Contributing
922
149
 
923
- - **Ruby 3.3+** required (the `protocol-http2 ~> 0.26` transitive dep imposes this floor; older Ruby installs error at `bundle install`).
924
- - **Rack 3** (auto-sets `SERVER_SOFTWARE`, `rack.version`, `REMOTE_ADDR`, IPv6-safe `Host` parsing, CRLF guard).
925
- - **`Hyperion::FiberLocal.install!`** opt-in shim for older Rails apps that store request-scoped data via `Thread.current.thread_variable_*` (modern Rails 7.1+ already uses Fiber storage natively; the shim handles the residual footgun).
926
- - **`Hyperion::FiberLocal.verify_environment!`** runtime check that `Thread.current[:k]` is fiber-local on the current Ruby (it is on 3.2+).
150
+ See [CONTRIBUTING.md](CONTRIBUTING.md). `bundle install && bundle exec rake`
151
+ gives you a green test suite (1147 examples / 0 failures / 16 pending
152
+ on macOS arm64 + Ruby 3.3.3 as of 2.15-A).
927
153
 
928
154
  ## Credits
929
155
 
930
- - Vendored [llhttp](https://github.com/nodejs/llhttp) (Node.js's HTTP parser, MIT) under `ext/hyperion_http/llhttp/`.
156
+ - Vendored [llhttp](https://github.com/nodejs/llhttp) (Node.js's HTTP
157
+ parser, MIT) under `ext/hyperion_http/llhttp/`.
931
158
  - HTTP/2 framing and HPACK via [`protocol-http2`](https://github.com/socketry/protocol-http2).
932
159
  - Fiber scheduler via [`async`](https://github.com/socketry/async).
933
160