kino 0.1.0-x86_64-linux
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.yardopts +14 -0
- data/CHANGELOG.md +54 -0
- data/LICENSE.txt +21 -0
- data/README.md +384 -0
- data/doc/README.md +6 -0
- data/doc/architecture.md +161 -0
- data/doc/benchmarks.md +321 -0
- data/doc/rails-on-ractors.md +50 -0
- data/doc/why-kino.md +91 -0
- data/exe/kino +26 -0
- data/lib/kino/check.rb +199 -0
- data/lib/kino/cli.rb +254 -0
- data/lib/kino/configuration.rb +190 -0
- data/lib/kino/errors_stream.rb +25 -0
- data/lib/kino/input.rb +77 -0
- data/lib/kino/kino.so +0 -0
- data/lib/kino/logger.rb +56 -0
- data/lib/kino/null_input.rb +37 -0
- data/lib/kino/ractor_supervisor.rb +103 -0
- data/lib/kino/server.rb +271 -0
- data/lib/kino/stream.rb +61 -0
- data/lib/kino/templates/kino.rb.tt +141 -0
- data/lib/kino/version.rb +6 -0
- data/lib/kino/worker.rb +124 -0
- data/lib/kino.rb +53 -0
- data/sig/kino.rbs +178 -0
- metadata +193 -0
data/doc/benchmarks.md
ADDED
|
@@ -0,0 +1,321 @@
|
|
|
1
|
+
# Benchmarks: methodology and analysis
|
|
2
|
+
|
|
3
|
+
The result tables live in the [README](../README.md#benchmarks). This
|
|
4
|
+
document is the part that doesn't fit in a README: how the numbers were
|
|
5
|
+
produced, what they do and don't mean, and the investigations behind the
|
|
6
|
+
odd-looking columns.
|
|
7
|
+
|
|
8
|
+
The interesting question is not which server is faster—it's what a
|
|
9
|
+
Ractor-based dispatch model buys, and what it costs, relative to the
|
|
10
|
+
battle-tested approaches (Puma's forked workers, Falcon's
|
|
11
|
+
fiber-per-request). Puma is the reference point throughout because it's
|
|
12
|
+
the deployment most apps run today.
|
|
13
|
+
|
|
14
|
+
## Methodology
|
|
15
|
+
|
|
16
|
+
- Primary hardware: AWS **c7a.2xlarge**—8 dedicated cores of AMD EPYC
|
|
17
|
+
9R14 (Genoa), 16 GB RAM, Amazon Linux 2023, kernel 6.18. A realistic
|
|
18
|
+
app-server size, deliberately: nobody provisions a 32-core box per
|
|
19
|
+
app process.
|
|
20
|
+
- Toolchain built on the box via mise: Ruby 4.0.5 (**YJIT enabled**,
|
|
21
|
+
`RUBY_YJIT_ENABLE=1` for every server), Rust 1.96, Kino compiled in
|
|
22
|
+
the release profile.
|
|
23
|
+
- Load generator: wrk 4.2 on the same host, 8-second windows, 64
|
|
24
|
+
connections (`bench/run.sh 8 64`). Same-host load generation costs
|
|
25
|
+
both sides CPU equally; we verified the generator was not the
|
|
26
|
+
bottleneck by A/B-ing against single-threaded ab, which capped Kino's
|
|
27
|
+
plaintext 26-37% lower while leaving Puma's number unchanged.
|
|
28
|
+
- Identical app for every server (`bench/bench_app.rb`), Ractor-shareable
|
|
29
|
+
so Kino's `:ractor` mode can run it unmodified.
|
|
30
|
+
- Topology held equal: Puma 8 forked workers × 3 threads vs Kino
|
|
31
|
+
8 workers × 3 threads in one process.
|
|
32
|
+
- Follow-up studies (`bench/studies.sh`): CPU recipe, topology sweep,
|
|
33
|
+
/io worker scaling, logging costs, and memory—run in the same session
|
|
34
|
+
as the headline tables.
|
|
35
|
+
- The harness waits for the port to be genuinely free between targets.
|
|
36
|
+
This matters: falcon binds with `SO_REUSEPORT`, so a leftover instance
|
|
37
|
+
silently splits traffic with the next server and poisons every number
|
|
38
|
+
after it (we learned this the hard way).
|
|
39
|
+
- Secondary data point: macOS (MacBook Pro, M1 10-core), where every
|
|
40
|
+
server converges near the loopback ceiling (~42-49k plaintext) and
|
|
41
|
+
differences compress; its table is at the end of this document.
|
|
42
|
+
Earlier published numbers from Docker-on-Mac are retired—real
|
|
43
|
+
hardware contradicted several of that environment's findings, noted
|
|
44
|
+
inline below where the conclusion changed.
|
|
45
|
+
|
|
46
|
+
## Reading the headline tables
|
|
47
|
+
|
|
48
|
+
- **Plaintext/10k**: Kino's tokio front-end clears the fork cluster by
|
|
49
|
+
1.4-2× (lanes plaintext 241,501 vs Puma 117,838 = 2.05×; the smallest
|
|
50
|
+
margin is threaded /10k at 1.44×). The cross-ractor handoff shows up
|
|
51
|
+
as ractor (201k) trailing threaded (218k) on trivial handlers—nothing
|
|
52
|
+
in them needs parallel Ruby—and lane dispatch reverses that (241k).
|
|
53
|
+
- **CPU (recursive fib)**: ractor mode does **5× its own GVL-bound
|
|
54
|
+
threaded mode** (66,735 vs 13,298)—that's the entire point of
|
|
55
|
+
ractors—and beats the fork cluster outright: +15% with stock
|
|
56
|
+
defaults, +21% with lanes (70,373 vs 58,207).
|
|
57
|
+
- **Memory**: serving the same loaded bench app, Kino held **57 MB
|
|
58
|
+
(ractor) / 50 MB (threaded)** where the 8-worker cluster held
|
|
59
|
+
**1,078 MB**—a fork per core pays one full copy of the VM, the app,
|
|
60
|
+
and its YJIT-compiled code per worker. On the Rails hello-world:
|
|
61
|
+
Kino 97 MB vs cluster 797 MB.
|
|
62
|
+
- **I/O (5 ms wait)**: all dispatch models tie within ~4% at equal slot
|
|
63
|
+
counts; the lever that matters is slot count, see
|
|
64
|
+
[below](#why-io-lags-in-ractor-mode-on-linux).
|
|
65
|
+
|
|
66
|
+
## CPU-bound tuning
|
|
67
|
+
|
|
68
|
+
On real hardware, Kino's stock defaults already lead the cluster on
|
|
69
|
+
pure CPU—same-session studies run:
|
|
70
|
+
|
|
71
|
+
| config | /cpu req/s |
|
|
72
|
+
|---|---:|
|
|
73
|
+
| Puma cluster (reference) | 58,376 |
|
|
74
|
+
| Kino `workers 8, threads 3`, tokio auto (default) | 68,257 |
|
|
75
|
+
| Kino `workers 8, threads 1, tokio_threads 1` (recipe) | 68,629 |
|
|
76
|
+
|
|
77
|
+
The tuned recipe is a wash (+0.5%)—and it still costs plaintext
|
|
78
|
+
(112,815 vs ~200k) and /io (1,532, 8 slots): on this hardware there is
|
|
79
|
+
no reason to use it. **This is a finding that changed with the
|
|
80
|
+
environment**: in the earlier Docker-on-Mac runs the recipe was worth
|
|
81
|
+
+12%, because tokio threads and wake churn competed for oversubscribed
|
|
82
|
+
virtualized cores. If you deploy into a constrained/virtualized
|
|
83
|
+
environment, the recipe may still pay; measure there.
|
|
84
|
+
|
|
85
|
+
Two findings that survived the environment change:
|
|
86
|
+
|
|
87
|
+
**Ruby executes at identical per-core speed in ractors and forks.** We
|
|
88
|
+
briefly believed parallel ractors paid a ~24% execution tax; that probe
|
|
89
|
+
compared 8 busy ractors against a *single* busy thread, which also
|
|
90
|
+
compares all-core clocks against single-core boost. The controlled probe
|
|
91
|
+
(8 forked processes vs 8 ractors, same all-core clock): forks 8,973
|
|
92
|
+
fib/s/core, ractors 8,918—identical. `GC.disable` changes nothing (fib
|
|
93
|
+
barely allocates). The VM is innocent. Never compare parallel against
|
|
94
|
+
single-threaded baselines without controlling for clocks.
|
|
95
|
+
|
|
96
|
+
**The GVL ceiling is absolute.** Threaded mode posts the same ~13k /cpu
|
|
97
|
+
whatever the topology—24 threads in one process serialize on one lock.
|
|
98
|
+
Parallelism for CPU-bound Ruby comes from ractors or forks, nothing else.
|
|
99
|
+
|
|
100
|
+
## Why /io lags in ractor mode on Linux
|
|
101
|
+
|
|
102
|
+
On bare metal the gap is small: ractor /io 4,527 vs threaded 4,715
|
|
103
|
+
(−4%). In Docker it was −18%, and a pure-Ruby probe there measured
|
|
104
|
+
`sleep(0.005)` waking +2.3-2.8 ms late inside ractors vs +1.8 ms on the
|
|
105
|
+
main ractor—non-main-ractor timer wakeups are coarser in Ruby 4.0, but
|
|
106
|
+
how much that costs depends heavily on the kernel/virtualization stack.
|
|
107
|
+
A follow-up probe showed `IO.select`-style waits are tighter than
|
|
108
|
+
`sleep` inside ractors, so real I/O readiness suffers less than timers.
|
|
109
|
+
|
|
110
|
+
**Mitigation 1—`Kino.sleep`:** releases the GVL and waits on the OS
|
|
111
|
+
clock directly (chunked, so `Thread#kill`/shutdown stay responsive). The
|
|
112
|
+
`/io_native` endpoint (same 5 ms wait via `Kino.sleep` when available)
|
|
113
|
+
erases the remaining ractor gap on this box: 4,714 vs 4,527 plain sleep.
|
|
114
|
+
|
|
115
|
+
**Mitigation 2—add workers; they're nearly free.** Wait-bound
|
|
116
|
+
throughput is simply `slots ÷ effective wait`, and Kino's slots cost ~a
|
|
117
|
+
thread each, not a forked process: `workers 32, threads 1` measured
|
|
118
|
+
**5,922 /io (+27% over the 24-thread cluster's 4,672) and 6,254
|
|
119
|
+
/io_native (+34%)**, still one small process. A fork cluster buying the
|
|
120
|
+
same 32 slots pays for them in full copies of the app.
|
|
121
|
+
|
|
122
|
+
## The ractor-pool-wrapper comparison
|
|
123
|
+
|
|
124
|
+
A reasonable first experiment for anyone curious about ractors is a Rack
|
|
125
|
+
wrapper that ships each request to a ractor pool on whatever server they
|
|
126
|
+
already run. `bench/ractor_wrapper.rb` is that experiment, benchmarked on
|
|
127
|
+
Puma and Falcon—not as a comparison of those servers, but to measure
|
|
128
|
+
what the Rack-level hop itself costs (c7a.2xlarge, same session):
|
|
129
|
+
|
|
130
|
+
| endpoint | Kino :ractor | Puma + wrapper | Falcon + wrapper |
|
|
131
|
+
|------------|-------------:|---------------:|-----------------:|
|
|
132
|
+
| /plaintext | 201,472 | 19,425 | 100,624 |
|
|
133
|
+
| /cpu (fib) | 66,735 | 17,106 | 49,083 |
|
|
134
|
+
| /io (5 ms) | 4,527 | 1,447 | 1,549 |
|
|
135
|
+
|
|
136
|
+
Inside the Rack contract, the wrapper must reduce the env to a shareable
|
|
137
|
+
subset, copy it to the worker ractor, copy the response back, and hold a
|
|
138
|
+
server thread for the round trip—that's the 10× gap in the Puma
|
|
139
|
+
column, and it would be the same for any server in that position. The
|
|
140
|
+
Falcon numbers mostly show its per-core forks doing the work (the
|
|
141
|
+
per-fork ractor adds little), while `Port#receive` blocking its event
|
|
142
|
+
loop is what limits the I/O endpoints—ractors and fiber schedulers
|
|
143
|
+
don't compose yet, which is a Ruby-level limitation, not a Falcon one.
|
|
144
|
+
Our conclusion: ractor dispatch needs to live at the server layer, below
|
|
145
|
+
the Rack contract—which is the experiment this gem exists to run.
|
|
146
|
+
|
|
147
|
+
## Rails
|
|
148
|
+
|
|
149
|
+
The example app (`examples/rails-hello`, edge Rails, production mode,
|
|
150
|
+
8 workers × 5 threads) on the same box:
|
|
151
|
+
|
|
152
|
+
| | req/s | RSS under load |
|
|
153
|
+
|---|---:|---:|
|
|
154
|
+
| Kino `:threaded` (one process) | 2,298 | **97 MB** |
|
|
155
|
+
| Puma cluster (8 workers) | 11,923 | 797 MB |
|
|
156
|
+
|
|
157
|
+
This is the honest version of the Rails story: in threaded mode Kino is
|
|
158
|
+
one GVL-bound process, so the fork cluster outruns it ~5× by using all
|
|
159
|
+
8 cores—at 8× the memory. Rails-on-Ractors is interesting precisely
|
|
160
|
+
because it would close that throughput gap at the one-process memory
|
|
161
|
+
cost; the upstream blockers are documented in
|
|
162
|
+
[rails-on-ractors.md](rails-on-ractors.md).
|
|
163
|
+
|
|
164
|
+
## YJIT × Ractors gotcha (found the hard way)
|
|
165
|
+
|
|
166
|
+
Plain `def` methods get the full YJIT speedup inside worker ractors
|
|
167
|
+
(5.8× in our probe—YJIT + Ractors compose fine in Ruby 4.0). But a
|
|
168
|
+
**self-referential lambda** (`fib = ->(x) { ... fib.call(x - 1) ... }`)
|
|
169
|
+
runs *slower* with YJIT than without when shared across parallel
|
|
170
|
+
ractors. Keep hot-path code in methods (which real apps do anyway); our
|
|
171
|
+
own /cpu benchmark was a victim of this pattern until it wasn't.
|
|
172
|
+
|
|
173
|
+
## Rust-side allocator: mimalloc
|
|
174
|
+
|
|
175
|
+
The native extension's global allocator is **mimalloc**, unconditionally.
|
|
176
|
+
It covers all Rust-side allocations—request/response buffers, hyper,
|
|
177
|
+
tokio, channels—not the Ruby heap. The decision came from a three-way
|
|
178
|
+
shoot-out (measured in the earlier Docker-on-Linux environment, one
|
|
179
|
+
container session):
|
|
180
|
+
|
|
181
|
+
| allocator | /plaintext | /10k | /cpu |
|
|
182
|
+
|-----------|-----------:|-----:|-----:|
|
|
183
|
+
| system (glibc) | 131,978 | 114,121 | 45,690 |
|
|
184
|
+
| **mimalloc** | **145,229** | 113,907 | 45,351 |
|
|
185
|
+
| jemalloc | 134,632 | 111,341 | 47,273 |
|
|
186
|
+
|
|
187
|
+
mimalloc won plaintext by ~10% with everything else flat, with no
|
|
188
|
+
downside measured. For the record, jemalloc inside a Ruby extension also
|
|
189
|
+
needs its `disable_initial_exec_tls` build flag just to load (dlopen +
|
|
190
|
+
initial-exec TLS = `cannot allocate memory in static TLS block`)—one
|
|
191
|
+
more reason to prefer mimalloc in dlopen'd extensions.
|
|
192
|
+
|
|
193
|
+
## Run-to-run variance (a.k.a. "is this a regression?")
|
|
194
|
+
|
|
195
|
+
Rule of thumb from chasing this twice: never compare numbers from
|
|
196
|
+
different sessions; interleave A/B rounds in one session instead. The
|
|
197
|
+
Docker-on-Mac environment swung ±10% on /cpu between sessions with the
|
|
198
|
+
VM's mood; the dedicated c7a box is far steadier (same-session repeats
|
|
199
|
+
land within ~1-2%), but the discipline stays—every comparative claim in
|
|
200
|
+
these docs comes from same-session pairs.
|
|
201
|
+
|
|
202
|
+
## Topology notes
|
|
203
|
+
|
|
204
|
+
Measured on c7a.2xlarge, plaintext, ractor mode, same session: `8×3`
|
|
205
|
+
(workers×threads) = 199,470, `8×1` = **232,469 (+17%)**, `16×1` =
|
|
206
|
+
214,284. Threads inside one ractor share its lock, so every request
|
|
207
|
+
handled by a 3-thread ractor pays a lock handoff that a 1-thread ractor
|
|
208
|
+
doesn't (`perf` in the earlier Docker sessions attributed ~10% of cycles
|
|
209
|
+
to `rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3;
|
|
210
|
+
the +17% reproduces exactly on real hardware). Threads-per-ractor exist
|
|
211
|
+
for handlers that block on I/O; if yours don't, run `threads 1` and let
|
|
212
|
+
workers = cores do the parallelism. (16×1 being worse than 8×1 also says
|
|
213
|
+
the shared MPMC queue is *not* the bottleneck—8 extra parked consumers
|
|
214
|
+
just add scheduler churn.)
|
|
215
|
+
|
|
216
|
+
## What profiling tried and rejected
|
|
217
|
+
|
|
218
|
+
A perf profile of saturated plaintext showed ~20% of cycles in futex
|
|
219
|
+
wakeups: at saturation the queue oscillates around empty, so nearly every
|
|
220
|
+
request parks its worker and pays two wake round-trips (tokio firing the
|
|
221
|
+
channel signal + the GVL reacquire). The textbook fix—a bounded
|
|
222
|
+
busy-poll before parking—measured **worse** (-13% on both modes): with
|
|
223
|
+
workers + tokio threads oversubscribing the cores, spinners steal exactly
|
|
224
|
+
the CPU the event loop needs. Parking is the cheaper evil when threads
|
|
225
|
+
outnumber cores; the fix that did land is the fused
|
|
226
|
+
`respond_and_take` call (answer the previous request and block for the
|
|
227
|
+
next in one FFI crossing) plus the opt-in `batch` directive for grabbing
|
|
228
|
+
several queued requests per visit (default 1—values above 1 add
|
|
229
|
+
head-of-line blocking behind slow handlers and stretch effective queue
|
|
230
|
+
depth, which is also why it's a config knob and not a hardcoded win).
|
|
231
|
+
|
|
232
|
+
## Lane dispatch (experimental: `lanes true`)
|
|
233
|
+
|
|
234
|
+
The shared-queue design pays a futex wake per request at saturation
|
|
235
|
+
(every worker parks on an empty queue between requests). Lane mode gives
|
|
236
|
+
each worker a small private queue; the dispatcher prefers *awake* lanes—so
|
|
237
|
+
a hot worker keeps taking without ever being woken—with two
|
|
238
|
+
safeguards: lane depth is capped at 4, and workers steal from siblings
|
|
239
|
+
before parking (plus on every park tick), so a slow request can't strand
|
|
240
|
+
its lane's backlog.
|
|
241
|
+
|
|
242
|
+
Same-session A/B on c7a.2xlarge (ractor mode):
|
|
243
|
+
|
|
244
|
+
| endpoint | shared queue | lanes | delta |
|
|
245
|
+
|----------|-------------:|------:|------:|
|
|
246
|
+
| /plaintext | 201,472 | **241,501** | **+20%** |
|
|
247
|
+
| /10k | 156,635 | 183,564 | +17% |
|
|
248
|
+
| /cpu | 66,735 | 70,373 | +5% |
|
|
249
|
+
| /io | 4,527 | 4,530 | flat |
|
|
250
|
+
|
|
251
|
+
On this hardware lanes make ractor mode the fastest Kino configuration
|
|
252
|
+
outright—+11% over threaded mode's plaintext, where the shared queue
|
|
253
|
+
trails it. (On loopback-bound macOS, lanes lose a few percent instead;
|
|
254
|
+
see the secondary table below.) It stays opt-in for now because overload
|
|
255
|
+
semantics differ from the shared queue (`queue_depth` doesn't apply;
|
|
256
|
+
capacity is lanes × 4 with brief dispatcher retries up to
|
|
257
|
+
`queue_timeout` before the 503), and crash semantics, stealing fairness,
|
|
258
|
+
and drain behavior have spec coverage but not production mileage.
|
|
259
|
+
|
|
260
|
+
## Logging costs
|
|
261
|
+
|
|
262
|
+
Measured at full plaintext saturation (one log line per request—rates
|
|
263
|
+
that no real deployment logs at; treat these as worst-case ceilings, not
|
|
264
|
+
typical costs):
|
|
265
|
+
|
|
266
|
+
| case (8×3, same session) | req/s |
|
|
267
|
+
|---|---:|
|
|
268
|
+
| threaded, no logging | 217,113 |
|
|
269
|
+
| threaded, `log_requests true` (native access log) | 193,200 (−11%) |
|
|
270
|
+
| ractor, access log off / on | 198,624 / 183,565 (−8%) |
|
|
271
|
+
| app logs 1 line/req via shared `::Logger` (file) | **62,962** |
|
|
272
|
+
| app logs 1 line/req via `Kino::Logger` (file) | **150,810 (2.4×)** |
|
|
273
|
+
|
|
274
|
+
The shared-`::Logger` cost is the mutex: 24 worker threads serialize
|
|
275
|
+
through one lock plus a write syscall per line. `Kino::Logger` hands the
|
|
276
|
+
formatted line to a lock-free channel and returns—the remaining cost vs
|
|
277
|
+
not logging at all is Ruby-side formatting, which no device can remove.
|
|
278
|
+
(In the Docker environment the same comparison showed 8.5×—overlay-fs
|
|
279
|
+
write latency punished the synchronous logger far harder than this box's
|
|
280
|
+
NVMe does. The ranking is environment-independent; the multiple is not.)
|
|
281
|
+
|
|
282
|
+
One trade-off worth knowing: the sink **never blocks** request threads,
|
|
283
|
+
so at absurd rates against a slow disk it drops lines once its 8192-line
|
|
284
|
+
buffer is full, while `::Logger` writes every line by strangling
|
|
285
|
+
throughput instead. Pick your failure mode; at sane logging rates
|
|
286
|
+
neither happens.
|
|
287
|
+
|
|
288
|
+
Puma comparison note: request logging is opt-in there too (`--quiet` is
|
|
289
|
+
the default, `-v/--log-requests` enables it)—Kino's default-off
|
|
290
|
+
`log_requests` matches the ecosystem's standard behavior.
|
|
291
|
+
|
|
292
|
+
## Hot-path notes
|
|
293
|
+
|
|
294
|
+
For the curious, the dispatch-path work behind the numbers: a try-pop
|
|
295
|
+
fast path skips the GVL release when a request is already queued,
|
|
296
|
+
bodyless requests spawn no body-forwarder task, `TCP_NODELAY`, a frozen
|
|
297
|
+
(Ractor-shareable) cache of env keys + common header names built once at
|
|
298
|
+
init, response headers read in place across the FFI boundary, ahash for
|
|
299
|
+
per-request lookups, SmallVec for header joins. Details in
|
|
300
|
+
[architecture.md](architecture.md).
|
|
301
|
+
|
|
302
|
+
## Secondary data point: macOS
|
|
303
|
+
|
|
304
|
+
MacBook Pro (M1 10-core), ab with keep-alive, 10 workers × 3 threads.
|
|
305
|
+
Everything converges near the loopback ceiling and differences compress;
|
|
306
|
+
useful mainly as a sanity check that the ranking holds on a second OS:
|
|
307
|
+
|
|
308
|
+
| endpoint | Kino :ractor | + lanes | Kino :threaded | Puma (cluster) |
|
|
309
|
+
|-------------|-------------:|--------:|---------------:|---------------:|
|
|
310
|
+
| /plaintext | 48,441 | 44,000 | 49,352 | 44,594 |
|
|
311
|
+
| /10k | 45,840 | 42,362 | 46,482 | 42,890 |
|
|
312
|
+
| /cpu (fib) | 43,827 | 41,426 | 10,076 | 34,161 |
|
|
313
|
+
| /io (5 ms) | 4,943 | 4,883 | 4,890 | 4,780 |
|
|
314
|
+
| /io_native | 4,758 | 4,844 | 4,848 | 4,805 |
|
|
315
|
+
|
|
316
|
+
Notes from this environment: lanes lose a few percent (the loopback
|
|
317
|
+
stack, not dispatch, is the ceiling); macOS loopback occasionally stalls
|
|
318
|
+
entire benchmark windows under back-to-back runs (the harness sleeps
|
|
319
|
+
between endpoints to reduce this; treat any isolated collapsed cell as
|
|
320
|
+
suspect and re-run); the ractor /cpu margin over the cluster (+28%) is
|
|
321
|
+
wider than on Linux because macOS Puma forks pay a higher per-fork cost.
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
# Rails on Kino, and the state of Rails-on-Ractors
|
|
2
|
+
|
|
3
|
+
`examples/rails-hello` runs edge Rails on Kino.
|
|
4
|
+
|
|
5
|
+
## What works as of Kino 0.1.0: `:threaded` mode
|
|
6
|
+
|
|
7
|
+
Rails 8.2.0.alpha boots and serves with `mode :threaded` (see the
|
|
8
|
+
example's `kino.rb`; just `bundle exec kino` in that directory). Measured
|
|
9
|
+
on the hello-world (c7a.2xlarge, 8 cores, production mode, 8×5):
|
|
10
|
+
~2.3k req/s in 97 MB, single process. The 8-worker Puma cluster reaches
|
|
11
|
+
~11.9k in 797 MB by parallelizing across forks—Rails-on-Ractors is
|
|
12
|
+
interesting precisely because it could offer that ~5× parallelism at
|
|
13
|
+
~1/8th of the memory.
|
|
14
|
+
|
|
15
|
+
Pair it with production-style Rails settings: eager load, no code
|
|
16
|
+
reloading, database pool ≥ workers × threads, logger to stdout or another
|
|
17
|
+
thread-safe device.
|
|
18
|
+
|
|
19
|
+
## What doesn't (yet): `:ractor` mode—with the exact blockers
|
|
20
|
+
|
|
21
|
+
`examples/rails-hello/ractor_probe.rb` re-tests this against whatever
|
|
22
|
+
Rails is bundled; when it prints SUCCEEDED *and* a 200 ractor-mode
|
|
23
|
+
response (sharing the app is only the first blocker), flip
|
|
24
|
+
`mode :ractor`. As of June 2026 (rails main):
|
|
25
|
+
|
|
26
|
+
1. **Sharing the booted app fails.**
|
|
27
|
+
`Ractor.make_shareable(Rails.application)` raises
|
|
28
|
+
`Ractor::IsolationError: Proc's self is not shareable` at
|
|
29
|
+
`railties/lib/rails/application/finisher.rb:151`—a routes-reloader
|
|
30
|
+
hook lambda Rails registers at boot (in the development path) whose
|
|
31
|
+
`self` is the unshareable application instance.
|
|
32
|
+
2. **Calling the app from a worker ractor fails** even where
|
|
33
|
+
`make_shareable` succeeds on a class:
|
|
34
|
+
`Ractor::IsolationError: can not get unshareable values from instance
|
|
35
|
+
variables of classes/modules from non-main Ractors (@instance from
|
|
36
|
+
HelloApp)`—`Rails.application` is a class-level ivar read on the
|
|
37
|
+
dispatch path.
|
|
38
|
+
3. **Every workaround is closed by one VM rule** (verified on Ruby 4.0.5):
|
|
39
|
+
class instance variables are main-ractor-only—non-main ractors can
|
|
40
|
+
neither read them (when the value is unshareable) nor set them, **even
|
|
41
|
+
inside a `Ruby::Box`** (`RUBY_BOX=1`). Box isolates constant tables,
|
|
42
|
+
but Ractor isolation still applies to box-local classes:
|
|
43
|
+
|
|
44
|
+
```
|
|
45
|
+
Ractor::IsolationError: can not set instance variables of
|
|
46
|
+
classes/modules by non-main Ractors
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
So booting a private Rails per worker ractor (boxed or not) dies on
|
|
50
|
+
the first class-ivar write, and sharing one boot dies on reads.
|
data/doc/why-kino.md
ADDED
|
@@ -0,0 +1,91 @@
|
|
|
1
|
+
# Why Kino is a Ractor server
|
|
2
|
+
|
|
3
|
+
What makes this server ractor-specific, why do Ractors work fast here
|
|
4
|
+
when they are slow everywhere else, and which Rust parts deserve the
|
|
5
|
+
credit. Companion to [architecture.md](architecture.md), which describes
|
|
6
|
+
the same machinery piece by piece.
|
|
7
|
+
|
|
8
|
+
## The problem Kino exists to solve
|
|
9
|
+
|
|
10
|
+
Ractors forbid sharing mutable Ruby objects. That single rule breaks
|
|
11
|
+
every conventional server design. Puma hands env Hashes, socket objects,
|
|
12
|
+
and request state between threads freely; with Ractors, a Ruby object
|
|
13
|
+
created on the accepting side cannot be given to a worker—`Ractor#send`
|
|
14
|
+
deep-copies it, and sockets cannot cross at all.
|
|
15
|
+
|
|
16
|
+
We measured what the "obvious" workaround costs. The ractor-pool wrapper
|
|
17
|
+
experiment (reduce the env to a shareable subset, copy it to a worker
|
|
18
|
+
over a `Ractor::Port`, copy the response back) runs at **19k req/s where
|
|
19
|
+
Kino does 201k** on the same hardware—see the
|
|
20
|
+
[wrapper comparison](benchmarks.md#the-ractor-pool-wrapper-comparison).
|
|
21
|
+
Copying at the Rack layer eats the entire ractor dividend. Dispatch has
|
|
22
|
+
to live below the Rack contract.
|
|
23
|
+
|
|
24
|
+
## The trick: shared state lives below the FFI line
|
|
25
|
+
|
|
26
|
+
Kino's answer is that **the request never exists as a Ruby object until
|
|
27
|
+
it is already inside the worker ractor**. The whole request—method, URI,
|
|
28
|
+
headers, body stream, response channel—lives in native memory as a Rust
|
|
29
|
+
`RequestCtx`. The only things that cross ractor boundaries in Ruby are
|
|
30
|
+
what Ractor law allows for free: integers (server id, worker ids) and
|
|
31
|
+
the frozen, shareable app. When a worker calls `take_one`, the env Hash
|
|
32
|
+
and the `Kino::Native::Request` handle are *constructed inside that
|
|
33
|
+
worker's ractor*—ownership is correct by construction, nothing is
|
|
34
|
+
copied, nothing is shared.
|
|
35
|
+
|
|
36
|
+
Put differently: Ractors cannot have shared mutable memory in Ruby, so
|
|
37
|
+
Kino moves all shared mutable state into Rust, where `Send`/`Sync` rules
|
|
38
|
+
apply instead of ractor isolation. Ruby sees only ids and frozen
|
|
39
|
+
objects; Rust sees one queue and one registry.
|
|
40
|
+
|
|
41
|
+
## The Rust parts to thank, in order of credit
|
|
42
|
+
|
|
43
|
+
1. **`rb_ext_ractor_safe(true)`** (lib.rs)—one line, and the
|
|
44
|
+
precondition for everything: without it, *any* native call from a
|
|
45
|
+
non-main ractor raises `Ractor::UnsafeError`. The spec suite keeps a
|
|
46
|
+
canary test on this.
|
|
47
|
+
2. **The registry** (registry.rs)—global Rust-side state keyed by
|
|
48
|
+
`u64`. This is the "shared memory" Ractors legally cannot have:
|
|
49
|
+
workers address everything by integer, and no TypedData object ever
|
|
50
|
+
crosses a boundary.
|
|
51
|
+
3. **The flume MPMC queue**—the cross-ractor work distributor. One
|
|
52
|
+
queue, async send from tokio, blocking receive from any ractor's
|
|
53
|
+
thread. Ruby has no ractor-safe equivalent that does not copy
|
|
54
|
+
(Ports copy every message); a Rust channel moves a
|
|
55
|
+
`Box<RequestCtx>` pointer.
|
|
56
|
+
4. **`gvl.rs`**—`rb_thread_call_without_gvl` plus the atomic-flag UBF
|
|
57
|
+
idiom. A worker blocking on the queue releases its per-ractor lock,
|
|
58
|
+
stays interruptible (Thread#kill, shutdown) via bounded
|
|
59
|
+
`recv_timeout` ticks, and costs nothing when the queue has work: the
|
|
60
|
+
`try_recv` fast path skips the lock release entirely.
|
|
61
|
+
5. **The frozen env-string cache** (env_strings.rs)—exploits the one
|
|
62
|
+
sharing channel Ractors *do* allow: frozen objects. Env keys, 44
|
|
63
|
+
common header names, methods, LRU'd host and peer-address values,
|
|
64
|
+
built once on the main ractor and read by every worker forever.
|
|
65
|
+
Without this, each ractor would allocate every key on every request.
|
|
66
|
+
6. **The Responder** (response.rs)—responses travel back as Rust types
|
|
67
|
+
through a oneshot/frame channel into hyper, never as shared Ruby
|
|
68
|
+
objects. Its first-claimant atomic is also what makes crash recovery
|
|
69
|
+
work: the supervisor (on the main ractor) can 500 a dead ractor's
|
|
70
|
+
in-flight requests through `Weak` references into native memory.
|
|
71
|
+
That cross-ractor cleanup would be impossible if request state were
|
|
72
|
+
Ruby-side.
|
|
73
|
+
7. **tokio + hyper owning all I/O**—sockets are exactly the kind of
|
|
74
|
+
unshareable object that poisons ractor designs; here Ruby never
|
|
75
|
+
touches one.
|
|
76
|
+
|
|
77
|
+
## Why it is fast, by the numbers
|
|
78
|
+
|
|
79
|
+
With the dispatch cost eliminated, Ractors deliver the thing they were
|
|
80
|
+
built for—a lock per ractor instead of one GVL—and each layer is
|
|
81
|
+
visible in the [benchmarks](benchmarks.md): `/cpu` at 66.7k req/s in
|
|
82
|
+
ractor mode vs **13.3k threaded (5×, the GVL ceiling)**, matching the
|
|
83
|
+
fork cluster's CPU parallelism while holding **57 MB against the
|
|
84
|
+
cluster's 1,078 MB**, because eight ractors share one VM, one Rust
|
|
85
|
+
front-end, one queue, and one JIT, where eight forks each pay full
|
|
86
|
+
price.
|
|
87
|
+
|
|
88
|
+
The cleanest proof of the design is the threaded fallback itself: it
|
|
89
|
+
reuses ~95% of the same machinery, because the Rust core is
|
|
90
|
+
dispatch-agnostic. Ractors are not what makes Kino's engine fast—they
|
|
91
|
+
are what the engine finally makes *usable*.
|
data/exe/kino
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
# frozen_string_literal: true
|
|
3
|
+
|
|
4
|
+
# kino [-C config.rb] [rackup_file]
|
|
5
|
+
#
|
|
6
|
+
# Boots a Rack app from a rackup file (default: config.ru) with settings
|
|
7
|
+
# from a Puma-style Ruby DSL config file (default: kino.rb if present).
|
|
8
|
+
# All the real work lives in Kino::CLI.start; this file only sets up the
|
|
9
|
+
# process-global bits an executable owns.
|
|
10
|
+
|
|
11
|
+
Warning[:experimental] = false
|
|
12
|
+
# Startup output must land immediately even when stdout is a pipe or file
|
|
13
|
+
# (process supervisors, `kino > server.log`); block buffering would hold
|
|
14
|
+
# the banner back until exit.
|
|
15
|
+
$stdout.sync = true
|
|
16
|
+
|
|
17
|
+
# Running from a git checkout (no installed gem, no bundler context):
|
|
18
|
+
# prefer the checkout's own lib so `require "kino"` resolves.
|
|
19
|
+
lib = File.expand_path("../lib", __dir__)
|
|
20
|
+
$LOAD_PATH.unshift(lib) if File.directory?(File.join(lib, "kino")) && !$LOAD_PATH.include?(lib)
|
|
21
|
+
|
|
22
|
+
# cli.rb loads no native code: --help and --version stay instant; the
|
|
23
|
+
# compiled extension loads only when a server actually boots.
|
|
24
|
+
require "kino/cli"
|
|
25
|
+
|
|
26
|
+
exit Kino::CLI.start(ARGV)
|