kino 0.1.0-aarch64-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/doc/benchmarks.md ADDED
@@ -0,0 +1,321 @@
1
+ # Benchmarks: methodology and analysis
2
+
3
+ The result tables live in the [README](../README.md#benchmarks). This
4
+ document is the part that doesn't fit in a README: how the numbers were
5
+ produced, what they do and don't mean, and the investigations behind the
6
+ odd-looking columns.
7
+
8
+ The interesting question is not which server is faster—it's what a
9
+ Ractor-based dispatch model buys, and what it costs, relative to the
10
+ battle-tested approaches (Puma's forked workers, Falcon's
11
+ fiber-per-request). Puma is the reference point throughout because it's
12
+ the deployment most apps run today.
13
+
14
+ ## Methodology
15
+
16
+ - Primary hardware: AWS **c7a.2xlarge**—8 dedicated cores of AMD EPYC
17
+ 9R14 (Genoa), 16 GB RAM, Amazon Linux 2023, kernel 6.18. A realistic
18
+ app-server size, deliberately: nobody provisions a 32-core box per
19
+ app process.
20
+ - Toolchain built on the box via mise: Ruby 4.0.5 (**YJIT enabled**,
21
+ `RUBY_YJIT_ENABLE=1` for every server), Rust 1.96, Kino compiled in
22
+ the release profile.
23
+ - Load generator: wrk 4.2 on the same host, 8-second windows, 64
24
+ connections (`bench/run.sh 8 64`). Same-host load generation costs
25
+ both sides CPU equally; we verified the generator was not the
26
+ bottleneck by A/B-ing against single-threaded ab, which capped Kino's
27
+ plaintext 26-37% lower while leaving Puma's number unchanged.
28
+ - Identical app for every server (`bench/bench_app.rb`), Ractor-shareable
29
+ so Kino's `:ractor` mode can run it unmodified.
30
+ - Topology held equal: Puma 8 forked workers × 3 threads vs Kino
31
+ 8 workers × 3 threads in one process.
32
+ - Follow-up studies (`bench/studies.sh`): CPU recipe, topology sweep,
33
+ /io worker scaling, logging costs, and memory—run in the same session
34
+ as the headline tables.
35
+ - The harness waits for the port to be genuinely free between targets.
36
+ This matters: falcon binds with `SO_REUSEPORT`, so a leftover instance
37
+ silently splits traffic with the next server and poisons every number
38
+ after it (we learned this the hard way).
39
+ - Secondary data point: macOS (MacBook Pro, M1 10-core), where every
40
+ server converges near the loopback ceiling (~42-49k plaintext) and
41
+ differences compress; its table is at the end of this document.
42
+ Earlier published numbers from Docker-on-Mac are retired—real
43
+ hardware contradicted several of that environment's findings, noted
44
+ inline below where the conclusion changed.
45
+
46
+ ## Reading the headline tables
47
+
48
+ - **Plaintext/10k**: Kino's tokio front-end clears the fork cluster by
49
+ 1.4-2× (lanes plaintext 241,501 vs Puma 117,838 = 2.05×; the smallest
50
+ margin is threaded /10k at 1.44×). The cross-ractor handoff shows up
51
+ as ractor (201k) trailing threaded (218k) on trivial handlers—nothing
52
+ in them needs parallel Ruby—and lane dispatch reverses that (241k).
53
+ - **CPU (recursive fib)**: ractor mode does **5× its own GVL-bound
54
+ threaded mode** (66,735 vs 13,298)—that's the entire point of
55
+ ractors—and beats the fork cluster outright: +15% with stock
56
+ defaults, +21% with lanes (70,373 vs 58,207).
57
+ - **Memory**: serving the same loaded bench app, Kino held **57 MB
58
+ (ractor) / 50 MB (threaded)** where the 8-worker cluster held
59
+ **1,078 MB**—a fork per core pays one full copy of the VM, the app,
60
+ and its YJIT-compiled code per worker. On the Rails hello-world:
61
+ Kino 97 MB vs cluster 797 MB.
62
+ - **I/O (5 ms wait)**: all dispatch models tie within ~4% at equal slot
63
+ counts; the lever that matters is slot count, see
64
+ [below](#why-io-lags-in-ractor-mode-on-linux).
65
+
66
+ ## CPU-bound tuning
67
+
68
+ On real hardware, Kino's stock defaults already lead the cluster on
69
+ pure CPU—same-session studies run:
70
+
71
+ | config | /cpu req/s |
72
+ |---|---:|
73
+ | Puma cluster (reference) | 58,376 |
74
+ | Kino `workers 8, threads 3`, tokio auto (default) | 68,257 |
75
+ | Kino `workers 8, threads 1, tokio_threads 1` (recipe) | 68,629 |
76
+
77
+ The tuned recipe is a wash (+0.5%)—and it still costs plaintext
78
+ (112,815 vs ~200k) and /io (1,532, 8 slots): on this hardware there is
79
+ no reason to use it. **This is a finding that changed with the
80
+ environment**: in the earlier Docker-on-Mac runs the recipe was worth
81
+ +12%, because tokio threads and wake churn competed for oversubscribed
82
+ virtualized cores. If you deploy into a constrained/virtualized
83
+ environment, the recipe may still pay; measure there.
84
+
85
+ Two findings that survived the environment change:
86
+
87
+ **Ruby executes at identical per-core speed in ractors and forks.** We
88
+ briefly believed parallel ractors paid a ~24% execution tax; that probe
89
+ compared 8 busy ractors against a *single* busy thread, which also
90
+ compares all-core clocks against single-core boost. The controlled probe
91
+ (8 forked processes vs 8 ractors, same all-core clock): forks 8,973
92
+ fib/s/core, ractors 8,918—identical. `GC.disable` changes nothing (fib
93
+ barely allocates). The VM is innocent. Never compare parallel against
94
+ single-threaded baselines without controlling for clocks.
95
+
96
+ **The GVL ceiling is absolute.** Threaded mode posts the same ~13k /cpu
97
+ whatever the topology—24 threads in one process serialize on one lock.
98
+ Parallelism for CPU-bound Ruby comes from ractors or forks, nothing else.
99
+
100
+ ## Why /io lags in ractor mode on Linux
101
+
102
+ On bare metal the gap is small: ractor /io 4,527 vs threaded 4,715
103
+ (−4%). In Docker it was −18%, and a pure-Ruby probe there measured
104
+ `sleep(0.005)` waking +2.3-2.8 ms late inside ractors vs +1.8 ms on the
105
+ main ractor—non-main-ractor timer wakeups are coarser in Ruby 4.0, but
106
+ how much that costs depends heavily on the kernel/virtualization stack.
107
+ A follow-up probe showed `IO.select`-style waits are tighter than
108
+ `sleep` inside ractors, so real I/O readiness suffers less than timers.
109
+
110
+ **Mitigation 1—`Kino.sleep`:** releases the GVL and waits on the OS
111
+ clock directly (chunked, so `Thread#kill`/shutdown stay responsive). The
112
+ `/io_native` endpoint (same 5 ms wait via `Kino.sleep` when available)
113
+ erases the remaining ractor gap on this box: 4,714 vs 4,527 plain sleep.
114
+
115
+ **Mitigation 2—add workers; they're nearly free.** Wait-bound
116
+ throughput is simply `slots ÷ effective wait`, and Kino's slots cost ~a
117
+ thread each, not a forked process: `workers 32, threads 1` measured
118
+ **5,922 /io (+27% over the 24-thread cluster's 4,672) and 6,254
119
+ /io_native (+34%)**, still one small process. A fork cluster buying the
120
+ same 32 slots pays for them in full copies of the app.
121
+
122
+ ## The ractor-pool-wrapper comparison
123
+
124
+ A reasonable first experiment for anyone curious about ractors is a Rack
125
+ wrapper that ships each request to a ractor pool on whatever server they
126
+ already run. `bench/ractor_wrapper.rb` is that experiment, benchmarked on
127
+ Puma and Falcon—not as a comparison of those servers, but to measure
128
+ what the Rack-level hop itself costs (c7a.2xlarge, same session):
129
+
130
+ | endpoint | Kino :ractor | Puma + wrapper | Falcon + wrapper |
131
+ |------------|-------------:|---------------:|-----------------:|
132
+ | /plaintext | 201,472 | 19,425 | 100,624 |
133
+ | /cpu (fib) | 66,735 | 17,106 | 49,083 |
134
+ | /io (5 ms) | 4,527 | 1,447 | 1,549 |
135
+
136
+ Inside the Rack contract, the wrapper must reduce the env to a shareable
137
+ subset, copy it to the worker ractor, copy the response back, and hold a
138
+ server thread for the round trip—that's the 10× gap in the Puma
139
+ column, and it would be the same for any server in that position. The
140
+ Falcon numbers mostly show its per-core forks doing the work (the
141
+ per-fork ractor adds little), while `Port#receive` blocking its event
142
+ loop is what limits the I/O endpoints—ractors and fiber schedulers
143
+ don't compose yet, which is a Ruby-level limitation, not a Falcon one.
144
+ Our conclusion: ractor dispatch needs to live at the server layer, below
145
+ the Rack contract—which is the experiment this gem exists to run.
146
+
147
+ ## Rails
148
+
149
+ The example app (`examples/rails-hello`, edge Rails, production mode,
150
+ 8 workers × 5 threads) on the same box:
151
+
152
+ | | req/s | RSS under load |
153
+ |---|---:|---:|
154
+ | Kino `:threaded` (one process) | 2,298 | **97 MB** |
155
+ | Puma cluster (8 workers) | 11,923 | 797 MB |
156
+
157
+ This is the honest version of the Rails story: in threaded mode Kino is
158
+ one GVL-bound process, so the fork cluster outruns it ~5× by using all
159
+ 8 cores—at 8× the memory. Rails-on-Ractors is interesting precisely
160
+ because it would close that throughput gap at the one-process memory
161
+ cost; the upstream blockers are documented in
162
+ [rails-on-ractors.md](rails-on-ractors.md).
163
+
164
+ ## YJIT × Ractors gotcha (found the hard way)
165
+
166
+ Plain `def` methods get the full YJIT speedup inside worker ractors
167
+ (5.8× in our probe—YJIT + Ractors compose fine in Ruby 4.0). But a
168
+ **self-referential lambda** (`fib = ->(x) { ... fib.call(x - 1) ... }`)
169
+ runs *slower* with YJIT than without when shared across parallel
170
+ ractors. Keep hot-path code in methods (which real apps do anyway); our
171
+ own /cpu benchmark was a victim of this pattern until it wasn't.
172
+
173
+ ## Rust-side allocator: mimalloc
174
+
175
+ The native extension's global allocator is **mimalloc**, unconditionally.
176
+ It covers all Rust-side allocations—request/response buffers, hyper,
177
+ tokio, channels—not the Ruby heap. The decision came from a three-way
178
+ shoot-out (measured in the earlier Docker-on-Linux environment, one
179
+ container session):
180
+
181
+ | allocator | /plaintext | /10k | /cpu |
182
+ |-----------|-----------:|-----:|-----:|
183
+ | system (glibc) | 131,978 | 114,121 | 45,690 |
184
+ | **mimalloc** | **145,229** | 113,907 | 45,351 |
185
+ | jemalloc | 134,632 | 111,341 | 47,273 |
186
+
187
+ mimalloc won plaintext by ~10% with everything else flat, with no
188
+ downside measured. For the record, jemalloc inside a Ruby extension also
189
+ needs its `disable_initial_exec_tls` build flag just to load (dlopen +
190
+ initial-exec TLS = `cannot allocate memory in static TLS block`)—one
191
+ more reason to prefer mimalloc in dlopen'd extensions.
192
+
193
+ ## Run-to-run variance (a.k.a. "is this a regression?")
194
+
195
+ Rule of thumb from chasing this twice: never compare numbers from
196
+ different sessions; interleave A/B rounds in one session instead. The
197
+ Docker-on-Mac environment swung ±10% on /cpu between sessions with the
198
+ VM's mood; the dedicated c7a box is far steadier (same-session repeats
199
+ land within ~1-2%), but the discipline stays—every comparative claim in
200
+ these docs comes from same-session pairs.
201
+
202
+ ## Topology notes
203
+
204
+ Measured on c7a.2xlarge, plaintext, ractor mode, same session: `8×3`
205
+ (workers×threads) = 199,470, `8×1` = **232,469 (+17%)**, `16×1` =
206
+ 214,284. Threads inside one ractor share its lock, so every request
207
+ handled by a 3-thread ractor pays a lock handoff that a 1-thread ractor
208
+ doesn't (`perf` in the earlier Docker sessions attributed ~10% of cycles
209
+ to `rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3;
210
+ the +17% reproduces exactly on real hardware). Threads-per-ractor exist
211
+ for handlers that block on I/O; if yours don't, run `threads 1` and let
212
+ workers = cores do the parallelism. (16×1 being worse than 8×1 also says
213
+ the shared MPMC queue is *not* the bottleneck—8 extra parked consumers
214
+ just add scheduler churn.)
215
+
216
+ ## What profiling tried and rejected
217
+
218
+ A perf profile of saturated plaintext showed ~20% of cycles in futex
219
+ wakeups: at saturation the queue oscillates around empty, so nearly every
220
+ request parks its worker and pays two wake round-trips (tokio firing the
221
+ channel signal + the GVL reacquire). The textbook fix—a bounded
222
+ busy-poll before parking—measured **worse** (-13% on both modes): with
223
+ workers + tokio threads oversubscribing the cores, spinners steal exactly
224
+ the CPU the event loop needs. Parking is the cheaper evil when threads
225
+ outnumber cores; the fix that did land is the fused
226
+ `respond_and_take` call (answer the previous request and block for the
227
+ next in one FFI crossing) plus the opt-in `batch` directive for grabbing
228
+ several queued requests per visit (default 1—values above 1 add
229
+ head-of-line blocking behind slow handlers and stretch effective queue
230
+ depth, which is also why it's a config knob and not a hardcoded win).
231
+
232
+ ## Lane dispatch (experimental: `lanes true`)
233
+
234
+ The shared-queue design pays a futex wake per request at saturation
235
+ (every worker parks on an empty queue between requests). Lane mode gives
236
+ each worker a small private queue; the dispatcher prefers *awake* lanes—so
237
+ a hot worker keeps taking without ever being woken—with two
238
+ safeguards: lane depth is capped at 4, and workers steal from siblings
239
+ before parking (plus on every park tick), so a slow request can't strand
240
+ its lane's backlog.
241
+
242
+ Same-session A/B on c7a.2xlarge (ractor mode):
243
+
244
+ | endpoint | shared queue | lanes | delta |
245
+ |----------|-------------:|------:|------:|
246
+ | /plaintext | 201,472 | **241,501** | **+20%** |
247
+ | /10k | 156,635 | 183,564 | +17% |
248
+ | /cpu | 66,735 | 70,373 | +5% |
249
+ | /io | 4,527 | 4,530 | flat |
250
+
251
+ On this hardware lanes make ractor mode the fastest Kino configuration
252
+ outright—+11% over threaded mode's plaintext, where the shared queue
253
+ trails it. (On loopback-bound macOS, lanes lose a few percent instead;
254
+ see the secondary table below.) It stays opt-in for now because overload
255
+ semantics differ from the shared queue (`queue_depth` doesn't apply;
256
+ capacity is lanes × 4 with brief dispatcher retries up to
257
+ `queue_timeout` before the 503), and crash semantics, stealing fairness,
258
+ and drain behavior have spec coverage but not production mileage.
259
+
260
+ ## Logging costs
261
+
262
+ Measured at full plaintext saturation (one log line per request—rates
263
+ that no real deployment logs at; treat these as worst-case ceilings, not
264
+ typical costs):
265
+
266
+ | case (8×3, same session) | req/s |
267
+ |---|---:|
268
+ | threaded, no logging | 217,113 |
269
+ | threaded, `log_requests true` (native access log) | 193,200 (−11%) |
270
+ | ractor, access log off / on | 198,624 / 183,565 (−8%) |
271
+ | app logs 1 line/req via shared `::Logger` (file) | **62,962** |
272
+ | app logs 1 line/req via `Kino::Logger` (file) | **150,810 (2.4×)** |
273
+
274
+ The shared-`::Logger` cost is the mutex: 24 worker threads serialize
275
+ through one lock plus a write syscall per line. `Kino::Logger` hands the
276
+ formatted line to a lock-free channel and returns—the remaining cost vs
277
+ not logging at all is Ruby-side formatting, which no device can remove.
278
+ (In the Docker environment the same comparison showed 8.5×—overlay-fs
279
+ write latency punished the synchronous logger far harder than this box's
280
+ NVMe does. The ranking is environment-independent; the multiple is not.)
281
+
282
+ One trade-off worth knowing: the sink **never blocks** request threads,
283
+ so at absurd rates against a slow disk it drops lines once its 8192-line
284
+ buffer is full, while `::Logger` writes every line by strangling
285
+ throughput instead. Pick your failure mode; at sane logging rates
286
+ neither happens.
287
+
288
+ Puma comparison note: request logging is opt-in there too (`--quiet` is
289
+ the default, `-v/--log-requests` enables it)—Kino's default-off
290
+ `log_requests` matches the ecosystem's standard behavior.
291
+
292
+ ## Hot-path notes
293
+
294
+ For the curious, the dispatch-path work behind the numbers: a try-pop
295
+ fast path skips the GVL release when a request is already queued,
296
+ bodyless requests spawn no body-forwarder task, `TCP_NODELAY`, a frozen
297
+ (Ractor-shareable) cache of env keys + common header names built once at
298
+ init, response headers read in place across the FFI boundary, ahash for
299
+ per-request lookups, SmallVec for header joins. Details in
300
+ [architecture.md](architecture.md).
301
+
302
+ ## Secondary data point: macOS
303
+
304
+ MacBook Pro (M1 10-core), ab with keep-alive, 10 workers × 3 threads.
305
+ Everything converges near the loopback ceiling and differences compress;
306
+ useful mainly as a sanity check that the ranking holds on a second OS:
307
+
308
+ | endpoint | Kino :ractor | + lanes | Kino :threaded | Puma (cluster) |
309
+ |-------------|-------------:|--------:|---------------:|---------------:|
310
+ | /plaintext | 48,441 | 44,000 | 49,352 | 44,594 |
311
+ | /10k | 45,840 | 42,362 | 46,482 | 42,890 |
312
+ | /cpu (fib) | 43,827 | 41,426 | 10,076 | 34,161 |
313
+ | /io (5 ms) | 4,943 | 4,883 | 4,890 | 4,780 |
314
+ | /io_native | 4,758 | 4,844 | 4,848 | 4,805 |
315
+
316
+ Notes from this environment: lanes lose a few percent (the loopback
317
+ stack, not dispatch, is the ceiling); macOS loopback occasionally stalls
318
+ entire benchmark windows under back-to-back runs (the harness sleeps
319
+ between endpoints to reduce this; treat any isolated collapsed cell as
320
+ suspect and re-run); the ractor /cpu margin over the cluster (+28%) is
321
+ wider than on Linux because macOS Puma forks pay a higher per-fork cost.
@@ -0,0 +1,50 @@
1
+ # Rails on Kino, and the state of Rails-on-Ractors
2
+
3
+ `examples/rails-hello` runs edge Rails on Kino.
4
+
5
+ ## What works as of Kino 0.1.0: `:threaded` mode
6
+
7
+ Rails 8.2.0.alpha boots and serves with `mode :threaded` (see the
8
+ example's `kino.rb`; just `bundle exec kino` in that directory). Measured
9
+ on the hello-world (c7a.2xlarge, 8 cores, production mode, 8×5):
10
+ ~2.3k req/s in 97 MB, single process. The 8-worker Puma cluster reaches
11
+ ~11.9k in 797 MB by parallelizing across forks—Rails-on-Ractors is
12
+ interesting precisely because it could offer that ~5× parallelism at
13
+ ~1/8th of the memory.
14
+
15
+ Pair it with production-style Rails settings: eager load, no code
16
+ reloading, database pool ≥ workers × threads, logger to stdout or another
17
+ thread-safe device.
18
+
19
+ ## What doesn't (yet): `:ractor` mode—with the exact blockers
20
+
21
+ `examples/rails-hello/ractor_probe.rb` re-tests this against whatever
22
+ Rails is bundled; when it prints SUCCEEDED *and* a 200 ractor-mode
23
+ response (sharing the app is only the first blocker), flip
24
+ `mode :ractor`. As of June 2026 (rails main):
25
+
26
+ 1. **Sharing the booted app fails.**
27
+ `Ractor.make_shareable(Rails.application)` raises
28
+ `Ractor::IsolationError: Proc's self is not shareable` at
29
+ `railties/lib/rails/application/finisher.rb:151`—a routes-reloader
30
+ hook lambda Rails registers at boot (in the development path) whose
31
+ `self` is the unshareable application instance.
32
+ 2. **Calling the app from a worker ractor fails** even where
33
+ `make_shareable` succeeds on a class:
34
+ `Ractor::IsolationError: can not get unshareable values from instance
35
+ variables of classes/modules from non-main Ractors (@instance from
36
+ HelloApp)`—`Rails.application` is a class-level ivar read on the
37
+ dispatch path.
38
+ 3. **Every workaround is closed by one VM rule** (verified on Ruby 4.0.5):
39
+ class instance variables are main-ractor-only—non-main ractors can
40
+ neither read them (when the value is unshareable) nor set them, **even
41
+ inside a `Ruby::Box`** (`RUBY_BOX=1`). Box isolates constant tables,
42
+ but Ractor isolation still applies to box-local classes:
43
+
44
+ ```
45
+ Ractor::IsolationError: can not set instance variables of
46
+ classes/modules by non-main Ractors
47
+ ```
48
+
49
+ So booting a private Rails per worker ractor (boxed or not) dies on
50
+ the first class-ivar write, and sharing one boot dies on reads.
data/doc/why-kino.md ADDED
@@ -0,0 +1,91 @@
1
+ # Why Kino is a Ractor server
2
+
3
+ What makes this server ractor-specific, why do Ractors work fast here
4
+ when they are slow everywhere else, and which Rust parts deserve the
5
+ credit. Companion to [architecture.md](architecture.md), which describes
6
+ the same machinery piece by piece.
7
+
8
+ ## The problem Kino exists to solve
9
+
10
+ Ractors forbid sharing mutable Ruby objects. That single rule breaks
11
+ every conventional server design. Puma hands env Hashes, socket objects,
12
+ and request state between threads freely; with Ractors, a Ruby object
13
+ created on the accepting side cannot be given to a worker—`Ractor#send`
14
+ deep-copies it, and sockets cannot cross at all.
15
+
16
+ We measured what the "obvious" workaround costs. The ractor-pool wrapper
17
+ experiment (reduce the env to a shareable subset, copy it to a worker
18
+ over a `Ractor::Port`, copy the response back) runs at **19k req/s where
19
+ Kino does 201k** on the same hardware—see the
20
+ [wrapper comparison](benchmarks.md#the-ractor-pool-wrapper-comparison).
21
+ Copying at the Rack layer eats the entire ractor dividend. Dispatch has
22
+ to live below the Rack contract.
23
+
24
+ ## The trick: shared state lives below the FFI line
25
+
26
+ Kino's answer is that **the request never exists as a Ruby object until
27
+ it is already inside the worker ractor**. The whole request—method, URI,
28
+ headers, body stream, response channel—lives in native memory as a Rust
29
+ `RequestCtx`. The only things that cross ractor boundaries in Ruby are
30
+ what Ractor law allows for free: integers (server id, worker ids) and
31
+ the frozen, shareable app. When a worker calls `take_one`, the env Hash
32
+ and the `Kino::Native::Request` handle are *constructed inside that
33
+ worker's ractor*—ownership is correct by construction, nothing is
34
+ copied, nothing is shared.
35
+
36
+ Put differently: Ractors cannot have shared mutable memory in Ruby, so
37
+ Kino moves all shared mutable state into Rust, where `Send`/`Sync` rules
38
+ apply instead of ractor isolation. Ruby sees only ids and frozen
39
+ objects; Rust sees one queue and one registry.
40
+
41
+ ## The Rust parts to thank, in order of credit
42
+
43
+ 1. **`rb_ext_ractor_safe(true)`** (lib.rs)—one line, and the
44
+ precondition for everything: without it, *any* native call from a
45
+ non-main ractor raises `Ractor::UnsafeError`. The spec suite keeps a
46
+ canary test on this.
47
+ 2. **The registry** (registry.rs)—global Rust-side state keyed by
48
+ `u64`. This is the "shared memory" Ractors legally cannot have:
49
+ workers address everything by integer, and no TypedData object ever
50
+ crosses a boundary.
51
+ 3. **The flume MPMC queue**—the cross-ractor work distributor. One
52
+ queue, async send from tokio, blocking receive from any ractor's
53
+ thread. Ruby has no ractor-safe equivalent that does not copy
54
+ (Ports copy every message); a Rust channel moves a
55
+ `Box<RequestCtx>` pointer.
56
+ 4. **`gvl.rs`**—`rb_thread_call_without_gvl` plus the atomic-flag UBF
57
+ idiom. A worker blocking on the queue releases its per-ractor lock,
58
+ stays interruptible (Thread#kill, shutdown) via bounded
59
+ `recv_timeout` ticks, and costs nothing when the queue has work: the
60
+ `try_recv` fast path skips the lock release entirely.
61
+ 5. **The frozen env-string cache** (env_strings.rs)—exploits the one
62
+ sharing channel Ractors *do* allow: frozen objects. Env keys, 44
63
+ common header names, methods, LRU'd host and peer-address values,
64
+ built once on the main ractor and read by every worker forever.
65
+ Without this, each ractor would allocate every key on every request.
66
+ 6. **The Responder** (response.rs)—responses travel back as Rust types
67
+ through a oneshot/frame channel into hyper, never as shared Ruby
68
+ objects. Its first-claimant atomic is also what makes crash recovery
69
+ work: the supervisor (on the main ractor) can 500 a dead ractor's
70
+ in-flight requests through `Weak` references into native memory.
71
+ That cross-ractor cleanup would be impossible if request state were
72
+ Ruby-side.
73
+ 7. **tokio + hyper owning all I/O**—sockets are exactly the kind of
74
+ unshareable object that poisons ractor designs; here Ruby never
75
+ touches one.
76
+
77
+ ## Why it is fast, by the numbers
78
+
79
+ With the dispatch cost eliminated, Ractors deliver the thing they were
80
+ built for—a lock per ractor instead of one GVL—and each layer is
81
+ visible in the [benchmarks](benchmarks.md): `/cpu` at 66.7k req/s in
82
+ ractor mode vs **13.3k threaded (5×, the GVL ceiling)**, matching the
83
+ fork cluster's CPU parallelism while holding **57 MB against the
84
+ cluster's 1,078 MB**, because eight ractors share one VM, one Rust
85
+ front-end, one queue, and one JIT, where eight forks each pay full
86
+ price.
87
+
88
+ The cleanest proof of the design is the threaded fallback itself: it
89
+ reuses ~95% of the same machinery, because the Rust core is
90
+ dispatch-agnostic. Ractors are not what makes Kino's engine fast—they
91
+ are what the engine finally makes *usable*.
data/exe/kino ADDED
@@ -0,0 +1,26 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ # kino [-C config.rb] [rackup_file]
5
+ #
6
+ # Boots a Rack app from a rackup file (default: config.ru) with settings
7
+ # from a Puma-style Ruby DSL config file (default: kino.rb if present).
8
+ # All the real work lives in Kino::CLI.start; this file only sets up the
9
+ # process-global bits an executable owns.
10
+
11
+ Warning[:experimental] = false
12
+ # Startup output must land immediately even when stdout is a pipe or file
13
+ # (process supervisors, `kino > server.log`); block buffering would hold
14
+ # the banner back until exit.
15
+ $stdout.sync = true
16
+
17
+ # Running from a git checkout (no installed gem, no bundler context):
18
+ # prefer the checkout's own lib so `require "kino"` resolves.
19
+ lib = File.expand_path("../lib", __dir__)
20
+ $LOAD_PATH.unshift(lib) if File.directory?(File.join(lib, "kino")) && !$LOAD_PATH.include?(lib)
21
+
22
+ # cli.rb loads no native code: --help and --version stay instant; the
23
+ # compiled extension loads only when a server actually boots.
24
+ require "kino/cli"
25
+
26
+ exit Kino::CLI.start(ARGV)