kino 0.1.0-x86_64-linux → 0.1.2-x86_64-linux
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +37 -0
- data/README.md +103 -57
- data/doc/benchmarks.md +208 -89
- data/doc/rails-on-ractors.md +5 -4
- data/doc/why-kino.md +8 -8
- data/lib/kino/configuration.rb +14 -3
- data/lib/kino/kino.so +0 -0
- data/lib/kino/server.rb +21 -1
- data/lib/kino/templates/kino.rb.tt +63 -83
- data/lib/kino/version.rb +1 -1
- data/sig/kino.rbs +2 -0
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 70ea3dc39cb1ed5a61650ab42a83e3a2ebae3d7b2079abcbc20530240750c687
|
|
4
|
+
data.tar.gz: 8efcd33a04f46a3d74b1ac1a6ffacf0ed82e33ed84602de2e34673cbd2a5c7dc
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: de5ad9a1d564811ce1377507422354eb1b0c82df35a28843506fbada9bb0cbc914bcacf428eacdbf0f66fa904f3663c1e71396866374576f4906fcdb7c1f2552
|
|
7
|
+
data.tar.gz: ae6ce1fb6b2dff351420aa2e2973813bb3cc73dd6af6df5859db7127b0accb4c276f583ea001564b528ebc8a68d13a26fb2675aeb9aeb946d2bc8882486a702b
|
data/CHANGELOG.md
CHANGED
|
@@ -1,3 +1,40 @@
|
|
|
1
|
+
## [0.1.2] - 2026-06-22
|
|
2
|
+
|
|
3
|
+
- Drop a connection that has not sent its complete request headers
|
|
4
|
+
within 15 seconds. Closes a slowloris hole: hyper's built-in header-read
|
|
5
|
+
timeout was inert because the server installed no timer, so a slow-header
|
|
6
|
+
client could tie up a connection (and its tokio task) indefinitely.
|
|
7
|
+
- Cap concurrent connections (new `max_connections` directive). Past the cap,
|
|
8
|
+
new connections wait in the kernel backlog instead of piling up until a
|
|
9
|
+
flood exhausts file descriptors or memory. Defaults to most of the process
|
|
10
|
+
open-file limit (`ulimit -n`), so it scales with the OS limit and only
|
|
11
|
+
engages under a flood.
|
|
12
|
+
- Bound the TLS handshake to 10 seconds. A client that completed the TCP
|
|
13
|
+
connect but stalled the handshake could otherwise hold a connection slot
|
|
14
|
+
indefinitely, since the request and header-read deadlines only begin once
|
|
15
|
+
the handshake finishes.
|
|
16
|
+
- Cap the request body at 50 MB by default (new `max_body_size` directive,
|
|
17
|
+
configurable; nil or 0 disables and delegates to a fronting proxy). An app
|
|
18
|
+
that reads `rack.input` could otherwise be driven to run out of memory by an
|
|
19
|
+
oversized or endless upload. A truthful oversize Content-Length is refused
|
|
20
|
+
with a 413 before the app runs; a chunked or lying client is cut off
|
|
21
|
+
mid-stream once it passes the cap.
|
|
22
|
+
- Bound the idle time between request-body frames to 30 seconds. A client that
|
|
23
|
+
began a request then stalled mid-body would otherwise keep a worker blocked
|
|
24
|
+
in `rack.input.read` indefinitely; now the read raises and the worker
|
|
25
|
+
reclaims its slot. Only a silent client trips it: a steadily-sent body resets
|
|
26
|
+
the deadline each frame, so slow-but-active uploads are unaffected.
|
|
27
|
+
|
|
28
|
+
## [0.1.1] - 2026-06-11
|
|
29
|
+
|
|
30
|
+
- Mode-dependent `threads` default: 1 per worker in :ractor mode (threads
|
|
31
|
+
inside a ractor share its lock and cost a per-request handoff; +16-18%
|
|
32
|
+
on fast handlers, measured on dedicated hardware), 3 in :threaded mode.
|
|
33
|
+
Explicit `threads` always wins; waiting-heavy ractor apps should raise
|
|
34
|
+
`workers` instead.
|
|
35
|
+
- `queue_timeout` default raised from 1 to 5 seconds: a brief burst now
|
|
36
|
+
waits out the spike instead of shedding 503s within a second.
|
|
37
|
+
|
|
1
38
|
## [0.1.0] - 2026-06-11
|
|
2
39
|
|
|
3
40
|
Initial release.
|
data/README.md
CHANGED
|
@@ -11,14 +11,14 @@ on every core in **one small process**. A **Rust** (tokio + hyper)
|
|
|
11
11
|
front-end owns the network, parallel **Ractors** run your Rack 3 app,
|
|
12
12
|
and a threaded fallback mode runs everything else, Rails included.
|
|
13
13
|
|
|
14
|
-
* **Fast.** On a real 8-core server, every Kino mode is **1.
|
|
15
|
-
of a
|
|
16
|
-
|
|
17
|
-
* **A fraction of the memory.**
|
|
18
|
-
about **
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
14
|
+
* **Fast.** On a real 8-core server, every Kino mode is **1.5-2×**
|
|
15
|
+
ahead of a Puma fork cluster on I/O-light endpoints. Ractor mode also
|
|
16
|
+
wins on pure CPU, **30%+**. [Benchmarks](#benchmarks) below.
|
|
17
|
+
* **A fraction of the memory.** Aabout **~7×** on the simplistic bench
|
|
18
|
+
Ractor app, and about **4× less memory** than a Puma cluster serving Rails in fallback threaded mode.
|
|
19
|
+
* **Parallel without forking.** Ractor mode runs CPU work **more than
|
|
20
|
+
5× faster** than Kino's own GVL-bound threaded mode, in the same
|
|
21
|
+
small process.
|
|
22
22
|
* **Production plumbing included.** Graceful drain, crash supervision
|
|
23
23
|
and respawn, bounded queues with 503 backpressure, request timeouts,
|
|
24
24
|
TLS (rustls), live stats, async access and app logging.
|
|
@@ -63,63 +63,108 @@ notes live in [doc/architecture.md](doc/architecture.md).
|
|
|
63
63
|
## Benchmarks
|
|
64
64
|
|
|
65
65
|
Measured on a real server: AWS **c7a.2xlarge** (8-core AMD EPYC 9R14,
|
|
66
|
-
16 GB, Amazon Linux 2023). This is a realistic app-server size.
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
66
|
+
16 GB, Amazon Linux 2023). This is a realistic app-server size.
|
|
67
|
+
|
|
68
|
+
**These tables run a tiny synthetic Rack app**—plaintext, a 10 KB body,
|
|
69
|
+
a CPU-bound `fib`, a 5 ms wait—deliberately small, to measure the server
|
|
70
|
+
rather than an app. It is Ractor-shareable, so Kino runs it in `:ractor`
|
|
71
|
+
mode (and `:threaded` for comparison). **A real Rails app is a different
|
|
72
|
+
story:** it is *not* Ractor-shareable, so it runs only in Kino's
|
|
73
|
+
`:threaded` fallback, with its own numbers—see [Rails](#rails) below.
|
|
74
|
+
Ruby 4.0.5 with YJIT, every server at its defaults: Puma forks 8 workers ×
|
|
75
|
+
3 threads, Kino stays in one process (8 workers; 1 thread each in ractor
|
|
76
|
+
modes, 3 in threaded). Numbers are req/s by wrk (8-second windows, 64
|
|
77
|
+
connections, same host). Methodology:
|
|
71
78
|
[doc/benchmarks.md](doc/benchmarks.md).
|
|
72
79
|
|
|
73
|
-
| endpoint | Kino :ractor | + lanes | Kino :threaded | Puma (cluster) |
|
|
74
|
-
|
|
75
|
-
| /plaintext |
|
|
76
|
-
| /10k |
|
|
77
|
-
| /cpu (fib) |
|
|
78
|
-
| /io (5 ms) |
|
|
79
|
-
| /io_native |
|
|
80
|
+
| endpoint | Kino :ractor | + lanes | :ractor, `workers 32`² | Kino :threaded | Puma (cluster) |
|
|
81
|
+
|-------------|-------------:|--------:|-----------------------:|---------------:|---------------:|
|
|
82
|
+
| /plaintext | 229,534 | **250,222** | 182,997 | 216,994 | 118,176 |
|
|
83
|
+
| /10k | 178,083 | **189,862** | 151,034 | 160,400 | 106,768 |
|
|
84
|
+
| /cpu (fib) | **77,999**¹| 70,885 | 66,100 | 13,429 | 58,006 |
|
|
85
|
+
| /io (5 ms) | 1,552 | 1,551 | **5,888** | 4,709 | 4,693 |
|
|
86
|
+
| /io_native | 1,570 | 1,571 | **6,274** | 4,695 | 4,691 |
|
|
80
87
|
|
|
81
|
-
Memory on the
|
|
88
|
+
Memory tells two different stories depending on the app, both by **PSS**
|
|
89
|
+
(proportional set size; see note) after sustained load.
|
|
82
90
|
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
91
|
+
**The tiny benchmark app** (Ractor-shareable, so Kino runs it in `:ractor`
|
|
92
|
+
or `:threaded`). Kino is **~7× lighter in :ractor mode, ~10× in :threaded**
|
|
93
|
+
than the Puma cluster — the gap stays large because a trivial app is almost
|
|
94
|
+
all private per-worker heap, which copy-on-write can't share:
|
|
95
|
+
|
|
96
|
+
| tiny app, Kino | Kino (one process) | Puma cluster (8 workers) | ratio |
|
|
97
|
+
|-----------------|-------------------:|-------------------------:|------:|
|
|
98
|
+
| :ractor (8×1) | **148 MB** | 1,068 MB | ~7× |
|
|
99
|
+
| :threaded (8×3) | **107 MB**³| 1,068 MB | ~10× |
|
|
100
|
+
|
|
101
|
+
**A real Rails app** (not Ractor-shareable—Kino's `:threaded` fallback
|
|
102
|
+
only, [below](#rails)). The gap is **~4×**, smaller because Rails' large
|
|
103
|
+
framework *is* shared copy-on-write across Puma's forks:
|
|
104
|
+
|
|
105
|
+
| Rails hello-world | Kino :threaded | Puma cluster (8 workers) | ratio |
|
|
106
|
+
|-------------------|---------------:|-------------------------:|------:|
|
|
107
|
+
| **PSS** | **92 MB** | **389 MB** | ~4× |
|
|
88
108
|
|
|
89
109
|
"+ lanes" is the experimental per-worker-queue dispatcher (`lanes true`).
|
|
90
|
-
It
|
|
91
|
-
mode the fastest Kino configuration. Details:
|
|
110
|
+
It posts the fastest plaintext/10k of any configuration here. Details:
|
|
92
111
|
[doc/benchmarks.md](doc/benchmarks.md#lane-dispatch-experimental-lanes-true).
|
|
93
112
|
|
|
94
113
|
¹ Stock settings, no tuning. Ractor mode beats the fork cluster on pure
|
|
95
|
-
CPU by +
|
|
96
|
-
every single-process Ruby server hits. The CPU-tuning recipe
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
114
|
+
CPU by +34% (+22% with lanes). Threaded mode shows the GVL ceiling that
|
|
115
|
+
every single-process Ruby server hits. The old CPU-tuning recipe is
|
|
116
|
+
retired: its `threads 1` half **is** the default now, and its
|
|
117
|
+
`tokio_threads 1` half costs −12% on real hardware; see
|
|
118
|
+
[doc/benchmarks.md](doc/benchmarks.md#cpu-bound-tuning).
|
|
119
|
+
|
|
120
|
+
² Wait-bound throughput is slots ÷ wait, and the default columns bring
|
|
121
|
+
8 single-thread workers against the cluster's 24 threads. Kino slots
|
|
122
|
+
are threads, not processes—when your app waits a lot, raise `workers`.
|
|
123
|
+
The `workers 32` column is that tuning: **+25% over the cluster on /io
|
|
124
|
+
(+34% via `Kino.sleep`)** while still ahead of it on pure CPU, all in
|
|
125
|
+
one small process. The cost is the CPU-light rows (32 ractors
|
|
126
|
+
oversubscribe 8 cores); pick the topology your app's wait profile
|
|
127
|
+
needs. See
|
|
105
128
|
[doc/benchmarks.md](doc/benchmarks.md#why-io-lags-in-ractor-mode-on-linux).
|
|
106
129
|
|
|
130
|
+
³ With `MALLOC_ARENA_MAX=2` (the standard Ruby deployment setting;
|
|
131
|
+
Heroku's default). Without it, 24 threads churning 10 KB responses
|
|
132
|
+
through one glibc heap balloon to ~670 MB—an arena-fragmentation
|
|
133
|
+
footgun, not a leak, and ractor mode sidesteps it. See
|
|
134
|
+
[doc/benchmarks.md](doc/benchmarks.md#memory-under-load-and-the-glibc-arena-footgun).
|
|
135
|
+
|
|
107
136
|
A common first idea is to keep your current server and wrap the app in
|
|
108
137
|
a ractor pool. We measured that too (same box; the analysis is in the
|
|
109
138
|
doc):
|
|
110
139
|
|
|
111
|
-
| endpoint | Kino :ractor | Puma + ractor wrapper | Falcon + ractor wrapper |
|
|
112
|
-
|
|
113
|
-
| /plaintext |
|
|
114
|
-
| /cpu (fib) |
|
|
115
|
-
| /io (5 ms) |
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
140
|
+
| endpoint | Kino :ractor (8×3) | Puma + ractor wrapper | Falcon + ractor wrapper |
|
|
141
|
+
|------------|-------------------:|----------------------:|------------------------:|
|
|
142
|
+
| /plaintext | **193,826** | 19,480 | 99,776 |
|
|
143
|
+
| /cpu (fib) | **68,061** | 17,755 | 48,721 |
|
|
144
|
+
| /io (5 ms) | **4,530** | 1,454 | 1,549 |
|
|
145
|
+
|
|
146
|
+
### Rails
|
|
147
|
+
|
|
148
|
+
Rails is not Ractor-shareable today, so Kino serves it in `:threaded`
|
|
149
|
+
fallback — one GVL-bound process. On the same box (`examples/rails-hello`,
|
|
150
|
+
edge Rails, production, 8×5):
|
|
151
|
+
|
|
152
|
+
| Rails hello-world | req/s | memory (PSS) |
|
|
153
|
+
|------------------------------|-------:|-------------:|
|
|
154
|
+
| Kino :threaded (one process) | 2,637 | **92 MB** |
|
|
155
|
+
| Puma cluster (8 workers) | 12,138 | 389 MB |
|
|
156
|
+
|
|
157
|
+
The honest trade-off: Puma's fork cluster uses all 8 cores, so it serves
|
|
158
|
+
~4.6× the throughput — at ~4× the memory. Ractor-mode Rails would close
|
|
159
|
+
the throughput gap at one-process memory cost; the upstream blockers are
|
|
160
|
+
tracked in [doc/rails-on-ractors.md](doc/rails-on-ractors.md).
|
|
161
|
+
|
|
162
|
+
In short: on the tiny synthetic app, ractor mode beats fork-level CPU parallelism (**5.8×** Kino's
|
|
163
|
+
own GVL-bound threaded mode, +34% over the cluster) in one process, at
|
|
164
|
+
about 1/7th of the cluster's memory by PSS (~4× on a real Rails app).
|
|
165
|
+
Every Kino mode is 1.5-2.1× ahead of the cluster on I/O-light endpoints. The macOS numbers
|
|
166
|
+
(secondary; everything there hits the loopback ceiling) and the
|
|
167
|
+
YJIT × Ractors gotcha are in [doc/benchmarks.md](doc/benchmarks.md).
|
|
123
168
|
|
|
124
169
|
Reproduce: `bench/run.sh [seconds] [concurrency]` for the main table,
|
|
125
170
|
`bench/studies.sh` for the follow-ups (CPU recipe, topology, scaling,
|
|
@@ -174,10 +219,10 @@ server = Kino::Server.new(app,
|
|
|
174
219
|
bind: "127.0.0.1",
|
|
175
220
|
port: 9292, # 0 = ephemeral; read back via server.port
|
|
176
221
|
workers: Etc.nprocessors, # ractors (parallelism)
|
|
177
|
-
threads:
|
|
222
|
+
threads: 1, # per worker; ractor default 1, threaded default 3
|
|
178
223
|
mode: :auto, # :auto | :ractor | :threaded
|
|
179
224
|
queue_depth: 1024, # bounded queue; overflow → 503
|
|
180
|
-
queue_timeout:
|
|
225
|
+
queue_timeout: 5.0, # seconds before 503 on a full queue
|
|
181
226
|
request_timeout: nil, # seconds before a slow response becomes a 504 (nil = off)
|
|
182
227
|
shutdown_timeout: 30, # drain deadline
|
|
183
228
|
tls: { cert: "cert.pem", key: "key.pem" }, # file paths or inline PEM
|
|
@@ -210,7 +255,7 @@ kwargs and CLI flags > config file > defaults.
|
|
|
210
255
|
# kino.rb
|
|
211
256
|
port 9292
|
|
212
257
|
workers 8
|
|
213
|
-
threads
|
|
258
|
+
threads 1
|
|
214
259
|
mode :ractor
|
|
215
260
|
```
|
|
216
261
|
|
|
@@ -266,7 +311,7 @@ cost):
|
|
|
266
311
|
|
|
267
312
|
```ruby
|
|
268
313
|
server.stats
|
|
269
|
-
# => {mode: :ractor, lanes: false, workers: 8, threads:
|
|
314
|
+
# => {mode: :ractor, lanes: false, workers: 8, threads: 1, batch: 1,
|
|
270
315
|
# respawns: 0, queued: 0, in_flight: 2, served: 1041, rejected: 0,
|
|
271
316
|
# timeouts: 0}
|
|
272
317
|
# plus lane_depths: [...] when lane dispatch is on
|
|
@@ -276,19 +321,20 @@ From the outside, `kill -USR1 <pid>` prints the same snapshot as one line
|
|
|
276
321
|
(pair it with `pidfile` to find the pid):
|
|
277
322
|
|
|
278
323
|
```
|
|
279
|
-
Kino stats: mode=:ractor lanes=false workers=8 threads=
|
|
324
|
+
Kino stats: mode=:ractor lanes=false workers=8 threads=1 batch=1 respawns=0 queued=0 in_flight=2 served=1041 rejected=0 timeouts=0
|
|
280
325
|
```
|
|
281
326
|
|
|
282
327
|
## Logging
|
|
283
328
|
|
|
284
329
|
With one log line per request, `Kino::Logger` sustained **2.4× the
|
|
285
|
-
throughput of a shared `::Logger`** (
|
|
330
|
+
throughput of a shared `::Logger`** (149k vs 63k req/s on the benchmark
|
|
286
331
|
box). There are two native pieces. Both write through a lock-free
|
|
287
332
|
channel to a Rust flusher thread, so request threads never take a log
|
|
288
333
|
mutex and never make a write syscall:
|
|
289
334
|
|
|
290
335
|
- **Access log** (`log_requests true`): one line per request to stdout,
|
|
291
|
-
including the 503s that never reach your app.
|
|
336
|
+
including the 503s that never reach your app. Recommended in
|
|
337
|
+
development; cheap enough for production. On color terminals the
|
|
292
338
|
lines are tinted by status class: 2xx green, 3xx yellow, 4xx maroon,
|
|
293
339
|
5xx bright red:
|
|
294
340
|
|
data/doc/benchmarks.md
CHANGED
|
@@ -27,9 +27,29 @@ the deployment most apps run today.
|
|
|
27
27
|
plaintext 26-37% lower while leaving Puma's number unchanged.
|
|
28
28
|
- Identical app for every server (`bench/bench_app.rb`), Ractor-shareable
|
|
29
29
|
so Kino's `:ractor` mode can run it unmodified.
|
|
30
|
-
- Topology
|
|
31
|
-
|
|
32
|
-
|
|
30
|
+
- Topology: each server at its shipped defaults—Puma 8 forked workers
|
|
31
|
+
× 3 threads (24 slots) vs Kino 8 workers in one process (× 1 thread
|
|
32
|
+
in ractor modes, the default since 0.1.1; × 3 threads in threaded
|
|
33
|
+
mode). Equal-topology numbers (Kino at 8×3) are in the studies below.
|
|
34
|
+
- The headline tables also carry an io-tuned column (`workers 32,
|
|
35
|
+
threads 1`)—not a default, labeled as such—because the /io rows are
|
|
36
|
+
a slot-count story (see below).
|
|
37
|
+
- The dataset spans four identical c7a.2xlarge boxes: the original
|
|
38
|
+
measurements, a re-measure at the 0.1.1 defaults, the headline sweep,
|
|
39
|
+
and a final full re-validation (every table re-run from scratch).
|
|
40
|
+
Equal-config throughput reproduced across boxes within ~1-2%.
|
|
41
|
+
- **Memory is reported as PSS (proportional set size), not RSS.** A Puma
|
|
42
|
+
cluster forks N workers that share the Ruby VM and gem code
|
|
43
|
+
copy-on-write; summing each worker's RSS counts those shared pages up
|
|
44
|
+
to N times and overstates the cluster's real footprint. PSS divides
|
|
45
|
+
every shared page across the processes mapping it, so it reflects the
|
|
46
|
+
unique physical memory the cluster occupies—the only fair basis for
|
|
47
|
+
comparing one process against a fork-per-core cluster. We read it from
|
|
48
|
+
`/proc/<pid>/smaps_rollup` over the whole process tree, cross-checked
|
|
49
|
+
against `ps` (RSS) and `smem` (PSS). Kino serves from one process, so
|
|
50
|
+
its RSS ≈ PSS; the correction only moves Puma. (`bench/studies.sh`
|
|
51
|
+
reports both columns.)
|
|
52
|
+
- Follow-up studies (`bench/studies.sh`): CPU tuning, topology sweep,
|
|
33
53
|
/io worker scaling, logging costs, and memory—run in the same session
|
|
34
54
|
as the headline tables.
|
|
35
55
|
- The harness waits for the port to be genuinely free between targets.
|
|
@@ -45,42 +65,55 @@ the deployment most apps run today.
|
|
|
45
65
|
|
|
46
66
|
## Reading the headline tables
|
|
47
67
|
|
|
68
|
+
These tables all run the **tiny synthetic Ractor-shareable app**. The real
|
|
69
|
+
Rails app is not Ractor-shareable and runs only in threaded fallback—a
|
|
70
|
+
separate story with separate numbers, in [its own section](#rails).
|
|
71
|
+
|
|
48
72
|
- **Plaintext/10k**: Kino's tokio front-end clears the fork cluster by
|
|
49
|
-
1.
|
|
50
|
-
margin is threaded /10k at 1.
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
73
|
+
1.5-2.1× (lanes plaintext 250,222 vs Puma 118,176 = 2.12×; the
|
|
74
|
+
smallest margin is threaded /10k at 1.50×). At the old 3-thread
|
|
75
|
+
topology the cross-ractor handoff showed up as ractor trailing
|
|
76
|
+
threaded on trivial handlers; the 1-thread default reverses that
|
|
77
|
+
(ractor 230k vs threaded 217k) and lanes widen it (250k).
|
|
78
|
+
- **CPU (recursive fib)**: ractor mode does **5.8× its own GVL-bound
|
|
79
|
+
threaded mode** (77,999 vs 13,429)—that's the entire point of
|
|
80
|
+
ractors—and beats the fork cluster outright: +34% with stock
|
|
81
|
+
defaults (+22% with lanes, 70,885 vs 58,006). Even the io-tuned
|
|
82
|
+
`workers 32` topology stays ahead of the cluster on CPU (66,100).
|
|
83
|
+
- **Memory (PSS)**: after the full endpoint battery, the tiny app costs
|
|
84
|
+
Kino **148 MB** in ractor mode (107 MB threaded) against the 8-worker
|
|
85
|
+
cluster's **1,068 MB**—~7-10× lighter, because a trivial app is almost
|
|
86
|
+
all private per-worker heap that copy-on-write can't share. The real
|
|
87
|
+
Rails app narrows this to ~4× (its framework *is* shared CoW); both are
|
|
88
|
+
in [Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
|
|
62
89
|
- **I/O (5 ms wait)**: all dispatch models tie within ~4% at equal slot
|
|
63
|
-
counts
|
|
64
|
-
|
|
90
|
+
counts, so the default columns show the ractor modes behind on /io
|
|
91
|
+
(8 slots vs the cluster's 24), and the `workers 32` column shows the
|
|
92
|
+
same engine winning (+25%, +34% via `Kino.sleep`) once it has more
|
|
93
|
+
slots than the cluster. The lever is slot count, and Kino slots are
|
|
94
|
+
cheap: see [below](#why-io-lags-in-ractor-mode-on-linux).
|
|
65
95
|
|
|
66
96
|
## CPU-bound tuning
|
|
67
97
|
|
|
68
|
-
On real hardware, Kino's stock defaults
|
|
69
|
-
|
|
98
|
+
On real hardware, Kino's stock defaults lead the cluster on pure
|
|
99
|
+
CPU—and the old tuning recipe is now obsolete. Same-session studies
|
|
100
|
+
run:
|
|
70
101
|
|
|
71
102
|
| config | /cpu req/s |
|
|
72
103
|
|---|---:|
|
|
73
|
-
| Puma cluster (reference) | 58,
|
|
74
|
-
| Kino `workers 8, threads 3
|
|
75
|
-
| Kino `workers 8, threads 1, tokio_threads 1` (recipe) | 68,
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
104
|
+
| Puma cluster (reference) | 58,189 |
|
|
105
|
+
| Kino `workers 8, threads 3` (the default before 0.1.1) | 67,394 |
|
|
106
|
+
| Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,600 |
|
|
107
|
+
| Kino `workers 8, threads 1`, tokio auto (**the default**) | **77,999** |
|
|
108
|
+
|
|
109
|
+
The `threads 1` half of the old recipe became the default; the
|
|
110
|
+
`tokio_threads 1` half now *costs* −12% on /cpu (and still costs
|
|
111
|
+
plaintext: 108,523 vs 230k). Don't pin tokio threads. **The recipe's
|
|
112
|
+
history is an environment story**: in the earlier Docker-on-Mac runs it
|
|
113
|
+
was worth +12%, because tokio threads and wake churn competed for
|
|
114
|
+
oversubscribed virtualized cores; on dedicated cores the same pin
|
|
115
|
+
starves the I/O front-end instead. If you deploy into a
|
|
116
|
+
constrained/virtualized environment, measure there.
|
|
84
117
|
|
|
85
118
|
Two findings that survived the environment change:
|
|
86
119
|
|
|
@@ -99,8 +132,9 @@ Parallelism for CPU-bound Ruby comes from ractors or forks, nothing else.
|
|
|
99
132
|
|
|
100
133
|
## Why /io lags in ractor mode on Linux
|
|
101
134
|
|
|
102
|
-
On bare metal the gap is small: ractor /io 4,
|
|
103
|
-
(−4
|
|
135
|
+
On bare metal the gap is small at equal slot counts: ractor /io 4,530
|
|
136
|
+
vs threaded 4,709 (−4%, both at 8×3). In Docker it was −18%, and a
|
|
137
|
+
pure-Ruby probe there measured
|
|
104
138
|
`sleep(0.005)` waking +2.3-2.8 ms late inside ractors vs +1.8 ms on the
|
|
105
139
|
main ractor—non-main-ractor timer wakeups are coarser in Ruby 4.0, but
|
|
106
140
|
how much that costs depends heavily on the kernel/virtualization stack.
|
|
@@ -110,14 +144,19 @@ A follow-up probe showed `IO.select`-style waits are tighter than
|
|
|
110
144
|
**Mitigation 1—`Kino.sleep`:** releases the GVL and waits on the OS
|
|
111
145
|
clock directly (chunked, so `Thread#kill`/shutdown stay responsive). The
|
|
112
146
|
`/io_native` endpoint (same 5 ms wait via `Kino.sleep` when available)
|
|
113
|
-
erases the remaining ractor gap on this box: 4,
|
|
114
|
-
|
|
115
|
-
**Mitigation 2—add workers; they're nearly free.**
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
147
|
+
erases the remaining ractor gap on this box: 4,721 vs 4,530 plain sleep.
|
|
148
|
+
|
|
149
|
+
**Mitigation 2—add workers; they're nearly free.** The headline tables
|
|
150
|
+
show default ractor-mode /io at 1,552: that's 8 slots (the 1-thread
|
|
151
|
+
default) against the cluster's 24, because wait-bound throughput is
|
|
152
|
+
simply `slots ÷ effective wait`. Kino's slots cost ~a thread each, not
|
|
153
|
+
a forked process: the `workers 32, threads 1` column measured **5,888
|
|
154
|
+
/io (+25% over the 24-thread cluster's 4,693) and 6,274 /io_native
|
|
155
|
+
(+34%)**, still one small process, and still +14% ahead of the cluster
|
|
156
|
+
on pure CPU. Its cost is the CPU-light rows (183k plaintext vs 230k at
|
|
157
|
+
8×1: 32 ractors oversubscribe 8 cores). A fork cluster buying the same
|
|
158
|
+
32 slots pays for them in full copies of the app; Kino pays in
|
|
159
|
+
scheduler churn only where the cores are already saturated.
|
|
121
160
|
|
|
122
161
|
## The ractor-pool-wrapper comparison
|
|
123
162
|
|
|
@@ -127,11 +166,11 @@ already run. `bench/ractor_wrapper.rb` is that experiment, benchmarked on
|
|
|
127
166
|
Puma and Falcon—not as a comparison of those servers, but to measure
|
|
128
167
|
what the Rack-level hop itself costs (c7a.2xlarge, same session):
|
|
129
168
|
|
|
130
|
-
| endpoint | Kino :ractor | Puma + wrapper | Falcon + wrapper |
|
|
131
|
-
|
|
132
|
-
| /plaintext |
|
|
133
|
-
| /cpu (fib) |
|
|
134
|
-
| /io (5 ms) |
|
|
169
|
+
| endpoint | Kino :ractor (8×3) | Puma + wrapper | Falcon + wrapper |
|
|
170
|
+
|------------|-------------------:|---------------:|-----------------:|
|
|
171
|
+
| /plaintext | 193,826 | 19,480 | 99,776 |
|
|
172
|
+
| /cpu (fib) | 68,061 | 17,755 | 48,721 |
|
|
173
|
+
| /io (5 ms) | 4,530 | 1,454 | 1,549 |
|
|
135
174
|
|
|
136
175
|
Inside the Rack contract, the wrapper must reduce the env to a shareable
|
|
137
176
|
subset, copy it to the worker ractor, copy the response back, and hold a
|
|
@@ -146,18 +185,26 @@ the Rack contract—which is the experiment this gem exists to run.
|
|
|
146
185
|
|
|
147
186
|
## Rails
|
|
148
187
|
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
188
|
+
Rails is **not Ractor-shareable**, so Kino can only serve it in
|
|
189
|
+
`:threaded` fallback—this whole section is one GVL-bound Kino process,
|
|
190
|
+
never ractor mode. The example app (`examples/rails-hello`, edge Rails,
|
|
191
|
+
production mode, 8 workers × 5 threads) on the same box:
|
|
192
|
+
|
|
193
|
+
| | req/s | RSS | PSS |
|
|
194
|
+
|---|---:|---:|---:|
|
|
195
|
+
| Kino `:threaded` (one process) | 2,637 | 97 MB | **92 MB** |
|
|
196
|
+
| Puma cluster (8 workers) | 12,138 | 794 MB | **389 MB** |
|
|
197
|
+
|
|
198
|
+
This is the honest version of the Rails story. In threaded mode Kino is
|
|
199
|
+
one GVL-bound process, so the fork cluster outruns it ~4.6× by using all
|
|
200
|
+
8 cores—at ~4× the memory by PSS. The metric matters here: Puma's RSS
|
|
201
|
+
(794 MB) counts the shared Rails framework once per worker; PSS (389 MB)
|
|
202
|
+
counts it once, and that is the fair figure (the README's headline used
|
|
203
|
+
to read 8× off RSS). Preloading barely moves it—389 MB with
|
|
204
|
+
`preload_app!` vs 400 MB without—because Ruby's GC dirties most heap
|
|
205
|
+
pages, breaking copy-on-write, so even a preloaded cluster keeps a
|
|
206
|
+
private heap per worker. Rails-on-Ractors is interesting precisely
|
|
207
|
+
because it would close the throughput gap at the one-process memory
|
|
161
208
|
cost; the upstream blockers are documented in
|
|
162
209
|
[rails-on-ractors.md](rails-on-ractors.md).
|
|
163
210
|
|
|
@@ -190,6 +237,62 @@ needs its `disable_initial_exec_tls` build flag just to load (dlopen +
|
|
|
190
237
|
initial-exec TLS = `cannot allocate memory in static TLS block`)—one
|
|
191
238
|
more reason to prefer mimalloc in dlopen'd extensions.
|
|
192
239
|
|
|
240
|
+
## Memory under load (and the glibc arena footgun)
|
|
241
|
+
|
|
242
|
+
All figures are **PSS** (see [Methodology](#methodology)) after the full
|
|
243
|
+
endpoint battery (8 s each of /plaintext, /10k, /cpu, /io—a "warmed
|
|
244
|
+
production process", not a fresh boot, which measures ~26 MB for every
|
|
245
|
+
Kino mode). RSS is shown alongside so the copy-on-write correction is
|
|
246
|
+
visible.
|
|
247
|
+
|
|
248
|
+
### The tiny synthetic app
|
|
249
|
+
|
|
250
|
+
| config | RSS | PSS |
|
|
251
|
+
|---|---:|---:|
|
|
252
|
+
| Kino :ractor 8×1 (default) | 151 | **148** |
|
|
253
|
+
| Kino lanes 8×1 | 137 | **135** |
|
|
254
|
+
| Kino :ractor 8×3 | 171 | **169** |
|
|
255
|
+
| Kino :threaded 8×3 (`MALLOC_ARENA_MAX=2`) | 109 | **107** |
|
|
256
|
+
| Kino :threaded 8×3 (no arena cap) | 668 | **666**¹ |
|
|
257
|
+
| Puma cluster 8×3 | 1,213 | **1,068** |
|
|
258
|
+
|
|
259
|
+
The tiny app is ~7× lighter than the cluster in ractor mode, ~10× in
|
|
260
|
+
arena-capped threaded mode. RSS ≈ PSS for every Kino row (one process,
|
|
261
|
+
nothing to share) and within ~12% for Puma here: a trivial app has almost
|
|
262
|
+
no shared state, so Puma's footprint is ~1,051 MB of *private* per-worker
|
|
263
|
+
heap plus only ~18 MB shared (which RSS counts 8×). This is the case where
|
|
264
|
+
copy-on-write does **not** rescue the cluster—there is nothing to
|
|
265
|
+
share—so the RSS and PSS numbers nearly agree. (The old "80 MB / 15×"
|
|
266
|
+
figure was a lighter, plaintext-only load; the honest full-battery ractor
|
|
267
|
+
figure is ~148 MB, i.e. ~7×.)
|
|
268
|
+
|
|
269
|
+
¹ Not a leak: glibc malloc arena bloat. One 8-second /10k round takes
|
|
270
|
+
threaded mode from ~70 MB to ~670 MB and it never returns—24 threads
|
|
271
|
+
churning 10 KB strings through one process heap is the textbook glibc
|
|
272
|
+
arena-fragmentation case (the reason Rails ops set `MALLOC_ARENA_MAX=2`;
|
|
273
|
+
Heroku ships that default). With the cap the same battery ends at 107 MB
|
|
274
|
+
PSS, throughput unchanged. Ractor mode sidesteps the worst of it without
|
|
275
|
+
any env tweak—objects live in per-ractor heaps.
|
|
276
|
+
|
|
277
|
+
### Rails (threaded fallback)
|
|
278
|
+
|
|
279
|
+
Here copy-on-write **does** matter, which is exactly why PSS is mandatory:
|
|
280
|
+
|
|
281
|
+
| config | RSS | PSS |
|
|
282
|
+
|---|---:|---:|
|
|
283
|
+
| Kino :threaded (one process) | 97 | **92** |
|
|
284
|
+
| Puma cluster 8×3 (preload) | 794 | **389** |
|
|
285
|
+
|
|
286
|
+
Puma serves the same Rails framework from 8 forks that share it
|
|
287
|
+
copy-on-write; RSS counts that shared framework once per worker (794 MB),
|
|
288
|
+
PSS counts it once (389 MB). The fair ratio is **~4×**, not the ~8× a
|
|
289
|
+
naive RSS sum reports—this is the correction that prompted the whole
|
|
290
|
+
re-measure. Preload barely helps (389 vs 400 MB without): Ruby's GC
|
|
291
|
+
dirties most heap pages, breaking copy-on-write, so even a preloaded
|
|
292
|
+
cluster keeps a large private heap per worker. That is why "CoW should
|
|
293
|
+
make a fork cluster nearly free" is only half true—it shares the code,
|
|
294
|
+
not the live object heap.
|
|
295
|
+
|
|
193
296
|
## Run-to-run variance (a.k.a. "is this a regression?")
|
|
194
297
|
|
|
195
298
|
Rule of thumb from chasing this twice: never compare numbers from
|
|
@@ -197,21 +300,33 @@ different sessions; interleave A/B rounds in one session instead. The
|
|
|
197
300
|
Docker-on-Mac environment swung ±10% on /cpu between sessions with the
|
|
198
301
|
VM's mood; the dedicated c7a box is far steadier (same-session repeats
|
|
199
302
|
land within ~1-2%), but the discipline stays—every comparative claim in
|
|
200
|
-
these docs comes from same-session pairs.
|
|
303
|
+
these docs comes from same-session pairs. Cross-box repeatability got
|
|
304
|
+
its own test: the dataset was measured across four identical
|
|
305
|
+
c7a.2xlarge boxes, and equal-config throughput numbers matched within
|
|
306
|
+
~1-2% (loaded-memory measurements swing more with heap-growth
|
|
307
|
+
timing—treat them as ballpark). The same discipline caught the recurring
|
|
308
|
+
threaded-plaintext fluke twice: once 28% low on an earlier box, and again
|
|
309
|
+
in the final re-validation (170k, where three interleaved re-runs put it
|
|
310
|
+
back at 217k). Suspect cells get re-measured, not published.
|
|
201
311
|
|
|
202
312
|
## Topology notes
|
|
203
313
|
|
|
204
|
-
Measured on c7a.2xlarge, plaintext, ractor mode, same session
|
|
205
|
-
|
|
206
|
-
214,
|
|
207
|
-
handled by a 3-thread ractor pays a lock
|
|
208
|
-
doesn't (`perf` in the earlier Docker
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
the
|
|
214
|
-
|
|
314
|
+
Measured on c7a.2xlarge, plaintext, ractor mode, same session (three
|
|
315
|
+
interleaved rounds, medians): `8×3` (workers×threads) = 198,478, `8×1`
|
|
316
|
+
= **229,966 (+16%)**, `16×1` = 214,391. Threads inside one ractor share
|
|
317
|
+
its lock, so every request handled by a 3-thread ractor pays a lock
|
|
318
|
+
handoff that a 1-thread ractor doesn't (`perf` in the earlier Docker
|
|
319
|
+
sessions attributed ~10% of cycles to
|
|
320
|
+
`rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3; the
|
|
321
|
+
gain reproduced on two separate boxes, +16-17% each). **This is why
|
|
322
|
+
`threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +16%
|
|
323
|
+
the same way: 77,999 vs 67,394). The trade-off is /io at low worker
|
|
324
|
+
counts: 1,552 at 8×1 vs 4,530 at 8×3—threads-per-ractor exist for
|
|
325
|
+
handlers that block on I/O. If yours wait a lot, raise `workers`
|
|
326
|
+
instead (32×1 beats even the 24-slot cluster, see above); slots are
|
|
327
|
+
cheap. (16×1 being worse than 8×1 on plaintext also says the shared
|
|
328
|
+
MPMC queue is *not* the bottleneck—8 extra parked consumers just add
|
|
329
|
+
scheduler churn.)
|
|
215
330
|
|
|
216
331
|
## What profiling tried and rejected
|
|
217
332
|
|
|
@@ -239,23 +354,27 @@ safeguards: lane depth is capped at 4, and workers steal from siblings
|
|
|
239
354
|
before parking (plus on every park tick), so a slow request can't strand
|
|
240
355
|
its lane's backlog.
|
|
241
356
|
|
|
242
|
-
Same-session A/B on c7a.2xlarge
|
|
357
|
+
Same-session A/B on c7a.2xlarge, ractor mode at the default topology
|
|
358
|
+
(8×1):
|
|
243
359
|
|
|
244
360
|
| endpoint | shared queue | lanes | delta |
|
|
245
361
|
|----------|-------------:|------:|------:|
|
|
246
|
-
| /plaintext |
|
|
247
|
-
| /10k |
|
|
248
|
-
| /cpu |
|
|
249
|
-
| /io |
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
`
|
|
258
|
-
|
|
362
|
+
| /plaintext | 229,534 | **250,222** | **+9%** |
|
|
363
|
+
| /10k | 178,083 | 189,862 | +7% |
|
|
364
|
+
| /cpu | **77,999** | 70,885 | −9% |
|
|
365
|
+
| /io | 1,552 | 1,551 | flat |
|
|
366
|
+
|
|
367
|
+
Lanes' margin shrank with the move to 1-thread workers (at the old 8×3
|
|
368
|
+
it was +21% plaintext: 240,193 vs 199,032 in the same session)—most of
|
|
369
|
+
the futex pain lanes were built to avoid came from thread handoffs
|
|
370
|
+
inside each ractor, and the new default removes those for everyone. At
|
|
371
|
+
the default, lanes still post the fastest plaintext/10k of any Kino
|
|
372
|
+
configuration, but plain shared-queue now takes /cpu. It stays opt-in because overload semantics differ from the
|
|
373
|
+
shared queue (`queue_depth` doesn't apply; capacity is lanes × 4 with
|
|
374
|
+
brief dispatcher retries up to `queue_timeout` before the 503), and
|
|
375
|
+
crash semantics, stealing fairness, and drain behavior have spec
|
|
376
|
+
coverage but not production mileage. (On loopback-bound macOS, lanes
|
|
377
|
+
lose a few percent instead; see the secondary table below.)
|
|
259
378
|
|
|
260
379
|
## Logging costs
|
|
261
380
|
|
|
@@ -265,11 +384,11 @@ typical costs):
|
|
|
265
384
|
|
|
266
385
|
| case (8×3, same session) | req/s |
|
|
267
386
|
|---|---:|
|
|
268
|
-
| threaded, no logging |
|
|
269
|
-
| threaded, `log_requests true` (native access log) | 193,
|
|
270
|
-
| ractor, access log off / on |
|
|
271
|
-
| app logs 1 line/req via shared `::Logger` (file) | **62,
|
|
272
|
-
| app logs 1 line/req via `Kino::Logger` (file) | **
|
|
387
|
+
| threaded, no logging | 219,168 |
|
|
388
|
+
| threaded, `log_requests true` (native access log) | 193,998 (−11%) |
|
|
389
|
+
| ractor, access log off / on | 197,596 / 181,050 (−8%) |
|
|
390
|
+
| app logs 1 line/req via shared `::Logger` (file) | **62,961** |
|
|
391
|
+
| app logs 1 line/req via `Kino::Logger` (file) | **149,519 (2.4×)** |
|
|
273
392
|
|
|
274
393
|
The shared-`::Logger` cost is the mutex: 24 worker threads serialize
|
|
275
394
|
through one lock plus a write syscall per line. `Kino::Logger` hands the
|
data/doc/rails-on-ractors.md
CHANGED
|
@@ -7,10 +7,11 @@
|
|
|
7
7
|
Rails 8.2.0.alpha boots and serves with `mode :threaded` (see the
|
|
8
8
|
example's `kino.rb`; just `bundle exec kino` in that directory). Measured
|
|
9
9
|
on the hello-world (c7a.2xlarge, 8 cores, production mode, 8×5):
|
|
10
|
-
~2.
|
|
11
|
-
~
|
|
12
|
-
|
|
13
|
-
|
|
10
|
+
~2.6k req/s in 92 MB PSS, single process. The 8-worker Puma cluster
|
|
11
|
+
reaches ~12.1k by parallelizing across forks, at 389 MB PSS (794 MB RSS,
|
|
12
|
+
but its forks share the framework copy-on-write, so PSS is the fair
|
|
13
|
+
figure)—Rails-on-Ractors is interesting precisely because it could offer
|
|
14
|
+
that ~4.6× parallelism at ~1/4th of the memory.
|
|
14
15
|
|
|
15
16
|
Pair it with production-style Rails settings: eager load, no code
|
|
16
17
|
reloading, database pool ≥ workers × threads, logger to stdout or another
|
data/doc/why-kino.md
CHANGED
|
@@ -15,8 +15,8 @@ deep-copies it, and sockets cannot cross at all.
|
|
|
15
15
|
|
|
16
16
|
We measured what the "obvious" workaround costs. The ractor-pool wrapper
|
|
17
17
|
experiment (reduce the env to a shareable subset, copy it to a worker
|
|
18
|
-
over a `Ractor::Port`, copy the response back) runs at **
|
|
19
|
-
Kino does
|
|
18
|
+
over a `Ractor::Port`, copy the response back) runs at **19.5k req/s
|
|
19
|
+
where Kino does 194k** on the same hardware—see the
|
|
20
20
|
[wrapper comparison](benchmarks.md#the-ractor-pool-wrapper-comparison).
|
|
21
21
|
Copying at the Rack layer eats the entire ractor dividend. Dispatch has
|
|
22
22
|
to live below the Rack contract.
|
|
@@ -78,12 +78,12 @@ objects; Rust sees one queue and one registry.
|
|
|
78
78
|
|
|
79
79
|
With the dispatch cost eliminated, Ractors deliver the thing they were
|
|
80
80
|
built for—a lock per ractor instead of one GVL—and each layer is
|
|
81
|
-
visible in the [benchmarks](benchmarks.md): `/cpu` at
|
|
82
|
-
ractor mode vs **13.
|
|
83
|
-
fork cluster's CPU parallelism while holding
|
|
84
|
-
cluster's 1,
|
|
85
|
-
front-end, one queue, and one JIT, where
|
|
86
|
-
price.
|
|
81
|
+
visible in the [benchmarks](benchmarks.md): `/cpu` at 78.0k req/s in
|
|
82
|
+
ractor mode vs **13.4k threaded (5.8×, the GVL ceiling)**, beating the
|
|
83
|
+
fork cluster's CPU parallelism by +34% while holding **~148 MB against
|
|
84
|
+
the cluster's ~1,068 MB** (by PSS, on the bench app), because eight
|
|
85
|
+
ractors share one VM, one Rust front-end, one queue, and one JIT, where
|
|
86
|
+
eight forks each pay full price.
|
|
87
87
|
|
|
88
88
|
The cleanest proof of the design is the threaded fallback itself: it
|
|
89
89
|
reuses ~95% of the same machinery, because the Rust core is
|
data/lib/kino/configuration.rb
CHANGED
|
@@ -12,11 +12,13 @@ module Kino
|
|
|
12
12
|
bind: "127.0.0.1",
|
|
13
13
|
port: 0,
|
|
14
14
|
workers: nil, # resolved to Etc.nprocessors in #to_h
|
|
15
|
-
threads: 3
|
|
15
|
+
threads: nil, # resolved per mode in Server: 1 in :ractor, 3 in :threaded
|
|
16
16
|
mode: :auto,
|
|
17
17
|
queue_depth: 1024,
|
|
18
|
-
queue_timeout:
|
|
18
|
+
queue_timeout: 5.0,
|
|
19
19
|
request_timeout: nil,
|
|
20
|
+
max_connections: nil, # nil = derive from the open-file limit
|
|
21
|
+
max_body_size: 50 * 1024 * 1024, # 50 MB; nil/0 = unlimited
|
|
20
22
|
batch: 1,
|
|
21
23
|
lanes: false,
|
|
22
24
|
log_requests: false,
|
|
@@ -144,7 +146,8 @@ module Kino
|
|
|
144
146
|
# Worker count (ractors in :ractor mode); defaults to CPU cores.
|
|
145
147
|
def workers(count) = @config.set(:workers, Integer(count))
|
|
146
148
|
|
|
147
|
-
# Threads per worker (I/O concurrency inside one ractor)
|
|
149
|
+
# Threads per worker (I/O concurrency inside one ractor); default is
|
|
150
|
+
# mode-dependent: 1 in :ractor mode, 3 in :threaded.
|
|
148
151
|
def threads(count) = @config.set(:threads, Integer(count))
|
|
149
152
|
|
|
150
153
|
# Dispatch mode: :auto, :ractor, or :threaded.
|
|
@@ -159,6 +162,14 @@ module Kino
|
|
|
159
162
|
# Seconds the app gets before the client receives a 504; nil = off.
|
|
160
163
|
def request_timeout(seconds) = @config.set(:request_timeout, seconds && Float(seconds))
|
|
161
164
|
|
|
165
|
+
# Max connections served at once; beyond it, new connections wait in
|
|
166
|
+
# the kernel backlog. Defaults to most of the open-file limit.
|
|
167
|
+
def max_connections(count) = @config.set(:max_connections, Integer(count))
|
|
168
|
+
|
|
169
|
+
# Max request-body bytes before a 413; nil disables (delegate to a
|
|
170
|
+
# fronting proxy). Default 50 MB.
|
|
171
|
+
def max_body_size(bytes) = @config.set(:max_body_size, bytes && Integer(bytes))
|
|
172
|
+
|
|
162
173
|
# Requests a worker may grab per queue visit (default 1).
|
|
163
174
|
def batch(count) = @config.set(:batch, Integer(count))
|
|
164
175
|
|
data/lib/kino/kino.so
CHANGED
|
Binary file
|
data/lib/kino/server.rb
CHANGED
|
@@ -41,11 +41,17 @@ module Kino
|
|
|
41
41
|
@bind = settings[:bind]
|
|
42
42
|
@requested_port = settings[:port]
|
|
43
43
|
@workers = Integer(settings[:workers])
|
|
44
|
-
@threads = Integer(settings[:threads])
|
|
45
44
|
@mode = resolve_mode(settings[:mode])
|
|
45
|
+
# Default threads per mode: 1 in :ractor (threads inside a ractor
|
|
46
|
+
# share its lock; a measured +17% on fast handlers; raise `workers`
|
|
47
|
+
# for I/O concurrency instead), 3 in :threaded (threads ARE the
|
|
48
|
+
# concurrency there).
|
|
49
|
+
@threads = Integer(settings[:threads] || ((@mode == :ractor) ? 1 : 3))
|
|
46
50
|
@queue_depth = Integer(settings[:queue_depth])
|
|
47
51
|
@queue_timeout_ms = (Float(settings[:queue_timeout]) * 1000).round
|
|
48
52
|
@request_timeout_ms = settings[:request_timeout] ? (Float(settings[:request_timeout]) * 1000).round : 0
|
|
53
|
+
@max_connections = settings[:max_connections] ? Integer(settings[:max_connections]) : default_max_connections
|
|
54
|
+
@max_body_size = Integer(settings[:max_body_size] || 0)
|
|
49
55
|
@batch = [Integer(settings[:batch]), 1].max
|
|
50
56
|
@lanes = !!settings[:lanes]
|
|
51
57
|
@log_requests = !!settings[:log_requests]
|
|
@@ -70,6 +76,8 @@ module Kino
|
|
|
70
76
|
bind: @bind, port: @requested_port,
|
|
71
77
|
queue_depth: @queue_depth, queue_timeout_ms: @queue_timeout_ms,
|
|
72
78
|
request_timeout_ms: @request_timeout_ms,
|
|
79
|
+
max_connections: @max_connections,
|
|
80
|
+
max_body_size: @max_body_size,
|
|
73
81
|
tokio_threads: @tokio_threads,
|
|
74
82
|
tls_cert: @tls&.fetch(:cert), tls_key: @tls&.fetch(:key),
|
|
75
83
|
lanes: @lanes, log_requests: @log_requests
|
|
@@ -210,6 +218,18 @@ module Kino
|
|
|
210
218
|
Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
211
219
|
end
|
|
212
220
|
|
|
221
|
+
# Default connection cap: most of the process open-file limit. A
|
|
222
|
+
# connection flood's failure mode is descriptor exhaustion, and in
|
|
223
|
+
# :ractor/:threaded mode the app's own sockets and files share this
|
|
224
|
+
# process's table, so leave headroom. Scales with `ulimit -n`; raise the
|
|
225
|
+
# OS limit (or set max_connections) to allow more.
|
|
226
|
+
def default_max_connections
|
|
227
|
+
soft, = Process.getrlimit(Process::RLIMIT_NOFILE)
|
|
228
|
+
return 65_536 if soft == Process::RLIM_INFINITY
|
|
229
|
+
|
|
230
|
+
[soft * 8 / 10, 64].max
|
|
231
|
+
end
|
|
232
|
+
|
|
213
233
|
def join_workers(deadline)
|
|
214
234
|
if @supervisor
|
|
215
235
|
@supervisor.shutdown([deadline - monotonic_now, 0].max)
|
|
@@ -1,141 +1,121 @@
|
|
|
1
|
-
# frozen_string_literal: true
|
|
2
|
-
|
|
3
1
|
# Kino configuration.
|
|
4
2
|
# Generated by `kino --init`.
|
|
5
3
|
#
|
|
6
|
-
# Every setting
|
|
7
|
-
#
|
|
8
|
-
#
|
|
4
|
+
# Every setting is shown with its default value and commented out, so
|
|
5
|
+
# this file works as-is: uncomment what you want to change.
|
|
6
|
+
# Command-line flags beat this file; this file beats built-in defaults.
|
|
9
7
|
|
|
10
8
|
## Network
|
|
11
9
|
|
|
12
|
-
# Address to listen on. Use "0.0.0.0" to accept
|
|
10
|
+
# Address to listen on. Use "0.0.0.0" to accept connections from other
|
|
11
|
+
# machines.
|
|
13
12
|
# bind "127.0.0.1"
|
|
14
13
|
|
|
15
|
-
# Port to listen on.
|
|
16
|
-
# The `kino` CLI defaults this to 9292 when nothing else sets it.
|
|
14
|
+
# Port to listen on.
|
|
17
15
|
# port 9292
|
|
18
16
|
|
|
19
|
-
#
|
|
20
|
-
#
|
|
17
|
+
# Serve HTTPS. Point these at your certificate and key files (inline
|
|
18
|
+
# PEM strings also work).
|
|
21
19
|
# tls cert: "config/certs/server.pem", key: "config/certs/server.key"
|
|
22
20
|
|
|
23
21
|
## Topology
|
|
24
22
|
|
|
25
|
-
#
|
|
26
|
-
#
|
|
27
|
-
#
|
|
28
|
-
# multi-core parallelism for Ruby CPU work, one per core is a good start.
|
|
29
|
-
# In :threaded mode the same total (workers × threads) runs as plain
|
|
30
|
-
# Threads on the main ractor.
|
|
31
|
-
|
|
32
|
-
# Defaults to the number of CPU cores (Etc.nprocessors).
|
|
23
|
+
# How many workers to run. Each worker handles requests independently;
|
|
24
|
+
# in :ractor mode every worker runs Ruby in parallel on its own core.
|
|
25
|
+
# Default: one per CPU core.
|
|
33
26
|
# workers 8
|
|
34
27
|
|
|
35
|
-
# Threads
|
|
36
|
-
#
|
|
37
|
-
#
|
|
38
|
-
#
|
|
39
|
-
#
|
|
40
|
-
#
|
|
41
|
-
# threads 3
|
|
28
|
+
# Threads inside each worker. More threads help when your app spends
|
|
29
|
+
# time waiting on databases or other services; they do not make Ruby
|
|
30
|
+
# code run faster. Left unset, Kino picks a sensible default for the
|
|
31
|
+
# mode. If your app waits a lot in :ractor mode, prefer raising
|
|
32
|
+
# `workers` instead.
|
|
33
|
+
# threads 1
|
|
42
34
|
|
|
43
35
|
## Dispatch mode
|
|
44
36
|
#
|
|
45
|
-
# :auto
|
|
46
|
-
#
|
|
47
|
-
#
|
|
48
|
-
#
|
|
49
|
-
# :ractor: require a Ractor-shareable app; raises
|
|
50
|
-
# Kino::UnshareableAppError otherwise. The app must capture
|
|
51
|
-
# nothing mutable: frozen middleware, Ractor.shareable_proc
|
|
52
|
-
# endpoints.
|
|
53
|
-
# :threaded: run ANY Rack app (Rails included) on a classic thread pool.
|
|
37
|
+
# :auto - picks :ractor when your app supports it, else :threaded.
|
|
38
|
+
# :ractor - runs Ruby in parallel on all cores. Your app must be
|
|
39
|
+
# Ractor-shareable; check yours with `kino --check`.
|
|
40
|
+
# :threaded - works with any Rack app, including Rails.
|
|
54
41
|
# mode :auto
|
|
55
42
|
|
|
56
43
|
## Backpressure
|
|
57
44
|
|
|
58
|
-
#
|
|
59
|
-
#
|
|
60
|
-
# instead of waiting forever.
|
|
45
|
+
# How many requests may wait in line. When the line stays full, new
|
|
46
|
+
# requests are turned away with a 503 instead of waiting forever.
|
|
61
47
|
# queue_depth 1024
|
|
62
48
|
|
|
63
|
-
#
|
|
64
|
-
#
|
|
49
|
+
# How long (in seconds) a request may wait for a free spot before
|
|
50
|
+
# getting the 503.
|
|
51
|
+
# queue_timeout 5.0
|
|
65
52
|
|
|
66
|
-
#
|
|
67
|
-
#
|
|
68
|
-
#
|
|
69
|
-
# returns, so size this above your slowest legitimate endpoint.
|
|
53
|
+
# Give up on a response after this many seconds: the client gets a 504
|
|
54
|
+
# while your app finishes in the background. Off unless set. Set it
|
|
55
|
+
# above your slowest legitimate endpoint.
|
|
70
56
|
# request_timeout 30
|
|
71
57
|
|
|
72
|
-
#
|
|
73
|
-
#
|
|
74
|
-
#
|
|
75
|
-
#
|
|
58
|
+
# Most connections to serve at once. Past this, new connections wait in
|
|
59
|
+
# the kernel backlog instead of piling up until the server runs out of
|
|
60
|
+
# file descriptors. Defaults to most of the open-file limit (ulimit -n),
|
|
61
|
+
# so it scales with the OS limit and only bites under a flood.
|
|
62
|
+
# max_connections 8192
|
|
63
|
+
|
|
64
|
+
# Reject request bodies larger than this many bytes with a 413, so an
|
|
65
|
+
# oversized or endless upload can't drive your app to run out of memory.
|
|
66
|
+
# Set to nil to disable and let a fronting proxy handle it. Default: 50 MB.
|
|
67
|
+
# max_body_size 50 * 1024 * 1024
|
|
68
|
+
|
|
69
|
+
# How many requests a worker grabs from the line at once. Leave at 1
|
|
70
|
+
# unless all your endpoints are uniformly fast.
|
|
76
71
|
# batch 1
|
|
77
72
|
|
|
78
|
-
#
|
|
79
|
-
#
|
|
80
|
-
# fast handlers; semantics under overload are slightly different (per-lane
|
|
81
|
-
# caps with brief dispatcher retries instead of one global queue).
|
|
73
|
+
# Experimental dispatcher that gives each worker its own line. Faster
|
|
74
|
+
# for quick handlers; behavior under heavy overload differs slightly.
|
|
82
75
|
# lanes false
|
|
83
76
|
|
|
84
|
-
#
|
|
85
|
-
#
|
|
86
|
-
#
|
|
87
|
-
# On color terminals lines are tinted by status class (2xx green,
|
|
88
|
-
# 3xx yellow, 4xx maroon, 5xx bright red). This is the SERVER's view - it
|
|
89
|
-
# includes the 503 rejections your app never sees - and it interleaves
|
|
90
|
-
# cleanly with your app's own log (e.g. Rails') on stdout. See also
|
|
91
|
-
# Kino::Logger for routing the app log through the same async sink.
|
|
92
|
-
#
|
|
93
|
-
# Try enabling it in the development environment.
|
|
77
|
+
# Print one line per request to stdout, colored by status on a
|
|
78
|
+
# terminal. This is the server's view: it includes requests your app
|
|
79
|
+
# never saw, such as 503s. Recommended in development.
|
|
94
80
|
# log_requests false
|
|
95
81
|
|
|
96
82
|
## Lifecycle
|
|
97
83
|
|
|
98
|
-
#
|
|
99
|
-
#
|
|
100
|
-
# reaped. A second INT/TERM force-exits immediately.
|
|
84
|
+
# On shutdown, give in-flight requests this many seconds to finish.
|
|
85
|
+
# A second Ctrl-C (or signal) force-exits immediately.
|
|
101
86
|
# shutdown_timeout 30
|
|
102
87
|
|
|
103
|
-
# Write the
|
|
88
|
+
# Write the server's process id to this file on start.
|
|
104
89
|
# pidfile "tmp/pids/kino.pid"
|
|
105
90
|
|
|
106
91
|
## Runtime
|
|
107
92
|
|
|
108
|
-
# Threads for the
|
|
109
|
-
#
|
|
110
|
-
# real lever: `tokio_threads 1` + `threads 1` measured +26% on a pure-CPU
|
|
111
|
-
# benchmark (every spare thread is Ruby work you didn't run).
|
|
93
|
+
# Threads for the Rust I/O engine. The default suits most apps; for
|
|
94
|
+
# heavily CPU-bound apps, try 1 to leave more cores for Ruby.
|
|
112
95
|
# tokio_threads 4
|
|
113
96
|
|
|
114
97
|
## App
|
|
115
98
|
|
|
116
|
-
# Rackup file
|
|
99
|
+
# Rackup file to load (a command-line argument wins).
|
|
117
100
|
# rackup "config.ru"
|
|
118
101
|
|
|
119
|
-
# Sets RACK_ENV
|
|
102
|
+
# Sets RACK_ENV before the app is loaded, unless already set.
|
|
120
103
|
# environment "production"
|
|
121
104
|
|
|
122
105
|
## Rails
|
|
123
106
|
#
|
|
124
|
-
# Rails runs on Kino
|
|
107
|
+
# Rails runs on Kino today in :threaded mode:
|
|
125
108
|
#
|
|
126
109
|
# mode :threaded
|
|
127
110
|
# environment "production"
|
|
128
111
|
# threads 5 # match your database pool size
|
|
129
112
|
#
|
|
130
|
-
#
|
|
131
|
-
# -
|
|
132
|
-
#
|
|
133
|
-
#
|
|
134
|
-
# -
|
|
135
|
-
# - Rails.logger goes to stdout/stderr or a thread-safe device.
|
|
113
|
+
# Rails-side tips:
|
|
114
|
+
# - Run with eager loading and no code reloading (the production
|
|
115
|
+
# defaults).
|
|
116
|
+
# - Set the database pool to at least workers x threads.
|
|
117
|
+
# - Send logs to stdout or another thread-safe destination.
|
|
136
118
|
#
|
|
137
|
-
# Rails
|
|
138
|
-
# Rails.application
|
|
139
|
-
#
|
|
140
|
-
# Ractor.make_shareable(Rails.application) succeeds, `mode :ractor` here
|
|
141
|
-
# is all you'll need to change.
|
|
119
|
+
# Ractor mode cannot run Rails yet; the blockers are upstream in Rails.
|
|
120
|
+
# Once `Ractor.make_shareable(Rails.application)` works, switching to
|
|
121
|
+
# `mode :ractor` here is all you will need.
|
data/lib/kino/version.rb
CHANGED
data/sig/kino.rbs
CHANGED
|
@@ -92,6 +92,8 @@ module Kino
|
|
|
92
92
|
def queue_depth: (int depth) -> untyped
|
|
93
93
|
def queue_timeout: (Numeric seconds) -> untyped
|
|
94
94
|
def request_timeout: (Numeric? seconds) -> untyped
|
|
95
|
+
def max_connections: (int count) -> untyped
|
|
96
|
+
def max_body_size: (int? bytes) -> untyped
|
|
95
97
|
def batch: (int count) -> untyped
|
|
96
98
|
def lanes: (boolish enabled) -> untyped
|
|
97
99
|
def log_requests: (boolish enabled) -> untyped
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: kino
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.2
|
|
5
5
|
platform: x86_64-linux
|
|
6
6
|
authors:
|
|
7
7
|
- Yaroslav Markin
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-06-
|
|
11
|
+
date: 2026-06-21 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: logger
|