kino 0.1.1-aarch64-linux → 0.1.2-aarch64-linux
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +28 -0
- data/README.md +65 -31
- data/doc/benchmarks.md +138 -85
- data/doc/rails-on-ractors.md +5 -4
- data/doc/why-kino.md +7 -7
- data/lib/kino/configuration.rb +10 -0
- data/lib/kino/kino.so +0 -0
- data/lib/kino/server.rb +16 -0
- data/lib/kino/templates/kino.rb.tt +11 -0
- data/lib/kino/version.rb +1 -1
- data/sig/kino.rbs +2 -0
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 782837060d496035095d9c245aa1a1b7f59b20ef85b6e91a067deef9ee27757c
|
|
4
|
+
data.tar.gz: 115e29c38ae853396d2ecb5cb52884e4aa7245ac4eb2106a00eda8be67673534
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 8ef1ebf271164fee8b032a11252fa56b108aa7de9b4817dd6af482b6470eea685d0b243c6849aa57feb8047a84f71f1cd19342d7bf86370da110608a66728696
|
|
7
|
+
data.tar.gz: b39644afd25f10d6508ee7551f985d31690877fa43edc5169a85bd548f0c394c6d2be7f329e1ac836edc20f55f60d8dffe6fd0a8e9ee83cb411d95cf88ccf344
|
data/CHANGELOG.md
CHANGED
|
@@ -1,4 +1,32 @@
|
|
|
1
|
+
## [0.1.2] - 2026-06-22
|
|
2
|
+
|
|
3
|
+
- Drop a connection that has not sent its complete request headers
|
|
4
|
+
within 15 seconds. Closes a slowloris hole: hyper's built-in header-read
|
|
5
|
+
timeout was inert because the server installed no timer, so a slow-header
|
|
6
|
+
client could tie up a connection (and its tokio task) indefinitely.
|
|
7
|
+
- Cap concurrent connections (new `max_connections` directive). Past the cap,
|
|
8
|
+
new connections wait in the kernel backlog instead of piling up until a
|
|
9
|
+
flood exhausts file descriptors or memory. Defaults to most of the process
|
|
10
|
+
open-file limit (`ulimit -n`), so it scales with the OS limit and only
|
|
11
|
+
engages under a flood.
|
|
12
|
+
- Bound the TLS handshake to 10 seconds. A client that completed the TCP
|
|
13
|
+
connect but stalled the handshake could otherwise hold a connection slot
|
|
14
|
+
indefinitely, since the request and header-read deadlines only begin once
|
|
15
|
+
the handshake finishes.
|
|
16
|
+
- Cap the request body at 50 MB by default (new `max_body_size` directive,
|
|
17
|
+
configurable; nil or 0 disables and delegates to a fronting proxy). An app
|
|
18
|
+
that reads `rack.input` could otherwise be driven to run out of memory by an
|
|
19
|
+
oversized or endless upload. A truthful oversize Content-Length is refused
|
|
20
|
+
with a 413 before the app runs; a chunked or lying client is cut off
|
|
21
|
+
mid-stream once it passes the cap.
|
|
22
|
+
- Bound the idle time between request-body frames to 30 seconds. A client that
|
|
23
|
+
began a request then stalled mid-body would otherwise keep a worker blocked
|
|
24
|
+
in `rack.input.read` indefinitely; now the read raises and the worker
|
|
25
|
+
reclaims its slot. Only a silent client trips it: a steadily-sent body resets
|
|
26
|
+
the deadline each frame, so slow-but-active uploads are unaffected.
|
|
27
|
+
|
|
1
28
|
## [0.1.1] - 2026-06-11
|
|
29
|
+
|
|
2
30
|
- Mode-dependent `threads` default: 1 per worker in :ractor mode (threads
|
|
3
31
|
inside a ractor share its lock and cost a per-request handoff; +16-18%
|
|
4
32
|
on fast handlers, measured on dedicated hardware), 3 in :threaded mode.
|
data/README.md
CHANGED
|
@@ -14,9 +14,8 @@ and a threaded fallback mode runs everything else, Rails included.
|
|
|
14
14
|
* **Fast.** On a real 8-core server, every Kino mode is **1.5-2×**
|
|
15
15
|
ahead of a Puma fork cluster on I/O-light endpoints. Ractor mode also
|
|
16
16
|
wins on pure CPU, **30%+**. [Benchmarks](#benchmarks) below.
|
|
17
|
-
* **A fraction of the memory.**
|
|
18
|
-
about **
|
|
19
|
-
and 8× less when serving the Rails hello-world.
|
|
17
|
+
* **A fraction of the memory.** Aabout **~7×** on the simplistic bench
|
|
18
|
+
Ractor app, and about **4× less memory** than a Puma cluster serving Rails in fallback threaded mode.
|
|
20
19
|
* **Parallel without forking.** Ractor mode runs CPU work **more than
|
|
21
20
|
5× faster** than Kino's own GVL-bound threaded mode, in the same
|
|
22
21
|
small process.
|
|
@@ -64,36 +63,55 @@ notes live in [doc/architecture.md](doc/architecture.md).
|
|
|
64
63
|
## Benchmarks
|
|
65
64
|
|
|
66
65
|
Measured on a real server: AWS **c7a.2xlarge** (8-core AMD EPYC 9R14,
|
|
67
|
-
16 GB, Amazon Linux 2023). This is a realistic app-server size.
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
66
|
+
16 GB, Amazon Linux 2023). This is a realistic app-server size.
|
|
67
|
+
|
|
68
|
+
**These tables run a tiny synthetic Rack app**—plaintext, a 10 KB body,
|
|
69
|
+
a CPU-bound `fib`, a 5 ms wait—deliberately small, to measure the server
|
|
70
|
+
rather than an app. It is Ractor-shareable, so Kino runs it in `:ractor`
|
|
71
|
+
mode (and `:threaded` for comparison). **A real Rails app is a different
|
|
72
|
+
story:** it is *not* Ractor-shareable, so it runs only in Kino's
|
|
73
|
+
`:threaded` fallback, with its own numbers—see [Rails](#rails) below.
|
|
74
|
+
Ruby 4.0.5 with YJIT, every server at its defaults: Puma forks 8 workers ×
|
|
75
|
+
3 threads, Kino stays in one process (8 workers; 1 thread each in ractor
|
|
76
|
+
modes, 3 in threaded). Numbers are req/s by wrk (8-second windows, 64
|
|
77
|
+
connections, same host). Methodology:
|
|
73
78
|
[doc/benchmarks.md](doc/benchmarks.md).
|
|
74
79
|
|
|
75
80
|
| endpoint | Kino :ractor | + lanes | :ractor, `workers 32`² | Kino :threaded | Puma (cluster) |
|
|
76
81
|
|-------------|-------------:|--------:|-----------------------:|---------------:|---------------:|
|
|
77
|
-
| /plaintext | 229,
|
|
78
|
-
| /10k |
|
|
79
|
-
| /cpu (fib) | **
|
|
80
|
-
| /io (5 ms) | 1,
|
|
81
|
-
| /io_native | 1,570 | 1,571 | **6,
|
|
82
|
+
| /plaintext | 229,534 | **250,222** | 182,997 | 216,994 | 118,176 |
|
|
83
|
+
| /10k | 178,083 | **189,862** | 151,034 | 160,400 | 106,768 |
|
|
84
|
+
| /cpu (fib) | **77,999**¹| 70,885 | 66,100 | 13,429 | 58,006 |
|
|
85
|
+
| /io (5 ms) | 1,552 | 1,551 | **5,888** | 4,709 | 4,693 |
|
|
86
|
+
| /io_native | 1,570 | 1,571 | **6,274** | 4,695 | 4,691 |
|
|
82
87
|
|
|
83
|
-
Memory on the
|
|
88
|
+
Memory tells two different stories depending on the app, both by **PSS**
|
|
89
|
+
(proportional set size; see note) after sustained load.
|
|
84
90
|
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
91
|
+
**The tiny benchmark app** (Ractor-shareable, so Kino runs it in `:ractor`
|
|
92
|
+
or `:threaded`). Kino is **~7× lighter in :ractor mode, ~10× in :threaded**
|
|
93
|
+
than the Puma cluster — the gap stays large because a trivial app is almost
|
|
94
|
+
all private per-worker heap, which copy-on-write can't share:
|
|
95
|
+
|
|
96
|
+
| tiny app, Kino | Kino (one process) | Puma cluster (8 workers) | ratio |
|
|
97
|
+
|-----------------|-------------------:|-------------------------:|------:|
|
|
98
|
+
| :ractor (8×1) | **148 MB** | 1,068 MB | ~7× |
|
|
99
|
+
| :threaded (8×3) | **107 MB**³| 1,068 MB | ~10× |
|
|
100
|
+
|
|
101
|
+
**A real Rails app** (not Ractor-shareable—Kino's `:threaded` fallback
|
|
102
|
+
only, [below](#rails)). The gap is **~4×**, smaller because Rails' large
|
|
103
|
+
framework *is* shared copy-on-write across Puma's forks:
|
|
104
|
+
|
|
105
|
+
| Rails hello-world | Kino :threaded | Puma cluster (8 workers) | ratio |
|
|
106
|
+
|-------------------|---------------:|-------------------------:|------:|
|
|
107
|
+
| **PSS** | **92 MB** | **389 MB** | ~4× |
|
|
90
108
|
|
|
91
109
|
"+ lanes" is the experimental per-worker-queue dispatcher (`lanes true`).
|
|
92
110
|
It posts the fastest plaintext/10k of any configuration here. Details:
|
|
93
111
|
[doc/benchmarks.md](doc/benchmarks.md#lane-dispatch-experimental-lanes-true).
|
|
94
112
|
|
|
95
113
|
¹ Stock settings, no tuning. Ractor mode beats the fork cluster on pure
|
|
96
|
-
CPU by +
|
|
114
|
+
CPU by +34% (+22% with lanes). Threaded mode shows the GVL ceiling that
|
|
97
115
|
every single-process Ruby server hits. The old CPU-tuning recipe is
|
|
98
116
|
retired: its `threads 1` half **is** the default now, and its
|
|
99
117
|
`tokio_threads 1` half costs −12% on real hardware; see
|
|
@@ -102,7 +120,7 @@ retired: its `threads 1` half **is** the default now, and its
|
|
|
102
120
|
² Wait-bound throughput is slots ÷ wait, and the default columns bring
|
|
103
121
|
8 single-thread workers against the cluster's 24 threads. Kino slots
|
|
104
122
|
are threads, not processes—when your app waits a lot, raise `workers`.
|
|
105
|
-
The `workers 32` column is that tuning: **+
|
|
123
|
+
The `workers 32` column is that tuning: **+25% over the cluster on /io
|
|
106
124
|
(+34% via `Kino.sleep`)** while still ahead of it on pure CPU, all in
|
|
107
125
|
one small process. The cost is the CPU-light rows (32 ractors
|
|
108
126
|
oversubscribe 8 cores); pick the topology your app's wait profile
|
|
@@ -111,7 +129,7 @@ needs. See
|
|
|
111
129
|
|
|
112
130
|
³ With `MALLOC_ARENA_MAX=2` (the standard Ruby deployment setting;
|
|
113
131
|
Heroku's default). Without it, 24 threads churning 10 KB responses
|
|
114
|
-
through one glibc heap balloon to ~
|
|
132
|
+
through one glibc heap balloon to ~670 MB—an arena-fragmentation
|
|
115
133
|
footgun, not a leak, and ractor mode sidesteps it. See
|
|
116
134
|
[doc/benchmarks.md](doc/benchmarks.md#memory-under-load-and-the-glibc-arena-footgun).
|
|
117
135
|
|
|
@@ -121,14 +139,30 @@ doc):
|
|
|
121
139
|
|
|
122
140
|
| endpoint | Kino :ractor (8×3) | Puma + ractor wrapper | Falcon + ractor wrapper |
|
|
123
141
|
|------------|-------------------:|----------------------:|------------------------:|
|
|
124
|
-
| /plaintext | **
|
|
125
|
-
| /cpu (fib) | **68,
|
|
126
|
-
| /io (5 ms) | **4,
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
142
|
+
| /plaintext | **193,826** | 19,480 | 99,776 |
|
|
143
|
+
| /cpu (fib) | **68,061** | 17,755 | 48,721 |
|
|
144
|
+
| /io (5 ms) | **4,530** | 1,454 | 1,549 |
|
|
145
|
+
|
|
146
|
+
### Rails
|
|
147
|
+
|
|
148
|
+
Rails is not Ractor-shareable today, so Kino serves it in `:threaded`
|
|
149
|
+
fallback — one GVL-bound process. On the same box (`examples/rails-hello`,
|
|
150
|
+
edge Rails, production, 8×5):
|
|
151
|
+
|
|
152
|
+
| Rails hello-world | req/s | memory (PSS) |
|
|
153
|
+
|------------------------------|-------:|-------------:|
|
|
154
|
+
| Kino :threaded (one process) | 2,637 | **92 MB** |
|
|
155
|
+
| Puma cluster (8 workers) | 12,138 | 389 MB |
|
|
156
|
+
|
|
157
|
+
The honest trade-off: Puma's fork cluster uses all 8 cores, so it serves
|
|
158
|
+
~4.6× the throughput — at ~4× the memory. Ractor-mode Rails would close
|
|
159
|
+
the throughput gap at one-process memory cost; the upstream blockers are
|
|
160
|
+
tracked in [doc/rails-on-ractors.md](doc/rails-on-ractors.md).
|
|
161
|
+
|
|
162
|
+
In short: on the tiny synthetic app, ractor mode beats fork-level CPU parallelism (**5.8×** Kino's
|
|
163
|
+
own GVL-bound threaded mode, +34% over the cluster) in one process, at
|
|
164
|
+
about 1/7th of the cluster's memory by PSS (~4× on a real Rails app).
|
|
165
|
+
Every Kino mode is 1.5-2.1× ahead of the cluster on I/O-light endpoints. The macOS numbers
|
|
132
166
|
(secondary; everything there hits the loopback ceiling) and the
|
|
133
167
|
YJIT × Ractors gotcha are in [doc/benchmarks.md](doc/benchmarks.md).
|
|
134
168
|
|
data/doc/benchmarks.md
CHANGED
|
@@ -34,10 +34,21 @@ the deployment most apps run today.
|
|
|
34
34
|
- The headline tables also carry an io-tuned column (`workers 32,
|
|
35
35
|
threads 1`)—not a default, labeled as such—because the /io rows are
|
|
36
36
|
a slot-count story (see below).
|
|
37
|
-
- The dataset spans
|
|
38
|
-
a
|
|
39
|
-
|
|
40
|
-
|
|
37
|
+
- The dataset spans four identical c7a.2xlarge boxes: the original
|
|
38
|
+
measurements, a re-measure at the 0.1.1 defaults, the headline sweep,
|
|
39
|
+
and a final full re-validation (every table re-run from scratch).
|
|
40
|
+
Equal-config throughput reproduced across boxes within ~1-2%.
|
|
41
|
+
- **Memory is reported as PSS (proportional set size), not RSS.** A Puma
|
|
42
|
+
cluster forks N workers that share the Ruby VM and gem code
|
|
43
|
+
copy-on-write; summing each worker's RSS counts those shared pages up
|
|
44
|
+
to N times and overstates the cluster's real footprint. PSS divides
|
|
45
|
+
every shared page across the processes mapping it, so it reflects the
|
|
46
|
+
unique physical memory the cluster occupies—the only fair basis for
|
|
47
|
+
comparing one process against a fork-per-core cluster. We read it from
|
|
48
|
+
`/proc/<pid>/smaps_rollup` over the whole process tree, cross-checked
|
|
49
|
+
against `ps` (RSS) and `smem` (PSS). Kino serves from one process, so
|
|
50
|
+
its RSS ≈ PSS; the correction only moves Puma. (`bench/studies.sh`
|
|
51
|
+
reports both columns.)
|
|
41
52
|
- Follow-up studies (`bench/studies.sh`): CPU tuning, topology sweep,
|
|
42
53
|
/io worker scaling, logging costs, and memory—run in the same session
|
|
43
54
|
as the headline tables.
|
|
@@ -54,28 +65,31 @@ the deployment most apps run today.
|
|
|
54
65
|
|
|
55
66
|
## Reading the headline tables
|
|
56
67
|
|
|
68
|
+
These tables all run the **tiny synthetic Ractor-shareable app**. The real
|
|
69
|
+
Rails app is not Ractor-shareable and runs only in threaded fallback—a
|
|
70
|
+
separate story with separate numbers, in [its own section](#rails).
|
|
71
|
+
|
|
57
72
|
- **Plaintext/10k**: Kino's tokio front-end clears the fork cluster by
|
|
58
|
-
1.5-2.1× (lanes plaintext
|
|
59
|
-
smallest margin is threaded /10k at 1.
|
|
73
|
+
1.5-2.1× (lanes plaintext 250,222 vs Puma 118,176 = 2.12×; the
|
|
74
|
+
smallest margin is threaded /10k at 1.50×). At the old 3-thread
|
|
60
75
|
topology the cross-ractor handoff showed up as ractor trailing
|
|
61
76
|
threaded on trivial handlers; the 1-thread default reverses that
|
|
62
|
-
(ractor 230k vs threaded
|
|
63
|
-
- **CPU (recursive fib)**: ractor mode does **5.
|
|
64
|
-
threaded mode** (
|
|
65
|
-
ractors—and beats the fork cluster outright: +
|
|
66
|
-
defaults (+
|
|
67
|
-
`workers 32` topology stays ahead of the cluster on CPU (
|
|
68
|
-
- **Memory**: after
|
|
69
|
-
**
|
|
70
|
-
cluster
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
[Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
|
|
77
|
+
(ractor 230k vs threaded 217k) and lanes widen it (250k).
|
|
78
|
+
- **CPU (recursive fib)**: ractor mode does **5.8× its own GVL-bound
|
|
79
|
+
threaded mode** (77,999 vs 13,429)—that's the entire point of
|
|
80
|
+
ractors—and beats the fork cluster outright: +34% with stock
|
|
81
|
+
defaults (+22% with lanes, 70,885 vs 58,006). Even the io-tuned
|
|
82
|
+
`workers 32` topology stays ahead of the cluster on CPU (66,100).
|
|
83
|
+
- **Memory (PSS)**: after the full endpoint battery, the tiny app costs
|
|
84
|
+
Kino **148 MB** in ractor mode (107 MB threaded) against the 8-worker
|
|
85
|
+
cluster's **1,068 MB**—~7-10× lighter, because a trivial app is almost
|
|
86
|
+
all private per-worker heap that copy-on-write can't share. The real
|
|
87
|
+
Rails app narrows this to ~4× (its framework *is* shared CoW); both are
|
|
88
|
+
in [Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
|
|
75
89
|
- **I/O (5 ms wait)**: all dispatch models tie within ~4% at equal slot
|
|
76
90
|
counts, so the default columns show the ractor modes behind on /io
|
|
77
91
|
(8 slots vs the cluster's 24), and the `workers 32` column shows the
|
|
78
|
-
same engine winning (+
|
|
92
|
+
same engine winning (+25%, +34% via `Kino.sleep`) once it has more
|
|
79
93
|
slots than the cluster. The lever is slot count, and Kino slots are
|
|
80
94
|
cheap: see [below](#why-io-lags-in-ractor-mode-on-linux).
|
|
81
95
|
|
|
@@ -87,14 +101,14 @@ run:
|
|
|
87
101
|
|
|
88
102
|
| config | /cpu req/s |
|
|
89
103
|
|---|---:|
|
|
90
|
-
| Puma cluster (reference) | 58,
|
|
91
|
-
| Kino `workers 8, threads 3` (the default before 0.1.1) | 67,
|
|
92
|
-
| Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,
|
|
93
|
-
| Kino `workers 8, threads 1`, tokio auto (**the default**) | **
|
|
104
|
+
| Puma cluster (reference) | 58,189 |
|
|
105
|
+
| Kino `workers 8, threads 3` (the default before 0.1.1) | 67,394 |
|
|
106
|
+
| Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,600 |
|
|
107
|
+
| Kino `workers 8, threads 1`, tokio auto (**the default**) | **77,999** |
|
|
94
108
|
|
|
95
109
|
The `threads 1` half of the old recipe became the default; the
|
|
96
110
|
`tokio_threads 1` half now *costs* −12% on /cpu (and still costs
|
|
97
|
-
plaintext:
|
|
111
|
+
plaintext: 108,523 vs 230k). Don't pin tokio threads. **The recipe's
|
|
98
112
|
history is an environment story**: in the earlier Docker-on-Mac runs it
|
|
99
113
|
was worth +12%, because tokio threads and wake churn competed for
|
|
100
114
|
oversubscribed virtualized cores; on dedicated cores the same pin
|
|
@@ -118,8 +132,8 @@ Parallelism for CPU-bound Ruby comes from ractors or forks, nothing else.
|
|
|
118
132
|
|
|
119
133
|
## Why /io lags in ractor mode on Linux
|
|
120
134
|
|
|
121
|
-
On bare metal the gap is small at equal slot counts: ractor /io 4,
|
|
122
|
-
vs threaded 4,
|
|
135
|
+
On bare metal the gap is small at equal slot counts: ractor /io 4,530
|
|
136
|
+
vs threaded 4,709 (−4%, both at 8×3). In Docker it was −18%, and a
|
|
123
137
|
pure-Ruby probe there measured
|
|
124
138
|
`sleep(0.005)` waking +2.3-2.8 ms late inside ractors vs +1.8 ms on the
|
|
125
139
|
main ractor—non-main-ractor timer wakeups are coarser in Ruby 4.0, but
|
|
@@ -130,16 +144,16 @@ A follow-up probe showed `IO.select`-style waits are tighter than
|
|
|
130
144
|
**Mitigation 1—`Kino.sleep`:** releases the GVL and waits on the OS
|
|
131
145
|
clock directly (chunked, so `Thread#kill`/shutdown stay responsive). The
|
|
132
146
|
`/io_native` endpoint (same 5 ms wait via `Kino.sleep` when available)
|
|
133
|
-
erases the remaining ractor gap on this box: 4,721 vs 4,
|
|
147
|
+
erases the remaining ractor gap on this box: 4,721 vs 4,530 plain sleep.
|
|
134
148
|
|
|
135
149
|
**Mitigation 2—add workers; they're nearly free.** The headline tables
|
|
136
|
-
show default ractor-mode /io at 1,
|
|
150
|
+
show default ractor-mode /io at 1,552: that's 8 slots (the 1-thread
|
|
137
151
|
default) against the cluster's 24, because wait-bound throughput is
|
|
138
152
|
simply `slots ÷ effective wait`. Kino's slots cost ~a thread each, not
|
|
139
|
-
a forked process: the `workers 32, threads 1` column measured **5,
|
|
140
|
-
/io (+
|
|
141
|
-
(+34%)**, still one small process, and still +
|
|
142
|
-
on pure CPU. Its cost is the CPU-light rows (
|
|
153
|
+
a forked process: the `workers 32, threads 1` column measured **5,888
|
|
154
|
+
/io (+25% over the 24-thread cluster's 4,693) and 6,274 /io_native
|
|
155
|
+
(+34%)**, still one small process, and still +14% ahead of the cluster
|
|
156
|
+
on pure CPU. Its cost is the CPU-light rows (183k plaintext vs 230k at
|
|
143
157
|
8×1: 32 ractors oversubscribe 8 cores). A fork cluster buying the same
|
|
144
158
|
32 slots pays for them in full copies of the app; Kino pays in
|
|
145
159
|
scheduler churn only where the cores are already saturated.
|
|
@@ -154,9 +168,9 @@ what the Rack-level hop itself costs (c7a.2xlarge, same session):
|
|
|
154
168
|
|
|
155
169
|
| endpoint | Kino :ractor (8×3) | Puma + wrapper | Falcon + wrapper |
|
|
156
170
|
|------------|-------------------:|---------------:|-----------------:|
|
|
157
|
-
| /plaintext |
|
|
158
|
-
| /cpu (fib) | 68,
|
|
159
|
-
| /io (5 ms) | 4,
|
|
171
|
+
| /plaintext | 193,826 | 19,480 | 99,776 |
|
|
172
|
+
| /cpu (fib) | 68,061 | 17,755 | 48,721 |
|
|
173
|
+
| /io (5 ms) | 4,530 | 1,454 | 1,549 |
|
|
160
174
|
|
|
161
175
|
Inside the Rack contract, the wrapper must reduce the env to a shareable
|
|
162
176
|
subset, copy it to the worker ractor, copy the response back, and hold a
|
|
@@ -171,18 +185,26 @@ the Rack contract—which is the experiment this gem exists to run.
|
|
|
171
185
|
|
|
172
186
|
## Rails
|
|
173
187
|
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
188
|
+
Rails is **not Ractor-shareable**, so Kino can only serve it in
|
|
189
|
+
`:threaded` fallback—this whole section is one GVL-bound Kino process,
|
|
190
|
+
never ractor mode. The example app (`examples/rails-hello`, edge Rails,
|
|
191
|
+
production mode, 8 workers × 5 threads) on the same box:
|
|
192
|
+
|
|
193
|
+
| | req/s | RSS | PSS |
|
|
194
|
+
|---|---:|---:|---:|
|
|
195
|
+
| Kino `:threaded` (one process) | 2,637 | 97 MB | **92 MB** |
|
|
196
|
+
| Puma cluster (8 workers) | 12,138 | 794 MB | **389 MB** |
|
|
197
|
+
|
|
198
|
+
This is the honest version of the Rails story. In threaded mode Kino is
|
|
199
|
+
one GVL-bound process, so the fork cluster outruns it ~4.6× by using all
|
|
200
|
+
8 cores—at ~4× the memory by PSS. The metric matters here: Puma's RSS
|
|
201
|
+
(794 MB) counts the shared Rails framework once per worker; PSS (389 MB)
|
|
202
|
+
counts it once, and that is the fair figure (the README's headline used
|
|
203
|
+
to read 8× off RSS). Preloading barely moves it—389 MB with
|
|
204
|
+
`preload_app!` vs 400 MB without—because Ruby's GC dirties most heap
|
|
205
|
+
pages, breaking copy-on-write, so even a preloaded cluster keeps a
|
|
206
|
+
private heap per worker. Rails-on-Ractors is interesting precisely
|
|
207
|
+
because it would close the throughput gap at the one-process memory
|
|
186
208
|
cost; the upstream blockers are documented in
|
|
187
209
|
[rails-on-ractors.md](rails-on-ractors.md).
|
|
188
210
|
|
|
@@ -217,29 +239,59 @@ more reason to prefer mimalloc in dlopen'd extensions.
|
|
|
217
239
|
|
|
218
240
|
## Memory under load (and the glibc arena footgun)
|
|
219
241
|
|
|
220
|
-
|
|
221
|
-
/10k, /cpu, /io—a "warmed
|
|
222
|
-
measures 26
|
|
242
|
+
All figures are **PSS** (see [Methodology](#methodology)) after the full
|
|
243
|
+
endpoint battery (8 s each of /plaintext, /10k, /cpu, /io—a "warmed
|
|
244
|
+
production process", not a fresh boot, which measures ~26 MB for every
|
|
245
|
+
Kino mode). RSS is shown alongside so the copy-on-write correction is
|
|
246
|
+
visible.
|
|
223
247
|
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
|
227
|
-
|
|
228
|
-
| Kino :ractor 8×
|
|
229
|
-
| Kino
|
|
230
|
-
|
|
|
248
|
+
### The tiny synthetic app
|
|
249
|
+
|
|
250
|
+
| config | RSS | PSS |
|
|
251
|
+
|---|---:|---:|
|
|
252
|
+
| Kino :ractor 8×1 (default) | 151 | **148** |
|
|
253
|
+
| Kino lanes 8×1 | 137 | **135** |
|
|
254
|
+
| Kino :ractor 8×3 | 171 | **169** |
|
|
255
|
+
| Kino :threaded 8×3 (`MALLOC_ARENA_MAX=2`) | 109 | **107** |
|
|
256
|
+
| Kino :threaded 8×3 (no arena cap) | 668 | **666**¹ |
|
|
257
|
+
| Puma cluster 8×3 | 1,213 | **1,068** |
|
|
258
|
+
|
|
259
|
+
The tiny app is ~7× lighter than the cluster in ractor mode, ~10× in
|
|
260
|
+
arena-capped threaded mode. RSS ≈ PSS for every Kino row (one process,
|
|
261
|
+
nothing to share) and within ~12% for Puma here: a trivial app has almost
|
|
262
|
+
no shared state, so Puma's footprint is ~1,051 MB of *private* per-worker
|
|
263
|
+
heap plus only ~18 MB shared (which RSS counts 8×). This is the case where
|
|
264
|
+
copy-on-write does **not** rescue the cluster—there is nothing to
|
|
265
|
+
share—so the RSS and PSS numbers nearly agree. (The old "80 MB / 15×"
|
|
266
|
+
figure was a lighter, plaintext-only load; the honest full-battery ractor
|
|
267
|
+
figure is ~148 MB, i.e. ~7×.)
|
|
231
268
|
|
|
232
269
|
¹ Not a leak: glibc malloc arena bloat. One 8-second /10k round takes
|
|
233
|
-
threaded mode from
|
|
270
|
+
threaded mode from ~70 MB to ~670 MB and it never returns—24 threads
|
|
234
271
|
churning 10 KB strings through one process heap is the textbook glibc
|
|
235
272
|
arena-fragmentation case (the reason Rails ops set `MALLOC_ARENA_MAX=2`;
|
|
236
|
-
Heroku ships that default). With
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
273
|
+
Heroku ships that default). With the cap the same battery ends at 107 MB
|
|
274
|
+
PSS, throughput unchanged. Ractor mode sidesteps the worst of it without
|
|
275
|
+
any env tweak—objects live in per-ractor heaps.
|
|
276
|
+
|
|
277
|
+
### Rails (threaded fallback)
|
|
278
|
+
|
|
279
|
+
Here copy-on-write **does** matter, which is exactly why PSS is mandatory:
|
|
280
|
+
|
|
281
|
+
| config | RSS | PSS |
|
|
282
|
+
|---|---:|---:|
|
|
283
|
+
| Kino :threaded (one process) | 97 | **92** |
|
|
284
|
+
| Puma cluster 8×3 (preload) | 794 | **389** |
|
|
285
|
+
|
|
286
|
+
Puma serves the same Rails framework from 8 forks that share it
|
|
287
|
+
copy-on-write; RSS counts that shared framework once per worker (794 MB),
|
|
288
|
+
PSS counts it once (389 MB). The fair ratio is **~4×**, not the ~8× a
|
|
289
|
+
naive RSS sum reports—this is the correction that prompted the whole
|
|
290
|
+
re-measure. Preload barely helps (389 vs 400 MB without): Ruby's GC
|
|
291
|
+
dirties most heap pages, breaking copy-on-write, so even a preloaded
|
|
292
|
+
cluster keeps a large private heap per worker. That is why "CoW should
|
|
293
|
+
make a fork cluster nearly free" is only half true—it shares the code,
|
|
294
|
+
not the live object heap.
|
|
243
295
|
|
|
244
296
|
## Run-to-run variance (a.k.a. "is this a regression?")
|
|
245
297
|
|
|
@@ -249,26 +301,27 @@ Docker-on-Mac environment swung ±10% on /cpu between sessions with the
|
|
|
249
301
|
VM's mood; the dedicated c7a box is far steadier (same-session repeats
|
|
250
302
|
land within ~1-2%), but the discipline stays—every comparative claim in
|
|
251
303
|
these docs comes from same-session pairs. Cross-box repeatability got
|
|
252
|
-
its own test: the dataset was measured across
|
|
304
|
+
its own test: the dataset was measured across four identical
|
|
253
305
|
c7a.2xlarge boxes, and equal-config throughput numbers matched within
|
|
254
|
-
~1-2% (loaded-
|
|
255
|
-
timing—treat
|
|
256
|
-
|
|
257
|
-
|
|
306
|
+
~1-2% (loaded-memory measurements swing more with heap-growth
|
|
307
|
+
timing—treat them as ballpark). The same discipline caught the recurring
|
|
308
|
+
threaded-plaintext fluke twice: once 28% low on an earlier box, and again
|
|
309
|
+
in the final re-validation (170k, where three interleaved re-runs put it
|
|
310
|
+
back at 217k). Suspect cells get re-measured, not published.
|
|
258
311
|
|
|
259
312
|
## Topology notes
|
|
260
313
|
|
|
261
314
|
Measured on c7a.2xlarge, plaintext, ractor mode, same session (three
|
|
262
|
-
interleaved rounds, medians): `8×3` (workers×threads) =
|
|
263
|
-
= **
|
|
315
|
+
interleaved rounds, medians): `8×3` (workers×threads) = 198,478, `8×1`
|
|
316
|
+
= **229,966 (+16%)**, `16×1` = 214,391. Threads inside one ractor share
|
|
264
317
|
its lock, so every request handled by a 3-thread ractor pays a lock
|
|
265
318
|
handoff that a 1-thread ractor doesn't (`perf` in the earlier Docker
|
|
266
319
|
sessions attributed ~10% of cycles to
|
|
267
320
|
`rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3; the
|
|
268
321
|
gain reproduced on two separate boxes, +16-17% each). **This is why
|
|
269
|
-
`threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +
|
|
270
|
-
the same way:
|
|
271
|
-
counts: 1,
|
|
322
|
+
`threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +16%
|
|
323
|
+
the same way: 77,999 vs 67,394). The trade-off is /io at low worker
|
|
324
|
+
counts: 1,552 at 8×1 vs 4,530 at 8×3—threads-per-ractor exist for
|
|
272
325
|
handlers that block on I/O. If yours wait a lot, raise `workers`
|
|
273
326
|
instead (32×1 beats even the 24-slot cluster, see above); slots are
|
|
274
327
|
cheap. (16×1 being worse than 8×1 on plaintext also says the shared
|
|
@@ -306,10 +359,10 @@ Same-session A/B on c7a.2xlarge, ractor mode at the default topology
|
|
|
306
359
|
|
|
307
360
|
| endpoint | shared queue | lanes | delta |
|
|
308
361
|
|----------|-------------:|------:|------:|
|
|
309
|
-
| /plaintext |
|
|
310
|
-
| /10k |
|
|
311
|
-
| /cpu | **
|
|
312
|
-
| /io | 1,
|
|
362
|
+
| /plaintext | 229,534 | **250,222** | **+9%** |
|
|
363
|
+
| /10k | 178,083 | 189,862 | +7% |
|
|
364
|
+
| /cpu | **77,999** | 70,885 | −9% |
|
|
365
|
+
| /io | 1,552 | 1,551 | flat |
|
|
313
366
|
|
|
314
367
|
Lanes' margin shrank with the move to 1-thread workers (at the old 8×3
|
|
315
368
|
it was +21% plaintext: 240,193 vs 199,032 in the same session)—most of
|
|
@@ -331,11 +384,11 @@ typical costs):
|
|
|
331
384
|
|
|
332
385
|
| case (8×3, same session) | req/s |
|
|
333
386
|
|---|---:|
|
|
334
|
-
| threaded, no logging |
|
|
335
|
-
| threaded, `log_requests true` (native access log) | 193,
|
|
336
|
-
| ractor, access log off / on |
|
|
337
|
-
| app logs 1 line/req via shared `::Logger` (file) | **62,
|
|
338
|
-
| app logs 1 line/req via `Kino::Logger` (file) | **149,
|
|
387
|
+
| threaded, no logging | 219,168 |
|
|
388
|
+
| threaded, `log_requests true` (native access log) | 193,998 (−11%) |
|
|
389
|
+
| ractor, access log off / on | 197,596 / 181,050 (−8%) |
|
|
390
|
+
| app logs 1 line/req via shared `::Logger` (file) | **62,961** |
|
|
391
|
+
| app logs 1 line/req via `Kino::Logger` (file) | **149,519 (2.4×)** |
|
|
339
392
|
|
|
340
393
|
The shared-`::Logger` cost is the mutex: 24 worker threads serialize
|
|
341
394
|
through one lock plus a write syscall per line. `Kino::Logger` hands the
|
data/doc/rails-on-ractors.md
CHANGED
|
@@ -7,10 +7,11 @@
|
|
|
7
7
|
Rails 8.2.0.alpha boots and serves with `mode :threaded` (see the
|
|
8
8
|
example's `kino.rb`; just `bundle exec kino` in that directory). Measured
|
|
9
9
|
on the hello-world (c7a.2xlarge, 8 cores, production mode, 8×5):
|
|
10
|
-
~2.
|
|
11
|
-
~
|
|
12
|
-
|
|
13
|
-
|
|
10
|
+
~2.6k req/s in 92 MB PSS, single process. The 8-worker Puma cluster
|
|
11
|
+
reaches ~12.1k by parallelizing across forks, at 389 MB PSS (794 MB RSS,
|
|
12
|
+
but its forks share the framework copy-on-write, so PSS is the fair
|
|
13
|
+
figure)—Rails-on-Ractors is interesting precisely because it could offer
|
|
14
|
+
that ~4.6× parallelism at ~1/4th of the memory.
|
|
14
15
|
|
|
15
16
|
Pair it with production-style Rails settings: eager load, no code
|
|
16
17
|
reloading, database pool ≥ workers × threads, logger to stdout or another
|
data/doc/why-kino.md
CHANGED
|
@@ -16,7 +16,7 @@ deep-copies it, and sockets cannot cross at all.
|
|
|
16
16
|
We measured what the "obvious" workaround costs. The ractor-pool wrapper
|
|
17
17
|
experiment (reduce the env to a shareable subset, copy it to a worker
|
|
18
18
|
over a `Ractor::Port`, copy the response back) runs at **19.5k req/s
|
|
19
|
-
where Kino does
|
|
19
|
+
where Kino does 194k** on the same hardware—see the
|
|
20
20
|
[wrapper comparison](benchmarks.md#the-ractor-pool-wrapper-comparison).
|
|
21
21
|
Copying at the Rack layer eats the entire ractor dividend. Dispatch has
|
|
22
22
|
to live below the Rack contract.
|
|
@@ -78,12 +78,12 @@ objects; Rust sees one queue and one registry.
|
|
|
78
78
|
|
|
79
79
|
With the dispatch cost eliminated, Ractors deliver the thing they were
|
|
80
80
|
built for—a lock per ractor instead of one GVL—and each layer is
|
|
81
|
-
visible in the [benchmarks](benchmarks.md): `/cpu` at
|
|
82
|
-
ractor mode vs **13.
|
|
83
|
-
fork cluster's CPU parallelism by +
|
|
84
|
-
the cluster's 1,
|
|
85
|
-
front-end, one queue, and one JIT, where
|
|
86
|
-
price.
|
|
81
|
+
visible in the [benchmarks](benchmarks.md): `/cpu` at 78.0k req/s in
|
|
82
|
+
ractor mode vs **13.4k threaded (5.8×, the GVL ceiling)**, beating the
|
|
83
|
+
fork cluster's CPU parallelism by +34% while holding **~148 MB against
|
|
84
|
+
the cluster's ~1,068 MB** (by PSS, on the bench app), because eight
|
|
85
|
+
ractors share one VM, one Rust front-end, one queue, and one JIT, where
|
|
86
|
+
eight forks each pay full price.
|
|
87
87
|
|
|
88
88
|
The cleanest proof of the design is the threaded fallback itself: it
|
|
89
89
|
reuses ~95% of the same machinery, because the Rust core is
|
data/lib/kino/configuration.rb
CHANGED
|
@@ -17,6 +17,8 @@ module Kino
|
|
|
17
17
|
queue_depth: 1024,
|
|
18
18
|
queue_timeout: 5.0,
|
|
19
19
|
request_timeout: nil,
|
|
20
|
+
max_connections: nil, # nil = derive from the open-file limit
|
|
21
|
+
max_body_size: 50 * 1024 * 1024, # 50 MB; nil/0 = unlimited
|
|
20
22
|
batch: 1,
|
|
21
23
|
lanes: false,
|
|
22
24
|
log_requests: false,
|
|
@@ -160,6 +162,14 @@ module Kino
|
|
|
160
162
|
# Seconds the app gets before the client receives a 504; nil = off.
|
|
161
163
|
def request_timeout(seconds) = @config.set(:request_timeout, seconds && Float(seconds))
|
|
162
164
|
|
|
165
|
+
# Max connections served at once; beyond it, new connections wait in
|
|
166
|
+
# the kernel backlog. Defaults to most of the open-file limit.
|
|
167
|
+
def max_connections(count) = @config.set(:max_connections, Integer(count))
|
|
168
|
+
|
|
169
|
+
# Max request-body bytes before a 413; nil disables (delegate to a
|
|
170
|
+
# fronting proxy). Default 50 MB.
|
|
171
|
+
def max_body_size(bytes) = @config.set(:max_body_size, bytes && Integer(bytes))
|
|
172
|
+
|
|
163
173
|
# Requests a worker may grab per queue visit (default 1).
|
|
164
174
|
def batch(count) = @config.set(:batch, Integer(count))
|
|
165
175
|
|
data/lib/kino/kino.so
CHANGED
|
Binary file
|
data/lib/kino/server.rb
CHANGED
|
@@ -50,6 +50,8 @@ module Kino
|
|
|
50
50
|
@queue_depth = Integer(settings[:queue_depth])
|
|
51
51
|
@queue_timeout_ms = (Float(settings[:queue_timeout]) * 1000).round
|
|
52
52
|
@request_timeout_ms = settings[:request_timeout] ? (Float(settings[:request_timeout]) * 1000).round : 0
|
|
53
|
+
@max_connections = settings[:max_connections] ? Integer(settings[:max_connections]) : default_max_connections
|
|
54
|
+
@max_body_size = Integer(settings[:max_body_size] || 0)
|
|
53
55
|
@batch = [Integer(settings[:batch]), 1].max
|
|
54
56
|
@lanes = !!settings[:lanes]
|
|
55
57
|
@log_requests = !!settings[:log_requests]
|
|
@@ -74,6 +76,8 @@ module Kino
|
|
|
74
76
|
bind: @bind, port: @requested_port,
|
|
75
77
|
queue_depth: @queue_depth, queue_timeout_ms: @queue_timeout_ms,
|
|
76
78
|
request_timeout_ms: @request_timeout_ms,
|
|
79
|
+
max_connections: @max_connections,
|
|
80
|
+
max_body_size: @max_body_size,
|
|
77
81
|
tokio_threads: @tokio_threads,
|
|
78
82
|
tls_cert: @tls&.fetch(:cert), tls_key: @tls&.fetch(:key),
|
|
79
83
|
lanes: @lanes, log_requests: @log_requests
|
|
@@ -214,6 +218,18 @@ module Kino
|
|
|
214
218
|
Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
215
219
|
end
|
|
216
220
|
|
|
221
|
+
# Default connection cap: most of the process open-file limit. A
|
|
222
|
+
# connection flood's failure mode is descriptor exhaustion, and in
|
|
223
|
+
# :ractor/:threaded mode the app's own sockets and files share this
|
|
224
|
+
# process's table, so leave headroom. Scales with `ulimit -n`; raise the
|
|
225
|
+
# OS limit (or set max_connections) to allow more.
|
|
226
|
+
def default_max_connections
|
|
227
|
+
soft, = Process.getrlimit(Process::RLIMIT_NOFILE)
|
|
228
|
+
return 65_536 if soft == Process::RLIM_INFINITY
|
|
229
|
+
|
|
230
|
+
[soft * 8 / 10, 64].max
|
|
231
|
+
end
|
|
232
|
+
|
|
217
233
|
def join_workers(deadline)
|
|
218
234
|
if @supervisor
|
|
219
235
|
@supervisor.shutdown([deadline - monotonic_now, 0].max)
|
|
@@ -55,6 +55,17 @@
|
|
|
55
55
|
# above your slowest legitimate endpoint.
|
|
56
56
|
# request_timeout 30
|
|
57
57
|
|
|
58
|
+
# Most connections to serve at once. Past this, new connections wait in
|
|
59
|
+
# the kernel backlog instead of piling up until the server runs out of
|
|
60
|
+
# file descriptors. Defaults to most of the open-file limit (ulimit -n),
|
|
61
|
+
# so it scales with the OS limit and only bites under a flood.
|
|
62
|
+
# max_connections 8192
|
|
63
|
+
|
|
64
|
+
# Reject request bodies larger than this many bytes with a 413, so an
|
|
65
|
+
# oversized or endless upload can't drive your app to run out of memory.
|
|
66
|
+
# Set to nil to disable and let a fronting proxy handle it. Default: 50 MB.
|
|
67
|
+
# max_body_size 50 * 1024 * 1024
|
|
68
|
+
|
|
58
69
|
# How many requests a worker grabs from the line at once. Leave at 1
|
|
59
70
|
# unless all your endpoints are uniformly fast.
|
|
60
71
|
# batch 1
|
data/lib/kino/version.rb
CHANGED
data/sig/kino.rbs
CHANGED
|
@@ -92,6 +92,8 @@ module Kino
|
|
|
92
92
|
def queue_depth: (int depth) -> untyped
|
|
93
93
|
def queue_timeout: (Numeric seconds) -> untyped
|
|
94
94
|
def request_timeout: (Numeric? seconds) -> untyped
|
|
95
|
+
def max_connections: (int count) -> untyped
|
|
96
|
+
def max_body_size: (int? bytes) -> untyped
|
|
95
97
|
def batch: (int count) -> untyped
|
|
96
98
|
def lanes: (boolish enabled) -> untyped
|
|
97
99
|
def log_requests: (boolish enabled) -> untyped
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: kino
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.2
|
|
5
5
|
platform: aarch64-linux
|
|
6
6
|
authors:
|
|
7
7
|
- Yaroslav Markin
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-06-
|
|
11
|
+
date: 2026-06-21 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: logger
|