kino 0.1.1 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +28 -0
- data/Cargo.lock +1 -1
- data/README.md +65 -31
- data/doc/benchmarks.md +138 -85
- data/doc/rails-on-ractors.md +5 -4
- data/doc/why-kino.md +7 -7
- data/ext/kino/Cargo.toml +1 -1
- data/ext/kino/src/registry.rs +4 -0
- data/ext/kino/src/request.rs +33 -1
- data/ext/kino/src/server.rs +123 -25
- data/lib/kino/configuration.rb +10 -0
- data/lib/kino/server.rb +16 -0
- data/lib/kino/templates/kino.rb.tt +11 -0
- data/lib/kino/version.rb +1 -1
- data/sig/kino.rbs +2 -0
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 0bd8e6e3b295832fa1d87743b4c9a121cdb5687287011b29f1140814fdae0575
|
|
4
|
+
data.tar.gz: f21305459e857d366159ee258d873e2109b7a7715bb0cbe8d23c47e458ae2b33
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 04e95f9ee2133b4d15bdd2069977e73791a16b033e57a3bdb88478d0048b0bb5d8f56eff1963813bbf8d9399461bb80bf9c33a2039f98805e4f129b3e42b28ef
|
|
7
|
+
data.tar.gz: 8eb131cdbbe5bdbd29d188ab4ff43dbbec2f8c791aacf85586ba69c7208b2f74cc05d4007c469a4de61c4578ef08214e56a4fd080b9ad655af3a01d208b4c752
|
data/CHANGELOG.md
CHANGED
|
@@ -1,4 +1,32 @@
|
|
|
1
|
+
## [0.1.2] - 2026-06-22
|
|
2
|
+
|
|
3
|
+
- Drop a connection that has not sent its complete request headers
|
|
4
|
+
within 15 seconds. Closes a slowloris hole: hyper's built-in header-read
|
|
5
|
+
timeout was inert because the server installed no timer, so a slow-header
|
|
6
|
+
client could tie up a connection (and its tokio task) indefinitely.
|
|
7
|
+
- Cap concurrent connections (new `max_connections` directive). Past the cap,
|
|
8
|
+
new connections wait in the kernel backlog instead of piling up until a
|
|
9
|
+
flood exhausts file descriptors or memory. Defaults to most of the process
|
|
10
|
+
open-file limit (`ulimit -n`), so it scales with the OS limit and only
|
|
11
|
+
engages under a flood.
|
|
12
|
+
- Bound the TLS handshake to 10 seconds. A client that completed the TCP
|
|
13
|
+
connect but stalled the handshake could otherwise hold a connection slot
|
|
14
|
+
indefinitely, since the request and header-read deadlines only begin once
|
|
15
|
+
the handshake finishes.
|
|
16
|
+
- Cap the request body at 50 MB by default (new `max_body_size` directive,
|
|
17
|
+
configurable; nil or 0 disables and delegates to a fronting proxy). An app
|
|
18
|
+
that reads `rack.input` could otherwise be driven to run out of memory by an
|
|
19
|
+
oversized or endless upload. A truthful oversize Content-Length is refused
|
|
20
|
+
with a 413 before the app runs; a chunked or lying client is cut off
|
|
21
|
+
mid-stream once it passes the cap.
|
|
22
|
+
- Bound the idle time between request-body frames to 30 seconds. A client that
|
|
23
|
+
began a request then stalled mid-body would otherwise keep a worker blocked
|
|
24
|
+
in `rack.input.read` indefinitely; now the read raises and the worker
|
|
25
|
+
reclaims its slot. Only a silent client trips it: a steadily-sent body resets
|
|
26
|
+
the deadline each frame, so slow-but-active uploads are unaffected.
|
|
27
|
+
|
|
1
28
|
## [0.1.1] - 2026-06-11
|
|
29
|
+
|
|
2
30
|
- Mode-dependent `threads` default: 1 per worker in :ractor mode (threads
|
|
3
31
|
inside a ractor share its lock and cost a per-request handoff; +16-18%
|
|
4
32
|
on fast handlers, measured on dedicated hardware), 3 in :threaded mode.
|
data/Cargo.lock
CHANGED
data/README.md
CHANGED
|
@@ -14,9 +14,8 @@ and a threaded fallback mode runs everything else, Rails included.
|
|
|
14
14
|
* **Fast.** On a real 8-core server, every Kino mode is **1.5-2×**
|
|
15
15
|
ahead of a Puma fork cluster on I/O-light endpoints. Ractor mode also
|
|
16
16
|
wins on pure CPU, **30%+**. [Benchmarks](#benchmarks) below.
|
|
17
|
-
* **A fraction of the memory.**
|
|
18
|
-
about **
|
|
19
|
-
and 8× less when serving the Rails hello-world.
|
|
17
|
+
* **A fraction of the memory.** Aabout **~7×** on the simplistic bench
|
|
18
|
+
Ractor app, and about **4× less memory** than a Puma cluster serving Rails in fallback threaded mode.
|
|
20
19
|
* **Parallel without forking.** Ractor mode runs CPU work **more than
|
|
21
20
|
5× faster** than Kino's own GVL-bound threaded mode, in the same
|
|
22
21
|
small process.
|
|
@@ -64,36 +63,55 @@ notes live in [doc/architecture.md](doc/architecture.md).
|
|
|
64
63
|
## Benchmarks
|
|
65
64
|
|
|
66
65
|
Measured on a real server: AWS **c7a.2xlarge** (8-core AMD EPYC 9R14,
|
|
67
|
-
16 GB, Amazon Linux 2023). This is a realistic app-server size.
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
66
|
+
16 GB, Amazon Linux 2023). This is a realistic app-server size.
|
|
67
|
+
|
|
68
|
+
**These tables run a tiny synthetic Rack app**—plaintext, a 10 KB body,
|
|
69
|
+
a CPU-bound `fib`, a 5 ms wait—deliberately small, to measure the server
|
|
70
|
+
rather than an app. It is Ractor-shareable, so Kino runs it in `:ractor`
|
|
71
|
+
mode (and `:threaded` for comparison). **A real Rails app is a different
|
|
72
|
+
story:** it is *not* Ractor-shareable, so it runs only in Kino's
|
|
73
|
+
`:threaded` fallback, with its own numbers—see [Rails](#rails) below.
|
|
74
|
+
Ruby 4.0.5 with YJIT, every server at its defaults: Puma forks 8 workers ×
|
|
75
|
+
3 threads, Kino stays in one process (8 workers; 1 thread each in ractor
|
|
76
|
+
modes, 3 in threaded). Numbers are req/s by wrk (8-second windows, 64
|
|
77
|
+
connections, same host). Methodology:
|
|
73
78
|
[doc/benchmarks.md](doc/benchmarks.md).
|
|
74
79
|
|
|
75
80
|
| endpoint | Kino :ractor | + lanes | :ractor, `workers 32`² | Kino :threaded | Puma (cluster) |
|
|
76
81
|
|-------------|-------------:|--------:|-----------------------:|---------------:|---------------:|
|
|
77
|
-
| /plaintext | 229,
|
|
78
|
-
| /10k |
|
|
79
|
-
| /cpu (fib) | **
|
|
80
|
-
| /io (5 ms) | 1,
|
|
81
|
-
| /io_native | 1,570 | 1,571 | **6,
|
|
82
|
+
| /plaintext | 229,534 | **250,222** | 182,997 | 216,994 | 118,176 |
|
|
83
|
+
| /10k | 178,083 | **189,862** | 151,034 | 160,400 | 106,768 |
|
|
84
|
+
| /cpu (fib) | **77,999**¹| 70,885 | 66,100 | 13,429 | 58,006 |
|
|
85
|
+
| /io (5 ms) | 1,552 | 1,551 | **5,888** | 4,709 | 4,693 |
|
|
86
|
+
| /io_native | 1,570 | 1,571 | **6,274** | 4,695 | 4,691 |
|
|
82
87
|
|
|
83
|
-
Memory on the
|
|
88
|
+
Memory tells two different stories depending on the app, both by **PSS**
|
|
89
|
+
(proportional set size; see note) after sustained load.
|
|
84
90
|
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
91
|
+
**The tiny benchmark app** (Ractor-shareable, so Kino runs it in `:ractor`
|
|
92
|
+
or `:threaded`). Kino is **~7× lighter in :ractor mode, ~10× in :threaded**
|
|
93
|
+
than the Puma cluster — the gap stays large because a trivial app is almost
|
|
94
|
+
all private per-worker heap, which copy-on-write can't share:
|
|
95
|
+
|
|
96
|
+
| tiny app, Kino | Kino (one process) | Puma cluster (8 workers) | ratio |
|
|
97
|
+
|-----------------|-------------------:|-------------------------:|------:|
|
|
98
|
+
| :ractor (8×1) | **148 MB** | 1,068 MB | ~7× |
|
|
99
|
+
| :threaded (8×3) | **107 MB**³| 1,068 MB | ~10× |
|
|
100
|
+
|
|
101
|
+
**A real Rails app** (not Ractor-shareable—Kino's `:threaded` fallback
|
|
102
|
+
only, [below](#rails)). The gap is **~4×**, smaller because Rails' large
|
|
103
|
+
framework *is* shared copy-on-write across Puma's forks:
|
|
104
|
+
|
|
105
|
+
| Rails hello-world | Kino :threaded | Puma cluster (8 workers) | ratio |
|
|
106
|
+
|-------------------|---------------:|-------------------------:|------:|
|
|
107
|
+
| **PSS** | **92 MB** | **389 MB** | ~4× |
|
|
90
108
|
|
|
91
109
|
"+ lanes" is the experimental per-worker-queue dispatcher (`lanes true`).
|
|
92
110
|
It posts the fastest plaintext/10k of any configuration here. Details:
|
|
93
111
|
[doc/benchmarks.md](doc/benchmarks.md#lane-dispatch-experimental-lanes-true).
|
|
94
112
|
|
|
95
113
|
¹ Stock settings, no tuning. Ractor mode beats the fork cluster on pure
|
|
96
|
-
CPU by +
|
|
114
|
+
CPU by +34% (+22% with lanes). Threaded mode shows the GVL ceiling that
|
|
97
115
|
every single-process Ruby server hits. The old CPU-tuning recipe is
|
|
98
116
|
retired: its `threads 1` half **is** the default now, and its
|
|
99
117
|
`tokio_threads 1` half costs −12% on real hardware; see
|
|
@@ -102,7 +120,7 @@ retired: its `threads 1` half **is** the default now, and its
|
|
|
102
120
|
² Wait-bound throughput is slots ÷ wait, and the default columns bring
|
|
103
121
|
8 single-thread workers against the cluster's 24 threads. Kino slots
|
|
104
122
|
are threads, not processes—when your app waits a lot, raise `workers`.
|
|
105
|
-
The `workers 32` column is that tuning: **+
|
|
123
|
+
The `workers 32` column is that tuning: **+25% over the cluster on /io
|
|
106
124
|
(+34% via `Kino.sleep`)** while still ahead of it on pure CPU, all in
|
|
107
125
|
one small process. The cost is the CPU-light rows (32 ractors
|
|
108
126
|
oversubscribe 8 cores); pick the topology your app's wait profile
|
|
@@ -111,7 +129,7 @@ needs. See
|
|
|
111
129
|
|
|
112
130
|
³ With `MALLOC_ARENA_MAX=2` (the standard Ruby deployment setting;
|
|
113
131
|
Heroku's default). Without it, 24 threads churning 10 KB responses
|
|
114
|
-
through one glibc heap balloon to ~
|
|
132
|
+
through one glibc heap balloon to ~670 MB—an arena-fragmentation
|
|
115
133
|
footgun, not a leak, and ractor mode sidesteps it. See
|
|
116
134
|
[doc/benchmarks.md](doc/benchmarks.md#memory-under-load-and-the-glibc-arena-footgun).
|
|
117
135
|
|
|
@@ -121,14 +139,30 @@ doc):
|
|
|
121
139
|
|
|
122
140
|
| endpoint | Kino :ractor (8×3) | Puma + ractor wrapper | Falcon + ractor wrapper |
|
|
123
141
|
|------------|-------------------:|----------------------:|------------------------:|
|
|
124
|
-
| /plaintext | **
|
|
125
|
-
| /cpu (fib) | **68,
|
|
126
|
-
| /io (5 ms) | **4,
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
142
|
+
| /plaintext | **193,826** | 19,480 | 99,776 |
|
|
143
|
+
| /cpu (fib) | **68,061** | 17,755 | 48,721 |
|
|
144
|
+
| /io (5 ms) | **4,530** | 1,454 | 1,549 |
|
|
145
|
+
|
|
146
|
+
### Rails
|
|
147
|
+
|
|
148
|
+
Rails is not Ractor-shareable today, so Kino serves it in `:threaded`
|
|
149
|
+
fallback — one GVL-bound process. On the same box (`examples/rails-hello`,
|
|
150
|
+
edge Rails, production, 8×5):
|
|
151
|
+
|
|
152
|
+
| Rails hello-world | req/s | memory (PSS) |
|
|
153
|
+
|------------------------------|-------:|-------------:|
|
|
154
|
+
| Kino :threaded (one process) | 2,637 | **92 MB** |
|
|
155
|
+
| Puma cluster (8 workers) | 12,138 | 389 MB |
|
|
156
|
+
|
|
157
|
+
The honest trade-off: Puma's fork cluster uses all 8 cores, so it serves
|
|
158
|
+
~4.6× the throughput — at ~4× the memory. Ractor-mode Rails would close
|
|
159
|
+
the throughput gap at one-process memory cost; the upstream blockers are
|
|
160
|
+
tracked in [doc/rails-on-ractors.md](doc/rails-on-ractors.md).
|
|
161
|
+
|
|
162
|
+
In short: on the tiny synthetic app, ractor mode beats fork-level CPU parallelism (**5.8×** Kino's
|
|
163
|
+
own GVL-bound threaded mode, +34% over the cluster) in one process, at
|
|
164
|
+
about 1/7th of the cluster's memory by PSS (~4× on a real Rails app).
|
|
165
|
+
Every Kino mode is 1.5-2.1× ahead of the cluster on I/O-light endpoints. The macOS numbers
|
|
132
166
|
(secondary; everything there hits the loopback ceiling) and the
|
|
133
167
|
YJIT × Ractors gotcha are in [doc/benchmarks.md](doc/benchmarks.md).
|
|
134
168
|
|
data/doc/benchmarks.md
CHANGED
|
@@ -34,10 +34,21 @@ the deployment most apps run today.
|
|
|
34
34
|
- The headline tables also carry an io-tuned column (`workers 32,
|
|
35
35
|
threads 1`)—not a default, labeled as such—because the /io rows are
|
|
36
36
|
a slot-count story (see below).
|
|
37
|
-
- The dataset spans
|
|
38
|
-
a
|
|
39
|
-
|
|
40
|
-
|
|
37
|
+
- The dataset spans four identical c7a.2xlarge boxes: the original
|
|
38
|
+
measurements, a re-measure at the 0.1.1 defaults, the headline sweep,
|
|
39
|
+
and a final full re-validation (every table re-run from scratch).
|
|
40
|
+
Equal-config throughput reproduced across boxes within ~1-2%.
|
|
41
|
+
- **Memory is reported as PSS (proportional set size), not RSS.** A Puma
|
|
42
|
+
cluster forks N workers that share the Ruby VM and gem code
|
|
43
|
+
copy-on-write; summing each worker's RSS counts those shared pages up
|
|
44
|
+
to N times and overstates the cluster's real footprint. PSS divides
|
|
45
|
+
every shared page across the processes mapping it, so it reflects the
|
|
46
|
+
unique physical memory the cluster occupies—the only fair basis for
|
|
47
|
+
comparing one process against a fork-per-core cluster. We read it from
|
|
48
|
+
`/proc/<pid>/smaps_rollup` over the whole process tree, cross-checked
|
|
49
|
+
against `ps` (RSS) and `smem` (PSS). Kino serves from one process, so
|
|
50
|
+
its RSS ≈ PSS; the correction only moves Puma. (`bench/studies.sh`
|
|
51
|
+
reports both columns.)
|
|
41
52
|
- Follow-up studies (`bench/studies.sh`): CPU tuning, topology sweep,
|
|
42
53
|
/io worker scaling, logging costs, and memory—run in the same session
|
|
43
54
|
as the headline tables.
|
|
@@ -54,28 +65,31 @@ the deployment most apps run today.
|
|
|
54
65
|
|
|
55
66
|
## Reading the headline tables
|
|
56
67
|
|
|
68
|
+
These tables all run the **tiny synthetic Ractor-shareable app**. The real
|
|
69
|
+
Rails app is not Ractor-shareable and runs only in threaded fallback—a
|
|
70
|
+
separate story with separate numbers, in [its own section](#rails).
|
|
71
|
+
|
|
57
72
|
- **Plaintext/10k**: Kino's tokio front-end clears the fork cluster by
|
|
58
|
-
1.5-2.1× (lanes plaintext
|
|
59
|
-
smallest margin is threaded /10k at 1.
|
|
73
|
+
1.5-2.1× (lanes plaintext 250,222 vs Puma 118,176 = 2.12×; the
|
|
74
|
+
smallest margin is threaded /10k at 1.50×). At the old 3-thread
|
|
60
75
|
topology the cross-ractor handoff showed up as ractor trailing
|
|
61
76
|
threaded on trivial handlers; the 1-thread default reverses that
|
|
62
|
-
(ractor 230k vs threaded
|
|
63
|
-
- **CPU (recursive fib)**: ractor mode does **5.
|
|
64
|
-
threaded mode** (
|
|
65
|
-
ractors—and beats the fork cluster outright: +
|
|
66
|
-
defaults (+
|
|
67
|
-
`workers 32` topology stays ahead of the cluster on CPU (
|
|
68
|
-
- **Memory**: after
|
|
69
|
-
**
|
|
70
|
-
cluster
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
[Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
|
|
77
|
+
(ractor 230k vs threaded 217k) and lanes widen it (250k).
|
|
78
|
+
- **CPU (recursive fib)**: ractor mode does **5.8× its own GVL-bound
|
|
79
|
+
threaded mode** (77,999 vs 13,429)—that's the entire point of
|
|
80
|
+
ractors—and beats the fork cluster outright: +34% with stock
|
|
81
|
+
defaults (+22% with lanes, 70,885 vs 58,006). Even the io-tuned
|
|
82
|
+
`workers 32` topology stays ahead of the cluster on CPU (66,100).
|
|
83
|
+
- **Memory (PSS)**: after the full endpoint battery, the tiny app costs
|
|
84
|
+
Kino **148 MB** in ractor mode (107 MB threaded) against the 8-worker
|
|
85
|
+
cluster's **1,068 MB**—~7-10× lighter, because a trivial app is almost
|
|
86
|
+
all private per-worker heap that copy-on-write can't share. The real
|
|
87
|
+
Rails app narrows this to ~4× (its framework *is* shared CoW); both are
|
|
88
|
+
in [Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
|
|
75
89
|
- **I/O (5 ms wait)**: all dispatch models tie within ~4% at equal slot
|
|
76
90
|
counts, so the default columns show the ractor modes behind on /io
|
|
77
91
|
(8 slots vs the cluster's 24), and the `workers 32` column shows the
|
|
78
|
-
same engine winning (+
|
|
92
|
+
same engine winning (+25%, +34% via `Kino.sleep`) once it has more
|
|
79
93
|
slots than the cluster. The lever is slot count, and Kino slots are
|
|
80
94
|
cheap: see [below](#why-io-lags-in-ractor-mode-on-linux).
|
|
81
95
|
|
|
@@ -87,14 +101,14 @@ run:
|
|
|
87
101
|
|
|
88
102
|
| config | /cpu req/s |
|
|
89
103
|
|---|---:|
|
|
90
|
-
| Puma cluster (reference) | 58,
|
|
91
|
-
| Kino `workers 8, threads 3` (the default before 0.1.1) | 67,
|
|
92
|
-
| Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,
|
|
93
|
-
| Kino `workers 8, threads 1`, tokio auto (**the default**) | **
|
|
104
|
+
| Puma cluster (reference) | 58,189 |
|
|
105
|
+
| Kino `workers 8, threads 3` (the default before 0.1.1) | 67,394 |
|
|
106
|
+
| Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,600 |
|
|
107
|
+
| Kino `workers 8, threads 1`, tokio auto (**the default**) | **77,999** |
|
|
94
108
|
|
|
95
109
|
The `threads 1` half of the old recipe became the default; the
|
|
96
110
|
`tokio_threads 1` half now *costs* −12% on /cpu (and still costs
|
|
97
|
-
plaintext:
|
|
111
|
+
plaintext: 108,523 vs 230k). Don't pin tokio threads. **The recipe's
|
|
98
112
|
history is an environment story**: in the earlier Docker-on-Mac runs it
|
|
99
113
|
was worth +12%, because tokio threads and wake churn competed for
|
|
100
114
|
oversubscribed virtualized cores; on dedicated cores the same pin
|
|
@@ -118,8 +132,8 @@ Parallelism for CPU-bound Ruby comes from ractors or forks, nothing else.
|
|
|
118
132
|
|
|
119
133
|
## Why /io lags in ractor mode on Linux
|
|
120
134
|
|
|
121
|
-
On bare metal the gap is small at equal slot counts: ractor /io 4,
|
|
122
|
-
vs threaded 4,
|
|
135
|
+
On bare metal the gap is small at equal slot counts: ractor /io 4,530
|
|
136
|
+
vs threaded 4,709 (−4%, both at 8×3). In Docker it was −18%, and a
|
|
123
137
|
pure-Ruby probe there measured
|
|
124
138
|
`sleep(0.005)` waking +2.3-2.8 ms late inside ractors vs +1.8 ms on the
|
|
125
139
|
main ractor—non-main-ractor timer wakeups are coarser in Ruby 4.0, but
|
|
@@ -130,16 +144,16 @@ A follow-up probe showed `IO.select`-style waits are tighter than
|
|
|
130
144
|
**Mitigation 1—`Kino.sleep`:** releases the GVL and waits on the OS
|
|
131
145
|
clock directly (chunked, so `Thread#kill`/shutdown stay responsive). The
|
|
132
146
|
`/io_native` endpoint (same 5 ms wait via `Kino.sleep` when available)
|
|
133
|
-
erases the remaining ractor gap on this box: 4,721 vs 4,
|
|
147
|
+
erases the remaining ractor gap on this box: 4,721 vs 4,530 plain sleep.
|
|
134
148
|
|
|
135
149
|
**Mitigation 2—add workers; they're nearly free.** The headline tables
|
|
136
|
-
show default ractor-mode /io at 1,
|
|
150
|
+
show default ractor-mode /io at 1,552: that's 8 slots (the 1-thread
|
|
137
151
|
default) against the cluster's 24, because wait-bound throughput is
|
|
138
152
|
simply `slots ÷ effective wait`. Kino's slots cost ~a thread each, not
|
|
139
|
-
a forked process: the `workers 32, threads 1` column measured **5,
|
|
140
|
-
/io (+
|
|
141
|
-
(+34%)**, still one small process, and still +
|
|
142
|
-
on pure CPU. Its cost is the CPU-light rows (
|
|
153
|
+
a forked process: the `workers 32, threads 1` column measured **5,888
|
|
154
|
+
/io (+25% over the 24-thread cluster's 4,693) and 6,274 /io_native
|
|
155
|
+
(+34%)**, still one small process, and still +14% ahead of the cluster
|
|
156
|
+
on pure CPU. Its cost is the CPU-light rows (183k plaintext vs 230k at
|
|
143
157
|
8×1: 32 ractors oversubscribe 8 cores). A fork cluster buying the same
|
|
144
158
|
32 slots pays for them in full copies of the app; Kino pays in
|
|
145
159
|
scheduler churn only where the cores are already saturated.
|
|
@@ -154,9 +168,9 @@ what the Rack-level hop itself costs (c7a.2xlarge, same session):
|
|
|
154
168
|
|
|
155
169
|
| endpoint | Kino :ractor (8×3) | Puma + wrapper | Falcon + wrapper |
|
|
156
170
|
|------------|-------------------:|---------------:|-----------------:|
|
|
157
|
-
| /plaintext |
|
|
158
|
-
| /cpu (fib) | 68,
|
|
159
|
-
| /io (5 ms) | 4,
|
|
171
|
+
| /plaintext | 193,826 | 19,480 | 99,776 |
|
|
172
|
+
| /cpu (fib) | 68,061 | 17,755 | 48,721 |
|
|
173
|
+
| /io (5 ms) | 4,530 | 1,454 | 1,549 |
|
|
160
174
|
|
|
161
175
|
Inside the Rack contract, the wrapper must reduce the env to a shareable
|
|
162
176
|
subset, copy it to the worker ractor, copy the response back, and hold a
|
|
@@ -171,18 +185,26 @@ the Rack contract—which is the experiment this gem exists to run.
|
|
|
171
185
|
|
|
172
186
|
## Rails
|
|
173
187
|
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
188
|
+
Rails is **not Ractor-shareable**, so Kino can only serve it in
|
|
189
|
+
`:threaded` fallback—this whole section is one GVL-bound Kino process,
|
|
190
|
+
never ractor mode. The example app (`examples/rails-hello`, edge Rails,
|
|
191
|
+
production mode, 8 workers × 5 threads) on the same box:
|
|
192
|
+
|
|
193
|
+
| | req/s | RSS | PSS |
|
|
194
|
+
|---|---:|---:|---:|
|
|
195
|
+
| Kino `:threaded` (one process) | 2,637 | 97 MB | **92 MB** |
|
|
196
|
+
| Puma cluster (8 workers) | 12,138 | 794 MB | **389 MB** |
|
|
197
|
+
|
|
198
|
+
This is the honest version of the Rails story. In threaded mode Kino is
|
|
199
|
+
one GVL-bound process, so the fork cluster outruns it ~4.6× by using all
|
|
200
|
+
8 cores—at ~4× the memory by PSS. The metric matters here: Puma's RSS
|
|
201
|
+
(794 MB) counts the shared Rails framework once per worker; PSS (389 MB)
|
|
202
|
+
counts it once, and that is the fair figure (the README's headline used
|
|
203
|
+
to read 8× off RSS). Preloading barely moves it—389 MB with
|
|
204
|
+
`preload_app!` vs 400 MB without—because Ruby's GC dirties most heap
|
|
205
|
+
pages, breaking copy-on-write, so even a preloaded cluster keeps a
|
|
206
|
+
private heap per worker. Rails-on-Ractors is interesting precisely
|
|
207
|
+
because it would close the throughput gap at the one-process memory
|
|
186
208
|
cost; the upstream blockers are documented in
|
|
187
209
|
[rails-on-ractors.md](rails-on-ractors.md).
|
|
188
210
|
|
|
@@ -217,29 +239,59 @@ more reason to prefer mimalloc in dlopen'd extensions.
|
|
|
217
239
|
|
|
218
240
|
## Memory under load (and the glibc arena footgun)
|
|
219
241
|
|
|
220
|
-
|
|
221
|
-
/10k, /cpu, /io—a "warmed
|
|
222
|
-
measures 26
|
|
242
|
+
All figures are **PSS** (see [Methodology](#methodology)) after the full
|
|
243
|
+
endpoint battery (8 s each of /plaintext, /10k, /cpu, /io—a "warmed
|
|
244
|
+
production process", not a fresh boot, which measures ~26 MB for every
|
|
245
|
+
Kino mode). RSS is shown alongside so the copy-on-write correction is
|
|
246
|
+
visible.
|
|
223
247
|
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
|
227
|
-
|
|
228
|
-
| Kino :ractor 8×
|
|
229
|
-
| Kino
|
|
230
|
-
|
|
|
248
|
+
### The tiny synthetic app
|
|
249
|
+
|
|
250
|
+
| config | RSS | PSS |
|
|
251
|
+
|---|---:|---:|
|
|
252
|
+
| Kino :ractor 8×1 (default) | 151 | **148** |
|
|
253
|
+
| Kino lanes 8×1 | 137 | **135** |
|
|
254
|
+
| Kino :ractor 8×3 | 171 | **169** |
|
|
255
|
+
| Kino :threaded 8×3 (`MALLOC_ARENA_MAX=2`) | 109 | **107** |
|
|
256
|
+
| Kino :threaded 8×3 (no arena cap) | 668 | **666**¹ |
|
|
257
|
+
| Puma cluster 8×3 | 1,213 | **1,068** |
|
|
258
|
+
|
|
259
|
+
The tiny app is ~7× lighter than the cluster in ractor mode, ~10× in
|
|
260
|
+
arena-capped threaded mode. RSS ≈ PSS for every Kino row (one process,
|
|
261
|
+
nothing to share) and within ~12% for Puma here: a trivial app has almost
|
|
262
|
+
no shared state, so Puma's footprint is ~1,051 MB of *private* per-worker
|
|
263
|
+
heap plus only ~18 MB shared (which RSS counts 8×). This is the case where
|
|
264
|
+
copy-on-write does **not** rescue the cluster—there is nothing to
|
|
265
|
+
share—so the RSS and PSS numbers nearly agree. (The old "80 MB / 15×"
|
|
266
|
+
figure was a lighter, plaintext-only load; the honest full-battery ractor
|
|
267
|
+
figure is ~148 MB, i.e. ~7×.)
|
|
231
268
|
|
|
232
269
|
¹ Not a leak: glibc malloc arena bloat. One 8-second /10k round takes
|
|
233
|
-
threaded mode from
|
|
270
|
+
threaded mode from ~70 MB to ~670 MB and it never returns—24 threads
|
|
234
271
|
churning 10 KB strings through one process heap is the textbook glibc
|
|
235
272
|
arena-fragmentation case (the reason Rails ops set `MALLOC_ARENA_MAX=2`;
|
|
236
|
-
Heroku ships that default). With
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
273
|
+
Heroku ships that default). With the cap the same battery ends at 107 MB
|
|
274
|
+
PSS, throughput unchanged. Ractor mode sidesteps the worst of it without
|
|
275
|
+
any env tweak—objects live in per-ractor heaps.
|
|
276
|
+
|
|
277
|
+
### Rails (threaded fallback)
|
|
278
|
+
|
|
279
|
+
Here copy-on-write **does** matter, which is exactly why PSS is mandatory:
|
|
280
|
+
|
|
281
|
+
| config | RSS | PSS |
|
|
282
|
+
|---|---:|---:|
|
|
283
|
+
| Kino :threaded (one process) | 97 | **92** |
|
|
284
|
+
| Puma cluster 8×3 (preload) | 794 | **389** |
|
|
285
|
+
|
|
286
|
+
Puma serves the same Rails framework from 8 forks that share it
|
|
287
|
+
copy-on-write; RSS counts that shared framework once per worker (794 MB),
|
|
288
|
+
PSS counts it once (389 MB). The fair ratio is **~4×**, not the ~8× a
|
|
289
|
+
naive RSS sum reports—this is the correction that prompted the whole
|
|
290
|
+
re-measure. Preload barely helps (389 vs 400 MB without): Ruby's GC
|
|
291
|
+
dirties most heap pages, breaking copy-on-write, so even a preloaded
|
|
292
|
+
cluster keeps a large private heap per worker. That is why "CoW should
|
|
293
|
+
make a fork cluster nearly free" is only half true—it shares the code,
|
|
294
|
+
not the live object heap.
|
|
243
295
|
|
|
244
296
|
## Run-to-run variance (a.k.a. "is this a regression?")
|
|
245
297
|
|
|
@@ -249,26 +301,27 @@ Docker-on-Mac environment swung ±10% on /cpu between sessions with the
|
|
|
249
301
|
VM's mood; the dedicated c7a box is far steadier (same-session repeats
|
|
250
302
|
land within ~1-2%), but the discipline stays—every comparative claim in
|
|
251
303
|
these docs comes from same-session pairs. Cross-box repeatability got
|
|
252
|
-
its own test: the dataset was measured across
|
|
304
|
+
its own test: the dataset was measured across four identical
|
|
253
305
|
c7a.2xlarge boxes, and equal-config throughput numbers matched within
|
|
254
|
-
~1-2% (loaded-
|
|
255
|
-
timing—treat
|
|
256
|
-
|
|
257
|
-
|
|
306
|
+
~1-2% (loaded-memory measurements swing more with heap-growth
|
|
307
|
+
timing—treat them as ballpark). The same discipline caught the recurring
|
|
308
|
+
threaded-plaintext fluke twice: once 28% low on an earlier box, and again
|
|
309
|
+
in the final re-validation (170k, where three interleaved re-runs put it
|
|
310
|
+
back at 217k). Suspect cells get re-measured, not published.
|
|
258
311
|
|
|
259
312
|
## Topology notes
|
|
260
313
|
|
|
261
314
|
Measured on c7a.2xlarge, plaintext, ractor mode, same session (three
|
|
262
|
-
interleaved rounds, medians): `8×3` (workers×threads) =
|
|
263
|
-
= **
|
|
315
|
+
interleaved rounds, medians): `8×3` (workers×threads) = 198,478, `8×1`
|
|
316
|
+
= **229,966 (+16%)**, `16×1` = 214,391. Threads inside one ractor share
|
|
264
317
|
its lock, so every request handled by a 3-thread ractor pays a lock
|
|
265
318
|
handoff that a 1-thread ractor doesn't (`perf` in the earlier Docker
|
|
266
319
|
sessions attributed ~10% of cycles to
|
|
267
320
|
`rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3; the
|
|
268
321
|
gain reproduced on two separate boxes, +16-17% each). **This is why
|
|
269
|
-
`threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +
|
|
270
|
-
the same way:
|
|
271
|
-
counts: 1,
|
|
322
|
+
`threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +16%
|
|
323
|
+
the same way: 77,999 vs 67,394). The trade-off is /io at low worker
|
|
324
|
+
counts: 1,552 at 8×1 vs 4,530 at 8×3—threads-per-ractor exist for
|
|
272
325
|
handlers that block on I/O. If yours wait a lot, raise `workers`
|
|
273
326
|
instead (32×1 beats even the 24-slot cluster, see above); slots are
|
|
274
327
|
cheap. (16×1 being worse than 8×1 on plaintext also says the shared
|
|
@@ -306,10 +359,10 @@ Same-session A/B on c7a.2xlarge, ractor mode at the default topology
|
|
|
306
359
|
|
|
307
360
|
| endpoint | shared queue | lanes | delta |
|
|
308
361
|
|----------|-------------:|------:|------:|
|
|
309
|
-
| /plaintext |
|
|
310
|
-
| /10k |
|
|
311
|
-
| /cpu | **
|
|
312
|
-
| /io | 1,
|
|
362
|
+
| /plaintext | 229,534 | **250,222** | **+9%** |
|
|
363
|
+
| /10k | 178,083 | 189,862 | +7% |
|
|
364
|
+
| /cpu | **77,999** | 70,885 | −9% |
|
|
365
|
+
| /io | 1,552 | 1,551 | flat |
|
|
313
366
|
|
|
314
367
|
Lanes' margin shrank with the move to 1-thread workers (at the old 8×3
|
|
315
368
|
it was +21% plaintext: 240,193 vs 199,032 in the same session)—most of
|
|
@@ -331,11 +384,11 @@ typical costs):
|
|
|
331
384
|
|
|
332
385
|
| case (8×3, same session) | req/s |
|
|
333
386
|
|---|---:|
|
|
334
|
-
| threaded, no logging |
|
|
335
|
-
| threaded, `log_requests true` (native access log) | 193,
|
|
336
|
-
| ractor, access log off / on |
|
|
337
|
-
| app logs 1 line/req via shared `::Logger` (file) | **62,
|
|
338
|
-
| app logs 1 line/req via `Kino::Logger` (file) | **149,
|
|
387
|
+
| threaded, no logging | 219,168 |
|
|
388
|
+
| threaded, `log_requests true` (native access log) | 193,998 (−11%) |
|
|
389
|
+
| ractor, access log off / on | 197,596 / 181,050 (−8%) |
|
|
390
|
+
| app logs 1 line/req via shared `::Logger` (file) | **62,961** |
|
|
391
|
+
| app logs 1 line/req via `Kino::Logger` (file) | **149,519 (2.4×)** |
|
|
339
392
|
|
|
340
393
|
The shared-`::Logger` cost is the mutex: 24 worker threads serialize
|
|
341
394
|
through one lock plus a write syscall per line. `Kino::Logger` hands the
|
data/doc/rails-on-ractors.md
CHANGED
|
@@ -7,10 +7,11 @@
|
|
|
7
7
|
Rails 8.2.0.alpha boots and serves with `mode :threaded` (see the
|
|
8
8
|
example's `kino.rb`; just `bundle exec kino` in that directory). Measured
|
|
9
9
|
on the hello-world (c7a.2xlarge, 8 cores, production mode, 8×5):
|
|
10
|
-
~2.
|
|
11
|
-
~
|
|
12
|
-
|
|
13
|
-
|
|
10
|
+
~2.6k req/s in 92 MB PSS, single process. The 8-worker Puma cluster
|
|
11
|
+
reaches ~12.1k by parallelizing across forks, at 389 MB PSS (794 MB RSS,
|
|
12
|
+
but its forks share the framework copy-on-write, so PSS is the fair
|
|
13
|
+
figure)—Rails-on-Ractors is interesting precisely because it could offer
|
|
14
|
+
that ~4.6× parallelism at ~1/4th of the memory.
|
|
14
15
|
|
|
15
16
|
Pair it with production-style Rails settings: eager load, no code
|
|
16
17
|
reloading, database pool ≥ workers × threads, logger to stdout or another
|
data/doc/why-kino.md
CHANGED
|
@@ -16,7 +16,7 @@ deep-copies it, and sockets cannot cross at all.
|
|
|
16
16
|
We measured what the "obvious" workaround costs. The ractor-pool wrapper
|
|
17
17
|
experiment (reduce the env to a shareable subset, copy it to a worker
|
|
18
18
|
over a `Ractor::Port`, copy the response back) runs at **19.5k req/s
|
|
19
|
-
where Kino does
|
|
19
|
+
where Kino does 194k** on the same hardware—see the
|
|
20
20
|
[wrapper comparison](benchmarks.md#the-ractor-pool-wrapper-comparison).
|
|
21
21
|
Copying at the Rack layer eats the entire ractor dividend. Dispatch has
|
|
22
22
|
to live below the Rack contract.
|
|
@@ -78,12 +78,12 @@ objects; Rust sees one queue and one registry.
|
|
|
78
78
|
|
|
79
79
|
With the dispatch cost eliminated, Ractors deliver the thing they were
|
|
80
80
|
built for—a lock per ractor instead of one GVL—and each layer is
|
|
81
|
-
visible in the [benchmarks](benchmarks.md): `/cpu` at
|
|
82
|
-
ractor mode vs **13.
|
|
83
|
-
fork cluster's CPU parallelism by +
|
|
84
|
-
the cluster's 1,
|
|
85
|
-
front-end, one queue, and one JIT, where
|
|
86
|
-
price.
|
|
81
|
+
visible in the [benchmarks](benchmarks.md): `/cpu` at 78.0k req/s in
|
|
82
|
+
ractor mode vs **13.4k threaded (5.8×, the GVL ceiling)**, beating the
|
|
83
|
+
fork cluster's CPU parallelism by +34% while holding **~148 MB against
|
|
84
|
+
the cluster's ~1,068 MB** (by PSS, on the bench app), because eight
|
|
85
|
+
ractors share one VM, one Rust front-end, one queue, and one JIT, where
|
|
86
|
+
eight forks each pay full price.
|
|
87
87
|
|
|
88
88
|
The cleanest proof of the design is the threaded fallback itself: it
|
|
89
89
|
reuses ~95% of the same machinery, because the Rust core is
|
data/ext/kino/Cargo.toml
CHANGED
data/ext/kino/src/registry.rs
CHANGED
|
@@ -41,6 +41,9 @@ pub struct ServerInner {
|
|
|
41
41
|
/// 0 = no request timeout; otherwise the response head must arrive
|
|
42
42
|
/// within this many ms or the client gets a 504.
|
|
43
43
|
pub request_timeout_ms: u64,
|
|
44
|
+
/// 0 = unlimited; otherwise the max request-body bytes accepted before a
|
|
45
|
+
/// 413 (truthful Content-Length) or a mid-stream abort (chunked/lying).
|
|
46
|
+
pub max_body_size: usize,
|
|
44
47
|
pub timeouts: AtomicU64,
|
|
45
48
|
pub https: bool,
|
|
46
49
|
/// Native access log sink (None unless log_requests is on).
|
|
@@ -180,6 +183,7 @@ pub fn test_server(lanes: bool, queue_depth: usize) -> Arc<ServerInner> {
|
|
|
180
183
|
rejected: AtomicU64::new(0),
|
|
181
184
|
queue_timeout_ms: 10,
|
|
182
185
|
request_timeout_ms: 0,
|
|
186
|
+
max_body_size: 0,
|
|
183
187
|
timeouts: AtomicU64::new(0),
|
|
184
188
|
https: false,
|
|
185
189
|
access_log: None,
|
data/ext/kino/src/request.rs
CHANGED
|
@@ -25,6 +25,12 @@ pub struct RequestCtx {
|
|
|
25
25
|
/// Request body, streamed from hyper through a bounded channel: hyper is
|
|
26
26
|
/// only polled as Ruby consumes, so inbound backpressure is free.
|
|
27
27
|
pub body_rx: flume::Receiver<Bytes>,
|
|
28
|
+
/// Set by the body forwarder when the body exceeded max_body_size: turns
|
|
29
|
+
/// the next read into an error instead of a (truncated) clean EOF.
|
|
30
|
+
pub body_overflow: Arc<std::sync::atomic::AtomicBool>,
|
|
31
|
+
/// Set by the body forwarder when the client stalled past the idle
|
|
32
|
+
/// deadline: the next read raises so the worker reclaims its slot.
|
|
33
|
+
pub body_timeout: Arc<std::sync::atomic::AtomicBool>,
|
|
28
34
|
/// When a frame is bigger than read_body's max_len, the rest waits here.
|
|
29
35
|
pub leftover: Option<Bytes>,
|
|
30
36
|
/// The owning worker slot (set at admit time, queue.rs); its interrupt
|
|
@@ -62,6 +68,20 @@ fn interrupted_error(ruby: &Ruby) -> Error {
|
|
|
62
68
|
)
|
|
63
69
|
}
|
|
64
70
|
|
|
71
|
+
fn body_too_large_error(ruby: &Ruby) -> Error {
|
|
72
|
+
Error::new(
|
|
73
|
+
ruby.exception_runtime_error(),
|
|
74
|
+
"Kino: request body exceeded max_body_size",
|
|
75
|
+
)
|
|
76
|
+
}
|
|
77
|
+
|
|
78
|
+
fn body_timeout_error(ruby: &Ruby) -> Error {
|
|
79
|
+
Error::new(
|
|
80
|
+
ruby.exception_runtime_error(),
|
|
81
|
+
"Kino: request body read timed out",
|
|
82
|
+
)
|
|
83
|
+
}
|
|
84
|
+
|
|
65
85
|
fn invalid_response(ruby: &Ruby, e: impl std::fmt::Display) -> Error {
|
|
66
86
|
Error::new(
|
|
67
87
|
ruby.exception_runtime_error(),
|
|
@@ -238,7 +258,17 @@ impl Request {
|
|
|
238
258
|
});
|
|
239
259
|
match outcome {
|
|
240
260
|
Some(Some(bytes)) => bytes,
|
|
241
|
-
Some(None) =>
|
|
261
|
+
Some(None) => {
|
|
262
|
+
// Disconnected: a clean EOF, unless the forwarder
|
|
263
|
+
// abandoned the body (too large, or the client stalled).
|
|
264
|
+
if ctx.body_overflow.load(std::sync::atomic::Ordering::Relaxed) {
|
|
265
|
+
return Err(body_too_large_error(ruby));
|
|
266
|
+
}
|
|
267
|
+
if ctx.body_timeout.load(std::sync::atomic::Ordering::Relaxed) {
|
|
268
|
+
return Err(body_timeout_error(ruby));
|
|
269
|
+
}
|
|
270
|
+
return Ok(None); // EOF
|
|
271
|
+
}
|
|
242
272
|
None => return Err(interrupted_error(ruby)),
|
|
243
273
|
}
|
|
244
274
|
}
|
|
@@ -363,6 +393,8 @@ pub fn test_ctx() -> crate::registry::BoxedCtx {
|
|
|
363
393
|
local_addr: "127.0.0.1:9292".parse().expect("static addr"),
|
|
364
394
|
https: false,
|
|
365
395
|
body_rx,
|
|
396
|
+
body_overflow: Arc::new(std::sync::atomic::AtomicBool::new(false)),
|
|
397
|
+
body_timeout: Arc::new(std::sync::atomic::AtomicBool::new(false)),
|
|
366
398
|
leftover: None,
|
|
367
399
|
slot: None,
|
|
368
400
|
responder: Arc::new(Responder::new(head_tx)),
|
data/ext/kino/src/server.rs
CHANGED
|
@@ -51,6 +51,8 @@ pub fn server_start(ruby: &Ruby, config: magnus::RHash) -> Result<(u64, u16), Er
|
|
|
51
51
|
let queue_depth: usize = cfg(ruby, config, "queue_depth")?;
|
|
52
52
|
let queue_timeout_ms: u64 = cfg(ruby, config, "queue_timeout_ms")?;
|
|
53
53
|
let request_timeout_ms: u64 = cfg_opt::<u64>(ruby, config, "request_timeout_ms")?.unwrap_or(0);
|
|
54
|
+
let max_body_size: usize = cfg_opt::<usize>(ruby, config, "max_body_size")?.unwrap_or(0);
|
|
55
|
+
let max_connections: usize = cfg_opt::<usize>(ruby, config, "max_connections")?.unwrap_or(1024);
|
|
54
56
|
let tokio_threads: usize = cfg_opt::<usize>(ruby, config, "tokio_threads")?.unwrap_or(0);
|
|
55
57
|
let tls_cert: Option<String> = cfg_opt(ruby, config, "tls_cert")?;
|
|
56
58
|
let tls_key: Option<String> = cfg_opt(ruby, config, "tls_key")?;
|
|
@@ -104,6 +106,7 @@ pub fn server_start(ruby: &Ruby, config: magnus::RHash) -> Result<(u64, u16), Er
|
|
|
104
106
|
rejected: std::sync::atomic::AtomicU64::new(0),
|
|
105
107
|
queue_timeout_ms,
|
|
106
108
|
request_timeout_ms,
|
|
109
|
+
max_body_size,
|
|
107
110
|
timeouts: std::sync::atomic::AtomicU64::new(0),
|
|
108
111
|
https: acceptor.is_some(),
|
|
109
112
|
access_log: log_requests.then(|| crate::logsink::Sink::new(std::io::stdout())),
|
|
@@ -120,6 +123,7 @@ pub fn server_start(ruby: &Ruby, config: magnus::RHash) -> Result<(u64, u16), Er
|
|
|
120
123
|
tokio_listener,
|
|
121
124
|
acceptor,
|
|
122
125
|
server.clone(),
|
|
126
|
+
max_connections,
|
|
123
127
|
shutdown_rx,
|
|
124
128
|
));
|
|
125
129
|
*server.runtime.lock() = Some(runtime);
|
|
@@ -129,40 +133,73 @@ pub fn server_start(ruby: &Ruby, config: magnus::RHash) -> Result<(u64, u16), Er
|
|
|
129
133
|
Ok((id, local_port))
|
|
130
134
|
}
|
|
131
135
|
|
|
136
|
+
/// Slowloris guard for TLS: a client that completes the TCP connect but then
|
|
137
|
+
/// stalls the handshake would otherwise hold a connection slot indefinitely
|
|
138
|
+
/// (the per-request and header-read deadlines only start once hyper is
|
|
139
|
+
/// serving, i.e. after the handshake). A handshake is a few round trips, so
|
|
140
|
+
/// this is generous even for a high-latency client. Fixed, like the header
|
|
141
|
+
/// timeout: not a knob.
|
|
142
|
+
const TLS_HANDSHAKE_TIMEOUT: Duration = Duration::from_secs(10);
|
|
143
|
+
|
|
132
144
|
async fn accept_loop(
|
|
133
145
|
listener: tokio::net::TcpListener,
|
|
134
146
|
acceptor: Option<tokio_rustls::TlsAcceptor>,
|
|
135
147
|
server: Arc<ServerInner>,
|
|
148
|
+
max_connections: usize,
|
|
136
149
|
mut shutdown_rx: tokio::sync::watch::Receiver<bool>,
|
|
137
150
|
) {
|
|
151
|
+
// Bound concurrent connections: unbounded, a flood spawns a task and holds
|
|
152
|
+
// a socket per connection until file descriptors or memory run out. One
|
|
153
|
+
// permit per live connection; acquiring BEFORE accept leaves the excess in
|
|
154
|
+
// the kernel backlog (backpressure) rather than accepting then dropping.
|
|
155
|
+
let conn_limit = Arc::new(tokio::sync::Semaphore::new(max_connections));
|
|
138
156
|
loop {
|
|
139
|
-
tokio::select! {
|
|
157
|
+
let permit = tokio::select! {
|
|
140
158
|
_ = shutdown_rx.changed() => break,
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
159
|
+
permit = conn_limit.clone().acquire_owned() => match permit {
|
|
160
|
+
Ok(permit) => permit,
|
|
161
|
+
Err(_) => break, // semaphore closed
|
|
162
|
+
},
|
|
163
|
+
};
|
|
164
|
+
let (stream, remote_addr) = tokio::select! {
|
|
165
|
+
_ = shutdown_rx.changed() => break,
|
|
166
|
+
accepted = listener.accept() => match accepted {
|
|
167
|
+
Ok(pair) => pair,
|
|
168
|
+
Err(_) => continue, // transient accept error; permit drops, retry
|
|
169
|
+
},
|
|
170
|
+
};
|
|
171
|
+
// Small responses must not wait on Nagle + delayed ACK.
|
|
172
|
+
let _ = stream.set_nodelay(true);
|
|
173
|
+
let local_addr = stream
|
|
174
|
+
.local_addr()
|
|
175
|
+
.unwrap_or_else(|_| SocketAddr::from(([0, 0, 0, 0], 0)));
|
|
176
|
+
let server = server.clone();
|
|
177
|
+
let acceptor = acceptor.clone();
|
|
178
|
+
tokio::spawn(async move {
|
|
179
|
+
// Held for the connection's lifetime; dropping it frees a slot.
|
|
180
|
+
let _permit = permit;
|
|
181
|
+
match acceptor {
|
|
182
|
+
Some(acceptor) => {
|
|
183
|
+
// Handshake failures (port scans, plain HTTP to a TLS
|
|
184
|
+
// port) and stalled handshakes (slowloris) just drop the
|
|
185
|
+
// connection; the timeout bounds the latter.
|
|
186
|
+
let handshake = tokio::time::timeout(TLS_HANDSHAKE_TIMEOUT, acceptor.accept(stream));
|
|
187
|
+
let Ok(Ok(tls)) = handshake.await else { return };
|
|
188
|
+
serve_connection(tls, server, remote_addr, local_addr).await;
|
|
189
|
+
}
|
|
190
|
+
None => serve_connection(stream, server, remote_addr, local_addr).await,
|
|
161
191
|
}
|
|
162
|
-
}
|
|
192
|
+
});
|
|
163
193
|
}
|
|
164
194
|
}
|
|
165
195
|
|
|
196
|
+
/// Slowloris guard: drop a connection that has not sent its complete request
|
|
197
|
+
/// headers within this window. Long enough never to trip a real client (even
|
|
198
|
+
/// on a slow mobile link), short enough to reap a stalled one. Deliberately a
|
|
199
|
+
/// constant, not a config knob: fine-tuning intake limits is the fronting
|
|
200
|
+
/// proxy's job; the actual hazard was having no default at all.
|
|
201
|
+
const HEADER_READ_TIMEOUT: Duration = Duration::from_secs(15);
|
|
202
|
+
|
|
166
203
|
async fn serve_connection<I>(
|
|
167
204
|
io: I,
|
|
168
205
|
server: Arc<ServerInner>,
|
|
@@ -176,7 +213,13 @@ async fn serve_connection<I>(
|
|
|
176
213
|
// No auto Date header: it costs a clock read per response (together
|
|
177
214
|
// with timer reads, ~7% of tokio-side cycles in the profile); it's a
|
|
178
215
|
// SHOULD not a MUST, and apps that need it can set it themselves.
|
|
216
|
+
//
|
|
217
|
+
// The timer is installed so header_read_timeout actually fires: hyper's
|
|
218
|
+
// slow-header guard is inert without one. It arms only while the request
|
|
219
|
+
// head is being read, so it adds no per-response cost on the hot path.
|
|
179
220
|
let _ = hyper::server::conn::http1::Builder::new()
|
|
221
|
+
.timer(hyper_util::rt::TokioTimer::new())
|
|
222
|
+
.header_read_timeout(HEADER_READ_TIMEOUT)
|
|
180
223
|
.auto_date_header(false)
|
|
181
224
|
.serve_connection(TokioIo::new(io), service)
|
|
182
225
|
.await;
|
|
@@ -198,6 +241,25 @@ fn branded(mut response: HyperResponse) -> HyperResponse {
|
|
|
198
241
|
response
|
|
199
242
|
}
|
|
200
243
|
|
|
244
|
+
/// A single valid Content-Length as a byte count. hyper has already rejected
|
|
245
|
+
/// conflicting/duplicate values, so the first is authoritative; anything
|
|
246
|
+
/// unparseable yields None and the streaming cap still applies.
|
|
247
|
+
fn content_length(headers: &http::HeaderMap) -> Option<u64> {
|
|
248
|
+
headers
|
|
249
|
+
.get(http::header::CONTENT_LENGTH)?
|
|
250
|
+
.to_str()
|
|
251
|
+
.ok()?
|
|
252
|
+
.trim()
|
|
253
|
+
.parse()
|
|
254
|
+
.ok()
|
|
255
|
+
}
|
|
256
|
+
|
|
257
|
+
/// Idle deadline between request-body frames. A client that stalls mid-body
|
|
258
|
+
/// would otherwise hold a worker slot indefinitely (the worker blocks in
|
|
259
|
+
/// read_body). Generous: a real upload sends steadily and resets this each
|
|
260
|
+
/// frame, so only a silent client trips it. Fixed, like the header timeout.
|
|
261
|
+
const BODY_READ_TIMEOUT: Duration = Duration::from_secs(30);
|
|
262
|
+
|
|
201
263
|
async fn handle_request(
|
|
202
264
|
server: Arc<ServerInner>,
|
|
203
265
|
remote_addr: SocketAddr,
|
|
@@ -221,19 +283,50 @@ async fn handle_request(
|
|
|
221
283
|
)
|
|
222
284
|
});
|
|
223
285
|
|
|
286
|
+
// Body-size guard: an honestly-declared oversize body is refused with a
|
|
287
|
+
// 413 below, before any worker runs. Chunked or lying clients are caught
|
|
288
|
+
// by the forwarder, which caps cumulative bytes and flags an overflow so
|
|
289
|
+
// read_body raises instead of letting the app buffer without bound.
|
|
290
|
+
let max_body = server.max_body_size;
|
|
291
|
+
let oversize =
|
|
292
|
+
max_body > 0 && content_length(&parts.headers).is_some_and(|len| len > max_body as u64);
|
|
293
|
+
|
|
224
294
|
// Stream the request body through a bounded channel: hyper is polled
|
|
225
295
|
// only as fast as the Ruby side consumes (inbound backpressure), and the
|
|
226
296
|
// forwarder dropping the sender is EOF. Bodyless requests (most GETs)
|
|
227
297
|
// skip the forwarder task entirely: dropping the sender IS the EOF.
|
|
228
298
|
let (body_tx, body_rx) = flume::bounded::<bytes::Bytes>(8);
|
|
229
|
-
|
|
299
|
+
let body_overflow = Arc::new(std::sync::atomic::AtomicBool::new(false));
|
|
300
|
+
let body_timeout = Arc::new(std::sync::atomic::AtomicBool::new(false));
|
|
301
|
+
if oversize || hyper::body::Body::is_end_stream(&body) {
|
|
230
302
|
drop(body_tx);
|
|
231
303
|
} else {
|
|
304
|
+
let overflow = body_overflow.clone();
|
|
305
|
+
let timed_out = body_timeout.clone();
|
|
232
306
|
tokio::spawn(async move {
|
|
233
307
|
let mut body = body;
|
|
234
|
-
|
|
235
|
-
|
|
308
|
+
let mut total: u64 = 0;
|
|
309
|
+
loop {
|
|
310
|
+
// Idle deadline between frames: a client that stalls mid-body
|
|
311
|
+
// would otherwise pin a worker blocked in read_body. Only the
|
|
312
|
+
// client's silence trips this; a slow APP blocks the forwarder
|
|
313
|
+
// in send_async below instead, which is not timed.
|
|
314
|
+
let frame = match tokio::time::timeout(BODY_READ_TIMEOUT, body.frame()).await {
|
|
315
|
+
Ok(Some(Ok(frame))) => frame,
|
|
316
|
+
Ok(Some(Err(_))) | Ok(None) => break, // body error or clean EOF
|
|
317
|
+
Err(_) => {
|
|
318
|
+
timed_out.store(true, Ordering::Relaxed);
|
|
319
|
+
break;
|
|
320
|
+
}
|
|
321
|
+
};
|
|
236
322
|
if let Ok(data) = frame.into_data() {
|
|
323
|
+
total += data.len() as u64;
|
|
324
|
+
if max_body > 0 && total > max_body as u64 {
|
|
325
|
+
// Past the cap: flag it and stop pulling. Dropping the
|
|
326
|
+
// sender unblocks read_body, which then raises.
|
|
327
|
+
overflow.store(true, Ordering::Relaxed);
|
|
328
|
+
break;
|
|
329
|
+
}
|
|
237
330
|
if body_tx.send_async(data).await.is_err() {
|
|
238
331
|
break; // request handle dropped; stop pulling
|
|
239
332
|
}
|
|
@@ -253,6 +346,8 @@ async fn handle_request(
|
|
|
253
346
|
local_addr,
|
|
254
347
|
https: server.https,
|
|
255
348
|
body_rx,
|
|
349
|
+
body_overflow,
|
|
350
|
+
body_timeout,
|
|
256
351
|
leftover: None,
|
|
257
352
|
slot: None,
|
|
258
353
|
responder,
|
|
@@ -273,6 +368,9 @@ async fn handle_request(
|
|
|
273
368
|
|
|
274
369
|
// Single exit point so the access log sees every outcome, 503s included.
|
|
275
370
|
let response: HyperResponse = 'resp: {
|
|
371
|
+
if oversize {
|
|
372
|
+
break 'resp plain_response(413, "Payload Too Large\n");
|
|
373
|
+
}
|
|
276
374
|
if server.lanes {
|
|
277
375
|
if !dispatch_to_lane(&server, ctx).await {
|
|
278
376
|
break 'resp unavailable(&server);
|
data/lib/kino/configuration.rb
CHANGED
|
@@ -17,6 +17,8 @@ module Kino
|
|
|
17
17
|
queue_depth: 1024,
|
|
18
18
|
queue_timeout: 5.0,
|
|
19
19
|
request_timeout: nil,
|
|
20
|
+
max_connections: nil, # nil = derive from the open-file limit
|
|
21
|
+
max_body_size: 50 * 1024 * 1024, # 50 MB; nil/0 = unlimited
|
|
20
22
|
batch: 1,
|
|
21
23
|
lanes: false,
|
|
22
24
|
log_requests: false,
|
|
@@ -160,6 +162,14 @@ module Kino
|
|
|
160
162
|
# Seconds the app gets before the client receives a 504; nil = off.
|
|
161
163
|
def request_timeout(seconds) = @config.set(:request_timeout, seconds && Float(seconds))
|
|
162
164
|
|
|
165
|
+
# Max connections served at once; beyond it, new connections wait in
|
|
166
|
+
# the kernel backlog. Defaults to most of the open-file limit.
|
|
167
|
+
def max_connections(count) = @config.set(:max_connections, Integer(count))
|
|
168
|
+
|
|
169
|
+
# Max request-body bytes before a 413; nil disables (delegate to a
|
|
170
|
+
# fronting proxy). Default 50 MB.
|
|
171
|
+
def max_body_size(bytes) = @config.set(:max_body_size, bytes && Integer(bytes))
|
|
172
|
+
|
|
163
173
|
# Requests a worker may grab per queue visit (default 1).
|
|
164
174
|
def batch(count) = @config.set(:batch, Integer(count))
|
|
165
175
|
|
data/lib/kino/server.rb
CHANGED
|
@@ -50,6 +50,8 @@ module Kino
|
|
|
50
50
|
@queue_depth = Integer(settings[:queue_depth])
|
|
51
51
|
@queue_timeout_ms = (Float(settings[:queue_timeout]) * 1000).round
|
|
52
52
|
@request_timeout_ms = settings[:request_timeout] ? (Float(settings[:request_timeout]) * 1000).round : 0
|
|
53
|
+
@max_connections = settings[:max_connections] ? Integer(settings[:max_connections]) : default_max_connections
|
|
54
|
+
@max_body_size = Integer(settings[:max_body_size] || 0)
|
|
53
55
|
@batch = [Integer(settings[:batch]), 1].max
|
|
54
56
|
@lanes = !!settings[:lanes]
|
|
55
57
|
@log_requests = !!settings[:log_requests]
|
|
@@ -74,6 +76,8 @@ module Kino
|
|
|
74
76
|
bind: @bind, port: @requested_port,
|
|
75
77
|
queue_depth: @queue_depth, queue_timeout_ms: @queue_timeout_ms,
|
|
76
78
|
request_timeout_ms: @request_timeout_ms,
|
|
79
|
+
max_connections: @max_connections,
|
|
80
|
+
max_body_size: @max_body_size,
|
|
77
81
|
tokio_threads: @tokio_threads,
|
|
78
82
|
tls_cert: @tls&.fetch(:cert), tls_key: @tls&.fetch(:key),
|
|
79
83
|
lanes: @lanes, log_requests: @log_requests
|
|
@@ -214,6 +218,18 @@ module Kino
|
|
|
214
218
|
Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
215
219
|
end
|
|
216
220
|
|
|
221
|
+
# Default connection cap: most of the process open-file limit. A
|
|
222
|
+
# connection flood's failure mode is descriptor exhaustion, and in
|
|
223
|
+
# :ractor/:threaded mode the app's own sockets and files share this
|
|
224
|
+
# process's table, so leave headroom. Scales with `ulimit -n`; raise the
|
|
225
|
+
# OS limit (or set max_connections) to allow more.
|
|
226
|
+
def default_max_connections
|
|
227
|
+
soft, = Process.getrlimit(Process::RLIMIT_NOFILE)
|
|
228
|
+
return 65_536 if soft == Process::RLIM_INFINITY
|
|
229
|
+
|
|
230
|
+
[soft * 8 / 10, 64].max
|
|
231
|
+
end
|
|
232
|
+
|
|
217
233
|
def join_workers(deadline)
|
|
218
234
|
if @supervisor
|
|
219
235
|
@supervisor.shutdown([deadline - monotonic_now, 0].max)
|
|
@@ -55,6 +55,17 @@
|
|
|
55
55
|
# above your slowest legitimate endpoint.
|
|
56
56
|
# request_timeout 30
|
|
57
57
|
|
|
58
|
+
# Most connections to serve at once. Past this, new connections wait in
|
|
59
|
+
# the kernel backlog instead of piling up until the server runs out of
|
|
60
|
+
# file descriptors. Defaults to most of the open-file limit (ulimit -n),
|
|
61
|
+
# so it scales with the OS limit and only bites under a flood.
|
|
62
|
+
# max_connections 8192
|
|
63
|
+
|
|
64
|
+
# Reject request bodies larger than this many bytes with a 413, so an
|
|
65
|
+
# oversized or endless upload can't drive your app to run out of memory.
|
|
66
|
+
# Set to nil to disable and let a fronting proxy handle it. Default: 50 MB.
|
|
67
|
+
# max_body_size 50 * 1024 * 1024
|
|
68
|
+
|
|
58
69
|
# How many requests a worker grabs from the line at once. Leave at 1
|
|
59
70
|
# unless all your endpoints are uniformly fast.
|
|
60
71
|
# batch 1
|
data/lib/kino/version.rb
CHANGED
data/sig/kino.rbs
CHANGED
|
@@ -92,6 +92,8 @@ module Kino
|
|
|
92
92
|
def queue_depth: (int depth) -> untyped
|
|
93
93
|
def queue_timeout: (Numeric seconds) -> untyped
|
|
94
94
|
def request_timeout: (Numeric? seconds) -> untyped
|
|
95
|
+
def max_connections: (int count) -> untyped
|
|
96
|
+
def max_body_size: (int? bytes) -> untyped
|
|
95
97
|
def batch: (int count) -> untyped
|
|
96
98
|
def lanes: (boolish enabled) -> untyped
|
|
97
99
|
def log_requests: (boolish enabled) -> untyped
|