kino 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5aba4896a5a135898c04a67f68c2e28622b044124c2a0fe3bea2f5691bf28d74
4
- data.tar.gz: 2ce9e3de94784aa9e396752cbfe73c3a2ee0766ebae45b2fbb89d8e80557d009
3
+ metadata.gz: 0bd8e6e3b295832fa1d87743b4c9a121cdb5687287011b29f1140814fdae0575
4
+ data.tar.gz: f21305459e857d366159ee258d873e2109b7a7715bb0cbe8d23c47e458ae2b33
5
5
  SHA512:
6
- metadata.gz: d0b3eac3075143dd92534d14c57a88c00650078afca07e9e23c523e2c5c2519da80aade571acc11128850be388d79a1957560db1be194fba80b47cad282e8aa6
7
- data.tar.gz: 771c4ddb9f8509550dee82ad0bb2f18f73f173fdd81a32041a5b09d7b98eb836cb07345b583e84c9e42068b1182e48f9e544458cb11cd1706fd331017029d32b
6
+ metadata.gz: 04e95f9ee2133b4d15bdd2069977e73791a16b033e57a3bdb88478d0048b0bb5d8f56eff1963813bbf8d9399461bb80bf9c33a2039f98805e4f129b3e42b28ef
7
+ data.tar.gz: 8eb131cdbbe5bdbd29d188ab4ff43dbbec2f8c791aacf85586ba69c7208b2f74cc05d4007c469a4de61c4578ef08214e56a4fd080b9ad655af3a01d208b4c752
data/CHANGELOG.md CHANGED
@@ -1,4 +1,32 @@
1
+ ## [0.1.2] - 2026-06-22
2
+
3
+ - Drop a connection that has not sent its complete request headers
4
+ within 15 seconds. Closes a slowloris hole: hyper's built-in header-read
5
+ timeout was inert because the server installed no timer, so a slow-header
6
+ client could tie up a connection (and its tokio task) indefinitely.
7
+ - Cap concurrent connections (new `max_connections` directive). Past the cap,
8
+ new connections wait in the kernel backlog instead of piling up until a
9
+ flood exhausts file descriptors or memory. Defaults to most of the process
10
+ open-file limit (`ulimit -n`), so it scales with the OS limit and only
11
+ engages under a flood.
12
+ - Bound the TLS handshake to 10 seconds. A client that completed the TCP
13
+ connect but stalled the handshake could otherwise hold a connection slot
14
+ indefinitely, since the request and header-read deadlines only begin once
15
+ the handshake finishes.
16
+ - Cap the request body at 50 MB by default (new `max_body_size` directive,
17
+ configurable; nil or 0 disables and delegates to a fronting proxy). An app
18
+ that reads `rack.input` could otherwise be driven to run out of memory by an
19
+ oversized or endless upload. A truthful oversize Content-Length is refused
20
+ with a 413 before the app runs; a chunked or lying client is cut off
21
+ mid-stream once it passes the cap.
22
+ - Bound the idle time between request-body frames to 30 seconds. A client that
23
+ began a request then stalled mid-body would otherwise keep a worker blocked
24
+ in `rack.input.read` indefinitely; now the read raises and the worker
25
+ reclaims its slot. Only a silent client trips it: a steadily-sent body resets
26
+ the deadline each frame, so slow-but-active uploads are unaffected.
27
+
1
28
  ## [0.1.1] - 2026-06-11
29
+
2
30
  - Mode-dependent `threads` default: 1 per worker in :ractor mode (threads
3
31
  inside a ractor share its lock and cost a per-request handoff; +16-18%
4
32
  on fast handlers, measured on dedicated hardware), 3 in :threaded mode.
data/Cargo.lock CHANGED
@@ -332,7 +332,7 @@ dependencies = [
332
332
 
333
333
  [[package]]
334
334
  name = "kino"
335
- version = "0.1.1"
335
+ version = "0.1.2"
336
336
  dependencies = [
337
337
  "ahash",
338
338
  "bytes",
data/README.md CHANGED
@@ -14,9 +14,8 @@ and a threaded fallback mode runs everything else, Rails included.
14
14
  * **Fast.** On a real 8-core server, every Kino mode is **1.5-2×**
15
15
  ahead of a Puma fork cluster on I/O-light endpoints. Ractor mode also
16
16
  wins on pure CPU, **30%+**. [Benchmarks](#benchmarks) below.
17
- * **A fraction of the memory.** One process instead of a fork per core:
18
- about **15× less memory** than the Puma cluster under the same load,
19
- and 8× less when serving the Rails hello-world.
17
+ * **A fraction of the memory.** Aabout **~7×** on the simplistic bench
18
+ Ractor app, and about **4× less memory** than a Puma cluster serving Rails in fallback threaded mode.
20
19
  * **Parallel without forking.** Ractor mode runs CPU work **more than
21
20
  5× faster** than Kino's own GVL-bound threaded mode, in the same
22
21
  small process.
@@ -64,36 +63,55 @@ notes live in [doc/architecture.md](doc/architecture.md).
64
63
  ## Benchmarks
65
64
 
66
65
  Measured on a real server: AWS **c7a.2xlarge** (8-core AMD EPYC 9R14,
67
- 16 GB, Amazon Linux 2023). This is a realistic app-server size. The same
68
- Ractor-shareable app runs on every server, Ruby 4.0.5 with YJIT, every
69
- server at its defaults: Puma forks 8 workers × 3 threads, Kino stays in
70
- one process (8 workers; 1 thread each in ractor modes, 3 in threaded).
71
- Numbers are req/s by wrk (8-second windows, 64 connections, same host).
72
- Methodology and the analysis behind every column:
66
+ 16 GB, Amazon Linux 2023). This is a realistic app-server size.
67
+
68
+ **These tables run a tiny synthetic Rack app**—plaintext, a 10 KB body,
69
+ a CPU-bound `fib`, a 5 ms wait—deliberately small, to measure the server
70
+ rather than an app. It is Ractor-shareable, so Kino runs it in `:ractor`
71
+ mode (and `:threaded` for comparison). **A real Rails app is a different
72
+ story:** it is *not* Ractor-shareable, so it runs only in Kino's
73
+ `:threaded` fallback, with its own numbers—see [Rails](#rails) below.
74
+ Ruby 4.0.5 with YJIT, every server at its defaults: Puma forks 8 workers ×
75
+ 3 threads, Kino stays in one process (8 workers; 1 thread each in ractor
76
+ modes, 3 in threaded). Numbers are req/s by wrk (8-second windows, 64
77
+ connections, same host). Methodology:
73
78
  [doc/benchmarks.md](doc/benchmarks.md).
74
79
 
75
80
  | endpoint | Kino :ractor | + lanes | :ractor, `workers 32`² | Kino :threaded | Puma (cluster) |
76
81
  |-------------|-------------:|--------:|-----------------------:|---------------:|---------------:|
77
- | /plaintext | 229,565 | **244,340** | 156,118 | 217,619 | 118,190 |
78
- | /10k | 179,119 | **188,258** | 134,457 | 157,147 | 105,588 |
79
- | /cpu (fib) | **76,922**¹| 73,136 | 62,406 | 13,499 | 58,337 |
80
- | /io (5 ms) | 1,548 | 1,548 | **5,935** | 4,715 | 4,687 |
81
- | /io_native | 1,570 | 1,571 | **6,289** | 4,717 | 4,695 |
82
+ | /plaintext | 229,534 | **250,222** | 182,997 | 216,994 | 118,176 |
83
+ | /10k | 178,083 | **189,862** | 151,034 | 160,400 | 106,768 |
84
+ | /cpu (fib) | **77,999**¹| 70,885 | 66,100 | 13,429 | 58,006 |
85
+ | /io (5 ms) | 1,552 | 1,551 | **5,888** | 4,709 | 4,693 |
86
+ | /io_native | 1,570 | 1,571 | **6,274** | 4,695 | 4,691 |
82
87
 
83
- Memory on the same box, RSS after sustained load:
88
+ Memory tells two different stories depending on the app, both by **PSS**
89
+ (proportional set size; see note) after sustained load.
84
90
 
85
- | serving | Kino (one process) | Puma cluster (8 workers) |
86
- |-----------------------|-------------------:|-------------------------:|
87
- | bench app, :ractor | **80 MB** | 1,256 MB |
88
- | bench app, :threaded | **151 MB**³| 1,256 MB |
89
- | Rails hello-world | **97 MB** | 797 MB |
91
+ **The tiny benchmark app** (Ractor-shareable, so Kino runs it in `:ractor`
92
+ or `:threaded`). Kino is **~7× lighter in :ractor mode, ~10× in :threaded**
93
+ than the Puma cluster the gap stays large because a trivial app is almost
94
+ all private per-worker heap, which copy-on-write can't share:
95
+
96
+ | tiny app, Kino | Kino (one process) | Puma cluster (8 workers) | ratio |
97
+ |-----------------|-------------------:|-------------------------:|------:|
98
+ | :ractor (8×1) | **148 MB** | 1,068 MB | ~7× |
99
+ | :threaded (8×3) | **107 MB**³| 1,068 MB | ~10× |
100
+
101
+ **A real Rails app** (not Ractor-shareable—Kino's `:threaded` fallback
102
+ only, [below](#rails)). The gap is **~4×**, smaller because Rails' large
103
+ framework *is* shared copy-on-write across Puma's forks:
104
+
105
+ | Rails hello-world | Kino :threaded | Puma cluster (8 workers) | ratio |
106
+ |-------------------|---------------:|-------------------------:|------:|
107
+ | **PSS** | **92 MB** | **389 MB** | ~4× |
90
108
 
91
109
  "+ lanes" is the experimental per-worker-queue dispatcher (`lanes true`).
92
110
  It posts the fastest plaintext/10k of any configuration here. Details:
93
111
  [doc/benchmarks.md](doc/benchmarks.md#lane-dispatch-experimental-lanes-true).
94
112
 
95
113
  ¹ Stock settings, no tuning. Ractor mode beats the fork cluster on pure
96
- CPU by +32% (+25% with lanes). Threaded mode shows the GVL ceiling that
114
+ CPU by +34% (+22% with lanes). Threaded mode shows the GVL ceiling that
97
115
  every single-process Ruby server hits. The old CPU-tuning recipe is
98
116
  retired: its `threads 1` half **is** the default now, and its
99
117
  `tokio_threads 1` half costs −12% on real hardware; see
@@ -102,7 +120,7 @@ retired: its `threads 1` half **is** the default now, and its
102
120
  ² Wait-bound throughput is slots ÷ wait, and the default columns bring
103
121
  8 single-thread workers against the cluster's 24 threads. Kino slots
104
122
  are threads, not processes—when your app waits a lot, raise `workers`.
105
- The `workers 32` column is that tuning: **+27% over the cluster on /io
123
+ The `workers 32` column is that tuning: **+25% over the cluster on /io
106
124
  (+34% via `Kino.sleep`)** while still ahead of it on pure CPU, all in
107
125
  one small process. The cost is the CPU-light rows (32 ractors
108
126
  oversubscribe 8 cores); pick the topology your app's wait profile
@@ -111,7 +129,7 @@ needs. See
111
129
 
112
130
  ³ With `MALLOC_ARENA_MAX=2` (the standard Ruby deployment setting;
113
131
  Heroku's default). Without it, 24 threads churning 10 KB responses
114
- through one glibc heap balloon to ~600 MB—an arena-fragmentation
132
+ through one glibc heap balloon to ~670 MB—an arena-fragmentation
115
133
  footgun, not a leak, and ractor mode sidesteps it. See
116
134
  [doc/benchmarks.md](doc/benchmarks.md#memory-under-load-and-the-glibc-arena-footgun).
117
135
 
@@ -121,14 +139,30 @@ doc):
121
139
 
122
140
  | endpoint | Kino :ractor (8×3) | Puma + ractor wrapper | Falcon + ractor wrapper |
123
141
  |------------|-------------------:|----------------------:|------------------------:|
124
- | /plaintext | **199,032** | 19,532 | 100,342 |
125
- | /cpu (fib) | **68,238** | 17,323 | 48,561 |
126
- | /io (5 ms) | **4,531** | 1,452 | 1,544 |
127
-
128
- In short: ractor mode beats fork-level CPU parallelism (**5.7×** Kino's
129
- own GVL-bound threaded mode, +32% over the cluster) in one process, at
130
- about 1/16th of the cluster's memory. Every Kino mode is 1.5-2.1×
131
- ahead of the cluster on I/O-light endpoints. The macOS numbers
142
+ | /plaintext | **193,826** | 19,480 | 99,776 |
143
+ | /cpu (fib) | **68,061** | 17,755 | 48,721 |
144
+ | /io (5 ms) | **4,530** | 1,454 | 1,549 |
145
+
146
+ ### Rails
147
+
148
+ Rails is not Ractor-shareable today, so Kino serves it in `:threaded`
149
+ fallback one GVL-bound process. On the same box (`examples/rails-hello`,
150
+ edge Rails, production, 8×5):
151
+
152
+ | Rails hello-world | req/s | memory (PSS) |
153
+ |------------------------------|-------:|-------------:|
154
+ | Kino :threaded (one process) | 2,637 | **92 MB** |
155
+ | Puma cluster (8 workers) | 12,138 | 389 MB |
156
+
157
+ The honest trade-off: Puma's fork cluster uses all 8 cores, so it serves
158
+ ~4.6× the throughput — at ~4× the memory. Ractor-mode Rails would close
159
+ the throughput gap at one-process memory cost; the upstream blockers are
160
+ tracked in [doc/rails-on-ractors.md](doc/rails-on-ractors.md).
161
+
162
+ In short: on the tiny synthetic app, ractor mode beats fork-level CPU parallelism (**5.8×** Kino's
163
+ own GVL-bound threaded mode, +34% over the cluster) in one process, at
164
+ about 1/7th of the cluster's memory by PSS (~4× on a real Rails app).
165
+ Every Kino mode is 1.5-2.1× ahead of the cluster on I/O-light endpoints. The macOS numbers
132
166
  (secondary; everything there hits the loopback ceiling) and the
133
167
  YJIT × Ractors gotcha are in [doc/benchmarks.md](doc/benchmarks.md).
134
168
 
data/doc/benchmarks.md CHANGED
@@ -34,10 +34,21 @@ the deployment most apps run today.
34
34
  - The headline tables also carry an io-tuned column (`workers 32,
35
35
  threads 1`)—not a default, labeled as such—because the /io rows are
36
36
  a slot-count story (see below).
37
- - The dataset spans three identical boxes: the original measurements,
38
- a full re-measure at the 0.1.1 defaults, and the final headline
39
- sweep. Equal-config numbers reproduced across boxes within ~1-2%
40
- throughout.
37
+ - The dataset spans four identical c7a.2xlarge boxes: the original
38
+ measurements, a re-measure at the 0.1.1 defaults, the headline sweep,
39
+ and a final full re-validation (every table re-run from scratch).
40
+ Equal-config throughput reproduced across boxes within ~1-2%.
41
+ - **Memory is reported as PSS (proportional set size), not RSS.** A Puma
42
+ cluster forks N workers that share the Ruby VM and gem code
43
+ copy-on-write; summing each worker's RSS counts those shared pages up
44
+ to N times and overstates the cluster's real footprint. PSS divides
45
+ every shared page across the processes mapping it, so it reflects the
46
+ unique physical memory the cluster occupies—the only fair basis for
47
+ comparing one process against a fork-per-core cluster. We read it from
48
+ `/proc/<pid>/smaps_rollup` over the whole process tree, cross-checked
49
+ against `ps` (RSS) and `smem` (PSS). Kino serves from one process, so
50
+ its RSS ≈ PSS; the correction only moves Puma. (`bench/studies.sh`
51
+ reports both columns.)
41
52
  - Follow-up studies (`bench/studies.sh`): CPU tuning, topology sweep,
42
53
  /io worker scaling, logging costs, and memory—run in the same session
43
54
  as the headline tables.
@@ -54,28 +65,31 @@ the deployment most apps run today.
54
65
 
55
66
  ## Reading the headline tables
56
67
 
68
+ These tables all run the **tiny synthetic Ractor-shareable app**. The real
69
+ Rails app is not Ractor-shareable and runs only in threaded fallback—a
70
+ separate story with separate numbers, in [its own section](#rails).
71
+
57
72
  - **Plaintext/10k**: Kino's tokio front-end clears the fork cluster by
58
- 1.5-2.1× (lanes plaintext 244,340 vs Puma 118,190 = 2.07×; the
59
- smallest margin is threaded /10k at 1.49×). At the old 3-thread
73
+ 1.5-2.1× (lanes plaintext 250,222 vs Puma 118,176 = 2.12×; the
74
+ smallest margin is threaded /10k at 1.50×). At the old 3-thread
60
75
  topology the cross-ractor handoff showed up as ractor trailing
61
76
  threaded on trivial handlers; the 1-thread default reverses that
62
- (ractor 230k vs threaded 218k) and lanes widen it (244k).
63
- - **CPU (recursive fib)**: ractor mode does **5.7× its own GVL-bound
64
- threaded mode** (76,922 vs 13,499)—that's the entire point of
65
- ractors—and beats the fork cluster outright: +32% with stock
66
- defaults (+25% with lanes, 73,136 vs 58,337). Even the io-tuned
67
- `workers 32` topology stays ahead of the cluster on CPU (62,406).
68
- - **Memory**: after serving the full endpoint battery, Kino held
69
- **80 MB** (ractor or lanes, default topology) where the 8-worker
70
- cluster held **1,256 MB**—a fork per core pays one full copy of the
71
- VM, the app, and its YJIT-compiled code per worker. On the Rails
72
- hello-world: Kino 97 MB vs cluster 797 MB. Threaded mode under the
73
- same battery needs a malloc note; see
74
- [Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
77
+ (ractor 230k vs threaded 217k) and lanes widen it (250k).
78
+ - **CPU (recursive fib)**: ractor mode does **5.8× its own GVL-bound
79
+ threaded mode** (77,999 vs 13,429)—that's the entire point of
80
+ ractors—and beats the fork cluster outright: +34% with stock
81
+ defaults (+22% with lanes, 70,885 vs 58,006). Even the io-tuned
82
+ `workers 32` topology stays ahead of the cluster on CPU (66,100).
83
+ - **Memory (PSS)**: after the full endpoint battery, the tiny app costs
84
+ Kino **148 MB** in ractor mode (107 MB threaded) against the 8-worker
85
+ cluster's **1,068 MB**—~7-10× lighter, because a trivial app is almost
86
+ all private per-worker heap that copy-on-write can't share. The real
87
+ Rails app narrows this to ~4× (its framework *is* shared CoW); both are
88
+ in [Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
75
89
  - **I/O (5 ms wait)**: all dispatch models tie within ~4% at equal slot
76
90
  counts, so the default columns show the ractor modes behind on /io
77
91
  (8 slots vs the cluster's 24), and the `workers 32` column shows the
78
- same engine winning (+27%, +34% via `Kino.sleep`) once it has more
92
+ same engine winning (+25%, +34% via `Kino.sleep`) once it has more
79
93
  slots than the cluster. The lever is slot count, and Kino slots are
80
94
  cheap: see [below](#why-io-lags-in-ractor-mode-on-linux).
81
95
 
@@ -87,14 +101,14 @@ run:
87
101
 
88
102
  | config | /cpu req/s |
89
103
  |---|---:|
90
- | Puma cluster (reference) | 58,505 |
91
- | Kino `workers 8, threads 3` (the default before 0.1.1) | 67,111 |
92
- | Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,638 |
93
- | Kino `workers 8, threads 1`, tokio auto (**the default**) | **78,175** |
104
+ | Puma cluster (reference) | 58,189 |
105
+ | Kino `workers 8, threads 3` (the default before 0.1.1) | 67,394 |
106
+ | Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,600 |
107
+ | Kino `workers 8, threads 1`, tokio auto (**the default**) | **77,999** |
94
108
 
95
109
  The `threads 1` half of the old recipe became the default; the
96
110
  `tokio_threads 1` half now *costs* −12% on /cpu (and still costs
97
- plaintext: 107,743 vs 230k). Don't pin tokio threads. **The recipe's
111
+ plaintext: 108,523 vs 230k). Don't pin tokio threads. **The recipe's
98
112
  history is an environment story**: in the earlier Docker-on-Mac runs it
99
113
  was worth +12%, because tokio threads and wake churn competed for
100
114
  oversubscribed virtualized cores; on dedicated cores the same pin
@@ -118,8 +132,8 @@ Parallelism for CPU-bound Ruby comes from ractors or forks, nothing else.
118
132
 
119
133
  ## Why /io lags in ractor mode on Linux
120
134
 
121
- On bare metal the gap is small at equal slot counts: ractor /io 4,531
122
- vs threaded 4,725 (−4%, both at 8×3). In Docker it was −18%, and a
135
+ On bare metal the gap is small at equal slot counts: ractor /io 4,530
136
+ vs threaded 4,709 (−4%, both at 8×3). In Docker it was −18%, and a
123
137
  pure-Ruby probe there measured
124
138
  `sleep(0.005)` waking +2.3-2.8 ms late inside ractors vs +1.8 ms on the
125
139
  main ractor—non-main-ractor timer wakeups are coarser in Ruby 4.0, but
@@ -130,16 +144,16 @@ A follow-up probe showed `IO.select`-style waits are tighter than
130
144
  **Mitigation 1—`Kino.sleep`:** releases the GVL and waits on the OS
131
145
  clock directly (chunked, so `Thread#kill`/shutdown stay responsive). The
132
146
  `/io_native` endpoint (same 5 ms wait via `Kino.sleep` when available)
133
- erases the remaining ractor gap on this box: 4,721 vs 4,531 plain sleep.
147
+ erases the remaining ractor gap on this box: 4,721 vs 4,530 plain sleep.
134
148
 
135
149
  **Mitigation 2—add workers; they're nearly free.** The headline tables
136
- show default ractor-mode /io at 1,548: that's 8 slots (the 1-thread
150
+ show default ractor-mode /io at 1,552: that's 8 slots (the 1-thread
137
151
  default) against the cluster's 24, because wait-bound throughput is
138
152
  simply `slots ÷ effective wait`. Kino's slots cost ~a thread each, not
139
- a forked process: the `workers 32, threads 1` column measured **5,935
140
- /io (+27% over the 24-thread cluster's 4,687) and 6,289 /io_native
141
- (+34%)**, still one small process, and still +7% ahead of the cluster
142
- on pure CPU. Its cost is the CPU-light rows (156k plaintext vs 230k at
153
+ a forked process: the `workers 32, threads 1` column measured **5,888
154
+ /io (+25% over the 24-thread cluster's 4,693) and 6,274 /io_native
155
+ (+34%)**, still one small process, and still +14% ahead of the cluster
156
+ on pure CPU. Its cost is the CPU-light rows (183k plaintext vs 230k at
143
157
  8×1: 32 ractors oversubscribe 8 cores). A fork cluster buying the same
144
158
  32 slots pays for them in full copies of the app; Kino pays in
145
159
  scheduler churn only where the cores are already saturated.
@@ -154,9 +168,9 @@ what the Rack-level hop itself costs (c7a.2xlarge, same session):
154
168
 
155
169
  | endpoint | Kino :ractor (8×3) | Puma + wrapper | Falcon + wrapper |
156
170
  |------------|-------------------:|---------------:|-----------------:|
157
- | /plaintext | 199,032 | 19,532 | 100,342 |
158
- | /cpu (fib) | 68,238 | 17,323 | 48,561 |
159
- | /io (5 ms) | 4,531 | 1,452 | 1,544 |
171
+ | /plaintext | 193,826 | 19,480 | 99,776 |
172
+ | /cpu (fib) | 68,061 | 17,755 | 48,721 |
173
+ | /io (5 ms) | 4,530 | 1,454 | 1,549 |
160
174
 
161
175
  Inside the Rack contract, the wrapper must reduce the env to a shareable
162
176
  subset, copy it to the worker ractor, copy the response back, and hold a
@@ -171,18 +185,26 @@ the Rack contract—which is the experiment this gem exists to run.
171
185
 
172
186
  ## Rails
173
187
 
174
- The example app (`examples/rails-hello`, edge Rails, production mode,
175
- 8 workers × 5 threads) on the same box:
176
-
177
- | | req/s | RSS under load |
178
- |---|---:|---:|
179
- | Kino `:threaded` (one process) | 2,298 | **97 MB** |
180
- | Puma cluster (8 workers) | 11,923 | 797 MB |
181
-
182
- This is the honest version of the Rails story: in threaded mode Kino is
183
- one GVL-bound process, so the fork cluster outruns it ~5× by using all
184
- 8 cores—at the memory. Rails-on-Ractors is interesting precisely
185
- because it would close that throughput gap at the one-process memory
188
+ Rails is **not Ractor-shareable**, so Kino can only serve it in
189
+ `:threaded` fallback—this whole section is one GVL-bound Kino process,
190
+ never ractor mode. The example app (`examples/rails-hello`, edge Rails,
191
+ production mode, 8 workers × 5 threads) on the same box:
192
+
193
+ | | req/s | RSS | PSS |
194
+ |---|---:|---:|---:|
195
+ | Kino `:threaded` (one process) | 2,637 | 97 MB | **92 MB** |
196
+ | Puma cluster (8 workers) | 12,138 | 794 MB | **389 MB** |
197
+
198
+ This is the honest version of the Rails story. In threaded mode Kino is
199
+ one GVL-bound process, so the fork cluster outruns it ~4.6× by using all
200
+ 8 cores—at ~4× the memory by PSS. The metric matters here: Puma's RSS
201
+ (794 MB) counts the shared Rails framework once per worker; PSS (389 MB)
202
+ counts it once, and that is the fair figure (the README's headline used
203
+ to read 8× off RSS). Preloading barely moves it—389 MB with
204
+ `preload_app!` vs 400 MB without—because Ruby's GC dirties most heap
205
+ pages, breaking copy-on-write, so even a preloaded cluster keeps a
206
+ private heap per worker. Rails-on-Ractors is interesting precisely
207
+ because it would close the throughput gap at the one-process memory
186
208
  cost; the upstream blockers are documented in
187
209
  [rails-on-ractors.md](rails-on-ractors.md).
188
210
 
@@ -217,29 +239,59 @@ more reason to prefer mimalloc in dlopen'd extensions.
217
239
 
218
240
  ## Memory under load (and the glibc arena footgun)
219
241
 
220
- RSS after serving the full endpoint battery (8 s each of /plaintext,
221
- /10k, /cpu, /io—a "warmed production process", not a fresh boot, which
222
- measures 26-27 MB for every Kino mode):
242
+ All figures are **PSS** (see [Methodology](#methodology)) after the full
243
+ endpoint battery (8 s each of /plaintext, /10k, /cpu, /io—a "warmed
244
+ production process", not a fresh boot, which measures ~26 MB for every
245
+ Kino mode). RSS is shown alongside so the copy-on-write correction is
246
+ visible.
223
247
 
224
- | config | RSS loaded |
225
- |---|---:|
226
- | Kino :ractor 8×1 (default) | **80 MB** |
227
- | Kino lanes 8×1 | **80 MB** |
228
- | Kino :ractor 8×3 | 115 MB |
229
- | Kino :threaded3 | 612 MB¹ |
230
- | Puma cluster 8×3 | 1,256 MB |
248
+ ### The tiny synthetic app
249
+
250
+ | config | RSS | PSS |
251
+ |---|---:|---:|
252
+ | Kino :ractor 8×1 (default) | 151 | **148** |
253
+ | Kino lanes1 | 137 | **135** |
254
+ | Kino :ractor 8×3 | 171 | **169** |
255
+ | Kino :threaded 8×3 (`MALLOC_ARENA_MAX=2`) | 109 | **107** |
256
+ | Kino :threaded 8×3 (no arena cap) | 668 | **666**¹ |
257
+ | Puma cluster 8×3 | 1,213 | **1,068** |
258
+
259
+ The tiny app is ~7× lighter than the cluster in ractor mode, ~10× in
260
+ arena-capped threaded mode. RSS ≈ PSS for every Kino row (one process,
261
+ nothing to share) and within ~12% for Puma here: a trivial app has almost
262
+ no shared state, so Puma's footprint is ~1,051 MB of *private* per-worker
263
+ heap plus only ~18 MB shared (which RSS counts 8×). This is the case where
264
+ copy-on-write does **not** rescue the cluster—there is nothing to
265
+ share—so the RSS and PSS numbers nearly agree. (The old "80 MB / 15×"
266
+ figure was a lighter, plaintext-only load; the honest full-battery ractor
267
+ figure is ~148 MB, i.e. ~7×.)
231
268
 
232
269
  ¹ Not a leak: glibc malloc arena bloat. One 8-second /10k round takes
233
- threaded mode from 69 MB to ~800 MB and it never returns—24 threads
270
+ threaded mode from ~70 MB to ~670 MB and it never returns—24 threads
234
271
  churning 10 KB strings through one process heap is the textbook glibc
235
272
  arena-fragmentation case (the reason Rails ops set `MALLOC_ARENA_MAX=2`;
236
- Heroku ships that default). With `MALLOC_ARENA_MAX=2` the same battery
237
- ends at **151 MB** with throughput unchanged (165,993 vs 157,177 /10k,
238
- if anything faster). Ractor mode sidesteps the worst of it without any
239
- env tweak—objects live in per-ractor heaps, and repeated runs landed at
240
- 80-124 MB regardless of arena settings. Puma's 1,256 MB barely moves
241
- under the cap (1,237 MB): its cost is eight full copies of the warmed
242
- VM, not arenas.
273
+ Heroku ships that default). With the cap the same battery ends at 107 MB
274
+ PSS, throughput unchanged. Ractor mode sidesteps the worst of it without
275
+ any env tweak—objects live in per-ractor heaps.
276
+
277
+ ### Rails (threaded fallback)
278
+
279
+ Here copy-on-write **does** matter, which is exactly why PSS is mandatory:
280
+
281
+ | config | RSS | PSS |
282
+ |---|---:|---:|
283
+ | Kino :threaded (one process) | 97 | **92** |
284
+ | Puma cluster 8×3 (preload) | 794 | **389** |
285
+
286
+ Puma serves the same Rails framework from 8 forks that share it
287
+ copy-on-write; RSS counts that shared framework once per worker (794 MB),
288
+ PSS counts it once (389 MB). The fair ratio is **~4×**, not the ~8× a
289
+ naive RSS sum reports—this is the correction that prompted the whole
290
+ re-measure. Preload barely helps (389 vs 400 MB without): Ruby's GC
291
+ dirties most heap pages, breaking copy-on-write, so even a preloaded
292
+ cluster keeps a large private heap per worker. That is why "CoW should
293
+ make a fork cluster nearly free" is only half true—it shares the code,
294
+ not the live object heap.
243
295
 
244
296
  ## Run-to-run variance (a.k.a. "is this a regression?")
245
297
 
@@ -249,26 +301,27 @@ Docker-on-Mac environment swung ±10% on /cpu between sessions with the
249
301
  VM's mood; the dedicated c7a box is far steadier (same-session repeats
250
302
  land within ~1-2%), but the discipline stays—every comparative claim in
251
303
  these docs comes from same-session pairs. Cross-box repeatability got
252
- its own test: the dataset was measured across three identical
304
+ its own test: the dataset was measured across four identical
253
305
  c7a.2xlarge boxes, and equal-config throughput numbers matched within
254
- ~1-2% (loaded-RSS measurements swing far more with heap-growth
255
- timing—treat memory numbers as ballpark). The same discipline caught one fluke: a sweep round once
256
- posted threaded plaintext 28% low; interleaved re-runs minutes later
257
- put it back—suspect cells get re-measured, not published.
306
+ ~1-2% (loaded-memory measurements swing more with heap-growth
307
+ timing—treat them as ballpark). The same discipline caught the recurring
308
+ threaded-plaintext fluke twice: once 28% low on an earlier box, and again
309
+ in the final re-validation (170k, where three interleaved re-runs put it
310
+ back at 217k). Suspect cells get re-measured, not published.
258
311
 
259
312
  ## Topology notes
260
313
 
261
314
  Measured on c7a.2xlarge, plaintext, ractor mode, same session (three
262
- interleaved rounds, medians): `8×3` (workers×threads) = 200,048, `8×1`
263
- = **232,173 (+16%)**, `16×1` = 214,570. Threads inside one ractor share
315
+ interleaved rounds, medians): `8×3` (workers×threads) = 198,478, `8×1`
316
+ = **229,966 (+16%)**, `16×1` = 214,391. Threads inside one ractor share
264
317
  its lock, so every request handled by a 3-thread ractor pays a lock
265
318
  handoff that a 1-thread ractor doesn't (`perf` in the earlier Docker
266
319
  sessions attributed ~10% of cycles to
267
320
  `rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3; the
268
321
  gain reproduced on two separate boxes, +16-17% each). **This is why
269
- `threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +18%
270
- the same way: 80,409 vs 68,132). The trade-off is /io at low worker
271
- counts: 1,534 at 8×1 vs 4,486 at 8×3—threads-per-ractor exist for
322
+ `threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +16%
323
+ the same way: 77,999 vs 67,394). The trade-off is /io at low worker
324
+ counts: 1,552 at 8×1 vs 4,530 at 8×3—threads-per-ractor exist for
272
325
  handlers that block on I/O. If yours wait a lot, raise `workers`
273
326
  instead (32×1 beats even the 24-slot cluster, see above); slots are
274
327
  cheap. (16×1 being worse than 8×1 on plaintext also says the shared
@@ -306,10 +359,10 @@ Same-session A/B on c7a.2xlarge, ractor mode at the default topology
306
359
 
307
360
  | endpoint | shared queue | lanes | delta |
308
361
  |----------|-------------:|------:|------:|
309
- | /plaintext | 230,547 | **250,395** | **+9%** |
310
- | /10k | 182,788 | 198,301 | +8% |
311
- | /cpu | **78,175** | 73,345 | −6% |
312
- | /io | 1,550 | 1,550 | flat |
362
+ | /plaintext | 229,534 | **250,222** | **+9%** |
363
+ | /10k | 178,083 | 189,862 | +7% |
364
+ | /cpu | **77,999** | 70,885 | −9% |
365
+ | /io | 1,552 | 1,551 | flat |
313
366
 
314
367
  Lanes' margin shrank with the move to 1-thread workers (at the old 8×3
315
368
  it was +21% plaintext: 240,193 vs 199,032 in the same session)—most of
@@ -331,11 +384,11 @@ typical costs):
331
384
 
332
385
  | case (8×3, same session) | req/s |
333
386
  |---|---:|
334
- | threaded, no logging | 217,377 |
335
- | threaded, `log_requests true` (native access log) | 193,493 (−11%) |
336
- | ractor, access log off / on | 200,478 / 184,357 (−8%) |
337
- | app logs 1 line/req via shared `::Logger` (file) | **62,917** |
338
- | app logs 1 line/req via `Kino::Logger` (file) | **149,237 (2.4×)** |
387
+ | threaded, no logging | 219,168 |
388
+ | threaded, `log_requests true` (native access log) | 193,998 (−11%) |
389
+ | ractor, access log off / on | 197,596 / 181,050 (−8%) |
390
+ | app logs 1 line/req via shared `::Logger` (file) | **62,961** |
391
+ | app logs 1 line/req via `Kino::Logger` (file) | **149,519 (2.4×)** |
339
392
 
340
393
  The shared-`::Logger` cost is the mutex: 24 worker threads serialize
341
394
  through one lock plus a write syscall per line. `Kino::Logger` hands the
@@ -7,10 +7,11 @@
7
7
  Rails 8.2.0.alpha boots and serves with `mode :threaded` (see the
8
8
  example's `kino.rb`; just `bundle exec kino` in that directory). Measured
9
9
  on the hello-world (c7a.2xlarge, 8 cores, production mode, 8×5):
10
- ~2.3k req/s in 97 MB, single process. The 8-worker Puma cluster reaches
11
- ~11.9k in 797 MB by parallelizing across forks—Rails-on-Ractors is
12
- interesting precisely because it could offer that ~5× parallelism at
13
- ~1/8th of the memory.
10
+ ~2.6k req/s in 92 MB PSS, single process. The 8-worker Puma cluster
11
+ reaches ~12.1k by parallelizing across forks, at 389 MB PSS (794 MB RSS,
12
+ but its forks share the framework copy-on-write, so PSS is the fair
13
+ figure)—Rails-on-Ractors is interesting precisely because it could offer
14
+ that ~4.6× parallelism at ~1/4th of the memory.
14
15
 
15
16
  Pair it with production-style Rails settings: eager load, no code
16
17
  reloading, database pool ≥ workers × threads, logger to stdout or another
data/doc/why-kino.md CHANGED
@@ -16,7 +16,7 @@ deep-copies it, and sockets cannot cross at all.
16
16
  We measured what the "obvious" workaround costs. The ractor-pool wrapper
17
17
  experiment (reduce the env to a shareable subset, copy it to a worker
18
18
  over a `Ractor::Port`, copy the response back) runs at **19.5k req/s
19
- where Kino does 199k** on the same hardware—see the
19
+ where Kino does 194k** on the same hardware—see the
20
20
  [wrapper comparison](benchmarks.md#the-ractor-pool-wrapper-comparison).
21
21
  Copying at the Rack layer eats the entire ractor dividend. Dispatch has
22
22
  to live below the Rack contract.
@@ -78,12 +78,12 @@ objects; Rust sees one queue and one registry.
78
78
 
79
79
  With the dispatch cost eliminated, Ractors deliver the thing they were
80
80
  built for—a lock per ractor instead of one GVL—and each layer is
81
- visible in the [benchmarks](benchmarks.md): `/cpu` at 76.9k req/s in
82
- ractor mode vs **13.5k threaded (5.7×, the GVL ceiling)**, beating the
83
- fork cluster's CPU parallelism by +32% while holding **80 MB against
84
- the cluster's 1,256 MB**, because eight ractors share one VM, one Rust
85
- front-end, one queue, and one JIT, where eight forks each pay full
86
- price.
81
+ visible in the [benchmarks](benchmarks.md): `/cpu` at 78.0k req/s in
82
+ ractor mode vs **13.4k threaded (5.8×, the GVL ceiling)**, beating the
83
+ fork cluster's CPU parallelism by +34% while holding **~148 MB against
84
+ the cluster's ~1,068 MB** (by PSS, on the bench app), because eight
85
+ ractors share one VM, one Rust front-end, one queue, and one JIT, where
86
+ eight forks each pay full price.
87
87
 
88
88
  The cleanest proof of the design is the threaded fallback itself: it
89
89
  reuses ~95% of the same machinery, because the Rust core is
data/ext/kino/Cargo.toml CHANGED
@@ -1,6 +1,6 @@
1
1
  [package]
2
2
  name = "kino"
3
- version = "0.1.1"
3
+ version = "0.1.2"
4
4
  edition = "2021"
5
5
  authors = ["Yaroslav Markin <yaroslav@markin.net>"]
6
6
  license = "MIT"
@@ -41,6 +41,9 @@ pub struct ServerInner {
41
41
  /// 0 = no request timeout; otherwise the response head must arrive
42
42
  /// within this many ms or the client gets a 504.
43
43
  pub request_timeout_ms: u64,
44
+ /// 0 = unlimited; otherwise the max request-body bytes accepted before a
45
+ /// 413 (truthful Content-Length) or a mid-stream abort (chunked/lying).
46
+ pub max_body_size: usize,
44
47
  pub timeouts: AtomicU64,
45
48
  pub https: bool,
46
49
  /// Native access log sink (None unless log_requests is on).
@@ -180,6 +183,7 @@ pub fn test_server(lanes: bool, queue_depth: usize) -> Arc<ServerInner> {
180
183
  rejected: AtomicU64::new(0),
181
184
  queue_timeout_ms: 10,
182
185
  request_timeout_ms: 0,
186
+ max_body_size: 0,
183
187
  timeouts: AtomicU64::new(0),
184
188
  https: false,
185
189
  access_log: None,
@@ -25,6 +25,12 @@ pub struct RequestCtx {
25
25
  /// Request body, streamed from hyper through a bounded channel: hyper is
26
26
  /// only polled as Ruby consumes, so inbound backpressure is free.
27
27
  pub body_rx: flume::Receiver<Bytes>,
28
+ /// Set by the body forwarder when the body exceeded max_body_size: turns
29
+ /// the next read into an error instead of a (truncated) clean EOF.
30
+ pub body_overflow: Arc<std::sync::atomic::AtomicBool>,
31
+ /// Set by the body forwarder when the client stalled past the idle
32
+ /// deadline: the next read raises so the worker reclaims its slot.
33
+ pub body_timeout: Arc<std::sync::atomic::AtomicBool>,
28
34
  /// When a frame is bigger than read_body's max_len, the rest waits here.
29
35
  pub leftover: Option<Bytes>,
30
36
  /// The owning worker slot (set at admit time, queue.rs); its interrupt
@@ -62,6 +68,20 @@ fn interrupted_error(ruby: &Ruby) -> Error {
62
68
  )
63
69
  }
64
70
 
71
+ fn body_too_large_error(ruby: &Ruby) -> Error {
72
+ Error::new(
73
+ ruby.exception_runtime_error(),
74
+ "Kino: request body exceeded max_body_size",
75
+ )
76
+ }
77
+
78
+ fn body_timeout_error(ruby: &Ruby) -> Error {
79
+ Error::new(
80
+ ruby.exception_runtime_error(),
81
+ "Kino: request body read timed out",
82
+ )
83
+ }
84
+
65
85
  fn invalid_response(ruby: &Ruby, e: impl std::fmt::Display) -> Error {
66
86
  Error::new(
67
87
  ruby.exception_runtime_error(),
@@ -238,7 +258,17 @@ impl Request {
238
258
  });
239
259
  match outcome {
240
260
  Some(Some(bytes)) => bytes,
241
- Some(None) => return Ok(None), // EOF
261
+ Some(None) => {
262
+ // Disconnected: a clean EOF, unless the forwarder
263
+ // abandoned the body (too large, or the client stalled).
264
+ if ctx.body_overflow.load(std::sync::atomic::Ordering::Relaxed) {
265
+ return Err(body_too_large_error(ruby));
266
+ }
267
+ if ctx.body_timeout.load(std::sync::atomic::Ordering::Relaxed) {
268
+ return Err(body_timeout_error(ruby));
269
+ }
270
+ return Ok(None); // EOF
271
+ }
242
272
  None => return Err(interrupted_error(ruby)),
243
273
  }
244
274
  }
@@ -363,6 +393,8 @@ pub fn test_ctx() -> crate::registry::BoxedCtx {
363
393
  local_addr: "127.0.0.1:9292".parse().expect("static addr"),
364
394
  https: false,
365
395
  body_rx,
396
+ body_overflow: Arc::new(std::sync::atomic::AtomicBool::new(false)),
397
+ body_timeout: Arc::new(std::sync::atomic::AtomicBool::new(false)),
366
398
  leftover: None,
367
399
  slot: None,
368
400
  responder: Arc::new(Responder::new(head_tx)),
@@ -51,6 +51,8 @@ pub fn server_start(ruby: &Ruby, config: magnus::RHash) -> Result<(u64, u16), Er
51
51
  let queue_depth: usize = cfg(ruby, config, "queue_depth")?;
52
52
  let queue_timeout_ms: u64 = cfg(ruby, config, "queue_timeout_ms")?;
53
53
  let request_timeout_ms: u64 = cfg_opt::<u64>(ruby, config, "request_timeout_ms")?.unwrap_or(0);
54
+ let max_body_size: usize = cfg_opt::<usize>(ruby, config, "max_body_size")?.unwrap_or(0);
55
+ let max_connections: usize = cfg_opt::<usize>(ruby, config, "max_connections")?.unwrap_or(1024);
54
56
  let tokio_threads: usize = cfg_opt::<usize>(ruby, config, "tokio_threads")?.unwrap_or(0);
55
57
  let tls_cert: Option<String> = cfg_opt(ruby, config, "tls_cert")?;
56
58
  let tls_key: Option<String> = cfg_opt(ruby, config, "tls_key")?;
@@ -104,6 +106,7 @@ pub fn server_start(ruby: &Ruby, config: magnus::RHash) -> Result<(u64, u16), Er
104
106
  rejected: std::sync::atomic::AtomicU64::new(0),
105
107
  queue_timeout_ms,
106
108
  request_timeout_ms,
109
+ max_body_size,
107
110
  timeouts: std::sync::atomic::AtomicU64::new(0),
108
111
  https: acceptor.is_some(),
109
112
  access_log: log_requests.then(|| crate::logsink::Sink::new(std::io::stdout())),
@@ -120,6 +123,7 @@ pub fn server_start(ruby: &Ruby, config: magnus::RHash) -> Result<(u64, u16), Er
120
123
  tokio_listener,
121
124
  acceptor,
122
125
  server.clone(),
126
+ max_connections,
123
127
  shutdown_rx,
124
128
  ));
125
129
  *server.runtime.lock() = Some(runtime);
@@ -129,40 +133,73 @@ pub fn server_start(ruby: &Ruby, config: magnus::RHash) -> Result<(u64, u16), Er
129
133
  Ok((id, local_port))
130
134
  }
131
135
 
136
+ /// Slowloris guard for TLS: a client that completes the TCP connect but then
137
+ /// stalls the handshake would otherwise hold a connection slot indefinitely
138
+ /// (the per-request and header-read deadlines only start once hyper is
139
+ /// serving, i.e. after the handshake). A handshake is a few round trips, so
140
+ /// this is generous even for a high-latency client. Fixed, like the header
141
+ /// timeout: not a knob.
142
+ const TLS_HANDSHAKE_TIMEOUT: Duration = Duration::from_secs(10);
143
+
132
144
  async fn accept_loop(
133
145
  listener: tokio::net::TcpListener,
134
146
  acceptor: Option<tokio_rustls::TlsAcceptor>,
135
147
  server: Arc<ServerInner>,
148
+ max_connections: usize,
136
149
  mut shutdown_rx: tokio::sync::watch::Receiver<bool>,
137
150
  ) {
151
+ // Bound concurrent connections: unbounded, a flood spawns a task and holds
152
+ // a socket per connection until file descriptors or memory run out. One
153
+ // permit per live connection; acquiring BEFORE accept leaves the excess in
154
+ // the kernel backlog (backpressure) rather than accepting then dropping.
155
+ let conn_limit = Arc::new(tokio::sync::Semaphore::new(max_connections));
138
156
  loop {
139
- tokio::select! {
157
+ let permit = tokio::select! {
140
158
  _ = shutdown_rx.changed() => break,
141
- accepted = listener.accept() => {
142
- let Ok((stream, remote_addr)) = accepted else { continue };
143
- // Small responses must not wait on Nagle + delayed ACK.
144
- let _ = stream.set_nodelay(true);
145
- let local_addr = stream
146
- .local_addr()
147
- .unwrap_or_else(|_| SocketAddr::from(([0, 0, 0, 0], 0)));
148
- let server = server.clone();
149
- let acceptor = acceptor.clone();
150
- tokio::spawn(async move {
151
- match acceptor {
152
- Some(acceptor) => {
153
- // Handshake failures (port scans, plain HTTP to a
154
- // TLS port) just drop the connection.
155
- let Ok(tls) = acceptor.accept(stream).await else { return };
156
- serve_connection(tls, server, remote_addr, local_addr).await;
157
- }
158
- None => serve_connection(stream, server, remote_addr, local_addr).await,
159
- }
160
- });
159
+ permit = conn_limit.clone().acquire_owned() => match permit {
160
+ Ok(permit) => permit,
161
+ Err(_) => break, // semaphore closed
162
+ },
163
+ };
164
+ let (stream, remote_addr) = tokio::select! {
165
+ _ = shutdown_rx.changed() => break,
166
+ accepted = listener.accept() => match accepted {
167
+ Ok(pair) => pair,
168
+ Err(_) => continue, // transient accept error; permit drops, retry
169
+ },
170
+ };
171
+ // Small responses must not wait on Nagle + delayed ACK.
172
+ let _ = stream.set_nodelay(true);
173
+ let local_addr = stream
174
+ .local_addr()
175
+ .unwrap_or_else(|_| SocketAddr::from(([0, 0, 0, 0], 0)));
176
+ let server = server.clone();
177
+ let acceptor = acceptor.clone();
178
+ tokio::spawn(async move {
179
+ // Held for the connection's lifetime; dropping it frees a slot.
180
+ let _permit = permit;
181
+ match acceptor {
182
+ Some(acceptor) => {
183
+ // Handshake failures (port scans, plain HTTP to a TLS
184
+ // port) and stalled handshakes (slowloris) just drop the
185
+ // connection; the timeout bounds the latter.
186
+ let handshake = tokio::time::timeout(TLS_HANDSHAKE_TIMEOUT, acceptor.accept(stream));
187
+ let Ok(Ok(tls)) = handshake.await else { return };
188
+ serve_connection(tls, server, remote_addr, local_addr).await;
189
+ }
190
+ None => serve_connection(stream, server, remote_addr, local_addr).await,
161
191
  }
162
- }
192
+ });
163
193
  }
164
194
  }
165
195
 
196
+ /// Slowloris guard: drop a connection that has not sent its complete request
197
+ /// headers within this window. Long enough never to trip a real client (even
198
+ /// on a slow mobile link), short enough to reap a stalled one. Deliberately a
199
+ /// constant, not a config knob: fine-tuning intake limits is the fronting
200
+ /// proxy's job; the actual hazard was having no default at all.
201
+ const HEADER_READ_TIMEOUT: Duration = Duration::from_secs(15);
202
+
166
203
  async fn serve_connection<I>(
167
204
  io: I,
168
205
  server: Arc<ServerInner>,
@@ -176,7 +213,13 @@ async fn serve_connection<I>(
176
213
  // No auto Date header: it costs a clock read per response (together
177
214
  // with timer reads, ~7% of tokio-side cycles in the profile); it's a
178
215
  // SHOULD not a MUST, and apps that need it can set it themselves.
216
+ //
217
+ // The timer is installed so header_read_timeout actually fires: hyper's
218
+ // slow-header guard is inert without one. It arms only while the request
219
+ // head is being read, so it adds no per-response cost on the hot path.
179
220
  let _ = hyper::server::conn::http1::Builder::new()
221
+ .timer(hyper_util::rt::TokioTimer::new())
222
+ .header_read_timeout(HEADER_READ_TIMEOUT)
180
223
  .auto_date_header(false)
181
224
  .serve_connection(TokioIo::new(io), service)
182
225
  .await;
@@ -198,6 +241,25 @@ fn branded(mut response: HyperResponse) -> HyperResponse {
198
241
  response
199
242
  }
200
243
 
244
+ /// A single valid Content-Length as a byte count. hyper has already rejected
245
+ /// conflicting/duplicate values, so the first is authoritative; anything
246
+ /// unparseable yields None and the streaming cap still applies.
247
+ fn content_length(headers: &http::HeaderMap) -> Option<u64> {
248
+ headers
249
+ .get(http::header::CONTENT_LENGTH)?
250
+ .to_str()
251
+ .ok()?
252
+ .trim()
253
+ .parse()
254
+ .ok()
255
+ }
256
+
257
+ /// Idle deadline between request-body frames. A client that stalls mid-body
258
+ /// would otherwise hold a worker slot indefinitely (the worker blocks in
259
+ /// read_body). Generous: a real upload sends steadily and resets this each
260
+ /// frame, so only a silent client trips it. Fixed, like the header timeout.
261
+ const BODY_READ_TIMEOUT: Duration = Duration::from_secs(30);
262
+
201
263
  async fn handle_request(
202
264
  server: Arc<ServerInner>,
203
265
  remote_addr: SocketAddr,
@@ -221,19 +283,50 @@ async fn handle_request(
221
283
  )
222
284
  });
223
285
 
286
+ // Body-size guard: an honestly-declared oversize body is refused with a
287
+ // 413 below, before any worker runs. Chunked or lying clients are caught
288
+ // by the forwarder, which caps cumulative bytes and flags an overflow so
289
+ // read_body raises instead of letting the app buffer without bound.
290
+ let max_body = server.max_body_size;
291
+ let oversize =
292
+ max_body > 0 && content_length(&parts.headers).is_some_and(|len| len > max_body as u64);
293
+
224
294
  // Stream the request body through a bounded channel: hyper is polled
225
295
  // only as fast as the Ruby side consumes (inbound backpressure), and the
226
296
  // forwarder dropping the sender is EOF. Bodyless requests (most GETs)
227
297
  // skip the forwarder task entirely: dropping the sender IS the EOF.
228
298
  let (body_tx, body_rx) = flume::bounded::<bytes::Bytes>(8);
229
- if hyper::body::Body::is_end_stream(&body) {
299
+ let body_overflow = Arc::new(std::sync::atomic::AtomicBool::new(false));
300
+ let body_timeout = Arc::new(std::sync::atomic::AtomicBool::new(false));
301
+ if oversize || hyper::body::Body::is_end_stream(&body) {
230
302
  drop(body_tx);
231
303
  } else {
304
+ let overflow = body_overflow.clone();
305
+ let timed_out = body_timeout.clone();
232
306
  tokio::spawn(async move {
233
307
  let mut body = body;
234
- while let Some(frame) = body.frame().await {
235
- let Ok(frame) = frame else { break };
308
+ let mut total: u64 = 0;
309
+ loop {
310
+ // Idle deadline between frames: a client that stalls mid-body
311
+ // would otherwise pin a worker blocked in read_body. Only the
312
+ // client's silence trips this; a slow APP blocks the forwarder
313
+ // in send_async below instead, which is not timed.
314
+ let frame = match tokio::time::timeout(BODY_READ_TIMEOUT, body.frame()).await {
315
+ Ok(Some(Ok(frame))) => frame,
316
+ Ok(Some(Err(_))) | Ok(None) => break, // body error or clean EOF
317
+ Err(_) => {
318
+ timed_out.store(true, Ordering::Relaxed);
319
+ break;
320
+ }
321
+ };
236
322
  if let Ok(data) = frame.into_data() {
323
+ total += data.len() as u64;
324
+ if max_body > 0 && total > max_body as u64 {
325
+ // Past the cap: flag it and stop pulling. Dropping the
326
+ // sender unblocks read_body, which then raises.
327
+ overflow.store(true, Ordering::Relaxed);
328
+ break;
329
+ }
237
330
  if body_tx.send_async(data).await.is_err() {
238
331
  break; // request handle dropped; stop pulling
239
332
  }
@@ -253,6 +346,8 @@ async fn handle_request(
253
346
  local_addr,
254
347
  https: server.https,
255
348
  body_rx,
349
+ body_overflow,
350
+ body_timeout,
256
351
  leftover: None,
257
352
  slot: None,
258
353
  responder,
@@ -273,6 +368,9 @@ async fn handle_request(
273
368
 
274
369
  // Single exit point so the access log sees every outcome, 503s included.
275
370
  let response: HyperResponse = 'resp: {
371
+ if oversize {
372
+ break 'resp plain_response(413, "Payload Too Large\n");
373
+ }
276
374
  if server.lanes {
277
375
  if !dispatch_to_lane(&server, ctx).await {
278
376
  break 'resp unavailable(&server);
@@ -17,6 +17,8 @@ module Kino
17
17
  queue_depth: 1024,
18
18
  queue_timeout: 5.0,
19
19
  request_timeout: nil,
20
+ max_connections: nil, # nil = derive from the open-file limit
21
+ max_body_size: 50 * 1024 * 1024, # 50 MB; nil/0 = unlimited
20
22
  batch: 1,
21
23
  lanes: false,
22
24
  log_requests: false,
@@ -160,6 +162,14 @@ module Kino
160
162
  # Seconds the app gets before the client receives a 504; nil = off.
161
163
  def request_timeout(seconds) = @config.set(:request_timeout, seconds && Float(seconds))
162
164
 
165
+ # Max connections served at once; beyond it, new connections wait in
166
+ # the kernel backlog. Defaults to most of the open-file limit.
167
+ def max_connections(count) = @config.set(:max_connections, Integer(count))
168
+
169
+ # Max request-body bytes before a 413; nil disables (delegate to a
170
+ # fronting proxy). Default 50 MB.
171
+ def max_body_size(bytes) = @config.set(:max_body_size, bytes && Integer(bytes))
172
+
163
173
  # Requests a worker may grab per queue visit (default 1).
164
174
  def batch(count) = @config.set(:batch, Integer(count))
165
175
 
data/lib/kino/server.rb CHANGED
@@ -50,6 +50,8 @@ module Kino
50
50
  @queue_depth = Integer(settings[:queue_depth])
51
51
  @queue_timeout_ms = (Float(settings[:queue_timeout]) * 1000).round
52
52
  @request_timeout_ms = settings[:request_timeout] ? (Float(settings[:request_timeout]) * 1000).round : 0
53
+ @max_connections = settings[:max_connections] ? Integer(settings[:max_connections]) : default_max_connections
54
+ @max_body_size = Integer(settings[:max_body_size] || 0)
53
55
  @batch = [Integer(settings[:batch]), 1].max
54
56
  @lanes = !!settings[:lanes]
55
57
  @log_requests = !!settings[:log_requests]
@@ -74,6 +76,8 @@ module Kino
74
76
  bind: @bind, port: @requested_port,
75
77
  queue_depth: @queue_depth, queue_timeout_ms: @queue_timeout_ms,
76
78
  request_timeout_ms: @request_timeout_ms,
79
+ max_connections: @max_connections,
80
+ max_body_size: @max_body_size,
77
81
  tokio_threads: @tokio_threads,
78
82
  tls_cert: @tls&.fetch(:cert), tls_key: @tls&.fetch(:key),
79
83
  lanes: @lanes, log_requests: @log_requests
@@ -214,6 +218,18 @@ module Kino
214
218
  Process.clock_gettime(Process::CLOCK_MONOTONIC)
215
219
  end
216
220
 
221
+ # Default connection cap: most of the process open-file limit. A
222
+ # connection flood's failure mode is descriptor exhaustion, and in
223
+ # :ractor/:threaded mode the app's own sockets and files share this
224
+ # process's table, so leave headroom. Scales with `ulimit -n`; raise the
225
+ # OS limit (or set max_connections) to allow more.
226
+ def default_max_connections
227
+ soft, = Process.getrlimit(Process::RLIMIT_NOFILE)
228
+ return 65_536 if soft == Process::RLIM_INFINITY
229
+
230
+ [soft * 8 / 10, 64].max
231
+ end
232
+
217
233
  def join_workers(deadline)
218
234
  if @supervisor
219
235
  @supervisor.shutdown([deadline - monotonic_now, 0].max)
@@ -55,6 +55,17 @@
55
55
  # above your slowest legitimate endpoint.
56
56
  # request_timeout 30
57
57
 
58
+ # Most connections to serve at once. Past this, new connections wait in
59
+ # the kernel backlog instead of piling up until the server runs out of
60
+ # file descriptors. Defaults to most of the open-file limit (ulimit -n),
61
+ # so it scales with the OS limit and only bites under a flood.
62
+ # max_connections 8192
63
+
64
+ # Reject request bodies larger than this many bytes with a 413, so an
65
+ # oversized or endless upload can't drive your app to run out of memory.
66
+ # Set to nil to disable and let a fronting proxy handle it. Default: 50 MB.
67
+ # max_body_size 50 * 1024 * 1024
68
+
58
69
  # How many requests a worker grabs from the line at once. Leave at 1
59
70
  # unless all your endpoints are uniformly fast.
60
71
  # batch 1
data/lib/kino/version.rb CHANGED
@@ -2,5 +2,5 @@
2
2
 
3
3
  module Kino
4
4
  # The gem version (single source of truth; ext/kino/Cargo.toml syncs).
5
- VERSION = "0.1.1"
5
+ VERSION = "0.1.2"
6
6
  end
data/sig/kino.rbs CHANGED
@@ -92,6 +92,8 @@ module Kino
92
92
  def queue_depth: (int depth) -> untyped
93
93
  def queue_timeout: (Numeric seconds) -> untyped
94
94
  def request_timeout: (Numeric? seconds) -> untyped
95
+ def max_connections: (int count) -> untyped
96
+ def max_body_size: (int? bytes) -> untyped
95
97
  def batch: (int count) -> untyped
96
98
  def lanes: (boolish enabled) -> untyped
97
99
  def log_requests: (boolish enabled) -> untyped
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kino
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Yaroslav Markin