kino 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/doc/benchmarks.md CHANGED
@@ -27,9 +27,29 @@ the deployment most apps run today.
27
27
  plaintext 26-37% lower while leaving Puma's number unchanged.
28
28
  - Identical app for every server (`bench/bench_app.rb`), Ractor-shareable
29
29
  so Kino's `:ractor` mode can run it unmodified.
30
- - Topology held equal: Puma 8 forked workers × 3 threads vs Kino
31
- 8 workers × 3 threads in one process.
32
- - Follow-up studies (`bench/studies.sh`): CPU recipe, topology sweep,
30
+ - Topology: each server at its shipped defaults—Puma 8 forked workers
31
+ × 3 threads (24 slots) vs Kino 8 workers in one process (× 1 thread
32
+ in ractor modes, the default since 0.1.1; × 3 threads in threaded
33
+ mode). Equal-topology numbers (Kino at 8×3) are in the studies below.
34
+ - The headline tables also carry an io-tuned column (`workers 32,
35
+ threads 1`)—not a default, labeled as such—because the /io rows are
36
+ a slot-count story (see below).
37
+ - The dataset spans four identical c7a.2xlarge boxes: the original
38
+ measurements, a re-measure at the 0.1.1 defaults, the headline sweep,
39
+ and a final full re-validation (every table re-run from scratch).
40
+ Equal-config throughput reproduced across boxes within ~1-2%.
41
+ - **Memory is reported as PSS (proportional set size), not RSS.** A Puma
42
+ cluster forks N workers that share the Ruby VM and gem code
43
+ copy-on-write; summing each worker's RSS counts those shared pages up
44
+ to N times and overstates the cluster's real footprint. PSS divides
45
+ every shared page across the processes mapping it, so it reflects the
46
+ unique physical memory the cluster occupies—the only fair basis for
47
+ comparing one process against a fork-per-core cluster. We read it from
48
+ `/proc/<pid>/smaps_rollup` over the whole process tree, cross-checked
49
+ against `ps` (RSS) and `smem` (PSS). Kino serves from one process, so
50
+ its RSS ≈ PSS; the correction only moves Puma. (`bench/studies.sh`
51
+ reports both columns.)
52
+ - Follow-up studies (`bench/studies.sh`): CPU tuning, topology sweep,
33
53
  /io worker scaling, logging costs, and memory—run in the same session
34
54
  as the headline tables.
35
55
  - The harness waits for the port to be genuinely free between targets.
@@ -45,42 +65,55 @@ the deployment most apps run today.
45
65
 
46
66
  ## Reading the headline tables
47
67
 
68
+ These tables all run the **tiny synthetic Ractor-shareable app**. The real
69
+ Rails app is not Ractor-shareable and runs only in threaded fallback—a
70
+ separate story with separate numbers, in [its own section](#rails).
71
+
48
72
  - **Plaintext/10k**: Kino's tokio front-end clears the fork cluster by
49
- 1.4-2× (lanes plaintext 241,501 vs Puma 117,838 = 2.05×; the smallest
50
- margin is threaded /10k at 1.44×). The cross-ractor handoff shows up
51
- as ractor (201k) trailing threaded (218k) on trivial handlers—nothing
52
- in them needs parallel Ruby—and lane dispatch reverses that (241k).
53
- - **CPU (recursive fib)**: ractor mode does **5× its own GVL-bound
54
- threaded mode** (66,735 vs 13,298)—that's the entire point of
55
- ractors—and beats the fork cluster outright: +15% with stock
56
- defaults, +21% with lanes (70,373 vs 58,207).
57
- - **Memory**: serving the same loaded bench app, Kino held **57 MB
58
- (ractor) / 50 MB (threaded)** where the 8-worker cluster held
59
- **1,078 MB**—a fork per core pays one full copy of the VM, the app,
60
- and its YJIT-compiled code per worker. On the Rails hello-world:
61
- Kino 97 MB vs cluster 797 MB.
73
+ 1.5-2.1× (lanes plaintext 250,222 vs Puma 118,176 = 2.12×; the
74
+ smallest margin is threaded /10k at 1.50×). At the old 3-thread
75
+ topology the cross-ractor handoff showed up as ractor trailing
76
+ threaded on trivial handlers; the 1-thread default reverses that
77
+ (ractor 230k vs threaded 217k) and lanes widen it (250k).
78
+ - **CPU (recursive fib)**: ractor mode does **5.8× its own GVL-bound
79
+ threaded mode** (77,999 vs 13,429)—that's the entire point of
80
+ ractors—and beats the fork cluster outright: +34% with stock
81
+ defaults (+22% with lanes, 70,885 vs 58,006). Even the io-tuned
82
+ `workers 32` topology stays ahead of the cluster on CPU (66,100).
83
+ - **Memory (PSS)**: after the full endpoint battery, the tiny app costs
84
+ Kino **148 MB** in ractor mode (107 MB threaded) against the 8-worker
85
+ cluster's **1,068 MB**—~7-10× lighter, because a trivial app is almost
86
+ all private per-worker heap that copy-on-write can't share. The real
87
+ Rails app narrows this to ~4× (its framework *is* shared CoW); both are
88
+ in [Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
62
89
  - **I/O (5 ms wait)**: all dispatch models tie within ~4% at equal slot
63
- counts; the lever that matters is slot count, see
64
- [below](#why-io-lags-in-ractor-mode-on-linux).
90
+ counts, so the default columns show the ractor modes behind on /io
91
+ (8 slots vs the cluster's 24), and the `workers 32` column shows the
92
+ same engine winning (+25%, +34% via `Kino.sleep`) once it has more
93
+ slots than the cluster. The lever is slot count, and Kino slots are
94
+ cheap: see [below](#why-io-lags-in-ractor-mode-on-linux).
65
95
 
66
96
  ## CPU-bound tuning
67
97
 
68
- On real hardware, Kino's stock defaults already lead the cluster on
69
- pure CPU—same-session studies run:
98
+ On real hardware, Kino's stock defaults lead the cluster on pure
99
+ CPU—and the old tuning recipe is now obsolete. Same-session studies
100
+ run:
70
101
 
71
102
  | config | /cpu req/s |
72
103
  |---|---:|
73
- | Puma cluster (reference) | 58,376 |
74
- | Kino `workers 8, threads 3`, tokio auto (default) | 68,257 |
75
- | Kino `workers 8, threads 1, tokio_threads 1` (recipe) | 68,629 |
76
-
77
- The tuned recipe is a wash (+0.5%)—and it still costs plaintext
78
- (112,815 vs ~200k) and /io (1,532, 8 slots): on this hardware there is
79
- no reason to use it. **This is a finding that changed with the
80
- environment**: in the earlier Docker-on-Mac runs the recipe was worth
81
- +12%, because tokio threads and wake churn competed for oversubscribed
82
- virtualized cores. If you deploy into a constrained/virtualized
83
- environment, the recipe may still pay; measure there.
104
+ | Puma cluster (reference) | 58,189 |
105
+ | Kino `workers 8, threads 3` (the default before 0.1.1) | 67,394 |
106
+ | Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,600 |
107
+ | Kino `workers 8, threads 1`, tokio auto (**the default**) | **77,999** |
108
+
109
+ The `threads 1` half of the old recipe became the default; the
110
+ `tokio_threads 1` half now *costs* −12% on /cpu (and still costs
111
+ plaintext: 108,523 vs 230k). Don't pin tokio threads. **The recipe's
112
+ history is an environment story**: in the earlier Docker-on-Mac runs it
113
+ was worth +12%, because tokio threads and wake churn competed for
114
+ oversubscribed virtualized cores; on dedicated cores the same pin
115
+ starves the I/O front-end instead. If you deploy into a
116
+ constrained/virtualized environment, measure there.
84
117
 
85
118
  Two findings that survived the environment change:
86
119
 
@@ -99,8 +132,9 @@ Parallelism for CPU-bound Ruby comes from ractors or forks, nothing else.
99
132
 
100
133
  ## Why /io lags in ractor mode on Linux
101
134
 
102
- On bare metal the gap is small: ractor /io 4,527 vs threaded 4,715
103
- (−4%). In Docker it was −18%, and a pure-Ruby probe there measured
135
+ On bare metal the gap is small at equal slot counts: ractor /io 4,530
136
+ vs threaded 4,709 (−4%, both at 8×3). In Docker it was −18%, and a
137
+ pure-Ruby probe there measured
104
138
  `sleep(0.005)` waking +2.3-2.8 ms late inside ractors vs +1.8 ms on the
105
139
  main ractor—non-main-ractor timer wakeups are coarser in Ruby 4.0, but
106
140
  how much that costs depends heavily on the kernel/virtualization stack.
@@ -110,14 +144,19 @@ A follow-up probe showed `IO.select`-style waits are tighter than
110
144
  **Mitigation 1—`Kino.sleep`:** releases the GVL and waits on the OS
111
145
  clock directly (chunked, so `Thread#kill`/shutdown stay responsive). The
112
146
  `/io_native` endpoint (same 5 ms wait via `Kino.sleep` when available)
113
- erases the remaining ractor gap on this box: 4,714 vs 4,527 plain sleep.
114
-
115
- **Mitigation 2—add workers; they're nearly free.** Wait-bound
116
- throughput is simply `slots ÷ effective wait`, and Kino's slots cost ~a
117
- thread each, not a forked process: `workers 32, threads 1` measured
118
- **5,922 /io (+27% over the 24-thread cluster's 4,672) and 6,254
119
- /io_native (+34%)**, still one small process. A fork cluster buying the
120
- same 32 slots pays for them in full copies of the app.
147
+ erases the remaining ractor gap on this box: 4,721 vs 4,530 plain sleep.
148
+
149
+ **Mitigation 2—add workers; they're nearly free.** The headline tables
150
+ show default ractor-mode /io at 1,552: that's 8 slots (the 1-thread
151
+ default) against the cluster's 24, because wait-bound throughput is
152
+ simply `slots ÷ effective wait`. Kino's slots cost ~a thread each, not
153
+ a forked process: the `workers 32, threads 1` column measured **5,888
154
+ /io (+25% over the 24-thread cluster's 4,693) and 6,274 /io_native
155
+ (+34%)**, still one small process, and still +14% ahead of the cluster
156
+ on pure CPU. Its cost is the CPU-light rows (183k plaintext vs 230k at
157
+ 8×1: 32 ractors oversubscribe 8 cores). A fork cluster buying the same
158
+ 32 slots pays for them in full copies of the app; Kino pays in
159
+ scheduler churn only where the cores are already saturated.
121
160
 
122
161
  ## The ractor-pool-wrapper comparison
123
162
 
@@ -127,11 +166,11 @@ already run. `bench/ractor_wrapper.rb` is that experiment, benchmarked on
127
166
  Puma and Falcon—not as a comparison of those servers, but to measure
128
167
  what the Rack-level hop itself costs (c7a.2xlarge, same session):
129
168
 
130
- | endpoint | Kino :ractor | Puma + wrapper | Falcon + wrapper |
131
- |------------|-------------:|---------------:|-----------------:|
132
- | /plaintext | 201,472 | 19,425 | 100,624 |
133
- | /cpu (fib) | 66,735 | 17,106 | 49,083 |
134
- | /io (5 ms) | 4,527 | 1,447 | 1,549 |
169
+ | endpoint | Kino :ractor (8×3) | Puma + wrapper | Falcon + wrapper |
170
+ |------------|-------------------:|---------------:|-----------------:|
171
+ | /plaintext | 193,826 | 19,480 | 99,776 |
172
+ | /cpu (fib) | 68,061 | 17,755 | 48,721 |
173
+ | /io (5 ms) | 4,530 | 1,454 | 1,549 |
135
174
 
136
175
  Inside the Rack contract, the wrapper must reduce the env to a shareable
137
176
  subset, copy it to the worker ractor, copy the response back, and hold a
@@ -146,18 +185,26 @@ the Rack contract—which is the experiment this gem exists to run.
146
185
 
147
186
  ## Rails
148
187
 
149
- The example app (`examples/rails-hello`, edge Rails, production mode,
150
- 8 workers × 5 threads) on the same box:
151
-
152
- | | req/s | RSS under load |
153
- |---|---:|---:|
154
- | Kino `:threaded` (one process) | 2,298 | **97 MB** |
155
- | Puma cluster (8 workers) | 11,923 | 797 MB |
156
-
157
- This is the honest version of the Rails story: in threaded mode Kino is
158
- one GVL-bound process, so the fork cluster outruns it ~5× by using all
159
- 8 cores—at the memory. Rails-on-Ractors is interesting precisely
160
- because it would close that throughput gap at the one-process memory
188
+ Rails is **not Ractor-shareable**, so Kino can only serve it in
189
+ `:threaded` fallback—this whole section is one GVL-bound Kino process,
190
+ never ractor mode. The example app (`examples/rails-hello`, edge Rails,
191
+ production mode, 8 workers × 5 threads) on the same box:
192
+
193
+ | | req/s | RSS | PSS |
194
+ |---|---:|---:|---:|
195
+ | Kino `:threaded` (one process) | 2,637 | 97 MB | **92 MB** |
196
+ | Puma cluster (8 workers) | 12,138 | 794 MB | **389 MB** |
197
+
198
+ This is the honest version of the Rails story. In threaded mode Kino is
199
+ one GVL-bound process, so the fork cluster outruns it ~4.6× by using all
200
+ 8 cores—at ~4× the memory by PSS. The metric matters here: Puma's RSS
201
+ (794 MB) counts the shared Rails framework once per worker; PSS (389 MB)
202
+ counts it once, and that is the fair figure (the README's headline used
203
+ to read 8× off RSS). Preloading barely moves it—389 MB with
204
+ `preload_app!` vs 400 MB without—because Ruby's GC dirties most heap
205
+ pages, breaking copy-on-write, so even a preloaded cluster keeps a
206
+ private heap per worker. Rails-on-Ractors is interesting precisely
207
+ because it would close the throughput gap at the one-process memory
161
208
  cost; the upstream blockers are documented in
162
209
  [rails-on-ractors.md](rails-on-ractors.md).
163
210
 
@@ -190,6 +237,62 @@ needs its `disable_initial_exec_tls` build flag just to load (dlopen +
190
237
  initial-exec TLS = `cannot allocate memory in static TLS block`)—one
191
238
  more reason to prefer mimalloc in dlopen'd extensions.
192
239
 
240
+ ## Memory under load (and the glibc arena footgun)
241
+
242
+ All figures are **PSS** (see [Methodology](#methodology)) after the full
243
+ endpoint battery (8 s each of /plaintext, /10k, /cpu, /io—a "warmed
244
+ production process", not a fresh boot, which measures ~26 MB for every
245
+ Kino mode). RSS is shown alongside so the copy-on-write correction is
246
+ visible.
247
+
248
+ ### The tiny synthetic app
249
+
250
+ | config | RSS | PSS |
251
+ |---|---:|---:|
252
+ | Kino :ractor 8×1 (default) | 151 | **148** |
253
+ | Kino lanes 8×1 | 137 | **135** |
254
+ | Kino :ractor 8×3 | 171 | **169** |
255
+ | Kino :threaded 8×3 (`MALLOC_ARENA_MAX=2`) | 109 | **107** |
256
+ | Kino :threaded 8×3 (no arena cap) | 668 | **666**¹ |
257
+ | Puma cluster 8×3 | 1,213 | **1,068** |
258
+
259
+ The tiny app is ~7× lighter than the cluster in ractor mode, ~10× in
260
+ arena-capped threaded mode. RSS ≈ PSS for every Kino row (one process,
261
+ nothing to share) and within ~12% for Puma here: a trivial app has almost
262
+ no shared state, so Puma's footprint is ~1,051 MB of *private* per-worker
263
+ heap plus only ~18 MB shared (which RSS counts 8×). This is the case where
264
+ copy-on-write does **not** rescue the cluster—there is nothing to
265
+ share—so the RSS and PSS numbers nearly agree. (The old "80 MB / 15×"
266
+ figure was a lighter, plaintext-only load; the honest full-battery ractor
267
+ figure is ~148 MB, i.e. ~7×.)
268
+
269
+ ¹ Not a leak: glibc malloc arena bloat. One 8-second /10k round takes
270
+ threaded mode from ~70 MB to ~670 MB and it never returns—24 threads
271
+ churning 10 KB strings through one process heap is the textbook glibc
272
+ arena-fragmentation case (the reason Rails ops set `MALLOC_ARENA_MAX=2`;
273
+ Heroku ships that default). With the cap the same battery ends at 107 MB
274
+ PSS, throughput unchanged. Ractor mode sidesteps the worst of it without
275
+ any env tweak—objects live in per-ractor heaps.
276
+
277
+ ### Rails (threaded fallback)
278
+
279
+ Here copy-on-write **does** matter, which is exactly why PSS is mandatory:
280
+
281
+ | config | RSS | PSS |
282
+ |---|---:|---:|
283
+ | Kino :threaded (one process) | 97 | **92** |
284
+ | Puma cluster 8×3 (preload) | 794 | **389** |
285
+
286
+ Puma serves the same Rails framework from 8 forks that share it
287
+ copy-on-write; RSS counts that shared framework once per worker (794 MB),
288
+ PSS counts it once (389 MB). The fair ratio is **~4×**, not the ~8× a
289
+ naive RSS sum reports—this is the correction that prompted the whole
290
+ re-measure. Preload barely helps (389 vs 400 MB without): Ruby's GC
291
+ dirties most heap pages, breaking copy-on-write, so even a preloaded
292
+ cluster keeps a large private heap per worker. That is why "CoW should
293
+ make a fork cluster nearly free" is only half true—it shares the code,
294
+ not the live object heap.
295
+
193
296
  ## Run-to-run variance (a.k.a. "is this a regression?")
194
297
 
195
298
  Rule of thumb from chasing this twice: never compare numbers from
@@ -197,21 +300,33 @@ different sessions; interleave A/B rounds in one session instead. The
197
300
  Docker-on-Mac environment swung ±10% on /cpu between sessions with the
198
301
  VM's mood; the dedicated c7a box is far steadier (same-session repeats
199
302
  land within ~1-2%), but the discipline stays—every comparative claim in
200
- these docs comes from same-session pairs.
303
+ these docs comes from same-session pairs. Cross-box repeatability got
304
+ its own test: the dataset was measured across four identical
305
+ c7a.2xlarge boxes, and equal-config throughput numbers matched within
306
+ ~1-2% (loaded-memory measurements swing more with heap-growth
307
+ timing—treat them as ballpark). The same discipline caught the recurring
308
+ threaded-plaintext fluke twice: once 28% low on an earlier box, and again
309
+ in the final re-validation (170k, where three interleaved re-runs put it
310
+ back at 217k). Suspect cells get re-measured, not published.
201
311
 
202
312
  ## Topology notes
203
313
 
204
- Measured on c7a.2xlarge, plaintext, ractor mode, same session: `8×3`
205
- (workers×threads) = 199,470, `8×1` = **232,469 (+17%)**, `16×1` =
206
- 214,284. Threads inside one ractor share its lock, so every request
207
- handled by a 3-thread ractor pays a lock handoff that a 1-thread ractor
208
- doesn't (`perf` in the earlier Docker sessions attributed ~10% of cycles
209
- to `rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3;
210
- the +17% reproduces exactly on real hardware). Threads-per-ractor exist
211
- for handlers that block on I/O; if yours don't, run `threads 1` and let
212
- workers = cores do the parallelism. (16×1 being worse than 8×1 also says
213
- the shared MPMC queue is *not* the bottleneck—8 extra parked consumers
214
- just add scheduler churn.)
314
+ Measured on c7a.2xlarge, plaintext, ractor mode, same session (three
315
+ interleaved rounds, medians): `8×3` (workers×threads) = 198,478, `8×1`
316
+ = **229,966 (+16%)**, `16×1` = 214,391. Threads inside one ractor share
317
+ its lock, so every request handled by a 3-thread ractor pays a lock
318
+ handoff that a 1-thread ractor doesn't (`perf` in the earlier Docker
319
+ sessions attributed ~10% of cycles to
320
+ `rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3; the
321
+ gain reproduced on two separate boxes, +16-17% each). **This is why
322
+ `threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +16%
323
+ the same way: 77,999 vs 67,394). The trade-off is /io at low worker
324
+ counts: 1,552 at 8×1 vs 4,530 at 8×3—threads-per-ractor exist for
325
+ handlers that block on I/O. If yours wait a lot, raise `workers`
326
+ instead (32×1 beats even the 24-slot cluster, see above); slots are
327
+ cheap. (16×1 being worse than 8×1 on plaintext also says the shared
328
+ MPMC queue is *not* the bottleneck—8 extra parked consumers just add
329
+ scheduler churn.)
215
330
 
216
331
  ## What profiling tried and rejected
217
332
 
@@ -239,23 +354,27 @@ safeguards: lane depth is capped at 4, and workers steal from siblings
239
354
  before parking (plus on every park tick), so a slow request can't strand
240
355
  its lane's backlog.
241
356
 
242
- Same-session A/B on c7a.2xlarge (ractor mode):
357
+ Same-session A/B on c7a.2xlarge, ractor mode at the default topology
358
+ (8×1):
243
359
 
244
360
  | endpoint | shared queue | lanes | delta |
245
361
  |----------|-------------:|------:|------:|
246
- | /plaintext | 201,472 | **241,501** | **+20%** |
247
- | /10k | 156,635 | 183,564 | +17% |
248
- | /cpu | 66,735 | 70,373 | +5% |
249
- | /io | 4,527 | 4,530 | flat |
250
-
251
- On this hardware lanes make ractor mode the fastest Kino configuration
252
- outright—+11% over threaded mode's plaintext, where the shared queue
253
- trails it. (On loopback-bound macOS, lanes lose a few percent instead;
254
- see the secondary table below.) It stays opt-in for now because overload
255
- semantics differ from the shared queue (`queue_depth` doesn't apply;
256
- capacity is lanes × 4 with brief dispatcher retries up to
257
- `queue_timeout` before the 503), and crash semantics, stealing fairness,
258
- and drain behavior have spec coverage but not production mileage.
362
+ | /plaintext | 229,534 | **250,222** | **+9%** |
363
+ | /10k | 178,083 | 189,862 | +7% |
364
+ | /cpu | **77,999** | 70,885 | −9% |
365
+ | /io | 1,552 | 1,551 | flat |
366
+
367
+ Lanes' margin shrank with the move to 1-thread workers (at the old 8×3
368
+ it was +21% plaintext: 240,193 vs 199,032 in the same session)—most of
369
+ the futex pain lanes were built to avoid came from thread handoffs
370
+ inside each ractor, and the new default removes those for everyone. At
371
+ the default, lanes still post the fastest plaintext/10k of any Kino
372
+ configuration, but plain shared-queue now takes /cpu. It stays opt-in because overload semantics differ from the
373
+ shared queue (`queue_depth` doesn't apply; capacity is lanes × 4 with
374
+ brief dispatcher retries up to `queue_timeout` before the 503), and
375
+ crash semantics, stealing fairness, and drain behavior have spec
376
+ coverage but not production mileage. (On loopback-bound macOS, lanes
377
+ lose a few percent instead; see the secondary table below.)
259
378
 
260
379
  ## Logging costs
261
380
 
@@ -265,11 +384,11 @@ typical costs):
265
384
 
266
385
  | case (8×3, same session) | req/s |
267
386
  |---|---:|
268
- | threaded, no logging | 217,113 |
269
- | threaded, `log_requests true` (native access log) | 193,200 (−11%) |
270
- | ractor, access log off / on | 198,624 / 183,565 (−8%) |
271
- | app logs 1 line/req via shared `::Logger` (file) | **62,962** |
272
- | app logs 1 line/req via `Kino::Logger` (file) | **150,810 (2.4×)** |
387
+ | threaded, no logging | 219,168 |
388
+ | threaded, `log_requests true` (native access log) | 193,998 (−11%) |
389
+ | ractor, access log off / on | 197,596 / 181,050 (−8%) |
390
+ | app logs 1 line/req via shared `::Logger` (file) | **62,961** |
391
+ | app logs 1 line/req via `Kino::Logger` (file) | **149,519 (2.4×)** |
273
392
 
274
393
  The shared-`::Logger` cost is the mutex: 24 worker threads serialize
275
394
  through one lock plus a write syscall per line. `Kino::Logger` hands the
@@ -7,10 +7,11 @@
7
7
  Rails 8.2.0.alpha boots and serves with `mode :threaded` (see the
8
8
  example's `kino.rb`; just `bundle exec kino` in that directory). Measured
9
9
  on the hello-world (c7a.2xlarge, 8 cores, production mode, 8×5):
10
- ~2.3k req/s in 97 MB, single process. The 8-worker Puma cluster reaches
11
- ~11.9k in 797 MB by parallelizing across forks—Rails-on-Ractors is
12
- interesting precisely because it could offer that ~5× parallelism at
13
- ~1/8th of the memory.
10
+ ~2.6k req/s in 92 MB PSS, single process. The 8-worker Puma cluster
11
+ reaches ~12.1k by parallelizing across forks, at 389 MB PSS (794 MB RSS,
12
+ but its forks share the framework copy-on-write, so PSS is the fair
13
+ figure)—Rails-on-Ractors is interesting precisely because it could offer
14
+ that ~4.6× parallelism at ~1/4th of the memory.
14
15
 
15
16
  Pair it with production-style Rails settings: eager load, no code
16
17
  reloading, database pool ≥ workers × threads, logger to stdout or another
data/doc/why-kino.md CHANGED
@@ -15,8 +15,8 @@ deep-copies it, and sockets cannot cross at all.
15
15
 
16
16
  We measured what the "obvious" workaround costs. The ractor-pool wrapper
17
17
  experiment (reduce the env to a shareable subset, copy it to a worker
18
- over a `Ractor::Port`, copy the response back) runs at **19k req/s where
19
- Kino does 201k** on the same hardware—see the
18
+ over a `Ractor::Port`, copy the response back) runs at **19.5k req/s
19
+ where Kino does 194k** on the same hardware—see the
20
20
  [wrapper comparison](benchmarks.md#the-ractor-pool-wrapper-comparison).
21
21
  Copying at the Rack layer eats the entire ractor dividend. Dispatch has
22
22
  to live below the Rack contract.
@@ -78,12 +78,12 @@ objects; Rust sees one queue and one registry.
78
78
 
79
79
  With the dispatch cost eliminated, Ractors deliver the thing they were
80
80
  built for—a lock per ractor instead of one GVL—and each layer is
81
- visible in the [benchmarks](benchmarks.md): `/cpu` at 66.7k req/s in
82
- ractor mode vs **13.3k threaded (5×, the GVL ceiling)**, matching the
83
- fork cluster's CPU parallelism while holding **57 MB against the
84
- cluster's 1,078 MB**, because eight ractors share one VM, one Rust
85
- front-end, one queue, and one JIT, where eight forks each pay full
86
- price.
81
+ visible in the [benchmarks](benchmarks.md): `/cpu` at 78.0k req/s in
82
+ ractor mode vs **13.4k threaded (5.8×, the GVL ceiling)**, beating the
83
+ fork cluster's CPU parallelism by +34% while holding **~148 MB against
84
+ the cluster's ~1,068 MB** (by PSS, on the bench app), because eight
85
+ ractors share one VM, one Rust front-end, one queue, and one JIT, where
86
+ eight forks each pay full price.
87
87
 
88
88
  The cleanest proof of the design is the threaded fallback itself: it
89
89
  reuses ~95% of the same machinery, because the Rust core is
data/ext/kino/Cargo.toml CHANGED
@@ -1,6 +1,6 @@
1
1
  [package]
2
2
  name = "kino"
3
- version = "0.1.0"
3
+ version = "0.1.2"
4
4
  edition = "2021"
5
5
  authors = ["Yaroslav Markin <yaroslav@markin.net>"]
6
6
  license = "MIT"
@@ -41,6 +41,9 @@ pub struct ServerInner {
41
41
  /// 0 = no request timeout; otherwise the response head must arrive
42
42
  /// within this many ms or the client gets a 504.
43
43
  pub request_timeout_ms: u64,
44
+ /// 0 = unlimited; otherwise the max request-body bytes accepted before a
45
+ /// 413 (truthful Content-Length) or a mid-stream abort (chunked/lying).
46
+ pub max_body_size: usize,
44
47
  pub timeouts: AtomicU64,
45
48
  pub https: bool,
46
49
  /// Native access log sink (None unless log_requests is on).
@@ -180,6 +183,7 @@ pub fn test_server(lanes: bool, queue_depth: usize) -> Arc<ServerInner> {
180
183
  rejected: AtomicU64::new(0),
181
184
  queue_timeout_ms: 10,
182
185
  request_timeout_ms: 0,
186
+ max_body_size: 0,
183
187
  timeouts: AtomicU64::new(0),
184
188
  https: false,
185
189
  access_log: None,
@@ -25,6 +25,12 @@ pub struct RequestCtx {
25
25
  /// Request body, streamed from hyper through a bounded channel: hyper is
26
26
  /// only polled as Ruby consumes, so inbound backpressure is free.
27
27
  pub body_rx: flume::Receiver<Bytes>,
28
+ /// Set by the body forwarder when the body exceeded max_body_size: turns
29
+ /// the next read into an error instead of a (truncated) clean EOF.
30
+ pub body_overflow: Arc<std::sync::atomic::AtomicBool>,
31
+ /// Set by the body forwarder when the client stalled past the idle
32
+ /// deadline: the next read raises so the worker reclaims its slot.
33
+ pub body_timeout: Arc<std::sync::atomic::AtomicBool>,
28
34
  /// When a frame is bigger than read_body's max_len, the rest waits here.
29
35
  pub leftover: Option<Bytes>,
30
36
  /// The owning worker slot (set at admit time, queue.rs); its interrupt
@@ -62,6 +68,20 @@ fn interrupted_error(ruby: &Ruby) -> Error {
62
68
  )
63
69
  }
64
70
 
71
+ fn body_too_large_error(ruby: &Ruby) -> Error {
72
+ Error::new(
73
+ ruby.exception_runtime_error(),
74
+ "Kino: request body exceeded max_body_size",
75
+ )
76
+ }
77
+
78
+ fn body_timeout_error(ruby: &Ruby) -> Error {
79
+ Error::new(
80
+ ruby.exception_runtime_error(),
81
+ "Kino: request body read timed out",
82
+ )
83
+ }
84
+
65
85
  fn invalid_response(ruby: &Ruby, e: impl std::fmt::Display) -> Error {
66
86
  Error::new(
67
87
  ruby.exception_runtime_error(),
@@ -238,7 +258,17 @@ impl Request {
238
258
  });
239
259
  match outcome {
240
260
  Some(Some(bytes)) => bytes,
241
- Some(None) => return Ok(None), // EOF
261
+ Some(None) => {
262
+ // Disconnected: a clean EOF, unless the forwarder
263
+ // abandoned the body (too large, or the client stalled).
264
+ if ctx.body_overflow.load(std::sync::atomic::Ordering::Relaxed) {
265
+ return Err(body_too_large_error(ruby));
266
+ }
267
+ if ctx.body_timeout.load(std::sync::atomic::Ordering::Relaxed) {
268
+ return Err(body_timeout_error(ruby));
269
+ }
270
+ return Ok(None); // EOF
271
+ }
242
272
  None => return Err(interrupted_error(ruby)),
243
273
  }
244
274
  }
@@ -363,6 +393,8 @@ pub fn test_ctx() -> crate::registry::BoxedCtx {
363
393
  local_addr: "127.0.0.1:9292".parse().expect("static addr"),
364
394
  https: false,
365
395
  body_rx,
396
+ body_overflow: Arc::new(std::sync::atomic::AtomicBool::new(false)),
397
+ body_timeout: Arc::new(std::sync::atomic::AtomicBool::new(false)),
366
398
  leftover: None,
367
399
  slot: None,
368
400
  responder: Arc::new(Responder::new(head_tx)),