RubyGems - kino - Versions diffs - 0.1.1-aarch64-linux → 0.1.2-aarch64-linux - Mend

kino 0.1.1-aarch64-linux → 0.1.2-aarch64-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +28 -0
data/README.md +65 -31
data/doc/benchmarks.md +138 -85
data/doc/rails-on-ractors.md +5 -4
data/doc/why-kino.md +7 -7
data/lib/kino/configuration.rb +10 -0
data/lib/kino/kino.so +0 -0
data/lib/kino/server.rb +16 -0
data/lib/kino/templates/kino.rb.tt +11 -0
data/lib/kino/version.rb +1 -1
data/sig/kino.rbs +2 -0
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: cc5a5ccaf0a6edecf4e34f4b75a7fd9968fe2d16ab89a395bb962b41cea8467d
-  data.tar.gz: 74eacc8c9e1d280e67ee2394b8ae4d0eb45f8366fd693c057ac7ad49c1e48368
+  metadata.gz: 782837060d496035095d9c245aa1a1b7f59b20ef85b6e91a067deef9ee27757c
+  data.tar.gz: 115e29c38ae853396d2ecb5cb52884e4aa7245ac4eb2106a00eda8be67673534
 SHA512:
-  metadata.gz: ee0b6133a6ee0e8b3ce5a4f007cec3630c5ed666ae4b7af3cbdf0e6c188441448e6081882b27f626a7c9e109f05029744628f82992d6e8a68494e4db1985cb38
-  data.tar.gz: 42c73f8c2cd110d3430393695375927e2a6ae65a840d483681c0c270abba47f62b957d2f4b997aa23d4d0761dc3069d08052e3a79af51356d87158fdc8d5a221
+  metadata.gz: 8ef1ebf271164fee8b032a11252fa56b108aa7de9b4817dd6af482b6470eea685d0b243c6849aa57feb8047a84f71f1cd19342d7bf86370da110608a66728696
+  data.tar.gz: b39644afd25f10d6508ee7551f985d31690877fa43edc5169a85bd548f0c394c6d2be7f329e1ac836edc20f55f60d8dffe6fd0a8e9ee83cb411d95cf88ccf344

data/CHANGELOG.md CHANGED Viewed

@@ -1,4 +1,32 @@
+## [0.1.2] - 2026-06-22
+- Drop a connection that has not sent its complete request headers
+  within 15 seconds. Closes a slowloris hole: hyper's built-in header-read
+  timeout was inert because the server installed no timer, so a slow-header
+  client could tie up a connection (and its tokio task) indefinitely.
+- Cap concurrent connections (new `max_connections` directive). Past the cap,
+  new connections wait in the kernel backlog instead of piling up until a
+  flood exhausts file descriptors or memory. Defaults to most of the process
+  open-file limit (`ulimit -n`), so it scales with the OS limit and only
+  engages under a flood.
+- Bound the TLS handshake to 10 seconds. A client that completed the TCP
+  connect but stalled the handshake could otherwise hold a connection slot
+  indefinitely, since the request and header-read deadlines only begin once
+  the handshake finishes.
+- Cap the request body at 50 MB by default (new `max_body_size` directive,
+  configurable; nil or 0 disables and delegates to a fronting proxy). An app
+  that reads `rack.input` could otherwise be driven to run out of memory by an
+  oversized or endless upload. A truthful oversize Content-Length is refused
+  with a 413 before the app runs; a chunked or lying client is cut off
+  mid-stream once it passes the cap.
+- Bound the idle time between request-body frames to 30 seconds. A client that
+  began a request then stalled mid-body would otherwise keep a worker blocked
+  in `rack.input.read` indefinitely; now the read raises and the worker
+  reclaims its slot. Only a silent client trips it: a steadily-sent body resets
+  the deadline each frame, so slow-but-active uploads are unaffected.
 ## [0.1.1] - 2026-06-11
 - Mode-dependent `threads` default: 1 per worker in :ractor mode (threads
   inside a ractor share its lock and cost a per-request handoff; +16-18%
   on fast handlers, measured on dedicated hardware), 3 in :threaded mode.

data/README.md CHANGED Viewed

@@ -14,9 +14,8 @@ and a threaded fallback mode runs everything else, Rails included.
 * **Fast.** On a real 8-core server, every Kino mode is **1.5-2×**
   ahead of a Puma fork cluster on I/O-light endpoints. Ractor mode also
   wins on pure CPU, **30%+**. [Benchmarks](#benchmarks) below.
-* **A fraction of the memory.** One process instead of a fork per core:
-  about **15× less memory** than the Puma cluster under the same load,
-  and 8× less when serving the Rails hello-world.
+* **A fraction of the memory.** Aabout **~7×** on the simplistic bench
+  Ractor app, and about **4× less memory** than a Puma cluster serving Rails in fallback threaded mode.
 * **Parallel without forking.** Ractor mode runs CPU work **more than
   5× faster** than Kino's own GVL-bound threaded mode, in the same
   small process.
@@ -64,36 +63,55 @@ notes live in [doc/architecture.md](doc/architecture.md).
 ## Benchmarks
 Measured on a real server: AWS **c7a.2xlarge** (8-core AMD EPYC 9R14,
-16 GB, Amazon Linux 2023). This is a realistic app-server size. The same
-Ractor-shareable app runs on every server, Ruby 4.0.5 with YJIT, every
-server at its defaults: Puma forks 8 workers × 3 threads, Kino stays in
-one process (8 workers; 1 thread each in ractor modes, 3 in threaded).
-Numbers are req/s by wrk (8-second windows, 64 connections, same host).
-Methodology and the analysis behind every column:
+16 GB, Amazon Linux 2023). This is a realistic app-server size.
+**These tables run a tiny synthetic Rack app**—plaintext, a 10 KB body,
+a CPU-bound `fib`, a 5 ms wait—deliberately small, to measure the server
+rather than an app. It is Ractor-shareable, so Kino runs it in `:ractor`
+mode (and `:threaded` for comparison). **A real Rails app is a different
+story:** it is *not* Ractor-shareable, so it runs only in Kino's
+`:threaded` fallback, with its own numbers—see [Rails](#rails) below.
+Ruby 4.0.5 with YJIT, every server at its defaults: Puma forks 8 workers ×
+3 threads, Kino stays in one process (8 workers; 1 thread each in ractor
+modes, 3 in threaded). Numbers are req/s by wrk (8-second windows, 64
+connections, same host). Methodology:
 [doc/benchmarks.md](doc/benchmarks.md).
 | endpoint    | Kino :ractor | + lanes | :ractor, `workers 32`² | Kino :threaded | Puma (cluster) |
 |-------------|-------------:|--------:|-----------------------:|---------------:|---------------:|
-| /plaintext  |      229,565 | **244,340** |         156,118 |        217,619 |        118,190 |
-| /10k        |      179,119 | **188,258** |         134,457 |        157,147 |        105,588 |
-| /cpu (fib)  |   **76,922**¹|  73,136 |          62,406 |         13,499 |         58,337 |
-| /io (5 ms)  |        1,548 |   1,548 |       **5,935** |          4,715 |          4,687 |
-| /io_native  |        1,570 |   1,571 |       **6,289** |          4,717 |          4,695 |
+| /plaintext  |      229,534 | **250,222** |         182,997 |        216,994 |        118,176 |
+| /10k        |      178,083 | **189,862** |         151,034 |        160,400 |        106,768 |
+| /cpu (fib)  |   **77,999**¹|  70,885 |          66,100 |         13,429 |         58,006 |
+| /io (5 ms)  |        1,552 |   1,551 |       **5,888** |          4,709 |          4,693 |
+| /io_native  |        1,570 |   1,571 |       **6,274** |          4,695 |          4,691 |
-Memory on the same box, RSS after sustained load:
+Memory tells two different stories depending on the app, both by **PSS**
+(proportional set size; see note) after sustained load.
-| serving               | Kino (one process) | Puma cluster (8 workers) |
-|-----------------------|-------------------:|-------------------------:|
-| bench app, :ractor    |          **80 MB** |                 1,256 MB |
-| bench app, :threaded  |         **151 MB**³|                 1,256 MB |
-| Rails hello-world     |          **97 MB** |                   797 MB |
+**The tiny benchmark app** (Ractor-shareable, so Kino runs it in `:ractor`
+or `:threaded`). Kino is **~7× lighter in :ractor mode, ~10× in :threaded**
+than the Puma cluster — the gap stays large because a trivial app is almost
+all private per-worker heap, which copy-on-write can't share:
+| tiny app, Kino  | Kino (one process) | Puma cluster (8 workers) | ratio |
+|-----------------|-------------------:|-------------------------:|------:|
+| :ractor (8×1)   |         **148 MB** |                 1,068 MB |  ~7×  |
+| :threaded (8×3) |         **107 MB**³|                 1,068 MB | ~10×  |
+**A real Rails app** (not Ractor-shareable—Kino's `:threaded` fallback
+only, [below](#rails)). The gap is **~4×**, smaller because Rails' large
+framework *is* shared copy-on-write across Puma's forks:
+| Rails hello-world | Kino :threaded | Puma cluster (8 workers) | ratio |
+|-------------------|---------------:|-------------------------:|------:|
+| **PSS**           |      **92 MB** |               **389 MB** |  ~4×  |
 "+ lanes" is the experimental per-worker-queue dispatcher (`lanes true`).
 It posts the fastest plaintext/10k of any configuration here. Details:
 [doc/benchmarks.md](doc/benchmarks.md#lane-dispatch-experimental-lanes-true).
 ¹ Stock settings, no tuning. Ractor mode beats the fork cluster on pure
-CPU by +32% (+25% with lanes). Threaded mode shows the GVL ceiling that
+CPU by +34% (+22% with lanes). Threaded mode shows the GVL ceiling that
 every single-process Ruby server hits. The old CPU-tuning recipe is
 retired: its `threads 1` half **is** the default now, and its
 `tokio_threads 1` half costs −12% on real hardware; see
@@ -102,7 +120,7 @@ retired: its `threads 1` half **is** the default now, and its
 ² Wait-bound throughput is slots ÷ wait, and the default columns bring
 8 single-thread workers against the cluster's 24 threads. Kino slots
 are threads, not processes—when your app waits a lot, raise `workers`.
-The `workers 32` column is that tuning: **+27% over the cluster on /io
+The `workers 32` column is that tuning: **+25% over the cluster on /io
 (+34% via `Kino.sleep`)** while still ahead of it on pure CPU, all in
 one small process. The cost is the CPU-light rows (32 ractors
 oversubscribe 8 cores); pick the topology your app's wait profile
@@ -111,7 +129,7 @@ needs. See
 ³ With `MALLOC_ARENA_MAX=2` (the standard Ruby deployment setting;
 Heroku's default). Without it, 24 threads churning 10 KB responses
-through one glibc heap balloon to ~600 MB—an arena-fragmentation
+through one glibc heap balloon to ~670 MB—an arena-fragmentation
 footgun, not a leak, and ractor mode sidesteps it. See
 [doc/benchmarks.md](doc/benchmarks.md#memory-under-load-and-the-glibc-arena-footgun).
@@ -121,14 +139,30 @@ doc):
 | endpoint   | Kino :ractor (8×3) | Puma + ractor wrapper | Falcon + ractor wrapper |
 |------------|-------------------:|----------------------:|------------------------:|
-| /plaintext |        **199,032** |                19,532 |                 100,342 |
-| /cpu (fib) |         **68,238** |                17,323 |                  48,561 |
-| /io (5 ms) |          **4,531** |                 1,452 |                   1,544 |
-In short: ractor mode beats fork-level CPU parallelism (**5.7×** Kino's
-own GVL-bound threaded mode, +32% over the cluster) in one process, at
-about 1/16th of the cluster's memory. Every Kino mode is 1.5-2.1×
-ahead of the cluster on I/O-light endpoints. The macOS numbers
+| /plaintext |        **193,826** |                19,480 |                  99,776 |
+| /cpu (fib) |         **68,061** |                17,755 |                  48,721 |
+| /io (5 ms) |          **4,530** |                 1,454 |                   1,549 |
+### Rails
+Rails is not Ractor-shareable today, so Kino serves it in `:threaded`
+fallback — one GVL-bound process. On the same box (`examples/rails-hello`,
+edge Rails, production, 8×5):
+| Rails hello-world            |  req/s | memory (PSS) |
+|------------------------------|-------:|-------------:|
+| Kino :threaded (one process) |  2,637 |    **92 MB** |
+| Puma cluster (8 workers)     | 12,138 |       389 MB |
+The honest trade-off: Puma's fork cluster uses all 8 cores, so it serves
+~4.6× the throughput — at ~4× the memory. Ractor-mode Rails would close
+the throughput gap at one-process memory cost; the upstream blockers are
+tracked in [doc/rails-on-ractors.md](doc/rails-on-ractors.md).
+In short: on the tiny synthetic app, ractor mode beats fork-level CPU parallelism (**5.8×** Kino's
+own GVL-bound threaded mode, +34% over the cluster) in one process, at
+about 1/7th of the cluster's memory by PSS (~4× on a real Rails app).
+Every Kino mode is 1.5-2.1× ahead of the cluster on I/O-light endpoints. The macOS numbers
 (secondary; everything there hits the loopback ceiling) and the
 YJIT × Ractors gotcha are in [doc/benchmarks.md](doc/benchmarks.md).

data/doc/benchmarks.md CHANGED Viewed

@@ -34,10 +34,21 @@ the deployment most apps run today.
 - The headline tables also carry an io-tuned column (`workers 32,
   threads 1`)—not a default, labeled as such—because the /io rows are
   a slot-count story (see below).
-- The dataset spans three identical boxes: the original measurements,
-  a full re-measure at the 0.1.1 defaults, and the final headline
-  sweep. Equal-config numbers reproduced across boxes within ~1-2%
-  throughout.
+- The dataset spans four identical c7a.2xlarge boxes: the original
+  measurements, a re-measure at the 0.1.1 defaults, the headline sweep,
+  and a final full re-validation (every table re-run from scratch).
+  Equal-config throughput reproduced across boxes within ~1-2%.
+- **Memory is reported as PSS (proportional set size), not RSS.** A Puma
+  cluster forks N workers that share the Ruby VM and gem code
+  copy-on-write; summing each worker's RSS counts those shared pages up
+  to N times and overstates the cluster's real footprint. PSS divides
+  every shared page across the processes mapping it, so it reflects the
+  unique physical memory the cluster occupies—the only fair basis for
+  comparing one process against a fork-per-core cluster. We read it from
+  `/proc/<pid>/smaps_rollup` over the whole process tree, cross-checked
+  against `ps` (RSS) and `smem` (PSS). Kino serves from one process, so
+  its RSS ≈ PSS; the correction only moves Puma. (`bench/studies.sh`
+  reports both columns.)
 - Follow-up studies (`bench/studies.sh`): CPU tuning, topology sweep,
   /io worker scaling, logging costs, and memory—run in the same session
   as the headline tables.
@@ -54,28 +65,31 @@ the deployment most apps run today.
 ## Reading the headline tables
+These tables all run the **tiny synthetic Ractor-shareable app**. The real
+Rails app is not Ractor-shareable and runs only in threaded fallback—a
+separate story with separate numbers, in [its own section](#rails).
 - **Plaintext/10k**: Kino's tokio front-end clears the fork cluster by
-  1.5-2.1× (lanes plaintext 244,340 vs Puma 118,190 = 2.07×; the
-  smallest margin is threaded /10k at 1.49×). At the old 3-thread
+  1.5-2.1× (lanes plaintext 250,222 vs Puma 118,176 = 2.12×; the
+  smallest margin is threaded /10k at 1.50×). At the old 3-thread
   topology the cross-ractor handoff showed up as ractor trailing
   threaded on trivial handlers; the 1-thread default reverses that
-  (ractor 230k vs threaded 218k) and lanes widen it (244k).
-- **CPU (recursive fib)**: ractor mode does **5.7× its own GVL-bound
-  threaded mode** (76,922 vs 13,499)—that's the entire point of
-  ractors—and beats the fork cluster outright: +32% with stock
-  defaults (+25% with lanes, 73,136 vs 58,337). Even the io-tuned
-  `workers 32` topology stays ahead of the cluster on CPU (62,406).
-- **Memory**: after serving the full endpoint battery, Kino held
-  **80 MB** (ractor or lanes, default topology) where the 8-worker
-  cluster held **1,256 MB**—a fork per core pays one full copy of the
-  VM, the app, and its YJIT-compiled code per worker. On the Rails
-  hello-world: Kino 97 MB vs cluster 797 MB. Threaded mode under the
-  same battery needs a malloc note; see
-  [Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
+  (ractor 230k vs threaded 217k) and lanes widen it (250k).
+- **CPU (recursive fib)**: ractor mode does **5.8× its own GVL-bound
+  threaded mode** (77,999 vs 13,429)—that's the entire point of
+  ractors—and beats the fork cluster outright: +34% with stock
+  defaults (+22% with lanes, 70,885 vs 58,006). Even the io-tuned
+  `workers 32` topology stays ahead of the cluster on CPU (66,100).
+- **Memory (PSS)**: after the full endpoint battery, the tiny app costs
+  Kino **148 MB** in ractor mode (107 MB threaded) against the 8-worker
+  cluster's **1,068 MB**—~7-10× lighter, because a trivial app is almost
+  all private per-worker heap that copy-on-write can't share. The real
+  Rails app narrows this to ~4× (its framework *is* shared CoW); both are
+  in [Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
 - **I/O (5 ms wait)**: all dispatch models tie within ~4% at equal slot
   counts, so the default columns show the ractor modes behind on /io
   (8 slots vs the cluster's 24), and the `workers 32` column shows the
-  same engine winning (+27%, +34% via `Kino.sleep`) once it has more
+  same engine winning (+25%, +34% via `Kino.sleep`) once it has more
   slots than the cluster. The lever is slot count, and Kino slots are
   cheap: see [below](#why-io-lags-in-ractor-mode-on-linux).
@@ -87,14 +101,14 @@ run:
 | config | /cpu req/s |
 |---|---:|
-| Puma cluster (reference) | 58,505 |
-| Kino `workers 8, threads 3` (the default before 0.1.1) | 67,111 |
-| Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,638 |
-| Kino `workers 8, threads 1`, tokio auto (**the default**) | **78,175** |
+| Puma cluster (reference) | 58,189 |
+| Kino `workers 8, threads 3` (the default before 0.1.1) | 67,394 |
+| Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,600 |
+| Kino `workers 8, threads 1`, tokio auto (**the default**) | **77,999** |
 The `threads 1` half of the old recipe became the default; the
 `tokio_threads 1` half now *costs* −12% on /cpu (and still costs
-plaintext: 107,743 vs 230k). Don't pin tokio threads. **The recipe's
+plaintext: 108,523 vs 230k). Don't pin tokio threads. **The recipe's
 history is an environment story**: in the earlier Docker-on-Mac runs it
 was worth +12%, because tokio threads and wake churn competed for
 oversubscribed virtualized cores; on dedicated cores the same pin
@@ -118,8 +132,8 @@ Parallelism for CPU-bound Ruby comes from ractors or forks, nothing else.
 ## Why /io lags in ractor mode on Linux
-On bare metal the gap is small at equal slot counts: ractor /io 4,531
-vs threaded 4,725 (−4%, both at 8×3). In Docker it was −18%, and a
+On bare metal the gap is small at equal slot counts: ractor /io 4,530
+vs threaded 4,709 (−4%, both at 8×3). In Docker it was −18%, and a
 pure-Ruby probe there measured
 `sleep(0.005)` waking +2.3-2.8 ms late inside ractors vs +1.8 ms on the
 main ractor—non-main-ractor timer wakeups are coarser in Ruby 4.0, but
@@ -130,16 +144,16 @@ A follow-up probe showed `IO.select`-style waits are tighter than
 **Mitigation 1—`Kino.sleep`:** releases the GVL and waits on the OS
 clock directly (chunked, so `Thread#kill`/shutdown stay responsive). The
 `/io_native` endpoint (same 5 ms wait via `Kino.sleep` when available)
-erases the remaining ractor gap on this box: 4,721 vs 4,531 plain sleep.
+erases the remaining ractor gap on this box: 4,721 vs 4,530 plain sleep.
 **Mitigation 2—add workers; they're nearly free.** The headline tables
-show default ractor-mode /io at 1,548: that's 8 slots (the 1-thread
+show default ractor-mode /io at 1,552: that's 8 slots (the 1-thread
 default) against the cluster's 24, because wait-bound throughput is
 simply `slots ÷ effective wait`. Kino's slots cost ~a thread each, not
-a forked process: the `workers 32, threads 1` column measured **5,935
-/io (+27% over the 24-thread cluster's 4,687) and 6,289 /io_native
-(+34%)**, still one small process, and still +7% ahead of the cluster
-on pure CPU. Its cost is the CPU-light rows (156k plaintext vs 230k at
+a forked process: the `workers 32, threads 1` column measured **5,888
+/io (+25% over the 24-thread cluster's 4,693) and 6,274 /io_native
+(+34%)**, still one small process, and still +14% ahead of the cluster
+on pure CPU. Its cost is the CPU-light rows (183k plaintext vs 230k at
 8×1: 32 ractors oversubscribe 8 cores). A fork cluster buying the same
 32 slots pays for them in full copies of the app; Kino pays in
 scheduler churn only where the cores are already saturated.
@@ -154,9 +168,9 @@ what the Rack-level hop itself costs (c7a.2xlarge, same session):
 | endpoint   | Kino :ractor (8×3) | Puma + wrapper | Falcon + wrapper |
 |------------|-------------------:|---------------:|-----------------:|
-| /plaintext |            199,032 |         19,532 |          100,342 |
-| /cpu (fib) |             68,238 |         17,323 |           48,561 |
-| /io (5 ms) |              4,531 |          1,452 |            1,544 |
+| /plaintext |            193,826 |         19,480 |           99,776 |
+| /cpu (fib) |             68,061 |         17,755 |           48,721 |
+| /io (5 ms) |              4,530 |          1,454 |            1,549 |
 Inside the Rack contract, the wrapper must reduce the env to a shareable
 subset, copy it to the worker ractor, copy the response back, and hold a
@@ -171,18 +185,26 @@ the Rack contract—which is the experiment this gem exists to run.
 ## Rails
-The example app (`examples/rails-hello`, edge Rails, production mode,
-8 workers × 5 threads) on the same box:
-| | req/s | RSS under load |
-|---|---:|---:|
-| Kino `:threaded` (one process) | 2,298 | **97 MB** |
-| Puma cluster (8 workers) | 11,923 | 797 MB |
-This is the honest version of the Rails story: in threaded mode Kino is
-one GVL-bound process, so the fork cluster outruns it ~5× by using all
-8 cores—at 8× the memory. Rails-on-Ractors is interesting precisely
-because it would close that throughput gap at the one-process memory
+Rails is **not Ractor-shareable**, so Kino can only serve it in
+`:threaded` fallback—this whole section is one GVL-bound Kino process,
+never ractor mode. The example app (`examples/rails-hello`, edge Rails,
+production mode, 8 workers × 5 threads) on the same box:
+| | req/s | RSS | PSS |
+|---|---:|---:|---:|
+| Kino `:threaded` (one process) |  2,637 |  97 MB | **92 MB** |
+| Puma cluster (8 workers) | 12,138 | 794 MB | **389 MB** |
+This is the honest version of the Rails story. In threaded mode Kino is
+one GVL-bound process, so the fork cluster outruns it ~4.6× by using all
+8 cores—at ~4× the memory by PSS. The metric matters here: Puma's RSS
+(794 MB) counts the shared Rails framework once per worker; PSS (389 MB)
+counts it once, and that is the fair figure (the README's headline used
+to read 8× off RSS). Preloading barely moves it—389 MB with
+`preload_app!` vs 400 MB without—because Ruby's GC dirties most heap
+pages, breaking copy-on-write, so even a preloaded cluster keeps a
+private heap per worker. Rails-on-Ractors is interesting precisely
+because it would close the throughput gap at the one-process memory
 cost; the upstream blockers are documented in
 [rails-on-ractors.md](rails-on-ractors.md).
@@ -217,29 +239,59 @@ more reason to prefer mimalloc in dlopen'd extensions.
 ## Memory under load (and the glibc arena footgun)
-RSS after serving the full endpoint battery (8 s each of /plaintext,
-/10k, /cpu, /io—a "warmed production process", not a fresh boot, which
-measures 26-27 MB for every Kino mode):
+All figures are **PSS** (see [Methodology](#methodology)) after the full
+endpoint battery (8 s each of /plaintext, /10k, /cpu, /io—a "warmed
+production process", not a fresh boot, which measures ~26 MB for every
+Kino mode). RSS is shown alongside so the copy-on-write correction is
+visible.
-| config | RSS loaded |
-|---|---:|
-| Kino :ractor 8×1 (default) | **80 MB** |
-| Kino lanes 8×1 | **80 MB** |
-| Kino :ractor 8×3 | 115 MB |
-| Kino :threaded 8×3 | 612 MB¹ |
-| Puma cluster 8×3 | 1,256 MB |
+### The tiny synthetic app
+| config | RSS | PSS |
+|---|---:|---:|
+| Kino :ractor 8×1 (default) | 151 | **148** |
+| Kino lanes 8×1 | 137 | **135** |
+| Kino :ractor 8×3 | 171 | **169** |
+| Kino :threaded 8×3 (`MALLOC_ARENA_MAX=2`) | 109 | **107** |
+| Kino :threaded 8×3 (no arena cap) | 668 | **666**¹ |
+| Puma cluster 8×3 | 1,213 | **1,068** |
+The tiny app is ~7× lighter than the cluster in ractor mode, ~10× in
+arena-capped threaded mode. RSS ≈ PSS for every Kino row (one process,
+nothing to share) and within ~12% for Puma here: a trivial app has almost
+no shared state, so Puma's footprint is ~1,051 MB of *private* per-worker
+heap plus only ~18 MB shared (which RSS counts 8×). This is the case where
+copy-on-write does **not** rescue the cluster—there is nothing to
+share—so the RSS and PSS numbers nearly agree. (The old "80 MB / 15×"
+figure was a lighter, plaintext-only load; the honest full-battery ractor
+figure is ~148 MB, i.e. ~7×.)
 ¹ Not a leak: glibc malloc arena bloat. One 8-second /10k round takes
-threaded mode from 69 MB to ~800 MB and it never returns—24 threads
+threaded mode from ~70 MB to ~670 MB and it never returns—24 threads
 churning 10 KB strings through one process heap is the textbook glibc
 arena-fragmentation case (the reason Rails ops set `MALLOC_ARENA_MAX=2`;
-Heroku ships that default). With `MALLOC_ARENA_MAX=2` the same battery
-ends at **151 MB** with throughput unchanged (165,993 vs 157,177 /10k,
-if anything faster). Ractor mode sidesteps the worst of it without any
-env tweak—objects live in per-ractor heaps, and repeated runs landed at
-80-124 MB regardless of arena settings. Puma's 1,256 MB barely moves
-under the cap (1,237 MB): its cost is eight full copies of the warmed
-VM, not arenas.
+Heroku ships that default). With the cap the same battery ends at 107 MB
+PSS, throughput unchanged. Ractor mode sidesteps the worst of it without
+any env tweak—objects live in per-ractor heaps.
+### Rails (threaded fallback)
+Here copy-on-write **does** matter, which is exactly why PSS is mandatory:
+| config | RSS | PSS |
+|---|---:|---:|
+| Kino :threaded (one process) |  97 |  **92** |
+| Puma cluster 8×3 (preload) | 794 | **389** |
+Puma serves the same Rails framework from 8 forks that share it
+copy-on-write; RSS counts that shared framework once per worker (794 MB),
+PSS counts it once (389 MB). The fair ratio is **~4×**, not the ~8× a
+naive RSS sum reports—this is the correction that prompted the whole
+re-measure. Preload barely helps (389 vs 400 MB without): Ruby's GC
+dirties most heap pages, breaking copy-on-write, so even a preloaded
+cluster keeps a large private heap per worker. That is why "CoW should
+make a fork cluster nearly free" is only half true—it shares the code,
+not the live object heap.
 ## Run-to-run variance (a.k.a. "is this a regression?")
@@ -249,26 +301,27 @@ Docker-on-Mac environment swung ±10% on /cpu between sessions with the
 VM's mood; the dedicated c7a box is far steadier (same-session repeats
 land within ~1-2%), but the discipline stays—every comparative claim in
 these docs comes from same-session pairs. Cross-box repeatability got
-its own test: the dataset was measured across three identical
+its own test: the dataset was measured across four identical
 c7a.2xlarge boxes, and equal-config throughput numbers matched within
-~1-2% (loaded-RSS measurements swing far more with heap-growth
-timing—treat memory numbers as ballpark). The same discipline caught one fluke: a sweep round once
-posted threaded plaintext 28% low; interleaved re-runs minutes later
-put it back—suspect cells get re-measured, not published.
+~1-2% (loaded-memory measurements swing more with heap-growth
+timing—treat them as ballpark). The same discipline caught the recurring
+threaded-plaintext fluke twice: once 28% low on an earlier box, and again
+in the final re-validation (170k, where three interleaved re-runs put it
+back at 217k). Suspect cells get re-measured, not published.
 ## Topology notes
 Measured on c7a.2xlarge, plaintext, ractor mode, same session (three
-interleaved rounds, medians): `8×3` (workers×threads) = 200,048, `8×1`
-= **232,173 (+16%)**, `16×1` = 214,570. Threads inside one ractor share
+interleaved rounds, medians): `8×3` (workers×threads) = 198,478, `8×1`
+= **229,966 (+16%)**, `16×1` = 214,391. Threads inside one ractor share
 its lock, so every request handled by a 3-thread ractor pays a lock
 handoff that a 1-thread ractor doesn't (`perf` in the earlier Docker
 sessions attributed ~10% of cycles to
 `rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3; the
 gain reproduced on two separate boxes, +16-17% each). **This is why
-`threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +18%
-the same way: 80,409 vs 68,132). The trade-off is /io at low worker
-counts: 1,534 at 8×1 vs 4,486 at 8×3—threads-per-ractor exist for
+`threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +16%
+the same way: 77,999 vs 67,394). The trade-off is /io at low worker
+counts: 1,552 at 8×1 vs 4,530 at 8×3—threads-per-ractor exist for
 handlers that block on I/O. If yours wait a lot, raise `workers`
 instead (32×1 beats even the 24-slot cluster, see above); slots are
 cheap. (16×1 being worse than 8×1 on plaintext also says the shared
@@ -306,10 +359,10 @@ Same-session A/B on c7a.2xlarge, ractor mode at the default topology
 | endpoint | shared queue | lanes | delta |
 |----------|-------------:|------:|------:|
-| /plaintext | 230,547 | **250,395** | **+9%** |
-| /10k | 182,788 | 198,301 | +8% |
-| /cpu | **78,175** | 73,345 | −6% |
-| /io | 1,550 | 1,550 | flat |
+| /plaintext | 229,534 | **250,222** | **+9%** |
+| /10k | 178,083 | 189,862 | +7% |
+| /cpu | **77,999** | 70,885 | −9% |
+| /io | 1,552 | 1,551 | flat |
 Lanes' margin shrank with the move to 1-thread workers (at the old 8×3
 it was +21% plaintext: 240,193 vs 199,032 in the same session)—most of
@@ -331,11 +384,11 @@ typical costs):
 | case (8×3, same session) | req/s |
 |---|---:|
-| threaded, no logging | 217,377 |
-| threaded, `log_requests true` (native access log) | 193,493 (−11%) |
-| ractor, access log off / on | 200,478 / 184,357 (−8%) |
-| app logs 1 line/req via shared `::Logger` (file) | **62,917** |
-| app logs 1 line/req via `Kino::Logger` (file) | **149,237 (2.4×)** |
+| threaded, no logging | 219,168 |
+| threaded, `log_requests true` (native access log) | 193,998 (−11%) |
+| ractor, access log off / on | 197,596 / 181,050 (−8%) |
+| app logs 1 line/req via shared `::Logger` (file) | **62,961** |
+| app logs 1 line/req via `Kino::Logger` (file) | **149,519 (2.4×)** |
 The shared-`::Logger` cost is the mutex: 24 worker threads serialize
 through one lock plus a write syscall per line. `Kino::Logger` hands the

data/doc/rails-on-ractors.md CHANGED Viewed

@@ -7,10 +7,11 @@
 Rails 8.2.0.alpha boots and serves with `mode :threaded` (see the
 example's `kino.rb`; just `bundle exec kino` in that directory). Measured
 on the hello-world (c7a.2xlarge, 8 cores, production mode, 8×5):
-~2.3k req/s in 97 MB, single process. The 8-worker Puma cluster reaches
-~11.9k in 797 MB by parallelizing across forks—Rails-on-Ractors is
-interesting precisely because it could offer that ~5× parallelism at
-~1/8th of the memory.
+~2.6k req/s in 92 MB PSS, single process. The 8-worker Puma cluster
+reaches ~12.1k by parallelizing across forks, at 389 MB PSS (794 MB RSS,
+but its forks share the framework copy-on-write, so PSS is the fair
+figure)—Rails-on-Ractors is interesting precisely because it could offer
+that ~4.6× parallelism at ~1/4th of the memory.
 Pair it with production-style Rails settings: eager load, no code
 reloading, database pool ≥ workers × threads, logger to stdout or another

data/doc/why-kino.md CHANGED Viewed

@@ -16,7 +16,7 @@ deep-copies it, and sockets cannot cross at all.
 We measured what the "obvious" workaround costs. The ractor-pool wrapper
 experiment (reduce the env to a shareable subset, copy it to a worker
 over a `Ractor::Port`, copy the response back) runs at **19.5k req/s
-where Kino does 199k** on the same hardware—see the
+where Kino does 194k** on the same hardware—see the
 [wrapper comparison](benchmarks.md#the-ractor-pool-wrapper-comparison).
 Copying at the Rack layer eats the entire ractor dividend. Dispatch has
 to live below the Rack contract.
@@ -78,12 +78,12 @@ objects; Rust sees one queue and one registry.
 With the dispatch cost eliminated, Ractors deliver the thing they were
 built for—a lock per ractor instead of one GVL—and each layer is
-visible in the [benchmarks](benchmarks.md): `/cpu` at 76.9k req/s in
-ractor mode vs **13.5k threaded (5.7×, the GVL ceiling)**, beating the
-fork cluster's CPU parallelism by +32% while holding **80 MB against
-the cluster's 1,256 MB**, because eight ractors share one VM, one Rust
-front-end, one queue, and one JIT, where eight forks each pay full
-price.
+visible in the [benchmarks](benchmarks.md): `/cpu` at 78.0k req/s in
+ractor mode vs **13.4k threaded (5.8×, the GVL ceiling)**, beating the
+fork cluster's CPU parallelism by +34% while holding **~148 MB against
+the cluster's ~1,068 MB** (by PSS, on the bench app), because eight
+ractors share one VM, one Rust front-end, one queue, and one JIT, where
+eight forks each pay full price.
 The cleanest proof of the design is the threaded fallback itself: it
 reuses ~95% of the same machinery, because the Rust core is

data/lib/kino/configuration.rb CHANGED Viewed

@@ -17,6 +17,8 @@ module Kino
       queue_depth: 1024,
       queue_timeout: 5.0,
       request_timeout: nil,
+      max_connections: nil, # nil = derive from the open-file limit
+      max_body_size: 50 * 1024 * 1024, # 50 MB; nil/0 = unlimited
       batch: 1,
       lanes: false,
       log_requests: false,
@@ -160,6 +162,14 @@ module Kino
       # Seconds the app gets before the client receives a 504; nil = off.
       def request_timeout(seconds) = @config.set(:request_timeout, seconds && Float(seconds))
+      # Max connections served at once; beyond it, new connections wait in
+      # the kernel backlog. Defaults to most of the open-file limit.
+      def max_connections(count) = @config.set(:max_connections, Integer(count))
+      # Max request-body bytes before a 413; nil disables (delegate to a
+      # fronting proxy). Default 50 MB.
+      def max_body_size(bytes) = @config.set(:max_body_size, bytes && Integer(bytes))
       # Requests a worker may grab per queue visit (default 1).
       def batch(count) = @config.set(:batch, Integer(count))

data/lib/kino/kino.so CHANGED Viewed

Binary file

data/lib/kino/server.rb CHANGED Viewed

@@ -50,6 +50,8 @@ module Kino
       @queue_depth = Integer(settings[:queue_depth])
       @queue_timeout_ms = (Float(settings[:queue_timeout]) * 1000).round
       @request_timeout_ms = settings[:request_timeout] ? (Float(settings[:request_timeout]) * 1000).round : 0
+      @max_connections = settings[:max_connections] ? Integer(settings[:max_connections]) : default_max_connections
+      @max_body_size = Integer(settings[:max_body_size] || 0)
       @batch = [Integer(settings[:batch]), 1].max
       @lanes = !!settings[:lanes]
       @log_requests = !!settings[:log_requests]
@@ -74,6 +76,8 @@ module Kino
         bind: @bind, port: @requested_port,
         queue_depth: @queue_depth, queue_timeout_ms: @queue_timeout_ms,
         request_timeout_ms: @request_timeout_ms,
+        max_connections: @max_connections,
+        max_body_size: @max_body_size,
         tokio_threads: @tokio_threads,
         tls_cert: @tls&.fetch(:cert), tls_key: @tls&.fetch(:key),
         lanes: @lanes, log_requests: @log_requests
@@ -214,6 +218,18 @@ module Kino
       Process.clock_gettime(Process::CLOCK_MONOTONIC)
     end
+    # Default connection cap: most of the process open-file limit. A
+    # connection flood's failure mode is descriptor exhaustion, and in
+    # :ractor/:threaded mode the app's own sockets and files share this
+    # process's table, so leave headroom. Scales with `ulimit -n`; raise the
+    # OS limit (or set max_connections) to allow more.
+    def default_max_connections
+      soft, = Process.getrlimit(Process::RLIMIT_NOFILE)
+      return 65_536 if soft == Process::RLIM_INFINITY
+      [soft * 8 / 10, 64].max
+    end
     def join_workers(deadline)
       if @supervisor
         @supervisor.shutdown([deadline - monotonic_now, 0].max)

data/lib/kino/templates/kino.rb.tt CHANGED Viewed

@@ -55,6 +55,17 @@
 # above your slowest legitimate endpoint.
 # request_timeout 30
+# Most connections to serve at once. Past this, new connections wait in
+# the kernel backlog instead of piling up until the server runs out of
+# file descriptors. Defaults to most of the open-file limit (ulimit -n),
+# so it scales with the OS limit and only bites under a flood.
+# max_connections 8192
+# Reject request bodies larger than this many bytes with a 413, so an
+# oversized or endless upload can't drive your app to run out of memory.
+# Set to nil to disable and let a fronting proxy handle it. Default: 50 MB.
+# max_body_size 50 * 1024 * 1024
 # How many requests a worker grabs from the line at once. Leave at 1
 # unless all your endpoints are uniformly fast.
 # batch 1

data/lib/kino/version.rb CHANGED Viewed

@@ -2,5 +2,5 @@
 module Kino
   # The gem version (single source of truth; ext/kino/Cargo.toml syncs).
-  VERSION = "0.1.1"
+  VERSION = "0.1.2"
 end

data/sig/kino.rbs CHANGED Viewed

@@ -92,6 +92,8 @@ module Kino
       def queue_depth: (int depth) -> untyped
       def queue_timeout: (Numeric seconds) -> untyped
       def request_timeout: (Numeric? seconds) -> untyped
+      def max_connections: (int count) -> untyped
+      def max_body_size: (int? bytes) -> untyped
       def batch: (int count) -> untyped
       def lanes: (boolish enabled) -> untyped
       def log_requests: (boolish enabled) -> untyped

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: kino
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.1.2
 platform: aarch64-linux
 authors:
 - Yaroslav Markin
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2026-06-11 00:00:00.000000000 Z
+date: 2026-06-21 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: logger