kino 0.1.0-aarch64-linux → 0.1.1-aarch64-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4f532332dd554c483d0524b766b2fefd4eca812aed81a570e9b78398b8ad96c7
4
- data.tar.gz: b063b09af92ac8dc5ba5cc50e665b773204320ba2f51140ff09c8b2dfe3ee7bf
3
+ metadata.gz: cc5a5ccaf0a6edecf4e34f4b75a7fd9968fe2d16ab89a395bb962b41cea8467d
4
+ data.tar.gz: 74eacc8c9e1d280e67ee2394b8ae4d0eb45f8366fd693c057ac7ad49c1e48368
5
5
  SHA512:
6
- metadata.gz: fccff01128173ee48574f5aa16c7e9137931b43fff5781c65df0bf3df3be7b84710f8a13abfc5e845ed7f78dbe371130469056da98a812c5be7ea567c0d1cc88
7
- data.tar.gz: 65159df0de0a77e072aa3d01c4e46ad2518bbf2bf0248d95fa441ccf7a47cd0aa2b1f7f7fbe7f3dc43040c7afefa0b304a4e4b5a89df6cc9a2d0d7ba8d9a6277
6
+ metadata.gz: ee0b6133a6ee0e8b3ce5a4f007cec3630c5ed666ae4b7af3cbdf0e6c188441448e6081882b27f626a7c9e109f05029744628f82992d6e8a68494e4db1985cb38
7
+ data.tar.gz: 42c73f8c2cd110d3430393695375927e2a6ae65a840d483681c0c270abba47f62b957d2f4b997aa23d4d0761dc3069d08052e3a79af51356d87158fdc8d5a221
data/CHANGELOG.md CHANGED
@@ -1,3 +1,12 @@
1
+ ## [0.1.1] - 2026-06-11
2
+ - Mode-dependent `threads` default: 1 per worker in :ractor mode (threads
3
+ inside a ractor share its lock and cost a per-request handoff; +16-18%
4
+ on fast handlers, measured on dedicated hardware), 3 in :threaded mode.
5
+ Explicit `threads` always wins; waiting-heavy ractor apps should raise
6
+ `workers` instead.
7
+ - `queue_timeout` default raised from 1 to 5 seconds: a brief burst now
8
+ waits out the spike instead of shedding 503s within a second.
9
+
1
10
  ## [0.1.0] - 2026-06-11
2
11
 
3
12
  Initial release.
data/README.md CHANGED
@@ -11,14 +11,15 @@ on every core in **one small process**. A **Rust** (tokio + hyper)
11
11
  front-end owns the network, parallel **Ractors** run your Rack 3 app,
12
12
  and a threaded fallback mode runs everything else, Rails included.
13
13
 
14
- * **Fast.** On a real 8-core server, every Kino mode is **1.4-2×** ahead
15
- of a same-topology Puma cluster on I/O-light endpoints. Ractor mode
16
- also wins on pure CPU. [Benchmarks](#benchmarks) below.
14
+ * **Fast.** On a real 8-core server, every Kino mode is **1.5-2×**
15
+ ahead of a Puma fork cluster on I/O-light endpoints. Ractor mode also
16
+ wins on pure CPU, **30%+**. [Benchmarks](#benchmarks) below.
17
17
  * **A fraction of the memory.** One process instead of a fork per core:
18
- about **1/19th of the Puma cluster's memory** under the same load, and
19
- about 1/8th when serving the Rails hello-world.
20
- * **Parallel without forking.** Ractor mode runs CPU work **5×** faster
21
- than Kino's own GVL-bound threaded mode, in the same small process.
18
+ about **15× less memory** than the Puma cluster under the same load,
19
+ and less when serving the Rails hello-world.
20
+ * **Parallel without forking.** Ractor mode runs CPU work **more than
21
+ 5× faster** than Kino's own GVL-bound threaded mode, in the same
22
+ small process.
22
23
  * **Production plumbing included.** Graceful drain, crash supervision
23
24
  and respawn, bounded queues with 503 backpressure, request timeouts,
24
25
  TLS (rustls), live stats, async access and app logging.
@@ -64,62 +65,72 @@ notes live in [doc/architecture.md](doc/architecture.md).
64
65
 
65
66
  Measured on a real server: AWS **c7a.2xlarge** (8-core AMD EPYC 9R14,
66
67
  16 GB, Amazon Linux 2023). This is a realistic app-server size. The same
67
- Ractor-shareable app runs on every server, Ruby 4.0.5 with YJIT, equal
68
- topology (8 workers × 3 threads; Puma forks, Kino stays in one process).
68
+ Ractor-shareable app runs on every server, Ruby 4.0.5 with YJIT, every
69
+ server at its defaults: Puma forks 8 workers × 3 threads, Kino stays in
70
+ one process (8 workers; 1 thread each in ractor modes, 3 in threaded).
69
71
  Numbers are req/s by wrk (8-second windows, 64 connections, same host).
70
72
  Methodology and the analysis behind every column:
71
73
  [doc/benchmarks.md](doc/benchmarks.md).
72
74
 
73
- | endpoint | Kino :ractor | + lanes | Kino :threaded | Puma (cluster) |
74
- |-------------|-------------:|--------:|---------------:|---------------:|
75
- | /plaintext | 201,472 | **241,501** | 218,348 | 117,838 |
76
- | /10k | 156,635 | **183,564** | 153,442 | 106,666 |
77
- | /cpu (fib) | 66,735¹| **70,373** | 13,298 | 58,207 |
78
- | /io (5 ms) | 4,527²| 4,530 | **4,715** | 4,691 |
79
- | /io_native | 4,714 | **4,717** | 4,709 | 4,692 |
75
+ | endpoint | Kino :ractor | + lanes | :ractor, `workers 32`² | Kino :threaded | Puma (cluster) |
76
+ |-------------|-------------:|--------:|-----------------------:|---------------:|---------------:|
77
+ | /plaintext | 229,565 | **244,340** | 156,118 | 217,619 | 118,190 |
78
+ | /10k | 179,119 | **188,258** | 134,457 | 157,147 | 105,588 |
79
+ | /cpu (fib) | **76,922**¹| 73,136 | 62,406 | 13,499 | 58,337 |
80
+ | /io (5 ms) | 1,548 | 1,548 | **5,935** | 4,715 | 4,687 |
81
+ | /io_native | 1,570 | 1,571 | **6,289** | 4,717 | 4,695 |
80
82
 
81
- Memory on the same box, RSS under load:
83
+ Memory on the same box, RSS after sustained load:
82
84
 
83
85
  | serving | Kino (one process) | Puma cluster (8 workers) |
84
86
  |-----------------------|-------------------:|-------------------------:|
85
- | bench app, :ractor | **57 MB** | 1,078 MB |
86
- | bench app, :threaded | **50 MB** | 1,078 MB |
87
+ | bench app, :ractor | **80 MB** | 1,256 MB |
88
+ | bench app, :threaded | **151 MB**³| 1,256 MB |
87
89
  | Rails hello-world | **97 MB** | 797 MB |
88
90
 
89
91
  "+ lanes" is the experimental per-worker-queue dispatcher (`lanes true`).
90
- It adds +20% over the shared queue on this hardware and makes ractor
91
- mode the fastest Kino configuration. Details:
92
+ It posts the fastest plaintext/10k of any configuration here. Details:
92
93
  [doc/benchmarks.md](doc/benchmarks.md#lane-dispatch-experimental-lanes-true).
93
94
 
94
95
  ¹ Stock settings, no tuning. Ractor mode beats the fork cluster on pure
95
- CPU by +15% (+21% with lanes). Threaded mode shows the GVL ceiling that
96
- every single-process Ruby server hits. The CPU-tuning recipe that our
97
- earlier Docker measurements needed makes no difference on real hardware
98
- (+0.5%); see [doc/benchmarks.md](doc/benchmarks.md#cpu-bound-tuning).
99
-
100
- ² The ractor timer tax is small on real hardware: −4% against threaded
101
- mode (it was −18% in Docker). Wait-bound throughput is slots ÷ wait, and
102
- Kino slots are threads, not processes. `workers 32, threads 1` measured
103
- **5,922 /io (+27% over the cluster) and 6,254 /io_native (+34%)**, still
104
- one small process. See
96
+ CPU by +32% (+25% with lanes). Threaded mode shows the GVL ceiling that
97
+ every single-process Ruby server hits. The old CPU-tuning recipe is
98
+ retired: its `threads 1` half **is** the default now, and its
99
+ `tokio_threads 1` half costs −12% on real hardware; see
100
+ [doc/benchmarks.md](doc/benchmarks.md#cpu-bound-tuning).
101
+
102
+ ² Wait-bound throughput is slots ÷ wait, and the default columns bring
103
+ 8 single-thread workers against the cluster's 24 threads. Kino slots
104
+ are threads, not processes—when your app waits a lot, raise `workers`.
105
+ The `workers 32` column is that tuning: **+27% over the cluster on /io
106
+ (+34% via `Kino.sleep`)** while still ahead of it on pure CPU, all in
107
+ one small process. The cost is the CPU-light rows (32 ractors
108
+ oversubscribe 8 cores); pick the topology your app's wait profile
109
+ needs. See
105
110
  [doc/benchmarks.md](doc/benchmarks.md#why-io-lags-in-ractor-mode-on-linux).
106
111
 
112
+ ³ With `MALLOC_ARENA_MAX=2` (the standard Ruby deployment setting;
113
+ Heroku's default). Without it, 24 threads churning 10 KB responses
114
+ through one glibc heap balloon to ~600 MB—an arena-fragmentation
115
+ footgun, not a leak, and ractor mode sidesteps it. See
116
+ [doc/benchmarks.md](doc/benchmarks.md#memory-under-load-and-the-glibc-arena-footgun).
117
+
107
118
  A common first idea is to keep your current server and wrap the app in
108
119
  a ractor pool. We measured that too (same box; the analysis is in the
109
120
  doc):
110
121
 
111
- | endpoint | Kino :ractor | Puma + ractor wrapper | Falcon + ractor wrapper |
112
- |------------|-------------:|----------------------:|------------------------:|
113
- | /plaintext | **201,472** | 19,425 | 100,624 |
114
- | /cpu (fib) | **66,735** | 17,106 | 49,083 |
115
- | /io (5 ms) | **4,527** | 1,447 | 1,549 |
116
-
117
- In short: ractor mode reaches fork-level CPU parallelism (**5×** Kino's
118
- own GVL-bound threaded mode) in one process, at about 1/19th of the
119
- cluster's memory. Every Kino mode is 1.4-2× ahead of the cluster on
120
- I/O-light endpoints. The macOS numbers (secondary; everything there hits
121
- the loopback ceiling) and the YJIT × Ractors gotcha are in
122
- [doc/benchmarks.md](doc/benchmarks.md).
122
+ | endpoint | Kino :ractor (8×3) | Puma + ractor wrapper | Falcon + ractor wrapper |
123
+ |------------|-------------------:|----------------------:|------------------------:|
124
+ | /plaintext | **199,032** | 19,532 | 100,342 |
125
+ | /cpu (fib) | **68,238** | 17,323 | 48,561 |
126
+ | /io (5 ms) | **4,531** | 1,452 | 1,544 |
127
+
128
+ In short: ractor mode beats fork-level CPU parallelism (**5.7×** Kino's
129
+ own GVL-bound threaded mode, +32% over the cluster) in one process, at
130
+ about 1/16th of the cluster's memory. Every Kino mode is 1.5-2.1×
131
+ ahead of the cluster on I/O-light endpoints. The macOS numbers
132
+ (secondary; everything there hits the loopback ceiling) and the
133
+ YJIT × Ractors gotcha are in [doc/benchmarks.md](doc/benchmarks.md).
123
134
 
124
135
  Reproduce: `bench/run.sh [seconds] [concurrency]` for the main table,
125
136
  `bench/studies.sh` for the follow-ups (CPU recipe, topology, scaling,
@@ -174,10 +185,10 @@ server = Kino::Server.new(app,
174
185
  bind: "127.0.0.1",
175
186
  port: 9292, # 0 = ephemeral; read back via server.port
176
187
  workers: Etc.nprocessors, # ractors (parallelism)
177
- threads: 3, # threads per ractor (I/O concurrency, Puma-style)
188
+ threads: 1, # per worker; ractor default 1, threaded default 3
178
189
  mode: :auto, # :auto | :ractor | :threaded
179
190
  queue_depth: 1024, # bounded queue; overflow → 503
180
- queue_timeout: 1.0, # seconds before 503 on a full queue
191
+ queue_timeout: 5.0, # seconds before 503 on a full queue
181
192
  request_timeout: nil, # seconds before a slow response becomes a 504 (nil = off)
182
193
  shutdown_timeout: 30, # drain deadline
183
194
  tls: { cert: "cert.pem", key: "key.pem" }, # file paths or inline PEM
@@ -210,7 +221,7 @@ kwargs and CLI flags > config file > defaults.
210
221
  # kino.rb
211
222
  port 9292
212
223
  workers 8
213
- threads 3
224
+ threads 1
214
225
  mode :ractor
215
226
  ```
216
227
 
@@ -266,7 +277,7 @@ cost):
266
277
 
267
278
  ```ruby
268
279
  server.stats
269
- # => {mode: :ractor, lanes: false, workers: 8, threads: 3, batch: 1,
280
+ # => {mode: :ractor, lanes: false, workers: 8, threads: 1, batch: 1,
270
281
  # respawns: 0, queued: 0, in_flight: 2, served: 1041, rejected: 0,
271
282
  # timeouts: 0}
272
283
  # plus lane_depths: [...] when lane dispatch is on
@@ -276,19 +287,20 @@ From the outside, `kill -USR1 <pid>` prints the same snapshot as one line
276
287
  (pair it with `pidfile` to find the pid):
277
288
 
278
289
  ```
279
- Kino stats: mode=:ractor lanes=false workers=8 threads=3 batch=1 respawns=0 queued=0 in_flight=2 served=1041 rejected=0 timeouts=0
290
+ Kino stats: mode=:ractor lanes=false workers=8 threads=1 batch=1 respawns=0 queued=0 in_flight=2 served=1041 rejected=0 timeouts=0
280
291
  ```
281
292
 
282
293
  ## Logging
283
294
 
284
295
  With one log line per request, `Kino::Logger` sustained **2.4× the
285
- throughput of a shared `::Logger`** (151k vs 63k req/s on the benchmark
296
+ throughput of a shared `::Logger`** (149k vs 63k req/s on the benchmark
286
297
  box). There are two native pieces. Both write through a lock-free
287
298
  channel to a Rust flusher thread, so request threads never take a log
288
299
  mutex and never make a write syscall:
289
300
 
290
301
  - **Access log** (`log_requests true`): one line per request to stdout,
291
- including the 503s that never reach your app. On color terminals the
302
+ including the 503s that never reach your app. Recommended in
303
+ development; cheap enough for production. On color terminals the
292
304
  lines are tinted by status class: 2xx green, 3xx yellow, 4xx maroon,
293
305
  5xx bright red:
294
306
 
data/doc/benchmarks.md CHANGED
@@ -27,9 +27,18 @@ the deployment most apps run today.
27
27
  plaintext 26-37% lower while leaving Puma's number unchanged.
28
28
  - Identical app for every server (`bench/bench_app.rb`), Ractor-shareable
29
29
  so Kino's `:ractor` mode can run it unmodified.
30
- - Topology held equal: Puma 8 forked workers × 3 threads vs Kino
31
- 8 workers × 3 threads in one process.
32
- - Follow-up studies (`bench/studies.sh`): CPU recipe, topology sweep,
30
+ - Topology: each server at its shipped defaults—Puma 8 forked workers
31
+ × 3 threads (24 slots) vs Kino 8 workers in one process (× 1 thread
32
+ in ractor modes, the default since 0.1.1; × 3 threads in threaded
33
+ mode). Equal-topology numbers (Kino at 8×3) are in the studies below.
34
+ - The headline tables also carry an io-tuned column (`workers 32,
35
+ threads 1`)—not a default, labeled as such—because the /io rows are
36
+ a slot-count story (see below).
37
+ - The dataset spans three identical boxes: the original measurements,
38
+ a full re-measure at the 0.1.1 defaults, and the final headline
39
+ sweep. Equal-config numbers reproduced across boxes within ~1-2%
40
+ throughout.
41
+ - Follow-up studies (`bench/studies.sh`): CPU tuning, topology sweep,
33
42
  /io worker scaling, logging costs, and memory—run in the same session
34
43
  as the headline tables.
35
44
  - The harness waits for the port to be genuinely free between targets.
@@ -46,41 +55,51 @@ the deployment most apps run today.
46
55
  ## Reading the headline tables
47
56
 
48
57
  - **Plaintext/10k**: Kino's tokio front-end clears the fork cluster by
49
- 1.4-2× (lanes plaintext 241,501 vs Puma 117,838 = 2.05×; the smallest
50
- margin is threaded /10k at 1.44×). The cross-ractor handoff shows up
51
- as ractor (201k) trailing threaded (218k) on trivial handlers—nothing
52
- in them needs parallel Ruby—and lane dispatch reverses that (241k).
53
- - **CPU (recursive fib)**: ractor mode does **5× its own GVL-bound
54
- threaded mode** (66,735 vs 13,298)—that's the entire point of
55
- ractors—and beats the fork cluster outright: +15% with stock
56
- defaults, +21% with lanes (70,373 vs 58,207).
57
- - **Memory**: serving the same loaded bench app, Kino held **57 MB
58
- (ractor) / 50 MB (threaded)** where the 8-worker cluster held
59
- **1,078 MB**—a fork per core pays one full copy of the VM, the app,
60
- and its YJIT-compiled code per worker. On the Rails hello-world:
61
- Kino 97 MB vs cluster 797 MB.
58
+ 1.5-2.1× (lanes plaintext 244,340 vs Puma 118,190 = 2.07×; the
59
+ smallest margin is threaded /10k at 1.49×). At the old 3-thread
60
+ topology the cross-ractor handoff showed up as ractor trailing
61
+ threaded on trivial handlers; the 1-thread default reverses that
62
+ (ractor 230k vs threaded 218k) and lanes widen it (244k).
63
+ - **CPU (recursive fib)**: ractor mode does **5.7× its own GVL-bound
64
+ threaded mode** (76,922 vs 13,499)—that's the entire point of
65
+ ractors—and beats the fork cluster outright: +32% with stock
66
+ defaults (+25% with lanes, 73,136 vs 58,337). Even the io-tuned
67
+ `workers 32` topology stays ahead of the cluster on CPU (62,406).
68
+ - **Memory**: after serving the full endpoint battery, Kino held
69
+ **80 MB** (ractor or lanes, default topology) where the 8-worker
70
+ cluster held **1,256 MB**—a fork per core pays one full copy of the
71
+ VM, the app, and its YJIT-compiled code per worker. On the Rails
72
+ hello-world: Kino 97 MB vs cluster 797 MB. Threaded mode under the
73
+ same battery needs a malloc note; see
74
+ [Memory under load](#memory-under-load-and-the-glibc-arena-footgun).
62
75
  - **I/O (5 ms wait)**: all dispatch models tie within ~4% at equal slot
63
- counts; the lever that matters is slot count, see
64
- [below](#why-io-lags-in-ractor-mode-on-linux).
76
+ counts, so the default columns show the ractor modes behind on /io
77
+ (8 slots vs the cluster's 24), and the `workers 32` column shows the
78
+ same engine winning (+27%, +34% via `Kino.sleep`) once it has more
79
+ slots than the cluster. The lever is slot count, and Kino slots are
80
+ cheap: see [below](#why-io-lags-in-ractor-mode-on-linux).
65
81
 
66
82
  ## CPU-bound tuning
67
83
 
68
- On real hardware, Kino's stock defaults already lead the cluster on
69
- pure CPU—same-session studies run:
84
+ On real hardware, Kino's stock defaults lead the cluster on pure
85
+ CPU—and the old tuning recipe is now obsolete. Same-session studies
86
+ run:
70
87
 
71
88
  | config | /cpu req/s |
72
89
  |---|---:|
73
- | Puma cluster (reference) | 58,376 |
74
- | Kino `workers 8, threads 3`, tokio auto (default) | 68,257 |
75
- | Kino `workers 8, threads 1, tokio_threads 1` (recipe) | 68,629 |
76
-
77
- The tuned recipe is a wash (+0.5%)—and it still costs plaintext
78
- (112,815 vs ~200k) and /io (1,532, 8 slots): on this hardware there is
79
- no reason to use it. **This is a finding that changed with the
80
- environment**: in the earlier Docker-on-Mac runs the recipe was worth
81
- +12%, because tokio threads and wake churn competed for oversubscribed
82
- virtualized cores. If you deploy into a constrained/virtualized
83
- environment, the recipe may still pay; measure there.
90
+ | Puma cluster (reference) | 58,505 |
91
+ | Kino `workers 8, threads 3` (the default before 0.1.1) | 67,111 |
92
+ | Kino `workers 8, threads 1, tokio_threads 1` (the old recipe) | 68,638 |
93
+ | Kino `workers 8, threads 1`, tokio auto (**the default**) | **78,175** |
94
+
95
+ The `threads 1` half of the old recipe became the default; the
96
+ `tokio_threads 1` half now *costs* −12% on /cpu (and still costs
97
+ plaintext: 107,743 vs 230k). Don't pin tokio threads. **The recipe's
98
+ history is an environment story**: in the earlier Docker-on-Mac runs it
99
+ was worth +12%, because tokio threads and wake churn competed for
100
+ oversubscribed virtualized cores; on dedicated cores the same pin
101
+ starves the I/O front-end instead. If you deploy into a
102
+ constrained/virtualized environment, measure there.
84
103
 
85
104
  Two findings that survived the environment change:
86
105
 
@@ -99,8 +118,9 @@ Parallelism for CPU-bound Ruby comes from ractors or forks, nothing else.
99
118
 
100
119
  ## Why /io lags in ractor mode on Linux
101
120
 
102
- On bare metal the gap is small: ractor /io 4,527 vs threaded 4,715
103
- (−4%). In Docker it was −18%, and a pure-Ruby probe there measured
121
+ On bare metal the gap is small at equal slot counts: ractor /io 4,531
122
+ vs threaded 4,725 (−4%, both at 8×3). In Docker it was −18%, and a
123
+ pure-Ruby probe there measured
104
124
  `sleep(0.005)` waking +2.3-2.8 ms late inside ractors vs +1.8 ms on the
105
125
  main ractor—non-main-ractor timer wakeups are coarser in Ruby 4.0, but
106
126
  how much that costs depends heavily on the kernel/virtualization stack.
@@ -110,14 +130,19 @@ A follow-up probe showed `IO.select`-style waits are tighter than
110
130
  **Mitigation 1—`Kino.sleep`:** releases the GVL and waits on the OS
111
131
  clock directly (chunked, so `Thread#kill`/shutdown stay responsive). The
112
132
  `/io_native` endpoint (same 5 ms wait via `Kino.sleep` when available)
113
- erases the remaining ractor gap on this box: 4,714 vs 4,527 plain sleep.
114
-
115
- **Mitigation 2—add workers; they're nearly free.** Wait-bound
116
- throughput is simply `slots ÷ effective wait`, and Kino's slots cost ~a
117
- thread each, not a forked process: `workers 32, threads 1` measured
118
- **5,922 /io (+27% over the 24-thread cluster's 4,672) and 6,254
119
- /io_native (+34%)**, still one small process. A fork cluster buying the
120
- same 32 slots pays for them in full copies of the app.
133
+ erases the remaining ractor gap on this box: 4,721 vs 4,531 plain sleep.
134
+
135
+ **Mitigation 2—add workers; they're nearly free.** The headline tables
136
+ show default ractor-mode /io at 1,548: that's 8 slots (the 1-thread
137
+ default) against the cluster's 24, because wait-bound throughput is
138
+ simply `slots ÷ effective wait`. Kino's slots cost ~a thread each, not
139
+ a forked process: the `workers 32, threads 1` column measured **5,935
140
+ /io (+27% over the 24-thread cluster's 4,687) and 6,289 /io_native
141
+ (+34%)**, still one small process, and still +7% ahead of the cluster
142
+ on pure CPU. Its cost is the CPU-light rows (156k plaintext vs 230k at
143
+ 8×1: 32 ractors oversubscribe 8 cores). A fork cluster buying the same
144
+ 32 slots pays for them in full copies of the app; Kino pays in
145
+ scheduler churn only where the cores are already saturated.
121
146
 
122
147
  ## The ractor-pool-wrapper comparison
123
148
 
@@ -127,11 +152,11 @@ already run. `bench/ractor_wrapper.rb` is that experiment, benchmarked on
127
152
  Puma and Falcon—not as a comparison of those servers, but to measure
128
153
  what the Rack-level hop itself costs (c7a.2xlarge, same session):
129
154
 
130
- | endpoint | Kino :ractor | Puma + wrapper | Falcon + wrapper |
131
- |------------|-------------:|---------------:|-----------------:|
132
- | /plaintext | 201,472 | 19,425 | 100,624 |
133
- | /cpu (fib) | 66,735 | 17,106 | 49,083 |
134
- | /io (5 ms) | 4,527 | 1,447 | 1,549 |
155
+ | endpoint | Kino :ractor (8×3) | Puma + wrapper | Falcon + wrapper |
156
+ |------------|-------------------:|---------------:|-----------------:|
157
+ | /plaintext | 199,032 | 19,532 | 100,342 |
158
+ | /cpu (fib) | 68,238 | 17,323 | 48,561 |
159
+ | /io (5 ms) | 4,531 | 1,452 | 1,544 |
135
160
 
136
161
  Inside the Rack contract, the wrapper must reduce the env to a shareable
137
162
  subset, copy it to the worker ractor, copy the response back, and hold a
@@ -190,6 +215,32 @@ needs its `disable_initial_exec_tls` build flag just to load (dlopen +
190
215
  initial-exec TLS = `cannot allocate memory in static TLS block`)—one
191
216
  more reason to prefer mimalloc in dlopen'd extensions.
192
217
 
218
+ ## Memory under load (and the glibc arena footgun)
219
+
220
+ RSS after serving the full endpoint battery (8 s each of /plaintext,
221
+ /10k, /cpu, /io—a "warmed production process", not a fresh boot, which
222
+ measures 26-27 MB for every Kino mode):
223
+
224
+ | config | RSS loaded |
225
+ |---|---:|
226
+ | Kino :ractor 8×1 (default) | **80 MB** |
227
+ | Kino lanes 8×1 | **80 MB** |
228
+ | Kino :ractor 8×3 | 115 MB |
229
+ | Kino :threaded 8×3 | 612 MB¹ |
230
+ | Puma cluster 8×3 | 1,256 MB |
231
+
232
+ ¹ Not a leak: glibc malloc arena bloat. One 8-second /10k round takes
233
+ threaded mode from 69 MB to ~800 MB and it never returns—24 threads
234
+ churning 10 KB strings through one process heap is the textbook glibc
235
+ arena-fragmentation case (the reason Rails ops set `MALLOC_ARENA_MAX=2`;
236
+ Heroku ships that default). With `MALLOC_ARENA_MAX=2` the same battery
237
+ ends at **151 MB** with throughput unchanged (165,993 vs 157,177 /10k,
238
+ if anything faster). Ractor mode sidesteps the worst of it without any
239
+ env tweak—objects live in per-ractor heaps, and repeated runs landed at
240
+ 80-124 MB regardless of arena settings. Puma's 1,256 MB barely moves
241
+ under the cap (1,237 MB): its cost is eight full copies of the warmed
242
+ VM, not arenas.
243
+
193
244
  ## Run-to-run variance (a.k.a. "is this a regression?")
194
245
 
195
246
  Rule of thumb from chasing this twice: never compare numbers from
@@ -197,21 +248,32 @@ different sessions; interleave A/B rounds in one session instead. The
197
248
  Docker-on-Mac environment swung ±10% on /cpu between sessions with the
198
249
  VM's mood; the dedicated c7a box is far steadier (same-session repeats
199
250
  land within ~1-2%), but the discipline stays—every comparative claim in
200
- these docs comes from same-session pairs.
251
+ these docs comes from same-session pairs. Cross-box repeatability got
252
+ its own test: the dataset was measured across three identical
253
+ c7a.2xlarge boxes, and equal-config throughput numbers matched within
254
+ ~1-2% (loaded-RSS measurements swing far more with heap-growth
255
+ timing—treat memory numbers as ballpark). The same discipline caught one fluke: a sweep round once
256
+ posted threaded plaintext 28% low; interleaved re-runs minutes later
257
+ put it back—suspect cells get re-measured, not published.
201
258
 
202
259
  ## Topology notes
203
260
 
204
- Measured on c7a.2xlarge, plaintext, ractor mode, same session: `8×3`
205
- (workers×threads) = 199,470, `8×1` = **232,469 (+17%)**, `16×1` =
206
- 214,284. Threads inside one ractor share its lock, so every request
207
- handled by a 3-thread ractor pays a lock handoff that a 1-thread ractor
208
- doesn't (`perf` in the earlier Docker sessions attributed ~10% of cycles
209
- to `rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3;
210
- the +17% reproduces exactly on real hardware). Threads-per-ractor exist
211
- for handlers that block on I/O; if yours don't, run `threads 1` and let
212
- workers = cores do the parallelism. (16×1 being worse than 8×1 also says
213
- the shared MPMC queue is *not* the bottleneck—8 extra parked consumers
214
- just add scheduler churn.)
261
+ Measured on c7a.2xlarge, plaintext, ractor mode, same session (three
262
+ interleaved rounds, medians): `8×3` (workers×threads) = 200,048, `8×1`
263
+ = **232,173 (+16%)**, `16×1` = 214,570. Threads inside one ractor share
264
+ its lock, so every request handled by a 3-thread ractor pays a lock
265
+ handoff that a 1-thread ractor doesn't (`perf` in the earlier Docker
266
+ sessions attributed ~10% of cycles to
267
+ `rb_native_mutex_unlock`/`thread_sched_wakeup_next_thread` at 8×3; the
268
+ gain reproduced on two separate boxes, +16-17% each). **This is why
269
+ `threads` defaults to 1 in ractor mode since 0.1.1** (/cpu gains +18%
270
+ the same way: 80,409 vs 68,132). The trade-off is /io at low worker
271
+ counts: 1,534 at 8×1 vs 4,486 at 8×3—threads-per-ractor exist for
272
+ handlers that block on I/O. If yours wait a lot, raise `workers`
273
+ instead (32×1 beats even the 24-slot cluster, see above); slots are
274
+ cheap. (16×1 being worse than 8×1 on plaintext also says the shared
275
+ MPMC queue is *not* the bottleneck—8 extra parked consumers just add
276
+ scheduler churn.)
215
277
 
216
278
  ## What profiling tried and rejected
217
279
 
@@ -239,23 +301,27 @@ safeguards: lane depth is capped at 4, and workers steal from siblings
239
301
  before parking (plus on every park tick), so a slow request can't strand
240
302
  its lane's backlog.
241
303
 
242
- Same-session A/B on c7a.2xlarge (ractor mode):
304
+ Same-session A/B on c7a.2xlarge, ractor mode at the default topology
305
+ (8×1):
243
306
 
244
307
  | endpoint | shared queue | lanes | delta |
245
308
  |----------|-------------:|------:|------:|
246
- | /plaintext | 201,472 | **241,501** | **+20%** |
247
- | /10k | 156,635 | 183,564 | +17% |
248
- | /cpu | 66,735 | 70,373 | +5% |
249
- | /io | 4,527 | 4,530 | flat |
250
-
251
- On this hardware lanes make ractor mode the fastest Kino configuration
252
- outright—+11% over threaded mode's plaintext, where the shared queue
253
- trails it. (On loopback-bound macOS, lanes lose a few percent instead;
254
- see the secondary table below.) It stays opt-in for now because overload
255
- semantics differ from the shared queue (`queue_depth` doesn't apply;
256
- capacity is lanes × 4 with brief dispatcher retries up to
257
- `queue_timeout` before the 503), and crash semantics, stealing fairness,
258
- and drain behavior have spec coverage but not production mileage.
309
+ | /plaintext | 230,547 | **250,395** | **+9%** |
310
+ | /10k | 182,788 | 198,301 | +8% |
311
+ | /cpu | **78,175** | 73,345 | −6% |
312
+ | /io | 1,550 | 1,550 | flat |
313
+
314
+ Lanes' margin shrank with the move to 1-thread workers (at the old 8×3
315
+ it was +21% plaintext: 240,193 vs 199,032 in the same session)—most of
316
+ the futex pain lanes were built to avoid came from thread handoffs
317
+ inside each ractor, and the new default removes those for everyone. At
318
+ the default, lanes still post the fastest plaintext/10k of any Kino
319
+ configuration, but plain shared-queue now takes /cpu. It stays opt-in because overload semantics differ from the
320
+ shared queue (`queue_depth` doesn't apply; capacity is lanes × 4 with
321
+ brief dispatcher retries up to `queue_timeout` before the 503), and
322
+ crash semantics, stealing fairness, and drain behavior have spec
323
+ coverage but not production mileage. (On loopback-bound macOS, lanes
324
+ lose a few percent instead; see the secondary table below.)
259
325
 
260
326
  ## Logging costs
261
327
 
@@ -265,11 +331,11 @@ typical costs):
265
331
 
266
332
  | case (8×3, same session) | req/s |
267
333
  |---|---:|
268
- | threaded, no logging | 217,113 |
269
- | threaded, `log_requests true` (native access log) | 193,200 (−11%) |
270
- | ractor, access log off / on | 198,624 / 183,565 (−8%) |
271
- | app logs 1 line/req via shared `::Logger` (file) | **62,962** |
272
- | app logs 1 line/req via `Kino::Logger` (file) | **150,810 (2.4×)** |
334
+ | threaded, no logging | 217,377 |
335
+ | threaded, `log_requests true` (native access log) | 193,493 (−11%) |
336
+ | ractor, access log off / on | 200,478 / 184,357 (−8%) |
337
+ | app logs 1 line/req via shared `::Logger` (file) | **62,917** |
338
+ | app logs 1 line/req via `Kino::Logger` (file) | **149,237 (2.4×)** |
273
339
 
274
340
  The shared-`::Logger` cost is the mutex: 24 worker threads serialize
275
341
  through one lock plus a write syscall per line. `Kino::Logger` hands the
data/doc/why-kino.md CHANGED
@@ -15,8 +15,8 @@ deep-copies it, and sockets cannot cross at all.
15
15
 
16
16
  We measured what the "obvious" workaround costs. The ractor-pool wrapper
17
17
  experiment (reduce the env to a shareable subset, copy it to a worker
18
- over a `Ractor::Port`, copy the response back) runs at **19k req/s where
19
- Kino does 201k** on the same hardware—see the
18
+ over a `Ractor::Port`, copy the response back) runs at **19.5k req/s
19
+ where Kino does 199k** on the same hardware—see the
20
20
  [wrapper comparison](benchmarks.md#the-ractor-pool-wrapper-comparison).
21
21
  Copying at the Rack layer eats the entire ractor dividend. Dispatch has
22
22
  to live below the Rack contract.
@@ -78,10 +78,10 @@ objects; Rust sees one queue and one registry.
78
78
 
79
79
  With the dispatch cost eliminated, Ractors deliver the thing they were
80
80
  built for—a lock per ractor instead of one GVL—and each layer is
81
- visible in the [benchmarks](benchmarks.md): `/cpu` at 66.7k req/s in
82
- ractor mode vs **13.3k threaded (5×, the GVL ceiling)**, matching the
83
- fork cluster's CPU parallelism while holding **57 MB against the
84
- cluster's 1,078 MB**, because eight ractors share one VM, one Rust
81
+ visible in the [benchmarks](benchmarks.md): `/cpu` at 76.9k req/s in
82
+ ractor mode vs **13.5k threaded (5.7×, the GVL ceiling)**, beating the
83
+ fork cluster's CPU parallelism by +32% while holding **80 MB against
84
+ the cluster's 1,256 MB**, because eight ractors share one VM, one Rust
85
85
  front-end, one queue, and one JIT, where eight forks each pay full
86
86
  price.
87
87
 
@@ -12,10 +12,10 @@ module Kino
12
12
  bind: "127.0.0.1",
13
13
  port: 0,
14
14
  workers: nil, # resolved to Etc.nprocessors in #to_h
15
- threads: 3,
15
+ threads: nil, # resolved per mode in Server: 1 in :ractor, 3 in :threaded
16
16
  mode: :auto,
17
17
  queue_depth: 1024,
18
- queue_timeout: 1.0,
18
+ queue_timeout: 5.0,
19
19
  request_timeout: nil,
20
20
  batch: 1,
21
21
  lanes: false,
@@ -144,7 +144,8 @@ module Kino
144
144
  # Worker count (ractors in :ractor mode); defaults to CPU cores.
145
145
  def workers(count) = @config.set(:workers, Integer(count))
146
146
 
147
- # Threads per worker (I/O concurrency inside one ractor).
147
+ # Threads per worker (I/O concurrency inside one ractor); default is
148
+ # mode-dependent: 1 in :ractor mode, 3 in :threaded.
148
149
  def threads(count) = @config.set(:threads, Integer(count))
149
150
 
150
151
  # Dispatch mode: :auto, :ractor, or :threaded.
data/lib/kino/kino.so CHANGED
Binary file
data/lib/kino/server.rb CHANGED
@@ -41,8 +41,12 @@ module Kino
41
41
  @bind = settings[:bind]
42
42
  @requested_port = settings[:port]
43
43
  @workers = Integer(settings[:workers])
44
- @threads = Integer(settings[:threads])
45
44
  @mode = resolve_mode(settings[:mode])
45
+ # Default threads per mode: 1 in :ractor (threads inside a ractor
46
+ # share its lock; a measured +17% on fast handlers; raise `workers`
47
+ # for I/O concurrency instead), 3 in :threaded (threads ARE the
48
+ # concurrency there).
49
+ @threads = Integer(settings[:threads] || ((@mode == :ractor) ? 1 : 3))
46
50
  @queue_depth = Integer(settings[:queue_depth])
47
51
  @queue_timeout_ms = (Float(settings[:queue_timeout]) * 1000).round
48
52
  @request_timeout_ms = settings[:request_timeout] ? (Float(settings[:request_timeout]) * 1000).round : 0
@@ -1,141 +1,110 @@
1
- # frozen_string_literal: true
2
-
3
1
  # Kino configuration.
4
2
  # Generated by `kino --init`.
5
3
  #
6
- # Every setting below is shown with its default value, commented out:
7
- # the file is a valid no-op until you uncomment something. Precedence:
8
- # explicit Server.new kwargs / CLI flags > this file > built-in defaults.
4
+ # Every setting is shown with its default value and commented out, so
5
+ # this file works as-is: uncomment what you want to change.
6
+ # Command-line flags beat this file; this file beats built-in defaults.
9
7
 
10
8
  ## Network
11
9
 
12
- # Address to listen on. Use "0.0.0.0" to accept non-local connections.
10
+ # Address to listen on. Use "0.0.0.0" to accept connections from other
11
+ # machines.
13
12
  # bind "127.0.0.1"
14
13
 
15
- # Port to listen on. 0 picks an ephemeral port (readable via server.port).
16
- # The `kino` CLI defaults this to 9292 when nothing else sets it.
14
+ # Port to listen on.
17
15
  # port 9292
18
16
 
19
- # TLS termination (rustls, in Rust; never blocks a Ruby thread).
20
- # Values are file paths or inline PEM strings. ALPN is http/1.1.
17
+ # Serve HTTPS. Point these at your certificate and key files (inline
18
+ # PEM strings also work).
21
19
  # tls cert: "config/certs/server.pem", key: "config/certs/server.key"
22
20
 
23
21
  ## Topology
24
22
 
25
- # Puma-style two-level topology: `workers` × `threads`.
26
- #
27
- # In :ractor mode, `workers` is the number of worker Ractors: true
28
- # multi-core parallelism for Ruby CPU work, one per core is a good start.
29
- # In :threaded mode the same total (workers × threads) runs as plain
30
- # Threads on the main ractor.
31
-
32
- # Defaults to the number of CPU cores (Etc.nprocessors).
23
+ # How many workers to run. Each worker handles requests independently;
24
+ # in :ractor mode every worker runs Ruby in parallel on its own core.
25
+ # Default: one per CPU core.
33
26
  # workers 8
34
27
 
35
- # Threads per worker. Threads inside one ractor share its lock, so they
36
- # only add concurrency where handlers block on I/O (database calls, HTTP).
37
- # CPU-bound apps gain nothing past 1 (and pay a lock-handoff tax: threads 1
38
- # measured +17% on fast handlers). I/O-heavy apps want more SLOTS overall -
39
- # in :ractor mode prefer raising `workers` over `threads` (slots are cheap,
40
- # no fork memory): 32 workers x 1 thread beat 8x3 by +35% on waits.
41
- # threads 3
28
+ # Threads inside each worker. More threads help when your app spends
29
+ # time waiting on databases or other services; they do not make Ruby
30
+ # code run faster. Left unset, Kino picks a sensible default for the
31
+ # mode. If your app waits a lot in :ractor mode, prefer raising
32
+ # `workers` instead.
33
+ # threads 1
42
34
 
43
35
  ## Dispatch mode
44
36
  #
45
- # :auto: :ractor when the app is Ractor-shareable, else :threaded
46
- # (with a warning). Note: a Class used as a Rack app always
47
- # counts as "shareable" even if calling it touches unshareable
48
- # state; force :threaded for those.
49
- # :ractor: require a Ractor-shareable app; raises
50
- # Kino::UnshareableAppError otherwise. The app must capture
51
- # nothing mutable: frozen middleware, Ractor.shareable_proc
52
- # endpoints.
53
- # :threaded: run ANY Rack app (Rails included) on a classic thread pool.
37
+ # :auto - picks :ractor when your app supports it, else :threaded.
38
+ # :ractor - runs Ruby in parallel on all cores. Your app must be
39
+ # Ractor-shareable; check yours with `kino --check`.
40
+ # :threaded - works with any Rack app, including Rails.
54
41
  # mode :auto
55
42
 
56
43
  ## Backpressure
57
44
 
58
- # Bounded request queue between the Rust front-end and Ruby workers.
59
- # When it stays full past queue_timeout, clients get an immediate 503
60
- # instead of waiting forever.
45
+ # How many requests may wait in line. When the line stays full, new
46
+ # requests are turned away with a 503 instead of waiting forever.
61
47
  # queue_depth 1024
62
48
 
63
- # Seconds a request may wait for queue space before the 503.
64
- # queue_timeout 1.0
49
+ # How long (in seconds) a request may wait for a free spot before
50
+ # getting the 503.
51
+ # queue_timeout 5.0
65
52
 
66
- # Seconds the app gets to produce a response before the client receives a
67
- # 504 instead. Off by default (nil = wait forever). The handler is NOT
68
- # killed - its late response is dropped and its slot stays busy until it
69
- # returns, so size this above your slowest legitimate endpoint.
53
+ # Give up on a response after this many seconds: the client gets a 504
54
+ # while your app finishes in the background. Off unless set. Set it
55
+ # above your slowest legitimate endpoint.
70
56
  # request_timeout 30
71
57
 
72
- # Requests a worker may grab per queue visit. Values above 1 squeeze more
73
- # throughput out of uniformly fast handlers, but add head-of-line blocking
74
- # behind slow ones and stretch the effective queue depth - leave at 1
75
- # unless your handlers are all sub-millisecond.
58
+ # How many requests a worker grabs from the line at once. Leave at 1
59
+ # unless all your endpoints are uniformly fast.
76
60
  # batch 1
77
61
 
78
- # EXPERIMENTAL lane dispatch: per-worker queues with awake-preferring
79
- # assignment and work stealing. Cuts per-request wakeups for uniformly
80
- # fast handlers; semantics under overload are slightly different (per-lane
81
- # caps with brief dispatcher retries instead of one global queue).
62
+ # Experimental dispatcher that gives each worker its own line. Faster
63
+ # for quick handlers; behavior under heavy overload differs slightly.
82
64
  # lanes false
83
65
 
84
- # Native access log: one line per request to stdout, written by a
85
- # Rust-side flusher thread - request threads never block on the log.
86
- #
87
- # On color terminals lines are tinted by status class (2xx green,
88
- # 3xx yellow, 4xx maroon, 5xx bright red). This is the SERVER's view - it
89
- # includes the 503 rejections your app never sees - and it interleaves
90
- # cleanly with your app's own log (e.g. Rails') on stdout. See also
91
- # Kino::Logger for routing the app log through the same async sink.
92
- #
93
- # Try enabling it in the development environment.
66
+ # Print one line per request to stdout, colored by status on a
67
+ # terminal. This is the server's view: it includes requests your app
68
+ # never saw, such as 503s. Recommended in development.
94
69
  # log_requests false
95
70
 
96
71
  ## Lifecycle
97
72
 
98
- # Graceful-shutdown drain deadline in seconds: in-flight requests get this
99
- # long to finish; past it, their clients receive 500s and workers are
100
- # reaped. A second INT/TERM force-exits immediately.
73
+ # On shutdown, give in-flight requests this many seconds to finish.
74
+ # A second Ctrl-C (or signal) force-exits immediately.
101
75
  # shutdown_timeout 30
102
76
 
103
- # Write the master PID here on start; removed on graceful shutdown.
77
+ # Write the server's process id to this file on start.
104
78
  # pidfile "tmp/pids/kino.pid"
105
79
 
106
80
  ## Runtime
107
81
 
108
- # Threads for the tokio (Rust I/O) runtime. Default (nil) lets tokio use
109
- # one per core: right for I/O-heavy apps. For CPU-heavy apps this is a
110
- # real lever: `tokio_threads 1` + `threads 1` measured +26% on a pure-CPU
111
- # benchmark (every spare thread is Ruby work you didn't run).
82
+ # Threads for the Rust I/O engine. The default suits most apps; for
83
+ # heavily CPU-bound apps, try 1 to leave more cores for Ruby.
112
84
  # tokio_threads 4
113
85
 
114
86
  ## App
115
87
 
116
- # Rackup file the `kino` CLI loads (positional CLI argument wins).
88
+ # Rackup file to load (a command-line argument wins).
117
89
  # rackup "config.ru"
118
90
 
119
- # Sets RACK_ENV (unless already set) before the app is loaded by the CLI.
91
+ # Sets RACK_ENV before the app is loaded, unless already set.
120
92
  # environment "production"
121
93
 
122
94
  ## Rails
123
95
  #
124
- # Rails runs on Kino TODAY in :threaded mode; uncomment for a Rails app:
96
+ # Rails runs on Kino today in :threaded mode:
125
97
  #
126
98
  # mode :threaded
127
99
  # environment "production"
128
100
  # threads 5 # match your database pool size
129
101
  #
130
- # Recommended Rails-side settings to pair with Kino:
131
- # - config.eager_load = true and no code reloading (production defaults):
132
- # Kino's workers serve concurrently; lazy class loading under
133
- # concurrency is slow and, in ractor mode, unsafe.
134
- # - Database pool >= workers × threads (config/database.yml `pool:`).
135
- # - Rails.logger goes to stdout/stderr or a thread-safe device.
102
+ # Rails-side tips:
103
+ # - Run with eager loading and no code reloading (the production
104
+ # defaults).
105
+ # - Set the database pool to at least workers x threads.
106
+ # - Send logs to stdout or another thread-safe destination.
136
107
  #
137
- # Rails main is being ractorized, but
138
- # Rails.application still captures unshareable state at boot; known
139
- # blockers are documented in Kino's README. Track rails/rails main; when
140
- # Ractor.make_shareable(Rails.application) succeeds, `mode :ractor` here
141
- # is all you'll need to change.
108
+ # Ractor mode cannot run Rails yet; the blockers are upstream in Rails.
109
+ # Once `Ractor.make_shareable(Rails.application)` works, switching to
110
+ # `mode :ractor` here is all you will need.
data/lib/kino/version.rb CHANGED
@@ -2,5 +2,5 @@
2
2
 
3
3
  module Kino
4
4
  # The gem version (single source of truth; ext/kino/Cargo.toml syncs).
5
- VERSION = "0.1.0"
5
+ VERSION = "0.1.1"
6
6
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kino
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: aarch64-linux
6
6
  authors:
7
7
  - Yaroslav Markin