kino 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a4e73026c9f3087a58234e0f8fc88be1e407bb7435e8e8a0be43d087f318bcad
4
- data.tar.gz: 49581091f865d3ff11227900376ad9499d1bddef77a4b3b0bc5b3b7405d58bd8
3
+ metadata.gz: 0bd8e6e3b295832fa1d87743b4c9a121cdb5687287011b29f1140814fdae0575
4
+ data.tar.gz: f21305459e857d366159ee258d873e2109b7a7715bb0cbe8d23c47e458ae2b33
5
5
  SHA512:
6
- metadata.gz: 3e420380536f5c73a417d20945ad49f7f0a8e46d3cec5feabfde50600aab9dfc18b2e16d64656b2e1529e5e2c08384efabf4be9f36bebfb9cc737a245e348aec
7
- data.tar.gz: f0e38db0ea8bb93cfa7825db47ebcf72319e76ac82487ed6efb191d82db86a8c205899f7996f34976b36d1960dd97f3086fa4a5359179ce8c7f35bfa5d8b94f4
6
+ metadata.gz: 04e95f9ee2133b4d15bdd2069977e73791a16b033e57a3bdb88478d0048b0bb5d8f56eff1963813bbf8d9399461bb80bf9c33a2039f98805e4f129b3e42b28ef
7
+ data.tar.gz: 8eb131cdbbe5bdbd29d188ab4ff43dbbec2f8c791aacf85586ba69c7208b2f74cc05d4007c469a4de61c4578ef08214e56a4fd080b9ad655af3a01d208b4c752
data/CHANGELOG.md CHANGED
@@ -1,3 +1,40 @@
1
+ ## [0.1.2] - 2026-06-22
2
+
3
+ - Drop a connection that has not sent its complete request headers
4
+ within 15 seconds. Closes a slowloris hole: hyper's built-in header-read
5
+ timeout was inert because the server installed no timer, so a slow-header
6
+ client could tie up a connection (and its tokio task) indefinitely.
7
+ - Cap concurrent connections (new `max_connections` directive). Past the cap,
8
+ new connections wait in the kernel backlog instead of piling up until a
9
+ flood exhausts file descriptors or memory. Defaults to most of the process
10
+ open-file limit (`ulimit -n`), so it scales with the OS limit and only
11
+ engages under a flood.
12
+ - Bound the TLS handshake to 10 seconds. A client that completed the TCP
13
+ connect but stalled the handshake could otherwise hold a connection slot
14
+ indefinitely, since the request and header-read deadlines only begin once
15
+ the handshake finishes.
16
+ - Cap the request body at 50 MB by default (new `max_body_size` directive,
17
+ configurable; nil or 0 disables and delegates to a fronting proxy). An app
18
+ that reads `rack.input` could otherwise be driven to run out of memory by an
19
+ oversized or endless upload. A truthful oversize Content-Length is refused
20
+ with a 413 before the app runs; a chunked or lying client is cut off
21
+ mid-stream once it passes the cap.
22
+ - Bound the idle time between request-body frames to 30 seconds. A client that
23
+ began a request then stalled mid-body would otherwise keep a worker blocked
24
+ in `rack.input.read` indefinitely; now the read raises and the worker
25
+ reclaims its slot. Only a silent client trips it: a steadily-sent body resets
26
+ the deadline each frame, so slow-but-active uploads are unaffected.
27
+
28
+ ## [0.1.1] - 2026-06-11
29
+
30
+ - Mode-dependent `threads` default: 1 per worker in :ractor mode (threads
31
+ inside a ractor share its lock and cost a per-request handoff; +16-18%
32
+ on fast handlers, measured on dedicated hardware), 3 in :threaded mode.
33
+ Explicit `threads` always wins; waiting-heavy ractor apps should raise
34
+ `workers` instead.
35
+ - `queue_timeout` default raised from 1 to 5 seconds: a brief burst now
36
+ waits out the spike instead of shedding 503s within a second.
37
+
1
38
  ## [0.1.0] - 2026-06-11
2
39
 
3
40
  Initial release.
data/Cargo.lock CHANGED
@@ -332,7 +332,7 @@ dependencies = [
332
332
 
333
333
  [[package]]
334
334
  name = "kino"
335
- version = "0.1.0"
335
+ version = "0.1.2"
336
336
  dependencies = [
337
337
  "ahash",
338
338
  "bytes",
data/README.md CHANGED
@@ -11,14 +11,14 @@ on every core in **one small process**. A **Rust** (tokio + hyper)
11
11
  front-end owns the network, parallel **Ractors** run your Rack 3 app,
12
12
  and a threaded fallback mode runs everything else, Rails included.
13
13
 
14
- * **Fast.** On a real 8-core server, every Kino mode is **1.4-2×** ahead
15
- of a same-topology Puma cluster on I/O-light endpoints. Ractor mode
16
- also wins on pure CPU. [Benchmarks](#benchmarks) below.
17
- * **A fraction of the memory.** One process instead of a fork per core:
18
- about **1/19th of the Puma cluster's memory** under the same load, and
19
- about 1/8th when serving the Rails hello-world.
20
- * **Parallel without forking.** Ractor mode runs CPU work **5×** faster
21
- than Kino's own GVL-bound threaded mode, in the same small process.
14
+ * **Fast.** On a real 8-core server, every Kino mode is **1.5-2×**
15
+ ahead of a Puma fork cluster on I/O-light endpoints. Ractor mode also
16
+ wins on pure CPU, **30%+**. [Benchmarks](#benchmarks) below.
17
+ * **A fraction of the memory.** Aabout **~7×** on the simplistic bench
18
+ Ractor app, and about ** less memory** than a Puma cluster serving Rails in fallback threaded mode.
19
+ * **Parallel without forking.** Ractor mode runs CPU work **more than
20
+ faster** than Kino's own GVL-bound threaded mode, in the same
21
+ small process.
22
22
  * **Production plumbing included.** Graceful drain, crash supervision
23
23
  and respawn, bounded queues with 503 backpressure, request timeouts,
24
24
  TLS (rustls), live stats, async access and app logging.
@@ -63,63 +63,108 @@ notes live in [doc/architecture.md](doc/architecture.md).
63
63
  ## Benchmarks
64
64
 
65
65
  Measured on a real server: AWS **c7a.2xlarge** (8-core AMD EPYC 9R14,
66
- 16 GB, Amazon Linux 2023). This is a realistic app-server size. The same
67
- Ractor-shareable app runs on every server, Ruby 4.0.5 with YJIT, equal
68
- topology (8 workers × 3 threads; Puma forks, Kino stays in one process).
69
- Numbers are req/s by wrk (8-second windows, 64 connections, same host).
70
- Methodology and the analysis behind every column:
66
+ 16 GB, Amazon Linux 2023). This is a realistic app-server size.
67
+
68
+ **These tables run a tiny synthetic Rack app**—plaintext, a 10 KB body,
69
+ a CPU-bound `fib`, a 5 ms wait—deliberately small, to measure the server
70
+ rather than an app. It is Ractor-shareable, so Kino runs it in `:ractor`
71
+ mode (and `:threaded` for comparison). **A real Rails app is a different
72
+ story:** it is *not* Ractor-shareable, so it runs only in Kino's
73
+ `:threaded` fallback, with its own numbers—see [Rails](#rails) below.
74
+ Ruby 4.0.5 with YJIT, every server at its defaults: Puma forks 8 workers ×
75
+ 3 threads, Kino stays in one process (8 workers; 1 thread each in ractor
76
+ modes, 3 in threaded). Numbers are req/s by wrk (8-second windows, 64
77
+ connections, same host). Methodology:
71
78
  [doc/benchmarks.md](doc/benchmarks.md).
72
79
 
73
- | endpoint | Kino :ractor | + lanes | Kino :threaded | Puma (cluster) |
74
- |-------------|-------------:|--------:|---------------:|---------------:|
75
- | /plaintext | 201,472 | **241,501** | 218,348 | 117,838 |
76
- | /10k | 156,635 | **183,564** | 153,442 | 106,666 |
77
- | /cpu (fib) | 66,735¹| **70,373** | 13,298 | 58,207 |
78
- | /io (5 ms) | 4,527²| 4,530 | **4,715** | 4,691 |
79
- | /io_native | 4,714 | **4,717** | 4,709 | 4,692 |
80
+ | endpoint | Kino :ractor | + lanes | :ractor, `workers 32`² | Kino :threaded | Puma (cluster) |
81
+ |-------------|-------------:|--------:|-----------------------:|---------------:|---------------:|
82
+ | /plaintext | 229,534 | **250,222** | 182,997 | 216,994 | 118,176 |
83
+ | /10k | 178,083 | **189,862** | 151,034 | 160,400 | 106,768 |
84
+ | /cpu (fib) | **77,999**¹| 70,885 | 66,100 | 13,429 | 58,006 |
85
+ | /io (5 ms) | 1,552 | 1,551 | **5,888** | 4,709 | 4,693 |
86
+ | /io_native | 1,570 | 1,571 | **6,274** | 4,695 | 4,691 |
80
87
 
81
- Memory on the same box, RSS under load:
88
+ Memory tells two different stories depending on the app, both by **PSS**
89
+ (proportional set size; see note) after sustained load.
82
90
 
83
- | serving | Kino (one process) | Puma cluster (8 workers) |
84
- |-----------------------|-------------------:|-------------------------:|
85
- | bench app, :ractor | **57 MB** | 1,078 MB |
86
- | bench app, :threaded | **50 MB** | 1,078 MB |
87
- | Rails hello-world | **97 MB** | 797 MB |
91
+ **The tiny benchmark app** (Ractor-shareable, so Kino runs it in `:ractor`
92
+ or `:threaded`). Kino is **~7× lighter in :ractor mode, ~10× in :threaded**
93
+ than the Puma cluster the gap stays large because a trivial app is almost
94
+ all private per-worker heap, which copy-on-write can't share:
95
+
96
+ | tiny app, Kino | Kino (one process) | Puma cluster (8 workers) | ratio |
97
+ |-----------------|-------------------:|-------------------------:|------:|
98
+ | :ractor (8×1) | **148 MB** | 1,068 MB | ~7× |
99
+ | :threaded (8×3) | **107 MB**³| 1,068 MB | ~10× |
100
+
101
+ **A real Rails app** (not Ractor-shareable—Kino's `:threaded` fallback
102
+ only, [below](#rails)). The gap is **~4×**, smaller because Rails' large
103
+ framework *is* shared copy-on-write across Puma's forks:
104
+
105
+ | Rails hello-world | Kino :threaded | Puma cluster (8 workers) | ratio |
106
+ |-------------------|---------------:|-------------------------:|------:|
107
+ | **PSS** | **92 MB** | **389 MB** | ~4× |
88
108
 
89
109
  "+ lanes" is the experimental per-worker-queue dispatcher (`lanes true`).
90
- It adds +20% over the shared queue on this hardware and makes ractor
91
- mode the fastest Kino configuration. Details:
110
+ It posts the fastest plaintext/10k of any configuration here. Details:
92
111
  [doc/benchmarks.md](doc/benchmarks.md#lane-dispatch-experimental-lanes-true).
93
112
 
94
113
  ¹ Stock settings, no tuning. Ractor mode beats the fork cluster on pure
95
- CPU by +15% (+21% with lanes). Threaded mode shows the GVL ceiling that
96
- every single-process Ruby server hits. The CPU-tuning recipe that our
97
- earlier Docker measurements needed makes no difference on real hardware
98
- (+0.5%); see [doc/benchmarks.md](doc/benchmarks.md#cpu-bound-tuning).
99
-
100
- ² The ractor timer tax is small on real hardware: −4% against threaded
101
- mode (it was −18% in Docker). Wait-bound throughput is slots ÷ wait, and
102
- Kino slots are threads, not processes. `workers 32, threads 1` measured
103
- **5,922 /io (+27% over the cluster) and 6,254 /io_native (+34%)**, still
104
- one small process. See
114
+ CPU by +34% (+22% with lanes). Threaded mode shows the GVL ceiling that
115
+ every single-process Ruby server hits. The old CPU-tuning recipe is
116
+ retired: its `threads 1` half **is** the default now, and its
117
+ `tokio_threads 1` half costs −12% on real hardware; see
118
+ [doc/benchmarks.md](doc/benchmarks.md#cpu-bound-tuning).
119
+
120
+ ² Wait-bound throughput is slots ÷ wait, and the default columns bring
121
+ 8 single-thread workers against the cluster's 24 threads. Kino slots
122
+ are threads, not processes—when your app waits a lot, raise `workers`.
123
+ The `workers 32` column is that tuning: **+25% over the cluster on /io
124
+ (+34% via `Kino.sleep`)** while still ahead of it on pure CPU, all in
125
+ one small process. The cost is the CPU-light rows (32 ractors
126
+ oversubscribe 8 cores); pick the topology your app's wait profile
127
+ needs. See
105
128
  [doc/benchmarks.md](doc/benchmarks.md#why-io-lags-in-ractor-mode-on-linux).
106
129
 
130
+ ³ With `MALLOC_ARENA_MAX=2` (the standard Ruby deployment setting;
131
+ Heroku's default). Without it, 24 threads churning 10 KB responses
132
+ through one glibc heap balloon to ~670 MB—an arena-fragmentation
133
+ footgun, not a leak, and ractor mode sidesteps it. See
134
+ [doc/benchmarks.md](doc/benchmarks.md#memory-under-load-and-the-glibc-arena-footgun).
135
+
107
136
  A common first idea is to keep your current server and wrap the app in
108
137
  a ractor pool. We measured that too (same box; the analysis is in the
109
138
  doc):
110
139
 
111
- | endpoint | Kino :ractor | Puma + ractor wrapper | Falcon + ractor wrapper |
112
- |------------|-------------:|----------------------:|------------------------:|
113
- | /plaintext | **201,472** | 19,425 | 100,624 |
114
- | /cpu (fib) | **66,735** | 17,106 | 49,083 |
115
- | /io (5 ms) | **4,527** | 1,447 | 1,549 |
116
-
117
- In short: ractor mode reaches fork-level CPU parallelism (**5×** Kino's
118
- own GVL-bound threaded mode) in one process, at about 1/19th of the
119
- cluster's memory. Every Kino mode is 1.4-2× ahead of the cluster on
120
- I/O-light endpoints. The macOS numbers (secondary; everything there hits
121
- the loopback ceiling) and the YJIT × Ractors gotcha are in
122
- [doc/benchmarks.md](doc/benchmarks.md).
140
+ | endpoint | Kino :ractor (8×3) | Puma + ractor wrapper | Falcon + ractor wrapper |
141
+ |------------|-------------------:|----------------------:|------------------------:|
142
+ | /plaintext | **193,826** | 19,480 | 99,776 |
143
+ | /cpu (fib) | **68,061** | 17,755 | 48,721 |
144
+ | /io (5 ms) | **4,530** | 1,454 | 1,549 |
145
+
146
+ ### Rails
147
+
148
+ Rails is not Ractor-shareable today, so Kino serves it in `:threaded`
149
+ fallback — one GVL-bound process. On the same box (`examples/rails-hello`,
150
+ edge Rails, production, 8×5):
151
+
152
+ | Rails hello-world | req/s | memory (PSS) |
153
+ |------------------------------|-------:|-------------:|
154
+ | Kino :threaded (one process) | 2,637 | **92 MB** |
155
+ | Puma cluster (8 workers) | 12,138 | 389 MB |
156
+
157
+ The honest trade-off: Puma's fork cluster uses all 8 cores, so it serves
158
+ ~4.6× the throughput — at ~4× the memory. Ractor-mode Rails would close
159
+ the throughput gap at one-process memory cost; the upstream blockers are
160
+ tracked in [doc/rails-on-ractors.md](doc/rails-on-ractors.md).
161
+
162
+ In short: on the tiny synthetic app, ractor mode beats fork-level CPU parallelism (**5.8×** Kino's
163
+ own GVL-bound threaded mode, +34% over the cluster) in one process, at
164
+ about 1/7th of the cluster's memory by PSS (~4× on a real Rails app).
165
+ Every Kino mode is 1.5-2.1× ahead of the cluster on I/O-light endpoints. The macOS numbers
166
+ (secondary; everything there hits the loopback ceiling) and the
167
+ YJIT × Ractors gotcha are in [doc/benchmarks.md](doc/benchmarks.md).
123
168
 
124
169
  Reproduce: `bench/run.sh [seconds] [concurrency]` for the main table,
125
170
  `bench/studies.sh` for the follow-ups (CPU recipe, topology, scaling,
@@ -174,10 +219,10 @@ server = Kino::Server.new(app,
174
219
  bind: "127.0.0.1",
175
220
  port: 9292, # 0 = ephemeral; read back via server.port
176
221
  workers: Etc.nprocessors, # ractors (parallelism)
177
- threads: 3, # threads per ractor (I/O concurrency, Puma-style)
222
+ threads: 1, # per worker; ractor default 1, threaded default 3
178
223
  mode: :auto, # :auto | :ractor | :threaded
179
224
  queue_depth: 1024, # bounded queue; overflow → 503
180
- queue_timeout: 1.0, # seconds before 503 on a full queue
225
+ queue_timeout: 5.0, # seconds before 503 on a full queue
181
226
  request_timeout: nil, # seconds before a slow response becomes a 504 (nil = off)
182
227
  shutdown_timeout: 30, # drain deadline
183
228
  tls: { cert: "cert.pem", key: "key.pem" }, # file paths or inline PEM
@@ -210,7 +255,7 @@ kwargs and CLI flags > config file > defaults.
210
255
  # kino.rb
211
256
  port 9292
212
257
  workers 8
213
- threads 3
258
+ threads 1
214
259
  mode :ractor
215
260
  ```
216
261
 
@@ -266,7 +311,7 @@ cost):
266
311
 
267
312
  ```ruby
268
313
  server.stats
269
- # => {mode: :ractor, lanes: false, workers: 8, threads: 3, batch: 1,
314
+ # => {mode: :ractor, lanes: false, workers: 8, threads: 1, batch: 1,
270
315
  # respawns: 0, queued: 0, in_flight: 2, served: 1041, rejected: 0,
271
316
  # timeouts: 0}
272
317
  # plus lane_depths: [...] when lane dispatch is on
@@ -276,19 +321,20 @@ From the outside, `kill -USR1 <pid>` prints the same snapshot as one line
276
321
  (pair it with `pidfile` to find the pid):
277
322
 
278
323
  ```
279
- Kino stats: mode=:ractor lanes=false workers=8 threads=3 batch=1 respawns=0 queued=0 in_flight=2 served=1041 rejected=0 timeouts=0
324
+ Kino stats: mode=:ractor lanes=false workers=8 threads=1 batch=1 respawns=0 queued=0 in_flight=2 served=1041 rejected=0 timeouts=0
280
325
  ```
281
326
 
282
327
  ## Logging
283
328
 
284
329
  With one log line per request, `Kino::Logger` sustained **2.4× the
285
- throughput of a shared `::Logger`** (151k vs 63k req/s on the benchmark
330
+ throughput of a shared `::Logger`** (149k vs 63k req/s on the benchmark
286
331
  box). There are two native pieces. Both write through a lock-free
287
332
  channel to a Rust flusher thread, so request threads never take a log
288
333
  mutex and never make a write syscall:
289
334
 
290
335
  - **Access log** (`log_requests true`): one line per request to stdout,
291
- including the 503s that never reach your app. On color terminals the
336
+ including the 503s that never reach your app. Recommended in
337
+ development; cheap enough for production. On color terminals the
292
338
  lines are tinted by status class: 2xx green, 3xx yellow, 4xx maroon,
293
339
  5xx bright red:
294
340