omq-zstd 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/RFC.md ADDED
@@ -0,0 +1,453 @@
1
+ # ZMTP over Zstd+TCP: Zstandard-Compressed TCP Transport for ZMTP
2
+
3
+ | Field | Value |
4
+ |----------|----------------------------------------------------|
5
+ | Status | Draft |
6
+ | Editor | Patrik Wenger |
7
+ | Requires | [RFC 37/ZMTP 3.1](https://rfc.zeromq.org/spec/37/) |
8
+
9
+ ## 1. Abstract
10
+
11
+ This specification defines `zstd+tcp://`, a TCP transport for ZMTP 3.1
12
+ that applies per-part Zstandard compression after the ZMTP handshake.
13
+ Both peers use the `zstd+tcp://` scheme in their endpoint URIs. The ZMTP
14
+ greeting and handshake proceed over raw TCP exactly as they would over
15
+ `tcp://`. After the handshake completes, every message part on the wire
16
+ is individually encoded with a 4-byte sentinel dispatch that
17
+ distinguishes uncompressed plaintext, Zstandard-compressed frames, and
18
+ dictionary shipments. No ZMTP properties, command frames, or
19
+ negotiation are involved — compression is an intrinsic property of the
20
+ transport, like encryption is an intrinsic property of TLS.
21
+
22
+ ## 2. Motivation
23
+
24
+ Zstandard at low compression levels encodes in single-digit microseconds
25
+ per kilobyte, decompresses faster still, and on dictionary-trained
26
+ workloads compresses small frames to a fraction of their size. For most
27
+ ZMTP deployments compression can be treated as almost free CPU-wise,
28
+ while recovering large fractions of the wire budget.
29
+
30
+ Network-bound or bandwidth-constrained deployments (publish/subscribe
31
+ fan-out, cross-region replication, IoT telemetry) trade a small amount
32
+ of CPU for a large reduction in wire time. Zstandard's dictionary mode
33
+ is a good fit for the small-message profile typical of ZMQ workloads.
34
+
35
+ ZMTP applications today either accept the wire cost or layer ad-hoc,
36
+ per-payload compression into the application format. The latter requires
37
+ both sides to opt in and bakes compression into the payload rather than
38
+ the transport. `zstd+tcp://` replaces it with a transport-level
39
+ mechanism that any ZMTP application benefits from without changes to the
40
+ payload.
41
+
42
+ ### 2.1 Why a transport scheme
43
+
44
+ Compression could live at three layers. Each has a fatal flaw except the
45
+ transport layer.
46
+
47
+ **Socket-level wrapper** (too high). A wrapper above routing knows
48
+ nothing about transports. It compresses local connections (pure
49
+ overhead) and cannot act on new connections naturally — dictionary
50
+ shipping requires per-connection state, but a wrapper only sees messages
51
+ after routing has dispatched them. Reconnect handling requires hooking
52
+ into connection lifecycle events that are awkward from outside.
53
+
54
+ **ZMTP connection layer** (too low). Embedding compression into each
55
+ ZMTP connection means fan-out patterns compress the same message N times
56
+ (once per subscriber connection). The connection layer has no
57
+ socket-wide view, so there is no way to share compression work across
58
+ connections.
59
+
60
+ **Transport layer** (right). `zstd+tcp://` makes transport selection
61
+ explicit in the endpoint URI. Only TCP connections get compressed. Local
62
+ transports are unaffected even on the same socket. Dictionary lifetime
63
+ matches connection lifetime naturally (new connection = new wrapper =
64
+ re-ship dictionary). No negotiation is needed — both peers use
65
+ `zstd+tcp://`. The codec is socket-wide (shared across connections), so
66
+ fan-out patterns compress once and reuse the result.
67
+
68
+ ### 2.2 Why not negotiate
69
+
70
+ ZMTP 3.1 already supports unknown READY properties — an unaware peer
71
+ silently ignores them. A negotiation-based design could fall back to
72
+ plaintext when the peer does not understand compression. But this
73
+ introduces complexity (profile matching, asymmetric per-direction state,
74
+ passive senders) for a marginal benefit: in practice, compression is a
75
+ deployment decision, not a runtime discovery. Both peers are configured
76
+ to use `zstd+tcp://` or they are not. The transport scheme approach
77
+ eliminates the entire negotiation surface and its edge cases.
78
+
79
+ ### 2.3 Why Zstandard
80
+
81
+ Zstandard at low levels matches LZ4 on encode latency, beats it on
82
+ decompression speed and ratio at every realistic ZMQ payload size, and
83
+ has a first-class dictionary story. The decompression advantage is
84
+ particularly important for fan-out patterns (PUB/SUB, RADIO/DISH): the
85
+ publisher pays one compress, every subscriber pays decompress, so
86
+ per-subscriber CPU dominates the total budget.
87
+
88
+ ## 3. Goals and Non-goals
89
+
90
+ ### 3.1 Goals
91
+
92
+ - Transparent to application code: send/receive operations see plaintext.
93
+ - Per-part sender decision: opt out for short or incompressible parts.
94
+ - Works for legacy multipart socket types (PUSH/PULL, PUB/SUB, ...) and
95
+ draft single-frame types alike.
96
+ - Small-message-friendly via an optional shared dictionary, either
97
+ supplied out of band or automatically trained from early traffic.
98
+ - No ZMTP-level negotiation, no new READY properties, no new command
99
+ frames.
100
+
101
+ ### 3.2 Non-goals
102
+
103
+ - New ZMTP mechanism, new socket type, new greeting, new frame flag bit.
104
+ - Compression of the ZMTP greeting or command frames (READY, SUBSCRIBE,
105
+ PING, PONG, ...).
106
+ - Application to non-TCP transports (`inproc://` is zero-copy —
107
+ compression is pure overhead; `ipc://` rarely benefits).
108
+ - Replacing or weakening CurveZMQ or any other security mechanism.
109
+ See Sec. 8.
110
+ - Streaming / context-takeover compression. Each part is decodable in
111
+ isolation with no dependency on a previous part's LZ77 history.
112
+
113
+ ## 4. Terminology
114
+
115
+ | Term | Meaning |
116
+ |---------------------|---------------------------------------------------------------------------|
117
+ | Part | One ZMTP message frame body. A multipart message has multiple parts. |
118
+ | Sentinel | The first 4 bytes of a post-handshake part on the wire (Sec. 5.1). |
119
+ | Uncompressed part | A wire part whose sentinel is `00 00 00 00`. |
120
+ | Compressed part | A wire part whose first 4 bytes are the Zstandard magic `28 B5 2F FD`. |
121
+ | Dictionary part | A wire part whose first 4 bytes are `37 A4 30 EC` (Sec. 6). |
122
+ | Dictionary message | A single-part ZMTP message consisting of exactly one dictionary part. |
123
+
124
+ ## 5. Part Encoding
125
+
126
+ After the ZMTP handshake completes, every message part on the wire is
127
+ individually encoded. The ZMTP MORE flag is carried on the wire frame
128
+ header as normal. Multipart messages are encoded part by part — each
129
+ part is independent.
130
+
131
+ ### 5.1 Sentinel dispatch
132
+
133
+ The first 4 bytes of each wire part determine how it is decoded.
134
+
135
+ | Sentinel (hex) | Meaning |
136
+ |------------------|-----------------------------------------------------------------|
137
+ | `00 00 00 00` | Uncompressed plaintext (Sec. 5.3) |
138
+ | `28 B5 2F FD` | Zstandard compressed frame (Sec. 5.4) |
139
+ | `37 A4 30 EC` | Dictionary shipment (Sec. 6) |
140
+
141
+ All other 4-byte values are reserved. A receiver that encounters an
142
+ unknown sentinel MUST drop the connection with an error.
143
+
144
+ ### 5.2 Compression level
145
+
146
+ The default compression level is **-3** (Zstandard fast strategy). At
147
+ this level the encoder cost is in the low single-digit microseconds per
148
+ kilobyte, and the achieved ratio is within a few percent of level 3 once
149
+ a dictionary is in play.
150
+
151
+ The compression level is a sender choice and is not communicated on the
152
+ wire — the receiver decodes any valid Zstandard frame regardless of the
153
+ level used to encode it. Implementations SHOULD expose the level as a
154
+ configurable parameter.
155
+
156
+ ### 5.3 Uncompressed sentinel `00 00 00 00`
157
+
158
+ ```
159
+ +------------------+-------------------+
160
+ | 00 00 00 00 | plaintext payload |
161
+ | (4 bytes) | (N bytes) |
162
+ +------------------+-------------------+
163
+ ```
164
+
165
+ The sender uses this sentinel when it decides not to compress the part.
166
+ The 4-byte overhead is the price of per-part selective compression
167
+ without an extra flag bit in the ZMTP frame header.
168
+
169
+ Four zero bytes cannot collide with a valid Zstandard frame magic or the
170
+ dictionary sentinel, so no ambiguity arises.
171
+
172
+ ### 5.4 Compressed Zstandard frame
173
+
174
+ ```
175
+ +------------------+
176
+ | Zstandard frame |
177
+ | (M bytes) |
178
+ +------------------+
179
+ ```
180
+
181
+ The wire part IS the Zstandard frame — its first 4 bytes are the
182
+ standard Zstandard frame magic `28 B5 2F FD`. No additional framing is
183
+ added.
184
+
185
+ The sender MUST configure the encoder to write the `Frame_Content_Size`
186
+ field in the Zstandard frame header (RFC 8878 §3.1.1.1.2). This field
187
+ is required for the receiver's budget enforcement (Sec. 5.6).
188
+
189
+ ### 5.5 Sender rules
190
+
191
+ For each outgoing message part, the sender proceeds as follows:
192
+
193
+ 1. Compute `min_size`:
194
+ - If a dictionary is currently installed: **64 bytes**.
195
+ - Otherwise: **512 bytes**.
196
+
197
+ These thresholds reflect empirical measurement: without a dictionary,
198
+ Zstandard cannot usefully compress typical payloads below ~512 bytes;
199
+ with a dictionary, even 64-byte payloads compress to ~20 bytes.
200
+ Implementations MAY tune these thresholds.
201
+
202
+ 2. If `plaintext_size < min_size`, prepend `00 00 00 00` and emit.
203
+
204
+ 3. Otherwise, run the Zstandard encoder. The encoder MUST write the
205
+ `Frame_Content_Size` field. If the compressed output's size is
206
+ ≥ `plaintext_size - 4` (net saving ≤ 0 after accounting for the
207
+ 4-byte sentinel of the uncompressed alternative), prepend
208
+ `00 00 00 00` and emit the plaintext instead. Otherwise emit the
209
+ Zstandard frame as-is.
210
+
211
+ 4. If the plaintext's first 4 bytes happen to be `28 B5 2F FD` or
212
+ `37 A4 30 EC` and the sender chooses not to compress, the sender
213
+ MUST still prepend `00 00 00 00` to avoid sentinel ambiguity.
214
+ Step 2 and step 3's fallback path already guarantee this.
215
+
216
+ ### 5.6 Receiver rules
217
+
218
+ For each incoming wire part, the receiver proceeds as follows:
219
+
220
+ 1. Read the first 4 bytes as the sentinel. If the part is shorter than
221
+ 4 bytes, drop the connection with an error.
222
+
223
+ 2. Sentinel `00 00 00 00`: the remaining `N - 4` bytes are plaintext.
224
+ Return them.
225
+
226
+ 3. Sentinel `28 B5 2F FD`: the entire wire part is a Zstandard frame.
227
+ - Read the `Frame_Content_Size` field from the Zstandard header. If
228
+ the field is absent, drop the connection with an error.
229
+ - If the connection enforces a maximum message size, add this part's
230
+ declared content size to the running decompressed total for the
231
+ current multipart message (parts chained by the ZMTP MORE flag).
232
+ If the running total would exceed the maximum, drop the connection
233
+ with an error without invoking the decoder.
234
+ - Invoke the decoder in a bounded mode that aborts if it would write
235
+ more bytes than `Frame_Content_Size` declared. On such an abort,
236
+ drop the connection with an error.
237
+ - Return the decompressed plaintext.
238
+
239
+ 4. Sentinel `37 A4 30 EC`: dictionary shipment. See Sec. 6.
240
+
241
+ 5. Any other sentinel: drop the connection with an error.
242
+
243
+ The maximum message size always refers to the **decompressed** plaintext
244
+ summed across all parts of a multipart message. A multipart message
245
+ whose total wire length is small but whose total decompressed size
246
+ exceeds the limit MUST be rejected before decoder invocation.
247
+
248
+ ## 6. Dictionary Shipment
249
+
250
+ ### 6.1 Dictionary message format
251
+
252
+ A dictionary is shipped as a **single-part ZMTP message** (no MORE flag)
253
+ whose body begins with the dictionary sentinel:
254
+
255
+ ```
256
+ +------------------+------------------------+
257
+ | 37 A4 30 EC | dictionary bytes |
258
+ | (4 bytes) | (D bytes) |
259
+ +------------------+------------------------+
260
+ ```
261
+
262
+ The sentinel `37 A4 30 EC` is specific to this specification and has no
263
+ relationship to Zstandard's internals. It was chosen to avoid collision
264
+ with the Zstandard frame magic and the uncompressed sentinel.
265
+
266
+ The remaining `D` bytes are the raw dictionary as it should be passed
267
+ to the Zstandard decoder's dictionary-load operation.
268
+
269
+ ### 6.2 Constraints
270
+
271
+ - A dictionary message MUST be a single-part ZMTP message (MORE flag
272
+ not set on the frame header). A dictionary sentinel in a multipart
273
+ message's non-final or non-only part is a protocol error.
274
+
275
+ - A dictionary message MUST NOT exceed **64 KiB** total (sentinel +
276
+ dictionary bytes). A receiver that receives a dictionary message
277
+ larger than 64 KiB MUST drop the connection with an error.
278
+
279
+ - A sender MUST send at most **one** dictionary message per direction
280
+ per connection. A receiver that receives a second dictionary message
281
+ on the same connection MUST drop the connection with an error.
282
+
283
+ - A dictionary message MUST be sent BEFORE any compressed part that
284
+ references the dictionary. In practice this means the sender ships
285
+ the dictionary before (or immediately after training triggers during)
286
+ the first compressed write that would benefit from it.
287
+
288
+ ### 6.3 Receiver handling
289
+
290
+ When the receiver encounters a dictionary part:
291
+
292
+ 1. Validate the constraints in Sec. 6.2.
293
+ 2. Strip the 4-byte sentinel.
294
+ 3. Install the remaining bytes as the decompression dictionary for this
295
+ connection.
296
+ 4. Discard the message — it is not delivered to the application.
297
+
298
+ If all parts of a ZMTP message are dictionary parts (which is always
299
+ the case, since dictionary messages are single-part), the receiver
300
+ loops to receive the next message.
301
+
302
+ ### 6.4 Dictionary scope
303
+
304
+ The dictionary a sender ships applies to a single direction of a single
305
+ connection. Each peer may independently ship its own dictionary for its
306
+ own send direction. The common deployment is one-directional: a
307
+ publisher ships its dictionary; subscribers decode with it and send
308
+ nothing (or uncompressed traffic) back.
309
+
310
+ The sender's dictionary is typically socket-wide: trained once from
311
+ early traffic across all connections and reused. But this is an
312
+ implementation choice — the wire protocol carries no dictionary identity
313
+ or scope metadata.
314
+
315
+ An implementation MAY pool training samples and share the resulting
316
+ auto-trained dictionary across all `zstd+tcp://` connections of a
317
+ single socket. This is beneficial when a socket binds or connects
318
+ multiple `zstd+tcp://` endpoints: samples from one endpoint accelerate
319
+ training for all of them, and newly opened connections benefit from a
320
+ dictionary trained by their predecessors. Connections that were
321
+ configured with an explicit out-of-band dictionary MUST NOT participate
322
+ in shared training — they use their own dictionary independently.
323
+
324
+ ### 6.5 Automatic dictionary training
325
+
326
+ A sender MAY train a dictionary automatically from early traffic:
327
+
328
+ 1. Buffer plaintext samples from the first messages. Samples larger
329
+ than **1024 bytes** SHOULD be skipped — dictionaries primarily
330
+ benefit small frames.
331
+ 2. When the buffer reaches **1000 samples** OR **100 KiB** of
332
+ plaintext (whichever comes first), train a Zstandard dictionary from
333
+ the buffered samples and discard the buffer.
334
+ 3. The recommended dictionary capacity (training target size) is
335
+ **8 KiB**.
336
+ 4. Ship the trained dictionary via a dictionary message (Sec. 6.1) on
337
+ every connection, before any compressed part that uses it.
338
+ 5. Switch to dictionary-bound compression for all subsequent parts.
339
+
340
+ If training fails (the sample set was too small or too uniform), the
341
+ sender MUST stay in no-dictionary mode for the rest of the socket's
342
+ lifetime. It MUST NOT retry training.
343
+
344
+ ### 6.6 Dictionary ID
345
+
346
+ Auto-trained dictionaries SHOULD be patched with a random dictionary ID
347
+ in the Zstandard user range (32768 to 2^31 - 1) to avoid collisions
348
+ with Zstandard's built-in dictionary IDs. Out-of-band dictionaries
349
+ retain whatever dictionary ID they were created with.
350
+
351
+ ## 7. ZMTP Interaction
352
+
353
+ ### 7.1 Greeting and handshake
354
+
355
+ The ZMTP greeting and security mechanism handshake proceed over raw TCP
356
+ exactly as specified by RFC 37. `zstd+tcp://` does not modify the
357
+ greeting, mechanism, READY properties, or any command frames. The
358
+ compression layer activates only after the handshake is complete and the
359
+ connection is ready for message traffic.
360
+
361
+ ### 7.2 Command frames
362
+
363
+ ZMTP command frames (READY, SUBSCRIBE, CANCEL, JOIN, LEAVE, PING,
364
+ PONG) are never compressed. They are sent and received as standard ZMTP
365
+ command frames. Only message frames (the COMMAND bit not set in the
366
+ frame header) are subject to sentinel-dispatched encoding.
367
+
368
+ ### 7.3 Socket type compatibility
369
+
370
+ `zstd+tcp://` is compatible with all ZMTP socket types. The socket type
371
+ negotiation in the READY handshake is unaffected.
372
+
373
+ ### 7.4 Peer requirement
374
+
375
+ Both peers of a connection MUST use `zstd+tcp://`. There is no
376
+ fallback to plaintext TCP and no negotiation. A `zstd+tcp://` peer
377
+ connecting to a plain `tcp://` peer (or vice versa) will see garbled
378
+ data or sentinel errors and the connection will fail.
379
+
380
+ ## 8. Security Considerations
381
+
382
+ ### 8.1 Compression combined with encryption (CRIME / BREACH)
383
+
384
+ Combining length-revealing compression with a secure channel that
385
+ carries attacker-influenced plaintext enables CRIME- and BREACH-style
386
+ side-channel attacks. An attacker who can inject chosen bytes into the
387
+ plaintext and observe the ciphertext length can extract secrets byte
388
+ by byte.
389
+
390
+ Implementations SHOULD refuse to layer `zstd+tcp://` inside an
391
+ encrypted tunnel when the plaintext contains attacker-controlled
392
+ content. Deployments that accept this risk MUST do so with explicit
393
+ opt-in.
394
+
395
+ ### 8.2 Length side-channel
396
+
397
+ Compression makes the wire length of a part depend on its content. An
398
+ on-path observer can learn something about the plaintext from the
399
+ compressed length alone. Deployments that care about traffic analysis
400
+ MUST NOT rely on `zstd+tcp://` to hide payload shape.
401
+
402
+ ### 8.3 Dictionary contents
403
+
404
+ When auto-training is enabled, the receiver loads dictionary bytes
405
+ chosen by the peer. The Zstandard reference dictionary loader is
406
+ hardened against malformed inputs, but implementations MUST enforce the
407
+ 64 KiB cap on dictionary messages (Sec. 6.2) and SHOULD NOT cache
408
+ received dictionaries across connections.
409
+
410
+ ### 8.4 Decompression bombs
411
+
412
+ A small compressed frame can decompress to many megabytes of plaintext.
413
+ The receiver rules in Sec. 5.6 mitigate this:
414
+
415
+ 1. Every compressed part MUST carry `Frame_Content_Size`. The receiver
416
+ checks the declared total against the maximum message size before
417
+ invoking the decoder, so a bomb is rejected on its header alone.
418
+ 2. The decoder is invoked in bounded mode — it aborts if it would write
419
+ more bytes than declared. A peer that lies in the header cannot
420
+ expand a part past its declared size.
421
+
422
+ Implementations SHOULD set a conservative maximum message size on
423
+ `zstd+tcp://` connections even if they would otherwise leave it
424
+ unbounded.
425
+
426
+ ## 9. Constants
427
+
428
+ ```
429
+ SENTINEL_UNCOMPRESSED = 00 00 00 00 (4 bytes)
430
+ SENTINEL_ZSTD_FRAME = 28 B5 2F FD (4 bytes, Zstandard frame magic)
431
+ SENTINEL_ZSTD_DICT = 37 A4 30 EC (4 bytes)
432
+
433
+ DEFAULT_LEVEL = -3
434
+
435
+ MIN_COMPRESS_NO_DICT = 512 bytes
436
+ MIN_COMPRESS_WITH_DICT = 64 bytes
437
+
438
+ MAX_DECOMPRESSED_SIZE = 16 MiB (absolute cap per frame)
439
+ MAX_DICT_SIZE = 64 KiB
440
+
441
+ TRAIN_MAX_SAMPLES = 1000
442
+ TRAIN_MAX_BYTES = 100 KiB
443
+ TRAIN_MAX_SAMPLE_LEN = 1024 bytes
444
+ DICT_CAPACITY = 8 KiB
445
+ ```
446
+
447
+ ## 10. References
448
+
449
+ - [RFC 37 / ZMTP 3.1](https://rfc.zeromq.org/spec/37/) — underlying wire protocol
450
+ - [RFC 8878 — Zstandard Compression Data Format](https://datatracker.ietf.org/doc/html/rfc8878)
451
+ - [Zstandard dictionary builder](https://github.com/facebook/zstd/blob/dev/lib/dictBuilder/zdict.h)
452
+ - [CRIME attack](https://en.wikipedia.org/wiki/CRIME) — compression side-channel on TLS
453
+ - [BREACH attack](https://en.wikipedia.org/wiki/BREACH) — HTTP-layer variant
@@ -0,0 +1,163 @@
1
+ # frozen_string_literal: true
2
+
3
+ module OMQ
4
+ module Transport
5
+ module ZstdTcp
6
+ class Codec
7
+ MAX_DICT_SIZE = 64 * 1024
8
+ DICT_CAPACITY = 8 * 1024
9
+ TRAIN_MAX_SAMPLES = 1000
10
+ TRAIN_MAX_BYTES = 100 * 1024
11
+ TRAIN_MAX_SAMPLE_LEN = 1024
12
+ MIN_COMPRESS_NO_DICT = 512
13
+ MIN_COMPRESS_WITH_DICT = 64
14
+
15
+ NUL_PREAMBLE = ("\x00" * 4).b.freeze
16
+ ZSTD_MAGIC = "\x28\xB5\x2F\xFD".b.freeze
17
+ ZDICT_MAGIC = "\x37\xA4\x30\xEC".b.freeze
18
+
19
+ USER_DICT_ID_RANGE = (32_768..(2**31 - 1)).freeze
20
+
21
+
22
+ attr_reader :send_dict_bytes, :max_message_size
23
+
24
+
25
+ def initialize(level:, dict: nil, max_message_size: nil)
26
+ @level = level
27
+ @max_message_size = max_message_size
28
+
29
+ @send_dict = nil
30
+ @send_dict_bytes = nil
31
+
32
+ @training = dict.nil?
33
+ @train_samples = []
34
+ @train_bytes = 0
35
+
36
+ @cached_parts = nil
37
+ @cached_compressed = nil
38
+
39
+ install_send_dict(dict.b) if dict
40
+ end
41
+
42
+
43
+ def compress_parts(parts)
44
+ return @cached_compressed if parts.equal?(@cached_parts)
45
+
46
+ parts.each { |p| maybe_train!(p) }
47
+
48
+ compressed = parts.map { |p| compress_or_plain(p) }
49
+ @cached_parts = parts
50
+ @cached_compressed = compressed.freeze
51
+ compressed
52
+ end
53
+
54
+
55
+ def parse_frame_content_size(wire)
56
+ return nil if wire.bytesize < 5
57
+
58
+ fhd = wire.getbyte(4)
59
+ did_flag = fhd & 0x03
60
+ single_seg = (fhd >> 5) & 0x01
61
+ fcs_flag = (fhd >> 6) & 0x03
62
+
63
+ return nil if fcs_flag == 0 && single_seg == 0
64
+
65
+ off = 5 + (single_seg == 0 ? 1 : 0) + [0, 1, 2, 4][did_flag]
66
+
67
+ case fcs_flag
68
+ when 0
69
+ return nil if wire.bytesize < off + 1
70
+ wire.getbyte(off)
71
+ when 1
72
+ return nil if wire.bytesize < off + 2
73
+ wire.byteslice(off, 2).unpack1("v") + 256
74
+ when 2
75
+ return nil if wire.bytesize < off + 4
76
+ wire.byteslice(off, 4).unpack1("V")
77
+ when 3
78
+ return nil if wire.bytesize < off + 8
79
+ lo, hi = wire.byteslice(off, 8).unpack("VV")
80
+ (hi << 32) | lo
81
+ end
82
+ end
83
+
84
+
85
+ private
86
+
87
+
88
+ def maybe_train!(part)
89
+ return unless @training
90
+
91
+ bytes = part.is_a?(String) && part.encoding == Encoding::BINARY ? part : part.to_s.b
92
+ return if bytes.bytesize >= TRAIN_MAX_SAMPLE_LEN
93
+
94
+ @train_samples << bytes
95
+ @train_bytes += bytes.bytesize
96
+
97
+ return unless @train_samples.size >= TRAIN_MAX_SAMPLES ||
98
+ @train_bytes >= TRAIN_MAX_BYTES
99
+
100
+ begin
101
+ trained = RZstd::Dictionary.train(@train_samples, capacity: DICT_CAPACITY)
102
+ rescue RuntimeError
103
+ @training = false
104
+ @train_samples = nil
105
+ return
106
+ end
107
+
108
+ @training = false
109
+ @train_samples = nil
110
+
111
+ patched = patch_auto_dict_id(trained)
112
+ install_send_dict(patched)
113
+ end
114
+
115
+
116
+ def patch_auto_dict_id(bytes)
117
+ out = bytes.dup.b
118
+ id = rand(USER_DICT_ID_RANGE)
119
+ out[4, 4] = [id].pack("V")
120
+ out
121
+ end
122
+
123
+
124
+ def install_send_dict(bytes)
125
+ unless bytes.byteslice(0, 4) == ZDICT_MAGIC
126
+ raise ProtocolError, "supplied dict is not ZDICT-format"
127
+ end
128
+
129
+ if bytes.bytesize > MAX_DICT_SIZE
130
+ raise ProtocolError, "dict exceeds #{MAX_DICT_SIZE} bytes"
131
+ end
132
+
133
+ @send_dict = RZstd::Dictionary.new(bytes, level: @level)
134
+ @send_dict_bytes = bytes
135
+ end
136
+
137
+
138
+ def compress_or_plain(part)
139
+ bytes = part.is_a?(String) && part.encoding == Encoding::BINARY ? part : part.to_s.b
140
+ threshold = @send_dict ? MIN_COMPRESS_WITH_DICT : MIN_COMPRESS_NO_DICT
141
+ return plain(bytes) if bytes.bytesize < threshold
142
+
143
+ compressed =
144
+ if @send_dict
145
+ @send_dict.compress(bytes)
146
+ else
147
+ RZstd.compress(bytes, level: @level)
148
+ end
149
+
150
+ return plain(bytes) if compressed.bytesize >= bytes.bytesize - 4
151
+
152
+ compressed
153
+ end
154
+
155
+
156
+ def plain(body)
157
+ NUL_PREAMBLE + body
158
+ end
159
+
160
+ end
161
+ end
162
+ end
163
+ end