turbopipe 1.0.4__tar.gz → 1.0.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of turbopipe might be problematic. Click here for more details.

@@ -0,0 +1,368 @@
1
+ Metadata-Version: 2.1
2
+ Name: turbopipe
3
+ Version: 1.0.5
4
+ Summary: 🌀 Faster ModernGL Buffer inter process data transfers
5
+ Home-page: https://brokensrc.dev
6
+ Author-Email: Tremeschin <29046864+Tremeschin@users.noreply.github.com>
7
+ License: MIT License
8
+
9
+ Copyright (c) 2024 Gabriel Tremeschin
10
+
11
+ Permission is hereby granted, free of charge, to any person obtaining a copy
12
+ of this software and associated documentation files (the "Software"), to deal
13
+ in the Software without restriction, including without limitation the rights
14
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
15
+ copies of the Software, and to permit persons to whom the Software is
16
+ furnished to do so, subject to the following conditions:
17
+
18
+ The above copyright notice and this permission notice shall be included in all
19
+ copies or substantial portions of the Software.
20
+
21
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
22
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
23
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
24
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
25
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
26
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
27
+ SOFTWARE.
28
+ Project-URL: Issues, https://github.com/BrokenSource/TurboPipe/issues
29
+ Project-URL: Repository, https://github.com/BrokenSource/TurboPipe
30
+ Project-URL: Documentation, https://github.com/BrokenSource/TurboPipe
31
+ Project-URL: Homepage, https://brokensrc.dev
32
+ Requires-Python: >=3.7
33
+ Requires-Dist: moderngl
34
+ Description-Content-Type: text/markdown
35
+
36
+ > [!IMPORTANT]
37
+ > <sub>Also check out [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow), where **TurboPipe** shines! 😉</sub>
38
+
39
+ <div align="center">
40
+ <a href="https://brokensrc.dev/"><img src="https://raw.githubusercontent.com/BrokenSource/TurboPipe/main/turbopipe/resources/images/turbopipe.png" width="200"></a>
41
+ <h1>TurboPipe</h1>
42
+ <br>
43
+ Faster <a href="https://github.com/moderngl/moderngl"><b>ModernGL</b></a> inter-process data transfers
44
+ </div>
45
+
46
+ <br>
47
+
48
+ # 🔥 Description
49
+
50
+ > TurboPipe speeds up sending raw bytes from `moderngl.Buffer` objects primarily to `FFmpeg` subprocess
51
+
52
+ The **optimizations** involved are:
53
+
54
+ - **Zero-copy**: Avoid unnecessary memory copies or allocation (intermediate `buffer.read()`)
55
+ - **C++**: The core of TurboPipe is written in C++ for speed, efficiency and low-level control
56
+ - **Chunks**: Write in chunks of 4096 bytes (RAM page size), so the hardware is happy
57
+ - **Threaded**:
58
+ - Doesn't block Python code execution, allows to render next frame
59
+ - Decouples the main thread from the I/O thread for performance
60
+
61
+ ✅ Don't worry, there's proper **safety** in place. TurboPipe will block Python if a memory address is already queued for writing, and guarantees order of writes per file-descriptor. Just call `.sync()` when done 😉
62
+
63
+ <br>
64
+
65
+ # 📦 Installation
66
+
67
+ It couldn't be easier! Just install the [**`turbopipe`**](https://pypi.org/project/turbopipe/) package from PyPI:
68
+
69
+ ```bash
70
+ # With pip (https://pip.pypa.io/)
71
+ python -m pip install turbopipe
72
+
73
+ # With Poetry (https://python-poetry.org/)
74
+ poetry add turbopipe
75
+
76
+ # With PDM (https://pdm-project.org/en/latest/)
77
+ pdm add turbopipe
78
+
79
+ # With Rye (https://rye.astral.sh/)
80
+ rye add turbopipe
81
+ ```
82
+
83
+ <br>
84
+
85
+ # 🚀 Usage
86
+
87
+ See also the [**Examples**](https://github.com/BrokenSource/TurboPipe/tree/main/examples) folder for more controlled usage, and [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow/blob/main/ShaderFlow/Scene.py) usage of it!
88
+
89
+ ```python
90
+ import subprocess
91
+
92
+ import moderngl
93
+ import turbopipe
94
+
95
+ # Create ModernGL objects
96
+ ctx = moderngl.create_standalone_context()
97
+ buffer = ctx.buffer(reserve=1920*1080*3)
98
+
99
+ # Make sure resolution, pixel format matches!
100
+ ffmpeg = subprocess.Popen(
101
+ 'ffmpeg -f rawvideo -pix_fmt rgb24 -r 60 -s 1920x1080 -i - -f null -'.split(),
102
+ stdin=subprocess.PIPE
103
+ )
104
+
105
+ # Rendering loop of yours (eg. 1m footage)
106
+ for _ in range(60 * 60):
107
+ turbopipe.pipe(buffer, ffmpeg.stdin.fileno())
108
+
109
+ # Finalize writing
110
+ turbopipe.sync()
111
+ ffmpeg.stdin.close()
112
+ ffmpeg.wait()
113
+ ```
114
+
115
+ <br>
116
+
117
+ # ⭐️ Benchmarks
118
+
119
+ > [!NOTE]
120
+ > **The tests conditions are as follows**:
121
+ > - The tests are the average of 3 runs to ensure consistency, with 5 GB of the same data being piped
122
+ > - These aren't tests of render speed; but rather the throughput speed of GPU -> CPU -> RAM -> IPC
123
+ > - All resolutions are wide-screen (16:9) and have 3 components (RGB) with 3 bytes per pixel (SDR)
124
+ > - The data is a random noise per-buffer between 128-135. So, multi-buffers runs are a noise video
125
+ > - Multi-buffer cycles through a list of buffer (eg. 1, 2, 3, 1, 2, 3... for 3-buffers)
126
+ > - All FFmpeg outputs are scrapped with `-f null -` to avoid any disk I/O bottlenecks
127
+ > - The `gain` column is the percentage increase over the standard method
128
+ > - When `x264` is Null, no encoding took place (passthrough)
129
+ > - The test cases emoji signifies:
130
+ > - 🐢: Standard `ffmpeg.stdin.write(buffer.read())` on just the main thread, pure Python
131
+ > - 🚀: Threaded `ffmpeg.stdin.write(buffer.read())` with a queue (similar to turbopipe)
132
+ > - 🌀: The magic of `turbopipe.pipe(buffer, ffmpeg.stdin.fileno())`
133
+ >
134
+ > Also see [`benchmark.py`](https://github.com/BrokenSource/TurboPipe/blob/main/examples/benchmark.py) for the implementation
135
+
136
+ ✅ Check out benchmarks in a couple of systems below:
137
+
138
+ 📦 TurboPipe v1.0.4:
139
+
140
+ <details>
141
+ <summary><b>Desktop</b> • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Arch Linux)</summary>
142
+ <br>
143
+
144
+ <b>Note:</b> I have noted inconsistency across tests, specially at lower resolutions. Some 720p runs might peak at 2900 fps and stay there, while others are limited by 1750 fps. I'm not sure if it's the Linux EEVDF scheduler, or CPU Topology that causes this. Nevertheless, results are stable on Windows 11 on the same machine.
145
+
146
+ | 720p | x264 | Buffers | Framerate | Bandwidth | Gain |
147
+ |:----:|:----------|:---------:|----------:|------------:|---------:|
148
+ | 🐢 | Null | 1 | 882 fps | 2.44 GB/s | |
149
+ | 🚀 | Null | 1 | 793 fps | 2.19 GB/s | -10.04% |
150
+ | 🌀 | Null | 1 | 1911 fps | 5.28 GB/s | 116.70% |
151
+ | 🐢 | Null | 4 | 880 fps | 2.43 GB/s | |
152
+ | 🚀 | Null | 4 | 924 fps | 2.56 GB/s | 5.05% |
153
+ | 🌀 | Null | 4 | 2037 fps | 5.63 GB/s | 131.59% |
154
+ | 🐢 | ultrafast | 4 | 714 fps | 1.98 GB/s | |
155
+ | 🚀 | ultrafast | 4 | 670 fps | 1.85 GB/s | -6.10% |
156
+ | 🌀 | ultrafast | 4 | 1093 fps | 3.02 GB/s | 53.13% |
157
+ | 🐢 | slow | 4 | 206 fps | 0.57 GB/s | |
158
+ | 🚀 | slow | 4 | 208 fps | 0.58 GB/s | 1.37% |
159
+ | 🌀 | slow | 4 | 214 fps | 0.59 GB/s | 3.93% |
160
+
161
+ | 1080p | x264 | Buffers | Framerate | Bandwidth | Gain |
162
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
163
+ | 🐢 | Null | 1 | 410 fps | 2.55 GB/s | |
164
+ | 🚀 | Null | 1 | 399 fps | 2.48 GB/s | -2.60% |
165
+ | 🌀 | Null | 1 | 794 fps | 4.94 GB/s | 93.80% |
166
+ | 🐢 | Null | 4 | 390 fps | 2.43 GB/s | |
167
+ | 🚀 | Null | 4 | 391 fps | 2.43 GB/s | 0.26% |
168
+ | 🌀 | Null | 4 | 756 fps | 4.71 GB/s | 94.01% |
169
+ | 🐢 | ultrafast | 4 | 277 fps | 1.73 GB/s | |
170
+ | 🚀 | ultrafast | 4 | 270 fps | 1.68 GB/s | -2.40% |
171
+ | 🌀 | ultrafast | 4 | 402 fps | 2.50 GB/s | 45.32% |
172
+ | 🐢 | slow | 4 | 115 fps | 0.72 GB/s | |
173
+ | 🚀 | slow | 4 | 118 fps | 0.74 GB/s | 3.40% |
174
+ | 🌀 | slow | 4 | 119 fps | 0.75 GB/s | 4.34% |
175
+
176
+ | 1440p | x264 | Buffers | Framerate | Bandwidth | Gain |
177
+ |:-----:|:----------|:---------:|----------:|------------:|---------:|
178
+ | 🐢 | Null | 1 | 210 fps | 2.33 GB/s | |
179
+ | 🚀 | Null | 1 | 239 fps | 2.64 GB/s | 13.84% |
180
+ | 🌀 | Null | 1 | 534 fps | 5.91 GB/s | 154.32% |
181
+ | 🐢 | Null | 4 | 233 fps | 2.58 GB/s | |
182
+ | 🚀 | Null | 4 | 232 fps | 2.57 GB/s | -0.08% |
183
+ | 🌀 | Null | 4 | 495 fps | 5.48 GB/s | 112.64% |
184
+ | 🐢 | ultrafast | 4 | 141 fps | 1.56 GB/s | |
185
+ | 🚀 | ultrafast | 4 | 150 fps | 1.67 GB/s | 6.92% |
186
+ | 🌀 | ultrafast | 4 | 226 fps | 2.50 GB/s | 60.37% |
187
+ | 🐢 | slow | 4 | 72 fps | 0.80 GB/s | |
188
+ | 🚀 | slow | 4 | 71 fps | 0.79 GB/s | -0.70% |
189
+ | 🌀 | slow | 4 | 75 fps | 0.83 GB/s | 4.60% |
190
+
191
+ | 2160p | x264 | Buffers | Framerate | Bandwidth | Gain |
192
+ |:-----:|:----------|:---------:|----------:|------------:|---------:|
193
+ | 🐢 | Null | 1 | 81 fps | 2.03 GB/s | |
194
+ | 🚀 | Null | 1 | 107 fps | 2.67 GB/s | 32.26% |
195
+ | 🌀 | Null | 1 | 213 fps | 5.31 GB/s | 163.47% |
196
+ | 🐢 | Null | 4 | 87 fps | 2.18 GB/s | |
197
+ | 🚀 | Null | 4 | 109 fps | 2.72 GB/s | 25.43% |
198
+ | 🌀 | Null | 4 | 212 fps | 5.28 GB/s | 143.72% |
199
+ | 🐢 | ultrafast | 4 | 59 fps | 1.48 GB/s | |
200
+ | 🚀 | ultrafast | 4 | 67 fps | 1.68 GB/s | 14.46% |
201
+ | 🌀 | ultrafast | 4 | 95 fps | 2.39 GB/s | 62.66% |
202
+ | 🐢 | slow | 4 | 37 fps | 0.94 GB/s | |
203
+ | 🚀 | slow | 4 | 43 fps | 1.07 GB/s | 16.22% |
204
+ | 🌀 | slow | 4 | 44 fps | 1.11 GB/s | 20.65% |
205
+
206
+ </details>
207
+
208
+ <details>
209
+ <summary><b>Desktop</b> • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Windows 11)</summary>
210
+ <br>
211
+
212
+ | 720p | x264 | Buffers | Framerate | Bandwidth | Gain |
213
+ |:----:|:----------|:---------:|----------:|------------:|--------:|
214
+ | 🐢 | Null | 1 | 981 fps | 2.71 GB/s | |
215
+ | 🚀 | Null | 1 | 1145 fps | 3.17 GB/s | 16.74% |
216
+ | 🌀 | Null | 1 | 1504 fps | 4.16 GB/s | 53.38% |
217
+ | 🐢 | Null | 4 | 997 fps | 2.76 GB/s | |
218
+ | 🚀 | Null | 4 | 1117 fps | 3.09 GB/s | 12.08% |
219
+ | 🌀 | Null | 4 | 1467 fps | 4.06 GB/s | 47.14% |
220
+ | 🐢 | ultrafast | 4 | 601 fps | 1.66 GB/s | |
221
+ | 🚀 | ultrafast | 4 | 616 fps | 1.70 GB/s | 2.57% |
222
+ | 🌀 | ultrafast | 4 | 721 fps | 1.99 GB/s | 20.04% |
223
+ | 🐢 | slow | 4 | 206 fps | 0.57 GB/s | |
224
+ | 🚀 | slow | 4 | 206 fps | 0.57 GB/s | 0.39% |
225
+ | 🌀 | slow | 4 | 206 fps | 0.57 GB/s | 0.13% |
226
+
227
+ | 1080p | x264 | Buffers | Framerate | Bandwidth | Gain |
228
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
229
+ | 🐢 | Null | 1 | 451 fps | 2.81 GB/s | |
230
+ | 🚀 | Null | 1 | 542 fps | 3.38 GB/s | 20.31% |
231
+ | 🌀 | Null | 1 | 711 fps | 4.43 GB/s | 57.86% |
232
+ | 🐢 | Null | 4 | 449 fps | 2.79 GB/s | |
233
+ | 🚀 | Null | 4 | 518 fps | 3.23 GB/s | 15.48% |
234
+ | 🌀 | Null | 4 | 614 fps | 3.82 GB/s | 36.83% |
235
+ | 🐢 | ultrafast | 4 | 262 fps | 1.64 GB/s | |
236
+ | 🚀 | ultrafast | 4 | 266 fps | 1.66 GB/s | 1.57% |
237
+ | 🌀 | ultrafast | 4 | 319 fps | 1.99 GB/s | 21.88% |
238
+ | 🐢 | slow | 4 | 119 fps | 0.74 GB/s | |
239
+ | 🚀 | slow | 4 | 121 fps | 0.76 GB/s | 2.46% |
240
+ | 🌀 | slow | 4 | 121 fps | 0.75 GB/s | 1.90% |
241
+
242
+ | 1440p | x264 | Buffers | Framerate | Bandwidth | Gain |
243
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
244
+ | 🐢 | Null | 1 | 266 fps | 2.95 GB/s | |
245
+ | 🚀 | Null | 1 | 308 fps | 3.41 GB/s | 15.87% |
246
+ | 🌀 | Null | 1 | 402 fps | 4.45 GB/s | 51.22% |
247
+ | 🐢 | Null | 4 | 276 fps | 3.06 GB/s | |
248
+ | 🚀 | Null | 4 | 307 fps | 3.40 GB/s | 11.32% |
249
+ | 🌀 | Null | 4 | 427 fps | 4.73 GB/s | 54.86% |
250
+ | 🐢 | ultrafast | 4 | 152 fps | 1.68 GB/s | |
251
+ | 🚀 | ultrafast | 4 | 156 fps | 1.73 GB/s | 3.02% |
252
+ | 🌀 | ultrafast | 4 | 181 fps | 2.01 GB/s | 19.36% |
253
+ | 🐢 | slow | 4 | 77 fps | 0.86 GB/s | |
254
+ | 🚀 | slow | 4 | 79 fps | 0.88 GB/s | 3.27% |
255
+ | 🌀 | slow | 4 | 80 fps | 0.89 GB/s | 4.86% |
256
+
257
+ | 2160p | x264 | Buffers | Framerate | Bandwidth | Gain |
258
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
259
+ | 🐢 | Null | 1 | 134 fps | 3.35 GB/s | |
260
+ | 🚀 | Null | 1 | 152 fps | 3.81 GB/s | 14.15% |
261
+ | 🌀 | Null | 1 | 221 fps | 5.52 GB/s | 65.44% |
262
+ | 🐢 | Null | 4 | 135 fps | 3.36 GB/s | |
263
+ | 🚀 | Null | 4 | 151 fps | 3.76 GB/s | 11.89% |
264
+ | 🌀 | Null | 4 | 220 fps | 5.49 GB/s | 63.34% |
265
+ | 🐢 | ultrafast | 4 | 66 fps | 1.65 GB/s | |
266
+ | 🚀 | ultrafast | 4 | 70 fps | 1.75 GB/s | 6.44% |
267
+ | 🌀 | ultrafast | 4 | 82 fps | 2.04 GB/s | 24.31% |
268
+ | 🐢 | slow | 4 | 40 fps | 1.01 GB/s | |
269
+ | 🚀 | slow | 4 | 43 fps | 1.09 GB/s | 9.54% |
270
+ | 🌀 | slow | 4 | 44 fps | 1.10 GB/s | 10.15% |
271
+
272
+ </details>
273
+
274
+ <details>
275
+ <summary><b>Laptop</b> • (Intel Core i7 11800H) • (NVIDIA RTX 3070) • (DDR4 2x16 GB 3200 MT/s) • (Windows 11)</summary>
276
+ <br>
277
+
278
+ <b>Note:</b> Must select NVIDIA GPU on their Control Panel instead of Intel iGPU
279
+
280
+ | 720p | x264 | Buffers | Framerate | Bandwidth | Gain |
281
+ |:----:|:----------|:---------:|----------:|------------:|--------:|
282
+ | 🐢 | Null | 1 | 786 fps | 2.17 GB/s | |
283
+ | 🚀 | Null | 1 | 903 fps | 2.50 GB/s | 14.91% |
284
+ | 🌀 | Null | 1 | 1366 fps | 3.78 GB/s | 73.90% |
285
+ | 🐢 | Null | 4 | 739 fps | 2.04 GB/s | |
286
+ | 🚀 | Null | 4 | 855 fps | 2.37 GB/s | 15.78% |
287
+ | 🌀 | Null | 4 | 1240 fps | 3.43 GB/s | 67.91% |
288
+ | 🐢 | ultrafast | 4 | 484 fps | 1.34 GB/s | |
289
+ | 🚀 | ultrafast | 4 | 503 fps | 1.39 GB/s | 4.10% |
290
+ | 🌀 | ultrafast | 4 | 577 fps | 1.60 GB/s | 19.37% |
291
+ | 🐢 | slow | 4 | 143 fps | 0.40 GB/s | |
292
+ | 🚀 | slow | 4 | 145 fps | 0.40 GB/s | 1.78% |
293
+ | 🌀 | slow | 4 | 151 fps | 0.42 GB/s | 5.76% |
294
+
295
+ | 1080p | x264 | Buffers | Framerate | Bandwidth | Gain |
296
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
297
+ | 🐢 | Null | 1 | 358 fps | 2.23 GB/s | |
298
+ | 🚀 | Null | 1 | 427 fps | 2.66 GB/s | 19.45% |
299
+ | 🌀 | Null | 1 | 566 fps | 3.53 GB/s | 58.31% |
300
+ | 🐢 | Null | 4 | 343 fps | 2.14 GB/s | |
301
+ | 🚀 | Null | 4 | 404 fps | 2.51 GB/s | 17.86% |
302
+ | 🌀 | Null | 4 | 465 fps | 2.89 GB/s | 35.62% |
303
+ | 🐢 | ultrafast | 4 | 191 fps | 1.19 GB/s | |
304
+ | 🚀 | ultrafast | 4 | 207 fps | 1.29 GB/s | 8.89% |
305
+ | 🌀 | ultrafast | 4 | 234 fps | 1.46 GB/s | 22.77% |
306
+ | 🐢 | slow | 4 | 62 fps | 0.39 GB/s | |
307
+ | 🚀 | slow | 4 | 67 fps | 0.42 GB/s | 8.40% |
308
+ | 🌀 | slow | 4 | 74 fps | 0.47 GB/s | 20.89% |
309
+
310
+ | 1440p | x264 | Buffers | Framerate | Bandwidth | Gain |
311
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
312
+ | 🐢 | Null | 1 | 180 fps | 1.99 GB/s | |
313
+ | 🚀 | Null | 1 | 216 fps | 2.40 GB/s | 20.34% |
314
+ | 🌀 | Null | 1 | 264 fps | 2.92 GB/s | 46.74% |
315
+ | 🐢 | Null | 4 | 178 fps | 1.97 GB/s | |
316
+ | 🚀 | Null | 4 | 211 fps | 2.34 GB/s | 19.07% |
317
+ | 🌀 | Null | 4 | 250 fps | 2.77 GB/s | 40.48% |
318
+ | 🐢 | ultrafast | 4 | 98 fps | 1.09 GB/s | |
319
+ | 🚀 | ultrafast | 4 | 110 fps | 1.23 GB/s | 13.18% |
320
+ | 🌀 | ultrafast | 4 | 121 fps | 1.35 GB/s | 24.15% |
321
+ | 🐢 | slow | 4 | 40 fps | 0.45 GB/s | |
322
+ | 🚀 | slow | 4 | 41 fps | 0.46 GB/s | 4.90% |
323
+ | 🌀 | slow | 4 | 43 fps | 0.48 GB/s | 7.89% |
324
+
325
+ | 2160p | x264 | Buffers | Framerate | Bandwidth | Gain |
326
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
327
+ | 🐢 | Null | 1 | 79 fps | 1.98 GB/s | |
328
+ | 🚀 | Null | 1 | 95 fps | 2.37 GB/s | 20.52% |
329
+ | 🌀 | Null | 1 | 104 fps | 2.60 GB/s | 32.15% |
330
+ | 🐢 | Null | 4 | 80 fps | 2.00 GB/s | |
331
+ | 🚀 | Null | 4 | 94 fps | 2.35 GB/s | 17.82% |
332
+ | 🌀 | Null | 4 | 108 fps | 2.70 GB/s | 35.40% |
333
+ | 🐢 | ultrafast | 4 | 41 fps | 1.04 GB/s | |
334
+ | 🚀 | ultrafast | 4 | 48 fps | 1.20 GB/s | 17.67% |
335
+ | 🌀 | ultrafast | 4 | 52 fps | 1.30 GB/s | 27.49% |
336
+ | 🐢 | slow | 4 | 17 fps | 0.43 GB/s | |
337
+ | 🚀 | slow | 4 | 19 fps | 0.48 GB/s | 13.16% |
338
+ | 🌀 | slow | 4 | 19 fps | 0.48 GB/s | 13.78% |
339
+
340
+ </details>
341
+ <br>
342
+
343
+ <div align="justify">
344
+
345
+ # 🌀 Conclusion
346
+
347
+ TurboPipe significantly increases the feeding speed of FFmpeg with data, especially at higher resolutions. However, if there's few CPU compute available, or the video is too hard to encode (/slow preset), the gains are insignificant over the other methods (bottleneck). Multi-buffering didn't prove to have an advantage, debugging shows that TurboPipe C++ is often starved of data to write (as the file stream is buffered on the OS most likely), and the context switching, or cache misses, might be the cause of the slowdown.
348
+
349
+ The theory supports the threaded method being faster, as writing to a file descriptor is a blocking operation for python, but a syscall under the hood, that doesn't necessarily lock the GIL, just the thread. TurboPipe speeds that even further by avoiding an unecessary copy of the buffer data, and writing directly to the file descriptor on a C++ thread. Linux shows a better performance than Windows in the same system after the optimizations, but Windows wins on the standard method.
350
+
351
+ Interestingly, due either Linux's scheduler on AMD Ryzen CPUs, or their operating philosophy, it was experimentally seen that Ryzen's frenetic thread switching degrades a bit the single thread performance, which can be _"fixed"_ with prepending the command with `taskset --cpu 0,2` (not recommended at all), comparatively speaking to Windows performance on the same system (Linux 🚀 = Windows 🐢). This can also be due the topology of tested CPUs having more than one Core Complex Die (CCD). Intel CPUs seem to stick to the same thread for longer, which makes the Python threaded method often slightly faster.
352
+
353
+ ### Personal experience
354
+
355
+ On realistically loads, like [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow)'s default lightweight shader export, TurboPipe increases rendering speed from 1080p260 to 1080p360 on my system, with mid 80% CPU usage than low 60%s. For [**DepthFlow**](https://github.com/BrokenSource/ShaderFlow)'s default depth video export, no gains are seen, as the CPU is almost saturated encoding at 1080p130.
356
+
357
+ </div>
358
+
359
+ <br>
360
+
361
+ # 📚 Future work
362
+
363
+ - Add support for NumPy arrays, memoryviews, and byte-like objects
364
+ - Disable/investigate performance degradation on Windows iGPUs
365
+ - Improve the thread synchronization and/or use a ThreadPool
366
+ - Stabler way for finding mglo struct offsets (moderngl.h?)
367
+ - Maybe use `mmap` instead of chunks writing on Linux
368
+ - Test on MacOS 🙈
@@ -0,0 +1,333 @@
1
+ > [!IMPORTANT]
2
+ > <sub>Also check out [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow), where **TurboPipe** shines! 😉</sub>
3
+
4
+ <div align="center">
5
+ <a href="https://brokensrc.dev/"><img src="https://raw.githubusercontent.com/BrokenSource/TurboPipe/main/turbopipe/resources/images/turbopipe.png" width="200"></a>
6
+ <h1>TurboPipe</h1>
7
+ <br>
8
+ Faster <a href="https://github.com/moderngl/moderngl"><b>ModernGL</b></a> inter-process data transfers
9
+ </div>
10
+
11
+ <br>
12
+
13
+ # 🔥 Description
14
+
15
+ > TurboPipe speeds up sending raw bytes from `moderngl.Buffer` objects primarily to `FFmpeg` subprocess
16
+
17
+ The **optimizations** involved are:
18
+
19
+ - **Zero-copy**: Avoid unnecessary memory copies or allocation (intermediate `buffer.read()`)
20
+ - **C++**: The core of TurboPipe is written in C++ for speed, efficiency and low-level control
21
+ - **Chunks**: Write in chunks of 4096 bytes (RAM page size), so the hardware is happy
22
+ - **Threaded**:
23
+ - Doesn't block Python code execution, allows to render next frame
24
+ - Decouples the main thread from the I/O thread for performance
25
+
26
+ ✅ Don't worry, there's proper **safety** in place. TurboPipe will block Python if a memory address is already queued for writing, and guarantees order of writes per file-descriptor. Just call `.sync()` when done 😉
27
+
28
+ <br>
29
+
30
+ # 📦 Installation
31
+
32
+ It couldn't be easier! Just install the [**`turbopipe`**](https://pypi.org/project/turbopipe/) package from PyPI:
33
+
34
+ ```bash
35
+ # With pip (https://pip.pypa.io/)
36
+ python -m pip install turbopipe
37
+
38
+ # With Poetry (https://python-poetry.org/)
39
+ poetry add turbopipe
40
+
41
+ # With PDM (https://pdm-project.org/en/latest/)
42
+ pdm add turbopipe
43
+
44
+ # With Rye (https://rye.astral.sh/)
45
+ rye add turbopipe
46
+ ```
47
+
48
+ <br>
49
+
50
+ # 🚀 Usage
51
+
52
+ See also the [**Examples**](https://github.com/BrokenSource/TurboPipe/tree/main/examples) folder for more controlled usage, and [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow/blob/main/ShaderFlow/Scene.py) usage of it!
53
+
54
+ ```python
55
+ import subprocess
56
+
57
+ import moderngl
58
+ import turbopipe
59
+
60
+ # Create ModernGL objects
61
+ ctx = moderngl.create_standalone_context()
62
+ buffer = ctx.buffer(reserve=1920*1080*3)
63
+
64
+ # Make sure resolution, pixel format matches!
65
+ ffmpeg = subprocess.Popen(
66
+ 'ffmpeg -f rawvideo -pix_fmt rgb24 -r 60 -s 1920x1080 -i - -f null -'.split(),
67
+ stdin=subprocess.PIPE
68
+ )
69
+
70
+ # Rendering loop of yours (eg. 1m footage)
71
+ for _ in range(60 * 60):
72
+ turbopipe.pipe(buffer, ffmpeg.stdin.fileno())
73
+
74
+ # Finalize writing
75
+ turbopipe.sync()
76
+ ffmpeg.stdin.close()
77
+ ffmpeg.wait()
78
+ ```
79
+
80
+ <br>
81
+
82
+ # ⭐️ Benchmarks
83
+
84
+ > [!NOTE]
85
+ > **The tests conditions are as follows**:
86
+ > - The tests are the average of 3 runs to ensure consistency, with 5 GB of the same data being piped
87
+ > - These aren't tests of render speed; but rather the throughput speed of GPU -> CPU -> RAM -> IPC
88
+ > - All resolutions are wide-screen (16:9) and have 3 components (RGB) with 3 bytes per pixel (SDR)
89
+ > - The data is a random noise per-buffer between 128-135. So, multi-buffers runs are a noise video
90
+ > - Multi-buffer cycles through a list of buffer (eg. 1, 2, 3, 1, 2, 3... for 3-buffers)
91
+ > - All FFmpeg outputs are scrapped with `-f null -` to avoid any disk I/O bottlenecks
92
+ > - The `gain` column is the percentage increase over the standard method
93
+ > - When `x264` is Null, no encoding took place (passthrough)
94
+ > - The test cases emoji signifies:
95
+ > - 🐢: Standard `ffmpeg.stdin.write(buffer.read())` on just the main thread, pure Python
96
+ > - 🚀: Threaded `ffmpeg.stdin.write(buffer.read())` with a queue (similar to turbopipe)
97
+ > - 🌀: The magic of `turbopipe.pipe(buffer, ffmpeg.stdin.fileno())`
98
+ >
99
+ > Also see [`benchmark.py`](https://github.com/BrokenSource/TurboPipe/blob/main/examples/benchmark.py) for the implementation
100
+
101
+ ✅ Check out benchmarks in a couple of systems below:
102
+
103
+ 📦 TurboPipe v1.0.4:
104
+
105
+ <details>
106
+ <summary><b>Desktop</b> • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Arch Linux)</summary>
107
+ <br>
108
+
109
+ <b>Note:</b> I have noted inconsistency across tests, specially at lower resolutions. Some 720p runs might peak at 2900 fps and stay there, while others are limited by 1750 fps. I'm not sure if it's the Linux EEVDF scheduler, or CPU Topology that causes this. Nevertheless, results are stable on Windows 11 on the same machine.
110
+
111
+ | 720p | x264 | Buffers | Framerate | Bandwidth | Gain |
112
+ |:----:|:----------|:---------:|----------:|------------:|---------:|
113
+ | 🐢 | Null | 1 | 882 fps | 2.44 GB/s | |
114
+ | 🚀 | Null | 1 | 793 fps | 2.19 GB/s | -10.04% |
115
+ | 🌀 | Null | 1 | 1911 fps | 5.28 GB/s | 116.70% |
116
+ | 🐢 | Null | 4 | 880 fps | 2.43 GB/s | |
117
+ | 🚀 | Null | 4 | 924 fps | 2.56 GB/s | 5.05% |
118
+ | 🌀 | Null | 4 | 2037 fps | 5.63 GB/s | 131.59% |
119
+ | 🐢 | ultrafast | 4 | 714 fps | 1.98 GB/s | |
120
+ | 🚀 | ultrafast | 4 | 670 fps | 1.85 GB/s | -6.10% |
121
+ | 🌀 | ultrafast | 4 | 1093 fps | 3.02 GB/s | 53.13% |
122
+ | 🐢 | slow | 4 | 206 fps | 0.57 GB/s | |
123
+ | 🚀 | slow | 4 | 208 fps | 0.58 GB/s | 1.37% |
124
+ | 🌀 | slow | 4 | 214 fps | 0.59 GB/s | 3.93% |
125
+
126
+ | 1080p | x264 | Buffers | Framerate | Bandwidth | Gain |
127
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
128
+ | 🐢 | Null | 1 | 410 fps | 2.55 GB/s | |
129
+ | 🚀 | Null | 1 | 399 fps | 2.48 GB/s | -2.60% |
130
+ | 🌀 | Null | 1 | 794 fps | 4.94 GB/s | 93.80% |
131
+ | 🐢 | Null | 4 | 390 fps | 2.43 GB/s | |
132
+ | 🚀 | Null | 4 | 391 fps | 2.43 GB/s | 0.26% |
133
+ | 🌀 | Null | 4 | 756 fps | 4.71 GB/s | 94.01% |
134
+ | 🐢 | ultrafast | 4 | 277 fps | 1.73 GB/s | |
135
+ | 🚀 | ultrafast | 4 | 270 fps | 1.68 GB/s | -2.40% |
136
+ | 🌀 | ultrafast | 4 | 402 fps | 2.50 GB/s | 45.32% |
137
+ | 🐢 | slow | 4 | 115 fps | 0.72 GB/s | |
138
+ | 🚀 | slow | 4 | 118 fps | 0.74 GB/s | 3.40% |
139
+ | 🌀 | slow | 4 | 119 fps | 0.75 GB/s | 4.34% |
140
+
141
+ | 1440p | x264 | Buffers | Framerate | Bandwidth | Gain |
142
+ |:-----:|:----------|:---------:|----------:|------------:|---------:|
143
+ | 🐢 | Null | 1 | 210 fps | 2.33 GB/s | |
144
+ | 🚀 | Null | 1 | 239 fps | 2.64 GB/s | 13.84% |
145
+ | 🌀 | Null | 1 | 534 fps | 5.91 GB/s | 154.32% |
146
+ | 🐢 | Null | 4 | 233 fps | 2.58 GB/s | |
147
+ | 🚀 | Null | 4 | 232 fps | 2.57 GB/s | -0.08% |
148
+ | 🌀 | Null | 4 | 495 fps | 5.48 GB/s | 112.64% |
149
+ | 🐢 | ultrafast | 4 | 141 fps | 1.56 GB/s | |
150
+ | 🚀 | ultrafast | 4 | 150 fps | 1.67 GB/s | 6.92% |
151
+ | 🌀 | ultrafast | 4 | 226 fps | 2.50 GB/s | 60.37% |
152
+ | 🐢 | slow | 4 | 72 fps | 0.80 GB/s | |
153
+ | 🚀 | slow | 4 | 71 fps | 0.79 GB/s | -0.70% |
154
+ | 🌀 | slow | 4 | 75 fps | 0.83 GB/s | 4.60% |
155
+
156
+ | 2160p | x264 | Buffers | Framerate | Bandwidth | Gain |
157
+ |:-----:|:----------|:---------:|----------:|------------:|---------:|
158
+ | 🐢 | Null | 1 | 81 fps | 2.03 GB/s | |
159
+ | 🚀 | Null | 1 | 107 fps | 2.67 GB/s | 32.26% |
160
+ | 🌀 | Null | 1 | 213 fps | 5.31 GB/s | 163.47% |
161
+ | 🐢 | Null | 4 | 87 fps | 2.18 GB/s | |
162
+ | 🚀 | Null | 4 | 109 fps | 2.72 GB/s | 25.43% |
163
+ | 🌀 | Null | 4 | 212 fps | 5.28 GB/s | 143.72% |
164
+ | 🐢 | ultrafast | 4 | 59 fps | 1.48 GB/s | |
165
+ | 🚀 | ultrafast | 4 | 67 fps | 1.68 GB/s | 14.46% |
166
+ | 🌀 | ultrafast | 4 | 95 fps | 2.39 GB/s | 62.66% |
167
+ | 🐢 | slow | 4 | 37 fps | 0.94 GB/s | |
168
+ | 🚀 | slow | 4 | 43 fps | 1.07 GB/s | 16.22% |
169
+ | 🌀 | slow | 4 | 44 fps | 1.11 GB/s | 20.65% |
170
+
171
+ </details>
172
+
173
+ <details>
174
+ <summary><b>Desktop</b> • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Windows 11)</summary>
175
+ <br>
176
+
177
+ | 720p | x264 | Buffers | Framerate | Bandwidth | Gain |
178
+ |:----:|:----------|:---------:|----------:|------------:|--------:|
179
+ | 🐢 | Null | 1 | 981 fps | 2.71 GB/s | |
180
+ | 🚀 | Null | 1 | 1145 fps | 3.17 GB/s | 16.74% |
181
+ | 🌀 | Null | 1 | 1504 fps | 4.16 GB/s | 53.38% |
182
+ | 🐢 | Null | 4 | 997 fps | 2.76 GB/s | |
183
+ | 🚀 | Null | 4 | 1117 fps | 3.09 GB/s | 12.08% |
184
+ | 🌀 | Null | 4 | 1467 fps | 4.06 GB/s | 47.14% |
185
+ | 🐢 | ultrafast | 4 | 601 fps | 1.66 GB/s | |
186
+ | 🚀 | ultrafast | 4 | 616 fps | 1.70 GB/s | 2.57% |
187
+ | 🌀 | ultrafast | 4 | 721 fps | 1.99 GB/s | 20.04% |
188
+ | 🐢 | slow | 4 | 206 fps | 0.57 GB/s | |
189
+ | 🚀 | slow | 4 | 206 fps | 0.57 GB/s | 0.39% |
190
+ | 🌀 | slow | 4 | 206 fps | 0.57 GB/s | 0.13% |
191
+
192
+ | 1080p | x264 | Buffers | Framerate | Bandwidth | Gain |
193
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
194
+ | 🐢 | Null | 1 | 451 fps | 2.81 GB/s | |
195
+ | 🚀 | Null | 1 | 542 fps | 3.38 GB/s | 20.31% |
196
+ | 🌀 | Null | 1 | 711 fps | 4.43 GB/s | 57.86% |
197
+ | 🐢 | Null | 4 | 449 fps | 2.79 GB/s | |
198
+ | 🚀 | Null | 4 | 518 fps | 3.23 GB/s | 15.48% |
199
+ | 🌀 | Null | 4 | 614 fps | 3.82 GB/s | 36.83% |
200
+ | 🐢 | ultrafast | 4 | 262 fps | 1.64 GB/s | |
201
+ | 🚀 | ultrafast | 4 | 266 fps | 1.66 GB/s | 1.57% |
202
+ | 🌀 | ultrafast | 4 | 319 fps | 1.99 GB/s | 21.88% |
203
+ | 🐢 | slow | 4 | 119 fps | 0.74 GB/s | |
204
+ | 🚀 | slow | 4 | 121 fps | 0.76 GB/s | 2.46% |
205
+ | 🌀 | slow | 4 | 121 fps | 0.75 GB/s | 1.90% |
206
+
207
+ | 1440p | x264 | Buffers | Framerate | Bandwidth | Gain |
208
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
209
+ | 🐢 | Null | 1 | 266 fps | 2.95 GB/s | |
210
+ | 🚀 | Null | 1 | 308 fps | 3.41 GB/s | 15.87% |
211
+ | 🌀 | Null | 1 | 402 fps | 4.45 GB/s | 51.22% |
212
+ | 🐢 | Null | 4 | 276 fps | 3.06 GB/s | |
213
+ | 🚀 | Null | 4 | 307 fps | 3.40 GB/s | 11.32% |
214
+ | 🌀 | Null | 4 | 427 fps | 4.73 GB/s | 54.86% |
215
+ | 🐢 | ultrafast | 4 | 152 fps | 1.68 GB/s | |
216
+ | 🚀 | ultrafast | 4 | 156 fps | 1.73 GB/s | 3.02% |
217
+ | 🌀 | ultrafast | 4 | 181 fps | 2.01 GB/s | 19.36% |
218
+ | 🐢 | slow | 4 | 77 fps | 0.86 GB/s | |
219
+ | 🚀 | slow | 4 | 79 fps | 0.88 GB/s | 3.27% |
220
+ | 🌀 | slow | 4 | 80 fps | 0.89 GB/s | 4.86% |
221
+
222
+ | 2160p | x264 | Buffers | Framerate | Bandwidth | Gain |
223
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
224
+ | 🐢 | Null | 1 | 134 fps | 3.35 GB/s | |
225
+ | 🚀 | Null | 1 | 152 fps | 3.81 GB/s | 14.15% |
226
+ | 🌀 | Null | 1 | 221 fps | 5.52 GB/s | 65.44% |
227
+ | 🐢 | Null | 4 | 135 fps | 3.36 GB/s | |
228
+ | 🚀 | Null | 4 | 151 fps | 3.76 GB/s | 11.89% |
229
+ | 🌀 | Null | 4 | 220 fps | 5.49 GB/s | 63.34% |
230
+ | 🐢 | ultrafast | 4 | 66 fps | 1.65 GB/s | |
231
+ | 🚀 | ultrafast | 4 | 70 fps | 1.75 GB/s | 6.44% |
232
+ | 🌀 | ultrafast | 4 | 82 fps | 2.04 GB/s | 24.31% |
233
+ | 🐢 | slow | 4 | 40 fps | 1.01 GB/s | |
234
+ | 🚀 | slow | 4 | 43 fps | 1.09 GB/s | 9.54% |
235
+ | 🌀 | slow | 4 | 44 fps | 1.10 GB/s | 10.15% |
236
+
237
+ </details>
238
+
239
+ <details>
240
+ <summary><b>Laptop</b> • (Intel Core i7 11800H) • (NVIDIA RTX 3070) • (DDR4 2x16 GB 3200 MT/s) • (Windows 11)</summary>
241
+ <br>
242
+
243
+ <b>Note:</b> Must select NVIDIA GPU on their Control Panel instead of Intel iGPU
244
+
245
+ | 720p | x264 | Buffers | Framerate | Bandwidth | Gain |
246
+ |:----:|:----------|:---------:|----------:|------------:|--------:|
247
+ | 🐢 | Null | 1 | 786 fps | 2.17 GB/s | |
248
+ | 🚀 | Null | 1 | 903 fps | 2.50 GB/s | 14.91% |
249
+ | 🌀 | Null | 1 | 1366 fps | 3.78 GB/s | 73.90% |
250
+ | 🐢 | Null | 4 | 739 fps | 2.04 GB/s | |
251
+ | 🚀 | Null | 4 | 855 fps | 2.37 GB/s | 15.78% |
252
+ | 🌀 | Null | 4 | 1240 fps | 3.43 GB/s | 67.91% |
253
+ | 🐢 | ultrafast | 4 | 484 fps | 1.34 GB/s | |
254
+ | 🚀 | ultrafast | 4 | 503 fps | 1.39 GB/s | 4.10% |
255
+ | 🌀 | ultrafast | 4 | 577 fps | 1.60 GB/s | 19.37% |
256
+ | 🐢 | slow | 4 | 143 fps | 0.40 GB/s | |
257
+ | 🚀 | slow | 4 | 145 fps | 0.40 GB/s | 1.78% |
258
+ | 🌀 | slow | 4 | 151 fps | 0.42 GB/s | 5.76% |
259
+
260
+ | 1080p | x264 | Buffers | Framerate | Bandwidth | Gain |
261
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
262
+ | 🐢 | Null | 1 | 358 fps | 2.23 GB/s | |
263
+ | 🚀 | Null | 1 | 427 fps | 2.66 GB/s | 19.45% |
264
+ | 🌀 | Null | 1 | 566 fps | 3.53 GB/s | 58.31% |
265
+ | 🐢 | Null | 4 | 343 fps | 2.14 GB/s | |
266
+ | 🚀 | Null | 4 | 404 fps | 2.51 GB/s | 17.86% |
267
+ | 🌀 | Null | 4 | 465 fps | 2.89 GB/s | 35.62% |
268
+ | 🐢 | ultrafast | 4 | 191 fps | 1.19 GB/s | |
269
+ | 🚀 | ultrafast | 4 | 207 fps | 1.29 GB/s | 8.89% |
270
+ | 🌀 | ultrafast | 4 | 234 fps | 1.46 GB/s | 22.77% |
271
+ | 🐢 | slow | 4 | 62 fps | 0.39 GB/s | |
272
+ | 🚀 | slow | 4 | 67 fps | 0.42 GB/s | 8.40% |
273
+ | 🌀 | slow | 4 | 74 fps | 0.47 GB/s | 20.89% |
274
+
275
+ | 1440p | x264 | Buffers | Framerate | Bandwidth | Gain |
276
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
277
+ | 🐢 | Null | 1 | 180 fps | 1.99 GB/s | |
278
+ | 🚀 | Null | 1 | 216 fps | 2.40 GB/s | 20.34% |
279
+ | 🌀 | Null | 1 | 264 fps | 2.92 GB/s | 46.74% |
280
+ | 🐢 | Null | 4 | 178 fps | 1.97 GB/s | |
281
+ | 🚀 | Null | 4 | 211 fps | 2.34 GB/s | 19.07% |
282
+ | 🌀 | Null | 4 | 250 fps | 2.77 GB/s | 40.48% |
283
+ | 🐢 | ultrafast | 4 | 98 fps | 1.09 GB/s | |
284
+ | 🚀 | ultrafast | 4 | 110 fps | 1.23 GB/s | 13.18% |
285
+ | 🌀 | ultrafast | 4 | 121 fps | 1.35 GB/s | 24.15% |
286
+ | 🐢 | slow | 4 | 40 fps | 0.45 GB/s | |
287
+ | 🚀 | slow | 4 | 41 fps | 0.46 GB/s | 4.90% |
288
+ | 🌀 | slow | 4 | 43 fps | 0.48 GB/s | 7.89% |
289
+
290
+ | 2160p | x264 | Buffers | Framerate | Bandwidth | Gain |
291
+ |:-----:|:----------|:---------:|----------:|------------:|--------:|
292
+ | 🐢 | Null | 1 | 79 fps | 1.98 GB/s | |
293
+ | 🚀 | Null | 1 | 95 fps | 2.37 GB/s | 20.52% |
294
+ | 🌀 | Null | 1 | 104 fps | 2.60 GB/s | 32.15% |
295
+ | 🐢 | Null | 4 | 80 fps | 2.00 GB/s | |
296
+ | 🚀 | Null | 4 | 94 fps | 2.35 GB/s | 17.82% |
297
+ | 🌀 | Null | 4 | 108 fps | 2.70 GB/s | 35.40% |
298
+ | 🐢 | ultrafast | 4 | 41 fps | 1.04 GB/s | |
299
+ | 🚀 | ultrafast | 4 | 48 fps | 1.20 GB/s | 17.67% |
300
+ | 🌀 | ultrafast | 4 | 52 fps | 1.30 GB/s | 27.49% |
301
+ | 🐢 | slow | 4 | 17 fps | 0.43 GB/s | |
302
+ | 🚀 | slow | 4 | 19 fps | 0.48 GB/s | 13.16% |
303
+ | 🌀 | slow | 4 | 19 fps | 0.48 GB/s | 13.78% |
304
+
305
+ </details>
306
+ <br>
307
+
308
+ <div align="justify">
309
+
310
+ # 🌀 Conclusion
311
+
312
+ TurboPipe significantly increases the feeding speed of FFmpeg with data, especially at higher resolutions. However, if there's few CPU compute available, or the video is too hard to encode (/slow preset), the gains are insignificant over the other methods (bottleneck). Multi-buffering didn't prove to have an advantage, debugging shows that TurboPipe C++ is often starved of data to write (as the file stream is buffered on the OS most likely), and the context switching, or cache misses, might be the cause of the slowdown.
313
+
314
+ The theory supports the threaded method being faster, as writing to a file descriptor is a blocking operation for python, but a syscall under the hood, that doesn't necessarily lock the GIL, just the thread. TurboPipe speeds that even further by avoiding an unecessary copy of the buffer data, and writing directly to the file descriptor on a C++ thread. Linux shows a better performance than Windows in the same system after the optimizations, but Windows wins on the standard method.
315
+
316
+ Interestingly, due either Linux's scheduler on AMD Ryzen CPUs, or their operating philosophy, it was experimentally seen that Ryzen's frenetic thread switching degrades a bit the single thread performance, which can be _"fixed"_ with prepending the command with `taskset --cpu 0,2` (not recommended at all), comparatively speaking to Windows performance on the same system (Linux 🚀 = Windows 🐢). This can also be due the topology of tested CPUs having more than one Core Complex Die (CCD). Intel CPUs seem to stick to the same thread for longer, which makes the Python threaded method often slightly faster.
317
+
318
+ ### Personal experience
319
+
320
+ On realistically loads, like [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow)'s default lightweight shader export, TurboPipe increases rendering speed from 1080p260 to 1080p360 on my system, with mid 80% CPU usage than low 60%s. For [**DepthFlow**](https://github.com/BrokenSource/ShaderFlow)'s default depth video export, no gains are seen, as the CPU is almost saturated encoding at 1080p130.
321
+
322
+ </div>
323
+
324
+ <br>
325
+
326
+ # 📚 Future work
327
+
328
+ - Add support for NumPy arrays, memoryviews, and byte-like objects
329
+ - Disable/investigate performance degradation on Windows iGPUs
330
+ - Improve the thread synchronization and/or use a ThreadPool
331
+ - Stabler way for finding mglo struct offsets (moderngl.h?)
332
+ - Maybe use `mmap` instead of chunks writing on Linux
333
+ - Test on MacOS 🙈
@@ -21,6 +21,7 @@ TOTAL_BYTES = (BYTES_PER_FRAME * TOTAL_FRAMES)
21
21
  # Create ModernGL objects
22
22
  ctx = moderngl.create_standalone_context()
23
23
  buffer = ctx.buffer(reserve=BYTES_PER_FRAME)
24
+ print(f"OpenGL Renderer: {ctx.info['GL_RENDERER']}")
24
25
 
25
26
  # Let's play fair and avoid any OS/Python/Hardware optimizations
26
27
  buffer.write(bytearray(random.getrandbits(8) for _ in range(BYTES_PER_FRAME)))
@@ -61,7 +62,7 @@ def Progress():
61
62
 
62
63
  # -------------------------------------------------------------------------------------------------|
63
64
 
64
- print("\n:: Traditional method\n")
65
+ print("\n:: Standard method\n")
65
66
 
66
67
  with Progress() as progress, FFmpeg() as ffmpeg:
67
68
  for frame in range(TOTAL_FRAMES):
@@ -17,8 +17,10 @@ import tqdm
17
17
  import turbopipe
18
18
 
19
19
  ctx = moderngl.create_standalone_context()
20
+ print(f"OpenGL Renderer: {ctx.info['GL_RENDERER']}")
21
+
20
22
  AVERAGE_N_RUNS = 3
21
- DATA_SIZE = 3
23
+ DATA_SIZE_GB = 5
22
24
 
23
25
  # -------------------------------------------------------------------------------------------------|
24
26
 
@@ -141,7 +143,7 @@ class Benchmark:
141
143
  nbuffer = (4 if test_case in ["B", "C", "D"] else 1)
142
144
  buffers = [ctx.buffer(data=numpy.random.randint(128, 135, (height, width, 3), dtype=numpy.uint8)) for _ in range(nbuffer)]
143
145
  bytes_per_frame = (width * height * 3)
144
- total_frames = int((DATA_SIZE * 1024**3) / bytes_per_frame)
146
+ total_frames = int((DATA_SIZE_GB * 1024**3) / bytes_per_frame)
145
147
  statistics = Statistics(bytes_per_frame)
146
148
 
147
149
  for run in range(AVERAGE_N_RUNS):
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env python3
2
- __version__ = "1.0.4"
2
+ __version__ = "1.0.5"
3
3
  print(__version__)
turbopipe-1.0.4/PKG-INFO DELETED
@@ -1,223 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: turbopipe
3
- Version: 1.0.4
4
- Summary: 🌀 Faster ModernGL Buffer inter process data transfers
5
- Home-page: https://brokensrc.dev
6
- Author-Email: Tremeschin <29046864+Tremeschin@users.noreply.github.com>
7
- License: MIT License
8
-
9
- Copyright (c) 2024 Gabriel Tremeschin
10
-
11
- Permission is hereby granted, free of charge, to any person obtaining a copy
12
- of this software and associated documentation files (the "Software"), to deal
13
- in the Software without restriction, including without limitation the rights
14
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
15
- copies of the Software, and to permit persons to whom the Software is
16
- furnished to do so, subject to the following conditions:
17
-
18
- The above copyright notice and this permission notice shall be included in all
19
- copies or substantial portions of the Software.
20
-
21
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
22
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
23
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
24
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
25
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
26
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
27
- SOFTWARE.
28
- Project-URL: Issues, https://github.com/BrokenSource/TurboPipe/issues
29
- Project-URL: Repository, https://github.com/BrokenSource/TurboPipe
30
- Project-URL: Documentation, https://github.com/BrokenSource/TurboPipe
31
- Project-URL: Homepage, https://brokensrc.dev
32
- Requires-Python: >=3.7
33
- Requires-Dist: moderngl
34
- Description-Content-Type: text/markdown
35
-
36
- > [!IMPORTANT]
37
- > <sub>Also check out [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow), where **TurboPipe** shines! 😉</sub>
38
-
39
- <div align="center">
40
- <a href="https://brokensrc.dev/"><img src="https://raw.githubusercontent.com/BrokenSource/TurboPipe/main/turbopipe/resources/images/turbopipe.png" width="200"></a>
41
- <h1>TurboPipe</h1>
42
- <br>
43
- Faster <a href="https://github.com/moderngl/moderngl"><b>ModernGL</b></a> inter-process data transfers
44
- </div>
45
-
46
- <br>
47
-
48
- # 🔥 Description
49
-
50
- > TurboPipe speeds up sending raw bytes from `moderngl.Buffer` objects primarily to `FFmpeg` subprocess
51
-
52
- The **optimizations** involved are:
53
-
54
- - **Zero-copy**: Avoid unnecessary memory copies or allocation (intermediate `buffer.read()`)
55
- - **C++**: The core of TurboPipe is written in C++ for speed, efficiency and low-level control
56
- - **Chunks**: Write in chunks of 4096 bytes (RAM page size), so the hardware is happy
57
- - **Threaded**:
58
- - Doesn't block Python code execution, allows to render next frame
59
- - Decouples the main thread from the I/O thread for performance
60
-
61
- ✅ Don't worry, there's proper **safety** in place. TurboPipe will block Python if a memory address is already queued for writing, and guarantees order of writes per file-descriptor. Just call `.sync()` when done 😉
62
-
63
- <br>
64
-
65
- # 📦 Installation
66
-
67
- It couldn't be easier! Just install in your package manager:
68
-
69
- ```bash
70
- pip install turbopipe
71
- poetry add turbopipe
72
- pdm add turbopipe
73
- rye add turbopipe
74
- ```
75
-
76
- <br>
77
-
78
- # 🚀 Usage
79
-
80
- See also the [**Examples**](https://github.com/BrokenSource/TurboPipe/tree/main/examples) folder for more controlled usage, and [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow/blob/main/ShaderFlow/Scene.py) usage of it!
81
-
82
- ```python
83
- import subprocess
84
- import moderngl
85
- import turbopipe
86
-
87
- # Create ModernGL objects
88
- ctx = moderngl.create_standalone_context()
89
- buffer = ctx.buffer(reserve=1920*1080*3)
90
-
91
- # Make sure resolution, pixel format matches!
92
- ffmpeg = subprocess.Popen(
93
- 'ffmpeg -f rawvideo -pix_fmt rgb24 -s 1920x1080 -i - -f null -'.split(),
94
- stdin=subprocess.PIPE
95
- )
96
-
97
- # Rendering loop of yours
98
- for _ in range(100):
99
- turbopipe.pipe(buffer, ffmpeg.stdin.fileno())
100
-
101
- # Finalize writing
102
- turbo.sync()
103
- ffmpeg.stdin.close()
104
- ffmpeg.wait()
105
- ```
106
-
107
- <br>
108
-
109
- # ⭐️ Benchmarks
110
-
111
- > [!NOTE]
112
- > **The tests conditions are as follows**:
113
- > - The tests are the average of 3 runs to ensure consistency, with 3 GB of the same data being piped
114
- > - The data is a random noise per-buffer between 128-135. So, multi-buffers runs are a noise video
115
- > - All resolutions are wide-screen (16:9) and have 3 components (RGB) with 3 bytes per pixel (SDR)
116
- > - Multi-buffer cycles through a list of buffer (eg. 1, 2, 3, 1, 2, 3... for 3-buffers)
117
- > - All FFmpeg outputs are scrapped with `-f null -` to avoid any disk I/O bottlenecks
118
- > - The `gain` column is the percentage increase over the standard method
119
- > - When `x264` is Null, no encoding took place (passthrough)
120
- > - The test cases emoji signifies:
121
- > - 🐢: Standard `ffmpeg.stdin.write(buffer.read())` on just the main thread, pure Python
122
- > - 🚀: Threaded `ffmpeg.stdin.write(buffer.read())` with a queue (similar to turbopipe)
123
- > - 🌀: The magic of `turbopipe.pipe(buffer, ffmpeg.stdin.fileno())`
124
- >
125
- > Also see [`benchmark.py`](https://github.com/BrokenSource/TurboPipe/blob/main/examples/benchmark.py) for the implementation
126
-
127
- ✅ Check out benchmarks in a couple of systems below:
128
-
129
- <details>
130
- <summary><b>Desktop</b> • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Arch Linux)</summary>
131
- <br>
132
-
133
- | 720p | x264 | Buffers | Framerate | Bandwidth | Gain |
134
- |:----:|:----------|:---------:|----------:|------------:|---------:|
135
- | 🐢 | Null | 1 | 882 fps | 2.44 GB/s | |
136
- | 🚀 | Null | 1 | 793 fps | 2.19 GB/s | -10.04% |
137
- | 🌀 | Null | 1 | 1911 fps | 5.28 GB/s | 116.70% |
138
- | 🐢 | Null | 4 | 818 fps | 2.26 GB/s | |
139
- | 🚀 | Null | 4 | 684 fps | 1.89 GB/s | -16.35% |
140
- | 🌀 | Null | 4 | 1494 fps | 4.13 GB/s | 82.73% |
141
- | 🐢 | ultrafast | 4 | 664 fps | 1.84 GB/s | |
142
- | 🚀 | ultrafast | 4 | 635 fps | 1.76 GB/s | -4.33% |
143
- | 🌀 | ultrafast | 4 | 869 fps | 2.40 GB/s | 31.00% |
144
- | 🐢 | slow | 4 | 204 fps | 0.57 GB/s | |
145
- | 🚀 | slow | 4 | 205 fps | 0.57 GB/s | 0.58% |
146
- | 🌀 | slow | 4 | 208 fps | 0.58 GB/s | 2.22% |
147
-
148
- | 1080p | x264 | Buffers | Framerate | Bandwidth | Gain |
149
- |:-----:|:----------|:---------:|----------:|------------:|--------:|
150
- | 🐢 | Null | 1 | 385 fps | 2.40 GB/s | |
151
- | 🚀 | Null | 1 | 369 fps | 2.30 GB/s | -3.91% |
152
- | 🌀 | Null | 1 | 641 fps | 3.99 GB/s | 66.54% |
153
- | 🐢 | Null | 4 | 387 fps | 2.41 GB/s | |
154
- | 🚀 | Null | 4 | 359 fps | 2.23 GB/s | -7.21% |
155
- | 🌀 | Null | 4 | 632 fps | 3.93 GB/s | 63.40% |
156
- | 🐢 | ultrafast | 4 | 272 fps | 1.70 GB/s | |
157
- | 🚀 | ultrafast | 4 | 266 fps | 1.66 GB/s | -2.14% |
158
- | 🌀 | ultrafast | 4 | 405 fps | 2.53 GB/s | 49.24% |
159
- | 🐢 | slow | 4 | 117 fps | 0.73 GB/s | |
160
- | 🚀 | slow | 4 | 122 fps | 0.76 GB/s | 4.43% |
161
- | 🌀 | slow | 4 | 124 fps | 0.77 GB/s | 6.48% |
162
-
163
- | 1440p | x264 | Buffers | Framerate | Bandwidth | Gain |
164
- |:-----:|:----------|:---------:|----------:|------------:|--------:|
165
- | 🐢 | Null | 1 | 204 fps | 2.26 GB/s | |
166
- | 🚀 | Null | 1 | 241 fps | 2.67 GB/s | 18.49% |
167
- | 🌀 | Null | 1 | 297 fps | 3.29 GB/s | 45.67% |
168
- | 🐢 | Null | 4 | 230 fps | 2.54 GB/s | |
169
- | 🚀 | Null | 4 | 235 fps | 2.61 GB/s | 2.52% |
170
- | 🌀 | Null | 4 | 411 fps | 4.55 GB/s | 78.97% |
171
- | 🐢 | ultrafast | 4 | 146 fps | 1.62 GB/s | |
172
- | 🚀 | ultrafast | 4 | 153 fps | 1.70 GB/s | 5.21% |
173
- | 🌀 | ultrafast | 4 | 216 fps | 2.39 GB/s | 47.96% |
174
- | 🐢 | slow | 4 | 73 fps | 0.82 GB/s | |
175
- | 🚀 | slow | 4 | 78 fps | 0.86 GB/s | 7.06% |
176
- | 🌀 | slow | 4 | 79 fps | 0.88 GB/s | 9.27% |
177
-
178
- | 2160p | x264 | Buffers | Framerate | Bandwidth | Gain |
179
- |:-----:|:----------|:---------:|----------:|------------:|---------:|
180
- | 🐢 | Null | 1 | 81 fps | 2.03 GB/s | |
181
- | 🚀 | Null | 1 | 107 fps | 2.67 GB/s | 32.26% |
182
- | 🌀 | Null | 1 | 213 fps | 5.31 GB/s | 163.47% |
183
- | 🐢 | Null | 4 | 87 fps | 2.18 GB/s | |
184
- | 🚀 | Null | 4 | 109 fps | 2.72 GB/s | 25.43% |
185
- | 🌀 | Null | 4 | 212 fps | 5.28 GB/s | 143.72% |
186
- | 🐢 | ultrafast | 4 | 59 fps | 1.48 GB/s | |
187
- | 🚀 | ultrafast | 4 | 67 fps | 1.68 GB/s | 14.46% |
188
- | 🌀 | ultrafast | 4 | 95 fps | 2.39 GB/s | 62.66% |
189
- | 🐢 | slow | 4 | 37 fps | 0.94 GB/s | |
190
- | 🚀 | slow | 4 | 43 fps | 1.07 GB/s | 16.22% |
191
- | 🌀 | slow | 4 | 44 fps | 1.11 GB/s | 20.65% |
192
-
193
- </details>
194
-
195
- <details>
196
- <summary><b>Desktop</b> • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Windows 11)</summary>
197
- <br>
198
- </details>
199
-
200
- <br>
201
-
202
- <div align="justify">
203
-
204
- # 🌀 Conclusion
205
-
206
- TurboPipe significantly increases the feeding speed of FFmpeg with data, especially at higher resolutions. However, if there's few CPU compute available, or the video is too hard to encode (slow preset), the gains are insignificant over the other methods (bottleneck). Multi-buffering didn't prove to have an advantage, debugging shows that TurboPipe C++ is often starved of data to write (as the file stream is buffered on the OS most likely), and the context switching, or cache misses, might be the cause of the slowdown.
207
-
208
- Interestingly, due either Linux's scheduler on AMD Ryzen CPUs, or their operating philosophy, it was experimentally seen that Ryzen's frenetic thread switching degrades a bit the single thread performance, which can be _"fixed"_ with prepending the command with `taskset --cpu 0,2` (not recommended at all), comparatively speaking to Windows performance on the same system (Linux 🚀 = Windows 🐢). This can also be due the topology of tested CPUs having more than one Core Complex Die (CCD). Intel CPUs seem to stick to the same thread for longer, which makes the Python threaded method an unecessary overhead.
209
-
210
- ### Personal experience
211
-
212
- On realistically loads, like [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow)'s default lightweight shader export, TurboPipe increases rendering speed from 1080p260 to 1080p330 on my system, with mid 80% CPU usage than low 60%s. For [**DepthFlow**](https://github.com/BrokenSource/ShaderFlow)'s default depth video export, no gains are seen, as the CPU is almost saturated encoding at 1080p130.
213
-
214
- </div>
215
-
216
- <br>
217
-
218
- # 📚 Future work
219
-
220
- - Add support for NumPy arrays, memoryviews, and byte-like objects
221
- - Improve the thread synchronization and/or use a ThreadPool
222
- - Maybe use `mmap` instead of chunks writing
223
- - Test on MacOS 🙈
turbopipe-1.0.4/Readme.md DELETED
@@ -1,188 +0,0 @@
1
- > [!IMPORTANT]
2
- > <sub>Also check out [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow), where **TurboPipe** shines! 😉</sub>
3
-
4
- <div align="center">
5
- <a href="https://brokensrc.dev/"><img src="https://raw.githubusercontent.com/BrokenSource/TurboPipe/main/turbopipe/resources/images/turbopipe.png" width="200"></a>
6
- <h1>TurboPipe</h1>
7
- <br>
8
- Faster <a href="https://github.com/moderngl/moderngl"><b>ModernGL</b></a> inter-process data transfers
9
- </div>
10
-
11
- <br>
12
-
13
- # 🔥 Description
14
-
15
- > TurboPipe speeds up sending raw bytes from `moderngl.Buffer` objects primarily to `FFmpeg` subprocess
16
-
17
- The **optimizations** involved are:
18
-
19
- - **Zero-copy**: Avoid unnecessary memory copies or allocation (intermediate `buffer.read()`)
20
- - **C++**: The core of TurboPipe is written in C++ for speed, efficiency and low-level control
21
- - **Chunks**: Write in chunks of 4096 bytes (RAM page size), so the hardware is happy
22
- - **Threaded**:
23
- - Doesn't block Python code execution, allows to render next frame
24
- - Decouples the main thread from the I/O thread for performance
25
-
26
- ✅ Don't worry, there's proper **safety** in place. TurboPipe will block Python if a memory address is already queued for writing, and guarantees order of writes per file-descriptor. Just call `.sync()` when done 😉
27
-
28
- <br>
29
-
30
- # 📦 Installation
31
-
32
- It couldn't be easier! Just install in your package manager:
33
-
34
- ```bash
35
- pip install turbopipe
36
- poetry add turbopipe
37
- pdm add turbopipe
38
- rye add turbopipe
39
- ```
40
-
41
- <br>
42
-
43
- # 🚀 Usage
44
-
45
- See also the [**Examples**](https://github.com/BrokenSource/TurboPipe/tree/main/examples) folder for more controlled usage, and [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow/blob/main/ShaderFlow/Scene.py) usage of it!
46
-
47
- ```python
48
- import subprocess
49
- import moderngl
50
- import turbopipe
51
-
52
- # Create ModernGL objects
53
- ctx = moderngl.create_standalone_context()
54
- buffer = ctx.buffer(reserve=1920*1080*3)
55
-
56
- # Make sure resolution, pixel format matches!
57
- ffmpeg = subprocess.Popen(
58
- 'ffmpeg -f rawvideo -pix_fmt rgb24 -s 1920x1080 -i - -f null -'.split(),
59
- stdin=subprocess.PIPE
60
- )
61
-
62
- # Rendering loop of yours
63
- for _ in range(100):
64
- turbopipe.pipe(buffer, ffmpeg.stdin.fileno())
65
-
66
- # Finalize writing
67
- turbo.sync()
68
- ffmpeg.stdin.close()
69
- ffmpeg.wait()
70
- ```
71
-
72
- <br>
73
-
74
- # ⭐️ Benchmarks
75
-
76
- > [!NOTE]
77
- > **The tests conditions are as follows**:
78
- > - The tests are the average of 3 runs to ensure consistency, with 3 GB of the same data being piped
79
- > - The data is a random noise per-buffer between 128-135. So, multi-buffers runs are a noise video
80
- > - All resolutions are wide-screen (16:9) and have 3 components (RGB) with 3 bytes per pixel (SDR)
81
- > - Multi-buffer cycles through a list of buffer (eg. 1, 2, 3, 1, 2, 3... for 3-buffers)
82
- > - All FFmpeg outputs are scrapped with `-f null -` to avoid any disk I/O bottlenecks
83
- > - The `gain` column is the percentage increase over the standard method
84
- > - When `x264` is Null, no encoding took place (passthrough)
85
- > - The test cases emoji signifies:
86
- > - 🐢: Standard `ffmpeg.stdin.write(buffer.read())` on just the main thread, pure Python
87
- > - 🚀: Threaded `ffmpeg.stdin.write(buffer.read())` with a queue (similar to turbopipe)
88
- > - 🌀: The magic of `turbopipe.pipe(buffer, ffmpeg.stdin.fileno())`
89
- >
90
- > Also see [`benchmark.py`](https://github.com/BrokenSource/TurboPipe/blob/main/examples/benchmark.py) for the implementation
91
-
92
- ✅ Check out benchmarks in a couple of systems below:
93
-
94
- <details>
95
- <summary><b>Desktop</b> • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Arch Linux)</summary>
96
- <br>
97
-
98
- | 720p | x264 | Buffers | Framerate | Bandwidth | Gain |
99
- |:----:|:----------|:---------:|----------:|------------:|---------:|
100
- | 🐢 | Null | 1 | 882 fps | 2.44 GB/s | |
101
- | 🚀 | Null | 1 | 793 fps | 2.19 GB/s | -10.04% |
102
- | 🌀 | Null | 1 | 1911 fps | 5.28 GB/s | 116.70% |
103
- | 🐢 | Null | 4 | 818 fps | 2.26 GB/s | |
104
- | 🚀 | Null | 4 | 684 fps | 1.89 GB/s | -16.35% |
105
- | 🌀 | Null | 4 | 1494 fps | 4.13 GB/s | 82.73% |
106
- | 🐢 | ultrafast | 4 | 664 fps | 1.84 GB/s | |
107
- | 🚀 | ultrafast | 4 | 635 fps | 1.76 GB/s | -4.33% |
108
- | 🌀 | ultrafast | 4 | 869 fps | 2.40 GB/s | 31.00% |
109
- | 🐢 | slow | 4 | 204 fps | 0.57 GB/s | |
110
- | 🚀 | slow | 4 | 205 fps | 0.57 GB/s | 0.58% |
111
- | 🌀 | slow | 4 | 208 fps | 0.58 GB/s | 2.22% |
112
-
113
- | 1080p | x264 | Buffers | Framerate | Bandwidth | Gain |
114
- |:-----:|:----------|:---------:|----------:|------------:|--------:|
115
- | 🐢 | Null | 1 | 385 fps | 2.40 GB/s | |
116
- | 🚀 | Null | 1 | 369 fps | 2.30 GB/s | -3.91% |
117
- | 🌀 | Null | 1 | 641 fps | 3.99 GB/s | 66.54% |
118
- | 🐢 | Null | 4 | 387 fps | 2.41 GB/s | |
119
- | 🚀 | Null | 4 | 359 fps | 2.23 GB/s | -7.21% |
120
- | 🌀 | Null | 4 | 632 fps | 3.93 GB/s | 63.40% |
121
- | 🐢 | ultrafast | 4 | 272 fps | 1.70 GB/s | |
122
- | 🚀 | ultrafast | 4 | 266 fps | 1.66 GB/s | -2.14% |
123
- | 🌀 | ultrafast | 4 | 405 fps | 2.53 GB/s | 49.24% |
124
- | 🐢 | slow | 4 | 117 fps | 0.73 GB/s | |
125
- | 🚀 | slow | 4 | 122 fps | 0.76 GB/s | 4.43% |
126
- | 🌀 | slow | 4 | 124 fps | 0.77 GB/s | 6.48% |
127
-
128
- | 1440p | x264 | Buffers | Framerate | Bandwidth | Gain |
129
- |:-----:|:----------|:---------:|----------:|------------:|--------:|
130
- | 🐢 | Null | 1 | 204 fps | 2.26 GB/s | |
131
- | 🚀 | Null | 1 | 241 fps | 2.67 GB/s | 18.49% |
132
- | 🌀 | Null | 1 | 297 fps | 3.29 GB/s | 45.67% |
133
- | 🐢 | Null | 4 | 230 fps | 2.54 GB/s | |
134
- | 🚀 | Null | 4 | 235 fps | 2.61 GB/s | 2.52% |
135
- | 🌀 | Null | 4 | 411 fps | 4.55 GB/s | 78.97% |
136
- | 🐢 | ultrafast | 4 | 146 fps | 1.62 GB/s | |
137
- | 🚀 | ultrafast | 4 | 153 fps | 1.70 GB/s | 5.21% |
138
- | 🌀 | ultrafast | 4 | 216 fps | 2.39 GB/s | 47.96% |
139
- | 🐢 | slow | 4 | 73 fps | 0.82 GB/s | |
140
- | 🚀 | slow | 4 | 78 fps | 0.86 GB/s | 7.06% |
141
- | 🌀 | slow | 4 | 79 fps | 0.88 GB/s | 9.27% |
142
-
143
- | 2160p | x264 | Buffers | Framerate | Bandwidth | Gain |
144
- |:-----:|:----------|:---------:|----------:|------------:|---------:|
145
- | 🐢 | Null | 1 | 81 fps | 2.03 GB/s | |
146
- | 🚀 | Null | 1 | 107 fps | 2.67 GB/s | 32.26% |
147
- | 🌀 | Null | 1 | 213 fps | 5.31 GB/s | 163.47% |
148
- | 🐢 | Null | 4 | 87 fps | 2.18 GB/s | |
149
- | 🚀 | Null | 4 | 109 fps | 2.72 GB/s | 25.43% |
150
- | 🌀 | Null | 4 | 212 fps | 5.28 GB/s | 143.72% |
151
- | 🐢 | ultrafast | 4 | 59 fps | 1.48 GB/s | |
152
- | 🚀 | ultrafast | 4 | 67 fps | 1.68 GB/s | 14.46% |
153
- | 🌀 | ultrafast | 4 | 95 fps | 2.39 GB/s | 62.66% |
154
- | 🐢 | slow | 4 | 37 fps | 0.94 GB/s | |
155
- | 🚀 | slow | 4 | 43 fps | 1.07 GB/s | 16.22% |
156
- | 🌀 | slow | 4 | 44 fps | 1.11 GB/s | 20.65% |
157
-
158
- </details>
159
-
160
- <details>
161
- <summary><b>Desktop</b> • (AMD Ryzen 9 5900x) • (NVIDIA RTX 3060 12 GB) • (DDR4 2x32 GB 3200 MT/s) • (Windows 11)</summary>
162
- <br>
163
- </details>
164
-
165
- <br>
166
-
167
- <div align="justify">
168
-
169
- # 🌀 Conclusion
170
-
171
- TurboPipe significantly increases the feeding speed of FFmpeg with data, especially at higher resolutions. However, if there's few CPU compute available, or the video is too hard to encode (slow preset), the gains are insignificant over the other methods (bottleneck). Multi-buffering didn't prove to have an advantage, debugging shows that TurboPipe C++ is often starved of data to write (as the file stream is buffered on the OS most likely), and the context switching, or cache misses, might be the cause of the slowdown.
172
-
173
- Interestingly, due either Linux's scheduler on AMD Ryzen CPUs, or their operating philosophy, it was experimentally seen that Ryzen's frenetic thread switching degrades a bit the single thread performance, which can be _"fixed"_ with prepending the command with `taskset --cpu 0,2` (not recommended at all), comparatively speaking to Windows performance on the same system (Linux 🚀 = Windows 🐢). This can also be due the topology of tested CPUs having more than one Core Complex Die (CCD). Intel CPUs seem to stick to the same thread for longer, which makes the Python threaded method an unecessary overhead.
174
-
175
- ### Personal experience
176
-
177
- On realistically loads, like [**ShaderFlow**](https://github.com/BrokenSource/ShaderFlow)'s default lightweight shader export, TurboPipe increases rendering speed from 1080p260 to 1080p330 on my system, with mid 80% CPU usage than low 60%s. For [**DepthFlow**](https://github.com/BrokenSource/ShaderFlow)'s default depth video export, no gains are seen, as the CPU is almost saturated encoding at 1080p130.
178
-
179
- </div>
180
-
181
- <br>
182
-
183
- # 📚 Future work
184
-
185
- - Add support for NumPy arrays, memoryviews, and byte-like objects
186
- - Improve the thread synchronization and/or use a ThreadPool
187
- - Maybe use `mmap` instead of chunks writing
188
- - Test on MacOS 🙈
File without changes
File without changes
File without changes
File without changes
File without changes