mosaix 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- mosaix-0.1.0/.gitignore +42 -0
- mosaix-0.1.0/CHANGELOG.md +53 -0
- mosaix-0.1.0/LICENSE +21 -0
- mosaix-0.1.0/PKG-INFO +369 -0
- mosaix-0.1.0/README.md +315 -0
- mosaix-0.1.0/docs/adapters.md +136 -0
- mosaix-0.1.0/docs/architecture.md +79 -0
- mosaix-0.1.0/docs/configuration.md +92 -0
- mosaix-0.1.0/docs/performance.md +126 -0
- mosaix-0.1.0/examples/custom_model.py +53 -0
- mosaix-0.1.0/examples/quickstart.py +38 -0
- mosaix-0.1.0/examples/whole_image.py +45 -0
- mosaix-0.1.0/pyproject.toml +90 -0
- mosaix-0.1.0/src/mosaix/__init__.py +88 -0
- mosaix-0.1.0/src/mosaix/adapters/__init__.py +123 -0
- mosaix-0.1.0/src/mosaix/adapters/base.py +87 -0
- mosaix-0.1.0/src/mosaix/adapters/callable_adapter.py +70 -0
- mosaix-0.1.0/src/mosaix/adapters/depth_adapter.py +50 -0
- mosaix-0.1.0/src/mosaix/adapters/face_adapter.py +58 -0
- mosaix-0.1.0/src/mosaix/adapters/onnx_adapter.py +187 -0
- mosaix-0.1.0/src/mosaix/adapters/tagger_adapter.py +85 -0
- mosaix-0.1.0/src/mosaix/adapters/torchscript_adapter.py +83 -0
- mosaix-0.1.0/src/mosaix/adapters/ultralytics_adapter.py +100 -0
- mosaix-0.1.0/src/mosaix/cli.py +213 -0
- mosaix-0.1.0/src/mosaix/config.py +191 -0
- mosaix-0.1.0/src/mosaix/engine.py +161 -0
- mosaix-0.1.0/src/mosaix/geometry.py +127 -0
- mosaix-0.1.0/src/mosaix/io/__init__.py +12 -0
- mosaix-0.1.0/src/mosaix/io/reader.py +161 -0
- mosaix-0.1.0/src/mosaix/io/writer.py +90 -0
- mosaix-0.1.0/src/mosaix/metrics.py +93 -0
- mosaix-0.1.0/src/mosaix/pipeline.py +149 -0
- mosaix-0.1.0/src/mosaix/results.py +120 -0
mosaix-0.1.0/.gitignore
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
# Python
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*.egg-info/
|
|
5
|
+
*.egg
|
|
6
|
+
.eggs/
|
|
7
|
+
build/
|
|
8
|
+
dist/
|
|
9
|
+
.installed.cfg
|
|
10
|
+
|
|
11
|
+
# Virtual envs
|
|
12
|
+
.venv/
|
|
13
|
+
venv/
|
|
14
|
+
env/
|
|
15
|
+
|
|
16
|
+
# Test / coverage
|
|
17
|
+
.pytest_cache/
|
|
18
|
+
.coverage
|
|
19
|
+
htmlcov/
|
|
20
|
+
.ruff_cache/
|
|
21
|
+
|
|
22
|
+
# Models & media (don't ship weights/videos in the repo)
|
|
23
|
+
*.pt
|
|
24
|
+
*.onnx
|
|
25
|
+
*.engine
|
|
26
|
+
*.mp4
|
|
27
|
+
*.avi
|
|
28
|
+
*.mkv
|
|
29
|
+
*.mov
|
|
30
|
+
|
|
31
|
+
# Benchmark / run artifacts
|
|
32
|
+
*.json.bench
|
|
33
|
+
benchmarks/results/
|
|
34
|
+
runs/
|
|
35
|
+
_logs/
|
|
36
|
+
_dbg/
|
|
37
|
+
|
|
38
|
+
# IDE / OS
|
|
39
|
+
.idea/
|
|
40
|
+
.vscode/
|
|
41
|
+
.DS_Store
|
|
42
|
+
Thumbs.db
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project are documented here. The format follows
|
|
4
|
+
[Keep a Changelog](https://keepachangelog.com/), and the project adheres to
|
|
5
|
+
[Semantic Versioning](https://semver.org/).
|
|
6
|
+
|
|
7
|
+
## [Unreleased]
|
|
8
|
+
|
|
9
|
+
### Added
|
|
10
|
+
- **Multi-family model support beyond box detectors.** New adapters and a generalised
|
|
11
|
+
`FrameResult` (carrying `label` / `tags` / `extra` alongside `detections`) let mosaix
|
|
12
|
+
run whole-image tasks through the same batched pipeline:
|
|
13
|
+
- `TaggerAdapter` — ONNX multi-label taggers (WD-EVA02, PixAI); auto NCHW/NHWC.
|
|
14
|
+
- `DepthAdapter` — HuggingFace depth estimation (Depth-Anything-V2).
|
|
15
|
+
- `TorchScriptAdapter` — raw forward for TorchScript/`.pt2` & fixed-size ONNX nets
|
|
16
|
+
(Sapiens, DWPose) for throughput benchmarking.
|
|
17
|
+
- `FaceAdapter` — InsightFace SCRFD detection (buffalo_l/s) → boxes.
|
|
18
|
+
- `UltralyticsAdapter` now auto-selects RT-DETR / FastSAM / SAM, handles **OBB**
|
|
19
|
+
(rotated → axis-aligned) and **classification** (`-cls`) models.
|
|
20
|
+
- **Whole-image engine path** (`MosaicEngine.infer_iter_whole`) for `output_kind="whole"`
|
|
21
|
+
adapters: one model input per frame, throughput from large-batch inference.
|
|
22
|
+
- **Benchmark harness** (`benchmarks/bench_all.py` + `bench_one.py`) — one process per
|
|
23
|
+
model for uncontested FPS / VRAM / GPU-util / RAM, CUDA + CPU, configs 1×1 / 4×32 /
|
|
24
|
+
9×32. `benchmarks/accuracy.py` — coco128 mAP@0.5, native vs mosaic-tiled.
|
|
25
|
+
- **README** — measured benchmark tables (RTX 4060 Laptop) and a tested/supported
|
|
26
|
+
model-families matrix.
|
|
27
|
+
|
|
28
|
+
## [0.1.0] - 2026-06-14
|
|
29
|
+
|
|
30
|
+
Initial release.
|
|
31
|
+
|
|
32
|
+
### Added
|
|
33
|
+
- **Mosaic throughput engine** — gutter-padded tiling (`grid`) + batched forward
|
|
34
|
+
passes (`batch`) covering `grid × batch` frames per pass, with centroid-assign +
|
|
35
|
+
clip remapping that eliminates seam duplicates without cross-cell NMS.
|
|
36
|
+
- **Pluggable model adapters** — `UltralyticsAdapter` (YOLO/RT-DETR, `.pt`/`.onnx`/
|
|
37
|
+
`.engine`), `OnnxAdapter` (generic YOLO ONNX with built-in letterbox/decode/NMS),
|
|
38
|
+
and `CallableAdapter` (wrap any function). Custom adapters via `register_adapter`.
|
|
39
|
+
- **Streaming pipeline** — `VideoPipeline.stream()` / `MosaicEngine.infer_iter()`
|
|
40
|
+
process arbitrarily long videos in constant memory; threaded decode overlaps the GPU.
|
|
41
|
+
- **Throughput metrics** — GPU-synced, reporting both inference-only and true
|
|
42
|
+
end-to-end FPS, per-stage timing, and peak GPU memory.
|
|
43
|
+
- **CLI** — `mosaix run` / `bench` / `info`, exposing every config knob.
|
|
44
|
+
- **Configuration** — fully documented `TileConfig`, `ReaderConfig`, `InferenceConfig`,
|
|
45
|
+
`PipelineConfig` dataclasses.
|
|
46
|
+
- **Docs** — architecture, configuration, adapters, and performance/tuning guides.
|
|
47
|
+
- **Examples & benchmark harness**, plus a model-free test suite.
|
|
48
|
+
|
|
49
|
+
### Performance
|
|
50
|
+
- YOLO11n on RTX 4060 Laptop (8 GB), multi-minute 240p video, decode included:
|
|
51
|
+
up to **2178 FPS inference-only / 1735 FPS end-to-end** at the `grid=16, batch=24`
|
|
52
|
+
speed preset (cool GPU); ~**1349 / 1174 FPS** at the default `grid=9, batch=32`.
|
|
53
|
+
Comfortably past the 1200 FPS target.
|
mosaix-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 mosaix contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
mosaix-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,369 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: mosaix
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Maximum-throughput video inference for YOLO and any vision model, via batched gutter-padded mosaic tiling.
|
|
5
|
+
Project-URL: Homepage, https://github.com/BARKEM-Cognitive-Industries/mosaix
|
|
6
|
+
Project-URL: Documentation, https://github.com/BARKEM-Cognitive-Industries/mosaix/tree/main/docs
|
|
7
|
+
Project-URL: Repository, https://github.com/BARKEM-Cognitive-Industries/mosaix
|
|
8
|
+
Project-URL: Issues, https://github.com/BARKEM-Cognitive-Industries/mosaix/issues
|
|
9
|
+
Project-URL: Changelog, https://github.com/BARKEM-Cognitive-Industries/mosaix/blob/main/CHANGELOG.md
|
|
10
|
+
Author: mosaix contributors
|
|
11
|
+
License: MIT
|
|
12
|
+
License-File: LICENSE
|
|
13
|
+
Keywords: acceleration,batch,computer-vision,gpu,inference,mosaic,object-detection,onnx,real-time,throughput,tiling,ultralytics,video,yolo
|
|
14
|
+
Classifier: Development Status :: 4 - Beta
|
|
15
|
+
Classifier: Intended Audience :: Developers
|
|
16
|
+
Classifier: Intended Audience :: Science/Research
|
|
17
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
18
|
+
Classifier: Operating System :: OS Independent
|
|
19
|
+
Classifier: Programming Language :: Python :: 3
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
22
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
23
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
24
|
+
Classifier: Topic :: Multimedia :: Video
|
|
25
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
26
|
+
Classifier: Topic :: Scientific/Engineering :: Image Recognition
|
|
27
|
+
Requires-Python: >=3.9
|
|
28
|
+
Requires-Dist: numpy>=1.21
|
|
29
|
+
Requires-Dist: opencv-python>=4.5
|
|
30
|
+
Provides-Extra: all
|
|
31
|
+
Requires-Dist: insightface>=0.7; extra == 'all'
|
|
32
|
+
Requires-Dist: onnxruntime-gpu>=1.15; extra == 'all'
|
|
33
|
+
Requires-Dist: transformers>=4.40; extra == 'all'
|
|
34
|
+
Requires-Dist: ultralytics>=8.1; extra == 'all'
|
|
35
|
+
Provides-Extra: depth
|
|
36
|
+
Requires-Dist: transformers>=4.40; extra == 'depth'
|
|
37
|
+
Provides-Extra: dev
|
|
38
|
+
Requires-Dist: build>=1.0; extra == 'dev'
|
|
39
|
+
Requires-Dist: psutil>=5.9; extra == 'dev'
|
|
40
|
+
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
|
|
41
|
+
Requires-Dist: pytest>=7.0; extra == 'dev'
|
|
42
|
+
Requires-Dist: ruff>=0.4; extra == 'dev'
|
|
43
|
+
Requires-Dist: twine>=5.0; extra == 'dev'
|
|
44
|
+
Provides-Extra: face
|
|
45
|
+
Requires-Dist: insightface>=0.7; extra == 'face'
|
|
46
|
+
Requires-Dist: onnxruntime-gpu>=1.15; extra == 'face'
|
|
47
|
+
Provides-Extra: onnx
|
|
48
|
+
Requires-Dist: onnxruntime-gpu>=1.15; extra == 'onnx'
|
|
49
|
+
Provides-Extra: onnx-cpu
|
|
50
|
+
Requires-Dist: onnxruntime>=1.15; extra == 'onnx-cpu'
|
|
51
|
+
Provides-Extra: ultralytics
|
|
52
|
+
Requires-Dist: ultralytics>=8.1; extra == 'ultralytics'
|
|
53
|
+
Description-Content-Type: text/markdown
|
|
54
|
+
|
|
55
|
+
# mosaix
|
|
56
|
+
|
|
57
|
+
**Maximum-throughput video inference for YOLO — and any vision model.**
|
|
58
|
+
|
|
59
|
+
`mosaix` turns an ordinary object detector into a firehose. On a single laptop RTX 4060
|
|
60
|
+
(8 GB) it pushes **YOLO11n to ~1735 FPS end-to-end / 2178 FPS inference-only** on real,
|
|
61
|
+
multi-minute video — decode included — without exotic hardware, custom CUDA, or model
|
|
62
|
+
retraining. Point it at a model and a video; it gives you per-frame detections and the
|
|
63
|
+
throughput number.
|
|
64
|
+
|
|
65
|
+
```python
|
|
66
|
+
import mosaix
|
|
67
|
+
|
|
68
|
+
pipe = mosaix.VideoPipeline.from_model("yolo11n.pt")
|
|
69
|
+
result = pipe.run("long_video.mp4")
|
|
70
|
+
|
|
71
|
+
print(f"{result.fps:.0f} FPS end-to-end, {result.total_detections} detections")
|
|
72
|
+
for frame in result:
|
|
73
|
+
for det in frame.detections:
|
|
74
|
+
print(frame.index, det.cls, det.conf, det.xyxy)
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
## How it hits 1000+ FPS
|
|
80
|
+
|
|
81
|
+
A detector spends most of its time on per-image fixed overhead (launch, letterbox,
|
|
82
|
+
NMS, host↔device copies), not on the pixels that matter. `mosaix` amortises that
|
|
83
|
+
overhead across many frames at once:
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
one GPU forward pass
|
|
87
|
+
┌───────────────────────────────────────────────────────────┐
|
|
88
|
+
frames │ ┌────┬────┬────┐ ┌────┬────┬────┐ ┌────┬────┬────┐ │
|
|
89
|
+
0..287 │ │ f0 │ f1 │ f2 │ │ f9 │f10 │f11 │ ... │f279│f280│f281│ │
|
|
90
|
+
│ ├────┼────┼────┤ ├────┼────┼────┤ ├────┼────┼────┤ │
|
|
91
|
+
│ │ f3 │ f4 │ f5 │ │f12 │f13 │f14 │ │... │... │... │ │
|
|
92
|
+
│ ├────┼────┼────┤ ├────┼────┼────┤ ├────┼────┼────┤ │
|
|
93
|
+
│ │ f6 │ f7 │ f8 │ │f15 │f16 │f17 │ │... │...│f287 │ │
|
|
94
|
+
│ └────┴────┴────┘ └────┴────┴────┘ └────┴────┴────┘ │
|
|
95
|
+
│ mosaic 0 mosaic 1 ... mosaic 31 │
|
|
96
|
+
└───────────────────────────────────────────────────────────┘
|
|
97
|
+
grid = 9 cells × batch = 32 mosaics
|
|
98
|
+
= 288 frames per single forward pass
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
1. **Downscale** every frame to a small cell (default `426×240`, i.e. "240p").
|
|
102
|
+
2. **Mosaic** `grid` cells (default 9 → a 3×3 grid) into one image, with a blank
|
|
103
|
+
**gutter** between cells.
|
|
104
|
+
3. **Batch** `batch` mosaics (default 32) into a single model forward pass — so each
|
|
105
|
+
pass covers `grid × batch = 288` frames.
|
|
106
|
+
4. **Remap** every detection back to the exact frame it came from.
|
|
107
|
+
|
|
108
|
+
## Install
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
pip install mosaix # core (numpy + opencv)
|
|
112
|
+
pip install "mosaix[ultralytics]" # + YOLO / RT-DETR support
|
|
113
|
+
pip install "mosaix[onnx]" # + ONNX Runtime (GPU)
|
|
114
|
+
pip install "mosaix[all]" # everything
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
GPU inference needs a CUDA-enabled PyTorch (for the Ultralytics backend) or
|
|
118
|
+
`onnxruntime-gpu` (for the ONNX backend) — install those per your CUDA version.
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Tested models & benchmarks
|
|
123
|
+
|
|
124
|
+
mosaix is built around bounding-box detectors, but the adapter layer also runs
|
|
125
|
+
classifiers, taggers, depth, dense-pose and face models. The table below is **measured
|
|
126
|
+
on this repo's models** so you know what to expect *before* you try one.
|
|
127
|
+
|
|
128
|
+
**Test rig:** NVIDIA RTX 4060 Laptop (8 GB), FP16 on CUDA, OpenCV decode (no `decord`).
|
|
129
|
+
**Clip:** *The Monkey Business Illusion* — 854×480, ~30 fps, many people (a non-private,
|
|
130
|
+
reproducible stand-in for crowded real footage). Each model ran in its **own process**
|
|
131
|
+
(uncontested), sweeping configs `1×1` (no tiling), `4×32` and `9×32` (`grid×batch`).
|
|
132
|
+
600 frames per config for detectors, fewer for heavy whole-image nets.
|
|
133
|
+
|
|
134
|
+
> These are **relative, apples-to-apples** numbers across families on a short 480p clip
|
|
135
|
+
> — they are decode-bound (note the low GPU-util), not peak. For tuned single-model peak
|
|
136
|
+
> (YOLO11n hits ~1735 e2e / 2178 infer FPS), see [`docs/performance.md`](docs/performance.md).
|
|
137
|
+
> Install `decord` and use a 240p source to get there.
|
|
138
|
+
|
|
139
|
+
### Throughput
|
|
140
|
+
|
|
141
|
+
`e2e` = decode→mosaic→infer→remap (what you get); `infer` = GPU forward only.
|
|
142
|
+
`tiling×` = best e2e ÷ untiled (`1×1`) e2e — the speedup mosaicking buys.
|
|
143
|
+
VRAM is the Torch allocator peak; ONNX/onnxruntime memory isn't visible to it (`—`).
|
|
144
|
+
|
|
145
|
+
| Model | Task | Backend | CUDA e2e | CUDA infer | tiling× | best `g×b` | VRAM | CPU e2e |
|
|
146
|
+
|---|---|--:|--:|--:|--:|:--:|--:|--:|
|
|
147
|
+
| FastSAM-s | segment | ultralytics | **308** | 494 | 5.5× | 4×32 | 1187 MB | 43 |
|
|
148
|
+
| omniparser_icon_detect | detect (UI icons) | ultralytics | 250 | 360 | 4.5× | 4×32 | 684 MB | 26 |
|
|
149
|
+
| yolov12n-face | detect (face) | ultralytics | 235 | 473 | 6.7× | 9×32 | 413 MB | 94 |
|
|
150
|
+
| yolo11m | detect | ultralytics | 231 | 355 | 4.3× | 4×32 | 684 MB | 23 |
|
|
151
|
+
| yolo11m-pose | pose | ultralytics | 223 | 349 | 4.7× | 4×32 | 646 MB | 21 |
|
|
152
|
+
| rtdetr-l | detect (DETR) | ultralytics | 223 | 271 | **8.5×** | 4×32 | 1268 MB | 4.0 |
|
|
153
|
+
| FastSAM-x | segment | ultralytics | 220 | 265 | 4.7× | 4×32 | 1803 MB | 4.6 |
|
|
154
|
+
| yolo11n-cls | classify | ultralytics | 207 | 247 | 2.6× | 1×32 | 123 MB | 75 |
|
|
155
|
+
| yolov12m-face | detect (face) | ultralytics | 198 | 304 | 5.0× | 4×32 | 644 MB | 22 |
|
|
156
|
+
| yolo11x-pose | pose | ultralytics | 191 | 244 | 5.8× | 4×32 | 1242 MB | 4.7 |
|
|
157
|
+
| yolov12l-face | detect (face) | ultralytics | 190 | 268 | 8.0× | 4×32 | 770 MB | 16 |
|
|
158
|
+
| yolo11m-seg | segment | ultralytics | 189 | 268 | 4.4× | 4×32 | 1186 MB | 16 |
|
|
159
|
+
| yolo11n | detect | ultralytics | 186 | 309 | 3.0× | 4×32 | 212 MB | 94 |
|
|
160
|
+
| yolo11x | detect | ultralytics | 184 | 246 | 5.7× | 4×32 | 1238 MB | 5.2 |
|
|
161
|
+
| yoloe-11s-seg | segment (open-vocab) | ultralytics | 177 | 317 | 4.1× | 4×32 | 652 MB | 36 |
|
|
162
|
+
| yolo11n-obb | oriented bbox | ultralytics | 175 | 259 | 4.3× | 4×32 | 212 MB | 76 |
|
|
163
|
+
| yolo11n-pose | pose | ultralytics | 170 | 269 | 3.7× | 4×32 | 213 MB | 81 |
|
|
164
|
+
| dw-ss_ucoco | pose (DWPose) | onnx¹ | 169 | 187 | 1.2× | 1×32 | — | 154 |
|
|
165
|
+
| yolo11n-seg | segment | ultralytics | 155 | 243 | 3.6× | 4×32 | 461 MB | 71 |
|
|
166
|
+
| 320n | detect² | onnx | 142 | 154 | 1.1× | 4×32 | — | 146 |
|
|
167
|
+
| depth_anything_v2_small | depth | transformers | 122 | 132 | 2.1× | 1×32 | 1169 MB | 4.3 |
|
|
168
|
+
| yolo11n.onnx | detect | onnx | 102 | 110 | 1.3× | 4×32 | — | 93 |
|
|
169
|
+
| dw-mm_ucoco | pose (DWPose) | onnx¹ | 80 | 84 | 1.0× | 1×32 | — | 77 |
|
|
170
|
+
| insightface (buffalo_l) | detect (face, SCRFD) | insightface | 70 | 73 | 1.1× | 4×32 | — | 62 |
|
|
171
|
+
| dw-ll_ucoco_384 | pose (DWPose) | onnx¹ | 21 | 21 | 1.0× | 1×1 | — | 21 |
|
|
172
|
+
| sapiens_0.3b_goliath | body-part seg | torchscript¹ | 3.2 | 3.2 | 1.0× | 1×1 | 2151 MB | timeout |
|
|
173
|
+
| yolox_l | detect | onnx³ | 2.6 | 2.6 | 1.0× | 1×1 | — | 2.7 |
|
|
174
|
+
| pixai | tagger (multi-label) | onnx | 0.4 | 0.4 | — | 1×1 | — | 0.4 |
|
|
175
|
+
| wd-eva02 | tagger (multi-label) | onnx | 0.4 | 0.4 | — | 1×1 | — | 0.4 |
|
|
176
|
+
| densepose_r50_fpn | dense UV | torchscript | — | — | — | — | — | — |
|
|
177
|
+
| nlf_s / nlf_l | 3D pose | torchscript | — | — | — | — | — | — |
|
|
178
|
+
|
|
179
|
+
¹ Whole-image nets run one frame per input (no mosaic); tiling doesn't apply — `batch`
|
|
180
|
+
is the only lever, so their speedup is modest. ² `320n` loads as a detector but uses a
|
|
181
|
+
non-standard 22-channel head (not COCO). ³ `yolox_l` is exported with a **fixed batch
|
|
182
|
+
of 1**, so mosaix runs mosaics one-at-a-time — re-export with `dynamic=True` for real
|
|
183
|
+
throughput. `densepose`/`nlf` load but expose no generic `forward` (they need their own
|
|
184
|
+
project's inference code), so they're recognised but not benchmarkable here.
|
|
185
|
+
|
|
186
|
+
**Reading it:**
|
|
187
|
+
- **Tiling pays off for detectors** — 3–8.5× over untiled, peaking around `grid=4`
|
|
188
|
+
(`4×32`) on this decode-bound 480p clip. RT-DETR and the large/face models gain most.
|
|
189
|
+
- **`grid=4` won here, not `grid=9`** — because end-to-end is decode-bound (480p,
|
|
190
|
+
OpenCV); the bigger `9×32` mosaic adds compute without feeding faster frames. On a
|
|
191
|
+
240p source with `decord`, `grid=9`/`16` pull ahead (see performance docs).
|
|
192
|
+
- **CPU is viable for nano models** (yolo11n/-pose/-obb, yolov12n-face, 320n: 75–95 FPS)
|
|
193
|
+
but collapses for large ones (RT-DETR, yolo11x, FastSAM-x: 4–5 FPS). Use CPU only for
|
|
194
|
+
the `n`-class models.
|
|
195
|
+
|
|
196
|
+
### Accuracy & the cost of tiling
|
|
197
|
+
|
|
198
|
+
Measured on **coco128** (128 labelled COCO images, auto-downloaded, ~7 MB) with one mAP
|
|
199
|
+
routine, native full-res vs the default `9×32` mosaic — so the drop *is* the tiling
|
|
200
|
+
cost. Packing 9 frames into one 240p-cell mosaic shrinks small objects, so expect a real
|
|
201
|
+
hit; it's smallest for big objects / larger models and worst for crowded tiny-object
|
|
202
|
+
scenes (use `grid=4` or bigger cells there).
|
|
203
|
+
|
|
204
|
+
| Model | native mAP@.5 | tiled (9×32) mAP@.5 | retention | published (full COCO / source) |
|
|
205
|
+
|---|--:|--:|--:|--|
|
|
206
|
+
| rtdetr-l | 0.81 | 0.48 | 59 % | 53.0 AP @.5:.95 |
|
|
207
|
+
| yolo11x | 0.71 | 0.41 | 58 % | 54.7 mAP@.5:.95 |
|
|
208
|
+
| yolo11m / -seg | 0.66 | 0.38 | 58 % | 51.5 mAP (32.0 mask) |
|
|
209
|
+
| yolo11n-seg | 0.48 | 0.23 | 47 % | 32.0 mAP mask |
|
|
210
|
+
| yolo11n | 0.46 | 0.21 | 46 % | 39.5 mAP@.5:.95 |
|
|
211
|
+
| yolo11n.onnx | 0.46 | 0.19 | 43 % | 39.5 mAP@.5:.95 |
|
|
212
|
+
|
|
213
|
+
Published metrics for families coco128 can't score (different domains), from each
|
|
214
|
+
model's authoritative source:
|
|
215
|
+
|
|
216
|
+
| Family | Metric (published) |
|
|
217
|
+
|---|---|
|
|
218
|
+
| YOLO11 pose (n/m/x) | 50.0 / 64.9 / 69.5 mAP-pose @.5:.95 (COCO) |
|
|
219
|
+
| YOLO11n-cls | 70.0 % top-1 (ImageNet) |
|
|
220
|
+
| YOLO11n-obb | 78.4 mAP@.5 (DOTAv1) |
|
|
221
|
+
| Depth-Anything-V2-Small | δ1 ≈ 0.724 (Sun-RGBD) |
|
|
222
|
+
| WD-EVA02 tagger | F1 ≈ 0.477 (Danbooru) |
|
|
223
|
+
| Sapiens-0.3B Goliath | mIoU 76.7 (body-part seg) |
|
|
224
|
+
| InsightFace buffalo_l (SCRFD-10GF) | WIDERFACE 95.2 / 93.9 / 83.1 (easy/med/hard) |
|
|
225
|
+
| YOLOX-l | 49.7 AP @.5:.95 (COCO) |
|
|
226
|
+
|
|
227
|
+
Reproduce everything:
|
|
228
|
+
|
|
229
|
+
```bash
|
|
230
|
+
python benchmarks/bench_all.py --models <models_dir> --video <clip.mp4> --devices cuda,cpu
|
|
231
|
+
python benchmarks/accuracy.py --models <models_dir> # coco128 native-vs-tiled mAP
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
### Supported model families at a glance
|
|
235
|
+
|
|
236
|
+
| Family | Examples here | How |
|
|
237
|
+
|---|---|---|
|
|
238
|
+
| YOLO detect / seg / pose / OBB / cls | yolo11{n,m,x}, `-seg`/`-pose`/`-obb`/`-cls` | `UltralyticsAdapter` (auto) |
|
|
239
|
+
| RT-DETR | rtdetr-l | auto (RTDETR) |
|
|
240
|
+
| YOLO-World / YOLOE / FastSAM / SAM | yoloe-11s-seg, FastSAM-{s,x} | auto |
|
|
241
|
+
| Face detection | yolov12{n,m,l}-face, InsightFace buffalo_{l,s} | auto / `FaceAdapter` |
|
|
242
|
+
| Generic YOLO ONNX | yolo11n.onnx, 320n, yolox_l | `OnnxAdapter` (v8/v5 auto) |
|
|
243
|
+
| Multi-label taggers | wd-eva02, pixai | `TaggerAdapter` (NCHW/NHWC) |
|
|
244
|
+
| Monocular depth (HF) | depth_anything_v2_small | `DepthAdapter` |
|
|
245
|
+
| Dense / pose nets (raw fwd) | sapiens, DWPose dw-\* | `TorchScriptAdapter` |
|
|
246
|
+
| Loads, needs native API | densepose, nlf | recognised, not run here |
|
|
247
|
+
| Anything else | your model | `CallableAdapter` (see below) |
|
|
248
|
+
|
|
249
|
+
## Plug in *any* vision model
|
|
250
|
+
|
|
251
|
+
`mosaix` never talks to a model directly — it goes through a thin **adapter**. Three
|
|
252
|
+
ways to bring a model, in increasing order of control:
|
|
253
|
+
|
|
254
|
+
### 1. By file — auto-detected backend
|
|
255
|
+
|
|
256
|
+
```python
|
|
257
|
+
mosaix.VideoPipeline.from_model("yolo11n.pt") # Ultralytics
|
|
258
|
+
mosaix.VideoPipeline.from_model("yolov8n.onnx") # ONNX Runtime
|
|
259
|
+
mosaix.VideoPipeline.from_model("model.engine") # TensorRT (via Ultralytics)
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
### 2. Any callable — the universal escape hatch
|
|
263
|
+
|
|
264
|
+
If your model is a Detectron2 net, a Transformers pipeline, a homemade detector —
|
|
265
|
+
anything — wrap a function that maps a batch of mosaic images to boxes:
|
|
266
|
+
|
|
267
|
+
```python
|
|
268
|
+
import numpy as np
|
|
269
|
+
from mosaix import VideoPipeline, TileConfig, InferenceConfig
|
|
270
|
+
from mosaix.adapters import CallableAdapter
|
|
271
|
+
|
|
272
|
+
def my_detector(mosaics): # list[BGR uint8] -> list[(N,6)]
|
|
273
|
+
out = []
|
|
274
|
+
for img in mosaics:
|
|
275
|
+
boxes = run_whatever(img) # (N,6): x1,y1,x2,y2,conf,cls in mosaic pixels
|
|
276
|
+
out.append(np.asarray(boxes, np.float32))
|
|
277
|
+
return out
|
|
278
|
+
|
|
279
|
+
tile, infer = TileConfig(), InferenceConfig()
|
|
280
|
+
adapter = CallableAdapter(my_detector, tile, infer, name="my-model")
|
|
281
|
+
pipe = VideoPipeline(adapter)
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
The engine handles all the tiling/batching/remapping; your function only ever sees
|
|
285
|
+
ordinary images and returns ordinary boxes. See
|
|
286
|
+
[`examples/custom_model.py`](examples/custom_model.py).
|
|
287
|
+
|
|
288
|
+
### 3. A custom adapter class
|
|
289
|
+
|
|
290
|
+
Subclass `ModelAdapter`, implement `predict_batch`, and `register_adapter("name", Cls)`
|
|
291
|
+
to expose it everywhere. See [`docs/adapters.md`](docs/adapters.md).
|
|
292
|
+
|
|
293
|
+
---
|
|
294
|
+
|
|
295
|
+
## Configuration
|
|
296
|
+
|
|
297
|
+
Everything is a documented dataclass. The defaults are tuned for an 8 GB GPU at 240p.
|
|
298
|
+
|
|
299
|
+
```python
|
|
300
|
+
from mosaix import VideoPipeline, TileConfig, ReaderConfig, InferenceConfig
|
|
301
|
+
|
|
302
|
+
pipe = VideoPipeline.from_model(
|
|
303
|
+
"yolo11n.pt",
|
|
304
|
+
tile=TileConfig(
|
|
305
|
+
grid=9, # 9 = 3x3 mosaic; try 4 (2x2) for bigger objects
|
|
306
|
+
batch=32, # mosaics per forward pass; lower if you OOM
|
|
307
|
+
cell_width=426, # downscaled frame size
|
|
308
|
+
cell_height=240,
|
|
309
|
+
gutter=12, # seam padding
|
|
310
|
+
),
|
|
311
|
+
reader=ReaderConfig(
|
|
312
|
+
stride=1, # process every frame; 5 = 1-in-5
|
|
313
|
+
threaded=True, # overlap decode with GPU (essential for real FPS)
|
|
314
|
+
backend="auto", # "decord" if installed, else "opencv"
|
|
315
|
+
),
|
|
316
|
+
inference=InferenceConfig(
|
|
317
|
+
device="auto",
|
|
318
|
+
half=True, # FP16 — ~2x on modern GPUs
|
|
319
|
+
conf=0.25,
|
|
320
|
+
classes=[0], # keep only COCO 'person'; None = all classes
|
|
321
|
+
),
|
|
322
|
+
)
|
|
323
|
+
```
|
|
324
|
+
|
|
325
|
+
Full reference: [`docs/configuration.md`](docs/configuration.md) ·
|
|
326
|
+
Tuning guide: [`docs/performance.md`](docs/performance.md).
|
|
327
|
+
|
|
328
|
+
---
|
|
329
|
+
|
|
330
|
+
## Streaming (constant memory on long videos)
|
|
331
|
+
|
|
332
|
+
`run()` collects every result. For multi-hour videos, `stream()` yields one
|
|
333
|
+
`FrameResult` at a time with bounded memory:
|
|
334
|
+
|
|
335
|
+
```python
|
|
336
|
+
pipe = mosaix.VideoPipeline.from_model("yolo11n.pt")
|
|
337
|
+
for frame in pipe.stream("8_hour_stream.mp4"):
|
|
338
|
+
process(frame) # memory stays flat regardless of length
|
|
339
|
+
print(pipe.meter.summary()) # FPS, GPU mem, per-stage timing
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
---
|
|
343
|
+
|
|
344
|
+
## Command line
|
|
345
|
+
|
|
346
|
+
```bash
|
|
347
|
+
mosaix bench yolo11n.pt video.mp4 --grid 9 --batch 32 # measure FPS
|
|
348
|
+
mosaix run yolo11n.pt video.mp4 --out annotated.mp4 --classes 0
|
|
349
|
+
mosaix run yolo11n.pt video.mp4 --jsonl dets.jsonl # detections to JSONL
|
|
350
|
+
mosaix info video.mp4 # probe metadata
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
Every config knob has a flag — run `mosaix bench -h`.
|
|
354
|
+
|
|
355
|
+
---
|
|
356
|
+
|
|
357
|
+
## Why "true" throughput
|
|
358
|
+
|
|
359
|
+
Many benchmarks quote inference-only FPS with decode excluded. `mosaix` reports
|
|
360
|
+
**both**: `infer_fps` (GPU forward passes only) and `e2e_fps` (decode + downscale +
|
|
361
|
+
mosaic + inference + remap, what you actually get). The threaded reader overlaps
|
|
362
|
+
decode with GPU so the end-to-end number stays close to the inference number on long
|
|
363
|
+
videos — which is the only number that matters in production.
|
|
364
|
+
|
|
365
|
+
---
|
|
366
|
+
|
|
367
|
+
## License
|
|
368
|
+
|
|
369
|
+
MIT — see [LICENSE](LICENSE).
|