mosaix 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. mosaix-0.1.0/.gitignore +42 -0
  2. mosaix-0.1.0/CHANGELOG.md +53 -0
  3. mosaix-0.1.0/LICENSE +21 -0
  4. mosaix-0.1.0/PKG-INFO +369 -0
  5. mosaix-0.1.0/README.md +315 -0
  6. mosaix-0.1.0/docs/adapters.md +136 -0
  7. mosaix-0.1.0/docs/architecture.md +79 -0
  8. mosaix-0.1.0/docs/configuration.md +92 -0
  9. mosaix-0.1.0/docs/performance.md +126 -0
  10. mosaix-0.1.0/examples/custom_model.py +53 -0
  11. mosaix-0.1.0/examples/quickstart.py +38 -0
  12. mosaix-0.1.0/examples/whole_image.py +45 -0
  13. mosaix-0.1.0/pyproject.toml +90 -0
  14. mosaix-0.1.0/src/mosaix/__init__.py +88 -0
  15. mosaix-0.1.0/src/mosaix/adapters/__init__.py +123 -0
  16. mosaix-0.1.0/src/mosaix/adapters/base.py +87 -0
  17. mosaix-0.1.0/src/mosaix/adapters/callable_adapter.py +70 -0
  18. mosaix-0.1.0/src/mosaix/adapters/depth_adapter.py +50 -0
  19. mosaix-0.1.0/src/mosaix/adapters/face_adapter.py +58 -0
  20. mosaix-0.1.0/src/mosaix/adapters/onnx_adapter.py +187 -0
  21. mosaix-0.1.0/src/mosaix/adapters/tagger_adapter.py +85 -0
  22. mosaix-0.1.0/src/mosaix/adapters/torchscript_adapter.py +83 -0
  23. mosaix-0.1.0/src/mosaix/adapters/ultralytics_adapter.py +100 -0
  24. mosaix-0.1.0/src/mosaix/cli.py +213 -0
  25. mosaix-0.1.0/src/mosaix/config.py +191 -0
  26. mosaix-0.1.0/src/mosaix/engine.py +161 -0
  27. mosaix-0.1.0/src/mosaix/geometry.py +127 -0
  28. mosaix-0.1.0/src/mosaix/io/__init__.py +12 -0
  29. mosaix-0.1.0/src/mosaix/io/reader.py +161 -0
  30. mosaix-0.1.0/src/mosaix/io/writer.py +90 -0
  31. mosaix-0.1.0/src/mosaix/metrics.py +93 -0
  32. mosaix-0.1.0/src/mosaix/pipeline.py +149 -0
  33. mosaix-0.1.0/src/mosaix/results.py +120 -0
@@ -0,0 +1,42 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.egg-info/
5
+ *.egg
6
+ .eggs/
7
+ build/
8
+ dist/
9
+ .installed.cfg
10
+
11
+ # Virtual envs
12
+ .venv/
13
+ venv/
14
+ env/
15
+
16
+ # Test / coverage
17
+ .pytest_cache/
18
+ .coverage
19
+ htmlcov/
20
+ .ruff_cache/
21
+
22
+ # Models & media (don't ship weights/videos in the repo)
23
+ *.pt
24
+ *.onnx
25
+ *.engine
26
+ *.mp4
27
+ *.avi
28
+ *.mkv
29
+ *.mov
30
+
31
+ # Benchmark / run artifacts
32
+ *.json.bench
33
+ benchmarks/results/
34
+ runs/
35
+ _logs/
36
+ _dbg/
37
+
38
+ # IDE / OS
39
+ .idea/
40
+ .vscode/
41
+ .DS_Store
42
+ Thumbs.db
@@ -0,0 +1,53 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project are documented here. The format follows
4
+ [Keep a Changelog](https://keepachangelog.com/), and the project adheres to
5
+ [Semantic Versioning](https://semver.org/).
6
+
7
+ ## [Unreleased]
8
+
9
+ ### Added
10
+ - **Multi-family model support beyond box detectors.** New adapters and a generalised
11
+ `FrameResult` (carrying `label` / `tags` / `extra` alongside `detections`) let mosaix
12
+ run whole-image tasks through the same batched pipeline:
13
+ - `TaggerAdapter` — ONNX multi-label taggers (WD-EVA02, PixAI); auto NCHW/NHWC.
14
+ - `DepthAdapter` — HuggingFace depth estimation (Depth-Anything-V2).
15
+ - `TorchScriptAdapter` — raw forward for TorchScript/`.pt2` & fixed-size ONNX nets
16
+ (Sapiens, DWPose) for throughput benchmarking.
17
+ - `FaceAdapter` — InsightFace SCRFD detection (buffalo_l/s) → boxes.
18
+ - `UltralyticsAdapter` now auto-selects RT-DETR / FastSAM / SAM, handles **OBB**
19
+ (rotated → axis-aligned) and **classification** (`-cls`) models.
20
+ - **Whole-image engine path** (`MosaicEngine.infer_iter_whole`) for `output_kind="whole"`
21
+ adapters: one model input per frame, throughput from large-batch inference.
22
+ - **Benchmark harness** (`benchmarks/bench_all.py` + `bench_one.py`) — one process per
23
+ model for uncontested FPS / VRAM / GPU-util / RAM, CUDA + CPU, configs 1×1 / 4×32 /
24
+ 9×32. `benchmarks/accuracy.py` — coco128 mAP@0.5, native vs mosaic-tiled.
25
+ - **README** — measured benchmark tables (RTX 4060 Laptop) and a tested/supported
26
+ model-families matrix.
27
+
28
+ ## [0.1.0] - 2026-06-14
29
+
30
+ Initial release.
31
+
32
+ ### Added
33
+ - **Mosaic throughput engine** — gutter-padded tiling (`grid`) + batched forward
34
+ passes (`batch`) covering `grid × batch` frames per pass, with centroid-assign +
35
+ clip remapping that eliminates seam duplicates without cross-cell NMS.
36
+ - **Pluggable model adapters** — `UltralyticsAdapter` (YOLO/RT-DETR, `.pt`/`.onnx`/
37
+ `.engine`), `OnnxAdapter` (generic YOLO ONNX with built-in letterbox/decode/NMS),
38
+ and `CallableAdapter` (wrap any function). Custom adapters via `register_adapter`.
39
+ - **Streaming pipeline** — `VideoPipeline.stream()` / `MosaicEngine.infer_iter()`
40
+ process arbitrarily long videos in constant memory; threaded decode overlaps the GPU.
41
+ - **Throughput metrics** — GPU-synced, reporting both inference-only and true
42
+ end-to-end FPS, per-stage timing, and peak GPU memory.
43
+ - **CLI** — `mosaix run` / `bench` / `info`, exposing every config knob.
44
+ - **Configuration** — fully documented `TileConfig`, `ReaderConfig`, `InferenceConfig`,
45
+ `PipelineConfig` dataclasses.
46
+ - **Docs** — architecture, configuration, adapters, and performance/tuning guides.
47
+ - **Examples & benchmark harness**, plus a model-free test suite.
48
+
49
+ ### Performance
50
+ - YOLO11n on RTX 4060 Laptop (8 GB), multi-minute 240p video, decode included:
51
+ up to **2178 FPS inference-only / 1735 FPS end-to-end** at the `grid=16, batch=24`
52
+ speed preset (cool GPU); ~**1349 / 1174 FPS** at the default `grid=9, batch=32`.
53
+ Comfortably past the 1200 FPS target.
mosaix-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 mosaix contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
mosaix-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,369 @@
1
+ Metadata-Version: 2.4
2
+ Name: mosaix
3
+ Version: 0.1.0
4
+ Summary: Maximum-throughput video inference for YOLO and any vision model, via batched gutter-padded mosaic tiling.
5
+ Project-URL: Homepage, https://github.com/BARKEM-Cognitive-Industries/mosaix
6
+ Project-URL: Documentation, https://github.com/BARKEM-Cognitive-Industries/mosaix/tree/main/docs
7
+ Project-URL: Repository, https://github.com/BARKEM-Cognitive-Industries/mosaix
8
+ Project-URL: Issues, https://github.com/BARKEM-Cognitive-Industries/mosaix/issues
9
+ Project-URL: Changelog, https://github.com/BARKEM-Cognitive-Industries/mosaix/blob/main/CHANGELOG.md
10
+ Author: mosaix contributors
11
+ License: MIT
12
+ License-File: LICENSE
13
+ Keywords: acceleration,batch,computer-vision,gpu,inference,mosaic,object-detection,onnx,real-time,throughput,tiling,ultralytics,video,yolo
14
+ Classifier: Development Status :: 4 - Beta
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: Intended Audience :: Science/Research
17
+ Classifier: License :: OSI Approved :: MIT License
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.9
21
+ Classifier: Programming Language :: Python :: 3.10
22
+ Classifier: Programming Language :: Python :: 3.11
23
+ Classifier: Programming Language :: Python :: 3.12
24
+ Classifier: Topic :: Multimedia :: Video
25
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
26
+ Classifier: Topic :: Scientific/Engineering :: Image Recognition
27
+ Requires-Python: >=3.9
28
+ Requires-Dist: numpy>=1.21
29
+ Requires-Dist: opencv-python>=4.5
30
+ Provides-Extra: all
31
+ Requires-Dist: insightface>=0.7; extra == 'all'
32
+ Requires-Dist: onnxruntime-gpu>=1.15; extra == 'all'
33
+ Requires-Dist: transformers>=4.40; extra == 'all'
34
+ Requires-Dist: ultralytics>=8.1; extra == 'all'
35
+ Provides-Extra: depth
36
+ Requires-Dist: transformers>=4.40; extra == 'depth'
37
+ Provides-Extra: dev
38
+ Requires-Dist: build>=1.0; extra == 'dev'
39
+ Requires-Dist: psutil>=5.9; extra == 'dev'
40
+ Requires-Dist: pytest-cov>=4.0; extra == 'dev'
41
+ Requires-Dist: pytest>=7.0; extra == 'dev'
42
+ Requires-Dist: ruff>=0.4; extra == 'dev'
43
+ Requires-Dist: twine>=5.0; extra == 'dev'
44
+ Provides-Extra: face
45
+ Requires-Dist: insightface>=0.7; extra == 'face'
46
+ Requires-Dist: onnxruntime-gpu>=1.15; extra == 'face'
47
+ Provides-Extra: onnx
48
+ Requires-Dist: onnxruntime-gpu>=1.15; extra == 'onnx'
49
+ Provides-Extra: onnx-cpu
50
+ Requires-Dist: onnxruntime>=1.15; extra == 'onnx-cpu'
51
+ Provides-Extra: ultralytics
52
+ Requires-Dist: ultralytics>=8.1; extra == 'ultralytics'
53
+ Description-Content-Type: text/markdown
54
+
55
+ # mosaix
56
+
57
+ **Maximum-throughput video inference for YOLO — and any vision model.**
58
+
59
+ `mosaix` turns an ordinary object detector into a firehose. On a single laptop RTX 4060
60
+ (8 GB) it pushes **YOLO11n to ~1735 FPS end-to-end / 2178 FPS inference-only** on real,
61
+ multi-minute video — decode included — without exotic hardware, custom CUDA, or model
62
+ retraining. Point it at a model and a video; it gives you per-frame detections and the
63
+ throughput number.
64
+
65
+ ```python
66
+ import mosaix
67
+
68
+ pipe = mosaix.VideoPipeline.from_model("yolo11n.pt")
69
+ result = pipe.run("long_video.mp4")
70
+
71
+ print(f"{result.fps:.0f} FPS end-to-end, {result.total_detections} detections")
72
+ for frame in result:
73
+ for det in frame.detections:
74
+ print(frame.index, det.cls, det.conf, det.xyxy)
75
+ ```
76
+
77
+ ---
78
+
79
+ ## How it hits 1000+ FPS
80
+
81
+ A detector spends most of its time on per-image fixed overhead (launch, letterbox,
82
+ NMS, host↔device copies), not on the pixels that matter. `mosaix` amortises that
83
+ overhead across many frames at once:
84
+
85
+ ```
86
+ one GPU forward pass
87
+ ┌───────────────────────────────────────────────────────────┐
88
+ frames │ ┌────┬────┬────┐ ┌────┬────┬────┐ ┌────┬────┬────┐ │
89
+ 0..287 │ │ f0 │ f1 │ f2 │ │ f9 │f10 │f11 │ ... │f279│f280│f281│ │
90
+ │ ├────┼────┼────┤ ├────┼────┼────┤ ├────┼────┼────┤ │
91
+ │ │ f3 │ f4 │ f5 │ │f12 │f13 │f14 │ │... │... │... │ │
92
+ │ ├────┼────┼────┤ ├────┼────┼────┤ ├────┼────┼────┤ │
93
+ │ │ f6 │ f7 │ f8 │ │f15 │f16 │f17 │ │... │...│f287 │ │
94
+ │ └────┴────┴────┘ └────┴────┴────┘ └────┴────┴────┘ │
95
+ │ mosaic 0 mosaic 1 ... mosaic 31 │
96
+ └───────────────────────────────────────────────────────────┘
97
+ grid = 9 cells × batch = 32 mosaics
98
+ = 288 frames per single forward pass
99
+ ```
100
+
101
+ 1. **Downscale** every frame to a small cell (default `426×240`, i.e. "240p").
102
+ 2. **Mosaic** `grid` cells (default 9 → a 3×3 grid) into one image, with a blank
103
+ **gutter** between cells.
104
+ 3. **Batch** `batch` mosaics (default 32) into a single model forward pass — so each
105
+ pass covers `grid × batch = 288` frames.
106
+ 4. **Remap** every detection back to the exact frame it came from.
107
+
108
+ ## Install
109
+
110
+ ```bash
111
+ pip install mosaix # core (numpy + opencv)
112
+ pip install "mosaix[ultralytics]" # + YOLO / RT-DETR support
113
+ pip install "mosaix[onnx]" # + ONNX Runtime (GPU)
114
+ pip install "mosaix[all]" # everything
115
+ ```
116
+
117
+ GPU inference needs a CUDA-enabled PyTorch (for the Ultralytics backend) or
118
+ `onnxruntime-gpu` (for the ONNX backend) — install those per your CUDA version.
119
+
120
+ ---
121
+
122
+ ## Tested models & benchmarks
123
+
124
+ mosaix is built around bounding-box detectors, but the adapter layer also runs
125
+ classifiers, taggers, depth, dense-pose and face models. The table below is **measured
126
+ on this repo's models** so you know what to expect *before* you try one.
127
+
128
+ **Test rig:** NVIDIA RTX 4060 Laptop (8 GB), FP16 on CUDA, OpenCV decode (no `decord`).
129
+ **Clip:** *The Monkey Business Illusion* — 854×480, ~30 fps, many people (a non-private,
130
+ reproducible stand-in for crowded real footage). Each model ran in its **own process**
131
+ (uncontested), sweeping configs `1×1` (no tiling), `4×32` and `9×32` (`grid×batch`).
132
+ 600 frames per config for detectors, fewer for heavy whole-image nets.
133
+
134
+ > These are **relative, apples-to-apples** numbers across families on a short 480p clip
135
+ > — they are decode-bound (note the low GPU-util), not peak. For tuned single-model peak
136
+ > (YOLO11n hits ~1735 e2e / 2178 infer FPS), see [`docs/performance.md`](docs/performance.md).
137
+ > Install `decord` and use a 240p source to get there.
138
+
139
+ ### Throughput
140
+
141
+ `e2e` = decode→mosaic→infer→remap (what you get); `infer` = GPU forward only.
142
+ `tiling×` = best e2e ÷ untiled (`1×1`) e2e — the speedup mosaicking buys.
143
+ VRAM is the Torch allocator peak; ONNX/onnxruntime memory isn't visible to it (`—`).
144
+
145
+ | Model | Task | Backend | CUDA e2e | CUDA infer | tiling× | best `g×b` | VRAM | CPU e2e |
146
+ |---|---|--:|--:|--:|--:|:--:|--:|--:|
147
+ | FastSAM-s | segment | ultralytics | **308** | 494 | 5.5× | 4×32 | 1187 MB | 43 |
148
+ | omniparser_icon_detect | detect (UI icons) | ultralytics | 250 | 360 | 4.5× | 4×32 | 684 MB | 26 |
149
+ | yolov12n-face | detect (face) | ultralytics | 235 | 473 | 6.7× | 9×32 | 413 MB | 94 |
150
+ | yolo11m | detect | ultralytics | 231 | 355 | 4.3× | 4×32 | 684 MB | 23 |
151
+ | yolo11m-pose | pose | ultralytics | 223 | 349 | 4.7× | 4×32 | 646 MB | 21 |
152
+ | rtdetr-l | detect (DETR) | ultralytics | 223 | 271 | **8.5×** | 4×32 | 1268 MB | 4.0 |
153
+ | FastSAM-x | segment | ultralytics | 220 | 265 | 4.7× | 4×32 | 1803 MB | 4.6 |
154
+ | yolo11n-cls | classify | ultralytics | 207 | 247 | 2.6× | 1×32 | 123 MB | 75 |
155
+ | yolov12m-face | detect (face) | ultralytics | 198 | 304 | 5.0× | 4×32 | 644 MB | 22 |
156
+ | yolo11x-pose | pose | ultralytics | 191 | 244 | 5.8× | 4×32 | 1242 MB | 4.7 |
157
+ | yolov12l-face | detect (face) | ultralytics | 190 | 268 | 8.0× | 4×32 | 770 MB | 16 |
158
+ | yolo11m-seg | segment | ultralytics | 189 | 268 | 4.4× | 4×32 | 1186 MB | 16 |
159
+ | yolo11n | detect | ultralytics | 186 | 309 | 3.0× | 4×32 | 212 MB | 94 |
160
+ | yolo11x | detect | ultralytics | 184 | 246 | 5.7× | 4×32 | 1238 MB | 5.2 |
161
+ | yoloe-11s-seg | segment (open-vocab) | ultralytics | 177 | 317 | 4.1× | 4×32 | 652 MB | 36 |
162
+ | yolo11n-obb | oriented bbox | ultralytics | 175 | 259 | 4.3× | 4×32 | 212 MB | 76 |
163
+ | yolo11n-pose | pose | ultralytics | 170 | 269 | 3.7× | 4×32 | 213 MB | 81 |
164
+ | dw-ss_ucoco | pose (DWPose) | onnx¹ | 169 | 187 | 1.2× | 1×32 | — | 154 |
165
+ | yolo11n-seg | segment | ultralytics | 155 | 243 | 3.6× | 4×32 | 461 MB | 71 |
166
+ | 320n | detect² | onnx | 142 | 154 | 1.1× | 4×32 | — | 146 |
167
+ | depth_anything_v2_small | depth | transformers | 122 | 132 | 2.1× | 1×32 | 1169 MB | 4.3 |
168
+ | yolo11n.onnx | detect | onnx | 102 | 110 | 1.3× | 4×32 | — | 93 |
169
+ | dw-mm_ucoco | pose (DWPose) | onnx¹ | 80 | 84 | 1.0× | 1×32 | — | 77 |
170
+ | insightface (buffalo_l) | detect (face, SCRFD) | insightface | 70 | 73 | 1.1× | 4×32 | — | 62 |
171
+ | dw-ll_ucoco_384 | pose (DWPose) | onnx¹ | 21 | 21 | 1.0× | 1×1 | — | 21 |
172
+ | sapiens_0.3b_goliath | body-part seg | torchscript¹ | 3.2 | 3.2 | 1.0× | 1×1 | 2151 MB | timeout |
173
+ | yolox_l | detect | onnx³ | 2.6 | 2.6 | 1.0× | 1×1 | — | 2.7 |
174
+ | pixai | tagger (multi-label) | onnx | 0.4 | 0.4 | — | 1×1 | — | 0.4 |
175
+ | wd-eva02 | tagger (multi-label) | onnx | 0.4 | 0.4 | — | 1×1 | — | 0.4 |
176
+ | densepose_r50_fpn | dense UV | torchscript | — | — | — | — | — | — |
177
+ | nlf_s / nlf_l | 3D pose | torchscript | — | — | — | — | — | — |
178
+
179
+ ¹ Whole-image nets run one frame per input (no mosaic); tiling doesn't apply — `batch`
180
+ is the only lever, so their speedup is modest. ² `320n` loads as a detector but uses a
181
+ non-standard 22-channel head (not COCO). ³ `yolox_l` is exported with a **fixed batch
182
+ of 1**, so mosaix runs mosaics one-at-a-time — re-export with `dynamic=True` for real
183
+ throughput. `densepose`/`nlf` load but expose no generic `forward` (they need their own
184
+ project's inference code), so they're recognised but not benchmarkable here.
185
+
186
+ **Reading it:**
187
+ - **Tiling pays off for detectors** — 3–8.5× over untiled, peaking around `grid=4`
188
+ (`4×32`) on this decode-bound 480p clip. RT-DETR and the large/face models gain most.
189
+ - **`grid=4` won here, not `grid=9`** — because end-to-end is decode-bound (480p,
190
+ OpenCV); the bigger `9×32` mosaic adds compute without feeding faster frames. On a
191
+ 240p source with `decord`, `grid=9`/`16` pull ahead (see performance docs).
192
+ - **CPU is viable for nano models** (yolo11n/-pose/-obb, yolov12n-face, 320n: 75–95 FPS)
193
+ but collapses for large ones (RT-DETR, yolo11x, FastSAM-x: 4–5 FPS). Use CPU only for
194
+ the `n`-class models.
195
+
196
+ ### Accuracy & the cost of tiling
197
+
198
+ Measured on **coco128** (128 labelled COCO images, auto-downloaded, ~7 MB) with one mAP
199
+ routine, native full-res vs the default `9×32` mosaic — so the drop *is* the tiling
200
+ cost. Packing 9 frames into one 240p-cell mosaic shrinks small objects, so expect a real
201
+ hit; it's smallest for big objects / larger models and worst for crowded tiny-object
202
+ scenes (use `grid=4` or bigger cells there).
203
+
204
+ | Model | native mAP@.5 | tiled (9×32) mAP@.5 | retention | published (full COCO / source) |
205
+ |---|--:|--:|--:|--|
206
+ | rtdetr-l | 0.81 | 0.48 | 59 % | 53.0 AP @.5:.95 |
207
+ | yolo11x | 0.71 | 0.41 | 58 % | 54.7 mAP@.5:.95 |
208
+ | yolo11m / -seg | 0.66 | 0.38 | 58 % | 51.5 mAP (32.0 mask) |
209
+ | yolo11n-seg | 0.48 | 0.23 | 47 % | 32.0 mAP mask |
210
+ | yolo11n | 0.46 | 0.21 | 46 % | 39.5 mAP@.5:.95 |
211
+ | yolo11n.onnx | 0.46 | 0.19 | 43 % | 39.5 mAP@.5:.95 |
212
+
213
+ Published metrics for families coco128 can't score (different domains), from each
214
+ model's authoritative source:
215
+
216
+ | Family | Metric (published) |
217
+ |---|---|
218
+ | YOLO11 pose (n/m/x) | 50.0 / 64.9 / 69.5 mAP-pose @.5:.95 (COCO) |
219
+ | YOLO11n-cls | 70.0 % top-1 (ImageNet) |
220
+ | YOLO11n-obb | 78.4 mAP@.5 (DOTAv1) |
221
+ | Depth-Anything-V2-Small | δ1 ≈ 0.724 (Sun-RGBD) |
222
+ | WD-EVA02 tagger | F1 ≈ 0.477 (Danbooru) |
223
+ | Sapiens-0.3B Goliath | mIoU 76.7 (body-part seg) |
224
+ | InsightFace buffalo_l (SCRFD-10GF) | WIDERFACE 95.2 / 93.9 / 83.1 (easy/med/hard) |
225
+ | YOLOX-l | 49.7 AP @.5:.95 (COCO) |
226
+
227
+ Reproduce everything:
228
+
229
+ ```bash
230
+ python benchmarks/bench_all.py --models <models_dir> --video <clip.mp4> --devices cuda,cpu
231
+ python benchmarks/accuracy.py --models <models_dir> # coco128 native-vs-tiled mAP
232
+ ```
233
+
234
+ ### Supported model families at a glance
235
+
236
+ | Family | Examples here | How |
237
+ |---|---|---|
238
+ | YOLO detect / seg / pose / OBB / cls | yolo11{n,m,x}, `-seg`/`-pose`/`-obb`/`-cls` | `UltralyticsAdapter` (auto) |
239
+ | RT-DETR | rtdetr-l | auto (RTDETR) |
240
+ | YOLO-World / YOLOE / FastSAM / SAM | yoloe-11s-seg, FastSAM-{s,x} | auto |
241
+ | Face detection | yolov12{n,m,l}-face, InsightFace buffalo_{l,s} | auto / `FaceAdapter` |
242
+ | Generic YOLO ONNX | yolo11n.onnx, 320n, yolox_l | `OnnxAdapter` (v8/v5 auto) |
243
+ | Multi-label taggers | wd-eva02, pixai | `TaggerAdapter` (NCHW/NHWC) |
244
+ | Monocular depth (HF) | depth_anything_v2_small | `DepthAdapter` |
245
+ | Dense / pose nets (raw fwd) | sapiens, DWPose dw-\* | `TorchScriptAdapter` |
246
+ | Loads, needs native API | densepose, nlf | recognised, not run here |
247
+ | Anything else | your model | `CallableAdapter` (see below) |
248
+
249
+ ## Plug in *any* vision model
250
+
251
+ `mosaix` never talks to a model directly — it goes through a thin **adapter**. Three
252
+ ways to bring a model, in increasing order of control:
253
+
254
+ ### 1. By file — auto-detected backend
255
+
256
+ ```python
257
+ mosaix.VideoPipeline.from_model("yolo11n.pt") # Ultralytics
258
+ mosaix.VideoPipeline.from_model("yolov8n.onnx") # ONNX Runtime
259
+ mosaix.VideoPipeline.from_model("model.engine") # TensorRT (via Ultralytics)
260
+ ```
261
+
262
+ ### 2. Any callable — the universal escape hatch
263
+
264
+ If your model is a Detectron2 net, a Transformers pipeline, a homemade detector —
265
+ anything — wrap a function that maps a batch of mosaic images to boxes:
266
+
267
+ ```python
268
+ import numpy as np
269
+ from mosaix import VideoPipeline, TileConfig, InferenceConfig
270
+ from mosaix.adapters import CallableAdapter
271
+
272
+ def my_detector(mosaics): # list[BGR uint8] -> list[(N,6)]
273
+ out = []
274
+ for img in mosaics:
275
+ boxes = run_whatever(img) # (N,6): x1,y1,x2,y2,conf,cls in mosaic pixels
276
+ out.append(np.asarray(boxes, np.float32))
277
+ return out
278
+
279
+ tile, infer = TileConfig(), InferenceConfig()
280
+ adapter = CallableAdapter(my_detector, tile, infer, name="my-model")
281
+ pipe = VideoPipeline(adapter)
282
+ ```
283
+
284
+ The engine handles all the tiling/batching/remapping; your function only ever sees
285
+ ordinary images and returns ordinary boxes. See
286
+ [`examples/custom_model.py`](examples/custom_model.py).
287
+
288
+ ### 3. A custom adapter class
289
+
290
+ Subclass `ModelAdapter`, implement `predict_batch`, and `register_adapter("name", Cls)`
291
+ to expose it everywhere. See [`docs/adapters.md`](docs/adapters.md).
292
+
293
+ ---
294
+
295
+ ## Configuration
296
+
297
+ Everything is a documented dataclass. The defaults are tuned for an 8 GB GPU at 240p.
298
+
299
+ ```python
300
+ from mosaix import VideoPipeline, TileConfig, ReaderConfig, InferenceConfig
301
+
302
+ pipe = VideoPipeline.from_model(
303
+ "yolo11n.pt",
304
+ tile=TileConfig(
305
+ grid=9, # 9 = 3x3 mosaic; try 4 (2x2) for bigger objects
306
+ batch=32, # mosaics per forward pass; lower if you OOM
307
+ cell_width=426, # downscaled frame size
308
+ cell_height=240,
309
+ gutter=12, # seam padding
310
+ ),
311
+ reader=ReaderConfig(
312
+ stride=1, # process every frame; 5 = 1-in-5
313
+ threaded=True, # overlap decode with GPU (essential for real FPS)
314
+ backend="auto", # "decord" if installed, else "opencv"
315
+ ),
316
+ inference=InferenceConfig(
317
+ device="auto",
318
+ half=True, # FP16 — ~2x on modern GPUs
319
+ conf=0.25,
320
+ classes=[0], # keep only COCO 'person'; None = all classes
321
+ ),
322
+ )
323
+ ```
324
+
325
+ Full reference: [`docs/configuration.md`](docs/configuration.md) ·
326
+ Tuning guide: [`docs/performance.md`](docs/performance.md).
327
+
328
+ ---
329
+
330
+ ## Streaming (constant memory on long videos)
331
+
332
+ `run()` collects every result. For multi-hour videos, `stream()` yields one
333
+ `FrameResult` at a time with bounded memory:
334
+
335
+ ```python
336
+ pipe = mosaix.VideoPipeline.from_model("yolo11n.pt")
337
+ for frame in pipe.stream("8_hour_stream.mp4"):
338
+ process(frame) # memory stays flat regardless of length
339
+ print(pipe.meter.summary()) # FPS, GPU mem, per-stage timing
340
+ ```
341
+
342
+ ---
343
+
344
+ ## Command line
345
+
346
+ ```bash
347
+ mosaix bench yolo11n.pt video.mp4 --grid 9 --batch 32 # measure FPS
348
+ mosaix run yolo11n.pt video.mp4 --out annotated.mp4 --classes 0
349
+ mosaix run yolo11n.pt video.mp4 --jsonl dets.jsonl # detections to JSONL
350
+ mosaix info video.mp4 # probe metadata
351
+ ```
352
+
353
+ Every config knob has a flag — run `mosaix bench -h`.
354
+
355
+ ---
356
+
357
+ ## Why "true" throughput
358
+
359
+ Many benchmarks quote inference-only FPS with decode excluded. `mosaix` reports
360
+ **both**: `infer_fps` (GPU forward passes only) and `e2e_fps` (decode + downscale +
361
+ mosaic + inference + remap, what you actually get). The threaded reader overlaps
362
+ decode with GPU so the end-to-end number stays close to the inference number on long
363
+ videos — which is the only number that matters in production.
364
+
365
+ ---
366
+
367
+ ## License
368
+
369
+ MIT — see [LICENSE](LICENSE).