canvas-engineering 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (26) hide show
  1. canvas_engineering-0.1.0/LICENSE +17 -0
  2. canvas_engineering-0.1.0/PKG-INFO +602 -0
  3. canvas_engineering-0.1.0/README.md +571 -0
  4. canvas_engineering-0.1.0/canvas_engineering/__init__.py +40 -0
  5. canvas_engineering-0.1.0/canvas_engineering/action_heads.py +42 -0
  6. canvas_engineering-0.1.0/canvas_engineering/canvas.py +349 -0
  7. canvas_engineering-0.1.0/canvas_engineering/checkpoint.py +67 -0
  8. canvas_engineering-0.1.0/canvas_engineering/cogvideox.py +96 -0
  9. canvas_engineering-0.1.0/canvas_engineering/connectivity.py +301 -0
  10. canvas_engineering-0.1.0/canvas_engineering/curriculum.py +36 -0
  11. canvas_engineering-0.1.0/canvas_engineering/graft.py +126 -0
  12. canvas_engineering-0.1.0/canvas_engineering/looped_block.py +97 -0
  13. canvas_engineering-0.1.0/canvas_engineering/schema.py +192 -0
  14. canvas_engineering-0.1.0/canvas_engineering/sharpening.py +46 -0
  15. canvas_engineering-0.1.0/canvas_engineering.egg-info/PKG-INFO +602 -0
  16. canvas_engineering-0.1.0/canvas_engineering.egg-info/SOURCES.txt +24 -0
  17. canvas_engineering-0.1.0/canvas_engineering.egg-info/dependency_links.txt +1 -0
  18. canvas_engineering-0.1.0/canvas_engineering.egg-info/requires.txt +15 -0
  19. canvas_engineering-0.1.0/canvas_engineering.egg-info/top_level.txt +1 -0
  20. canvas_engineering-0.1.0/pyproject.toml +41 -0
  21. canvas_engineering-0.1.0/setup.cfg +4 -0
  22. canvas_engineering-0.1.0/tests/test_canvas.py +243 -0
  23. canvas_engineering-0.1.0/tests/test_connectivity.py +483 -0
  24. canvas_engineering-0.1.0/tests/test_graft.py +50 -0
  25. canvas_engineering-0.1.0/tests/test_looped_block.py +79 -0
  26. canvas_engineering-0.1.0/tests/test_schema.py +523 -0
@@ -0,0 +1,17 @@
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ Licensed under the Apache License, Version 2.0 (the "License");
8
+ you may not use this file except in compliance with the License.
9
+ You may obtain a copy of the License at
10
+
11
+ http://www.apache.org/licenses/LICENSE-2.0
12
+
13
+ Unless required by applicable law or agreed to in writing, software
14
+ distributed under the License is distributed on an "AS IS" BASIS,
15
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16
+ See the License for the specific language governing permissions and
17
+ limitations under the License.
@@ -0,0 +1,602 @@
1
+ Metadata-Version: 2.4
2
+ Name: canvas-engineering
3
+ Version: 0.1.0
4
+ Summary: Prompt engineering, but for latent space. A type system for multimodal latent dynamics in video diffusion transformers.
5
+ Author: Jacob
6
+ Author-email: "Claude Opus 4.6" <noreply@anthropic.com>
7
+ License-Expression: Apache-2.0
8
+ Project-URL: Homepage, https://github.com/JacobFV/canvas-engineering
9
+ Project-URL: Documentation, https://jacobfv.github.io/canvas-engineering/
10
+ Keywords: diffusion,transformer,looped-attention,video,robotics,canvas
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Requires-Python: >=3.9
16
+ Description-Content-Type: text/markdown
17
+ License-File: LICENSE
18
+ Requires-Dist: torch>=2.0
19
+ Requires-Dist: einops>=0.7
20
+ Provides-Extra: cogvideox
21
+ Requires-Dist: diffusers>=0.28; extra == "cogvideox"
22
+ Requires-Dist: transformers>=4.40; extra == "cogvideox"
23
+ Requires-Dist: accelerate; extra == "cogvideox"
24
+ Provides-Extra: data
25
+ Requires-Dist: av; extra == "data"
26
+ Requires-Dist: numpy; extra == "data"
27
+ Requires-Dist: Pillow; extra == "data"
28
+ Provides-Extra: dev
29
+ Requires-Dist: pytest>=7.0; extra == "dev"
30
+ Dynamic: license-file
31
+
32
+ # canvas-engineering
33
+
34
+ ### Prompt engineering, but for latent space.
35
+
36
+ [![PyPI](https://img.shields.io/pypi/v/canvas-engineering.svg)](https://pypi.org/project/canvas-engineering/)
37
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
38
+ [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
39
+ [![Tests](https://img.shields.io/badge/tests-95%2F95-brightgreen.svg)]()
40
+ [![Docs](https://img.shields.io/badge/docs-jacobfv.github.io-blue.svg)](https://jacobfv.github.io/canvas-engineering/)
41
+
42
+ > Prompt engineering structures what an LLM *sees*. **Canvas engineering** structures what a diffusion model *thinks in*. You declare which regions of latent space carry video, actions, proprioception, reward, or thought — their geometry, their temporal frequency, their connectivity, their loss participation — and the canvas compiles that declaration into attention masks, loss weights, and frame mappings. The layout is the schema. The topology is the compute graph. Together they form a **type system for multimodal latent computation**: the model doesn't discover what its internal state means — you declare it, and the structure constrains what it learns.
43
+
44
+ <p align="center">
45
+ <img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_layouts_combined.png" alt="Canvas allocation layouts for three applications" width="100%">
46
+ </p>
47
+ <p align="center"><i>Canvas allocations for robot manipulation, computer use, and multi-robot control. Each colored block is a modality region on the 3D spatiotemporal grid.</i></p>
48
+
49
+ ---
50
+
51
+ ## The idea
52
+
53
+ Prompt engineering gives LLMs structured context — few-shot examples, system instructions, tool descriptions — so they produce better outputs. Canvas engineering does the same thing one level deeper: it gives diffusion models structured *latent space* so they learn better representations. A diffusion transformer's latent tensor is just a flat bag of positions. **canvas-engineering** turns it into a typed workspace by letting you declare:
54
+
55
+ - **What** each region means — `RegionSpec` with bounds, temporal frequency, loss weight, input/output role
56
+ - **How** regions interact — `CanvasTopology` as a directed graph of attention operations with temporal constraints
57
+ - **How fast** each region runs — `period` maps canvas timesteps to real-world frames, so a "thought" region at period=4 and a "perception" region at period=1 coexist on the same canvas
58
+
59
+ This is literally a type system. `region_indices()` is an offset calculation. `loss_weight_mask()` is type-directed codegen. The topology is a calling convention. Two agents with the same canvas schema can share latent state directly — no tokenization, no encoding — because the schema tells you what every position means.
60
+
61
+ <!-- Source: scripts/generate_diagrams.py :: generate_type_system() -->
62
+ <p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_type_system.png" alt="Type system analogy: C struct layout vs canvas schema" width="80%"></p>
63
+
64
+ The library has two orthogonal pieces, validated over [26 experiments and 236 training runs](https://github.com/JacobFV/recursive-omnimodal-video-action-model):
65
+
66
+ ### 1. The canvas: structured multimodal latent space
67
+
68
+ Large video diffusion models (CogVideoX, Mochi, Wan) generate video. The **spatiotemporal canvas** extends them to *do things* — predict robot actions, estimate rewards, process proprioception — by placing heterogeneous modalities on a shared 3D grid with dedicated encoders and decoders. You design the schema, the model attends over everything.
69
+
70
+ ### 2. Looped attention: weight-sharing regularization
71
+
72
+ **Looped attention** iterates transformer blocks multiple times with learned iteration embeddings. The empirical result: **1.73x parameter efficiency** over matched-depth models (p<0.001) through weight-sharing regularization (fixed-point convergence, cosine similarity 0.926 → 0.996). A frozen CogVideoX-2B backbone + **350K trainable loop parameters** outperforms **11.5M unfrozen parameters** on action prediction. 3 loops is optimal.
73
+
74
+ What looping is *not*: iterative reasoning -- at least not yet. Three independent experiments falsified that hypothesis (p=0.97, p>0.05, p>0.05). The benefit is regularization, not reasoning depth, not at the limited scale I tested anyway... tho I'm skeptical.
75
+
76
+ ## Quick start
77
+
78
+ ```bash
79
+ pip install canvas-engineering
80
+ ```
81
+
82
+ ### Graft looped attention onto CogVideoX-2B
83
+
84
+ ```python
85
+ from canvas_engineering import graft_looped_blocks, CurriculumScheduler
86
+ from diffusers import CogVideoXTransformer3DModel
87
+ import torch
88
+
89
+ # Load pretrained video diffusion model
90
+ transformer = CogVideoXTransformer3DModel.from_pretrained(
91
+ "THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.bfloat16
92
+ )
93
+
94
+ # Graft 3-loop attention onto all 30 frozen DiT blocks
95
+ looped_blocks, action_head = graft_looped_blocks(
96
+ transformer,
97
+ max_loops=3, # 3 is optimal (empirically validated)
98
+ freeze="full", # freeze backbone, train only loop params
99
+ action_dim=7, # 6DOF end-effector + gripper
100
+ )
101
+
102
+ # Only 350K params to optimize
103
+ optimizer = torch.optim.AdamW(
104
+ [p for b in looped_blocks for p in b.parameters() if p.requires_grad]
105
+ + list(action_head.parameters()),
106
+ lr=1e-4,
107
+ )
108
+
109
+ # Curriculum: gradually ramp from 1 to 3 loops during training
110
+ scheduler = CurriculumScheduler(max_loops=3, total_steps=5000)
111
+ ```
112
+
113
+ That's it. The frozen 1.69B-parameter backbone now loops its computation 3 times per forward pass, with learned iteration embeddings that cost 0.02% of the model.
114
+
115
+ ## How looped attention works
116
+
117
+ <!-- Source: scripts/generate_diagrams.py :: generate_looped_attention() -->
118
+ <p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/looped_attention.png" alt="Looped attention block diagram" width="75%"></p>
119
+
120
+ **Zero-init safety**: Loop embeddings start at zero. At initialization, the model behaves identically to the pretrained backbone. No distribution shift. Safe to graft onto any frozen model.
121
+
122
+ **Gradient checkpointing**: Multi-loop training fits in 40GB VRAM by recomputing activations on the backward pass (per-loop, not per-block).
123
+
124
+ ## How the canvas works
125
+
126
+ A **canvas** is a 3D grid `(T, H, W)` where different regions handle different modalities. This is the omnimodal I/O layer — it's what lets a video model also predict actions, read proprioception, and estimate reward.
127
+
128
+ ```python
129
+ from canvas_engineering import CanvasLayout, SpatiotemporalCanvas
130
+
131
+ # Robot manipulation canvas
132
+ layout = CanvasLayout(
133
+ T=5, H=8, W=8, d_model=256,
134
+ regions={
135
+ "visual": (0, 5, 0, 6, 0, 6), # 180 positions — video patches
136
+ "action": (0, 5, 6, 7, 0, 1), # 5 positions — per-frame actions
137
+ "reward": (2, 3, 7, 8, 0, 1), # 1 position — scalar reward
138
+ },
139
+ t_current=2, # t >= 2 is future (diffusion output)
140
+ )
141
+
142
+ canvas = SpatiotemporalCanvas(layout)
143
+ batch = canvas.create_empty(batch_size=4) # (4, 320, 256)
144
+ batch = canvas.place(batch, visual_embs, "visual") # write video patches
145
+ actions = canvas.extract(batch, "action") # read action predictions
146
+ ```
147
+
148
+ <!-- Source: scripts/generate_diagrams.py :: generate_3d_gif() / generate_3d_static() -->
149
+ <p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_robot_3d.gif" alt="3D rotating canvas allocation" width="50%"></p>
150
+ <p align="center"><i>3D region allocation for a robot manipulation canvas. Each colored block is a modality occupying a subvolume of the (T, H, W) grid.</i></p>
151
+
152
+ **Built-in examples** for robot manipulation, computer use agents, and multi-robot control:
153
+
154
+ ```python
155
+ # Computer use agent: screen pixels + mouse + keyboard + LLM steering
156
+ layout = CanvasLayout(
157
+ T=16, H=32, W=32, d_model=768,
158
+ regions={
159
+ "screen": (0, 16, 0, 24, 0, 24), # 9,216 positions (56%)
160
+ "mouse": (0, 16, 24, 26, 0, 4), # 128 positions
161
+ "keyboard": (0, 16, 26, 28, 0, 4), # 128 positions
162
+ "llm": (0, 16, 28, 32, 0, 8), # 512 positions
163
+ },
164
+ )
165
+ # → 16,384 total positions, bandwidth-proportional allocation
166
+ ```
167
+
168
+ <p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_computer.png" alt="Computer use agent canvas" width="45%"> <img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_multi_robot.png" alt="Multi-robot canvas" width="45%"></p>
169
+
170
+ ## Why 3 loops?
171
+
172
+ From a 12-condition grid ablation on CogVideoX-2B with real Bridge V2 robot video (36 runs, $152 compute):
173
+
174
+ ```
175
+ Action Loss (lower = better)
176
+ Frozen Half-frozen Unfrozen
177
+ (350K params) (3.7M params) (11.7M params)
178
+ 1 loop 0.121 0.115 0.108
179
+ 2 loops 0.140 0.119 0.112
180
+ 3 loops 0.073 ◀ BEST 0.107 0.088
181
+ 4 loops 0.104 0.137 0.124
182
+ ```
183
+
184
+ **3 loops wins at every freeze level.** The frozen 3-loop condition (350K params) beats every unfrozen condition (11.5M+ params). 4 loops consistently regresses from 3.
185
+
186
+ Freeze level doesn't affect action loss at all (marginals: 0.109 vs 0.108, p=0.72). It only affects video generation quality (8-9x gap on diffusion loss).
187
+
188
+ ## Declarative region frequency
189
+
190
+ Canvas regions can operate at different real-world frequencies. A `RegionSpec` declares per-region semantics — temporal frequency, loss participation, and loss weight — as first-class properties.
191
+
192
+ ```python
193
+ from canvas_engineering import CanvasLayout, RegionSpec
194
+
195
+ layout = CanvasLayout(
196
+ T=16, H=32, W=32, d_model=768,
197
+ regions={
198
+ "screen": (0, 16, 0, 24, 0, 24), # raw tuple — period=1 default
199
+
200
+ "mouse": RegionSpec(
201
+ bounds=(0, 16, 24, 26, 0, 4),
202
+ period=1, loss_weight=2.0, # high-freq, emphasize accuracy
203
+ ),
204
+ "thought": RegionSpec(
205
+ bounds=(0, 4, 28, 32, 0, 8),
206
+ period=4, loss_weight=1.0, # low-freq: 4 slots → frames 0,4,8,12
207
+ ),
208
+ "task_prompt": RegionSpec(
209
+ bounds=(0, 1, 26, 28, 0, 4),
210
+ is_output=False, # input-only conditioning, no loss
211
+ ),
212
+ },
213
+ )
214
+
215
+ # Per-position loss weighting — respects is_output and loss_weight
216
+ weights = layout.loss_weight_mask("cuda") # (N,) tensor
217
+ loss = (per_position_loss * weights).sum() / weights.sum()
218
+
219
+ # Frame mapping between canvas time and real-world time
220
+ layout.real_frame("thought", canvas_t=2) # → 8
221
+ layout.canvas_frame("thought", real_t=8) # → 2
222
+ layout.canvas_frame("thought", real_t=7) # → None (not aligned)
223
+ ```
224
+
225
+ Raw tuples auto-wrap as `RegionSpec(bounds=tuple)` with defaults — full backward compatibility. All existing code continues to work unchanged.
226
+
227
+ **RegionSpec fields:**
228
+
229
+ | Field | Default | Meaning |
230
+ |---|---|---|
231
+ | `bounds` | *(required)* | `(t0, t1, h0, h1, w0, w1)` spatial-temporal extent |
232
+ | `period` | `1` | Canvas frames per real-world update (1 = every frame) |
233
+ | `is_output` | `True` | Whether this region participates in diffusion loss |
234
+ | `loss_weight` | `1.0` | Relative loss weight for positions in this region |
235
+
236
+ ## Non-Euclidean connectivity
237
+
238
+ Canvas regions don't have to interact via Euclidean adjacency. A `CanvasTopology` declaratively specifies which **block-to-block attention operations** are performed per step. Each `Connection` is a discrete cross-attention op: `src` tokens query against `dst` keys/values.
239
+
240
+ ```python
241
+ from canvas_engineering import Connection, CanvasTopology
242
+
243
+ # Declarative: define the full attention compute DAG as data
244
+ topology = CanvasTopology(connections=[
245
+ # Self-attention within each region
246
+ Connection(src="robot1_cam", dst="robot1_cam"),
247
+ Connection(src="robot1_action", dst="robot1_action"),
248
+ Connection(src="robot2_cam", dst="robot2_cam"),
249
+ Connection(src="robot2_action", dst="robot2_action"),
250
+ Connection(src="shared_task", dst="shared_task"),
251
+
252
+ # Causal: each robot's camera informs its own actions
253
+ Connection(src="robot1_action", dst="robot1_cam"),
254
+ Connection(src="robot2_action", dst="robot2_cam"),
255
+
256
+ # Coordination: robots see each other's cameras
257
+ Connection(src="robot1_cam", dst="robot2_cam", weight=0.5),
258
+ Connection(src="robot2_cam", dst="robot1_cam", weight=0.5),
259
+
260
+ # Hub: shared task reads from cameras, actions read from task
261
+ Connection(src="shared_task", dst="robot1_cam"),
262
+ Connection(src="shared_task", dst="robot2_cam"),
263
+ Connection(src="robot1_action", dst="shared_task"),
264
+ Connection(src="robot2_action", dst="shared_task"),
265
+ ])
266
+
267
+ # Generate attention mask or iterate over ops
268
+ mask = topology.to_attention_mask(layout) # (N, N) float
269
+ ops = topology.attention_ops() # [(src, dst, weight), ...]
270
+ ```
271
+
272
+ **Convenience constructors** for common patterns:
273
+
274
+ ```python
275
+ CanvasTopology.dense(["a", "b", "c"]) # fully connected (standard transformer)
276
+ CanvasTopology.isolated(["a", "b", "c"]) # block-diagonal (no cross-region)
277
+ CanvasTopology.hub_spoke("task", ["r1", "r2"]) # star topology
278
+ CanvasTopology.causal_chain(["obs", "plan", "act"]) # A → B → C
279
+ CanvasTopology.causal_temporal(["obs", "act"]) # same-frame self + prev-frame cross
280
+ ```
281
+
282
+ <!-- Source: scripts/generate_topology_diagrams.py :: generate_all() -->
283
+ <p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/topology_constructors.png" alt="Topology convenience constructors" width="100%"></p>
284
+
285
+ The topology is the compute graph of attention operations — not a soft mask on dense attention. Block self-attention is one special case. Dense is another. The interesting cases are structured DAGs that mirror the causal/information-flow structure of your problem.
286
+
287
+ ### Temporal connectivity
288
+
289
+ Connections can constrain **which timesteps** participate in each attention op. By default, all timesteps see all timesteps (dense in time). With temporal offsets, you get causal chains over time, same-frame-only constraints, or sliding windows.
290
+
291
+ ```python
292
+ # Default: all timesteps (backward compatible)
293
+ Connection(src="cam", dst="action")
294
+
295
+ # Same-frame only: no temporal leakage
296
+ Connection(src="cam", dst="action", t_src=0, t_dst=0)
297
+
298
+ # Previous frame cross-attention: action at t queries obs at t-1
299
+ Connection(src="action", dst="obs", t_src=0, t_dst=-1)
300
+
301
+ # Full temporal self-attention (explicit)
302
+ Connection(src="thought", dst="thought", t_src=None, t_dst=None)
303
+ ```
304
+
305
+ **Semantics**: `t_src` and `t_dst` are relative offsets from a shared reference frame. The mask generator iterates over all reference frames and pairs positions at `ref + t_src` with positions at `ref + t_dst`. Out-of-bounds timesteps are silently skipped.
306
+
307
+ | `t_src` | `t_dst` | Behavior |
308
+ |---------|---------|----------|
309
+ | `None` | `None` | All src ↔ all dst (dense in time) |
310
+ | `0` | `0` | Same-frame only |
311
+ | `0` | `-1` | Src at current frame queries dst at previous frame |
312
+ | `None` | `0` | All src timesteps query dst at each reference frame |
313
+
314
+ The `causal_temporal` constructor gives you same-frame self-attention + previous-frame cross-attention for all regions — no future leakage, but full temporal context.
315
+
316
+ ## Attention function types
317
+
318
+ Not all connections should use the same attention mechanism. A `Connection` can declare its `fn` — the type of function used for that edge. Regions can also set `default_attn` — a default for all outgoing connections. The schema declares *intent*; execution is backend-dependent.
319
+
320
+ ```python
321
+ from canvas_engineering import CanvasLayout, RegionSpec, Connection, CanvasTopology
322
+
323
+ layout = CanvasLayout(
324
+ T=8, H=16, W=16, d_model=512,
325
+ regions={
326
+ # Region defaults: what kind of attention makes sense for this modality?
327
+ "visual": RegionSpec(bounds=(0,8, 0,12, 0,12), default_attn="cross_attention"),
328
+ "proprio": RegionSpec(bounds=(0,8, 12,13, 0,2), default_attn="linear_attention"),
329
+ "thought": RegionSpec(bounds=(0,4, 13,15, 0,4), default_attn="mamba"),
330
+ "goal": RegionSpec(bounds=(0,1, 15,16, 0,4), default_attn="cross_attention",
331
+ is_output=False),
332
+ },
333
+ )
334
+
335
+ topology = CanvasTopology(connections=[
336
+ # Self-attention (uses each region's default_attn)
337
+ Connection(src="visual", dst="visual"), # → cross_attention
338
+ Connection(src="proprio", dst="proprio"), # → linear_attention
339
+ Connection(src="thought", dst="thought"), # → mamba
340
+ Connection(src="goal", dst="goal"), # → cross_attention
341
+
342
+ # Cross-region with explicit fn overrides
343
+ Connection(src="visual", dst="goal", fn="gated"), # optional conditioning
344
+ Connection(src="thought", dst="visual", fn="perceiver"), # compress 864 visual tokens
345
+ Connection(src="proprio", dst="visual", fn="pooling"), # just need a summary
346
+ Connection(src="thought", dst="thought", fn="copy", # direct latent relay
347
+ t_src=0, t_dst=-1), # from previous frame
348
+ ])
349
+
350
+ # Resolve: returns (src, dst, weight, fn) with defaults applied
351
+ ops = topology.attention_ops(layout)
352
+ # [("visual", "visual", 1.0, "cross_attention"),
353
+ # ("proprio", "proprio", 1.0, "linear_attention"),
354
+ # ("thought", "thought", 1.0, "mamba"),
355
+ # ...]
356
+ ```
357
+
358
+ **Resolution order:** `connection.fn` (if set) → `region.default_attn` (if layout provided) → `"cross_attention"` (global default). Fully backward compatible — existing code without `fn` or `default_attn` resolves to standard cross-attention.
359
+
360
+ ### The lineup
361
+
362
+ Every connection function type represents a different theory of how information should flow between regions. The schema declares intent; the executor decides implementation.
363
+
364
+ | Type | Family | Complexity | Best for |
365
+ |------|--------|-----------|----------|
366
+ | `cross_attention` | Dot-product | O(NM) | General-purpose, content-based selection |
367
+ | `linear_attention` | Dot-product | O(N+M) | Low-dimensional or high-frequency streams |
368
+ | `cosine_attention` | Dot-product | O(NM) | Stable gradients, no temperature scaling |
369
+ | `sigmoid_attention` | Dot-product | O(NM) | Non-exclusive / multi-label attention |
370
+ | `gated` | Gating | O(NM) | Optional conditioning (goals, instructions) |
371
+ | `perceiver` | Compression | O(NK) | Large dst regions compressed through bottleneck |
372
+ | `pooling` | Compression | O(N+M) | Scalar/low-dim conditioning signals |
373
+ | `copy` | Transfer | O(N) | Direct latent sharing, broadcast regions |
374
+ | `mamba` | State-space | O(N) | Long temporal sequences, causal connections |
375
+ | `rwkv` | State-space | O(N) | Temporal connections with learned decay |
376
+ | `hyena` | Convolution | O(N log N) | Sub-quadratic long-range via FFT |
377
+ | `sparse_attention` | Sparse | O(NK) | Selective binding to specific positions |
378
+ | `local_attention` | Sparse | O(NW) | Spatially local interactions (neighboring patches) |
379
+ | `none` | Meta | O(0) | Ablation — edge declared but disabled |
380
+ | `random_fixed` | Meta | O(NK) | Baseline — does learned structure matter? |
381
+ | `mixture` | Meta | O(NK) | MoE-style routing for multi-modal hubs |
382
+
383
+ ### Design recipes
384
+
385
+ **Robot manipulation** — vision-heavy, low-latency actions:
386
+ ```python
387
+ "visual": default_attn="cross_attention" # full attention for spatial reasoning
388
+ "proprio": default_attn="linear_attention" # 12D joint state, no need for O(N²)
389
+ "action": default_attn="cross_attention" # content-based selection from visual
390
+ # visual → action: cross_attention (which visual patches matter for this action?)
391
+ # proprio → action: pooling (just need the joint state vector)
392
+ ```
393
+
394
+ **Embodied agent with memory** — long-horizon, selective recall:
395
+ ```python
396
+ "perception": default_attn="cross_attention"
397
+ "memory": default_attn="mamba" # O(N) sequential over long history
398
+ "policy": default_attn="cross_attention"
399
+ # memory → perception: gated (decide whether to incorporate memory at all)
400
+ # perception → memory: perceiver (compress percepts into fixed-size memory)
401
+ ```
402
+
403
+ **Multi-agent coordination** — shared latent space:
404
+ ```python
405
+ "agent_a.thought": default_attn="rwkv" # causal temporal within agent
406
+ "agent_b.thought": default_attn="rwkv"
407
+ "shared_task": default_attn="cross_attention"
408
+ # agent_a.thought → shared_task: cross_attention (selective broadcast)
409
+ # shared_task → agent_b.thought: gated (selective incorporation)
410
+ # agent_a.thought → agent_b.thought: copy (direct latent relay)
411
+ ```
412
+
413
+ **Vision transformer backbone** — drop-in structured attention:
414
+ ```python
415
+ "cls_token": default_attn="cross_attention"
416
+ "patches": default_attn="local_attention" # each patch attends locally
417
+ "readout": default_attn="cross_attention"
418
+ # cls_token → patches: cross_attention (global aggregation)
419
+ # patches → patches: local_attention (spatial locality)
420
+ # readout → cls_token: pooling (single vector summary)
421
+ ```
422
+
423
+ ## Semantic types and transfer distance
424
+
425
+ Each canvas region represents a modality — RGB video, joint angles, reward, language. `RegionSpec` lets you declare the modality's **semantic type** as a human-readable string and a frozen embedding vector from a fixed model. This turns modality compatibility from a human judgment call into a computable quantity.
426
+
427
+ ```python
428
+ from canvas_engineering import RegionSpec, transfer_distance
429
+
430
+ cam = RegionSpec(
431
+ bounds=(0, 8, 0, 12, 0, 12),
432
+ semantic_type="RGB video 224x224 30fps from front-facing monocular camera",
433
+ semantic_embedding=embed("RGB video 224x224 30fps from front-facing monocular camera"),
434
+ embedding_model="openai/text-embedding-3-small", # fixed, declared
435
+ )
436
+
437
+ depth = RegionSpec(
438
+ bounds=(0, 8, 0, 12, 0, 12),
439
+ semantic_type="Metric depth map 224x224 from front-facing monocular camera",
440
+ semantic_embedding=embed("Metric depth map 224x224 from front-facing monocular camera"),
441
+ )
442
+
443
+ joints = RegionSpec(
444
+ bounds=(0, 8, 12, 13, 0, 1),
445
+ semantic_type="7-DOF joint angles at 30Hz",
446
+ semantic_embedding=embed("7-DOF joint angles at 30Hz"),
447
+ )
448
+
449
+ transfer_distance(cam, depth) # ~0.15 — cheap to bridge (1-2 layers)
450
+ transfer_distance(cam, joints) # ~0.65 — expensive (full MLP adapter)
451
+ ```
452
+
453
+ <!-- Source: scripts/generate_semantic_diagrams.py :: generate_transfer_distance() -->
454
+ <p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/transfer_distance.png" alt="Semantic embedding space with transfer distances" width="65%"></p>
455
+
456
+ **Why this matters:** If canvas schemas produce stable latent representations (an empirical hypothesis we're testing), then semantic embedding distance approximates the real cost of bridging two modalities — how many adapter layers, how much data. The embedding model must be fixed and declared so distances are comparable across time and projects.
457
+
458
+ ## Canvas schemas
459
+
460
+ A `CanvasSchema` bundles layout + topology into a single portable, serializable object — the complete type signature for a canvas-based model.
461
+
462
+ ```python
463
+ from canvas_engineering import CanvasSchema, CanvasLayout, RegionSpec, CanvasTopology, Connection
464
+
465
+ schema = CanvasSchema(
466
+ layout=CanvasLayout(
467
+ T=8, H=16, W=16, d_model=256,
468
+ regions={
469
+ "visual": RegionSpec(
470
+ bounds=(0, 8, 0, 12, 0, 12),
471
+ semantic_type="RGB video 224x224",
472
+ semantic_embedding=(0.12, -0.05, ...),
473
+ ),
474
+ "action": RegionSpec(
475
+ bounds=(0, 8, 12, 14, 0, 2),
476
+ loss_weight=2.0,
477
+ semantic_type="6-DOF end-effector + gripper",
478
+ semantic_embedding=(0.31, 0.08, ...),
479
+ ),
480
+ },
481
+ ),
482
+ topology=CanvasTopology(connections=[
483
+ Connection(src="visual", dst="visual"),
484
+ Connection(src="action", dst="visual"),
485
+ Connection(src="action", dst="action"),
486
+ ]),
487
+ metadata={"model": "CogVideoX-2B", "data": "bridge_v2"},
488
+ )
489
+
490
+ # Serialize — the schema is the complete declaration
491
+ schema.to_json("robot_v1.json")
492
+ loaded = CanvasSchema.from_json("robot_v1.json")
493
+
494
+ # Find compatible regions across two schemas
495
+ pairs = schema.compatible_regions(other_schema, threshold=0.3)
496
+ # → [("visual", "camera", 0.04), ("action", "gripper_cmd", 0.12)]
497
+ ```
498
+
499
+ The schema file is human-readable JSON. It declares everything needed to interpret a canvas tensor: geometry, region semantics, connectivity, and modality types. Two models with the same schema can share latent state directly.
500
+
501
+ <!-- Source: scripts/generate_semantic_diagrams.py :: generate_schema_alignment() -->
502
+ <p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/schema_alignment.png" alt="Cross-schema region alignment between robot and computer agents" width="90%"></p>
503
+ <p align="center"><i>Two agents with different canvas schemas. <code>compatible_regions()</code> finds semantically aligned region pairs — solid lines indicate direct latent transfer is possible, dashed lines require adapter layers.</i></p>
504
+
505
+ ## API reference
506
+
507
+ | Module | What it does |
508
+ |--------|-------------|
509
+ | **Canvas (omnimodal I/O)** | |
510
+ | `CanvasLayout` | Declarative 3D canvas geometry with named regions |
511
+ | `RegionSpec` | Per-region semantics: frequency, loss weight, output participation |
512
+ | `SpatiotemporalCanvas` | Canvas tensor ops: `create_empty`, `place`, `extract` |
513
+ | `Connection` | Single attention op with temporal offsets and function type (`fn`) |
514
+ | `CanvasTopology` | Declarative DAG of attention ops with `resolve_fn()` dispatch |
515
+ | `ATTENTION_TYPES` | Registry of 16 declared attention function types |
516
+ | `transfer_distance()` | Cosine distance between semantic type embeddings |
517
+ | `CanvasSchema` | Portable bundle: layout + topology + metadata, JSON-serializable |
518
+ | `ActionHead` | MLP decoder: latent channels → robot actions |
519
+ | **Looped attention (adaptive compute)** | |
520
+ | `LoopedBlockWrapper` | Wrap **any** transformer block for looped execution |
521
+ | `graft_looped_blocks()` | One-line grafting onto CogVideoX (auto-detects block type) |
522
+ | `freeze_full()` / `freeze_half()` | Freeze strategies for the backbone |
523
+ | `CurriculumScheduler` | Ramp loop count 1→3 during training |
524
+ | `SharpeningSchedule` | Progressive attention sharpening across loops (soft→sharp) |
525
+ | **Utilities** | |
526
+ | `save_loop_checkpoint()` | Save only loop params (~0.1% of model, ~1.4 MB) |
527
+
528
+ ## Freeze strategies
529
+
530
+ | Strategy | What's frozen | Trainable | Action loss | Diffusion loss | Use when |
531
+ |----------|:---:|:---:|:---:|:---:|----------|
532
+ | `"full"` | Everything except loops | 350K | 0.073 | 1.48 | Max efficiency, action-only tasks |
533
+ | `"half"` | Only `patch_embed` | 3.7M | 0.107 | 0.19 | Good video + good actions |
534
+ | `"none"` | Nothing | 11.7M | 0.088 | 0.18 | Full fine-tuning, compute available |
535
+
536
+ ## Progressive sharpening
537
+
538
+ Loop-indexed inverse temperature for bridging the soft→sharp attention discontinuity:
539
+
540
+ ```python
541
+ from canvas_engineering import SharpeningSchedule
542
+
543
+ schedule = SharpeningSchedule(max_loops=3, beta_min=1.0, beta_max=4.0)
544
+
545
+ # Loop 0: beta=1.0 (soft, broad gradients)
546
+ # Loop 1: beta=2.5 (medium)
547
+ # Loop 2: beta=4.0 (sharp, precise attention)
548
+ ```
549
+
550
+ Early loops train Q/K matrices via gradient flow. Later loops exploit trained structure with near-discrete attention. Empirically: mild sharpening (beta→2) gives 1.30x F1 on contact detection; aggressive (beta→8) hurts.
551
+
552
+ ## What looping is NOT
553
+
554
+ We tested three cortical-computation hypotheses rigorously. Two are **falsified**:
555
+
556
+ | Hypothesis | Result | Evidence |
557
+ |---|---|---|
558
+ | Looping enables iterative reasoning | **Falsified** | 3 independent nulls (p=0.97, p>0.05, p>0.05) |
559
+ | Shared canvas creates multi-modal binding | **Falsified** | Joint prediction 19% worse (p<0.0001) |
560
+ | Token allocation follows power laws | Borderline | R^2=0.902 but alpha=0.011 (doubling tokens = 0.8%) |
561
+
562
+ The looping benefit is **weight-sharing regularization** (parameter efficiency, fixed-point convergence, lower variance), not iterative reasoning. The omnimodal capability comes from the **canvas architecture** (multi-encoder/multi-decoder), not from the looping.
563
+
564
+ ## Examples
565
+
566
+ ```
567
+ examples/
568
+ ├── quickstart.py # 30-line graft-and-train
569
+ ├── graft_cogvideox.py # Full CogVideoX grafting with training loop
570
+ ├── define_canvas.py # Canvas layouts for 3 applications
571
+ └── train_bridge_v2.py # Real robot data training
572
+ ```
573
+
574
+ ## Installation
575
+
576
+ ```bash
577
+ # Core (canvas + looped blocks)
578
+ pip install canvas-engineering
579
+
580
+ # With CogVideoX support
581
+ pip install canvas-engineering[cogvideox]
582
+
583
+ # With video dataset loading
584
+ pip install canvas-engineering[data]
585
+
586
+ # Development
587
+ pip install canvas-engineering[dev]
588
+ ```
589
+
590
+ Requires Python 3.9+ and PyTorch 2.0+.
591
+
592
+ ## Paper
593
+
594
+ > **Looped Attention in Video Diffusion Transformers: 26 Experiments on What Works, What Doesn't, and Why**
595
+ >
596
+ > Jacob Valdez and Claude Opus 4.6
597
+
598
+ [Paper PDF](https://github.com/JacobFV/recursive-omnimodal-video-action-model/blob/16c4bed/papers/empirical/main.pdf) | [Video](https://youtu.be/LHEhdFAWkEc) | [Full experiment data](https://github.com/JacobFV/recursive-omnimodal-video-action-model/tree/main/archive/experiments)
599
+
600
+ ## License
601
+
602
+ Apache 2.0