canvas-engineering 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- canvas_engineering-0.1.0/LICENSE +17 -0
- canvas_engineering-0.1.0/PKG-INFO +602 -0
- canvas_engineering-0.1.0/README.md +571 -0
- canvas_engineering-0.1.0/canvas_engineering/__init__.py +40 -0
- canvas_engineering-0.1.0/canvas_engineering/action_heads.py +42 -0
- canvas_engineering-0.1.0/canvas_engineering/canvas.py +349 -0
- canvas_engineering-0.1.0/canvas_engineering/checkpoint.py +67 -0
- canvas_engineering-0.1.0/canvas_engineering/cogvideox.py +96 -0
- canvas_engineering-0.1.0/canvas_engineering/connectivity.py +301 -0
- canvas_engineering-0.1.0/canvas_engineering/curriculum.py +36 -0
- canvas_engineering-0.1.0/canvas_engineering/graft.py +126 -0
- canvas_engineering-0.1.0/canvas_engineering/looped_block.py +97 -0
- canvas_engineering-0.1.0/canvas_engineering/schema.py +192 -0
- canvas_engineering-0.1.0/canvas_engineering/sharpening.py +46 -0
- canvas_engineering-0.1.0/canvas_engineering.egg-info/PKG-INFO +602 -0
- canvas_engineering-0.1.0/canvas_engineering.egg-info/SOURCES.txt +24 -0
- canvas_engineering-0.1.0/canvas_engineering.egg-info/dependency_links.txt +1 -0
- canvas_engineering-0.1.0/canvas_engineering.egg-info/requires.txt +15 -0
- canvas_engineering-0.1.0/canvas_engineering.egg-info/top_level.txt +1 -0
- canvas_engineering-0.1.0/pyproject.toml +41 -0
- canvas_engineering-0.1.0/setup.cfg +4 -0
- canvas_engineering-0.1.0/tests/test_canvas.py +243 -0
- canvas_engineering-0.1.0/tests/test_connectivity.py +483 -0
- canvas_engineering-0.1.0/tests/test_graft.py +50 -0
- canvas_engineering-0.1.0/tests/test_looped_block.py +79 -0
- canvas_engineering-0.1.0/tests/test_schema.py +523 -0
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
Apache License
|
|
2
|
+
Version 2.0, January 2004
|
|
3
|
+
http://www.apache.org/licenses/
|
|
4
|
+
|
|
5
|
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
|
6
|
+
|
|
7
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
|
8
|
+
you may not use this file except in compliance with the License.
|
|
9
|
+
You may obtain a copy of the License at
|
|
10
|
+
|
|
11
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
|
12
|
+
|
|
13
|
+
Unless required by applicable law or agreed to in writing, software
|
|
14
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
|
15
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
16
|
+
See the License for the specific language governing permissions and
|
|
17
|
+
limitations under the License.
|
|
@@ -0,0 +1,602 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: canvas-engineering
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Prompt engineering, but for latent space. A type system for multimodal latent dynamics in video diffusion transformers.
|
|
5
|
+
Author: Jacob
|
|
6
|
+
Author-email: "Claude Opus 4.6" <noreply@anthropic.com>
|
|
7
|
+
License-Expression: Apache-2.0
|
|
8
|
+
Project-URL: Homepage, https://github.com/JacobFV/canvas-engineering
|
|
9
|
+
Project-URL: Documentation, https://jacobfv.github.io/canvas-engineering/
|
|
10
|
+
Keywords: diffusion,transformer,looped-attention,video,robotics,canvas
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Science/Research
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
15
|
+
Requires-Python: >=3.9
|
|
16
|
+
Description-Content-Type: text/markdown
|
|
17
|
+
License-File: LICENSE
|
|
18
|
+
Requires-Dist: torch>=2.0
|
|
19
|
+
Requires-Dist: einops>=0.7
|
|
20
|
+
Provides-Extra: cogvideox
|
|
21
|
+
Requires-Dist: diffusers>=0.28; extra == "cogvideox"
|
|
22
|
+
Requires-Dist: transformers>=4.40; extra == "cogvideox"
|
|
23
|
+
Requires-Dist: accelerate; extra == "cogvideox"
|
|
24
|
+
Provides-Extra: data
|
|
25
|
+
Requires-Dist: av; extra == "data"
|
|
26
|
+
Requires-Dist: numpy; extra == "data"
|
|
27
|
+
Requires-Dist: Pillow; extra == "data"
|
|
28
|
+
Provides-Extra: dev
|
|
29
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
30
|
+
Dynamic: license-file
|
|
31
|
+
|
|
32
|
+
# canvas-engineering
|
|
33
|
+
|
|
34
|
+
### Prompt engineering, but for latent space.
|
|
35
|
+
|
|
36
|
+
[](https://pypi.org/project/canvas-engineering/)
|
|
37
|
+
[](LICENSE)
|
|
38
|
+
[](https://www.python.org/downloads/)
|
|
39
|
+
[]()
|
|
40
|
+
[](https://jacobfv.github.io/canvas-engineering/)
|
|
41
|
+
|
|
42
|
+
> Prompt engineering structures what an LLM *sees*. **Canvas engineering** structures what a diffusion model *thinks in*. You declare which regions of latent space carry video, actions, proprioception, reward, or thought — their geometry, their temporal frequency, their connectivity, their loss participation — and the canvas compiles that declaration into attention masks, loss weights, and frame mappings. The layout is the schema. The topology is the compute graph. Together they form a **type system for multimodal latent computation**: the model doesn't discover what its internal state means — you declare it, and the structure constrains what it learns.
|
|
43
|
+
|
|
44
|
+
<p align="center">
|
|
45
|
+
<img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_layouts_combined.png" alt="Canvas allocation layouts for three applications" width="100%">
|
|
46
|
+
</p>
|
|
47
|
+
<p align="center"><i>Canvas allocations for robot manipulation, computer use, and multi-robot control. Each colored block is a modality region on the 3D spatiotemporal grid.</i></p>
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## The idea
|
|
52
|
+
|
|
53
|
+
Prompt engineering gives LLMs structured context — few-shot examples, system instructions, tool descriptions — so they produce better outputs. Canvas engineering does the same thing one level deeper: it gives diffusion models structured *latent space* so they learn better representations. A diffusion transformer's latent tensor is just a flat bag of positions. **canvas-engineering** turns it into a typed workspace by letting you declare:
|
|
54
|
+
|
|
55
|
+
- **What** each region means — `RegionSpec` with bounds, temporal frequency, loss weight, input/output role
|
|
56
|
+
- **How** regions interact — `CanvasTopology` as a directed graph of attention operations with temporal constraints
|
|
57
|
+
- **How fast** each region runs — `period` maps canvas timesteps to real-world frames, so a "thought" region at period=4 and a "perception" region at period=1 coexist on the same canvas
|
|
58
|
+
|
|
59
|
+
This is literally a type system. `region_indices()` is an offset calculation. `loss_weight_mask()` is type-directed codegen. The topology is a calling convention. Two agents with the same canvas schema can share latent state directly — no tokenization, no encoding — because the schema tells you what every position means.
|
|
60
|
+
|
|
61
|
+
<!-- Source: scripts/generate_diagrams.py :: generate_type_system() -->
|
|
62
|
+
<p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_type_system.png" alt="Type system analogy: C struct layout vs canvas schema" width="80%"></p>
|
|
63
|
+
|
|
64
|
+
The library has two orthogonal pieces, validated over [26 experiments and 236 training runs](https://github.com/JacobFV/recursive-omnimodal-video-action-model):
|
|
65
|
+
|
|
66
|
+
### 1. The canvas: structured multimodal latent space
|
|
67
|
+
|
|
68
|
+
Large video diffusion models (CogVideoX, Mochi, Wan) generate video. The **spatiotemporal canvas** extends them to *do things* — predict robot actions, estimate rewards, process proprioception — by placing heterogeneous modalities on a shared 3D grid with dedicated encoders and decoders. You design the schema, the model attends over everything.
|
|
69
|
+
|
|
70
|
+
### 2. Looped attention: weight-sharing regularization
|
|
71
|
+
|
|
72
|
+
**Looped attention** iterates transformer blocks multiple times with learned iteration embeddings. The empirical result: **1.73x parameter efficiency** over matched-depth models (p<0.001) through weight-sharing regularization (fixed-point convergence, cosine similarity 0.926 → 0.996). A frozen CogVideoX-2B backbone + **350K trainable loop parameters** outperforms **11.5M unfrozen parameters** on action prediction. 3 loops is optimal.
|
|
73
|
+
|
|
74
|
+
What looping is *not*: iterative reasoning -- at least not yet. Three independent experiments falsified that hypothesis (p=0.97, p>0.05, p>0.05). The benefit is regularization, not reasoning depth, not at the limited scale I tested anyway... tho I'm skeptical.
|
|
75
|
+
|
|
76
|
+
## Quick start
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
pip install canvas-engineering
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### Graft looped attention onto CogVideoX-2B
|
|
83
|
+
|
|
84
|
+
```python
|
|
85
|
+
from canvas_engineering import graft_looped_blocks, CurriculumScheduler
|
|
86
|
+
from diffusers import CogVideoXTransformer3DModel
|
|
87
|
+
import torch
|
|
88
|
+
|
|
89
|
+
# Load pretrained video diffusion model
|
|
90
|
+
transformer = CogVideoXTransformer3DModel.from_pretrained(
|
|
91
|
+
"THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.bfloat16
|
|
92
|
+
)
|
|
93
|
+
|
|
94
|
+
# Graft 3-loop attention onto all 30 frozen DiT blocks
|
|
95
|
+
looped_blocks, action_head = graft_looped_blocks(
|
|
96
|
+
transformer,
|
|
97
|
+
max_loops=3, # 3 is optimal (empirically validated)
|
|
98
|
+
freeze="full", # freeze backbone, train only loop params
|
|
99
|
+
action_dim=7, # 6DOF end-effector + gripper
|
|
100
|
+
)
|
|
101
|
+
|
|
102
|
+
# Only 350K params to optimize
|
|
103
|
+
optimizer = torch.optim.AdamW(
|
|
104
|
+
[p for b in looped_blocks for p in b.parameters() if p.requires_grad]
|
|
105
|
+
+ list(action_head.parameters()),
|
|
106
|
+
lr=1e-4,
|
|
107
|
+
)
|
|
108
|
+
|
|
109
|
+
# Curriculum: gradually ramp from 1 to 3 loops during training
|
|
110
|
+
scheduler = CurriculumScheduler(max_loops=3, total_steps=5000)
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
That's it. The frozen 1.69B-parameter backbone now loops its computation 3 times per forward pass, with learned iteration embeddings that cost 0.02% of the model.
|
|
114
|
+
|
|
115
|
+
## How looped attention works
|
|
116
|
+
|
|
117
|
+
<!-- Source: scripts/generate_diagrams.py :: generate_looped_attention() -->
|
|
118
|
+
<p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/looped_attention.png" alt="Looped attention block diagram" width="75%"></p>
|
|
119
|
+
|
|
120
|
+
**Zero-init safety**: Loop embeddings start at zero. At initialization, the model behaves identically to the pretrained backbone. No distribution shift. Safe to graft onto any frozen model.
|
|
121
|
+
|
|
122
|
+
**Gradient checkpointing**: Multi-loop training fits in 40GB VRAM by recomputing activations on the backward pass (per-loop, not per-block).
|
|
123
|
+
|
|
124
|
+
## How the canvas works
|
|
125
|
+
|
|
126
|
+
A **canvas** is a 3D grid `(T, H, W)` where different regions handle different modalities. This is the omnimodal I/O layer — it's what lets a video model also predict actions, read proprioception, and estimate reward.
|
|
127
|
+
|
|
128
|
+
```python
|
|
129
|
+
from canvas_engineering import CanvasLayout, SpatiotemporalCanvas
|
|
130
|
+
|
|
131
|
+
# Robot manipulation canvas
|
|
132
|
+
layout = CanvasLayout(
|
|
133
|
+
T=5, H=8, W=8, d_model=256,
|
|
134
|
+
regions={
|
|
135
|
+
"visual": (0, 5, 0, 6, 0, 6), # 180 positions — video patches
|
|
136
|
+
"action": (0, 5, 6, 7, 0, 1), # 5 positions — per-frame actions
|
|
137
|
+
"reward": (2, 3, 7, 8, 0, 1), # 1 position — scalar reward
|
|
138
|
+
},
|
|
139
|
+
t_current=2, # t >= 2 is future (diffusion output)
|
|
140
|
+
)
|
|
141
|
+
|
|
142
|
+
canvas = SpatiotemporalCanvas(layout)
|
|
143
|
+
batch = canvas.create_empty(batch_size=4) # (4, 320, 256)
|
|
144
|
+
batch = canvas.place(batch, visual_embs, "visual") # write video patches
|
|
145
|
+
actions = canvas.extract(batch, "action") # read action predictions
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
<!-- Source: scripts/generate_diagrams.py :: generate_3d_gif() / generate_3d_static() -->
|
|
149
|
+
<p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_robot_3d.gif" alt="3D rotating canvas allocation" width="50%"></p>
|
|
150
|
+
<p align="center"><i>3D region allocation for a robot manipulation canvas. Each colored block is a modality occupying a subvolume of the (T, H, W) grid.</i></p>
|
|
151
|
+
|
|
152
|
+
**Built-in examples** for robot manipulation, computer use agents, and multi-robot control:
|
|
153
|
+
|
|
154
|
+
```python
|
|
155
|
+
# Computer use agent: screen pixels + mouse + keyboard + LLM steering
|
|
156
|
+
layout = CanvasLayout(
|
|
157
|
+
T=16, H=32, W=32, d_model=768,
|
|
158
|
+
regions={
|
|
159
|
+
"screen": (0, 16, 0, 24, 0, 24), # 9,216 positions (56%)
|
|
160
|
+
"mouse": (0, 16, 24, 26, 0, 4), # 128 positions
|
|
161
|
+
"keyboard": (0, 16, 26, 28, 0, 4), # 128 positions
|
|
162
|
+
"llm": (0, 16, 28, 32, 0, 8), # 512 positions
|
|
163
|
+
},
|
|
164
|
+
)
|
|
165
|
+
# → 16,384 total positions, bandwidth-proportional allocation
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
<p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_computer.png" alt="Computer use agent canvas" width="45%"> <img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/canvas_multi_robot.png" alt="Multi-robot canvas" width="45%"></p>
|
|
169
|
+
|
|
170
|
+
## Why 3 loops?
|
|
171
|
+
|
|
172
|
+
From a 12-condition grid ablation on CogVideoX-2B with real Bridge V2 robot video (36 runs, $152 compute):
|
|
173
|
+
|
|
174
|
+
```
|
|
175
|
+
Action Loss (lower = better)
|
|
176
|
+
Frozen Half-frozen Unfrozen
|
|
177
|
+
(350K params) (3.7M params) (11.7M params)
|
|
178
|
+
1 loop 0.121 0.115 0.108
|
|
179
|
+
2 loops 0.140 0.119 0.112
|
|
180
|
+
3 loops 0.073 ◀ BEST 0.107 0.088
|
|
181
|
+
4 loops 0.104 0.137 0.124
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
**3 loops wins at every freeze level.** The frozen 3-loop condition (350K params) beats every unfrozen condition (11.5M+ params). 4 loops consistently regresses from 3.
|
|
185
|
+
|
|
186
|
+
Freeze level doesn't affect action loss at all (marginals: 0.109 vs 0.108, p=0.72). It only affects video generation quality (8-9x gap on diffusion loss).
|
|
187
|
+
|
|
188
|
+
## Declarative region frequency
|
|
189
|
+
|
|
190
|
+
Canvas regions can operate at different real-world frequencies. A `RegionSpec` declares per-region semantics — temporal frequency, loss participation, and loss weight — as first-class properties.
|
|
191
|
+
|
|
192
|
+
```python
|
|
193
|
+
from canvas_engineering import CanvasLayout, RegionSpec
|
|
194
|
+
|
|
195
|
+
layout = CanvasLayout(
|
|
196
|
+
T=16, H=32, W=32, d_model=768,
|
|
197
|
+
regions={
|
|
198
|
+
"screen": (0, 16, 0, 24, 0, 24), # raw tuple — period=1 default
|
|
199
|
+
|
|
200
|
+
"mouse": RegionSpec(
|
|
201
|
+
bounds=(0, 16, 24, 26, 0, 4),
|
|
202
|
+
period=1, loss_weight=2.0, # high-freq, emphasize accuracy
|
|
203
|
+
),
|
|
204
|
+
"thought": RegionSpec(
|
|
205
|
+
bounds=(0, 4, 28, 32, 0, 8),
|
|
206
|
+
period=4, loss_weight=1.0, # low-freq: 4 slots → frames 0,4,8,12
|
|
207
|
+
),
|
|
208
|
+
"task_prompt": RegionSpec(
|
|
209
|
+
bounds=(0, 1, 26, 28, 0, 4),
|
|
210
|
+
is_output=False, # input-only conditioning, no loss
|
|
211
|
+
),
|
|
212
|
+
},
|
|
213
|
+
)
|
|
214
|
+
|
|
215
|
+
# Per-position loss weighting — respects is_output and loss_weight
|
|
216
|
+
weights = layout.loss_weight_mask("cuda") # (N,) tensor
|
|
217
|
+
loss = (per_position_loss * weights).sum() / weights.sum()
|
|
218
|
+
|
|
219
|
+
# Frame mapping between canvas time and real-world time
|
|
220
|
+
layout.real_frame("thought", canvas_t=2) # → 8
|
|
221
|
+
layout.canvas_frame("thought", real_t=8) # → 2
|
|
222
|
+
layout.canvas_frame("thought", real_t=7) # → None (not aligned)
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
Raw tuples auto-wrap as `RegionSpec(bounds=tuple)` with defaults — full backward compatibility. All existing code continues to work unchanged.
|
|
226
|
+
|
|
227
|
+
**RegionSpec fields:**
|
|
228
|
+
|
|
229
|
+
| Field | Default | Meaning |
|
|
230
|
+
|---|---|---|
|
|
231
|
+
| `bounds` | *(required)* | `(t0, t1, h0, h1, w0, w1)` spatial-temporal extent |
|
|
232
|
+
| `period` | `1` | Canvas frames per real-world update (1 = every frame) |
|
|
233
|
+
| `is_output` | `True` | Whether this region participates in diffusion loss |
|
|
234
|
+
| `loss_weight` | `1.0` | Relative loss weight for positions in this region |
|
|
235
|
+
|
|
236
|
+
## Non-Euclidean connectivity
|
|
237
|
+
|
|
238
|
+
Canvas regions don't have to interact via Euclidean adjacency. A `CanvasTopology` declaratively specifies which **block-to-block attention operations** are performed per step. Each `Connection` is a discrete cross-attention op: `src` tokens query against `dst` keys/values.
|
|
239
|
+
|
|
240
|
+
```python
|
|
241
|
+
from canvas_engineering import Connection, CanvasTopology
|
|
242
|
+
|
|
243
|
+
# Declarative: define the full attention compute DAG as data
|
|
244
|
+
topology = CanvasTopology(connections=[
|
|
245
|
+
# Self-attention within each region
|
|
246
|
+
Connection(src="robot1_cam", dst="robot1_cam"),
|
|
247
|
+
Connection(src="robot1_action", dst="robot1_action"),
|
|
248
|
+
Connection(src="robot2_cam", dst="robot2_cam"),
|
|
249
|
+
Connection(src="robot2_action", dst="robot2_action"),
|
|
250
|
+
Connection(src="shared_task", dst="shared_task"),
|
|
251
|
+
|
|
252
|
+
# Causal: each robot's camera informs its own actions
|
|
253
|
+
Connection(src="robot1_action", dst="robot1_cam"),
|
|
254
|
+
Connection(src="robot2_action", dst="robot2_cam"),
|
|
255
|
+
|
|
256
|
+
# Coordination: robots see each other's cameras
|
|
257
|
+
Connection(src="robot1_cam", dst="robot2_cam", weight=0.5),
|
|
258
|
+
Connection(src="robot2_cam", dst="robot1_cam", weight=0.5),
|
|
259
|
+
|
|
260
|
+
# Hub: shared task reads from cameras, actions read from task
|
|
261
|
+
Connection(src="shared_task", dst="robot1_cam"),
|
|
262
|
+
Connection(src="shared_task", dst="robot2_cam"),
|
|
263
|
+
Connection(src="robot1_action", dst="shared_task"),
|
|
264
|
+
Connection(src="robot2_action", dst="shared_task"),
|
|
265
|
+
])
|
|
266
|
+
|
|
267
|
+
# Generate attention mask or iterate over ops
|
|
268
|
+
mask = topology.to_attention_mask(layout) # (N, N) float
|
|
269
|
+
ops = topology.attention_ops() # [(src, dst, weight), ...]
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
**Convenience constructors** for common patterns:
|
|
273
|
+
|
|
274
|
+
```python
|
|
275
|
+
CanvasTopology.dense(["a", "b", "c"]) # fully connected (standard transformer)
|
|
276
|
+
CanvasTopology.isolated(["a", "b", "c"]) # block-diagonal (no cross-region)
|
|
277
|
+
CanvasTopology.hub_spoke("task", ["r1", "r2"]) # star topology
|
|
278
|
+
CanvasTopology.causal_chain(["obs", "plan", "act"]) # A → B → C
|
|
279
|
+
CanvasTopology.causal_temporal(["obs", "act"]) # same-frame self + prev-frame cross
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
<!-- Source: scripts/generate_topology_diagrams.py :: generate_all() -->
|
|
283
|
+
<p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/topology_constructors.png" alt="Topology convenience constructors" width="100%"></p>
|
|
284
|
+
|
|
285
|
+
The topology is the compute graph of attention operations — not a soft mask on dense attention. Block self-attention is one special case. Dense is another. The interesting cases are structured DAGs that mirror the causal/information-flow structure of your problem.
|
|
286
|
+
|
|
287
|
+
### Temporal connectivity
|
|
288
|
+
|
|
289
|
+
Connections can constrain **which timesteps** participate in each attention op. By default, all timesteps see all timesteps (dense in time). With temporal offsets, you get causal chains over time, same-frame-only constraints, or sliding windows.
|
|
290
|
+
|
|
291
|
+
```python
|
|
292
|
+
# Default: all timesteps (backward compatible)
|
|
293
|
+
Connection(src="cam", dst="action")
|
|
294
|
+
|
|
295
|
+
# Same-frame only: no temporal leakage
|
|
296
|
+
Connection(src="cam", dst="action", t_src=0, t_dst=0)
|
|
297
|
+
|
|
298
|
+
# Previous frame cross-attention: action at t queries obs at t-1
|
|
299
|
+
Connection(src="action", dst="obs", t_src=0, t_dst=-1)
|
|
300
|
+
|
|
301
|
+
# Full temporal self-attention (explicit)
|
|
302
|
+
Connection(src="thought", dst="thought", t_src=None, t_dst=None)
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
**Semantics**: `t_src` and `t_dst` are relative offsets from a shared reference frame. The mask generator iterates over all reference frames and pairs positions at `ref + t_src` with positions at `ref + t_dst`. Out-of-bounds timesteps are silently skipped.
|
|
306
|
+
|
|
307
|
+
| `t_src` | `t_dst` | Behavior |
|
|
308
|
+
|---------|---------|----------|
|
|
309
|
+
| `None` | `None` | All src ↔ all dst (dense in time) |
|
|
310
|
+
| `0` | `0` | Same-frame only |
|
|
311
|
+
| `0` | `-1` | Src at current frame queries dst at previous frame |
|
|
312
|
+
| `None` | `0` | All src timesteps query dst at each reference frame |
|
|
313
|
+
|
|
314
|
+
The `causal_temporal` constructor gives you same-frame self-attention + previous-frame cross-attention for all regions — no future leakage, but full temporal context.
|
|
315
|
+
|
|
316
|
+
## Attention function types
|
|
317
|
+
|
|
318
|
+
Not all connections should use the same attention mechanism. A `Connection` can declare its `fn` — the type of function used for that edge. Regions can also set `default_attn` — a default for all outgoing connections. The schema declares *intent*; execution is backend-dependent.
|
|
319
|
+
|
|
320
|
+
```python
|
|
321
|
+
from canvas_engineering import CanvasLayout, RegionSpec, Connection, CanvasTopology
|
|
322
|
+
|
|
323
|
+
layout = CanvasLayout(
|
|
324
|
+
T=8, H=16, W=16, d_model=512,
|
|
325
|
+
regions={
|
|
326
|
+
# Region defaults: what kind of attention makes sense for this modality?
|
|
327
|
+
"visual": RegionSpec(bounds=(0,8, 0,12, 0,12), default_attn="cross_attention"),
|
|
328
|
+
"proprio": RegionSpec(bounds=(0,8, 12,13, 0,2), default_attn="linear_attention"),
|
|
329
|
+
"thought": RegionSpec(bounds=(0,4, 13,15, 0,4), default_attn="mamba"),
|
|
330
|
+
"goal": RegionSpec(bounds=(0,1, 15,16, 0,4), default_attn="cross_attention",
|
|
331
|
+
is_output=False),
|
|
332
|
+
},
|
|
333
|
+
)
|
|
334
|
+
|
|
335
|
+
topology = CanvasTopology(connections=[
|
|
336
|
+
# Self-attention (uses each region's default_attn)
|
|
337
|
+
Connection(src="visual", dst="visual"), # → cross_attention
|
|
338
|
+
Connection(src="proprio", dst="proprio"), # → linear_attention
|
|
339
|
+
Connection(src="thought", dst="thought"), # → mamba
|
|
340
|
+
Connection(src="goal", dst="goal"), # → cross_attention
|
|
341
|
+
|
|
342
|
+
# Cross-region with explicit fn overrides
|
|
343
|
+
Connection(src="visual", dst="goal", fn="gated"), # optional conditioning
|
|
344
|
+
Connection(src="thought", dst="visual", fn="perceiver"), # compress 864 visual tokens
|
|
345
|
+
Connection(src="proprio", dst="visual", fn="pooling"), # just need a summary
|
|
346
|
+
Connection(src="thought", dst="thought", fn="copy", # direct latent relay
|
|
347
|
+
t_src=0, t_dst=-1), # from previous frame
|
|
348
|
+
])
|
|
349
|
+
|
|
350
|
+
# Resolve: returns (src, dst, weight, fn) with defaults applied
|
|
351
|
+
ops = topology.attention_ops(layout)
|
|
352
|
+
# [("visual", "visual", 1.0, "cross_attention"),
|
|
353
|
+
# ("proprio", "proprio", 1.0, "linear_attention"),
|
|
354
|
+
# ("thought", "thought", 1.0, "mamba"),
|
|
355
|
+
# ...]
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
**Resolution order:** `connection.fn` (if set) → `region.default_attn` (if layout provided) → `"cross_attention"` (global default). Fully backward compatible — existing code without `fn` or `default_attn` resolves to standard cross-attention.
|
|
359
|
+
|
|
360
|
+
### The lineup
|
|
361
|
+
|
|
362
|
+
Every connection function type represents a different theory of how information should flow between regions. The schema declares intent; the executor decides implementation.
|
|
363
|
+
|
|
364
|
+
| Type | Family | Complexity | Best for |
|
|
365
|
+
|------|--------|-----------|----------|
|
|
366
|
+
| `cross_attention` | Dot-product | O(NM) | General-purpose, content-based selection |
|
|
367
|
+
| `linear_attention` | Dot-product | O(N+M) | Low-dimensional or high-frequency streams |
|
|
368
|
+
| `cosine_attention` | Dot-product | O(NM) | Stable gradients, no temperature scaling |
|
|
369
|
+
| `sigmoid_attention` | Dot-product | O(NM) | Non-exclusive / multi-label attention |
|
|
370
|
+
| `gated` | Gating | O(NM) | Optional conditioning (goals, instructions) |
|
|
371
|
+
| `perceiver` | Compression | O(NK) | Large dst regions compressed through bottleneck |
|
|
372
|
+
| `pooling` | Compression | O(N+M) | Scalar/low-dim conditioning signals |
|
|
373
|
+
| `copy` | Transfer | O(N) | Direct latent sharing, broadcast regions |
|
|
374
|
+
| `mamba` | State-space | O(N) | Long temporal sequences, causal connections |
|
|
375
|
+
| `rwkv` | State-space | O(N) | Temporal connections with learned decay |
|
|
376
|
+
| `hyena` | Convolution | O(N log N) | Sub-quadratic long-range via FFT |
|
|
377
|
+
| `sparse_attention` | Sparse | O(NK) | Selective binding to specific positions |
|
|
378
|
+
| `local_attention` | Sparse | O(NW) | Spatially local interactions (neighboring patches) |
|
|
379
|
+
| `none` | Meta | O(0) | Ablation — edge declared but disabled |
|
|
380
|
+
| `random_fixed` | Meta | O(NK) | Baseline — does learned structure matter? |
|
|
381
|
+
| `mixture` | Meta | O(NK) | MoE-style routing for multi-modal hubs |
|
|
382
|
+
|
|
383
|
+
### Design recipes
|
|
384
|
+
|
|
385
|
+
**Robot manipulation** — vision-heavy, low-latency actions:
|
|
386
|
+
```python
|
|
387
|
+
"visual": default_attn="cross_attention" # full attention for spatial reasoning
|
|
388
|
+
"proprio": default_attn="linear_attention" # 12D joint state, no need for O(N²)
|
|
389
|
+
"action": default_attn="cross_attention" # content-based selection from visual
|
|
390
|
+
# visual → action: cross_attention (which visual patches matter for this action?)
|
|
391
|
+
# proprio → action: pooling (just need the joint state vector)
|
|
392
|
+
```
|
|
393
|
+
|
|
394
|
+
**Embodied agent with memory** — long-horizon, selective recall:
|
|
395
|
+
```python
|
|
396
|
+
"perception": default_attn="cross_attention"
|
|
397
|
+
"memory": default_attn="mamba" # O(N) sequential over long history
|
|
398
|
+
"policy": default_attn="cross_attention"
|
|
399
|
+
# memory → perception: gated (decide whether to incorporate memory at all)
|
|
400
|
+
# perception → memory: perceiver (compress percepts into fixed-size memory)
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
**Multi-agent coordination** — shared latent space:
|
|
404
|
+
```python
|
|
405
|
+
"agent_a.thought": default_attn="rwkv" # causal temporal within agent
|
|
406
|
+
"agent_b.thought": default_attn="rwkv"
|
|
407
|
+
"shared_task": default_attn="cross_attention"
|
|
408
|
+
# agent_a.thought → shared_task: cross_attention (selective broadcast)
|
|
409
|
+
# shared_task → agent_b.thought: gated (selective incorporation)
|
|
410
|
+
# agent_a.thought → agent_b.thought: copy (direct latent relay)
|
|
411
|
+
```
|
|
412
|
+
|
|
413
|
+
**Vision transformer backbone** — drop-in structured attention:
|
|
414
|
+
```python
|
|
415
|
+
"cls_token": default_attn="cross_attention"
|
|
416
|
+
"patches": default_attn="local_attention" # each patch attends locally
|
|
417
|
+
"readout": default_attn="cross_attention"
|
|
418
|
+
# cls_token → patches: cross_attention (global aggregation)
|
|
419
|
+
# patches → patches: local_attention (spatial locality)
|
|
420
|
+
# readout → cls_token: pooling (single vector summary)
|
|
421
|
+
```
|
|
422
|
+
|
|
423
|
+
## Semantic types and transfer distance
|
|
424
|
+
|
|
425
|
+
Each canvas region represents a modality — RGB video, joint angles, reward, language. `RegionSpec` lets you declare the modality's **semantic type** as a human-readable string and a frozen embedding vector from a fixed model. This turns modality compatibility from a human judgment call into a computable quantity.
|
|
426
|
+
|
|
427
|
+
```python
|
|
428
|
+
from canvas_engineering import RegionSpec, transfer_distance
|
|
429
|
+
|
|
430
|
+
cam = RegionSpec(
|
|
431
|
+
bounds=(0, 8, 0, 12, 0, 12),
|
|
432
|
+
semantic_type="RGB video 224x224 30fps from front-facing monocular camera",
|
|
433
|
+
semantic_embedding=embed("RGB video 224x224 30fps from front-facing monocular camera"),
|
|
434
|
+
embedding_model="openai/text-embedding-3-small", # fixed, declared
|
|
435
|
+
)
|
|
436
|
+
|
|
437
|
+
depth = RegionSpec(
|
|
438
|
+
bounds=(0, 8, 0, 12, 0, 12),
|
|
439
|
+
semantic_type="Metric depth map 224x224 from front-facing monocular camera",
|
|
440
|
+
semantic_embedding=embed("Metric depth map 224x224 from front-facing monocular camera"),
|
|
441
|
+
)
|
|
442
|
+
|
|
443
|
+
joints = RegionSpec(
|
|
444
|
+
bounds=(0, 8, 12, 13, 0, 1),
|
|
445
|
+
semantic_type="7-DOF joint angles at 30Hz",
|
|
446
|
+
semantic_embedding=embed("7-DOF joint angles at 30Hz"),
|
|
447
|
+
)
|
|
448
|
+
|
|
449
|
+
transfer_distance(cam, depth) # ~0.15 — cheap to bridge (1-2 layers)
|
|
450
|
+
transfer_distance(cam, joints) # ~0.65 — expensive (full MLP adapter)
|
|
451
|
+
```
|
|
452
|
+
|
|
453
|
+
<!-- Source: scripts/generate_semantic_diagrams.py :: generate_transfer_distance() -->
|
|
454
|
+
<p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/transfer_distance.png" alt="Semantic embedding space with transfer distances" width="65%"></p>
|
|
455
|
+
|
|
456
|
+
**Why this matters:** If canvas schemas produce stable latent representations (an empirical hypothesis we're testing), then semantic embedding distance approximates the real cost of bridging two modalities — how many adapter layers, how much data. The embedding model must be fixed and declared so distances are comparable across time and projects.
|
|
457
|
+
|
|
458
|
+
## Canvas schemas
|
|
459
|
+
|
|
460
|
+
A `CanvasSchema` bundles layout + topology into a single portable, serializable object — the complete type signature for a canvas-based model.
|
|
461
|
+
|
|
462
|
+
```python
|
|
463
|
+
from canvas_engineering import CanvasSchema, CanvasLayout, RegionSpec, CanvasTopology, Connection
|
|
464
|
+
|
|
465
|
+
schema = CanvasSchema(
|
|
466
|
+
layout=CanvasLayout(
|
|
467
|
+
T=8, H=16, W=16, d_model=256,
|
|
468
|
+
regions={
|
|
469
|
+
"visual": RegionSpec(
|
|
470
|
+
bounds=(0, 8, 0, 12, 0, 12),
|
|
471
|
+
semantic_type="RGB video 224x224",
|
|
472
|
+
semantic_embedding=(0.12, -0.05, ...),
|
|
473
|
+
),
|
|
474
|
+
"action": RegionSpec(
|
|
475
|
+
bounds=(0, 8, 12, 14, 0, 2),
|
|
476
|
+
loss_weight=2.0,
|
|
477
|
+
semantic_type="6-DOF end-effector + gripper",
|
|
478
|
+
semantic_embedding=(0.31, 0.08, ...),
|
|
479
|
+
),
|
|
480
|
+
},
|
|
481
|
+
),
|
|
482
|
+
topology=CanvasTopology(connections=[
|
|
483
|
+
Connection(src="visual", dst="visual"),
|
|
484
|
+
Connection(src="action", dst="visual"),
|
|
485
|
+
Connection(src="action", dst="action"),
|
|
486
|
+
]),
|
|
487
|
+
metadata={"model": "CogVideoX-2B", "data": "bridge_v2"},
|
|
488
|
+
)
|
|
489
|
+
|
|
490
|
+
# Serialize — the schema is the complete declaration
|
|
491
|
+
schema.to_json("robot_v1.json")
|
|
492
|
+
loaded = CanvasSchema.from_json("robot_v1.json")
|
|
493
|
+
|
|
494
|
+
# Find compatible regions across two schemas
|
|
495
|
+
pairs = schema.compatible_regions(other_schema, threshold=0.3)
|
|
496
|
+
# → [("visual", "camera", 0.04), ("action", "gripper_cmd", 0.12)]
|
|
497
|
+
```
|
|
498
|
+
|
|
499
|
+
The schema file is human-readable JSON. It declares everything needed to interpret a canvas tensor: geometry, region semantics, connectivity, and modality types. Two models with the same schema can share latent state directly.
|
|
500
|
+
|
|
501
|
+
<!-- Source: scripts/generate_semantic_diagrams.py :: generate_schema_alignment() -->
|
|
502
|
+
<p align="center"><img src="https://raw.githubusercontent.com/JacobFV/canvas-engineering/main/assets/schema_alignment.png" alt="Cross-schema region alignment between robot and computer agents" width="90%"></p>
|
|
503
|
+
<p align="center"><i>Two agents with different canvas schemas. <code>compatible_regions()</code> finds semantically aligned region pairs — solid lines indicate direct latent transfer is possible, dashed lines require adapter layers.</i></p>
|
|
504
|
+
|
|
505
|
+
## API reference
|
|
506
|
+
|
|
507
|
+
| Module | What it does |
|
|
508
|
+
|--------|-------------|
|
|
509
|
+
| **Canvas (omnimodal I/O)** | |
|
|
510
|
+
| `CanvasLayout` | Declarative 3D canvas geometry with named regions |
|
|
511
|
+
| `RegionSpec` | Per-region semantics: frequency, loss weight, output participation |
|
|
512
|
+
| `SpatiotemporalCanvas` | Canvas tensor ops: `create_empty`, `place`, `extract` |
|
|
513
|
+
| `Connection` | Single attention op with temporal offsets and function type (`fn`) |
|
|
514
|
+
| `CanvasTopology` | Declarative DAG of attention ops with `resolve_fn()` dispatch |
|
|
515
|
+
| `ATTENTION_TYPES` | Registry of 16 declared attention function types |
|
|
516
|
+
| `transfer_distance()` | Cosine distance between semantic type embeddings |
|
|
517
|
+
| `CanvasSchema` | Portable bundle: layout + topology + metadata, JSON-serializable |
|
|
518
|
+
| `ActionHead` | MLP decoder: latent channels → robot actions |
|
|
519
|
+
| **Looped attention (adaptive compute)** | |
|
|
520
|
+
| `LoopedBlockWrapper` | Wrap **any** transformer block for looped execution |
|
|
521
|
+
| `graft_looped_blocks()` | One-line grafting onto CogVideoX (auto-detects block type) |
|
|
522
|
+
| `freeze_full()` / `freeze_half()` | Freeze strategies for the backbone |
|
|
523
|
+
| `CurriculumScheduler` | Ramp loop count 1→3 during training |
|
|
524
|
+
| `SharpeningSchedule` | Progressive attention sharpening across loops (soft→sharp) |
|
|
525
|
+
| **Utilities** | |
|
|
526
|
+
| `save_loop_checkpoint()` | Save only loop params (~0.1% of model, ~1.4 MB) |
|
|
527
|
+
|
|
528
|
+
## Freeze strategies
|
|
529
|
+
|
|
530
|
+
| Strategy | What's frozen | Trainable | Action loss | Diffusion loss | Use when |
|
|
531
|
+
|----------|:---:|:---:|:---:|:---:|----------|
|
|
532
|
+
| `"full"` | Everything except loops | 350K | 0.073 | 1.48 | Max efficiency, action-only tasks |
|
|
533
|
+
| `"half"` | Only `patch_embed` | 3.7M | 0.107 | 0.19 | Good video + good actions |
|
|
534
|
+
| `"none"` | Nothing | 11.7M | 0.088 | 0.18 | Full fine-tuning, compute available |
|
|
535
|
+
|
|
536
|
+
## Progressive sharpening
|
|
537
|
+
|
|
538
|
+
Loop-indexed inverse temperature for bridging the soft→sharp attention discontinuity:
|
|
539
|
+
|
|
540
|
+
```python
|
|
541
|
+
from canvas_engineering import SharpeningSchedule
|
|
542
|
+
|
|
543
|
+
schedule = SharpeningSchedule(max_loops=3, beta_min=1.0, beta_max=4.0)
|
|
544
|
+
|
|
545
|
+
# Loop 0: beta=1.0 (soft, broad gradients)
|
|
546
|
+
# Loop 1: beta=2.5 (medium)
|
|
547
|
+
# Loop 2: beta=4.0 (sharp, precise attention)
|
|
548
|
+
```
|
|
549
|
+
|
|
550
|
+
Early loops train Q/K matrices via gradient flow. Later loops exploit trained structure with near-discrete attention. Empirically: mild sharpening (beta→2) gives 1.30x F1 on contact detection; aggressive (beta→8) hurts.
|
|
551
|
+
|
|
552
|
+
## What looping is NOT
|
|
553
|
+
|
|
554
|
+
We tested three cortical-computation hypotheses rigorously. Two are **falsified**:
|
|
555
|
+
|
|
556
|
+
| Hypothesis | Result | Evidence |
|
|
557
|
+
|---|---|---|
|
|
558
|
+
| Looping enables iterative reasoning | **Falsified** | 3 independent nulls (p=0.97, p>0.05, p>0.05) |
|
|
559
|
+
| Shared canvas creates multi-modal binding | **Falsified** | Joint prediction 19% worse (p<0.0001) |
|
|
560
|
+
| Token allocation follows power laws | Borderline | R^2=0.902 but alpha=0.011 (doubling tokens = 0.8%) |
|
|
561
|
+
|
|
562
|
+
The looping benefit is **weight-sharing regularization** (parameter efficiency, fixed-point convergence, lower variance), not iterative reasoning. The omnimodal capability comes from the **canvas architecture** (multi-encoder/multi-decoder), not from the looping.
|
|
563
|
+
|
|
564
|
+
## Examples
|
|
565
|
+
|
|
566
|
+
```
|
|
567
|
+
examples/
|
|
568
|
+
├── quickstart.py # 30-line graft-and-train
|
|
569
|
+
├── graft_cogvideox.py # Full CogVideoX grafting with training loop
|
|
570
|
+
├── define_canvas.py # Canvas layouts for 3 applications
|
|
571
|
+
└── train_bridge_v2.py # Real robot data training
|
|
572
|
+
```
|
|
573
|
+
|
|
574
|
+
## Installation
|
|
575
|
+
|
|
576
|
+
```bash
|
|
577
|
+
# Core (canvas + looped blocks)
|
|
578
|
+
pip install canvas-engineering
|
|
579
|
+
|
|
580
|
+
# With CogVideoX support
|
|
581
|
+
pip install canvas-engineering[cogvideox]
|
|
582
|
+
|
|
583
|
+
# With video dataset loading
|
|
584
|
+
pip install canvas-engineering[data]
|
|
585
|
+
|
|
586
|
+
# Development
|
|
587
|
+
pip install canvas-engineering[dev]
|
|
588
|
+
```
|
|
589
|
+
|
|
590
|
+
Requires Python 3.9+ and PyTorch 2.0+.
|
|
591
|
+
|
|
592
|
+
## Paper
|
|
593
|
+
|
|
594
|
+
> **Looped Attention in Video Diffusion Transformers: 26 Experiments on What Works, What Doesn't, and Why**
|
|
595
|
+
>
|
|
596
|
+
> Jacob Valdez and Claude Opus 4.6
|
|
597
|
+
|
|
598
|
+
[Paper PDF](https://github.com/JacobFV/recursive-omnimodal-video-action-model/blob/16c4bed/papers/empirical/main.pdf) | [Video](https://youtu.be/LHEhdFAWkEc) | [Full experiment data](https://github.com/JacobFV/recursive-omnimodal-video-action-model/tree/main/archive/experiments)
|
|
599
|
+
|
|
600
|
+
## License
|
|
601
|
+
|
|
602
|
+
Apache 2.0
|