@kolbo/kolbo-code-linux-x64-baseline-musl 2.1.20 → 2.1.22

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -7,6 +7,8 @@ description: Generate, edit, or analyze creative media through Kolbo AI. Load th
7
7
 
8
8
  You have direct access to the Kolbo AI creative platform via MCP tools (auto-configured by `kolbo auth login`). Use them to generate and deliver real content — do NOT just describe what you would create.
9
9
 
10
+ > 🚫 **Don't dump generated URLs as bare text or markdown links in chat** — the UI already renders artifacts as a gallery tile + canvas. Refer by description ("the rainy scene"), store URLs in `.kolbo/production.md`. INLINE `![](url)` images ARE allowed and encouraged for catalog-style replies (per-item thumbs in numbered lists). Full rule + Do/Don't: see "Generated URLs in chat" section below.
11
+
10
12
  ## Available MCP Tools
11
13
 
12
14
  ### Generation
@@ -46,8 +48,25 @@ You have direct access to the Kolbo AI creative platform via MCP tools (auto-con
46
48
 
47
49
  | Tool | Description |
48
50
  |------|-------------|
49
- | `upload_media` | Upload ANY local file to Kolbo CDN → returns a public URL. Works for images, videos, audio, HTML, documents — any file type. Use for: feeding media to `chat_send_message`, sharing files publicly, hosting HTML pages, or multi-tool workflows. |
50
- | `list_media` | Browse user's uploaded media with filtering by type and search. |
51
+ | `upload_media` | Upload ANY local file (or remote URL) to Kolbo CDN → returns a stable public URL. Use for feeding media to `chat_send_message`, hosting HTML, or any multi-tool workflow that re-uses the same file. |
52
+ | `list_media` | Browse the user's library — both uploaded files AND saved AI outputs. Filter by `project_id`, `folder_id`, `category` (ai / uploaded / edited / favorites / training-lab), `type`, `source_type`, `sort`; paginate; full-text `search`. Response items include `is_favorited`, `prompt`, `dimensions`, `duration`, `project_id`. |
53
+ | `get_media` | Fetch one item by id (full details + extended metadata). Use when the user references a specific past creation. |
54
+ | `get_media_stats` | Counts + storage usage: `{ total, images, videos, audio, total_size_bytes }`. Optional `project_id`. Use for "how many videos do I have?" / "what's my usage?" / sizing a bulk op. |
55
+ | `favorite_media` / `unfavorite_media` | Toggle favorite. Idempotent. Per-user (shared projects: your favorites ≠ teammates'). |
56
+ | `delete_media` | **Soft delete** → trash (30-day recovery). Owner only. This is the right call for "delete this". |
57
+ | `restore_media` | Restore from trash. Pair with `delete_media`. |
58
+ | `permanently_delete_media` | **HARD delete** — MongoDB + S3 + folders + source generation record. NOT REVERSIBLE. **Always confirm with the user before calling.** Never default here for "delete". |
59
+ | `move_media` | Move one item to a different project (caller must own the item + have access to the target project). |
60
+ | `bulk_delete_media` | Soft-delete up to 1000 ids. Items not owned by the user are silently skipped. |
61
+ | `bulk_restore_media` | Restore up to 1000 trashed ids. |
62
+ | `bulk_permanently_delete_media` | Hard-delete up to 1000 ids. **Always confirm with the user before calling.** |
63
+ | `bulk_move_media` | Move up to 1000 ids to another project. **Atomic** — if ANY id isn't owned by the caller, the whole op is rejected; do not retry partially. |
64
+ | `move_folder_contents` | Move every item in a folder to another project (owner-only on every item). |
65
+ | `list_media_folders` | List the user's folders (owned + shared). Folders span projects. |
66
+ | `create_media_folder` / `update_media_folder` / `delete_media_folder` | Folder lifecycle. Delete is owner-only and detaches items (items stay in the library); **confirm before delete**. |
67
+ | `add_media_to_folder` / `remove_media_from_folder` | Up to 500 ids per call. Idempotent on add. |
68
+ | `share_media_folder` | Share by email (resolved to user ids; emails not found come back in `not_found`). Owner only. Members can list/add/remove items but cannot delete or reshare the folder. |
69
+ | `unshare_media_folder` | Revoke one user's access. Takes `user_id` from the folder's `shared_with` array. |
51
70
 
52
71
  ### Visual DNA (Character/Style Consistency)
53
72
 
@@ -58,6 +77,8 @@ You have direct access to the Kolbo AI creative platform via MCP tools (auto-con
58
77
  | `get_visual_dna` | Fetch full profile details including system_prompt and reference images. |
59
78
  | `delete_visual_dna` | Delete a Visual DNA profile. |
60
79
 
80
+ **Visual DNA routing (server-side, automatic):** passing `visual_dna_ids` is enough — the server expands the DNA's reference images and auto-routes the selected text-to-image model to its image-editing variant (e.g. `nano-banana-2` → `nano-banana-2-image-editing`). You do NOT need to also pass `reference_images` when using DNA. If the chosen model has no edit variant at all, the server falls back to using the DNA's images as style references on the t2i model. Either way, DNA payloads are never silently dropped.
81
+
61
82
  ### Moodboards & Presets
62
83
 
63
84
  | Tool | Description |
@@ -88,7 +109,29 @@ You have direct access to the Kolbo AI creative platform via MCP tools (auto-con
88
109
  | `app_builder_list_generations` | List all generations for a session (needed for `edit_app`). |
89
110
  | `app_builder_delete_session` | Permanently delete a session and all resources. IRREVERSIBLE. |
90
111
 
91
- ## ⚠️ Generate vs Edit — Know the Difference
112
+ ### Artifact Publishing
113
+
114
+ | Tool | Description |
115
+ |------|-------------|
116
+ | `publish_html_artifact` | Publish a built HTML page (or SVG / Mermaid diagram) to `sites.kolbo.ai` and return a public shareable URL. Use when the user asks to share, publish, deploy, or "give me a link to" a built artifact. Pass `title` + `content` (the full HTML document). Server dedupes by content hash — re-publishing the same bytes returns the same URL. Page is served with strict CSP (`connect-src 'none'`, `form-action 'none'`) so it cannot exfiltrate data; CDN frameworks (Tailwind/Chart.js/Three.js/React) still load. Use this *instead of* dumping the HTML into chat or telling the user to open the local file. |
117
+
118
+ ## ⚠️ If the user names a tool, USE THAT TOOL (HARD RULE)
119
+
120
+ A user-named tool — in any language — overrides every other rule on this page. Same precedence as a user-named model: no routing, no substitution.
121
+
122
+ Recognized aliases:
123
+
124
+ | User said (any language) | Use exactly this MCP tool |
125
+ |---|---|
126
+ | "director", "creative director", "director tool", **"במאי"**, **"כלי במאי"**, "director-tool", "ad set", "campaign tool", "storyboard tool" | `generate_creative_director` |
127
+ | "image edit", "edit", "modify", "remove background", **"עריכת תמונה"** (only when paired with a per-image instruction, not a multi-output one) | `generate_image_edit` |
128
+ | "elements" / **"אלמנטים"** | `generate_elements` |
129
+ | "first/last frame" / **"פריימים"** | `generate_first_last_frame` |
130
+ | "lipsync" / **"ליפסינק"** | `generate_lipsync` |
131
+
132
+ **Mixed signals — named tool always wins.** "*Image edit with the director tool to make 4 angles*" → `generate_creative_director` (the verb says "edit" but the named tool is "director"). Same rule applies in Hebrew/Arabic phrases that contain both an edit-verb and a tool name.
133
+
134
+ ## ⚠️ Generate vs Edit — Know the Difference (only when the user did NOT name a tool)
92
135
 
93
136
  | User intent | Action | NOT this |
94
137
  |-------------|--------|----------|
@@ -97,8 +140,9 @@ You have direct access to the Kolbo AI creative platform via MCP tools (auto-con
97
140
  | "Create motion graphics" / "Animated text" / "Title sequence" | Load `remotion-best-practices` skill → Remotion | ❌ Do NOT call `generate_video` |
98
141
  | "Animate this image" / "Make this photo move" | `generate_video_from_image` (Kolbo MCP) | — |
99
142
  | "Restyle this video as anime" | `generate_video_from_video` (Kolbo MCP) | — |
100
-
101
- **`generate_video` creates NEW videos from text prompts. It cannot edit, cut, trim, merge, or modify existing video files.** For any operation on an existing video file, use FFmpeg via the `video-production` skill.
143
+ | "Modify THIS one image" — change background, remove object, recolor, add text | `generate_image_edit` (Kolbo MCP) | ❌ Do NOT use for multi-output |
144
+ | **"4 angles / poses / views of this character"** / "Show her in 8 different scenes / outfits / moods / settings" / "Variations of this character" | `generate_creative_director` with `visual_dna_ids` (Kolbo MCP) | ❌ Do NOT call `generate_image_edit` repeatedly. The director tool produces N coherent scenes from one brief and keeps the character consistent. |
145
+ | "4 variations of THIS exact image" (same prompt, different seeds, no new direction) | `generate_image` with `num_images=4` | ❌ Do NOT use `generate_image_edit` |
102
146
 
103
147
  ## Core Workflow
104
148
 
@@ -173,47 +217,110 @@ Creative generations bill against the user's Kolbo credit balance. **Billing uni
173
217
  - Images: `"1K"` ≈ 1024px, `"2K"` ≈ Full HD (1920×1080), `"3K"` ≈ QHD (2560×1440), `"4K"` ≈ UHD (3840×2160). Picker shows only tiers the model actually supports (per `supported_resolutions`).
174
218
  - Videos: `"720p"` / `"1080p"` / `"1440p"` / `"2160p"` = vertical pixels (720p = HD, 1080p = Full HD, 1440p = QHD, 2160p = 4K UHD). Some models use model-specific labels like `"512P"` / `"1024P"` (Hailuo).
175
219
 
176
- **Cost confirmation — know when to skip it:**
177
- - **User specified everything** (model, count, duration, e.g. "make 5 videos, seedance 2 fast, 15s, 16:9"): **ACT IMMEDIATELY** — that IS the confirmation. Do not re-explain costs or ask again.
178
- - **Single generation under 5 credits**: proceed without confirmation.
179
- - **Everything else**: calculate total cost, present a summary, and wait for the user to confirm before generating.
180
- - **Batch totalling 100+ credits**: run `check_credits` before starting to verify the balance is sufficient, and include the available balance in your cost summary.
181
-
182
- **When confirmation IS needed:**
183
- 1. Calculate per-item cost using the formulas above.
184
- 2. Multiply by the number of items.
185
- 3. Present a summary: "This will generate 8 videos × 5s each using [model] at X cr/s = **Y credits total**. Proceed?"
186
- 4. **Suggest cheaper alternatives** if available.
187
- 5. Only proceed after the user confirms.
220
+ **Cost confirmation — when:**
221
+ - **Skip** when the user already specified model + count + duration ("make 5 videos, seedance 2 fast, 15s" IS the confirmation), or when a single generation costs under 5 credits.
222
+ - **Required** otherwise: present a one-line summary ("8 videos × 5s × [model] @ X cr/s = **Y credits**. Proceed?"), suggest a cheaper alternative if one exists, wait for confirm before firing.
223
+ - **Batch totalling 100+ credits**: run `check_credits` first and include the available balance in the summary.
188
224
 
189
225
  ### Rate Limiting & Batch Generation (CRITICAL)
190
226
 
191
- **Rate limits** (per user, enforced server-side):
192
- - **Image generation**: 30 requests per minute (higher because images are fast and cheap)
193
- - **All other generation types**: 10 requests per minute per type (e.g. 10 video + 10 image = fine, but 11 video in 1 minute = 429)
194
- - **300 requests per minute** global across all media endpoints
195
- - **Uploads** (`upload_media`): 300/min, no credit cost — much lighter than generation
196
- - The API **queues** requests internally — it never silently drops them. If you're within limits, every request will be processed.
227
+ **Rate limits** (per user, server-enforced; the API queues — never silently drops):
228
+ - `generate_image`: 30/min
229
+ - All other generation tools: 10/min per type
230
+ - 300/min global across all media endpoints
231
+ - `upload_media`: 300/min, no credit cost
197
232
 
198
- **⚠️ NEVER duplicate a generation you already fired.**
199
- Before calling any generation tool, check your conversation history. If you already called that tool with the same or similar prompt in this session:
200
- - Do NOT call it again — even if it was aborted or interrupted (it is still running server-side and will complete)
201
- - Only retry if the user explicitly says "retry", "redo", or "try again"
202
- - Each duplicate wastes real credits from the user's balance
203
- - If unsure whether a generation went through, use `get_generation_status` to check — the API returns 202 immediately and processes in the background, so aborted tool calls still generate
233
+ **⚠️ NEVER re-fire a generation you already called.** Aborted, interrupted, or timed-out tool calls still process server-side and will complete. Before retrying, run `get_generation_status` — only retry if it returns `failed` (not `pending` or `completed`). Only re-fire on explicit user request ("retry", "redo", "try again"). Every duplicate burns real credits.
204
234
 
205
235
  **Batch generation workflow (≤10 items):**
206
- 1. Confirm cost ONCE — or skip if the user already specified model, count, and duration (e.g. "make 5 videos, seedance 2 fast, 15s" IS the confirmation — act immediately)
207
- 2. **Output ALL generation tool calls in a single response** up to 10 per tool type. The system runs them concurrently, so 5 videos render in parallel and finish in the time of the slowest one, not 5× the time.
208
- 3. Each call blocks until its generation is complete (images: seconds, video: 1-5 minutes). This is normal — don't apologize for the wait.
209
- 4. Track what you've generated never re-fire a completed or in-progress generation.
210
- 5. After all complete, present all results together.
211
- 6. If any fail with 429: wait 60 seconds and retry only the failed ones (max 2 retries).
212
-
213
- **Multi-image decision:**
214
- - User gives a **general brief** ("make 4 product shots", "create a storyboard", "show the character in 4 different settings") use `generate_creative_director` with `scene_count`. Pass `visual_dna_ids` to keep a character consistent across all scenes.
215
- - User gives **explicit separate prompts** ("Image 1: X, Image 2: Y, Image 3: Z") → fire all as **parallel `generate_image` calls** in one response
216
- - Never call `generate_image` sequentially in a loopeither use `generate_creative_director` or fire all calls in one parallel batch
236
+ 1. Confirm cost ONCE — skip if the user already specified model/count/duration ("make 5 videos, seedance 2 fast, 15s" IS the confirmation).
237
+ 2. **Output ALL tool calls in one response** (up to 10 per type) they run concurrently, so 5 videos finish in the time of the slowest one.
238
+ 3. Each call blocks until done (images: seconds; video: 15 min). Don't apologize for the wait.
239
+ 4. After all complete, present results together.
240
+ 5. On 429: wait 60s, retry only the failed items (max 2 retries).
241
+
242
+ ### ⚠️ Multi-output? Default to `generate_creative_director` (CRITICAL)
243
+
244
+ `generate_creative_director` is not a niche tool — **it IS an agent**. It plans the prompt for each scene internally, locks style + character consistency across the whole set, and runs every scene in parallel. For anything that produces **2 or more related outputs**, it is almost always the right call.
245
+
246
+ **Default rule:** if the user asks for multiple related outputs (variations, scenes, angles, poses, settings, moods, ad shots, product shots, storyboards, character sheets, key frames for a video)**use `generate_creative_director`**. Reach for parallel `generate_image` calls only when the rule below explicitly says to.
247
+
248
+ **Decision matrix:**
249
+
250
+ | User intent | Tool | Why |
251
+ |---|---|---|
252
+ | "make 4 / 6 / 8 [shots, scenes, variations, angles, poses, outfits, moods, settings, frames]" | **`generate_creative_director`** with `scene_count` | One brief → N distinct, coherent outputs. Director handles per-scene prompting. |
253
+ | "show the character in N different ___" | **`generate_creative_director`** with `scene_count` + `visual_dna_ids` | Character consistency is its specialty. |
254
+ | "create a storyboard / ad campaign / product set" | **`generate_creative_director`** | The model itself was built for this case. |
255
+ | "key frames for a video / shots for the ad" | **`generate_creative_director`** (then `generate_video_from_image` per frame) | See "Character-driven video — frames first" below. |
256
+ | "give me 4 variations of THIS exact image" (same prompt, random seed only) | `generate_image` with `num_images=4` | No new direction needed — seeds, not scenes. |
257
+ | User dictates **explicit separate prompts** word-for-word: "Image 1: X. Image 2: Y. Image 3: Z." | Parallel `generate_image` calls (one per prompt) | The user already wrote the per-scene prompts; the director's planning step would be wasted. |
258
+ | Single image, single prompt | `generate_image` | Nothing to coordinate. |
259
+
260
+ **Tie-breaker:** if you're about to fire ≥2 `generate_image` calls and the user did NOT dictate per-image prompts word-for-word, stop and use `generate_creative_director` instead. The director is cheaper in tokens (one call, one plan) and the results stay coherent.
261
+
262
+ **Never call `generate_image` sequentially in a loop.** Either use `generate_creative_director` (preferred for any related set) or fire all calls in a single parallel batch.
263
+
264
+ ### 🛑 Runaway-loop guard — ONE generation per requested item (CRITICAL)
265
+
266
+ When the user asks for **one specific change** to a specific media item ("change the 2nd video to anime", "make this image brighter", "regenerate scene 3"), the answer is **a single tool call** with a single output. After that tool returns URLs, **stop**. Surface the result to the user and wait for their next message.
267
+
268
+ You are NOT allowed to:
269
+
270
+ - Fire the same tool 3+ times in a single turn unless the user explicitly asked for "N variations" / "make X versions".
271
+ - Re-fire a tool because you think the previous result might not be exactly what the user wanted — let the user judge; if they don't like it, they'll say so.
272
+ - Auto-retry on success ("the first one wasn't perfect, let me try again"). If the tool returned URLs successfully, the work is done.
273
+ - Fire 5+ parallel `generate_video*` calls speculatively. Video is expensive — every extra call burns credits the user didn't ask to spend.
274
+
275
+ **Rule of thumb**: if you've fired the same tool 3+ times in one turn and the user asked for ONE thing, stop. You're in a loop. Surface what you have, ask the user which one they want to keep, and wait.
276
+
277
+ Only re-fire when:
278
+ 1. The user explicitly asked for variations (with a count).
279
+ 2. The previous call returned `failure.retryable === true` AND it was a transient error — then ONE retry, max.
280
+ 3. The previous call returned `completed` but with `urls.length === 0` — then ONE retry on the same payload.
281
+
282
+ ### ⚠️ Editing an existing video → ONE call, not frames-first (CRITICAL)
283
+
284
+ If the user references an **existing video** and asks to modify it ("change this video to anime style", "make it cinematic", "video-to-video edit", "the 2nd video looks too cheesy, fix it"), that is a **single `generate_video_from_video` call** — pass the source video URL + the edit prompt and you're done. **One call. One output. Done.**
285
+
286
+ **Use a TRUE video-to-video model.** Image-to-video models (anything whose identifier ends in `image-to-video` / `i2v` / contains `image2video`) will NOT work — they require an image input, not a video, and the server rejects the call with `WRONG_MODEL_TYPE`. Examples of valid v2v models you can pick:
287
+
288
+ - `wan/2-7-videoedit` — Wan 2.7 video edit (re-style / re-prompt)
289
+ - `happyhorse/video-edit` — HappyHorse video edit
290
+ - `kling-video/o3-video-to-video` — Kling O3 V2V
291
+ - Any model whose DB `type` array contains `video_to_video` (use `list_models` with `type: "video_to_video"` to see the current set)
292
+
293
+ If you're unsure which model to use, call `list_models({ type: "video_to_video" })` first — do NOT guess based on the name.
294
+
295
+ **Do NOT:**
296
+ - Decompose into frames (frames-first applies only when *creating new character video from scratch*, NOT for editing an existing video).
297
+ - Re-fire the same tool repeatedly if the first call returned URLs — even if you're not 100% sure the result matches; surface the result to the user and let them iterate.
298
+ - Generate multiple variations unless the user explicitly asks for "options" / "variations" / "X different versions".
299
+ - Use an image-to-video model just because it shows up in `list_models` — check that its `type` array includes `video_to_video`.
300
+
301
+ **If the first call legitimately fails** (content policy, transient error, model refused), surface the failure to the user via the `failure` envelope — do not blindly retry. See "Reading failure envelopes" above.
302
+
303
+ This rule overrides the "frames first" guidance below for any video-EDITING request. The "frames first" rule is only for **generating new character-driven video from scratch**, not for re-styling / modifying an existing clip.
304
+
305
+ ### ⚠️ Character-driven video — frames first, then animate (CRITICAL)
306
+
307
+ For any ad / story / scene-based video **created from scratch** featuring a Visual DNA character (NOT video-to-video edits — see the rule above for those), do **NOT** jump straight from DNA to `generate_elements` / `generate_video` per shot. The right flow is:
308
+
309
+ 1. **Generate the shot frames first** as still images — one image per shot — via `generate_creative_director` with `scene_count` + `visual_dna_ids`. (Use the director, NOT parallel `generate_image` calls — multiple shots = a coherent set, that's exactly what the director is for.) The DNA is at its strongest in image generation: the character lands consistently, you can preview cheaply, and the user can approve/revise the frames before any expensive video runs.
310
+ 2. **Confirm the frames with the user** if there are more than ~3 shots, or if the user hasn't explicitly said "go straight to video."
311
+ 3. **Animate each frame to video** with `generate_video_from_image`, passing each approved frame as `image_url`. This is cheaper and more predictable than direct DNA→video, and identity is locked because the model is animating an existing pixel-perfect frame instead of re-inferring the character.
312
+
313
+ **Why this matters:**
314
+ - `generate_elements` + Seedance / Kling for direct character video is more expensive per shot and the character can drift between shots.
315
+ - Image-to-video (`generate_video_from_image`) anchors the first frame to your approved still, so the character/setting/composition stays locked.
316
+ - Frames are debuggable: if shot 4's pose is wrong you regenerate one image, not a 10-second video.
317
+
318
+ **When to skip frames-first and go direct to `generate_elements`:**
319
+ - User explicitly asks "go straight to video / skip the storyboard / use seedance for everything."
320
+ - Single-shot quick experiments where no character consistency is needed.
321
+ - The user supplies their own approved frames and just wants animation.
322
+
323
+ Default to frames-first unless one of those applies. After all frames are approved, fire the image-to-video calls in parallel (subject to the bulk-generation ceilings below).
217
324
 
218
325
  **⚠️ Parameter names — do NOT confuse these:**
219
326
  - `generate_image` → `num_images` (1–4): all images use the **same prompt**, just different random seeds — use this for "give me 4 variations of this image"
@@ -221,117 +328,492 @@ Before calling any generation tool, check your conversation history. If you alre
221
328
 
222
329
  **After `generate_creative_director` completes — share results as individual URLs, one per scene. Do NOT create an HTML grid artifact or any combined layout. Just list each scene's title and its image URL on separate lines.**
223
330
 
224
- **Don't narrate, just generate.** When the user says "make 5 videos", output all 5 tool calls in one response. Don't explain your plan, don't calculate step-by-step, don't say "Generating Video 1 of 5..." — just call the tools.
331
+ ### ⚠️ Generated URLs in chat (CRITICAL)
332
+
333
+ The chat renders markdown natively: `![alt](url)` becomes an inline image, `[label](url.png)` becomes a labeled link with an auto-preview, bare URLs become clickable links. Use whichever fits the reply:
334
+
335
+ - **Catalog-style replies** (numbered lists of characters / scenes / products where the user wants to *see* each item next to its description) — embed `![alt](url)` or `[label](url)` so each item shows its thumb inline. The agent decides; the renderer handles the rest.
336
+ - **Conversational replies** ("4 shots ready") — keep the prose short; the canvas chip already shows the gallery, you don't need to re-list URLs.
337
+
338
+ Avoid bare URL dumps (`1. https://… 2. https://…`) and HTML `<table>` grids — they're ugly and the canvas already provides a gallery. Anything you want the user to actually see inline should be wrapped in markdown image / link syntax.
339
+
340
+ **Always** record every URL in `.kolbo/production.md` — that's the durable record, independent of what you show in chat.
341
+
342
+ ### ⚠️ Bulk Generation (>10 items)
343
+
344
+ For large briefs ("make 50 UGC ads") the rules above still apply, plus:
345
+
346
+ **Real-world batch ceilings (cheat sheet)** — these are tighter than the published rate limits; exceeding them causes 429s that can throttle the whole session:
347
+
348
+ | Tool | Max safe in-flight | Notes |
349
+ |---|---|---|
350
+ | `generate_image` | 8–10 | Fast (~10–30s each) |
351
+ | `generate_image_edit` | 5–8 | Multi-angle models slower |
352
+ | `generate_creative_director` | 1 call → up to 8 scenes | Runs scenes in parallel internally — never batch externally |
353
+ | `generate_video` / `generate_video_from_image` / `generate_first_last_frame` / `generate_lipsync` | 3–5 | 1–5 min each |
354
+ | `generate_video_from_video` | 3 | Heaviest |
355
+ | `generate_elements` | 3–5 | Confirmed real-world ceiling for 50-item bulk runs |
356
+ | `generate_music` / `generate_speech` / `generate_sound` | 5–8 | |
357
+ | `upload_media` | 10+ | No practical ceiling |
358
+
359
+ For 50 outputs: fire one batch → wait for all to finish → fire next batch. Never fire all 50 in one response.
360
+
361
+ **`upload_media` external URLs first.** `files` on `generate_elements` and source images on edit/from-image tools only accept Kolbo-hosted URLs reliably; external URLs (e.g. unsplash) cause `400 Bad Request`. Pattern: external URL/local file → `upload_media` → use the returned Kolbo CDN URL in `reference_images` / `source_images` / `image_url`. Image upload constraints: JPEG/PNG/WebP only, 300×300 to 2048×2048 — pre-validate before upload.
362
+
363
+ **On 429:** finish the in-flight batch, wait 60s, retry only the failed items. Second 429 → wait 120s, retry once. Third → stop the whole job, report completed/failed counts to the user.
364
+
365
+ **Persist every `generation_id`** in `.kolbo/production.md` (even for failures) — required for `get_generation_status` recovery and cross-session dedupe.
366
+
367
+ Bulk production-log entry shape:
368
+ ```md
369
+ 12. ✅ Asian F 24, bedroom, hype POV
370
+ - generation_id: gen_8a2c…
371
+ - url: https://…
372
+ - model: seedance-2 · 720p · 10s · sound-on
373
+ - generated: 2026-05-14T07:42Z
374
+ 13. ❌ Latino M 31, gym
375
+ - generation_id: gen_ff19…
376
+ - error: 429 Too many generation requests
377
+ - retry_after: 2026-05-14T07:43Z
378
+ ```
379
+
380
+ **Don't narrate** — output the tool calls, skip "Generating Video 1 of 5…" preambles.
381
+
382
+ **Handling interruptions:** if the user aborts mid-batch then says "do the rest," check what you already fired, skip those, fire only the remainder. Never restart from the beginning.
383
+
384
+ ### Reading failure envelopes from `get_generation_status`
385
+
386
+ When a generation fails, `get_generation_status` now returns a structured `failure` field alongside `error`:
387
+
388
+ ```json
389
+ {
390
+ "state": "failed",
391
+ "error": "The input or output was flagged as sensitive…",
392
+ "failure": {
393
+ "message": "The input or output was flagged as sensitive…",
394
+ "category": "content_policy", // content_policy | network | auth | model_failure | ...
395
+ "code": "CONTENT_FLAGGED_SENSITIVE",
396
+ "retryable": false, // true = transient, safe to retry; false = same input will fail again
397
+ "severity": "error",
398
+ "provider": "kie-nano-banana"
399
+ }
400
+ }
401
+ ```
402
+
403
+ Branch on `failure.category` / `failure.retryable`:
404
+
405
+ - `category === "content_policy"` (or `code === "CONTENT_FLAGGED_SENSITIVE"`) → **do not retry the same prompt**. Tell the user the model refused, suggest a less explicit phrasing or a Visual DNA fallback. Log to `.kolbo/production.md` Failures section with the exact reason.
406
+ - `category === "auth"` or `code === "[KOLBO_AUTH_EXPIRED]"` → surface the reconnect flow (`open_kolbo_app`), don't auto-retry.
407
+ - `retryable === true` (transient: network, rate limit, provider 5xx) → retry once with the same payload after a short pause. If it fails again, surface to user.
408
+ - `retryable === false` and unknown category → surface the raw `message` to the user, don't retry.
225
409
 
226
- **Handling interruptions:** If the user aborts or interrupts mid-batch (e.g. cancels Video 1, then says "do the rest" or "continue with 2-5"), pick up where you left off. Check which generations you already fired, skip those, and fire only the remaining ones. Never restart a batch from the beginning. Remember: aborted tool calls still process server-side — don't re-fire them.
410
+ If `failure` is absent (older kolbo-api), fall back to the heuristic in the next section.
411
+
412
+ ### ⚠️ Detecting failed generations (CRITICAL)
413
+
414
+ A generation can fail in three distinct ways. Treat ALL three as failure — don't pretend it worked:
415
+
416
+ 1. **Tool returns `error`** — explicit failure, you see the error text. Easy case. Surface it to the user, suggest a retry, and log the `generation_id` if present so you can call `get_generation_status` later.
417
+ 2. **Tool returns `completed` but with NO URL in `urls`** — most common silent failure. The server marked the generation done but produced nothing (NSFW filter, model OOM, upstream provider 5xx, transient bug). Treat as failure. Do NOT log this to `.kolbo/production.md`. Do NOT claim the generation worked. Tell the user "the generation completed without an output — retrying" and re-fire ONCE.
418
+ 3. **Tool hangs / never returns** — the MCP poll timed out (the JS-side timeout in the MCP fired before the server-side generation finished, OR the server died mid-job). Two signals: (a) the tool result includes a timeout error mentioning `generation_id`, OR (b) the user comes back later and says "you said you were generating but I never got it."
419
+
420
+ - On case (a): IMMEDIATELY call `get_generation_status(generation_id)`. The server might be done — recover the URL instead of re-firing. Only retry if `status === "failed"`.
421
+ - On case (b): if you have a recorded `generation_id` in `.kolbo/production.md` for that step, call `get_generation_status` first. If you don't (because you never logged it — see Bulk Generation rules), the work is lost.
422
+
423
+ **Always-true rules:**
424
+
425
+ - **Don't celebrate a generation before reading the tool result.** "✅ done!" before checking that `urls` is non-empty is wrong. Even when the tool returns `completed`, verify there's at least one URL before reporting success.
426
+ - **Don't auto-retry without surfacing the failure.** If a single generation fails, tell the user before silently retrying. If a BATCH partially fails, list the failed items with their reasons and the count of successful ones. Never paper over partials with "✅ all done!"
427
+ - **Don't double-log failed items.** Failures DO NOT go into `.kolbo/production.md`'s artifact list. Only successful generations land there. Failures get a one-line note in chat + (optionally) a separate `## Failures` section in the production log with the `generation_id` for retry traceability.
428
+ - **Surface the user's count.** If the user asked for 8 and you got 6 successes + 2 failures, the reply MUST say "6 of 8 ready" — not "videos ready." Misreporting partial success is the most common UX bug here.
227
429
 
228
430
  ---
229
431
 
230
- ## Transcription & Audio/Video Analysis
432
+ ## ⚠️ Production Log — `.kolbo/production.md` (CRITICAL)
231
433
 
232
- Use `transcribe_audio` ONLY when the user explicitly asks for:
233
- - A text transcript
234
- - Subtitles (SRT format)
235
- - Word-by-word timed subtitles (for karaoke, motion graphics, Remotion captions, video editing)
236
- - Summary of what was **spoken/said** in the video
237
- - Dialogue extraction from video
434
+ Every URL, id, and brief produced by a Kolbo MCP tool MUST be recorded in `.kolbo/production.md` in the user's workspace. This file — not chat history — is your source of truth for prior artifacts: URLs scattered across `tool_result` blobs are unreliable to re-scan and disappear entirely on context compaction.
238
435
 
239
- **Do NOT use `transcribe_audio` to "analyze" a video visually.** For visual analysis **of videos or audio**, use `upload_media` → `chat_send_message` with `media_urls`. For **images**, use the `Read` tool directly — you have built-in vision.
436
+ ### When to READ it
240
437
 
241
- ### Workflow
242
- 1. Call `transcribe_audio` with the `source` (URL or absolute local file path)
243
- 2. The tool returns:
244
- - `text` full transcript as plain text
245
- - `srt_url` download URL for grouped SRT subtitles (configurable words-per-line)
246
- - `word_by_word_srt_url` — download URL for **word-by-word SRT** (one word per subtitle entry with precise timestamps from ElevenLabs Scribe v2)
247
- - `txt_url` — download URL for plain text file
248
- - `duration` — audio duration in seconds
249
- 3. Analyze the transcript text as needed (summarize, translate, extract topics, answer questions about content)
250
-
251
- ### Supported Formats
252
- - **Audio**: mp3, wav, m4a, flac, aac
253
- - **Video** (extracts audio track): mp4, mov, webm, mkv, avi, m4v
254
-
255
- ### Word-by-Word Transcription
256
- The `word_by_word_srt_url` contains an SRT file where each subtitle entry is a **single word** with precise start/end timestamps (powered by ElevenLabs Scribe v2). This is ideal for:
257
- - **Karaoke-style captions** — highlight one word at a time
258
- - **Remotion/motion graphics** — animate text word-by-word synced to audio
259
- - **Video editing** — precise cut points aligned to speech
260
- - **Accessibility** — word-level navigation for hearing-impaired users
261
-
262
- The regular `srt_url` groups words into readable subtitle lines (default 12 words per line, up to 2 lines per subtitle).
263
-
264
- ### Use Cases & Examples
265
- - "Transcribe this podcast" → `transcribe_audio` with the audio URL
266
- - "What's being said in this video?" → `transcribe_audio` → analyze the returned text
267
- - "Generate subtitles for my video" → `transcribe_audio` → share the `srt_url`
268
- - "I need word-by-word timing for this audio" → `transcribe_audio` → share `word_by_word_srt_url`
269
- - "Summarize this meeting recording" → `transcribe_audio` → summarize the text
270
- - "Extract key points from this lecture" → `transcribe_audio` → analyze and extract
271
-
272
- ### Long Content
273
- Transcription supports files up to 30 minutes. For longer content, split the file first or provide segments.
274
-
275
- ### Visual Video/Audio/Image Analysis
276
-
277
- **The agent has built-in vision — ALWAYS prefer your own model for images:**
278
-
279
- | Media type | How to analyze |
280
- |------------|----------------|
281
- | **Image** (jpg, png, webp, etc.) | **Read it directly with the `Read` tool** — you see images natively. No upload, no API call, no rate-limit risk. This is ALWAYS the first choice for images. |
282
- | **Video / Audio** | `upload_media` → `chat_send_message` with `media_urls` (Gemini handles video/audio) |
283
- | **Transcription** | `transcribe_audio` — ONLY when user explicitly says "transcribe", "subtitles", "SRT", or "what's being said" |
284
-
285
- **⚠️ Image analysis priority: YOUR OWN VISION FIRST.**
286
- You are a multimodal model — you can see and analyze images directly via the `Read` tool. This is faster, free, and avoids API rate limits. **Never upload images to Kolbo or use `chat_send_message` for image analysis** unless the user explicitly asks to use a specific Kolbo chat model. Even with 10+ images, read them all yourself — you can handle up to 10 images in a single analysis pass.
287
-
288
- **NEVER use ffmpeg or frame extraction for analysis. NEVER ask the user — just pick the right path above.**
289
-
290
- **Video/Audio analysis workflow — Step 1 is NOT optional:**
291
- 1. `upload_media({ source: "/absolute/local/path/to/file.mp4" })` → returns `{ url, thumbnail_url, ... }`
292
- - **Use `url`** — the actual CDN URL. Ignore `thumbnail_url` (preview JPG only).
293
- 2. `chat_send_message({ message: "<your question>", media_urls: [result.url] })`
294
- - **`media_urls` is mandatory** — the model only sees the video if you pass the CDN URL here.
295
- - Always an **array**: `media_urls: ["https://cdn.kolbo.ai/..."]`
296
- - **Omit `model`** — Smart Select auto-routes to Gemini when media is detected
297
- - **Sessions do NOT remember media between messages.** On retry: reuse the same CDN `url` (no re-upload) but always pass `media_urls` again.
298
- - **Batch / many videos**: use `list_models` to find the cheapest Gemini model and pass it explicitly for cheaper bulk runs
438
+ Read `.kolbo/production.md` **before** acting on any of these signals:
439
+ - "edit", "animate", "combine", "redo", "polish", "fix", "regenerate"
440
+ - "the same character / scene / image / video / sound", "that X", "scene N", "the rainy one", etc.
441
+ - `@name` references for Visual DNA
442
+ - Any continuation of prior media work ("now make scene 3")
299
443
 
300
- ### ⚠️ Batching Media in Chat Messages (CRITICAL)
444
+ If the file is missing and the user is referencing prior media, ask the user — do not guess from chat.
445
+
446
+ ### When to WRITE to it
447
+
448
+ **Immediately after every successful generation tool call**, before your next tool call or your final reply. The runtime will inject a reminder after generation tool results — treat that as a hard rule, not a suggestion.
449
+
450
+ Tools that REQUIRE logging:
451
+ - `generate_image`, `generate_image_edit`, `edit_image`
452
+ - `generate_video`, `generate_video_from_image`, `generate_video_from_video`, `edit_video`
453
+ - `generate_elements`, `generate_first_last_frame`, `generate_lipsync`
454
+ - `generate_music`, `generate_sound`, `generate_speech`
455
+ - `generate_3d`, `generate_creative_director`
456
+ - `create_visual_dna`, `upload_media`
457
+
458
+ Tools that do NOT log: `list_*`, `get_*`, `check_credits`, `chat_*`, `transcribe_audio` (read-only / discovery).
459
+
460
+ ### File creation — pick the right tool to avoid the "must Read first" error
461
+
462
+ `Edit` refuses to overwrite a file unless you've `Read` it first in the same session. Pick by file state:
463
+
464
+ | State | Tool |
465
+ |---|---|
466
+ | File **does not exist** (typical first turn) | `Write` with the full stub below |
467
+ | File **exists** | `Read` first, then `Edit` |
468
+ | Not sure | `Read` first; on ENOENT, fall back to `Write` |
469
+
470
+ Stub for first creation:
471
+
472
+ ```md
473
+ <!-- .kolbo/production.md — agent-managed media artifact registry.
474
+ User may hand-edit; agent must Read-before-Edit to reconcile. -->
475
+
476
+ # Production Log
477
+
478
+ ## 🎯 Now
479
+
480
+ **Brief:** <paraphrase of user's overall goal in 1-3 sentences>
481
+ **Now working on:** <the immediate next step>
482
+ **Last updated:** <ISO date>
483
+
484
+ ---
485
+
486
+ ## Production: <name from user's request, slugified human label>
487
+
488
+ ### Cast
489
+ ### Visual DNA
490
+ ### Scenes
491
+ ### Audio
492
+ ### Final
493
+ ```
494
+
495
+ Subsections (`### Cast` etc.) are **suggested defaults**, not required. Adapt: a logo set has `### Logos`, an album has `### Tracks`, a 3D render has `### Models`. Leave empty subsections out of the file when you create entries.
496
+
497
+ ### Entry shape
498
+
499
+ One bullet per artifact. Write the label **the way the user would reference it next time** ("the rainy one"), not the model's raw output.
500
+
501
+ ```md
502
+ ### Cast
503
+ - **Maya** — female, 30, urban photographer, leather jacket
504
+ - portrait: https://...characters/maya.png (nano-banana-2, 2026-05-13)
505
+ - visual DNA: vdna_8f2c (@maya)
506
+
507
+ ### Scenes
508
+ 1. **Coffee shop morning** — Maya at counter, soft light, wide shot
509
+ - still: https://...scenes/01-coffee.png (flux-2-pro, 2026-05-13)
510
+ - video: (pending)
511
+ 2. **Rainy street walk** — neon reflections, slow dolly
512
+ - still: https://...scenes/02-rain.png (flux-2-pro, 2026-05-13)
513
+ - video: https://...videos/02-rain.mp4 (kling-2, 2026-05-13)
514
+ ```
515
+
516
+ ### Header rewrite rule (Manus pattern — IMPORTANT)
517
+
518
+ The `## 🎯 Now` block at the top of the file is **rewritten every turn** to keep the brief + current step near the model's recency window. Body sections (everything below the first `---`) are **append-only**.
519
+
520
+ When a user request supersedes a previous artifact (e.g., "redo scene 2 with more rain"), do not delete the old entry. Mark it `(superseded YYYY-MM-DD)` and place the new entry beneath:
521
+
522
+ ```md
523
+ 2. **Rainy street walk** — neon reflections, slow dolly
524
+ - still: https://...scenes/02-rain.png (superseded 2026-05-13)
525
+ - still: https://...scenes/02-rain-v2.png (flux-2-pro, 2026-05-13)
526
+ - video: https://...videos/02-rain-v2.mp4 (kling-2, 2026-05-13)
527
+ ```
528
+
529
+ ### Rules
530
+
531
+ 1. **First touch `Write`, subsequent touches `Read` → `Edit`** (see "File creation" above). If `Edit` fails on exact-match, `Read` again — the user may have hand-edited.
532
+ 2. **Plain English labels** — write what the user would call it.
533
+ 3. **Append-only body.** Only the `## 🎯 Now` header is rewritten. Never delete artifact entries; mark them `(superseded)` instead.
534
+ 4. **Do not log failures.** Only successful generations.
535
+ 5. **Resolve user references via the log, not chat history.** If the user says "scene 3," use the URL the log says is scene 3, even if a later tool_result mentioned a different URL.
536
+ 6. **One file per workspace.** Multiple concurrent productions go under separate `## Production: <name>` headings inside the same file.
537
+
538
+ ### Production Log vs TodoWrite
539
+
540
+ Use both — different jobs:
541
+
542
+ | | `.kolbo/production.md` | `TodoWrite` |
543
+ |---|---|---|
544
+ | Purpose | Durable artifact registry | Ephemeral step plan |
545
+ | Lifetime | Persists across sessions / compaction | Per turn / per request |
546
+ | Content | URLs, ids, briefs | "Do X, then Y, then Z" |
547
+ | Example | `still: https://...01-coffee.png` | `Generate visual DNA for Maya` |
548
+
549
+ ---
550
+
551
+ ## Video / Audio Analysis & Transcription
552
+
553
+ You have three routes. The right one depends on the file profile — pick before calling any tool.
554
+
555
+ ### Decision tree
556
+
557
+ ```
558
+ Image (jpg/png/webp)? → Read directly (native vision, up to 10 per pass)
559
+ File >100MB OR >15 min OR dialogue-dense? → HYBRID (transcribe + ffmpeg frames + Read + your synthesis)
560
+ User wants the transcript/SRT as deliverable? → transcribe_audio, return the URLs
561
+ Precise answer about one specific frame? → ffmpeg that frame → Read
562
+ Otherwise (short/medium video, mixed content) → upload_media → chat_send_message (Gemini native)
563
+ ```
564
+
565
+ ### Why `upload_media` → chat is **not** always the default
566
+
567
+ Gemini-via-chat processes frames + motion + audio in one pass and is the simplest route when it works. But it has three known failure surfaces — recognize them and pivot to the hybrid path:
568
+
569
+ 1. **>100MB upload cap.** Hard limit; the upload won't succeed. No option but to split with ffmpeg or go hybrid.
570
+ 2. **Long-form decay** (rough threshold: 15–20 min). Even when it fits, attention degrades — shallow or hallucinated answers on the back half of the file.
571
+ 3. **Transcription-dense laziness.** Lectures, interviews, podcasts, anything where speech is the substance: chat models summarize aggressively, paraphrase quotes wrong, or silently skip stretches. Always transcribe these first to get the actual words, then add visuals only if they matter.
572
+
573
+ ### The hybrid path (workaround for all three failures)
574
+
575
+ ```
576
+ 1. transcribe_audio({ source }) → text, srt_url, word_by_word_srt_url, duration
577
+ 2. Read the transcript text from the tool output directly
578
+ 3. Pick 3–8 timestamps from the SRT where visuals actually matter
579
+ 4. ffmpeg -ss <ts> -i <file> -frames:v 1 <frame.jpg> (one extract per timestamp)
580
+ 5. Read each frame with native vision (up to ~10 frames per analysis pass)
581
+ 6. Synthesize from transcript + frames + the user's question
582
+ ```
583
+
584
+ This is usually **cheaper** than chat for long files — transcription is per-minute, ffmpeg + Read are free — and produces stronger answers on dialogue-heavy material because you have the complete text, not a model's summary of it.
585
+
586
+ For media >30 min (past the transcription cap), split with ffmpeg into ~25-min chunks, transcribe each, concatenate.
587
+
588
+ ### Transcribe-as-deliverable vs transcribe-as-input
589
+
590
+ | Request pattern | Action |
591
+ |---|---|
592
+ | "Transcribe this" / "give me an SRT" / "I need word-by-word timing" / "make subtitles" | Run `transcribe_audio`, return the URL(s). The transcript IS the deliverable. |
593
+ | "What did they say about X?" / "Summarize this meeting" / "Find the part where they mention Y" | Run `transcribe_audio` to *get* the text → **you** read/summarize/search. Transcript is a means, not the answer. |
594
+
595
+ ### `transcribe_audio` — tool details
596
+
597
+ - `source`: URL or absolute local path.
598
+ - **Audio**: mp3, wav, m4a, flac, aac. **Video** (audio track extracted): mp4, mov, webm, mkv, avi, m4v.
599
+ - **30-minute hard cap.** Longer → split with ffmpeg first.
600
+ - Returns:
601
+ - `text` — full transcript, plain.
602
+ - `srt_url` — grouped SRT (~12 words per line, up to 2 lines per subtitle). Use this for normal subtitle delivery.
603
+ - `word_by_word_srt_url` — one word per cue with millisecond-precise start/end (ElevenLabs Scribe v2). Use **only** when downstream is animation (Remotion captions, after-effects karaoke, precise speech-aligned cuts). Noise for normal subtitle workflows.
604
+ - `txt_url` — plain text file.
605
+ - `duration` — seconds.
606
+ - Cost: per-minute (`model.credit × duration_minutes`). Run `check_credits` before transcribing very long files.
607
+ - Read-only / discovery — does NOT trigger the `.kolbo/production.md` log nudge. If the user wants the transcript saved as a durable artifact, `Write` it to a workspace file, not the production log.
608
+
609
+ ### `upload_media` → `chat_send_message` — tool details
610
+
611
+ - `upload_media({ source: "/absolute/local/path/file.mp4" })` → returns `{ url, thumbnail_url, ... }`. **Use `url`** (the CDN URL); ignore `thumbnail_url` (preview JPG only).
612
+ - `chat_send_message({ message, media_urls: [url] })`:
613
+ - `media_urls` is **mandatory** — the model only sees the file if you pass the CDN URL here. Always an array.
614
+ - **Omit `model`** — Smart Select auto-routes to Gemini when media is detected.
615
+ - Sessions do NOT remember media between messages. On retry: reuse the same CDN URL (no re-upload), but always pass `media_urls` again.
616
+ - Batch / many short videos cost-sensitively: `list_models` for the cheapest Gemini, pass it explicitly.
617
+
618
+ ### Image analysis — never via chat
619
+
620
+ You have native vision. **Always `Read` images directly** (you handle up to 10 per pass). Do not `upload_media` + chat for images unless the user explicitly names a specific Kolbo chat model. Don't extract frames from images either — they're already viewable.
621
+
622
+ **NEVER ask the user which path to use — diagnose from the file profile and pick.**
623
+
624
+ ### Analyzing the source before a chained generation — when it's worth it
625
+
626
+ Before feeding a media asset into another generation tool
627
+ (`generate_image_edit`, `edit_image`, `generate_video_from_image`,
628
+ `generate_first_last_frame`, `generate_video_from_video`, `edit_video`,
629
+ `generate_elements`, `generate_lipsync`), think about whether you actually
630
+ *know* what's in the source. If you don't, analyze it first so the next
631
+ prompt can reference concrete details instead of generic adjectives.
632
+
633
+ **Analyze first when:**
634
+
635
+ - The source is **old** — more than a few turns back, or pulled via
636
+ `list_media` / `get_media` from earlier in the project. Context has
637
+ drifted; you likely don't remember the specifics.
638
+ - The source was **user-provided without a description** — they pasted a
639
+ URL or uploaded a file but didn't say what it shows.
640
+ - The previous prompt was **vague** ("make something pretty", "a cool
641
+ shot") — the output details matter and you don't know them.
642
+ - The chain step needs to **preserve specific details** the original
643
+ prompt didn't pin down (exact pose, color of a prop, lighting direction,
644
+ audio room tone, etc.).
645
+ - Source is a **video or audio** going into elements / video-from-video /
646
+ lipsync — motion direction, pacing, voice characteristics, and ambient
647
+ bed drive the next prompt and can't be guessed from a URL.
648
+
649
+ **Skip analysis when:**
650
+
651
+ - You **just generated** the asset in the same conversation with a precise
652
+ prompt — that prompt *is* the spec. Re-analyzing wastes credits.
653
+ - The edit is **mechanical** — "remove background", "brighten 10%",
654
+ "loop to 5 seconds", "crop to 1:1". The source content doesn't matter.
655
+ - The user already **described what's in it** in this turn.
656
+
657
+ Default to skipping unless one of the "analyze first" cases applies — an
658
+ analysis-per-step habit on long chains burns credits and latency without
659
+ adding signal.
660
+
661
+ **How to analyze (pick by media type):**
662
+
663
+ | Source media | How |
664
+ |---|---|
665
+ | Image (URL or local) | Your native vision — view it directly. No `chat_send_message` round-trip needed. |
666
+ | Video / Audio | `chat_send_message({ message: "Describe...", media_urls: [url] })`. Batch up to 10 URLs in **one** call (see batching rule above). Omit `model` so Smart Select routes to Gemini vision. |
667
+
668
+ **What the analysis should extract** (use whatever is relevant for the next
669
+ step's prompt):
670
+
671
+ - **Subject** — pose, expression, framing (head-and-shoulders / full body / wide).
672
+ - **Wardrobe & props** — exact colors, materials, distinguishing items.
673
+ - **Scene & environment** — location, time of day, weather, background depth.
674
+ - **Lighting & color palette** — dominant temperature, key/fill direction,
675
+ contrast, color grade.
676
+ - **Camera** — angle, focal length feel (wide / portrait), depth-of-field.
677
+ - **Motion** (videos only) — direction, speed, camera move (push-in,
678
+ pan, locked), what changes between first and last frame.
679
+ - **Audio** (videos/audio only) — voice characteristics, ambient bed,
680
+ speech pace, music tempo/mood.
681
+ - **Anything that already looks wrong** — artifacts, blurred faces, wrong
682
+ fingers, blown highlights, audio glitches — note these so the next prompt
683
+ either fixes them (edit) or doesn't preserve them (elements/video).
684
+
685
+ **Then write the next prompt with concrete references**, not generic
686
+ adjectives. Example for an image-to-video chain:
687
+
688
+ Bad — generic, no analysis:
689
+ ```
690
+ prompt: "Animate this image with a slow push-in"
691
+ image_url: <generated still>
692
+ ```
693
+
694
+ Good — analyzed first, prompt names the specifics:
695
+ ```
696
+ prompt: "Slow 4-second dolly-in toward @maya's face from the medium shot;
697
+ the warm golden-hour rim light on her left shoulder stays
698
+ consistent; the wind moves the leaves behind her gently to the
699
+ right. Camera locked, no shake. Subject does not turn — she keeps
700
+ the half-smile and direct eye contact from the still."
701
+ image_url: <generated still>
702
+ visual_dna_ids: ["vdna_8f2c"] // maya
703
+ ```
704
+
705
+ The point is **not** to dump an essay into the prompt — it's to make sure
706
+ every concrete detail the next model needs to preserve (or change) is
707
+ named, so the chain doesn't lose continuity across steps.
301
708
 
302
- **Always send ALL media in ONE `chat_send_message` call.** The `media_urls` array accepts up to **10 URLs** in a single request. Never send one message per image/video.
709
+ **Production-log tie-in:** when you analyze a generated still/clip, write
710
+ a one-line description into `.kolbo/production.md` next to the URL — that
711
+ way the next chained step can read the log instead of re-analyzing.
303
712
 
304
- **Why this matters:** Each `upload_media` call + the final `chat_send_message` all count toward rate limits. Sending 10 uploads + 10 separate chat messages = 20 requests in rapid succession → "Too many generation requests" error. Instead:
713
+ ### ⚠️ Batching Media in Chat Messages (CRITICAL)
305
714
 
306
- 1. Upload all files at once (output all `upload_media` calls in one responseuploads are 300/min and cost no credits).
307
- 2. Collect ALL returned CDN URLs into one array.
308
- 3. Send ONE `chat_send_message` with all URLs in `media_urls`.
715
+ **Send ALL media in ONE `chat_send_message` call.** `media_urls` accepts up to **10 URLs**. Each separate chat call counts toward rate limits splitting trips "Too many generation requests."
309
716
 
310
- **Example — analyzing 5 videos:**
311
717
  ```
312
- # Step 1: Upload all in one response (all 5 upload_media calls at once)
718
+ # Step 1: parallel uploads (one response)
313
719
  upload_media({ source: "video1.mp4" }) → url1
314
- upload_media({ source: "video2.mp4" }) → url2
315
- upload_media({ source: "video3.mp4" }) → url3
316
- upload_media({ source: "video4.mp4" }) → url4
317
- upload_media({ source: "video5.mp4" }) → url5
720
+ ... (up to 10)
318
721
 
319
- # Step 2: ONE chat call with ALL media URLs
320
- chat_send_message({
321
- message: "Analyze all 5 videos...",
322
- media_urls: [url1, url2, url3, url4, url5]
323
- })
722
+ # Step 2: ONE chat call with all URLs
723
+ chat_send_message({ message: "Analyze all 5 videos...", media_urls: [url1, url2, ...] })
324
724
  ```
325
725
 
326
- **Rate limit recovery:** If you hit "Too many generation requests", wait 60 seconds before retrying. On retry, do NOT re-upload — reuse the CDN URLs from step 1.
726
+ On 429: wait 60s, retry the same chat call — reuse the CDN URLs, do not re-upload.
727
+
728
+ **Never:** pass a local path in `media_urls` (CDN URLs only); use a transcription `.txt` URL as a video URL; construct a CDN URL yourself; split media across multiple chat calls.
729
+
730
+ ---
731
+
732
+ ## ⚠️ Research-First Creative — when to scrape before generating
733
+
734
+ When the user gives you a **product URL, brand reference, or "make X for Y audience" brief** (especially for ads, marketing creative, or anything tied to a real brand), don't jump straight to prompts. Spend one turn researching first — the cost of a single research turn is far less than 10 mis-aimed generations.
735
+
736
+ ### When to do research-first
737
+ - Any URL appears in the brief (product page, landing page, brand site)
738
+ - The brief names a brand, product, or company you don't already have context on
739
+ - The brief targets a specific audience / language / market with conventions you should respect (Hebrew/Israeli, Japanese, Gen-Z TikTok, B2B SaaS, luxury, etc.)
740
+ - The brief explicitly says "research" / "תחקור" / "look up" / "find examples" / "check best practices"
741
+
742
+ ### How to research (parallel calls in one response)
743
+ Fire these IN PARALLEL — they're independent reads:
744
+
745
+ 1. **`WebSearch`** for prompt-engineering patterns specific to the chosen model. **The model name in the search query MUST be the literal model the user named** — never substitute a generic / default / "popular" model. If the user said "nano banana 2", search for `"nano banana 2" prompt …`, NOT `"flux" prompt …` or `"midjourney" prompt …`. The same HARD RULE that applies to *calling* the named model applies to *researching* it. Examples (replace `<model>` with the user's exact wording):
746
+ - `"<model>" prompt engineering ad image text rendering`
747
+ - `"<model>" hex color font specification advertising prompt`
748
+ - `"<model>" hebrew text RTL rendering` (or any user-named language)
749
+ 2. **`WebSearch`** for the audience / market design conventions:
750
+ - `<audience> advertising design trends <year>`
751
+ - `<language> typography <use case> RTL/LTR best practices`
752
+ 3. **`WebFetch`** the product URL with a precise extraction prompt (see below).
753
+ 4. (Optional) `WebSearch` for competitor / reference visuals to set bar.
754
+
755
+ ### Extracting the product page (WebFetch prompt template)
756
+
757
+ Don't ask WebFetch a vague "what is this page" — ask for structured extraction:
758
+
759
+ ```
760
+ Extract from this page, in compact bullets:
761
+ 1. Product name + one-line value proposition.
762
+ 2. 3–5 concrete capabilities/benefits (user-facing language).
763
+ 3. All product hero / screenshot image URLs visible in the page.
764
+ 4. Brand color hex codes — pull from inline `style=`, `<style>` tags, or
765
+ linked CSS, ignoring generic UI defaults (#fff/#000). Identify which
766
+ color plays which role (primary CTA, headline text, background, accent).
767
+ 5. Brand voice signals (tone, target user, formality).
768
+ 6. Any explicit fonts named in CSS or visible.
769
+ ```
327
770
 
328
- **❌ Never do this:**
329
- - Pass a local file path in `media_urls` — it won't work, only CDN URLs work
330
- - Use the `.txt` URL from a transcription result as the video URL — that's text, not video
331
- - Skip `upload_media` and try to construct a URL yourself
332
- - Send separate `chat_send_message` calls for each media file — batch them into ONE call
771
+ ### Re-host every external image via `upload_media`
333
772
 
334
- When in doubt, do visual analysis. Do not stop to ask.
773
+ The bulk-API rule applies: external URLs in `reference_images` / `source_images` / `image_url` cause **400 Bad Request**. Pipeline:
774
+
775
+ 1. `Bash: curl -fsSL "<external-url>" -o /tmp/<name>.<ext>` (or use WebFetch where it returns the binary)
776
+ 2. `mcp__kolbo__upload_media` with the local file → returns Kolbo CDN URL
777
+ 3. Use the returned CDN URL in any subsequent generation call
778
+ 4. Log both URLs in the production log (so the user can trace provenance)
779
+
780
+ ### Synthesizing the research
781
+
782
+ In the production log create:
783
+ ```md
784
+ ### Research notes
785
+ - Prompt patterns for <model>: …
786
+ - Audience conventions: …
787
+
788
+ ### Product brief
789
+ - Name: …
790
+ - Value prop: …
791
+ - Capabilities: …, …, …
792
+
793
+ ### Brand palette
794
+ - primary: #...
795
+ - accent: #...
796
+ - text: #...
797
+ - bg: #...
798
+
799
+ ### Re-hosted assets
800
+ - hero_1: <kolbo CDN url> (from <original url>)
801
+ ```
802
+
803
+ ### Building prompts informed by the research
804
+
805
+ When generating ad / marketing creative based on this research:
806
+ - **Exact hex codes for every color** — `#FF4D2E` not "orange". Match brand palette.
807
+ - **On-image text in literal double quotes** — `"שלום עולם"` not `Hebrew greeting`. Specify language and direction (RTL/LTR) when non-English.
808
+ - **Per text element**: position, font weight, point size, color hex, alignment.
809
+ - **Forbid uninvited additions** — explicitly tell the model: NO captions, NO subtitles, NO watermarks, NO extra text beyond what's specified. Same rule as UGC defaults.
810
+ - **Use research findings to shape composition** — e.g. if research said "Israeli social ads favor bold contrast and minimal copy", reflect that.
811
+ - Always **approve the concept + sample prompts with the user** before firing the full batch when the batch is ≥4 ads or the user said "approve first".
812
+
813
+ ### Skipping research is OK when…
814
+ - User gave no URL, no brand, no audience-specific signal — pure creative ("make a sunset")
815
+ - User said "skip research" / "just generate" / "I have the prompt ready"
816
+ - The brief is for a single quick draft
335
817
 
336
818
  ---
337
819
 
@@ -353,10 +835,218 @@ Use `generate_image_edit` when the user wants to modify an existing image. Pass
353
835
 
354
836
  Simple edits deserve simple prompts. Only elaborate for genuinely complex, multi-step transformations.
355
837
 
356
- ### Multi-Scene / Campaigns
357
- `generate_creative_director` is not only for storyboards and campaigns — use it whenever the user wants a character shown across multiple scenes, outfits, moods, or settings. It generates 1–8 scenes from one brief, each with its own distinct prompt, and keeps style consistent internally. Always pass `visual_dna_ids` when a character must look the same across scenes, and optionally `moodboard_id` for art direction.
838
+ ### Director Tool — Full Capabilities
839
+
840
+ `generate_creative_director` is **not just for storyboards**. It is the right tool any time the user wants **2–8 related outputs from one brief**. The director plans each scene's prompt internally, keeps style consistent across all of them, and runs them in parallel — meaning total wall-time matches the slowest scene, not the sum.
841
+
842
+ **When to reach for it (canonical use cases):**
843
+ - **Multi-angle character sheet** — front / back / sides / 3-quarter, "show her from 4 angles," "turn-around"
844
+ - **Multi-pose** — same character, different poses for the camera
845
+ - **Multi-scene story** — same character through 8 different environments / settings / locations
846
+ - **Wardrobe / outfit variants** — same character, different outfits
847
+ - **Mood / lighting variants** — same scene, different times of day / weather / emotion
848
+ - **Ad campaign / product set** — one product, N hero shots
849
+ - **Storyboard / shot list** — sequential beats of a narrative
850
+ - **Reference sheet for Visual DNA training** — produce 4–8 cohesive images that you'll *then* feed into `create_visual_dna`
851
+
852
+ **What it accepts (all combinable):**
853
+
854
+ | Parameter | Purpose | Use when |
855
+ |---|---|---|
856
+ | `prompt` | The overall brief, *not* a per-scene prompt | Always |
857
+ | `scene_count` (1–8) | How many outputs | Always — never use `num_images` here |
858
+ | `visual_dna_ids: []` | Character / style / product / scene consistency across every output | The character must look the same in every scene |
859
+ | `reference_images: []` | Style / composition references applied to every scene | You have a mood-image or layout reference but no Visual DNA yet |
860
+ | `moodboard_id` / `moodboard_ids: []` | Art-direction overlay (palette, lighting, vibe) | The user gave a brand / style brief |
861
+ | `workflow_type: "video"` | Switch to multi-scene video instead of images | The user asked for "8 short clips" / "4 video variants" |
862
+ | `model` | Pin a specific image / video model | The user named one |
863
+ | `aspect_ratio`, `resolution`, `duration` | Standard formatting | As needed |
864
+
865
+ **When NOT to use it:**
866
+ - User gave **explicit per-image prompts** ("Image 1: X. Image 2: Y. Image 3: Z.") — fire parallel `generate_image` calls instead. Director is for *one brief → N scenes*; explicit per-scene prompts mean the user already did the directing.
867
+ - User wants to **modify a specific existing image** — that's `generate_image_edit`.
868
+ - User asked for **one image** — that's `generate_image`.
869
+
870
+ ### Mixing References, Visual DNAs, and Moodboards
871
+
872
+ You can combine all three reference types in a single call — they're additive, not exclusive. The system blends them; the model uses whichever it can interpret best for the prompt.
873
+
874
+ | Tool | `source_images` (required edit base) | `reference_images` (style / composition) | `visual_dna_ids` (character/style identity) | `moodboard_id` (art direction) |
875
+ |---|:-:|:-:|:-:|:-:|
876
+ | `generate_image` | — | ✅ | ✅ | ✅ |
877
+ | `generate_image_edit` | ✅ (required) | — (source_images plays this role) | ✅ | ✅ |
878
+ | `generate_creative_director` | — | ✅ (applied to every scene) | ✅ (locks character across every scene) | ✅ / `moodboard_ids` |
879
+ | `generate_elements` (video) | — | ✅ (also `reference_videos`, `audio_url`) | ✅ | — |
880
+
881
+ **Practical combinations to know:**
882
+ - *"Make her in a Tokyo street, matching this mood board, with the same face as Visual DNA Maya"* → `generate_image` with `visual_dna_ids=[maya], moodboard_id=tokyo_neon`. No `reference_images` needed.
883
+ - *"Same character, but place her like in this composition"* → `generate_image` with `visual_dna_ids=[maya], reference_images=[layout.png]`. The DNA owns the *face*; the reference owns the *pose/composition*.
884
+ - *"Edit this photo to give her the leather-jacket look from Visual DNA Maya"* → `generate_image_edit` with `source_images=[photo.png], visual_dna_ids=[maya]`. Source is what's edited; the DNA injects the wardrobe identity.
885
+ - *"4 angles of this character, brand-styled"* → `generate_creative_director` with `scene_count=4, visual_dna_ids=[maya], moodboard_id=brand_x`. DNA keeps the face; moodboard sets the look.
886
+ - *"Generate 6 product hero shots; here are 3 reference comp images and our brand moodboard"* → `generate_creative_director` with `scene_count=6, reference_images=[comp1, comp2, comp3], moodboard_id=brand_x`. No DNA needed if it's a product not a face.
887
+
888
+ **Rule of thumb for which to use:**
889
+ - Need an **identity** (face, character, specific product) to stay constant → `visual_dna_ids`.
890
+ - Need a **composition / pose / mood reference** → `reference_images`.
891
+ - Need an **overall style / palette / brand look** → `moodboard_id`.
892
+ - Need all three at once → pass all three. They compose.
893
+
894
+ ### Tagging references inside the prompt (CRITICAL for multi-reference accuracy)
895
+
896
+ When a generation call passes ANY references — `reference_images`,
897
+ `source_images`, `reference_videos`, `source_videos`, `reference_audio`,
898
+ `elements`, OR `visual_dna_ids` — name them inside the prompt so the model
899
+ knows **which asset plays which role**. Without tags, the engine guesses
900
+ and the wrong reference bleeds into the wrong slot ("she ended up wearing
901
+ the background's color" / "the second character got the first character's
902
+ face" / "the wrong song was used as the rhythm reference").
903
+
904
+ **Tag namespaces, used together:**
905
+
906
+ | Tag | Refers to | Order rule |
907
+ |---|---|---|
908
+ | `@image1`, `@image2`, … | Plain images in `reference_images` / `source_images` | Position in the array — `@image1` = `images[0]`, etc. |
909
+ | `@video1`, `@video2`, … | Videos in `reference_videos` / `source_videos` / video `elements` slots | Position in the array. |
910
+ | `@Audio1`, `@Audio2`, … | Audio in `reference_audio` / `audio` slots (lipsync source, music style ref, voice clone, etc.) | Position in the array. |
911
+ | `@<dna-name>` | A Visual DNA — use the literal `name` field from `create_visual_dna` / `list_visual_dnas` (any language, case-insensitive) | Name-based, never positional. See "@name Syntax" rule below. |
912
+
913
+ **Reserved**: `@Image\d+`, `@Video\d+`, `@Audio\d+` are reserved by the Kinovi
914
+ Omni Reference parser — they are NOT looked up as Visual DNAs. Never name a
915
+ Visual DNA `Image1` / `Video2` / etc. (kolbo-api rejects this on creation).
916
+
917
+ **How to write a tagged prompt:**
918
+
919
+ ```
920
+ Place @maya at the coffee-shop counter from @image1, wearing the leather jacket from @image2.
921
+ Keep the warm window light from @image1; ignore the people in the background of @image2.
922
+ ```
923
+
924
+ ```
925
+ Animate @maya walking through @video1's snowy street, matching the camera move of @video1; ignore the people in @video1.
926
+ ```
927
+
928
+ ```
929
+ Lipsync @video1's speaker to the dialogue track @audio1, keeping the original ambient room tone of @video1.
930
+ ```
931
+
932
+ ```
933
+ Compose a 30s track in the style of @audio1 (slow tempo, no vocals), suitable for a product reveal video.
934
+ ```
935
+
936
+ What a tagged prompt does at submission time:
937
+ - `visual_dna_ids: [vdna_8f2c]` → bound to `@maya`
938
+ - `reference_images: [coffee_shop.jpg, jacket_ref.jpg]` → bound to `@image1`, `@image2`
939
+ - `reference_videos: [walking_clip.mp4]` → bound to `@video1`
940
+ - `reference_audio: [dialogue.wav]` → bound to `@audio1`
941
+ - The prompt names each one, so the engine never has to guess.
942
+
943
+ **Rules:**
944
+
945
+ 1. **Order is contract.** `@imageN` / `@videoN` / `@audioN` / `@elementN` are bound to position N in the array you pass. Reordering silently changes what each tag points to — don't reorder mid-conversation; if you need to add a new ref, append it (`@image3`, `@video2`, …) rather than inserting.
946
+ 2. **For edits, the source is `@image1` (or `@video1`).** In `generate_image_edit`, the first entry of `source_images` is the canonical base — refer to it as `@image1`. Same for video tools that take `source_videos`: the first entry is `@video1`. Additional sources become `@image2`/`@video2`/etc.
947
+ 3. **Visual DNA tags are name-based, not positional.** `@maya` always means the DNA you registered as `name: "maya"`, regardless of where its id sits in `visual_dna_ids`.
948
+ 4. **Tag every reference you actually pass.** If you pass a reference but never mention it in the prompt, the engine often treats it as decorative — either drop it or name it explicitly. This applies to images, videos, audio, AND Visual DNAs.
949
+ 5. **Tags carry across the production log.** When you log a generation to `.kolbo/production.md`, write the prompt with the tags intact and record the `@name → URL` / `@name → vdna_id` binding alongside. That way "the rainy scene from last week" remains reproducible weeks later.
950
+ 6. **Tag even single-reference calls when a DNA, video, or audio is involved.** Single plain image with no DNA can use prose ("this image"), but as soon as the call also carries a DNA, a video ref, or an audio ref, tag every asset so the engine knows the subject vs. the modifier role.
951
+
952
+ **Failure modes the tags fix:**
953
+
954
+ | Without tags | With tags |
955
+ |---|---|
956
+ | "Combine these two images" → engine averages them | "Put the subject from @image1 into the scene of @image2" |
957
+ | "Same character, new outfit" with 2 refs → wrong face | "Keep @maya's face from the Visual DNA; apply the outfit from @image1" |
958
+ | "Edit this" with 3 source images → engine edits whichever is first | "In @image1, replace the sky with the sky from @image2" |
959
+ | "Lipsync this video to this audio" with 2 audio tracks → wrong track picked | "Lipsync @video1 to @audio1; ignore @audio2 (that's the music bed)" |
960
+ | "Match this video's style" with 2 video refs → blended motion | "Use @video1's camera move; use @video2's color grade" |
961
+ | "Music like this" with a reference track → engine ignores it | "Compose in the style of @audio1, but slower and without vocals" |
962
+
963
+ ---
964
+
965
+ ## ⚠️ Resolution, Caps & Constraints — read these BEFORE every generation (HARD RULE)
966
+
967
+ Every model exposes a constraint envelope via `list_models`. Submitting a value outside it is a **deterministic 400** — not a degraded result, not a substitution. You MUST consult `list_models` and validate inputs before firing any generation. When in doubt, call `list_models` with `format: "json"` to get the raw model document for programmatic comparison.
968
+
969
+ ### Canonical field reference — which `list_models` field controls which input on which tool
970
+
971
+ The same conceptual slot (e.g. "max reference images") lives under **different field names per model family**. Read the row for your tool, not the model name.
972
+
973
+ | Your input | Tool(s) | Field to read on the model | What "0" / `null` means |
974
+ |---|---|---|---|
975
+ | `reference_images` | `generate_image`, `generate_image_edit` (uses `source_images`), `generate_creative_director`, `generate_video` | `max_reference_images` | `0` = model accepts no refs |
976
+ | `reference_images` | `generate_elements` | `elements_max_images` | `0` = model accepts no image refs |
977
+ | `reference_images` | `generate_video_from_video` | `max_images` | `0` = no secondary image input |
978
+ | `reference_videos` | `generate_elements` | `elements_max_videos` | `0` = no video refs |
979
+ | `reference_videos` | `generate_video_from_video` | `max_videos` | `<= 1` = only the source_video |
980
+ | `elements` | `generate_video_from_video` | `max_elements` | `0` = no elements |
981
+ | `audio_url` | `generate_elements` | `elements_max_audio` (+ `max_audio_duration` for the file) | `0` = no audio ref |
982
+ | `visual_dna_ids` | every tool that accepts DNA | `max_visual_dna` (+ `supports_visual_dna` boolean) | `null` / `0` / `false` = model rejects DNA (silently ignored by some paths) |
983
+ | `aspect_ratio` | any | `supported_aspect_ratios` (or `supported_aspect_ratios_by_type[<type>]` when multimodal) | empty → use `default_aspect_ratio` if set |
984
+ | `resolution` | any | `supported_resolutions` (+ `resolution_multipliers` for cost) | empty → model has no resolution tiering |
985
+ | `duration` (video output) | video tools | `supported_durations` if set, else `min_output_duration`–`max_output_duration` | both null → can't validate, omit and let server default |
986
+ | **input** video duration (source) | `lipsync-video`, `generate_video_from_video` | `min_video_duration` – `max_video_duration` | outside range → reject or upstream truncates |
987
+ | input audio duration | `generate_lipsync`, `generate_elements` audio | `min_audio_duration` – `max_audio_duration` (+ `audio_max_follows_video_duration` for lipsync) | outside range → reject |
988
+ | audio file format | any audio input | `supported_audio_formats` (e.g. `["mp3","wav","m4a"]`; empty = all) | pre-validate before upload |
989
+ | recording duration | `text_to_speech` recording UX | `min_recording_duration` – `max_recording_duration` | usually null for plain TTS |
990
+ | upload file size | every file upload | `max_file_size` (bytes) | null → use platform default |
991
+ | `num_images` | image tools | `images_per_request` overrides for fixed-output models (Midjourney returns 4 regardless) | null → `num_images` honored as-is |
992
+ | `prompt` | every tool | `requires_prompt`, `min_prompt_length`, `max_prompt_length` | null → unconstrained |
993
+ | sound on/off | video tools | `sound_generation_type` (`"native"` vs `"none"`), `sound_enabled_by_default`, `sound_credit_multiplier` | not `"native"` → can't emit synced audio |
994
+ | capability gate | route decision | `supports_visual_dna`, `supports_first_last_frame`, `supports_audio_input` | `false` → the controller silently drops that param |
995
+
996
+ Cost formula: `final_cost = credit × resolution_multipliers[resolution] × (sound_enabled ? sound_credit_multiplier : 1)`, multiplied by `num_images` / `scene_count` as applicable.
997
+
998
+ ### Validation pattern — every generation
999
+
1000
+ Before submitting:
1001
+
1002
+ 1. Call `list_models type=<tool-type>` (text mode is enough for picking; `format: "json"` when you need to programmatically compare caps).
1003
+ 2. For each input array (refs / DNAs / elements) — check `length <= <cap>` from the row above. If over, drop the lowest-priority entries OR ask the user.
1004
+ 3. For each enumerated value (`aspect_ratio` / `resolution` / `duration`) — check it's in `supported_*`. If not, **do not silently substitute**; show the user the allowed set and ask.
1005
+ 4. For each duration-bearing file (source_video for lipsync/v2v, audio for lipsync/elements) — pre-check duration against the min/max range. Use ffmpeg if needed (via `video-production` skill).
1006
+ 5. For uploads — pre-check size against `max_file_size`.
1007
+
1008
+ The MCP tool descriptions also embed the cap field name on the relevant parameter (e.g. `reference_images: "...Cap: pass at most max_reference_images..."`) — use those as inline reminders.
1009
+
1010
+ ### ⚠️ Quote real cost, never estimates (CRITICAL)
1011
+
1012
+ The formula above is for **pre-approval previews only**. After firing, use the real number from the tool response — every generation now returns `credits_used` (multiplier-adjusted total) and `credits_breakdown` (per-model attribution). Log `credits_used` to `.kolbo/production.md`, not `base × count`.
1013
+
1014
+ ```json
1015
+ { "credits_used": 12, "credits_breakdown": [{ "model": "nano-banana-2", "base": 8, "final": 12, ... }], "urls": [...] }
1016
+ ```
1017
+
1018
+ When the user asks "how much did I spend?" → call `mcp__kolbo__get_session_usage` for the real, multiplier-adjusted session total + per-tool + per-model breakdowns (same numbers as the desktop bottom-bar counter).
1019
+
1020
+ ### Decision rule
1021
+
1022
+ 1. **User specified resolution / sound explicitly** ("4K", "1080p", "480p", "with sound", "silent") → ALWAYS verify the value is in `supported_resolutions` BEFORE firing. If it isn't:
1023
+ - ❌ Do **NOT** silently substitute a "close" value. The user asked for 480p; sending 720p without their consent burns 1.5–2× the credits they expected and produces a different output.
1024
+ - ✅ Show them what the model actually supports in one line and ask which to use:
1025
+ > "Seedance 2 elements supports `[720p, 1080p, 1440p, 2160p]` — 480p isn't available. Closest cheap option is 720p (~+0 credits over your intent). Want 720p, or pick another?"
1026
+ - Only fire after they reply (or after they re-confirm the original intent with the new info).
1027
+ 2. **User specified quality intent without numbers** ("draft", "quick test", "final delivery", "for client", "production") → map intent to tier:
1028
+ - draft / quick / preview → cheapest in `supported_resolutions` (1K / 720p)
1029
+ - normal / standard → `default_duration`-equivalent (typically 2K / 1080p)
1030
+ - final / production / hero → highest the user's budget allows (3K-4K / 1440p-2160p)
1031
+ 3. **No quality signal at all** AND the cost difference between cheapest and most-expensive is **>2×** OR total batch is large (≥4 outputs) → **ask the user once** with a one-line cost comparison, then default to standard if they don't reply. Example:
1032
+ > "This model offers 1K (8 cr × 4 = 32), 2K (1.5×: 48), 4K (2×: 64). Default to 1K? Or pick 2K/4K?"
1033
+ 4. **No quality signal AND cost difference is small** (≤1.5×) → quietly use the cheapest supported, no need to interrupt.
1034
+ 5. **Sound on a video model with `sound_credit_multiplier > 1`** → if user didn't ask for sound, leave it off (saves credits). If user said "with sound" / "with music" / "with audio", enable it.
1035
+
1036
+ ### Defaults when nothing is specified
1037
+
1038
+ - **Image**: `1K` (or the cheapest in `supported_resolutions`).
1039
+ - **Video**: `720p` (or the cheapest), with `default_duration` (or shortest in `supported_durations`).
1040
+ - **Sound**: respect `sound_enabled_by_default`; if false, leave off.
1041
+
1042
+ ### Always log the resolution / duration / sound choices
1043
+
1044
+ Production-log entries should include the resolution and (for video) duration + sound state alongside the URL, so the user can see what they paid for:
358
1045
 
359
- You can also do multiple parallel `generate_image` calls with the same `visual_dna_ids` when the user provides explicit per-image prompts.
1046
+ ```md
1047
+ - still: https://...01-coffee.png (flux-2-pro · 1K, 2026-05-14)
1048
+ - video: https://...02-rain.mp4 (kling-2 · 1080p · 5s · sound-off, 2026-05-14)
1049
+ ```
360
1050
 
361
1051
  ---
362
1052
 
@@ -365,39 +1055,161 @@ You can also do multiple parallel `generate_image` calls with the same `visual_d
365
1055
  Visual DNA profiles capture the visual "identity" of a character, style, product, or scene from reference media.
366
1056
 
367
1057
  ### Workflow
368
- 1. **Create** a profile with `create_visual_dna` — provide reference images (max 4), optionally video and audio
1058
+ 1. **Create** a profile with `create_visual_dna` — provide reference images (max 4 — if the user gives more, pick the 4 most representative or ask which to keep; never pass 5+), optionally video and audio
369
1059
  2. **Types**: `character` (default), `style`, `product`, `scene`, `environment`
370
1060
  3. **Use** the profile by passing its `id` in `visual_dna_ids` in: `generate_image`, `generate_creative_director`, `generate_elements`
371
1061
  4. **List/inspect** profiles with `list_visual_dnas` / `get_visual_dna`
372
1062
 
373
- ### ⚠️ @name Syntax CRITICAL for Multi-Visual-DNA Prompts
1063
+ ### ⚠️ Pre-flight: Verify the Visual DNA Exists Before Using It (MANDATORY)
1064
+
1065
+ NEVER reference a Visual DNA by name, role, or assumed identity without first
1066
+ confirming it exists in the user's library. This is a frequent failure mode:
1067
+ the user mentions a character ("אסתר", "Maya", "the model from before"), the
1068
+ agent assumes a matching Visual DNA exists, calls `generate_image` /
1069
+ `generate_elements` with a guessed or fabricated `visual_dna_ids` value, and
1070
+ the generation fails or produces the wrong identity.
1071
+
1072
+ **Before** any generation call that uses `visual_dna_ids`:
1073
+
1074
+ 1. Call `list_visual_dnas` to get the actual available DNAs (id + name).
1075
+ 2. Match the user's reference (by name, type, or your `.kolbo/production.md`
1076
+ log) to a real DNA in that list.
1077
+ 3. If there is **no match**, STOP and ask the user one of:
1078
+ - "I don't see a Visual DNA named <X> in your library. Do you want me
1079
+ to create one now (I'll need reference image(s)), use an existing
1080
+ DNA (<list>), or proceed without DNA using direct reference images?"
1081
+ 4. Only proceed once you have a real `vdna_*` id confirmed by either the
1082
+ list or a fresh `create_visual_dna` call you just made.
1083
+
1084
+ Do NOT:
1085
+ - Invent a Visual DNA id or assume one exists from context.
1086
+ - Use the same DNA id for a different character because "it sounded close."
1087
+ - Carry a DNA id from `.kolbo/production.md` into a new generation without
1088
+ re-confirming it still exists (`list_visual_dnas` is cheap — call it).
1089
+
1090
+ When the user says "use the model אסתר" but you've only created a DNA for
1091
+ "זוהר", you MUST ask before generating — never silently substitute or guess.
1092
+
1093
+ ### ⚠️ Don't re-fetch / re-list your own outputs (CRITICAL)
1094
+
1095
+ After a generation tool returns its URLs, those URLs are **already** in the canvas (the desktop app's gallery panel) and in `.kolbo/production.md`. Do **NOT** call `list_media`, `get_media`, `get_media_stats`, `list_visual_dnas`, or `chat_send_message` with `media_urls` on those URLs just to "verify" or "fetch thumbnails of the results" — that's pure noise:
374
1096
 
375
- When using **multiple Visual DNA profiles in a single generation**, reference each profile by its name using the `@name` syntax directly in the prompt. This tells the engine which character or asset appears where:
1097
+ - It burns credits and time for zero new information.
1098
+ - Every such tool call streams partial output into the session, which forces the desktop canvas to re-evaluate (visible flicker on the gallery tiles).
1099
+ - The thumbnails returned by `list_media` / `get_media` are the SAME asset you just generated; you don't need a thumbnail of a thumbnail.
1100
+
1101
+ **Only call list/get media tools when:**
1102
+ - The user explicitly asks ("what do I have in my library?", "show me my old DNAs").
1103
+ - You need details about something generated in an **earlier session** that you don't have a record of.
1104
+ - You're chasing a specific user reference like "the rainy clip from yesterday" that isn't in the current chat's `.kolbo/production.md`.
1105
+
1106
+ **Only call `chat_send_message` with `media_urls` when:**
1107
+ - The user uploaded media themselves and asks you to analyze / describe / extract info from it.
1108
+ - You need to read a video / audio file you didn't generate.
1109
+
1110
+ For media you generated this session, you already know the prompt, model, and result URL — write that into `.kolbo/production.md` and reference it from context.
1111
+
1112
+ ### ⚠️ Presenting list results — show thumbnails (MANDATORY)
1113
+
1114
+ When you display the result of `list_visual_dnas`, `list_media`,
1115
+ `list_moodboards`, or any other tool that returns items with image/thumbnail
1116
+ URLs, render each item's thumbnail as a markdown image so the user can
1117
+ actually see what they have. The chat view auto-renders both `![](url)`
1118
+ markdown and bare image URLs, plus auto-injects a player below links to
1119
+ videos/audio — use that.
1120
+
1121
+ Do NOT dump a text-only bullet list of ids + names when a thumbnail field
1122
+ is available in the response.
1123
+
1124
+ **Visual DNA listing format:**
1125
+ ```
1126
+ Visual DNAs (6):
1127
+ 1. **Maya** — `vdna_abc` (character)
1128
+ ![Maya](https://cdn.kolbo.ai/.../maya-thumb.jpg)
1129
+ 2. **Tokyo Neon** — `vdna_xyz` (style)
1130
+ ![Tokyo Neon](https://cdn.kolbo.ai/.../tokyo-thumb.jpg)
1131
+ ```
1132
+
1133
+ **Media listing format:**
1134
+ ```
1135
+ 1. **rain-loop.mp4** — `med_abc` (video, 5s, 1080p)
1136
+ https://cdn.kolbo.ai/.../rain-loop.mp4
1137
+ 2. **coffee-01.png** — `med_def` (image, 1024x1024)
1138
+ ![](https://cdn.kolbo.ai/.../coffee-01.png)
1139
+ ```
1140
+
1141
+ Fields to read for the image source (use the first one present on the item):
1142
+ `thumbnail`, `thumbnail_url`, `preview_url`, `url`, `image`. For videos and
1143
+ audio, use the file `url` directly — the chat view renders a player inline.
1144
+
1145
+ If an item lacks any image/preview field, fall back to text-only for that
1146
+ row, but never skip thumbnails on the rows that do have them.
1147
+
1148
+ ### ⚠️ @name Syntax — ALWAYS use it when passing visual_dna_ids (MANDATORY)
1149
+
1150
+ Whenever a generation call passes `visual_dna_ids` (even just one), the
1151
+ prompt MUST refer to each Visual DNA by `@<exact-name>` — the literal `name`
1152
+ field as it was set in `create_visual_dna` and as it appears in
1153
+ `list_visual_dnas`. This is how the engine binds the DNA to a role in the
1154
+ scene. Without `@name`, the engine guesses, drops the DNA, or blends
1155
+ multiple DNAs together.
1156
+
1157
+ **Use the actual stored name, programmatically.** When you call
1158
+ `list_visual_dnas` (or `create_visual_dna`), read the `name` field off the
1159
+ response and use that exact string after the `@`. Do NOT:
1160
+
1161
+ - Translate the name into another language ("אסתר" / "esther" / "אסתי" —
1162
+ pick whichever string is in `name` and use ONLY that one).
1163
+ - Invent a friendlier alias ("the model", "המודל", "her").
1164
+ - Write the character's name in plain text without the `@` prefix.
1165
+ - Drop the `@name` when only one DNA is passed — the engine still needs the
1166
+ binding so it knows the DNA is the *subject* and not a passive style.
1167
+
1168
+ **Wrong** (DNA `name` is `esther_model`, user wrote prompt in Hebrew):
1169
+ ```
1170
+ prompt: "אסתר לובשת שרשרת זהב, פורטרט חצי גוף"
1171
+ visual_dna_ids: ["vdna_abc"]
1172
+ ```
1173
+ The engine sees plain text "אסתר" and has no idea it should bind to the DNA.
1174
+
1175
+ **Right:**
1176
+ ```
1177
+ prompt: "@esther_model לובשת שרשרת זהב, פורטרט חצי גוף"
1178
+ visual_dna_ids: ["vdna_abc"] // esther_model
1179
+ ```
376
1180
 
1181
+ **Multi-DNA example:**
377
1182
  ```
378
- "@dana walks into @shop and picks up a product from the shelf"
1183
+ prompt: "@dana standing in @shop, picking up a product"
1184
+ visual_dna_ids: ["vdna_abc", // dana
1185
+ "vdna_xyz"] // shop
379
1186
  ```
380
1187
 
381
- - Profile names are set during `create_visual_dna` (the `name` field)
382
- - Reference them as `@name` (lowercase, no spaces) inside the prompt text
383
- - Multiple profiles can appear in one prompt — the engine blends each one where it's mentioned
384
- - **Without `@name` references, the engine may blend all Visual DNAs together indiscriminately**
385
- - This works across `generate_image`, `generate_creative_director`, and `generate_elements`
1188
+ **How `@name` actually binds:** kolbo-api parses the prompt for `@<name>`
1189
+ mentions, queries the DB for a Visual DNA whose `name` matches
1190
+ (case-insensitive), and **replaces the `@name` token with that DNA's stored
1191
+ `systemPrompt`**. If no `@name` is in the prompt, the systemPrompt never
1192
+ gets injected the `visual_dna_ids` slot is effectively wasted.
1193
+
1194
+ The match is **literal and case-insensitive**, so:
1195
+ - The `@name` must equal the stored `name` field (e.g. if `name: "esther_model"`
1196
+ → write `@esther_model`, not `@Esther`, not `@אסתר`, not `@the model`).
1197
+ - Any-language characters are supported — if the DNA was created with
1198
+ `name: "אסתר"` you write `@אסתר`. Use the EXACT stored string.
1199
+ - Mentions terminate at punctuation (`.,!?`), double-spaces, another `@`,
1200
+ or end of string. So `@maya, wearing...` matches `maya`.
386
1201
 
387
- **Example workflow two-character scene:**
388
- 1. Create Visual DNA `name: "dana"` (type: character) `id: "vdna_abc"`
389
- 2. Create Visual DNA `name: "shop"` (type: environment) → `id: "vdna_xyz"`
390
- 3. Generate: `prompt: "@dana standing in @shop, picking up a product"`, `visual_dna_ids: ["vdna_abc", "vdna_xyz"]`
1202
+ This composes with `@image1` / `@image2` positional tags for plain
1203
+ reference/source images see "Tagging references inside the prompt" above
1204
+ for the full system.
391
1205
 
392
- ### Visual DNA Limits (maxVisualDna)
1206
+ **Naming hint for `create_visual_dna`:** pick a short, lowercase, no-space
1207
+ Latin string for `name` (`esther_model`, `dana`, `tokyo_neon`) so it's
1208
+ trivially typable inside any prompt regardless of the user's language.
393
1209
 
394
- Each model has a `maxVisualDna` field in `list_models` results — never pass more Visual DNAs than the model supports:
395
- - **Image models** (non-Kling): up to **8** Visual DNAs
396
- - **Kling image models**: up to **3** Visual DNAs
397
- - **Elements video models**: up to **3–5** Visual DNAs (model-dependent)
398
- - **All other models**: up to **3** Visual DNAs
1210
+ ### Visual DNA Limits
399
1211
 
400
- Always check the `maxVisualDna` field from `list_models` for the exact limit of the chosen model.
1212
+ Read `max_visual_dna` from `list_models` for the exact cap, AND `supports_visual_dna` for the on/off boolean — a model can support DNA without an explicit cap, or have a non-null cap but silently ignore DNA on certain paths (e.g. `generate_video`). Typical ranges: image models (non-Kling) up to **8**, Kling image models **3**, Elements video models **3–5**, everything else up to **3**. The canonical field reference table above gives the per-tool routing.
401
1213
 
402
1214
  ### ⚠️ Visual DNA Creation — Always Generate Reference Images First (MANDATORY)
403
1215
 
@@ -426,12 +1238,8 @@ Always check the `maxVisualDna` field from `list_models` for the exact limit of
426
1238
  - User asks to put a character in a specific environment or scene → create both a character Visual DNA and an environment Visual DNA, use `@name` syntax to place them
427
1239
 
428
1240
  ### ⚠️ When NOT to Use Visual DNA
429
- - **Animating an image** ("make this photo move", "animate this image") use `generate_video_from_image` and pass the image as the source. Do NOT attach `visual_dna_ids` — the source image IS the reference, Visual DNA adds no value here.
430
- - **Text-to-video** from a general description (no specific character to lock in) use `generate_video` without `visual_dna_ids`
431
- - **`generate_video`** — does not support Visual DNA at all. Never pass `visual_dna_ids` to it.
432
- - **`generate_video_from_image`** — does not support Visual DNA. The source image serves as the visual reference.
433
- - **`generate_first_last_frame`** — does not support Visual DNA. The keyframes define the visual.
434
- - **The only video tool that supports Visual DNA is `generate_elements`** (elements-type models like Seedance 2, Kling O3 Reference, Grok Imagine). Use it when the user wants a character to appear consistently in a video scene.
1241
+ - **Animating an image** → `generate_video_from_image`; the source image IS the reference, don't add `visual_dna_ids`.
1242
+ - **Video DNA support is limited to `generate_elements`** (Seedance 2, Kling O3 Reference, Grok Imagine). `generate_video`, `generate_video_from_image`, and `generate_first_last_frame` all ignore `visual_dna_ids` — for character-consistent video, route through `generate_elements`.
435
1243
 
436
1244
  ---
437
1245
 
@@ -447,24 +1255,104 @@ Video costs more per generation than images — write prompts deliberately to ge
447
1255
  - **Max 3 shots per prompt.** More shots cause the model to drift.
448
1256
  - **Duration-aware timecodes**: if the user gives a duration, space timecodes to fit (`[0s] [3s]` for 5s total; `[0s] [3s] [6s]` for 10s total). If no duration is given, describe shots sequentially without hardcoded timecodes.
449
1257
 
450
- ### Image-to-Video
1258
+ ### ⚠️ Pick the right video tool
1259
+
1260
+ There are SIX distinct video modes. They take different inputs and route to different model families. Pick by what the user actually has on hand:
1261
+
1262
+ | User has… | Use | Primary inputs | Visual DNA? |
1263
+ |---|---|:-:|:-:|
1264
+ | Nothing — just a text idea | `generate_video` | `prompt` (+ optional `reference_images`, `preset_id`) | **❌ No** (controller ignores DNA — use `generate_elements` if you need DNA) |
1265
+ | One still image they want animated | `generate_video_from_image` | `image_url` + motion `prompt` | ✅ Yes |
1266
+ | An existing video to restyle / transform | `generate_video_from_video` | `source_video` + restyle `prompt` (+ optional `reference_images`, `reference_videos`, `elements`) | ✅ Yes |
1267
+ | Loose assets (products, characters, refs) to compose into a video | `generate_elements` | `prompt` + any of `reference_images`, `reference_videos`, `audio_url`, `files`, `visual_dna_ids` | ✅ Yes (PRIMARY route for DNA→video) |
1268
+ | Two keyframes (start + end) — wants smooth morph between them | `generate_first_last_frame` | `first_frame_url` + `last_frame_url` (or `first_frame` + `last_frame` paths) + optional motion `prompt` | ✅ Yes |
1269
+ | Image or video face + audio to dub | `generate_lipsync` | `source` (image OR video) + `audio` + optional `text_prompt` + optional `bounding_box_target` | — |
1270
+
1271
+ **Rule of thumb:**
1272
+ - Coordinated **multi-scene** video set ("8 short clips of the character") → `generate_creative_director` with `workflow_type: "video"`, never multiple `generate_video` calls.
1273
+ - Need a **character to stay the same** across multiple videos → DNA only flows through `generate_elements`, `generate_video_from_image`, `generate_video_from_video`, `generate_first_last_frame`. **NOT through `generate_video`** — text-to-video silently drops `visual_dna_ids`.
1274
+
1275
+ ### Text-to-Video (`generate_video`)
1276
+ Pure text → video. No source media. Pass `prompt`, optional `reference_images` (style/composition cue), optional `preset_id`. Use `list_models type="text_to_video"` to pick a model, then read `supported_durations`, `supported_aspect_ratios`, `supported_resolutions` on it before setting those params.
1277
+
1278
+ ### Image-to-Video (`generate_video_from_image`)
451
1279
  The model can see the starting frame. Describe **what happens**, not what the image looks like. Focus on motion, camera, and action — don't re-describe the subject or setting.
452
1280
  - Good: "Slow dolly-in on the subject. Her hair drifts in a light breeze. Soft particles float through the air. [6s]"
453
1281
  - Bad: "A woman with long brown hair standing in a forest, wearing a red dress, with golden sunlight..." (re-describes the image)
454
1282
 
455
- ### Video-to-Video (Restyle)
456
- Use `generate_video_from_video` to restyle an existing video. Describe the **new style**, not the original content — the model preserves the original motion.
1283
+ DNA support: yes — `visual_dna_ids` is honored if you need to lock the character to a prior DNA profile.
1284
+
1285
+ ### Video-to-Video (`generate_video_from_video`)
1286
+ Restyle / transform an existing video. Describe the **new style**, not the original content — the model preserves the original motion.
457
1287
  - Good: "Transform into anime style with cel-shading and vibrant colors"
458
1288
  - Bad: "A person walking down a street" (re-describes what's already in the video)
459
1289
 
460
- ### Elements (Reference Assets Video)
461
- Use `generate_elements` when the user has specific assets (product photos, character references) they want animated into a video. Pass them as `reference_images` (URLs) or `files` (local paths).
1290
+ Per-model extras **call `list_models type="video_to_video"` and read these caps before passing extras**:
1291
+
1292
+ | Param | Read this cap | Examples |
1293
+ |---|---|---|
1294
+ | `reference_images` | `max_images > 0` | Kling O1/O3 (character ref), Aleph / gen4_aleph (style ref), WAN VACE (character image) |
1295
+ | `reference_videos` | `max_videos > 1` | WAN 2.6 reference-to-video — accepts 1–3 reference videos |
1296
+ | `elements` | `max_elements > 0` | Models that accept additional element images alongside the main video |
1297
+
1298
+ For models that use `reference_videos` as their *primary* input (like WAN 2.6 reference-to-video), pass the first reference video in BOTH `source_video` AND `reference_videos`.
1299
+
1300
+ ### Elements — Reference Assets → Video (`generate_elements`)
1301
+ The **primary route for character-consistent video**. Combine any of: reference images, reference videos, an audio track, Visual DNAs. Pass URLs (`reference_images`, `reference_videos`, `audio_url`) or local file paths (`files`).
1302
+
1303
+ Per-model caps — **call `list_models type="elements"`** and read:
1304
+
1305
+ | Param | Read this cap | What it means |
1306
+ |---|---|---|
1307
+ | `reference_images` | `elements_max_images` | Max distinct image references the model accepts |
1308
+ | `reference_videos` | `elements_max_videos` | Most models = 0; non-zero for video-referenced elements models |
1309
+ | `audio_url` | `elements_max_audio` | Most models = 0; non-zero for audio-driven elements models |
1310
+ | `visual_dna_ids` | `max_visual_dna` | Max DNA profiles. Each DNA may expand into multiple slots — the controller distributes them across the available image slots. |
1311
+
1312
+ Top elements models to know: Seedance 2, Kling O3 Reference, Grok Imagine, Veo 3.1. Specs vary — never assume; always `list_models`.
1313
+
1314
+ ### First/Last Frame (`generate_first_last_frame`)
1315
+ Provide two keyframes; the model interpolates a smooth transition. Two input modes (do NOT mix):
1316
+ - URL mode — `first_frame_url` + `last_frame_url`
1317
+ - File mode — `first_frame` + `last_frame` (URLs or absolute local paths)
1318
+
1319
+ Optional `prompt` describes the desired motion (e.g. "smooth dolly-in"). DNA support: yes.
462
1320
 
463
- ### First/Last Frame (Keyframe Interpolation)
464
- Use `generate_first_last_frame` when the user provides two keyframes and wants the model to create a smooth transition between them.
1321
+ ### Lipsync (`generate_lipsync`)
1322
+ Sync audio to a face works for **both image-lipsync and video-lipsync**, the tool auto-detects the source type by file extension. Pass `source` (image OR video URL/path), `audio` (URL/path), optional `text_prompt`, optional `bounding_box_target` to pick which face when there are several.
465
1323
 
466
- ### Lipsync
467
- Use `generate_lipsync` to sync audio to a face in an image or video. Both `source` (face) and `audio` accept URLs or local file paths.
1324
+ ### Reference inputs combine freely
1325
+ `visual_dna_ids` + `reference_images` + (where supported) `reference_videos` + `audio_url` are **additive** across all video tools that accept them. The same matrix from "Mixing References, Visual DNAs, and Moodboards" applies: DNA owns identity, reference_images own composition/style, audio_url drives sync, video references provide motion or scene context.
1326
+
1327
+ ### UGC / Short-Form Vertical Video — Defaults
1328
+
1329
+ When the user asks for **UGC ads, TikTok content, Reels, Shorts, or any "creator-style" video**, snap to these defaults unless they explicitly override:
1330
+
1331
+ | Setting | UGC default | Why |
1332
+ |---|---|---|
1333
+ | `aspect_ratio` | **`"9:16"`** (vertical) | TikTok / Reels / Shorts are all vertical-first. Using 16:9 forces the user to crop or reshoot. |
1334
+ | Visual aesthetic | Phone-shot, handheld, natural lighting | UGC works precisely *because* it doesn't look produced. Cinematic = wrong vibe. |
1335
+ | Camera language | Slight handheld sway, selfie-arm framing, key light from window/screen | NOT slow dollies, NOT cinematic crane moves, NOT studio key light |
1336
+ | Energy | "talking to a friend" — casual, direct-to-camera, occasional gestures | Not theatrical, not staged, not "model-y" |
1337
+ | Captions / subtitles / text overlays | **NEVER add** unless explicitly requested | Users add captions in CapCut / TikTok native editor; baked-in captions limit reuse |
1338
+ | Brand watermarks / lower-thirds / lower banners | **NEVER add** unless explicitly requested | Same reason |
1339
+ | Music / SFX | Off by default unless asked | They'll layer their own audio in post |
1340
+ | Length | If user gives no number, default to the model's `default_duration` (typically 5–8s for elements/v2v models). Don't extend without asking. | Shorter = more usable for the algorithm |
1341
+
1342
+ **Phrases in the user's prompt that activate UGC defaults:**
1343
+ "UGC", "user-generated", "creator video", "TikTok", "Reels", "Shorts", "POV", "selfie video", "phone-shot", "vlogger", "talking head" (when context implies social media), "for social", "Instagram video", "YouTube short".
1344
+
1345
+ **Phrases that override UGC defaults** (use them as-given, not as UGC):
1346
+ "commercial", "ad spot" (without UGC), "cinematic", "broadcast", "TV ad", "horizontal", "16:9", "landscape", "billboard".
1347
+
1348
+ **Prompt template seed for UGC:**
1349
+ ```
1350
+ UGC selfie video, vertical 9:16, handheld phone aesthetic.
1351
+ {presenter description} in {everyday setting}, {energy level}.
1352
+ They {natural action with the product/subject}, talking directly to camera.
1353
+ Phone-shot lighting (window/screen key light), slight handheld sway, no cinematic moves.
1354
+ Style: authentic creator content, NOT polished commercial.
1355
+ ```
468
1356
 
469
1357
  ### Camera Vocabulary
470
1358
 
@@ -580,11 +1468,78 @@ Describe **genre → mood → instrumentation → tempo → era**, in that order
580
1468
 
581
1469
  ## Media Library
582
1470
 
583
- Use `upload_media` to upload local files or URLs to the Kolbo CDN for stable hosting. Useful when:
584
- - A local file needs to be referenced in multiple generation calls
585
- - You want a permanent CDN URL instead of an ephemeral local path
586
-
587
- Use `list_media` to browse previously uploaded content (filter by type, search by name).
1471
+ The library covers both **uploaded files** and **AI-generated outputs the user has saved**. Tools fall into five groups: ingest, browse, lifecycle (delete/restore/move), folders, and favorites.
1472
+
1473
+ ### ⚠️ Present locally-produced media to the user
1474
+
1475
+ When you produce a media file LOCALLY — `ffmpeg` via the `video-production` skill, Remotion render, manual `Bash` mux of audio + video, `edit_image` outputs saved to disk, any save-to-file flow — make sure the user can actually find and open it. Local files are invisible in the chat / canvas UI by default; only the path string makes it through.
1476
+
1477
+ **Rules:**
1478
+
1479
+ 1. **Surface the file in chat as a clickable thing**, not just a path string. Write the line as a markdown link to a `file://` URL so the user can click to open it in their default app:
1480
+ ```
1481
+ ✅ Final video ready: [zohar_hagai_campaign.mp4](file:///Users/mymac/Documents/test agent 1/zohar_hagai_campaign.mp4) (45s · 1440×1440 · with music)
1482
+ ```
1483
+ The user clicks the link → the desktop app shell hands the path to the system → opens in QuickTime / VLC / Finder reveal, etc.
1484
+
1485
+ 2. **Always log the local path in `.kolbo/production.md`** under the artifact's entry — that's the durable record:
1486
+ ```md
1487
+ ## Final
1488
+ - **Campaign video (45s)**
1489
+ - local: /Users/mymac/Documents/test agent 1/zohar_hagai_campaign.mp4
1490
+ - resolution: 1440×1440
1491
+ - audio: Gilded Horizon (Track 1 & 2, 3:03)
1492
+ - rendered: 2026-05-16
1493
+ ```
1494
+
1495
+ 3. **Don't auto-upload to `upload_media`**. The user wants local-only files to stay local; they have the file on disk and can move/share it themselves. Upload only when the user explicitly asks ("upload this", "share publicly", "give me a CDN URL").
1496
+
1497
+ 4. **Reveal-in-Finder affordance for macOS** when finishing a multi-step production: in addition to the `file://` link, mention the parent directory path so the user can `cd` or open the folder. Many users want to see all the intermediate files (frames, alt cuts, original audio) in one place.
1498
+
1499
+ 5. **Files served via `file://` won't render inline** in the chat as `<video>` / `<img>` — the desktop WebView blocks file:// for security. Don't try to embed; just link.
1500
+
1501
+ ### Routing — user says → call
1502
+
1503
+ | User says | Call |
1504
+ |---|---|
1505
+ | "Upload this file" / "host this" / "give me a public URL for this" | `upload_media` |
1506
+ | "Show my media" / "list my images/videos" / "what do I have?" | `list_media` (pass `type` / `category` / `project_id` / `folder_id` / `search`) |
1507
+ | "Show my favorites" / "list starred items" | `list_media` with `category=favorites` |
1508
+ | "List everything in project X" | `list_media` with `project_id=X` |
1509
+ | "List all videos in folder X" | `list_media` with `folder_id=X, type=video` |
1510
+ | "What was the prompt for [item]?" / "tell me about this generation" | `get_media` |
1511
+ | "How many videos do I have?" / "what's my storage usage?" | `get_media_stats` |
1512
+ | "Favorite this" / "star this" / "save to favorites" | `favorite_media` |
1513
+ | "Unfavorite" / "remove from favorites" / "unstar" | `unfavorite_media` |
1514
+ | "Delete this" / "remove this image" | `delete_media` (soft, recoverable for 30 days) |
1515
+ | "Restore it" / "undelete" / "bring it back from trash" | `restore_media` |
1516
+ | "Permanently delete" / "wipe it forever" / "free up space" | **confirm with user** → `permanently_delete_media` |
1517
+ | "Move this to project X" | `move_media` |
1518
+ | "Clean up old [type]" / "delete everything from [time period]" | `list_media` (find ids) → **confirm** → `bulk_delete_media` |
1519
+ | "Restore all from trash" | `list_media include_deleted=true` → `bulk_restore_media` |
1520
+ | "Empty my trash" / "purge deleted items" | `list_media include_deleted=true` → **show count, confirm** → `bulk_permanently_delete_media` |
1521
+ | "Move all these to project X" | `bulk_move_media` |
1522
+ | "Move everything in folder X to project Y" | `move_folder_contents` |
1523
+ | "Make a folder for X" / "create a 'campaigns' folder" | `create_media_folder` |
1524
+ | "Rename folder" / "change folder color or icon" | `update_media_folder` |
1525
+ | "Delete the [name] folder" | **confirm with user** → `delete_media_folder` (items stay in library) |
1526
+ | "Add these to [folder]" / "put these in folder X" | `add_media_to_folder` |
1527
+ | "Remove these from [folder]" | `remove_media_from_folder` |
1528
+ | "Share [folder] with alice@…" | `share_media_folder` with `user_emails: [...]` |
1529
+ | "Revoke [user]'s access to [folder]" | `unshare_media_folder` with `user_id` |
1530
+ | "Show my folders" / "what folders do I have?" | `list_media_folders` |
1531
+
1532
+ ### Rules and gotchas
1533
+
1534
+ 1. **"Delete" is soft by default.** Use `delete_media` / `bulk_delete_media` for normal "delete" intent — items go to trash for 30 days and are recoverable. Only use `permanently_delete_media` / `bulk_permanently_delete_media` when the user explicitly asks for unrecoverable deletion ("permanently", "forever", "wipe", "free up space"). **Always confirm before either permanent variant.**
1535
+ 2. **Confirm before destructive folder ops.** `delete_media_folder` detaches items (they stay in the library) but the folder itself is gone — no undo. Confirm with the user.
1536
+ 3. **`bulk_move_media` is atomic.** If you get a "not all items owned by you" error, do NOT retry partially. Surface the error to the user and let them pick a smaller batch.
1537
+ 4. **Prefer `list_media` filters over post-filtering.** Pass `project_id` / `folder_id` / `category` / `type` / `search` to the backend; don't fetch the whole library and filter client-side.
1538
+ 5. **`is_favorited` is per-user.** On shared projects, an item can be favorited by you and not by your teammates — the value reflects the calling user only.
1539
+ 6. **"Empty trash" flow:** `list_media` with `include_deleted=true` → show the count → confirm → `bulk_permanently_delete_media`. Never call the bulk-permanent endpoint without listing first so the user knows the scope.
1540
+ 7. **Bulk caps:** 1000 ids for `bulk_delete_media` / `bulk_restore_media` / `bulk_permanently_delete_media` / `bulk_move_media`; 500 ids for `add_media_to_folder` / `remove_media_from_folder`. Split larger jobs into successive calls.
1541
+ 8. **Folder share resolution:** `share_media_folder` takes emails; users not found come back in `not_found`. Report those to the user — don't assume the share succeeded silently. Members can list/add/remove items but cannot delete the folder or reshare it.
1542
+ 9. **`get_media` accepts a generation_id as a fallback** for the `media_id` arg, so you can chase down items the user references by their original generation rather than by library id.
588
1543
 
589
1544
  ---
590
1545
 
@@ -623,15 +1578,12 @@ app_builder_get_session(session_id) → returns:
623
1578
  supabase_anon_key → paste into .env as NEXT_PUBLIC_SUPABASE_ANON_KEY
624
1579
  ```
625
1580
 
626
- ### Whitelabel Support
627
-
628
- Works automatically — the MCP client routes App Builder calls through whitelabel API endpoints just like all other Kolbo tools.
629
-
630
1581
  ### ⚠️ Rules
631
1582
 
632
- - **Always confirm before `app_builder_delete_session`** — it permanently deletes the GitHub repo, Supabase DB (unless user-connected), deployed files, and all history. IRREVERSIBLE.
633
- - **After every successful build**, show the `deployment_url` prominently that's the live public URL, no setup needed.
634
- - **On build timeout** (rare): use `app_builder_get_build_status` to check manually, then continue or report to user.
1583
+ - **Always confirm before `app_builder_delete_session`** — permanently deletes the GitHub repo, Supabase DB (unless user-connected), deployed files, and history. IRREVERSIBLE.
1584
+ - **On build timeout** (rare): use `app_builder_get_build_status` to check manually, then continue or report.
1585
+
1586
+ Whitelabel works automatically — the MCP client routes App Builder calls through whitelabel API endpoints.
635
1587
 
636
1588
  ---
637
1589
 
@@ -659,15 +1611,7 @@ When the user shares an image and asks about it:
659
1611
 
660
1612
  ## Sharing HTML Artifacts
661
1613
 
662
- When you generate an HTML, SVG, or Mermaid artifact in the chat, a **Share** button appears in the artifact preview toolbar (next to Desktop / Mobile). Clicking it:
663
-
664
- 1. Uploads the artifact to Kolbo's hosting platform
665
- 2. Copies a permanent public URL to the clipboard (e.g. `https://api.kolbo.ai/api/shared-artifact-raw/<token>`)
666
- 3. Shows a toast confirming the link was copied
667
-
668
- Anyone with the URL can view the rendered page — no login required.
669
-
670
- **Requirements:** You must be logged in (`kolbo auth login`). The share button returns an error toast if you are not authenticated.
1614
+ HTML/SVG/Mermaid artifacts have a **Share** button in the preview toolbar that uploads the artifact and copies a permanent public URL (no login required to view). Requires the user to be authenticated (`kolbo auth login`).
671
1615
 
672
1616
  ---
673
1617
 
@@ -712,45 +1656,37 @@ If Kolbo tools timeout or aren't listed, the MCP server may not be wired. Tell t
712
1656
  This re-wires the MCP configuration automatically. Then restart the session.
713
1657
 
714
1658
  ### "Rate limited" (429 errors)
715
- Kolbo allows 10 generation requests per minute per user per tool type (video, image, etc. are separate pools). Wait 60 seconds (the window resets) and retry only the failed calls. Use `generate_creative_director` for batch image work instead of multiple `generate_image` calls. The API queues requests it never silently drops them.
1659
+ Wait 60s for the window to reset, retry only the failed calls. For batch image work prefer `generate_creative_director` over multiple `generate_image` calls. Full rate-limit details + retry sequence: see "Rate Limiting & Batch Generation".
716
1660
 
717
1661
  ---
718
1662
 
719
1663
  ## Examples
720
1664
 
721
- Natural-language triggers that should prompt this skill + a tool call:
1665
+ Natural-language triggers tool routing:
722
1666
 
723
1667
  - "Generate an image of a neon-lit Tokyo street at night" → `list_models` (image) → `generate_image`
724
- - "Use Midjourney to generate a Tokyo street" → `generate_image` with model "midjourney" (user named the model — skip `list_models`)
1668
+ - "Use Midjourney to generate X" → `generate_image` with model "midjourney" (user named skip `list_models`)
725
1669
  - "Remove the background from this image" → `list_models` (image_edit) → `generate_image_edit`
726
- - "Create a storyboard for a coffee brand ad" `list_models` (image) → `generate_creative_director`
727
- - "Create a 5-second cinematic video of ocean waves at sunset" → `list_models` (video) `generate_video` with camera + mood guidance
728
- - "Make 5 videos with Seedance 2 Fast, 15s, 16:9" → fire all 5 `generate_video` calls in parallel (user specified everything — skip `list_models`, skip cost confirmation)
729
- - "Animate this product photo with a 360° orbit" → `list_models` (video_from_image) → `generate_video_from_image`
1670
+ - "Create a storyboard for a coffee brand ad" / "4 angles of this character" → `generate_creative_director`
1671
+ - "Make 5 videos with Seedance 2 Fast, 15s, 16:9" → fire all 5 `generate_video` calls in parallel (skip `list_models`, skip cost confirmation)
1672
+ - "Animate this product photo with a 360° orbit" → `generate_video_from_image`
730
1673
  - "Restyle this video as anime" → `generate_video_from_video`
731
1674
  - "Make this character talk with this voiceover" → `generate_lipsync`
732
1675
  - "Create a smooth transition between these two frames" → `generate_first_last_frame`
733
- - "Make a lo-fi hip hop beat, instrumental, 85 BPM" → `list_models` (music) → `generate_music`
734
- - "Say this in English with a natural female voice: Welcome to Kolbo" → `list_voices` → `generate_speech`
735
- - "Generate a door slam sound effect" → `list_models` (sound) → `generate_sound`
736
- - "Create a 3D model of a medieval castle" → `list_models` (three_d) → `generate_3d`
737
- - "Transcribe this podcast episode" → `transcribe_audio`
738
- - "What's being said in this video?" → `transcribe_audio` → analyze the text
739
- - "Generate word-by-word subtitles for this audio" `transcribe_audio` share `word_by_word_srt_url`
740
- - "Analyze this video" / "What do you see?" / "What's in this?" (with video file) → `upload_media` → `chat_send_message` with `media_urls` (omit model — auto-routes to Gemini)
741
- - "What prompts are shown in this video?" `upload_media` `chat_send_message` with `media_urls` (omit model auto-routes to Gemini)
742
- - "Keep the same character across all these images" → `create_visual_dna` → `generate_image` with `visual_dna_ids`
743
- - "Upload this file to my media library" → `upload_media`
744
- - "Host this HTML page" / "Publish this landing page" / "Give me a public URL for this file" → `upload_media` → share the returned `url` (Kolbo CDN serves any file type publicly)
745
- - "What video models are available?" → `list_models` (video)
1676
+ - "Make a lo-fi hip hop beat, instrumental, 85 BPM" → `generate_music`
1677
+ - "Say this in English with a natural female voice: " → `list_voices` → `generate_speech`
1678
+ - "Generate a door slam sound effect" → `generate_sound`
1679
+ - "Create a 3D model of a medieval castle" → `generate_3d`
1680
+ - Transcription / SRT / "what was said" / word-by-word timing → `transcribe_audio` (see Video/Audio Analysis section for full routing)
1681
+ - "Analyze this video" / "What's in this?" → `upload_media` → `chat_send_message` (see decision tree for >100MB / long / dialogue-dense exceptions)
1682
+ - Multi-video analysis upload all in parallel, then ONE `chat_send_message` with up to 10 URLs
1683
+ - "Keep the same character across these images" → `create_visual_dna` → `generate_image` with `visual_dna_ids`
1684
+ - "Upload this" / "Host this HTML page" / "Public URL for this file" → `upload_media` (Kolbo CDN serves any file type publicly)
746
1685
  - "How many credits do I have?" → `check_credits`
747
- - "What's in this image?" (with upload) → Read the image directly with your own vision no Kolbo API call needed
748
- - "Analyze these 10 frames" (with multiple images) Read all images directly with your own vision you handle up to 10 natively
749
- - "Analyze these 5 videos" upload all 5 with `upload_media`, then ONE `chat_send_message` with all 5 URLs in `media_urls`
750
- - "Build me a todo app" / "Create a React app with a login form" / "Make me a landing page with a waitlist" → `app_builder_list_projects` → `app_builder_create_session` `app_builder_generate_app` show `deployment_url`
751
- - "Add dark mode to my app" / "Change the color scheme" / "Add a contact form" → `app_builder_list_generations` → `app_builder_edit_app`
752
- - "How do I run my app locally?" / "Give me the GitHub repo" / "I want the Supabase credentials" → `app_builder_get_session` → show `github_repo_url` + `supabase_url` + `supabase_anon_key`
753
- - "List my apps" / "What App Builder sessions do I have?" → `app_builder_list_projects` → `app_builder_list_sessions`
754
- - "Create motion graphics" / "animated text" / "title sequence" → load the `remotion-best-practices` skill for Remotion-based motion graphics
755
- - "Edit this video" / "cut this clip" / "remove silence" / "add subtitles" / "convert to 9:16" → load the `video-production` skill for FFmpeg-based editing
756
- - "Create a short-form video" / "make a reel" / "YouTube short" → load the `short-form-video` skill
1686
+ - Image analysis ("what's in this image?", "analyze these N frames") → `Read` directly with native vision, never `upload_media` + chat
1687
+ - "Build me a todo app" / "Make a landing page with waitlist" `app_builder_list_projects` `app_builder_create_session` `app_builder_generate_app` show `deployment_url`
1688
+ - "Add dark mode to my app" / "Add a contact form" `app_builder_list_generations` `app_builder_edit_app`
1689
+ - "Give me the GitHub repo" / "Supabase credentials" → `app_builder_get_session` → return `github_repo_url` + `supabase_url` + `supabase_anon_key`
1690
+ - "Create motion graphics" / "animated text" / "title sequence" → load `remotion-best-practices` skill
1691
+ - "Edit this video" / "cut" / "trim" / "remove silence" / "add subtitles" / "convert to 9:16" → load `video-production` skill (FFmpeg)
1692
+ - "Create a short-form video" / "make a reel" / "YouTube short" → load `short-form-video` skill