@kolbo/kolbo-code-linux-arm64-musl 2.2.5 → 2.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/kolbo +0 -0
- package/package.json +1 -1
- package/skills/kolbo/SKILL.md +177 -1651
- package/skills/kolbo/VERSION +1 -0
- package/skills/kolbo/references/models/creative-director.md +106 -0
- package/skills/kolbo/references/models/gpt-image.md +111 -0
- package/skills/kolbo/references/models/html-presentation.md +139 -0
- package/skills/kolbo/references/models/landing-page.md +135 -0
- package/skills/kolbo/references/models/music.md +120 -0
- package/skills/kolbo/references/models/nano-banana.md +97 -0
- package/skills/kolbo/references/models/prompt-copilot.md +133 -0
- package/skills/kolbo/references/models/seedance.md +90 -0
- package/skills/kolbo/references/models/veo.md +110 -0
- package/skills/kolbo/references/models/visual-code.md +80 -0
- package/skills/kolbo/references/workflows/app-builder.md +41 -0
- package/skills/kolbo/references/workflows/cost-and-validation.md +138 -0
- package/skills/kolbo/references/workflows/dtc-ads.md +126 -0
- package/skills/kolbo/references/workflows/marketing-studio.md +157 -0
- package/skills/kolbo/references/workflows/marketplace-cards.md +146 -0
- package/skills/kolbo/references/workflows/media-library.md +76 -0
- package/skills/kolbo/references/workflows/product-photoshoot.md +199 -0
- package/skills/kolbo/references/workflows/production-log.md +155 -0
- package/skills/kolbo/references/workflows/research-first.md +174 -0
- package/skills/kolbo/references/workflows/transcription.md +163 -0
- package/skills/kolbo/references/workflows/troubleshooting.md +73 -0
- package/skills/kolbo/references/workflows/visual-dna.md +233 -0
package/skills/kolbo/SKILL.md
CHANGED
|
@@ -1,1725 +1,251 @@
|
|
|
1
1
|
---
|
|
2
|
+
version: 0.4.0
|
|
2
3
|
name: kolbo
|
|
3
|
-
description:
|
|
4
|
+
description: |
|
|
5
|
+
Generate, edit, or analyze creative media via the Kolbo AI MCP server.
|
|
6
|
+
Covers images (GPT Image 2, Nano Banana, Flux, ...), video (Seedance 2,
|
|
7
|
+
Veo 3.1, Kling, Hailuo, ...), music (Suno), TTS (ElevenLabs), 3D,
|
|
8
|
+
transcription, Visual DNA (character consistency), Marketing Studio
|
|
9
|
+
(UGC + DTC ads + product photoshoot + marketplace cards),
|
|
10
|
+
Creative Director (multi-scene batches), HTML artifact publishing
|
|
11
|
+
(presentations, landing pages, dashboards), and the App Builder.
|
|
12
|
+
|
|
13
|
+
Use when: "generate", "create", "make me a", "edit", "animate",
|
|
14
|
+
"transcribe", "Visual DNA", "the same character", "UGC ad",
|
|
15
|
+
"TikTok / Reels / Shorts", "unboxing", "product review", "TV spot",
|
|
16
|
+
"Pinterest pin", "product photo", "lifestyle shot", "hero banner",
|
|
17
|
+
"ad pack", "social carousel", "virtual try-on", "Amazon listing",
|
|
18
|
+
"marketplace cards", "A+ content", "build a presentation/slide deck",
|
|
19
|
+
"landing page", "dashboard / widget / game", "music / song / jingle",
|
|
20
|
+
"TTS / voice", "sound effect", "3D model", "build me an app".
|
|
21
|
+
|
|
22
|
+
Chain: train Visual DNA → use vdna_id in any DNA-aware tool;
|
|
23
|
+
research-first → persist brand kit (.kolbo/brand-kits/) → DTC ads /
|
|
24
|
+
product photoshoot / marketplace cards; generate frames (Creative
|
|
25
|
+
Director) → animate each frame (generate_video_from_image).
|
|
26
|
+
|
|
27
|
+
NOT for: video editing / FFmpeg work (use video-production skill),
|
|
28
|
+
motion graphics (use remotion-best-practices skill), code editing,
|
|
29
|
+
general chat outside media generation.
|
|
30
|
+
argument-hint: "[prompt-or-command] [--model <name>] [--image <path>] [--video <path>]"
|
|
31
|
+
allowed-tools: Bash, Read, Write, Edit
|
|
4
32
|
---
|
|
5
33
|
|
|
6
34
|
# Kolbo AI — Creative Generation, Analysis & Transcription
|
|
7
35
|
|
|
8
36
|
You have direct access to the Kolbo AI creative platform via MCP tools (auto-configured by `kolbo auth login`). Use them to generate and deliver real content — do NOT just describe what you would create.
|
|
9
37
|
|
|
10
|
-
> 🚫 **Don't dump generated URLs as bare text or markdown links in chat** — the UI already renders artifacts as a gallery tile + canvas. Refer by description ("the rainy scene"), store URLs in `.kolbo/production.md`. INLINE `` images ARE allowed
|
|
38
|
+
> 🚫 **Don't dump generated URLs as bare text or markdown links in chat** — the UI already renders artifacts as a gallery tile + canvas. Refer by description ("the rainy scene"), store URLs in `.kolbo/production.md`. INLINE `` images ARE allowed for catalog-style replies (per-item thumbs in numbered lists).
|
|
11
39
|
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
### Generation
|
|
15
|
-
|
|
16
|
-
| Tool | Description |
|
|
17
|
-
|------|-------------|
|
|
18
|
-
| `generate_image` | Create a **single** image from a text prompt. Supports Visual DNA, moodboards, reference images, web-search grounding. |
|
|
19
|
-
| `generate_image_edit` | Edit/transform an existing image (background removal, color changes, compositing). Pass source images + edit prompt. |
|
|
20
|
-
| `generate_creative_director` | **Generate 2–8 related images or videos as one coherent set.** Use this INSTEAD of multiple `generate_image` calls whenever the user wants more than one related output (storyboards, ad campaigns, product sets, character sheets, scene variations). Handles style consistency and runs scenes in parallel internally. |
|
|
21
|
-
| `generate_video` | Create videos from text prompts. Supports reference images for style/composition guidance. Does **not** support Visual DNA — use `generate_elements` for character-consistent video. |
|
|
22
|
-
| `generate_video_from_image` | Animate a still image into video. Prompt describes the motion, not the subject. |
|
|
23
|
-
| `generate_video_from_video` | Restyle/transform an existing video (style transfer, scene restyling, subject swap). Keeps the original motion. |
|
|
24
|
-
| `generate_elements` | Generate video from reference assets (images/videos) + prompt. **Supports Visual DNA** for character-consistent video — this is the primary tool for animating characters/scenes with Visual DNA. |
|
|
25
|
-
| `generate_first_last_frame` | Generate video that morphs from a first frame to a last frame (keyframe interpolation). |
|
|
26
|
-
| `generate_lipsync` | Lipsync an audio track to a source image or video face. Accepts local files or URLs. |
|
|
27
|
-
| `generate_music` | Create music from descriptions. Supports instrumental, custom lyrics, style, vocal gender. |
|
|
28
|
-
| `generate_speech` | Convert text to speech (TTS). Default: ElevenLabs. Use `list_voices` to pick a voice. |
|
|
29
|
-
| `generate_sound` | Generate sound effects from descriptions (foley, ambient, impacts, UI sounds). |
|
|
30
|
-
| `generate_3d` | Generate 3D models from text, single image, or multi-view images. Returns GLB, FBX, OBJ, USDZ. |
|
|
31
|
-
|
|
32
|
-
### Transcription & Analysis
|
|
33
|
-
|
|
34
|
-
| Tool | Description |
|
|
35
|
-
|------|-------------|
|
|
36
|
-
| `transcribe_audio` | Transcribe audio or video into text + SRT subtitles + word-by-word SRT. Accepts local files or URLs. |
|
|
37
|
-
|
|
38
|
-
### Voice & Model Discovery
|
|
39
|
-
|
|
40
|
-
| Tool | Description |
|
|
41
|
-
|------|-------------|
|
|
42
|
-
| `list_models` | Browse available AI models filtered by type. |
|
|
43
|
-
| `list_voices` | List available TTS voices with filtering by provider, language, gender. |
|
|
44
|
-
| `check_credits` | Check remaining Kolbo credit balance. |
|
|
45
|
-
| `get_generation_status` | Poll status of an in-progress generation by ID (fallback for timeouts). |
|
|
46
|
-
|
|
47
|
-
### Media Library
|
|
48
|
-
|
|
49
|
-
| Tool | Description |
|
|
50
|
-
|------|-------------|
|
|
51
|
-
| `upload_media` | Upload ANY local file (or remote URL) to Kolbo CDN → returns a stable public URL. Use for feeding media to `chat_send_message`, hosting HTML, or any multi-tool workflow that re-uses the same file. |
|
|
52
|
-
| `list_media` | Browse the user's library — both uploaded files AND saved AI outputs. Filter by `project_id`, `folder_id`, `category` (ai / uploaded / edited / favorites / training-lab), `type`, `source_type`, `sort`; paginate; full-text `search`. Response items include `is_favorited`, `prompt`, `dimensions`, `duration`, `project_id`. |
|
|
53
|
-
| `get_media` | Fetch one item by id (full details + extended metadata). Use when the user references a specific past creation. |
|
|
54
|
-
| `get_media_stats` | Counts + storage usage: `{ total, images, videos, audio, total_size_bytes }`. Optional `project_id`. Use for "how many videos do I have?" / "what's my usage?" / sizing a bulk op. |
|
|
55
|
-
| `favorite_media` / `unfavorite_media` | Toggle favorite. Idempotent. Per-user (shared projects: your favorites ≠ teammates'). |
|
|
56
|
-
| `delete_media` | **Soft delete** → trash (30-day recovery). Owner only. This is the right call for "delete this". |
|
|
57
|
-
| `restore_media` | Restore from trash. Pair with `delete_media`. |
|
|
58
|
-
| `permanently_delete_media` | **HARD delete** — MongoDB + S3 + folders + source generation record. NOT REVERSIBLE. **Always confirm with the user before calling.** Never default here for "delete". |
|
|
59
|
-
| `move_media` | Move one item to a different project (caller must own the item + have access to the target project). |
|
|
60
|
-
| `bulk_delete_media` | Soft-delete up to 1000 ids. Items not owned by the user are silently skipped. |
|
|
61
|
-
| `bulk_restore_media` | Restore up to 1000 trashed ids. |
|
|
62
|
-
| `bulk_permanently_delete_media` | Hard-delete up to 1000 ids. **Always confirm with the user before calling.** |
|
|
63
|
-
| `bulk_move_media` | Move up to 1000 ids to another project. **Atomic** — if ANY id isn't owned by the caller, the whole op is rejected; do not retry partially. |
|
|
64
|
-
| `move_folder_contents` | Move every item in a folder to another project (owner-only on every item). |
|
|
65
|
-
| `list_media_folders` | List the user's folders (owned + shared). Folders span projects. |
|
|
66
|
-
| `create_media_folder` / `update_media_folder` / `delete_media_folder` | Folder lifecycle. Delete is owner-only and detaches items (items stay in the library); **confirm before delete**. |
|
|
67
|
-
| `add_media_to_folder` / `remove_media_from_folder` | Up to 500 ids per call. Idempotent on add. |
|
|
68
|
-
| `share_media_folder` | Share by email (resolved to user ids; emails not found come back in `not_found`). Owner only. Members can list/add/remove items but cannot delete or reshare the folder. |
|
|
69
|
-
| `unshare_media_folder` | Revoke one user's access. Takes `user_id` from the folder's `shared_with` array. |
|
|
70
|
-
|
|
71
|
-
### Visual DNA (Character/Style Consistency)
|
|
40
|
+
This file is the **always-loaded core**: tool inventory + universal hard rules + routing index. For any model-specific prompt rules, Visual DNA workflow, production log format, marketing workflow, cost validation, etc., **Read the matching `references/` file from the index below**. Don't try to remember the rules — load the file when you need them.
|
|
72
41
|
|
|
73
|
-
|
|
74
|
-
|------|-------------|
|
|
75
|
-
| `create_visual_dna` | Create a Visual DNA profile from reference images/video/audio for character, style, product, or scene consistency. |
|
|
76
|
-
| `list_visual_dnas` | List your Visual DNA profiles (id, name, type, thumbnail). |
|
|
77
|
-
| `get_visual_dna` | Fetch full profile details including system_prompt and reference images. |
|
|
78
|
-
| `delete_visual_dna` | Delete a Visual DNA profile. |
|
|
42
|
+
## Step 0 — Bootstrap
|
|
79
43
|
|
|
80
|
-
|
|
44
|
+
Once per conversation, before any other Kolbo tool call:
|
|
81
45
|
|
|
82
|
-
|
|
46
|
+
1. **Run `check_credits`.** If it fails with "Session expired" / "Not authenticated", ask the user to run `kolbo auth login` (or their branded CLI command like `sapir auth login`) and reload the editor.
|
|
47
|
+
2. **If `list_models` returns empty**, MCP isn't wired — same fix.
|
|
48
|
+
3. Remember the credit balance for the session; don't re-check on every turn.
|
|
83
49
|
|
|
84
|
-
|
|
85
|
-
|------|-------------|
|
|
86
|
-
| `list_moodboards` | List available moodboards (personal, system presets, org). |
|
|
87
|
-
| `get_moodboard` | Fetch a moodboard's master_prompt, style_guide, and images. |
|
|
88
|
-
| `list_presets` | Browse generation presets (image/video/music templates with bundled style direction). |
|
|
50
|
+
If the user is on a whitelabel build (`sapir`, etc.), they must use their branded command — not `kolbo`. See `references/workflows/troubleshooting.md`.
|
|
89
51
|
|
|
90
|
-
|
|
52
|
+
## Routing Index — Read These Files on Demand
|
|
91
53
|
|
|
92
|
-
|
|
|
93
|
-
|
|
94
|
-
|
|
|
95
|
-
|
|
|
96
|
-
|
|
|
97
|
-
|
|
98
|
-
|
|
54
|
+
| If the user wants to… | Read first |
|
|
55
|
+
|---|---|
|
|
56
|
+
| Generate a **Seedance 2** video | `references/models/seedance.md` |
|
|
57
|
+
| Generate a **GPT Image 2** image | `references/models/gpt-image.md` |
|
|
58
|
+
| Generate a **Nano Banana / Gemini** image | `references/models/nano-banana.md` |
|
|
59
|
+
| Generate a **Veo 3 / 3.1** video | `references/models/veo.md` |
|
|
60
|
+
| Build a **multi-scene set** (Creative Director, storyboard, campaign batch, 4+ angles) | `references/models/creative-director.md` |
|
|
61
|
+
| Generate **music** (Suno, song, lyrics, jingle, score) | `references/models/music.md` |
|
|
62
|
+
| Build an **HTML presentation / slide deck** | `references/models/html-presentation.md` |
|
|
63
|
+
| Build a **landing page / marketing site** | `references/models/landing-page.md` |
|
|
64
|
+
| Build a **dashboard / data viz / interactive widget / mini-game / UI mockup** | `references/models/visual-code.md` |
|
|
65
|
+
| Generate with **any other model** (Flux, Kling, Sora, Hailuo, ElevenLabs, DeepDub, …) — also covers universal prompt-engineering basics | `references/models/prompt-copilot.md` |
|
|
66
|
+
| Build a **UGC ad / TV spot / branded video / unboxing / product review / virtual try-on** | `references/workflows/marketing-studio.md` |
|
|
67
|
+
| Compose a **DTC ad image** (brand kit + ad format + avatar + product + reference media) | `references/workflows/dtc-ads.md` |
|
|
68
|
+
| Generate **brand product imagery** (studio shot, lifestyle, Pinterest pin, hero banner, carousel, ad pack, virtual try-on, conceptual, restyle) | `references/workflows/product-photoshoot.md` |
|
|
69
|
+
| Generate **marketplace listing cards** (Amazon main + secondary + A+ content) | `references/workflows/marketplace-cards.md` |
|
|
70
|
+
| Use **Visual DNA** / character consistency / `@name` syntax | `references/workflows/visual-dna.md` |
|
|
71
|
+
| Start or continue a **multi-step production** (storyboard → scenes → final cut) | `references/workflows/production-log.md` |
|
|
72
|
+
| **Transcribe** or **analyze** audio/video | `references/workflows/transcription.md` |
|
|
73
|
+
| **Scrape brand/product info** before generating + persist as `.kolbo/brand-kits/<slug>.md` | `references/workflows/research-first.md` |
|
|
74
|
+
| Browse, manage, or present existing **media library** items | `references/workflows/media-library.md` |
|
|
75
|
+
| Use the **App Builder** (React app generation) | `references/workflows/app-builder.md` |
|
|
76
|
+
| Confirm **cost** or validate **resolution / aspect / duration** against model caps | `references/workflows/cost-and-validation.md` |
|
|
77
|
+
| Hit an **auth / MCP / 429** issue | `references/workflows/troubleshooting.md` |
|
|
78
|
+
|
|
79
|
+
Each `references/models/*.md` mirrors the matching skill prompt in `kolbo-api/src/config/systemPrompt.js` — same battle-tuned rules that power Kolbo's web-app help widget. Keep parity (see `packages/opencode/CLAUDE.md` "MCP & Skill Sync Rule").
|
|
99
80
|
|
|
100
|
-
|
|
101
|
-
|------|-------------|
|
|
102
|
-
| `app_builder_list_projects` | List all Kolbo projects to find a `project_id` for App Builder. |
|
|
103
|
-
| `app_builder_create_session` | Create a new App Builder session inside a project. Returns `session_id`. |
|
|
104
|
-
| `app_builder_generate_app` | Generate a full React app from a text prompt. Fires build, polls until deployed, returns live URL. |
|
|
105
|
-
| `app_builder_edit_app` | Edit an existing app with a natural language instruction. Same fire-and-poll pattern. |
|
|
106
|
-
| `app_builder_get_build_status` | Check current build status manually (fallback after timeout). |
|
|
107
|
-
| `app_builder_get_session` | Get session details including GitHub repo URL and Supabase connection info for local dev. |
|
|
108
|
-
| `app_builder_list_sessions` | List all App Builder sessions in a project. |
|
|
109
|
-
| `app_builder_list_generations` | List all generations for a session (needed for `edit_app`). |
|
|
110
|
-
| `app_builder_delete_session` | Permanently delete a session and all resources. IRREVERSIBLE. |
|
|
111
|
-
|
|
112
|
-
### Artifact Publishing
|
|
81
|
+
## Available MCP Tools
|
|
113
82
|
|
|
83
|
+
### Generation
|
|
114
84
|
| Tool | Description |
|
|
115
85
|
|------|-------------|
|
|
116
|
-
| `
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
|
86
|
+
| `generate_image` | Single image from a text prompt. Supports Visual DNA, moodboards, reference images, web-search grounding. |
|
|
87
|
+
| `generate_image_edit` | Edit/transform an existing image. Pass `source_images` + edit prompt. |
|
|
88
|
+
| `generate_creative_director` | **2–8 related images or videos as one coherent set.** Use INSTEAD of multiple `generate_image` calls for any related multi-output. |
|
|
89
|
+
| `generate_video` | Text-to-video. Does **not** support Visual DNA — use `generate_elements` for character-consistent video. |
|
|
90
|
+
| `generate_video_from_image` | Animate a still. Prompt describes motion, not subject. |
|
|
91
|
+
| `generate_video_from_video` | Restyle/transform an existing video. Keeps original motion. |
|
|
92
|
+
| `generate_elements` | Reference-driven video. **Primary route for DNA → video.** |
|
|
93
|
+
| `generate_first_last_frame` | Keyframe interpolation between two frames. |
|
|
94
|
+
| `generate_lipsync` | Lipsync audio to an image or video face. |
|
|
95
|
+
| `generate_music` | Music generation (Suno + variants). |
|
|
96
|
+
| `generate_speech` | TTS. Use `list_voices` to pick a voice. |
|
|
97
|
+
| `generate_sound` | Sound effects. |
|
|
98
|
+
| `generate_3d` | 3D models from text / single image / multi-view. Returns GLB/FBX/OBJ/USDZ. |
|
|
99
|
+
|
|
100
|
+
### Discovery, Library, Visual DNA, Moodboards, Chat, App Builder, Publishing
|
|
101
|
+
| Tool | Purpose |
|
|
102
|
+
|------|---------|
|
|
103
|
+
| `list_models` / `list_voices` / `check_credits` / `get_generation_status` / `get_session_usage` | Discovery + status |
|
|
104
|
+
| `upload_media` / `list_media` / `get_media` / `get_media_stats` / `favorite_media` / `unfavorite_media` / `delete_media` / `restore_media` / `permanently_delete_media` / `move_media` / `bulk_*_media` / `*_media_folder` | Media library — see `workflows/media-library.md` |
|
|
105
|
+
| `create_visual_dna` / `list_visual_dnas` / `get_visual_dna` / `delete_visual_dna` | Visual DNA — see `workflows/visual-dna.md` |
|
|
106
|
+
| `list_moodboards` / `get_moodboard` / `list_presets` | Style overlays |
|
|
107
|
+
| `chat_send_message` / `chat_list_conversations` / `chat_get_messages` | Kolbo chat with optional `media_urls` (up to 10 per call) |
|
|
108
|
+
| `app_builder_*` (9 tools) | Full React app generation — see `workflows/app-builder.md` |
|
|
109
|
+
| `publish_html_artifact` | Publish HTML / SVG / Mermaid to `sites.kolbo.ai`. Server dedupes by content hash. Strict CSP. |
|
|
110
|
+
|
|
111
|
+
## ⚠️ If the User Names a Tool, USE THAT TOOL (HARD RULE)
|
|
112
|
+
|
|
113
|
+
A user-named tool — in any language — overrides every other rule. Recognized aliases:
|
|
114
|
+
|
|
115
|
+
| User said (any language) | Use exactly |
|
|
125
116
|
|---|---|
|
|
126
|
-
| "director", "creative director",
|
|
127
|
-
| "image edit", "edit", "modify", "remove background", **"עריכת תמונה"** (
|
|
117
|
+
| "director", "creative director", **"במאי"**, "ad set", "campaign tool", "storyboard tool" | `generate_creative_director` |
|
|
118
|
+
| "image edit", "edit", "modify", "remove background", **"עריכת תמונה"** (paired with a per-image instruction) | `generate_image_edit` |
|
|
128
119
|
| "elements" / **"אלמנטים"** | `generate_elements` |
|
|
129
120
|
| "first/last frame" / **"פריימים"** | `generate_first_last_frame` |
|
|
130
121
|
| "lipsync" / **"ליפסינק"** | `generate_lipsync` |
|
|
131
122
|
|
|
132
|
-
**Mixed signals — named tool always wins.** "
|
|
123
|
+
**Mixed signals — named tool always wins.** "Image edit with the director tool to make 4 angles" → `generate_creative_director`.
|
|
133
124
|
|
|
134
|
-
## ⚠️ Generate vs Edit
|
|
125
|
+
## ⚠️ Generate vs Edit (when the user did NOT name a tool)
|
|
135
126
|
|
|
136
127
|
| User intent | Action | NOT this |
|
|
137
128
|
|-------------|--------|----------|
|
|
138
|
-
| "Create a video from scratch"
|
|
139
|
-
| "Edit
|
|
140
|
-
| "Create motion graphics
|
|
141
|
-
| "Animate this image"
|
|
142
|
-
| "Restyle this video as anime" | `generate_video_from_video`
|
|
143
|
-
| "Modify THIS one image" — change
|
|
144
|
-
|
|
|
145
|
-
| "4 variations of THIS exact image" (same prompt, different seeds
|
|
129
|
+
| "Create a video from scratch" | `generate_video` | — |
|
|
130
|
+
| "Edit / Cut / Trim / Add subtitles / Remove silence / Convert to 9:16" | Load `video-production` skill → FFmpeg | ❌ `generate_video` |
|
|
131
|
+
| "Create motion graphics / animated text / title sequence" | Load `remotion-best-practices` skill | ❌ `generate_video` |
|
|
132
|
+
| "Animate this image" | `generate_video_from_image` | — |
|
|
133
|
+
| "Restyle this video as anime" | `generate_video_from_video` | — |
|
|
134
|
+
| "Modify THIS one image" — change bg, remove object, recolor | `generate_image_edit` | ❌ Not for multi-output |
|
|
135
|
+
| "4 angles / poses / views of this character" / "variations of this character" | `generate_creative_director` with `visual_dna_ids` | ❌ Don't loop `generate_image_edit` |
|
|
136
|
+
| "4 variations of THIS exact image" (same prompt, different seeds) | `generate_image` with `num_images=4` | ❌ Not `generate_image_edit` |
|
|
146
137
|
|
|
147
138
|
## Core Workflow
|
|
148
139
|
|
|
149
|
-
1. **Check credits** ONCE per conversation
|
|
150
|
-
2. **Discover models** with `list_models` using a `type` filter — but **skip
|
|
151
|
-
3. **Pick the model**:
|
|
152
|
-
-
|
|
153
|
-
-
|
|
154
|
-
-
|
|
155
|
-
4. **
|
|
156
|
-
5. **
|
|
157
|
-
|
|
158
|
-
**For batch operations** (generating multiple items at once), see the "Rate Limiting & Batch Generation" section below — it overrides the per-item steps above.
|
|
159
|
-
|
|
160
|
-
### Model Types (for `list_models`)
|
|
161
|
-
|
|
162
|
-
Use the DB type name directly. Legacy aliases (right column) still work but prefer DB names.
|
|
163
|
-
|
|
164
|
-
| DB Type | Legacy alias | Use for |
|
|
165
|
-
|---------|-------------|---------|
|
|
166
|
-
| `text_to_img` | `image` | Still-image generation |
|
|
167
|
-
| `image_editing` | `image_edit` | Image editing / transformation |
|
|
168
|
-
| `text_to_video` | `video` | Text-to-video |
|
|
169
|
-
| `img_to_video` | `video_from_image` | Image-to-video animation |
|
|
170
|
-
| `draw_to_video` | — | Draw-to-video (Hailuo, Seedance variants) |
|
|
171
|
-
| `video_to_video` | `video_from_video` | Video restyling / style transfer |
|
|
172
|
-
| `elements` | *(same)* | Reference-to-video — Visual DNA-driven video |
|
|
173
|
-
| `firstlastgenerations` | `first_last_frame` | Keyframe interpolation |
|
|
174
|
-
| `lipsync-image` | (part of `lipsync`) | Lipsync with image source face |
|
|
175
|
-
| `lipsync-video` | (part of `lipsync`) | Lipsync with video source face |
|
|
176
|
-
| `music_gen` | `music` | Music generation |
|
|
177
|
-
| `text_to_speech` | `speech` | Text-to-speech (TTS) |
|
|
178
|
-
| `text_to_sound` | `sound` | Sound effects |
|
|
179
|
-
| `stt` | `transcription` | Audio/video transcription |
|
|
180
|
-
| `text` | `chat` | Chat / AI language models |
|
|
181
|
-
| `3d_text_to_model` | (part of `three_d`) | 3D from text prompt |
|
|
182
|
-
| `3d_image_to_model` | (part of `three_d`) | 3D from single image |
|
|
183
|
-
| `3d_multi_image_to_model` | (part of `three_d`) | 3D from multiple images |
|
|
184
|
-
| `3d_world` | (part of `three_d`) | 3D world generation |
|
|
185
|
-
|
|
186
|
-
> **Note**: `lipsync` alias returns both `lipsync-image` + `lipsync-video`. `three_d` alias returns all four 3D types.
|
|
187
|
-
|
|
188
|
-
### Cost Awareness
|
|
189
|
-
|
|
190
|
-
Creative generations bill against the user's Kolbo credit balance. **Billing units differ by type** — always apply the correct formula before generating.
|
|
191
|
-
|
|
192
|
-
| Type | Billing unit | Credit range | Example |
|
|
193
|
-
|------|-------------|-------------|---------|
|
|
194
|
-
| **Image** | per image (flat) | 1–30 cr | Flux.1 Fast = 1 cr, Midjourney = 4 cr. If `resolution` is set, check the model's `resolutionMultipliers` from `list_models` — some families multiply cost significantly at higher tiers, others are flat. |
|
|
195
|
-
| **Image edit** | per image (flat) | 2–20 cr | |
|
|
196
|
-
| **Video** | **cr/s × duration** | 2–30 cr/s | Kandinsky 5 Fast × 5s = 10 cr; Seedance 2.0 × 10s = 300 cr. If `resolution` or native audio is set, check the model's `resolutionMultipliers` and `soundCreditMultiplier` from `list_models`. |
|
|
197
|
-
| **Video from image** | **cr/s × duration** | 4–30 cr/s | Same per-second rule as text-to-video. Same multiplier check. |
|
|
198
|
-
| **Elements (ref-to-video)** | **cr/s × duration** | 4–30 cr/s | Same per-second billing as video — check `credit` and multipliers in `list_models type="elements"`. |
|
|
199
|
-
| **Lipsync** | **cr/s × duration** | 5–20 cr/s | |
|
|
200
|
-
| **Music** | per generation (flat) | 15–60 cr | Suno v5 = 15 cr; ElevenLabs Music = 60 cr |
|
|
201
|
-
| **Speech (TTS)** | per 100 characters | 2–5 cr/100 chars | ElevenLabs (5) × 500 chars = 25 cr; Google (2) × 500 chars = 10 cr |
|
|
202
|
-
| **Sound effects** | per generation (flat) | 4–7 cr | |
|
|
203
|
-
| **3D model** | per model (flat) | 5–300 cr | Trellis = 5 cr; Meshy v6 = 150 cr; Marble 1.1 = 300 cr |
|
|
204
|
-
| **Transcription (stt)** | per minute of audio | model.credit × duration_minutes | |
|
|
205
|
-
|
|
206
|
-
**Calculation formulas — apply when confirming cost:**
|
|
207
|
-
- **Video / Lipsync**: `total = model_credit_per_second × duration_seconds`
|
|
208
|
-
- Get the `credit` value from `list_models` (or from a previous call in this session) and multiply by duration.
|
|
209
|
-
- Never assume the credit shown is a flat per-generation cost for these types.
|
|
210
|
-
- **Music**: flat per generation — `total = model_credit` (duration does not change the cost).
|
|
211
|
-
- **TTS**: `total = model_credit × ceil(character_count / 100)`
|
|
212
|
-
- Count the actual characters in the text before estimating. 1000 chars with ElevenLabs = 50 credits.
|
|
213
|
-
- **Images / 3D / Sound effects**: `total = model_credit × quantity`
|
|
214
|
-
- **Resolution / audio multipliers**: if the user sets `resolution` or the model has native audio, read `resolutionMultipliers[tier]` and `soundCreditMultiplier` from `list_models`. Formula: `final = base × resolutionMult × (sound ? soundMult : 1) × durationSeconds`.
|
|
215
|
-
|
|
216
|
-
**Tier label → pixel mapping (rough):**
|
|
217
|
-
- Images: `"1K"` ≈ 1024px, `"2K"` ≈ Full HD (1920×1080), `"3K"` ≈ QHD (2560×1440), `"4K"` ≈ UHD (3840×2160). Picker shows only tiers the model actually supports (per `supported_resolutions`).
|
|
218
|
-
- Videos: `"720p"` / `"1080p"` / `"1440p"` / `"2160p"` = vertical pixels (720p = HD, 1080p = Full HD, 1440p = QHD, 2160p = 4K UHD). Some models use model-specific labels like `"512P"` / `"1024P"` (Hailuo).
|
|
219
|
-
|
|
220
|
-
**Cost confirmation — when:**
|
|
221
|
-
- **Skip** when the user already specified model + count + duration ("make 5 videos, seedance 2 fast, 15s" IS the confirmation), or when a single generation costs under 5 credits.
|
|
222
|
-
- **Required** otherwise: present a one-line summary ("8 videos × 5s × [model] @ X cr/s = **Y credits**. Proceed?"), suggest a cheaper alternative if one exists, wait for confirm before firing.
|
|
223
|
-
- **Batch totalling 100+ credits**: run `check_credits` first and include the available balance in the summary.
|
|
224
|
-
|
|
225
|
-
### Rate Limiting & Batch Generation (CRITICAL)
|
|
226
|
-
|
|
227
|
-
**Rate limits** (per user, server-enforced; the API queues — never silently drops):
|
|
228
|
-
- `generate_image`: 30/min
|
|
229
|
-
- All other generation tools: 10/min per type
|
|
230
|
-
- 300/min global across all media endpoints
|
|
231
|
-
- `upload_media`: 300/min, no credit cost
|
|
232
|
-
|
|
233
|
-
**⚠️ NEVER re-fire a generation you already called.** Aborted, interrupted, or timed-out tool calls still process server-side and will complete. Before retrying, run `get_generation_status` — only retry if it returns `failed` (not `pending` or `completed`). Only re-fire on explicit user request ("retry", "redo", "try again"). Every duplicate burns real credits.
|
|
234
|
-
|
|
235
|
-
**Batch generation workflow (≤10 items):**
|
|
236
|
-
1. Confirm cost ONCE — skip if the user already specified model/count/duration ("make 5 videos, seedance 2 fast, 15s" IS the confirmation).
|
|
237
|
-
2. **Output ALL tool calls in one response** (up to 10 per type) — they run concurrently, so 5 videos finish in the time of the slowest one.
|
|
238
|
-
3. Each call blocks until done (images: seconds; video: 1–5 min). Don't apologize for the wait.
|
|
239
|
-
4. After all complete, present results together.
|
|
240
|
-
5. On 429: wait 60s, retry only the failed items (max 2 retries).
|
|
241
|
-
|
|
242
|
-
### ⚠️ Multi-output? Default to `generate_creative_director` (CRITICAL)
|
|
243
|
-
|
|
244
|
-
`generate_creative_director` is not a niche tool — **it IS an agent**. It plans the prompt for each scene internally, locks style + character consistency across the whole set, and runs every scene in parallel. For anything that produces **2 or more related outputs**, it is almost always the right call.
|
|
245
|
-
|
|
246
|
-
**Default rule:** if the user asks for multiple related outputs (variations, scenes, angles, poses, settings, moods, ad shots, product shots, storyboards, character sheets, key frames for a video) — **use `generate_creative_director`**. Reach for parallel `generate_image` calls only when the rule below explicitly says to.
|
|
247
|
-
|
|
248
|
-
**Decision matrix:**
|
|
249
|
-
|
|
250
|
-
| User intent | Tool | Why |
|
|
251
|
-
|---|---|---|
|
|
252
|
-
| "make 4 / 6 / 8 [shots, scenes, variations, angles, poses, outfits, moods, settings, frames]" | **`generate_creative_director`** with `scene_count` | One brief → N distinct, coherent outputs. Director handles per-scene prompting. |
|
|
253
|
-
| "show the character in N different ___" | **`generate_creative_director`** with `scene_count` + `visual_dna_ids` | Character consistency is its specialty. |
|
|
254
|
-
| "create a storyboard / ad campaign / product set" | **`generate_creative_director`** | The model itself was built for this case. |
|
|
255
|
-
| "key frames for a video / shots for the ad" | **`generate_creative_director`** (then `generate_video_from_image` per frame) | See "Character-driven video — frames first" below. |
|
|
256
|
-
| "give me 4 variations of THIS exact image" (same prompt, random seed only) | `generate_image` with `num_images=4` | No new direction needed — seeds, not scenes. |
|
|
257
|
-
| User dictates **explicit separate prompts** word-for-word: "Image 1: X. Image 2: Y. Image 3: Z." | Parallel `generate_image` calls (one per prompt) | The user already wrote the per-scene prompts; the director's planning step would be wasted. |
|
|
258
|
-
| Single image, single prompt | `generate_image` | Nothing to coordinate. |
|
|
259
|
-
|
|
260
|
-
**Tie-breaker:** if you're about to fire ≥2 `generate_image` calls and the user did NOT dictate per-image prompts word-for-word, stop and use `generate_creative_director` instead. The director is cheaper in tokens (one call, one plan) and the results stay coherent.
|
|
261
|
-
|
|
262
|
-
**Never call `generate_image` sequentially in a loop.** Either use `generate_creative_director` (preferred for any related set) or fire all calls in a single parallel batch.
|
|
263
|
-
|
|
264
|
-
### 🛑 Runaway-loop guard — ONE generation per requested item (CRITICAL)
|
|
265
|
-
|
|
266
|
-
When the user asks for **one specific change** to a specific media item ("change the 2nd video to anime", "make this image brighter", "regenerate scene 3"), the answer is **a single tool call** with a single output. After that tool returns URLs, **stop**. Surface the result to the user and wait for their next message.
|
|
267
|
-
|
|
268
|
-
You are NOT allowed to:
|
|
269
|
-
|
|
270
|
-
- Fire the same tool 3+ times in a single turn unless the user explicitly asked for "N variations" / "make X versions".
|
|
271
|
-
- Re-fire a tool because you think the previous result might not be exactly what the user wanted — let the user judge; if they don't like it, they'll say so.
|
|
272
|
-
- Auto-retry on success ("the first one wasn't perfect, let me try again"). If the tool returned URLs successfully, the work is done.
|
|
273
|
-
- Fire 5+ parallel `generate_video*` calls speculatively. Video is expensive — every extra call burns credits the user didn't ask to spend.
|
|
274
|
-
|
|
275
|
-
**Rule of thumb**: if you've fired the same tool 3+ times in one turn and the user asked for ONE thing, stop. You're in a loop. Surface what you have, ask the user which one they want to keep, and wait.
|
|
276
|
-
|
|
277
|
-
Only re-fire when:
|
|
278
|
-
1. The user explicitly asked for variations (with a count).
|
|
279
|
-
2. The previous call returned `failure.retryable === true` AND it was a transient error — then ONE retry, max.
|
|
280
|
-
3. The previous call returned `completed` but with `urls.length === 0` — then ONE retry on the same payload.
|
|
281
|
-
|
|
282
|
-
### ⚠️ Editing an existing video → ONE call, not frames-first (CRITICAL)
|
|
283
|
-
|
|
284
|
-
If the user references an **existing video** and asks to modify it ("change this video to anime style", "make it cinematic", "video-to-video edit", "the 2nd video looks too cheesy, fix it"), that is a **single `generate_video_from_video` call** — pass the source video URL + the edit prompt and you're done. **One call. One output. Done.**
|
|
285
|
-
|
|
286
|
-
**Use a TRUE video-to-video model.** Image-to-video models (anything whose identifier ends in `image-to-video` / `i2v` / contains `image2video`) will NOT work — they require an image input, not a video, and the server rejects the call with `WRONG_MODEL_TYPE`. Examples of valid v2v models you can pick:
|
|
287
|
-
|
|
288
|
-
- `wan/2-7-videoedit` — Wan 2.7 video edit (re-style / re-prompt)
|
|
289
|
-
- `happyhorse/video-edit` — HappyHorse video edit
|
|
290
|
-
- `kling-video/o3-video-to-video` — Kling O3 V2V
|
|
291
|
-
- Any model whose DB `type` array contains `video_to_video` (use `list_models` with `type: "video_to_video"` to see the current set)
|
|
292
|
-
|
|
293
|
-
If you're unsure which model to use, call `list_models({ type: "video_to_video" })` first — do NOT guess based on the name.
|
|
294
|
-
|
|
295
|
-
**Do NOT:**
|
|
296
|
-
- Decompose into frames (frames-first applies only when *creating new character video from scratch*, NOT for editing an existing video).
|
|
297
|
-
- Re-fire the same tool repeatedly if the first call returned URLs — even if you're not 100% sure the result matches; surface the result to the user and let them iterate.
|
|
298
|
-
- Generate multiple variations unless the user explicitly asks for "options" / "variations" / "X different versions".
|
|
299
|
-
- Use an image-to-video model just because it shows up in `list_models` — check that its `type` array includes `video_to_video`.
|
|
300
|
-
|
|
301
|
-
**If the first call legitimately fails** (content policy, transient error, model refused), surface the failure to the user via the `failure` envelope — do not blindly retry. See "Reading failure envelopes" above.
|
|
302
|
-
|
|
303
|
-
This rule overrides the "frames first" guidance below for any video-EDITING request. The "frames first" rule is only for **generating new character-driven video from scratch**, not for re-styling / modifying an existing clip.
|
|
304
|
-
|
|
305
|
-
### ⚠️ Character-driven video — frames first, then animate (CRITICAL)
|
|
306
|
-
|
|
307
|
-
For any ad / story / scene-based video **created from scratch** featuring a Visual DNA character (NOT video-to-video edits — see the rule above for those), do **NOT** jump straight from DNA to `generate_elements` / `generate_video` per shot. The right flow is:
|
|
308
|
-
|
|
309
|
-
1. **Generate the shot frames first** as still images — one image per shot — via `generate_creative_director` with `scene_count` + `visual_dna_ids`. (Use the director, NOT parallel `generate_image` calls — multiple shots = a coherent set, that's exactly what the director is for.) The DNA is at its strongest in image generation: the character lands consistently, you can preview cheaply, and the user can approve/revise the frames before any expensive video runs.
|
|
310
|
-
2. **Confirm the frames with the user** if there are more than ~3 shots, or if the user hasn't explicitly said "go straight to video."
|
|
311
|
-
3. **Animate each frame to video** with `generate_video_from_image`, passing each approved frame as `image_url`. This is cheaper and more predictable than direct DNA→video, and identity is locked because the model is animating an existing pixel-perfect frame instead of re-inferring the character.
|
|
312
|
-
|
|
313
|
-
**Why this matters:**
|
|
314
|
-
- `generate_elements` + Seedance / Kling for direct character video is more expensive per shot and the character can drift between shots.
|
|
315
|
-
- Image-to-video (`generate_video_from_image`) anchors the first frame to your approved still, so the character/setting/composition stays locked.
|
|
316
|
-
- Frames are debuggable: if shot 4's pose is wrong you regenerate one image, not a 10-second video.
|
|
317
|
-
|
|
318
|
-
**When to skip frames-first and go direct to `generate_elements`:**
|
|
319
|
-
- User explicitly asks "go straight to video / skip the storyboard / use seedance for everything."
|
|
320
|
-
- Single-shot quick experiments where no character consistency is needed.
|
|
321
|
-
- The user supplies their own approved frames and just wants animation.
|
|
322
|
-
|
|
323
|
-
Default to frames-first unless one of those applies. After all frames are approved, fire the image-to-video calls in parallel (subject to the bulk-generation ceilings below).
|
|
324
|
-
|
|
325
|
-
**⚠️ Parameter names — do NOT confuse these:**
|
|
326
|
-
- `generate_image` → `num_images` (1–4): all images use the **same prompt**, just different random seeds — use this for "give me 4 variations of this image"
|
|
327
|
-
- `generate_creative_director` → `scene_count` (1–8): each scene gets its **own distinct prompt** — use this for "make 8 different campaign shots" OR "show the character in 8 different scenes/outfits/moods". Always pass `visual_dna_ids` when character consistency matters. **Never pass `num_images` to `generate_creative_director`.**
|
|
328
|
-
|
|
329
|
-
**After `generate_creative_director` completes — share results as individual URLs, one per scene. Do NOT create an HTML grid artifact or any combined layout. Just list each scene's title and its image URL on separate lines.**
|
|
330
|
-
|
|
331
|
-
### ⚠️ Generated URLs in chat (CRITICAL)
|
|
332
|
-
|
|
333
|
-
The chat renders markdown natively: `` becomes an inline image, `[label](url.png)` becomes a labeled link with an auto-preview, bare URLs become clickable links. Use whichever fits the reply:
|
|
334
|
-
|
|
335
|
-
- **Catalog-style replies** (numbered lists of characters / scenes / products where the user wants to *see* each item next to its description) — embed `` or `[label](url)` so each item shows its thumb inline. The agent decides; the renderer handles the rest.
|
|
336
|
-
- **Conversational replies** ("4 shots ready") — keep the prose short; the canvas chip already shows the gallery, you don't need to re-list URLs.
|
|
337
|
-
|
|
338
|
-
Avoid bare URL dumps (`1. https://… 2. https://…`) and HTML `<table>` grids — they're ugly and the canvas already provides a gallery. Anything you want the user to actually see inline should be wrapped in markdown image / link syntax.
|
|
339
|
-
|
|
340
|
-
**Always** record every URL in `.kolbo/production.md` — that's the durable record, independent of what you show in chat.
|
|
341
|
-
|
|
342
|
-
### ⚠️ Bulk Generation (>10 items)
|
|
343
|
-
|
|
344
|
-
For large briefs ("make 50 UGC ads") the rules above still apply, plus:
|
|
345
|
-
|
|
346
|
-
**Real-world batch ceilings (cheat sheet)** — these are tighter than the published rate limits; exceeding them causes 429s that can throttle the whole session:
|
|
347
|
-
|
|
348
|
-
| Tool | Max safe in-flight | Notes |
|
|
349
|
-
|---|---|---|
|
|
350
|
-
| `generate_image` | 8–10 | Fast (~10–30s each) |
|
|
351
|
-
| `generate_image_edit` | 5–8 | Multi-angle models slower |
|
|
352
|
-
| `generate_creative_director` | 1 call → up to 8 scenes | Runs scenes in parallel internally — never batch externally |
|
|
353
|
-
| `generate_video` / `generate_video_from_image` / `generate_first_last_frame` / `generate_lipsync` | 3–5 | 1–5 min each |
|
|
354
|
-
| `generate_video_from_video` | 3 | Heaviest |
|
|
355
|
-
| `generate_elements` | 3–5 | Confirmed real-world ceiling for 50-item bulk runs |
|
|
356
|
-
| `generate_music` / `generate_speech` / `generate_sound` | 5–8 | |
|
|
357
|
-
| `upload_media` | 10+ | No practical ceiling |
|
|
358
|
-
|
|
359
|
-
For 50 outputs: fire one batch → wait for all to finish → fire next batch. Never fire all 50 in one response.
|
|
360
|
-
|
|
361
|
-
**`upload_media` external URLs first.** `files` on `generate_elements` and source images on edit/from-image tools only accept Kolbo-hosted URLs reliably; external URLs (e.g. unsplash) cause `400 Bad Request`. Pattern: external URL/local file → `upload_media` → use the returned Kolbo CDN URL in `reference_images` / `source_images` / `image_url`. Image upload constraints: JPEG/PNG/WebP only, 300×300 to 2048×2048 — pre-validate before upload.
|
|
362
|
-
|
|
363
|
-
**On 429:** finish the in-flight batch, wait 60s, retry only the failed items. Second 429 → wait 120s, retry once. Third → stop the whole job, report completed/failed counts to the user.
|
|
364
|
-
|
|
365
|
-
**Persist every `generation_id`** in `.kolbo/production.md` (even for failures) — required for `get_generation_status` recovery and cross-session dedupe.
|
|
366
|
-
|
|
367
|
-
Bulk production-log entry shape:
|
|
368
|
-
```md
|
|
369
|
-
12. ✅ Asian F 24, bedroom, hype POV
|
|
370
|
-
- generation_id: gen_8a2c…
|
|
371
|
-
- url: https://…
|
|
372
|
-
- model: seedance-2 · 720p · 10s · sound-on
|
|
373
|
-
- generated: 2026-05-14T07:42Z
|
|
374
|
-
13. ❌ Latino M 31, gym
|
|
375
|
-
- generation_id: gen_ff19…
|
|
376
|
-
- error: 429 Too many generation requests
|
|
377
|
-
- retry_after: 2026-05-14T07:43Z
|
|
378
|
-
```
|
|
379
|
-
|
|
380
|
-
**Don't narrate** — output the tool calls, skip "Generating Video 1 of 5…" preambles.
|
|
381
|
-
|
|
382
|
-
**Handling interruptions:** if the user aborts mid-batch then says "do the rest," check what you already fired, skip those, fire only the remainder. Never restart from the beginning.
|
|
383
|
-
|
|
384
|
-
### Reading failure envelopes from `get_generation_status`
|
|
385
|
-
|
|
386
|
-
When a generation fails, `get_generation_status` now returns a structured `failure` field alongside `error`:
|
|
387
|
-
|
|
388
|
-
```json
|
|
389
|
-
{
|
|
390
|
-
"state": "failed",
|
|
391
|
-
"error": "The input or output was flagged as sensitive…",
|
|
392
|
-
"failure": {
|
|
393
|
-
"message": "The input or output was flagged as sensitive…",
|
|
394
|
-
"category": "content_policy", // content_policy | network | auth | model_failure | ...
|
|
395
|
-
"code": "CONTENT_FLAGGED_SENSITIVE",
|
|
396
|
-
"retryable": false, // true = transient, safe to retry; false = same input will fail again
|
|
397
|
-
"severity": "error",
|
|
398
|
-
"provider": "kie-nano-banana"
|
|
399
|
-
}
|
|
400
|
-
}
|
|
401
|
-
```
|
|
402
|
-
|
|
403
|
-
Branch on `failure.category` / `failure.retryable`:
|
|
404
|
-
|
|
405
|
-
- `category === "content_policy"` (or `code === "CONTENT_FLAGGED_SENSITIVE"`) → **do not retry the same prompt**. Tell the user the model refused, suggest a less explicit phrasing or a Visual DNA fallback. Log to `.kolbo/production.md` Failures section with the exact reason.
|
|
406
|
-
- `category === "auth"` or `code === "[KOLBO_AUTH_EXPIRED]"` → surface the reconnect flow (`open_kolbo_app`), don't auto-retry.
|
|
407
|
-
- `retryable === true` (transient: network, rate limit, provider 5xx) → retry once with the same payload after a short pause. If it fails again, surface to user.
|
|
408
|
-
- `retryable === false` and unknown category → surface the raw `message` to the user, don't retry.
|
|
409
|
-
|
|
410
|
-
If `failure` is absent (older kolbo-api), fall back to the heuristic in the next section.
|
|
411
|
-
|
|
412
|
-
### ⚠️ Detecting failed generations (CRITICAL)
|
|
413
|
-
|
|
414
|
-
A generation can fail in three distinct ways. Treat ALL three as failure — don't pretend it worked:
|
|
415
|
-
|
|
416
|
-
1. **Tool returns `error`** — explicit failure, you see the error text. Easy case. Surface it to the user, suggest a retry, and log the `generation_id` if present so you can call `get_generation_status` later.
|
|
417
|
-
2. **Tool returns `completed` but with NO URL in `urls`** — most common silent failure. The server marked the generation done but produced nothing (NSFW filter, model OOM, upstream provider 5xx, transient bug). Treat as failure. Do NOT log this to `.kolbo/production.md`. Do NOT claim the generation worked. Tell the user "the generation completed without an output — retrying" and re-fire ONCE.
|
|
418
|
-
3. **Tool hangs / never returns** — the MCP poll timed out (the JS-side timeout in the MCP fired before the server-side generation finished, OR the server died mid-job). Two signals: (a) the tool result includes a timeout error mentioning `generation_id`, OR (b) the user comes back later and says "you said you were generating but I never got it."
|
|
419
|
-
|
|
420
|
-
- On case (a): IMMEDIATELY call `get_generation_status(generation_id)`. The server might be done — recover the URL instead of re-firing. Only retry if `status === "failed"`.
|
|
421
|
-
- On case (b): if you have a recorded `generation_id` in `.kolbo/production.md` for that step, call `get_generation_status` first. If you don't (because you never logged it — see Bulk Generation rules), the work is lost.
|
|
422
|
-
|
|
423
|
-
**Always-true rules:**
|
|
424
|
-
|
|
425
|
-
- **Don't celebrate a generation before reading the tool result.** "✅ done!" before checking that `urls` is non-empty is wrong. Even when the tool returns `completed`, verify there's at least one URL before reporting success.
|
|
426
|
-
- **Don't auto-retry without surfacing the failure.** If a single generation fails, tell the user before silently retrying. If a BATCH partially fails, list the failed items with their reasons and the count of successful ones. Never paper over partials with "✅ all done!"
|
|
427
|
-
- **Don't double-log failed items.** Failures DO NOT go into `.kolbo/production.md`'s artifact list. Only successful generations land there. Failures get a one-line note in chat + (optionally) a separate `## Failures` section in the production log with the `generation_id` for retry traceability.
|
|
428
|
-
- **Surface the user's count.** If the user asked for 8 and you got 6 successes + 2 failures, the reply MUST say "6 of 8 ready" — not "videos ready." Misreporting partial success is the most common UX bug here.
|
|
429
|
-
|
|
430
|
-
---
|
|
431
|
-
|
|
432
|
-
## ⚠️ Production Log — `.kolbo/production.md` (CRITICAL)
|
|
433
|
-
|
|
434
|
-
Every URL, id, and brief produced by a Kolbo MCP tool MUST be recorded in `.kolbo/production.md` in the user's workspace. This file — not chat history — is your source of truth for prior artifacts: URLs scattered across `tool_result` blobs are unreliable to re-scan and disappear entirely on context compaction.
|
|
435
|
-
|
|
436
|
-
### When to READ it
|
|
437
|
-
|
|
438
|
-
Read `.kolbo/production.md` **before** acting on any of these signals:
|
|
439
|
-
- "edit", "animate", "combine", "redo", "polish", "fix", "regenerate"
|
|
440
|
-
- "the same character / scene / image / video / sound", "that X", "scene N", "the rainy one", etc.
|
|
441
|
-
- `@name` references for Visual DNA
|
|
442
|
-
- Any continuation of prior media work ("now make scene 3")
|
|
443
|
-
|
|
444
|
-
If the file is missing and the user is referencing prior media, ask the user — do not guess from chat.
|
|
445
|
-
|
|
446
|
-
### When to WRITE to it
|
|
447
|
-
|
|
448
|
-
**Immediately after every successful generation tool call**, before your next tool call or your final reply. The runtime will inject a reminder after generation tool results — treat that as a hard rule, not a suggestion.
|
|
449
|
-
|
|
450
|
-
Tools that REQUIRE logging:
|
|
451
|
-
- `generate_image`, `generate_image_edit`, `edit_image`
|
|
452
|
-
- `generate_video`, `generate_video_from_image`, `generate_video_from_video`, `edit_video`
|
|
453
|
-
- `generate_elements`, `generate_first_last_frame`, `generate_lipsync`
|
|
454
|
-
- `generate_music`, `generate_sound`, `generate_speech`
|
|
455
|
-
- `generate_3d`, `generate_creative_director`
|
|
456
|
-
- `create_visual_dna`, `upload_media`
|
|
457
|
-
|
|
458
|
-
Tools that do NOT log: `list_*`, `get_*`, `check_credits`, `chat_*`, `transcribe_audio` (read-only / discovery).
|
|
459
|
-
|
|
460
|
-
### File creation — pick the right tool to avoid the "must Read first" error
|
|
461
|
-
|
|
462
|
-
`Edit` refuses to overwrite a file unless you've `Read` it first in the same session. Pick by file state:
|
|
463
|
-
|
|
464
|
-
| State | Tool |
|
|
465
|
-
|---|---|
|
|
466
|
-
| File **does not exist** (typical first turn) | `Write` with the full stub below |
|
|
467
|
-
| File **exists** | `Read` first, then `Edit` |
|
|
468
|
-
| Not sure | `Read` first; on ENOENT, fall back to `Write` |
|
|
469
|
-
|
|
470
|
-
Stub for first creation:
|
|
471
|
-
|
|
472
|
-
```md
|
|
473
|
-
<!-- .kolbo/production.md — agent-managed media artifact registry.
|
|
474
|
-
User may hand-edit; agent must Read-before-Edit to reconcile. -->
|
|
475
|
-
|
|
476
|
-
# Production Log
|
|
477
|
-
|
|
478
|
-
## 🎯 Now
|
|
479
|
-
|
|
480
|
-
**Brief:** <paraphrase of user's overall goal in 1-3 sentences>
|
|
481
|
-
**Now working on:** <the immediate next step>
|
|
482
|
-
**Last updated:** <ISO date>
|
|
483
|
-
|
|
484
|
-
---
|
|
485
|
-
|
|
486
|
-
## Production: <name from user's request, slugified human label>
|
|
487
|
-
|
|
488
|
-
### Cast
|
|
489
|
-
### Visual DNA
|
|
490
|
-
### Scenes
|
|
491
|
-
### Audio
|
|
492
|
-
### Final
|
|
493
|
-
```
|
|
494
|
-
|
|
495
|
-
Subsections (`### Cast` etc.) are **suggested defaults**, not required. Adapt: a logo set has `### Logos`, an album has `### Tracks`, a 3D render has `### Models`. Leave empty subsections out of the file when you create entries.
|
|
496
|
-
|
|
497
|
-
### Entry shape
|
|
498
|
-
|
|
499
|
-
One bullet per artifact. Write the label **the way the user would reference it next time** ("the rainy one"), not the model's raw output.
|
|
500
|
-
|
|
501
|
-
```md
|
|
502
|
-
### Cast
|
|
503
|
-
- **Maya** — female, 30, urban photographer, leather jacket
|
|
504
|
-
- portrait: https://...characters/maya.png (nano-banana-2, 2026-05-13)
|
|
505
|
-
- visual DNA: vdna_8f2c (@maya)
|
|
506
|
-
|
|
507
|
-
### Scenes
|
|
508
|
-
1. **Coffee shop morning** — Maya at counter, soft light, wide shot
|
|
509
|
-
- still: https://...scenes/01-coffee.png (flux-2-pro, 2026-05-13)
|
|
510
|
-
- video: (pending)
|
|
511
|
-
2. **Rainy street walk** — neon reflections, slow dolly
|
|
512
|
-
- still: https://...scenes/02-rain.png (flux-2-pro, 2026-05-13)
|
|
513
|
-
- video: https://...videos/02-rain.mp4 (kling-2, 2026-05-13)
|
|
514
|
-
```
|
|
515
|
-
|
|
516
|
-
### Header rewrite rule (Manus pattern — IMPORTANT)
|
|
517
|
-
|
|
518
|
-
The `## 🎯 Now` block at the top of the file is **rewritten every turn** to keep the brief + current step near the model's recency window. Body sections (everything below the first `---`) are **append-only**.
|
|
519
|
-
|
|
520
|
-
When a user request supersedes a previous artifact (e.g., "redo scene 2 with more rain"), do not delete the old entry. Mark it `(superseded YYYY-MM-DD)` and place the new entry beneath:
|
|
521
|
-
|
|
522
|
-
```md
|
|
523
|
-
2. **Rainy street walk** — neon reflections, slow dolly
|
|
524
|
-
- still: https://...scenes/02-rain.png (superseded 2026-05-13)
|
|
525
|
-
- still: https://...scenes/02-rain-v2.png (flux-2-pro, 2026-05-13)
|
|
526
|
-
- video: https://...videos/02-rain-v2.mp4 (kling-2, 2026-05-13)
|
|
527
|
-
```
|
|
528
|
-
|
|
529
|
-
### Rules
|
|
530
|
-
|
|
531
|
-
1. **First touch `Write`, subsequent touches `Read` → `Edit`** (see "File creation" above). If `Edit` fails on exact-match, `Read` again — the user may have hand-edited.
|
|
532
|
-
2. **Plain English labels** — write what the user would call it.
|
|
533
|
-
3. **Append-only body.** Only the `## 🎯 Now` header is rewritten. Never delete artifact entries; mark them `(superseded)` instead.
|
|
534
|
-
4. **Do not log failures.** Only successful generations.
|
|
535
|
-
5. **Resolve user references via the log, not chat history.** If the user says "scene 3," use the URL the log says is scene 3, even if a later tool_result mentioned a different URL.
|
|
536
|
-
6. **One file per workspace.** Multiple concurrent productions go under separate `## Production: <name>` headings inside the same file.
|
|
537
|
-
|
|
538
|
-
### Production Log vs TodoWrite
|
|
539
|
-
|
|
540
|
-
Use both — different jobs:
|
|
541
|
-
|
|
542
|
-
| | `.kolbo/production.md` | `TodoWrite` |
|
|
543
|
-
|---|---|---|
|
|
544
|
-
| Purpose | Durable artifact registry | Ephemeral step plan |
|
|
545
|
-
| Lifetime | Persists across sessions / compaction | Per turn / per request |
|
|
546
|
-
| Content | URLs, ids, briefs | "Do X, then Y, then Z" |
|
|
547
|
-
| Example | `still: https://...01-coffee.png` | `Generate visual DNA for Maya` |
|
|
548
|
-
|
|
549
|
-
---
|
|
550
|
-
|
|
551
|
-
## Video / Audio Analysis & Transcription
|
|
552
|
-
|
|
553
|
-
You have three routes. The right one depends on the file profile — pick before calling any tool.
|
|
554
|
-
|
|
555
|
-
### Decision tree
|
|
556
|
-
|
|
557
|
-
```
|
|
558
|
-
Image (jpg/png/webp)? → Read directly (native vision, up to 10 per pass)
|
|
559
|
-
File >100MB OR >15 min OR dialogue-dense? → HYBRID (transcribe + ffmpeg frames + Read + your synthesis)
|
|
560
|
-
User wants the transcript/SRT as deliverable? → transcribe_audio, return the URLs
|
|
561
|
-
Precise answer about one specific frame? → ffmpeg that frame → Read
|
|
562
|
-
Otherwise (short/medium video, mixed content) → upload_media → chat_send_message (Gemini native)
|
|
563
|
-
```
|
|
564
|
-
|
|
565
|
-
### Why `upload_media` → chat is **not** always the default
|
|
566
|
-
|
|
567
|
-
Gemini-via-chat processes frames + motion + audio in one pass and is the simplest route when it works. But it has three known failure surfaces — recognize them and pivot to the hybrid path:
|
|
568
|
-
|
|
569
|
-
1. **>100MB upload cap.** Hard limit; the upload won't succeed. No option but to split with ffmpeg or go hybrid.
|
|
570
|
-
2. **Long-form decay** (rough threshold: 15–20 min). Even when it fits, attention degrades — shallow or hallucinated answers on the back half of the file.
|
|
571
|
-
3. **Transcription-dense laziness.** Lectures, interviews, podcasts, anything where speech is the substance: chat models summarize aggressively, paraphrase quotes wrong, or silently skip stretches. Always transcribe these first to get the actual words, then add visuals only if they matter.
|
|
572
|
-
|
|
573
|
-
### The hybrid path (workaround for all three failures)
|
|
574
|
-
|
|
575
|
-
```
|
|
576
|
-
1. transcribe_audio({ source }) → text, srt_url, word_by_word_srt_url, duration
|
|
577
|
-
2. Read the transcript text from the tool output directly
|
|
578
|
-
3. Pick 3–8 timestamps from the SRT where visuals actually matter
|
|
579
|
-
4. ffmpeg -ss <ts> -i <file> -frames:v 1 <frame.jpg> (one extract per timestamp)
|
|
580
|
-
5. Read each frame with native vision (up to ~10 frames per analysis pass)
|
|
581
|
-
6. Synthesize from transcript + frames + the user's question
|
|
582
|
-
```
|
|
583
|
-
|
|
584
|
-
This is usually **cheaper** than chat for long files — transcription is per-minute, ffmpeg + Read are free — and produces stronger answers on dialogue-heavy material because you have the complete text, not a model's summary of it.
|
|
585
|
-
|
|
586
|
-
For media >30 min (past the transcription cap), split with ffmpeg into ~25-min chunks, transcribe each, concatenate.
|
|
587
|
-
|
|
588
|
-
### Transcribe-as-deliverable vs transcribe-as-input
|
|
589
|
-
|
|
590
|
-
| Request pattern | Action |
|
|
591
|
-
|---|---|
|
|
592
|
-
| "Transcribe this" / "give me an SRT" / "I need word-by-word timing" / "make subtitles" | Run `transcribe_audio`, return the URL(s). The transcript IS the deliverable. |
|
|
593
|
-
| "What did they say about X?" / "Summarize this meeting" / "Find the part where they mention Y" | Run `transcribe_audio` to *get* the text → **you** read/summarize/search. Transcript is a means, not the answer. |
|
|
594
|
-
|
|
595
|
-
### `transcribe_audio` — tool details
|
|
596
|
-
|
|
597
|
-
- `source`: URL or absolute local path.
|
|
598
|
-
- **Audio**: mp3, wav, m4a, flac, aac. **Video** (audio track extracted): mp4, mov, webm, mkv, avi, m4v.
|
|
599
|
-
- **30-minute hard cap.** Longer → split with ffmpeg first.
|
|
600
|
-
- Returns:
|
|
601
|
-
- `text` — full transcript, plain.
|
|
602
|
-
- `srt_url` — grouped SRT (~12 words per line, up to 2 lines per subtitle). Use this for normal subtitle delivery.
|
|
603
|
-
- `word_by_word_srt_url` — one word per cue with millisecond-precise start/end (ElevenLabs Scribe v2). Use **only** when downstream is animation (Remotion captions, after-effects karaoke, precise speech-aligned cuts). Noise for normal subtitle workflows.
|
|
604
|
-
- `txt_url` — plain text file.
|
|
605
|
-
- `duration` — seconds.
|
|
606
|
-
- Cost: per-minute (`model.credit × duration_minutes`). Run `check_credits` before transcribing very long files.
|
|
607
|
-
- Read-only / discovery — does NOT trigger the `.kolbo/production.md` log nudge. If the user wants the transcript saved as a durable artifact, `Write` it to a workspace file, not the production log.
|
|
608
|
-
|
|
609
|
-
### `upload_media` → `chat_send_message` — tool details
|
|
610
|
-
|
|
611
|
-
- `upload_media({ source: "/absolute/local/path/file.mp4" })` → returns `{ url, thumbnail_url, ... }`. **Use `url`** (the CDN URL); ignore `thumbnail_url` (preview JPG only).
|
|
612
|
-
- `chat_send_message({ message, media_urls: [url] })`:
|
|
613
|
-
- `media_urls` is **mandatory** — the model only sees the file if you pass the CDN URL here. Always an array.
|
|
614
|
-
- **Omit `model`** — Smart Select auto-routes to Gemini when media is detected.
|
|
615
|
-
- Sessions do NOT remember media between messages. On retry: reuse the same CDN URL (no re-upload), but always pass `media_urls` again.
|
|
616
|
-
- Batch / many short videos cost-sensitively: `list_models` for the cheapest Gemini, pass it explicitly.
|
|
617
|
-
|
|
618
|
-
### Image analysis — never via chat
|
|
619
|
-
|
|
620
|
-
You have native vision. **Always `Read` images directly** (you handle up to 10 per pass). Do not `upload_media` + chat for images unless the user explicitly names a specific Kolbo chat model. Don't extract frames from images either — they're already viewable.
|
|
621
|
-
|
|
622
|
-
**NEVER ask the user which path to use — diagnose from the file profile and pick.**
|
|
623
|
-
|
|
624
|
-
### Analyzing the source before a chained generation — when it's worth it
|
|
625
|
-
|
|
626
|
-
Before feeding a media asset into another generation tool
|
|
627
|
-
(`generate_image_edit`, `edit_image`, `generate_video_from_image`,
|
|
628
|
-
`generate_first_last_frame`, `generate_video_from_video`, `edit_video`,
|
|
629
|
-
`generate_elements`, `generate_lipsync`), think about whether you actually
|
|
630
|
-
*know* what's in the source. If you don't, analyze it first so the next
|
|
631
|
-
prompt can reference concrete details instead of generic adjectives.
|
|
632
|
-
|
|
633
|
-
**Analyze first when:**
|
|
634
|
-
|
|
635
|
-
- The source is **old** — more than a few turns back, or pulled via
|
|
636
|
-
`list_media` / `get_media` from earlier in the project. Context has
|
|
637
|
-
drifted; you likely don't remember the specifics.
|
|
638
|
-
- The source was **user-provided without a description** — they pasted a
|
|
639
|
-
URL or uploaded a file but didn't say what it shows.
|
|
640
|
-
- The previous prompt was **vague** ("make something pretty", "a cool
|
|
641
|
-
shot") — the output details matter and you don't know them.
|
|
642
|
-
- The chain step needs to **preserve specific details** the original
|
|
643
|
-
prompt didn't pin down (exact pose, color of a prop, lighting direction,
|
|
644
|
-
audio room tone, etc.).
|
|
645
|
-
- Source is a **video or audio** going into elements / video-from-video /
|
|
646
|
-
lipsync — motion direction, pacing, voice characteristics, and ambient
|
|
647
|
-
bed drive the next prompt and can't be guessed from a URL.
|
|
648
|
-
|
|
649
|
-
**Skip analysis when:**
|
|
650
|
-
|
|
651
|
-
- You **just generated** the asset in the same conversation with a precise
|
|
652
|
-
prompt — that prompt *is* the spec. Re-analyzing wastes credits.
|
|
653
|
-
- The edit is **mechanical** — "remove background", "brighten 10%",
|
|
654
|
-
"loop to 5 seconds", "crop to 1:1". The source content doesn't matter.
|
|
655
|
-
- The user already **described what's in it** in this turn.
|
|
656
|
-
|
|
657
|
-
Default to skipping unless one of the "analyze first" cases applies — an
|
|
658
|
-
analysis-per-step habit on long chains burns credits and latency without
|
|
659
|
-
adding signal.
|
|
660
|
-
|
|
661
|
-
**How to analyze (pick by media type):**
|
|
662
|
-
|
|
663
|
-
| Source media | How |
|
|
664
|
-
|---|---|
|
|
665
|
-
| Image (URL or local) | Your native vision — view it directly. No `chat_send_message` round-trip needed. |
|
|
666
|
-
| Video / Audio | `chat_send_message({ message: "Describe...", media_urls: [url] })`. Batch up to 10 URLs in **one** call (see batching rule above). Omit `model` so Smart Select routes to Gemini vision. |
|
|
667
|
-
|
|
668
|
-
**What the analysis should extract** (use whatever is relevant for the next
|
|
669
|
-
step's prompt):
|
|
670
|
-
|
|
671
|
-
- **Subject** — pose, expression, framing (head-and-shoulders / full body / wide).
|
|
672
|
-
- **Wardrobe & props** — exact colors, materials, distinguishing items.
|
|
673
|
-
- **Scene & environment** — location, time of day, weather, background depth.
|
|
674
|
-
- **Lighting & color palette** — dominant temperature, key/fill direction,
|
|
675
|
-
contrast, color grade.
|
|
676
|
-
- **Camera** — angle, focal length feel (wide / portrait), depth-of-field.
|
|
677
|
-
- **Motion** (videos only) — direction, speed, camera move (push-in,
|
|
678
|
-
pan, locked), what changes between first and last frame.
|
|
679
|
-
- **Audio** (videos/audio only) — voice characteristics, ambient bed,
|
|
680
|
-
speech pace, music tempo/mood.
|
|
681
|
-
- **Anything that already looks wrong** — artifacts, blurred faces, wrong
|
|
682
|
-
fingers, blown highlights, audio glitches — note these so the next prompt
|
|
683
|
-
either fixes them (edit) or doesn't preserve them (elements/video).
|
|
684
|
-
|
|
685
|
-
**Then write the next prompt with concrete references**, not generic
|
|
686
|
-
adjectives. Example for an image-to-video chain:
|
|
687
|
-
|
|
688
|
-
Bad — generic, no analysis:
|
|
689
|
-
```
|
|
690
|
-
prompt: "Animate this image with a slow push-in"
|
|
691
|
-
image_url: <generated still>
|
|
692
|
-
```
|
|
693
|
-
|
|
694
|
-
Good — analyzed first, prompt names the specifics:
|
|
695
|
-
```
|
|
696
|
-
prompt: "Slow 4-second dolly-in toward @maya's face from the medium shot;
|
|
697
|
-
the warm golden-hour rim light on her left shoulder stays
|
|
698
|
-
consistent; the wind moves the leaves behind her gently to the
|
|
699
|
-
right. Camera locked, no shake. Subject does not turn — she keeps
|
|
700
|
-
the half-smile and direct eye contact from the still."
|
|
701
|
-
image_url: <generated still>
|
|
702
|
-
visual_dna_ids: ["vdna_8f2c"] // maya
|
|
703
|
-
```
|
|
704
|
-
|
|
705
|
-
The point is **not** to dump an essay into the prompt — it's to make sure
|
|
706
|
-
every concrete detail the next model needs to preserve (or change) is
|
|
707
|
-
named, so the chain doesn't lose continuity across steps.
|
|
708
|
-
|
|
709
|
-
**Production-log tie-in:** when you analyze a generated still/clip, write
|
|
710
|
-
a one-line description into `.kolbo/production.md` next to the URL — that
|
|
711
|
-
way the next chained step can read the log instead of re-analyzing.
|
|
712
|
-
|
|
713
|
-
### ⚠️ Batching Media in Chat Messages (CRITICAL)
|
|
714
|
-
|
|
715
|
-
**Send ALL media in ONE `chat_send_message` call.** `media_urls` accepts up to **10 URLs**. Each separate chat call counts toward rate limits — splitting trips "Too many generation requests."
|
|
716
|
-
|
|
717
|
-
```
|
|
718
|
-
# Step 1: parallel uploads (one response)
|
|
719
|
-
upload_media({ source: "video1.mp4" }) → url1
|
|
720
|
-
... (up to 10)
|
|
721
|
-
|
|
722
|
-
# Step 2: ONE chat call with all URLs
|
|
723
|
-
chat_send_message({ message: "Analyze all 5 videos...", media_urls: [url1, url2, ...] })
|
|
724
|
-
```
|
|
725
|
-
|
|
726
|
-
On 429: wait 60s, retry the same chat call — reuse the CDN URLs, do not re-upload.
|
|
727
|
-
|
|
728
|
-
**Never:** pass a local path in `media_urls` (CDN URLs only); use a transcription `.txt` URL as a video URL; construct a CDN URL yourself; split media across multiple chat calls.
|
|
729
|
-
|
|
730
|
-
---
|
|
731
|
-
|
|
732
|
-
## ⚠️ Research-First Creative — when to scrape before generating
|
|
733
|
-
|
|
734
|
-
When the user gives you a **product URL, brand reference, or "make X for Y audience" brief** (especially for ads, marketing creative, or anything tied to a real brand), don't jump straight to prompts. Spend one turn researching first — the cost of a single research turn is far less than 10 mis-aimed generations.
|
|
735
|
-
|
|
736
|
-
### When to do research-first
|
|
737
|
-
- Any URL appears in the brief (product page, landing page, brand site)
|
|
738
|
-
- The brief names a brand, product, or company you don't already have context on
|
|
739
|
-
- The brief targets a specific audience / language / market with conventions you should respect (Hebrew/Israeli, Japanese, Gen-Z TikTok, B2B SaaS, luxury, etc.)
|
|
740
|
-
- The brief explicitly says "research" / "תחקור" / "look up" / "find examples" / "check best practices"
|
|
741
|
-
|
|
742
|
-
### How to research (parallel calls in one response)
|
|
743
|
-
Fire these IN PARALLEL — they're independent reads:
|
|
744
|
-
|
|
745
|
-
1. **`WebSearch`** for prompt-engineering patterns specific to the chosen model. **The model name in the search query MUST be the literal model the user named** — never substitute a generic / default / "popular" model. If the user said "nano banana 2", search for `"nano banana 2" prompt …`, NOT `"flux" prompt …` or `"midjourney" prompt …`. The same HARD RULE that applies to *calling* the named model applies to *researching* it. Examples (replace `<model>` with the user's exact wording):
|
|
746
|
-
- `"<model>" prompt engineering ad image text rendering`
|
|
747
|
-
- `"<model>" hex color font specification advertising prompt`
|
|
748
|
-
- `"<model>" hebrew text RTL rendering` (or any user-named language)
|
|
749
|
-
2. **`WebSearch`** for the audience / market design conventions:
|
|
750
|
-
- `<audience> advertising design trends <year>`
|
|
751
|
-
- `<language> typography <use case> RTL/LTR best practices`
|
|
752
|
-
3. **`WebFetch`** the product URL with a precise extraction prompt (see below).
|
|
753
|
-
4. (Optional) `WebSearch` for competitor / reference visuals to set bar.
|
|
754
|
-
|
|
755
|
-
### Extracting the product page (WebFetch prompt template)
|
|
756
|
-
|
|
757
|
-
Don't ask WebFetch a vague "what is this page" — ask for structured extraction:
|
|
758
|
-
|
|
759
|
-
```
|
|
760
|
-
Extract from this page, in compact bullets:
|
|
761
|
-
1. Product name + one-line value proposition.
|
|
762
|
-
2. 3–5 concrete capabilities/benefits (user-facing language).
|
|
763
|
-
3. All product hero / screenshot image URLs visible in the page.
|
|
764
|
-
4. Brand color hex codes — pull from inline `style=`, `<style>` tags, or
|
|
765
|
-
linked CSS, ignoring generic UI defaults (#fff/#000). Identify which
|
|
766
|
-
color plays which role (primary CTA, headline text, background, accent).
|
|
767
|
-
5. Brand voice signals (tone, target user, formality).
|
|
768
|
-
6. Any explicit fonts named in CSS or visible.
|
|
769
|
-
```
|
|
770
|
-
|
|
771
|
-
### Re-host every external image via `upload_media`
|
|
772
|
-
|
|
773
|
-
The bulk-API rule applies: external URLs in `reference_images` / `source_images` / `image_url` cause **400 Bad Request**. Pipeline:
|
|
774
|
-
|
|
775
|
-
1. `Bash: curl -fsSL "<external-url>" -o /tmp/<name>.<ext>` (or use WebFetch where it returns the binary)
|
|
776
|
-
2. `mcp__kolbo__upload_media` with the local file → returns Kolbo CDN URL
|
|
777
|
-
3. Use the returned CDN URL in any subsequent generation call
|
|
778
|
-
4. Log both URLs in the production log (so the user can trace provenance)
|
|
779
|
-
|
|
780
|
-
### Synthesizing the research
|
|
781
|
-
|
|
782
|
-
In the production log create:
|
|
783
|
-
```md
|
|
784
|
-
### Research notes
|
|
785
|
-
- Prompt patterns for <model>: …
|
|
786
|
-
- Audience conventions: …
|
|
787
|
-
|
|
788
|
-
### Product brief
|
|
789
|
-
- Name: …
|
|
790
|
-
- Value prop: …
|
|
791
|
-
- Capabilities: …, …, …
|
|
792
|
-
|
|
793
|
-
### Brand palette
|
|
794
|
-
- primary: #...
|
|
795
|
-
- accent: #...
|
|
796
|
-
- text: #...
|
|
797
|
-
- bg: #...
|
|
798
|
-
|
|
799
|
-
### Re-hosted assets
|
|
800
|
-
- hero_1: <kolbo CDN url> (from <original url>)
|
|
801
|
-
```
|
|
802
|
-
|
|
803
|
-
### Building prompts informed by the research
|
|
804
|
-
|
|
805
|
-
When generating ad / marketing creative based on this research:
|
|
806
|
-
- **Exact hex codes for every color** — `#FF4D2E` not "orange". Match brand palette.
|
|
807
|
-
- **On-image text in literal double quotes** — `"שלום עולם"` not `Hebrew greeting`. Specify language and direction (RTL/LTR) when non-English.
|
|
808
|
-
- **Per text element**: position, font weight, point size, color hex, alignment.
|
|
809
|
-
- **Forbid uninvited additions** — explicitly tell the model: NO captions, NO subtitles, NO watermarks, NO extra text beyond what's specified. Same rule as UGC defaults.
|
|
810
|
-
- **Use research findings to shape composition** — e.g. if research said "Israeli social ads favor bold contrast and minimal copy", reflect that.
|
|
811
|
-
- Always **approve the concept + sample prompts with the user** before firing the full batch when the batch is ≥4 ads or the user said "approve first".
|
|
812
|
-
|
|
813
|
-
### Skipping research is OK when…
|
|
814
|
-
- User gave no URL, no brand, no audience-specific signal — pure creative ("make a sunset")
|
|
815
|
-
- User said "skip research" / "just generate" / "I have the prompt ready"
|
|
816
|
-
- The brief is for a single quick draft
|
|
817
|
-
|
|
818
|
-
---
|
|
819
|
-
|
|
820
|
-
## Image Prompts
|
|
821
|
-
|
|
822
|
-
### Rules
|
|
823
|
-
- **Clean prompts only.** No "Output:", "Tips:", "Notes:", "Resolution:", "Dimensions:", or any instructional/meta language inside the prompt. The prompt is what the model sees — anything not describing the image is noise.
|
|
824
|
-
- **Length**: focused 2-3 sentences beats a bloated paragraph. Only go longer when the concept genuinely needs it (complex scenes, multiple subjects, specific technical requirements). Match prompt length to complexity.
|
|
825
|
-
- **Order**: Subject → action/pose → environment → lighting → style.
|
|
826
|
-
- **Be specific about style** when it matters: "1970s film photography", "watercolor illustration on rough paper", "3D product render with studio softbox lighting" — not vague descriptors like "beautiful" or "high quality".
|
|
827
|
-
- **`enhance_prompt: true`** (default) will improve most prompts automatically. Turn it off only if the user's prompt is already fully engineered or they want literal wording.
|
|
828
|
-
|
|
829
|
-
### Image Editing (image-to-image)
|
|
830
|
-
|
|
831
|
-
Use `generate_image_edit` when the user wants to modify an existing image. Pass the source image URL(s) in `source_images` and describe the change in `prompt`.
|
|
832
|
-
|
|
833
|
-
- Good: "Turn the sky orange and add drifting clouds"
|
|
834
|
-
- Bad: "A mountain landscape with an orange sky and drifting clouds" (re-describes what's already in the image)
|
|
835
|
-
|
|
836
|
-
Simple edits deserve simple prompts. Only elaborate for genuinely complex, multi-step transformations.
|
|
837
|
-
|
|
838
|
-
### Director Tool — Full Capabilities
|
|
839
|
-
|
|
840
|
-
`generate_creative_director` is **not just for storyboards**. It is the right tool any time the user wants **2–8 related outputs from one brief**. The director plans each scene's prompt internally, keeps style consistent across all of them, and runs them in parallel — meaning total wall-time matches the slowest scene, not the sum.
|
|
841
|
-
|
|
842
|
-
**When to reach for it (canonical use cases):**
|
|
843
|
-
- **Multi-angle character sheet** — front / back / sides / 3-quarter, "show her from 4 angles," "turn-around"
|
|
844
|
-
- **Multi-pose** — same character, different poses for the camera
|
|
845
|
-
- **Multi-scene story** — same character through 8 different environments / settings / locations
|
|
846
|
-
- **Wardrobe / outfit variants** — same character, different outfits
|
|
847
|
-
- **Mood / lighting variants** — same scene, different times of day / weather / emotion
|
|
848
|
-
- **Ad campaign / product set** — one product, N hero shots
|
|
849
|
-
- **Storyboard / shot list** — sequential beats of a narrative
|
|
850
|
-
- **Reference sheet for Visual DNA training** — produce 4–8 cohesive images that you'll *then* feed into `create_visual_dna`
|
|
851
|
-
|
|
852
|
-
**What it accepts (all combinable):**
|
|
853
|
-
|
|
854
|
-
| Parameter | Purpose | Use when |
|
|
855
|
-
|---|---|---|
|
|
856
|
-
| `prompt` | The overall brief, *not* a per-scene prompt | Always |
|
|
857
|
-
| `scene_count` (1–8) | How many outputs | Always — never use `num_images` here |
|
|
858
|
-
| `visual_dna_ids: []` | Character / style / product / scene consistency across every output | The character must look the same in every scene |
|
|
859
|
-
| `reference_images: []` | Style / composition references applied to every scene | You have a mood-image or layout reference but no Visual DNA yet |
|
|
860
|
-
| `moodboard_id` / `moodboard_ids: []` | Art-direction overlay (palette, lighting, vibe) | The user gave a brand / style brief |
|
|
861
|
-
| `workflow_type: "video"` | Switch to multi-scene video instead of images | The user asked for "8 short clips" / "4 video variants" |
|
|
862
|
-
| `model` | Pin a specific image / video model | The user named one |
|
|
863
|
-
| `aspect_ratio`, `resolution`, `duration` | Standard formatting | As needed |
|
|
864
|
-
|
|
865
|
-
**When NOT to use it:**
|
|
866
|
-
- User gave **explicit per-image prompts** ("Image 1: X. Image 2: Y. Image 3: Z.") — fire parallel `generate_image` calls instead. Director is for *one brief → N scenes*; explicit per-scene prompts mean the user already did the directing.
|
|
867
|
-
- User wants to **modify a specific existing image** — that's `generate_image_edit`.
|
|
868
|
-
- User asked for **one image** — that's `generate_image`.
|
|
869
|
-
|
|
870
|
-
### Mixing References, Visual DNAs, and Moodboards
|
|
871
|
-
|
|
872
|
-
You can combine all three reference types in a single call — they're additive, not exclusive. The system blends them; the model uses whichever it can interpret best for the prompt.
|
|
873
|
-
|
|
874
|
-
| Tool | `source_images` (required edit base) | `reference_images` (style / composition) | `visual_dna_ids` (character/style identity) | `moodboard_id` (art direction) |
|
|
875
|
-
|---|:-:|:-:|:-:|:-:|
|
|
876
|
-
| `generate_image` | — | ✅ | ✅ | ✅ |
|
|
877
|
-
| `generate_image_edit` | ✅ (required) | — (source_images plays this role) | ✅ | ✅ |
|
|
878
|
-
| `generate_creative_director` | — | ✅ (applied to every scene) | ✅ (locks character across every scene) | ✅ / `moodboard_ids` |
|
|
879
|
-
| `generate_elements` (video) | — | ✅ (also `reference_videos`, `audio_url`) | ✅ | — |
|
|
880
|
-
|
|
881
|
-
**Practical combinations to know:**
|
|
882
|
-
- *"Make her in a Tokyo street, matching this mood board, with the same face as Visual DNA Maya"* → `generate_image` with `visual_dna_ids=[maya], moodboard_id=tokyo_neon`. No `reference_images` needed.
|
|
883
|
-
- *"Same character, but place her like in this composition"* → `generate_image` with `visual_dna_ids=[maya], reference_images=[layout.png]`. The DNA owns the *face*; the reference owns the *pose/composition*.
|
|
884
|
-
- *"Edit this photo to give her the leather-jacket look from Visual DNA Maya"* → `generate_image_edit` with `source_images=[photo.png], visual_dna_ids=[maya]`. Source is what's edited; the DNA injects the wardrobe identity.
|
|
885
|
-
- *"4 angles of this character, brand-styled"* → `generate_creative_director` with `scene_count=4, visual_dna_ids=[maya], moodboard_id=brand_x`. DNA keeps the face; moodboard sets the look.
|
|
886
|
-
- *"Generate 6 product hero shots; here are 3 reference comp images and our brand moodboard"* → `generate_creative_director` with `scene_count=6, reference_images=[comp1, comp2, comp3], moodboard_id=brand_x`. No DNA needed if it's a product not a face.
|
|
887
|
-
|
|
888
|
-
**Rule of thumb for which to use:**
|
|
889
|
-
- Need an **identity** (face, character, specific product) to stay constant → `visual_dna_ids`.
|
|
890
|
-
- Need a **composition / pose / mood reference** → `reference_images`.
|
|
891
|
-
- Need an **overall style / palette / brand look** → `moodboard_id`.
|
|
892
|
-
- Need all three at once → pass all three. They compose.
|
|
893
|
-
|
|
894
|
-
### Tagging references inside the prompt (CRITICAL for multi-reference accuracy)
|
|
895
|
-
|
|
896
|
-
When a generation call passes ANY references — `reference_images`,
|
|
897
|
-
`source_images`, `reference_videos`, `source_videos`, `reference_audio`,
|
|
898
|
-
`elements`, OR `visual_dna_ids` — name them inside the prompt so the model
|
|
899
|
-
knows **which asset plays which role**. Without tags, the engine guesses
|
|
900
|
-
and the wrong reference bleeds into the wrong slot ("she ended up wearing
|
|
901
|
-
the background's color" / "the second character got the first character's
|
|
902
|
-
face" / "the wrong song was used as the rhythm reference").
|
|
903
|
-
|
|
904
|
-
**Tag namespaces, used together:**
|
|
905
|
-
|
|
906
|
-
| Tag | Refers to | Order rule |
|
|
907
|
-
|---|---|---|
|
|
908
|
-
| `@image1`, `@image2`, … | Plain images in `reference_images` / `source_images` | Position in the array — `@image1` = `images[0]`, etc. |
|
|
909
|
-
| `@video1`, `@video2`, … | Videos in `reference_videos` / `source_videos` / video `elements` slots | Position in the array. |
|
|
910
|
-
| `@Audio1`, `@Audio2`, … | Audio in `reference_audio` / `audio` slots (lipsync source, music style ref, voice clone, etc.) | Position in the array. |
|
|
911
|
-
| `@<dna-name>` | A Visual DNA — use the literal `name` field from `create_visual_dna` / `list_visual_dnas` (any language, case-insensitive) | Name-based, never positional. See "@name Syntax" rule below. |
|
|
912
|
-
|
|
913
|
-
**Reserved**: `@Image\d+`, `@Video\d+`, `@Audio\d+` are reserved by the Kinovi
|
|
914
|
-
Omni Reference parser — they are NOT looked up as Visual DNAs. Never name a
|
|
915
|
-
Visual DNA `Image1` / `Video2` / etc. (kolbo-api rejects this on creation).
|
|
916
|
-
|
|
917
|
-
**How to write a tagged prompt:**
|
|
918
|
-
|
|
919
|
-
```
|
|
920
|
-
Place @maya at the coffee-shop counter from @image1, wearing the leather jacket from @image2.
|
|
921
|
-
Keep the warm window light from @image1; ignore the people in the background of @image2.
|
|
922
|
-
```
|
|
923
|
-
|
|
924
|
-
```
|
|
925
|
-
Animate @maya walking through @video1's snowy street, matching the camera move of @video1; ignore the people in @video1.
|
|
926
|
-
```
|
|
927
|
-
|
|
928
|
-
```
|
|
929
|
-
Lipsync @video1's speaker to the dialogue track @audio1, keeping the original ambient room tone of @video1.
|
|
930
|
-
```
|
|
931
|
-
|
|
932
|
-
```
|
|
933
|
-
Compose a 30s track in the style of @audio1 (slow tempo, no vocals), suitable for a product reveal video.
|
|
934
|
-
```
|
|
935
|
-
|
|
936
|
-
What a tagged prompt does at submission time:
|
|
937
|
-
- `visual_dna_ids: [vdna_8f2c]` → bound to `@maya`
|
|
938
|
-
- `reference_images: [coffee_shop.jpg, jacket_ref.jpg]` → bound to `@image1`, `@image2`
|
|
939
|
-
- `reference_videos: [walking_clip.mp4]` → bound to `@video1`
|
|
940
|
-
- `reference_audio: [dialogue.wav]` → bound to `@audio1`
|
|
941
|
-
- The prompt names each one, so the engine never has to guess.
|
|
942
|
-
|
|
943
|
-
**Rules:**
|
|
944
|
-
|
|
945
|
-
1. **Order is contract.** `@imageN` / `@videoN` / `@audioN` / `@elementN` are bound to position N in the array you pass. Reordering silently changes what each tag points to — don't reorder mid-conversation; if you need to add a new ref, append it (`@image3`, `@video2`, …) rather than inserting.
|
|
946
|
-
2. **For edits, the source is `@image1` (or `@video1`).** In `generate_image_edit`, the first entry of `source_images` is the canonical base — refer to it as `@image1`. Same for video tools that take `source_videos`: the first entry is `@video1`. Additional sources become `@image2`/`@video2`/etc.
|
|
947
|
-
3. **Visual DNA tags are name-based, not positional.** `@maya` always means the DNA you registered as `name: "maya"`, regardless of where its id sits in `visual_dna_ids`.
|
|
948
|
-
4. **Tag every reference you actually pass.** If you pass a reference but never mention it in the prompt, the engine often treats it as decorative — either drop it or name it explicitly. This applies to images, videos, audio, AND Visual DNAs.
|
|
949
|
-
5. **Tags carry across the production log.** When you log a generation to `.kolbo/production.md`, write the prompt with the tags intact and record the `@name → URL` / `@name → vdna_id` binding alongside. That way "the rainy scene from last week" remains reproducible weeks later.
|
|
950
|
-
6. **Tag even single-reference calls when a DNA, video, or audio is involved.** Single plain image with no DNA can use prose ("this image"), but as soon as the call also carries a DNA, a video ref, or an audio ref, tag every asset so the engine knows the subject vs. the modifier role.
|
|
951
|
-
|
|
952
|
-
**Failure modes the tags fix:**
|
|
953
|
-
|
|
954
|
-
| Without tags | With tags |
|
|
955
|
-
|---|---|
|
|
956
|
-
| "Combine these two images" → engine averages them | "Put the subject from @image1 into the scene of @image2" |
|
|
957
|
-
| "Same character, new outfit" with 2 refs → wrong face | "Keep @maya's face from the Visual DNA; apply the outfit from @image1" |
|
|
958
|
-
| "Edit this" with 3 source images → engine edits whichever is first | "In @image1, replace the sky with the sky from @image2" |
|
|
959
|
-
| "Lipsync this video to this audio" with 2 audio tracks → wrong track picked | "Lipsync @video1 to @audio1; ignore @audio2 (that's the music bed)" |
|
|
960
|
-
| "Match this video's style" with 2 video refs → blended motion | "Use @video1's camera move; use @video2's color grade" |
|
|
961
|
-
| "Music like this" with a reference track → engine ignores it | "Compose in the style of @audio1, but slower and without vocals" |
|
|
962
|
-
|
|
963
|
-
---
|
|
964
|
-
|
|
965
|
-
## ⚠️ Resolution, Caps & Constraints — read these BEFORE every generation (HARD RULE)
|
|
966
|
-
|
|
967
|
-
Every model exposes a constraint envelope via `list_models`. Submitting a value outside it is a **deterministic 400** — not a degraded result, not a substitution. You MUST consult `list_models` and validate inputs before firing any generation. When in doubt, call `list_models` with `format: "json"` to get the raw model document for programmatic comparison.
|
|
968
|
-
|
|
969
|
-
### Canonical field reference — which `list_models` field controls which input on which tool
|
|
970
|
-
|
|
971
|
-
The same conceptual slot (e.g. "max reference images") lives under **different field names per model family**. Read the row for your tool, not the model name.
|
|
972
|
-
|
|
973
|
-
| Your input | Tool(s) | Field to read on the model | What "0" / `null` means |
|
|
974
|
-
|---|---|---|---|
|
|
975
|
-
| `reference_images` | `generate_image`, `generate_image_edit` (uses `source_images`), `generate_creative_director`, `generate_video` | `max_reference_images` | `0` = model accepts no refs |
|
|
976
|
-
| `reference_images` | `generate_elements` | `elements_max_images` | `0` = model accepts no image refs |
|
|
977
|
-
| `reference_images` | `generate_video_from_video` | `max_images` | `0` = no secondary image input |
|
|
978
|
-
| `reference_videos` | `generate_elements` | `elements_max_videos` | `0` = no video refs |
|
|
979
|
-
| `reference_videos` | `generate_video_from_video` | `max_videos` | `<= 1` = only the source_video |
|
|
980
|
-
| `elements` | `generate_video_from_video` | `max_elements` | `0` = no elements |
|
|
981
|
-
| `audio_url` | `generate_elements` | `elements_max_audio` (+ `max_audio_duration` for the file) | `0` = no audio ref |
|
|
982
|
-
| `visual_dna_ids` | every tool that accepts DNA | `max_visual_dna` (+ `supports_visual_dna` boolean) | `null` / `0` / `false` = model rejects DNA (silently ignored by some paths) |
|
|
983
|
-
| `aspect_ratio` | any | `supported_aspect_ratios` (or `supported_aspect_ratios_by_type[<type>]` when multimodal) | empty → use `default_aspect_ratio` if set |
|
|
984
|
-
| `resolution` | any | `supported_resolutions` (+ `resolution_multipliers` for cost) | empty → model has no resolution tiering |
|
|
985
|
-
| `duration` (video output) | video tools | `supported_durations` if set, else `min_output_duration`–`max_output_duration` | both null → can't validate, omit and let server default |
|
|
986
|
-
| **input** video duration (source) | `lipsync-video`, `generate_video_from_video` | `min_video_duration` – `max_video_duration` | outside range → reject or upstream truncates |
|
|
987
|
-
| input audio duration | `generate_lipsync`, `generate_elements` audio | `min_audio_duration` – `max_audio_duration` (+ `audio_max_follows_video_duration` for lipsync) | outside range → reject |
|
|
988
|
-
| audio file format | any audio input | `supported_audio_formats` (e.g. `["mp3","wav","m4a"]`; empty = all) | pre-validate before upload |
|
|
989
|
-
| recording duration | `text_to_speech` recording UX | `min_recording_duration` – `max_recording_duration` | usually null for plain TTS |
|
|
990
|
-
| upload file size | every file upload | `max_file_size` (bytes) | null → use platform default |
|
|
991
|
-
| `num_images` | image tools | `images_per_request` overrides for fixed-output models (Midjourney returns 4 regardless) | null → `num_images` honored as-is |
|
|
992
|
-
| `prompt` | every tool | `requires_prompt`, `min_prompt_length`, `max_prompt_length` | null → unconstrained |
|
|
993
|
-
| sound on/off | video tools | `sound_generation_type` (`"native"` vs `"none"`), `sound_enabled_by_default`, `sound_credit_multiplier` | not `"native"` → can't emit synced audio |
|
|
994
|
-
| capability gate | route decision | `supports_visual_dna`, `supports_first_last_frame`, `supports_audio_input` | `false` → the controller silently drops that param |
|
|
995
|
-
|
|
996
|
-
Cost formula: `final_cost = credit × resolution_multipliers[resolution] × (sound_enabled ? sound_credit_multiplier : 1)`, multiplied by `num_images` / `scene_count` as applicable.
|
|
997
|
-
|
|
998
|
-
### Validation pattern — every generation
|
|
999
|
-
|
|
1000
|
-
Before submitting:
|
|
1001
|
-
|
|
1002
|
-
1. Call `list_models type=<tool-type>` (text mode is enough for picking; `format: "json"` when you need to programmatically compare caps).
|
|
1003
|
-
2. For each input array (refs / DNAs / elements) — check `length <= <cap>` from the row above. If over, drop the lowest-priority entries OR ask the user.
|
|
1004
|
-
3. For each enumerated value (`aspect_ratio` / `resolution` / `duration`) — check it's in `supported_*`. If not, **do not silently substitute**; show the user the allowed set and ask.
|
|
1005
|
-
4. For each duration-bearing file (source_video for lipsync/v2v, audio for lipsync/elements) — pre-check duration against the min/max range. Use ffmpeg if needed (via `video-production` skill).
|
|
1006
|
-
5. For uploads — pre-check size against `max_file_size`.
|
|
1007
|
-
|
|
1008
|
-
The MCP tool descriptions also embed the cap field name on the relevant parameter (e.g. `reference_images: "...Cap: pass at most max_reference_images..."`) — use those as inline reminders.
|
|
1009
|
-
|
|
1010
|
-
### ⚠️ Quote real cost, never estimates (CRITICAL)
|
|
1011
|
-
|
|
1012
|
-
The formula above is for **pre-approval previews only**. After firing, use the real number from the tool response — every generation now returns `credits_used` (multiplier-adjusted total) and `credits_breakdown` (per-model attribution). Log `credits_used` to `.kolbo/production.md`, not `base × count`.
|
|
1013
|
-
|
|
1014
|
-
```json
|
|
1015
|
-
{ "credits_used": 12, "credits_breakdown": [{ "model": "nano-banana-2", "base": 8, "final": 12, ... }], "urls": [...] }
|
|
1016
|
-
```
|
|
1017
|
-
|
|
1018
|
-
When the user asks "how much did I spend?" → call `mcp__kolbo__get_session_usage` for the real, multiplier-adjusted session total + per-tool + per-model breakdowns (same numbers as the desktop bottom-bar counter).
|
|
1019
|
-
|
|
1020
|
-
### Decision rule
|
|
1021
|
-
|
|
1022
|
-
1. **User specified resolution / sound explicitly** ("4K", "1080p", "480p", "with sound", "silent") → ALWAYS verify the value is in `supported_resolutions` BEFORE firing. If it isn't:
|
|
1023
|
-
- ❌ Do **NOT** silently substitute a "close" value. The user asked for 480p; sending 720p without their consent burns 1.5–2× the credits they expected and produces a different output.
|
|
1024
|
-
- ✅ Show them what the model actually supports in one line and ask which to use:
|
|
1025
|
-
> "Seedance 2 elements supports `[720p, 1080p, 1440p, 2160p]` — 480p isn't available. Closest cheap option is 720p (~+0 credits over your intent). Want 720p, or pick another?"
|
|
1026
|
-
- Only fire after they reply (or after they re-confirm the original intent with the new info).
|
|
1027
|
-
2. **User specified quality intent without numbers** ("draft", "quick test", "final delivery", "for client", "production") → map intent to tier:
|
|
1028
|
-
- draft / quick / preview → cheapest in `supported_resolutions` (1K / 720p)
|
|
1029
|
-
- normal / standard → `default_duration`-equivalent (typically 2K / 1080p)
|
|
1030
|
-
- final / production / hero → highest the user's budget allows (3K-4K / 1440p-2160p)
|
|
1031
|
-
3. **No quality signal at all** AND the cost difference between cheapest and most-expensive is **>2×** OR total batch is large (≥4 outputs) → **ask the user once** with a one-line cost comparison, then default to standard if they don't reply. Example:
|
|
1032
|
-
> "This model offers 1K (8 cr × 4 = 32), 2K (1.5×: 48), 4K (2×: 64). Default to 1K? Or pick 2K/4K?"
|
|
1033
|
-
4. **No quality signal AND cost difference is small** (≤1.5×) → quietly use the cheapest supported, no need to interrupt.
|
|
1034
|
-
5. **Sound on a video model with `sound_credit_multiplier > 1`** → if user didn't ask for sound, leave it off (saves credits). If user said "with sound" / "with music" / "with audio", enable it.
|
|
1035
|
-
|
|
1036
|
-
### Defaults when nothing is specified
|
|
1037
|
-
|
|
1038
|
-
- **Image**: `1K` (or the cheapest in `supported_resolutions`).
|
|
1039
|
-
- **Video**: `720p` (or the cheapest), with `default_duration` (or shortest in `supported_durations`).
|
|
1040
|
-
- **Sound**: respect `sound_enabled_by_default`; if false, leave off.
|
|
1041
|
-
|
|
1042
|
-
### Always log the resolution / duration / sound choices
|
|
1043
|
-
|
|
1044
|
-
Production-log entries should include the resolution and (for video) duration + sound state alongside the URL, so the user can see what they paid for:
|
|
1045
|
-
|
|
1046
|
-
```md
|
|
1047
|
-
- still: https://...01-coffee.png (flux-2-pro · 1K, 2026-05-14)
|
|
1048
|
-
- video: https://...02-rain.mp4 (kling-2 · 1080p · 5s · sound-off, 2026-05-14)
|
|
1049
|
-
```
|
|
1050
|
-
|
|
1051
|
-
---
|
|
1052
|
-
|
|
1053
|
-
## Visual DNA (Character/Style Consistency)
|
|
1054
|
-
|
|
1055
|
-
Visual DNA profiles capture the visual "identity" of a character, style, product, or scene from reference media.
|
|
1056
|
-
|
|
1057
|
-
### Workflow
|
|
1058
|
-
1. **Create** a profile with `create_visual_dna` — provide reference images (max 4 — if the user gives more, pick the 4 most representative or ask which to keep; never pass 5+), optionally video and audio
|
|
1059
|
-
2. **Types**: `character` (default), `style`, `product`, `scene`, `environment`
|
|
1060
|
-
3. **Use** the profile by passing its `id` in `visual_dna_ids` in: `generate_image`, `generate_creative_director`, `generate_elements`
|
|
1061
|
-
4. **List/inspect** profiles with `list_visual_dnas` / `get_visual_dna`
|
|
1062
|
-
|
|
1063
|
-
### ⚠️ Pre-flight: Verify the Visual DNA Exists Before Using It (MANDATORY)
|
|
1064
|
-
|
|
1065
|
-
NEVER reference a Visual DNA by name, role, or assumed identity without first
|
|
1066
|
-
confirming it exists in the user's library. This is a frequent failure mode:
|
|
1067
|
-
the user mentions a character ("אסתר", "Maya", "the model from before"), the
|
|
1068
|
-
agent assumes a matching Visual DNA exists, calls `generate_image` /
|
|
1069
|
-
`generate_elements` with a guessed or fabricated `visual_dna_ids` value, and
|
|
1070
|
-
the generation fails or produces the wrong identity.
|
|
1071
|
-
|
|
1072
|
-
**Before** any generation call that uses `visual_dna_ids`:
|
|
1073
|
-
|
|
1074
|
-
1. Call `list_visual_dnas` to get the actual available DNAs (id + name).
|
|
1075
|
-
2. Match the user's reference (by name, type, or your `.kolbo/production.md`
|
|
1076
|
-
log) to a real DNA in that list.
|
|
1077
|
-
3. If there is **no match**, STOP and ask the user one of:
|
|
1078
|
-
- "I don't see a Visual DNA named <X> in your library. Do you want me
|
|
1079
|
-
to create one now (I'll need reference image(s)), use an existing
|
|
1080
|
-
DNA (<list>), or proceed without DNA using direct reference images?"
|
|
1081
|
-
4. Only proceed once you have a real `vdna_*` id confirmed by either the
|
|
1082
|
-
list or a fresh `create_visual_dna` call you just made.
|
|
1083
|
-
|
|
1084
|
-
Do NOT:
|
|
1085
|
-
- Invent a Visual DNA id or assume one exists from context.
|
|
1086
|
-
- Use the same DNA id for a different character because "it sounded close."
|
|
1087
|
-
- Carry a DNA id from `.kolbo/production.md` into a new generation without
|
|
1088
|
-
re-confirming it still exists (`list_visual_dnas` is cheap — call it).
|
|
1089
|
-
|
|
1090
|
-
When the user says "use the model אסתר" but you've only created a DNA for
|
|
1091
|
-
"זוהר", you MUST ask before generating — never silently substitute or guess.
|
|
1092
|
-
|
|
1093
|
-
### ⚠️ Don't re-fetch / re-list your own outputs (CRITICAL)
|
|
1094
|
-
|
|
1095
|
-
After a generation tool returns its URLs, those URLs are **already** in the canvas (the desktop app's gallery panel) and in `.kolbo/production.md`. Do **NOT** call `list_media`, `get_media`, `get_media_stats`, `list_visual_dnas`, or `chat_send_message` with `media_urls` on those URLs just to "verify" or "fetch thumbnails of the results" — that's pure noise:
|
|
1096
|
-
|
|
1097
|
-
- It burns credits and time for zero new information.
|
|
1098
|
-
- Every such tool call streams partial output into the session, which forces the desktop canvas to re-evaluate (visible flicker on the gallery tiles).
|
|
1099
|
-
- The thumbnails returned by `list_media` / `get_media` are the SAME asset you just generated; you don't need a thumbnail of a thumbnail.
|
|
1100
|
-
|
|
1101
|
-
**Only call list/get media tools when:**
|
|
1102
|
-
- The user explicitly asks ("what do I have in my library?", "show me my old DNAs").
|
|
1103
|
-
- You need details about something generated in an **earlier session** that you don't have a record of.
|
|
1104
|
-
- You're chasing a specific user reference like "the rainy clip from yesterday" that isn't in the current chat's `.kolbo/production.md`.
|
|
1105
|
-
|
|
1106
|
-
**Only call `chat_send_message` with `media_urls` when:**
|
|
1107
|
-
- The user uploaded media themselves and asks you to analyze / describe / extract info from it.
|
|
1108
|
-
- You need to read a video / audio file you didn't generate.
|
|
1109
|
-
|
|
1110
|
-
For media you generated this session, you already know the prompt, model, and result URL — write that into `.kolbo/production.md` and reference it from context.
|
|
1111
|
-
|
|
1112
|
-
### ⚠️ Presenting list results — show thumbnails (MANDATORY)
|
|
1113
|
-
|
|
1114
|
-
When you display the result of `list_visual_dnas`, `list_media`,
|
|
1115
|
-
`list_moodboards`, or any other tool that returns items with image/thumbnail
|
|
1116
|
-
URLs, render each item's thumbnail as a markdown image so the user can
|
|
1117
|
-
actually see what they have. The chat view auto-renders both ``
|
|
1118
|
-
markdown and bare image URLs, plus auto-injects a player below links to
|
|
1119
|
-
videos/audio — use that.
|
|
1120
|
-
|
|
1121
|
-
Do NOT dump a text-only bullet list of ids + names when a thumbnail field
|
|
1122
|
-
is available in the response.
|
|
1123
|
-
|
|
1124
|
-
**Visual DNA listing format:**
|
|
1125
|
-
```
|
|
1126
|
-
Visual DNAs (6):
|
|
1127
|
-
1. **Maya** — `vdna_abc` (character)
|
|
1128
|
-

|
|
1129
|
-
2. **Tokyo Neon** — `vdna_xyz` (style)
|
|
1130
|
-

|
|
1131
|
-
```
|
|
1132
|
-
|
|
1133
|
-
**Media listing format:**
|
|
1134
|
-
```
|
|
1135
|
-
1. **rain-loop.mp4** — `med_abc` (video, 5s, 1080p)
|
|
1136
|
-
https://cdn.kolbo.ai/.../rain-loop.mp4
|
|
1137
|
-
2. **coffee-01.png** — `med_def` (image, 1024x1024)
|
|
1138
|
-

|
|
1139
|
-
```
|
|
1140
|
-
|
|
1141
|
-
Fields to read for the image source (use the first one present on the item):
|
|
1142
|
-
`thumbnail`, `thumbnail_url`, `preview_url`, `url`, `image`. For videos and
|
|
1143
|
-
audio, use the file `url` directly — the chat view renders a player inline.
|
|
1144
|
-
|
|
1145
|
-
If an item lacks any image/preview field, fall back to text-only for that
|
|
1146
|
-
row, but never skip thumbnails on the rows that do have them.
|
|
1147
|
-
|
|
1148
|
-
### ⚠️ @name Syntax — ALWAYS use it when passing visual_dna_ids (MANDATORY)
|
|
1149
|
-
|
|
1150
|
-
Whenever a generation call passes `visual_dna_ids` (even just one), the
|
|
1151
|
-
prompt MUST refer to each Visual DNA by `@<exact-name>` — the literal `name`
|
|
1152
|
-
field as it was set in `create_visual_dna` and as it appears in
|
|
1153
|
-
`list_visual_dnas`. This is how the engine binds the DNA to a role in the
|
|
1154
|
-
scene. Without `@name`, the engine guesses, drops the DNA, or blends
|
|
1155
|
-
multiple DNAs together.
|
|
1156
|
-
|
|
1157
|
-
**Use the actual stored name, programmatically.** When you call
|
|
1158
|
-
`list_visual_dnas` (or `create_visual_dna`), read the `name` field off the
|
|
1159
|
-
response and use that exact string after the `@`. Do NOT:
|
|
1160
|
-
|
|
1161
|
-
- Translate the name into another language ("אסתר" / "esther" / "אסתי" —
|
|
1162
|
-
pick whichever string is in `name` and use ONLY that one).
|
|
1163
|
-
- Invent a friendlier alias ("the model", "המודל", "her").
|
|
1164
|
-
- Write the character's name in plain text without the `@` prefix.
|
|
1165
|
-
- Drop the `@name` when only one DNA is passed — the engine still needs the
|
|
1166
|
-
binding so it knows the DNA is the *subject* and not a passive style.
|
|
1167
|
-
|
|
1168
|
-
**Wrong** (DNA `name` is `esther_model`, user wrote prompt in Hebrew):
|
|
1169
|
-
```
|
|
1170
|
-
prompt: "אסתר לובשת שרשרת זהב, פורטרט חצי גוף"
|
|
1171
|
-
visual_dna_ids: ["vdna_abc"]
|
|
1172
|
-
```
|
|
1173
|
-
The engine sees plain text "אסתר" and has no idea it should bind to the DNA.
|
|
1174
|
-
|
|
1175
|
-
**Right:**
|
|
1176
|
-
```
|
|
1177
|
-
prompt: "@esther_model לובשת שרשרת זהב, פורטרט חצי גוף"
|
|
1178
|
-
visual_dna_ids: ["vdna_abc"] // esther_model
|
|
1179
|
-
```
|
|
1180
|
-
|
|
1181
|
-
**Multi-DNA example:**
|
|
1182
|
-
```
|
|
1183
|
-
prompt: "@dana standing in @shop, picking up a product"
|
|
1184
|
-
visual_dna_ids: ["vdna_abc", // dana
|
|
1185
|
-
"vdna_xyz"] // shop
|
|
1186
|
-
```
|
|
1187
|
-
|
|
1188
|
-
**How `@name` actually binds:** kolbo-api parses the prompt for `@<name>`
|
|
1189
|
-
mentions, queries the DB for a Visual DNA whose `name` matches
|
|
1190
|
-
(case-insensitive), and **replaces the `@name` token with that DNA's stored
|
|
1191
|
-
`systemPrompt`**. If no `@name` is in the prompt, the systemPrompt never
|
|
1192
|
-
gets injected — the `visual_dna_ids` slot is effectively wasted.
|
|
1193
|
-
|
|
1194
|
-
The match is **literal and case-insensitive**, so:
|
|
1195
|
-
- The `@name` must equal the stored `name` field (e.g. if `name: "esther_model"`
|
|
1196
|
-
→ write `@esther_model`, not `@Esther`, not `@אסתר`, not `@the model`).
|
|
1197
|
-
- Any-language characters are supported — if the DNA was created with
|
|
1198
|
-
`name: "אסתר"` you write `@אסתר`. Use the EXACT stored string.
|
|
1199
|
-
- Mentions terminate at punctuation (`.,!?`), double-spaces, another `@`,
|
|
1200
|
-
or end of string. So `@maya, wearing...` matches `maya`.
|
|
1201
|
-
|
|
1202
|
-
This composes with `@image1` / `@image2` positional tags for plain
|
|
1203
|
-
reference/source images — see "Tagging references inside the prompt" above
|
|
1204
|
-
for the full system.
|
|
1205
|
-
|
|
1206
|
-
**⚠️ Naming rule for `create_visual_dna` — NO SPACES (MANDATORY).** The
|
|
1207
|
-
`name` you set MUST be a **single token, lowercase, no spaces, ASCII-safe**
|
|
1208
|
-
— `esther_model`, `dana`, `tokyo_neon`, `brand_red`. Never `Sarah Johnson`,
|
|
1209
|
-
never `the red dress`. Reason: the prompt parser stops the `@<token>`
|
|
1210
|
-
match at the first space (and at `.,!?` punctuation). So `@Sarah Johnson`
|
|
1211
|
-
matches *only* `Sarah` — if no DNA named `Sarah` exists, the mention is
|
|
1212
|
-
silently dropped and the DNA never binds. A single-token name is the only
|
|
1213
|
-
way to guarantee inline `@name` works in any sentence, in any language,
|
|
1214
|
-
without forcing the user to write awkward punctuation around it. Use
|
|
1215
|
-
underscores for multi-word concepts (`old_town`, not `Old Town`). When
|
|
1216
|
-
the user proposes a name with spaces, accept the intent but collapse it
|
|
1217
|
-
into a single token before storing (`"Sarah Johnson"` → `sarah_johnson`)
|
|
1218
|
-
and tell them once how you'll refer to it. Source of truth:
|
|
1219
|
-
[kolbo-docs / Visual DNA & @ References](https://docs.kolbo.ai/kolbo-code/visual-dna).
|
|
1220
|
-
|
|
1221
|
-
### Visual DNA Limits
|
|
1222
|
-
|
|
1223
|
-
Read `max_visual_dna` from `list_models` for the exact cap, AND `supports_visual_dna` for the on/off boolean — a model can support DNA without an explicit cap, or have a non-null cap but silently ignore DNA on certain paths (e.g. `generate_video`). Typical ranges: image models (non-Kling) up to **8**, Kling image models **3**, Elements video models **3–5**, everything else up to **3**. The canonical field reference table above gives the per-tool routing.
|
|
1224
|
-
|
|
1225
|
-
### ⚠️ Visual DNA Creation — Always Generate Reference Images First (MANDATORY)
|
|
1226
|
-
|
|
1227
|
-
**Before calling `create_visual_dna` for a character**, always generate 2 reference images first and include them alongside any user-provided images. These give the Visual DNA engine multi-angle coverage and dramatically improve consistency:
|
|
1228
|
-
|
|
1229
|
-
**Step 1 — Generate both images in parallel (one `generate_image` call each, fire simultaneously):**
|
|
1230
|
-
|
|
1231
|
-
1. **4-angle character sheet** — prompt: `"[character description], character reference sheet showing front view, back view, left side view, right side view, four panels arranged in a 2x2 grid, neutral solid background, full body, photorealistic"`, aspect ratio `16:9`
|
|
1232
|
-
2. **Close-up portrait** — prompt: `"[character description], close-up portrait, face and shoulders, neutral solid background, soft studio lighting, photorealistic"`, aspect ratio `1:1`
|
|
1233
|
-
|
|
1234
|
-
**Step 2 — Call `create_visual_dna`** with:
|
|
1235
|
-
- `images`: the 4-angle sheet URL first, then the close-up URL — **plus** the user's reference photo(s) only if they provided one (i.e. a real person or existing character they want to match). If they gave no reference image, the 2 generated images alone are sufficient.
|
|
1236
|
-
- `type`: `"character"`
|
|
1237
|
-
- `name`: descriptive name
|
|
1238
|
-
|
|
1239
|
-
**Why:** A single reference photo only shows one angle. The close-up gives the engine facial detail; the 4-angle sheet gives it body geometry and pose range. Together they produce far more consistent generations.
|
|
1240
|
-
|
|
1241
|
-
**Skip this only if** the user explicitly says "just use my image as-is" or provides 3+ reference images already covering multiple angles.
|
|
1242
|
-
|
|
1243
|
-
### When to Use
|
|
1244
|
-
- User wants the same character across multiple **images** or a campaign → `generate_image` / `generate_creative_director` with `visual_dna_ids`
|
|
1245
|
-
- User wants to animate a character in video using **elements models** (Seedance 2, Kling O3 Reference, Grok Imagine, Veo 3.1, etc.) → `generate_elements` with `visual_dna_ids`
|
|
1246
|
-
- User wants a consistent brand style across a campaign → `generate_creative_director` with `visual_dna_ids`
|
|
1247
|
-
- User references "keep the same look", "same character", or "use that character"
|
|
1248
|
-
- User provides reference photos of a person/product to maintain consistency
|
|
1249
|
-
- User asks to put a character in a specific environment or scene → create both a character Visual DNA and an environment Visual DNA, use `@name` syntax to place them
|
|
1250
|
-
|
|
1251
|
-
### ⚠️ When NOT to Use Visual DNA
|
|
1252
|
-
- **Animating an image** → `generate_video_from_image`; the source image IS the reference, don't add `visual_dna_ids`.
|
|
1253
|
-
- **Video DNA support is limited to `generate_elements`** (Seedance 2, Kling O3 Reference, Grok Imagine). `generate_video`, `generate_video_from_image`, and `generate_first_last_frame` all ignore `visual_dna_ids` — for character-consistent video, route through `generate_elements`.
|
|
1254
|
-
|
|
1255
|
-
---
|
|
1256
|
-
|
|
1257
|
-
## Video Prompts
|
|
1258
|
-
|
|
1259
|
-
Video costs more per generation than images — write prompts deliberately to get it right the first time.
|
|
1260
|
-
|
|
1261
|
-
### Core Rules
|
|
1262
|
-
- **Order**: Subject → Action → Camera → Style → Constraints → Audio
|
|
1263
|
-
- **Length**: 80-280 words. Shorter = random. Longer = the model forgets the start.
|
|
1264
|
-
- **Always specify at least one camera movement per shot.** Even "static wide shot" is a valid explicit choice — just don't leave it unsaid.
|
|
1265
|
-
- **Character consistency**: when a character appears across shots, begin the prompt with the literal phrase `same character throughout all shots` to prevent identity drift.
|
|
1266
|
-
- **Max 3 shots per prompt.** More shots cause the model to drift.
|
|
1267
|
-
- **Duration-aware timecodes**: if the user gives a duration, space timecodes to fit (`[0s] [3s]` for 5s total; `[0s] [3s] [6s]` for 10s total). If no duration is given, describe shots sequentially without hardcoded timecodes.
|
|
1268
|
-
|
|
1269
|
-
### ⚠️ Pick the right video tool
|
|
1270
|
-
|
|
1271
|
-
There are SIX distinct video modes. They take different inputs and route to different model families. Pick by what the user actually has on hand:
|
|
1272
|
-
|
|
1273
|
-
| User has… | Use | Primary inputs | Visual DNA? |
|
|
1274
|
-
|---|---|:-:|:-:|
|
|
1275
|
-
| Nothing — just a text idea | `generate_video` | `prompt` (+ optional `reference_images`, `preset_id`) | **❌ No** (controller ignores DNA — use `generate_elements` if you need DNA) |
|
|
1276
|
-
| One still image they want animated | `generate_video_from_image` | `image_url` + motion `prompt` | ✅ Yes |
|
|
1277
|
-
| An existing video to restyle / transform | `generate_video_from_video` | `source_video` + restyle `prompt` (+ optional `reference_images`, `reference_videos`, `elements`) | ✅ Yes |
|
|
1278
|
-
| Loose assets (products, characters, refs) to compose into a video | `generate_elements` | `prompt` + any of `reference_images`, `reference_videos`, `audio_url`, `files`, `visual_dna_ids` | ✅ Yes (PRIMARY route for DNA→video) |
|
|
1279
|
-
| Two keyframes (start + end) — wants smooth morph between them | `generate_first_last_frame` | `first_frame_url` + `last_frame_url` (or `first_frame` + `last_frame` paths) + optional motion `prompt` | ✅ Yes |
|
|
1280
|
-
| Image or video face + audio to dub | `generate_lipsync` | `source` (image OR video) + `audio` + optional `text_prompt` + optional `bounding_box_target` | — |
|
|
1281
|
-
|
|
1282
|
-
**Rule of thumb:**
|
|
1283
|
-
- Coordinated **multi-scene** video set ("8 short clips of the character") → `generate_creative_director` with `workflow_type: "video"`, never multiple `generate_video` calls.
|
|
1284
|
-
- Need a **character to stay the same** across multiple videos → DNA only flows through `generate_elements`, `generate_video_from_image`, `generate_video_from_video`, `generate_first_last_frame`. **NOT through `generate_video`** — text-to-video silently drops `visual_dna_ids`.
|
|
1285
|
-
|
|
1286
|
-
### Text-to-Video (`generate_video`)
|
|
1287
|
-
Pure text → video. No source media. Pass `prompt`, optional `reference_images` (style/composition cue), optional `preset_id`. Use `list_models type="text_to_video"` to pick a model, then read `supported_durations`, `supported_aspect_ratios`, `supported_resolutions` on it before setting those params.
|
|
1288
|
-
|
|
1289
|
-
### Image-to-Video (`generate_video_from_image`)
|
|
1290
|
-
The model can see the starting frame. Describe **what happens**, not what the image looks like. Focus on motion, camera, and action — don't re-describe the subject or setting.
|
|
1291
|
-
- Good: "Slow dolly-in on the subject. Her hair drifts in a light breeze. Soft particles float through the air. [6s]"
|
|
1292
|
-
- Bad: "A woman with long brown hair standing in a forest, wearing a red dress, with golden sunlight..." (re-describes the image)
|
|
1293
|
-
|
|
1294
|
-
DNA support: yes — `visual_dna_ids` is honored if you need to lock the character to a prior DNA profile.
|
|
1295
|
-
|
|
1296
|
-
### Video-to-Video (`generate_video_from_video`)
|
|
1297
|
-
Restyle / transform an existing video. Describe the **new style**, not the original content — the model preserves the original motion.
|
|
1298
|
-
- Good: "Transform into anime style with cel-shading and vibrant colors"
|
|
1299
|
-
- Bad: "A person walking down a street" (re-describes what's already in the video)
|
|
1300
|
-
|
|
1301
|
-
Per-model extras — **call `list_models type="video_to_video"` and read these caps before passing extras**:
|
|
1302
|
-
|
|
1303
|
-
| Param | Read this cap | Examples |
|
|
1304
|
-
|---|---|---|
|
|
1305
|
-
| `reference_images` | `max_images > 0` | Kling O1/O3 (character ref), Aleph / gen4_aleph (style ref), WAN VACE (character image) |
|
|
1306
|
-
| `reference_videos` | `max_videos > 1` | WAN 2.6 reference-to-video — accepts 1–3 reference videos |
|
|
1307
|
-
| `elements` | `max_elements > 0` | Models that accept additional element images alongside the main video |
|
|
1308
|
-
|
|
1309
|
-
For models that use `reference_videos` as their *primary* input (like WAN 2.6 reference-to-video), pass the first reference video in BOTH `source_video` AND `reference_videos`.
|
|
1310
|
-
|
|
1311
|
-
### Elements — Reference Assets → Video (`generate_elements`)
|
|
1312
|
-
The **primary route for character-consistent video**. Combine any of: reference images, reference videos, an audio track, Visual DNAs. Pass URLs (`reference_images`, `reference_videos`, `audio_url`) or local file paths (`files`).
|
|
1313
|
-
|
|
1314
|
-
Per-model caps — **call `list_models type="elements"`** and read:
|
|
1315
|
-
|
|
1316
|
-
| Param | Read this cap | What it means |
|
|
1317
|
-
|---|---|---|
|
|
1318
|
-
| `reference_images` | `elements_max_images` | Max distinct image references the model accepts |
|
|
1319
|
-
| `reference_videos` | `elements_max_videos` | Most models = 0; non-zero for video-referenced elements models |
|
|
1320
|
-
| `audio_url` | `elements_max_audio` | Most models = 0; non-zero for audio-driven elements models |
|
|
1321
|
-
| `visual_dna_ids` | `max_visual_dna` | Max DNA profiles. Each DNA may expand into multiple slots — the controller distributes them across the available image slots. |
|
|
1322
|
-
|
|
1323
|
-
Top elements models to know: Seedance 2, Kling O3 Reference, Grok Imagine, Veo 3.1. Specs vary — never assume; always `list_models`.
|
|
1324
|
-
|
|
1325
|
-
### First/Last Frame (`generate_first_last_frame`)
|
|
1326
|
-
Provide two keyframes; the model interpolates a smooth transition. Two input modes (do NOT mix):
|
|
1327
|
-
- URL mode — `first_frame_url` + `last_frame_url`
|
|
1328
|
-
- File mode — `first_frame` + `last_frame` (URLs or absolute local paths)
|
|
1329
|
-
|
|
1330
|
-
Optional `prompt` describes the desired motion (e.g. "smooth dolly-in"). DNA support: yes.
|
|
1331
|
-
|
|
1332
|
-
### Lipsync (`generate_lipsync`)
|
|
1333
|
-
Sync audio to a face — works for **both image-lipsync and video-lipsync**, the tool auto-detects the source type by file extension. Pass `source` (image OR video URL/path), `audio` (URL/path), optional `text_prompt`, optional `bounding_box_target` to pick which face when there are several.
|
|
1334
|
-
|
|
1335
|
-
### Reference inputs combine freely
|
|
1336
|
-
`visual_dna_ids` + `reference_images` + (where supported) `reference_videos` + `audio_url` are **additive** across all video tools that accept them. The same matrix from "Mixing References, Visual DNAs, and Moodboards" applies: DNA owns identity, reference_images own composition/style, audio_url drives sync, video references provide motion or scene context.
|
|
1337
|
-
|
|
1338
|
-
### ⚠️ Sound on/off — `sound_enabled` (MANDATORY when the user mentions audio)
|
|
1339
|
-
|
|
1340
|
-
All six video tools — `generate_video`, `generate_video_from_image`, `generate_elements`, `generate_first_last_frame`, `generate_video_from_video`, `generate_creative_director` (with `workflow_type: "video"`) — accept an optional **`sound_enabled: boolean`** parameter that controls whether the model produces AI-generated synced audio (ambient/foley/dialogue) on the output video.
|
|
1341
|
-
|
|
1342
|
-
**When to pass it:**
|
|
1343
|
-
|
|
1344
|
-
| User phrasing | Pass |
|
|
1345
|
-
|---|---|
|
|
1346
|
-
| "no sound", "silent", "mute", "without audio", "אל תוסיף סאונד", "בלי קול" | `sound_enabled: false` |
|
|
1347
|
-
| "with sound", "add audio", "include the sound", "עם סאונד" | `sound_enabled: true` |
|
|
1348
|
-
| Did not mention sound at all | **OMIT the field** — the model's `sound_enabled_by_default` applies |
|
|
1349
|
-
|
|
1350
|
-
**Before passing**, check the model's capability via `list_models`:
|
|
1351
|
-
- `sound_generation_type: "native"` → flag is honored (Veo 3.1, Kling V3/2.6/O3, PixVerse V6).
|
|
1352
|
-
- `sound_generation_type: "none"` → flag is silently ignored (Sora 2, Hailuo, Seedance, Grok Imagine, Wan, Hedra, Runway). If the user asked for sound on a model that can't produce it, tell them and offer to switch models (e.g. "Sora 2 doesn't emit audio — want me to use Veo 3.1 instead?").
|
|
1353
|
-
- `sound_enabled_by_default: true` means sound is ON unless you explicitly pass `false`. **This is why the user complaints exist** — Veo 3.1 defaults to sound-on. If the user said "no sound", you MUST pass `sound_enabled: false`; omitting it is NOT the same as disabling it.
|
|
1354
|
-
- `sound_credit_multiplier > 1` means enabling sound costs more credits per output. Mention the extra cost when quoting (per the "Quote real cost" rule).
|
|
1355
|
-
|
|
1356
|
-
**Production-log entries** must record the sound state (`sound-on` / `sound-off`) alongside resolution + duration — see "Always log the resolution / duration / sound choices".
|
|
1357
|
-
|
|
1358
|
-
**Do NOT** try to "remove" sound after the fact with `edit_video` if you can prevent it at generation time — passing `sound_enabled: false` upfront is the only correct path.
|
|
1359
|
-
|
|
1360
|
-
### UGC / Short-Form Vertical Video — Defaults
|
|
140
|
+
1. **Check credits** ONCE per conversation (Step 0). Skip if already checked.
|
|
141
|
+
2. **Discover models** with `list_models` using a `type` filter — but **skip when the user names a specific model**.
|
|
142
|
+
3. **Pick the model**:
|
|
143
|
+
- User named one → use it.
|
|
144
|
+
- Auto-select → only from "Auto-selectable" section (models with a `summary`). Cheapest fit. Prefer `[RECOMMENDED]` when cost is similar.
|
|
145
|
+
- Never auto-select from "Named-only" section.
|
|
146
|
+
4. **Validate inputs** against model caps — see `references/workflows/cost-and-validation.md`.
|
|
147
|
+
5. **How calls work**: each tool blocks until generation is fully complete. Images: seconds. Video: minutes. Multiple tool calls in one response run concurrently. If a call times out, use `get_generation_status` with the returned generation ID.
|
|
148
|
+
6. **Share the URL** after success. Never fabricate URLs.
|
|
1361
149
|
|
|
1362
|
-
|
|
150
|
+
Model types for `list_models`: `text_to_img`, `image_editing`, `text_to_video`, `img_to_video`, `draw_to_video`, `video_to_video`, `elements`, `firstlastgenerations`, `lipsync-image`, `lipsync-video`, `music_gen`, `text_to_speech`, `text_to_sound`, `stt`, `text`, `3d_text_to_model`, `3d_image_to_model`, `3d_multi_image_to_model`, `3d_world`.
|
|
1363
151
|
|
|
1364
|
-
|
|
1365
|
-
|---|---|---|
|
|
1366
|
-
| `aspect_ratio` | **`"9:16"`** (vertical) | TikTok / Reels / Shorts are all vertical-first. Using 16:9 forces the user to crop or reshoot. |
|
|
1367
|
-
| Visual aesthetic | Phone-shot, handheld, natural lighting | UGC works precisely *because* it doesn't look produced. Cinematic = wrong vibe. |
|
|
1368
|
-
| Camera language | Slight handheld sway, selfie-arm framing, key light from window/screen | NOT slow dollies, NOT cinematic crane moves, NOT studio key light |
|
|
1369
|
-
| Energy | "talking to a friend" — casual, direct-to-camera, occasional gestures | Not theatrical, not staged, not "model-y" |
|
|
1370
|
-
| Captions / subtitles / text overlays | **NEVER add** unless explicitly requested | Users add captions in CapCut / TikTok native editor; baked-in captions limit reuse |
|
|
1371
|
-
| Brand watermarks / lower-thirds / lower banners | **NEVER add** unless explicitly requested | Same reason |
|
|
1372
|
-
| Music / SFX | Off by default unless asked | They'll layer their own audio in post |
|
|
1373
|
-
| Length | If user gives no number, default to the model's `default_duration` (typically 5–8s for elements/v2v models). Don't extend without asking. | Shorter = more usable for the algorithm |
|
|
152
|
+
## Cost Awareness — Quick Rules
|
|
1374
153
|
|
|
1375
|
-
|
|
1376
|
-
"UGC", "user-generated", "creator video", "TikTok", "Reels", "Shorts", "POV", "selfie video", "phone-shot", "vlogger", "talking head" (when context implies social media), "for social", "Instagram video", "YouTube short".
|
|
154
|
+
Full tables + formulas in `references/workflows/cost-and-validation.md`. Quick rules:
|
|
1377
155
|
|
|
1378
|
-
**
|
|
1379
|
-
|
|
156
|
+
- **Skip cost confirmation** when the user already specified model + count + duration, OR when a single generation costs < 5 credits.
|
|
157
|
+
- **Required cost confirmation** otherwise: one-line summary, suggest cheaper alternative if available, wait for confirm.
|
|
158
|
+
- **Batch totalling 100+ credits**: run `check_credits` first.
|
|
159
|
+
- **Quote real cost**: after firing, log `credits_used` (from the tool result) to `.kolbo/production.md` — never `base × count`.
|
|
1380
160
|
|
|
1381
|
-
|
|
1382
|
-
```
|
|
1383
|
-
UGC selfie video, vertical 9:16, handheld phone aesthetic.
|
|
1384
|
-
{presenter description} in {everyday setting}, {energy level}.
|
|
1385
|
-
They {natural action with the product/subject}, talking directly to camera.
|
|
1386
|
-
Phone-shot lighting (window/screen key light), slight handheld sway, no cinematic moves.
|
|
1387
|
-
Style: authentic creator content, NOT polished commercial.
|
|
1388
|
-
```
|
|
161
|
+
## Rate Limiting & Batch Generation
|
|
1389
162
|
|
|
1390
|
-
|
|
163
|
+
- `generate_image`: 30/min. All other generation tools: 10/min per type. 300/min global. `upload_media`: 300/min, no credit cost.
|
|
164
|
+
- **⚠️ NEVER re-fire a generation you already called.** Aborted / timed-out calls still process server-side. Run `get_generation_status` before retrying.
|
|
165
|
+
- **Batch ≤10 items**: output ALL tool calls in one response — they run concurrently.
|
|
166
|
+
- **Bulk >10 items**: real-world ceilings — `generate_image` 8–10 in-flight, image-edit 5–8, video tools 3–5, `generate_video_from_video` 3, music/speech/sound 5–8. Fire one batch → wait → fire next. Persist every `generation_id` in `.kolbo/production.md`.
|
|
167
|
+
- **`upload_media` external URLs first.** `files`/`source_images`/`image_url` only accept Kolbo-hosted URLs reliably; external URLs cause `400`.
|
|
1391
168
|
|
|
1392
|
-
|
|
169
|
+
## ⚠️ Multi-output? Default to `generate_creative_director` (CRITICAL)
|
|
1393
170
|
|
|
1394
|
-
|
|
1395
|
-
|----------|---------|
|
|
1396
|
-
| `slow dolly-in` | Building intensity, focus pull |
|
|
1397
|
-
| `pull-back` / `dolly out` | Scale reveal, loneliness, context |
|
|
1398
|
-
| `extreme low-angle` | Power, heroic framing |
|
|
1399
|
-
| `overhead top-down` | Geometry, pattern, abstraction |
|
|
1400
|
-
| `360° orbit` | Product showcase, bullet-time moments |
|
|
1401
|
-
| `handheld natural lag` | Urgency, documentary, grit |
|
|
1402
|
-
| `tracking shot` | Continuous follow of a subject |
|
|
1403
|
-
| `crash zoom` | Shock, impact moment |
|
|
1404
|
-
| `aerial pull-back` | Epic reveal, landscape scale |
|
|
1405
|
-
| `static drift` | Contemplative, subtle, meditative |
|
|
1406
|
-
| `crane up` / `crane down` | Grandeur, establishing, dismissal |
|
|
1407
|
-
| `whip pan` | Sharp transition, high energy |
|
|
171
|
+
`generate_creative_director` is **an agent**, not a niche tool. Plans each scene internally, locks consistency, runs in parallel. For 2+ related outputs, it's almost always right.
|
|
1408
172
|
|
|
1409
|
-
|
|
173
|
+
**Tie-breaker:** about to fire ≥2 `generate_image` calls and the user did NOT dictate per-image prompts? Stop. Use `generate_creative_director`.
|
|
1410
174
|
|
|
1411
|
-
|
|
1412
|
-
- **Water**: `water splashing with surface tension`, `droplets scattering`, `puddle mirror reflection`
|
|
1413
|
-
- **Sand / dust**: `sand displacement`, `radial dust shockwave`
|
|
1414
|
-
- **Hair**: `hair reacts to acceleration and wind`
|
|
1415
|
-
- **Impact**: `skin distorting on impact`, `delayed follow-through`
|
|
1416
|
-
- **Smoke**: `volumetric smoke curling and dissipating`
|
|
175
|
+
**Never loop `generate_image` sequentially.** Either Creative Director or one parallel batch.
|
|
1417
176
|
|
|
1418
|
-
|
|
177
|
+
**Parameter gotcha:** `num_images` (1–4, same prompt different seeds) on `generate_image` vs `scene_count` (1–8, distinct prompt per scene) on `generate_creative_director`. **Never pass `num_images` to Creative Director.**
|
|
1419
178
|
|
|
1420
|
-
|
|
179
|
+
## 🛑 Runaway-Loop Guard — ONE Generation per Requested Item (CRITICAL)
|
|
1421
180
|
|
|
1422
|
-
When the user
|
|
181
|
+
When the user asks for **one specific change**, the answer is **a single tool call**. After URLs return, **stop**. Surface and wait.
|
|
1423
182
|
|
|
1424
|
-
|
|
1425
|
-
|
|
1426
|
-
|
|
1427
|
-
|
|
1428
|
-
|
|
1429
|
-
|
|
1430
|
-
Think like a director. Describe what **happens**, not what things **look** like.
|
|
1431
|
-
|
|
1432
|
-
### Mood Presets
|
|
1433
|
-
|
|
1434
|
-
Pick techniques that match the user's intent. A calm landscape and an action sequence need different tools.
|
|
1435
|
-
|
|
1436
|
-
- **Cinematic / dramatic**: slow dolly-in, anamorphic 2.39:1, shallow depth of field, volumetric light, subtle film grain
|
|
1437
|
-
- **Product showcase**: 360° orbit, clean white or gradient backdrop, macro detail inserts, smooth tracking
|
|
1438
|
-
- **Dreamy / ethereal**: slow crane up, soft diffused light, gentle particle drift, muted pastels, static drift moments
|
|
1439
|
-
- **Action / intense**: crash zoom, handheld natural lag, extreme slow-motion at the peak beat, high contrast, fast cuts
|
|
1440
|
-
- **Nature / landscape**: aerial pull-back, golden hour lighting, wind physics on foliage, wide establishing shots
|
|
1441
|
-
- **Abstract / motion graphics**: overhead top-down, geometric patterns, bold color blocks, rhythmic cutting
|
|
1442
|
-
|
|
1443
|
-
### Slow-Motion
|
|
1444
|
-
|
|
1445
|
-
Extreme slow-motion is a tool, not a freeze frame. Always describe the micro-movements that *continue* during the slow beat (hair drifting, droplets crawling, fabric rippling), and specify the snap-back to full speed when relevant.
|
|
1446
|
-
|
|
1447
|
-
Format: `extreme slow-motion [Xs] — [micro-movements in ultra slow-mo] — snap-back to full speed`
|
|
1448
|
-
|
|
1449
|
-
---
|
|
1450
|
-
|
|
1451
|
-
## 3D Generation
|
|
1452
|
-
|
|
1453
|
-
Use `generate_3d` for creating 3D models. Three modes:
|
|
1454
|
-
- **Text mode**: prompt-only (e.g., "a medieval sword with ornate handle")
|
|
1455
|
-
- **Single image mode**: one reference image + optional prompt
|
|
1456
|
-
- **Multi-view mode**: 2+ reference images for higher-quality reconstruction
|
|
1457
|
-
|
|
1458
|
-
Returns downloadable model files in GLB, FBX, OBJ, and USDZ formats. Use `list_models` with `type: "three_d"` to discover available models.
|
|
1459
|
-
|
|
1460
|
-
---
|
|
1461
|
-
|
|
1462
|
-
## Music Prompts
|
|
1463
|
-
|
|
1464
|
-
Describe **genre → mood → instrumentation → tempo → era**, in that order.
|
|
1465
|
-
|
|
1466
|
-
- `instrumental: true` excludes vocals.
|
|
1467
|
-
- `lyrics` accepts actual lyric text the model should sing.
|
|
1468
|
-
- `style` accepts short genre tags ("lo-fi hip hop", "orchestral cinematic", "80s synthwave").
|
|
1469
|
-
- Good: "Upbeat 80s synthwave, analog synths, gated reverb drums, 120 BPM, driving bassline, no vocals"
|
|
1470
|
-
- Bad: "A cool song" / "Something for a workout" (too vague)
|
|
1471
|
-
|
|
1472
|
-
---
|
|
1473
|
-
|
|
1474
|
-
## Speech (TTS)
|
|
1475
|
-
|
|
1476
|
-
- Call `list_voices` to find available voices. Filter by `provider`, `language`, or `gender`.
|
|
1477
|
-
- Pass the returned `voice_id` (or the voice's display name like "Rachel") as the `voice` parameter in `generate_speech`.
|
|
1478
|
-
- For multilingual content, pick a voice that supports the target language.
|
|
1479
|
-
- For long text, split at natural sentence boundaries. Each generation has a character cap; chunk long-form content into multiple calls.
|
|
1480
|
-
|
|
1481
|
-
---
|
|
1482
|
-
|
|
1483
|
-
## Sound Effects
|
|
1484
|
-
|
|
1485
|
-
- Describe the sound **literally and physically**. Avoid emotional framing.
|
|
1486
|
-
- Good: "Heavy wooden door creaking open slowly, echoing in a stone hallway, followed by distant dripping water"
|
|
1487
|
-
- Bad: "A scary sound" / "Creepy atmosphere" (the model can't render emotions directly — render the physical source)
|
|
1488
|
-
|
|
1489
|
-
---
|
|
1490
|
-
|
|
1491
|
-
## Moodboards & Presets
|
|
1492
|
-
|
|
1493
|
-
**Moodboards** inject style direction as a **system-level prompt** (master prompt + style guide + reference images) — think of it as a persistent art direction layer applied on top of your generation. Pass a `moodboard_id` to any generation tool to apply its style. Moodboards can be combined with Visual DNA: the moodboard sets the overall aesthetic, while Visual DNA controls specific characters or objects.
|
|
1494
|
-
- `list_moodboards` to browse available options
|
|
1495
|
-
- `get_moodboard` to see full details (master_prompt, style_guide, images) before applying
|
|
1496
|
-
|
|
1497
|
-
**Presets** bundle prompt templates + style direction for specific creative looks. Pass a `preset_id` to generation tools.
|
|
1498
|
-
- `list_presets` with optional `type` filter ("image", "video", "video_from_image", "music")
|
|
1499
|
-
|
|
1500
|
-
---
|
|
1501
|
-
|
|
1502
|
-
## Media Library
|
|
1503
|
-
|
|
1504
|
-
The library covers both **uploaded files** and **AI-generated outputs the user has saved**. Tools fall into five groups: ingest, browse, lifecycle (delete/restore/move), folders, and favorites.
|
|
1505
|
-
|
|
1506
|
-
### ⚠️ Present locally-produced media to the user
|
|
1507
|
-
|
|
1508
|
-
When you produce a media file LOCALLY — `ffmpeg` via the `video-production` skill, Remotion render, manual `Bash` mux of audio + video, `edit_image` outputs saved to disk, any save-to-file flow — make sure the user can actually find and open it. Local files are invisible in the chat / canvas UI by default; only the path string makes it through.
|
|
1509
|
-
|
|
1510
|
-
**Rules:**
|
|
1511
|
-
|
|
1512
|
-
1. **Surface the file in chat as a clickable thing**, not just a path string. Write the line as a markdown link to a `file://` URL so the user can click to open it in their default app:
|
|
1513
|
-
```
|
|
1514
|
-
✅ Final video ready: [zohar_hagai_campaign.mp4](file:///Users/mymac/Documents/test agent 1/zohar_hagai_campaign.mp4) (45s · 1440×1440 · with music)
|
|
1515
|
-
```
|
|
1516
|
-
The user clicks the link → the desktop app shell hands the path to the system → opens in QuickTime / VLC / Finder reveal, etc.
|
|
1517
|
-
|
|
1518
|
-
2. **Always log the local path in `.kolbo/production.md`** under the artifact's entry — that's the durable record:
|
|
1519
|
-
```md
|
|
1520
|
-
## Final
|
|
1521
|
-
- **Campaign video (45s)**
|
|
1522
|
-
- local: /Users/mymac/Documents/test agent 1/zohar_hagai_campaign.mp4
|
|
1523
|
-
- resolution: 1440×1440
|
|
1524
|
-
- audio: Gilded Horizon (Track 1 & 2, 3:03)
|
|
1525
|
-
- rendered: 2026-05-16
|
|
1526
|
-
```
|
|
1527
|
-
|
|
1528
|
-
3. **Don't auto-upload to `upload_media`**. The user wants local-only files to stay local; they have the file on disk and can move/share it themselves. Upload only when the user explicitly asks ("upload this", "share publicly", "give me a CDN URL").
|
|
1529
|
-
|
|
1530
|
-
4. **Reveal-in-Finder affordance for macOS** when finishing a multi-step production: in addition to the `file://` link, mention the parent directory path so the user can `cd` or open the folder. Many users want to see all the intermediate files (frames, alt cuts, original audio) in one place.
|
|
1531
|
-
|
|
1532
|
-
5. **Files served via `file://` won't render inline** in the chat as `<video>` / `<img>` — the desktop WebView blocks file:// for security. Don't try to embed; just link.
|
|
1533
|
-
|
|
1534
|
-
### Routing — user says → call
|
|
1535
|
-
|
|
1536
|
-
| User says | Call |
|
|
1537
|
-
|---|---|
|
|
1538
|
-
| "Upload this file" / "host this" / "give me a public URL for this" | `upload_media` |
|
|
1539
|
-
| "Show my media" / "list my images/videos" / "what do I have?" | `list_media` (pass `type` / `category` / `project_id` / `folder_id` / `search`) |
|
|
1540
|
-
| "Show my favorites" / "list starred items" | `list_media` with `category=favorites` |
|
|
1541
|
-
| "List everything in project X" | `list_media` with `project_id=X` |
|
|
1542
|
-
| "List all videos in folder X" | `list_media` with `folder_id=X, type=video` |
|
|
1543
|
-
| "What was the prompt for [item]?" / "tell me about this generation" | `get_media` |
|
|
1544
|
-
| "How many videos do I have?" / "what's my storage usage?" | `get_media_stats` |
|
|
1545
|
-
| "Favorite this" / "star this" / "save to favorites" | `favorite_media` |
|
|
1546
|
-
| "Unfavorite" / "remove from favorites" / "unstar" | `unfavorite_media` |
|
|
1547
|
-
| "Delete this" / "remove this image" | `delete_media` (soft, recoverable for 30 days) |
|
|
1548
|
-
| "Restore it" / "undelete" / "bring it back from trash" | `restore_media` |
|
|
1549
|
-
| "Permanently delete" / "wipe it forever" / "free up space" | **confirm with user** → `permanently_delete_media` |
|
|
1550
|
-
| "Move this to project X" | `move_media` |
|
|
1551
|
-
| "Clean up old [type]" / "delete everything from [time period]" | `list_media` (find ids) → **confirm** → `bulk_delete_media` |
|
|
1552
|
-
| "Restore all from trash" | `list_media include_deleted=true` → `bulk_restore_media` |
|
|
1553
|
-
| "Empty my trash" / "purge deleted items" | `list_media include_deleted=true` → **show count, confirm** → `bulk_permanently_delete_media` |
|
|
1554
|
-
| "Move all these to project X" | `bulk_move_media` |
|
|
1555
|
-
| "Move everything in folder X to project Y" | `move_folder_contents` |
|
|
1556
|
-
| "Make a folder for X" / "create a 'campaigns' folder" | `create_media_folder` |
|
|
1557
|
-
| "Rename folder" / "change folder color or icon" | `update_media_folder` |
|
|
1558
|
-
| "Delete the [name] folder" | **confirm with user** → `delete_media_folder` (items stay in library) |
|
|
1559
|
-
| "Add these to [folder]" / "put these in folder X" | `add_media_to_folder` |
|
|
1560
|
-
| "Remove these from [folder]" | `remove_media_from_folder` |
|
|
1561
|
-
| "Share [folder] with alice@…" | `share_media_folder` with `user_emails: [...]` |
|
|
1562
|
-
| "Revoke [user]'s access to [folder]" | `unshare_media_folder` with `user_id` |
|
|
1563
|
-
| "Show my folders" / "what folders do I have?" | `list_media_folders` |
|
|
1564
|
-
|
|
1565
|
-
### Rules and gotchas
|
|
1566
|
-
|
|
1567
|
-
1. **"Delete" is soft by default.** Use `delete_media` / `bulk_delete_media` for normal "delete" intent — items go to trash for 30 days and are recoverable. Only use `permanently_delete_media` / `bulk_permanently_delete_media` when the user explicitly asks for unrecoverable deletion ("permanently", "forever", "wipe", "free up space"). **Always confirm before either permanent variant.**
|
|
1568
|
-
2. **Confirm before destructive folder ops.** `delete_media_folder` detaches items (they stay in the library) but the folder itself is gone — no undo. Confirm with the user.
|
|
1569
|
-
3. **`bulk_move_media` is atomic.** If you get a "not all items owned by you" error, do NOT retry partially. Surface the error to the user and let them pick a smaller batch.
|
|
1570
|
-
4. **Prefer `list_media` filters over post-filtering.** Pass `project_id` / `folder_id` / `category` / `type` / `search` to the backend; don't fetch the whole library and filter client-side.
|
|
1571
|
-
5. **`is_favorited` is per-user.** On shared projects, an item can be favorited by you and not by your teammates — the value reflects the calling user only.
|
|
1572
|
-
6. **"Empty trash" flow:** `list_media` with `include_deleted=true` → show the count → confirm → `bulk_permanently_delete_media`. Never call the bulk-permanent endpoint without listing first so the user knows the scope.
|
|
1573
|
-
7. **Bulk caps:** 1000 ids for `bulk_delete_media` / `bulk_restore_media` / `bulk_permanently_delete_media` / `bulk_move_media`; 500 ids for `add_media_to_folder` / `remove_media_from_folder`. Split larger jobs into successive calls.
|
|
1574
|
-
8. **Folder share resolution:** `share_media_folder` takes emails; users not found come back in `not_found`. Report those to the user — don't assume the share succeeded silently. Members can list/add/remove items but cannot delete the folder or reshare it.
|
|
1575
|
-
9. **`get_media` accepts a generation_id as a fallback** for the `media_id` arg, so you can chase down items the user references by their original generation rather than by library id.
|
|
1576
|
-
|
|
1577
|
-
---
|
|
183
|
+
You are NOT allowed to:
|
|
184
|
+
- Fire the same tool 3+ times in a single turn unless the user explicitly asked for "N variations".
|
|
185
|
+
- Re-fire because you think the result might not be exactly what the user wanted.
|
|
186
|
+
- Auto-retry on success.
|
|
187
|
+
- Fire 5+ parallel `generate_video*` calls speculatively.
|
|
1578
188
|
|
|
1579
|
-
|
|
189
|
+
**Only re-fire when:** user explicitly asked for variations with a count, OR previous call returned `failure.retryable === true` (ONE retry), OR previous call returned `completed` but `urls.length === 0` (ONE retry).
|
|
1580
190
|
|
|
1581
|
-
|
|
191
|
+
## ⚠️ Editing an Existing Video → ONE Call, Not Frames-First (CRITICAL)
|
|
1582
192
|
|
|
1583
|
-
|
|
193
|
+
Existing video → modify → **single `generate_video_from_video` call** with source video URL + edit prompt.
|
|
1584
194
|
|
|
1585
|
-
Use `
|
|
195
|
+
**Use a TRUE video-to-video model.** Image-to-video models reject with `WRONG_MODEL_TYPE`. Valid: `wan/2-7-videoedit`, `happyhorse/video-edit`, `kling-video/o3-video-to-video`, or any model whose DB `type` includes `video_to_video` (use `list_models({ type: "video_to_video" })`).
|
|
1586
196
|
|
|
1587
|
-
|
|
197
|
+
**Do NOT** decompose into frames. **Do NOT** re-fire if the first call returned URLs.
|
|
1588
198
|
|
|
1589
|
-
##
|
|
199
|
+
## ⚠️ Character-Driven Video — Frames First, Then Animate (CRITICAL)
|
|
1590
200
|
|
|
1591
|
-
|
|
201
|
+
For any ad / story / scene-based video **created from scratch** featuring a Visual DNA character (NOT v2v edits):
|
|
1592
202
|
|
|
1593
|
-
|
|
203
|
+
1. **Generate the shot frames first** via `generate_creative_director` with `scene_count` + `visual_dna_ids` (image mode). DNA is strongest in image gen; user can approve cheaply.
|
|
204
|
+
2. **Confirm the frames** if >3 shots.
|
|
205
|
+
3. **Animate each frame** with `generate_video_from_image`, fired in parallel.
|
|
1594
206
|
|
|
1595
|
-
|
|
1596
|
-
2. **Create session**: `app_builder_create_session` with `project_id`
|
|
1597
|
-
3. **Generate app**: `app_builder_generate_app` with `session_id` + `prompt`
|
|
1598
|
-
- Fires the build in the background, polls until `build_status === "deployed"` (up to 5 min)
|
|
1599
|
-
- Always surface the `deployment_url` to the user: **"Your app is live at: [url]"**
|
|
1600
|
-
4. **Iterate**: `app_builder_list_generations` → get `generation_id` → `app_builder_edit_app` with natural language instruction
|
|
207
|
+
Skip frames-first only when the user says "go straight to video", single-shot quick experiments, or the user supplies approved frames. Full rules: `references/models/creative-director.md`.
|
|
1601
208
|
|
|
1602
|
-
|
|
209
|
+
## ⚠️ Detecting Failed Generations (CRITICAL)
|
|
1603
210
|
|
|
1604
|
-
|
|
211
|
+
A generation can fail three ways. Treat ALL as failure:
|
|
1605
212
|
|
|
1606
|
-
|
|
1607
|
-
|
|
1608
|
-
|
|
1609
|
-
github_repo_url → git clone <url> && npm install && npm run dev
|
|
1610
|
-
supabase_url → paste into .env as NEXT_PUBLIC_SUPABASE_URL
|
|
1611
|
-
supabase_anon_key → paste into .env as NEXT_PUBLIC_SUPABASE_ANON_KEY
|
|
1612
|
-
```
|
|
213
|
+
1. **Tool returns `error`** — explicit. Surface, suggest retry, log `generation_id`.
|
|
214
|
+
2. **Tool returns `completed` but `urls` is empty** — silent failure (NSFW filter, model OOM, upstream 5xx). Tell user "completed without an output — retrying" and re-fire ONCE. Do NOT log to `.kolbo/production.md`. Do NOT claim it worked.
|
|
215
|
+
3. **Tool hangs / never returns** — MCP poll timed out. Call `get_generation_status(generation_id)` IMMEDIATELY. The server might be done.
|
|
1613
216
|
|
|
1614
|
-
|
|
217
|
+
**Always:**
|
|
218
|
+
- Don't celebrate before reading the result. Verify `urls` is non-empty.
|
|
219
|
+
- Don't auto-retry without surfacing the failure. Partial batches: list failed items + reasons + successful count. Never "✅ all done!" on partials.
|
|
220
|
+
- Don't log failed items to `.kolbo/production.md`. Only successes.
|
|
221
|
+
- Surface the user's count. "6 of 8 ready", not "videos ready".
|
|
1615
222
|
|
|
1616
|
-
|
|
1617
|
-
- **On build timeout** (rare): use `app_builder_get_build_status` to check manually, then continue or report.
|
|
223
|
+
`failure` envelope structure + retry rules: `references/workflows/troubleshooting.md`.
|
|
1618
224
|
|
|
1619
|
-
|
|
225
|
+
## ⚠️ Generated URLs in Chat (CRITICAL)
|
|
1620
226
|
|
|
1621
|
-
|
|
227
|
+
Chat renders markdown natively. `` = inline image. `[label](url)` = labeled link with preview.
|
|
1622
228
|
|
|
1623
|
-
|
|
229
|
+
- **Catalog-style replies** (numbered lists of characters / scenes / products): embed `` so each item shows inline.
|
|
230
|
+
- **Conversational replies** ("4 shots ready"): keep prose short; canvas chip already shows gallery.
|
|
1624
231
|
|
|
1625
|
-
|
|
232
|
+
Avoid bare URL dumps and HTML `<table>` grids — canvas already provides a gallery.
|
|
1626
233
|
|
|
1627
|
-
|
|
1628
|
-
- **Reference specific regions** when helpful: "top-left corner", "in the foreground", "the figure on the right".
|
|
1629
|
-
- **Extract text verbatim** when asked (OCR-style requests are fine).
|
|
1630
|
-
- **Cannot identify real people.** Describe hair, clothing, pose, expression, and apparent role — but never name a specific individual, even a well-known public figure. If the user insists, decline and offer to describe instead.
|
|
1631
|
-
- **Copyrighted content**: summarize and reference, don't reproduce verbatim large chunks.
|
|
1632
|
-
- If the user wants an **edit** based on the analysis, hand off to `generate_image_edit` (visual edit) or `generate_video_from_image` (motion).
|
|
234
|
+
**After `generate_creative_director` completes** — share results as individual URLs, one per scene. Do NOT create an HTML grid artifact.
|
|
1633
235
|
|
|
1634
|
-
|
|
236
|
+
**Always** record every URL in `.kolbo/production.md` — see `references/workflows/production-log.md`.
|
|
1635
237
|
|
|
1636
238
|
## Limitations & Safety
|
|
1637
239
|
|
|
1638
|
-
- **Real people**: never identify specific
|
|
1639
|
-
- **NSFW**: Kolbo enforces content safety at the model level. If a generation fails on safety grounds, rephrase
|
|
1640
|
-
- **Copyright**: style references are fine (
|
|
1641
|
-
- **No fabricated URLs**: only share URLs that actually came back from a tool call.
|
|
1642
|
-
|
|
1643
|
-
---
|
|
240
|
+
- **Real people**: never identify specific individuals in photos, even public figures. Describe visible attributes only.
|
|
241
|
+
- **NSFW**: Kolbo enforces content safety at the model level. If a generation fails on safety grounds, rephrase rather than retrying identically.
|
|
242
|
+
- **Copyright**: style references are fine ("in the style of Studio Ghibli"); verbatim reproduction is not.
|
|
243
|
+
- **No fabricated URLs**: only share URLs that actually came back from a tool call.
|
|
1644
244
|
|
|
1645
245
|
## Sharing HTML Artifacts
|
|
1646
246
|
|
|
1647
|
-
HTML/SVG/Mermaid artifacts have a **Share** button in the preview toolbar that uploads the artifact and copies a permanent public URL (no login required to view).
|
|
1648
|
-
|
|
1649
|
-
---
|
|
1650
|
-
|
|
1651
|
-
## Kolbo Code Documentation
|
|
1652
|
-
|
|
1653
|
-
Full public documentation for Kolbo Code (the CLI you are running inside) lives at **[docs.kolbo.ai/docs/kolbo-code](https://docs.kolbo.ai/docs/kolbo-code)**. If the user asks about installation, authentication, voice input, supported languages, commands, or how to uninstall, point them to the matching page below rather than guessing:
|
|
1654
|
-
|
|
1655
|
-
| Topic | Path |
|
|
1656
|
-
|-------|------|
|
|
1657
|
-
| Overview & quick links | `/docs/kolbo-code` |
|
|
1658
|
-
| Installation (npm / bun / brew / scoop / choco) | `/docs/kolbo-code/installation` |
|
|
1659
|
-
| Sign in with Kolbo (device-code OAuth) | `/docs/kolbo-code/authentication` |
|
|
1660
|
-
| Push-to-talk voice input (hold `space`) | `/docs/kolbo-code/voice-input` |
|
|
1661
|
-
| 12 supported UI languages + RTL | `/docs/kolbo-code/languages` |
|
|
1662
|
-
| Full CLI command reference | `/docs/kolbo-code/commands` |
|
|
1663
|
-
| Uninstall + cleanup | `/docs/kolbo-code/uninstall` |
|
|
1664
|
-
|
|
1665
|
-
The MDX sources are in the `kolbo-docs` repo under `content/docs/kolbo-code/`. When the user's question has a concrete answer in one of those pages, cite the path and summarize — do not invent new instructions.
|
|
1666
|
-
|
|
1667
|
-
## Troubleshooting
|
|
1668
|
-
|
|
1669
|
-
### "API key is invalid or expired"
|
|
1670
|
-
This usually means the CLI is sending a key to the wrong API endpoint.
|
|
1671
|
-
|
|
1672
|
-
**Common cause — whitelabel overlap:** if the user previously used regular `kolbo` and then switched to a whitelabel/partner CLI (e.g. `sapir`), the old API key may still be cached against the main Kolbo API. Running `kolbo` instead of the branded command (`sapir`) overwrites the MCP config with the wrong endpoint.
|
|
1673
|
-
|
|
1674
|
-
**Fix:** tell the user to re-authenticate with their branded CLI command:
|
|
1675
|
-
```
|
|
1676
|
-
sapir auth login
|
|
1677
|
-
```
|
|
1678
|
-
(Replace `sapir` with their actual CLI command.)
|
|
1679
|
-
|
|
1680
|
-
Then **restart the editor/session** so the MCP picks up the new key and endpoint.
|
|
1681
|
-
|
|
1682
|
-
**Important:** whitelabel users must always use their branded CLI command (e.g. `sapir`), not `kolbo`, to keep the MCP pointed at the correct API.
|
|
1683
|
-
|
|
1684
|
-
### MCP tools not responding or not found
|
|
1685
|
-
If Kolbo tools timeout or aren't listed, the MCP server may not be wired. Tell the user to run:
|
|
1686
|
-
```
|
|
1687
|
-
<their-cli-command> auth login
|
|
1688
|
-
```
|
|
1689
|
-
This re-wires the MCP configuration automatically. Then restart the session.
|
|
1690
|
-
|
|
1691
|
-
### "Rate limited" (429 errors)
|
|
1692
|
-
Wait 60s for the window to reset, retry only the failed calls. For batch image work prefer `generate_creative_director` over multiple `generate_image` calls. Full rate-limit details + retry sequence: see "Rate Limiting & Batch Generation".
|
|
247
|
+
HTML/SVG/Mermaid artifacts have a **Share** button in the preview toolbar that uploads the artifact and copies a permanent public URL (no login required to view). Or call `publish_html_artifact({ title, content })` directly.
|
|
1693
248
|
|
|
1694
249
|
---
|
|
1695
250
|
|
|
1696
|
-
|
|
1697
|
-
|
|
1698
|
-
Natural-language triggers → tool routing:
|
|
1699
|
-
|
|
1700
|
-
- "Generate an image of a neon-lit Tokyo street at night" → `list_models` (image) → `generate_image`
|
|
1701
|
-
- "Use Midjourney to generate X" → `generate_image` with model "midjourney" (user named → skip `list_models`)
|
|
1702
|
-
- "Remove the background from this image" → `list_models` (image_edit) → `generate_image_edit`
|
|
1703
|
-
- "Create a storyboard for a coffee brand ad" / "4 angles of this character" → `generate_creative_director`
|
|
1704
|
-
- "Make 5 videos with Seedance 2 Fast, 15s, 16:9" → fire all 5 `generate_video` calls in parallel (skip `list_models`, skip cost confirmation)
|
|
1705
|
-
- "Animate this product photo with a 360° orbit" → `generate_video_from_image`
|
|
1706
|
-
- "Restyle this video as anime" → `generate_video_from_video`
|
|
1707
|
-
- "Make this character talk with this voiceover" → `generate_lipsync`
|
|
1708
|
-
- "Create a smooth transition between these two frames" → `generate_first_last_frame`
|
|
1709
|
-
- "Make a lo-fi hip hop beat, instrumental, 85 BPM" → `generate_music`
|
|
1710
|
-
- "Say this in English with a natural female voice: …" → `list_voices` → `generate_speech`
|
|
1711
|
-
- "Generate a door slam sound effect" → `generate_sound`
|
|
1712
|
-
- "Create a 3D model of a medieval castle" → `generate_3d`
|
|
1713
|
-
- Transcription / SRT / "what was said" / word-by-word timing → `transcribe_audio` (see Video/Audio Analysis section for full routing)
|
|
1714
|
-
- "Analyze this video" / "What's in this?" → `upload_media` → `chat_send_message` (see decision tree for >100MB / long / dialogue-dense exceptions)
|
|
1715
|
-
- Multi-video analysis → upload all in parallel, then ONE `chat_send_message` with up to 10 URLs
|
|
1716
|
-
- "Keep the same character across these images" → `create_visual_dna` → `generate_image` with `visual_dna_ids`
|
|
1717
|
-
- "Upload this" / "Host this HTML page" / "Public URL for this file" → `upload_media` (Kolbo CDN serves any file type publicly)
|
|
1718
|
-
- "How many credits do I have?" → `check_credits`
|
|
1719
|
-
- Image analysis ("what's in this image?", "analyze these N frames") → `Read` directly with native vision, never `upload_media` + chat
|
|
1720
|
-
- "Build me a todo app" / "Make a landing page with waitlist" → `app_builder_list_projects` → `app_builder_create_session` → `app_builder_generate_app` → show `deployment_url`
|
|
1721
|
-
- "Add dark mode to my app" / "Add a contact form" → `app_builder_list_generations` → `app_builder_edit_app`
|
|
1722
|
-
- "Give me the GitHub repo" / "Supabase credentials" → `app_builder_get_session` → return `github_repo_url` + `supabase_url` + `supabase_anon_key`
|
|
1723
|
-
- "Create motion graphics" / "animated text" / "title sequence" → load `remotion-best-practices` skill
|
|
1724
|
-
- "Edit this video" / "cut" / "trim" / "remove silence" / "add subtitles" / "convert to 9:16" → load `video-production` skill (FFmpeg)
|
|
1725
|
-
- "Create a short-form video" / "make a reel" / "YouTube short" → load `short-form-video` skill
|
|
251
|
+
If at this point you still don't know which `references/` file to load, default to `references/models/prompt-copilot.md` for generation prompts or `references/workflows/cost-and-validation.md` for cost/validation questions, or just keep going with this core file's rules.
|