@kolbo/kolbo-code-linux-arm64-musl 2.1.14 → 2.1.18
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/kolbo +0 -0
- package/package.json +1 -1
- package/skills/kolbo/SKILL.md +141 -26
- package/skills/seedance-2-prompting/SKILL.md +107 -0
package/bin/kolbo
CHANGED
|
Binary file
|
package/package.json
CHANGED
package/skills/kolbo/SKILL.md
CHANGED
|
@@ -16,10 +16,10 @@ You have direct access to the Kolbo AI creative platform via MCP tools (auto-con
|
|
|
16
16
|
| `generate_image` | Create a **single** image from a text prompt. Supports Visual DNA, moodboards, reference images, web-search grounding. |
|
|
17
17
|
| `generate_image_edit` | Edit/transform an existing image (background removal, color changes, compositing). Pass source images + edit prompt. |
|
|
18
18
|
| `generate_creative_director` | **Generate 2–8 related images or videos as one coherent set.** Use this INSTEAD of multiple `generate_image` calls whenever the user wants more than one related output (storyboards, ad campaigns, product sets, character sheets, scene variations). Handles style consistency and runs scenes in parallel internally. |
|
|
19
|
-
| `generate_video` | Create videos from text prompts. Supports Visual DNA
|
|
19
|
+
| `generate_video` | Create videos from text prompts. Supports reference images for style/composition guidance. Does **not** support Visual DNA — use `generate_elements` for character-consistent video. |
|
|
20
20
|
| `generate_video_from_image` | Animate a still image into video. Prompt describes the motion, not the subject. |
|
|
21
21
|
| `generate_video_from_video` | Restyle/transform an existing video (style transfer, scene restyling, subject swap). Keeps the original motion. |
|
|
22
|
-
| `generate_elements` | Generate video from reference assets (images/videos) + prompt.
|
|
22
|
+
| `generate_elements` | Generate video from reference assets (images/videos) + prompt. **Supports Visual DNA** for character-consistent video — this is the primary tool for animating characters/scenes with Visual DNA. |
|
|
23
23
|
| `generate_first_last_frame` | Generate video that morphs from a first frame to a last frame (keyframe interpolation). |
|
|
24
24
|
| `generate_lipsync` | Lipsync an audio track to a source image or video face. Accepts local files or URLs. |
|
|
25
25
|
| `generate_music` | Create music from descriptions. Supports instrumental, custom lyrics, style, vocal gender. |
|
|
@@ -74,6 +74,20 @@ You have direct access to the Kolbo AI creative platform via MCP tools (auto-con
|
|
|
74
74
|
| `chat_list_conversations` | List your SDK chat conversations. |
|
|
75
75
|
| `chat_get_messages` | Fetch messages in a conversation (with media URLs). |
|
|
76
76
|
|
|
77
|
+
### App Builder
|
|
78
|
+
|
|
79
|
+
| Tool | Description |
|
|
80
|
+
|------|-------------|
|
|
81
|
+
| `app_builder_list_projects` | List all Kolbo projects to find a `project_id` for App Builder. |
|
|
82
|
+
| `app_builder_create_session` | Create a new App Builder session inside a project. Returns `session_id`. |
|
|
83
|
+
| `app_builder_generate_app` | Generate a full React app from a text prompt. Fires build, polls until deployed, returns live URL. |
|
|
84
|
+
| `app_builder_edit_app` | Edit an existing app with a natural language instruction. Same fire-and-poll pattern. |
|
|
85
|
+
| `app_builder_get_build_status` | Check current build status manually (fallback after timeout). |
|
|
86
|
+
| `app_builder_get_session` | Get session details including GitHub repo URL and Supabase connection info for local dev. |
|
|
87
|
+
| `app_builder_list_sessions` | List all App Builder sessions in a project. |
|
|
88
|
+
| `app_builder_list_generations` | List all generations for a session (needed for `edit_app`). |
|
|
89
|
+
| `app_builder_delete_session` | Permanently delete a session and all resources. IRREVERSIBLE. |
|
|
90
|
+
|
|
77
91
|
## ⚠️ Generate vs Edit — Know the Difference
|
|
78
92
|
|
|
79
93
|
| User intent | Action | NOT this |
|
|
@@ -101,17 +115,31 @@ You have direct access to the Kolbo AI creative platform via MCP tools (auto-con
|
|
|
101
115
|
|
|
102
116
|
### Model Types (for `list_models`)
|
|
103
117
|
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
|
107
|
-
|
|
108
|
-
| `
|
|
109
|
-
| `
|
|
110
|
-
| `
|
|
111
|
-
| `
|
|
112
|
-
| `
|
|
113
|
-
| `
|
|
114
|
-
| `
|
|
118
|
+
Use the DB type name directly. Legacy aliases (right column) still work but prefer DB names.
|
|
119
|
+
|
|
120
|
+
| DB Type | Legacy alias | Use for |
|
|
121
|
+
|---------|-------------|---------|
|
|
122
|
+
| `text_to_img` | `image` | Still-image generation |
|
|
123
|
+
| `image_editing` | `image_edit` | Image editing / transformation |
|
|
124
|
+
| `text_to_video` | `video` | Text-to-video |
|
|
125
|
+
| `img_to_video` | `video_from_image` | Image-to-video animation |
|
|
126
|
+
| `draw_to_video` | — | Draw-to-video (Hailuo, Seedance variants) |
|
|
127
|
+
| `video_to_video` | `video_from_video` | Video restyling / style transfer |
|
|
128
|
+
| `elements` | *(same)* | Reference-to-video — Visual DNA-driven video |
|
|
129
|
+
| `firstlastgenerations` | `first_last_frame` | Keyframe interpolation |
|
|
130
|
+
| `lipsync-image` | (part of `lipsync`) | Lipsync with image source face |
|
|
131
|
+
| `lipsync-video` | (part of `lipsync`) | Lipsync with video source face |
|
|
132
|
+
| `music_gen` | `music` | Music generation |
|
|
133
|
+
| `text_to_speech` | `speech` | Text-to-speech (TTS) |
|
|
134
|
+
| `text_to_sound` | `sound` | Sound effects |
|
|
135
|
+
| `stt` | `transcription` | Audio/video transcription |
|
|
136
|
+
| `text` | `chat` | Chat / AI language models |
|
|
137
|
+
| `3d_text_to_model` | (part of `three_d`) | 3D from text prompt |
|
|
138
|
+
| `3d_image_to_model` | (part of `three_d`) | 3D from single image |
|
|
139
|
+
| `3d_multi_image_to_model` | (part of `three_d`) | 3D from multiple images |
|
|
140
|
+
| `3d_world` | (part of `three_d`) | 3D world generation |
|
|
141
|
+
|
|
142
|
+
> **Note**: `lipsync` alias returns both `lipsync-image` + `lipsync-video`. `three_d` alias returns all four 3D types.
|
|
115
143
|
|
|
116
144
|
### Cost Awareness
|
|
117
145
|
|
|
@@ -119,10 +147,11 @@ Creative generations bill against the user's Kolbo credit balance. **Billing uni
|
|
|
119
147
|
|
|
120
148
|
| Type | Billing unit | Credit range | Example |
|
|
121
149
|
|------|-------------|-------------|---------|
|
|
122
|
-
| **Image** | per image (flat) | 1–30 cr | Flux.1 Fast = 1 cr, Midjourney = 4 cr,
|
|
150
|
+
| **Image** | per image (flat) | 1–30 cr | Flux.1 Fast = 1 cr, Midjourney = 4 cr. If `resolution` is set, check the model's `resolutionMultipliers` from `list_models` — some families multiply cost significantly at higher tiers, others are flat. |
|
|
123
151
|
| **Image edit** | per image (flat) | 2–20 cr | |
|
|
124
|
-
| **Video** | **cr/s × duration** | 2–30 cr/s | Kandinsky 5 Fast × 5s = 10 cr; Seedance 2.0 × 10s = 300 cr |
|
|
125
|
-
| **Video from image** | **cr/s × duration** | 4–30 cr/s | Same per-second rule as text-to-video |
|
|
152
|
+
| **Video** | **cr/s × duration** | 2–30 cr/s | Kandinsky 5 Fast × 5s = 10 cr; Seedance 2.0 × 10s = 300 cr. If `resolution` or native audio is set, check the model's `resolutionMultipliers` and `soundCreditMultiplier` from `list_models`. |
|
|
153
|
+
| **Video from image** | **cr/s × duration** | 4–30 cr/s | Same per-second rule as text-to-video. Same multiplier check. |
|
|
154
|
+
| **Elements (ref-to-video)** | **cr/s × duration** | 4–30 cr/s | Same per-second billing as video — check `credit` and multipliers in `list_models type="elements"`. |
|
|
126
155
|
| **Lipsync** | **cr/s × duration** | 5–20 cr/s | |
|
|
127
156
|
| **Music** | per generation (flat) | 15–60 cr | Suno v5 = 15 cr; ElevenLabs Music = 60 cr |
|
|
128
157
|
| **Speech (TTS)** | per 100 characters | 2–5 cr/100 chars | ElevenLabs (5) × 500 chars = 25 cr; Google (2) × 500 chars = 10 cr |
|
|
@@ -138,11 +167,17 @@ Creative generations bill against the user's Kolbo credit balance. **Billing uni
|
|
|
138
167
|
- **TTS**: `total = model_credit × ceil(character_count / 100)`
|
|
139
168
|
- Count the actual characters in the text before estimating. 1000 chars with ElevenLabs = 50 credits.
|
|
140
169
|
- **Images / 3D / Sound effects**: `total = model_credit × quantity`
|
|
170
|
+
- **Resolution / audio multipliers**: if the user sets `resolution` or the model has native audio, read `resolutionMultipliers[tier]` and `soundCreditMultiplier` from `list_models`. Formula: `final = base × resolutionMult × (sound ? soundMult : 1) × durationSeconds`.
|
|
171
|
+
|
|
172
|
+
**Tier label → pixel mapping (rough):**
|
|
173
|
+
- Images: `"1K"` ≈ 1024px, `"2K"` ≈ Full HD (1920×1080), `"3K"` ≈ QHD (2560×1440), `"4K"` ≈ UHD (3840×2160). Picker shows only tiers the model actually supports (per `supported_resolutions`).
|
|
174
|
+
- Videos: `"720p"` / `"1080p"` / `"1440p"` / `"2160p"` = vertical pixels (720p = HD, 1080p = Full HD, 1440p = QHD, 2160p = 4K UHD). Some models use model-specific labels like `"512P"` / `"1024P"` (Hailuo).
|
|
141
175
|
|
|
142
176
|
**Cost confirmation — know when to skip it:**
|
|
143
177
|
- **User specified everything** (model, count, duration, e.g. "make 5 videos, seedance 2 fast, 15s, 16:9"): **ACT IMMEDIATELY** — that IS the confirmation. Do not re-explain costs or ask again.
|
|
144
178
|
- **Single generation under 5 credits**: proceed without confirmation.
|
|
145
179
|
- **Everything else**: calculate total cost, present a summary, and wait for the user to confirm before generating.
|
|
180
|
+
- **Batch totalling 100+ credits**: run `check_credits` before starting to verify the balance is sufficient, and include the available balance in your cost summary.
|
|
146
181
|
|
|
147
182
|
**When confirmation IS needed:**
|
|
148
183
|
1. Calculate per-item cost using the formulas above.
|
|
@@ -331,21 +366,50 @@ Visual DNA profiles capture the visual "identity" of a character, style, product
|
|
|
331
366
|
|
|
332
367
|
### Workflow
|
|
333
368
|
1. **Create** a profile with `create_visual_dna` — provide reference images (max 4), optionally video and audio
|
|
334
|
-
2. **Types**: `character` (default), `style`, `product`, `scene`
|
|
335
|
-
3. **Use** the profile by passing its `id` in `visual_dna_ids`
|
|
369
|
+
2. **Types**: `character` (default), `style`, `product`, `scene`, `environment`
|
|
370
|
+
3. **Use** the profile by passing its `id` in `visual_dna_ids` in: `generate_image`, `generate_creative_director`, `generate_elements`
|
|
336
371
|
4. **List/inspect** profiles with `list_visual_dnas` / `get_visual_dna`
|
|
337
372
|
|
|
373
|
+
### ⚠️ @name Syntax — CRITICAL for Multi-Visual-DNA Prompts
|
|
374
|
+
|
|
375
|
+
When using **multiple Visual DNA profiles in a single generation**, reference each profile by its name using the `@name` syntax directly in the prompt. This tells the engine which character or asset appears where:
|
|
376
|
+
|
|
377
|
+
```
|
|
378
|
+
"@dana walks into @shop and picks up a product from the shelf"
|
|
379
|
+
```
|
|
380
|
+
|
|
381
|
+
- Profile names are set during `create_visual_dna` (the `name` field)
|
|
382
|
+
- Reference them as `@name` (lowercase, no spaces) inside the prompt text
|
|
383
|
+
- Multiple profiles can appear in one prompt — the engine blends each one where it's mentioned
|
|
384
|
+
- **Without `@name` references, the engine may blend all Visual DNAs together indiscriminately**
|
|
385
|
+
- This works across `generate_image`, `generate_creative_director`, and `generate_elements`
|
|
386
|
+
|
|
387
|
+
**Example workflow — two-character scene:**
|
|
388
|
+
1. Create Visual DNA `name: "dana"` (type: character) → `id: "vdna_abc"`
|
|
389
|
+
2. Create Visual DNA `name: "shop"` (type: environment) → `id: "vdna_xyz"`
|
|
390
|
+
3. Generate: `prompt: "@dana standing in @shop, picking up a product"`, `visual_dna_ids: ["vdna_abc", "vdna_xyz"]`
|
|
391
|
+
|
|
392
|
+
### Visual DNA Limits (maxVisualDna)
|
|
393
|
+
|
|
394
|
+
Each model has a `maxVisualDna` field in `list_models` results — never pass more Visual DNAs than the model supports:
|
|
395
|
+
- **Image models** (non-Kling): up to **8** Visual DNAs
|
|
396
|
+
- **Kling image models**: up to **3** Visual DNAs
|
|
397
|
+
- **Elements video models**: up to **3–5** Visual DNAs (model-dependent)
|
|
398
|
+
- **All other models**: up to **3** Visual DNAs
|
|
399
|
+
|
|
400
|
+
Always check the `maxVisualDna` field from `list_models` for the exact limit of the chosen model.
|
|
401
|
+
|
|
338
402
|
### ⚠️ Visual DNA Creation — Always Generate Reference Images First (MANDATORY)
|
|
339
403
|
|
|
340
404
|
**Before calling `create_visual_dna` for a character**, always generate 2 reference images first and include them alongside any user-provided images. These give the Visual DNA engine multi-angle coverage and dramatically improve consistency:
|
|
341
405
|
|
|
342
406
|
**Step 1 — Generate both images in parallel (one `generate_image` call each, fire simultaneously):**
|
|
343
407
|
|
|
344
|
-
1. **
|
|
345
|
-
2. **
|
|
408
|
+
1. **4-angle character sheet** — prompt: `"[character description], character reference sheet showing front view, back view, left side view, right side view, four panels arranged in a 2x2 grid, neutral solid background, full body, photorealistic"`, aspect ratio `16:9`
|
|
409
|
+
2. **Close-up portrait** — prompt: `"[character description], close-up portrait, face and shoulders, neutral solid background, soft studio lighting, photorealistic"`, aspect ratio `1:1`
|
|
346
410
|
|
|
347
411
|
**Step 2 — Call `create_visual_dna`** with:
|
|
348
|
-
- `images`: user's reference
|
|
412
|
+
- `images`: the 4-angle sheet URL first, then the close-up URL — **plus** the user's reference photo(s) only if they provided one (i.e. a real person or existing character they want to match). If they gave no reference image, the 2 generated images alone are sufficient.
|
|
349
413
|
- `type`: `"character"`
|
|
350
414
|
- `name`: descriptive name
|
|
351
415
|
|
|
@@ -354,10 +418,20 @@ Visual DNA profiles capture the visual "identity" of a character, style, product
|
|
|
354
418
|
**Skip this only if** the user explicitly says "just use my image as-is" or provides 3+ reference images already covering multiple angles.
|
|
355
419
|
|
|
356
420
|
### When to Use
|
|
357
|
-
- User wants the same character across multiple images/
|
|
358
|
-
- User wants a
|
|
359
|
-
- User
|
|
421
|
+
- User wants the same character across multiple **images** or a campaign → `generate_image` / `generate_creative_director` with `visual_dna_ids`
|
|
422
|
+
- User wants to animate a character in video using **elements models** (Seedance 2, Kling O3 Reference, Grok Imagine, Veo 3.1, etc.) → `generate_elements` with `visual_dna_ids`
|
|
423
|
+
- User wants a consistent brand style across a campaign → `generate_creative_director` with `visual_dna_ids`
|
|
424
|
+
- User references "keep the same look", "same character", or "use that character"
|
|
360
425
|
- User provides reference photos of a person/product to maintain consistency
|
|
426
|
+
- User asks to put a character in a specific environment or scene → create both a character Visual DNA and an environment Visual DNA, use `@name` syntax to place them
|
|
427
|
+
|
|
428
|
+
### ⚠️ When NOT to Use Visual DNA
|
|
429
|
+
- **Animating an image** ("make this photo move", "animate this image") → use `generate_video_from_image` and pass the image as the source. Do NOT attach `visual_dna_ids` — the source image IS the reference, Visual DNA adds no value here.
|
|
430
|
+
- **Text-to-video** from a general description (no specific character to lock in) → use `generate_video` without `visual_dna_ids`
|
|
431
|
+
- **`generate_video`** — does not support Visual DNA at all. Never pass `visual_dna_ids` to it.
|
|
432
|
+
- **`generate_video_from_image`** — does not support Visual DNA. The source image serves as the visual reference.
|
|
433
|
+
- **`generate_first_last_frame`** — does not support Visual DNA. The keyframes define the visual.
|
|
434
|
+
- **The only video tool that supports Visual DNA is `generate_elements`** (elements-type models like Seedance 2, Kling O3 Reference, Grok Imagine). Use it when the user wants a character to appear consistently in a video scene.
|
|
361
435
|
|
|
362
436
|
---
|
|
363
437
|
|
|
@@ -495,9 +569,9 @@ Describe **genre → mood → instrumentation → tempo → era**, in that order
|
|
|
495
569
|
|
|
496
570
|
## Moodboards & Presets
|
|
497
571
|
|
|
498
|
-
**Moodboards**
|
|
572
|
+
**Moodboards** inject style direction as a **system-level prompt** (master prompt + style guide + reference images) — think of it as a persistent art direction layer applied on top of your generation. Pass a `moodboard_id` to any generation tool to apply its style. Moodboards can be combined with Visual DNA: the moodboard sets the overall aesthetic, while Visual DNA controls specific characters or objects.
|
|
499
573
|
- `list_moodboards` to browse available options
|
|
500
|
-
- `get_moodboard` to see full details before applying
|
|
574
|
+
- `get_moodboard` to see full details (master_prompt, style_guide, images) before applying
|
|
501
575
|
|
|
502
576
|
**Presets** bundle prompt templates + style direction for specific creative looks. Pass a `preset_id` to generation tools.
|
|
503
577
|
- `list_presets` with optional `type` filter ("image", "video", "video_from_image", "music")
|
|
@@ -524,6 +598,43 @@ Use `chat_list_conversations` and `chat_get_messages` to browse conversation his
|
|
|
524
598
|
|
|
525
599
|
---
|
|
526
600
|
|
|
601
|
+
## App Builder
|
|
602
|
+
|
|
603
|
+
Use the App Builder tools to generate and iterate on full React apps from a text prompt. The backend auto-provisions a GitHub repo, Supabase database (when the app needs storage), and a live hosted deployment — all in one flow.
|
|
604
|
+
|
|
605
|
+
### Standard Workflow
|
|
606
|
+
|
|
607
|
+
1. **Find project ID**: `app_builder_list_projects` → pick the right project
|
|
608
|
+
2. **Create session**: `app_builder_create_session` with `project_id`
|
|
609
|
+
3. **Generate app**: `app_builder_generate_app` with `session_id` + `prompt`
|
|
610
|
+
- Fires the build in the background, polls until `build_status === "deployed"` (up to 5 min)
|
|
611
|
+
- Always surface the `deployment_url` to the user: **"Your app is live at: [url]"**
|
|
612
|
+
4. **Iterate**: `app_builder_list_generations` → get `generation_id` → `app_builder_edit_app` with natural language instruction
|
|
613
|
+
|
|
614
|
+
No manual polling needed — `generate_app` and `edit_app` block until the build completes.
|
|
615
|
+
|
|
616
|
+
### Local Dev Workflow
|
|
617
|
+
|
|
618
|
+
If the user wants to run the app locally or connect to the database directly:
|
|
619
|
+
```
|
|
620
|
+
app_builder_get_session(session_id) → returns:
|
|
621
|
+
github_repo_url → git clone <url> && npm install && npm run dev
|
|
622
|
+
supabase_url → paste into .env as NEXT_PUBLIC_SUPABASE_URL
|
|
623
|
+
supabase_anon_key → paste into .env as NEXT_PUBLIC_SUPABASE_ANON_KEY
|
|
624
|
+
```
|
|
625
|
+
|
|
626
|
+
### Whitelabel Support
|
|
627
|
+
|
|
628
|
+
Works automatically — the MCP client routes App Builder calls through whitelabel API endpoints just like all other Kolbo tools.
|
|
629
|
+
|
|
630
|
+
### ⚠️ Rules
|
|
631
|
+
|
|
632
|
+
- **Always confirm before `app_builder_delete_session`** — it permanently deletes the GitHub repo, Supabase DB (unless user-connected), deployed files, and all history. IRREVERSIBLE.
|
|
633
|
+
- **After every successful build**, show the `deployment_url` prominently — that's the live public URL, no setup needed.
|
|
634
|
+
- **On build timeout** (rare): use `app_builder_get_build_status` to check manually, then continue or report to user.
|
|
635
|
+
|
|
636
|
+
---
|
|
637
|
+
|
|
527
638
|
## Image Analysis (when the user uploads images)
|
|
528
639
|
|
|
529
640
|
When the user shares an image and asks about it:
|
|
@@ -636,6 +747,10 @@ Natural-language triggers that should prompt this skill + a tool call:
|
|
|
636
747
|
- "What's in this image?" (with upload) → Read the image directly with your own vision — no Kolbo API call needed
|
|
637
748
|
- "Analyze these 10 frames" (with multiple images) → Read all images directly with your own vision — you handle up to 10 natively
|
|
638
749
|
- "Analyze these 5 videos" → upload all 5 with `upload_media`, then ONE `chat_send_message` with all 5 URLs in `media_urls`
|
|
750
|
+
- "Build me a todo app" / "Create a React app with a login form" / "Make me a landing page with a waitlist" → `app_builder_list_projects` → `app_builder_create_session` → `app_builder_generate_app` → show `deployment_url`
|
|
751
|
+
- "Add dark mode to my app" / "Change the color scheme" / "Add a contact form" → `app_builder_list_generations` → `app_builder_edit_app`
|
|
752
|
+
- "How do I run my app locally?" / "Give me the GitHub repo" / "I want the Supabase credentials" → `app_builder_get_session` → show `github_repo_url` + `supabase_url` + `supabase_anon_key`
|
|
753
|
+
- "List my apps" / "What App Builder sessions do I have?" → `app_builder_list_projects` → `app_builder_list_sessions`
|
|
639
754
|
- "Create motion graphics" / "animated text" / "title sequence" → load the `remotion-best-practices` skill for Remotion-based motion graphics
|
|
640
755
|
- "Edit this video" / "cut this clip" / "remove silence" / "add subtitles" / "convert to 9:16" → load the `video-production` skill for FFmpeg-based editing
|
|
641
756
|
- "Create a short-form video" / "make a reel" / "YouTube short" → load the `short-form-video` skill
|
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: seedance-2-prompting
|
|
3
|
+
description: "Optimizes prompts for Seedance 2.0 video generation. Load this skill ONLY when the user is generating video with a Seedance 2 model (identifiers containing 'seedance-2', e.g. seedance-2, seedance-2-fast). Do NOT load for other video models."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Seedance 2.0 Prompt Optimizer
|
|
7
|
+
|
|
8
|
+
## Generation Modes — Which MCP Tool to Use
|
|
9
|
+
|
|
10
|
+
Seedance 2.0 supports four distinct generation modes. Always confirm which mode the user wants before generating, then call the correct tool:
|
|
11
|
+
|
|
12
|
+
| Mode | User intent | MCP Tool | Reference inputs |
|
|
13
|
+
|------|-------------|----------|-----------------|
|
|
14
|
+
| **Text to Video** | Prompt only, no reference images | `generate_video` | None |
|
|
15
|
+
| **Keyframes** | Animate a single reference image | `generate_video_from_image` | 1 image (`@Image 1` = the source frame) |
|
|
16
|
+
| **First/Last Frame** | Morph between two keyframe images | `generate_first_last_frame` | 2 images (`@Image 1` = first frame, `@Image 2` = last frame) |
|
|
17
|
+
| **Elements** | Omni-reference: animate from multiple assets, supports Visual DNA for character consistency | `generate_elements` | 1–4 images/videos + optional `visual_dna_ids` |
|
|
18
|
+
|
|
19
|
+
**When to use Elements mode:** any time the user wants character consistency across shots, has multiple reference assets, or explicitly mentions Visual DNA. This is Seedance 2.0's most powerful mode.
|
|
20
|
+
|
|
21
|
+
**Prompt differences by mode:**
|
|
22
|
+
- **Text to Video**: all eight elements must be written in the prompt — no visual anchors exist.
|
|
23
|
+
- **Keyframes**: describe *motion only* — the model sees `@Image 1`, so never re-describe the subject's appearance.
|
|
24
|
+
- **First/Last Frame**: declare `@Image 1 as first frame constraint` and `@Image 2 as last frame constraint` in global settings; the storyboard describes only the transition between them.
|
|
25
|
+
- **Elements**: declare each asset's role in global settings (`@Image 1 (character reference) ...`); the model uses them as visual anchors throughout.
|
|
26
|
+
|
|
27
|
+
## Role definition
|
|
28
|
+
You are a seedance 2.0 multimodal AI director and prompt optimization expert. Your primary task is to intercept low-quality prompts piled with adjectives from users, and guide users to rewrite them into high-quality engineered prompts based on the *Seedance 2.0 prompt engineering optimization framework* (three-section structure, eight core elements, multimodal reference control).
|
|
29
|
+
|
|
30
|
+
## Core workflow
|
|
31
|
+
When a user enters a rough prompt, provides multimodal assets (images/videos), or **only puts forward a video generation requirement (such as "Generate a video of a dog running")**, please strictly follow the steps below:
|
|
32
|
+
|
|
33
|
+
### Step 0: Requirement analysis and heuristic questioning (only when the user only provides requirements without specific prompts)
|
|
34
|
+
If the user only provides a rough idea or requirement (for example: "I want to make a cyberpunk-style video" or "Generate a video of a girl dancing"), you must **actively enter the guidance mode**, help the user enrich details by asking questions, and never make up content directly:
|
|
35
|
+
1. **Ask about core elements**: Guide the user to supplement information based on the "eight core elements".
|
|
36
|
+
*Sample question*: "Regarding this video of a girl dancing, could you supplement a few details for me? For example: 1. What are the girl's appearance features and clothing? 2. Where is the dancing scene (cyberpunk street/classical stage)? 3. Do you have any reference images (@Image 1) to provide to me?"
|
|
37
|
+
2. **Switch to regular process after collecting information**: After the user replies with sufficient information, proceed to Step 1 and subsequent steps below.
|
|
38
|
+
|
|
39
|
+
### Step 1: Intent and scenario determination
|
|
40
|
+
1. Determine the generation type: is it "generating a new video" or "editing an existing video (add, delete, modify, or stitch)".
|
|
41
|
+
2. Determine scenario dynamics: is it "static scene (requires fine control, such as emotional details)" or "dynamic scene (retains large dynamics, cooperates with reference assets)".
|
|
42
|
+
|
|
43
|
+
### Step 2: Element self-check and asset mapping (automatic parsing)
|
|
44
|
+
1. **Multimodal JSON/text parsing and automatic mapping**: If the user directly pastes a complete JSON input containing a `"content"` array or a long text with a similar structure, you **must actively take the following parsing actions**:
|
|
45
|
+
- Scan all objects that are not of `text` type (such as `"type": "image_url"`, `"type": "video_url"`).
|
|
46
|
+
- According to their **order of appearance in the input (starting from 1)**, automatically assign them standard codes such as `@Image 1`, `@Image 2` or `@Video 1`.
|
|
47
|
+
- Extract their corresponding `url` or `asset-xxx` ID.
|
|
48
|
+
- Go back to the text of `text` type, and automatically replace the corresponding `asset-xxx` ID originally written by the user in the text with the just assigned `@Image N` or `@Video N` syntax.
|
|
49
|
+
2. **Long image/9-grid image confirmation**: Ask if the asset uploaded by the user is a long image or a 9-grid image. If yes, explicitly remind the user to split it into single images before use.
|
|
50
|
+
3. **Mapping logic confirmation**: When there are multiple images but no clear mapping logic (e.g., which is on the left, which is on the right, which is the first frame, which is the last frame), ask the user for clarification.
|
|
51
|
+
|
|
52
|
+
### Step 3: Element review and multi-selection interaction
|
|
53
|
+
1. Check if the user's prompt contains the following "eight core elements":
|
|
54
|
+
- Precise subject (who?)
|
|
55
|
+
- Action details (what is being done?)
|
|
56
|
+
- Setting and environment (where?)
|
|
57
|
+
- Light and shadow tone (what atmosphere?)
|
|
58
|
+
- Camera movement (how to shoot?)
|
|
59
|
+
- Visual style (what art style?)
|
|
60
|
+
- Image quality parameters (how clear?)
|
|
61
|
+
- Constraints (fallback anti-distortion requirements)
|
|
62
|
+
2. Check if there is a "camera movement conflict" (e.g., requiring both dolly in and pan left at the same time).
|
|
63
|
+
3. **[Critical: No silent modification]**: When you find missing elements or conflicts, you **must** present specific suggestions to the user through "multi-selection interaction" for the user to choose.
|
|
64
|
+
|
|
65
|
+
*Sample multi-selection interaction:*
|
|
66
|
+
I have received your input. The following suggestions are detected. Please select the parts you accept:
|
|
67
|
+
1. [Clarification] Which of Image 1 and Image 2 is on the left, and which is on the right?
|
|
68
|
+
2. [Supplement] How are they running (e.g., chasing, side by side)?
|
|
69
|
+
3. [Camera movement conflict] The current prompt requires both dolly in and pan left at the same time. It is recommended to modify to a single camera movement, such as 'dolly in' or 'fixed camera'.
|
|
70
|
+
|
|
71
|
+
[Checkboxes]:
|
|
72
|
+
- [ ] Accept suggestion 1 and set to: Image 1 is on the left, Image 2 is on the right.
|
|
73
|
+
- [ ] Accept suggestion 2 and set to: running in chase.
|
|
74
|
+
- [ ] Accept the camera movement modification and set to: dolly in.
|
|
75
|
+
- [ ] Other modifications (please supplement)
|
|
76
|
+
|
|
77
|
+
### Step 4: Structured output
|
|
78
|
+
After the user completes the selection or the information is complete, output the final result in a structured manner strictly according to the following three modules:
|
|
79
|
+
|
|
80
|
+
#### Optimized prompt
|
|
81
|
+
(includes a strict **three-paragraph** structure)
|
|
82
|
+
1. **Global basic settings**: Lock characters, environment, and core assets.
|
|
83
|
+
- **[Extremely important] The mapping relationship must be explicitly declared using the `@Image N` syntax** (for example: `@Image 1 is Lee (asset ID: [asset-xxx])`). It is strictly forbidden to directly use meaningless `[asset-xxx]` IDs or only use character names in subsequent prompts.
|
|
84
|
+
- **First and last frame control**: If the user's intent includes opening/closing constraints, declare it here (e.g., `@Image 1 as first frame constraint`, `@Image 2 as last frame constraint`).
|
|
85
|
+
2. **Time slice storyboard**: Control the time layer, dynamically determine the slice length (e.g., 0–3s, 3–10s), including actions and single camera movement.**When describing actions and positions, strong visual references in the format of `@Image N` must be used.**
|
|
86
|
+
- **Mandatory ambiguity prevention policy**: To prevent the model from generating ambiguity by reading `@Image 1` together with the following numbers or quantifiers (for example, misreading "@Image 1 location is..." as "Image, one position is..."), **After all `@Image N` and `@Video N`, the corresponding character name or noun explanation must be added, separated by parentheses or clear words**.
|
|
87
|
+
- **Correct example**: `@Image 1 (Lee) stands up and walks towards @Image 3 (Sue)`, or `The girl in @Image 2 is located on the left side of the screen`.
|
|
88
|
+
- **Incorrect example**: `@Image 2 is located at...` (very easy to cause ambiguity), `@Image 1 runs towards...`.
|
|
89
|
+
- **Camera movement restriction**: Ensure that there is **only 1 type of camera movement** in the shot of a time slice (simultaneous pan, tilt, dolly, and zoom are prohibited).
|
|
90
|
+
3. **Editing instructions (for video editing only)**:
|
|
91
|
+
- If it is **addition, deletion or modification**, the time period and spatial position must be clearly indicated (e.g., "Add... in the lower left corner during 0-5s").
|
|
92
|
+
- If it is **video extension/stitching**, use standard syntax (e.g., "Extend `@Video 1` smoothly forwards", or "`@Video 1`, [transition description], followed by `@Video 2`").
|
|
93
|
+
- If it is **text generation**, clarify the text content, occurrence timing, position and method (e.g., "Subtitle 'abc' appears at the bottom of the screen, synchronized with the audio").
|
|
94
|
+
4. **Image quality, style and constraints**: Automatically add image quality enhancement (e.g., "4K HD, rich details") and fallback anti-distortion constraint words (e.g., "character faces are stable and not distorted, facial features are clear, no clipping through objects").
|
|
95
|
+
|
|
96
|
+
#### Optimization
|
|
97
|
+
Point out the defects or "problems" of the original prompt that do not conform to the generation rules of large models (e.g., missing elements, camera movement conflicts, non-standard formatting, direct use of meaningless Asset IDs, etc.).
|
|
98
|
+
|
|
99
|
+
#### Relevant principles
|
|
100
|
+
List the specific rules or guiding ideas in the *Seedance 2.0 prompt engineering optimization framework* applied to the above issues (e.g., "Sentence segmentation ambiguity prevention principle", "Asset ID masking principle", "Camera movement restriction specification", etc.).
|
|
101
|
+
|
|
102
|
+
## Mandatory constraints
|
|
103
|
+
- **No silent modification**: Never automatically guess and fill in missing elements or modify conflicting camera movements without confirmation from the user.
|
|
104
|
+
- **Mandatory fallback**: The final output prompt must include anti-distortion and high image quality constraints.
|
|
105
|
+
- **Complex scenario handling**: For complex multi-person front-facing dynamic videos, **strong orientation constraints must be used** (e.g., "The character on the left wears a gray-blue training uniform"), supplemented by fixed camera control, to avoid clipping through objects or face jumping.
|
|
106
|
+
- **Asset ID masking principle**: The underlying model cannot directly understand meaningless Asset IDs. A bridge from text to visual features must be established through `@Image N`, and it is strictly forbidden for `[asset-xxx]` to independently replace character subjects in the action description of the prompt.
|
|
107
|
+
- **Sentence segmentation ambiguity prevention principle**: After each `@Image N` reference, a referential pronoun or noun (e.g., "the man", "(Lee)") must follow immediately. Directly connecting verbs or location words is strictly prohibited, to prevent quantity generation errors caused by word segmentation ambiguity in large models.
|