@kolbo/kolbo-code-linux-arm64-musl 1.1.5 → 1.1.66

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/bin/kolbo CHANGED
Binary file
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@kolbo/kolbo-code-linux-arm64-musl",
3
- "version": "1.1.5",
3
+ "version": "1.1.66",
4
4
  "os": [
5
5
  "linux"
6
6
  ],
@@ -1,25 +1,78 @@
1
1
  ---
2
2
  name: kolbo
3
- description: Generate or analyze creative media through Kolbo AI. Load this skill whenever the user asks to create, edit, prompt, or analyze images, videos, music, speech, or sound effects — or to list available AI models / check credit balance. It contains the MCP tool workflow and the prompt-engineering rules for each media type.
3
+ description: Generate, edit, or analyze creative media through Kolbo AI. Load this skill whenever the user asks to create, edit, prompt, or analyze images, videos, music, speech, sound effects, 3D models — or to transcribe audio/video, manage media, use Visual DNA for consistency, check credits, or browse models/presets/moodboards. It contains the MCP tool workflow and the prompt-engineering rules for each media type.
4
4
  ---
5
5
 
6
- # Kolbo AI — Creative Generation & Analysis
6
+ # Kolbo AI — Creative Generation, Analysis & Transcription
7
7
 
8
8
  You have direct access to the Kolbo AI creative platform via MCP tools (auto-configured by `kolbo auth login`). Use them to generate and deliver real content — do NOT just describe what you would create.
9
9
 
10
10
  ## Available MCP Tools
11
11
 
12
+ ### Generation
13
+
14
+ | Tool | Description |
15
+ |------|-------------|
16
+ | `generate_image` | Create images from text prompts. Supports Visual DNA, moodboards, reference images, batch generation, web-search grounding. |
17
+ | `generate_image_edit` | Edit/transform an existing image (background removal, color changes, compositing). Pass source images + edit prompt. |
18
+ | `generate_creative_director` | Generate a coordinated multi-scene set (1–8 scenes) from one creative brief. Ideal for storyboards, ad campaigns, product showcases. Supports image and video modes. |
19
+ | `generate_video` | Create videos from text prompts. Supports Visual DNA and reference images for consistency. |
20
+ | `generate_video_from_image` | Animate a still image into video. Prompt describes the motion, not the subject. |
21
+ | `generate_video_from_video` | Restyle/transform an existing video (style transfer, scene restyling, subject swap). Keeps the original motion. |
22
+ | `generate_elements` | Generate video from reference assets (images/videos) + prompt. Use when animating specific uploaded assets. |
23
+ | `generate_first_last_frame` | Generate video that morphs from a first frame to a last frame (keyframe interpolation). |
24
+ | `generate_lipsync` | Lipsync an audio track to a source image or video face. Accepts local files or URLs. |
25
+ | `generate_music` | Create music from descriptions. Supports instrumental, custom lyrics, style, vocal gender. |
26
+ | `generate_speech` | Convert text to speech (TTS). Default: ElevenLabs. Use `list_voices` to pick a voice. |
27
+ | `generate_sound` | Generate sound effects from descriptions (foley, ambient, impacts, UI sounds). |
28
+ | `generate_3d` | Generate 3D models from text, single image, or multi-view images. Returns GLB, FBX, OBJ, USDZ. |
29
+
30
+ ### Transcription & Analysis
31
+
32
+ | Tool | Description |
33
+ |------|-------------|
34
+ | `transcribe_audio` | Transcribe audio or video into text + SRT subtitles + word-by-word SRT. Accepts local files or URLs. |
35
+
36
+ ### Voice & Model Discovery
37
+
12
38
  | Tool | Description |
13
39
  |------|-------------|
14
- | `generate_image` | Create images from text prompts. Returns image URL(s). |
15
- | `generate_video` | Create videos from text. Returns video URL. |
16
- | `generate_video_from_image` | Animate a still image into video. Returns video URL. |
17
- | `generate_music` | Create music from descriptions. Returns audio URL. |
18
- | `generate_speech` | Convert text to speech. Returns audio URL. |
19
- | `generate_sound` | Generate sound effects. Returns audio URL. |
20
40
  | `list_models` | Browse available AI models filtered by type. |
41
+ | `list_voices` | List available TTS voices with filtering by provider, language, gender. |
21
42
  | `check_credits` | Check remaining Kolbo credit balance. |
22
- | `get_generation_status` | Poll status of an in-progress generation by ID. |
43
+ | `get_generation_status` | Poll status of an in-progress generation by ID (fallback for timeouts). |
44
+
45
+ ### Media Library
46
+
47
+ | Tool | Description |
48
+ |------|-------------|
49
+ | `upload_media` | Upload a local file or URL to the user's Kolbo media library (CDN). Use for multi-tool workflows. |
50
+ | `list_media` | Browse user's uploaded media with filtering by type and search. |
51
+
52
+ ### Visual DNA (Character/Style Consistency)
53
+
54
+ | Tool | Description |
55
+ |------|-------------|
56
+ | `create_visual_dna` | Create a Visual DNA profile from reference images/video/audio for character, style, product, or scene consistency. |
57
+ | `list_visual_dnas` | List your Visual DNA profiles (id, name, type, thumbnail). |
58
+ | `get_visual_dna` | Fetch full profile details including system_prompt and reference images. |
59
+ | `delete_visual_dna` | Delete a Visual DNA profile. |
60
+
61
+ ### Moodboards & Presets
62
+
63
+ | Tool | Description |
64
+ |------|-------------|
65
+ | `list_moodboards` | List available moodboards (personal, system presets, org). |
66
+ | `get_moodboard` | Fetch a moodboard's master_prompt, style_guide, and images. |
67
+ | `list_presets` | Browse generation presets (image/video/music templates with bundled style direction). |
68
+
69
+ ### Chat
70
+
71
+ | Tool | Description |
72
+ |------|-------------|
73
+ | `chat_send_message` | Send a message to Kolbo AI chat. Supports web search and deep think modes. |
74
+ | `chat_list_conversations` | List your SDK chat conversations. |
75
+ | `chat_get_messages` | Fetch messages in a conversation (with media URLs). |
23
76
 
24
77
  ## Core Workflow
25
78
 
@@ -34,23 +87,71 @@ You have direct access to the Kolbo AI creative platform via MCP tools (auto-con
34
87
  | Type | Use for |
35
88
  |------|---------|
36
89
  | `image` | Still-image generation |
90
+ | `image_edit` | Image editing / transformation |
37
91
  | `video` | Text-to-video |
38
92
  | `video_from_image` | Image-to-video animation |
93
+ | `lipsync` | Audio-to-face lipsync |
39
94
  | `music` | Music generation |
40
95
  | `speech` | Text-to-speech |
41
96
  | `sound` | Sound effects |
97
+ | `three_d` | 3D model generation |
42
98
 
43
99
  ### Cost Awareness
44
100
 
45
101
  Creative generations bill against the user's Kolbo credit balance. Order of expense (rough):
46
- - **Cheap & fast**: speech (~5-30s), sound effects (~5-30s), image (~10-30s)
47
- - **Medium**: music (~30s-2min)
48
- - **Expensive**: video (~1-5min, highest credit cost)
102
+ - **Cheap & fast**: speech (~5-30s), sound effects (~5-30s), image (~10-30s), transcription (by duration)
103
+ - **Medium**: music (~30s-2min), 3D (~1-3min)
104
+ - **Expensive**: video (~1-5min, highest credit cost), lipsync (~1-3min)
49
105
 
50
106
  Rule of thumb: confirm intent before firing off a video generation unless the user was explicit. For images, just generate.
51
107
 
52
108
  ---
53
109
 
110
+ ## Transcription & Audio/Video Analysis
111
+
112
+ Use `transcribe_audio` whenever the user provides an audio or video file and wants:
113
+ - A text transcript
114
+ - Subtitles (SRT format)
115
+ - Word-by-word timed subtitles (for karaoke, motion graphics, Remotion captions, video editing)
116
+ - Content analysis or summary of spoken content
117
+ - Dialogue extraction from video
118
+
119
+ ### Workflow
120
+ 1. Call `transcribe_audio` with the `source` (URL or absolute local file path)
121
+ 2. The tool returns:
122
+ - `text` — full transcript as plain text
123
+ - `srt_url` — download URL for grouped SRT subtitles (configurable words-per-line)
124
+ - `word_by_word_srt_url` — download URL for **word-by-word SRT** (one word per subtitle entry with precise timestamps from ElevenLabs Scribe v2)
125
+ - `txt_url` — download URL for plain text file
126
+ - `duration` — audio duration in seconds
127
+ 3. Analyze the transcript text as needed (summarize, translate, extract topics, answer questions about content)
128
+
129
+ ### Supported Formats
130
+ - **Audio**: mp3, wav, m4a, flac, aac
131
+ - **Video** (extracts audio track): mp4, mov, webm, mkv, avi, m4v
132
+
133
+ ### Word-by-Word Transcription
134
+ The `word_by_word_srt_url` contains an SRT file where each subtitle entry is a **single word** with precise start/end timestamps (powered by ElevenLabs Scribe v2). This is ideal for:
135
+ - **Karaoke-style captions** — highlight one word at a time
136
+ - **Remotion/motion graphics** — animate text word-by-word synced to audio
137
+ - **Video editing** — precise cut points aligned to speech
138
+ - **Accessibility** — word-level navigation for hearing-impaired users
139
+
140
+ The regular `srt_url` groups words into readable subtitle lines (default 12 words per line, up to 2 lines per subtitle).
141
+
142
+ ### Use Cases & Examples
143
+ - "Transcribe this podcast" → `transcribe_audio` with the audio URL
144
+ - "What's being said in this video?" → `transcribe_audio` → analyze the returned text
145
+ - "Generate subtitles for my video" → `transcribe_audio` → share the `srt_url`
146
+ - "I need word-by-word timing for this audio" → `transcribe_audio` → share `word_by_word_srt_url`
147
+ - "Summarize this meeting recording" → `transcribe_audio` → summarize the text
148
+ - "Extract key points from this lecture" → `transcribe_audio` → analyze and extract
149
+
150
+ ### Long Content
151
+ Transcription supports files up to 30 minutes. For longer content, split the file first or provide segments.
152
+
153
+ ---
154
+
54
155
  ## Image Prompts
55
156
 
56
157
  ### Rules
@@ -61,14 +162,36 @@ Rule of thumb: confirm intent before firing off a video generation unless the us
61
162
  - **`enhance_prompt: true`** (default) will improve most prompts automatically. Turn it off only if the user's prompt is already fully engineered or they want literal wording.
62
163
 
63
164
  ### Image Editing (image-to-image)
64
- When the model can see the uploaded image, describe the **change**, not the unchanged parts.
165
+
166
+ Use `generate_image_edit` when the user wants to modify an existing image. Pass the source image URL(s) in `source_images` and describe the change in `prompt`.
167
+
65
168
  - Good: "Turn the sky orange and add drifting clouds"
66
169
  - Bad: "A mountain landscape with an orange sky and drifting clouds" (re-describes what's already in the image)
67
170
 
68
171
  Simple edits deserve simple prompts. Only elaborate for genuinely complex, multi-step transformations.
69
172
 
70
173
  ### Multi-Scene / Campaigns
71
- For storyboards, campaigns, or character-consistent sequences, call `generate_image` once per scene with the same base style cues carried across prompts. Kolbo's web app has a dedicated Creative Director feature for this; in the CLI the workflow is sequential `generate_image` calls.
174
+ For storyboards, campaigns, or character-consistent sequences, use `generate_creative_director` it generates 1–8 coordinated scenes from a single creative brief with consistent style. Pass `visual_dna_ids` and/or `moodboard_id` for character/style consistency across all scenes.
175
+
176
+ In the CLI, you can also do sequential `generate_image` calls with the same Visual DNA profiles.
177
+
178
+ ---
179
+
180
+ ## Visual DNA (Character/Style Consistency)
181
+
182
+ Visual DNA profiles capture the visual "identity" of a character, style, product, or scene from reference media.
183
+
184
+ ### Workflow
185
+ 1. **Create** a profile with `create_visual_dna` — provide reference images (max 4), optionally video and audio
186
+ 2. **Types**: `character` (default), `style`, `product`, `scene`
187
+ 3. **Use** the profile by passing its `id` in `visual_dna_ids` when calling any generation tool
188
+ 4. **List/inspect** profiles with `list_visual_dnas` / `get_visual_dna`
189
+
190
+ ### When to Use
191
+ - User wants the same character across multiple images/videos
192
+ - User wants a consistent brand style across a campaign
193
+ - User references "keep the same look" or "same character"
194
+ - User provides reference photos of a person/product to maintain consistency
72
195
 
73
196
  ---
74
197
 
@@ -89,6 +212,20 @@ The model can see the starting frame. Describe **what happens**, not what the im
89
212
  - Good: "Slow dolly-in on the subject. Her hair drifts in a light breeze. Soft particles float through the air. [6s]"
90
213
  - Bad: "A woman with long brown hair standing in a forest, wearing a red dress, with golden sunlight..." (re-describes the image)
91
214
 
215
+ ### Video-to-Video (Restyle)
216
+ Use `generate_video_from_video` to restyle an existing video. Describe the **new style**, not the original content — the model preserves the original motion.
217
+ - Good: "Transform into anime style with cel-shading and vibrant colors"
218
+ - Bad: "A person walking down a street" (re-describes what's already in the video)
219
+
220
+ ### Elements (Reference Assets → Video)
221
+ Use `generate_elements` when the user has specific assets (product photos, character references) they want animated into a video. Pass them as `reference_images` (URLs) or `files` (local paths).
222
+
223
+ ### First/Last Frame (Keyframe Interpolation)
224
+ Use `generate_first_last_frame` when the user provides two keyframes and wants the model to create a smooth transition between them.
225
+
226
+ ### Lipsync
227
+ Use `generate_lipsync` to sync audio to a face in an image or video. Both `source` (face) and `audio` accept URLs or local file paths.
228
+
92
229
  ### Camera Vocabulary
93
230
 
94
231
  Pick what fits the mood. Every shot gets at least one.
@@ -150,6 +287,17 @@ Format: `extreme slow-motion [Xs] — [micro-movements in ultra slow-mo] — sna
150
287
 
151
288
  ---
152
289
 
290
+ ## 3D Generation
291
+
292
+ Use `generate_3d` for creating 3D models. Three modes:
293
+ - **Text mode**: prompt-only (e.g., "a medieval sword with ornate handle")
294
+ - **Single image mode**: one reference image + optional prompt
295
+ - **Multi-view mode**: 2+ reference images for higher-quality reconstruction
296
+
297
+ Returns downloadable model files in GLB, FBX, OBJ, and USDZ formats. Use `list_models` with `type: "three_d"` to discover available models.
298
+
299
+ ---
300
+
153
301
  ## Music Prompts
154
302
 
155
303
  Describe **genre → mood → instrumentation → tempo → era**, in that order.
@@ -164,10 +312,10 @@ Describe **genre → mood → instrumentation → tempo → era**, in that order
164
312
 
165
313
  ## Speech (TTS)
166
314
 
167
- - Call `list_models` with `type: speech` to get voice identifiers. Pass the `identifier` as `model` for a consistent voice.
168
- - The voice **is** the model for speech there is no separate voice parameter.
315
+ - Call `list_voices` to find available voices. Filter by `provider`, `language`, or `gender`.
316
+ - Pass the returned `voice_id` (or the voice's display name like "Rachel") as the `voice` parameter in `generate_speech`.
317
+ - For multilingual content, pick a voice that supports the target language.
169
318
  - For long text, split at natural sentence boundaries. Each generation has a character cap; chunk long-form content into multiple calls.
170
- - For multilingual content, pick a voice that supports the target language from `list_models`.
171
319
 
172
320
  ---
173
321
 
@@ -179,6 +327,35 @@ Describe **genre → mood → instrumentation → tempo → era**, in that order
179
327
 
180
328
  ---
181
329
 
330
+ ## Moodboards & Presets
331
+
332
+ **Moodboards** provide style direction (master prompt + style guide + reference images). Pass a `moodboard_id` to any generation tool to apply its style.
333
+ - `list_moodboards` to browse available options
334
+ - `get_moodboard` to see full details before applying
335
+
336
+ **Presets** bundle prompt templates + style direction for specific creative looks. Pass a `preset_id` to generation tools.
337
+ - `list_presets` with optional `type` filter ("image", "video", "music", "text_to_video")
338
+
339
+ ---
340
+
341
+ ## Media Library
342
+
343
+ Use `upload_media` to upload local files or URLs to the Kolbo CDN for stable hosting. Useful when:
344
+ - A local file needs to be referenced in multiple generation calls
345
+ - You want a permanent CDN URL instead of an ephemeral local path
346
+
347
+ Use `list_media` to browse previously uploaded content (filter by type, search by name).
348
+
349
+ ---
350
+
351
+ ## Chat
352
+
353
+ Use `chat_send_message` to interact with Kolbo AI models (GPT-4o, Claude, etc.) with optional web search and deep think modes. Conversations persist via `session_id` — omit to start new, pass to continue.
354
+
355
+ Use `chat_list_conversations` and `chat_get_messages` to browse conversation history.
356
+
357
+ ---
358
+
182
359
  ## Image Analysis (when the user uploads images)
183
360
 
184
361
  When the user shares an image and asks about it:
@@ -188,7 +365,7 @@ When the user shares an image and asks about it:
188
365
  - **Extract text verbatim** when asked (OCR-style requests are fine).
189
366
  - **Cannot identify real people.** Describe hair, clothing, pose, expression, and apparent role — but never name a specific individual, even a well-known public figure. If the user insists, decline and offer to describe instead.
190
367
  - **Copyrighted content**: summarize and reference, don't reproduce verbatim large chunks.
191
- - If the user wants an **edit** based on the analysis, hand off to `generate_video_from_image` (motion) or `generate_image` with an image-to-image model (visual edit) — see the Image Editing section above for prompt structure.
368
+ - If the user wants an **edit** based on the analysis, hand off to `generate_image_edit` (visual edit) or `generate_video_from_image` (motion).
192
369
 
193
370
  ---
194
371
 
@@ -222,11 +399,22 @@ The MDX sources are in the `kolbo-docs` repo under `content/docs/kolbo-code/`. W
222
399
  Natural-language triggers that should prompt this skill + a tool call:
223
400
 
224
401
  - "Generate an image of a neon-lit Tokyo street at night" → `list_models` (image) → `generate_image`
402
+ - "Remove the background from this image" → `list_models` (image_edit) → `generate_image_edit`
403
+ - "Create a storyboard for a coffee brand ad" → `list_models` (image) → `generate_creative_director`
225
404
  - "Create a 5-second cinematic video of ocean waves at sunset" → `list_models` (video) → `generate_video` with camera + mood guidance
226
405
  - "Animate this product photo with a 360° orbit" → `list_models` (video_from_image) → `generate_video_from_image`
406
+ - "Restyle this video as anime" → `generate_video_from_video`
407
+ - "Make this character talk with this voiceover" → `generate_lipsync`
408
+ - "Create a smooth transition between these two frames" → `generate_first_last_frame`
227
409
  - "Make a lo-fi hip hop beat, instrumental, 85 BPM" → `list_models` (music) → `generate_music`
228
- - "Say this in English with a natural female voice: Welcome to Kolbo" → `list_models` (speech) → `generate_speech`
410
+ - "Say this in English with a natural female voice: Welcome to Kolbo" → `list_voices` → `generate_speech`
229
411
  - "Generate a door slam sound effect" → `list_models` (sound) → `generate_sound`
412
+ - "Create a 3D model of a medieval castle" → `list_models` (three_d) → `generate_3d`
413
+ - "Transcribe this podcast episode" → `transcribe_audio`
414
+ - "What's being said in this video?" → `transcribe_audio` → analyze the text
415
+ - "Generate word-by-word subtitles for this audio" → `transcribe_audio` → share `word_by_word_srt_url`
416
+ - "Keep the same character across all these images" → `create_visual_dna` → `generate_image` with `visual_dna_ids`
417
+ - "Upload this file to my media library" → `upload_media`
230
418
  - "What video models are available?" → `list_models` (video)
231
419
  - "How many credits do I have?" → `check_credits`
232
420
  - "What's in this image?" (with upload) → describe per the Image Analysis section; no tool call needed unless the user asks to generate or edit