shmakk 1.2.0 → 1.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,320 @@
1
+ ---
2
+ name: video-compose
3
+ description: Assemble video segments from images, audio, and transitions into a final MP4. Receives the merged outputs from the voice and visual agents and uses video_compose and video_concat to produce the final video file. Part of the video production pipeline.
4
+ category: media
5
+ ---
6
+
7
+ # Video Compositor
8
+
9
+ The final stage of the video production pipeline. You receive the merged outputs from the voice and visual agents — a list of segments each with an image path, audio path, duration, and transition. Your job is to turn these into a single, synchronized MP4 video using `video_compose` and `video_concat`.
10
+
11
+ ## When to use
12
+
13
+ - You receive the merged handoff from the voice and visual agents
14
+ - You are the compositor in the video production pipeline
15
+ - The user asks to assemble, render, or export the final video
16
+
17
+ ## When not to use
18
+
19
+ - The script has not been written yet (wait for the script agent)
20
+ - Audio or images are still being generated (wait for voice and visual agents)
21
+ - The user wants to edit individual assets (that is upstream)
22
+
23
+ ## Input format
24
+
25
+ You receive a merged segments array where each segment has paths to its audio and image assets:
26
+
27
+ ```json
28
+ {
29
+ "segments": [
30
+ {
31
+ "index": 0,
32
+ "durationSec": 3.5,
33
+ "startSec": 0.0,
34
+ "narration": "Your best ideas don't wait for the right moment.",
35
+ "audioPath": "output/voice/segment-0.wav",
36
+ "imagePath": "output/images/segment-0.png",
37
+ "transition": null
38
+ },
39
+ {
40
+ "index": 1,
41
+ "durationSec": 5.0,
42
+ "startSec": 3.5,
43
+ "narration": "They arrive in the shower...",
44
+ "audioPath": "output/voice/segment-1.wav",
45
+ "imagePath": "output/images/segment-1.png",
46
+ "transition": "dissolve"
47
+ }
48
+ ]
49
+ }
50
+ ```
51
+
52
+ ## Tools
53
+
54
+ ### `video_probe`
55
+
56
+ Get metadata about a media file: duration, codec, resolution, frame rate, sample rate.
57
+
58
+ **Parameters:**
59
+
60
+ | Parameter | Type | Required | Description |
61
+ |-----------|------|----------|-------------|
62
+ | `path` | string | Yes | Path to the media file (image, audio, or video) |
63
+
64
+ **Returns:**
65
+
66
+ ```json
67
+ {
68
+ "durationSec": 5.03,
69
+ "width": 1920,
70
+ "height": 1080,
71
+ "codec": "h264",
72
+ "fps": 30,
73
+ "audioCodec": "aac",
74
+ "audioSampleRate": 44100,
75
+ "format": "mp4"
76
+ }
77
+ ```
78
+
79
+ Wraps `ffprobe -v quiet -print_format json -show_format -show_streams <path>`.
80
+
81
+ **When to use:**
82
+ - Before composing: verify all images are consistent resolution. If they vary, the compositor needs to normalize them.
83
+ - Before composing: verify audio durations match expected segment durations. If an audio clip is shorter or longer than the segment, adjust accordingly.
84
+ - After composing: verify the output video has the expected duration and codec.
85
+
86
+ ### `video_compose`
87
+
88
+ Assemble a single video segment from an image, optional audio, and optional transition. Builds the ffmpeg filtergraph from structured parameters.
89
+
90
+ **Parameters:**
91
+
92
+ | Parameter | Type | Required | Description |
93
+ |-----------|------|----------|-------------|
94
+ | `imagePath` | string | Yes | Path to the still image for this segment |
95
+ | `audioPath` | string | No | Path to the WAV audio file. Omit for silent segments. |
96
+ | `durationSec` | number | Yes | Duration this segment should last in seconds |
97
+ | `outputPath` | string | Yes | Where to write the rendered segment (intermediate MP4) |
98
+ | `width` | number | No | Output width in pixels. Default: 1920 |
99
+ | `height` | number | No | Output height in pixels. Default: 1080 |
100
+ | `fps` | number | No | Frame rate. Default: 30 |
101
+ | `transition` | string | No | Transition type to apply at the START of this segment. Options: `"fade-in"`, `"dissolve"`, `"wipe-left"`, `"wipe-right"`, `"wipe-up"`, `"wipe-down"`. Default: none (hard cut). |
102
+ | `transitionDurationSec` | number | No | How long the transition lasts. Default: 0.5 seconds. Must be less than `durationSec`. |
103
+ | `zoomEffect` | string | No | Apply Ken Burns-style slow zoom. Options: `"in"`, `"out"`. Default: none. |
104
+ | `textOverlay` | object | No | Burn text onto the video. Object with `text`, `position` ("top", "bottom", "center"), `fontSize`, `color`. |
105
+ | `codec` | string | No | Video codec. Default: `"libx264"`. |
106
+ | `crf` | number | No | Quality (0-51, lower is better). Default: 23. |
107
+
108
+ **What it does internally:**
109
+
110
+ Builds an ffmpeg command like:
111
+
112
+ ```bash
113
+ ffmpeg -loop 1 -i image.png -i audio.wav \
114
+ -filter_complex "[0:v]scale=1920:1080:force_original_aspect_ratio=decrease,pad=1920:1080:(ow-iw)/2:(oh-ih)/2,format=yuv420p,fade=t=in:d=0.5[v]" \
115
+ -map "[v]" -map 1:a \
116
+ -t 5.0 -r 30 -c:v libx264 -crf 23 -pix_fmt yuv420p -shortest \
117
+ output.mp4
118
+ ```
119
+
120
+ For segments with transitions, it adds the appropriate filter (fade, wipe, etc.) to the filtergraph.
121
+
122
+ **Returns:**
123
+
124
+ ```json
125
+ {
126
+ "outputPath": "output/rendered/segment-0.mp4",
127
+ "durationSec": 5.0,
128
+ "size": 1048576
129
+ }
130
+ ```
131
+
132
+ ### `video_concat`
133
+
134
+ Concatenate rendered segment MP4s into the final video file. Uses the ffmpeg concat demuxer for lossless joining.
135
+
136
+ **Parameters:**
137
+
138
+ | Parameter | Type | Required | Description |
139
+ |-----------|------|----------|-------------|
140
+ | `inputPaths` | string[] | Yes | Ordered list of segment MP4 paths to concatenate |
141
+ | `outputPath` | string | Yes | Where to write the final video |
142
+ | `codec` | string | No | Video codec. Default: `"libx264"`. |
143
+ | `crf` | number | No | Quality. Default: 23. |
144
+ | `addSilentAudio` | boolean | No | If true, adds a silent audio track to segments that have no audio, ensuring consistent stream count. Default: true. |
145
+
146
+ **What it does internally:**
147
+
148
+ 1. Creates a temporary concat file listing all input paths
149
+ 2. Runs: `ffmpeg -f concat -safe 0 -i concat-list.txt -c copy output.mp4`
150
+ 3. If segments have mismatched codecs/resolutions, falls back to re-encoding: `ffmpeg -f concat -safe 0 -i concat-list.txt -c:v libx264 -crf 23 output.mp4`
151
+
152
+ **Returns:**
153
+
154
+ ```json
155
+ {
156
+ "outputPath": "output/final.mp4",
157
+ "durationSec": 30.0,
158
+ "size": 5242880
159
+ }
160
+ ```
161
+
162
+ ## Workflow
163
+
164
+ ### Step 1: Receive and validate input
165
+
166
+ Check that the merged segments array:
167
+ - Is non-empty
168
+ - Every segment has at least `imagePath` (audioPath can be null for silent segments)
169
+ - All file paths exist on disk (use `video_probe` or `list_dir` to verify)
170
+ - Durations are consistent with audio clip lengths
171
+
172
+ If validation fails, report which segments are missing assets and ask the user or upstream agents to fix them.
173
+
174
+ ### Step 2: Probe and normalize
175
+
176
+ Run `video_probe` on each image and audio file to collect:
177
+ - Image dimensions (width, height)
178
+ - Audio duration
179
+
180
+ **Normalization decisions:**
181
+
182
+ - **Resolution:** If images vary in size, pick the most common resolution (or user-specified resolution) and note that `video_compose` will scale/pad all images to match.
183
+ - **Audio alignment:** If an audio clip is shorter than the segment `durationSec`, the segment will have silence at the end (ffmpeg handles this with `-shortest`). If it is longer, the segment will truncate. Flag segments where audio and planned duration differ by more than 0.5 seconds — this may indicate a timing issue from the script agent.
184
+ - **File format:** Ensure all images are PNG or JPG and all audio is WAV. If not, convert them using the `run` tool with ffmpeg before composing.
185
+
186
+ ### Step 3: Compose each segment
187
+
188
+ For each segment in order:
189
+
190
+ 1. Determine the effective duration. Use `audioDuration` from probe if available, falling back to `durationSec` from the storyboard.
191
+ 2. Call `video_compose` with:
192
+ - `imagePath`: the segment's image
193
+ - `audioPath`: the segment's audio (omit if null)
194
+ - `durationSec`: the effective duration
195
+ - `transition`: convert the storyboard's `transition` field. The storyboard specifies transitions *entering* a segment, so segment 0 gets no transition, segment 1 with `"dissolve"` gets `transition: "dissolve"` with `transitionDurationSec: 0.5`.
196
+ - `outputPath`: `output/rendered/segment-{index}.mp4`
197
+ 3. If the user requested a zoom effect (Ken Burns), add `zoomEffect: "in"` or `zoomEffect: "out"` — alternate between segments for visual variety, or use `"in"` for the first segment and `"out"` for the last.
198
+ 4. If any segment has text overlays (detected from storyboard visualDesc mentioning "text overlay" or "overlay text"), extract the overlay text and position and pass via `textOverlay`.
199
+
200
+ ### Step 4: Concatenate segments
201
+
202
+ Once all segments are rendered:
203
+
204
+ 1. Collect all rendered segment paths in index order
205
+ 2. Call `video_concat` with:
206
+ - `inputPaths`: the ordered list of segment MP4s
207
+ - `outputPath`: `output/final.mp4` (or user-specified name)
208
+ - `addSilentAudio`: `true` (ensures segments without audio still play correctly in the concat)
209
+
210
+ ### Step 5: Verify the output
211
+
212
+ Run `video_probe` on the final video and verify:
213
+ - Duration matches the sum of segment durations (within 0.1 seconds tolerance)
214
+ - Resolution matches the target
215
+ - File size is reasonable (not zero bytes, not absurdly large)
216
+
217
+ ### Step 6: Clean up intermediates
218
+
219
+ By default, keep intermediate rendered segments so the user can inspect or re-use them. If the user asks to clean up, use `delete_file` to remove the `output/rendered/` directory.
220
+
221
+ ### Step 7: Report the result
222
+
223
+ Output a summary:
224
+
225
+ ```
226
+ Video rendered: output/final.mp4
227
+ Duration: 30.0 seconds
228
+ Resolution: 1920x1080
229
+ Size: 5.2 MB
230
+ Segments: 6
231
+ ```
232
+
233
+ Include the final output path so the user knows where to find the video.
234
+
235
+ ## Transition mapping
236
+
237
+ The storyboard uses these transition values. Map them to `video_compose` transitions:
238
+
239
+ | Storyboard transition | `video_compose` transition | ffmpeg filter |
240
+ |-----------------------|---------------------------|---------------|
241
+ | `null` or first segment | none | — |
242
+ | `"cut"` | none | — |
243
+ | `"fade"` | `"fade-in"` | `fade=t=in:d=0.5` |
244
+ | `"dissolve"` | `"dissolve"` | `fade=t=in:d=0.5` (applied over crossfade when concatenating) |
245
+ | `"wipe-left"` | `"wipe-left"` | custom wipe filter |
246
+ | `"wipe-right"` | `"wipe-right"` | custom wipe filter |
247
+ | `"wipe-up"` | `"wipe-up"` | custom wipe filter |
248
+ | `"wipe-down"` | `"wipe-down"` | custom wipe filter |
249
+
250
+ ## Example
251
+
252
+ Given a 6-segment storyboard with merged assets:
253
+
254
+ ```bash
255
+ # Step 2: Probe assets
256
+ video_probe output/images/segment-0.png
257
+ # → { "width": 1920, "height": 1080 }
258
+
259
+ video_probe output/voice/segment-0.wav
260
+ # → { "durationSec": 3.48 }
261
+
262
+ # Step 3: Compose each segment
263
+ video_compose \
264
+ --imagePath output/images/segment-0.png \
265
+ --audioPath output/voice/segment-0.wav \
266
+ --durationSec 3.5 \
267
+ --outputPath output/rendered/segment-0.mp4
268
+
269
+ video_compose \
270
+ --imagePath output/images/segment-1.png \
271
+ --audioPath output/voice/segment-1.wav \
272
+ --durationSec 5.0 \
273
+ --transition dissolve \
274
+ --transitionDurationSec 0.5 \
275
+ --outputPath output/rendered/segment-1.mp4
276
+
277
+ # ... repeat for remaining segments ...
278
+
279
+ # Step 4: Concatenate
280
+ video_concat \
281
+ --inputPaths output/rendered/segment-0.mp4,output/rendered/segment-1.mp4,... \
282
+ --outputPath output/final.mp4
283
+
284
+ # Step 5: Verify
285
+ video_probe output/final.mp4
286
+ # → { "durationSec": 30.0, "width": 1920, "height": 1080, "codec": "h264" }
287
+ ```
288
+
289
+ Note: These are illustrative. Actual tool invocation is through the LLM function call interface, not shell commands.
290
+
291
+ ## Resolution and aspect ratio
292
+
293
+ Unless the user specifies otherwise:
294
+ - **Default resolution:** 1920x1080 (16:9 landscape)
295
+ - **Social media:** 1080x1920 (9:16 vertical for TikTok/Reels/Shorts), 1080x1080 (1:1 square for Instagram)
296
+ - **Presentation:** 1920x1080 (16:9)
297
+
298
+ `video_compose` automatically scales images to fit the target resolution, adding letterbox/pillarbox padding as needed (`force_original_aspect_ratio=decrease` + `pad`). This means images of any aspect ratio produce a clean result without stretching.
299
+
300
+ ## Budget awareness
301
+
302
+ - `video_compose` costs 1 budget point per segment
303
+ - `video_concat` costs 1 budget point for the final join
304
+ - `video_probe` costs 1 budget point per call (use sparingly — probe a few representative files, not every frame)
305
+ - All three wrap ffmpeg (local), so no API costs
306
+ - A 12-segment video: ~12 compose calls + 1 concat + 2-3 probes = ~16 budget points for compositing alone
307
+
308
+ ## Error recovery
309
+
310
+ - **Segment compose fails:** Check if the image or audio path is valid. Retry once. If it still fails, skip the segment and note it in the output — the concat will have a gap.
311
+ - **Concat fails with codec mismatch:** Re-run `video_concat` with an explicit codec (`codec: "libx264"`) to force re-encoding instead of stream copy.
312
+ - **Audio/video duration mismatch:** ffmpeg's `-shortest` flag in `video_compose` handles this automatically — the segment ends when the shorter stream (usually audio) finishes.
313
+ - **Out of disk space:** Check available space before starting. A 60-second 1080p video at CRF 23 is roughly 30-80 MB. Warn the user if free space is under 200 MB.
314
+
315
+ ## Coordination with upstream agents
316
+
317
+ The compositor runs after both voice and visual agents complete. The merged segments arrive via pipeline handoff. If either upstream agent failed for some segments:
318
+ - Missing audio: render the segment as silent (video only)
319
+ - Missing image: use a black/color background with text overlay showing the segment's narration or index
320
+ - Both missing: render a 2-second black frame with the segment index as text so the user can see what to fix
@@ -0,0 +1,204 @@
1
+ ---
2
+ name: video-script
3
+ description: Turn a user's video description into a timed JSON storyboard with narration text and visual cues per segment. Use when the task is to produce a structured script that will feed downstream voice-over and visual agents.
4
+ category: media
5
+ ---
6
+
7
+ # Video Scripting
8
+
9
+ Convert a user's natural-language video request into a structured, timed JSON storyboard. This is the first stage of the video production pipeline — your output is handed off to the voice and visual agents.
10
+
11
+ ## When to use
12
+
13
+ - User asks to create, produce, or script a video
14
+ - User provides a narrative, product description, tutorial, or explainer that needs a timed storyboard
15
+ - User gives a rough outline and asks for pacing / timing
16
+
17
+ ## When not to use
18
+
19
+ - The user already has a complete, timestamped JSON storyboard
20
+ - The task is editing an existing video (skip to compositor)
21
+ - The user only wants a single image or audio clip (use imagegen or voice directly)
22
+
23
+ ## Output format
24
+
25
+ Produce a JSON array of segments. Each segment is a temporal slice of the video:
26
+
27
+ ```json
28
+ {
29
+ "segments": [
30
+ {
31
+ "index": 0,
32
+ "durationSec": 5.0,
33
+ "startSec": 0.0,
34
+ "narration": "Text for the voice-over to speak during this segment.",
35
+ "visualDesc": "Concise, keyword-rich image prompt describing what appears on screen. Include style, mood, composition, and color cues.",
36
+ "transition": "fade" | "cut" | "dissolve" | "wipe-left" | null
37
+ }
38
+ ]
39
+ }
40
+ ```
41
+
42
+ ### Segment field reference
43
+
44
+ | Field | Type | Required | Notes |
45
+ |-------|------|----------|-------|
46
+ | `index` | integer | Yes | Zero-based segment number |
47
+ | `durationSec` | number | Yes | Duration in seconds. Minimum 2.0, maximum 30.0. Must match the time the narration needs at a comfortable speaking pace (~150 words/min). |
48
+ | `startSec` | number | Yes | Cumulative start time from beginning of video |
49
+ | `narration` | string | Yes | Exact text the voice agent will speak. Keep each segment under 75 words. Use natural sentence breaks. |
50
+ | `visualDesc` | string | Yes | Visual description for the image generator. Include subject, scene, style, lighting, color palette, mood, and composition. Be specific enough that the image agent needs no clarification. |
51
+ | `transition` | string | No | How to transition INTO this segment. First segment should be `null`. Options: `"cut"`, `"fade"`, `"dissolve"`, `"wipe-left"`, `"wipe-right"`, `"wipe-up"`, `"wipe-down"`. Defaults to `"cut"`. |
52
+
53
+ ## Workflow
54
+
55
+ ### Step 1: Understand the user's intent
56
+
57
+ Ask yourself:
58
+ - What is the video's purpose? (demo, explainer, ad, tutorial, social media post)
59
+ - What tone does the user want? (professional, casual, energetic, calm)
60
+ - What is the target duration? If not specified, ask or infer from scope
61
+ - Is there a specific visual style? (photorealistic, 2D illustration, 3D render, flat UI mockups)
62
+ - Any brand colors, logos, or recurring motifs?
63
+
64
+ If the user provides incomplete information, ask clarifying questions before generating the storyboard. Getting the intent right here prevents rework downstream.
65
+
66
+ ### Step 2: Structure the narrative arc
67
+
68
+ Map the video into a narrative flow:
69
+
70
+ 1. **Hook** (first 3-5 seconds): capture attention
71
+ 2. **Setup** (~15-20% of total): introduce the problem or context
72
+ 3. **Solution/body** (~50-60% of total): the core content
73
+ 4. **Payoff** (~15-20% of total): show the result or benefit
74
+ 5. **Call to action** (last 3-5 seconds): what should the viewer do next
75
+
76
+ For very short videos (< 15 seconds), collapse this into hook → body → CTA.
77
+
78
+ ### Step 3: Allocate time
79
+
80
+ - Total duration should match the user's request (default: 60 seconds if unspecified)
81
+ - Each segment duration = time needed to speak its narration at ~150 words/minute
82
+ - Round segment durations to one decimal place
83
+ - The sum of all `durationSec` values must equal the total video duration
84
+ - `startSec` must be a running total: segment N's startSec = sum of durations of segments 0 through N-1
85
+
86
+ Example timing calculation:
87
+ - Narration: "Welcome to our new productivity dashboard" (7 words)
88
+ - At 150 words/min = 2.5 words/sec → 7 words / 2.5 = ~2.8 seconds
89
+ - Round up to 3.0 seconds minimum
90
+
91
+ ### Step 4: Write narration and visuals
92
+
93
+ For each segment:
94
+
95
+ **Narration rules:**
96
+ - Write natural, spoken language — not essay prose
97
+ - Each segment should be one or two complete sentences
98
+ - Avoid words that are hard to synthesize (uncommon acronyms, special symbols)
99
+ - Break at natural pause points between segments
100
+ - Maximum 75 words per segment
101
+
102
+ **Visual description rules:**
103
+ - Be specific: "A modern glass-walled office interior, natural daylight streaming through floor-to-ceiling windows, warm oak desk in foreground, minimalist decor, shallow depth of field, 4K cinematic still" — not "an office"
104
+ - Include composition cues: wide shot, close-up, overhead, split-screen
105
+ - Include mood/lighting: golden hour, moody shadows, bright and clean, neon-lit
106
+ - Include color direction: muted earth tones, vibrant neon palette, monochrome blue
107
+ - If text overlays are needed (titles, labels), explicitly include them in visualDesc: "Overlay text in bottom third: 'Introducing Dashboard v3'"
108
+ - Maintain visual consistency — all segments should feel like they belong to the same video
109
+
110
+ ### Step 5: Assign transitions
111
+
112
+ - First segment: `null` (no transition into the opening)
113
+ - Between segments of the same scene/topic: `"cut"`
114
+ - For scene changes or time jumps: `"fade"` or `"dissolve"`
115
+ - For directional movement (before/after, left/right comparison): `"wipe-left"` or `"wipe-right"`
116
+ - Use sparingly. Most segments should use `"cut"` unless the transition carries meaning.
117
+
118
+ ### Step 6: Validate the storyboard
119
+
120
+ Before output, check:
121
+ 1. Sum of `durationSec` equals the requested total duration
122
+ 2. All `startSec` values are the correct running totals
123
+ 3. Every `narration` is under 75 words
124
+ 4. Every `visualDesc` is sufficiently specific (25+ characters, includes style/mood cues)
125
+ 5. Transitions are appropriate for the narrative flow
126
+ 6. Indices are consecutive and zero-based
127
+
128
+ ### Step 7: Output
129
+
130
+ Output the complete JSON as the handoff payload. Include a brief summary comment before the JSON explaining the video's narrative arc and total duration. The downstream agents expect this exact JSON structure.
131
+
132
+ ## Example
133
+
134
+ User: "Create a 30-second product teaser for a new note-taking app called Scribble"
135
+
136
+ ```json
137
+ {
138
+ "segments": [
139
+ {
140
+ "index": 0,
141
+ "durationSec": 3.5,
142
+ "startSec": 0.0,
143
+ "narration": "Your best ideas don't wait for the right moment.",
144
+ "visualDesc": "Cinematic close-up of a person staring at a blank notebook page, soft morning light, shallow depth of field, muted warm tones, contemplative mood",
145
+ "transition": null
146
+ },
147
+ {
148
+ "index": 1,
149
+ "durationSec": 5.0,
150
+ "startSec": 3.5,
151
+ "narration": "They arrive in the shower, on a walk, or right before you fall asleep.",
152
+ "visualDesc": "Montage split into three panels: shower steam silhouette, tree-lined walking path at golden hour, bedroom with nightstand clock showing 2:47 AM. Soft transitions between panels. Warm consistent palette.",
153
+ "transition": "dissolve"
154
+ },
155
+ {
156
+ "index": 2,
157
+ "durationSec": 6.0,
158
+ "startSec": 8.5,
159
+ "narration": "Scribble captures them instantly. One tap, speak your thought, and it is saved, organized, and searchable forever.",
160
+ "visualDesc": "Phone screen mockup showing the Scribble app interface. Hand taps the record button, voice waveform animates, the transcribed note appears organized in a clean list. Modern flat UI design, navy and coral color scheme, bright clean lighting.",
161
+ "transition": "cut"
162
+ },
163
+ {
164
+ "index": 3,
165
+ "durationSec": 5.0,
166
+ "startSec": 14.5,
167
+ "narration": "No folders. No friction. Just your mind, unblocked.",
168
+ "visualDesc": "Overhead shot of a desk with a phone displaying Scribble app, a coffee cup, and a plant. The phone glows softly. Clean minimalist composition, natural daylight, warm wood textures.",
169
+ "transition": "dissolve"
170
+ },
171
+ {
172
+ "index": 4,
173
+ "durationSec": 4.5,
174
+ "startSec": 19.5,
175
+ "narration": "Your ideas deserve better than a forgotten notes app. Try Scribble today.",
176
+ "visualDesc": "App icon centered on screen with the Scribble logo, coral gradient background. Text overlay bottom third: 'Available on iOS and Android'. Clean product-hero composition, bright and inviting.",
177
+ "transition": "fade"
178
+ },
179
+ {
180
+ "index": 5,
181
+ "durationSec": 3.0,
182
+ "startSec": 24.0,
183
+ "narration": "",
184
+ "visualDesc": "Scribble logo on clean white background with tagline 'Capture everything.' centered below. Fade to black at end.",
185
+ "transition": "cut"
186
+ }
187
+ ]
188
+ }
189
+ ```
190
+
191
+ ## Edge cases
192
+
193
+ - **Very short video (< 10 seconds):** Limit to 2-3 segments. Combine hook and CTA into the same segment.
194
+ - **No voice-over (music-only):** Set `narration` to empty string `""` for all segments. Still allocate time for the visual pacing.
195
+ - **Multiple speakers / dialogue:** Note the speaker in the narration field like `"[Interviewer]: Tell us about your process."` and `"[Speaker]: I start with research."` — the voice agent can assign different voices to different speaker labels.
196
+ - **Text-heavy video (titles, captions only):** Include all text in `visualDesc` as overlay instructions. Set `narration` to empty if there is no spoken audio.
197
+
198
+ ## Budget awareness
199
+
200
+ This agent is a scripting role — it does not consume budget for image generation or TTS. It only costs the LLM call. However, be mindful:
201
+ - More segments = more downstream tool calls (each segment triggers at least one `image_gen` and one `tts_generate` call)
202
+ - For a 60-second video, aim for 6-12 segments
203
+ - For a 30-second video, aim for 4-8 segments
204
+ - Each segment adds ~2 budget points downstream (image + voice), so keep the segment count proportional to the video length