npm - @koda-sl/baker-cli - Versions diffs - 0.94.0 → 0.95.0 - Mend

@koda-sl/baker-cli 0.94.0 → 0.95.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -3653,7 +3653,7 @@ It then scaffolds the full pipeline like an **editing timeline**: each clip gets
 **Montage flashes held as stills.** A rapid-cut beat shorter than ~2s with no spoken line is a **flash** — Seedance's shortest clip is 4s, so generating one (then trimming away most of it) burns credits for motion no viewer perceives. The scaffold instead **holds one keyframe as a still** for the scene length (a cheap ffmpeg loop, no billed `video_generate`), same look at a fraction of the cost. Talking/ambient beats keep a real clip (they need motion + native audio).
-**The phrase model (voice cut at pauses, not at visual cuts).** The voice is grouped into **phrases** — runs of continuous speech with no real pause, which may span several visual scenes. A phrase is voiced ONCE (so a sentence the deconstruct split at a visual cut never breaks mid-word): if the speaker is **shown** anywhere in the phrase it's a single Seedance clip (`s<anchor>_clip`, native lip-sync + audio) re-voiced to the brand voice; if the speaker is **never shown** it's one ElevenLabs `tts` read. The picture is then assembled **scene by scene**: a scene that shows the speaker **slices its window** out of the phrase clip (`s<i>_seg`, an ffmpeg `-ss`/`-t` cut — video and audio come from the *same* clip, so lip-sync holds), and a **b-roll cutaway** gets its own silent clip while the phrase's voice plays underneath. "Shown" is decided by the **presenter element's per-scene presence**, not just who's speaking — a scene where a cast member narrates over b-roll (their element absent) is treated as a cutaway, so the talking head never appears where the original cut away. A presenter run longer than the 15s Seedance ceiling **splits at a scene boundary** into contiguous takes (each its own clip + convert), so a sliced window never reads past its clip. A b-roll cutaway *inside* a phrase lands at an **approximate** time (Seedance exposes no word timing) — nudge the scene boundary if it's off its beat.
+**The phrase model (voice cut at pauses, not at visual cuts).** The voice is grouped into **phrases** — runs of continuous speech with no real pause, which may span several visual scenes. A phrase is voiced ONCE (so a sentence the deconstruct split at a visual cut never breaks mid-word): if the speaker is **shown** anywhere in the phrase it's a single Seedance clip (`s<anchor>_clip`, native lip-sync + audio) re-voiced to the brand voice; if the speaker is **never shown** it's one ElevenLabs `tts` read. The picture is then assembled **scene by scene**: a scene that shows the speaker **slices its window** out of the phrase clip (`s<i>_seg`, an ffmpeg `-ss`/`-t` cut — video and audio come from the *same* clip, so lip-sync holds), and a **b-roll cutaway** gets its own silent clip while the phrase's voice plays underneath. "Shown" is decided by the **presenter element's per-scene presence**, not just who's speaking — a scene where a cast member narrates over b-roll (their element absent) is treated as a cutaway, so the talking head never appears where the original cut away. A presenter run longer than the **gateway-safe ~10s clip ceiling splits at a scene boundary** into contiguous takes (each its own clip + convert), so a sliced window never reads past its clip. (Seedance's *API* max is 15s, but the generation gateway frequently times out — **HTTP 524** — before it can deliver a clip longer than ~10s, so the scaffold never asks for one that long; 10s is a Seedance-allowed duration, so the split clip still snaps cleanly.) A b-roll cutaway *inside* a phrase lands at an **approximate** time (Seedance exposes no word timing) — nudge the scene boundary if it's off its beat.
 **A starting point, not a locked render.** The canvas mirrors the reference's structure to give you a faithful scaffold, but `metadata.todo.full_flexibility` makes explicit that the agent has **full editing freedom**: add / delete / reorder / split / merge scenes, re-prompt any frame or motion brief, change a scene's layout (full-frame ↔ composite), or rewrite any line — the content-addressed cache re-bills only what changes, and `baker canvas validate` re-checks timing/lip-sync after any edit.
@@ -3663,7 +3663,7 @@ It then scaffolds the full pipeline like an **editing timeline**: each clip gets
 **Same-shot lip-sync caution.** A single held shot can carry only ONE lip-synced clip (voiceover turns must not overlap, and Seedance generates one clip per shot), so when the on-camera speaker has further turns in that shot (a rapid "3000? … 4000?" with an off-camera "Plus" between), the first turn is native and the rest play as `tts` over the same clip — where the mouth no longer matches those words. This is inherent to reproducing sparse same-shot dialogue, not a wiring fault; the scaffold lists the affected scenes/lines in **`metadata.video.lip_sync_caution`** (advisory, never gated) so you can cut away to b-roll over those lines or rely on the burned-in captions that already show them.
-**Timing-faithful clip + extract (no overlap).** Each phrase clip is generated to its **coverage window** (the deconstruct's real scene/line timing, capped at 15s) and its converted voice is extracted to the **spoken window** (pause to pause) — *not* padded to a word-count estimate. Padding past the window was what ran the voice the clip's whole length and overlapped the next phrase; trusting the deconstruct's timing keeps consecutive phrases back-to-back and lets Seedance pace the quoted text to fit. `metadata.video.talking_scenes` still records each phrase's `scene_s` vs `est_speech_s` so you can spot a line whose words overrun its window and widen it by hand.
+**Timing-faithful clip + extract (no overlap).** Each phrase clip is generated to its **coverage window** (the deconstruct's real scene/line timing, capped at the gateway-safe ~10s ceiling) and its converted voice is extracted to the **spoken window** (pause to pause) — *not* padded to a word-count estimate. Padding past the window was what ran the voice the clip's whole length and overlapped the next phrase; trusting the deconstruct's timing keeps consecutive phrases back-to-back and lets Seedance pace the quoted text to fit. `metadata.video.talking_scenes` records each phrase's `scene_s` vs `est_speech_s`; on top of that the scaffold flags any scene whose estimated speech overruns its window by more than ~1.3× as **`metadata.todo.overstuffed_scenes`** (also in the stdout checklist) — a loud advisory to shorten the copy or lengthen the scene before rendering, since an over-stuffed line pushes the picture off the audio timeline. It similarly flags **`oversize_scenes`** — a single scene whose own footage exceeds the gateway-safe ~10s clip ceiling (a b-roll shot or one-shot monologue). The phrase splitter only breaks at scene boundaries, so it can't shrink a single over-long scene; its clip would 524 at the gateway, so the advisory tells you to split that scene into two before rendering.
 **Timeline-accurate picture.** Seedance can't render under 4s, so each clip is generated at the smallest allowed duration ≥ the scene length and then **trimmed back to the exact scene duration** before concat. This keeps the concatenated picture on the same timeline as the absolute-timed audio — without it, short scenes balloon to 4s, the spine runs far longer than the soundtrack, and every line plays over the wrong (slowed) scene so the lips never match. Frames are also prompted as **clean text-free plates** (no baked captions/lower-thirds/tickers/logos-as-text) so the overlay layer is the single source of on-screen text.
@@ -3701,7 +3701,7 @@ baker canvas run ./reference-ad.video.canvas.json
 Each scene is captured in a **shoot mode** — `ugc_selfie` (talking heads, the default look), `ugc_broll`, `studio_product` (pack shot), `lifestyle_cinematic`, or `screen_ui`. The scaffold derives one per scene (UGC by default; the cinematic and screen lanes are opt-in) and bakes its capture block into the frame and a camera default into the clip; override per scene with a `shoot_mode` field in `prompt.json`. Capture aesthetic + depth-of-field follow the mode (UGC stays flat; studio/lifestyle allow shallow DoF). Clips also carry **diegetic native audio** — the scene's own ambience described in the Seedance prompt, never music (the music bed is a separate, ducked track); set a scene's `ambient` field to steer it.
-**Automatic by default (no flags).** Every recast **base element — person, pet, product, AND location/set** — is fused into ONE rich multi-view sheet (`image_reference_sheet`, one subject per sheet, **4K**, up to 8 cells) that every frame it appears in grounds on, so the same face/pet/pack/room is rendered from a multi-angle canvas instead of a lone flat snapshot (a one-scene hero element is sheeted too). Each sheet pairs a **full turnaround** (angles, for proportions/wardrobe/layout) with tight **close-ups** so the generator is prepared for ANY framing a scene needs: a **person** gets body cells + face close-ups (front/¾/profile) and a mid-sentence speaking expression (identity pinned, natural skin — no airbrushing); an **animal** gets a body turnaround + head close-ups + an eyes/face macro; a **product** gets a turnaround + label and material detail macros; a **location/set** gets several camera angles of the same room + a key-surface detail. Generated clips are pinned to **1080p** (see `--resolution`) so the video keeps the keyframe's sharpness, and each cast frame keeps the source frame as a **composition anchor** (identity stays on the sheet) so the original framing/camera is reproduced, not re-guessed. An **app/website/chat screen** is never sent to the video model — the scaffold drops the scene to a clean talking-head and seeds a phone-mockup PIP stub to fill with a real `baker images screenshot` or brand HTML block (Seedance garbles UI and a split leaves a seam). The **music bed is instrumental** (the script is never fed to the music model — it would sing over the voice), enters only after the hook, and is **sidechain-ducked** under the voice. **Word-synced TikTok captions** are wired off the deconstruct transcript whenever the ad has speech. Seeded overlays are pushed **off the subject's face** (dead-center → bottom band).
+**Automatic by default (no flags).** Every recast **base element — person, pet, product, AND location/set** — is fused into ONE rich multi-view sheet (`image_reference_sheet`, one subject per sheet, **4K**, up to 8 cells) that every frame it appears in grounds on, so the same face/pet/pack/room is rendered from a multi-angle canvas instead of a lone flat snapshot (a one-scene hero element is sheeted too). Each sheet pairs a **full turnaround** (angles, for proportions/wardrobe/layout) with tight **close-ups** so the generator is prepared for ANY framing a scene needs: a **person** gets body cells + face close-ups (front/¾/profile) and a mid-sentence speaking expression (identity pinned, natural skin — no airbrushing); an **animal** gets a body turnaround + head close-ups + an eyes/face macro; a **product** gets a turnaround + label and material detail macros; a **location/set** gets several camera angles of the same room + a key-surface detail. Generated clips are pinned to **1080p** (see `--resolution`) so the video keeps the keyframe's sharpness, and each cast frame keeps the source frame as a **composition anchor** (identity stays on the sheet) so the original framing/camera is reproduced, not re-guessed. An **app/website/chat screen** is never sent to the video model — the scaffold drops the scene to a clean talking-head and seeds a phone-mockup PIP stub to fill with a real `baker images screenshot` or brand HTML block (Seedance garbles UI and a split leaves a seam). The **music bed is instrumental** (the script is never fed to the music model — it would sing over the voice), enters only after the hook, and is **sidechain-ducked** under the voice. **Word-synced TikTok captions** are wired whenever the ad has speech — and they are **transcribed from the rendered audio** (a `video_transcribe` of the actual voice mix), not the deconstruct's original transcript. This is a correctness boundary: wiring the source transcript would burn the **competitor's** words (their brand name, a claim we can't make) over the ad once the script is re-authored, whereas transcribing the generated audio can only ever show what is actually spoken, so the captions always track the re-written lines. Seeded overlays are pushed **off the subject's face** (dead-center → bottom band).
 The two scaffold passes are billed (the full `video_deconstruct` is the heavy one); **running** the result then generates many image/video/audio assets and is not free. Defaults to vertical 1080×1920 overlays — copy + edit the composition for other aspect ratios. For on-brand overlay type, drop `brand-bold.otf`/`brand-regular.otf` into the copied `video-overlay-composition/` dir (wired via `@font-face`, with a system fallback). Richer transcription (punctuated words + paragraphs) is available via the deconstruct's `transcriber: "deepgram"` param when `DEEPGRAM_API_KEY` is set.

package/dist/cli.js CHANGED Viewed

@@ -9096,7 +9096,8 @@ function buildElementSlots(elements) {
     "spine",
     "overlaid",
     "captions",
-    "captions_transcript",
+    "captions_premux",
+    "captions_transcribe",
     "audio_mix",
     "final",
     "music_bed"
@@ -9762,7 +9763,8 @@ function makePresenterPresent(slots, canonical, opts = {}) {
   };
 }
 var PAUSE_GAP_S = 0.6;
-var PHRASE_MAX_S = SEEDANCE_DURATIONS[SEEDANCE_DURATIONS.length - 1];
+var SEEDANCE_SAFE_MAX_S = SEEDANCE_DURATIONS.find((d) => d >= 10) ?? 10;
+var PHRASE_MAX_S = SEEDANCE_SAFE_MAX_S;
 function collapseVoiceover(blueprint) {
   const casts = castIdSet(blueprint);
   const cameraOn = onCameraDialogue(blueprint);
@@ -10564,22 +10566,8 @@ function scaffoldVideoCanvas(input, elementsInput, opts) {
     videoRef = "$ref:overlaid.video";
     videoNode = "overlaid";
   }
-  if (opts.captionsCompositionPath && opts.transcriptPath) {
-    nodes.push({
-      id: "captions_transcript",
-      type: "ingest",
-      params: { source: "path", path: opts.transcriptPath, expect: "json" }
-    });
-    nodes.push({
-      id: "captions",
-      type: "hyperframe_render",
-      inputs: { background: videoRef, transcript: "$ref:captions_transcript.asset" },
-      params: { composition: opts.captionsCompositionPath }
-    });
-    videoRef = "$ref:captions.video";
-    videoNode = "captions";
-  }
   const tracks = [...voTracks, ...buildSfxMusic(blueprint, nodes)];
+  let audioMixRef;
   if (tracks.length > 0) {
     const mixInputs = {};
     for (const t of tracks) mixInputs[t.slot] = t.ref;
@@ -10600,10 +10588,54 @@ function scaffoldVideoCanvas(input, elementsInput, opts) {
         ...duck
       }
     });
+    audioMixRef = "$ref:audio_mix.audio";
+  }
+  if (opts.captionsCompositionPath && audioMixRef) {
+    nodes.push({
+      id: "captions_premux",
+      type: "ffmpeg",
+      inputs: { video: videoRef, audio: audioMixRef },
+      params: {
+        args: [
+          "-i",
+          "{{in.video}}",
+          "-i",
+          "{{in.audio}}",
+          "-map",
+          "0:v:0",
+          "-map",
+          "1:a:0",
+          "-c:v",
+          "copy",
+          "-c:a",
+          "aac",
+          "-shortest",
+          "{{out.video}}"
+        ],
+        outputs: { video: { kind: "video", ext: "mp4" } }
+      }
+    });
+    const captionLanguage = deriveTtsLanguageCode(blueprint);
+    nodes.push({
+      id: "captions_transcribe",
+      type: "video_transcribe",
+      inputs: { video: "$ref:captions_premux.video" },
+      params: captionLanguage ? { language: captionLanguage } : {}
+    });
+    nodes.push({
+      id: "captions",
+      type: "hyperframe_render",
+      inputs: { background: videoRef, transcript: "$ref:captions_transcribe.transcript" },
+      params: { composition: opts.captionsCompositionPath }
+    });
+    videoRef = "$ref:captions.video";
+    videoNode = "captions";
+  }
+  if (audioMixRef) {
     nodes.push({
       id: "final",
       type: "ffmpeg",
-      inputs: { video: videoRef, audio: "$ref:audio_mix.audio" },
+      inputs: { video: videoRef, audio: audioMixRef },
       params: {
         args: [
           "-i",
@@ -10845,14 +10877,36 @@ function buildVideoTodo(report, overlayCount, floatingCount, opts, blueprint) {
     // "none detected" — re-watch the reference and fill the gaps with the right tool.
     completeness_check: 'The scaffold mirrors the deconstruct\'s catalog, which UNDER-DETECTS \u2014 never trust a 0 count. Re-watch the reference frame-by-frame and add anything missing: (1) ON-IMAGE GRAPHICS not in floating_elements (dollar/coin icons, emojis, checkmarks, rating stars, price tags, arrows, progress bars, app UI) \u2192 source each with `baker images icon "<desc>"` / `baker images sticker` / `baker images gif` / `baker images logo <domain>` and add it as an <img class="ov pos-* " data-start data-dur> in video-overlay-composition/index.html (NEVER bake graphics into the frame plates). (2) SOUND CUES not in sound_effects (cha-ching/coin, whoosh, ding, pop, notification, keyboard) \u2192 add a `sound_effect` node (eleven_text_to_sound) and wire it onto `audio_mix` at its timestamp. (3) RECURRING people/animals/products/logos/sets with no el_* slot \u2192 add an `ingest` [TODO] slot and reference it from the frames they appear in. (4) Burned-in captions/text not in text_overlays \u2192 add an <img>-free <div class="ov"> in index.html. (5) ONE person playing MULTIPLE personas/wardrobes (skeptic vs believer, before vs after, two outfits) collapsed into a single el_* slot \u2192 split into one el_* slot PER look, each linked as the SAME individual via `same_as` so every outfit has its own reference image but the face/identity stays identical.',
     scenes_clamped_to_15s: report.clamped_scenes,
+    oversize_scenes: report.oversize_scenes.length > 0 ? {
+      scenes: report.oversize_scenes,
+      fix: "Each listed scene's own footage is longer than the gateway-safe ~10s clip ceiling, so its clip will fail the gateway (HTTP 524) \u2014 and the scaffold can't auto-split a single scene. Split each into two scenes in prompt.json (scene N+1's start frame = scene N's end frame, so the action stays continuous) before rendering."
+    } : "none \u2014 every scene fits within the gateway-safe clip ceiling.",
+    overstuffed_scenes: report.overstuffed_scenes.length > 0 ? {
+      scenes: report.overstuffed_scenes,
+      fix: "Each listed scene's estimated speech runs well past its window \u2014 the line will overrun the scene and push the picture off the audio timeline (lips fall behind). Shorten the copy in prompt.json or lengthen the scene before rendering."
+    } : "none \u2014 every scene's spoken length fits its window.",
     run_warning: "`baker canvas run` generates many billed image/video/audio assets \u2014 validate first, it is not free."
   };
 }
+var OVERSTUFF_RATIO = 1.3;
+function collectClipAdvisories(scene, i, out) {
+  const round22 = (n) => Math.round(n * 100) / 100;
+  const original = scene.duration_s ?? 5;
+  if (original > 15) out.clamped.push({ scene: i, original_s: original, clip_s: snapToSeedance(original) });
+  const window = sceneDurationS(scene);
+  if (window > SEEDANCE_SAFE_MAX_S)
+    out.oversize.push({ scene: i, scene_s: round22(window), clip_s: ceilToSeedance(window) });
+  const speech = (scene.dialogue ?? []).reduce((s, line) => s + (line.line ? estSpeechS(line.line) : 0), 0);
+  if (speech > window * OVERSTUFF_RATIO)
+    out.overstuffed.push({ scene: i, scene_s: round22(window), est_speech_s: round22(speech) });
+}
 function videoReport(input, elementsInput) {
   const blueprint = VideoBlueprint.parse(input);
   const elements = RecurringElements.parse(elementsInput);
   const dialogue = [];
   const clamped = [];
+  const oversize = [];
+  const overstuffed = [];
   let sfxCount = 0;
   let overlayCount = 0;
   blueprint.scenes.forEach((scene, i) => {
@@ -10868,9 +10922,7 @@ function videoReport(input, elementsInput) {
     }
     sfxCount += (scene.sfx ?? []).length;
     overlayCount += (scene.overlays ?? []).length;
-    const original = scene.duration_s ?? 5;
-    const clip = snapToSeedance(original);
-    if (original > 15) clamped.push({ scene: i, original_s: original, clip_s: clip });
+    collectClipAdvisories(scene, i, { clamped, oversize, overstuffed });
   });
   return {
     scene_count: blueprint.scenes.length,
@@ -10885,6 +10937,8 @@ function videoReport(input, elementsInput) {
     sfx_count: sfxCount,
     overlay_count: overlayCount,
     clamped_scenes: clamped,
+    oversize_scenes: oversize,
+    overstuffed_scenes: overstuffed,
     has_music: Boolean(blueprint.global?.music?.music_prompt)
   };
 }
@@ -10950,12 +11004,9 @@ async function loadTranscriptBestEffort(ref) {
 async function stageCaptions(outDir, transcript) {
   const text = transcript?.trim();
   if (!text || text === "[]") return {};
-  const transcriptPath = path5.join(outDir, "transcript.json");
-  await writeFile2(transcriptPath, `${text}
-`, "utf8");
   const compositionPath = path5.join(outDir, "tiktok-captions-composition");
   await cp(SHIPPED_CAPTIONS_DIR, compositionPath, { recursive: true });
-  return { compositionPath, transcriptPath };
+  return { compositionPath };
 }
 function parseElements2(raw) {
   const parsed = JSON.parse(raw);
@@ -11161,7 +11212,6 @@ var scaffoldVideoCommand = defineCommand78({
       videoModel,
       overlayCompositionPath: compositionDest,
       captionsCompositionPath: captions.compositionPath,
-      transcriptPath: captions.transcriptPath,
       blueprintPath,
       frames,
       ambient: Boolean(args.ambient),
@@ -11214,6 +11264,8 @@ var scaffoldVideoCommand = defineCommand78({
             overlay_count: report.overlay_count,
             has_music: report.has_music,
             scenes_clamped_to_15s: report.clamped_scenes,
+            oversize_scenes: report.oversize_scenes,
+            overstuffed_scenes: report.overstuffed_scenes,
             note: "Drop ONE real source image at each el_* [TODO] (reused across every frame that element appears in), confirm each voice_select casting, then `baker canvas validate` and `baker canvas run`. Running generates many billed image/video/audio assets \u2014 it is not free."
           }
         },