@koda-sl/baker-cli 0.91.0 → 0.92.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -3624,7 +3624,7 @@ Before the deconstruct it runs a **local shot-cut pass** on the source file with
3624
3624
 
3625
3625
  A shot longer than the video model's per-clip ceiling (Seedance's 15s, passed as `video_deconstruct`'s `max_clip_s`) is split into equal **continuation sub-scenes** that share their splice boundary exactly — so a long shot is reproduced in **full** (no truncation) and joins seamlessly. Each sub-scene carries `continues_previous`.
3626
3626
 
3627
- It then scaffolds the full pipeline like an **editing timeline**: each clip gets a **static-ad-grade start AND end keyframe** (`image_generate`, each with its **own self-contained `params.prompt`** — edit a frame node to change only that frame; `prompt.json` wired as the **authoritative shared `target_blueprint`**, plus a per-element reference legend). Each keyframe is **fully recast** to the dropped `el_*` reference images. For a frame with **no person/animal** the original extracted frame is kept LAST as a pure composition anchor; for any frame **with a face it is dropped entirely** (it leaked the source person's identity the hook used to render the reference woman, not our actor), so the recast `el_*`/actor-sheet is the sole identity reference. Both keyframes feed `video_generate` (`first_frame`+`last_frame`, so Seedance interpolates real in-shot motion; ultra-detailed motion brief; duration snapped to the nearest allowed clip length). Every keyframe grounds **only on its own extracted frame + `el_*` slots** — no reference to any other generated frame — so all images render **in parallel** (no cascade). Source-frame URLs are **deduped** (each ingested once). `--frames reuse` wires the real source frame straight in.
3627
+ It then scaffolds the full pipeline like an **editing timeline**: each clip gets a **static-ad-grade start AND end keyframe** (`image_generate`, each with its **own self-contained `params.prompt`** — edit a frame node to change only that frame; `prompt.json` wired as the **authoritative shared `target_blueprint`**, plus a per-element reference legend). Each keyframe is **fully recast** to the dropped `el_*` reference images. The original extracted frame is kept LAST as a **pure composition anchor** (framing / camera angle / shot size / pose) whenever identity is safely locked — i.e. a frame with no person/animal, OR every cast member present is **sheet-backed** (a multi-view turnaround owns identity, so the anchor can reproduce the source's framing without dictating the face). Since every base element is now sheet-backed by default, cast frames keep their framing anchor too — this is what reproduces the source's composition (a side-profile stays a side-profile, the camera angle holds scene to scene) instead of drifting to a fresh guess. The anchor's legend forbids taking identity/text/palette from it. It is dropped only when a cast member rests on a weak lone-snapshot reference (e.g. a `same_as` second-look slot), where the original frame could re-leak the source actor. Both keyframes feed `video_generate` (`first_frame`+`last_frame`, so Seedance interpolates real in-shot motion; ultra-detailed motion brief; duration snapped to the nearest allowed clip length). Every keyframe grounds **only on its own extracted frame + `el_*` slots** — no reference to any other generated frame — so all images render **in parallel** (no cascade). Source-frame URLs are **deduped** (each ingested once). `--frames reuse` wires the real source frame straight in.
3628
3628
 
3629
3629
  **Composited scenes (split-screen / picture-in-picture / keyed presenter).** Real ads aren't always one full-frame shot — a frame can be **persistently divided** (b-roll on top, a presenter talking on the bottom) or **layer a presenter** over background footage (boxed in a corner, or green-screen keyed). The deconstruct now reports this per scene as `scene.composition` (`layout: split_screen | pip | keyed_overlay`, with one `region` per stream — each its own clean-plate frame + motion brief, the talking-head region flagged `is_presenter`). The scaffold reproduces a composited scene by building **one clip per region** (`s<i>_r0_*`, `s<i>_r1_*`, …) and compositing them with ffmpeg: a split-screen `vstack`/`hstack` (stack direction read from the region **panels**, so a top/bottom split always stacks vertically), or a picture-in-picture `overlay` of the presenter inset at its corner. A **keyed** presenter is first cut to transparency by `video_background_remove` (`s<i>_key`), then overlaid. The presenter region carries the native lip-synced voice; b-roll/render panels stay silent. To change a layout, edit `composition` in `prompt.json` and re-scaffold, or hand-edit the `s<i>_composite` ffmpeg args. Plain full-frame scenes (the default) are unaffected.
3630
3630
 
@@ -3636,7 +3636,9 @@ It then scaffolds the full pipeline like an **editing timeline**: each clip gets
3636
3636
 
3637
3637
  **Sequenced audio.** Dialogue is a back-and-forth on one absolute timeline, so each **contiguous same-speaker turn** becomes its own `tts` placed at its real `start_s` — turns alternate and never stack (the earlier design concatenated each speaker's whole monologue at their earliest timestamp, so two voices played in parallel for the entire video). Each speaker is locked to one shared `voice_select` voice; a `sound_effect` per SFX and a `music` bed (conditioned on the **ad's own script + emotional arc** so the bed supports the message, styled after the AudD-identified track when available, ducked under the voices, and started at the reference's `music.starts_at_s` rather than always at 0) round out the mix (`audio_timeline`). The final mux normalizes the soundtrack to **−14 LUFS (stereo)** so the output plays loud in every player — the raw mix is quiet mono, which reads as "no sound."
3638
3638
 
3639
- **Native talking heads + one voice per person (no post-hoc lip-sync).** Seedance 2.0 generates lip-synced speech **natively** — a presenter phrase puts the full phrase in the clip's prompt with `generate_audio`, so lips and voice are generated together (no `video_lipsync`/veed). Each presenter phrase's audio is extracted and re-voiced through a **per-phrase** `audio_voice_convert` (ElevenLabs Voice Changer; one per phrase keeps each ≤15s clip under the converter's length cap) to the brand voice — timing preserved so the lips stay matched. There is **ONE voice per person**: a single `voice_select` is reused for all that person's phrases, and the deconstruct's `voiceover` label folds into the sole on-camera presenter (so on-camera and off-camera narration are the same voice, not two). A scene with **two on-camera speakers** can't be one clip — both lines become `tts` over a plain scene clip.
3639
+ **Native talking heads + one voice per person (no post-hoc lip-sync).** Seedance 2.0 generates lip-synced speech **natively** — a presenter phrase puts the full phrase in the clip's prompt with `generate_audio`, so lips and voice are generated together (no `video_lipsync`/veed). Each presenter phrase's audio is extracted and re-voiced through a **per-phrase** `audio_voice_convert` (ElevenLabs Voice Changer; one per phrase keeps each ≤15s clip under the converter's length cap) to the brand voice — timing preserved so the lips stay matched. There is **ONE voice per person**: a single `voice_select` is reused for all that person's phrases, and the deconstruct's `voiceover` label folds into the sole on-camera presenter (so on-camera and off-camera narration are the same voice, not two). A scene with **two speakers both on screen** can't be one clip — both lines become `tts` over a plain scene clip. But a scene with **one on-camera speaker trading lines with an OFF-camera voice** (an interviewer, a heard-but-not-shown assistant) keeps the on-camera speaker **native** (lip-synced) and reads the off-camera line as `tts` — "on screen" is decided by the speaker's element presence, so a heard-but-unshown voice no longer drops the whole scene to a silent clip. Every `tts` node is stamped with the spoken track's **`language_code`** when the blueprint states a language (cast localization note / voiceover persona / voice description), so numbers and units are read in the target tongue instead of ElevenLabs' English default (the "6900 read in English" bug). For **NATIVE (Seedance) lines** — which carry no language tag — the scaffold additionally **spells numerals into target-language words** across every part of the clip prompt Seedance can vocalize (the spoken line, the scene summary/action/motion, the transcript), so a French "6930 ?" becomes "six mille neuf cent trente ?" and is never read as English digits. Spelling covers **every language the blueprint can resolve** (fr, es, en, de, it, pt, nl, pl, ar, ja, ko, hi — via `n2words`); a language outside that set leaves digits (the `tts` path still localizes them via `language_code`).
3640
+
3641
+ **Same-shot lip-sync caution.** A single held shot can carry only ONE lip-synced clip (voiceover turns must not overlap, and Seedance generates one clip per shot), so when the on-camera speaker has further turns in that shot (a rapid "3000? … 4000?" with an off-camera "Plus" between), the first turn is native and the rest play as `tts` over the same clip — where the mouth no longer matches those words. This is inherent to reproducing sparse same-shot dialogue, not a wiring fault; the scaffold lists the affected scenes/lines in **`metadata.video.lip_sync_caution`** (advisory, never gated) so you can cut away to b-roll over those lines or rely on the burned-in captions that already show them.
3640
3642
 
3641
3643
  **Timing-faithful clip + extract (no overlap).** Each phrase clip is generated to its **coverage window** (the deconstruct's real scene/line timing, capped at 15s) and its converted voice is extracted to the **spoken window** (pause to pause) — *not* padded to a word-count estimate. Padding past the window was what ran the voice the clip's whole length and overlapped the next phrase; trusting the deconstruct's timing keeps consecutive phrases back-to-back and lets Seedance pace the quoted text to fit. `metadata.video.talking_scenes` still records each phrase's `scene_s` vs `est_speech_s` so you can spot a line whose words overrun its window and widen it by hand.
3642
3644
 
@@ -3672,10 +3674,11 @@ baker canvas run ./reference-ad.video.canvas.json
3672
3674
  | `--select-model <id>` | `~google/gemini-flash-latest` | Override the element-selection `text_generate` model. |
3673
3675
  | `--image-model <id>` | `openai/gpt-5.4-image-2` | Override the per-frame `image_generate` model (defaults to the strongest, matching `scaffold-static-ad`). |
3674
3676
  | `--video-model <id>` | `bytedance/seedance-2.0` | Override the `video_generate` model. |
3677
+ | `--resolution <res>` | `1080p` | Output resolution for every generated clip (`480p`/`720p`/`1080p` for Seedance). The model defaults to a LOW tier when unset, which downscales the 2K keyframes — pinning the top tier keeps the clip as sharp as its frames. |
3675
3678
 
3676
3679
  Each scene is captured in a **shoot mode** — `ugc_selfie` (talking heads, the default look), `ugc_broll`, `studio_product` (pack shot), `lifestyle_cinematic`, or `screen_ui`. The scaffold derives one per scene (UGC by default; the cinematic and screen lanes are opt-in) and bakes its capture block into the frame and a camera default into the clip; override per scene with a `shoot_mode` field in `prompt.json`. Capture aesthetic + depth-of-field follow the mode (UGC stays flat; studio/lifestyle allow shallow DoF). Clips also carry **diegetic native audio** — the scene's own ambience described in the Seedance prompt, never music (the music bed is a separate, ducked track); set a scene's `ambient` field to steer it.
3677
3680
 
3678
- **Automatic by default (no flags).** A recast **person/pet recurring across ≥2 scenes** is always locked to ONE multi-view turnaround (`image_reference_sheet`) every frame grounds on. An **app/website/chat screen** is never sent to the video model — the scaffold drops the scene to a clean talking-head and seeds a phone-mockup PIP stub to fill with a real `baker images screenshot` or brand HTML block (Seedance garbles UI and a split leaves a seam). The **music bed is instrumental** (the script is never fed to the music model — it would sing over the voice), enters only after the hook, and is **sidechain-ducked** under the voice. **Word-synced TikTok captions** are wired off the deconstruct transcript whenever the ad has speech. Seeded overlays are pushed **off the subject's face** (dead-center → bottom band).
3681
+ **Automatic by default (no flags).** Every recast **base element — person, pet, product, AND location/set** is fused into ONE rich multi-view sheet (`image_reference_sheet`, one subject per sheet, **4K**, up to 8 cells) that every frame it appears in grounds on, so the same face/pet/pack/room is rendered from a multi-angle canvas instead of a lone flat snapshot (a one-scene hero element is sheeted too). Each sheet pairs a **full turnaround** (angles, for proportions/wardrobe/layout) with tight **close-ups** so the generator is prepared for ANY framing a scene needs: a **person** gets body cells + face close-ups (front/¾/profile) and a mid-sentence speaking expression (identity pinned, natural skin — no airbrushing); an **animal** gets a body turnaround + head close-ups + an eyes/face macro; a **product** gets a turnaround + label and material detail macros; a **location/set** gets several camera angles of the same room + a key-surface detail. Generated clips are pinned to **1080p** (see `--resolution`) so the video keeps the keyframe's sharpness, and each cast frame keeps the source frame as a **composition anchor** (identity stays on the sheet) so the original framing/camera is reproduced, not re-guessed. An **app/website/chat screen** is never sent to the video model — the scaffold drops the scene to a clean talking-head and seeds a phone-mockup PIP stub to fill with a real `baker images screenshot` or brand HTML block (Seedance garbles UI and a split leaves a seam). The **music bed is instrumental** (the script is never fed to the music model — it would sing over the voice), enters only after the hook, and is **sidechain-ducked** under the voice. **Word-synced TikTok captions** are wired off the deconstruct transcript whenever the ad has speech. Seeded overlays are pushed **off the subject's face** (dead-center → bottom band).
3679
3682
 
3680
3683
  The two scaffold passes are billed (the full `video_deconstruct` is the heavy one); **running** the result then generates many image/video/audio assets and is not free. Defaults to vertical 1080×1920 overlays — copy + edit the composition for other aspect ratios. For on-brand overlay type, drop `brand-bold.otf`/`brand-regular.otf` into the copied `video-overlay-composition/` dir (wired via `@font-face`, with a system fallback). Richer transcription (punctuated words + paragraphs) is available via the deconstruct's `transcriber: "deepgram"` param when `DEEPGRAM_API_KEY` is set.
3681
3684
 
@@ -1117,7 +1117,7 @@ var MODEL_REGISTRY = {
1117
1117
  required: ["subject_description", "subject_type"],
1118
1118
  params: {
1119
1119
  subject_description: { kind: "string" },
1120
- subject_type: { kind: "string", enum: ["character", "person", "product"] },
1120
+ subject_type: { kind: "string", enum: ["character", "person", "product", "location"] },
1121
1121
  views: { kind: "json" },
1122
1122
  style: { kind: "string" },
1123
1123
  prompt_override: { kind: "string" },
@@ -1131,7 +1131,7 @@ var MODEL_REGISTRY = {
1131
1131
  required: ["subject_description", "subject_type"],
1132
1132
  params: {
1133
1133
  subject_description: { kind: "string" },
1134
- subject_type: { kind: "string", enum: ["character", "person", "product"] },
1134
+ subject_type: { kind: "string", enum: ["character", "person", "product", "location"] },
1135
1135
  views: { kind: "json" },
1136
1136
  style: { kind: "string" },
1137
1137
  prompt_override: { kind: "string" },
@@ -1589,6 +1589,11 @@ var VideoMeta = z.object({
1589
1589
  z.object({ scene: z.number(), lipsync_node: z.string() })
1590
1590
  ])
1591
1591
  ).default([]),
1592
+ // Advisory, NOT gated: scenes where a presenter's later same-shot turns play as
1593
+ // `tts` over their one native clip (a held shot can hold only ONE lip-synced clip,
1594
+ // and voiceover turns can't overlap), so the lips may not match those words. The
1595
+ // scaffold surfaces it so the agent can cut away or rely on the burned-in captions.
1596
+ lip_sync_caution: z.array(z.object({ scene: z.number(), speaker: z.string(), tts_over_native: z.array(z.string()) })).optional(),
1592
1597
  // Advisory, NOT gated by the validator: the reviewable "which graphic fires
1593
1598
  // on which spoken beat" map emitted by scaffold-video (per-scene window,
1594
1599
  // spoken line, storyboard frames, scheduled graphics). Free-form rows so the
@@ -5398,8 +5403,10 @@ var REFERENCE_SHEET_MODELS = ["google/gemini-3-pro-image-preview", "google/gemin
5398
5403
  var ImageReferenceSheetParams = z20.object({
5399
5404
  model: z20.enum(REFERENCE_SHEET_MODELS),
5400
5405
  subject_description: z20.string().min(1),
5401
- subject_type: z20.enum(["character", "person", "product"]),
5402
- views: z20.array(z20.string().min(1)).min(2).max(6).optional(),
5406
+ // `location` = a set/room shown from several camera ANGLES (not a rotated subject),
5407
+ // so a multi-scene shoot keeps one consistent set.
5408
+ subject_type: z20.enum(["character", "person", "product", "location"]),
5409
+ views: z20.array(z20.string().min(1)).min(2).max(8).optional(),
5403
5410
  style: z20.string().optional(),
5404
5411
  prompt_override: z20.string().min(1).optional(),
5405
5412
  aspect_ratio: z20.enum(["1:1", "16:9", "9:16", "4:3", "3:4", "3:2", "2:3", "4:5", "5:4", "21:9", "1:4", "4:1", "1:8", "8:1"]).optional(),
@@ -5409,7 +5416,7 @@ var imageReferenceSheetNode = delegated({
5409
5416
  id: "image_reference_sheet",
5410
5417
  version: "1.0.0",
5411
5418
  category: "image",
5412
- summary: "Fuse 1\u20136 images of a single subject (person, character, or product) into ONE multi-view reference sheet \u2014 a labeled turnaround grid (FRONT / SIDE / BACK\u2026) in consistent style and lighting. Curated models: Gemini 3 Pro Image (best fusion + labels), Gemini 3.1 Flash Image (cheap iteration).",
5419
+ summary: "Fuse 1\u20136 images of a single subject (person, character, product, or location/set) into ONE multi-view reference sheet \u2014 a labeled grid in consistent style and lighting: a turnaround (FRONT / SIDE / BACK\u2026) for a person/character/product, or several camera angles of the same room (WIDE / REVERSE / DETAIL\u2026) for a location. Curated models: Gemini 3 Pro Image (best fusion + labels), Gemini 3.1 Flash Image (cheap iteration).",
5413
5420
  when_to_use: "Use before image_generate / video_generate when a subject must stay consistent across many creatives \u2014 wire the `sheet` output into their `reference` input instead of re-describing the subject per prompt. `subject_description` should be the exact wording you reuse downstream. Pick `google/gemini-3-pro-image-preview` for final 6-view sheets at 2K+, `google/gemini-3.1-flash-image-preview` while iterating.",
5414
5421
  inputs: z20.object({ references: z20.array(ImageRef).min(1).max(6) }).loose(),
5415
5422
  params: ImageReferenceSheetParams,
@@ -6139,4 +6146,4 @@ export {
6139
6146
  defaultRegistry,
6140
6147
  createEngineFromEnv
6141
6148
  };
6142
- //# sourceMappingURL=chunk-LMVDA3EZ.js.map
6149
+ //# sourceMappingURL=chunk-RCPMJKI7.js.map