@koda-sl/baker-cli 0.82.0 → 0.91.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +25 -9
- package/canvas/video-overlay-composition/index.html +31 -5
- package/dist/{chunk-KIL2ZJST.js → chunk-LMVDA3EZ.js} +151 -17
- package/dist/chunk-LMVDA3EZ.js.map +1 -0
- package/dist/cli.js +1258 -281
- package/dist/cli.js.map +1 -1
- package/dist/engine/index.js +1 -1
- package/package.json +1 -1
- package/dist/chunk-KIL2ZJST.js.map +0 -1
package/README.md
CHANGED
|
@@ -2620,7 +2620,7 @@ Pick a `source` discriminator and declare the kind you expect. See [Ingestion](#
|
|
|
2620
2620
|
|
|
2621
2621
|
**Outputs:** `asset` → `<params.expect>` / content-determined (URL strategy table) or extension-inferred (path).
|
|
2622
2622
|
|
|
2623
|
-
**Path-source notes:** the canvas is **not portable** to another machine without the file. Cache key folds the file's `mtime:size`, so editing the file invalidates the cache automatically. Supported extensions: `png`, `jpg`/`jpeg`, `webp`, `gif`, `avif`, `svg`, `mp4`, `webm`, `mov`, `m4v`, `mp3`, `wav`, `m4a`, `ogg`, `flac`, `json`, `txt`, `md`, `markdown`, `html`/`htm`, `csv`, `ttf`, `otf`, `woff`, `woff2`. Unknown extensions fall back to magic-byte sniffing for common image formats (and an SVG content sniff), else `kind_mismatch`. **SVG (`expect: "image"`) is rasterized to a transparent PNG on ingest** — brand logos are usually SVG, and image-generation models can't read SVG markup, so it's upscaled (longest edge near 2048px) with transparency preserved and the resulting asset carries `metadata.rasterized_from: "svg"`.
|
|
2623
|
+
**Path-source notes:** the canvas is **not portable** to another machine without the file. Cache key folds the file's `mtime:size`, so editing the file invalidates the cache automatically. Supported extensions: `png`, `jpg`/`jpeg`, `webp`, `gif`, `avif`, `svg`, `mp4`, `webm`, `mov`, `m4v`, `mp3`, `wav`, `m4a`, `ogg`, `flac`, `json`, `txt`, `md`, `markdown`, `html`/`htm`, `csv`, `ttf`, `otf`, `woff`, `woff2`. Unknown extensions fall back to magic-byte sniffing for common image formats (and an SVG content sniff), else `kind_mismatch`. **SVG (`expect: "image"`) is rasterized to a transparent PNG on ingest** — brand logos are usually SVG, and image-generation models can't read SVG markup, so it's upscaled (longest edge near 2048px) with transparency preserved and the resulting asset carries `metadata.rasterized_from: "svg"`. **Video (`expect: "video"`) duration is probed from the file's ISO-BMFF (`mp4`/`mov`/`m4v`) header** and stamped as the canonical `duration_ms` (and `metadata.duration_ms`); other containers (e.g. `webm`) leave it unset. Downstream `video_deconstruct` uses this declared duration to size its ingest-poll timeout and preflight — without it those fall back to worst-case budgets and a single deconstruct step can hit the action time limit.
|
|
2624
2624
|
|
|
2625
2625
|
**Cost:** 0 engine credits for direct fetch + yt-dlp + local file. Handinger charges per scrape.
|
|
2626
2626
|
|
|
@@ -3282,6 +3282,8 @@ Over-length runs fail with a message that includes ready-to-use suggested window
|
|
|
3282
3282
|
| `max_scenes` | number | no | `20` (full ≤3 min) / `40` (longer) / `60` (index) | 1–60; extra cuts merge into the last scene |
|
|
3283
3283
|
| `focus` | string | no | — | extra extraction emphasis (e.g. `"overlay typography"`) |
|
|
3284
3284
|
| `start_s` / `end_s` | number | no | whole video | analysis window (absolute seconds) for chunking long videos; window ≤480s |
|
|
3285
|
+
| `shot_cuts` | number[] | no | — | real visual shot-cut timestamps (absolute seconds); scene boundaries snap onto them and any scene spanning one is split, so a scene's frames never straddle a hard cut. `scaffold-video` detects these with PySceneDetect and passes them automatically |
|
|
3286
|
+
| `max_clip_s` | number | no | — | the video model's per-clip ceiling (seconds); a shot longer than this is split into equal continuation sub-scenes that share their splice (full-length, seamless — no truncation), each tagged `continues_previous`. `scaffold-video` passes Seedance's 15 |
|
|
3285
3287
|
|
|
3286
3288
|
**Outputs**
|
|
3287
3289
|
|
|
@@ -3358,9 +3360,10 @@ Place and mix several audio clips onto one timeline — a music bed plus timed v
|
|
|
3358
3360
|
|
|
3359
3361
|
| Name | Type | Required | Notes |
|
|
3360
3362
|
|---|---|---|---|
|
|
3361
|
-
| `tracks` | array | yes | `[{slot, start_s, gain_db?}]` — `slot` matches a wired input; `start_s` is absolute seconds; `gain_db`
|
|
3363
|
+
| `tracks` | array | yes | `[{slot, start_s, gain_db?}]` — `slot` matches a wired input; `start_s` is absolute seconds; `gain_db` sets the static level (e.g. `-20` for a music bed) |
|
|
3362
3364
|
| `total_ms` | number | no | pins the final length (pad/trim) |
|
|
3363
3365
|
| `output_format` | enum | no | `mp3` (default) / `wav` / `m4a` |
|
|
3366
|
+
| `duck` | object | no | sidechain-duck one track under others: `{ track, against: [...], threshold?, ratio?, attack?, release? }` — the `track` (music) drops while any `against` track (voice) carries signal and recovers in the gaps |
|
|
3364
3367
|
|
|
3365
3368
|
**Outputs** — `audio`. **Cost** — free (local). Requires `ffmpeg`.
|
|
3366
3369
|
|
|
@@ -3615,15 +3618,27 @@ Validate, then execute the graph. Blocks until done. Logs one line per node. Ret
|
|
|
3615
3618
|
Turn a reference video into a **runnable, self-validated reproduction canvas** in one command — the video counterpart of `scaffold-static-ad`. It runs **billed passes** up front:
|
|
3616
3619
|
|
|
3617
3620
|
1. **`video_deconstruct`** (`~google/gemini-pro-latest`, full mode) — reverse-engineers the video into a scene-by-scene blueprint + word-level transcript, written next to the canvas as **`prompt.json`**. Each scene's `start_frame_prompt`/`end_frame_prompt` are inlined into the frame nodes (see below); `prompt.json` then rides along as the shared **global style reference** (palette, cast cohesion) and as provenance.
|
|
3618
|
-
2. **recurring-element selection** (`~google/gemini-flash-latest`) — picks only the **recurring, identity-critical** elements (each `global.cast` person, a recurring animal, a showcased product, the brand logo) and the scene indices each appears in. One real reference image grounds each element across **every** frame it appears in, so the same actor stays consistent the whole video.
|
|
3621
|
+
2. **recurring-element selection** (`~google/gemini-flash-latest`) — picks only the **recurring, identity-critical** elements (each `global.cast` person, a recurring animal, a showcased product, the brand logo) and the scene indices each appears in. One real reference image grounds each element across **every** frame it appears in, so the same actor stays consistent the whole video. This selection runs as a **second pass over a slimmed blueprint** (cast/branding + each scene's frame prompts only) — a long ad's full blueprint can exceed the engine's inline-prompt limit, so the heavy per-scene detail (dialogue, overlays, transcript) the selector never reads is dropped before the prompt.
|
|
3619
3622
|
|
|
3620
|
-
|
|
3623
|
+
Before the deconstruct it runs a **local shot-cut pass** on the source file with **[PySceneDetect](https://www.scenedetect.com)** (`scenedetect` CLI, `detect-content` — the battle-tested HSV content detector, installed in the canvas sandbox) and passes the cut timestamps as `video_deconstruct`'s `shot_cuts`. The deconstruct snaps its scene boundaries onto those real cuts and **splits any scene that spans one**, so a scene's frames can never straddle a hard cut (the failure where a scene's start frame was the couch and its end frame the b-roll). Two knobs tuned for fast social ads: the content **threshold defaults to 18** (PySceneDetect's own default of 27 misses soft reframes) and the **minimum scene length is dropped to 0.25s** (its default ~0.6s merges away rapid montage flashes) — so super-fast cuts survive and become cheap still-holds downstream. The threshold is **adaptive**: if the first pass looks like a continuous shot shredded into many close micro-cuts (a talking-head selfie's natural motion), it re-runs at PySceneDetect's own default of 27 and trusts that — real montage cuts survive it, motion artifacts don't. Pinning **`--shot-threshold N`** disables the re-check (lower = more cuts). The backend additionally coalesces residual same-shot slivers. If `scenedetect` is unavailable it warns loudly and degrades to LLM-only boundaries.
|
|
3621
3624
|
|
|
3622
|
-
|
|
3625
|
+
A shot longer than the video model's per-clip ceiling (Seedance's 15s, passed as `video_deconstruct`'s `max_clip_s`) is split into equal **continuation sub-scenes** that share their splice boundary exactly — so a long shot is reproduced in **full** (no truncation) and joins seamlessly. Each sub-scene carries `continues_previous`.
|
|
3623
3626
|
|
|
3624
|
-
|
|
3627
|
+
It then scaffolds the full pipeline like an **editing timeline**: each clip gets a **static-ad-grade start AND end keyframe** (`image_generate`, each with its **own self-contained `params.prompt`** — edit a frame node to change only that frame; `prompt.json` wired as the **authoritative shared `target_blueprint`**, plus a per-element reference legend). Each keyframe is **fully recast** to the dropped `el_*` reference images. For a frame with **no person/animal** the original extracted frame is kept LAST as a pure composition anchor; for any frame **with a face it is dropped entirely** (it leaked the source person's identity — the hook used to render the reference woman, not our actor), so the recast `el_*`/actor-sheet is the sole identity reference. Both keyframes feed `video_generate` (`first_frame`+`last_frame`, so Seedance interpolates real in-shot motion; ultra-detailed motion brief; duration snapped to the nearest allowed clip length). Every keyframe grounds **only on its own extracted frame + `el_*` slots** — no reference to any other generated frame — so all images render **in parallel** (no cascade). Source-frame URLs are **deduped** (each ingested once). `--frames reuse` wires the real source frame straight in.
|
|
3625
3628
|
|
|
3626
|
-
**
|
|
3629
|
+
**Composited scenes (split-screen / picture-in-picture / keyed presenter).** Real ads aren't always one full-frame shot — a frame can be **persistently divided** (b-roll on top, a presenter talking on the bottom) or **layer a presenter** over background footage (boxed in a corner, or green-screen keyed). The deconstruct now reports this per scene as `scene.composition` (`layout: split_screen | pip | keyed_overlay`, with one `region` per stream — each its own clean-plate frame + motion brief, the talking-head region flagged `is_presenter`). The scaffold reproduces a composited scene by building **one clip per region** (`s<i>_r0_*`, `s<i>_r1_*`, …) and compositing them with ffmpeg: a split-screen `vstack`/`hstack` (stack direction read from the region **panels**, so a top/bottom split always stacks vertically), or a picture-in-picture `overlay` of the presenter inset at its corner. A **keyed** presenter is first cut to transparency by `video_background_remove` (`s<i>_key`), then overlaid. The presenter region carries the native lip-synced voice; b-roll/render panels stay silent. To change a layout, edit `composition` in `prompt.json` and re-scaffold, or hand-edit the `s<i>_composite` ffmpeg args. Plain full-frame scenes (the default) are unaffected.
|
|
3630
|
+
|
|
3631
|
+
**Montage flashes held as stills.** A rapid-cut beat shorter than ~2s with no spoken line is a **flash** — Seedance's shortest clip is 4s, so generating one (then trimming away most of it) burns credits for motion no viewer perceives. The scaffold instead **holds one keyframe as a still** for the scene length (a cheap ffmpeg loop, no billed `video_generate`), same look at a fraction of the cost. Talking/ambient beats keep a real clip (they need motion + native audio).
|
|
3632
|
+
|
|
3633
|
+
**The phrase model (voice cut at pauses, not at visual cuts).** The voice is grouped into **phrases** — runs of continuous speech with no real pause, which may span several visual scenes. A phrase is voiced ONCE (so a sentence the deconstruct split at a visual cut never breaks mid-word): if the speaker is **shown** anywhere in the phrase it's a single Seedance clip (`s<anchor>_clip`, native lip-sync + audio) re-voiced to the brand voice; if the speaker is **never shown** it's one ElevenLabs `tts` read. The picture is then assembled **scene by scene**: a scene that shows the speaker **slices its window** out of the phrase clip (`s<i>_seg`, an ffmpeg `-ss`/`-t` cut — video and audio come from the *same* clip, so lip-sync holds), and a **b-roll cutaway** gets its own silent clip while the phrase's voice plays underneath. "Shown" is decided by the **presenter element's per-scene presence**, not just who's speaking — a scene where a cast member narrates over b-roll (their element absent) is treated as a cutaway, so the talking head never appears where the original cut away. A presenter run longer than the 15s Seedance ceiling **splits at a scene boundary** into contiguous takes (each its own clip + convert), so a sliced window never reads past its clip. A b-roll cutaway *inside* a phrase lands at an **approximate** time (Seedance exposes no word timing) — nudge the scene boundary if it's off its beat.
|
|
3634
|
+
|
|
3635
|
+
**A starting point, not a locked render.** The canvas mirrors the reference's structure to give you a faithful scaffold, but `metadata.todo.full_flexibility` makes explicit that the agent has **full editing freedom**: add / delete / reorder / split / merge scenes, re-prompt any frame or motion brief, change a scene's layout (full-frame ↔ composite), or rewrite any line — the content-addressed cache re-bills only what changes, and `baker canvas validate` re-checks timing/lip-sync after any edit.
|
|
3636
|
+
|
|
3637
|
+
**Sequenced audio.** Dialogue is a back-and-forth on one absolute timeline, so each **contiguous same-speaker turn** becomes its own `tts` placed at its real `start_s` — turns alternate and never stack (the earlier design concatenated each speaker's whole monologue at their earliest timestamp, so two voices played in parallel for the entire video). Each speaker is locked to one shared `voice_select` voice; a `sound_effect` per SFX and a `music` bed (conditioned on the **ad's own script + emotional arc** so the bed supports the message, styled after the AudD-identified track when available, ducked under the voices, and started at the reference's `music.starts_at_s` rather than always at 0) round out the mix (`audio_timeline`). The final mux normalizes the soundtrack to **−14 LUFS (stereo)** so the output plays loud in every player — the raw mix is quiet mono, which reads as "no sound."
|
|
3638
|
+
|
|
3639
|
+
**Native talking heads + one voice per person (no post-hoc lip-sync).** Seedance 2.0 generates lip-synced speech **natively** — a presenter phrase puts the full phrase in the clip's prompt with `generate_audio`, so lips and voice are generated together (no `video_lipsync`/veed). Each presenter phrase's audio is extracted and re-voiced through a **per-phrase** `audio_voice_convert` (ElevenLabs Voice Changer; one per phrase keeps each ≤15s clip under the converter's length cap) to the brand voice — timing preserved so the lips stay matched. There is **ONE voice per person**: a single `voice_select` is reused for all that person's phrases, and the deconstruct's `voiceover` label folds into the sole on-camera presenter (so on-camera and off-camera narration are the same voice, not two). A scene with **two on-camera speakers** can't be one clip — both lines become `tts` over a plain scene clip.
|
|
3640
|
+
|
|
3641
|
+
**Timing-faithful clip + extract (no overlap).** Each phrase clip is generated to its **coverage window** (the deconstruct's real scene/line timing, capped at 15s) and its converted voice is extracted to the **spoken window** (pause to pause) — *not* padded to a word-count estimate. Padding past the window was what ran the voice the clip's whole length and overlapped the next phrase; trusting the deconstruct's timing keeps consecutive phrases back-to-back and lets Seedance pace the quoted text to fit. `metadata.video.talking_scenes` still records each phrase's `scene_s` vs `est_speech_s` so you can spot a line whose words overrun its window and widen it by hand.
|
|
3627
3642
|
|
|
3628
3643
|
**Timeline-accurate picture.** Seedance can't render under 4s, so each clip is generated at the smallest allowed duration ≥ the scene length and then **trimmed back to the exact scene duration** before concat. This keeps the concatenated picture on the same timeline as the absolute-timed audio — without it, short scenes balloon to 4s, the spine runs far longer than the soundtrack, and every line plays over the wrong (slowed) scene so the lips never match. Frames are also prompted as **clean text-free plates** (no baked captions/lower-thirds/tickers/logos-as-text) so the overlay layer is the single source of on-screen text.
|
|
3629
3644
|
|
|
@@ -3648,9 +3663,8 @@ baker canvas run ./reference-ad.video.canvas.json
|
|
|
3648
3663
|
| Flag | Default | Effect |
|
|
3649
3664
|
|---|---|---|
|
|
3650
3665
|
| `--out <path>` | `<video-dir>/<name>.video.canvas.json` | Where to write the canvas (composition is copied alongside). |
|
|
3651
|
-
| `--frames <mode>` | `generate` | `generate`
|
|
3666
|
+
| `--frames <mode>` | `generate` | `generate` emits ONE recast keyframe per scene (the original frame is dropped so the dropped `el_*` assets drive identity); `reuse` wires the real extracted first+last frames straight into the clips (faithful, cheaper, no recast). |
|
|
3652
3667
|
| `--ambient` | off | Give silent **b-roll** scenes native diegetic ambient (Seedance `generate_audio`), mixed deep under the music bed. Talking scenes already carry voice; check levels don't muddy the mix before keeping it. |
|
|
3653
|
-
| `--actor-sheets` | off | Lock a recast **person/animal that recurs across ≥2 scenes** to ONE multi-view turnaround (`image_reference_sheet`) that every frame grounds on — the strongest cross-scene identity lock. Costs extra credits per sheet; a fused sheet can over-polish, so eyeball it. |
|
|
3654
3668
|
| `--max-scenes <n>` | all source scenes | **Cost lever that reduces fidelity** — caps the deconstruct, MERGING away every scene beyond the cap (fewer cuts, lost beats). Prints a warning when set; omit it to reproduce every scene. |
|
|
3655
3669
|
| `--language <code>` | auto | Transcript/dialogue language hint (e.g. `fr`, `en`). |
|
|
3656
3670
|
| `--focus <text>` | — | Known provenance/emphasis to ground the deconstruct. |
|
|
@@ -3661,6 +3675,8 @@ baker canvas run ./reference-ad.video.canvas.json
|
|
|
3661
3675
|
|
|
3662
3676
|
Each scene is captured in a **shoot mode** — `ugc_selfie` (talking heads, the default look), `ugc_broll`, `studio_product` (pack shot), `lifestyle_cinematic`, or `screen_ui`. The scaffold derives one per scene (UGC by default; the cinematic and screen lanes are opt-in) and bakes its capture block into the frame and a camera default into the clip; override per scene with a `shoot_mode` field in `prompt.json`. Capture aesthetic + depth-of-field follow the mode (UGC stays flat; studio/lifestyle allow shallow DoF). Clips also carry **diegetic native audio** — the scene's own ambience described in the Seedance prompt, never music (the music bed is a separate, ducked track); set a scene's `ambient` field to steer it.
|
|
3663
3677
|
|
|
3678
|
+
**Automatic by default (no flags).** A recast **person/pet recurring across ≥2 scenes** is always locked to ONE multi-view turnaround (`image_reference_sheet`) every frame grounds on. An **app/website/chat screen** is never sent to the video model — the scaffold drops the scene to a clean talking-head and seeds a phone-mockup PIP stub to fill with a real `baker images screenshot` or brand HTML block (Seedance garbles UI and a split leaves a seam). The **music bed is instrumental** (the script is never fed to the music model — it would sing over the voice), enters only after the hook, and is **sidechain-ducked** under the voice. **Word-synced TikTok captions** are wired off the deconstruct transcript whenever the ad has speech. Seeded overlays are pushed **off the subject's face** (dead-center → bottom band).
|
|
3679
|
+
|
|
3664
3680
|
The two scaffold passes are billed (the full `video_deconstruct` is the heavy one); **running** the result then generates many image/video/audio assets and is not free. Defaults to vertical 1080×1920 overlays — copy + edit the composition for other aspect ratios. For on-brand overlay type, drop `brand-bold.otf`/`brand-regular.otf` into the copied `video-overlay-composition/` dir (wired via `@font-face`, with a system fallback). Richer transcription (punctuated words + paragraphs) is available via the deconstruct's `transcriber: "deepgram"` param when `DEEPGRAM_API_KEY` is set.
|
|
3665
3681
|
|
|
3666
3682
|
#### `baker canvas scaffold-static-ad <image> [flags]`
|
|
@@ -64,12 +64,33 @@
|
|
|
64
64
|
}
|
|
65
65
|
.ov.fe { font-size: 30px; font-weight: 600; opacity: 0.9; }
|
|
66
66
|
|
|
67
|
+
/* SAFE ZONES — a 9:16 talking-head's FACE fills the vertical CENTER band. Keep
|
|
68
|
+
graphics in the TOP band (≤360px) or BOTTOM band (≥1400px); never park a caption
|
|
69
|
+
dead-center over the face. The scaffold already pushes center placements to the
|
|
70
|
+
bottom, but if you hand-place, respect the bands. */
|
|
71
|
+
|
|
72
|
+
/* COLLISION-SAFE TOP BAR — wrap co-timed top items (logo + trust badge + …) in ONE
|
|
73
|
+
.top-bar so they pack side-by-side with a gap and never overlap. Example:
|
|
74
|
+
<div class="top-bar" data-start="0" data-dur="30">
|
|
75
|
+
<img class="brandmark" src="logo.svg"><span class="trust">★ 4,5/5</span>
|
|
76
|
+
</div> */
|
|
77
|
+
.top-bar {
|
|
78
|
+
position: absolute; top: 70px; left: 56px; right: 56px;
|
|
79
|
+
display: flex; align-items: center; gap: 20px; flex-wrap: wrap;
|
|
80
|
+
}
|
|
81
|
+
|
|
82
|
+
/* STAGGER — give a group `data-stagger="0.12"` and its direct children reveal one
|
|
83
|
+
after another (e.g. coverage chips appearing as each is spoken), not as a clump. */
|
|
84
|
+
.chips { display: flex; flex-wrap: wrap; gap: 14px; justify-content: center; max-width: 92%; }
|
|
85
|
+
|
|
67
86
|
/* 9-grid position helpers (absolute). Tweak the insets or add your own. */
|
|
68
87
|
.pos-top-left { top: 90px; left: 56px; text-align: left; }
|
|
69
88
|
.pos-top-center { top: 90px; left: 50%; transform: translateX(-50%); }
|
|
70
89
|
.pos-top-right { top: 90px; right: 56px; text-align: right; }
|
|
71
90
|
.pos-mid-left,
|
|
72
91
|
.pos-center-left { top: 50%; left: 56px; transform: translateY(-50%); text-align: left; }
|
|
92
|
+
/* Center = the face. Kept only for non-talking-head plates; the scaffold remaps
|
|
93
|
+
seeded center overlays to the bottom band so they never cover the subject. */
|
|
73
94
|
.pos-center,
|
|
74
95
|
.pos-mid-center { top: 50%; left: 50%; transform: translate(-50%,-50%); }
|
|
75
96
|
.pos-mid-right,
|
|
@@ -115,26 +136,31 @@
|
|
|
115
136
|
const els = Array.from(document.querySelectorAll('#overlay-root [data-start]'));
|
|
116
137
|
|
|
117
138
|
// Generic timeline: show each element at data-start, hide at start+data-dur,
|
|
118
|
-
// with an optional canned entrance from data-anim.
|
|
139
|
+
// with an optional canned entrance from data-anim. With data-stagger="<sec>" the
|
|
140
|
+
// element's DIRECT CHILDREN reveal one-by-one (e.g. coverage chips landing on the
|
|
141
|
+
// word each is spoken) instead of as a single clump. No styling decisions here —
|
|
119
142
|
// the look lives entirely in the CSS/markup above.
|
|
120
143
|
for (const el of els) {
|
|
121
144
|
const at = parseFloat(el.getAttribute('data-start') || '0') || 0;
|
|
122
145
|
const dur = parseFloat(el.getAttribute('data-dur') || '2.5') || 2.5;
|
|
123
146
|
const anim = el.getAttribute('data-anim') || '';
|
|
147
|
+
const stagger = parseFloat(el.getAttribute('data-stagger') || '0') || 0;
|
|
124
148
|
// Preserve any positioning transform the CSS set (translate(...)).
|
|
125
149
|
const baseTransform = getComputedStyle(el).transform;
|
|
126
150
|
const tx = baseTransform && baseTransform !== 'none' ? baseTransform : '';
|
|
127
151
|
|
|
152
|
+
// Stagger animates the children; otherwise the element itself enters.
|
|
153
|
+
const targets = stagger > 0 && el.children.length ? Array.from(el.children) : el;
|
|
128
154
|
tl.set(el, { visibility: 'visible' }, at);
|
|
129
155
|
if (anim === 'pop') {
|
|
130
|
-
tl.fromTo(
|
|
156
|
+
tl.fromTo(targets, { opacity: 0, scale: 0.7 }, { opacity: 1, scale: 1, duration: 0.3, ease: 'back.out(1.7)', stagger }, at);
|
|
131
157
|
} else if (anim === 'slide_up') {
|
|
132
|
-
tl.fromTo(
|
|
158
|
+
tl.fromTo(targets, { opacity: 0, yPercent: 30 }, { opacity: 1, yPercent: 0, duration: 0.3, ease: 'power2.out', stagger }, at);
|
|
133
159
|
} else if (anim === 'slide_down') {
|
|
134
|
-
tl.fromTo(
|
|
160
|
+
tl.fromTo(targets, { opacity: 0, yPercent: -30 }, { opacity: 1, yPercent: 0, duration: 0.3, ease: 'power2.out', stagger }, at);
|
|
135
161
|
} else {
|
|
136
162
|
// Default / any unrecognized data-anim value: a plain fade.
|
|
137
|
-
tl.fromTo(
|
|
163
|
+
tl.fromTo(targets, { opacity: 0 }, { opacity: 1, duration: 0.25, ease: 'power1.out', stagger }, at);
|
|
138
164
|
}
|
|
139
165
|
tl.to(el, { opacity: 0, duration: 0.2 }, Math.max(at + 0.2, at + dur));
|
|
140
166
|
tl.set(el, { visibility: 'hidden' }, at + dur + 0.21);
|
|
@@ -1224,7 +1224,12 @@ var MODEL_REGISTRY = {
|
|
|
1224
1224
|
focus: { kind: "string" },
|
|
1225
1225
|
start_s: { kind: "number", min: 0 },
|
|
1226
1226
|
end_s: { kind: "number", min: 0 },
|
|
1227
|
-
|
|
1227
|
+
max_clip_s: { kind: "number", min: 1, max: 60 },
|
|
1228
|
+
transcriber: { kind: "string", enum: ["groq", "deepgram"] },
|
|
1229
|
+
// Real visual shot-cut timestamps (seconds) from local PySceneDetect;
|
|
1230
|
+
// scene boundaries snap onto them so frames never straddle a hard cut.
|
|
1231
|
+
// Shape (number[]) is enforced by the node + backend Zod schemas.
|
|
1232
|
+
shot_cuts: { kind: "json" }
|
|
1228
1233
|
}
|
|
1229
1234
|
},
|
|
1230
1235
|
"~google/gemini-pro-latest": {
|
|
@@ -1238,7 +1243,10 @@ var MODEL_REGISTRY = {
|
|
|
1238
1243
|
focus: { kind: "string" },
|
|
1239
1244
|
start_s: { kind: "number", min: 0 },
|
|
1240
1245
|
end_s: { kind: "number", min: 0 },
|
|
1241
|
-
|
|
1246
|
+
max_clip_s: { kind: "number", min: 1, max: 60 },
|
|
1247
|
+
transcriber: { kind: "string", enum: ["groq", "deepgram"] },
|
|
1248
|
+
// See gemini-flash-latest above — local PySceneDetect shot-cut timestamps.
|
|
1249
|
+
shot_cuts: { kind: "json" }
|
|
1242
1250
|
}
|
|
1243
1251
|
}
|
|
1244
1252
|
},
|
|
@@ -2359,8 +2367,9 @@ function estimateCredits(ctx) {
|
|
|
2359
2367
|
function talkingSceneSatisfied(ctx, entry, scene) {
|
|
2360
2368
|
const nodes = ctx.canvas.nodes;
|
|
2361
2369
|
if (typeof entry === "object" && "voice_convert_node" in entry) {
|
|
2370
|
+
const nativeClipRe = new RegExp(`^s${scene}(_r\\d+)?_clip$`);
|
|
2362
2371
|
const clipNativeAudio = nodes.some(
|
|
2363
|
-
(n) => n.id
|
|
2372
|
+
(n) => nativeClipRe.test(n.id) && n.type === "video_generate" && n.params?.generate_audio === true
|
|
2364
2373
|
);
|
|
2365
2374
|
const converted = nodes.some((n) => n.id === entry.voice_convert_node && n.type === "audio_voice_convert");
|
|
2366
2375
|
return clipNativeAudio && converted;
|
|
@@ -3306,6 +3315,48 @@ function sniffImageMime(buf) {
|
|
|
3306
3315
|
}
|
|
3307
3316
|
return null;
|
|
3308
3317
|
}
|
|
3318
|
+
function findBoxPayload(buf, start, end, type) {
|
|
3319
|
+
let offset = start;
|
|
3320
|
+
while (offset + 8 <= end) {
|
|
3321
|
+
let size = buf.readUInt32BE(offset);
|
|
3322
|
+
const boxType = buf.toString("ascii", offset + 4, offset + 8);
|
|
3323
|
+
let headerLen = 8;
|
|
3324
|
+
if (size === 1) {
|
|
3325
|
+
if (offset + 16 > end) return null;
|
|
3326
|
+
size = Number(buf.readBigUInt64BE(offset + 8));
|
|
3327
|
+
headerLen = 16;
|
|
3328
|
+
} else if (size === 0) {
|
|
3329
|
+
size = end - offset;
|
|
3330
|
+
}
|
|
3331
|
+
if (size < headerLen || offset + size > end) return null;
|
|
3332
|
+
if (boxType === type) return { start: offset + headerLen, end: offset + size };
|
|
3333
|
+
offset += size;
|
|
3334
|
+
}
|
|
3335
|
+
return null;
|
|
3336
|
+
}
|
|
3337
|
+
var MVHD_DURATION_UNKNOWN = 4294967295;
|
|
3338
|
+
function mp4DurationMs(bytes) {
|
|
3339
|
+
const buf = Buffer.isBuffer(bytes) ? bytes : Buffer.from(bytes);
|
|
3340
|
+
const moov = findBoxPayload(buf, 0, buf.length, "moov");
|
|
3341
|
+
if (!moov) return void 0;
|
|
3342
|
+
const mvhd = findBoxPayload(buf, moov.start, moov.end, "mvhd");
|
|
3343
|
+
if (!mvhd) return void 0;
|
|
3344
|
+
const version = buf[mvhd.start];
|
|
3345
|
+
const tsOffset = mvhd.start + 4 + (version === 1 ? 16 : 8);
|
|
3346
|
+
if (version === 1) {
|
|
3347
|
+
if (tsOffset + 12 > mvhd.end) return void 0;
|
|
3348
|
+
const timescale2 = buf.readUInt32BE(tsOffset);
|
|
3349
|
+
const duration2 = buf.readBigUInt64BE(tsOffset + 4);
|
|
3350
|
+
if (timescale2 === 0 || duration2 === 0n) return void 0;
|
|
3351
|
+
return Math.round(Number(duration2) / timescale2 * 1e3);
|
|
3352
|
+
}
|
|
3353
|
+
if (version !== 0) return void 0;
|
|
3354
|
+
if (tsOffset + 8 > mvhd.end) return void 0;
|
|
3355
|
+
const timescale = buf.readUInt32BE(tsOffset);
|
|
3356
|
+
const duration = buf.readUInt32BE(tsOffset + 4);
|
|
3357
|
+
if (timescale === 0 || duration === 0 || duration === MVHD_DURATION_UNKNOWN) return void 0;
|
|
3358
|
+
return Math.round(duration / timescale * 1e3);
|
|
3359
|
+
}
|
|
3309
3360
|
function inferKindFromMime(mime) {
|
|
3310
3361
|
if (mime.startsWith("image/")) return "image";
|
|
3311
3362
|
if (mime.startsWith("video/")) return "video";
|
|
@@ -3370,20 +3421,36 @@ async function execLocalFile(params, ctx) {
|
|
|
3370
3421
|
outMime = "image/png";
|
|
3371
3422
|
ctx.log(`ingest: rasterized SVG -> PNG (${outBytes.length}B)`);
|
|
3372
3423
|
}
|
|
3373
|
-
|
|
3424
|
+
const durationMs = probeVideoDurationMs(params.expect, outBytes, ctx);
|
|
3425
|
+
const ref = await uploadAndIngest({
|
|
3374
3426
|
bytes: outBytes,
|
|
3375
3427
|
kind: params.expect,
|
|
3376
3428
|
mime: outMime,
|
|
3377
|
-
metadata: {
|
|
3378
|
-
source_path: absPath,
|
|
3379
|
-
strategy: "local_file",
|
|
3380
|
-
ingested_at: (/* @__PURE__ */ new Date()).toISOString(),
|
|
3381
|
-
file_size: stats.size,
|
|
3382
|
-
original_filename: path3.basename(absPath),
|
|
3383
|
-
...mime === SVG_MIME ? { rasterized_from: "svg" } : {}
|
|
3384
|
-
},
|
|
3429
|
+
metadata: localFileMetadata({ absPath, fileSize: stats.size, mime, durationMs }),
|
|
3385
3430
|
ctx
|
|
3386
3431
|
});
|
|
3432
|
+
return withProbedDuration(ref, durationMs);
|
|
3433
|
+
}
|
|
3434
|
+
function probeVideoDurationMs(expect, bytes, ctx) {
|
|
3435
|
+
if (expect !== "video") return void 0;
|
|
3436
|
+
const durationMs = mp4DurationMs(bytes);
|
|
3437
|
+
if (durationMs !== void 0) ctx.log(`ingest: probed video duration ${durationMs}ms`);
|
|
3438
|
+
return durationMs;
|
|
3439
|
+
}
|
|
3440
|
+
function localFileMetadata(args) {
|
|
3441
|
+
return {
|
|
3442
|
+
source_path: args.absPath,
|
|
3443
|
+
strategy: "local_file",
|
|
3444
|
+
ingested_at: (/* @__PURE__ */ new Date()).toISOString(),
|
|
3445
|
+
file_size: args.fileSize,
|
|
3446
|
+
original_filename: path3.basename(args.absPath),
|
|
3447
|
+
...args.mime === SVG_MIME ? { rasterized_from: "svg" } : {},
|
|
3448
|
+
...args.durationMs !== void 0 ? { duration_ms: args.durationMs } : {}
|
|
3449
|
+
};
|
|
3450
|
+
}
|
|
3451
|
+
function withProbedDuration(ref, durationMs) {
|
|
3452
|
+
if (durationMs === void 0 || ref.kind !== "video") return ref;
|
|
3453
|
+
return { ...ref, duration_ms: durationMs };
|
|
3387
3454
|
}
|
|
3388
3455
|
var YT_DLP_BIN = "yt-dlp";
|
|
3389
3456
|
var YT_DLP_TIMEOUT_MS = 10 * 60 * 1e3;
|
|
@@ -3786,19 +3853,56 @@ var Track = z6.object({
|
|
|
3786
3853
|
/** Optional level adjustment in dB (negative ducks, e.g. a music bed at -12). */
|
|
3787
3854
|
gain_db: z6.number().optional()
|
|
3788
3855
|
}).strict();
|
|
3856
|
+
var DuckSpec = z6.object({
|
|
3857
|
+
track: z6.string().min(1),
|
|
3858
|
+
against: z6.array(z6.string().min(1)).min(1),
|
|
3859
|
+
threshold: z6.number().min(0).max(1).optional(),
|
|
3860
|
+
ratio: z6.number().min(1).max(20).optional(),
|
|
3861
|
+
attack: z6.number().min(0.01).optional(),
|
|
3862
|
+
release: z6.number().min(0.01).optional()
|
|
3863
|
+
}).strict();
|
|
3789
3864
|
var AudioTimelineParams = z6.object({
|
|
3790
3865
|
tracks: z6.array(Track).min(1),
|
|
3791
3866
|
/** Final track length in ms — pads short / trims long. Defaults to the natural mix length. */
|
|
3792
3867
|
total_ms: z6.number().int().positive().optional(),
|
|
3793
|
-
output_format: z6.enum(["mp3", "wav", "m4a"]).optional()
|
|
3868
|
+
output_format: z6.enum(["mp3", "wav", "m4a"]).optional(),
|
|
3869
|
+
/** Optional sidechain duck of one track under others (e.g. music under voice). */
|
|
3870
|
+
duck: DuckSpec.optional()
|
|
3794
3871
|
}).strict();
|
|
3795
3872
|
var AudioTimelineInputs = z6.record(z6.string(), z6.unknown());
|
|
3796
3873
|
var AudioTimelineOutputs = z6.object({ audio: z6.custom() }).strict();
|
|
3874
|
+
function applyDuck(params, labelFor, filterChains) {
|
|
3875
|
+
const duck = params.duck;
|
|
3876
|
+
if (!duck) return;
|
|
3877
|
+
const trackIdx = params.tracks.findIndex((t) => t.slot === duck.track);
|
|
3878
|
+
const keyIdxs = duck.against.map((s) => params.tracks.findIndex((t) => t.slot === s)).filter((i) => i >= 0);
|
|
3879
|
+
if (trackIdx < 0 || keyIdxs.length === 0) return;
|
|
3880
|
+
const keyLabels = [];
|
|
3881
|
+
for (const ki of keyIdxs) {
|
|
3882
|
+
const base = labelFor[ki];
|
|
3883
|
+
filterChains.push(`[${base}]asplit=2[${base}m][${base}k]`);
|
|
3884
|
+
labelFor[ki] = `${base}m`;
|
|
3885
|
+
keyLabels.push(`[${base}k]`);
|
|
3886
|
+
}
|
|
3887
|
+
let keyOut = keyLabels[0];
|
|
3888
|
+
if (keyLabels.length > 1) {
|
|
3889
|
+
filterChains.push(`${keyLabels.join("")}amix=inputs=${keyLabels.length}:normalize=0[duckkey]`);
|
|
3890
|
+
keyOut = "[duckkey]";
|
|
3891
|
+
}
|
|
3892
|
+
const th = duck.threshold ?? 0.03;
|
|
3893
|
+
const ra = duck.ratio ?? 8;
|
|
3894
|
+
const at = duck.attack ?? 5;
|
|
3895
|
+
const re = duck.release ?? 300;
|
|
3896
|
+
filterChains.push(
|
|
3897
|
+
`[${labelFor[trackIdx]}]${keyOut}sidechaincompress=threshold=${th}:ratio=${ra}:attack=${at}:release=${re}[ducked]`
|
|
3898
|
+
);
|
|
3899
|
+
labelFor[trackIdx] = "ducked";
|
|
3900
|
+
}
|
|
3797
3901
|
function buildAudioTimelineArgs(params) {
|
|
3798
3902
|
const fmt = params.output_format ?? "mp3";
|
|
3799
3903
|
const inputArgs = [];
|
|
3800
3904
|
const filterChains = [];
|
|
3801
|
-
const
|
|
3905
|
+
const labelFor = [];
|
|
3802
3906
|
params.tracks.forEach((track, i) => {
|
|
3803
3907
|
inputArgs.push("-i", `{{in.${track.slot}}}`);
|
|
3804
3908
|
const delayMs = Math.round(track.start_s * 1e3);
|
|
@@ -3806,8 +3910,10 @@ function buildAudioTimelineArgs(params) {
|
|
|
3806
3910
|
if (track.gain_db !== void 0) steps.push(`volume=${track.gain_db}dB`);
|
|
3807
3911
|
const label = `a${i}`;
|
|
3808
3912
|
filterChains.push(`[${i}:a]${steps.join(",")}[${label}]`);
|
|
3809
|
-
|
|
3913
|
+
labelFor[i] = label;
|
|
3810
3914
|
});
|
|
3915
|
+
applyDuck(params, labelFor, filterChains);
|
|
3916
|
+
const mixLabels = labelFor.map((l) => `[${l}]`);
|
|
3811
3917
|
let graph = `${filterChains.join(";")};${mixLabels.join("")}amix=inputs=${params.tracks.length}:normalize=0`;
|
|
3812
3918
|
if (params.total_ms !== void 0) {
|
|
3813
3919
|
const totalS = params.total_ms / 1e3;
|
|
@@ -3818,7 +3924,7 @@ function buildAudioTimelineArgs(params) {
|
|
|
3818
3924
|
}
|
|
3819
3925
|
var audioTimelineNode = defineNode({
|
|
3820
3926
|
id: "audio_timeline",
|
|
3821
|
-
version: "1.
|
|
3927
|
+
version: "1.1.0",
|
|
3822
3928
|
category: "audio",
|
|
3823
3929
|
location: "local",
|
|
3824
3930
|
summary: "Place and mix several audio clips onto one timeline: each track starts at a given second (optionally level-adjusted in dB), then they're combined into a single track. Built for laying a music bed plus timed voiceover lines and sound effects under a video.",
|
|
@@ -3826,6 +3932,9 @@ var audioTimelineNode = defineNode({
|
|
|
3826
3932
|
inputs: AudioTimelineInputs,
|
|
3827
3933
|
params: AudioTimelineParams,
|
|
3828
3934
|
outputs: AudioTimelineOutputs,
|
|
3935
|
+
// The mixed timeline is an audio asset — declare it so strictly-typed consumers
|
|
3936
|
+
// (e.g. audio_voice_convert reading a merged per-speaker track) resolve its kind.
|
|
3937
|
+
outputKinds: { audio: "audio" },
|
|
3829
3938
|
cost: () => ({ credits: 0, seconds_estimate: 10 }),
|
|
3830
3939
|
validateExtra({ rawParams, rawInputs }) {
|
|
3831
3940
|
const issues = [];
|
|
@@ -3840,6 +3949,21 @@ var audioTimelineNode = defineNode({
|
|
|
3840
3949
|
});
|
|
3841
3950
|
}
|
|
3842
3951
|
});
|
|
3952
|
+
const duck = parsed.data.duck;
|
|
3953
|
+
if (duck) {
|
|
3954
|
+
const slots = new Set(parsed.data.tracks.map((t) => t.slot));
|
|
3955
|
+
if (!slots.has(duck.track)) {
|
|
3956
|
+
issues.push({ path: "params.duck.track", message: `duck.track "${duck.track}" is not one of the tracks` });
|
|
3957
|
+
}
|
|
3958
|
+
duck.against.forEach((s, k) => {
|
|
3959
|
+
if (!slots.has(s)) {
|
|
3960
|
+
issues.push({ path: `params.duck.against[${k}]`, message: `duck.against "${s}" is not one of the tracks` });
|
|
3961
|
+
}
|
|
3962
|
+
if (s === duck.track) {
|
|
3963
|
+
issues.push({ path: `params.duck.against[${k}]`, message: "a track cannot duck against itself" });
|
|
3964
|
+
}
|
|
3965
|
+
});
|
|
3966
|
+
}
|
|
3843
3967
|
return issues;
|
|
3844
3968
|
},
|
|
3845
3969
|
async cacheKeyExtras() {
|
|
@@ -5617,6 +5741,16 @@ var videoDeconstructNode = delegated({
|
|
|
5617
5741
|
focus: z29.string().optional(),
|
|
5618
5742
|
start_s: z29.number().min(0).optional(),
|
|
5619
5743
|
end_s: z29.number().positive().optional(),
|
|
5744
|
+
// Real visual shot-cut timestamps (absolute seconds), detected locally with
|
|
5745
|
+
// ffmpeg before the deconstruct. The backend SNAPS its LLM scene boundaries
|
|
5746
|
+
// onto these and SPLITS any scene that spans one, so a scene's frames never
|
|
5747
|
+
// straddle a hard cut. `scaffold-video` populates this; omit for LLM-only cuts.
|
|
5748
|
+
shot_cuts: z29.array(z29.number().min(0)).max(200).optional(),
|
|
5749
|
+
// The video model's per-clip ceiling (seconds). A shot longer than this is
|
|
5750
|
+
// split into seamless continuation sub-scenes (shared splice frame), so long
|
|
5751
|
+
// shots reproduce in full instead of being truncated. `scaffold-video` sets
|
|
5752
|
+
// the Seedance ceiling (15); omit to disable length splitting.
|
|
5753
|
+
max_clip_s: z29.number().positive().max(60).optional(),
|
|
5620
5754
|
// Transcript provider for the blueprint's dialogue/transcript. Default
|
|
5621
5755
|
// Groq Whisper; "deepgram" routes to Nova-3 so words carry punctuation.
|
|
5622
5756
|
transcriber: z29.enum(["groq", "deepgram"]).optional()
|
|
@@ -6005,4 +6139,4 @@ export {
|
|
|
6005
6139
|
defaultRegistry,
|
|
6006
6140
|
createEngineFromEnv
|
|
6007
6141
|
};
|
|
6008
|
-
//# sourceMappingURL=chunk-
|
|
6142
|
+
//# sourceMappingURL=chunk-LMVDA3EZ.js.map
|