@koda-sl/baker-cli 0.81.1 → 0.90.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2620,7 +2620,7 @@ Pick a `source` discriminator and declare the kind you expect. See [Ingestion](#
2620
2620
 
2621
2621
  **Outputs:** `asset` → `<params.expect>` / content-determined (URL strategy table) or extension-inferred (path).
2622
2622
 
2623
- **Path-source notes:** the canvas is **not portable** to another machine without the file. Cache key folds the file's `mtime:size`, so editing the file invalidates the cache automatically. Supported extensions: `png`, `jpg`/`jpeg`, `webp`, `gif`, `avif`, `svg`, `mp4`, `webm`, `mov`, `m4v`, `mp3`, `wav`, `m4a`, `ogg`, `flac`, `json`, `txt`, `md`, `markdown`, `html`/`htm`, `csv`, `ttf`, `otf`, `woff`, `woff2`. Unknown extensions fall back to magic-byte sniffing for common image formats (and an SVG content sniff), else `kind_mismatch`. **SVG (`expect: "image"`) is rasterized to a transparent PNG on ingest** — brand logos are usually SVG, and image-generation models can't read SVG markup, so it's upscaled (longest edge near 2048px) with transparency preserved and the resulting asset carries `metadata.rasterized_from: "svg"`.
2623
+ **Path-source notes:** the canvas is **not portable** to another machine without the file. Cache key folds the file's `mtime:size`, so editing the file invalidates the cache automatically. Supported extensions: `png`, `jpg`/`jpeg`, `webp`, `gif`, `avif`, `svg`, `mp4`, `webm`, `mov`, `m4v`, `mp3`, `wav`, `m4a`, `ogg`, `flac`, `json`, `txt`, `md`, `markdown`, `html`/`htm`, `csv`, `ttf`, `otf`, `woff`, `woff2`. Unknown extensions fall back to magic-byte sniffing for common image formats (and an SVG content sniff), else `kind_mismatch`. **SVG (`expect: "image"`) is rasterized to a transparent PNG on ingest** — brand logos are usually SVG, and image-generation models can't read SVG markup, so it's upscaled (longest edge near 2048px) with transparency preserved and the resulting asset carries `metadata.rasterized_from: "svg"`. **Video (`expect: "video"`) duration is probed from the file's ISO-BMFF (`mp4`/`mov`/`m4v`) header** and stamped as the canonical `duration_ms` (and `metadata.duration_ms`); other containers (e.g. `webm`) leave it unset. Downstream `video_deconstruct` uses this declared duration to size its ingest-poll timeout and preflight — without it those fall back to worst-case budgets and a single deconstruct step can hit the action time limit.
2624
2624
 
2625
2625
  **Cost:** 0 engine credits for direct fetch + yt-dlp + local file. Handinger charges per scrape.
2626
2626
 
@@ -3263,6 +3263,8 @@ Reverse-engineer a video into a replication-grade blueprint: scene boundaries, t
3263
3263
 
3264
3264
  Over-length runs fail with a message that includes ready-to-use suggested windows, so the loop is self-correcting.
3265
3265
 
3266
+ > **Async execution.** Full-mode `video_deconstruct` is a multi-minute job, so the backend runs it as a durable background workflow rather than over a single long-lived HTTP request: the exec call returns immediately and the CLI transparently polls the job to completion. This is invisible to callers — the node's inputs, params, and outputs are unchanged — but it means a stalled provider can no longer drop the connection and cause a whole-node retry storm. (`mode:"index"` stays synchronous.)
3267
+
3266
3268
  **Inputs**
3267
3269
 
3268
3270
  | Slot | Kind | Required | Accepted MIMEs |
@@ -3280,6 +3282,8 @@ Over-length runs fail with a message that includes ready-to-use suggested window
3280
3282
  | `max_scenes` | number | no | `20` (full ≤3 min) / `40` (longer) / `60` (index) | 1–60; extra cuts merge into the last scene |
3281
3283
  | `focus` | string | no | — | extra extraction emphasis (e.g. `"overlay typography"`) |
3282
3284
  | `start_s` / `end_s` | number | no | whole video | analysis window (absolute seconds) for chunking long videos; window ≤480s |
3285
+ | `shot_cuts` | number[] | no | — | real visual shot-cut timestamps (absolute seconds); scene boundaries snap onto them and any scene spanning one is split, so a scene's frames never straddle a hard cut. `scaffold-video` detects these with PySceneDetect and passes them automatically |
3286
+ | `max_clip_s` | number | no | — | the video model's per-clip ceiling (seconds); a shot longer than this is split into equal continuation sub-scenes that share their splice (full-length, seamless — no truncation), each tagged `continues_previous`. `scaffold-video` passes Seedance's 15 |
3283
3287
 
3284
3288
  **Outputs**
3285
3289
 
@@ -3613,15 +3617,27 @@ Validate, then execute the graph. Blocks until done. Logs one line per node. Ret
3613
3617
  Turn a reference video into a **runnable, self-validated reproduction canvas** in one command — the video counterpart of `scaffold-static-ad`. It runs **billed passes** up front:
3614
3618
 
3615
3619
  1. **`video_deconstruct`** (`~google/gemini-pro-latest`, full mode) — reverse-engineers the video into a scene-by-scene blueprint + word-level transcript, written next to the canvas as **`prompt.json`**. Each scene's `start_frame_prompt`/`end_frame_prompt` are inlined into the frame nodes (see below); `prompt.json` then rides along as the shared **global style reference** (palette, cast cohesion) and as provenance.
3616
- 2. **recurring-element selection** (`~google/gemini-flash-latest`) — picks only the **recurring, identity-critical** elements (each `global.cast` person, a recurring animal, a showcased product, the brand logo) and the scene indices each appears in. One real reference image grounds each element across **every** frame it appears in, so the same actor stays consistent the whole video.
3620
+ 2. **recurring-element selection** (`~google/gemini-flash-latest`) — picks only the **recurring, identity-critical** elements (each `global.cast` person, a recurring animal, a showcased product, the brand logo) and the scene indices each appears in. One real reference image grounds each element across **every** frame it appears in, so the same actor stays consistent the whole video. This selection runs as a **second pass over a slimmed blueprint** (cast/branding + each scene's frame prompts only) — a long ad's full blueprint can exceed the engine's inline-prompt limit, so the heavy per-scene detail (dialogue, overlays, transcript) the selector never reads is dropped before the prompt.
3621
+
3622
+ Before the deconstruct it runs a **local shot-cut pass** on the source file with **[PySceneDetect](https://www.scenedetect.com)** (`scenedetect` CLI, `detect-content` — the battle-tested HSV content detector, installed in the canvas sandbox) and passes the cut timestamps as `video_deconstruct`'s `shot_cuts`. The deconstruct snaps its scene boundaries onto those real cuts and **splits any scene that spans one**, so a scene's frames can never straddle a hard cut (the failure where a scene's start frame was the couch and its end frame the b-roll). Two knobs tuned for fast social ads: the content **threshold defaults to 18** (PySceneDetect's own default of 27 misses soft reframes) and the **minimum scene length is dropped to 0.25s** (its default ~0.6s merges away rapid montage flashes) — so super-fast cuts survive and become cheap still-holds downstream. Tune sensitivity per video with **`--shot-threshold N`** (lower = more cuts). If `scenedetect` is unavailable it warns loudly and degrades to LLM-only boundaries.
3623
+
3624
+ A shot longer than the video model's per-clip ceiling (Seedance's 15s, passed as `video_deconstruct`'s `max_clip_s`) is split into equal **continuation sub-scenes** that share their splice boundary exactly — so a long shot is reproduced in **full** (no truncation) and joins seamlessly. Each sub-scene carries `continues_previous`.
3625
+
3626
+ It then scaffolds the full pipeline like an **editing timeline**: each clip gets a **static-ad-grade start AND end keyframe** (`image_generate`, each with its **own self-contained `params.prompt`** — edit a frame node to change only that frame; `prompt.json` wired as the **authoritative shared `target_blueprint`**, plus a per-element reference legend). Each keyframe is **fully recast** to the dropped `el_*` reference images, and — exactly like a static ad — the **original extracted frame is kept LAST as a composition anchor** (the RECAST block + "ignore its brand/colors" legend stop it leaking identity). Both keyframes feed `video_generate` (`first_frame`+`last_frame`, so Seedance interpolates real in-shot motion; ultra-detailed motion brief; duration snapped to the nearest allowed clip length). Every keyframe grounds **only on its own extracted frame + `el_*` slots** — no reference to any other generated frame — so all images render **in parallel** (no cascade). Source-frame URLs are **deduped** (each ingested once). `--frames reuse` wires the real source frame straight in.
3627
+
3628
+ **Composited scenes (split-screen / picture-in-picture / keyed presenter).** Real ads aren't always one full-frame shot — a frame can be **persistently divided** (b-roll on top, a presenter talking on the bottom) or **layer a presenter** over background footage (boxed in a corner, or green-screen keyed). The deconstruct now reports this per scene as `scene.composition` (`layout: split_screen | pip | keyed_overlay`, with one `region` per stream — each its own clean-plate frame + motion brief, the talking-head region flagged `is_presenter`). The scaffold reproduces a composited scene by building **one clip per region** (`s<i>_r0_*`, `s<i>_r1_*`, …) and compositing them with ffmpeg: a split-screen `vstack`/`hstack` (stack direction read from the region **panels**, so a top/bottom split always stacks vertically), or a picture-in-picture `overlay` of the presenter inset at its corner. A **keyed** presenter is first cut to transparency by `video_background_remove` (`s<i>_key`), then overlaid. The presenter region carries the native lip-synced voice; b-roll/render panels stay silent. To change a layout, edit `composition` in `prompt.json` and re-scaffold, or hand-edit the `s<i>_composite` ffmpeg args. Plain full-frame scenes (the default) are unaffected.
3629
+
3630
+ **Montage flashes held as stills.** A rapid-cut beat shorter than ~2s with no spoken line is a **flash** — Seedance's shortest clip is 4s, so generating one (then trimming away most of it) burns credits for motion no viewer perceives. The scaffold instead **holds one keyframe as a still** for the scene length (a cheap ffmpeg loop, no billed `video_generate`), same look at a fraction of the cost. Talking/ambient beats keep a real clip (they need motion + native audio).
3631
+
3632
+ **The phrase model (voice cut at pauses, not at visual cuts).** The voice is grouped into **phrases** — runs of continuous speech with no real pause, which may span several visual scenes. A phrase is voiced ONCE (so a sentence the deconstruct split at a visual cut never breaks mid-word): if the speaker is **shown** anywhere in the phrase it's a single Seedance clip (`s<anchor>_clip`, native lip-sync + audio) re-voiced to the brand voice; if the speaker is **never shown** it's one ElevenLabs `tts` read. The picture is then assembled **scene by scene**: a scene that shows the speaker **slices its window** out of the phrase clip (`s<i>_seg`, an ffmpeg `-ss`/`-t` cut — video and audio come from the *same* clip, so lip-sync holds), and a **b-roll cutaway** gets its own silent clip while the phrase's voice plays underneath. "Shown" is decided by the **presenter element's per-scene presence**, not just who's speaking — a scene where a cast member narrates over b-roll (their element absent) is treated as a cutaway, so the talking head never appears where the original cut away. A presenter run longer than the 15s Seedance ceiling **splits at a scene boundary** into contiguous takes (each its own clip + convert), so a sliced window never reads past its clip. A b-roll cutaway *inside* a phrase lands at an **approximate** time (Seedance exposes no word timing) — nudge the scene boundary if it's off its beat.
3617
3633
 
3618
- It then scaffolds the full pipeline: per scene, two **static-ad-grade frames** (`image_generate` with its **own self-contained `params.prompt`** edit a frame node to change only that frame; `prompt.json` is wired as a demoted shared-style `target_blueprint`, a per-element reference legend, the real extracted frame as a composition anchor) → `video_generate` (Seedance first/last-frame, fed an ultra-detailed motion brief composed from the scene's action, camera, dialogue, and transcript; duration snapped to the nearest allowed clip length).
3634
+ **A starting point, not a locked render.** The canvas mirrors the reference's structure to give you a faithful scaffold, but `metadata.todo.full_flexibility` makes explicit that the agent has **full editing freedom**: add / delete / reorder / split / merge scenes, re-prompt any frame or motion brief, change a scene's layout (full-frame composite), or rewrite any line the content-addressed cache re-bills only what changes, and `baker canvas validate` re-checks timing/lip-sync after any edit.
3619
3635
 
3620
- **Sequenced audio.** Dialogue is a back-and-forth on one absolute timeline, so each **contiguous same-speaker turn** becomes its own `tts` placed at its real `start_s` — turns alternate and never stack (the earlier design concatenated each speaker's whole monologue at their earliest timestamp, so two voices played in parallel for the entire video). Each speaker is locked to one shared `voice_select` voice; a `sound_effect` per SFX and a `music` bed (styled after the AudD-identified track when available, ducked under the voices, and started at the reference's `music.starts_at_s` rather than always at 0) round out the mix (`audio_timeline`). The final mux normalizes the soundtrack to **−14 LUFS (stereo)** so the output plays loud in every player — the raw mix is quiet mono, which reads as "no sound."
3636
+ **Sequenced audio.** Dialogue is a back-and-forth on one absolute timeline, so each **contiguous same-speaker turn** becomes its own `tts` placed at its real `start_s` — turns alternate and never stack (the earlier design concatenated each speaker's whole monologue at their earliest timestamp, so two voices played in parallel for the entire video). Each speaker is locked to one shared `voice_select` voice; a `sound_effect` per SFX and a `music` bed (conditioned on the **ad's own script + emotional arc** so the bed supports the message, styled after the AudD-identified track when available, ducked under the voices, and started at the reference's `music.starts_at_s` rather than always at 0) round out the mix (`audio_timeline`). The final mux normalizes the soundtrack to **−14 LUFS (stereo)** so the output plays loud in every player — the raw mix is quiet mono, which reads as "no sound."
3621
3637
 
3622
- **Native talking heads (no post-hoc lip-sync).** Seedance 2.0 generates lip-synced speech **natively** — a scene with a **single on-camera speaker** (a `dialogue.speaker` that maps to a `global.cast` member, when `global.voiceover.mode` isn't pure `voiceover`/`none`) puts the line in the clip's prompt with `generate_audio`, so lips and voice are generated together (no `video_lipsync`/veed). Each native clip's audio is then re-voiced through `audio_voice_convert` (ElevenLabs Voice Changer) to **one brand voice** so every scene sounds like the same person, timing preserved so the lips stay matched. Off-camera narration keeps a sequenced `tts` per turn; scenes with two on-camera speakers stay native-per-clip and are flagged for you to split or pick a primary.
3638
+ **Native talking heads + one voice per person (no post-hoc lip-sync).** Seedance 2.0 generates lip-synced speech **natively** — a presenter phrase puts the full phrase in the clip's prompt with `generate_audio`, so lips and voice are generated together (no `video_lipsync`/veed). Each presenter phrase's audio is extracted and re-voiced through a **per-phrase** `audio_voice_convert` (ElevenLabs Voice Changer; one per phrase keeps each ≤15s clip under the converter's length cap) to the brand voice — timing preserved so the lips stay matched. There is **ONE voice per person**: a single `voice_select` is reused for all that person's phrases, and the deconstruct's `voiceover` label folds into the sole on-camera presenter (so on-camera and off-camera narration are the same voice, not two). A scene with **two on-camera speakers** can't be one clip — both lines become `tts` over a plain scene clip.
3623
3639
 
3624
- **Native-audio drift guard.** Seedance paces a spoken line to fill the clip, so each native clip is generated **long enough for the estimated speech** (≈150 wpm) — not just the visual scene length and its audio is extracted at the **full line length** rather than hard-trimmed to the scene, so a line that runs a beat long isn't cut mid-word (it continues over the next scene as natural VO). `metadata.video.talking_scenes` records each scene's `scene_s` vs `est_speech_s` so you can spot a line that overruns its cut.
3640
+ **Timing-faithful clip + extract (no overlap).** Each phrase clip is generated to its **coverage window** (the deconstruct's real scene/line timing, capped at 15s) and its converted voice is extracted to the **spoken window** (pause to pause) — *not* padded to a word-count estimate. Padding past the window was what ran the voice the clip's whole length and overlapped the next phrase; trusting the deconstruct's timing keeps consecutive phrases back-to-back and lets Seedance pace the quoted text to fit. `metadata.video.talking_scenes` still records each phrase's `scene_s` vs `est_speech_s` so you can spot a line whose words overrun its window and widen it by hand.
3625
3641
 
3626
3642
  **Timeline-accurate picture.** Seedance can't render under 4s, so each clip is generated at the smallest allowed duration ≥ the scene length and then **trimmed back to the exact scene duration** before concat. This keeps the concatenated picture on the same timeline as the absolute-timed audio — without it, short scenes balloon to 4s, the spine runs far longer than the soundtrack, and every line plays over the wrong (slowed) scene so the lips never match. Frames are also prompted as **clean text-free plates** (no baked captions/lower-thirds/tickers/logos-as-text) so the overlay layer is the single source of on-screen text.
3627
3643
 
@@ -3646,7 +3662,7 @@ baker canvas run ./reference-ad.video.canvas.json
3646
3662
  | Flag | Default | Effect |
3647
3663
  |---|---|---|
3648
3664
  | `--out <path>` | `<video-dir>/<name>.video.canvas.json` | Where to write the canvas (composition is copied alongside). |
3649
- | `--frames <mode>` | `generate` | `generate` regenerates frames anchored on the originals; `reuse` wires the real extracted frames straight into the clips (faithful, cheaper). |
3665
+ | `--frames <mode>` | `generate` | `generate` emits ONE recast keyframe per scene (the original frame is dropped so the dropped `el_*` assets drive identity); `reuse` wires the real extracted first+last frames straight into the clips (faithful, cheaper, no recast). |
3650
3666
  | `--ambient` | off | Give silent **b-roll** scenes native diegetic ambient (Seedance `generate_audio`), mixed deep under the music bed. Talking scenes already carry voice; check levels don't muddy the mix before keeping it. |
3651
3667
  | `--actor-sheets` | off | Lock a recast **person/animal that recurs across ≥2 scenes** to ONE multi-view turnaround (`image_reference_sheet`) that every frame grounds on — the strongest cross-scene identity lock. Costs extra credits per sheet; a fused sheet can over-polish, so eyeball it. |
3652
3668
  | `--max-scenes <n>` | all source scenes | **Cost lever that reduces fidelity** — caps the deconstruct, MERGING away every scene beyond the cap (fewer cuts, lost beats). Prints a warning when set; omit it to reproduce every scene. |
@@ -770,13 +770,48 @@ function shouldRetry(err) {
770
770
  }
771
771
 
772
772
  // src/engine/client/backend-client.ts
773
+ function isAsyncJob(res) {
774
+ return typeof res.job_id === "string";
775
+ }
776
+ var sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
777
+ var JOB_POLL_INTERVAL_MS = 3e3;
778
+ var JOB_POLL_MAX_MS = 20 * 60 * 1e3;
773
779
  var BackendClient = class {
774
780
  http;
775
781
  constructor(opts) {
776
782
  this.http = new HttpClient(opts);
777
783
  }
778
- exec(req, signal) {
779
- return this.http.postJson("/api/canvas/nodes/exec", req, signal);
784
+ async exec(req, signal) {
785
+ const res = await this.http.postJson("/api/canvas/nodes/exec", req, signal);
786
+ if (isAsyncJob(res)) {
787
+ return await this.pollJob(res.job_id, signal);
788
+ }
789
+ return res;
790
+ }
791
+ async pollJob(jobId, signal) {
792
+ const deadline = Date.now() + JOB_POLL_MAX_MS;
793
+ const path16 = `/api/canvas/jobs/${encodeURIComponent(jobId)}`;
794
+ while (true) {
795
+ if (signal?.aborted) {
796
+ throw new BackendHttpError({ kind: "network", cause: signal.reason ?? new Error("aborted") });
797
+ }
798
+ const job = await this.http.getJson(path16, signal);
799
+ if (job.status === "completed") return job.result;
800
+ if (job.status === "failed") {
801
+ throw new BackendHttpError({
802
+ kind: "provider",
803
+ status: job.error.status ?? 502,
804
+ provider: job.error.provider,
805
+ code: job.error.code ?? "provider_error",
806
+ message: job.error.message ?? "deconstruct failed",
807
+ retryable: job.error.retryable ?? false
808
+ });
809
+ }
810
+ if (Date.now() > deadline) {
811
+ throw new BackendHttpError({ kind: "timeout", message: `job ${jobId} did not finish in time` });
812
+ }
813
+ await sleep(JOB_POLL_INTERVAL_MS);
814
+ }
780
815
  }
781
816
  presignAssetUpload(sha256, mime, signal) {
782
817
  return this.http.postJson(
@@ -1189,7 +1224,12 @@ var MODEL_REGISTRY = {
1189
1224
  focus: { kind: "string" },
1190
1225
  start_s: { kind: "number", min: 0 },
1191
1226
  end_s: { kind: "number", min: 0 },
1192
- transcriber: { kind: "string", enum: ["groq", "deepgram"] }
1227
+ max_clip_s: { kind: "number", min: 1, max: 60 },
1228
+ transcriber: { kind: "string", enum: ["groq", "deepgram"] },
1229
+ // Real visual shot-cut timestamps (seconds) from local PySceneDetect;
1230
+ // scene boundaries snap onto them so frames never straddle a hard cut.
1231
+ // Shape (number[]) is enforced by the node + backend Zod schemas.
1232
+ shot_cuts: { kind: "json" }
1193
1233
  }
1194
1234
  },
1195
1235
  "~google/gemini-pro-latest": {
@@ -1203,7 +1243,10 @@ var MODEL_REGISTRY = {
1203
1243
  focus: { kind: "string" },
1204
1244
  start_s: { kind: "number", min: 0 },
1205
1245
  end_s: { kind: "number", min: 0 },
1206
- transcriber: { kind: "string", enum: ["groq", "deepgram"] }
1246
+ max_clip_s: { kind: "number", min: 1, max: 60 },
1247
+ transcriber: { kind: "string", enum: ["groq", "deepgram"] },
1248
+ // See gemini-flash-latest above — local PySceneDetect shot-cut timestamps.
1249
+ shot_cuts: { kind: "json" }
1207
1250
  }
1208
1251
  }
1209
1252
  },
@@ -2324,8 +2367,9 @@ function estimateCredits(ctx) {
2324
2367
  function talkingSceneSatisfied(ctx, entry, scene) {
2325
2368
  const nodes = ctx.canvas.nodes;
2326
2369
  if (typeof entry === "object" && "voice_convert_node" in entry) {
2370
+ const nativeClipRe = new RegExp(`^s${scene}(_r\\d+)?_clip$`);
2327
2371
  const clipNativeAudio = nodes.some(
2328
- (n) => n.id === `s${scene}_clip` && n.type === "video_generate" && n.params?.generate_audio === true
2372
+ (n) => nativeClipRe.test(n.id) && n.type === "video_generate" && n.params?.generate_audio === true
2329
2373
  );
2330
2374
  const converted = nodes.some((n) => n.id === entry.voice_convert_node && n.type === "audio_voice_convert");
2331
2375
  return clipNativeAudio && converted;
@@ -3271,6 +3315,48 @@ function sniffImageMime(buf) {
3271
3315
  }
3272
3316
  return null;
3273
3317
  }
3318
+ function findBoxPayload(buf, start, end, type) {
3319
+ let offset = start;
3320
+ while (offset + 8 <= end) {
3321
+ let size = buf.readUInt32BE(offset);
3322
+ const boxType = buf.toString("ascii", offset + 4, offset + 8);
3323
+ let headerLen = 8;
3324
+ if (size === 1) {
3325
+ if (offset + 16 > end) return null;
3326
+ size = Number(buf.readBigUInt64BE(offset + 8));
3327
+ headerLen = 16;
3328
+ } else if (size === 0) {
3329
+ size = end - offset;
3330
+ }
3331
+ if (size < headerLen || offset + size > end) return null;
3332
+ if (boxType === type) return { start: offset + headerLen, end: offset + size };
3333
+ offset += size;
3334
+ }
3335
+ return null;
3336
+ }
3337
+ var MVHD_DURATION_UNKNOWN = 4294967295;
3338
+ function mp4DurationMs(bytes) {
3339
+ const buf = Buffer.isBuffer(bytes) ? bytes : Buffer.from(bytes);
3340
+ const moov = findBoxPayload(buf, 0, buf.length, "moov");
3341
+ if (!moov) return void 0;
3342
+ const mvhd = findBoxPayload(buf, moov.start, moov.end, "mvhd");
3343
+ if (!mvhd) return void 0;
3344
+ const version = buf[mvhd.start];
3345
+ const tsOffset = mvhd.start + 4 + (version === 1 ? 16 : 8);
3346
+ if (version === 1) {
3347
+ if (tsOffset + 12 > mvhd.end) return void 0;
3348
+ const timescale2 = buf.readUInt32BE(tsOffset);
3349
+ const duration2 = buf.readBigUInt64BE(tsOffset + 4);
3350
+ if (timescale2 === 0 || duration2 === 0n) return void 0;
3351
+ return Math.round(Number(duration2) / timescale2 * 1e3);
3352
+ }
3353
+ if (version !== 0) return void 0;
3354
+ if (tsOffset + 8 > mvhd.end) return void 0;
3355
+ const timescale = buf.readUInt32BE(tsOffset);
3356
+ const duration = buf.readUInt32BE(tsOffset + 4);
3357
+ if (timescale === 0 || duration === 0 || duration === MVHD_DURATION_UNKNOWN) return void 0;
3358
+ return Math.round(duration / timescale * 1e3);
3359
+ }
3274
3360
  function inferKindFromMime(mime) {
3275
3361
  if (mime.startsWith("image/")) return "image";
3276
3362
  if (mime.startsWith("video/")) return "video";
@@ -3335,20 +3421,36 @@ async function execLocalFile(params, ctx) {
3335
3421
  outMime = "image/png";
3336
3422
  ctx.log(`ingest: rasterized SVG -> PNG (${outBytes.length}B)`);
3337
3423
  }
3338
- return await uploadAndIngest({
3424
+ const durationMs = probeVideoDurationMs(params.expect, outBytes, ctx);
3425
+ const ref = await uploadAndIngest({
3339
3426
  bytes: outBytes,
3340
3427
  kind: params.expect,
3341
3428
  mime: outMime,
3342
- metadata: {
3343
- source_path: absPath,
3344
- strategy: "local_file",
3345
- ingested_at: (/* @__PURE__ */ new Date()).toISOString(),
3346
- file_size: stats.size,
3347
- original_filename: path3.basename(absPath),
3348
- ...mime === SVG_MIME ? { rasterized_from: "svg" } : {}
3349
- },
3429
+ metadata: localFileMetadata({ absPath, fileSize: stats.size, mime, durationMs }),
3350
3430
  ctx
3351
3431
  });
3432
+ return withProbedDuration(ref, durationMs);
3433
+ }
3434
+ function probeVideoDurationMs(expect, bytes, ctx) {
3435
+ if (expect !== "video") return void 0;
3436
+ const durationMs = mp4DurationMs(bytes);
3437
+ if (durationMs !== void 0) ctx.log(`ingest: probed video duration ${durationMs}ms`);
3438
+ return durationMs;
3439
+ }
3440
+ function localFileMetadata(args) {
3441
+ return {
3442
+ source_path: args.absPath,
3443
+ strategy: "local_file",
3444
+ ingested_at: (/* @__PURE__ */ new Date()).toISOString(),
3445
+ file_size: args.fileSize,
3446
+ original_filename: path3.basename(args.absPath),
3447
+ ...args.mime === SVG_MIME ? { rasterized_from: "svg" } : {},
3448
+ ...args.durationMs !== void 0 ? { duration_ms: args.durationMs } : {}
3449
+ };
3450
+ }
3451
+ function withProbedDuration(ref, durationMs) {
3452
+ if (durationMs === void 0 || ref.kind !== "video") return ref;
3453
+ return { ...ref, duration_ms: durationMs };
3352
3454
  }
3353
3455
  var YT_DLP_BIN = "yt-dlp";
3354
3456
  var YT_DLP_TIMEOUT_MS = 10 * 60 * 1e3;
@@ -3791,6 +3893,9 @@ var audioTimelineNode = defineNode({
3791
3893
  inputs: AudioTimelineInputs,
3792
3894
  params: AudioTimelineParams,
3793
3895
  outputs: AudioTimelineOutputs,
3896
+ // The mixed timeline is an audio asset — declare it so strictly-typed consumers
3897
+ // (e.g. audio_voice_convert reading a merged per-speaker track) resolve its kind.
3898
+ outputKinds: { audio: "audio" },
3794
3899
  cost: () => ({ credits: 0, seconds_estimate: 10 }),
3795
3900
  validateExtra({ rawParams, rawInputs }) {
3796
3901
  const issues = [];
@@ -5582,6 +5687,16 @@ var videoDeconstructNode = delegated({
5582
5687
  focus: z29.string().optional(),
5583
5688
  start_s: z29.number().min(0).optional(),
5584
5689
  end_s: z29.number().positive().optional(),
5690
+ // Real visual shot-cut timestamps (absolute seconds), detected locally with
5691
+ // ffmpeg before the deconstruct. The backend SNAPS its LLM scene boundaries
5692
+ // onto these and SPLITS any scene that spans one, so a scene's frames never
5693
+ // straddle a hard cut. `scaffold-video` populates this; omit for LLM-only cuts.
5694
+ shot_cuts: z29.array(z29.number().min(0)).max(200).optional(),
5695
+ // The video model's per-clip ceiling (seconds). A shot longer than this is
5696
+ // split into seamless continuation sub-scenes (shared splice frame), so long
5697
+ // shots reproduce in full instead of being truncated. `scaffold-video` sets
5698
+ // the Seedance ceiling (15); omit to disable length splitting.
5699
+ max_clip_s: z29.number().positive().max(60).optional(),
5585
5700
  // Transcript provider for the blueprint's dialogue/transcript. Default
5586
5701
  // Groq Whisper; "deepgram" routes to Nova-3 so words carry punctuation.
5587
5702
  transcriber: z29.enum(["groq", "deepgram"]).optional()
@@ -5970,4 +6085,4 @@ export {
5970
6085
  defaultRegistry,
5971
6086
  createEngineFromEnv
5972
6087
  };
5973
- //# sourceMappingURL=chunk-NBNUNCY7.js.map
6088
+ //# sourceMappingURL=chunk-2E4H2GIJ.js.map