npm - arca-marketing-video - Versions diffs - 2.8.0 → 2.9.0 - Mend

arca-marketing-video 2.8.0 → 2.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/package.json +1 -1
package/skills/shorts-editor/SKILL.md +22 -6
package/skills/shorts-editor/composition.template.html +29 -38
package/skills/storyboard-prompt/SKILL.md +25 -10
package/skills/video-prompt/SKILL.md +43 -0

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "arca-marketing-video",
-  "version": "2.8.0",
+  "version": "2.9.0",
   "description": "Brand-driven short-form marketing content kit: four installable Claude Code skills (carousel generator, storyboard prompt, video prompt, shorts editor) plus a shared brand profile and assets. Run `npx arca-marketing-video` to install them into .claude/skills.",
   "keywords": [
     "skill",

package/skills/shorts-editor/SKILL.md CHANGED Viewed

@@ -18,7 +18,7 @@ in-world mark. `silence_cut.py` and `composition.template.html` are co-located i
 This skill is the final edit stage after `video-prompt` (or runs standalone on any raw footage).
 ## Overview
-Turn a finished-but-flat talking clip into a punchy 9:16 short. Engagement is four layers stacked on clean footage: a **tight silence-cut**, **synced captions**, **face/caption-safe graphic chips + zoom punch-ins**, and a **SFX + brand-splash** finish. The composition is HyperFrames HTML; ffmpeg cuts and masters; faster-whisper supplies timing.
+Turn a finished-but-flat talking clip into a punchy 9:16 short. Engagement is layers stacked on clean footage: a **tight silence-cut** (raw footage only — skip for AI clips), **word-by-word pop-on captions** (Anton, gold keywords, no pill backing), **native treatments** (zoom punch-ins, speed ramps, hard cuts, SFX hits — NOT glass-pill chips), and a **SFX + brand-splash** finish. The composition is HyperFrames HTML; ffmpeg cuts and masters; faster-whisper supplies timing.
 **Core principle:** retention is manufactured by deleting dead time and giving the eye a new beat every 2-4s. Cut the pauses first; every other layer just decorates the tightened result.
@@ -37,13 +37,18 @@ Not for: generating footage from scratch, or pure motion-graphics pieces (use th
 1. **Denoise** → `audio_clean.m4a`:
    `ffmpeg -i src.mp4 -af "highpass=85,afftdn,lowpass=12000,loudnorm=I=-14:TP=-1.5:LRA=11" -ar 48000 -ac 2 audio_clean.m4a`
 2. **Transcribe** word timestamps with faster-whisper → `transcript.json` (`[{text,start,end}]`). `small.en` only if the audio is English.
-3. **Silence-cut** with `./silence_cut.py` → `tight.mp4` (the non-obvious core, see below).
+3. **Silence-cut** with `./silence_cut.py` → `tight.mp4` (the non-obvious core, see below). **SKIP this
+   step for AI-generated footage** (clips from `video-prompt` / Wyren): it is already tight and its
+   pauses are intentional comedic beats — the cutter is built for RAW talking-head dead air, not for
+   generated clips. For AI footage, go straight to assembly and keep the clips' timing. Only run the
+   silence-cut on real raw recordings (interview, vox-pop, vlog, podcast) with genuine dead air.
 4. **Re-transcribe `tight.mp4`**, regroup into caption phrases (sentence-aware, 3-5 words). Cutting shifts every timestamp, so always re-transcribe the cut; never remap old times.
-5. **Build the composition** from `./composition.template.html`: muted plate + separate dialogue audio + captions + chips + zooms + logo + splash. Lint clean.
+5. **Build the composition** from `./composition.template.html`: muted plate + separate dialogue audio + word-pop captions + zooms + logo + splash + SFX (no chips by default). Lint clean.
 6. **Draft render** (`--quality draft`), extract frames at every chip/caption/splash beat, eyeball, fix. Then **`--quality high`** and **master**:
    `ffmpeg -i raw.mp4 -c:v copy -af "loudnorm=I=-14:TP=-1.5,alimiter=limit=0.95" -c:a aac -b:a 192k out.mp4`
 ## The silence-cut (the part that is easy to get wrong)
+**Only for RAW recordings — skip entirely for AI-generated clips** (see Pipeline step 3).
 Neither signal alone works:
 - **silencedetect** misses pauses filled with ambient/breath above the noise floor.
 - **whisper word timestamps** are imprecise around pauses and will report contiguous words across a real ~1s gap (e.g. it claimed "So, cheating" was continuous when 0.8s of silence sat between them).
@@ -51,8 +56,19 @@ Neither signal alone works:
 **Cut where EITHER is true:** acoustic silence (`silencedetect=noise=-33dB:d=0.35`) OR a transcript word-gap > 0.45s. Only remove the dead middle when it exceeds ~0.5s, leave ~0.1s of speech pad each side, and KEEP sub-0.5s gaps (natural rhythm — cutting every micro-gap machine-guns the edit and looks glitchy). Add a 10ms `afade` at each join to kill clicks. **Verify with silencedetect on the OUTPUT** — it should show only short breath gaps. `./silence_cut.py` implements all of this; tune `--cut-min`.
 ## Layers (all face/caption-safe)
-- **Captions:** phrase blocks of 3-5 words, white with one accent color on keywords, ONE visible at a time. Hard-hide each before the next appears: clamp exit to `min(end+0.1, next.start-0.07)`, then `tl.set(opacity:0,visibility:hidden)`. Bottom third (~286px up).
-- **Graphic chips:** glass pill + inline-SVG icon + short label, in the **lower-mid torso band (~y1170 of 1920)** — below the face, above the captions. "Over the people" is fine; over the face is not. One per beat, pop in / hold / pop out.
+- **Captions (word-by-word pop-on — the default, not flat phrase-fade):** the strong default is each
+  WORD popping on as it is spoken, not a whole phrase fading in. We already transcribe word-level times,
+  so use them: show ~3-5 words per group, but reveal them one word at a time on each word's `start`
+  (quick scale/opacity pop), the current word emphasized. **Anton** display font (uppercase), white with
+  **gold (`--accent2`) keywords**, **NO pill / glass backing** — just a hard text-shadow for legibility.
+  ONE group visible at a time: hard-hide each before the next appears — clamp exit to
+  `min(end+0.1, next.start-0.07)`, then `tl.set(opacity:0,visibility:hidden)`. Bottom third (~286px up).
+  The flat phrase-fade is the weak default — don't ship it.
+- **Native treatments over graphic "chips" (chips are OFF by default):** do NOT add the glass-pill /
+  badge "chip" by default — it reads as AI / vibe-coded. Manufacture beats with NATIVE editing instead:
+  zoom punch-ins, speed ramps / hold-frames, the word-pop captions themselves (gold keyword emphasis),
+  hard cuts on the beat, and SFX hits. Only add a chip if the user explicitly asks for an on-screen
+  label, and keep it minimal. The chip CSS/JS is removed from the default template.
 - **Zoom punch-ins:** scale the plate wrapper (base ~1.04) to ~1.10-1.14 on emphasis lines, ease back. Never scale below 1.0 (reveals letterbox edges). Cover-fit the plate.
 - **SFX:** a curated set ships in `./sfx/` — use these first (no download needed). Keep dialogue front (SFX vol 0.25-0.35); every `<audio>` needs an `id`. Mapping:
   | Role | File |
@@ -81,5 +97,5 @@ Neither signal alone works:
 ## Files
 - `./silence_cut.py` — silence ∪ word-gap cutter: `--src --audio --transcript --out [--cut-min 0.5]`.
-- `./composition.template.html` — HyperFrames composition skeleton (plate, captions, chips, zooms, splash, SFX) with the load-bearing GSAP logic already wired.
+- `./composition.template.html` — HyperFrames composition skeleton (plate, word-pop captions, zooms, splash, SFX; chips removed from default) with the load-bearing GSAP logic already wired.
 - `./sfx/` — bundled, ready-to-use SFX (riser, swooshes, ding, tiktok-boom-bling, glitch, wrong, sad-violin). See the SFX mapping above.

package/skills/shorts-editor/composition.template.html CHANGED Viewed

@@ -2,7 +2,9 @@
 <!--
   HyperFrames composition skeleton for an engaging vertical short.
   Fill the >>> TODO <<< spots. All times are on the TIGHT (cut) timeline.
-  Layers by z-index: plate(0) < grade(1) < chips(3) < captions(5) < logo(6) < flash(7) < splash(8) < endfade(9).
+  Layers by z-index: plate(0) < grade(1) < captions(5) < logo(6) < flash(7) < splash(8) < endfade(9).
+  Captions are word-by-word pop-on (Anton, gold keywords, no pill backing). Chips are OFF by default
+  (they read as AI/vibe-coded) — use native treatments (zooms, speed, hard cuts, SFX) instead.
   Standalone composition: data-composition-id div lives directly in <body> (NO <template>).
 -->
 <html lang="en">
@@ -17,20 +19,14 @@
       #plate-wrap { position:absolute; inset:0; overflow:hidden; z-index:0; transform-origin:50% 42%; will-change:transform; }
       #plate-wrap video { position:absolute; inset:0; width:100%; height:100%; object-fit:cover; }
       #grade { position:absolute; inset:0; z-index:1; pointer-events:none; background:radial-gradient(120% 80% at 50% 40%, rgba(0,0,0,0) 56%, rgba(0,0,0,.42) 100%); }
-      /* chips: lower-mid torso band — below face, above captions */
-      #chips { position:absolute; inset:0; z-index:3; pointer-events:none; }
-      .chip { position:absolute; top:1170px; left:0; right:0; margin:0 auto; width:max-content; max-width:920px;
-              display:flex; align-items:center; gap:22px; padding:18px 40px 18px 20px; border-radius:999px;
-              background:var(--glass); border:2px solid rgba(41,171,226,.5); box-shadow:0 18px 50px rgba(0,0,0,.45); opacity:0; }
-      .chip .badge { flex:0 0 auto; width:66px; height:66px; border-radius:50%; display:flex; align-items:center; justify-content:center;
-                     background:linear-gradient(150deg,var(--accent),#1565c0); }
-      .chip.alt .badge { background:linear-gradient(150deg,#ffce4d,var(--accent2)); }
-      .chip .badge svg { width:36px; height:36px; color:#fff; }
-      .chip .ctext { font-weight:800; font-size:46px; letter-spacing:-.02em; color:var(--ink); white-space:nowrap; line-height:1; }
+      /* chips are OFF by default (they read as AI/vibe-coded) — native treatments only: zooms, speed, hard cuts, SFX, gold keyword pops */
       #caps { position:absolute; left:0; right:0; bottom:286px; z-index:5; pointer-events:none; }
-      .cap { position:absolute; left:0; right:0; bottom:0; margin:0 auto; width:900px; text-align:center; font-weight:800;
-             font-size:58px; line-height:1.12; letter-spacing:-.015em; color:var(--ink); opacity:0;
-             text-shadow:0 4px 18px rgba(0,0,0,.8),0 1px 3px rgba(0,0,0,.92); }
+      /* word-by-word pop-on captions: Anton display font, uppercase, gold keywords, NO pill backing — just a hard shadow */
+      .cap { position:absolute; left:0; right:0; bottom:0; margin:0 auto; width:900px; text-align:center;
+             font-family:"Anton",system-ui,sans-serif; font-weight:400; text-transform:uppercase;
+             font-size:64px; line-height:1.1; letter-spacing:.005em; color:var(--ink); opacity:0;
+             text-shadow:0 4px 18px rgba(0,0,0,.85),0 2px 4px rgba(0,0,0,.95); }
+      .cap .w { display:inline-block; will-change:transform,opacity; }
       .cap .kw { color:var(--accent2); }
       #logo { position:absolute; top:54px; right:54px; width:90px; z-index:6; opacity:0; filter:drop-shadow(0 6px 18px rgba(0,0,0,.45)); }
       #flash { position:absolute; inset:0; z-index:7; background:#fff; opacity:0; pointer-events:none; }
@@ -44,7 +40,6 @@
     <div id="root" data-composition-id="root" data-width="1080" data-height="1920" data-start="0" data-duration="47.58">
       <div id="plate-wrap"><video id="plate" src="tight.mp4" data-start="0" data-duration="44.58" data-media-start="0" data-track-index="0" muted playsinline></video></div>
       <div id="grade"></div>
-      <div id="chips"></div>
       <div id="caps"></div>
       <img id="logo" src="logo.png" alt="brand" crossorigin="anonymous" />
       <div id="flash"></div>
@@ -53,47 +48,43 @@
       <!-- AUDIO: muted video + separate dialogue track. EVERY <audio> needs an id or it is silent. -->
       <audio id="dlg" src="dialogue.m4a" data-start="0" data-duration="44.58" data-media-start="0" data-track-index="2" data-volume="1"></audio>
-      <audio id="sfx-riser" src="sfx/riser.mp3" data-start="0" data-duration="4.0" data-media-start="0" data-track-index="3" data-volume="0.30"></audio>
-      <!-- >>> whooshes on cuts (track 4), pops on chip entrances (track 5), ding only for splash (track 6) <<< -->
-      <audio id="sfx-ding" src="sfx/ding.mp3" data-start="44.58" data-duration="0.84" data-media-start="0" data-track-index="6" data-volume="0.5"></audio>
+      <audio id="sfx-riser" src="sfx/riser-high.mp3" data-start="0" data-duration="4.0" data-media-start="0" data-track-index="3" data-volume="0.30"></audio>
+      <!-- >>> whooshes on hard cuts (track 4: swoosh-high.mp3 / swoosh-low.mp3); reserve the boom-bling for the splash only (track 6) <<< -->
+      <audio id="sfx-splash" src="sfx/tiktok-boom-bling.mp3" data-start="44.58" data-duration="0.84" data-media-start="0" data-track-index="6" data-volume="0.5"></audio>
       <script src="https://cdn.jsdelivr.net/npm/gsap@3.14.2/dist/gsap.min.js"></script>
-      <script src="data.js"></script><!-- window.CAPTION_GROUPS = [{text,start,end},...] from the re-transcribed cut -->
+      <script src="data.js"></script><!-- window.CAPTION_GROUPS = [{start,end,words:[{w,t},...]},...] — w=word text, t=its spoken start time (from the re-transcribed cut) -->
       <script>
         (function () {
           var VID = 44.58, DUR = 47.58;            // >>> set <<<
           var tl = gsap.timeline({ paused: true });
-          var ICONS = { /* >>> inline-SVG strings keyed by name, e.g. */ dot: '<svg viewBox="0 0 24 24" fill="currentColor"><circle cx="12" cy="12" r="6"/></svg>' };
-          var KW = { /* lowercase keywords to highlight gold */ };
-          function capHTML(t){return t.split(" ").map(function(w){return KW[w.toLowerCase()]?'<span class="kw">'+w+'</span>':w;}).join(" ");}
+          var KW = { /* lowercase keyword (alphanumeric only) -> true, to highlight gold, e.g. arca:true */ };
+          function isKW(w){ return !!KW[w.toLowerCase().replace(/[^a-z0-9]/g,"")]; }
-          // ---- CAPTIONS: one at a time, hard-hidden before the next appears ----
+          // ---- CAPTIONS: word-by-word pop-on, gold keywords, ONE group on screen at a time ----
           var caps = document.getElementById("caps"), G = window.CAPTION_GROUPS || [];
           G.forEach(function (g, i) {
             var el = document.createElement("div"); el.className = "cap"; el.id = "cap-" + i;
-            el.innerHTML = capHTML(g.text); caps.appendChild(el);
+            caps.appendChild(el);
             var next = (i + 1 < G.length) ? G[i + 1].start : VID + 1;
             var inAt = Math.max(0, g.start - 0.05);
             var hardOut = Math.min(g.end + 0.1, next - 0.07);
             if (hardOut <= inAt + 0.12) hardOut = inAt + 0.12;
-            tl.fromTo(el, { opacity:0, y:20 }, { opacity:1, y:0, duration:0.22, ease:"power2.out" }, inAt);
-            tl.to(el, { opacity:0, y:-10, duration:0.14, ease:"power2.in" }, Math.max(inAt + 0.1, hardOut - 0.14));
+            tl.set(el, { opacity:1, visibility:"visible" }, inAt);
+            (g.words || []).forEach(function (wd, j) {
+              var s = document.createElement("span"); s.className = "w";
+              s.innerHTML = isKW(wd.w) ? '<span class="kw">' + wd.w + '</span>' : wd.w;
+              el.appendChild(s); el.appendChild(document.createTextNode(" "));
+              // each word pops on at its own spoken time (fallback: group start)
+              var at = Math.min(hardOut - 0.05, Math.max(inAt, wd.t != null ? wd.t : g.start));
+              tl.fromTo(s, { opacity:0, scale:0.6, y:14 }, { opacity:1, scale:1, y:0, duration:0.16, ease:"back.out(2)" }, at);
+            });
+            tl.to(el, { opacity:0, duration:0.14, ease:"power2.in" }, Math.max(inAt + 0.1, hardOut - 0.14));
             tl.set(el, { opacity:0, visibility:"hidden" }, hardOut);
           });
-          // ---- CHIPS: one per beat, lower-mid band ----
-          var CHIPS = [ /* { s, e, icon:"dot", text:"LABEL", alt:false } */ ];
-          var chips = document.getElementById("chips");
-          CHIPS.forEach(function (c, i) {
-            var el = document.createElement("div"); el.className = "chip" + (c.alt ? " alt" : ""); el.id = "chip-" + i;
-            el.innerHTML = '<span class="badge">' + (ICONS[c.icon]||"") + '</span><span class="ctext">' + c.text + '</span>';
-            chips.appendChild(el);
-            tl.fromTo(el, { opacity:0, scale:0.8, y:24 }, { opacity:1, scale:1, y:0, duration:0.42, ease:"back.out(1.6)" }, c.s);
-            tl.to(el, { y:-6, duration:(c.e - c.s) - 0.42, ease:"sine.inOut" }, c.s + 0.42);
-            tl.to(el, { opacity:0, scale:0.92, duration:0.3, ease:"power2.in" }, c.e - 0.3);
-            tl.set(el, { opacity:0, visibility:"hidden" }, c.e);
-          });
+          // ---- CHIPS removed from default — use native treatments: zoom punch-ins, speed ramps, hard cuts, SFX hits ----
           // ---- ZOOM PUNCH-INS on the plate (never below 1.0) ----
           var W = "#plate-wrap", BASE = 1.04;

package/skills/storyboard-prompt/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: storyboard-prompt
-description: Use when pressure-testing a short-form video idea and turning it into a 3×3 (or larger) storyboard for TikTok / Reels / Shorts of any video type (UGC, cinematic, animation, etc.) — brutal virality scoring, a 10-route hook lab, a polish pass that makes the idea catchy and engaging with a strong hook and brand-aligned messaging, a first-5-seconds cold open, a clarification checkpoint before image generation, then TWO deliverables: a clean frames-only grid image (no text/notes, easy to crop) plus a text breakdown (concept, flow, dialogue, editor notes, style notes). Triggers on "storyboard this idea", "is this video concept good", "plan a short". Part of the arca-marketing-video kit.
+description: Use when pressure-testing a short-form video idea and turning it into a 3×3 (or larger) storyboard for TikTok / Reels / Shorts of any video type (UGC, cinematic, animation, etc.) — brutal virality scoring, a 10-route hook lab, a polish pass that makes the idea catchy and engaging with a strong hook and brand-aligned messaging, a first-5-seconds cold open, a clarification checkpoint before image generation, then TWO deliverables: a clean frames-only grid image (no text/notes, easy to crop) plus a text breakdown (concept, named characters + 1-line bibles, a required 5-column Frame·Time·Visual·Dialogue·Direction flow table, editor notes, style notes). Triggers on "storyboard this idea", "is this video concept good", "plan a short". Part of the arca-marketing-video kit.
 ---
 # Storyboard Prompt
@@ -729,22 +729,37 @@ croppable. Use these sections:
 - VIDEO CONCEPT — one tight paragraph: the polished concept + logline, the video type / style, the
   target viewer, the core promise, the chosen cold-open strategy, and the recommended length.
-- FLOW — the beat-by-beat shot flow, one short block per frame (1–9, or however many), each with:
-  - FRAME # · TIME (range) · RETENTION ROLE
-  - SHOT / ANGLE / MOVE
-  - ACTION (what happens in the frame)
-  - DIALOGUE / VO (OPTIONAL — include only when the beat has a spoken line)
+- CHARACTERS — name EVERY recurring character (give each a short memorable name, e.g. "Maya the
+  Navigator") and write a ONE-LINE bible for each: face / age / build, hair, wardrobe, and 1–2
+  distinguishing features. This is what makes identity-locking trivial downstream — `video-prompt`
+  pastes these bibles verbatim into every shot. Even a one-person video needs its character named + a
+  bible line. If no character recurs (pure product / motion-graphics), say so in one line.
+- FLOW — the beat-by-beat shot flow as an EXPLICIT 5-COLUMN TABLE, one row per frame (this table is a
+  REQUIRED handoff, never optional — your written dialogue + direction drive the acting far better than
+  lines inferred downstream). Columns:
+  | Frame | Time | Visual | Dialogue | Direction |
+  - **Frame** — number (1–9, or however many) + its retention role + shot / angle / move.
+  - **Time** — the timestamp range.
+  - **Visual** — what happens in the frame (action, who's on screen, key prop).
+  - **Dialogue** — the actual spoken line for that beat, written out. Use "—" only for true silent
+    beats; do NOT leave it blank to be filled in later. These exact lines get spoken (native audio
+    nails scripted lines), so write them tight and in-character.
+  - **Direction** — performance / delivery note that drives the acting (e.g. "deliver softly, almost
+    inspirational", "rushed, glancing off-camera", "deadpan, then a tiny smirk"). Always fill this —
+    it is the highest-value column and the reason this table is mandatory.
 - VIDEO EDITOR NOTES — what to add in the EDIT, not in the frames: suggested on-screen caption per
   beat, cut / transition style, SFX and music, pacing, zoom punch-ins, and where the logo / brand
   splash lands. (These feed the `shorts-editor` skill.)
 - STYLE NOTES — the look the final video must match: video type, lighting, camera feel, color,
-  wardrobe / prop continuity, and the recurring character profile to keep consistent across frames.
-  (These feed the `video-prompt` skill.)
+  wardrobe / prop continuity. Reference the CHARACTERS section above for who must stay consistent
+  across frames. (These feed the `video-prompt` skill.)
 - BRAND / LOGO NOTES — which frames the supplied logo appears in and how (as a subtle in-world prop),
   per Phase 7.
-Keep it tight and scannable. Dialogue is optional. The clean frame grid (Phase 8) plus this text
-breakdown are the two deliverables — never merge the text into the image.
+Keep it tight and scannable. The FLOW table's Dialogue + Direction columns are REQUIRED (write the
+real lines and the delivery notes — don't defer them); only a genuinely silent beat gets a "—" in
+Dialogue. The clean frame grid (Phase 8) plus this text breakdown are the two deliverables — never
+merge the text into the image.
 PHASE 9: ANTI-AI AND ANTI-CINEMA QUALITY GATE

package/skills/video-prompt/SKILL.md CHANGED Viewed

@@ -26,6 +26,40 @@ Before generating, ask the user for any of these that are not already provided,
 10. NATIVE AUDIO — whether the video model should synthesize dialogue/sound (only some models support it). Default: on if the chosen model supports `sound`.
 Ask first. Only make smart, briefly stated assumptions for whatever is still missing, and state them briefly. If the user names a budget or "cheapest/fastest", pick the model tier accordingly and say which you picked.
+MODEL CAPABILITY MATRIX — decide this in the FIRST intake round (don't discover it mid-build)
+The question that always surfaces late — "shouldn't the storyboard be a reference image for the video
+model?" — must be answered up front, because the answer changes the whole graph. Reconfirm live with
+`mcp__wyren__list_models` + `get_model_capabilities`, but the load-bearing facts:
+| Model | Ref images | Native audio | Multishot | Start frame | Duration |
+| --- | --- | --- | --- | --- | --- |
+| **Kling V3** (default) | **0** (none) | yes (`sound`) | yes (≤6) | yes | 3–15s |
+| Kling O1 | yes (≤7) | **no** | **no** | yes | — |
+| Kling V2.6 | 0 | yes (+voice clone) | no | yes | 5/10s |
+| Veo 3.1 Fast/Std | ≤3 | yes | no | yes | 4/6/8s |
+| Seedance 2.0 / Fast | ≤4 | no | yes | yes | 4–15s (480/720p) |
+- **Kling V3 takes ZERO reference images** (`maxReferenceImages: 0`). So the storyboard does NOT feed
+  the video model as a reference — it drives the IMAGE stage only (it becomes / designs the start
+  frame). Identity on V3 comes from the per-shot **startFrame + the verbatim character bible** in each
+  shot prompt, never from a reference image.
+- **Only Kling O1 accepts reference images** — but it LOSES native audio AND multishot. So choosing O1
+  for face-locking means giving up scripted dialogue + in-generation angle changes. Usually not worth
+  it: prefer V3 (audio + multishot) with startFrame-driven identity.
+- **Kling V3 Omni is disabled** — do not route to it.
+- Net: for a dialogue UGC skit, default = **Kling V3**, identity via startFrame + bible, audio native.
+  Reach for a ref-image model (O1/Veo/Seedance) only when face-lock genuinely beats audio+multishot.
+DECISION RULE — multishot vs separate clips:
+- **Many quick dialogue beats + native audio → ONE Kling V3 multishot per part** (not many tiny
+  separate clips). Multishot keeps the face/voice continuous and gives angle changes in one generation.
+- Watch the math: **Kling's minimum clip is 3s**, and a multishot maxes at 15s, so a single multishot
+  merge caps at **~5 beats** (5 × 3s = 15s). If a part has more than ~5 dialogue beats, split into a
+  second multishot/part rather than crushing beats below 3s.
+- **Native audio nails the exact scripted lines** — for a dialogue skit you do NOT need a separate TTS
+  pass. Write the real lines into each shot's prompt (the storyboard FLOW table's Dialogue column) and
+  let the model speak them. Only fall back to TTS/editor-added VO when the chosen model has no `sound`.
 Use the storyboard as the PLAN, not as footage to copy. First classify it (see STORYBOARD
 INTERPRETATION below): if the panels are photographic, clean them and use them as start frames; if
 they are schematic / annotated mockups (panel numbers, notes boxes, drawn phone bezels, mock UI), do
@@ -196,6 +230,15 @@ Wyren execution flow (per the wyren skill's policy — load it before any `mcp__
    brand splash / end card, zoom/SFX timing, and master. That is the ONLY place HyperFrames is used, and
    only for captions + splash + timing — never to composite text/UI graphics onto the footage.
+WYREN BUILD GOTCHAS (these cost failed `validate_workflow` / `run_workflow` calls — get them right):
+- **`multiPrompt` must be a JSON STRING, not an array.** When using video-model multishot, pass the
+  per-shot prompts as a stringified JSON value (e.g. `"[{...},{...}]"`), not a raw array. A raw array
+  fails validation.
+- **`imageAI` / `videoAI` need a CONNECTED TEXT EDGE — `customPrompt` alone fails.** Wire a text/prompt
+  node into the AI node's prompt input; a `customPrompt` field set on the node without an incoming text
+  edge does not satisfy validation. Build the graph so every imageAI/videoAI has its prompt edge
+  connected, then set the prompt content.
 RECURRING CHARACTER CONSISTENCY (multishot / multi-clip)
 Any time the video is more than one shot — a multi-clip split (Part 1 / Part 2), or video-model