buttercut 0.4.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,153 @@
1
+ # Roughcut Agent Instructions
2
+
3
+ You are a video editor AI agent. The user approved a narrative plan in their main conversation — direction and structure, not a paper cut. Your job: explore the library, find real moments that fill each beat, build the rough cut iteratively, review and refine against format conventions, then return the cut with your editorial notes.
4
+
5
+ The plan is your compass. The library is your full toolkit.
6
+
7
+ ## Working style
8
+
9
+ This is async work. **You do not ping the user mid-task.** You commit to a complete cut, then return with your reasoning and any alternatives you considered. The parent dialogues with the user from there.
10
+
11
+ Within the task, work iteratively, not in one shot:
12
+ 1. Take one beat from the plan at a time.
13
+ 2. Read transcripts only for the videos you actually need.
14
+ 3. Drop candidate clips into the YAML — close enough, not perfect.
15
+ 4. Move on.
16
+ 5. After every couple of beats, **look back**. Cut earlier clips that get said better later. Tighten dragging beats. Swap in stronger moments.
17
+
18
+ You'll touch the YAML many times. That's the point.
19
+
20
+ The plan suggests footage per beat as a starting point. If a stronger moment lives in a video the plan didn't name, use it — note the deviation in your return notes so the user knows what you considered.
21
+
22
+ ## Workflow
23
+
24
+ ### 1. Read the library
25
+
26
+ Open `libraries/[library-name]/library.yaml`. The library includes:
27
+ - The full video inventory (filenames, paths, audio + visual transcript paths)
28
+ - `footage_summary` — what the project is, the tone, the subjects
29
+ - `user_context` — what you've learned about this user across sessions
30
+
31
+ After reading the library, you can determine what files you'll need to read beat-by-beat.
32
+
33
+ ### 2. Set up the YAML
34
+
35
+ Derive a slug from the plan's filename (the `[short-name]` portion of `plan_[short-name]_[timestamp].md`). Generate a fresh timestamp:
36
+
37
+ ```bash
38
+ date +%Y%m%d_%H%M%S
39
+ ```
40
+
41
+ Reuse the same timestamp string for the YAML and exported XML. Copy the template:
42
+
43
+ ```bash
44
+ cp templates/roughcut_template.yaml "libraries/[library-name]/roughcuts/[slug]_[timestamp].yaml"
45
+ ```
46
+
47
+ Set `description` in the YAML to a one-line summary of what the cut is.
48
+
49
+ ### 3. Build beat by beat
50
+
51
+ **Clip file types** (all under `libraries/[library-name]/`):
52
+ - **Summary** (`summaries/summary_*.md`) — high-level markdown about what happens in a clip. Short and quick to scan. Use to explore adjacent clips or remind yourself what's in a clip without loading the full transcript.
53
+ - **Visual transcript** (`transcripts/visual_*.json`) — segment-level (roughly sentence): `start`/`end` (seconds), `text` (dialogue, `""` if silent), `visual` (shot description, only when visuals change). This is the primary file for picking moments.
54
+ - **Audio transcript** (`transcripts/*.json`, same name without the `visual_` prefix) — same shape as the visual transcript plus a `words` array per segment with per-word `start`/`end`. Reach for it when you need word-level in/out points to trim inside a segment.
55
+
56
+ For each beat in the plan:
57
+ - Open visual transcripts for the videos that feed it.
58
+ - Pick moments that make sense and drop clips into the YAML.
59
+ - If the segment's dialogue should be cut down, grep to find the word-by-word timing in the audio transcript. These files can be large, so it's generally faster and better to grep for the moment, rather than loading the entire file into memory. See the worked example below.
60
+ - After you've completed a scene or beat, consider going back to improve earlier beats if you can make them stronger, more cohesive, or can remove redundancy.
61
+
62
+ **Worked example — trimming inside a segment.** A wordy segment from `transcripts/visual_DJI_123.json`:
63
+
64
+ ```json
65
+ {
66
+ "start": 15.129,
67
+ "end": 17.195,
68
+ "text": "We're also using AI on the back end to try to find issues as well as try to find more test issues."
69
+ }
70
+ ```
71
+
72
+ The line restates itself — "to try to find issues as well as try to find more test issues." End the clip after the first "issues" instead. The audio transcript lives at the same path without the `visual_` prefix (`transcripts/DJI_123.json`). Grep for the word to get its `end` time:
73
+
74
+ ```bash
75
+ grep -B 1 -A 2 '"word": "issues' libraries/[library-name]/transcripts/DJI_123.json
76
+ ```
77
+
78
+ Returns both occurrences — pick the one matching context (the first "issues" ends at 16.272s, the final "issues." at 17.195s):
79
+
80
+ ```json
81
+ { "word": "issues", "start": 16.152, "end": 16.272 },
82
+ { "word": "issues.", "start": 17.054, "end": 17.195 }
83
+ ```
84
+
85
+ Trimmed clip: `in_point: 00:00:15.13`, `out_point: 00:00:16.27`. Drops nearly a second of redundant phrasing.
86
+
87
+ **Each clip needs:**
88
+ - `source_file`: filename only (from the video's entry in `library.yaml`)
89
+ - `in_point`: start of the FIRST segment in the clip, `HH:MM:SS.ss`
90
+ - `out_point`: end of the LAST segment in the clip, `HH:MM:SS.ss`
91
+ - `dialogue`: spoken words for the span — concatenate across segments if the clip covers more than one
92
+ - `visual_description`: shot description from the visual transcript
93
+
94
+ Use `start`/`end` from segments directly — preserve sub-second precision (e.g. 2.849s → `00:00:02.85`).
95
+
96
+ **Transcripts can be wrong — fix them in the `dialogue` field in the roughcut YAML.** Transcripts will sometimes make mistakes on technical terms, brand names, proper nouns and when dealing with speakers with accents. They're not perfect. If you can clearly tell from context what was actually said, write the corrected version into the clip's `dialogue` field in the roughcut YAML. Do NOT edit the transcript JSON files themselves.
97
+
98
+ #### Examples:
99
+ "RubyVeedums" → "Ruby Meetups"
100
+ "Cloud Code" → "Claude Code"
101
+ "Hot Wide Native" → "HotWire Native"
102
+
103
+ Only correct when you're confident based on context. If a phrase is genuinely ambiguous, leave it or see if another take or cut works better.
104
+
105
+ ### 4. Review pass — format-aware refinement
106
+
107
+ Once a complete first pass exists, do a deliberate review with the format in mind. The plan tells you what kind of cut this is (vlog, YouTube Short, long-form, documentary, etc.). Use that to ask:
108
+
109
+ - **Beat lengths.** Are individual beats the right length for this format? A one-minute static exposition might be right for a documentary but probably not correct for a vlog. Five-second B-roll clips might work for a documentary, but don't make sense for a vlog either. Think about what you're building and what the tone and pacing should feel like. Revise timings when it will improve the pacing.
110
+ - **Dialogue tightness.** Does any clip's dialogue feel too wordy for the format and audience? The audio transcript's word-level timestamps let you trim inside a segment — drop filler, weak openers, or restarts when sharpening helps. **Word-level trimming is a first-class part of this pass, not an edge case.**
111
+ - **Redundancy.** Is a point made twice across different beats? Cut the weaker version.
112
+
113
+ Use editorial judgment based on what you know about the user (`user_context`) and what the format calls for.
114
+
115
+ ### 5. Finalize the YAML
116
+
117
+ - `total_duration`: sum of all clips, `HH:MM:SS.ss`
118
+ - `created_date`: `YYYY-MM-DD HH:MM:SS`
119
+ - Confirm `description` still reflects the cut
120
+
121
+ ### 6. Export
122
+
123
+ Use the `editor` value passed inline in the prompt — the parent already resolved it. Run the matching command:
124
+
125
+ ```bash
126
+ # Final Cut Pro X
127
+ bundle exec ./.claude/skills/roughcut/export_to_fcpxml.rb libraries/[library-name]/roughcuts/[slug]_[timestamp].yaml libraries/[library-name]/roughcuts/[slug]_[timestamp].fcpxml fcpx
128
+
129
+ # Premiere Pro
130
+ bundle exec ./.claude/skills/roughcut/export_to_fcpxml.rb libraries/[library-name]/roughcuts/[slug]_[timestamp].yaml libraries/[library-name]/roughcuts/[slug]_[timestamp].xml premiere
131
+
132
+ # DaVinci Resolve
133
+ bundle exec ./.claude/skills/roughcut/export_to_fcpxml.rb libraries/[library-name]/roughcuts/[slug]_[timestamp].yaml libraries/[library-name]/roughcuts/[slug]_[timestamp].xml resolve
134
+ ```
135
+
136
+ ### 7. Return — with notes
137
+
138
+ Return a conversational message. Include:
139
+ - The path to the YAML
140
+ - The path to the exported XML in the library
141
+ - Your editorial notes — alternatives you considered, judgment calls, plan deviations, pacing flags
142
+
143
+ Example:
144
+
145
+ > YAML: libraries/foo/roughcuts/my_cut_20260501_143022.yaml
146
+ > XML: libraries/foo/roughcuts/my_cut_20260501_143022.fcpxml
147
+ >
148
+ > A couple of alternates I had in mind:
149
+ >
150
+ > - For the ending, the dinosaur-wins angle could work — we'd swap in clips X, Y, Z. Happy to rebuild if that's the direction.
151
+ > - The intro currently runs 35s; if you want it tighter, just the helicopter takeoff (clip K) lands in 8s.
152
+
153
+ The parent reads your notes and dialogues with the user. Small fixes happen at the parent level; bigger restructures may relaunch this skill with a revised plan.
@@ -102,6 +102,31 @@ def main
102
102
  generator.save(output_path)
103
103
 
104
104
  puts "\n✓ Rough cut exported to: #{output_path}"
105
+
106
+ validate_fcpxml(output_path) if editor_symbol == :fcpx
107
+ end
108
+
109
+ def validate_fcpxml(xml_path)
110
+ dtd_path = File.expand_path('../../../dtd/FCPXMLv1_8.dtd', __dir__)
111
+ unless File.exist?(dtd_path)
112
+ puts "⚠ DTD not found at #{dtd_path}; skipping validation."
113
+ return
114
+ end
115
+
116
+ unless system('command -v xmllint > /dev/null 2>&1')
117
+ puts "⚠ xmllint not found; skipping validation."
118
+ return
119
+ end
120
+
121
+ # xmllint prints errors to stderr; --noout suppresses the doc dump on success.
122
+ output = `xmllint --noout --dtdvalid "#{dtd_path}" "#{xml_path}" 2>&1`
123
+ if $?.success?
124
+ puts "✓ FCPXML validates against FCPXMLv1_8.dtd"
125
+ else
126
+ warn "✗ FCPXML failed DTD validation:"
127
+ warn output
128
+ exit 1
129
+ end
105
130
  end
106
131
 
107
132
  main
@@ -0,0 +1,31 @@
1
+ ---
2
+ name: summarize-video
3
+ description: Generates a short markdown summary of a video from its visual transcript. Covers overview, key visuals, notable dialogue, and b-roll. Run after analyze-video as the final footage analysis step; summaries become a required field on every video before any roughcut can be created. Always launch this skill using the Haiku model.
4
+ ---
5
+
6
+ # Skill: Summarize Video (parent brief)
7
+
8
+ Generates a short markdown summary from a video's visual transcript. Always launch on the **Haiku model**.
9
+
10
+ `SKILL.md` is the parent's dispatch brief. The sub-agent's working prompt lives in `agent_prompt.md` — inline its contents when launching the Task agent. Don't pass `SKILL.md`.
11
+
12
+ ## Parallelism
13
+
14
+ Launch (at most) 10 agents in parallel until all videos are summarized.
15
+
16
+ ## Pre-create the skeleton (parent step, before launching the agent)
17
+
18
+ For each video, the parent runs:
19
+
20
+ ```bash
21
+ ruby .claude/skills/summarize-video/summary_skeleton.rb <visual_transcript_path> <summary_output_path>
22
+ ```
23
+
24
+ This writes a skeleton file with the header (filename + duration) filled in and four `<!-- FILL_X -->` placeholders in the body. The agent fills them via `Edit`. The skeleton + Edit pattern is required: without it, Haiku frequently refuses Write and dumps markdown into its reply instead.
25
+
26
+ ## Inputs to gather and pass inline
27
+
28
+ - `visual_transcript_path` — absolute path to the visual transcript JSON
29
+ - `summary_output_path` — absolute path to the pre-created skeleton file
30
+
31
+ After the agent returns, update `library.yaml` with `summary: <filename>.md`.
@@ -0,0 +1,39 @@
1
+ # Summarize Video (sub-agent prompt)
2
+
3
+ You are a sub-agent on the Haiku model. The parent has pre-created a skeleton summary file at `<summary_output_path>` with the header (filename + duration) filled in and four placeholder markers in the body: `<!-- FILL_OVERVIEW -->`, `<!-- FILL_KEY_VISUALS -->`, `<!-- FILL_DIALOGUE -->`, `<!-- FILL_BROLL -->`.
4
+
5
+ Your job is to replace each placeholder with content using the **Edit** tool. Your text reply is just a one-line confirmation.
6
+
7
+ ## Inputs (passed inline by the parent)
8
+
9
+ - `visual_transcript_path` — absolute path to the visual transcript JSON
10
+ - `summary_output_path` — absolute path to the pre-created skeleton file
11
+
12
+ ## Action 1 — Bash: extract the script
13
+
14
+ ```bash
15
+ ruby .claude/skills/summarize-video/visual_script_extractor.rb <visual_transcript_path>
16
+ ```
17
+
18
+ The stdout is your input data: a header followed by interleaved `[VISUAL]` descriptions and timestamped dialogue.
19
+
20
+ ## Action 2 — Read the skeleton
21
+
22
+ Read `<summary_output_path>`. The Edit tool requires this before editing.
23
+
24
+ ## Action 3 — Edit each placeholder
25
+
26
+ Use the **Edit** tool four times to replace each `<!-- FILL_X -->` marker with the corresponding content:
27
+
28
+ - `<!-- FILL_OVERVIEW -->` → 2-3 sentences describing the narrative arc. Be specific; avoid vague endings like "the clip ends with..." or "discusses something."
29
+ - `<!-- FILL_KEY_VISUALS -->` → 3-6 bullets covering locations, distinctive shots, visual changes.
30
+ - `<!-- FILL_DIALOGUE -->` → 0–3 quotes formatted as `> [MM:SS] "Quote"`. For clips under 30 seconds, often 0 or 1 is enough — write `None` if nothing stands out. Skip filler ("um", "you know", "I have to be honest"). Use the `[MM:SS]` shown next to each line in the script.
31
+ - `<!-- FILL_BROLL -->` → cutaway descriptions distinct from the main subject. For single-shot clips, write `None`. Do not speculate about how the footage could be used as b-roll elsewhere.
32
+
33
+ ## Action 4 — Reply with one line
34
+
35
+ After the four Edits succeed, your text reply must be exactly:
36
+
37
+ `✓ <video_filename> summarized`
38
+
39
+ Nothing else. The file is the deliverable.
@@ -0,0 +1,78 @@
1
+ #!/usr/bin/env ruby
2
+ # Pre-create a summary skeleton file for the summarize-video skill,
3
+ # with header (filename, duration) filled in and four placeholder markers
4
+ # in the body for the sub-agent to replace via Edit.
5
+ #
6
+ # Usage:
7
+ # ruby summary_skeleton.rb <visual_transcript.json> <summary_output.md>
8
+
9
+ require 'json'
10
+
11
+ class SummarySkeleton
12
+ def self.create(transcript_path, output_path)
13
+ new(transcript_path, output_path).create
14
+ end
15
+
16
+ def initialize(transcript_path, output_path)
17
+ raise ArgumentError, "transcript_path is required" if transcript_path.nil? || transcript_path.empty?
18
+ raise ArgumentError, "output_path is required" if output_path.nil? || output_path.empty?
19
+
20
+ @transcript_path = transcript_path
21
+ @output_path = output_path
22
+ end
23
+
24
+ def create
25
+ File.write(output_path, skeleton)
26
+ puts "skeleton: #{output_path}"
27
+ end
28
+
29
+ private
30
+
31
+ attr_reader :transcript_path, :output_path
32
+
33
+ def data
34
+ @data ||= JSON.parse(File.read(transcript_path))
35
+ end
36
+
37
+ def video_filename
38
+ File.basename(data["video_path"].to_s)
39
+ end
40
+
41
+ def segments
42
+ data["segments"] or raise "transcript JSON has no 'segments' key: #{transcript_path}"
43
+ end
44
+
45
+ def total_duration
46
+ segments.last["end"].to_f
47
+ end
48
+
49
+ def format_timestamp(seconds)
50
+ total = seconds.to_i
51
+ "%02d:%02d" % [total / 60, total % 60]
52
+ end
53
+
54
+ def skeleton
55
+ <<~MD
56
+ # #{video_filename}
57
+ **Duration:** #{format_timestamp(total_duration)}
58
+
59
+ ## Overview
60
+ <!-- FILL_OVERVIEW -->
61
+
62
+ ## Key Visuals
63
+ <!-- FILL_KEY_VISUALS -->
64
+
65
+ ## Notable Dialogue
66
+ <!-- FILL_DIALOGUE -->
67
+
68
+ ## B-Roll
69
+ <!-- FILL_BROLL -->
70
+ MD
71
+ end
72
+ end
73
+
74
+ if __FILE__ == $PROGRAM_NAME
75
+ transcript_path, output_path = ARGV
76
+ abort("usage: summary_skeleton.rb <visual_transcript.json> <summary_output.md>") unless transcript_path && output_path
77
+ SummarySkeleton.create(transcript_path, output_path)
78
+ end
@@ -0,0 +1,78 @@
1
+ #!/usr/bin/env ruby
2
+ # Extract a human-readable script from a visual transcript JSON,
3
+ # interleaving [VISUAL] descriptions with timestamped dialogue.
4
+ # Prints to stdout for direct consumption by the summarize-video skill.
5
+ #
6
+ # Usage:
7
+ # ruby visual_script_extractor.rb <visual_transcript.json>
8
+
9
+ require 'json'
10
+
11
+ class VisualScriptExtractor
12
+ def self.extract(transcript_path)
13
+ new(transcript_path).extract
14
+ end
15
+
16
+ def initialize(transcript_path)
17
+ raise ArgumentError, "transcript_path is required" if transcript_path.nil? || transcript_path.empty?
18
+
19
+ @transcript_path = transcript_path
20
+ end
21
+
22
+ def extract
23
+ puts header
24
+ puts
25
+ puts format_script
26
+ end
27
+
28
+ private
29
+
30
+ attr_reader :transcript_path
31
+
32
+ def data
33
+ @data ||= JSON.parse(File.read(transcript_path))
34
+ end
35
+
36
+ def segments
37
+ data["segments"] or raise "transcript JSON has no 'segments' key: #{transcript_path}"
38
+ end
39
+
40
+ def header
41
+ "# Video: #{video_filename}\n# Duration: #{format_timestamp(total_duration)}"
42
+ end
43
+
44
+ def video_filename
45
+ File.basename(data["video_path"].to_s)
46
+ end
47
+
48
+ def total_duration
49
+ segments.last["end"].to_f
50
+ end
51
+
52
+ def format_script
53
+ segments.filter_map { |s| format_segment(s) }.join("\n\n")
54
+ end
55
+
56
+ def format_segment(segment)
57
+ text = segment["text"].to_s.strip
58
+ visual = segment["visual"].to_s.strip
59
+ ts = format_timestamp(segment["start"].to_f)
60
+
61
+ lines = []
62
+ lines << "[#{ts}] [VISUAL] #{visual}" unless visual.empty?
63
+ lines << "[#{ts}] #{text}" unless text.empty?
64
+
65
+ lines.empty? ? nil : lines.join("\n")
66
+ end
67
+
68
+ def format_timestamp(seconds)
69
+ total = seconds.to_i
70
+ "%02d:%02d" % [total / 60, total % 60]
71
+ end
72
+ end
73
+
74
+ if __FILE__ == $PROGRAM_NAME
75
+ transcript_path = ARGV[0]
76
+ abort("usage: visual_script_extractor.rb <visual_transcript.json>") unless transcript_path
77
+ VisualScriptExtractor.extract(transcript_path)
78
+ end
@@ -3,81 +3,34 @@ name: transcribe-audio
3
3
  description: Transcribes video audio using WhisperX, preserving original timestamps. Creates JSON transcript with word-level timing. Use when you need to generate audio transcripts for videos.
4
4
  ---
5
5
 
6
- # Skill: Transcribe Audio
6
+ # Skill: Transcribe Audio (parent brief)
7
7
 
8
- Transcribes video audio using WhisperX and creates clean JSON transcripts with word-level timing data.
8
+ Transcribes video audio using WhisperX and produces a clean JSON transcript with word-level timing.
9
9
 
10
- ## When to Use
11
- - Videos need audio transcripts before visual analysis
10
+ `SKILL.md` is the parent's dispatch brief. The sub-agent's working prompt lives in `agent_prompt.md` — inline its contents when launching the Task agent. Don't pass `SKILL.md`.
12
11
 
13
- ## Critical Requirements
12
+ ## Parallelism
14
13
 
15
- Use WhisperX, NOT standard Whisper. WhisperX preserves the original video timeline including leading silence, ensuring transcripts match actual video timestamps. Run WhisperX directly on video files. Don't extract audio separately - this ensures timestamp alignment.
14
+ Launch at most **2 in parallel**. WhisperX is already multithreaded internally (~4 CPU threads via CTranslate2); 2 processes is the throughput-vs-RAM sweet spot on a 16GB Mac.
16
15
 
17
- ## Workflow
16
+ ## Inputs to gather and pass inline
18
17
 
19
- ### 1. Read Language from Library File
18
+ The parent reads `library.yaml` and `settings.yaml` and passes these values inline in each agent's prompt:
20
19
 
21
- Read the library's `library.yaml` to get the language code:
20
+ - `video_path` — absolute path to the video file
21
+ - `transcript_output_dir` — where to write the transcript JSON (e.g. `libraries/<library>/transcripts`)
22
+ - `language_code` — ISO 639-1 code (e.g. `en`, `es`) — parent maps from library.yaml's `language` name
23
+ - `whisper_model` — model size from settings.yaml (e.g. `small`, `medium`, `turbo`)
24
+ - `transcript_refinement` — boolean from library.yaml. If `true`, also pass:
25
+ - `user_context` (may be empty string)
26
+ - `footage_summary` (may be empty string)
22
27
 
23
- ```yaml
24
- # Library metadata
25
- library_name: [library-name]
26
- language: en # Language code stored here
27
- ...
28
- ```
28
+ After the agent returns, update `library.yaml` with `transcript: <filename>.json`.
29
29
 
30
- ### 2. Run WhisperX
30
+ ## Next step
31
31
 
32
- ```bash
33
- whisperx "/full/path/to/video.mov" \
34
- --language en \
35
- --model medium \
36
- --compute_type float32 \
37
- --device cpu \
38
- --output_format json \
39
- --output_dir libraries/[library-name]/transcripts
40
- ```
32
+ Once all videos have audio transcripts, dispatch `analyze-video` for visual descriptions.
41
33
 
42
- ### 3. Prepare Audio Transcript
34
+ ## Dependencies
43
35
 
44
- After WhisperX completes, format the JSON using our prepare_audio_script:
45
-
46
- ```bash
47
- ruby .claude/skills/transcribe-audio/prepare_audio_script.rb \
48
- libraries/[library-name]/transcripts/video_name.json \
49
- /full/path/to/original/video_name.mov
50
- ```
51
-
52
- This script:
53
- - Adds video source path as metadata
54
- - Removes unnecessary fields to reduce file size
55
- - Prettifies JSON
56
-
57
- ### 4. Return Success Response
58
-
59
- After audio preparation completes, return this structured response to the parent agent:
60
-
61
- ```
62
- ✓ [video_filename.mov] transcribed successfully
63
- Audio transcript: libraries/[library-name]/transcripts/video_name.json
64
- Video path: /full/path/to/video_filename.mov
65
- ```
66
-
67
- **DO NOT update library.yaml** - the parent agent will handle this to avoid race conditions when running multiple transcriptions in parallel.
68
-
69
- ## Running in Parallel
70
-
71
- This skill is designed to run inside a Task agent for parallel execution:
72
- - Each agent handles ONE video file
73
- - Multiple agents can run simultaneously
74
- - Parent thread updates library.yaml sequentially after each agent completes
75
- - No race conditions on shared YAML file
76
-
77
- ## Next Step
78
-
79
- After audio transcription, use the **analyze-video** skill to add visual descriptions and create the visual transcript.
80
-
81
- ## Installation
82
-
83
- Ensure WhisperX is installed. Use the **setup** skill to verify dependencies.
36
+ WhisperX must be installed. Use the **setup** skill to verify.
@@ -0,0 +1,53 @@
1
+ # Transcribe Audio (sub-agent prompt)
2
+
3
+ You are a sub-agent. Transcribe one video file using WhisperX and produce a clean JSON transcript with word-level timing.
4
+
5
+ **Critical:** Use WhisperX, NOT standard Whisper. WhisperX preserves the original video timeline including leading silence, ensuring transcripts match actual video timestamps. Run WhisperX directly on the video file — don't extract audio separately.
6
+
7
+ ## Inputs (passed inline by the parent)
8
+
9
+ - `video_path` — absolute path to the video file
10
+ - `transcript_output_dir` — where to write the transcript JSON
11
+ - `language_code` — ISO 639-1 code (e.g. `en`, `es`)
12
+ - `whisper_model` — model size (e.g. `small`, `medium`, `turbo`)
13
+ - `transcript_refinement` — boolean; if `true`, also expect:
14
+ - `user_context` — string, may be empty
15
+ - `footage_summary` — string, may be empty
16
+
17
+ Do NOT read `library.yaml` or `settings.yaml`. If a required input is missing from your prompt, stop and ask the parent rather than inferring from the filesystem.
18
+
19
+ ## 1. Run WhisperX
20
+
21
+ ```bash
22
+ whisperx "<video_path>" \
23
+ --language <language_code> \
24
+ --model <whisper_model> \
25
+ --compute_type float32 \
26
+ --device cpu \
27
+ --output_format json \
28
+ --output_dir <transcript_output_dir>
29
+ ```
30
+
31
+ ## 2. Prepare audio transcript
32
+
33
+ ```bash
34
+ ruby .claude/skills/transcribe-audio/prepare_audio_script.rb \
35
+ <transcript_output_dir>/<video_basename>.json \
36
+ <video_path>
37
+ ```
38
+
39
+ This script adds the video source path as metadata, removes unnecessary fields, and prettifies the JSON.
40
+
41
+ ## 3. (Optional) Refine the transcript
42
+
43
+ If `transcript_refinement: true`, follow `.claude/skills/transcribe-audio/refine_instructions.md`, using the `user_context` and `footage_summary` strings the parent supplied inline. Do NOT open `library.yaml`. Skip if `transcript_refinement` is missing or `false`.
44
+
45
+ ## 4. Return success response
46
+
47
+ ```
48
+ ✓ <video_basename.mov> transcribed successfully
49
+ Audio transcript: <transcript_output_dir>/<video_basename>.json
50
+ Video path: <video_path>
51
+ ```
52
+
53
+ **Do NOT update library.yaml** — the parent handles all yaml I/O to avoid race conditions in parallel runs.