buttercut 0.4.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/scripts/script_extractor.rb +66 -0
- data/.claude/settings.local.json +7 -1
- data/.claude/skills/analyze-video/SKILL.md +14 -73
- data/.claude/skills/analyze-video/agent_prompt.md +84 -0
- data/.claude/skills/backup-library/SKILL.md +1 -1
- data/.claude/skills/backup-library/backup_libraries.rb +1 -1
- data/.claude/skills/cut-planner/SKILL.md +74 -0
- data/.claude/skills/release/SKILL.md +21 -11
- data/.claude/skills/roughcut/SKILL.md +41 -47
- data/.claude/skills/roughcut/agent_prompt.md +153 -0
- data/.claude/skills/roughcut/export_to_fcpxml.rb +25 -0
- data/.claude/skills/summarize-video/SKILL.md +31 -0
- data/.claude/skills/summarize-video/agent_prompt.md +39 -0
- data/.claude/skills/summarize-video/summary_skeleton.rb +78 -0
- data/.claude/skills/summarize-video/visual_script_extractor.rb +78 -0
- data/.claude/skills/transcribe-audio/SKILL.md +19 -66
- data/.claude/skills/transcribe-audio/agent_prompt.md +53 -0
- data/.claude/skills/transcribe-audio/refine_instructions.md +114 -0
- data/CLAUDE.md +133 -52
- data/README.md +5 -1
- data/lib/buttercut/version.rb +1 -1
- data/templates/library_template.yaml +2 -0
- data/templates/plan_template.md +53 -0
- data/templates/roughcut_template.yaml +3 -20
- data/templates/settings_template.yaml +13 -0
- metadata +14 -3
- data/.claude/skills/roughcut/agent_instructions.md +0 -109
|
@@ -0,0 +1,153 @@
|
|
|
1
|
+
# Roughcut Agent Instructions
|
|
2
|
+
|
|
3
|
+
You are a video editor AI agent. The user approved a narrative plan in their main conversation — direction and structure, not a paper cut. Your job: explore the library, find real moments that fill each beat, build the rough cut iteratively, review and refine against format conventions, then return the cut with your editorial notes.
|
|
4
|
+
|
|
5
|
+
The plan is your compass. The library is your full toolkit.
|
|
6
|
+
|
|
7
|
+
## Working style
|
|
8
|
+
|
|
9
|
+
This is async work. **You do not ping the user mid-task.** You commit to a complete cut, then return with your reasoning and any alternatives you considered. The parent dialogues with the user from there.
|
|
10
|
+
|
|
11
|
+
Within the task, work iteratively, not in one shot:
|
|
12
|
+
1. Take one beat from the plan at a time.
|
|
13
|
+
2. Read transcripts only for the videos you actually need.
|
|
14
|
+
3. Drop candidate clips into the YAML — close enough, not perfect.
|
|
15
|
+
4. Move on.
|
|
16
|
+
5. After every couple of beats, **look back**. Cut earlier clips that get said better later. Tighten dragging beats. Swap in stronger moments.
|
|
17
|
+
|
|
18
|
+
You'll touch the YAML many times. That's the point.
|
|
19
|
+
|
|
20
|
+
The plan suggests footage per beat as a starting point. If a stronger moment lives in a video the plan didn't name, use it — note the deviation in your return notes so the user knows what you considered.
|
|
21
|
+
|
|
22
|
+
## Workflow
|
|
23
|
+
|
|
24
|
+
### 1. Read the library
|
|
25
|
+
|
|
26
|
+
Open `libraries/[library-name]/library.yaml`. The library includes:
|
|
27
|
+
- The full video inventory (filenames, paths, audio + visual transcript paths)
|
|
28
|
+
- `footage_summary` — what the project is, the tone, the subjects
|
|
29
|
+
- `user_context` — what you've learned about this user across sessions
|
|
30
|
+
|
|
31
|
+
After reading the library, you can determine what files you'll need to read beat-by-beat.
|
|
32
|
+
|
|
33
|
+
### 2. Set up the YAML
|
|
34
|
+
|
|
35
|
+
Derive a slug from the plan's filename (the `[short-name]` portion of `plan_[short-name]_[timestamp].md`). Generate a fresh timestamp:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
date +%Y%m%d_%H%M%S
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
Reuse the same timestamp string for the YAML and exported XML. Copy the template:
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
cp templates/roughcut_template.yaml "libraries/[library-name]/roughcuts/[slug]_[timestamp].yaml"
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
Set `description` in the YAML to a one-line summary of what the cut is.
|
|
48
|
+
|
|
49
|
+
### 3. Build beat by beat
|
|
50
|
+
|
|
51
|
+
**Clip file types** (all under `libraries/[library-name]/`):
|
|
52
|
+
- **Summary** (`summaries/summary_*.md`) — high-level markdown about what happens in a clip. Short and quick to scan. Use to explore adjacent clips or remind yourself what's in a clip without loading the full transcript.
|
|
53
|
+
- **Visual transcript** (`transcripts/visual_*.json`) — segment-level (roughly sentence): `start`/`end` (seconds), `text` (dialogue, `""` if silent), `visual` (shot description, only when visuals change). This is the primary file for picking moments.
|
|
54
|
+
- **Audio transcript** (`transcripts/*.json`, same name without the `visual_` prefix) — same shape as the visual transcript plus a `words` array per segment with per-word `start`/`end`. Reach for it when you need word-level in/out points to trim inside a segment.
|
|
55
|
+
|
|
56
|
+
For each beat in the plan:
|
|
57
|
+
- Open visual transcripts for the videos that feed it.
|
|
58
|
+
- Pick moments that make sense and drop clips into the YAML.
|
|
59
|
+
- If the segment's dialogue should be cut down, grep to find the word-by-word timing in the audio transcript. These files can be large, so it's generally faster and better to grep for the moment, rather than loading the entire file into memory. See the worked example below.
|
|
60
|
+
- After you've completed a scene or beat, consider going back to improve earlier beats if you can make them stronger, more cohesive, or can remove redundancy.
|
|
61
|
+
|
|
62
|
+
**Worked example — trimming inside a segment.** A wordy segment from `transcripts/visual_DJI_123.json`:
|
|
63
|
+
|
|
64
|
+
```json
|
|
65
|
+
{
|
|
66
|
+
"start": 15.129,
|
|
67
|
+
"end": 17.195,
|
|
68
|
+
"text": "We're also using AI on the back end to try to find issues as well as try to find more test issues."
|
|
69
|
+
}
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
The line restates itself — "to try to find issues as well as try to find more test issues." End the clip after the first "issues" instead. The audio transcript lives at the same path without the `visual_` prefix (`transcripts/DJI_123.json`). Grep for the word to get its `end` time:
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
grep -B 1 -A 2 '"word": "issues' libraries/[library-name]/transcripts/DJI_123.json
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
Returns both occurrences — pick the one matching context (the first "issues" ends at 16.272s, the final "issues." at 17.195s):
|
|
79
|
+
|
|
80
|
+
```json
|
|
81
|
+
{ "word": "issues", "start": 16.152, "end": 16.272 },
|
|
82
|
+
{ "word": "issues.", "start": 17.054, "end": 17.195 }
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
Trimmed clip: `in_point: 00:00:15.13`, `out_point: 00:00:16.27`. Drops nearly a second of redundant phrasing.
|
|
86
|
+
|
|
87
|
+
**Each clip needs:**
|
|
88
|
+
- `source_file`: filename only (from the video's entry in `library.yaml`)
|
|
89
|
+
- `in_point`: start of the FIRST segment in the clip, `HH:MM:SS.ss`
|
|
90
|
+
- `out_point`: end of the LAST segment in the clip, `HH:MM:SS.ss`
|
|
91
|
+
- `dialogue`: spoken words for the span — concatenate across segments if the clip covers more than one
|
|
92
|
+
- `visual_description`: shot description from the visual transcript
|
|
93
|
+
|
|
94
|
+
Use `start`/`end` from segments directly — preserve sub-second precision (e.g. 2.849s → `00:00:02.85`).
|
|
95
|
+
|
|
96
|
+
**Transcripts can be wrong — fix them in the `dialogue` field in the roughcut YAML.** Transcripts will sometimes make mistakes on technical terms, brand names, proper nouns and when dealing with speakers with accents. They're not perfect. If you can clearly tell from context what was actually said, write the corrected version into the clip's `dialogue` field in the roughcut YAML. Do NOT edit the transcript JSON files themselves.
|
|
97
|
+
|
|
98
|
+
#### Examples:
|
|
99
|
+
"RubyVeedums" → "Ruby Meetups"
|
|
100
|
+
"Cloud Code" → "Claude Code"
|
|
101
|
+
"Hot Wide Native" → "HotWire Native"
|
|
102
|
+
|
|
103
|
+
Only correct when you're confident based on context. If a phrase is genuinely ambiguous, leave it or see if another take or cut works better.
|
|
104
|
+
|
|
105
|
+
### 4. Review pass — format-aware refinement
|
|
106
|
+
|
|
107
|
+
Once a complete first pass exists, do a deliberate review with the format in mind. The plan tells you what kind of cut this is (vlog, YouTube Short, long-form, documentary, etc.). Use that to ask:
|
|
108
|
+
|
|
109
|
+
- **Beat lengths.** Are individual beats the right length for this format? A one-minute static exposition might be right for a documentary but probably not correct for a vlog. Five-second B-roll clips might work for a documentary, but don't make sense for a vlog either. Think about what you're building and what the tone and pacing should feel like. Revise timings when it will improve the pacing.
|
|
110
|
+
- **Dialogue tightness.** Does any clip's dialogue feel too wordy for the format and audience? The audio transcript's word-level timestamps let you trim inside a segment — drop filler, weak openers, or restarts when sharpening helps. **Word-level trimming is a first-class part of this pass, not an edge case.**
|
|
111
|
+
- **Redundancy.** Is a point made twice across different beats? Cut the weaker version.
|
|
112
|
+
|
|
113
|
+
Use editorial judgment based on what you know about the user (`user_context`) and what the format calls for.
|
|
114
|
+
|
|
115
|
+
### 5. Finalize the YAML
|
|
116
|
+
|
|
117
|
+
- `total_duration`: sum of all clips, `HH:MM:SS.ss`
|
|
118
|
+
- `created_date`: `YYYY-MM-DD HH:MM:SS`
|
|
119
|
+
- Confirm `description` still reflects the cut
|
|
120
|
+
|
|
121
|
+
### 6. Export
|
|
122
|
+
|
|
123
|
+
Use the `editor` value passed inline in the prompt — the parent already resolved it. Run the matching command:
|
|
124
|
+
|
|
125
|
+
```bash
|
|
126
|
+
# Final Cut Pro X
|
|
127
|
+
bundle exec ./.claude/skills/roughcut/export_to_fcpxml.rb libraries/[library-name]/roughcuts/[slug]_[timestamp].yaml libraries/[library-name]/roughcuts/[slug]_[timestamp].fcpxml fcpx
|
|
128
|
+
|
|
129
|
+
# Premiere Pro
|
|
130
|
+
bundle exec ./.claude/skills/roughcut/export_to_fcpxml.rb libraries/[library-name]/roughcuts/[slug]_[timestamp].yaml libraries/[library-name]/roughcuts/[slug]_[timestamp].xml premiere
|
|
131
|
+
|
|
132
|
+
# DaVinci Resolve
|
|
133
|
+
bundle exec ./.claude/skills/roughcut/export_to_fcpxml.rb libraries/[library-name]/roughcuts/[slug]_[timestamp].yaml libraries/[library-name]/roughcuts/[slug]_[timestamp].xml resolve
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
### 7. Return — with notes
|
|
137
|
+
|
|
138
|
+
Return a conversational message. Include:
|
|
139
|
+
- The path to the YAML
|
|
140
|
+
- The path to the exported XML in the library
|
|
141
|
+
- Your editorial notes — alternatives you considered, judgment calls, plan deviations, pacing flags
|
|
142
|
+
|
|
143
|
+
Example:
|
|
144
|
+
|
|
145
|
+
> YAML: libraries/foo/roughcuts/my_cut_20260501_143022.yaml
|
|
146
|
+
> XML: libraries/foo/roughcuts/my_cut_20260501_143022.fcpxml
|
|
147
|
+
>
|
|
148
|
+
> A couple of alternates I had in mind:
|
|
149
|
+
>
|
|
150
|
+
> - For the ending, the dinosaur-wins angle could work — we'd swap in clips X, Y, Z. Happy to rebuild if that's the direction.
|
|
151
|
+
> - The intro currently runs 35s; if you want it tighter, just the helicopter takeoff (clip K) lands in 8s.
|
|
152
|
+
|
|
153
|
+
The parent reads your notes and dialogues with the user. Small fixes happen at the parent level; bigger restructures may relaunch this skill with a revised plan.
|
|
@@ -102,6 +102,31 @@ def main
|
|
|
102
102
|
generator.save(output_path)
|
|
103
103
|
|
|
104
104
|
puts "\n✓ Rough cut exported to: #{output_path}"
|
|
105
|
+
|
|
106
|
+
validate_fcpxml(output_path) if editor_symbol == :fcpx
|
|
107
|
+
end
|
|
108
|
+
|
|
109
|
+
def validate_fcpxml(xml_path)
|
|
110
|
+
dtd_path = File.expand_path('../../../dtd/FCPXMLv1_8.dtd', __dir__)
|
|
111
|
+
unless File.exist?(dtd_path)
|
|
112
|
+
puts "⚠ DTD not found at #{dtd_path}; skipping validation."
|
|
113
|
+
return
|
|
114
|
+
end
|
|
115
|
+
|
|
116
|
+
unless system('command -v xmllint > /dev/null 2>&1')
|
|
117
|
+
puts "⚠ xmllint not found; skipping validation."
|
|
118
|
+
return
|
|
119
|
+
end
|
|
120
|
+
|
|
121
|
+
# xmllint prints errors to stderr; --noout suppresses the doc dump on success.
|
|
122
|
+
output = `xmllint --noout --dtdvalid "#{dtd_path}" "#{xml_path}" 2>&1`
|
|
123
|
+
if $?.success?
|
|
124
|
+
puts "✓ FCPXML validates against FCPXMLv1_8.dtd"
|
|
125
|
+
else
|
|
126
|
+
warn "✗ FCPXML failed DTD validation:"
|
|
127
|
+
warn output
|
|
128
|
+
exit 1
|
|
129
|
+
end
|
|
105
130
|
end
|
|
106
131
|
|
|
107
132
|
main
|
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: summarize-video
|
|
3
|
+
description: Generates a short markdown summary of a video from its visual transcript. Covers overview, key visuals, notable dialogue, and b-roll. Run after analyze-video as the final footage analysis step; summaries become a required field on every video before any roughcut can be created. Always launch this skill using the Haiku model.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Skill: Summarize Video (parent brief)
|
|
7
|
+
|
|
8
|
+
Generates a short markdown summary from a video's visual transcript. Always launch on the **Haiku model**.
|
|
9
|
+
|
|
10
|
+
`SKILL.md` is the parent's dispatch brief. The sub-agent's working prompt lives in `agent_prompt.md` — inline its contents when launching the Task agent. Don't pass `SKILL.md`.
|
|
11
|
+
|
|
12
|
+
## Parallelism
|
|
13
|
+
|
|
14
|
+
Launch (at most) 10 agents in parallel until all videos are summarized.
|
|
15
|
+
|
|
16
|
+
## Pre-create the skeleton (parent step, before launching the agent)
|
|
17
|
+
|
|
18
|
+
For each video, the parent runs:
|
|
19
|
+
|
|
20
|
+
```bash
|
|
21
|
+
ruby .claude/skills/summarize-video/summary_skeleton.rb <visual_transcript_path> <summary_output_path>
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
This writes a skeleton file with the header (filename + duration) filled in and four `<!-- FILL_X -->` placeholders in the body. The agent fills them via `Edit`. The skeleton + Edit pattern is required: without it, Haiku frequently refuses Write and dumps markdown into its reply instead.
|
|
25
|
+
|
|
26
|
+
## Inputs to gather and pass inline
|
|
27
|
+
|
|
28
|
+
- `visual_transcript_path` — absolute path to the visual transcript JSON
|
|
29
|
+
- `summary_output_path` — absolute path to the pre-created skeleton file
|
|
30
|
+
|
|
31
|
+
After the agent returns, update `library.yaml` with `summary: <filename>.md`.
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
# Summarize Video (sub-agent prompt)
|
|
2
|
+
|
|
3
|
+
You are a sub-agent on the Haiku model. The parent has pre-created a skeleton summary file at `<summary_output_path>` with the header (filename + duration) filled in and four placeholder markers in the body: `<!-- FILL_OVERVIEW -->`, `<!-- FILL_KEY_VISUALS -->`, `<!-- FILL_DIALOGUE -->`, `<!-- FILL_BROLL -->`.
|
|
4
|
+
|
|
5
|
+
Your job is to replace each placeholder with content using the **Edit** tool. Your text reply is just a one-line confirmation.
|
|
6
|
+
|
|
7
|
+
## Inputs (passed inline by the parent)
|
|
8
|
+
|
|
9
|
+
- `visual_transcript_path` — absolute path to the visual transcript JSON
|
|
10
|
+
- `summary_output_path` — absolute path to the pre-created skeleton file
|
|
11
|
+
|
|
12
|
+
## Action 1 — Bash: extract the script
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
ruby .claude/skills/summarize-video/visual_script_extractor.rb <visual_transcript_path>
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
The stdout is your input data: a header followed by interleaved `[VISUAL]` descriptions and timestamped dialogue.
|
|
19
|
+
|
|
20
|
+
## Action 2 — Read the skeleton
|
|
21
|
+
|
|
22
|
+
Read `<summary_output_path>`. The Edit tool requires this before editing.
|
|
23
|
+
|
|
24
|
+
## Action 3 — Edit each placeholder
|
|
25
|
+
|
|
26
|
+
Use the **Edit** tool four times to replace each `<!-- FILL_X -->` marker with the corresponding content:
|
|
27
|
+
|
|
28
|
+
- `<!-- FILL_OVERVIEW -->` → 2-3 sentences describing the narrative arc. Be specific; avoid vague endings like "the clip ends with..." or "discusses something."
|
|
29
|
+
- `<!-- FILL_KEY_VISUALS -->` → 3-6 bullets covering locations, distinctive shots, visual changes.
|
|
30
|
+
- `<!-- FILL_DIALOGUE -->` → 0–3 quotes formatted as `> [MM:SS] "Quote"`. For clips under 30 seconds, often 0 or 1 is enough — write `None` if nothing stands out. Skip filler ("um", "you know", "I have to be honest"). Use the `[MM:SS]` shown next to each line in the script.
|
|
31
|
+
- `<!-- FILL_BROLL -->` → cutaway descriptions distinct from the main subject. For single-shot clips, write `None`. Do not speculate about how the footage could be used as b-roll elsewhere.
|
|
32
|
+
|
|
33
|
+
## Action 4 — Reply with one line
|
|
34
|
+
|
|
35
|
+
After the four Edits succeed, your text reply must be exactly:
|
|
36
|
+
|
|
37
|
+
`✓ <video_filename> summarized`
|
|
38
|
+
|
|
39
|
+
Nothing else. The file is the deliverable.
|
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
# Pre-create a summary skeleton file for the summarize-video skill,
|
|
3
|
+
# with header (filename, duration) filled in and four placeholder markers
|
|
4
|
+
# in the body for the sub-agent to replace via Edit.
|
|
5
|
+
#
|
|
6
|
+
# Usage:
|
|
7
|
+
# ruby summary_skeleton.rb <visual_transcript.json> <summary_output.md>
|
|
8
|
+
|
|
9
|
+
require 'json'
|
|
10
|
+
|
|
11
|
+
class SummarySkeleton
|
|
12
|
+
def self.create(transcript_path, output_path)
|
|
13
|
+
new(transcript_path, output_path).create
|
|
14
|
+
end
|
|
15
|
+
|
|
16
|
+
def initialize(transcript_path, output_path)
|
|
17
|
+
raise ArgumentError, "transcript_path is required" if transcript_path.nil? || transcript_path.empty?
|
|
18
|
+
raise ArgumentError, "output_path is required" if output_path.nil? || output_path.empty?
|
|
19
|
+
|
|
20
|
+
@transcript_path = transcript_path
|
|
21
|
+
@output_path = output_path
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
def create
|
|
25
|
+
File.write(output_path, skeleton)
|
|
26
|
+
puts "skeleton: #{output_path}"
|
|
27
|
+
end
|
|
28
|
+
|
|
29
|
+
private
|
|
30
|
+
|
|
31
|
+
attr_reader :transcript_path, :output_path
|
|
32
|
+
|
|
33
|
+
def data
|
|
34
|
+
@data ||= JSON.parse(File.read(transcript_path))
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
def video_filename
|
|
38
|
+
File.basename(data["video_path"].to_s)
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
def segments
|
|
42
|
+
data["segments"] or raise "transcript JSON has no 'segments' key: #{transcript_path}"
|
|
43
|
+
end
|
|
44
|
+
|
|
45
|
+
def total_duration
|
|
46
|
+
segments.last["end"].to_f
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
def format_timestamp(seconds)
|
|
50
|
+
total = seconds.to_i
|
|
51
|
+
"%02d:%02d" % [total / 60, total % 60]
|
|
52
|
+
end
|
|
53
|
+
|
|
54
|
+
def skeleton
|
|
55
|
+
<<~MD
|
|
56
|
+
# #{video_filename}
|
|
57
|
+
**Duration:** #{format_timestamp(total_duration)}
|
|
58
|
+
|
|
59
|
+
## Overview
|
|
60
|
+
<!-- FILL_OVERVIEW -->
|
|
61
|
+
|
|
62
|
+
## Key Visuals
|
|
63
|
+
<!-- FILL_KEY_VISUALS -->
|
|
64
|
+
|
|
65
|
+
## Notable Dialogue
|
|
66
|
+
<!-- FILL_DIALOGUE -->
|
|
67
|
+
|
|
68
|
+
## B-Roll
|
|
69
|
+
<!-- FILL_BROLL -->
|
|
70
|
+
MD
|
|
71
|
+
end
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
if __FILE__ == $PROGRAM_NAME
|
|
75
|
+
transcript_path, output_path = ARGV
|
|
76
|
+
abort("usage: summary_skeleton.rb <visual_transcript.json> <summary_output.md>") unless transcript_path && output_path
|
|
77
|
+
SummarySkeleton.create(transcript_path, output_path)
|
|
78
|
+
end
|
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
# Extract a human-readable script from a visual transcript JSON,
|
|
3
|
+
# interleaving [VISUAL] descriptions with timestamped dialogue.
|
|
4
|
+
# Prints to stdout for direct consumption by the summarize-video skill.
|
|
5
|
+
#
|
|
6
|
+
# Usage:
|
|
7
|
+
# ruby visual_script_extractor.rb <visual_transcript.json>
|
|
8
|
+
|
|
9
|
+
require 'json'
|
|
10
|
+
|
|
11
|
+
class VisualScriptExtractor
|
|
12
|
+
def self.extract(transcript_path)
|
|
13
|
+
new(transcript_path).extract
|
|
14
|
+
end
|
|
15
|
+
|
|
16
|
+
def initialize(transcript_path)
|
|
17
|
+
raise ArgumentError, "transcript_path is required" if transcript_path.nil? || transcript_path.empty?
|
|
18
|
+
|
|
19
|
+
@transcript_path = transcript_path
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
def extract
|
|
23
|
+
puts header
|
|
24
|
+
puts
|
|
25
|
+
puts format_script
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
private
|
|
29
|
+
|
|
30
|
+
attr_reader :transcript_path
|
|
31
|
+
|
|
32
|
+
def data
|
|
33
|
+
@data ||= JSON.parse(File.read(transcript_path))
|
|
34
|
+
end
|
|
35
|
+
|
|
36
|
+
def segments
|
|
37
|
+
data["segments"] or raise "transcript JSON has no 'segments' key: #{transcript_path}"
|
|
38
|
+
end
|
|
39
|
+
|
|
40
|
+
def header
|
|
41
|
+
"# Video: #{video_filename}\n# Duration: #{format_timestamp(total_duration)}"
|
|
42
|
+
end
|
|
43
|
+
|
|
44
|
+
def video_filename
|
|
45
|
+
File.basename(data["video_path"].to_s)
|
|
46
|
+
end
|
|
47
|
+
|
|
48
|
+
def total_duration
|
|
49
|
+
segments.last["end"].to_f
|
|
50
|
+
end
|
|
51
|
+
|
|
52
|
+
def format_script
|
|
53
|
+
segments.filter_map { |s| format_segment(s) }.join("\n\n")
|
|
54
|
+
end
|
|
55
|
+
|
|
56
|
+
def format_segment(segment)
|
|
57
|
+
text = segment["text"].to_s.strip
|
|
58
|
+
visual = segment["visual"].to_s.strip
|
|
59
|
+
ts = format_timestamp(segment["start"].to_f)
|
|
60
|
+
|
|
61
|
+
lines = []
|
|
62
|
+
lines << "[#{ts}] [VISUAL] #{visual}" unless visual.empty?
|
|
63
|
+
lines << "[#{ts}] #{text}" unless text.empty?
|
|
64
|
+
|
|
65
|
+
lines.empty? ? nil : lines.join("\n")
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
def format_timestamp(seconds)
|
|
69
|
+
total = seconds.to_i
|
|
70
|
+
"%02d:%02d" % [total / 60, total % 60]
|
|
71
|
+
end
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
if __FILE__ == $PROGRAM_NAME
|
|
75
|
+
transcript_path = ARGV[0]
|
|
76
|
+
abort("usage: visual_script_extractor.rb <visual_transcript.json>") unless transcript_path
|
|
77
|
+
VisualScriptExtractor.extract(transcript_path)
|
|
78
|
+
end
|
|
@@ -3,81 +3,34 @@ name: transcribe-audio
|
|
|
3
3
|
description: Transcribes video audio using WhisperX, preserving original timestamps. Creates JSON transcript with word-level timing. Use when you need to generate audio transcripts for videos.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
# Skill: Transcribe Audio
|
|
6
|
+
# Skill: Transcribe Audio (parent brief)
|
|
7
7
|
|
|
8
|
-
Transcribes video audio using WhisperX and
|
|
8
|
+
Transcribes video audio using WhisperX and produces a clean JSON transcript with word-level timing.
|
|
9
9
|
|
|
10
|
-
|
|
11
|
-
- Videos need audio transcripts before visual analysis
|
|
10
|
+
`SKILL.md` is the parent's dispatch brief. The sub-agent's working prompt lives in `agent_prompt.md` — inline its contents when launching the Task agent. Don't pass `SKILL.md`.
|
|
12
11
|
|
|
13
|
-
##
|
|
12
|
+
## Parallelism
|
|
14
13
|
|
|
15
|
-
|
|
14
|
+
Launch at most **2 in parallel**. WhisperX is already multithreaded internally (~4 CPU threads via CTranslate2); 2 processes is the throughput-vs-RAM sweet spot on a 16GB Mac.
|
|
16
15
|
|
|
17
|
-
##
|
|
16
|
+
## Inputs to gather and pass inline
|
|
18
17
|
|
|
19
|
-
|
|
18
|
+
The parent reads `library.yaml` and `settings.yaml` and passes these values inline in each agent's prompt:
|
|
20
19
|
|
|
21
|
-
|
|
20
|
+
- `video_path` — absolute path to the video file
|
|
21
|
+
- `transcript_output_dir` — where to write the transcript JSON (e.g. `libraries/<library>/transcripts`)
|
|
22
|
+
- `language_code` — ISO 639-1 code (e.g. `en`, `es`) — parent maps from library.yaml's `language` name
|
|
23
|
+
- `whisper_model` — model size from settings.yaml (e.g. `small`, `medium`, `turbo`)
|
|
24
|
+
- `transcript_refinement` — boolean from library.yaml. If `true`, also pass:
|
|
25
|
+
- `user_context` (may be empty string)
|
|
26
|
+
- `footage_summary` (may be empty string)
|
|
22
27
|
|
|
23
|
-
|
|
24
|
-
# Library metadata
|
|
25
|
-
library_name: [library-name]
|
|
26
|
-
language: en # Language code stored here
|
|
27
|
-
...
|
|
28
|
-
```
|
|
28
|
+
After the agent returns, update `library.yaml` with `transcript: <filename>.json`.
|
|
29
29
|
|
|
30
|
-
|
|
30
|
+
## Next step
|
|
31
31
|
|
|
32
|
-
|
|
33
|
-
whisperx "/full/path/to/video.mov" \
|
|
34
|
-
--language en \
|
|
35
|
-
--model medium \
|
|
36
|
-
--compute_type float32 \
|
|
37
|
-
--device cpu \
|
|
38
|
-
--output_format json \
|
|
39
|
-
--output_dir libraries/[library-name]/transcripts
|
|
40
|
-
```
|
|
32
|
+
Once all videos have audio transcripts, dispatch `analyze-video` for visual descriptions.
|
|
41
33
|
|
|
42
|
-
|
|
34
|
+
## Dependencies
|
|
43
35
|
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
```bash
|
|
47
|
-
ruby .claude/skills/transcribe-audio/prepare_audio_script.rb \
|
|
48
|
-
libraries/[library-name]/transcripts/video_name.json \
|
|
49
|
-
/full/path/to/original/video_name.mov
|
|
50
|
-
```
|
|
51
|
-
|
|
52
|
-
This script:
|
|
53
|
-
- Adds video source path as metadata
|
|
54
|
-
- Removes unnecessary fields to reduce file size
|
|
55
|
-
- Prettifies JSON
|
|
56
|
-
|
|
57
|
-
### 4. Return Success Response
|
|
58
|
-
|
|
59
|
-
After audio preparation completes, return this structured response to the parent agent:
|
|
60
|
-
|
|
61
|
-
```
|
|
62
|
-
✓ [video_filename.mov] transcribed successfully
|
|
63
|
-
Audio transcript: libraries/[library-name]/transcripts/video_name.json
|
|
64
|
-
Video path: /full/path/to/video_filename.mov
|
|
65
|
-
```
|
|
66
|
-
|
|
67
|
-
**DO NOT update library.yaml** - the parent agent will handle this to avoid race conditions when running multiple transcriptions in parallel.
|
|
68
|
-
|
|
69
|
-
## Running in Parallel
|
|
70
|
-
|
|
71
|
-
This skill is designed to run inside a Task agent for parallel execution:
|
|
72
|
-
- Each agent handles ONE video file
|
|
73
|
-
- Multiple agents can run simultaneously
|
|
74
|
-
- Parent thread updates library.yaml sequentially after each agent completes
|
|
75
|
-
- No race conditions on shared YAML file
|
|
76
|
-
|
|
77
|
-
## Next Step
|
|
78
|
-
|
|
79
|
-
After audio transcription, use the **analyze-video** skill to add visual descriptions and create the visual transcript.
|
|
80
|
-
|
|
81
|
-
## Installation
|
|
82
|
-
|
|
83
|
-
Ensure WhisperX is installed. Use the **setup** skill to verify dependencies.
|
|
36
|
+
WhisperX must be installed. Use the **setup** skill to verify.
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# Transcribe Audio (sub-agent prompt)
|
|
2
|
+
|
|
3
|
+
You are a sub-agent. Transcribe one video file using WhisperX and produce a clean JSON transcript with word-level timing.
|
|
4
|
+
|
|
5
|
+
**Critical:** Use WhisperX, NOT standard Whisper. WhisperX preserves the original video timeline including leading silence, ensuring transcripts match actual video timestamps. Run WhisperX directly on the video file — don't extract audio separately.
|
|
6
|
+
|
|
7
|
+
## Inputs (passed inline by the parent)
|
|
8
|
+
|
|
9
|
+
- `video_path` — absolute path to the video file
|
|
10
|
+
- `transcript_output_dir` — where to write the transcript JSON
|
|
11
|
+
- `language_code` — ISO 639-1 code (e.g. `en`, `es`)
|
|
12
|
+
- `whisper_model` — model size (e.g. `small`, `medium`, `turbo`)
|
|
13
|
+
- `transcript_refinement` — boolean; if `true`, also expect:
|
|
14
|
+
- `user_context` — string, may be empty
|
|
15
|
+
- `footage_summary` — string, may be empty
|
|
16
|
+
|
|
17
|
+
Do NOT read `library.yaml` or `settings.yaml`. If a required input is missing from your prompt, stop and ask the parent rather than inferring from the filesystem.
|
|
18
|
+
|
|
19
|
+
## 1. Run WhisperX
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
whisperx "<video_path>" \
|
|
23
|
+
--language <language_code> \
|
|
24
|
+
--model <whisper_model> \
|
|
25
|
+
--compute_type float32 \
|
|
26
|
+
--device cpu \
|
|
27
|
+
--output_format json \
|
|
28
|
+
--output_dir <transcript_output_dir>
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## 2. Prepare audio transcript
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
ruby .claude/skills/transcribe-audio/prepare_audio_script.rb \
|
|
35
|
+
<transcript_output_dir>/<video_basename>.json \
|
|
36
|
+
<video_path>
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
This script adds the video source path as metadata, removes unnecessary fields, and prettifies the JSON.
|
|
40
|
+
|
|
41
|
+
## 3. (Optional) Refine the transcript
|
|
42
|
+
|
|
43
|
+
If `transcript_refinement: true`, follow `.claude/skills/transcribe-audio/refine_instructions.md`, using the `user_context` and `footage_summary` strings the parent supplied inline. Do NOT open `library.yaml`. Skip if `transcript_refinement` is missing or `false`.
|
|
44
|
+
|
|
45
|
+
## 4. Return success response
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
✓ <video_basename.mov> transcribed successfully
|
|
49
|
+
Audio transcript: <transcript_output_dir>/<video_basename>.json
|
|
50
|
+
Video path: <video_path>
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
**Do NOT update library.yaml** — the parent handles all yaml I/O to avoid race conditions in parallel runs.
|