vidclaude 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +237 -0
- package/SKILL.md +138 -0
- package/bin/setup.js +45 -0
- package/bin/vidclaude.js +27 -0
- package/package.json +31 -0
- package/requirements.txt +2 -0
- package/vidclaude/SKILL.md +138 -0
- package/vidclaude/__init__.py +3 -0
- package/vidclaude/__pycache__/__init__.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/audio.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/cli.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/ingest.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/intent.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/memory.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/models.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/ocr.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/reason.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/segment.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/timeline.cpython-313.pyc +0 -0
- package/vidclaude/__pycache__/util.cpython-313.pyc +0 -0
- package/vidclaude/audio.py +193 -0
- package/vidclaude/cli.py +389 -0
- package/vidclaude/ingest.py +80 -0
- package/vidclaude/intent.py +116 -0
- package/vidclaude/memory.py +174 -0
- package/vidclaude/models.py +162 -0
- package/vidclaude/ocr.py +110 -0
- package/vidclaude/reason.py +285 -0
- package/vidclaude/segment.py +239 -0
- package/vidclaude/timeline.py +95 -0
- package/vidclaude/util.py +163 -0
- package/video_understand.py +12 -0
package/README.md
ADDED
|
@@ -0,0 +1,237 @@
|
|
|
1
|
+
# vidclaude
|
|
2
|
+
|
|
3
|
+
Multimodal video understanding for Claude Code. Extract frames, transcribe audio in 90+ languages, build temporal timelines — all from a single command. No API key needed.
|
|
4
|
+
|
|
5
|
+
```bash
|
|
6
|
+
pip install vidclaude
|
|
7
|
+
vidclaude video.mp4 --mode standard --verbose
|
|
8
|
+
```
|
|
9
|
+
|
|
10
|
+
## What it does
|
|
11
|
+
|
|
12
|
+
Drop a video in, get structured evidence out. Claude in your conversation does the thinking.
|
|
13
|
+
|
|
14
|
+
```
|
|
15
|
+
Video File
|
|
16
|
+
│
|
|
17
|
+
├─ ffmpeg ──────────► Frames (adaptive, shot-aware sampling)
|
|
18
|
+
├─ faster-whisper ──► Transcript with timestamps (large-v3, 90+ languages)
|
|
19
|
+
├─ pytesseract ─────► On-screen text / OCR (optional)
|
|
20
|
+
├─ scene detection ─► Shot boundaries
|
|
21
|
+
│
|
|
22
|
+
└─► Timeline ──► evidence.md ──► Claude reasons over it
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Works with Hindi, English, Spanish, Japanese, Arabic — any of the 90+ languages Whisper supports. Language is auto-detected.
|
|
26
|
+
|
|
27
|
+
## Prerequisites
|
|
28
|
+
|
|
29
|
+
| Requirement | Install |
|
|
30
|
+
|-------------|---------|
|
|
31
|
+
| Python 3.10+ | [python.org](https://python.org) |
|
|
32
|
+
| ffmpeg | Windows: `winget install ffmpeg` / macOS: `brew install ffmpeg` / Linux: `sudo apt install ffmpeg` |
|
|
33
|
+
|
|
34
|
+
## Install
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
pip install vidclaude
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
That's it. First run downloads the Whisper model (~3GB, one time).
|
|
41
|
+
|
|
42
|
+
## Usage
|
|
43
|
+
|
|
44
|
+
### With Claude Code (recommended)
|
|
45
|
+
|
|
46
|
+
Set up the skill once:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
vidclaude --install-skill
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Then in Claude Code, just say:
|
|
53
|
+
|
|
54
|
+
> "analyze the video at C:/Users/me/Videos/meeting.mp4"
|
|
55
|
+
>
|
|
56
|
+
> "what does the speaker say about the budget in presentation.mp4?"
|
|
57
|
+
>
|
|
58
|
+
> "when does the logo appear in intro.mov?"
|
|
59
|
+
|
|
60
|
+
Claude runs the extraction, reads the evidence report + frames, and answers your question. Your Max/Pro plan covers everything — no API key needed.
|
|
61
|
+
|
|
62
|
+
Follow-up questions about the same video are instant (cached).
|
|
63
|
+
|
|
64
|
+
### From the command line
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
# Standard analysis — good for most videos
|
|
68
|
+
vidclaude video.mp4 --mode standard --verbose
|
|
69
|
+
|
|
70
|
+
# Quick — fewer frames, faster
|
|
71
|
+
vidclaude video.mp4 --mode quick
|
|
72
|
+
|
|
73
|
+
# Deep — dense frames, full OCR, detailed
|
|
74
|
+
vidclaude video.mp4 --mode deep --verbose
|
|
75
|
+
|
|
76
|
+
# Batch process a folder
|
|
77
|
+
vidclaude ./videos/ --verbose
|
|
78
|
+
|
|
79
|
+
# Skip audio transcription
|
|
80
|
+
vidclaude video.mp4 --no-audio
|
|
81
|
+
|
|
82
|
+
# Force fresh extraction (ignore cache)
|
|
83
|
+
vidclaude video.mp4 --no-cache
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
### Output
|
|
87
|
+
|
|
88
|
+
Every run creates a `.vidcache/` directory next to the video:
|
|
89
|
+
|
|
90
|
+
```
|
|
91
|
+
.vidcache/a3f7b2c1/
|
|
92
|
+
evidence.md ← Human-readable report (Claude reads this)
|
|
93
|
+
frames/ ← Extracted JPEG frames
|
|
94
|
+
transcript.json ← Timestamped transcript
|
|
95
|
+
timeline.json ← Unified event timeline
|
|
96
|
+
meta.json ← Video metadata
|
|
97
|
+
shots.json ← Shot boundaries
|
|
98
|
+
ocr.json ← On-screen text
|
|
99
|
+
summaries.json ← Scene/chapter summaries
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
## Modes
|
|
103
|
+
|
|
104
|
+
| Mode | Frames | Whisper model | OCR | Best for |
|
|
105
|
+
|------|--------|---------------|-----|----------|
|
|
106
|
+
| `quick` | ~20, uniform | base | skip | Short clips, fast overview |
|
|
107
|
+
| `standard` | ~60, shot-aware | large-v3 | keyframes | General use |
|
|
108
|
+
| `deep` | ~150, burst sampling | large-v3 | all frames | Long videos, detailed analysis |
|
|
109
|
+
|
|
110
|
+
**Smart frame budget**: If a video is too long for the frame limit, FPS is automatically reduced. Shot boundary frames are always prioritized.
|
|
111
|
+
|
|
112
|
+
## CLI Reference
|
|
113
|
+
|
|
114
|
+
```
|
|
115
|
+
vidclaude [input] [options]
|
|
116
|
+
|
|
117
|
+
positional:
|
|
118
|
+
input Video file or folder
|
|
119
|
+
|
|
120
|
+
setup:
|
|
121
|
+
--install-skill Copy SKILL.md to current dir for Claude Code
|
|
122
|
+
|
|
123
|
+
processing:
|
|
124
|
+
--mode {quick,standard,deep} Processing mode (default: standard)
|
|
125
|
+
-f, --fps N Override frames per second
|
|
126
|
+
-m, --max-frames N Override max frame count
|
|
127
|
+
--no-audio Skip audio transcription
|
|
128
|
+
--no-ocr Skip OCR extraction
|
|
129
|
+
--no-cache Force re-extraction
|
|
130
|
+
|
|
131
|
+
output:
|
|
132
|
+
--extract Print cache path summary
|
|
133
|
+
-o, --output FILE Write output to file
|
|
134
|
+
--verbose Show detailed progress
|
|
135
|
+
--batch-summary Cross-video summary for folders
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
## How it works
|
|
139
|
+
|
|
140
|
+
Built on a multi-layer video understanding architecture:
|
|
141
|
+
|
|
142
|
+
- **Layer A — Ingestion**: Validates format (MP4, MOV, MKV, WebM, AVI), extracts metadata via ffprobe
|
|
143
|
+
- **Layer B — Segmentation**: Detects shot boundaries using ffmpeg's scene change filter
|
|
144
|
+
- **Layer C — Adaptive Sampling**: Content-aware frame selection — more frames at scene transitions, smart frame budgets per mode
|
|
145
|
+
- **Layer D — Audio**: faster-whisper with large-v3 for multilingual transcription with word-level timestamps and VAD filtering
|
|
146
|
+
- **Layer E — OCR**: pytesseract text extraction from key frames (optional)
|
|
147
|
+
- **Layer G — Timeline**: Merges speech, OCR, and scene events into a single time-sorted list
|
|
148
|
+
- **Layer I — Memory**: Hierarchical summaries for longer videos (scene → chapter → global)
|
|
149
|
+
- **Layer J — Evidence Assembly**: Generates `evidence.md` with frame references, transcript, timeline for Claude to reason over
|
|
150
|
+
|
|
151
|
+
## Intent-aware processing
|
|
152
|
+
|
|
153
|
+
The tool classifies your question and adjusts the pipeline:
|
|
154
|
+
|
|
155
|
+
| Question type | Example | What happens |
|
|
156
|
+
|--------------|---------|-------------|
|
|
157
|
+
| Description | "What happens in this video?" | Balanced extraction |
|
|
158
|
+
| Moment retrieval | "When does the person stand up?" | Prioritizes transcript + timeline |
|
|
159
|
+
| Temporal ordering | "Does X happen before Y?" | Prioritizes timeline events |
|
|
160
|
+
| Counting | "How many cars appear?" | Denser frame sampling |
|
|
161
|
+
| OCR / text | "What text is on the slide?" | Prioritizes OCR extraction |
|
|
162
|
+
| Speech | "What did they say about revenue?" | Prioritizes transcript |
|
|
163
|
+
|
|
164
|
+
## Optional extras
|
|
165
|
+
|
|
166
|
+
```bash
|
|
167
|
+
# OCR support (also needs Tesseract binary installed)
|
|
168
|
+
pip install pytesseract
|
|
169
|
+
|
|
170
|
+
# Standalone API mode (use outside Claude Code)
|
|
171
|
+
pip install anthropic
|
|
172
|
+
export ANTHROPIC_API_KEY=sk-...
|
|
173
|
+
vidclaude video.mp4 --api -q "What happens in this video?"
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
## Examples
|
|
177
|
+
|
|
178
|
+
**Analyze a meeting recording:**
|
|
179
|
+
```bash
|
|
180
|
+
vidclaude meeting.mp4 --mode standard --verbose
|
|
181
|
+
# → 55 frames, full transcript, timeline
|
|
182
|
+
# → In Claude Code: "summarize the key decisions from this meeting"
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
**Review security footage:**
|
|
186
|
+
```bash
|
|
187
|
+
vidclaude ./cameras/ --mode deep --verbose
|
|
188
|
+
# → Batch processes all videos in the folder
|
|
189
|
+
# → Dense frame extraction catches fast events
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
**Extract text from a lecture:**
|
|
193
|
+
```bash
|
|
194
|
+
pip install pytesseract
|
|
195
|
+
vidclaude lecture.mp4 --mode standard --verbose
|
|
196
|
+
# → Captures slide text via OCR + speaker transcript
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
**Quick check on a short clip:**
|
|
200
|
+
```bash
|
|
201
|
+
vidclaude clip.mp4 --mode quick
|
|
202
|
+
# → 20 frames, fast whisper, done in seconds
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
## Troubleshooting
|
|
206
|
+
|
|
207
|
+
| Problem | Fix |
|
|
208
|
+
|---------|-----|
|
|
209
|
+
| `ffmpeg not found` | Install: `winget install ffmpeg` (Windows) / `brew install ffmpeg` (Mac) |
|
|
210
|
+
| `No module named 'faster_whisper'` | `pip install faster-whisper` |
|
|
211
|
+
| Slow first run | Normal — downloading Whisper large-v3 model (~3GB, one time) |
|
|
212
|
+
| Wrong language detected | Whisper auto-detects; usually correct. For edge cases, the transcript still captures phonetics |
|
|
213
|
+
| Large `.vidcache` folder | Delete it to free space: `rm -rf .vidcache/` |
|
|
214
|
+
| Want fresh extraction | Use `--no-cache` flag |
|
|
215
|
+
| OCR not working | Install pytesseract + Tesseract binary |
|
|
216
|
+
|
|
217
|
+
## Project structure
|
|
218
|
+
|
|
219
|
+
```
|
|
220
|
+
vidclaude/
|
|
221
|
+
cli.py Argument parsing, orchestration, caching
|
|
222
|
+
models.py Data model (VideoMeta, Frame, Shot, TranscriptChunk, etc.)
|
|
223
|
+
ingest.py Video validation + ffprobe metadata
|
|
224
|
+
segment.py Shot detection + adaptive frame sampling
|
|
225
|
+
audio.py faster-whisper transcription (large-v3)
|
|
226
|
+
ocr.py pytesseract text extraction
|
|
227
|
+
intent.py Question intent classification
|
|
228
|
+
timeline.py Temporal event merging
|
|
229
|
+
memory.py Hierarchical summaries
|
|
230
|
+
reason.py Evidence assembly + optional API mode
|
|
231
|
+
util.py ffmpeg helpers, image encoding, caching
|
|
232
|
+
SKILL.md Claude Code skill definition (bundled)
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
## License
|
|
236
|
+
|
|
237
|
+
MIT
|
package/SKILL.md
ADDED
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: video-understand
|
|
3
|
+
description: >
|
|
4
|
+
Analyze video files using multimodal extraction (frames, audio transcript,
|
|
5
|
+
OCR, timeline) and Claude's reasoning. Use when the user asks to analyze,
|
|
6
|
+
understand, describe, or answer questions about a video file or folder of
|
|
7
|
+
videos. Triggers on: "analyze this video", "what happens in this video",
|
|
8
|
+
"describe the video", "video question", any path ending in
|
|
9
|
+
.mp4/.mov/.mkv/.webm followed by a question about it.
|
|
10
|
+
tools: ["Bash", "Read", "Glob"]
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Video Understanding Skill
|
|
14
|
+
|
|
15
|
+
Analyze videos by extracting visual frames, audio transcripts, OCR text, and
|
|
16
|
+
temporal timelines, then reasoning over the combined evidence.
|
|
17
|
+
|
|
18
|
+
## How It Works
|
|
19
|
+
|
|
20
|
+
This skill uses a Python extraction pipeline (`video_understand.py`) that:
|
|
21
|
+
1. Ingests the video and extracts metadata via ffprobe
|
|
22
|
+
2. Detects shot boundaries using ffmpeg scene change detection
|
|
23
|
+
3. Adaptively samples frames (more frames at scene transitions)
|
|
24
|
+
4. Transcribes audio using OpenAI Whisper (if installed)
|
|
25
|
+
5. Extracts on-screen text via OCR (if pytesseract installed)
|
|
26
|
+
6. Builds a unified temporal timeline of all events
|
|
27
|
+
7. Generates an `evidence.md` report + cached artifacts
|
|
28
|
+
|
|
29
|
+
You (Claude) then read the evidence and frame images to reason over the video.
|
|
30
|
+
|
|
31
|
+
## Step-by-Step Usage
|
|
32
|
+
|
|
33
|
+
### Step 1: Extract evidence from the video
|
|
34
|
+
|
|
35
|
+
Run the extraction pipeline:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
python D:/ai_creative_stuff/claudevid/video_understand.py "<video_path>" --extract --mode standard --verbose
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
For quick analysis (fewer frames, faster):
|
|
42
|
+
```bash
|
|
43
|
+
python D:/ai_creative_stuff/claudevid/video_understand.py "<video_path>" --extract --mode quick --verbose
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
For deep analysis (more frames, OCR on all frames):
|
|
47
|
+
```bash
|
|
48
|
+
python D:/ai_creative_stuff/claudevid/video_understand.py "<video_path>" --extract --mode deep --verbose
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
### Step 2: Read the evidence report
|
|
52
|
+
|
|
53
|
+
The script outputs a cache directory path. Read the evidence report:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
# The cache path is printed by the script, e.g.:
|
|
57
|
+
# Cache: /path/to/.vidcache/a3f7b2c1
|
|
58
|
+
# Read the evidence:
|
|
59
|
+
cat /path/to/.vidcache/<hash>/evidence.md
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Use the Read tool to read `evidence.md` from the cache directory.
|
|
63
|
+
|
|
64
|
+
### Step 3: View key frames
|
|
65
|
+
|
|
66
|
+
Read the frame images listed in evidence.md to see what's in the video.
|
|
67
|
+
Select 5-10 representative frames spread across the video's duration.
|
|
68
|
+
Use the Read tool on the frame image paths.
|
|
69
|
+
|
|
70
|
+
### Step 4: Answer the question
|
|
71
|
+
|
|
72
|
+
Reason over the combined evidence (timeline, transcript, OCR, frames) to
|
|
73
|
+
answer the user's question. Ground claims in timestamps. Note uncertainties.
|
|
74
|
+
|
|
75
|
+
### Follow-up Questions
|
|
76
|
+
|
|
77
|
+
For follow-up questions about the same video, the cache already exists.
|
|
78
|
+
Re-read the evidence.md and relevant frames — no need to re-extract.
|
|
79
|
+
|
|
80
|
+
To force re-extraction: add `--no-cache` flag.
|
|
81
|
+
|
|
82
|
+
## Batch Processing
|
|
83
|
+
|
|
84
|
+
Process a folder of videos:
|
|
85
|
+
```bash
|
|
86
|
+
python D:/ai_creative_stuff/claudevid/video_understand.py "<folder_path>" --extract --mode standard --verbose
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## CLI Reference
|
|
90
|
+
|
|
91
|
+
| Flag | Default | Description |
|
|
92
|
+
|------|---------|-------------|
|
|
93
|
+
| `input` | required | Video file or folder path |
|
|
94
|
+
| `--extract` | off | Extract only (skill mode, no API key needed) |
|
|
95
|
+
| `-q "..."` | none | Question (for API standalone mode) |
|
|
96
|
+
| `-f N` | mode default | Frames per second override |
|
|
97
|
+
| `-m N` | mode default | Max frames override |
|
|
98
|
+
| `--no-audio` | off | Skip audio transcription |
|
|
99
|
+
| `--no-ocr` | off | Skip OCR extraction |
|
|
100
|
+
| `--mode` | standard | quick / standard / deep |
|
|
101
|
+
| `--verbose` | off | Detailed progress output |
|
|
102
|
+
| `--no-cache` | off | Force re-extraction |
|
|
103
|
+
| `--api` | off | Standalone mode (needs ANTHROPIC_API_KEY) |
|
|
104
|
+
| `-o file` | stdout | Write output to file |
|
|
105
|
+
|
|
106
|
+
## Modes Comparison
|
|
107
|
+
|
|
108
|
+
| Aspect | quick | standard | deep |
|
|
109
|
+
|--------|-------|----------|------|
|
|
110
|
+
| Frame sampling | 0.2fps, max 20 | 0.5fps + shot boundaries, max 60 | 1.0fps + burst, max 150 |
|
|
111
|
+
| Audio | whisper base | whisper base | whisper small |
|
|
112
|
+
| OCR | skip | keyframes only | all frames |
|
|
113
|
+
| Summaries | skip | for videos > 5min | always |
|
|
114
|
+
|
|
115
|
+
## Prerequisites
|
|
116
|
+
|
|
117
|
+
1. **Python 3.10+**
|
|
118
|
+
2. **ffmpeg** on PATH — `winget install ffmpeg` or download from https://ffmpeg.org
|
|
119
|
+
3. **Pillow**: `pip install Pillow`
|
|
120
|
+
4. **Whisper** (optional): `pip install openai-whisper` (requires PyTorch)
|
|
121
|
+
5. **Tesseract** (optional): Install Tesseract OCR + `pip install pytesseract`
|
|
122
|
+
|
|
123
|
+
## Troubleshooting
|
|
124
|
+
|
|
125
|
+
- **"ffmpeg not found"**: Ensure ffmpeg is on your PATH. Run `ffmpeg -version` to verify.
|
|
126
|
+
- **"openai-whisper not installed"**: Audio will be skipped. Install with `pip install openai-whisper`.
|
|
127
|
+
- **Slow frame extraction**: Use `--mode quick` or reduce with `-m 10`.
|
|
128
|
+
- **Large cache**: Delete `.vidcache/` folders to free space.
|
|
129
|
+
- **Re-analyze same video**: Use `--no-cache` to force fresh extraction.
|
|
130
|
+
|
|
131
|
+
## Extension Ideas
|
|
132
|
+
|
|
133
|
+
- **OmniVision offline footage review**: Process security/dashcam footage folders,
|
|
134
|
+
generate timeline reports, flag anomalous events
|
|
135
|
+
- **Meeting summarization**: Extract slides (OCR) + transcript, generate meeting notes
|
|
136
|
+
- **Content moderation**: Scan video batches for policy violations
|
|
137
|
+
- **Sports analysis**: Dense frame extraction for play-by-play breakdown
|
|
138
|
+
- **Accessibility**: Generate audio descriptions from visual content
|
package/bin/setup.js
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
|
|
3
|
+
const { execSync } = require("child_process");
|
|
4
|
+
const fs = require("fs");
|
|
5
|
+
const path = require("path");
|
|
6
|
+
|
|
7
|
+
const reqPath = path.join(__dirname, "..", "requirements.txt");
|
|
8
|
+
|
|
9
|
+
console.log("vidclaude: Installing Python dependencies...");
|
|
10
|
+
|
|
11
|
+
// Check Python exists
|
|
12
|
+
try {
|
|
13
|
+
execSync("python --version", { stdio: "pipe" });
|
|
14
|
+
} catch {
|
|
15
|
+
console.warn(
|
|
16
|
+
"\n⚠ Python not found. You need Python 3.10+ to use vidclaude.\n" +
|
|
17
|
+
" Install from: https://python.org\n" +
|
|
18
|
+
" Then run: pip install -r requirements.txt\n"
|
|
19
|
+
);
|
|
20
|
+
process.exit(0); // Don't fail npm install
|
|
21
|
+
}
|
|
22
|
+
|
|
23
|
+
// Check ffmpeg exists
|
|
24
|
+
try {
|
|
25
|
+
execSync("ffmpeg -version", { stdio: "pipe" });
|
|
26
|
+
} catch {
|
|
27
|
+
console.warn(
|
|
28
|
+
"\n⚠ ffmpeg not found. You need ffmpeg to use vidclaude.\n" +
|
|
29
|
+
" Windows: winget install ffmpeg\n" +
|
|
30
|
+
" macOS: brew install ffmpeg\n" +
|
|
31
|
+
" Linux: sudo apt install ffmpeg\n"
|
|
32
|
+
);
|
|
33
|
+
}
|
|
34
|
+
|
|
35
|
+
// Install Python deps
|
|
36
|
+
try {
|
|
37
|
+
execSync(`pip install -r "${reqPath}"`, { stdio: "inherit" });
|
|
38
|
+
console.log("\nvidclaude: Setup complete!");
|
|
39
|
+
console.log(" Run: npx vidclaude video.mp4 --extract --mode standard --verbose\n");
|
|
40
|
+
} catch {
|
|
41
|
+
console.warn(
|
|
42
|
+
"\n⚠ Failed to install Python dependencies.\n" +
|
|
43
|
+
" Try manually: pip install -r requirements.txt\n"
|
|
44
|
+
);
|
|
45
|
+
}
|
package/bin/vidclaude.js
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
|
|
3
|
+
const { spawn } = require("child_process");
|
|
4
|
+
const path = require("path");
|
|
5
|
+
|
|
6
|
+
const scriptPath = path.join(__dirname, "..", "video_understand.py");
|
|
7
|
+
const args = process.argv.slice(2);
|
|
8
|
+
|
|
9
|
+
// Pass all arguments through to the Python script
|
|
10
|
+
const proc = spawn("python", [scriptPath, ...args], {
|
|
11
|
+
stdio: "inherit",
|
|
12
|
+
});
|
|
13
|
+
|
|
14
|
+
proc.on("error", (err) => {
|
|
15
|
+
if (err.code === "ENOENT") {
|
|
16
|
+
console.error(
|
|
17
|
+
"Error: Python not found. Install Python 3.10+ from https://python.org"
|
|
18
|
+
);
|
|
19
|
+
process.exit(1);
|
|
20
|
+
}
|
|
21
|
+
console.error(`Error: ${err.message}`);
|
|
22
|
+
process.exit(1);
|
|
23
|
+
});
|
|
24
|
+
|
|
25
|
+
proc.on("close", (code) => {
|
|
26
|
+
process.exit(code || 0);
|
|
27
|
+
});
|
package/package.json
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "vidclaude",
|
|
3
|
+
"version": "0.2.0",
|
|
4
|
+
"description": "Multimodal video understanding for Claude Code — extract frames, transcribe audio, build timelines from any video",
|
|
5
|
+
"bin": {
|
|
6
|
+
"vidclaude": "./bin/vidclaude.js"
|
|
7
|
+
},
|
|
8
|
+
"files": [
|
|
9
|
+
"bin/",
|
|
10
|
+
"vidclaude/",
|
|
11
|
+
"video_understand.py",
|
|
12
|
+
"requirements.txt",
|
|
13
|
+
"SKILL.md",
|
|
14
|
+
"README.md"
|
|
15
|
+
],
|
|
16
|
+
"scripts": {
|
|
17
|
+
"postinstall": "node ./bin/setup.js"
|
|
18
|
+
},
|
|
19
|
+
"keywords": [
|
|
20
|
+
"video",
|
|
21
|
+
"claude",
|
|
22
|
+
"multimodal",
|
|
23
|
+
"whisper",
|
|
24
|
+
"analysis",
|
|
25
|
+
"claude-code"
|
|
26
|
+
],
|
|
27
|
+
"license": "MIT",
|
|
28
|
+
"engines": {
|
|
29
|
+
"node": ">=16.0.0"
|
|
30
|
+
}
|
|
31
|
+
}
|
package/requirements.txt
ADDED
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: video-understand
|
|
3
|
+
description: >
|
|
4
|
+
Analyze video files using multimodal extraction (frames, audio transcript,
|
|
5
|
+
OCR, timeline) and Claude's reasoning. Use when the user asks to analyze,
|
|
6
|
+
understand, describe, or answer questions about a video file or folder of
|
|
7
|
+
videos. Triggers on: "analyze this video", "what happens in this video",
|
|
8
|
+
"describe the video", "video question", any path ending in
|
|
9
|
+
.mp4/.mov/.mkv/.webm followed by a question about it.
|
|
10
|
+
tools: ["Bash", "Read", "Glob"]
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Video Understanding Skill
|
|
14
|
+
|
|
15
|
+
Analyze videos by extracting visual frames, audio transcripts, OCR text, and
|
|
16
|
+
temporal timelines, then reasoning over the combined evidence.
|
|
17
|
+
|
|
18
|
+
## How It Works
|
|
19
|
+
|
|
20
|
+
This skill uses a Python extraction pipeline (`video_understand.py`) that:
|
|
21
|
+
1. Ingests the video and extracts metadata via ffprobe
|
|
22
|
+
2. Detects shot boundaries using ffmpeg scene change detection
|
|
23
|
+
3. Adaptively samples frames (more frames at scene transitions)
|
|
24
|
+
4. Transcribes audio using OpenAI Whisper (if installed)
|
|
25
|
+
5. Extracts on-screen text via OCR (if pytesseract installed)
|
|
26
|
+
6. Builds a unified temporal timeline of all events
|
|
27
|
+
7. Generates an `evidence.md` report + cached artifacts
|
|
28
|
+
|
|
29
|
+
You (Claude) then read the evidence and frame images to reason over the video.
|
|
30
|
+
|
|
31
|
+
## Step-by-Step Usage
|
|
32
|
+
|
|
33
|
+
### Step 1: Extract evidence from the video
|
|
34
|
+
|
|
35
|
+
Run the extraction pipeline:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
python D:/ai_creative_stuff/claudevid/video_understand.py "<video_path>" --extract --mode standard --verbose
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
For quick analysis (fewer frames, faster):
|
|
42
|
+
```bash
|
|
43
|
+
python D:/ai_creative_stuff/claudevid/video_understand.py "<video_path>" --extract --mode quick --verbose
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
For deep analysis (more frames, OCR on all frames):
|
|
47
|
+
```bash
|
|
48
|
+
python D:/ai_creative_stuff/claudevid/video_understand.py "<video_path>" --extract --mode deep --verbose
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
### Step 2: Read the evidence report
|
|
52
|
+
|
|
53
|
+
The script outputs a cache directory path. Read the evidence report:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
# The cache path is printed by the script, e.g.:
|
|
57
|
+
# Cache: /path/to/.vidcache/a3f7b2c1
|
|
58
|
+
# Read the evidence:
|
|
59
|
+
cat /path/to/.vidcache/<hash>/evidence.md
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Use the Read tool to read `evidence.md` from the cache directory.
|
|
63
|
+
|
|
64
|
+
### Step 3: View key frames
|
|
65
|
+
|
|
66
|
+
Read the frame images listed in evidence.md to see what's in the video.
|
|
67
|
+
Select 5-10 representative frames spread across the video's duration.
|
|
68
|
+
Use the Read tool on the frame image paths.
|
|
69
|
+
|
|
70
|
+
### Step 4: Answer the question
|
|
71
|
+
|
|
72
|
+
Reason over the combined evidence (timeline, transcript, OCR, frames) to
|
|
73
|
+
answer the user's question. Ground claims in timestamps. Note uncertainties.
|
|
74
|
+
|
|
75
|
+
### Follow-up Questions
|
|
76
|
+
|
|
77
|
+
For follow-up questions about the same video, the cache already exists.
|
|
78
|
+
Re-read the evidence.md and relevant frames — no need to re-extract.
|
|
79
|
+
|
|
80
|
+
To force re-extraction: add `--no-cache` flag.
|
|
81
|
+
|
|
82
|
+
## Batch Processing
|
|
83
|
+
|
|
84
|
+
Process a folder of videos:
|
|
85
|
+
```bash
|
|
86
|
+
python D:/ai_creative_stuff/claudevid/video_understand.py "<folder_path>" --extract --mode standard --verbose
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## CLI Reference
|
|
90
|
+
|
|
91
|
+
| Flag | Default | Description |
|
|
92
|
+
|------|---------|-------------|
|
|
93
|
+
| `input` | required | Video file or folder path |
|
|
94
|
+
| `--extract` | off | Extract only (skill mode, no API key needed) |
|
|
95
|
+
| `-q "..."` | none | Question (for API standalone mode) |
|
|
96
|
+
| `-f N` | mode default | Frames per second override |
|
|
97
|
+
| `-m N` | mode default | Max frames override |
|
|
98
|
+
| `--no-audio` | off | Skip audio transcription |
|
|
99
|
+
| `--no-ocr` | off | Skip OCR extraction |
|
|
100
|
+
| `--mode` | standard | quick / standard / deep |
|
|
101
|
+
| `--verbose` | off | Detailed progress output |
|
|
102
|
+
| `--no-cache` | off | Force re-extraction |
|
|
103
|
+
| `--api` | off | Standalone mode (needs ANTHROPIC_API_KEY) |
|
|
104
|
+
| `-o file` | stdout | Write output to file |
|
|
105
|
+
|
|
106
|
+
## Modes Comparison
|
|
107
|
+
|
|
108
|
+
| Aspect | quick | standard | deep |
|
|
109
|
+
|--------|-------|----------|------|
|
|
110
|
+
| Frame sampling | 0.2fps, max 20 | 0.5fps + shot boundaries, max 60 | 1.0fps + burst, max 150 |
|
|
111
|
+
| Audio | whisper base | whisper base | whisper small |
|
|
112
|
+
| OCR | skip | keyframes only | all frames |
|
|
113
|
+
| Summaries | skip | for videos > 5min | always |
|
|
114
|
+
|
|
115
|
+
## Prerequisites
|
|
116
|
+
|
|
117
|
+
1. **Python 3.10+**
|
|
118
|
+
2. **ffmpeg** on PATH — `winget install ffmpeg` or download from https://ffmpeg.org
|
|
119
|
+
3. **Pillow**: `pip install Pillow`
|
|
120
|
+
4. **Whisper** (optional): `pip install openai-whisper` (requires PyTorch)
|
|
121
|
+
5. **Tesseract** (optional): Install Tesseract OCR + `pip install pytesseract`
|
|
122
|
+
|
|
123
|
+
## Troubleshooting
|
|
124
|
+
|
|
125
|
+
- **"ffmpeg not found"**: Ensure ffmpeg is on your PATH. Run `ffmpeg -version` to verify.
|
|
126
|
+
- **"openai-whisper not installed"**: Audio will be skipped. Install with `pip install openai-whisper`.
|
|
127
|
+
- **Slow frame extraction**: Use `--mode quick` or reduce with `-m 10`.
|
|
128
|
+
- **Large cache**: Delete `.vidcache/` folders to free space.
|
|
129
|
+
- **Re-analyze same video**: Use `--no-cache` to force fresh extraction.
|
|
130
|
+
|
|
131
|
+
## Extension Ideas
|
|
132
|
+
|
|
133
|
+
- **OmniVision offline footage review**: Process security/dashcam footage folders,
|
|
134
|
+
generate timeline reports, flag anomalous events
|
|
135
|
+
- **Meeting summarization**: Extract slides (OCR) + transcript, generate meeting notes
|
|
136
|
+
- **Content moderation**: Scan video batches for policy violations
|
|
137
|
+
- **Sports analysis**: Dense frame extraction for play-by-play breakdown
|
|
138
|
+
- **Accessibility**: Generate audio descriptions from visual content
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|