npm - vidistill - Versions diffs - 0.2.2 → 0.2.4 - Mend

vidistill 0.2.2 → 0.2.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 # vidistill
-Video intelligence distiller — turn any video into structured notes, transcripts, and insights using Gemini.
+Video intelligence distiller — turn any video or audio file into structured notes, transcripts, and insights using Gemini.
-Feed it a YouTube URL or local video file. It analyzes the content through multiple AI passes (scene analysis, transcript, visuals, code extraction, people, chat, implicit signals) and synthesizes everything into organized markdown output.
+Feed it a YouTube URL, local video, or audio file. It analyzes the content through multiple AI passes (scene analysis, transcript, visuals, code extraction, people, chat, implicit signals) and synthesizes everything into organized markdown output.
 ## Install
@@ -20,12 +20,13 @@ vidistill [input] [options]
 **Arguments:**
-- `input` — YouTube URL or local file path (prompted interactively if omitted)
+- `input` — YouTube URL, local video, or audio file path (prompted interactively if omitted)
 **Options:**
 - `-c, --context` — context about the video (e.g. "CS lecture", "product demo")
 - `-o, --output` — output directory (default: `./vidistill-output/`)
+- `-l, --lang <code>` — output language (e.g. `zh`, `ja`, `ko`, `es`, `fr`, `de`, `pt`, `ru`, `ar`, `hi`)
 **Examples:**
@@ -39,10 +40,41 @@ vidistill "https://youtube.com/watch?v=dQw4w9WgXcQ"
 # Local file with context
 vidistill ./lecture.mp4 --context "distributed systems lecture"
+# Audio file
+vidistill ./podcast.mp3
 # Custom output directory
 vidistill ./demo.mp4 -o ./notes/
+# Output in another language
+vidistill ./lecture.mp4 --lang zh
+```
+### Extract
+Pull specific data from a previously processed video or re-run a targeted pass on a video file.
+```
+vidistill extract <type> <source>
 ```
+**Arguments:**
+- `type` — what to extract: `code`, `links`, `people`, `transcript`, or `commands`
+- `source` — path to a vidistill output directory or a video/audio file
+**Examples:**
+```bash
+# Extract code from existing output (no API calls)
+vidistill extract code ./vidistill-output/my-video/
+# Extract links from a video file (runs targeted pipeline)
+vidistill extract links ./lecture.mp4
+```
+When pointed at an output directory, extract reads from already-generated files with zero API calls. When pointed at a video file, it runs a minimal pipeline with only the passes needed for the requested data type.
 ## API Key
 vidistill needs a Gemini API key. It checks these sources in order:
@@ -63,7 +95,9 @@ vidistill-output/my-video/
 ├── transcript.md      # full timestamped transcript
 ├── combined.md        # transcript + visual notes merged
 ├── notes.md           # meeting/lecture notes
-├── code.md            # extracted code blocks and reconstructions
+├── code/              # extracted and reconstructed source files
+│   ├── *.ext          # individual source files
+│   └── code-timeline.md  # code evolution timeline
 ├── people.md          # speakers and participants
 ├── chat.md            # chat messages and links
 ├── action-items.md    # tasks and follow-ups
@@ -73,22 +107,26 @@ vidistill-output/my-video/
 └── raw/               # raw pass outputs
 ```
-Which files are generated depends on the video content — a coding tutorial gets `code.md`, a meeting gets `people.md` and `action-items.md`, etc.
+Which files are generated depends on the video content — a coding tutorial gets `code/`, a meeting gets `people.md` and `action-items.md`, etc.
 ## How It Works
-1. **Input** — downloads YouTube video via yt-dlp or reads local file, compresses if over 2GB
+Supported video formats: MP4, MOV, WebM, MKV, AVI, MPEG, FLV, WMV, 3GPP. Supported audio formats: MP3, AAC, WAV, FLAC, OGG, M4A.
+1. **Input** — downloads YouTube video via yt-dlp or reads local file (video or audio), compresses if over 2GB
 2. **Pass 0** — scene analysis to classify video type and determine processing strategy
 3. **Pass 1** — transcript extraction with speaker identification
 4. **Pass 2** — visual content extraction (screen states, diagrams, slides)
 5. **Pass 3** — specialist passes based on video type:
-   - 3a: code reconstruction (coding videos)
-   - 3b: people and social dynamics (meetings)
-   - 3c: chat and links (live streams)
-   - 3d: implicit signals (all types)
+   - 3c: chat and links (live streams) — per segment
+   - 3d: implicit signals (all types) — per segment
+   - 3b: people and social dynamics (meetings) — whole video
+   - 3a: code reconstruction (coding videos) — whole video, runs 3x with consensus voting and validation
 6. **Synthesis** — cross-references all passes into unified analysis
 7. **Output** — generates structured markdown files
+Audio files skip visual passes and go straight to transcript, people, implicit signals, and synthesis.
 Long videos are segmented automatically. Passes that fail are skipped gracefully.
 ## License

package/dist/index.js CHANGED Viewed

@@ -13,6 +13,233 @@ import { defineCommand, runMain } from "citty";
 import figlet from "figlet";
 import pc from "picocolors";
 import { intro, note } from "@clack/prompts";
+// src/constants/prompts.ts
+var SYSTEM_INSTRUCTION_PASS_1 = `
+You are a professional audio transcriber. Your task is to create a COMPLETE, VERBATIM transcription of all speech in this video segment. Focus EXCLUSIVELY on the audio stream.
+CRITICAL RULES:
+1. TRANSCRIBE every spoken word completely and verbatim. Do not summarize, paraphrase, or skip any sentence.
+2. IDENTIFY different speakers. Label them SPEAKER_00, SPEAKER_01, etc. consistently throughout. If a speaker introduces themselves by name, note the name in the first entry's speaker field as "SPEAKER_00 (John)".
+3. NOTE tone and emphasis: when a speaker emphasizes words (louder, slower, repeated), mark those words. When they express emotions (excitement, warning, frustration, humor), note the tone.
+4. RECORD pauses longer than 1.5 seconds as pause markers with duration.
+5. PRESERVE filler words only when they carry meaning (hesitation indicating uncertainty about code behavior, self-correction). Remove meaningless "um", "uh".
+6. NEVER add your own explanations, interpretations, or knowledge. Only transcribe what is spoken.
+7. NEVER skip content because it seems repetitive or obvious. Record everything spoken.
+8. When the speaker references something on screen (e.g., "as you can see here", "this function", "line 5"), transcribe exactly what they say \u2014 the visual context will be captured separately.
+COMPLETENESS TARGET:
+- Aim for at least 150 words per minute of video in the transcript
+- Every speaker change must be noted with a new entry
+- Every sentence must appear \u2014 if in doubt, include it
+`;
+var SYSTEM_INSTRUCTION_PASS_2_TEMPLATE = `
+You are a professional code and visual content extractor. Your task is to extract ALL visual content from this video segment \u2014 every piece of code on screen, every diagram, every slide, every UI element.
+Focus EXCLUSIVELY on what is visible on screen. The audio transcript from this segment is provided below for cross-referencing \u2014 use it to associate spoken explanations with the code being displayed, but do NOT re-transcribe any speech.
+TRANSCRIPT FROM THIS SEGMENT (for cross-reference only):
+{INJECT_PASS1_TRANSCRIPT_HERE}
+CRITICAL RULES:
+1. EXTRACT every piece of code visible on screen \u2014 complete, with original indentation and formatting preserved exactly as shown.
+2. For each code appearance: note the filename if visible in a tab or title bar, the programming language, and the screen type (editor, terminal, browser, slide).
+3. TRACK code changes: when code is modified between appearances, note what changed (lines added, modified, deleted). Compare against previous code blocks in this segment.
+4. ASSOCIATE code with speech: using the injected transcript above, find what the instructor was saying when this code was on screen. Quote their explanation verbatim or near-verbatim.
+5. CAPTURE non-code visuals: slides with text, architectural diagrams, browser output, UI demonstrations, terminal output. Describe these completely.
+6. NEVER add your own explanations or interpretations. Only record what is visible.
+7. NEVER skip code because it seems repetitive or unchanged from before. Record every distinct appearance.
+8. If code scrolls, capture the full visible code at each scroll position as a separate entry.
+COMPLETENESS TARGET:
+- Every frame that shows code should produce a code_block entry
+- Every slide or diagram should produce a visual_notes entry
+- If the screen doesn't change for 30+ seconds, note the unchanged state
+`;
+var SYSTEM_INSTRUCTION_PASS_0 = `
+You are a video content classifier. Analyze the provided video sample and produce a structured VideoProfile that classifies the video type and recommends processing parameters.
+CLASSIFICATION RULES:
+1. CLASSIFY the video into exactly one type:
+   - "coding": Programming tutorials, live coding, IDE/editor-heavy content
+   - "meeting": Video calls, Zoom/Teams meetings, multi-participant discussions
+   - "lecture": Academic lectures, talks, single-speaker educational content
+   - "presentation": Slide-based presentations, keynotes, demo days
+   - "conversation": Interviews, podcasts, panel discussions without slides
+   - "mixed": Cannot clearly classify into one category, or multiple types present
+2. DETECT visual content:
+   - hasCode: Code editors, IDEs, or code visible on screen
+   - hasSlides: Presentation slides (PowerPoint, Google Slides, Keynote)
+   - hasDiagrams: Architecture diagrams, flowcharts, charts, graphs
+   - hasPeopleGrid: Video grid showing multiple participants (Zoom/Teams layout)
+   - hasChatbox: Chat panel visible (meeting chat, live stream chat sidebar)
+   - hasWhiteboard: Whiteboard, handwritten notes, or drawing surface
+   - hasTerminal: Terminal, command-line interface, or shell
+   - hasScreenShare: Desktop or application screen sharing
+3. ANALYZE audio:
+   - hasMultipleSpeakers: true if more than one distinct voice is heard
+   - primaryLanguage: The main spoken language
+   - quality: "high" (studio/clear), "medium" (decent webcam), "low" (noisy/poor)
+4. IDENTIFY speakers:
+   - count: Number of distinct speakers heard
+   - identified: Names if visible on screen (name tags, introductions) or spoken aloud
+5. ASSESS complexity:
+   - "simple": Single topic, linear flow, straightforward content
+   - "moderate": Multiple topics, some complexity, normal pacing
+   - "complex": Dense content, rapid switching, multiple concurrent information streams
+6. RECOMMEND processing parameters:
+   - resolution: "low" for text-only/simple visuals, "medium" for general content, "high" for code/diagrams
+   - segmentMinutes: 10 for simple/moderate, 8 for complex content
+   - passes: Always include "transcript" and "visual". Add specialist passes based on content type.
+PASS RECOMMENDATIONS BY TYPE:
+- coding: ["transcript", "visual", "code", "synthesis"]
+- meeting: ["transcript", "visual", "people", "implicit", "synthesis"] (add "chat" if hasChatbox)
+- lecture: ["transcript", "visual", "implicit", "synthesis"]
+- presentation: ["transcript", "visual", "implicit", "synthesis"] (add "people" if multiple speakers)
+- conversation: ["transcript", "visual", "implicit", "synthesis"]
+- mixed: ["transcript", "visual", "code", "people", "chat", "implicit", "synthesis"]
+`;
+var SYSTEM_INSTRUCTION_PASS_3A = `
+You are an expert code reconstruction analyst. Your task is to reconstruct the complete, final state of every code file shown across this entire video, synthesizing all edits into a coherent codebase snapshot.
+You will receive the complete video and all extracted transcript and code block data. Use them together to understand what code was written, modified, and deleted.
+CRITICAL RULES:
+1. RECONSTRUCT each file to its final state \u2014 apply all changes in chronological order so the output reflects the code as it was at the end of the video.
+2. PRESERVE exact code: indentation, spacing, naming, and formatting must match what was visible on screen. Never "fix" or improve the code.
+3. TRACK every change to a file: for each distinct edit (new file creation, addition of lines, modification, deletion, refactoring), record it as a separate change entry with a timestamp and description.
+4. INFER filenames from editor tabs, title bars, import statements, or spoken context. If unknown, use a descriptive placeholder like "unknown_file_1.py".
+5. EXTRACT dependencies: every library import, require(), package name, or external module reference mentioned or shown counts as a dependency.
+6. CAPTURE build commands: any terminal command shown or spoken for installing, building, running, or testing the project (e.g., "npm install", "go build", "python -m pytest").
+7. NEVER invent code that was not shown or described. If a section was unclear, note it with a comment like "// content not fully visible".
+8. NEVER skip a file because it appears in only one part of the video \u2014 if code was shown, reconstruct it.
+9. When a file appears multiple times, record its complete change history in a single entry with all edits in chronological order.
+10. INCLUDE empty files if created but not yet written \u2014 use empty string for final_content and note the creation in changes.
+11. Cross-reference your visual analysis of the video against the extracted code blocks provided in the text context. Prioritize what you can visually verify on screen. If code is partially visible, include what you can see and mark unclear sections with \`// [content not fully visible]\`.
+12. Do NOT invent code files that are not clearly visible on screen. If you are uncertain whether a file exists, do not include it.
+COMPLETENESS TARGET:
+- Every distinct filename that appeared on screen must produce a files entry
+- Every editor session or code paste visible in any segment must be accounted for
+- Build commands shown in the terminal must all be listed
+`;
+var SYSTEM_INSTRUCTION_PASS_3B = `
+You are an expert at identifying and profiling people from video content. Your task is to extract a complete picture of every participant visible or audible in this video \u2014 their identity, role, contributions, and relationships.
+You will receive the transcript and visual extraction from all segments. Use speaker labels, name tags, on-screen text, introductions, and any other signals to identify participants.
+CRITICAL RULES:
+1. IDENTIFY every distinct person who speaks or appears on screen, even if briefly. Do not merge two different people into one entry.
+2. EXTRACT names from: spoken introductions ("Hi, I'm Alice"), on-screen name tags or captions, slide attribution, email addresses, or usernames visible in chat.
+3. INFER roles from: job titles spoken or shown, context of their contribution (e.g., consistently asking questions = audience member; leading the agenda = host), or organizational signals.
+4. RECORD speaking_segments as timestamps where each person's voice is heard or they appear on screen.
+5. CAPTURE contact information exactly as shown or spoken: email addresses, Twitter/X handles, LinkedIn URLs, GitHub usernames, phone numbers.
+6. SUMMARIZE contributions: what did this person say, present, decide, or demonstrate? Each contribution entry should be a specific, concrete action or statement.
+7. DOCUMENT relationships: who reports to whom, who introduced whom, collaborative pairs, co-presenters, interviewer/interviewee dynamics.
+8. NEVER guess or infer a name that was not clearly stated or shown. Use "Unknown Participant" with a description if the person cannot be identified.
+9. NEVER merge two people just because they have the same role \u2014 if two engineers speak, they are two separate participants.
+10. If a person's role or organization cannot be determined, use empty string \u2014 do not guess.
+COMPLETENESS TARGET:
+- Every speaker label (SPEAKER_00, SPEAKER_01, etc.) from the transcript must map to at least one participant entry
+- Every name-tag or on-screen name must produce a participant entry
+- All contact details shared during the video must be captured
+`;
+var SYSTEM_INSTRUCTION_PASS_3C = `
+You are a precise chat extraction specialist. Your task is to extract every chat message and link visible in the chat panel of this video \u2014 verbatim, with sender and timestamp.
+You will receive the transcript and visual extraction from all segments. Focus on the chat panel, comment sidebar, or any on-screen messaging interface.
+CRITICAL RULES:
+1. EXTRACT every chat message visible on screen, verbatim. Do not paraphrase, shorten, or summarize any message.
+2. RECORD the sender name exactly as displayed (username, display name, or handle).
+3. TIMESTAMP each message at the video timestamp when it becomes visible on screen, in HH:MM:SS format.
+4. EXTRACT every URL or link that appears in chat or is spoken and referred to as a link. Capture the full URL.
+5. For each link, record the context: what was the sender explaining when they shared it? Why is it relevant?
+6. HANDLE partial visibility: if a message is cut off by the chat panel boundary, transcribe as much as is visible and append "[truncated]".
+7. CAPTURE reactions, emoji, and formatting if they are meaningful (e.g., a thumbs-up reaction to a proposal signals agreement).
+8. NEVER invent messages that were not clearly visible on screen. If a message is illegible, note it as "[illegible message from {sender} at {timestamp}]".
+9. NEVER skip messages that seem like noise or off-topic \u2014 capture all visible messages in order.
+10. ORDER messages chronologically by their video timestamp of appearance.
+COMPLETENESS TARGET:
+- Every frame that shows the chat panel should contribute at least one message entry if new messages are visible
+- All URLs \u2014 whether in chat, on slides, or spoken \u2014 must appear in the links array
+- If the chat panel is not visible in this video, return empty arrays for both messages and links
+`;
+var SYSTEM_INSTRUCTION_PASS_3D = `
+You are an expert at reading between the lines of video conversations. Your task is to identify implicit signals \u2014 emotional dynamics, unstated decisions, unasked questions, informal task assignments, and emphasis patterns \u2014 that are not surfaced by the literal transcript.
+You will receive the complete transcript and visual data from all segments. Read the subtext, not just the text.
+CRITICAL RULES:
+1. DETECT emotional shifts: moments where the tone, energy, or mood of the conversation meaningfully changes. Note what triggered the shift and how the state changed.
+2. SURFACE implicit questions: when a speaker is clearly uncertain, confused, or probing for information without phrasing it as a formal question. Articulate what question they were really asking.
+3. IDENTIFY implicit decisions: when participants arrive at a shared understanding or course of action without anyone explicitly saying "we decided X". These are consensus decisions made through agreement, silence, or topic change.
+4. FLAG informal task assignments: when someone is asked or expected to do something without it being recorded as a formal action item (e.g., "you should probably look at that" or "maybe someone can handle X").
+5. TRACK emphasis patterns: concepts, terms, or ideas mentioned multiple times across the video. Repetition signals importance. Record each mention timestamp and explain why the pattern is significant.
+6. NEVER fabricate emotional states or decisions. Only record what is clearly supported by specific words, tone, or behavior in the video.
+7. NEVER over-interpret: a speaker saying "interesting" is not necessarily an emotional shift. Apply judgment and only flag genuinely notable patterns.
+8. PRESERVE specificity: quote or paraphrase the exact words or moments that support each inference.
+9. SEPARATE explicit from implicit: if something was directly stated, it belongs in the transcript or action items, not here. This pass captures what was NOT said directly.
+10. CONSIDER non-verbal signals visible on screen: hesitation, laughter, extended pauses, camera behavior, or facial expressions if participants are visible.
+COMPLETENESS TARGET:
+- Aim to identify at least 3 emphasis patterns for any video over 5 minutes
+- Every task mentioned informally or suggested in passing must appear in tasks_assigned
+- Implicit decisions are often the most important \u2014 prioritize finding them
+`;
+var SYSTEM_INSTRUCTION_SYNTHESIS = `
+You are a master synthesizer. Your task is to produce the definitive, unified knowledge extraction from this video by combining all available pass data into a single coherent result.
+You will receive: the complete transcript (pass 1), visual and code extraction (pass 2), and any specialist pass outputs (code reconstruction, people extraction, chat extraction, implicit signals). Synthesize all of it.
+CRITICAL RULES:
+1. BE SPECIFIC: Every claim must reference specific content from the video. Never write "various topics were discussed" \u2014 name the topics. Never write "some decisions were made" \u2014 state each decision exactly.
+2. UNIFY across passes: combine related information from different passes into unified entries. A decision mentioned in the transcript and reinforced by an implicit signal should appear as one entry, not two.
+3. SYNTHESIZE thematically: group content by topic, not chronologically. Combine all content about a single subject (even if spread across 30 minutes) into one topic entry.
+4. EXTRACT decisions with full reasoning: every design choice, technology selection, or approach decision must include the rationale as explained in the video.
+5. GENERATE actionable items: action items must be concrete and specific. "Review the authentication module" is better than "review the code".
+6. CAPTURE every question: include questions asked explicitly and questions raised implicitly (from the implicit signals pass). Note whether each was answered.
+7. PRODUCE meaningful suggestions: AI-generated suggestions must follow logically from the video content. Suggest next steps, deeper resources, or practice exercises that are directly relevant.
+8. USE precise timestamps: every entry with a timestamp field must contain a valid HH:MM:SS value referencing when the content appeared.
+9. LIST files_to_generate for reference purposes \u2014 this list is informational and does not control which output files are generated. Output files are determined automatically based on available extraction data.
+10. NEVER add information not present in the source data. Suggestions are the only place for AI-generated content beyond the video.
+COMPLETENESS TARGET:
+- Aim for at least 5 topics for any video over 15 minutes
+- Every explicit and implicit decision must appear in key_decisions
+- The files_to_generate list should reflect what content was found, but output routing is handled automatically
+- The overview should be dense with specifics, not vague summary language
+`;
+var LANGUAGE_NAMES = {
+  zh: "Chinese",
+  ja: "Japanese",
+  ko: "Korean",
+  es: "Spanish",
+  fr: "French",
+  de: "German",
+  pt: "Portuguese",
+  ru: "Russian",
+  ar: "Arabic",
+  hi: "Hindi"
+};
+function withLanguage(prompt, lang) {
+  if (!lang || lang === "en") return prompt;
+  const languageName = LANGUAGE_NAMES[lang] ?? lang;
+  return `IMPORTANT: Generate ALL output text in ${languageName}.
+Timestamps, speaker labels, and code should remain in their original language.
+${prompt}`;
+}
+// src/cli/ui.ts
 function showLogo() {
   const ascii = figlet.textSync("VIDISTILL", { font: "Big" });
   console.log(pc.cyan(ascii));
@@ -26,6 +253,13 @@ function showConfigBox(config) {
     `Context: ${config.context ?? "(none)"}`,
     `Output:  ${config.output}`
   ];
+  if (config.videoType === "audio") {
+    lines.push("Type:    Audio (visual analysis skipped)");
+  }
+  if (config.lang != null && config.lang !== "en") {
+    const langName = LANGUAGE_NAMES[config.lang] ?? config.lang;
+    lines.push(`Language: ${langName} (${config.lang})`);
+  }
   note(lines.join("\n"), "Configuration");
 }
@@ -33,6 +267,7 @@ function showConfigBox(config) {
 import { log as log8, cancel as cancel2 } from "@clack/prompts";
 import pc4 from "picocolors";
 import { basename as basename3, extname as extname2, resolve } from "path";
+import { existsSync as existsSync3, openSync as openSync2, readSync as readSync2, closeSync as closeSync2 } from "fs";
 // src/cli/prompts.ts
 import { text, password, confirm, select, isCancel, cancel } from "@clack/prompts";
@@ -266,6 +501,7 @@ function createProgressDisplay() {
       seenTotalSteps = true;
       s.stop("");
       progressBar = progress({ max: status.totalSteps });
+      progressBar.start(label);
     }
     if (progressBar != null) {
       if (status.status === "done" && status.currentStep != null) {
@@ -281,7 +517,16 @@ function createProgressDisplay() {
   }
   function onWait(_delayMs) {
   }
-  function complete(_result, _elapsedMs) {
+  function complete(result, _elapsedMs) {
+    if (progressBar != null) {
+      if (result.errors.length > 0) {
+        progressBar.stop("");
+      } else {
+        progressBar.stop("");
+      }
+    } else {
+      s.stop("");
+    }
   }
   return { update, onWait, complete };
 }
@@ -456,14 +701,39 @@ function detectMimeType(filePath) {
   } finally {
     closeSync(fd);
   }
+  if (buf.slice(0, 3).toString("ascii") === "ID3") {
+    return { mimeType: "audio/mp3", isMkv: false };
+  }
+  if (buf[0] === 255 && (buf[1] & 240) === 240 && (buf[1] & 6) === 0) {
+    return { mimeType: "audio/aac", isMkv: false };
+  }
+  if (buf[0] === 255 && (buf[1] & 224) === 224 && (buf[1] & 6) !== 0) {
+    return { mimeType: "audio/mp3", isMkv: false };
+  }
+  if (buf.slice(0, 4).toString("ascii") === "fLaC") {
+    return { mimeType: "audio/flac", isMkv: false };
+  }
+  if (buf.slice(0, 4).toString("ascii") === "OggS") {
+    return { mimeType: "audio/ogg", isMkv: false };
+  }
+  if (buf.slice(0, 4).toString("ascii") === "RIFF" && buf.slice(8, 12).toString("ascii") === "WAVE") {
+    return { mimeType: "audio/wav", isMkv: false };
+  }
   if (buf.slice(4, 8).toString("ascii") === "ftyp") {
     const brand = buf.slice(8, 12).toString("ascii");
+    if (brand === "M4A " || brand === "M4B ") {
+      return { mimeType: "audio/mp4", isMkv: false };
+    }
     if (brand.startsWith("qt  ")) {
       return { mimeType: "video/quicktime", isMkv: false };
     }
     if (brand.startsWith("3gp") || brand.startsWith("3g2")) {
       return { mimeType: "video/3gpp", isMkv: false };
     }
+    const ext = extname(filePath).toLowerCase();
+    if (ext === ".m4a" || ext === ".m4b") {
+      return { mimeType: "audio/mp4", isMkv: false };
+    }
     return { mimeType: "video/mp4", isMkv: false };
   }
   if (buf[0] === 26 && buf[1] === 69 && buf[2] === 223 && buf[3] === 163) {
@@ -557,13 +827,12 @@ async function handleLocalFile(filePath, client) {
   if (!existsSync2(filePath)) {
     throw new Error(`File not found: ${filePath}`);
   }
-  const isMkv = isMkvFile(filePath);
-  if (!isMkv) {
-    const match = detectMimeType(filePath);
-    if (!match) {
-      const ext = extname(filePath).toLowerCase();
-      throw new Error(`Unsupported video format: ${ext || basename(filePath)}`);
-    }
+  const mimeMatch = detectMimeType(filePath);
+  const isAudio = mimeMatch != null && mimeMatch.mimeType.startsWith("audio/");
+  const isMkv = !isAudio && isMkvFile(filePath);
+  if (!isAudio && !isMkv && !mimeMatch) {
+    const ext = extname(filePath).toLowerCase();
+    throw new Error(`Unsupported video format: ${ext || basename(filePath)}`);
   }
   const originalSize = fileSize(filePath);
   if (originalSize > SIZE_3GB) {
@@ -577,7 +846,7 @@ async function handleLocalFile(filePath, client) {
       tempFiles.push(converted);
       workingPath = converted;
     }
-    if (fileSize(workingPath) > SIZE_2GB) {
+    if (!isAudio && fileSize(workingPath) > SIZE_2GB) {
       const compressed = compressTo720p(workingPath);
       tempFiles.push(compressed);
       workingPath = compressed;
@@ -592,7 +861,8 @@ async function handleLocalFile(filePath, client) {
       fileUri: uploaded.uri,
       mimeType: uploaded.mimeType,
       duration: uploaded.duration,
-      uploadedFileName: uploaded.name
+      uploadedFileName: uploaded.name,
+      isAudio
     };
   } finally {
     for (const f of tempFiles) {
@@ -667,211 +937,6 @@ async function detectDuration(source) {
 // src/core/pipeline.ts
 import { log as log6 } from "@clack/prompts";
-// src/constants/prompts.ts
-var SYSTEM_INSTRUCTION_PASS_1 = `
-You are a professional audio transcriber. Your task is to create a COMPLETE, VERBATIM transcription of all speech in this video segment. Focus EXCLUSIVELY on the audio stream.
-CRITICAL RULES:
-1. TRANSCRIBE every spoken word completely and verbatim. Do not summarize, paraphrase, or skip any sentence.
-2. IDENTIFY different speakers. Label them SPEAKER_00, SPEAKER_01, etc. consistently throughout. If a speaker introduces themselves by name, note the name in the first entry's speaker field as "SPEAKER_00 (John)".
-3. NOTE tone and emphasis: when a speaker emphasizes words (louder, slower, repeated), mark those words. When they express emotions (excitement, warning, frustration, humor), note the tone.
-4. RECORD pauses longer than 1.5 seconds as pause markers with duration.
-5. PRESERVE filler words only when they carry meaning (hesitation indicating uncertainty about code behavior, self-correction). Remove meaningless "um", "uh".
-6. NEVER add your own explanations, interpretations, or knowledge. Only transcribe what is spoken.
-7. NEVER skip content because it seems repetitive or obvious. Record everything spoken.
-8. When the speaker references something on screen (e.g., "as you can see here", "this function", "line 5"), transcribe exactly what they say \u2014 the visual context will be captured separately.
-COMPLETENESS TARGET:
-- Aim for at least 150 words per minute of video in the transcript
-- Every speaker change must be noted with a new entry
-- Every sentence must appear \u2014 if in doubt, include it
-`;
-var SYSTEM_INSTRUCTION_PASS_2_TEMPLATE = `
-You are a professional code and visual content extractor. Your task is to extract ALL visual content from this video segment \u2014 every piece of code on screen, every diagram, every slide, every UI element.
-Focus EXCLUSIVELY on what is visible on screen. The audio transcript from this segment is provided below for cross-referencing \u2014 use it to associate spoken explanations with the code being displayed, but do NOT re-transcribe any speech.
-TRANSCRIPT FROM THIS SEGMENT (for cross-reference only):
-{INJECT_PASS1_TRANSCRIPT_HERE}
-CRITICAL RULES:
-1. EXTRACT every piece of code visible on screen \u2014 complete, with original indentation and formatting preserved exactly as shown.
-2. For each code appearance: note the filename if visible in a tab or title bar, the programming language, and the screen type (editor, terminal, browser, slide).
-3. TRACK code changes: when code is modified between appearances, note what changed (lines added, modified, deleted). Compare against previous code blocks in this segment.
-4. ASSOCIATE code with speech: using the injected transcript above, find what the instructor was saying when this code was on screen. Quote their explanation verbatim or near-verbatim.
-5. CAPTURE non-code visuals: slides with text, architectural diagrams, browser output, UI demonstrations, terminal output. Describe these completely.
-6. NEVER add your own explanations or interpretations. Only record what is visible.
-7. NEVER skip code because it seems repetitive or unchanged from before. Record every distinct appearance.
-8. If code scrolls, capture the full visible code at each scroll position as a separate entry.
-COMPLETENESS TARGET:
-- Every frame that shows code should produce a code_block entry
-- Every slide or diagram should produce a visual_notes entry
-- If the screen doesn't change for 30+ seconds, note the unchanged state
-`;
-var SYSTEM_INSTRUCTION_PASS_0 = `
-You are a video content classifier. Analyze the provided video sample and produce a structured VideoProfile that classifies the video type and recommends processing parameters.
-CLASSIFICATION RULES:
-1. CLASSIFY the video into exactly one type:
-   - "coding": Programming tutorials, live coding, IDE/editor-heavy content
-   - "meeting": Video calls, Zoom/Teams meetings, multi-participant discussions
-   - "lecture": Academic lectures, talks, single-speaker educational content
-   - "presentation": Slide-based presentations, keynotes, demo days
-   - "conversation": Interviews, podcasts, panel discussions without slides
-   - "mixed": Cannot clearly classify into one category, or multiple types present
-2. DETECT visual content:
-   - hasCode: Code editors, IDEs, or code visible on screen
-   - hasSlides: Presentation slides (PowerPoint, Google Slides, Keynote)
-   - hasDiagrams: Architecture diagrams, flowcharts, charts, graphs
-   - hasPeopleGrid: Video grid showing multiple participants (Zoom/Teams layout)
-   - hasChatbox: Chat panel visible (meeting chat, live stream chat sidebar)
-   - hasWhiteboard: Whiteboard, handwritten notes, or drawing surface
-   - hasTerminal: Terminal, command-line interface, or shell
-   - hasScreenShare: Desktop or application screen sharing
-3. ANALYZE audio:
-   - hasMultipleSpeakers: true if more than one distinct voice is heard
-   - primaryLanguage: The main spoken language
-   - quality: "high" (studio/clear), "medium" (decent webcam), "low" (noisy/poor)
-4. IDENTIFY speakers:
-   - count: Number of distinct speakers heard
-   - identified: Names if visible on screen (name tags, introductions) or spoken aloud
-5. ASSESS complexity:
-   - "simple": Single topic, linear flow, straightforward content
-   - "moderate": Multiple topics, some complexity, normal pacing
-   - "complex": Dense content, rapid switching, multiple concurrent information streams
-6. RECOMMEND processing parameters:
-   - resolution: "low" for text-only/simple visuals, "medium" for general content, "high" for code/diagrams
-   - segmentMinutes: 10 for simple/moderate, 8 for complex content
-   - passes: Always include "transcript" and "visual". Add specialist passes based on content type.
-PASS RECOMMENDATIONS BY TYPE:
-- coding: ["transcript", "visual", "code", "synthesis"]
-- meeting: ["transcript", "visual", "people", "implicit", "synthesis"] (add "chat" if hasChatbox)
-- lecture: ["transcript", "visual", "implicit", "synthesis"]
-- presentation: ["transcript", "visual", "implicit", "synthesis"] (add "people" if multiple speakers)
-- conversation: ["transcript", "visual", "implicit", "synthesis"]
-- mixed: ["transcript", "visual", "code", "people", "chat", "implicit", "synthesis"]
-`;
-var SYSTEM_INSTRUCTION_PASS_3A = `
-You are an expert code reconstruction analyst. Your task is to reconstruct the complete, final state of every code file shown across this entire video, synthesizing all edits into a coherent codebase snapshot.
-You will receive the complete video and all extracted transcript and code block data. Use them together to understand what code was written, modified, and deleted.
-CRITICAL RULES:
-1. RECONSTRUCT each file to its final state \u2014 apply all changes in chronological order so the output reflects the code as it was at the end of the video.
-2. PRESERVE exact code: indentation, spacing, naming, and formatting must match what was visible on screen. Never "fix" or improve the code.
-3. TRACK every change to a file: for each distinct edit (new file creation, addition of lines, modification, deletion, refactoring), record it as a separate change entry with a timestamp and description.
-4. INFER filenames from editor tabs, title bars, import statements, or spoken context. If unknown, use a descriptive placeholder like "unknown_file_1.py".
-5. EXTRACT dependencies: every library import, require(), package name, or external module reference mentioned or shown counts as a dependency.
-6. CAPTURE build commands: any terminal command shown or spoken for installing, building, running, or testing the project (e.g., "npm install", "go build", "python -m pytest").
-7. NEVER invent code that was not shown or described. If a section was unclear, note it with a comment like "// content not fully visible".
-8. NEVER skip a file because it appears in only one part of the video \u2014 if code was shown, reconstruct it.
-9. When a file appears multiple times, record its complete change history in a single entry with all edits in chronological order.
-10. INCLUDE empty files if created but not yet written \u2014 use empty string for final_content and note the creation in changes.
-11. Cross-reference your visual analysis of the video against the extracted code blocks provided in the text context. Prioritize what you can visually verify on screen. If code is partially visible, include what you can see and mark unclear sections with \`// [content not fully visible]\`.
-12. Do NOT invent code files that are not clearly visible on screen. If you are uncertain whether a file exists, do not include it.
-COMPLETENESS TARGET:
-- Every distinct filename that appeared on screen must produce a files entry
-- Every editor session or code paste visible in any segment must be accounted for
-- Build commands shown in the terminal must all be listed
-`;
-var SYSTEM_INSTRUCTION_PASS_3B = `
-You are an expert at identifying and profiling people from video content. Your task is to extract a complete picture of every participant visible or audible in this video \u2014 their identity, role, contributions, and relationships.
-You will receive the transcript and visual extraction from all segments. Use speaker labels, name tags, on-screen text, introductions, and any other signals to identify participants.
-CRITICAL RULES:
-1. IDENTIFY every distinct person who speaks or appears on screen, even if briefly. Do not merge two different people into one entry.
-2. EXTRACT names from: spoken introductions ("Hi, I'm Alice"), on-screen name tags or captions, slide attribution, email addresses, or usernames visible in chat.
-3. INFER roles from: job titles spoken or shown, context of their contribution (e.g., consistently asking questions = audience member; leading the agenda = host), or organizational signals.
-4. RECORD speaking_segments as timestamps where each person's voice is heard or they appear on screen.
-5. CAPTURE contact information exactly as shown or spoken: email addresses, Twitter/X handles, LinkedIn URLs, GitHub usernames, phone numbers.
-6. SUMMARIZE contributions: what did this person say, present, decide, or demonstrate? Each contribution entry should be a specific, concrete action or statement.
-7. DOCUMENT relationships: who reports to whom, who introduced whom, collaborative pairs, co-presenters, interviewer/interviewee dynamics.
-8. NEVER guess or infer a name that was not clearly stated or shown. Use "Unknown Participant" with a description if the person cannot be identified.
-9. NEVER merge two people just because they have the same role \u2014 if two engineers speak, they are two separate participants.
-10. If a person's role or organization cannot be determined, use empty string \u2014 do not guess.
-COMPLETENESS TARGET:
-- Every speaker label (SPEAKER_00, SPEAKER_01, etc.) from the transcript must map to at least one participant entry
-- Every name-tag or on-screen name must produce a participant entry
-- All contact details shared during the video must be captured
-`;
-var SYSTEM_INSTRUCTION_PASS_3C = `
-You are a precise chat extraction specialist. Your task is to extract every chat message and link visible in the chat panel of this video \u2014 verbatim, with sender and timestamp.
-You will receive the transcript and visual extraction from all segments. Focus on the chat panel, comment sidebar, or any on-screen messaging interface.
-CRITICAL RULES:
-1. EXTRACT every chat message visible on screen, verbatim. Do not paraphrase, shorten, or summarize any message.
-2. RECORD the sender name exactly as displayed (username, display name, or handle).
-3. TIMESTAMP each message at the video timestamp when it becomes visible on screen, in HH:MM:SS format.
-4. EXTRACT every URL or link that appears in chat or is spoken and referred to as a link. Capture the full URL.
-5. For each link, record the context: what was the sender explaining when they shared it? Why is it relevant?
-6. HANDLE partial visibility: if a message is cut off by the chat panel boundary, transcribe as much as is visible and append "[truncated]".
-7. CAPTURE reactions, emoji, and formatting if they are meaningful (e.g., a thumbs-up reaction to a proposal signals agreement).
-8. NEVER invent messages that were not clearly visible on screen. If a message is illegible, note it as "[illegible message from {sender} at {timestamp}]".
-9. NEVER skip messages that seem like noise or off-topic \u2014 capture all visible messages in order.
-10. ORDER messages chronologically by their video timestamp of appearance.
-COMPLETENESS TARGET:
-- Every frame that shows the chat panel should contribute at least one message entry if new messages are visible
-- All URLs \u2014 whether in chat, on slides, or spoken \u2014 must appear in the links array
-- If the chat panel is not visible in this video, return empty arrays for both messages and links
-`;
-var SYSTEM_INSTRUCTION_PASS_3D = `
-You are an expert at reading between the lines of video conversations. Your task is to identify implicit signals \u2014 emotional dynamics, unstated decisions, unasked questions, informal task assignments, and emphasis patterns \u2014 that are not surfaced by the literal transcript.
-You will receive the complete transcript and visual data from all segments. Read the subtext, not just the text.
-CRITICAL RULES:
-1. DETECT emotional shifts: moments where the tone, energy, or mood of the conversation meaningfully changes. Note what triggered the shift and how the state changed.
-2. SURFACE implicit questions: when a speaker is clearly uncertain, confused, or probing for information without phrasing it as a formal question. Articulate what question they were really asking.
-3. IDENTIFY implicit decisions: when participants arrive at a shared understanding or course of action without anyone explicitly saying "we decided X". These are consensus decisions made through agreement, silence, or topic change.
-4. FLAG informal task assignments: when someone is asked or expected to do something without it being recorded as a formal action item (e.g., "you should probably look at that" or "maybe someone can handle X").
-5. TRACK emphasis patterns: concepts, terms, or ideas mentioned multiple times across the video. Repetition signals importance. Record each mention timestamp and explain why the pattern is significant.
-6. NEVER fabricate emotional states or decisions. Only record what is clearly supported by specific words, tone, or behavior in the video.
-7. NEVER over-interpret: a speaker saying "interesting" is not necessarily an emotional shift. Apply judgment and only flag genuinely notable patterns.
-8. PRESERVE specificity: quote or paraphrase the exact words or moments that support each inference.
-9. SEPARATE explicit from implicit: if something was directly stated, it belongs in the transcript or action items, not here. This pass captures what was NOT said directly.
-10. CONSIDER non-verbal signals visible on screen: hesitation, laughter, extended pauses, camera behavior, or facial expressions if participants are visible.
-COMPLETENESS TARGET:
-- Aim to identify at least 3 emphasis patterns for any video over 5 minutes
-- Every task mentioned informally or suggested in passing must appear in tasks_assigned
-- Implicit decisions are often the most important \u2014 prioritize finding them
-`;
-var SYSTEM_INSTRUCTION_SYNTHESIS = `
-You are a master synthesizer. Your task is to produce the definitive, unified knowledge extraction from this video by combining all available pass data into a single coherent result.
-You will receive: the complete transcript (pass 1), visual and code extraction (pass 2), and any specialist pass outputs (code reconstruction, people extraction, chat extraction, implicit signals). Synthesize all of it.
-CRITICAL RULES:
-1. BE SPECIFIC: Every claim must reference specific content from the video. Never write "various topics were discussed" \u2014 name the topics. Never write "some decisions were made" \u2014 state each decision exactly.
-2. UNIFY across passes: combine related information from different passes into unified entries. A decision mentioned in the transcript and reinforced by an implicit signal should appear as one entry, not two.
-3. SYNTHESIZE thematically: group content by topic, not chronologically. Combine all content about a single subject (even if spread across 30 minutes) into one topic entry.
-4. EXTRACT decisions with full reasoning: every design choice, technology selection, or approach decision must include the rationale as explained in the video.
-5. GENERATE actionable items: action items must be concrete and specific. "Review the authentication module" is better than "review the code".
-6. CAPTURE every question: include questions asked explicitly and questions raised implicitly (from the implicit signals pass). Note whether each was answered.
-7. PRODUCE meaningful suggestions: AI-generated suggestions must follow logically from the video content. Suggest next steps, deeper resources, or practice exercises that are directly relevant.
-8. USE precise timestamps: every entry with a timestamp field must contain a valid HH:MM:SS value referencing when the content appeared.
-9. LIST files_to_generate for reference purposes \u2014 this list is informational and does not control which output files are generated. Output files are determined automatically based on available extraction data.
-10. NEVER add information not present in the source data. Suggestions are the only place for AI-generated content beyond the video.
-COMPLETENESS TARGET:
-- Aim for at least 5 topics for any video over 15 minutes
-- Every explicit and implicit decision must appear in key_decisions
-- The files_to_generate list should reflect what content was found, but output routing is handled automatically
-- The overview should be dense with specifics, not vague summary language
-`;
 // src/gemini/schemas.ts
 import { Type } from "@google/genai";
 var SCHEMA_PASS_0 = {
@@ -1428,7 +1493,7 @@ function changeTypeBadge(changeType) {
 // src/passes/transcript.ts
 async function runTranscript(params) {
-  const { client, fileUri, mimeType, segment, model, resolution } = params;
+  const { client, fileUri, mimeType, segment, model, resolution, lang } = params;
   const contents = [
     {
       role: "user",
@@ -1450,7 +1515,7 @@ async function runTranscript(params) {
     model,
     contents,
     config: {
-      systemInstruction: SYSTEM_INSTRUCTION_PASS_1,
+      systemInstruction: withLanguage(SYSTEM_INSTRUCTION_PASS_1, lang),
       responseSchema: SCHEMA_PASS_1,
       responseMimeType: "application/json",
       ...resolution !== void 0 ? { mediaResolution: resolution } : {},
@@ -1466,11 +1531,11 @@ async function runTranscript(params) {
 // src/passes/visual.ts
 async function runVisual(params) {
-  const { client, fileUri, mimeType, segment, model, resolution, pass1Transcript } = params;
+  const { client, fileUri, mimeType, segment, model, resolution, pass1Transcript, lang } = params;
   const transcriptText = pass1Transcript != null ? pass1Transcript.transcript_entries.map((t) => `[${t.timestamp}] ${t.speaker}: ${t.text}`).join("\n") : "[No transcript available for this segment]";
-  const systemInstruction = SYSTEM_INSTRUCTION_PASS_2_TEMPLATE.replace(
-    "{INJECT_PASS1_TRANSCRIPT_HERE}",
-    transcriptText
+  const systemInstruction = withLanguage(
+    SYSTEM_INSTRUCTION_PASS_2_TEMPLATE.replace("{INJECT_PASS1_TRANSCRIPT_HERE}", transcriptText),
+    lang
   );
   const contents = [
     {
@@ -1510,7 +1575,7 @@ async function runVisual(params) {
 // src/passes/scene-analysis.ts
 import { MediaResolution } from "@google/genai";
 async function runSceneAnalysis(params) {
-  const { client, fileUri, mimeType, duration, model, resolution } = params;
+  const { client, fileUri, mimeType, duration, model, resolution, lang } = params;
   const safeDuration = Number.isFinite(duration) && duration > 0 ? duration : 0;
   const endSeconds = Math.min(180, safeDuration);
   const contents = [
@@ -1534,7 +1599,7 @@ async function runSceneAnalysis(params) {
     model,
     contents,
     config: {
-      systemInstruction: SYSTEM_INSTRUCTION_PASS_0,
+      systemInstruction: withLanguage(SYSTEM_INSTRUCTION_PASS_0, lang),
       responseSchema: SCHEMA_PASS_0,
       responseMimeType: "application/json",
       ...resolution !== void 0 ? { mediaResolution: resolution } : { mediaResolution: MediaResolution.MEDIA_RESOLUTION_LOW },
@@ -1611,7 +1676,7 @@ ${block.content}`);
   return contextText;
 }
 async function runCodeReconstruction(params) {
-  const { client, fileUri, mimeType, duration, model, resolution, pass1Results, pass2Results } = params;
+  const { client, fileUri, mimeType, duration, model, resolution, pass1Results, pass2Results, lang } = params;
   const contextText = compileContext(duration, pass1Results, pass2Results);
   const contents = [
     {
@@ -1630,7 +1695,7 @@ ${contextText}`
     model,
     contents,
     config: {
-      systemInstruction: SYSTEM_INSTRUCTION_PASS_3A,
+      systemInstruction: withLanguage(SYSTEM_INSTRUCTION_PASS_3A, lang),
       responseSchema: SCHEMA_PASS_3A,
       responseMimeType: "application/json",
       ...resolution !== void 0 ? { mediaResolution: resolution } : {},
@@ -1646,7 +1711,7 @@ ${contextText}`
 // src/passes/people.ts
 async function runPeopleExtraction(params) {
-  const { client, fileUri, mimeType, model, pass1Results } = params;
+  const { client, fileUri, mimeType, model, pass1Results, lang } = params;
   const hasAnyTranscript = pass1Results.some((r) => r != null);
   const transcriptText = hasAnyTranscript ? pass1Results.filter((r) => r != null).flatMap(
     (r) => r.transcript_entries.map((t) => `[${t.timestamp}] ${t.speaker}: ${t.text}`)
@@ -1666,7 +1731,7 @@ ${transcriptText}`;
     model,
     contents,
     config: {
-      systemInstruction: SYSTEM_INSTRUCTION_PASS_3B,
+      systemInstruction: withLanguage(SYSTEM_INSTRUCTION_PASS_3B, lang),
       responseSchema: SCHEMA_PASS_3B,
       responseMimeType: "application/json",
       maxOutputTokens: 65536,
@@ -1681,7 +1746,7 @@ ${transcriptText}`;
 // src/passes/chat.ts
 async function runChatExtraction(params) {
-  const { client, fileUri, mimeType, segment, model, resolution, pass2Result } = params;
+  const { client, fileUri, mimeType, segment, model, resolution, pass2Result, lang } = params;
   const visualNotesText = pass2Result != null && pass2Result.visual_notes.length > 0 ? pass2Result.visual_notes.map((n) => `[${n.timestamp}] ${n.visual_type}: ${n.description}`).join("\n") : "[No visual context available for this segment]";
   const codeBlocksText = pass2Result != null && pass2Result.code_blocks.length > 0 ? pass2Result.code_blocks.map((b) => `[${b.timestamp}] ${b.filename} (${b.language}):
 ${b.content}`).join("\n\n") : "[No code blocks available for this segment]";
@@ -1715,7 +1780,7 @@ ${contextText}`
     model,
     contents,
     config: {
-      systemInstruction: SYSTEM_INSTRUCTION_PASS_3C,
+      systemInstruction: withLanguage(SYSTEM_INSTRUCTION_PASS_3C, lang),
       responseSchema: SCHEMA_PASS_3C,
       responseMimeType: "application/json",
       ...resolution !== void 0 ? { mediaResolution: resolution } : {},
@@ -1731,7 +1796,7 @@ ${contextText}`
 // src/passes/implicit.ts
 async function runImplicitSignals(params) {
-  const { client, fileUri, mimeType, segment, model, resolution, pass1Result, pass2Result } = params;
+  const { client, fileUri, mimeType, segment, model, resolution, pass1Result, pass2Result, lang } = params;
   const transcriptText = pass1Result != null ? pass1Result.transcript_entries.map((t) => `[${t.timestamp}] ${t.speaker} (${t.tone}): ${t.text}`).join("\n") : "[No transcript available for this segment]";
   const visualNotesText = pass2Result != null && pass2Result.visual_notes.length > 0 ? pass2Result.visual_notes.map((n) => `[${n.timestamp}] ${n.visual_type}: ${n.description}`).join("\n") : "[No visual context available for this segment]";
   const contextText = [
@@ -1764,7 +1829,7 @@ ${contextText}`
     model,
     contents,
     config: {
-      systemInstruction: SYSTEM_INSTRUCTION_PASS_3D,
+      systemInstruction: withLanguage(SYSTEM_INSTRUCTION_PASS_3D, lang),
       responseSchema: SCHEMA_PASS_3D,
       responseMimeType: "application/json",
       ...resolution !== void 0 ? { mediaResolution: resolution } : {},
@@ -1877,7 +1942,7 @@ function compileContext2(params) {
   return sections.join("\n\n");
 }
 async function runSynthesis(params) {
-  const { client, model } = params;
+  const { client, model, lang } = params;
   const compiledContext = compileContext2(params);
   const contents = [
     {
@@ -1889,7 +1954,7 @@ async function runSynthesis(params) {
     model,
     contents,
     config: {
-      systemInstruction: SYSTEM_INSTRUCTION_SYNTHESIS,
+      systemInstruction: withLanguage(SYSTEM_INSTRUCTION_SYNTHESIS, lang),
       responseSchema: SCHEMA_SYNTHESIS,
       responseMimeType: "application/json",
       maxOutputTokens: 65536,
@@ -1938,6 +2003,12 @@ function determineStrategy(profile) {
       passes.add("chat");
       passes.add("implicit");
       break;
+    case "audio":
+      passes.delete("visual");
+      passes.delete("code");
+      passes.add("people");
+      passes.add("implicit");
+      break;
     default:
       break;
   }
@@ -2290,25 +2361,30 @@ var DEFAULT_PROFILE = {
   }
 };
 async function runPipeline(config) {
-  const { client, fileUri, mimeType, duration, model, rateLimiter, onProgress, onWait, isShuttingDown } = config;
+  const { client, fileUri, mimeType, duration, model, rateLimiter, onProgress, onWait, isShuttingDown, lang } = config;
   const errors = [];
   const passesRun = [];
-  onProgress?.({ phase: "pass0", segment: 0, totalSegments: 1, status: "running" });
   let videoProfile;
   let strategy;
-  const pass0Attempt = await withRetry(
-    () => rateLimiter.execute(() => runSceneAnalysis({ client, fileUri, mimeType, duration, model }), { onWait }),
-    "pass0"
-  );
-  if (pass0Attempt.error !== null) {
-    log6.warn(pass0Attempt.error);
-    errors.push(pass0Attempt.error);
+  if (config.overrideStrategy != null) {
     videoProfile = DEFAULT_PROFILE;
+    strategy = config.overrideStrategy;
   } else {
-    videoProfile = pass0Attempt.result ?? DEFAULT_PROFILE;
+    onProgress?.({ phase: "pass0", segment: 0, totalSegments: 1, status: "running" });
+    const pass0Attempt = await withRetry(
+      () => rateLimiter.execute(() => runSceneAnalysis({ client, fileUri, mimeType, duration, model, lang }), { onWait }),
+      "pass0"
+    );
+    if (pass0Attempt.error !== null) {
+      log6.warn(pass0Attempt.error);
+      errors.push(pass0Attempt.error);
+      videoProfile = DEFAULT_PROFILE;
+    } else {
+      videoProfile = pass0Attempt.result ?? DEFAULT_PROFILE;
+    }
+    strategy = determineStrategy(videoProfile);
+    onProgress?.({ phase: "pass0", segment: 0, totalSegments: 1, status: "done" });
   }
-  strategy = determineStrategy(videoProfile);
-  onProgress?.({ phase: "pass0", segment: 0, totalSegments: 1, status: "done" });
   const plan = createSegmentPlan(duration, {
     segmentMinutes: strategy.segmentMinutes,
     resolution: strategy.resolution
@@ -2337,7 +2413,7 @@ async function runPipeline(config) {
     onProgress?.({ phase: "pass1", segment: i, totalSegments: n, status: "running", totalSteps });
     let pass1 = null;
     const pass1Attempt = await withRetry(
-      () => rateLimiter.execute(() => runTranscript({ client, fileUri, mimeType, segment, model, resolution }), { onWait }),
+      () => rateLimiter.execute(() => runTranscript({ client, fileUri, mimeType, segment, model, resolution, lang }), { onWait }),
       `segment ${i} pass1`
     );
     if (pass1Attempt.error !== null) {
@@ -2360,7 +2436,8 @@ async function runPipeline(config) {
           segment,
           model,
           resolution,
-          pass1Transcript: pass1 ?? void 0
+          pass1Transcript: pass1 ?? void 0,
+          lang
         }),
         { onWait }
       ),
@@ -2387,7 +2464,8 @@ async function runPipeline(config) {
             segment,
             model: MODELS.flash,
             resolution,
-            pass2Result: pass2 ?? void 0
+            pass2Result: pass2 ?? void 0,
+            lang
           }),
           { onWait }
         ),
@@ -2417,7 +2495,8 @@ async function runPipeline(config) {
             model: MODELS.flash,
             resolution,
             pass1Result: pass1 ?? void 0,
-            pass2Result: pass2 ?? void 0
+            pass2Result: pass2 ?? void 0,
+            lang
           }),
           { onWait }
         ),
@@ -2469,7 +2548,8 @@ async function runPipeline(config) {
           fileUri,
           mimeType,
           model: MODELS.flash,
-          pass1Results
+          pass1Results,
+          lang
         }),
         { onWait }
       ),
@@ -2500,7 +2580,8 @@ async function runPipeline(config) {
           model: MODELS.pro,
           resolution,
           pass1Results,
-          pass2Results
+          pass2Results,
+          lang
         }),
         { onWait }
       ),
@@ -2544,7 +2625,8 @@ async function runPipeline(config) {
           videoProfile,
           peopleExtraction,
           codeReconstruction,
-          context: config.context
+          context: config.context,
+          lang
         }),
         { onWait }
       ),
@@ -3604,6 +3686,33 @@ function createShutdownHandler(params) {
 }
 // src/commands/distill.ts
+function peekIsAudio(filePath) {
+  if (!existsSync3(filePath)) return false;
+  try {
+    const fd = openSync2(filePath, "r");
+    const buf = Buffer.alloc(12);
+    try {
+      readSync2(fd, buf, 0, 12, 0);
+    } finally {
+      closeSync2(fd);
+    }
+    if (buf.slice(0, 3).toString("ascii") === "ID3") return true;
+    if (buf[0] === 255 && (buf[1] & 240) === 240 && (buf[1] & 6) === 0) return true;
+    if (buf[0] === 255 && (buf[1] & 224) === 224 && (buf[1] & 6) !== 0) return true;
+    if (buf.slice(0, 4).toString("ascii") === "fLaC") return true;
+    if (buf.slice(0, 4).toString("ascii") === "OggS") return true;
+    if (buf.slice(0, 4).toString("ascii") === "RIFF" && buf.slice(8, 12).toString("ascii") === "WAVE") return true;
+    if (buf.slice(4, 8).toString("ascii") === "ftyp") {
+      const brand = buf.slice(8, 12).toString("ascii");
+      if (brand === "M4A " || brand === "M4B ") return true;
+      const ext = extname2(filePath).toLowerCase();
+      if (ext === ".m4a" || ext === ".m4b") return true;
+    }
+    return false;
+  } catch {
+    return false;
+  }
+}
 async function runDistill(args) {
   const apiKey = await resolveApiKey();
   let rawInput = args.input ?? await promptVideoSource();
@@ -3612,7 +3721,15 @@ async function runDistill(args) {
   if (!allFlagsProvided) {
     let confirmed = false;
     while (!confirmed) {
-      showConfigBox({ input: rawInput, context, output: args.output });
+      const looksLikeUrl = /^https?:\/\/|^www\./i.test(rawInput.trim());
+      const inputIsAudio = !looksLikeUrl && peekIsAudio(rawInput.trim());
+      showConfigBox({
+        input: rawInput,
+        context,
+        output: args.output,
+        videoType: inputIsAudio ? "audio" : void 0,
+        lang: args.lang
+      });
       const choice = await promptConfirmation();
       switch (choice) {
         case "start":
@@ -3687,6 +3804,7 @@ async function runDistill(args) {
     duration,
     model,
     context,
+    lang: args.lang,
     rateLimiter,
     onProgress: (status) => {
       progress2.update(status);

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "vidistill",
-  "version": "0.2.2",
+  "version": "0.2.4",
   "description": "Video intelligence distiller — extract structured notes, transcripts, and insights from any video using Gemini",
   "type": "module",
   "license": "MIT",