@framers/agentos-skills-registry 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@framers/agentos-skills-registry",
3
- "version": "0.10.0",
3
+ "version": "0.11.0",
4
4
  "files": [
5
5
  "dist",
6
6
  "registry",
@@ -0,0 +1,90 @@
1
+ ---
2
+ name: video-ingestion
3
+ version: '1.0.0'
4
+ description: Video processing for RAG — extract frames via vision pipeline + audio via STT, index into knowledge base.
5
+ author: Wunderland
6
+ namespace: wunderland
7
+ category: productivity
8
+ tags: [video, ffmpeg, frames, transcription, multimodal, RAG]
9
+ requires_secrets: []
10
+ requires_tools: []
11
+ metadata:
12
+ agentos:
13
+ emoji: "\U0001F3AC"
14
+ ---
15
+
16
+ # Video Ingestion for Multimodal RAG
17
+
18
+ Use this skill when the user wants to index video content into the agent's knowledge base so it can be searched and recalled during conversation.
19
+
20
+ Video ingestion works through the `MultimodalMemoryBridge`, which orchestrates two parallel extraction pipelines and feeds the results into both the RAG vector store and (optionally) cognitive memory.
21
+
22
+ ## How It Works
23
+
24
+ 1. **Frame extraction** — ffmpeg samples frames at a configurable interval (default: 1 frame every 5 seconds). Each frame is passed to a vision-capable LLM (e.g. GPT-4o) which generates a text description. That description is embedded and indexed into the vector store with `modality: 'image'` metadata.
25
+
26
+ 2. **Audio extraction** — ffmpeg demuxes the audio track and pipes it to the configured STT provider (e.g. Whisper). The resulting transcript is chunked, embedded, and indexed with `modality: 'audio'` metadata.
27
+
28
+ 3. **Memory traces** — When cognitive memory is enabled, the bridge encodes both visual descriptions and audio transcript chunks as memory traces so the agent can recall video content during future conversations.
29
+
30
+ ## When to Ingest Video vs. Just Extract Audio
31
+
32
+ - **Ingest full video** when visual content matters: tutorials, screen recordings, product demos, surveillance, presentations with slides, anything where "what is shown" conveys information the transcript alone misses.
33
+ - **Extract audio only** when the video is essentially a podcast, voice memo, meeting recording, or phone call where the visual track adds no information. Audio-only ingestion is faster, cheaper (no vision LLM calls), and produces smaller index footprints.
34
+
35
+ If you are unsure, prefer full video ingestion. The frame extraction is lightweight and the vision descriptions are short — the marginal cost is small compared to the value of not losing visual context.
36
+
37
+ ## Prerequisites
38
+
39
+ - **ffmpeg** must be installed and on the system PATH. The bridge shells out to `ffmpeg` for frame and audio extraction. Without it, video ingestion will fail with a clear error.
40
+ - A **vision-capable LLM** must be configured (OPENAI_API_KEY or equivalent) for frame description.
41
+ - An **STT provider** must be configured for audio transcription.
42
+
43
+ ## Usage
44
+
45
+ Video ingestion is triggered through the `MultimodalMemoryBridge.ingestVideo()` method. When using the HTTP API, POST the video file to:
46
+
47
+ ```
48
+ POST /api/agentos/rag/multimodal/documents/ingest
49
+ Content-Type: multipart/form-data
50
+ ```
51
+
52
+ with the video file in the `document` field. The system auto-detects video MIME types and routes to the video pipeline.
53
+
54
+ Programmatic usage:
55
+
56
+ ```typescript
57
+ import { MultimodalMemoryBridge } from 'agentos/rag/multimodal';
58
+
59
+ await bridge.ingestVideo(videoBuffer, {
60
+ source: 'user-upload',
61
+ fileName: 'meeting-2024-03-15.mp4',
62
+ extractFrames: true, // default true
63
+ frameIntervalSeconds: 10, // sample 1 frame every 10s (default 5)
64
+ language: 'en', // STT language hint
65
+ });
66
+ ```
67
+
68
+ ## Configuration Options
69
+
70
+ | Option | Default | Description |
71
+ |--------|---------|-------------|
72
+ | `extractFrames` | `true` | Set `false` for audio-only ingestion |
73
+ | `frameIntervalSeconds` | `5` | Seconds between sampled frames |
74
+ | `language` | auto-detect | BCP-47 language code for STT |
75
+ | `collection` | `'multimodal'` | Target vector store collection |
76
+
77
+ ## Examples
78
+
79
+ - "Ingest this tutorial video so I can search it later."
80
+ - "Extract the audio from this meeting recording and add it to my knowledge base."
81
+ - "Index this product demo video — I need to reference the UI screenshots shown at 2:30."
82
+ - "Process all MP4 files in this folder and make them searchable."
83
+
84
+ ## Constraints
85
+
86
+ - ffmpeg must be installed. The system does not bundle or auto-install it.
87
+ - Long videos (>1 hour) produce many frames; consider increasing `frameIntervalSeconds` to 15-30 for very long content.
88
+ - Vision LLM calls are billed per frame. A 1-hour video at the default 5-second interval generates ~720 frames.
89
+ - Supported container formats: MP4, MKV, WebM, AVI, MOV (anything ffmpeg can demux).
90
+ - Video ingestion is not real-time; expect processing time proportional to video length.