@framers/agentos-skills-registry 0.13.0 → 0.15.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
|
@@ -0,0 +1,225 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: audio-generation
|
|
3
|
+
version: '1.0.0'
|
|
4
|
+
description: Music and sound effects generation — 8 providers with fallback chains, user-configurable preferences, local and cloud options.
|
|
5
|
+
author: Wunderland
|
|
6
|
+
namespace: wunderland
|
|
7
|
+
category: media
|
|
8
|
+
tags: [audio, music, sound-effects, sfx, generation, suno, elevenlabs, stable-audio, musicgen]
|
|
9
|
+
requires_secrets: []
|
|
10
|
+
requires_tools: []
|
|
11
|
+
metadata:
|
|
12
|
+
agentos:
|
|
13
|
+
emoji: "\U0001F3B5"
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
# Audio Generation (Music & Sound Effects)
|
|
17
|
+
|
|
18
|
+
Use this skill when the user wants to generate music compositions or sound effects from text descriptions. The system supports 8 provider backends with automatic fallback chains and user-configurable provider preferences.
|
|
19
|
+
|
|
20
|
+
This skill covers two complementary APIs:
|
|
21
|
+
|
|
22
|
+
1. **generateMusic()** — Full-length musical compositions from text prompts
|
|
23
|
+
2. **generateSFX()** — Short sound effects from text descriptions
|
|
24
|
+
|
|
25
|
+
## Music Generation
|
|
26
|
+
|
|
27
|
+
### Basic Usage
|
|
28
|
+
|
|
29
|
+
Generate music from a text prompt. The system auto-detects the best available provider from environment variables in priority order: `SUNO_API_KEY` (highest quality) -> `UDIO_API_KEY` -> `STABILITY_API_KEY` -> `REPLICATE_API_TOKEN` -> `FAL_API_KEY` -> local MusicGen (no key required).
|
|
30
|
+
|
|
31
|
+
```typescript
|
|
32
|
+
import { generateMusic } from 'agentos';
|
|
33
|
+
|
|
34
|
+
const result = await generateMusic({
|
|
35
|
+
prompt: 'Upbeat lo-fi hip hop beat with vinyl crackle and mellow piano',
|
|
36
|
+
durationSec: 60,
|
|
37
|
+
});
|
|
38
|
+
console.log(result.audio[0].url);
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Prompt Tips for Music
|
|
42
|
+
|
|
43
|
+
- **Specify genre and mood first**: "melancholic jazz ballad", "aggressive drum and bass", "peaceful ambient soundscape"
|
|
44
|
+
- **Include instrumentation**: "with acoustic guitar, soft brushed drums, and upright bass"
|
|
45
|
+
- **Mention tempo and energy**: "slow tempo, 70 BPM", "high energy, driving rhythm"
|
|
46
|
+
- **Add texture and production**: "lo-fi with vinyl crackle", "clean studio recording", "reverb-heavy shoegaze"
|
|
47
|
+
- **Reference eras or styles**: "1970s progressive rock", "modern trap production", "classical baroque harpsichord"
|
|
48
|
+
- **Use negative prompts** where supported: `negativePrompt: 'vocals, singing, lyrics'`
|
|
49
|
+
|
|
50
|
+
### Music Options
|
|
51
|
+
|
|
52
|
+
| Option | Default | Description |
|
|
53
|
+
|--------|---------|-------------|
|
|
54
|
+
| `prompt` | (required) | Text description of the desired music |
|
|
55
|
+
| `provider` | auto-detect | Provider ID (`"suno"`, `"udio"`, `"stable-audio"`, etc.) |
|
|
56
|
+
| `model` | provider default | Model identifier within the provider |
|
|
57
|
+
| `durationSec` | provider default | Output duration in seconds (Suno: up to ~240s, Stable Audio: ~47s) |
|
|
58
|
+
| `negativePrompt` | — | Musical elements to avoid (not all providers support this) |
|
|
59
|
+
| `outputFormat` | `"mp3"` | Output format: `"mp3"`, `"wav"`, `"flac"`, `"ogg"`, `"aac"` |
|
|
60
|
+
| `seed` | random | Seed for reproducible output (provider-dependent) |
|
|
61
|
+
| `n` | `1` | Number of clips to generate |
|
|
62
|
+
|
|
63
|
+
## Sound Effect Generation
|
|
64
|
+
|
|
65
|
+
### Basic Usage
|
|
66
|
+
|
|
67
|
+
Generate a sound effect from a text description. The SFX detection order is: `ELEVENLABS_API_KEY` (highest quality) -> `STABILITY_API_KEY` -> `REPLICATE_API_TOKEN` -> `FAL_API_KEY` -> local AudioGen (no key required).
|
|
68
|
+
|
|
69
|
+
```typescript
|
|
70
|
+
import { generateSFX } from 'agentos';
|
|
71
|
+
|
|
72
|
+
const result = await generateSFX({
|
|
73
|
+
prompt: 'Thunder crack followed by heavy rain on a tin roof',
|
|
74
|
+
durationSec: 5,
|
|
75
|
+
});
|
|
76
|
+
console.log(result.audio[0].url);
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
### Prompt Tips for Sound Effects
|
|
80
|
+
|
|
81
|
+
- **Be specific about the sound**: "glass bottle shattering on concrete floor" rather than just "glass breaking"
|
|
82
|
+
- **Describe the environment**: "footsteps on gravel in an empty parking garage with echo"
|
|
83
|
+
- **Layer multiple sounds**: "busy city intersection with car horns, distant sirens, and pedestrian chatter"
|
|
84
|
+
- **Specify duration context**: short stingers (1-3s) vs ambient loops (10-15s)
|
|
85
|
+
- **Include physical properties**: "heavy wooden door creaking open slowly", "small metallic click of a light switch"
|
|
86
|
+
|
|
87
|
+
### SFX Options
|
|
88
|
+
|
|
89
|
+
| Option | Default | Description |
|
|
90
|
+
|--------|---------|-------------|
|
|
91
|
+
| `prompt` | (required) | Text description of the desired sound effect |
|
|
92
|
+
| `provider` | auto-detect | Provider ID (`"elevenlabs-sfx"`, `"stable-audio"`, etc.) |
|
|
93
|
+
| `model` | provider default | Model identifier within the provider |
|
|
94
|
+
| `durationSec` | provider default | Output duration in seconds (SFX is typically 1-15s) |
|
|
95
|
+
| `outputFormat` | `"mp3"` | Output format: `"mp3"`, `"wav"`, `"flac"`, `"ogg"`, `"aac"` |
|
|
96
|
+
| `seed` | random | Seed for reproducible output (provider-dependent) |
|
|
97
|
+
| `n` | `1` | Number of clips to generate |
|
|
98
|
+
|
|
99
|
+
## Provider Selection Guide
|
|
100
|
+
|
|
101
|
+
### Music Providers
|
|
102
|
+
|
|
103
|
+
| Provider | ID | Best For | Env Var | Key Required |
|
|
104
|
+
|----------|-----|----------|---------|-------------|
|
|
105
|
+
| **Suno** | `suno` | Highest quality vocals + instrumentals, full songs | `SUNO_API_KEY` | Yes |
|
|
106
|
+
| **Udio** | `udio` | High quality music, alternative to Suno | `UDIO_API_KEY` | Yes |
|
|
107
|
+
| **Stable Audio** | `stable-audio` | Instrumentals, loops, ambient, fast generation | `STABILITY_API_KEY` | Yes |
|
|
108
|
+
| **Replicate** | `replicate-audio` | Open-source models (MusicGen), pay-per-use | `REPLICATE_API_TOKEN` | Yes |
|
|
109
|
+
| **Fal** | `fal-audio` | Fast serverless GPU, cost-effective | `FAL_API_KEY` | Yes |
|
|
110
|
+
| **MusicGen Local** | `musicgen-local` | Offline generation, no API key needed, privacy | — | No |
|
|
111
|
+
|
|
112
|
+
### SFX Providers
|
|
113
|
+
|
|
114
|
+
| Provider | ID | Best For | Env Var | Key Required |
|
|
115
|
+
|----------|-----|----------|---------|-------------|
|
|
116
|
+
| **ElevenLabs** | `elevenlabs-sfx` | Highest quality SFX, fast turnaround | `ELEVENLABS_API_KEY` | Yes |
|
|
117
|
+
| **Stable Audio** | `stable-audio` | Good SFX + music in one provider | `STABILITY_API_KEY` | Yes |
|
|
118
|
+
| **Replicate** | `replicate-audio` | Open-source AudioGen model, pay-per-use | `REPLICATE_API_TOKEN` | Yes |
|
|
119
|
+
| **Fal** | `fal-audio` | Fast serverless GPU | `FAL_API_KEY` | Yes |
|
|
120
|
+
| **AudioGen Local** | `audiogen-local` | Offline SFX generation, no API key needed | — | No |
|
|
121
|
+
|
|
122
|
+
### Forcing a Specific Provider
|
|
123
|
+
|
|
124
|
+
```typescript
|
|
125
|
+
const result = await generateMusic({
|
|
126
|
+
prompt: 'Chill synthwave with arpeggiated synths',
|
|
127
|
+
provider: 'stable-audio',
|
|
128
|
+
apiKey: 'your-stability-key',
|
|
129
|
+
durationSec: 30,
|
|
130
|
+
});
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
## Provider Preferences
|
|
134
|
+
|
|
135
|
+
Use `providerPreferences` to control the fallback chain ordering and filtering without hardcoding a single provider. This is useful for load balancing, cost optimization, or respecting user preferences.
|
|
136
|
+
|
|
137
|
+
```typescript
|
|
138
|
+
import { generateMusic } from 'agentos';
|
|
139
|
+
|
|
140
|
+
// Prefer Suno, fall back to Stable Audio, never use Udio
|
|
141
|
+
const result = await generateMusic({
|
|
142
|
+
prompt: 'Orchestral film score with dramatic strings',
|
|
143
|
+
providerPreferences: {
|
|
144
|
+
preferred: ['suno', 'stable-audio'],
|
|
145
|
+
blocked: ['udio'],
|
|
146
|
+
},
|
|
147
|
+
});
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
### Preference Fields
|
|
151
|
+
|
|
152
|
+
| Field | Description |
|
|
153
|
+
|-------|-------------|
|
|
154
|
+
| `preferred` | Ordered list of provider IDs to try first. Providers not in this list are excluded. |
|
|
155
|
+
| `blocked` | Provider IDs to unconditionally exclude from the chain. |
|
|
156
|
+
| `weights` | Weight map for weighted random selection (useful for A/B testing or load balancing). |
|
|
157
|
+
|
|
158
|
+
Provider preferences work identically across `generateMusic()`, `generateSFX()`, `generateImage()`, and `generateVideo()`.
|
|
159
|
+
|
|
160
|
+
## When to Use Music vs SFX vs TTS
|
|
161
|
+
|
|
162
|
+
| Need | API | Why |
|
|
163
|
+
|------|-----|-----|
|
|
164
|
+
| Background music, songs, jingles | `generateMusic()` | Optimized for musical compositions with melody, harmony, rhythm |
|
|
165
|
+
| Sound effects, foley, ambient sounds | `generateSFX()` | Optimized for short, non-musical audio (impacts, nature, UI sounds) |
|
|
166
|
+
| Speech, narration, voice cloning | TTS (speech subsystem) | Use the speech/TTS APIs instead — audio generation is for non-speech |
|
|
167
|
+
| Podcast intros with music + voice | Combine both | Generate music with `generateMusic()`, speech with TTS, mix externally |
|
|
168
|
+
|
|
169
|
+
## Combining Audio
|
|
170
|
+
|
|
171
|
+
The audio generation APIs return URLs or base64 data that can be combined in downstream workflows:
|
|
172
|
+
|
|
173
|
+
1. **Generate background music**: `generateMusic({ prompt: 'Gentle ambient pad' })`
|
|
174
|
+
2. **Generate SFX stingers**: `generateSFX({ prompt: 'Notification chime' })`
|
|
175
|
+
3. **Generate speech**: Use the TTS subsystem for narration
|
|
176
|
+
4. **Mix**: Use ffmpeg or a Web Audio API pipeline to layer the tracks
|
|
177
|
+
|
|
178
|
+
```typescript
|
|
179
|
+
import { generateMusic, generateSFX } from 'agentos';
|
|
180
|
+
|
|
181
|
+
// Generate assets in parallel
|
|
182
|
+
const [music, sfx] = await Promise.all([
|
|
183
|
+
generateMusic({ prompt: 'Calm podcast background music', durationSec: 120 }),
|
|
184
|
+
generateSFX({ prompt: 'Soft transition whoosh', durationSec: 2 }),
|
|
185
|
+
]);
|
|
186
|
+
|
|
187
|
+
// Use the URLs/base64 data in your mixing pipeline
|
|
188
|
+
console.log('Music:', music.audio[0].url);
|
|
189
|
+
console.log('SFX:', sfx.audio[0].url);
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
## Local Providers (No API Key)
|
|
193
|
+
|
|
194
|
+
Both MusicGen and AudioGen can run locally without any API keys using HuggingFace Transformers.js. The models are downloaded on first use and cached locally.
|
|
195
|
+
|
|
196
|
+
**Requirements:**
|
|
197
|
+
- `@huggingface/transformers` must be installed as a peer dependency
|
|
198
|
+
- Sufficient RAM for model inference (MusicGen Small ~1GB, AudioGen Medium ~2GB)
|
|
199
|
+
|
|
200
|
+
```typescript
|
|
201
|
+
// Explicitly use local generation
|
|
202
|
+
const result = await generateMusic({
|
|
203
|
+
prompt: 'Simple piano melody',
|
|
204
|
+
provider: 'musicgen-local',
|
|
205
|
+
});
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
Local providers are automatically used as the last fallback when no cloud API keys are configured.
|
|
209
|
+
|
|
210
|
+
## Prerequisites
|
|
211
|
+
|
|
212
|
+
- At least one audio provider API key for cloud generation, OR `@huggingface/transformers` for local generation
|
|
213
|
+
- For music: `SUNO_API_KEY`, `UDIO_API_KEY`, `STABILITY_API_KEY`, `REPLICATE_API_TOKEN`, or `FAL_API_KEY`
|
|
214
|
+
- For SFX: `ELEVENLABS_API_KEY`, `STABILITY_API_KEY`, `REPLICATE_API_TOKEN`, or `FAL_API_KEY`
|
|
215
|
+
|
|
216
|
+
## Examples
|
|
217
|
+
|
|
218
|
+
- "Generate a 60-second lo-fi hip hop beat for a study playlist"
|
|
219
|
+
- "Create a thunder and rain sound effect for my podcast intro"
|
|
220
|
+
- "Make upbeat electronic music for a product demo video"
|
|
221
|
+
- "Generate a notification chime sound effect"
|
|
222
|
+
- "Create ambient forest sounds with birds and a gentle stream"
|
|
223
|
+
- "Generate a dramatic orchestral score for a trailer"
|
|
224
|
+
- "Make a retro 8-bit video game soundtrack"
|
|
225
|
+
- "Create footstep sounds on different surfaces — wood, gravel, snow"
|
|
@@ -0,0 +1,210 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: video-generation
|
|
3
|
+
version: '1.0.0'
|
|
4
|
+
description: Video generation, analysis, and scene detection — text-to-video, image-to-video, structured scene descriptions with RAG indexing, and general-purpose visual change detection.
|
|
5
|
+
author: Wunderland
|
|
6
|
+
namespace: wunderland
|
|
7
|
+
category: media
|
|
8
|
+
tags: [video, generation, analysis, scene-detection, RAG, multimodal, runway, replicate, fal]
|
|
9
|
+
requires_secrets: []
|
|
10
|
+
requires_tools: []
|
|
11
|
+
metadata:
|
|
12
|
+
agentos:
|
|
13
|
+
emoji: "\U0001F3AC"
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
# Video Generation, Analysis & Scene Detection
|
|
17
|
+
|
|
18
|
+
Use this skill when the user wants to create AI-generated videos, analyse existing video content for structured scene descriptions, or detect visual changes in live/recorded frame streams.
|
|
19
|
+
|
|
20
|
+
This skill covers three complementary APIs:
|
|
21
|
+
|
|
22
|
+
1. **generateVideo()** — Text-to-video and image-to-video generation
|
|
23
|
+
2. **analyzeVideo()** — Structured video analysis with scene descriptions, transcription, and optional RAG indexing
|
|
24
|
+
3. **detectScenes()** — Real-time or batch scene boundary detection from frame streams
|
|
25
|
+
|
|
26
|
+
## Video Generation
|
|
27
|
+
|
|
28
|
+
### Text-to-Video
|
|
29
|
+
|
|
30
|
+
Generate a video from a text prompt. The system auto-detects the best available provider from environment variables in priority order: `RUNWAY_API_KEY` (highest quality), `REPLICATE_API_TOKEN` (widest model variety), `FAL_API_KEY` (fast serverless GPU).
|
|
31
|
+
|
|
32
|
+
```typescript
|
|
33
|
+
import { generateVideo } from 'agentos';
|
|
34
|
+
|
|
35
|
+
const result = await generateVideo({
|
|
36
|
+
prompt: 'A drone flying over a misty forest at sunrise, cinematic 4K',
|
|
37
|
+
durationSec: 5,
|
|
38
|
+
aspectRatio: '16:9',
|
|
39
|
+
});
|
|
40
|
+
console.log(result.videos[0].url);
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### Image-to-Video
|
|
44
|
+
|
|
45
|
+
Animate a still image by providing it as a Buffer via `opts.image`. The prompt describes the desired motion rather than the scene itself.
|
|
46
|
+
|
|
47
|
+
```typescript
|
|
48
|
+
import { generateVideo } from 'agentos';
|
|
49
|
+
import { readFileSync } from 'fs';
|
|
50
|
+
|
|
51
|
+
const result = await generateVideo({
|
|
52
|
+
prompt: 'Camera slowly zooms out, gentle wind moves the leaves',
|
|
53
|
+
image: readFileSync('landscape.png'),
|
|
54
|
+
provider: 'runway',
|
|
55
|
+
});
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### Provider Selection
|
|
59
|
+
|
|
60
|
+
| Provider | Best For | Env Var |
|
|
61
|
+
|----------|----------|---------|
|
|
62
|
+
| **Runway** | Highest quality, cinematic output, image-to-video | `RUNWAY_API_KEY` |
|
|
63
|
+
| **Replicate** | Widest model variety (Kling, HunyuanVideo, MiniMax), open-source models | `REPLICATE_API_TOKEN` |
|
|
64
|
+
| **Fal** | Fast serverless GPU, cost-effective, Kling/CogVideo | `FAL_API_KEY` |
|
|
65
|
+
|
|
66
|
+
When multiple provider API keys are set, the system wraps the primary in a `FallbackVideoProxy` so a transient failure on one provider automatically retries on the next.
|
|
67
|
+
|
|
68
|
+
To force a specific provider:
|
|
69
|
+
|
|
70
|
+
```typescript
|
|
71
|
+
const result = await generateVideo({
|
|
72
|
+
prompt: 'A cat playing piano',
|
|
73
|
+
provider: 'replicate',
|
|
74
|
+
model: 'klingai/kling-v1',
|
|
75
|
+
apiKey: 'your-replicate-token',
|
|
76
|
+
});
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
### Prompt Tips for Video
|
|
80
|
+
|
|
81
|
+
- **Be specific about motion**: "camera pans left to right", "person walks toward camera", "time-lapse of clouds moving"
|
|
82
|
+
- **Specify style early**: "cinematic 4K", "hand-drawn animation", "vintage film grain"
|
|
83
|
+
- **Keep prompts concise**: Video models respond best to clear, focused descriptions (1-3 sentences)
|
|
84
|
+
- **Use negative prompts** to avoid unwanted artifacts: `negativePrompt: 'blurry, distorted faces, watermark'`
|
|
85
|
+
|
|
86
|
+
### Image-to-Video Motion Strength
|
|
87
|
+
|
|
88
|
+
When doing image-to-video, the prompt controls how much the image changes:
|
|
89
|
+
|
|
90
|
+
- **Gentle motion**: "subtle camera drift", "soft wind blowing through hair" — minimal departure from source
|
|
91
|
+
- **Moderate motion**: "person turns head and smiles", "camera orbits subject" — clear movement while preserving subject
|
|
92
|
+
- **Strong motion**: "explosion of confetti", "character runs toward camera" — significant scene change
|
|
93
|
+
|
|
94
|
+
The provider's motion strength interpretation varies. Runway tends to be conservative (good for preserving the source image), while Replicate/Fal models may be more aggressive. Start with gentle prompts and increase intensity.
|
|
95
|
+
|
|
96
|
+
## Video Analysis
|
|
97
|
+
|
|
98
|
+
### Structured Scene Analysis
|
|
99
|
+
|
|
100
|
+
Analyse a video to extract structured scene descriptions, detected objects, on-screen text, and optional audio transcription.
|
|
101
|
+
|
|
102
|
+
```typescript
|
|
103
|
+
import { analyzeVideo } from 'agentos';
|
|
104
|
+
|
|
105
|
+
const analysis = await analyzeVideo({
|
|
106
|
+
videoUrl: 'https://example.com/product-demo.mp4',
|
|
107
|
+
prompt: 'Identify all products shown and their key features',
|
|
108
|
+
transcribeAudio: true,
|
|
109
|
+
descriptionDetail: 'detailed',
|
|
110
|
+
});
|
|
111
|
+
|
|
112
|
+
console.log(analysis.description);
|
|
113
|
+
for (const scene of analysis.scenes ?? []) {
|
|
114
|
+
console.log(`[${scene.startSec}s - ${scene.endSec}s] ${scene.description}`);
|
|
115
|
+
}
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
### RAG Integration
|
|
119
|
+
|
|
120
|
+
Enable `indexForRAG: true` to automatically index scene descriptions and transcripts into the vector store for later retrieval. This is especially useful for building searchable video libraries.
|
|
121
|
+
|
|
122
|
+
```typescript
|
|
123
|
+
const analysis = await analyzeVideo({
|
|
124
|
+
videoBuffer: videoData,
|
|
125
|
+
indexForRAG: true,
|
|
126
|
+
descriptionDetail: 'detailed',
|
|
127
|
+
transcribeAudio: true,
|
|
128
|
+
});
|
|
129
|
+
|
|
130
|
+
// Scene descriptions and transcripts are now searchable via RAG
|
|
131
|
+
console.log(`Indexed ${analysis.ragChunkIds?.length ?? 0} chunks`);
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
Each scene description becomes a separate vector chunk with metadata including timestamps, scene index, and cut type. This enables queries like "find the part where the presenter shows the pricing slide" to return precise timestamp ranges.
|
|
135
|
+
|
|
136
|
+
### Analysis Options
|
|
137
|
+
|
|
138
|
+
| Option | Default | Description |
|
|
139
|
+
|--------|---------|-------------|
|
|
140
|
+
| `sceneThreshold` | `0.3` | Scene change sensitivity (0-1, lower = more scenes) |
|
|
141
|
+
| `transcribeAudio` | `true` | Transcribe audio via configured STT provider |
|
|
142
|
+
| `descriptionDetail` | `'detailed'` | `'brief'`, `'detailed'`, or `'exhaustive'` |
|
|
143
|
+
| `maxScenes` | `100` | Cap on detected scenes (prevents runaway on long videos) |
|
|
144
|
+
| `indexForRAG` | `false` | Index results into RAG vector store |
|
|
145
|
+
|
|
146
|
+
## Scene Detection
|
|
147
|
+
|
|
148
|
+
### Live Stream / Batch Detection
|
|
149
|
+
|
|
150
|
+
Use `detectScenes()` for real-time visual change detection on frame streams. Returns an AsyncGenerator that yields `SceneBoundary` objects as visual discontinuities are detected.
|
|
151
|
+
|
|
152
|
+
```typescript
|
|
153
|
+
import { detectScenes } from 'agentos';
|
|
154
|
+
|
|
155
|
+
// From a pre-recorded video (frames extracted via ffmpeg)
|
|
156
|
+
for await (const boundary of detectScenes({ frames: extractedFrameStream })) {
|
|
157
|
+
console.log(`Scene ${boundary.index} at ${boundary.startTimeSec}s`);
|
|
158
|
+
console.log(` Type: ${boundary.cutType}, Confidence: ${boundary.confidence}`);
|
|
159
|
+
}
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
### Use Cases
|
|
163
|
+
|
|
164
|
+
- **Webcam / security camera**: Detect motion or scene changes in real-time surveillance feeds
|
|
165
|
+
- **Screen recording**: Identify slide transitions in presentations, page changes in demos
|
|
166
|
+
- **Video editing**: Automatically segment raw footage at cut points
|
|
167
|
+
- **Content moderation**: Flag rapid scene changes that may indicate problematic content
|
|
168
|
+
|
|
169
|
+
### Configuration
|
|
170
|
+
|
|
171
|
+
```typescript
|
|
172
|
+
for await (const boundary of detectScenes({
|
|
173
|
+
frames: webcamStream,
|
|
174
|
+
hardCutThreshold: 0.4, // Less sensitive to hard cuts
|
|
175
|
+
gradualThreshold: 0.15, // Standard sensitivity for dissolves/fades
|
|
176
|
+
minSceneDurationSec: 2.0, // Suppress very short scenes
|
|
177
|
+
methods: ['histogram'], // Fast histogram-only detection
|
|
178
|
+
})) {
|
|
179
|
+
handleSceneChange(boundary);
|
|
180
|
+
}
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
### Cut Type Classification
|
|
184
|
+
|
|
185
|
+
The detector classifies each scene boundary:
|
|
186
|
+
|
|
187
|
+
| Cut Type | Description |
|
|
188
|
+
|----------|-------------|
|
|
189
|
+
| `hard-cut` | Abrupt frame-to-frame change (most common) |
|
|
190
|
+
| `dissolve` | Cross-dissolve / superimposition transition |
|
|
191
|
+
| `fade` | Fade from/to black or white |
|
|
192
|
+
| `gradual` | Other gradual visual change |
|
|
193
|
+
|
|
194
|
+
## Prerequisites
|
|
195
|
+
|
|
196
|
+
- At least one video provider API key for generation (`RUNWAY_API_KEY`, `REPLICATE_API_TOKEN`, or `FAL_API_KEY`)
|
|
197
|
+
- **ffmpeg** on PATH for video analysis (frame extraction and audio demuxing)
|
|
198
|
+
- A vision-capable LLM (`OPENAI_API_KEY` or equivalent) for scene description
|
|
199
|
+
- An STT provider for audio transcription (when `transcribeAudio` is enabled)
|
|
200
|
+
|
|
201
|
+
Scene detection (`detectScenes()`) has zero external dependencies — it works purely on RGB pixel buffers.
|
|
202
|
+
|
|
203
|
+
## Examples
|
|
204
|
+
|
|
205
|
+
- "Generate a 5-second cinematic video of a sunset over the ocean"
|
|
206
|
+
- "Turn this product photo into a video with a slow camera orbit"
|
|
207
|
+
- "Analyse this tutorial video and index it for search"
|
|
208
|
+
- "Detect scene changes in this security camera feed"
|
|
209
|
+
- "Extract structured scenes from this presentation recording"
|
|
210
|
+
- "Create a video from this image with gentle parallax motion"
|
|
@@ -1,22 +1,82 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: vision-ocr
|
|
3
|
-
|
|
4
|
-
|
|
3
|
+
version: '1.1.0'
|
|
4
|
+
description: Extract text from images using OCR and vision AI with the performOCR() high-level API or the full VisionPipeline.
|
|
5
|
+
author: Wunderland
|
|
6
|
+
namespace: wunderland
|
|
7
|
+
category: vision
|
|
5
8
|
tags: [vision, ocr, text-extraction, document, handwriting]
|
|
6
|
-
|
|
9
|
+
requires_secrets: []
|
|
10
|
+
requires_tools: [vision-pipeline]
|
|
7
11
|
---
|
|
8
12
|
|
|
9
13
|
# Vision & OCR
|
|
10
14
|
|
|
11
|
-
Extract text from images, documents, and handwritten notes using a progressive 3-tier pipeline: local OCR (PaddleOCR) -> local vision models (TrOCR, Florence-2) -> cloud vision (GPT-4o, Claude).
|
|
15
|
+
Extract text from images, documents, and handwritten notes using a progressive 3-tier pipeline: local OCR (PaddleOCR / Tesseract) -> local vision models (TrOCR, Florence-2) -> cloud vision LLM (GPT-4o, Claude, Gemini).
|
|
16
|
+
|
|
17
|
+
## High-Level API: `performOCR()`
|
|
18
|
+
|
|
19
|
+
For one-shot text extraction, use the top-level `performOCR()` function. It handles input resolution, pipeline lifecycle, and cleanup automatically.
|
|
20
|
+
|
|
21
|
+
```typescript
|
|
22
|
+
import { performOCR } from '@framers/agentos';
|
|
23
|
+
|
|
24
|
+
const result = await performOCR({
|
|
25
|
+
image: '/path/to/receipt.png', // file path, URL, base64, or Buffer
|
|
26
|
+
strategy: 'progressive', // 'progressive' | 'local-only' | 'cloud-only'
|
|
27
|
+
confidenceThreshold: 0.7, // min confidence before escalating tier
|
|
28
|
+
});
|
|
29
|
+
|
|
30
|
+
console.log(result.text); // extracted text
|
|
31
|
+
console.log(result.confidence); // 0–1 score
|
|
32
|
+
console.log(result.tier); // 'ocr' | 'handwriting' | 'document-ai' | 'cloud-vision'
|
|
33
|
+
console.log(result.provider); // 'paddle' | 'tesseract' | 'openai' | etc.
|
|
34
|
+
console.log(result.regions); // bounding boxes (when available)
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## When to use `performOCR()` vs `VisionPipeline`
|
|
38
|
+
|
|
39
|
+
| Use case | Recommendation |
|
|
40
|
+
|----------|---------------|
|
|
41
|
+
| One-shot text extraction from a single image | `performOCR()` — simplest API |
|
|
42
|
+
| Batch processing many images | `VisionPipeline` — create once, reuse, dispose when done |
|
|
43
|
+
| Need CLIP embeddings or document layout | `VisionPipeline` — richer result shape |
|
|
44
|
+
| Quick scripts and integrations | `performOCR()` — zero boilerplate |
|
|
45
|
+
|
|
46
|
+
## Progressive Tier System
|
|
47
|
+
|
|
48
|
+
The pipeline tries the cheapest/fastest tier first and only escalates when confidence is below threshold:
|
|
49
|
+
|
|
50
|
+
1. **Tier 1 — Local OCR** (PaddleOCR or Tesseract.js): Fast, free, offline. Handles printed text in documents, receipts, screenshots.
|
|
51
|
+
2. **Tier 2 — Local Vision Models** (TrOCR / Florence-2): Still offline. Handles handwritten notes, complex document layouts with tables and figures.
|
|
52
|
+
3. **Tier 3 — Cloud Vision LLM** (GPT-4o / Claude / Gemini): Best quality. Handles photographs, diagrams, mixed content, anything the local tiers can't confidently read.
|
|
53
|
+
|
|
54
|
+
## Strategy Selection
|
|
55
|
+
|
|
56
|
+
- **`'progressive'`** (default): Start local, escalate only if needed. Best cost/quality balance for most use cases.
|
|
57
|
+
- **`'local-only'`**: Never call cloud APIs. Use for air-gapped environments, privacy-sensitive data (medical records, financial docs), or when no API keys are available.
|
|
58
|
+
- **`'cloud-only'`**: Skip local tiers entirely, send straight to a cloud vision LLM. Use when you need the highest quality output and cost is not a concern.
|
|
59
|
+
|
|
60
|
+
## Input Formats
|
|
61
|
+
|
|
62
|
+
`performOCR()` accepts four input types:
|
|
63
|
+
|
|
64
|
+
- **File path**: `'/tmp/scan.png'` — reads from disk
|
|
65
|
+
- **URL**: `'https://example.com/receipt.jpg'` — fetches via HTTP
|
|
66
|
+
- **Base64 string**: Raw base64 or `data:image/png;base64,...` data URIs — decoded in-memory
|
|
67
|
+
- **Buffer**: Raw image bytes — passed directly to the pipeline
|
|
12
68
|
|
|
13
69
|
## Capabilities
|
|
14
|
-
|
|
70
|
+
|
|
71
|
+
- **Printed text OCR**: Extract text from documents, receipts, screenshots, PDFs
|
|
15
72
|
- **Handwriting recognition**: Read handwritten notes and forms via TrOCR
|
|
16
|
-
- **Document layout**:
|
|
17
|
-
- **
|
|
73
|
+
- **Document layout understanding**: Parse tables, figures, headings via Florence-2
|
|
74
|
+
- **Bounding box regions**: Spatial text locations for overlay rendering
|
|
75
|
+
- **Image embeddings**: Generate CLIP vectors for semantic image search (via `VisionPipeline` only)
|
|
76
|
+
|
|
77
|
+
## Examples
|
|
18
78
|
|
|
19
|
-
|
|
20
|
-
"
|
|
21
|
-
"
|
|
22
|
-
"
|
|
79
|
+
- "Read the text from this receipt"
|
|
80
|
+
- "What does this handwritten note say?"
|
|
81
|
+
- "Extract the table data from this PDF page"
|
|
82
|
+
- "OCR this screenshot and return the error message"
|