@framers/agentos-skills-registry 0.10.0 → 0.12.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
|
@@ -1,12 +1,12 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: image-gen
|
|
3
|
-
version: '
|
|
4
|
-
description: Generate
|
|
3
|
+
version: '2.0.0'
|
|
4
|
+
description: Generate, edit, upscale, and variate images using the AgentOS multi-provider image pipeline with automatic fallback.
|
|
5
5
|
author: Wunderland
|
|
6
6
|
namespace: wunderland
|
|
7
7
|
category: creative
|
|
8
|
-
tags: [image-generation, ai-art, dall-e, stable-diffusion, creative, visual]
|
|
9
|
-
requires_secrets: [
|
|
8
|
+
tags: [image-generation, ai-art, dall-e, stable-diffusion, flux, replicate, stability, fal, creative, visual]
|
|
9
|
+
requires_secrets: []
|
|
10
10
|
requires_tools: [generate_image]
|
|
11
11
|
metadata:
|
|
12
12
|
agentos:
|
|
@@ -15,36 +15,89 @@ metadata:
|
|
|
15
15
|
homepage: https://platform.openai.com/docs/guides/images
|
|
16
16
|
---
|
|
17
17
|
|
|
18
|
-
# AI Image Generation
|
|
18
|
+
# AI Image Generation Workflow
|
|
19
19
|
|
|
20
|
-
Use the
|
|
21
|
-
|
|
22
|
-
|
|
20
|
+
Use this skill when the user wants to create, edit, upscale, or create variations of images. AgentOS provides four high-level APIs that route to any configured provider with automatic fallback when multiple providers have credentials set.
|
|
21
|
+
|
|
22
|
+
## The Four High-Level APIs
|
|
23
|
+
|
|
24
|
+
1. **`generateImage()`** — Create new images from text prompts.
|
|
25
|
+
2. **`editImage()`** — Transform existing images via img2img, inpainting, or outpainting.
|
|
26
|
+
3. **`upscaleImage()`** — Increase resolution (2x or 4x super-resolution).
|
|
27
|
+
4. **`variateImage()`** — Generate visual variations of an existing image.
|
|
23
28
|
|
|
24
29
|
If the `generate_image` tool is not loaded, enable it with `extensions_enable image-generation`.
|
|
25
30
|
|
|
26
|
-
|
|
31
|
+
## Provider Selection Guide
|
|
32
|
+
|
|
33
|
+
Choose the provider based on the user's priority:
|
|
34
|
+
|
|
35
|
+
| Priority | Provider | Env Var | Best For |
|
|
36
|
+
|----------|----------|---------|----------|
|
|
37
|
+
| Quality | **OpenAI** (GPT-Image-1, DALL-E 3) | `OPENAI_API_KEY` | Highest fidelity, prompt adherence, text-in-image |
|
|
38
|
+
| Control | **Stability AI** (SDXL, SD3, Ultra) | `STABILITY_API_KEY` | Negative prompts, style presets, cfg/steps tuning |
|
|
39
|
+
| Speed | **BFL / Flux** (Flux Pro 1.1) | `BFL_API_KEY` | Fast generation with strong quality |
|
|
40
|
+
| Speed | **Fal** (Flux Dev) | `FAL_API_KEY` | Serverless Flux inference, low latency |
|
|
41
|
+
| Variety | **Replicate** (Flux, SDXL, community models) | `REPLICATE_API_TOKEN` | Access to thousands of community models |
|
|
42
|
+
| Cost | **OpenRouter** (routes to cheapest) | `OPENROUTER_API_KEY` | Provider-agnostic routing, best price |
|
|
43
|
+
| Privacy | **Local SD** (A1111 / ComfyUI) | `STABLE_DIFFUSION_LOCAL_BASE_URL` | Fully offline, no data leaves the machine |
|
|
44
|
+
|
|
45
|
+
When multiple providers are configured, AgentOS wraps them in a **FallbackImageProxy** — if the primary provider fails (rate limit, outage, etc.), the request automatically retries on the next available provider in priority order.
|
|
46
|
+
|
|
47
|
+
## Operation Decision Tree
|
|
48
|
+
|
|
49
|
+
Use this to pick the right API for the user's request:
|
|
50
|
+
|
|
51
|
+
- **"Generate / create / draw / imagine"** -> `generateImage()`
|
|
52
|
+
- **"Edit / change / modify / transform"** -> `editImage()` with `mode: 'img2img'`
|
|
53
|
+
- **"Remove / fill in / fix this area"** -> `editImage()` with `mode: 'inpaint'` + mask
|
|
54
|
+
- **"Extend / expand the borders"** -> `editImage()` with `mode: 'outpaint'`
|
|
55
|
+
- **"Make it higher resolution / sharper"** -> `upscaleImage()` with `scale: 2` or `4`
|
|
56
|
+
- **"Show me variations / alternatives"** -> `variateImage()` with `n: 3-4`
|
|
57
|
+
|
|
58
|
+
## Prompt Engineering Tips
|
|
59
|
+
|
|
60
|
+
A strong image prompt has five components:
|
|
61
|
+
|
|
62
|
+
1. **Subject** — What is in the image. Be specific: "a red panda sitting on a mossy branch" not "an animal."
|
|
63
|
+
2. **Style** — Artistic approach: photorealistic, watercolor, pixel art, oil painting, vector illustration, cinematic, anime.
|
|
64
|
+
3. **Composition** — Camera angle and framing: close-up portrait, wide establishing shot, overhead flat lay, isometric.
|
|
65
|
+
4. **Lighting and Color** — Mood through light: golden hour, dramatic side-lighting, neon glow, muted earth tones, high contrast.
|
|
66
|
+
5. **Atmosphere** — Emotional tone: serene, ominous, whimsical, nostalgic, futuristic.
|
|
27
67
|
|
|
28
|
-
|
|
68
|
+
Additional tips:
|
|
69
|
+
- Front-load the most important elements. Models weight earlier tokens more heavily.
|
|
70
|
+
- Use negative prompts (Stability, Local SD) to exclude unwanted elements: "no text, no watermark, no blurry."
|
|
71
|
+
- For text-in-image, OpenAI GPT-Image-1 is the most reliable. Other models struggle with legible text.
|
|
72
|
+
- Request `quality: 'hd'` for DALL-E 3 when detail matters (doubles cost).
|
|
73
|
+
- For consistent characters across multiple images, describe the character in detail each time or use img2img with a reference.
|
|
29
74
|
|
|
30
|
-
|
|
75
|
+
## Sizes and Aspect Ratios
|
|
31
76
|
|
|
32
|
-
|
|
77
|
+
| Provider | Supported Sizes | Aspect Ratio Support |
|
|
78
|
+
|----------|----------------|---------------------|
|
|
79
|
+
| OpenAI | 1024x1024, 1792x1024, 1024x1792 | Via size selection |
|
|
80
|
+
| Stability | Flexible | `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, etc. |
|
|
81
|
+
| Replicate/Flux | Flexible | `aspectRatio` parameter |
|
|
82
|
+
| Local SD | Any (multiples of 64) | Via `width`/`height` |
|
|
33
83
|
|
|
34
84
|
## Examples
|
|
35
85
|
|
|
36
|
-
- "Generate
|
|
37
|
-
- "Create a professional logo for a coffee shop called 'Bean There'"
|
|
38
|
-
- "
|
|
39
|
-
- "
|
|
40
|
-
- "
|
|
86
|
+
- "Generate a photorealistic image of a cozy cabin in the mountains at sunset."
|
|
87
|
+
- "Create a professional logo for a coffee shop called 'Bean There' — vector illustration style, clean lines."
|
|
88
|
+
- "Edit this photo: make the sky more dramatic with storm clouds." (img2img)
|
|
89
|
+
- "Remove the person from the background of this product photo." (inpaint + mask)
|
|
90
|
+
- "Upscale this thumbnail to 4x resolution for print."
|
|
91
|
+
- "Show me 3 variations of this hero image with different color palettes."
|
|
92
|
+
- "Generate a 16:9 cinematic landscape of a neon-lit Tokyo street at night in the rain."
|
|
41
93
|
|
|
42
94
|
## Constraints
|
|
43
95
|
|
|
44
96
|
- Image generation costs API credits per request; inform the user of approximate costs when possible.
|
|
45
|
-
- Content policy restrictions apply: no realistic faces of real people, no violent/explicit content.
|
|
46
|
-
- DALL-E 3 does not support
|
|
97
|
+
- Content policy restrictions apply per provider: no realistic faces of real people, no violent/explicit content.
|
|
98
|
+
- DALL-E 3 does not support native inpainting — use GPT-Image-1 or Stability for mask-based editing.
|
|
99
|
+
- Upscaling is not supported by OpenAI or OpenRouter — use Stability, Replicate, or Local SD.
|
|
47
100
|
- Generated images may not perfectly match the prompt; iterative refinement is expected.
|
|
48
|
-
- Maximum prompt length varies by model (DALL-E 3: 4,000
|
|
49
|
-
-
|
|
50
|
-
-
|
|
101
|
+
- Maximum prompt length varies by model (DALL-E 3: 4,000 chars; Stability: 2,000 chars).
|
|
102
|
+
- Local SD requires a running A1111 or ComfyUI instance with the API enabled.
|
|
103
|
+
- The fallback chain only activates when the primary provider fails; it does not merge results from multiple providers.
|
|
@@ -0,0 +1,90 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: video-ingestion
|
|
3
|
+
version: '1.0.0'
|
|
4
|
+
description: Video processing for RAG — extract frames via vision pipeline + audio via STT, index into knowledge base.
|
|
5
|
+
author: Wunderland
|
|
6
|
+
namespace: wunderland
|
|
7
|
+
category: productivity
|
|
8
|
+
tags: [video, ffmpeg, frames, transcription, multimodal, RAG]
|
|
9
|
+
requires_secrets: []
|
|
10
|
+
requires_tools: []
|
|
11
|
+
metadata:
|
|
12
|
+
agentos:
|
|
13
|
+
emoji: "\U0001F3AC"
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
# Video Ingestion for Multimodal RAG
|
|
17
|
+
|
|
18
|
+
Use this skill when the user wants to index video content into the agent's knowledge base so it can be searched and recalled during conversation.
|
|
19
|
+
|
|
20
|
+
Video ingestion works through the `MultimodalMemoryBridge`, which orchestrates two parallel extraction pipelines and feeds the results into both the RAG vector store and (optionally) cognitive memory.
|
|
21
|
+
|
|
22
|
+
## How It Works
|
|
23
|
+
|
|
24
|
+
1. **Frame extraction** — ffmpeg samples frames at a configurable interval (default: 1 frame every 5 seconds). Each frame is passed to a vision-capable LLM (e.g. GPT-4o) which generates a text description. That description is embedded and indexed into the vector store with `modality: 'image'` metadata.
|
|
25
|
+
|
|
26
|
+
2. **Audio extraction** — ffmpeg demuxes the audio track and pipes it to the configured STT provider (e.g. Whisper). The resulting transcript is chunked, embedded, and indexed with `modality: 'audio'` metadata.
|
|
27
|
+
|
|
28
|
+
3. **Memory traces** — When cognitive memory is enabled, the bridge encodes both visual descriptions and audio transcript chunks as memory traces so the agent can recall video content during future conversations.
|
|
29
|
+
|
|
30
|
+
## When to Ingest Video vs. Just Extract Audio
|
|
31
|
+
|
|
32
|
+
- **Ingest full video** when visual content matters: tutorials, screen recordings, product demos, surveillance, presentations with slides, anything where "what is shown" conveys information the transcript alone misses.
|
|
33
|
+
- **Extract audio only** when the video is essentially a podcast, voice memo, meeting recording, or phone call where the visual track adds no information. Audio-only ingestion is faster, cheaper (no vision LLM calls), and produces smaller index footprints.
|
|
34
|
+
|
|
35
|
+
If you are unsure, prefer full video ingestion. The frame extraction is lightweight and the vision descriptions are short — the marginal cost is small compared to the value of not losing visual context.
|
|
36
|
+
|
|
37
|
+
## Prerequisites
|
|
38
|
+
|
|
39
|
+
- **ffmpeg** must be installed and on the system PATH. The bridge shells out to `ffmpeg` for frame and audio extraction. Without it, video ingestion will fail with a clear error.
|
|
40
|
+
- A **vision-capable LLM** must be configured (OPENAI_API_KEY or equivalent) for frame description.
|
|
41
|
+
- An **STT provider** must be configured for audio transcription.
|
|
42
|
+
|
|
43
|
+
## Usage
|
|
44
|
+
|
|
45
|
+
Video ingestion is triggered through the `MultimodalMemoryBridge.ingestVideo()` method. When using the HTTP API, POST the video file to:
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
POST /api/agentos/rag/multimodal/documents/ingest
|
|
49
|
+
Content-Type: multipart/form-data
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
with the video file in the `document` field. The system auto-detects video MIME types and routes to the video pipeline.
|
|
53
|
+
|
|
54
|
+
Programmatic usage:
|
|
55
|
+
|
|
56
|
+
```typescript
|
|
57
|
+
import { MultimodalMemoryBridge } from 'agentos/rag/multimodal';
|
|
58
|
+
|
|
59
|
+
await bridge.ingestVideo(videoBuffer, {
|
|
60
|
+
source: 'user-upload',
|
|
61
|
+
fileName: 'meeting-2024-03-15.mp4',
|
|
62
|
+
extractFrames: true, // default true
|
|
63
|
+
frameIntervalSeconds: 10, // sample 1 frame every 10s (default 5)
|
|
64
|
+
language: 'en', // STT language hint
|
|
65
|
+
});
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Configuration Options
|
|
69
|
+
|
|
70
|
+
| Option | Default | Description |
|
|
71
|
+
|--------|---------|-------------|
|
|
72
|
+
| `extractFrames` | `true` | Set `false` for audio-only ingestion |
|
|
73
|
+
| `frameIntervalSeconds` | `5` | Seconds between sampled frames |
|
|
74
|
+
| `language` | auto-detect | BCP-47 language code for STT |
|
|
75
|
+
| `collection` | `'multimodal'` | Target vector store collection |
|
|
76
|
+
|
|
77
|
+
## Examples
|
|
78
|
+
|
|
79
|
+
- "Ingest this tutorial video so I can search it later."
|
|
80
|
+
- "Extract the audio from this meeting recording and add it to my knowledge base."
|
|
81
|
+
- "Index this product demo video — I need to reference the UI screenshots shown at 2:30."
|
|
82
|
+
- "Process all MP4 files in this folder and make them searchable."
|
|
83
|
+
|
|
84
|
+
## Constraints
|
|
85
|
+
|
|
86
|
+
- ffmpeg must be installed. The system does not bundle or auto-install it.
|
|
87
|
+
- Long videos (>1 hour) produce many frames; consider increasing `frameIntervalSeconds` to 15-30 for very long content.
|
|
88
|
+
- Vision LLM calls are billed per frame. A 1-hour video at the default 5-second interval generates ~720 frames.
|
|
89
|
+
- Supported container formats: MP4, MKV, WebM, AVI, MOV (anything ffmpeg can demux).
|
|
90
|
+
- Video ingestion is not real-time; expect processing time proportional to video length.
|