@framers/agentos-skills-registry 0.11.0 → 0.13.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
|
@@ -1,12 +1,12 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: image-gen
|
|
3
|
-
version: '
|
|
4
|
-
description: Generate
|
|
3
|
+
version: '2.0.0'
|
|
4
|
+
description: Generate, edit, upscale, and variate images using the AgentOS multi-provider image pipeline with automatic fallback.
|
|
5
5
|
author: Wunderland
|
|
6
6
|
namespace: wunderland
|
|
7
7
|
category: creative
|
|
8
|
-
tags: [image-generation, ai-art, dall-e, stable-diffusion, creative, visual]
|
|
9
|
-
requires_secrets: [
|
|
8
|
+
tags: [image-generation, ai-art, dall-e, stable-diffusion, flux, replicate, stability, fal, creative, visual]
|
|
9
|
+
requires_secrets: []
|
|
10
10
|
requires_tools: [generate_image]
|
|
11
11
|
metadata:
|
|
12
12
|
agentos:
|
|
@@ -15,36 +15,89 @@ metadata:
|
|
|
15
15
|
homepage: https://platform.openai.com/docs/guides/images
|
|
16
16
|
---
|
|
17
17
|
|
|
18
|
-
# AI Image Generation
|
|
18
|
+
# AI Image Generation Workflow
|
|
19
19
|
|
|
20
|
-
Use the
|
|
21
|
-
|
|
22
|
-
|
|
20
|
+
Use this skill when the user wants to create, edit, upscale, or create variations of images. AgentOS provides four high-level APIs that route to any configured provider with automatic fallback when multiple providers have credentials set.
|
|
21
|
+
|
|
22
|
+
## The Four High-Level APIs
|
|
23
|
+
|
|
24
|
+
1. **`generateImage()`** — Create new images from text prompts.
|
|
25
|
+
2. **`editImage()`** — Transform existing images via img2img, inpainting, or outpainting.
|
|
26
|
+
3. **`upscaleImage()`** — Increase resolution (2x or 4x super-resolution).
|
|
27
|
+
4. **`variateImage()`** — Generate visual variations of an existing image.
|
|
23
28
|
|
|
24
29
|
If the `generate_image` tool is not loaded, enable it with `extensions_enable image-generation`.
|
|
25
30
|
|
|
26
|
-
|
|
31
|
+
## Provider Selection Guide
|
|
32
|
+
|
|
33
|
+
Choose the provider based on the user's priority:
|
|
34
|
+
|
|
35
|
+
| Priority | Provider | Env Var | Best For |
|
|
36
|
+
|----------|----------|---------|----------|
|
|
37
|
+
| Quality | **OpenAI** (GPT-Image-1, DALL-E 3) | `OPENAI_API_KEY` | Highest fidelity, prompt adherence, text-in-image |
|
|
38
|
+
| Control | **Stability AI** (SDXL, SD3, Ultra) | `STABILITY_API_KEY` | Negative prompts, style presets, cfg/steps tuning |
|
|
39
|
+
| Speed | **BFL / Flux** (Flux Pro 1.1) | `BFL_API_KEY` | Fast generation with strong quality |
|
|
40
|
+
| Speed | **Fal** (Flux Dev) | `FAL_API_KEY` | Serverless Flux inference, low latency |
|
|
41
|
+
| Variety | **Replicate** (Flux, SDXL, community models) | `REPLICATE_API_TOKEN` | Access to thousands of community models |
|
|
42
|
+
| Cost | **OpenRouter** (routes to cheapest) | `OPENROUTER_API_KEY` | Provider-agnostic routing, best price |
|
|
43
|
+
| Privacy | **Local SD** (A1111 / ComfyUI) | `STABLE_DIFFUSION_LOCAL_BASE_URL` | Fully offline, no data leaves the machine |
|
|
44
|
+
|
|
45
|
+
When multiple providers are configured, AgentOS wraps them in a **FallbackImageProxy** — if the primary provider fails (rate limit, outage, etc.), the request automatically retries on the next available provider in priority order.
|
|
46
|
+
|
|
47
|
+
## Operation Decision Tree
|
|
48
|
+
|
|
49
|
+
Use this to pick the right API for the user's request:
|
|
50
|
+
|
|
51
|
+
- **"Generate / create / draw / imagine"** -> `generateImage()`
|
|
52
|
+
- **"Edit / change / modify / transform"** -> `editImage()` with `mode: 'img2img'`
|
|
53
|
+
- **"Remove / fill in / fix this area"** -> `editImage()` with `mode: 'inpaint'` + mask
|
|
54
|
+
- **"Extend / expand the borders"** -> `editImage()` with `mode: 'outpaint'`
|
|
55
|
+
- **"Make it higher resolution / sharper"** -> `upscaleImage()` with `scale: 2` or `4`
|
|
56
|
+
- **"Show me variations / alternatives"** -> `variateImage()` with `n: 3-4`
|
|
57
|
+
|
|
58
|
+
## Prompt Engineering Tips
|
|
59
|
+
|
|
60
|
+
A strong image prompt has five components:
|
|
61
|
+
|
|
62
|
+
1. **Subject** — What is in the image. Be specific: "a red panda sitting on a mossy branch" not "an animal."
|
|
63
|
+
2. **Style** — Artistic approach: photorealistic, watercolor, pixel art, oil painting, vector illustration, cinematic, anime.
|
|
64
|
+
3. **Composition** — Camera angle and framing: close-up portrait, wide establishing shot, overhead flat lay, isometric.
|
|
65
|
+
4. **Lighting and Color** — Mood through light: golden hour, dramatic side-lighting, neon glow, muted earth tones, high contrast.
|
|
66
|
+
5. **Atmosphere** — Emotional tone: serene, ominous, whimsical, nostalgic, futuristic.
|
|
27
67
|
|
|
28
|
-
|
|
68
|
+
Additional tips:
|
|
69
|
+
- Front-load the most important elements. Models weight earlier tokens more heavily.
|
|
70
|
+
- Use negative prompts (Stability, Local SD) to exclude unwanted elements: "no text, no watermark, no blurry."
|
|
71
|
+
- For text-in-image, OpenAI GPT-Image-1 is the most reliable. Other models struggle with legible text.
|
|
72
|
+
- Request `quality: 'hd'` for DALL-E 3 when detail matters (doubles cost).
|
|
73
|
+
- For consistent characters across multiple images, describe the character in detail each time or use img2img with a reference.
|
|
29
74
|
|
|
30
|
-
|
|
75
|
+
## Sizes and Aspect Ratios
|
|
31
76
|
|
|
32
|
-
|
|
77
|
+
| Provider | Supported Sizes | Aspect Ratio Support |
|
|
78
|
+
|----------|----------------|---------------------|
|
|
79
|
+
| OpenAI | 1024x1024, 1792x1024, 1024x1792 | Via size selection |
|
|
80
|
+
| Stability | Flexible | `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, etc. |
|
|
81
|
+
| Replicate/Flux | Flexible | `aspectRatio` parameter |
|
|
82
|
+
| Local SD | Any (multiples of 64) | Via `width`/`height` |
|
|
33
83
|
|
|
34
84
|
## Examples
|
|
35
85
|
|
|
36
|
-
- "Generate
|
|
37
|
-
- "Create a professional logo for a coffee shop called 'Bean There'"
|
|
38
|
-
- "
|
|
39
|
-
- "
|
|
40
|
-
- "
|
|
86
|
+
- "Generate a photorealistic image of a cozy cabin in the mountains at sunset."
|
|
87
|
+
- "Create a professional logo for a coffee shop called 'Bean There' — vector illustration style, clean lines."
|
|
88
|
+
- "Edit this photo: make the sky more dramatic with storm clouds." (img2img)
|
|
89
|
+
- "Remove the person from the background of this product photo." (inpaint + mask)
|
|
90
|
+
- "Upscale this thumbnail to 4x resolution for print."
|
|
91
|
+
- "Show me 3 variations of this hero image with different color palettes."
|
|
92
|
+
- "Generate a 16:9 cinematic landscape of a neon-lit Tokyo street at night in the rain."
|
|
41
93
|
|
|
42
94
|
## Constraints
|
|
43
95
|
|
|
44
96
|
- Image generation costs API credits per request; inform the user of approximate costs when possible.
|
|
45
|
-
- Content policy restrictions apply: no realistic faces of real people, no violent/explicit content.
|
|
46
|
-
- DALL-E 3 does not support
|
|
97
|
+
- Content policy restrictions apply per provider: no realistic faces of real people, no violent/explicit content.
|
|
98
|
+
- DALL-E 3 does not support native inpainting — use GPT-Image-1 or Stability for mask-based editing.
|
|
99
|
+
- Upscaling is not supported by OpenAI or OpenRouter — use Stability, Replicate, or Local SD.
|
|
47
100
|
- Generated images may not perfectly match the prompt; iterative refinement is expected.
|
|
48
|
-
- Maximum prompt length varies by model (DALL-E 3: 4,000
|
|
49
|
-
-
|
|
50
|
-
-
|
|
101
|
+
- Maximum prompt length varies by model (DALL-E 3: 4,000 chars; Stability: 2,000 chars).
|
|
102
|
+
- Local SD requires a running A1111 or ComfyUI instance with the API enabled.
|
|
103
|
+
- The fallback chain only activates when the primary provider fails; it does not merge results from multiple providers.
|
|
@@ -1,23 +1,153 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: multimodal-rag
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
3
|
+
version: '2.0.0'
|
|
4
|
+
description: Index and search across text, images, audio, video, and PDFs via the multimodal RAG pipeline and HTTP API.
|
|
5
|
+
author: Wunderland
|
|
6
|
+
namespace: wunderland
|
|
7
|
+
category: productivity
|
|
8
|
+
tags: [rag, multimodal, image, audio, video, pdf, search, indexing, memory]
|
|
9
|
+
requires_secrets: []
|
|
10
|
+
requires_tools: [vision-pipeline]
|
|
11
|
+
metadata:
|
|
12
|
+
agentos:
|
|
13
|
+
emoji: "\U0001F50D"
|
|
7
14
|
---
|
|
8
15
|
|
|
9
16
|
# Multimodal RAG
|
|
10
17
|
|
|
11
|
-
|
|
18
|
+
Use this skill when the user wants to index, search, or retrieve content across multiple modalities -- text, images, audio, video, and documents (PDF, DOCX, Markdown, CSV, JSON, XML). All non-text content is converted to a text representation (vision description, STT transcript, document parse) before embedding, so every modality is searchable with the same text query.
|
|
19
|
+
|
|
20
|
+
## Architecture
|
|
21
|
+
|
|
22
|
+
```
|
|
23
|
+
Image --> Vision LLM --> description --> embed --> vector store
|
|
24
|
+
Audio --> STT --> transcript --> embed --> vector store
|
|
25
|
+
Video --> ffmpeg (frames + audio) --> vision + STT --> vector store
|
|
26
|
+
PDF --> text extraction + chunking --> embed --> vector store
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
When cognitive memory is enabled via `MultimodalMemoryBridge`, ingested content also creates memory traces so agents can recall multimodal content during conversation without an explicit search.
|
|
12
30
|
|
|
13
31
|
## Capabilities
|
|
14
|
-
|
|
15
|
-
- **
|
|
16
|
-
- **
|
|
17
|
-
- **
|
|
18
|
-
- **
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
32
|
+
|
|
33
|
+
- **Image indexing** — Vision LLM describes the image, description is embedded and searchable.
|
|
34
|
+
- **Audio indexing** — STT transcribes the audio, transcript is chunked and searchable.
|
|
35
|
+
- **Video indexing** — Frame extraction (vision) + audio transcription (STT), both indexed.
|
|
36
|
+
- **Document indexing** — PDF, DOCX, TXT, Markdown, CSV, JSON, XML text extracted and indexed.
|
|
37
|
+
- **Cross-modal search** — A single text query returns results from all modalities, ranked by relevance.
|
|
38
|
+
- **Query-by-image** — Upload an image to find similar indexed content.
|
|
39
|
+
- **Query-by-audio** — Upload audio to find related indexed content via transcript matching.
|
|
40
|
+
|
|
41
|
+
## HTTP API Routes
|
|
42
|
+
|
|
43
|
+
All routes are mounted under `/api/agentos/rag/multimodal`. Ingestion routes accept `multipart/form-data`.
|
|
44
|
+
|
|
45
|
+
### Ingest
|
|
46
|
+
|
|
47
|
+
| Method | Path | Field | Description |
|
|
48
|
+
|--------|------|-------|-------------|
|
|
49
|
+
| POST | `/images/ingest` | `image` | Ingest an image (max 15 MB). Vision LLM generates description. |
|
|
50
|
+
| POST | `/audio/ingest` | `audio` | Ingest audio (max 25 MB). STT generates transcript. |
|
|
51
|
+
| POST | `/documents/ingest` | `document` | Ingest a document (max 30 MB). Text extracted and chunked. |
|
|
52
|
+
|
|
53
|
+
Common form fields for all ingest routes:
|
|
54
|
+
|
|
55
|
+
| Field | Type | Description |
|
|
56
|
+
|-------|------|-------------|
|
|
57
|
+
| `collectionId` | string | Target collection (default: auto) |
|
|
58
|
+
| `assetId` | string | Optional custom ID for the asset |
|
|
59
|
+
| `category` | string | `conversation_memory`, `knowledge_base`, `user_notes`, `system`, `custom` |
|
|
60
|
+
| `tags` | string | Comma-separated or JSON array of tags |
|
|
61
|
+
| `metadata` | string | JSON object with arbitrary metadata |
|
|
62
|
+
| `storePayload` | boolean | Whether to store the raw binary (for later download) |
|
|
63
|
+
| `sourceUrl` | string | Original URL of the content |
|
|
64
|
+
| `textRepresentation` | string | Override auto-generated description/transcript |
|
|
65
|
+
| `userId` | string | Owner user ID |
|
|
66
|
+
| `agentId` | string | Owner agent ID |
|
|
67
|
+
|
|
68
|
+
### Query
|
|
69
|
+
|
|
70
|
+
| Method | Path | Body / Field | Description |
|
|
71
|
+
|--------|------|-------------|-------------|
|
|
72
|
+
| POST | `/query` | JSON body | Text query across all modalities |
|
|
73
|
+
| POST | `/images/query` | `image` field | Query by uploading an image |
|
|
74
|
+
| POST | `/audio/query` | `audio` field | Query by uploading audio |
|
|
75
|
+
|
|
76
|
+
Text query body:
|
|
77
|
+
|
|
78
|
+
```json
|
|
79
|
+
{
|
|
80
|
+
"query": "quantum computing diagrams",
|
|
81
|
+
"modalities": ["image", "audio", "document"],
|
|
82
|
+
"collectionIds": ["knowledge-base"],
|
|
83
|
+
"topK": 10,
|
|
84
|
+
"includeMetadata": true
|
|
85
|
+
}
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Image/audio query form fields:
|
|
89
|
+
|
|
90
|
+
| Field | Type | Description |
|
|
91
|
+
|-------|------|-------------|
|
|
92
|
+
| `modalities` | string | Comma-separated: `image`, `audio`, `document` |
|
|
93
|
+
| `collectionIds` | string | Comma-separated collection IDs to search |
|
|
94
|
+
| `topK` | number | Max results (default: 5) |
|
|
95
|
+
| `includeMetadata` | boolean | Include stored metadata in results |
|
|
96
|
+
| `retrievalMode` | string | `auto` (default), `text`, `native`, `hybrid` |
|
|
97
|
+
|
|
98
|
+
### Asset Management
|
|
99
|
+
|
|
100
|
+
| Method | Path | Description |
|
|
101
|
+
|--------|------|-------------|
|
|
102
|
+
| GET | `/assets/:assetId` | Get asset metadata |
|
|
103
|
+
| GET | `/assets/:assetId/content` | Download raw binary (if `storePayload` was true) |
|
|
104
|
+
| DELETE | `/assets/:assetId` | Delete asset and its embeddings |
|
|
105
|
+
|
|
106
|
+
## Retrieval Modes
|
|
107
|
+
|
|
108
|
+
- **`auto`** (default) — Text-first retrieval with native augmentation when available.
|
|
109
|
+
- **`text`** — Derive a caption/transcript and query the standard text pipeline only.
|
|
110
|
+
- **`native`** — Use modality-native embeddings (e.g. CLIP for images) when available.
|
|
111
|
+
- **`hybrid`** — Combine text and native retrieval, merge and re-rank results.
|
|
112
|
+
|
|
113
|
+
## Programmatic Usage
|
|
114
|
+
|
|
115
|
+
```typescript
|
|
116
|
+
import { MultimodalMemoryBridge } from 'agentos/rag/multimodal';
|
|
117
|
+
|
|
118
|
+
// Ingest an image
|
|
119
|
+
await bridge.ingestImage(imageBuffer, { source: 'upload', tags: ['product'] });
|
|
120
|
+
|
|
121
|
+
// Ingest audio
|
|
122
|
+
await bridge.ingestAudio(audioBuffer, { language: 'en' });
|
|
123
|
+
|
|
124
|
+
// Ingest video (requires ffmpeg)
|
|
125
|
+
await bridge.ingestVideo(videoBuffer, { extractFrames: true });
|
|
126
|
+
|
|
127
|
+
// Ingest PDF
|
|
128
|
+
await bridge.ingestPDF(pdfBuffer, { extractImages: true });
|
|
129
|
+
|
|
130
|
+
// Cross-modal search
|
|
131
|
+
const results = await indexer.search('quantum computing', {
|
|
132
|
+
topK: 10,
|
|
133
|
+
modalities: ['image', 'text', 'audio'],
|
|
134
|
+
});
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
## Examples
|
|
138
|
+
|
|
139
|
+
- "Index this product photo so I can find it by description later."
|
|
140
|
+
- "Ingest all the PDFs in this folder into my knowledge base."
|
|
141
|
+
- "Search my audio recordings for mentions of the quarterly budget."
|
|
142
|
+
- "Find images related to the network architecture diagram."
|
|
143
|
+
- "What does the chart on page 5 of the annual report show?"
|
|
144
|
+
- "Upload this meeting recording and make it searchable."
|
|
145
|
+
|
|
146
|
+
## Constraints
|
|
147
|
+
|
|
148
|
+
- Image uploads are capped at 15 MB, audio at 25 MB, documents at 30 MB.
|
|
149
|
+
- Supported audio formats: MP3, MP4, M4A, WAV, WebM, OGG (Whisper-compatible).
|
|
150
|
+
- Supported document formats: PDF, DOCX, TXT, Markdown, CSV, JSON, XML.
|
|
151
|
+
- Video ingestion requires ffmpeg installed on the system.
|
|
152
|
+
- Vision LLM and STT provider must be configured for image/audio indexing respectively.
|
|
153
|
+
- Cross-modal search ranks by cosine similarity of embedded text representations; it does not perform true multimodal embedding fusion unless `retrievalMode: 'native'` is used with a CLIP-like model.
|