@framers/agentos-skills-registry 0.12.0 → 0.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@framers/agentos-skills-registry",
3
- "version": "0.12.0",
3
+ "version": "0.13.0",
4
4
  "files": [
5
5
  "dist",
6
6
  "registry",
@@ -1,23 +1,153 @@
1
1
  ---
2
2
  name: multimodal-rag
3
- description: Index and search across text, images, audio, video, and PDFs
4
- version: 1.0.0
5
- tags: [rag, multimodal, image, audio, video, pdf, search]
6
- tools_required: [vision-pipeline]
3
+ version: '2.0.0'
4
+ description: Index and search across text, images, audio, video, and PDFs via the multimodal RAG pipeline and HTTP API.
5
+ author: Wunderland
6
+ namespace: wunderland
7
+ category: productivity
8
+ tags: [rag, multimodal, image, audio, video, pdf, search, indexing, memory]
9
+ requires_secrets: []
10
+ requires_tools: [vision-pipeline]
11
+ metadata:
12
+ agentos:
13
+ emoji: "\U0001F50D"
7
14
  ---
8
15
 
9
16
  # Multimodal RAG
10
17
 
11
- Index and retrieve content across all modalities -- text, images, audio, video, and PDFs. Images are described via vision AI, audio is transcribed, video frames are extracted, and PDFs are parsed. All content is embedded and searchable.
18
+ Use this skill when the user wants to index, search, or retrieve content across multiple modalities -- text, images, audio, video, and documents (PDF, DOCX, Markdown, CSV, JSON, XML). All non-text content is converted to a text representation (vision description, STT transcript, document parse) before embedding, so every modality is searchable with the same text query.
19
+
20
+ ## Architecture
21
+
22
+ ```
23
+ Image --> Vision LLM --> description --> embed --> vector store
24
+ Audio --> STT --> transcript --> embed --> vector store
25
+ Video --> ffmpeg (frames + audio) --> vision + STT --> vector store
26
+ PDF --> text extraction + chunking --> embed --> vector store
27
+ ```
28
+
29
+ When cognitive memory is enabled via `MultimodalMemoryBridge`, ingested content also creates memory traces so agents can recall multimodal content during conversation without an explicit search.
12
30
 
13
31
  ## Capabilities
14
- - **Image indexing**: Vision LLM describes -> embed -> search
15
- - **Audio indexing**: STT transcribes -> embed -> search
16
- - **Video indexing**: Frame extraction + audio transcription
17
- - **PDF indexing**: Text + embedded image extraction
18
- - **Cross-modal search**: Query returns results from all modalities
19
-
20
- ## Example
21
- "Find images related to quantum computing"
22
- "Search my audio recordings for mentions of the project deadline"
23
- "What does the diagram in page 3 of the PDF show?"
32
+
33
+ - **Image indexing** Vision LLM describes the image, description is embedded and searchable.
34
+ - **Audio indexing** STT transcribes the audio, transcript is chunked and searchable.
35
+ - **Video indexing** Frame extraction (vision) + audio transcription (STT), both indexed.
36
+ - **Document indexing** PDF, DOCX, TXT, Markdown, CSV, JSON, XML text extracted and indexed.
37
+ - **Cross-modal search** — A single text query returns results from all modalities, ranked by relevance.
38
+ - **Query-by-image** — Upload an image to find similar indexed content.
39
+ - **Query-by-audio** Upload audio to find related indexed content via transcript matching.
40
+
41
+ ## HTTP API Routes
42
+
43
+ All routes are mounted under `/api/agentos/rag/multimodal`. Ingestion routes accept `multipart/form-data`.
44
+
45
+ ### Ingest
46
+
47
+ | Method | Path | Field | Description |
48
+ |--------|------|-------|-------------|
49
+ | POST | `/images/ingest` | `image` | Ingest an image (max 15 MB). Vision LLM generates description. |
50
+ | POST | `/audio/ingest` | `audio` | Ingest audio (max 25 MB). STT generates transcript. |
51
+ | POST | `/documents/ingest` | `document` | Ingest a document (max 30 MB). Text extracted and chunked. |
52
+
53
+ Common form fields for all ingest routes:
54
+
55
+ | Field | Type | Description |
56
+ |-------|------|-------------|
57
+ | `collectionId` | string | Target collection (default: auto) |
58
+ | `assetId` | string | Optional custom ID for the asset |
59
+ | `category` | string | `conversation_memory`, `knowledge_base`, `user_notes`, `system`, `custom` |
60
+ | `tags` | string | Comma-separated or JSON array of tags |
61
+ | `metadata` | string | JSON object with arbitrary metadata |
62
+ | `storePayload` | boolean | Whether to store the raw binary (for later download) |
63
+ | `sourceUrl` | string | Original URL of the content |
64
+ | `textRepresentation` | string | Override auto-generated description/transcript |
65
+ | `userId` | string | Owner user ID |
66
+ | `agentId` | string | Owner agent ID |
67
+
68
+ ### Query
69
+
70
+ | Method | Path | Body / Field | Description |
71
+ |--------|------|-------------|-------------|
72
+ | POST | `/query` | JSON body | Text query across all modalities |
73
+ | POST | `/images/query` | `image` field | Query by uploading an image |
74
+ | POST | `/audio/query` | `audio` field | Query by uploading audio |
75
+
76
+ Text query body:
77
+
78
+ ```json
79
+ {
80
+ "query": "quantum computing diagrams",
81
+ "modalities": ["image", "audio", "document"],
82
+ "collectionIds": ["knowledge-base"],
83
+ "topK": 10,
84
+ "includeMetadata": true
85
+ }
86
+ ```
87
+
88
+ Image/audio query form fields:
89
+
90
+ | Field | Type | Description |
91
+ |-------|------|-------------|
92
+ | `modalities` | string | Comma-separated: `image`, `audio`, `document` |
93
+ | `collectionIds` | string | Comma-separated collection IDs to search |
94
+ | `topK` | number | Max results (default: 5) |
95
+ | `includeMetadata` | boolean | Include stored metadata in results |
96
+ | `retrievalMode` | string | `auto` (default), `text`, `native`, `hybrid` |
97
+
98
+ ### Asset Management
99
+
100
+ | Method | Path | Description |
101
+ |--------|------|-------------|
102
+ | GET | `/assets/:assetId` | Get asset metadata |
103
+ | GET | `/assets/:assetId/content` | Download raw binary (if `storePayload` was true) |
104
+ | DELETE | `/assets/:assetId` | Delete asset and its embeddings |
105
+
106
+ ## Retrieval Modes
107
+
108
+ - **`auto`** (default) — Text-first retrieval with native augmentation when available.
109
+ - **`text`** — Derive a caption/transcript and query the standard text pipeline only.
110
+ - **`native`** — Use modality-native embeddings (e.g. CLIP for images) when available.
111
+ - **`hybrid`** — Combine text and native retrieval, merge and re-rank results.
112
+
113
+ ## Programmatic Usage
114
+
115
+ ```typescript
116
+ import { MultimodalMemoryBridge } from 'agentos/rag/multimodal';
117
+
118
+ // Ingest an image
119
+ await bridge.ingestImage(imageBuffer, { source: 'upload', tags: ['product'] });
120
+
121
+ // Ingest audio
122
+ await bridge.ingestAudio(audioBuffer, { language: 'en' });
123
+
124
+ // Ingest video (requires ffmpeg)
125
+ await bridge.ingestVideo(videoBuffer, { extractFrames: true });
126
+
127
+ // Ingest PDF
128
+ await bridge.ingestPDF(pdfBuffer, { extractImages: true });
129
+
130
+ // Cross-modal search
131
+ const results = await indexer.search('quantum computing', {
132
+ topK: 10,
133
+ modalities: ['image', 'text', 'audio'],
134
+ });
135
+ ```
136
+
137
+ ## Examples
138
+
139
+ - "Index this product photo so I can find it by description later."
140
+ - "Ingest all the PDFs in this folder into my knowledge base."
141
+ - "Search my audio recordings for mentions of the quarterly budget."
142
+ - "Find images related to the network architecture diagram."
143
+ - "What does the chart on page 5 of the annual report show?"
144
+ - "Upload this meeting recording and make it searchable."
145
+
146
+ ## Constraints
147
+
148
+ - Image uploads are capped at 15 MB, audio at 25 MB, documents at 30 MB.
149
+ - Supported audio formats: MP3, MP4, M4A, WAV, WebM, OGG (Whisper-compatible).
150
+ - Supported document formats: PDF, DOCX, TXT, Markdown, CSV, JSON, XML.
151
+ - Video ingestion requires ffmpeg installed on the system.
152
+ - Vision LLM and STT provider must be configured for image/audio indexing respectively.
153
+ - Cross-modal search ranks by cosine similarity of embedded text representations; it does not perform true multimodal embedding fusion unless `retrievalMode: 'native'` is used with a CLIP-like model.