npm - @r16t/multimodal-mcp - Versions diffs - 1.2.3 → 1.3.0 - Mend

@r16t/multimodal-mcp 1.2.3 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/README.md +42 -17
package/build/config.d.ts +2 -0
package/build/config.js +8 -0
package/build/errors.js +1 -0
package/build/providers/bfl.d.ts +15 -0
package/build/providers/bfl.js +87 -0
package/build/providers/elevenlabs.d.ts +14 -0
package/build/providers/elevenlabs.js +96 -0
package/build/providers/google.js +1 -0
package/build/providers/openai.d.ts +2 -1
package/build/providers/openai.js +14 -0
package/build/providers/registry.d.ts +1 -0
package/build/providers/registry.js +3 -0
package/build/providers/types.d.ts +12 -0
package/build/providers/xai.js +1 -0
package/build/read-media-file.js +8 -0
package/build/server.js +26 -8
package/build/tools/list-providers.js +3 -1
package/build/tools/transcribe-audio.d.ts +19 -0
package/build/tools/transcribe-audio.js +48 -0
package/package.json +2 -1

package/README.md CHANGED Viewed

@@ -1,12 +1,14 @@
 # multimodal-mcp
-Multi-provider media generation MCP server. Generate images, videos, and audio from text prompts using OpenAI, xAI, and Gemini through a single unified interface.
+Multi-provider media generation MCP server. Generate images, videos, audio, and transcriptions from text prompts using OpenAI, xAI, Gemini, ElevenLabs, and BFL (FLUX) through a single unified interface.
 ## Features
-- 🎨 **Image Generation** — Generate images via OpenAI (gpt-image-1), xAI (grok-imagine-image), or Gemini (imagen-4)
+- 🎨 **Image Generation** — Generate images via OpenAI (gpt-image-1), xAI (grok-imagine-image), Gemini (imagen-4), or BFL (FLUX Pro 1.1)
+- ✏️ **Image Editing** — Edit images via OpenAI, xAI, Gemini, or BFL (FLUX Kontext)
 - 🎬 **Video Generation** — Generate videos via OpenAI (sora-2), xAI (grok-imagine-video), or Gemini (veo-3.1)
-- 🔊 **Audio Generation** — Text-to-speech via OpenAI (tts-1) or Gemini (gemini-2.5-flash-preview-tts)
+- 🔊 **Audio Generation** — Text-to-speech via OpenAI (tts-1), Gemini, or ElevenLabs (Flash v2.5). Sound effects via ElevenLabs
+- 🎙️ **Audio Transcription** — Speech-to-text via OpenAI (Whisper) or ElevenLabs (Scribe)
 - 🔄 **Auto-Discovery** — Automatically detects configured providers from environment variables
 - 🎯 **Provider Selection** — Auto-selects or explicitly choose a provider per request
 - 📁 **File Output** — Saves all generated media to disk with descriptive filenames
@@ -24,6 +26,12 @@ claude mcp add multimodal-mcp -e OPENAI_API_KEY=sk-... -- npx @r16t/multimodal-m
 # Or using Gemini
 # claude mcp add multimodal-mcp -e GEMINI_API_KEY=AIza... -- npx @r16t/multimodal-mcp@latest
+# Or using ElevenLabs (audio + transcription)
+# claude mcp add multimodal-mcp -e ELEVENLABS_API_KEY=xi-... -- npx @r16t/multimodal-mcp@latest
+# Or using BFL/FLUX (images)
+# claude mcp add multimodal-mcp -e BFL_API_KEY=... -- npx @r16t/multimodal-mcp@latest
 ```
 Using a different editor? See [setup instructions](#editor-setup) for Claude Desktop, Cursor, VS Code, Windsurf, and Cline.
@@ -32,10 +40,12 @@ Using a different editor? See [setup instructions](#editor-setup) for Claude Des
 | Variable | Required | Description |
 |----------|----------|-------------|
-| `OPENAI_API_KEY` | At least one provider key | OpenAI API key — enables image, video, and audio generation via gpt-image-1, sora-2, and tts-1 |
+| `OPENAI_API_KEY` | At least one provider key | OpenAI API key — enables image, video, audio generation, and transcription via gpt-image-1, sora-2, tts-1, and whisper-1 |
 | `XAI_API_KEY` | At least one provider key | xAI API key — enables image and video generation via grok-imagine-image and grok-imagine-video |
 | `GEMINI_API_KEY` | At least one provider key | Gemini API key — enables image, video, and audio generation via imagen-4, veo-3.1, and gemini-2.5-flash-preview-tts |
 | `GOOGLE_API_KEY` | — | Alias for `GEMINI_API_KEY`; either name is accepted |
+| `ELEVENLABS_API_KEY` | At least one provider key | ElevenLabs API key — enables audio generation (TTS, sound effects) and transcription via Flash v2.5 and Scribe v1 |
+| `BFL_API_KEY` | At least one provider key | BFL API key — enables image generation and editing via FLUX Pro 1.1 and FLUX Kontext |
 | `MEDIA_OUTPUT_DIR` | No | Directory for saved media files. Defaults to the current working directory |
 ## Available Tools
@@ -47,7 +57,7 @@ Generate an image from a text prompt.
 | Parameter | Type | Required | Description |
 |-----------|------|----------|-------------|
 | `prompt` | string | Yes | Text description of the image to generate |
-| `provider` | string | No | Provider to use: `openai`, `xai`, `google`. Auto-selects if omitted |
+| `provider` | string | No | Provider to use: `openai`, `xai`, `google`, `bfl`. Auto-selects if omitted |
 | `aspectRatio` | string | No | Aspect ratio: `1:1`, `16:9`, `9:16`, `4:3`, `3:4` |
 | `quality` | string | No | Quality level: `low`, `standard`, `high` |
 | `outputDirectory` | string | No | Directory to save the generated file. Absolute or relative path. Defaults to `MEDIA_OUTPUT_DIR` or cwd |
@@ -69,16 +79,27 @@ Generate a video from a text prompt. Video generation is asynchronous and may ta
 ### `generate_audio`
-Generate audio (text-to-speech) from text. Audio generation is synchronous.
+Generate audio from text. Supports text-to-speech and sound effects. Audio generation is synchronous.
 | Parameter | Type | Required | Description |
 |-----------|------|----------|-------------|
-| `text` | string | Yes | Text to convert to speech |
-| `provider` | string | No | Provider to use: `openai`, `google`. Auto-selects if omitted |
-| `voice` | string | No | Voice name (provider-specific). OpenAI: `alloy`, `ash`, `coral`, `echo`, `fable`, `nova`, `onyx`, `sage`, `shimmer`. Google: `Kore`, `Charon`, `Fenrir`, `Aoede`, `Puck`, etc. |
+| `text` | string | Yes | Text to convert to speech, or a description of the sound effect to generate |
+| `provider` | string | No | Provider to use: `openai`, `google`, `elevenlabs`. Auto-selects if omitted |
+| `voice` | string | No | Voice name (provider-specific). OpenAI: `alloy`, `ash`, `coral`, `echo`, `fable`, `nova`, `onyx`, `sage`, `shimmer`. Google: `Kore`, `Charon`, `Fenrir`, `Aoede`, `Puck`, etc. ElevenLabs: voice ID |
 | `speed` | number | No | Speech speed multiplier (OpenAI only): `0.25` to `4.0` |
 | `format` | string | No | Output format (OpenAI only): `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm` |
 | `outputDirectory` | string | No | Directory to save the generated file. Absolute or relative path. Defaults to `MEDIA_OUTPUT_DIR` or cwd |
+| `providerOptions` | object | No | Provider-specific parameters passed through directly. ElevenLabs: set `mode: "sound-effect"` for sound effects, `model` for TTS model selection |
+### `transcribe_audio`
+Transcribe audio to text (speech-to-text).
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `audioPath` | string | Yes | Absolute path to the audio file to transcribe |
+| `provider` | string | No | Provider to use: `openai`, `elevenlabs`. Auto-selects if omitted |
+| `language` | string | No | Language code (e.g., `en`, `fr`, `es`) to hint the transcription language |
 | `providerOptions` | object | No | Provider-specific parameters passed through directly |
 ### `list_providers`
@@ -87,11 +108,13 @@ List all configured media generation providers and their capabilities. Takes no
 ## Provider Capabilities
-| Provider | Image | Video | Audio | Image Model | Video Model | Audio Model |
-|----------|:-----:|:-----:|:-----:|-------------|-------------|-------------|
-| OpenAI | ✅ | ✅ | ✅ | gpt-image-1 | sora-2 | tts-1 |
-| xAI | ✅ | ✅ | — | grok-imagine-image | grok-imagine-video | — |
-| Gemini | ✅ | ✅ | ✅ | imagen-4 | veo-3.1 | gemini-2.5-flash-preview-tts |
+| Provider | Image | Image Editing | Video | Audio | Transcription | Key Models |
+|----------|:-----:|:------------:|:-----:|:-----:|:------------:|------------|
+| OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ | gpt-image-1, sora-2, tts-1, whisper-1 |
+| xAI | ✅ | ✅ | ✅ | — | — | grok-imagine-image, grok-imagine-video |
+| Gemini | ✅ | ✅ | ✅ | ✅ | — | imagen-4, veo-3.1, gemini-2.5-flash-preview-tts |
+| ElevenLabs | — | — | — | ✅ | ✅ | eleven_flash_v2_5, scribe_v1 |
+| BFL | ✅ | ✅ | — | — | — | flux-pro-1.1, flux-kontext-pro |
 ### Image Aspect Ratios
@@ -100,6 +123,7 @@ List all configured media generation providers and their capabilities. Takes no
 | OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ |
 | xAI | ✅ | ✅ | ✅ | ✅ | ✅ |
 | Gemini | ✅ | ✅ | ✅ | ✅ | ✅ |
+| BFL | ✅ | ✅ | ✅ | ✅ | ✅ |
 ### Video Aspect Ratios & Resolutions
@@ -115,6 +139,7 @@ List all configured media generation providers and their capabilities. Takes no
 |----------|:---:|:----:|:---:|:----:|:---:|:---:|
 | OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
 | Gemini | — | — | — | — | ✅ | — |
+| ElevenLabs | ✅ | ✅ | — | — | — | ✅ |
 ## Troubleshooting
@@ -124,11 +149,11 @@ List all configured media generation providers and their capabilities. Takes no
 [config] No provider API keys detected
 ```
-Set at least one of `OPENAI_API_KEY`, `XAI_API_KEY`, or `GEMINI_API_KEY` in the MCP server's `env` block.
+Set at least one of `OPENAI_API_KEY`, `XAI_API_KEY`, `GEMINI_API_KEY`, `ELEVENLABS_API_KEY`, or `BFL_API_KEY` in the MCP server's `env` block.
 ### Provider not available for requested media type
-All three providers support image and video generation. Audio generation (text-to-speech) is supported by OpenAI and Gemini. xAI does not currently offer a standalone TTS API. If you specify a `provider` that isn't configured (no API key) or doesn't support the requested media type, you'll receive an error. Omit the `provider` parameter to auto-select from configured providers.
+Each provider supports different media types (see [Provider Capabilities](#provider-capabilities)). If you specify a `provider` that isn't configured (no API key) or doesn't support the requested media type, you'll receive an error. Omit the `provider` parameter to auto-select from configured providers.
 ### Video generation timeout
@@ -154,7 +179,7 @@ npm run dev        # Watch mode for TypeScript compilation
 ## Editor Setup
-Replace `OPENAI_API_KEY` with your provider of choice (`XAI_API_KEY`, `GEMINI_API_KEY`). You can set multiple keys to enable multiple providers.
+Replace `OPENAI_API_KEY` with your provider of choice (`XAI_API_KEY`, `GEMINI_API_KEY`, `ELEVENLABS_API_KEY`, `BFL_API_KEY`). You can set multiple keys to enable multiple providers.
 ### Claude Desktop

package/build/config.d.ts CHANGED Viewed

@@ -3,6 +3,8 @@ declare const configSchema: z.ZodObject<{
     openaiApiKey: z.ZodOptional<z.ZodString>;
     xaiApiKey: z.ZodOptional<z.ZodString>;
     googleApiKey: z.ZodOptional<z.ZodString>;
+    elevenlabsApiKey: z.ZodOptional<z.ZodString>;
+    bflApiKey: z.ZodOptional<z.ZodString>;
     outputDirectory: z.ZodString;
 }, z.core.$strip>;
 export type Config = z.infer<typeof configSchema>;

package/build/config.js CHANGED Viewed

@@ -3,6 +3,8 @@ const configSchema = z.object({
     openaiApiKey: z.string().optional(),
     xaiApiKey: z.string().optional(),
     googleApiKey: z.string().optional(),
+    elevenlabsApiKey: z.string().optional(),
+    bflApiKey: z.string().optional(),
     outputDirectory: z.string(),
 });
 function resolveGeminiKey() {
@@ -13,6 +15,8 @@ export function loadConfig() {
         openaiApiKey: process.env.OPENAI_API_KEY || undefined,
         xaiApiKey: process.env.XAI_API_KEY || undefined,
         googleApiKey: resolveGeminiKey(),
+        elevenlabsApiKey: process.env.ELEVENLABS_API_KEY || undefined,
+        bflApiKey: process.env.BFL_API_KEY || undefined,
         outputDirectory: process.env.MEDIA_OUTPUT_DIR || process.cwd(),
     });
     const detected = [];
@@ -22,6 +26,10 @@ export function loadConfig() {
         detected.push("xAI");
     if (config.googleApiKey)
         detected.push("Gemini");
+    if (config.elevenlabsApiKey)
+        detected.push("ElevenLabs");
+    if (config.bflApiKey)
+        detected.push("BFL");
     if (detected.length > 0) {
         console.error(`[config] Detected providers: ${detected.join(", ")}`);
     }

package/build/errors.js CHANGED Viewed

@@ -3,6 +3,7 @@ const API_KEY_PATTERNS = [
     /xai-[a-zA-Z0-9_-]{10,}/g,
     /AIzaSy[a-zA-Z0-9_-]{10,}/g,
     /key=[a-zA-Z0-9_-]{20,}/g,
+    /xi-[a-zA-Z0-9_-]{10,}/g,
 ];
 export function sanitizeError(error) {
     let message;

package/build/providers/bfl.d.ts ADDED Viewed

@@ -0,0 +1,15 @@
+import type { MediaProvider, ProviderCapabilities, ImageParams, EditImageParams, VideoParams, AudioParams, GeneratedMedia } from "./types.js";
+export declare class BFLProvider implements MediaProvider {
+    readonly name = "bfl";
+    readonly capabilities: ProviderCapabilities;
+    private apiKey;
+    constructor(apiKey: string);
+    generateImage(params: ImageParams): Promise<GeneratedMedia>;
+    editImage(params: EditImageParams): Promise<GeneratedMedia>;
+    generateVideo(_params: VideoParams): Promise<GeneratedMedia>;
+    generateAudio(_params: AudioParams): Promise<GeneratedMedia>;
+    private submitTask;
+    private pollTask;
+    private downloadResult;
+    private mapAspectRatio;
+}

package/build/providers/bfl.js ADDED Viewed

@@ -0,0 +1,87 @@
+import { pollForCompletion } from "./polling.js";
+const BFL_BASE_URL = "https://api.bfl.ml/v1";
+const IMAGE_MODEL = "flux-pro-1.1";
+const EDIT_MODEL = "flux-kontext-pro";
+const ASPECT_RATIO_MAP = {
+    "1:1": { width: 1024, height: 1024 },
+    "16:9": { width: 1344, height: 768 },
+    "9:16": { width: 768, height: 1344 },
+    "4:3": { width: 1152, height: 896 },
+    "3:4": { width: 896, height: 1152 },
+};
+export class BFLProvider {
+    name = "bfl";
+    capabilities = {
+        supportsImageGeneration: true,
+        supportsImageEditing: true,
+        supportsVideoGeneration: false,
+        supportsAudioGeneration: false,
+        supportsTranscription: false,
+        supportedImageAspectRatios: ["1:1", "16:9", "9:16", "4:3", "3:4"],
+        supportedVideoAspectRatios: [],
+        supportedVideoResolutions: [],
+        supportedAudioFormats: [],
+        maxVideoDurationSeconds: 0,
+    };
+    apiKey;
+    constructor(apiKey) {
+        this.apiKey = apiKey;
+    }
+    async generateImage(params) {
+        const { model, ...options } = params.providerOptions ?? {};
+        const modelName = model ?? IMAGE_MODEL;
+        const { width, height } = this.mapAspectRatio(params.aspectRatio);
+        const task = await this.submitTask(modelName, { prompt: params.prompt, width, height, ...options });
+        const result = await this.pollTask(task.id);
+        return this.downloadResult(result.result.sample, modelName);
+    }
+    async editImage(params) {
+        const { model, ...options } = params.providerOptions ?? {};
+        const modelName = model ?? EDIT_MODEL;
+        const input_image = params.imageData.toString("base64");
+        const task = await this.submitTask(modelName, { prompt: params.prompt, input_image, ...options });
+        const result = await this.pollTask(task.id);
+        return this.downloadResult(result.result.sample, modelName);
+    }
+    async generateVideo(_params) {
+        throw new Error("BFL does not support video generation");
+    }
+    async generateAudio(_params) {
+        throw new Error("BFL does not support audio generation");
+    }
+    async submitTask(model, body) {
+        const response = await fetch(`${BFL_BASE_URL}/${model}`, {
+            method: "POST",
+            headers: { "Content-Type": "application/json", "X-Key": this.apiKey },
+            body: JSON.stringify(body),
+        });
+        if (!response.ok) {
+            throw new Error(`BFL task submission failed: ${response.status}`);
+        }
+        return response.json();
+    }
+    async pollTask(taskId) {
+        return pollForCompletion(async () => {
+            const response = await fetch(`${BFL_BASE_URL}/get_result?id=${taskId}`, {
+                headers: { "X-Key": this.apiKey },
+            });
+            return response.json();
+        }, (result) => result.status === "Ready", { timeoutMs: 300_000, intervalMs: 3_000 });
+    }
+    async downloadResult(url, model) {
+        const response = await fetch(url);
+        if (!response.ok) {
+            throw new Error(`BFL image download failed: ${response.status}`);
+        }
+        const mimeType = response.headers.get("content-type") ?? "image/png";
+        const data = Buffer.from(await response.arrayBuffer());
+        return { data, mimeType, metadata: { model, provider: "bfl" } };
+    }
+    mapAspectRatio(ratio) {
+        const dimensions = ASPECT_RATIO_MAP[ratio];
+        if (!dimensions) {
+            throw new Error(`BFL does not support aspect ratio: ${ratio}`);
+        }
+        return dimensions;
+    }
+}

package/build/providers/elevenlabs.d.ts ADDED Viewed

@@ -0,0 +1,14 @@
+import type { MediaProvider, ProviderCapabilities, ImageParams, EditImageParams, VideoParams, AudioParams, GeneratedMedia, TranscribeParams, TranscribedText } from "./types.js";
+export declare class ElevenLabsProvider implements MediaProvider {
+    readonly name = "elevenlabs";
+    readonly capabilities: ProviderCapabilities;
+    private apiKey;
+    constructor(apiKey: string);
+    generateImage(_params: ImageParams): Promise<GeneratedMedia>;
+    editImage(_params: EditImageParams): Promise<GeneratedMedia>;
+    generateVideo(_params: VideoParams): Promise<GeneratedMedia>;
+    generateAudio(params: AudioParams): Promise<GeneratedMedia>;
+    transcribeAudio(params: TranscribeParams): Promise<TranscribedText>;
+    private generateSpeech;
+    private generateSoundEffect;
+}

package/build/providers/elevenlabs.js ADDED Viewed

@@ -0,0 +1,96 @@
+const BASE_URL = "https://api.elevenlabs.io/v1";
+const DEFAULT_VOICE_ID = "JBFqnCBsd6RMkjVDRZzb";
+const DEFAULT_TTS_MODEL = "eleven_flash_v2_5";
+const TRANSCRIPTION_MODEL = "scribe_v1";
+export class ElevenLabsProvider {
+    name = "elevenlabs";
+    capabilities = {
+        supportsImageGeneration: false,
+        supportsImageEditing: false,
+        supportsVideoGeneration: false,
+        supportsAudioGeneration: true,
+        supportsTranscription: true,
+        supportedImageAspectRatios: [],
+        supportedVideoAspectRatios: [],
+        supportedVideoResolutions: [],
+        supportedAudioFormats: ["mp3", "pcm", "ulaw", "opus"],
+        maxVideoDurationSeconds: 0,
+    };
+    apiKey;
+    constructor(apiKey) {
+        this.apiKey = apiKey;
+    }
+    async generateImage(_params) {
+        throw new Error("ElevenLabs does not support image generation");
+    }
+    async editImage(_params) {
+        throw new Error("ElevenLabs does not support image editing");
+    }
+    async generateVideo(_params) {
+        throw new Error("ElevenLabs does not support video generation");
+    }
+    async generateAudio(params) {
+        const mode = params.providerOptions?.mode;
+        if (mode === "sound-effect") {
+            return this.generateSoundEffect(params);
+        }
+        return this.generateSpeech(params);
+    }
+    async transcribeAudio(params) {
+        const formData = new FormData();
+        const blob = new Blob([new Uint8Array(params.audioData)], { type: params.audioMimeType });
+        formData.append("file", blob, "audio");
+        formData.append("model_id", TRANSCRIPTION_MODEL);
+        if (params.language) {
+            formData.append("language_code", params.language);
+        }
+        const response = await fetch(`${BASE_URL}/speech-to-text`, {
+            method: "POST",
+            headers: { "xi-api-key": this.apiKey },
+            body: formData,
+        });
+        if (!response.ok) {
+            throw new Error(`ElevenLabs transcription failed: ${response.status}`);
+        }
+        const result = (await response.json());
+        return {
+            text: result.text,
+            metadata: { model: TRANSCRIPTION_MODEL, provider: "elevenlabs" },
+        };
+    }
+    async generateSpeech(params) {
+        const voiceId = params.voice ?? DEFAULT_VOICE_ID;
+        const options = params.providerOptions ?? {};
+        const modelId = options.model ?? DEFAULT_TTS_MODEL;
+        const filtered = Object.fromEntries(Object.entries(options).filter(([k]) => k !== "mode" && k !== "model"));
+        const response = await fetch(`${BASE_URL}/text-to-speech/${voiceId}`, {
+            method: "POST",
+            headers: { "Content-Type": "application/json", "xi-api-key": this.apiKey },
+            body: JSON.stringify({ text: params.text, model_id: modelId, ...filtered }),
+        });
+        if (!response.ok) {
+            throw new Error(`ElevenLabs TTS failed: ${response.status}`);
+        }
+        return {
+            data: Buffer.from(await response.arrayBuffer()),
+            mimeType: "audio/mpeg",
+            metadata: { model: modelId, provider: "elevenlabs", voice: voiceId },
+        };
+    }
+    async generateSoundEffect(params) {
+        const filtered = Object.fromEntries(Object.entries(params.providerOptions ?? {}).filter(([k]) => k !== "mode"));
+        const response = await fetch(`${BASE_URL}/text-to-sound-effects`, {
+            method: "POST",
+            headers: { "Content-Type": "application/json", "xi-api-key": this.apiKey },
+            body: JSON.stringify({ text: params.text, ...filtered }),
+        });
+        if (!response.ok) {
+            throw new Error(`ElevenLabs sound effect generation failed: ${response.status}`);
+        }
+        return {
+            data: Buffer.from(await response.arrayBuffer()),
+            mimeType: "audio/mpeg",
+            metadata: { provider: "elevenlabs" },
+        };
+    }
+}

package/build/providers/google.js CHANGED Viewed

@@ -7,6 +7,7 @@ export class GoogleProvider {
         supportsImageEditing: true,
         supportsVideoGeneration: true,
         supportsAudioGeneration: true,
+        supportsTranscription: false,
         supportedImageAspectRatios: ["1:1", "16:9", "9:16", "4:3", "3:4"],
         supportedVideoAspectRatios: ["16:9", "9:16"],
         supportedVideoResolutions: ["720p", "1080p"],

package/build/providers/openai.d.ts CHANGED Viewed

@@ -1,4 +1,4 @@
-import type { MediaProvider, ProviderCapabilities, ImageParams, EditImageParams, VideoParams, AudioParams, GeneratedMedia } from "./types.js";
+import type { MediaProvider, ProviderCapabilities, ImageParams, EditImageParams, VideoParams, AudioParams, GeneratedMedia, TranscribeParams, TranscribedText } from "./types.js";
 export declare class OpenAIProvider implements MediaProvider {
     readonly name = "openai";
     readonly capabilities: ProviderCapabilities;
@@ -8,6 +8,7 @@ export declare class OpenAIProvider implements MediaProvider {
     editImage(params: EditImageParams): Promise<GeneratedMedia>;
     generateVideo(params: VideoParams): Promise<GeneratedMedia>;
     generateAudio(params: AudioParams): Promise<GeneratedMedia>;
+    transcribeAudio(params: TranscribeParams): Promise<TranscribedText>;
     private audioFormatToMimeType;
     private mapAspectRatioToSize;
 }

package/build/providers/openai.js CHANGED Viewed

@@ -14,6 +14,7 @@ export class OpenAIProvider {
         supportsImageEditing: true,
         supportsVideoGeneration: true,
         supportsAudioGeneration: true,
+        supportsTranscription: true,
         supportedImageAspectRatios: ["1:1", "16:9", "9:16", "4:3", "3:4"],
         supportedVideoAspectRatios: ["16:9", "9:16", "1:1"],
         supportedVideoResolutions: ["480p", "720p", "1080p"],
@@ -96,6 +97,19 @@ export class OpenAIProvider {
             metadata: { model: "tts-1", provider: "openai", voice, format },
         };
     }
+    async transcribeAudio(params) {
+        const audioFile = new File([new Uint8Array(params.audioData)], "audio.wav", { type: params.audioMimeType });
+        const response = await this.client.audio.transcriptions.create({
+            model: "whisper-1",
+            file: audioFile,
+            language: params.language,
+            ...params.providerOptions,
+        });
+        return {
+            text: response.text,
+            metadata: { model: "whisper-1", provider: "openai" },
+        };
+    }
     audioFormatToMimeType(format) {
         const mimeTypes = {
             mp3: "audio/mpeg",

package/build/providers/registry.d.ts CHANGED Viewed

@@ -7,5 +7,6 @@ export declare class ProviderRegistry {
     getImageEditProviders(): MediaProvider[];
     getVideoProviders(): MediaProvider[];
     getAudioProviders(): MediaProvider[];
+    getTranscriptionProviders(): MediaProvider[];
     listCapabilities(): ProviderInfo[];
 }

package/build/providers/registry.js CHANGED Viewed

@@ -22,6 +22,9 @@ export class ProviderRegistry {
     getAudioProviders() {
         return [...this.providers.values()].filter((p) => p.capabilities.supportsAudioGeneration);
     }
+    getTranscriptionProviders() {
+        return [...this.providers.values()].filter((p) => p.capabilities.supportsTranscription);
+    }
     listCapabilities() {
         return [...this.providers.values()].map((p) => ({
             name: p.name,

package/build/providers/types.d.ts CHANGED Viewed

@@ -5,12 +5,14 @@ export interface MediaProvider {
     editImage(params: EditImageParams): Promise<GeneratedMedia>;
     generateVideo(params: VideoParams): Promise<GeneratedMedia>;
     generateAudio(params: AudioParams): Promise<GeneratedMedia>;
+    transcribeAudio?(params: TranscribeParams): Promise<TranscribedText>;
 }
 export interface ProviderCapabilities {
     supportsImageGeneration: boolean;
     supportsImageEditing: boolean;
     supportsVideoGeneration: boolean;
     supportsAudioGeneration: boolean;
+    supportsTranscription: boolean;
     supportedImageAspectRatios: string[];
     supportedVideoAspectRatios: string[];
     supportedVideoResolutions: string[];
@@ -50,6 +52,16 @@ export interface GeneratedMedia {
     mimeType: string;
     metadata: Record<string, unknown>;
 }
+export interface TranscribeParams {
+    audioData: Buffer;
+    audioMimeType: string;
+    language?: string;
+    providerOptions?: Record<string, unknown>;
+}
+export interface TranscribedText {
+    text: string;
+    metadata: Record<string, unknown>;
+}
 export interface ProviderInfo {
     name: string;
     capabilities: ProviderCapabilities;

package/build/providers/xai.js CHANGED Viewed

@@ -10,6 +10,7 @@ export class XAIProvider {
         supportsImageEditing: true,
         supportsVideoGeneration: true,
         supportsAudioGeneration: false,
+        supportsTranscription: false,
         supportedImageAspectRatios: ["1:1", "16:9", "9:16", "4:3", "3:4"],
         supportedVideoAspectRatios: ["16:9", "9:16", "1:1"],
         supportedVideoResolutions: ["720p", "1080p"],

package/build/read-media-file.js CHANGED Viewed

@@ -7,6 +7,14 @@ const EXTENSION_TO_MIME = {
     ".webp": "image/webp",
     ".gif": "image/gif",
     ".mp4": "video/mp4",
+    ".mp3": "audio/mpeg",
+    ".wav": "audio/wav",
+    ".flac": "audio/flac",
+    ".ogg": "audio/ogg",
+    ".m4a": "audio/mp4",
+    ".aac": "audio/aac",
+    ".opus": "audio/opus",
+    ".webm": "audio/webm",
 };
 export async function readMediaFile(filePath) {
     const absolutePath = resolve(filePath);

package/build/server.js CHANGED Viewed

@@ -4,11 +4,14 @@ import { ProviderRegistry } from "./providers/registry.js";
 import { OpenAIProvider } from "./providers/openai.js";
 import { XAIProvider } from "./providers/xai.js";
 import { GoogleProvider } from "./providers/google.js";
+import { ElevenLabsProvider } from "./providers/elevenlabs.js";
+import { BFLProvider } from "./providers/bfl.js";
 import { FileManager } from "./file-manager.js";
 import { buildGenerateImageHandler } from "./tools/generate-image.js";
 import { buildEditImageHandler } from "./tools/edit-image.js";
 import { buildGenerateVideoHandler } from "./tools/generate-video.js";
 import { buildGenerateAudioHandler } from "./tools/generate-audio.js";
+import { buildTranscribeAudioHandler } from "./tools/transcribe-audio.js";
 import { buildListProvidersHandler } from "./tools/list-providers.js";
 export function createServer(config) {
     const registry = new ProviderRegistry();
@@ -25,25 +28,34 @@ export function createServer(config) {
         registry.register(new GoogleProvider(config.googleApiKey));
         console.error("[server] Registered Google provider");
     }
+    if (config.elevenlabsApiKey) {
+        registry.register(new ElevenLabsProvider(config.elevenlabsApiKey));
+        console.error("[server] Registered ElevenLabs provider");
+    }
+    if (config.bflApiKey) {
+        registry.register(new BFLProvider(config.bflApiKey));
+        console.error("[server] Registered BFL provider");
+    }
     const generateImageHandler = buildGenerateImageHandler(registry, fileManager);
     const editImageHandler = buildEditImageHandler(registry, fileManager);
     const generateVideoHandler = buildGenerateVideoHandler(registry, fileManager);
     const generateAudioHandler = buildGenerateAudioHandler(registry, fileManager);
+    const transcribeAudioHandler = buildTranscribeAudioHandler(registry);
     const listProvidersHandler = buildListProvidersHandler(registry);
     const providerNames = registry.listCapabilities().map((p) => p.name).join(", ") || "none configured";
     const server = new McpServer({ name: "multimodal-mcp", version: "1.0.0" });
-    server.tool("generate_image", `Generate an image from a text prompt using AI. Available providers: ${providerNames}`, {
+    server.tool("generate_image", `Generate an image from a text prompt using AI. Providers: openai (DALL-E), xai (Aurora), google (Imagen), bfl (FLUX). Available: ${providerNames}`, {
         prompt: z.string().describe("Text description of the image to generate"),
-        provider: z.string().optional().describe("Provider to use: openai, xai, google. Auto-selects if omitted."),
+        provider: z.string().optional().describe("Provider to use: openai, xai, google, bfl. Auto-selects if omitted."),
         aspectRatio: z.string().optional().describe("Aspect ratio: 1:1, 16:9, 9:16, 4:3, 3:4"),
         quality: z.string().optional().describe("Quality level: low, standard, high"),
         outputDirectory: z.string().optional().describe("Directory to save the generated file. Supports absolute or relative paths (resolved from cwd). Defaults to MEDIA_OUTPUT_DIR env var or cwd."),
         providerOptions: z.record(z.string(), z.unknown()).optional().describe("Provider-specific parameters passed through directly"),
     }, async (params) => generateImageHandler(params));
-    server.tool("edit_image", `Edit an existing image using AI. Provide the path to an image and a text prompt describing the desired edits. Available providers: ${providerNames}`, {
+    server.tool("edit_image", `Edit an existing image using AI. Provide the path to an image and a text prompt describing the desired edits. Providers: openai, xai, google, bfl (FLUX Kontext). Available: ${providerNames}`, {
         imagePath: z.string().describe("Absolute path to the source image file to edit"),
         prompt: z.string().describe("Text description of the edits to apply to the image"),
-        provider: z.string().optional().describe("Provider to use: openai, xai, google. Auto-selects if omitted."),
+        provider: z.string().optional().describe("Provider to use: openai, xai, google, bfl. Auto-selects if omitted."),
         outputDirectory: z.string().optional().describe("Directory to save the edited file. Supports absolute or relative paths (resolved from cwd). Defaults to MEDIA_OUTPUT_DIR env var or cwd."),
         providerOptions: z.record(z.string(), z.unknown()).optional().describe("Provider-specific parameters passed through directly"),
     }, async (params) => editImageHandler(params));
@@ -57,15 +69,21 @@ export function createServer(config) {
         outputDirectory: z.string().optional().describe("Directory to save the generated file. Supports absolute or relative paths (resolved from cwd). Defaults to MEDIA_OUTPUT_DIR env var or cwd."),
         providerOptions: z.record(z.string(), z.unknown()).optional().describe("Provider-specific parameters passed through directly"),
     }, async (params) => generateVideoHandler(params));
-    server.tool("generate_audio", `Generate audio (text-to-speech) from text using AI. Available providers: ${providerNames}`, {
-        text: z.string().describe("Text to convert to speech"),
-        provider: z.string().optional().describe("Provider to use: openai, google. Auto-selects if omitted."),
-        voice: z.string().optional().describe("Voice name (provider-specific). OpenAI: alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer. Google: Kore, Charon, Fenrir, Aoede, Puck, etc."),
+    server.tool("generate_audio", `Generate audio from text using AI. Supports text-to-speech and sound effects. Providers: openai, google, elevenlabs. ElevenLabs: use providerOptions.mode = "sound-effect" for sound effects. Available: ${providerNames}`, {
+        text: z.string().describe("Text to convert to speech, or a description of the sound effect to generate"),
+        provider: z.string().optional().describe("Provider to use: openai, google, elevenlabs. Auto-selects if omitted."),
+        voice: z.string().optional().describe("Voice name (provider-specific). OpenAI: alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer. Google: Kore, Charon, Fenrir, Aoede, Puck, etc. ElevenLabs: voice ID."),
         speed: z.number().optional().describe("Speech speed multiplier (OpenAI only): 0.25 to 4.0"),
         format: z.string().optional().describe("Output format (OpenAI only): mp3, opus, aac, flac, wav, pcm"),
         outputDirectory: z.string().optional().describe("Directory to save the generated file. Supports absolute or relative paths (resolved from cwd). Defaults to MEDIA_OUTPUT_DIR env var or cwd."),
         providerOptions: z.record(z.string(), z.unknown()).optional().describe("Provider-specific parameters passed through directly"),
     }, async (params) => generateAudioHandler(params));
+    server.tool("transcribe_audio", `Transcribe audio to text using AI (speech-to-text). Providers: openai (Whisper), elevenlabs (Scribe). Available: ${providerNames}`, {
+        audioPath: z.string().describe("Absolute path to the audio file to transcribe"),
+        provider: z.string().optional().describe("Provider to use: openai, elevenlabs. Auto-selects if omitted."),
+        language: z.string().optional().describe("Language code (e.g., 'en', 'fr', 'es') to hint the transcription language"),
+        providerOptions: z.record(z.string(), z.unknown()).optional().describe("Provider-specific parameters passed through directly"),
+    }, async (params) => transcribeAudioHandler(params));
     server.tool("list_providers", "List all configured media generation providers and their capabilities", async () => listProvidersHandler());
     return server;
 }

package/build/tools/list-providers.js CHANGED Viewed

@@ -5,7 +5,7 @@ export function buildListProvidersHandler(registry) {
             return {
                 content: [{
                         type: "text",
-                        text: "No providers configured. Set one or more API keys: OPENAI_API_KEY, XAI_API_KEY, GEMINI_API_KEY",
+                        text: "No providers configured. Set one or more API keys: OPENAI_API_KEY, XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, BFL_API_KEY",
                     }],
             };
         }
@@ -19,6 +19,8 @@ export function buildListProvidersHandler(registry) {
                 caps.push("video");
             if (p.capabilities.supportsAudioGeneration)
                 caps.push("audio");
+            if (p.capabilities.supportsTranscription)
+                caps.push("transcription");
             return `- ${p.name}: ${caps.join(", ")}`;
         });
         return {

package/build/tools/transcribe-audio.d.ts ADDED Viewed

@@ -0,0 +1,19 @@
+import type { ProviderRegistry } from "../providers/registry.js";
+export declare function buildTranscribeAudioHandler(registry: ProviderRegistry): (params: {
+    audioPath: string;
+    provider?: string;
+    language?: string;
+    providerOptions?: Record<string, unknown>;
+}) => Promise<{
+    isError: true;
+    content: {
+        type: "text";
+        text: string;
+    }[];
+} | {
+    content: {
+        type: "text";
+        text: string;
+    }[];
+    isError?: undefined;
+}>;

package/build/tools/transcribe-audio.js ADDED Viewed

@@ -0,0 +1,48 @@
+import { readMediaFile } from "../read-media-file.js";
+import { sanitizeError } from "../errors.js";
+export function buildTranscribeAudioHandler(registry) {
+    return async (params) => {
+        const provider = params.provider
+            ? registry.getProvider(params.provider)
+            : registry.getTranscriptionProviders()[0];
+        if (!provider) {
+            const available = registry.getTranscriptionProviders().map((p) => p.name).join(", ") || "none";
+            const text = params.provider
+                ? `Provider "${params.provider}" is not configured or does not support transcription. Available transcription providers: ${available}`
+                : `No transcription provider available. Available transcription providers: ${available}`;
+            return {
+                isError: true,
+                content: [{ type: "text", text }],
+            };
+        }
+        if (!provider.capabilities.supportsTranscription || !provider.transcribeAudio) {
+            const available = registry.getTranscriptionProviders().map((p) => p.name).join(", ") || "none";
+            return {
+                isError: true,
+                content: [{
+                        type: "text",
+                        text: `Provider "${provider.name}" does not support transcription. Available transcription providers: ${available}`,
+                    }],
+            };
+        }
+        try {
+            const { data, mimeType } = await readMediaFile(params.audioPath);
+            const result = await provider.transcribeAudio({
+                audioData: data,
+                audioMimeType: mimeType,
+                language: params.language,
+                providerOptions: params.providerOptions,
+            });
+            return {
+                content: [{ type: "text", text: result.text }],
+            };
+        }
+        catch (error) {
+            const message = sanitizeError(error);
+            return {
+                isError: true,
+                content: [{ type: "text", text: `Transcription failed: ${message}` }],
+            };
+        }
+    };
+}

package/package.json CHANGED Viewed

@@ -1,6 +1,7 @@
 {
   "name": "@r16t/multimodal-mcp",
-  "version": "1.2.3",
+  "version": "1.3.0",
+  "mcpName": "io.github.rsmdt/multimodal",
   "description": "Multi-provider media generation MCP server",
   "type": "module",
   "main": "build/index.js",