npm - universal-llm-client - Versions diffs - 4.5.0 → 4.5.1 - Mend

universal-llm-client 4.5.0 → 4.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (174) hide show

package/CHANGELOG.md +12 -0
package/README.md +2 -0
package/dist/ai-model.d.ts +0 -1
package/dist/ai-model.js +0 -1
package/dist/auditor.d.ts +0 -1
package/dist/auditor.js +0 -1
package/dist/client.d.ts +0 -1
package/dist/client.js +0 -1
package/dist/gemma-channel.d.ts +0 -1
package/dist/gemma-channel.js +0 -1
package/dist/gemma-diffusion.d.ts +0 -1
package/dist/gemma-diffusion.js +0 -1
package/dist/http.d.ts +0 -1
package/dist/http.js +0 -1
package/dist/index.d.ts +0 -1
package/dist/index.js +0 -1
package/dist/interfaces.d.ts +0 -1
package/dist/interfaces.js +0 -1
package/dist/mcp.d.ts +0 -1
package/dist/mcp.js +0 -1
package/dist/providers/anthropic.d.ts +0 -1
package/dist/providers/anthropic.js +0 -1
package/dist/providers/google.d.ts +0 -1
package/dist/providers/google.js +0 -1
package/dist/providers/index.d.ts +0 -1
package/dist/providers/index.js +0 -1
package/dist/providers/ollama.d.ts +0 -1
package/dist/providers/ollama.js +0 -1
package/dist/providers/openai.d.ts +2 -1
package/dist/providers/openai.js +303 -74
package/dist/router.d.ts +0 -1
package/dist/router.js +0 -1
package/dist/stream-decoder.d.ts +0 -1
package/dist/stream-decoder.js +0 -1
package/dist/structured-output.d.ts +0 -1
package/dist/structured-output.js +0 -1
package/dist/thinking.d.ts +0 -1
package/dist/thinking.js +0 -1
package/dist/tools.d.ts +0 -1
package/dist/tools.js +0 -1
package/dist/zod-adapter.d.ts +0 -1
package/dist/zod-adapter.js +0 -1
package/package.json +1 -2
package/dist/ai-model.d.ts.map +0 -1
package/dist/ai-model.js.map +0 -1
package/dist/auditor.d.ts.map +0 -1
package/dist/auditor.js.map +0 -1
package/dist/client.d.ts.map +0 -1
package/dist/client.js.map +0 -1
package/dist/gemma-channel.d.ts.map +0 -1
package/dist/gemma-channel.js.map +0 -1
package/dist/gemma-diffusion.d.ts.map +0 -1
package/dist/gemma-diffusion.js.map +0 -1
package/dist/http.d.ts.map +0 -1
package/dist/http.js.map +0 -1
package/dist/index.d.ts.map +0 -1
package/dist/index.js.map +0 -1
package/dist/interfaces.d.ts.map +0 -1
package/dist/interfaces.js.map +0 -1
package/dist/mcp.d.ts.map +0 -1
package/dist/mcp.js.map +0 -1
package/dist/providers/anthropic.d.ts.map +0 -1
package/dist/providers/anthropic.js.map +0 -1
package/dist/providers/google.d.ts.map +0 -1
package/dist/providers/google.js.map +0 -1
package/dist/providers/index.d.ts.map +0 -1
package/dist/providers/index.js.map +0 -1
package/dist/providers/ollama.d.ts.map +0 -1
package/dist/providers/ollama.js.map +0 -1
package/dist/providers/openai.d.ts.map +0 -1
package/dist/providers/openai.js.map +0 -1
package/dist/router.d.ts.map +0 -1
package/dist/router.js.map +0 -1
package/dist/stream-decoder.d.ts.map +0 -1
package/dist/stream-decoder.js.map +0 -1
package/dist/structured-output.d.ts.map +0 -1
package/dist/structured-output.js.map +0 -1
package/dist/thinking.d.ts.map +0 -1
package/dist/thinking.js.map +0 -1
package/dist/tools.d.ts.map +0 -1
package/dist/tools.js.map +0 -1
package/dist/zod-adapter.d.ts.map +0 -1
package/dist/zod-adapter.js.map +0 -1
package/src/ai-model.ts +0 -400
package/src/auditor.ts +0 -213
package/src/client.ts +0 -402
package/src/debug/debug-google-streaming.ts +0 -97
package/src/debug/debug-tool-execution.ts +0 -86
package/src/debug/test-lmstudio-tools.ts +0 -155
package/src/demos/README.md +0 -47
package/src/demos/basic/universal-llm-examples.ts +0 -161
package/src/demos/diffusion-gemma/.env +0 -29
package/src/demos/diffusion-gemma/.env.example +0 -27
package/src/demos/diffusion-gemma/CLAUDE.md +0 -95
package/src/demos/diffusion-gemma/README.md +0 -59
package/src/demos/diffusion-gemma/canvas.ts +0 -1606
package/src/demos/diffusion-gemma/docker-compose.yml +0 -29
package/src/demos/diffusion-gemma/probe-stream.ts +0 -51
package/src/demos/diffusion-gemma/probe-tools.ts +0 -55
package/src/demos/diffusion-gemma/server.ts +0 -1205
package/src/demos/diffusion-gemma/start-vllm.sh +0 -98
package/src/demos/mcp/astrid-memory-demo.ts +0 -295
package/src/demos/mcp/astrid-persona-memory.ts +0 -357
package/src/demos/mcp/mcp-mongodb-demo.ts +0 -275
package/src/demos/mcp/simple-astrid-memory.ts +0 -148
package/src/demos/mcp/simple-mcp-demo.ts +0 -68
package/src/demos/mcp/working-mcp-demo.ts +0 -62
package/src/demos/model-alias-demo.ts +0 -0
package/src/demos/tools/RAG_MEMORY_INTEGRATION.md +0 -267
package/src/demos/tools/astrid-memory-demo.ts +0 -270
package/src/demos/tools/astrid-production-memory-clean.ts +0 -785
package/src/demos/tools/astrid-production-memory.ts +0 -558
package/src/demos/tools/basic-translation-test.ts +0 -66
package/src/demos/tools/chromadb-similarity-tuning.ts +0 -390
package/src/demos/tools/clean-multilingual-conversation.ts +0 -209
package/src/demos/tools/clean-translation-test.ts +0 -119
package/src/demos/tools/clean-universal-multilingual-test.ts +0 -131
package/src/demos/tools/complete-rag-demo.ts +0 -369
package/src/demos/tools/complete-tool-demo.ts +0 -132
package/src/demos/tools/demo-tool-calling.ts +0 -124
package/src/demos/tools/dynamic-language-switching-test.ts +0 -251
package/src/demos/tools/hybrid-thinking-test.ts +0 -154
package/src/demos/tools/memory-integration-test.ts +0 -420
package/src/demos/tools/multilingual-memory-system.ts +0 -802
package/src/demos/tools/ondemand-translation-demo.ts +0 -655
package/src/demos/tools/production-tool-demo.ts +0 -245
package/src/demos/tools/revolutionary-multilingual-test.ts +0 -151
package/src/demos/tools/rigorous-language-analysis.ts +0 -218
package/src/demos/tools/test-universal-memory-system.ts +0 -126
package/src/demos/tools/translation-integration-guide.ts +0 -346
package/src/demos/tools/universal-memory-system.ts +0 -560
package/src/gemma-channel.ts +0 -47
package/src/gemma-diffusion.ts +0 -167
package/src/http.ts +0 -261
package/src/index.ts +0 -180
package/src/interfaces.ts +0 -843
package/src/mcp.ts +0 -345
package/src/providers/anthropic.ts +0 -796
package/src/providers/google.ts +0 -840
package/src/providers/index.ts +0 -8
package/src/providers/ollama.ts +0 -503
package/src/providers/openai.ts +0 -587
package/src/router.ts +0 -785
package/src/stream-decoder.ts +0 -535
package/src/structured-output.ts +0 -759
package/src/test-scripts/test-advanced-tools.ts +0 -310
package/src/test-scripts/test-google-deep-research.ts +0 -33
package/src/test-scripts/test-google-streaming-enhanced.ts +0 -147
package/src/test-scripts/test-google-streaming.ts +0 -63
package/src/test-scripts/test-google-system-prompt-comprehensive.ts +0 -189
package/src/test-scripts/test-google-thinking.ts +0 -46
package/src/test-scripts/test-mcp-config.ts +0 -28
package/src/test-scripts/test-mcp-connection.ts +0 -29
package/src/test-scripts/test-system-message-positions.ts +0 -163
package/src/test-scripts/test-system-prompt-improvement-demo.ts +0 -83
package/src/test-scripts/test-tool-calling.ts +0 -231
package/src/test-scripts/test-vllm-qwen36.ts +0 -256
package/src/tests/ai-model.test.ts +0 -1614
package/src/tests/auditor.test.ts +0 -224
package/src/tests/gemma-diffusion.test.ts +0 -115
package/src/tests/http.test.ts +0 -200
package/src/tests/interfaces.test.ts +0 -117
package/src/tests/providers/anthropic.test.ts +0 -118
package/src/tests/providers/google.test.ts +0 -841
package/src/tests/providers/ollama.test.ts +0 -1034
package/src/tests/providers/openai.test.ts +0 -1511
package/src/tests/router.test.ts +0 -254
package/src/tests/stream-decoder.test.ts +0 -263
package/src/tests/structured-output.test.ts +0 -1450
package/src/tests/thinking.test.ts +0 -65
package/src/tests/tools.test.ts +0 -175
package/src/thinking.ts +0 -73
package/src/tools.ts +0 -246
package/src/zod-adapter.ts +0 -72

package/src/debug/test-lmstudio-tools.ts DELETED Viewed

@@ -1,155 +0,0 @@
-/**
- * Debug test specifically for LM Studio tool calling with OpenAI API
- */
-async function testLMStudioTools() {
-    console.log('🔧 Testing LM Studio Tool Calling with OpenAI API\n');
-    // Create LM Studio model using OpenAI-compatible API
-    const model = AIModelFactory.createOpenAIChatModel(
-        'qwen/qwen3-8b',  // Tool-trained model
-        'http://192.168.100.136:1234/v1'  // LM Studio URL
-    );
-    // Create simple test tools (same as our router test)
-    const testTools = [
-        ToolBuilder.createTool<{ expression: string }>(
-            'calculate',
-            'Perform mathematical calculations',
-            {
-                properties: {
-                    expression: {
-                        type: 'string',
-                        description: 'Mathematical expression to evaluate'
-                    }
-                },
-                required: ['expression']
-            },
-            (args) => {
-                try {
-                    const result = eval(args.expression); // Safe for testing
-                    console.log(`🔧 Tool executed: calculate(${args.expression}) = ${result}`);
-                    return `The result of ${args.expression} is ${result}`;
-                } catch (error) {
-                    console.log(`❌ Tool error: calculate(${args.expression}) failed`);
-                    return `Error calculating ${args.expression}: ${error}`;
-                }
-            }
-        ),
-        ToolBuilder.createTool<{ location: string }>(
-            'get_weather',
-            'Get weather information',
-            {
-                properties: {
-                    location: {
-                        type: 'string',
-                        description: 'City and state/country'
-                    }
-                },
-                required: ['location']
-            },
-            (args) => {
-                console.log(`🔧 Tool executed: get_weather(${args.location})`);
-                return `The weather in ${args.location} is sunny and 72°F.`;
-            }
-        )
-    ];
-    // Register tools
-    model.registerTools(testTools);
-    console.log('📝 Registered tools:', testTools.map(t => t.name).join(', '));
-    console.log('');
-    try {
-        await model.ensureReady();
-        console.log('✅ LM Studio model ready\n');
-        // Test 1: Basic calculation
-        console.log('🧮 Test 1: Basic Calculation');
-        console.log('Question: What is 15 multiplied by 23?');
-        const response1 = await model.chatWithTools([
-            { role: 'user', content: 'What is 15 multiplied by 23? Use the calculate tool.' }
-        ]);
-        console.log('Response:', response1.message?.content || 'No content');
-        console.log('Usage:', response1.usage);
-        console.log('');
-        // Test 2: Weather query
-        console.log('🌤️ Test 2: Weather Query');
-        console.log('Question: Weather in San Francisco');
-        const response2 = await model.chatWithTools([
-            { role: 'user', content: 'What is the weather in San Francisco, CA?' }
-        ]);
-        console.log('Response:', response2.message?.content || 'No content');
-        console.log('');
-        // Test 3: Manual tool call inspection (without auto-execution)
-        console.log('🔍 Test 3: Manual Tool Call Inspection');
-        console.log('Question: Calculate 100 / 4 (without auto-execution)');
-        const response3 = await model.chat([
-            { role: 'user', content: 'Calculate 100 divided by 4 using the calculate tool.' }
-        ], {}, {
-            tool_choice: 'auto',
-            executeTools: false  // Don't auto-execute
-        });
-        console.log('Response content:', response3.message?.content || 'No content');
-        console.log('Tool calls detected:', response3.message?.tool_calls?.length || 0);
-        if (response3.message?.tool_calls) {
-            response3.message.tool_calls.forEach((call, i) => {
-                console.log(`  Tool ${i + 1}: ${call.function.name}(${call.function.arguments})`);
-            });
-        }
-        console.log('');
-        // Test 4: Different tool_choice settings
-        console.log('🎯 Test 4: Different tool_choice Settings');
-        const testCases = [
-            { choice: 'auto', desc: 'Auto (let model decide)' },
-            { choice: 'required', desc: 'Required (force tool use)' },
-            { choice: { type: 'function', function: { name: 'calculate' } }, desc: 'Specific tool' }
-        ];
-        for (const testCase of testCases) {
-            console.log(`Testing tool_choice: ${testCase.desc}`);
-            try {
-                const response = await model.chat([
-                    { role: 'user', content: 'What is 50 + 50?' }
-                ], {}, {
-                    tool_choice: testCase.choice as any,
-                    executeTools: false
-                });
-                console.log(`  Content: ${response.message?.content?.substring(0, 100) || 'No content'}${(response.message?.content?.length || 0) > 100 ? '...' : ''}`);
-                console.log(`  Tool calls: ${response.message?.tool_calls?.length || 0}`);
-            } catch (error) {
-                console.log(`  Error: ${(error as Error).message}`);
-            }
-            console.log('');
-        }
-    } catch (error) {
-        console.error('❌ Test failed:', (error as Error).message);
-        console.error('Stack:', (error as Error).stack);
-    } finally {
-        model.dispose();
-        console.log('🏁 Test completed!\n');
-    }
-}
-// Run the test
-if (require.main === module) {
-    testLMStudioTools().catch(console.error);
-}
-export { testLMStudioTools };

package/src/demos/README.md DELETED Viewed

@@ -1,47 +0,0 @@
-# Universal LLM Client Demos
-This directory contains various demonstration files for the Universal LLM Client.
-## Directory Structure
-### `/basic`
-Basic usage examples and simple demonstrations.
-### `/tools`
-Tool calling demonstrations and examples:
-- `demo-tool-calling.ts` - Basic tool calling demo
-- `complete-tool-demo.ts` - Comprehensive tool calling examples
-- `production-tool-demo.ts` - Production-ready tool calling implementation
-### `/mcp`
-Model Context Protocol (MCP) integration demos:
-- `simple-mcp-demo.ts` - Clean MCP MongoDB demo (recommended)
-- `mcp-mongodb-demo.ts` - Detailed MCP MongoDB integration
-- `working-mcp-demo.ts` - Working MCP implementation example
-## Running Demos
-From the root directory (`universal-llm-client/`):
-```bash
-# Tool calling demos
-bun run demos/tools/production-tool-demo.ts
-# MCP demos (requires MongoDB MCP server)
-bun run demos/mcp/simple-mcp-demo.ts
-# Basic usage
-bun run demos/basic/[demo-file].ts
-```
-## Requirements
-- Node.js/Bun runtime
-- Configured Ollama instance (for Ollama provider)
-- MongoDB MCP server (for MCP demos)
-- API keys for external providers (OpenAI, Google)

package/src/demos/basic/universal-llm-examples.ts DELETED Viewed

@@ -1,161 +0,0 @@
-import { AIModelFactory } from "./factory";
-/**
- * Example: Complete AI Application Setup
- *
- * This example shows how to set up a complete AI application
- * with both chat and embedding capabilities using the factory.
- */
-export async function createAIApplicationExample() {
-    console.log('\n🏗️  Example: Setting up a complete AI application...\n');
-    // Method 1: Using the factory for easy setup
-    const aiSetup = AIModelFactory.createCompleteSetup({
-        ollama: {
-            chatModel: 'gemma3:4b-it-qat',
-            embeddingModel: 'snowflake-arctic-embed2:latest',
-            url: 'http://localhost:11434'
-        },
-        openai: {
-            chatModel: 'google/gemma-3-4b',
-            embeddingModel: 'text-embedding-snowflake-arctic-embed-l-v2.0',
-            url: 'http://localhost:1234/v1'
-            // apiKey: 'your-api-key-here' // For real OpenAI API
-        },
-        google: {
-            chatModel: 'gemma-3-4b-it',
-            apiKey: (process.env.GOOGLE_API_KEY ?? '')
-        }
-    });
-    // Method 2: Individual setup for specific use cases
-    const chatModel = AIModelFactory.createOllamaChatModel('gemma3:4b-it-qat');
-    const embeddingModel = AIModelFactory.createOllamaEmbeddingModel('snowflake-arctic-embed2:latest');
-    // Method 3: Google-specific setup
-    const googleChatModel = AIModelFactory.createGoogleChatModel(
-        'gemma-3-4b-it',
-        (process.env.GOOGLE_API_KEY ?? '')
-    );
-    // Example usage patterns:
-    console.log('🤖 Testing Google Generative AI...');
-    try {
-        const googleResponse = await googleChatModel.chat([
-            { role: 'user', content: 'What is the capital of France?' }
-        ]);
-        console.log('Google Response:', googleResponse.content);
-    } catch (error) {
-        console.error('Google API Error:', error);
-    }
-    console.log('\n🌊 Testing Google Streaming...');
-    try {
-        const streamingGenerator = googleChatModel.chatStream([
-            { role: 'user', content: 'Count from 1 to 5, explaining each number briefly.' }
-        ]);
-        for await (const chunk of streamingGenerator) {
-            process.stdout.write(chunk);
-        }
-        console.log('\n✅ Google streaming completed');
-    } catch (error) {
-        console.error('Google Streaming Error:', error);
-    }
-    // 1. Simple chat
-    const chatResponse = await aiSetup.ollama.chat.chat([
-        { role: 'user', content: 'What is machine learning?' }
-    ]);
-    // 2. Document embedding for search
-    const documents = [
-        'Machine learning is a subset of AI',
-        'Neural networks are inspired by the brain',
-        'TypeScript is a typed superset of JavaScript'
-    ];
-    const embeddings = await Promise.all(
-        documents.map(doc => aiSetup.ollama.embedding.embed(doc))
-    );
-    // 3. Cross-provider comparison including Google
-    const comparisons = await Promise.all([
-        aiSetup.ollama?.chat?.chat([{ role: 'user', content: 'What is the capital of France?' }]).catch(() => null),
-        aiSetup.openai?.chat?.chat([{ role: 'user', content: 'What is the capital of France?' }]).catch(() => null),
-        aiSetup.google?.chat?.chat([{ role: 'user', content: 'What is the capital of France?' }]).catch(() => null)
-    ]);
-    return {
-        aiSetup,
-        chatModel,
-        embeddingModel,
-        googleChatModel,
-        examples: {
-            chatResponse,
-            embeddings,
-            comparisons: {
-                ollama: comparisons[0],
-                openai: comparisons[1],
-                google: comparisons[2]
-            }
-        }
-    };
-}
-/**
- * Test Google API specifically
- */
-export async function testGoogleAPI() {
-    console.log('🧪 Testing Google Generative AI API...\n');
-    const googleModel = AIModelFactory.createGoogleChatModel(
-        'gemma-3-4b-it',
-        (process.env.GOOGLE_API_KEY ?? '')
-    );
-    try {
-        // Test basic chat
-        console.log('💬 Basic chat test...');
-        const response = await googleModel.chat([
-            { role: 'user', content: 'Hello! Can you tell me about TypeScript?' }
-        ]);
-        console.log('Response:', response.content);
-        // Test streaming
-        console.log('\n🌊 Streaming test...');
-        const streamResponse = googleModel.chatStream([
-            { role: 'user', content: 'Tell me a short story about a robot learning to code.' }
-        ]);
-        let fullResponse = '';
-        for await (const chunk of streamResponse) {
-            process.stdout.write(chunk);
-            fullResponse += chunk;
-        }
-        console.log('\n✅ Streaming completed');
-        // Test conversation with context
-        console.log('\n💭 Conversation with context...');
-        const conversationResponse = await googleModel.chat([
-            { role: 'user', content: 'What is the capital of France?' },
-            { role: 'assistant', content: 'The capital of France is Paris.' },
-            { role: 'user', content: 'What is its population?' }
-        ]);
-        console.log('Conversation Response:', conversationResponse.content);
-        console.log('\n✅ All Google API tests completed successfully!');
-        return { success: true, model: googleModel };
-    } catch (error) {
-        console.error('❌ Google API test failed:', error);
-        return { success: false, error };
-    }
-}
-// Uncomment to run the examples
-// testGoogleAPI().then(result => {
-//     console.log('Google API test complete:', result.success);
-// });

package/src/demos/diffusion-gemma/.env DELETED Viewed

@@ -1,29 +0,0 @@
-# Optional docker compose overrides for the DiffusionGemma vLLM backend.
-#
-# Start from packages/universal-llm-client:
-#   docker compose --env-file src/demos/diffusion-gemma/.env -f src/demos/diffusion-gemma/docker-compose.yml up -d
-# Public vLLM image to run. If a future nightly regresses DiffusionGemma support,
-# set this to a known-good local or registry tag.
-VLLM_IMAGE=vllm/vllm-openai:gemma
-# Host port for the OpenAI-compatible vLLM API.
-VLLM_PORT=18000
-VLLM_URL=http://localhost:18000
-# DiffusionGemma model served by vLLM.
-MODEL_NAME=RedHatAI/diffusiongemma-26B-A4B-it-NVFP4
-# Single-user local serving defaults. Tune for your GPU.
-GPU_MEM_UTIL=0.28
-MAX_MODEL_LEN=32768
-MAX_NUM_SEQS=1
-DIFFUSION_ENTROPY=0.1
-# Set to 1 only for CUDA graph / torch.compile debugging.
-ENFORCE_EAGER=0
-# Disable vLLM telemetry. In WSL this avoids a py-cpuinfo JSONDecodeError in
-# vLLM's background usage-reporting thread during engine startup/reload.
-VLLM_NO_USAGE_STATS=1

package/src/demos/diffusion-gemma/.env.example DELETED Viewed

@@ -1,27 +0,0 @@
-# Optional docker compose overrides for the DiffusionGemma vLLM backend.
-#
-# Start from packages/universal-llm-client:
-#   docker compose --env-file src/demos/diffusion-gemma/.env -f src/demos/diffusion-gemma/docker-compose.yml up -d
-# Public vLLM image to run. If a future nightly regresses DiffusionGemma support,
-# set this to a known-good local or registry tag.
-VLLM_IMAGE=vllm/vllm-openai:gemma
-# Host port for the OpenAI-compatible vLLM API.
-VLLM_PORT=8000
-# DiffusionGemma model served by vLLM.
-MODEL_NAME=RedHatAI/diffusiongemma-26B-A4B-it-NVFP4
-# Single-user local serving defaults. Tune for your GPU.
-GPU_MEM_UTIL=0.28
-MAX_MODEL_LEN=32768
-MAX_NUM_SEQS=1
-DIFFUSION_ENTROPY=0.1
-# Set to 1 only for CUDA graph / torch.compile debugging.
-ENFORCE_EAGER=0
-# Disable vLLM telemetry. In WSL this avoids a py-cpuinfo JSONDecodeError in
-# vLLM's background usage-reporting thread during engine startup/reload.
-VLLM_NO_USAGE_STATS=1

package/src/demos/diffusion-gemma/CLAUDE.md DELETED Viewed

@@ -1,95 +0,0 @@
-# DiffusionGemma demo — test harness + "Signal from Noise" canvas
-Standalone Bun server exercising `universal-llm-client` against DiffusionGemma
-(a discrete diffusion LM served by vLLM).
-## Run
-```bash
-bun run demo:diffusion-gemma:engine  # starts vLLM via demo-local docker compose
-bun run demo:diffusion-gemma         # starts the Bun demo server
-```
-- Demo server: **http://localhost:3333** (`/` test harness, `/canvas` diffusion chat UI)
-- vLLM upstream: `VLLM_URL` env, default `http://localhost:8000`
-- Model: `MODEL_NAME` env, default `RedHatAI/diffusiongemma-26B-A4B-it-NVFP4`
-- vLLM is started via `src/demos/diffusion-gemma/docker-compose.yml` and
-  `src/demos/diffusion-gemma/start-vllm.sh` — includes a WSL2 UVA patch and
-  the `entropy_bound` diffusion sampler. Runs as docker container
-  `diffusiongemma` (script is bind-mounted as `/start-vllm.sh`, so edits apply
-  on `docker restart diffusiongemma`). The script also sources
-  `src/demos/diffusion-gemma/.cache/huggingface/diffusion-env.sh`
-  (host-writable through the HF-cache bind mount) — that's how
-  `/api/engine-config` changes settings without recreating the container.
-- **Tuned for single-user local serving** (env-overridable in the start script):
-  `GPU_MEM_UTIL` (default 0.28 ≈ 27 GiB — without caps vLLM grabbed ~88 GiB:
-  69 GiB KV cache for the native 262k context, measured <0.5% used),
-  `MAX_MODEL_LEN` (32768), `MAX_NUM_SEQS` (1), `DIFFUSION_ENTROPY` (0.1),
-  `ENFORCE_EAGER` (0). Weights are 17.4 GiB.
-- **Never re-add `--enforce-eager` casually:** it disables CUDA graphs AND
-  torch.compile and cost 2.2× throughput (387 → 841 tok/s avg, peak 1002,
-  steady-state ~644 on long runs). Set `ENFORCE_EAGER=1` only to debug
-  WSL2/Blackwell graph-capture issues. Entropy 0.1→0.2 measured ≈ no speed
-  change (745–845 tok/s) — the dial trades quality, not meaningful speed,
-  at these settings.
-## Routes
-| Route | What |
-| ----- | ---- |
-| `/` | Test harness UI (chat + compatibility tests via universal-llm-client) |
-| `/canvas` | "Signal from Noise" — cinematic chat UI replaying the diffusion process |
-| `/api/chat` | Chat via universal-llm-client (`messages`, `stream`, `maxTokens`, `temperature`) |
-| `/api/stream-raw` | Direct vLLM SSE proxy preserving chunk timing (`messages` or `prompt`, `maxTokens`, `thinking:false` to disable the thought channel). Always sets `skip_special_tokens:false` so channel markers survive. |
-| `/api/engine-config` | GET current entropy; POST `{entropy}` writes the env file + `docker restart`s the engine (~2–4 min; UI polls `/api/health`) |
-| `/api/health` | Pings vLLM `/v1/models` |
-## Native protocol (no server-side parsers!)
-This vLLM build has **no reasoning parser and no tool-call parser module** —
-request-level `tools` with auto choice 400s. Everything is client-side, against
-the chat template's native markers (visible only with `skip_special_tokens:false`):
-- Reasoning: `<|channel>thought\n …<channel|>answer`. The canvas splits this
-  with a streaming state machine (partial markers carried across chunks) and
-  renders reasoning as a collapsible amber channel above the answer surface.
-- **Canvas reading view:** the mono token surface is the animation; when a
-  reply settles it fades into a rendered-markdown view (zero-dep renderer in
-  the inner script — headings/lists/code/bold/links, all input HTML-escaped
-  first; backticks via `String.fromCharCode(96)` because literal backticks
-  would terminate the outer template literal). Replay/scrub swaps back to the
-  token surface. Root font scales with viewport (`clamp` on `html`) for
-  screen-recording legibility. Max-tokens select goes to 16k (default 4k);
-  `finish_reason:'length'` shows an amber "⚠ capped" warning in phase+footer.
-- Tool calls: `<|tool_call>call:name{k:<|"|>v<|"|>,n:3}<tool_call|>` — pseudo-JSON
-  args (bare keys, `<|"|>` quote token). Send `tools` + `tool_choice:'none'`
-  (declarations still get rendered into the template); history tool turns go as
-  standard structured `tool_calls` + `role:'tool'` messages (template renders
-  them natively).
-- All of this is implemented for the library in `src/gemma-diffusion.ts` and
-  wired into the OpenAI provider (auto-detected by model name; override with
-  `LLMClientOptions.gemmaNativeProtocol`). `chatWithTools` works end-to-end.
-  Tests: `src/tests/gemma-diffusion.test.ts`. Probes: `probe-stream.ts`
-  (chunk timing), `probe-tools.ts` (tool-loop wire format).
-## Things that bite
-- **`canvas.ts` is one giant TS template literal.** Backslash escapes inside the
-  inner `<script>` are eaten by the outer literal (`/\S+/` silently becomes
-  `/S+/`). The inner script is written with ZERO backslashes — newlines via
-  `String.fromCharCode(10)`, tokenizing via charCode scans. Keep it that way.
-- **No hot reload.** `CANVAS_HTML` is bundled at startup — restart the server
-  after editing `canvas.ts` (kill the bun process on :3333, start again).
-- **Don't name a top-level browser var `history`** — `window.history` is
-  unshadowable; the conversation array is called `convo`.
-- **Stream shape (measured):** the vLLM OpenAI stream emits ~1KB bursts, one per
-  finished 256-token diffusion block, every ~0.8–1.2s. There is no per-denoise-step
-  state in the stream; `/canvas` animates each block's reveal during the real
-  compute window of the next block. `probe-stream.ts` logs chunk timing.
-- **The model emits stray unbalanced `<channel|>` closers** occasionally —
-  the parser strips them (`RESIDUAL_SPECIAL` in gemma-diffusion.ts), and it
-  sometimes puts the whole final answer inside the thought channel on
-  post-tool turns.
-- **Entropy is engine-level** (`hf_overrides` read once at model init in
-  vLLM's `diffusion_gemma.py`); per-request `vllm_xargs` is accepted but
-  ignored. Hence the reload-based `/api/engine-config`.

package/src/demos/diffusion-gemma/README.md DELETED Viewed

@@ -1,59 +0,0 @@
-# DiffusionGemma demo
-Standalone Bun demo for testing `universal-llm-client` against DiffusionGemma
-served by vLLM's OpenAI-compatible API.
-## Run the backend
-From `packages/universal-llm-client`:
-```bash
-docker compose -f src/demos/diffusion-gemma/docker-compose.yml up -d
-```
-The compose file runs a `diffusiongemma` container on `localhost:8000`, mounts a
-demo-local Hugging Face cache at `src/demos/diffusion-gemma/.cache/huggingface`,
-and bind-mounts `start-vllm.sh` as the container entrypoint.
-If you already have an older hand-created `diffusiongemma` container, remove it
-before switching to the demo compose file:
-```bash
-docker rm -f diffusiongemma
-```
-Optional overrides:
-```bash
-cp src/demos/diffusion-gemma/.env.example src/demos/diffusion-gemma/.env
-docker compose --env-file src/demos/diffusion-gemma/.env -f src/demos/diffusion-gemma/docker-compose.yml up -d
-```
-Useful knobs are `VLLM_IMAGE`, `GPU_MEM_UTIL`, `MAX_MODEL_LEN`,
-`DIFFUSION_ENTROPY`, `ENFORCE_EAGER`, and `VLLM_NO_USAGE_STATS`.
-## Run the demo UI
-```bash
-bun run src/demos/diffusion-gemma/server.ts
-```
-- Harness: <http://localhost:3333/>
-- Canvas: <http://localhost:3333/canvas>
-- vLLM API: <http://localhost:8000/v1/models>
-## Notes
-- The prior BentoKit setup did not use a `docker-compose.yml`; it was a direct
-  Docker container using a repo-root `scripts/diffusiongemma-start.sh` bind
-  mount. This demo now carries its own compose file and startup script.
-- The default image is `vllm/vllm-openai:gemma`, the vLLM image line that
-  includes DiffusionGemma support. Set `VLLM_IMAGE` if you need to test another
-  local or registry image.
-- The first startup can take several minutes while vLLM loads and compiles the
-  model. Poll `docker logs -f diffusiongemma` or `/api/health` from the demo UI.
-- The `/api/engine-config` endpoint writes `diffusion-env.sh` into the mounted
-  Hugging Face cache and restarts the `diffusiongemma` container.
-- `VLLM_NO_USAGE_STATS=1` is enabled by default because this vLLM image can hit
-  a non-fatal `py-cpuinfo` `JSONDecodeError` in its background usage-reporting
-  thread under WSL during startup/reload.