universal-llm-client 4.5.0 → 4.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (174) hide show
  1. package/CHANGELOG.md +12 -0
  2. package/README.md +2 -0
  3. package/dist/ai-model.d.ts +0 -1
  4. package/dist/ai-model.js +0 -1
  5. package/dist/auditor.d.ts +0 -1
  6. package/dist/auditor.js +0 -1
  7. package/dist/client.d.ts +0 -1
  8. package/dist/client.js +0 -1
  9. package/dist/gemma-channel.d.ts +0 -1
  10. package/dist/gemma-channel.js +0 -1
  11. package/dist/gemma-diffusion.d.ts +0 -1
  12. package/dist/gemma-diffusion.js +0 -1
  13. package/dist/http.d.ts +0 -1
  14. package/dist/http.js +0 -1
  15. package/dist/index.d.ts +0 -1
  16. package/dist/index.js +0 -1
  17. package/dist/interfaces.d.ts +0 -1
  18. package/dist/interfaces.js +0 -1
  19. package/dist/mcp.d.ts +0 -1
  20. package/dist/mcp.js +0 -1
  21. package/dist/providers/anthropic.d.ts +0 -1
  22. package/dist/providers/anthropic.js +0 -1
  23. package/dist/providers/google.d.ts +0 -1
  24. package/dist/providers/google.js +0 -1
  25. package/dist/providers/index.d.ts +0 -1
  26. package/dist/providers/index.js +0 -1
  27. package/dist/providers/ollama.d.ts +0 -1
  28. package/dist/providers/ollama.js +0 -1
  29. package/dist/providers/openai.d.ts +2 -1
  30. package/dist/providers/openai.js +303 -74
  31. package/dist/router.d.ts +0 -1
  32. package/dist/router.js +0 -1
  33. package/dist/stream-decoder.d.ts +0 -1
  34. package/dist/stream-decoder.js +0 -1
  35. package/dist/structured-output.d.ts +0 -1
  36. package/dist/structured-output.js +0 -1
  37. package/dist/thinking.d.ts +0 -1
  38. package/dist/thinking.js +0 -1
  39. package/dist/tools.d.ts +0 -1
  40. package/dist/tools.js +0 -1
  41. package/dist/zod-adapter.d.ts +0 -1
  42. package/dist/zod-adapter.js +0 -1
  43. package/package.json +1 -2
  44. package/dist/ai-model.d.ts.map +0 -1
  45. package/dist/ai-model.js.map +0 -1
  46. package/dist/auditor.d.ts.map +0 -1
  47. package/dist/auditor.js.map +0 -1
  48. package/dist/client.d.ts.map +0 -1
  49. package/dist/client.js.map +0 -1
  50. package/dist/gemma-channel.d.ts.map +0 -1
  51. package/dist/gemma-channel.js.map +0 -1
  52. package/dist/gemma-diffusion.d.ts.map +0 -1
  53. package/dist/gemma-diffusion.js.map +0 -1
  54. package/dist/http.d.ts.map +0 -1
  55. package/dist/http.js.map +0 -1
  56. package/dist/index.d.ts.map +0 -1
  57. package/dist/index.js.map +0 -1
  58. package/dist/interfaces.d.ts.map +0 -1
  59. package/dist/interfaces.js.map +0 -1
  60. package/dist/mcp.d.ts.map +0 -1
  61. package/dist/mcp.js.map +0 -1
  62. package/dist/providers/anthropic.d.ts.map +0 -1
  63. package/dist/providers/anthropic.js.map +0 -1
  64. package/dist/providers/google.d.ts.map +0 -1
  65. package/dist/providers/google.js.map +0 -1
  66. package/dist/providers/index.d.ts.map +0 -1
  67. package/dist/providers/index.js.map +0 -1
  68. package/dist/providers/ollama.d.ts.map +0 -1
  69. package/dist/providers/ollama.js.map +0 -1
  70. package/dist/providers/openai.d.ts.map +0 -1
  71. package/dist/providers/openai.js.map +0 -1
  72. package/dist/router.d.ts.map +0 -1
  73. package/dist/router.js.map +0 -1
  74. package/dist/stream-decoder.d.ts.map +0 -1
  75. package/dist/stream-decoder.js.map +0 -1
  76. package/dist/structured-output.d.ts.map +0 -1
  77. package/dist/structured-output.js.map +0 -1
  78. package/dist/thinking.d.ts.map +0 -1
  79. package/dist/thinking.js.map +0 -1
  80. package/dist/tools.d.ts.map +0 -1
  81. package/dist/tools.js.map +0 -1
  82. package/dist/zod-adapter.d.ts.map +0 -1
  83. package/dist/zod-adapter.js.map +0 -1
  84. package/src/ai-model.ts +0 -400
  85. package/src/auditor.ts +0 -213
  86. package/src/client.ts +0 -402
  87. package/src/debug/debug-google-streaming.ts +0 -97
  88. package/src/debug/debug-tool-execution.ts +0 -86
  89. package/src/debug/test-lmstudio-tools.ts +0 -155
  90. package/src/demos/README.md +0 -47
  91. package/src/demos/basic/universal-llm-examples.ts +0 -161
  92. package/src/demos/diffusion-gemma/.env +0 -29
  93. package/src/demos/diffusion-gemma/.env.example +0 -27
  94. package/src/demos/diffusion-gemma/CLAUDE.md +0 -95
  95. package/src/demos/diffusion-gemma/README.md +0 -59
  96. package/src/demos/diffusion-gemma/canvas.ts +0 -1606
  97. package/src/demos/diffusion-gemma/docker-compose.yml +0 -29
  98. package/src/demos/diffusion-gemma/probe-stream.ts +0 -51
  99. package/src/demos/diffusion-gemma/probe-tools.ts +0 -55
  100. package/src/demos/diffusion-gemma/server.ts +0 -1205
  101. package/src/demos/diffusion-gemma/start-vllm.sh +0 -98
  102. package/src/demos/mcp/astrid-memory-demo.ts +0 -295
  103. package/src/demos/mcp/astrid-persona-memory.ts +0 -357
  104. package/src/demos/mcp/mcp-mongodb-demo.ts +0 -275
  105. package/src/demos/mcp/simple-astrid-memory.ts +0 -148
  106. package/src/demos/mcp/simple-mcp-demo.ts +0 -68
  107. package/src/demos/mcp/working-mcp-demo.ts +0 -62
  108. package/src/demos/model-alias-demo.ts +0 -0
  109. package/src/demos/tools/RAG_MEMORY_INTEGRATION.md +0 -267
  110. package/src/demos/tools/astrid-memory-demo.ts +0 -270
  111. package/src/demos/tools/astrid-production-memory-clean.ts +0 -785
  112. package/src/demos/tools/astrid-production-memory.ts +0 -558
  113. package/src/demos/tools/basic-translation-test.ts +0 -66
  114. package/src/demos/tools/chromadb-similarity-tuning.ts +0 -390
  115. package/src/demos/tools/clean-multilingual-conversation.ts +0 -209
  116. package/src/demos/tools/clean-translation-test.ts +0 -119
  117. package/src/demos/tools/clean-universal-multilingual-test.ts +0 -131
  118. package/src/demos/tools/complete-rag-demo.ts +0 -369
  119. package/src/demos/tools/complete-tool-demo.ts +0 -132
  120. package/src/demos/tools/demo-tool-calling.ts +0 -124
  121. package/src/demos/tools/dynamic-language-switching-test.ts +0 -251
  122. package/src/demos/tools/hybrid-thinking-test.ts +0 -154
  123. package/src/demos/tools/memory-integration-test.ts +0 -420
  124. package/src/demos/tools/multilingual-memory-system.ts +0 -802
  125. package/src/demos/tools/ondemand-translation-demo.ts +0 -655
  126. package/src/demos/tools/production-tool-demo.ts +0 -245
  127. package/src/demos/tools/revolutionary-multilingual-test.ts +0 -151
  128. package/src/demos/tools/rigorous-language-analysis.ts +0 -218
  129. package/src/demos/tools/test-universal-memory-system.ts +0 -126
  130. package/src/demos/tools/translation-integration-guide.ts +0 -346
  131. package/src/demos/tools/universal-memory-system.ts +0 -560
  132. package/src/gemma-channel.ts +0 -47
  133. package/src/gemma-diffusion.ts +0 -167
  134. package/src/http.ts +0 -261
  135. package/src/index.ts +0 -180
  136. package/src/interfaces.ts +0 -843
  137. package/src/mcp.ts +0 -345
  138. package/src/providers/anthropic.ts +0 -796
  139. package/src/providers/google.ts +0 -840
  140. package/src/providers/index.ts +0 -8
  141. package/src/providers/ollama.ts +0 -503
  142. package/src/providers/openai.ts +0 -587
  143. package/src/router.ts +0 -785
  144. package/src/stream-decoder.ts +0 -535
  145. package/src/structured-output.ts +0 -759
  146. package/src/test-scripts/test-advanced-tools.ts +0 -310
  147. package/src/test-scripts/test-google-deep-research.ts +0 -33
  148. package/src/test-scripts/test-google-streaming-enhanced.ts +0 -147
  149. package/src/test-scripts/test-google-streaming.ts +0 -63
  150. package/src/test-scripts/test-google-system-prompt-comprehensive.ts +0 -189
  151. package/src/test-scripts/test-google-thinking.ts +0 -46
  152. package/src/test-scripts/test-mcp-config.ts +0 -28
  153. package/src/test-scripts/test-mcp-connection.ts +0 -29
  154. package/src/test-scripts/test-system-message-positions.ts +0 -163
  155. package/src/test-scripts/test-system-prompt-improvement-demo.ts +0 -83
  156. package/src/test-scripts/test-tool-calling.ts +0 -231
  157. package/src/test-scripts/test-vllm-qwen36.ts +0 -256
  158. package/src/tests/ai-model.test.ts +0 -1614
  159. package/src/tests/auditor.test.ts +0 -224
  160. package/src/tests/gemma-diffusion.test.ts +0 -115
  161. package/src/tests/http.test.ts +0 -200
  162. package/src/tests/interfaces.test.ts +0 -117
  163. package/src/tests/providers/anthropic.test.ts +0 -118
  164. package/src/tests/providers/google.test.ts +0 -841
  165. package/src/tests/providers/ollama.test.ts +0 -1034
  166. package/src/tests/providers/openai.test.ts +0 -1511
  167. package/src/tests/router.test.ts +0 -254
  168. package/src/tests/stream-decoder.test.ts +0 -263
  169. package/src/tests/structured-output.test.ts +0 -1450
  170. package/src/tests/thinking.test.ts +0 -65
  171. package/src/tests/tools.test.ts +0 -175
  172. package/src/thinking.ts +0 -73
  173. package/src/tools.ts +0 -246
  174. package/src/zod-adapter.ts +0 -72
@@ -1,155 +0,0 @@
1
- /**
2
- * Debug test specifically for LM Studio tool calling with OpenAI API
3
- */
4
-
5
- async function testLMStudioTools() {
6
- console.log('🔧 Testing LM Studio Tool Calling with OpenAI API\n');
7
-
8
- // Create LM Studio model using OpenAI-compatible API
9
- const model = AIModelFactory.createOpenAIChatModel(
10
- 'qwen/qwen3-8b', // Tool-trained model
11
- 'http://192.168.100.136:1234/v1' // LM Studio URL
12
- );
13
-
14
- // Create simple test tools (same as our router test)
15
- const testTools = [
16
- ToolBuilder.createTool<{ expression: string }>(
17
- 'calculate',
18
- 'Perform mathematical calculations',
19
- {
20
- properties: {
21
- expression: {
22
- type: 'string',
23
- description: 'Mathematical expression to evaluate'
24
- }
25
- },
26
- required: ['expression']
27
- },
28
- (args) => {
29
- try {
30
- const result = eval(args.expression); // Safe for testing
31
- console.log(`🔧 Tool executed: calculate(${args.expression}) = ${result}`);
32
- return `The result of ${args.expression} is ${result}`;
33
- } catch (error) {
34
- console.log(`❌ Tool error: calculate(${args.expression}) failed`);
35
- return `Error calculating ${args.expression}: ${error}`;
36
- }
37
- }
38
- ),
39
-
40
- ToolBuilder.createTool<{ location: string }>(
41
- 'get_weather',
42
- 'Get weather information',
43
- {
44
- properties: {
45
- location: {
46
- type: 'string',
47
- description: 'City and state/country'
48
- }
49
- },
50
- required: ['location']
51
- },
52
- (args) => {
53
- console.log(`🔧 Tool executed: get_weather(${args.location})`);
54
- return `The weather in ${args.location} is sunny and 72°F.`;
55
- }
56
- )
57
- ];
58
-
59
- // Register tools
60
- model.registerTools(testTools);
61
-
62
- console.log('📝 Registered tools:', testTools.map(t => t.name).join(', '));
63
- console.log('');
64
-
65
- try {
66
- await model.ensureReady();
67
- console.log('✅ LM Studio model ready\n');
68
-
69
- // Test 1: Basic calculation
70
- console.log('🧮 Test 1: Basic Calculation');
71
- console.log('Question: What is 15 multiplied by 23?');
72
-
73
- const response1 = await model.chatWithTools([
74
- { role: 'user', content: 'What is 15 multiplied by 23? Use the calculate tool.' }
75
- ]);
76
-
77
- console.log('Response:', response1.message?.content || 'No content');
78
- console.log('Usage:', response1.usage);
79
- console.log('');
80
-
81
- // Test 2: Weather query
82
- console.log('🌤️ Test 2: Weather Query');
83
- console.log('Question: Weather in San Francisco');
84
-
85
- const response2 = await model.chatWithTools([
86
- { role: 'user', content: 'What is the weather in San Francisco, CA?' }
87
- ]);
88
-
89
- console.log('Response:', response2.message?.content || 'No content');
90
- console.log('');
91
-
92
- // Test 3: Manual tool call inspection (without auto-execution)
93
- console.log('🔍 Test 3: Manual Tool Call Inspection');
94
- console.log('Question: Calculate 100 / 4 (without auto-execution)');
95
-
96
- const response3 = await model.chat([
97
- { role: 'user', content: 'Calculate 100 divided by 4 using the calculate tool.' }
98
- ], {}, {
99
- tool_choice: 'auto',
100
- executeTools: false // Don't auto-execute
101
- });
102
-
103
- console.log('Response content:', response3.message?.content || 'No content');
104
- console.log('Tool calls detected:', response3.message?.tool_calls?.length || 0);
105
-
106
- if (response3.message?.tool_calls) {
107
- response3.message.tool_calls.forEach((call, i) => {
108
- console.log(` Tool ${i + 1}: ${call.function.name}(${call.function.arguments})`);
109
- });
110
- }
111
- console.log('');
112
-
113
- // Test 4: Different tool_choice settings
114
- console.log('🎯 Test 4: Different tool_choice Settings');
115
-
116
- const testCases = [
117
- { choice: 'auto', desc: 'Auto (let model decide)' },
118
- { choice: 'required', desc: 'Required (force tool use)' },
119
- { choice: { type: 'function', function: { name: 'calculate' } }, desc: 'Specific tool' }
120
- ];
121
-
122
- for (const testCase of testCases) {
123
- console.log(`Testing tool_choice: ${testCase.desc}`);
124
- try {
125
- const response = await model.chat([
126
- { role: 'user', content: 'What is 50 + 50?' }
127
- ], {}, {
128
- tool_choice: testCase.choice as any,
129
- executeTools: false
130
- });
131
-
132
- console.log(` Content: ${response.message?.content?.substring(0, 100) || 'No content'}${(response.message?.content?.length || 0) > 100 ? '...' : ''}`);
133
- console.log(` Tool calls: ${response.message?.tool_calls?.length || 0}`);
134
-
135
- } catch (error) {
136
- console.log(` Error: ${(error as Error).message}`);
137
- }
138
- console.log('');
139
- }
140
-
141
- } catch (error) {
142
- console.error('❌ Test failed:', (error as Error).message);
143
- console.error('Stack:', (error as Error).stack);
144
- } finally {
145
- model.dispose();
146
- console.log('🏁 Test completed!\n');
147
- }
148
- }
149
-
150
- // Run the test
151
- if (require.main === module) {
152
- testLMStudioTools().catch(console.error);
153
- }
154
-
155
- export { testLMStudioTools };
@@ -1,47 +0,0 @@
1
- # Universal LLM Client Demos
2
-
3
- This directory contains various demonstration files for the Universal LLM Client.
4
-
5
- ## Directory Structure
6
-
7
- ### `/basic`
8
-
9
- Basic usage examples and simple demonstrations.
10
-
11
- ### `/tools`
12
-
13
- Tool calling demonstrations and examples:
14
-
15
- - `demo-tool-calling.ts` - Basic tool calling demo
16
- - `complete-tool-demo.ts` - Comprehensive tool calling examples
17
- - `production-tool-demo.ts` - Production-ready tool calling implementation
18
-
19
- ### `/mcp`
20
-
21
- Model Context Protocol (MCP) integration demos:
22
-
23
- - `simple-mcp-demo.ts` - Clean MCP MongoDB demo (recommended)
24
- - `mcp-mongodb-demo.ts` - Detailed MCP MongoDB integration
25
- - `working-mcp-demo.ts` - Working MCP implementation example
26
-
27
- ## Running Demos
28
-
29
- From the root directory (`universal-llm-client/`):
30
-
31
- ```bash
32
- # Tool calling demos
33
- bun run demos/tools/production-tool-demo.ts
34
-
35
- # MCP demos (requires MongoDB MCP server)
36
- bun run demos/mcp/simple-mcp-demo.ts
37
-
38
- # Basic usage
39
- bun run demos/basic/[demo-file].ts
40
- ```
41
-
42
- ## Requirements
43
-
44
- - Node.js/Bun runtime
45
- - Configured Ollama instance (for Ollama provider)
46
- - MongoDB MCP server (for MCP demos)
47
- - API keys for external providers (OpenAI, Google)
@@ -1,161 +0,0 @@
1
- import { AIModelFactory } from "./factory";
2
-
3
- /**
4
- * Example: Complete AI Application Setup
5
- *
6
- * This example shows how to set up a complete AI application
7
- * with both chat and embedding capabilities using the factory.
8
- */
9
- export async function createAIApplicationExample() {
10
-
11
- console.log('\n🏗️ Example: Setting up a complete AI application...\n');
12
-
13
- // Method 1: Using the factory for easy setup
14
- const aiSetup = AIModelFactory.createCompleteSetup({
15
- ollama: {
16
- chatModel: 'gemma3:4b-it-qat',
17
- embeddingModel: 'snowflake-arctic-embed2:latest',
18
- url: 'http://localhost:11434'
19
- },
20
- openai: {
21
- chatModel: 'google/gemma-3-4b',
22
- embeddingModel: 'text-embedding-snowflake-arctic-embed-l-v2.0',
23
- url: 'http://localhost:1234/v1'
24
- // apiKey: 'your-api-key-here' // For real OpenAI API
25
- },
26
- google: {
27
- chatModel: 'gemma-3-4b-it',
28
- apiKey: (process.env.GOOGLE_API_KEY ?? '')
29
- }
30
- });
31
-
32
- // Method 2: Individual setup for specific use cases
33
- const chatModel = AIModelFactory.createOllamaChatModel('gemma3:4b-it-qat');
34
- const embeddingModel = AIModelFactory.createOllamaEmbeddingModel('snowflake-arctic-embed2:latest');
35
-
36
- // Method 3: Google-specific setup
37
- const googleChatModel = AIModelFactory.createGoogleChatModel(
38
- 'gemma-3-4b-it',
39
- (process.env.GOOGLE_API_KEY ?? '')
40
- );
41
-
42
- // Example usage patterns:
43
-
44
- console.log('🤖 Testing Google Generative AI...');
45
- try {
46
- const googleResponse = await googleChatModel.chat([
47
- { role: 'user', content: 'What is the capital of France?' }
48
- ]);
49
- console.log('Google Response:', googleResponse.content);
50
- } catch (error) {
51
- console.error('Google API Error:', error);
52
- }
53
-
54
- console.log('\n🌊 Testing Google Streaming...');
55
- try {
56
- const streamingGenerator = googleChatModel.chatStream([
57
- { role: 'user', content: 'Count from 1 to 5, explaining each number briefly.' }
58
- ]);
59
-
60
- for await (const chunk of streamingGenerator) {
61
- process.stdout.write(chunk);
62
- }
63
- console.log('\n✅ Google streaming completed');
64
- } catch (error) {
65
- console.error('Google Streaming Error:', error);
66
- }
67
-
68
- // 1. Simple chat
69
- const chatResponse = await aiSetup.ollama.chat.chat([
70
- { role: 'user', content: 'What is machine learning?' }
71
- ]);
72
-
73
- // 2. Document embedding for search
74
- const documents = [
75
- 'Machine learning is a subset of AI',
76
- 'Neural networks are inspired by the brain',
77
- 'TypeScript is a typed superset of JavaScript'
78
- ];
79
-
80
- const embeddings = await Promise.all(
81
- documents.map(doc => aiSetup.ollama.embedding.embed(doc))
82
- );
83
-
84
- // 3. Cross-provider comparison including Google
85
- const comparisons = await Promise.all([
86
- aiSetup.ollama?.chat?.chat([{ role: 'user', content: 'What is the capital of France?' }]).catch(() => null),
87
- aiSetup.openai?.chat?.chat([{ role: 'user', content: 'What is the capital of France?' }]).catch(() => null),
88
- aiSetup.google?.chat?.chat([{ role: 'user', content: 'What is the capital of France?' }]).catch(() => null)
89
- ]);
90
-
91
- return {
92
- aiSetup,
93
- chatModel,
94
- embeddingModel,
95
- googleChatModel,
96
- examples: {
97
- chatResponse,
98
- embeddings,
99
- comparisons: {
100
- ollama: comparisons[0],
101
- openai: comparisons[1],
102
- google: comparisons[2]
103
- }
104
- }
105
- };
106
- }
107
-
108
- /**
109
- * Test Google API specifically
110
- */
111
- export async function testGoogleAPI() {
112
- console.log('🧪 Testing Google Generative AI API...\n');
113
-
114
- const googleModel = AIModelFactory.createGoogleChatModel(
115
- 'gemma-3-4b-it',
116
- (process.env.GOOGLE_API_KEY ?? '')
117
- );
118
-
119
- try {
120
- // Test basic chat
121
- console.log('💬 Basic chat test...');
122
- const response = await googleModel.chat([
123
- { role: 'user', content: 'Hello! Can you tell me about TypeScript?' }
124
- ]);
125
- console.log('Response:', response.content);
126
-
127
- // Test streaming
128
- console.log('\n🌊 Streaming test...');
129
- const streamResponse = googleModel.chatStream([
130
- { role: 'user', content: 'Tell me a short story about a robot learning to code.' }
131
- ]);
132
-
133
- let fullResponse = '';
134
- for await (const chunk of streamResponse) {
135
- process.stdout.write(chunk);
136
- fullResponse += chunk;
137
- }
138
- console.log('\n✅ Streaming completed');
139
-
140
- // Test conversation with context
141
- console.log('\n💭 Conversation with context...');
142
- const conversationResponse = await googleModel.chat([
143
- { role: 'user', content: 'What is the capital of France?' },
144
- { role: 'assistant', content: 'The capital of France is Paris.' },
145
- { role: 'user', content: 'What is its population?' }
146
- ]);
147
- console.log('Conversation Response:', conversationResponse.content);
148
-
149
- console.log('\n✅ All Google API tests completed successfully!');
150
- return { success: true, model: googleModel };
151
-
152
- } catch (error) {
153
- console.error('❌ Google API test failed:', error);
154
- return { success: false, error };
155
- }
156
- }
157
-
158
- // Uncomment to run the examples
159
- // testGoogleAPI().then(result => {
160
- // console.log('Google API test complete:', result.success);
161
- // });
@@ -1,29 +0,0 @@
1
- # Optional docker compose overrides for the DiffusionGemma vLLM backend.
2
- #
3
- # Start from packages/universal-llm-client:
4
- # docker compose --env-file src/demos/diffusion-gemma/.env -f src/demos/diffusion-gemma/docker-compose.yml up -d
5
-
6
- # Public vLLM image to run. If a future nightly regresses DiffusionGemma support,
7
- # set this to a known-good local or registry tag.
8
- VLLM_IMAGE=vllm/vllm-openai:gemma
9
-
10
- # Host port for the OpenAI-compatible vLLM API.
11
- VLLM_PORT=18000
12
-
13
- VLLM_URL=http://localhost:18000
14
-
15
- # DiffusionGemma model served by vLLM.
16
- MODEL_NAME=RedHatAI/diffusiongemma-26B-A4B-it-NVFP4
17
-
18
- # Single-user local serving defaults. Tune for your GPU.
19
- GPU_MEM_UTIL=0.28
20
- MAX_MODEL_LEN=32768
21
- MAX_NUM_SEQS=1
22
- DIFFUSION_ENTROPY=0.1
23
-
24
- # Set to 1 only for CUDA graph / torch.compile debugging.
25
- ENFORCE_EAGER=0
26
-
27
- # Disable vLLM telemetry. In WSL this avoids a py-cpuinfo JSONDecodeError in
28
- # vLLM's background usage-reporting thread during engine startup/reload.
29
- VLLM_NO_USAGE_STATS=1
@@ -1,27 +0,0 @@
1
- # Optional docker compose overrides for the DiffusionGemma vLLM backend.
2
- #
3
- # Start from packages/universal-llm-client:
4
- # docker compose --env-file src/demos/diffusion-gemma/.env -f src/demos/diffusion-gemma/docker-compose.yml up -d
5
-
6
- # Public vLLM image to run. If a future nightly regresses DiffusionGemma support,
7
- # set this to a known-good local or registry tag.
8
- VLLM_IMAGE=vllm/vllm-openai:gemma
9
-
10
- # Host port for the OpenAI-compatible vLLM API.
11
- VLLM_PORT=8000
12
-
13
- # DiffusionGemma model served by vLLM.
14
- MODEL_NAME=RedHatAI/diffusiongemma-26B-A4B-it-NVFP4
15
-
16
- # Single-user local serving defaults. Tune for your GPU.
17
- GPU_MEM_UTIL=0.28
18
- MAX_MODEL_LEN=32768
19
- MAX_NUM_SEQS=1
20
- DIFFUSION_ENTROPY=0.1
21
-
22
- # Set to 1 only for CUDA graph / torch.compile debugging.
23
- ENFORCE_EAGER=0
24
-
25
- # Disable vLLM telemetry. In WSL this avoids a py-cpuinfo JSONDecodeError in
26
- # vLLM's background usage-reporting thread during engine startup/reload.
27
- VLLM_NO_USAGE_STATS=1
@@ -1,95 +0,0 @@
1
- # DiffusionGemma demo — test harness + "Signal from Noise" canvas
2
-
3
- Standalone Bun server exercising `universal-llm-client` against DiffusionGemma
4
- (a discrete diffusion LM served by vLLM).
5
-
6
- ## Run
7
-
8
- ```bash
9
- bun run demo:diffusion-gemma:engine # starts vLLM via demo-local docker compose
10
- bun run demo:diffusion-gemma # starts the Bun demo server
11
- ```
12
-
13
- - Demo server: **http://localhost:3333** (`/` test harness, `/canvas` diffusion chat UI)
14
- - vLLM upstream: `VLLM_URL` env, default `http://localhost:8000`
15
- - Model: `MODEL_NAME` env, default `RedHatAI/diffusiongemma-26B-A4B-it-NVFP4`
16
- - vLLM is started via `src/demos/diffusion-gemma/docker-compose.yml` and
17
- `src/demos/diffusion-gemma/start-vllm.sh` — includes a WSL2 UVA patch and
18
- the `entropy_bound` diffusion sampler. Runs as docker container
19
- `diffusiongemma` (script is bind-mounted as `/start-vllm.sh`, so edits apply
20
- on `docker restart diffusiongemma`). The script also sources
21
- `src/demos/diffusion-gemma/.cache/huggingface/diffusion-env.sh`
22
- (host-writable through the HF-cache bind mount) — that's how
23
- `/api/engine-config` changes settings without recreating the container.
24
- - **Tuned for single-user local serving** (env-overridable in the start script):
25
- `GPU_MEM_UTIL` (default 0.28 ≈ 27 GiB — without caps vLLM grabbed ~88 GiB:
26
- 69 GiB KV cache for the native 262k context, measured <0.5% used),
27
- `MAX_MODEL_LEN` (32768), `MAX_NUM_SEQS` (1), `DIFFUSION_ENTROPY` (0.1),
28
- `ENFORCE_EAGER` (0). Weights are 17.4 GiB.
29
- - **Never re-add `--enforce-eager` casually:** it disables CUDA graphs AND
30
- torch.compile and cost 2.2× throughput (387 → 841 tok/s avg, peak 1002,
31
- steady-state ~644 on long runs). Set `ENFORCE_EAGER=1` only to debug
32
- WSL2/Blackwell graph-capture issues. Entropy 0.1→0.2 measured ≈ no speed
33
- change (745–845 tok/s) — the dial trades quality, not meaningful speed,
34
- at these settings.
35
-
36
- ## Routes
37
-
38
- | Route | What |
39
- | ----- | ---- |
40
- | `/` | Test harness UI (chat + compatibility tests via universal-llm-client) |
41
- | `/canvas` | "Signal from Noise" — cinematic chat UI replaying the diffusion process |
42
- | `/api/chat` | Chat via universal-llm-client (`messages`, `stream`, `maxTokens`, `temperature`) |
43
- | `/api/stream-raw` | Direct vLLM SSE proxy preserving chunk timing (`messages` or `prompt`, `maxTokens`, `thinking:false` to disable the thought channel). Always sets `skip_special_tokens:false` so channel markers survive. |
44
- | `/api/engine-config` | GET current entropy; POST `{entropy}` writes the env file + `docker restart`s the engine (~2–4 min; UI polls `/api/health`) |
45
- | `/api/health` | Pings vLLM `/v1/models` |
46
-
47
- ## Native protocol (no server-side parsers!)
48
-
49
- This vLLM build has **no reasoning parser and no tool-call parser module** —
50
- request-level `tools` with auto choice 400s. Everything is client-side, against
51
- the chat template's native markers (visible only with `skip_special_tokens:false`):
52
-
53
- - Reasoning: `<|channel>thought\n …<channel|>answer`. The canvas splits this
54
- with a streaming state machine (partial markers carried across chunks) and
55
- renders reasoning as a collapsible amber channel above the answer surface.
56
- - **Canvas reading view:** the mono token surface is the animation; when a
57
- reply settles it fades into a rendered-markdown view (zero-dep renderer in
58
- the inner script — headings/lists/code/bold/links, all input HTML-escaped
59
- first; backticks via `String.fromCharCode(96)` because literal backticks
60
- would terminate the outer template literal). Replay/scrub swaps back to the
61
- token surface. Root font scales with viewport (`clamp` on `html`) for
62
- screen-recording legibility. Max-tokens select goes to 16k (default 4k);
63
- `finish_reason:'length'` shows an amber "⚠ capped" warning in phase+footer.
64
- - Tool calls: `<|tool_call>call:name{k:<|"|>v<|"|>,n:3}<tool_call|>` — pseudo-JSON
65
- args (bare keys, `<|"|>` quote token). Send `tools` + `tool_choice:'none'`
66
- (declarations still get rendered into the template); history tool turns go as
67
- standard structured `tool_calls` + `role:'tool'` messages (template renders
68
- them natively).
69
- - All of this is implemented for the library in `src/gemma-diffusion.ts` and
70
- wired into the OpenAI provider (auto-detected by model name; override with
71
- `LLMClientOptions.gemmaNativeProtocol`). `chatWithTools` works end-to-end.
72
- Tests: `src/tests/gemma-diffusion.test.ts`. Probes: `probe-stream.ts`
73
- (chunk timing), `probe-tools.ts` (tool-loop wire format).
74
-
75
- ## Things that bite
76
-
77
- - **`canvas.ts` is one giant TS template literal.** Backslash escapes inside the
78
- inner `<script>` are eaten by the outer literal (`/\S+/` silently becomes
79
- `/S+/`). The inner script is written with ZERO backslashes — newlines via
80
- `String.fromCharCode(10)`, tokenizing via charCode scans. Keep it that way.
81
- - **No hot reload.** `CANVAS_HTML` is bundled at startup — restart the server
82
- after editing `canvas.ts` (kill the bun process on :3333, start again).
83
- - **Don't name a top-level browser var `history`** — `window.history` is
84
- unshadowable; the conversation array is called `convo`.
85
- - **Stream shape (measured):** the vLLM OpenAI stream emits ~1KB bursts, one per
86
- finished 256-token diffusion block, every ~0.8–1.2s. There is no per-denoise-step
87
- state in the stream; `/canvas` animates each block's reveal during the real
88
- compute window of the next block. `probe-stream.ts` logs chunk timing.
89
- - **The model emits stray unbalanced `<channel|>` closers** occasionally —
90
- the parser strips them (`RESIDUAL_SPECIAL` in gemma-diffusion.ts), and it
91
- sometimes puts the whole final answer inside the thought channel on
92
- post-tool turns.
93
- - **Entropy is engine-level** (`hf_overrides` read once at model init in
94
- vLLM's `diffusion_gemma.py`); per-request `vllm_xargs` is accepted but
95
- ignored. Hence the reload-based `/api/engine-config`.
@@ -1,59 +0,0 @@
1
- # DiffusionGemma demo
2
-
3
- Standalone Bun demo for testing `universal-llm-client` against DiffusionGemma
4
- served by vLLM's OpenAI-compatible API.
5
-
6
- ## Run the backend
7
-
8
- From `packages/universal-llm-client`:
9
-
10
- ```bash
11
- docker compose -f src/demos/diffusion-gemma/docker-compose.yml up -d
12
- ```
13
-
14
- The compose file runs a `diffusiongemma` container on `localhost:8000`, mounts a
15
- demo-local Hugging Face cache at `src/demos/diffusion-gemma/.cache/huggingface`,
16
- and bind-mounts `start-vllm.sh` as the container entrypoint.
17
-
18
- If you already have an older hand-created `diffusiongemma` container, remove it
19
- before switching to the demo compose file:
20
-
21
- ```bash
22
- docker rm -f diffusiongemma
23
- ```
24
-
25
- Optional overrides:
26
-
27
- ```bash
28
- cp src/demos/diffusion-gemma/.env.example src/demos/diffusion-gemma/.env
29
- docker compose --env-file src/demos/diffusion-gemma/.env -f src/demos/diffusion-gemma/docker-compose.yml up -d
30
- ```
31
-
32
- Useful knobs are `VLLM_IMAGE`, `GPU_MEM_UTIL`, `MAX_MODEL_LEN`,
33
- `DIFFUSION_ENTROPY`, `ENFORCE_EAGER`, and `VLLM_NO_USAGE_STATS`.
34
-
35
- ## Run the demo UI
36
-
37
- ```bash
38
- bun run src/demos/diffusion-gemma/server.ts
39
- ```
40
-
41
- - Harness: <http://localhost:3333/>
42
- - Canvas: <http://localhost:3333/canvas>
43
- - vLLM API: <http://localhost:8000/v1/models>
44
-
45
- ## Notes
46
-
47
- - The prior BentoKit setup did not use a `docker-compose.yml`; it was a direct
48
- Docker container using a repo-root `scripts/diffusiongemma-start.sh` bind
49
- mount. This demo now carries its own compose file and startup script.
50
- - The default image is `vllm/vllm-openai:gemma`, the vLLM image line that
51
- includes DiffusionGemma support. Set `VLLM_IMAGE` if you need to test another
52
- local or registry image.
53
- - The first startup can take several minutes while vLLM loads and compiles the
54
- model. Poll `docker logs -f diffusiongemma` or `/api/health` from the demo UI.
55
- - The `/api/engine-config` endpoint writes `diffusion-env.sh` into the mounted
56
- Hugging Face cache and restarts the `diffusiongemma` container.
57
- - `VLLM_NO_USAGE_STATS=1` is enabled by default because this vLLM image can hit
58
- a non-fatal `py-cpuinfo` `JSONDecodeError` in its background usage-reporting
59
- thread under WSL during startup/reload.