npm - waypoi - Versions diffs - 0.0.0 - Mend

waypoi 0.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (260) hide show

package/.github/instructions/ui.instructions.md +42 -0
package/.github/workflows/ci.yml +35 -0
package/.github/workflows/publish.yml +71 -0
package/.github/workflows/release.yml +48 -0
package/.playwright-mcp/console-2026-04-04T01-41-10-746Z.log +2 -0
package/.playwright-mcp/console-2026-04-04T01-41-28-799Z.log +3 -0
package/.playwright-mcp/console-2026-04-05T02-26-51-909Z.log +76 -0
package/.playwright-mcp/page-2026-04-04T01-41-10-816Z.yml +1 -0
package/.playwright-mcp/page-2026-04-04T01-41-29-141Z.yml +77 -0
package/.playwright-mcp/page-2026-04-04T01-41-42-633Z.yml +190 -0
package/.playwright-mcp/page-2026-04-04T01-42-03-929Z.yml +262 -0
package/.playwright-mcp/page-2026-04-04T02-12-54-813Z.yml +6 -0
package/.playwright-mcp/page-2026-04-04T02-14-58-600Z.yml +190 -0
package/.playwright-mcp/page-2026-04-04T02-15-03-923Z.yml +190 -0
package/.playwright-mcp/page-2026-04-04T02-15-07-426Z.yml +190 -0
package/.playwright-mcp/page-2026-04-04T02-15-25-729Z.yml +262 -0
package/.playwright-mcp/page-2026-04-04T02-16-22-984Z.yml +262 -0
package/.playwright-mcp/page-2026-04-04T02-17-00-599Z.yml +190 -0
package/.playwright-mcp/page-2026-04-04T02-17-50-874Z.yml +190 -0
package/.playwright-mcp/page-2026-04-05T02-26-55-570Z.yml +6 -0
package/AGENTS.md +48 -0
package/CHANGELOG.md +131 -0
package/README.md +552 -0
package/assets/agent-mode.png +0 -0
package/assets/categorize.png +0 -0
package/assets/dashboard.png +0 -0
package/assets/endpoint-proxy.png +0 -0
package/assets/icon.png +0 -0
package/assets/mcp-generate-image.png +0 -0
package/assets/mcp-understand-image.png +0 -0
package/assets/peek-token-flow.png +0 -0
package/assets/playground.png +0 -0
package/assets/sankey.png +0 -0
package/cli/index.ts +2805 -0
package/cli/legacyRewrite.ts +108 -0
package/cli/modelRef.ts +24 -0
package/dist/cli/index.js +2536 -0
package/dist/cli/legacyRewrite.js +92 -0
package/dist/cli/modelRef.js +20 -0
package/dist/src/benchmark/artifacts.js +131 -0
package/dist/src/benchmark/capabilityClassifier.js +81 -0
package/dist/src/benchmark/capabilityStore.js +144 -0
package/dist/src/benchmark/config.js +238 -0
package/dist/src/benchmark/gates.js +118 -0
package/dist/src/benchmark/jobs.js +252 -0
package/dist/src/benchmark/runner.js +1847 -0
package/dist/src/benchmark/schema.js +353 -0
package/dist/src/benchmark/suites.js +314 -0
package/dist/src/benchmark/tinyQaDataset.js +422 -0
package/dist/src/benchmark/types.js +25 -0
package/dist/src/config.js +47 -0
package/dist/src/index.js +178 -0
package/dist/src/mcp/client.js +215 -0
package/dist/src/mcp/discovery.js +226 -0
package/dist/src/mcp/policy.js +65 -0
package/dist/src/mcp/registry.js +129 -0
package/dist/src/mcp/service.js +460 -0
package/dist/src/middleware/auth.js +179 -0
package/dist/src/middleware/requestCapture.js +192 -0
package/dist/src/middleware/requestStats.js +118 -0
package/dist/src/pools/builder.js +132 -0
package/dist/src/pools/repository.js +69 -0
package/dist/src/pools/scheduler.js +360 -0
package/dist/src/pools/types.js +2 -0
package/dist/src/protocols/adapters/dashscope.js +267 -0
package/dist/src/protocols/adapters/inferenceV2.js +346 -0
package/dist/src/protocols/adapters/openai.js +27 -0
package/dist/src/protocols/registry.js +99 -0
package/dist/src/protocols/types.js +2 -0
package/dist/src/providers/health.js +153 -0
package/dist/src/providers/importer.js +289 -0
package/dist/src/providers/modelRegistry.js +313 -0
package/dist/src/providers/repository.js +361 -0
package/dist/src/providers/types.js +2 -0
package/dist/src/routes/admin.js +531 -0
package/dist/src/routes/audio.js +295 -0
package/dist/src/routes/chat.js +240 -0
package/dist/src/routes/embeddings.js +157 -0
package/dist/src/routes/images.js +288 -0
package/dist/src/routes/mcp.js +256 -0
package/dist/src/routes/mcpService.js +100 -0
package/dist/src/routes/models.js +48 -0
package/dist/src/routes/responses.js +711 -0
package/dist/src/routes/sessions.js +450 -0
package/dist/src/routes/stats.js +270 -0
package/dist/src/routes/ui.js +97 -0
package/dist/src/routes/videos.js +107 -0
package/dist/src/routing/router.js +338 -0
package/dist/src/services/imageGeneration.js +280 -0
package/dist/src/services/imageUnderstanding.js +352 -0
package/dist/src/services/videoGeneration.js +79 -0
package/dist/src/storage/captureRepository.js +1591 -0
package/dist/src/storage/files.js +157 -0
package/dist/src/storage/imageCache.js +346 -0
package/dist/src/storage/repositories.js +388 -0
package/dist/src/storage/sessionRepository.js +370 -0
package/dist/src/storage/statsRepository.js +204 -0
package/dist/src/transport/httpClient.js +126 -0
package/dist/src/types.js +2 -0
package/dist/src/utils/messageMedia.js +285 -0
package/dist/src/utils/modelCapabilities.js +108 -0
package/dist/src/utils/modelDiscovery.js +170 -0
package/dist/src/version.js +5 -0
package/dist/src/workers/captureRetention.js +25 -0
package/dist/src/workers/configWatcher.js +91 -0
package/dist/src/workers/healthChecker.js +21 -0
package/dist/src/workers/statsRotation.js +41 -0
package/docs/LLM/output_schema.md +312 -0
package/docs/benchmark.md +208 -0
package/docs/mcp-guidelines.md +125 -0
package/docs/mcp-service.md +178 -0
package/docs/opencode.md +86 -0
package/docs/providers.md +79 -0
package/examples/benchmark.config.yaml +28 -0
package/examples/providers/alibaba-dashscope.yaml +88 -0
package/examples/providers/alibaba-llm.yaml +64 -0
package/examples/providers/alibaba-registry.yaml +7 -0
package/examples/providers/inference-v2-ray.yaml +29 -0
package/examples/scenarios/assets/omni-call-sample.wav +0 -0
package/examples/scenarios/custom.jsonl +5 -0
package/examples/scenarios/custom.yaml +40 -0
package/model-form-v2.png +0 -0
package/package.json +66 -0
package/provider-form-v2.png +0 -0
package/provider-form.png +0 -0
package/scripts/manual-test.sh +11 -0
package/scripts/version-from-git.js +23 -0
package/src/benchmark/artifacts.ts +149 -0
package/src/benchmark/capabilityClassifier.ts +99 -0
package/src/benchmark/capabilityStore.ts +174 -0
package/src/benchmark/config.ts +337 -0
package/src/benchmark/gates.ts +164 -0
package/src/benchmark/jobs.ts +312 -0
package/src/benchmark/runner.ts +2519 -0
package/src/benchmark/schema.ts +443 -0
package/src/benchmark/suites.ts +323 -0
package/src/benchmark/tinyQaDataset.ts +428 -0
package/src/benchmark/types.ts +442 -0
package/src/config.ts +44 -0
package/src/index.ts +195 -0
package/src/mcp/client.ts +305 -0
package/src/mcp/discovery.ts +266 -0
package/src/mcp/policy.ts +105 -0
package/src/mcp/registry.ts +164 -0
package/src/mcp/service.ts +611 -0
package/src/middleware/auth.ts +251 -0
package/src/middleware/requestCapture.ts +245 -0
package/src/middleware/requestStats.ts +163 -0
package/src/pools/builder.ts +159 -0
package/src/pools/repository.ts +71 -0
package/src/pools/scheduler.ts +425 -0
package/src/pools/types.ts +117 -0
package/src/protocols/adapters/dashscope.ts +335 -0
package/src/protocols/adapters/inferenceV2.ts +428 -0
package/src/protocols/adapters/openai.ts +32 -0
package/src/protocols/registry.ts +117 -0
package/src/protocols/types.ts +81 -0
package/src/providers/health.ts +207 -0
package/src/providers/importer.ts +402 -0
package/src/providers/modelRegistry.ts +415 -0
package/src/providers/repository.ts +439 -0
package/src/providers/types.ts +113 -0
package/src/routes/admin.ts +666 -0
package/src/routes/audio.ts +372 -0
package/src/routes/chat.ts +301 -0
package/src/routes/embeddings.ts +197 -0
package/src/routes/images.ts +356 -0
package/src/routes/mcp.ts +320 -0
package/src/routes/mcpService.ts +114 -0
package/src/routes/models.ts +50 -0
package/src/routes/responses.ts +872 -0
package/src/routes/sessions.ts +558 -0
package/src/routes/stats.ts +312 -0
package/src/routes/ui.ts +96 -0
package/src/routes/videos.ts +132 -0
package/src/routing/router.ts +501 -0
package/src/services/imageGeneration.ts +396 -0
package/src/services/imageUnderstanding.ts +449 -0
package/src/services/videoGeneration.ts +127 -0
package/src/storage/captureRepository.ts +1835 -0
package/src/storage/files.ts +178 -0
package/src/storage/imageCache.ts +405 -0
package/src/storage/repositories.ts +494 -0
package/src/storage/sessionRepository.ts +419 -0
package/src/storage/statsRepository.ts +238 -0
package/src/transport/httpClient.ts +145 -0
package/src/types.ts +322 -0
package/src/utils/messageMedia.ts +293 -0
package/src/utils/modelCapabilities.ts +161 -0
package/src/utils/modelDiscovery.ts +203 -0
package/src/workers/captureRetention.ts +25 -0
package/src/workers/configWatcher.ts +115 -0
package/src/workers/healthChecker.ts +22 -0
package/src/workers/statsRotation.ts +49 -0
package/tests/benchmarkAdminRoutes.test.ts +82 -0
package/tests/benchmarkBasics.test.ts +116 -0
package/tests/captureAdminRoutes.test.ts +420 -0
package/tests/captureRepository.test.ts +797 -0
package/tests/cliLegacyRewrite.test.ts +45 -0
package/tests/imageGeneration.service.test.ts +107 -0
package/tests/imageUnderstanding.service.test.ts +123 -0
package/tests/mcpPolicy.test.ts +105 -0
package/tests/mcpService.test.ts +1245 -0
package/tests/modelRef.test.ts +23 -0
package/tests/modelsRoutes.test.ts +154 -0
package/tests/sessionMediaCache.test.ts +167 -0
package/tests/statsRoutes.test.ts +323 -0
package/tsconfig.json +15 -0
package/ui/index.html +16 -0
package/ui/package-lock.json +8521 -0
package/ui/package.json +52 -0
package/ui/postcss.config.js +6 -0
package/ui/public/assets/apple-touch-icon.png +0 -0
package/ui/public/assets/favicon-16.png +0 -0
package/ui/public/assets/favicon-32.png +0 -0
package/ui/public/assets/icon-192.png +0 -0
package/ui/public/assets/icon-512.png +0 -0
package/ui/src/App.tsx +27 -0
package/ui/src/api/client.ts +1503 -0
package/ui/src/components/EndpointUsageGuide.tsx +361 -0
package/ui/src/components/Layout.tsx +124 -0
package/ui/src/components/MessageContent.tsx +365 -0
package/ui/src/components/ToolCallMessage.tsx +179 -0
package/ui/src/components/ToolPicker.tsx +442 -0
package/ui/src/components/messageContentParser.test.ts +41 -0
package/ui/src/components/messageContentParser.ts +73 -0
package/ui/src/components/thinkingPreview.test.ts +27 -0
package/ui/src/components/thinkingPreview.ts +15 -0
package/ui/src/components/toMermaidSankey.test.ts +78 -0
package/ui/src/components/toMermaidSankey.ts +56 -0
package/ui/src/components/ui/button.tsx +58 -0
package/ui/src/components/ui/input.tsx +21 -0
package/ui/src/components/ui/textarea.tsx +21 -0
package/ui/src/lib/utils.ts +6 -0
package/ui/src/main.tsx +9 -0
package/ui/src/pages/AgentPlayground.tsx +2010 -0
package/ui/src/pages/Benchmark.tsx +988 -0
package/ui/src/pages/Dashboard.tsx +581 -0
package/ui/src/pages/Peek.tsx +962 -0
package/ui/src/pages/Settings.tsx +2013 -0
package/ui/src/pages/agentPlaygroundPayload.test.ts +109 -0
package/ui/src/pages/agentPlaygroundPayload.ts +97 -0
package/ui/src/pages/agentThinkingContent.test.ts +50 -0
package/ui/src/pages/agentThinkingContent.ts +57 -0
package/ui/src/pages/dashboardTokenUsage.test.ts +66 -0
package/ui/src/pages/dashboardTokenUsage.ts +36 -0
package/ui/src/pages/imageUpload.test.ts +39 -0
package/ui/src/pages/imageUpload.ts +71 -0
package/ui/src/pages/peekFilters.test.ts +29 -0
package/ui/src/pages/peekFilters.ts +13 -0
package/ui/src/pages/peekMedia.test.ts +58 -0
package/ui/src/pages/peekMedia.ts +148 -0
package/ui/src/pages/sessionAutoTitle.test.ts +128 -0
package/ui/src/pages/sessionAutoTitle.ts +106 -0
package/ui/src/stores/settings.ts +58 -0
package/ui/src/styles/globals.css +223 -0
package/ui/src/vite-env.d.ts +8 -0
package/ui/tailwind.config.js +106 -0
package/ui/tsconfig.json +32 -0
package/ui/vite.config.ts +37 -0

package/dist/src/workers/healthChecker.js ADDED Viewed

@@ -0,0 +1,21 @@
+"use strict";
+Object.defineProperty(exports, "__esModule", { value: true });
+exports.startHealthChecker = startHealthChecker;
+exports.stopHealthChecker = stopHealthChecker;
+const health_1 = require("../providers/health");
+let healthTimer = null;
+function startHealthChecker(paths) {
+    const intervalMs = 30_000;
+    const run = async () => {
+        await (0, health_1.probeProviderModels)(paths);
+    };
+    healthTimer = setInterval(run, intervalMs);
+    healthTimer.unref();
+    void run();
+}
+function stopHealthChecker() {
+    if (healthTimer) {
+        clearInterval(healthTimer);
+        healthTimer = null;
+    }
+}

package/dist/src/workers/statsRotation.js ADDED Viewed

@@ -0,0 +1,41 @@
+"use strict";
+Object.defineProperty(exports, "__esModule", { value: true });
+exports.startStatsRotation = startStatsRotation;
+exports.stopStatsRotation = stopStatsRotation;
+const statsRepository_1 = require("../storage/statsRepository");
+/**
+ * Stats Rotation Worker
+ *
+ * Periodically cleans up stats files older than retention period.
+ * Runs daily by default.
+ */
+const ROTATION_INTERVAL_MS = 24 * 60 * 60 * 1000; // 24 hours
+const DEFAULT_RETENTION_DAYS = 30;
+let rotationTimer = null;
+function startStatsRotation(paths, retentionDays = DEFAULT_RETENTION_DAYS) {
+    if (rotationTimer) {
+        return; // Already running
+    }
+    async function runRotation() {
+        try {
+            const deleted = await (0, statsRepository_1.rotateStats)(paths, retentionDays);
+            if (deleted > 0) {
+                console.log(`[stats-rotation] Deleted ${deleted} stats file(s) older than ${retentionDays} days`);
+            }
+        }
+        catch (error) {
+            console.error("[stats-rotation] Error rotating stats:", error);
+        }
+    }
+    // Run immediately on startup, then periodically
+    runRotation();
+    rotationTimer = setInterval(runRotation, ROTATION_INTERVAL_MS);
+    // Prevent timer from keeping process alive
+    rotationTimer.unref();
+}
+function stopStatsRotation() {
+    if (rotationTimer) {
+        clearInterval(rotationTimer);
+        rotationTimer = null;
+    }
+}

package/docs/LLM/output_schema.md ADDED Viewed

@@ -0,0 +1,312 @@
+# LLM Output Schema
+This document describes the output format supported by Waypoi's UI for displaying LLM responses, including thinking/reasoning content.
+## Overview
+Waypoi's Playground UI supports displaying thinking process from LLMs that provide reasoning content. The system handles both:
+1. **Native reasoning fields** from LLM APIs (e.g., DeepSeek's `reasoning_content`)
+2. **HTML-style tags** embedded in the response text
+## Supported Thinking Formats
+### 1. Native API Fields
+Some LLM providers include reasoning content in separate fields of the streaming response:
+| Provider | Field Name | Example |
+|----------|------------|---------|
+| DeepSeek | `reasoning_content` | `choices[0].delta.reasoning_content` |
+| Other providers | `reasoning` | `choices[0].delta.reasoning` |
+The Waypoi backend automatically extracts these fields and the frontend wraps them in `  ` tags for display.
+### 2. HTML-Style Tags
+LLMs can also output thinking content wrapped in HTML-like tags:
+```
+This is my thinking process...
+Step 1: Analyze the problem
+Step 2: Consider solutions
+This is the final response based on my reasoning above.
+```
+The UI recognizes these tags and renders the thinking content in a collapsible "Thinking process" block.
+## Streaming Response Format
+When `stream: true` is enabled, the API returns Server-Sent Events (SSE) with the following structure:
+### Standard OpenAI Format
+```json
+{
+  "id": "chatcmpl-xxx",
+  "object": "chat.completion.chunk",
+  "created": 1234567890,
+  "model": "gpt-4",
+  "choices": [
+    {
+      "index": 0,
+      "delta": {
+        "content": "Hello",
+        "reasoning_content": "I should greet the user warmly"
+      },
+      "finish_reason": null
+    }
+  ]
+}
+```
+### Frontend Processing
+The frontend processes each chunk:
+1. Extracts `content` (regular response text)
+2. Extracts `reasoning_content` or `reasoning` (thinking process)
+3. Wraps reasoning in `  ` tags when transitioning from reasoning to content
+4. Combines both into the display message
+Example flow:
+```
+Chunk 1: { reasoning: "Let me think..." }           → Display: "  Let me think..."
+Chunk 2: { reasoning: "Step 1: Analyze" }           → Display: "  Let me think...Step 1: Analyze"
+Chunk 3: { content: "Based on my analysis" }        → Display: "  Let me think...Step 1:Analyze  \n\nBased on my analysis"
+```
+## Display Behavior
+### Thinking Block UI
+When the UI detects `  ... ` content, it renders:
+- A collapsible block with a "Thinking process" header
+- Brain icon and expand/collapse chevron
+- Monospace font for the thinking content
+- Collapsed by default to focus on the main response
+### Parsing Logic
+The `MessageContent` component handles three edge cases:
+1. **Standard format**: `  ...content... ` - Properly tagged thinking
+2. **Missing opening tag**: Content before ` ` is treated as thinking
+3. **Unclosed tag**: `  ...` during streaming (tag will be closed when content arrives)
+## Supported Models
+The following models are known to provide reasoning content:
+| Model | Reasoning Field | Notes |
+|-------|----------------|-------|
+| DeepSeek-R1 | `reasoning_content` | Chain-of-thought reasoning |
+| DeepSeek-V3 | `reasoning_content` | Extended thinking mode |
+| Other reasoning models | `reasoning` | Generic field support |
+Models that output `  ` tags in their response (like some Qwen or Llama variants) will also have their thinking content displayed correctly.
+## Implementation Details
+### Backend (`src/routes/responses.ts`)
+The Responses API shim handles reasoning content from Codex-formatted requests:
+```typescript
+if (delta.reasoning_content || delta.reasoning) {
+  const reasoningDelta = delta.reasoning_content || delta.reasoning;
+  sendEvent("response.reasoning_text.delta", {
+    type: "response.reasoning_text.delta",
+    delta: reasoningDelta
+  });
+}
+```
+### Frontend (`ui/src/api/client.ts`)
+The streaming client extracts both content and reasoning:
+```typescript
+const delta = parsed.choices?.[0]?.delta;
+const content = delta?.content;
+const reasoning = delta?.reasoning_content || delta?.reasoning;
+if (content || reasoning) {
+  yield { content: content || '', reasoning: reasoning || undefined };
+}
+```
+### Playground (`ui/src/pages/Playground.tsx`)
+The Playground component tracks reasoning and content separately, then combines them:
+```typescript
+if (chunk.reasoning) {
+  if (!hasReasoning) {
+    reasoningContent = '  ' + chunk.reasoning;
+  } else {
+    reasoningContent += chunk.reasoning;
+  }
+}
+if (chunk.content && hasReasoning && !reasoningClosed) {
+  reasoningContent += '  ';
+  reasoningClosed = true;
+}
+```
+## Testing
+To verify thinking content display:
+1. Use a model that supports reasoning (e.g., DeepSeek-R1)
+2. Send a complex question requiring multi-step reasoning
+3. Observe the "Thinking process" collapsible block appears
+4. Expand to see the reasoning content
+5. Verify the final response follows the thinking block
+Example test prompt:
+```
+If a train travels 120 km in 2 hours, then stops for 30 minutes,
+then continues at the same speed for another 90 km,
+what is the total travel time?
+```
+Models with reasoning capability will show their calculation steps in the thinking block before providing the final answer.
+## Future Enhancements
+- Support for multiple thinking blocks in a single response
+- Configurable thinking display (always show/hide by default)
+- Token count display for reasoning vs. response content
+- Export thinking content separately from the final response
+## Codex CLI Specific Schema
+Codex CLI uses a custom event-based protocol rather than the standard OpenAI API format. The Waypoi proxy must transform standard API responses to match Codex's expectations.
+### Key Differences from Standard API
+| Feature | Standard OpenAI | Codex CLI |
+|---------|----------------|-----------|
+| Reasoning Content | Embedded in `delta.content` or separate field | Dedicated `AgentReasoningDeltaEvent` events |
+| Tool Calls | Standard `tool_calls` array | Custom `McpToolCallBeginEvent`/`McpToolCallEndEvent` |
+| Command Execution | Not supported | Special `ExecCommandBeginEvent`/`ExecCommandEndEvent` |
+| Model Requirements | Standard names | Specific names like `gpt-5.1-codex-mini` |
+### Codex-Specific Event Types
+Codex CLI expects the following event types in the stream:
+```typescript
+// Reasoning content
+interface AgentReasoningDeltaEvent {
+  type: 'agent_reasoning_delta';
+  delta: string;
+}
+// Raw reasoning content (for internal processing)
+interface AgentReasoningRawContentDeltaEvent {
+  type: 'agent_reasoning_raw_content_delta';
+  delta: string;
+}
+// Regular message content
+interface AgentMessageDeltaEvent {
+  type: 'agent_message_delta';
+  delta: string;
+}
+// Tool calls
+interface McpToolCallBeginEvent {
+  type: 'mcp_tool_call_begin';
+  name: string;
+  arguments: object;
+}
+interface McpToolCallEndEvent {
+  type: 'mcp_tool_call_end';
+  result: string;
+}
+// Command execution
+interface ExecCommandBeginEvent {
+  type: 'exec_command_begin';
+  command: string;
+  source: 'user' | 'agent';
+}
+interface ExecCommandEndEvent {
+  type: 'exec_command_end';
+  exit_code: number;
+  output: string;
+}
+```
+### Proxy Transformation Rules
+The Waypoi proxy handles the translation between standard OpenAI format and Codex's custom protocol:
+1. **Reasoning Content Extraction**
+   - When `reasoning_content` or `reasoning` fields are detected, they're converted to `AgentReasoningDeltaEvent`
+   - Example: `{"delta": {"reasoning_content": "Thinking step..."}}` → `{"type": "agent_reasoning_delta", "delta": "Thinking step..."}`
+2. **Special Model Handling**
+   - Requests to `gpt-5.1-codex-mini` are routed to specific endpoints
+   - Other Codex-specific models are transformed to match backend requirements
+3. **Tool Call Conversion**
+   - Standard tool calls are converted to `McpToolCallBeginEvent`/`McpToolCallEndEvent` sequence
+   - Custom tool parameters are preserved
+### Implementation in Waypoi
+The proxy implements these transformations in `src/routes/responses.ts`:
+```typescript
+// Convert OpenAI tool calls to Codex MCP events
+if (delta.tool_calls) {
+  delta.tool_calls.forEach(toolCall => {
+    sendEvent("mcp_tool_call_begin", {
+      type: "mcp_tool_call_begin",
+      name: toolCall.function.name,
+      arguments: JSON.parse(toolCall.function.arguments)
+    });
+  });
+}
+// Handle reasoning content from various sources
+if (delta.reasoning_content || delta.reasoning) {
+  const reasoningDelta = delta.reasoning_content || delta.reasoning;
+  sendEvent("response.reasoning_text.delta", {
+    type: "response.reasoning_text.delta",
+    delta: reasoningDelta
+  });
+}
+```
+### Testing with Codex CLI
+To verify Codex CLI compatibility:
+1. Set the model to `gpt-5.1-codex-mini` in your settings
+2. Enable reasoning mode if available
+3. Send a complex prompt requiring multi-step reasoning
+4. Verify the thinking process appears in dedicated blocks
+5. Test tool calling with `@mcp` commands
+Example Codex CLI prompt:
+```
+@model gpt-5.1-codex-mini
+@reasoning
+Explain step by step how you would calculate the area of a triangle with sides 3, 4, and 5.
+```
+The proxy should properly route this request and format the response to match Codex's event structure, with reasoning content separated from the final answer.

package/docs/benchmark.md ADDED Viewed

@@ -0,0 +1,208 @@
+# Waypoi Benchmark
+Waypoi benchmark now has two roles:
+- `showcase`: a live, user-visible replay of curated examples
+- `diagnostic`: the older internal smoke/capability/regression path
+Default behavior is showcase-first.
+## Quick start
+```bash
+# Default run: showcase suite, one visible replay per example
+waypoi bench
+# List showcase examples
+waypoi bench --list-examples
+# Run one example
+waypoi bench --example showcase-tinyqa-001
+# Pin a model for a showcase example
+waypoi bench --suite showcase --example showcase-tinyqa-001 --model smart
+# Run a diagnostic suite
+waypoi bench --mode diagnostic --suite pool_smoke
+# Add file-driven scenarios
+waypoi bench --scenario ./examples/scenarios/custom.yaml
+# Compare with a previous diagnostic run
+waypoi bench --mode diagnostic --baseline ~/.config/waypoi/benchmarks/bench-2026-02-23T12-00-00-000Z.json
+```
+## CLI options
+- `--suite <name>` built-in suite. Public default is `showcase`.
+- `--example <id>` run one built-in example from the selected suite.
+- `--list-examples` list built-in examples and exit.
+- `--mode <name>` `showcase` or `diagnostic`.
+- `--scenario <path>` scenario file (`.json`, `.jsonl`, `.yaml`, `.yml`).
+- `--model <name>` force one model for all scenarios.
+- `--out <path>` output file (`.json`/`.txt`) or output directory.
+- `--config <path>` benchmark config file (YAML or JSON).
+- `--profile <name>` config profile (default: `local`).
+- `--baseline <path>` previous benchmark report for p95/throughput deltas.
+- `--update-cap-cache` persist capability findings to `$WAYPOI_DIR/capabilities`.
+- `--cap-ttl-days <n>` capability TTL override for freshness (default `7`).
+- `--temperature <n>` run-level generation override for supported modes.
+- `--top-p <n>` run-level generation override (`0..1`) for supported modes.
+- `--max-tokens <n>` run-level generation override (`>=1`) for supported modes.
+- `--presence-penalty <n>` run-level generation override (`-2..2`) for supported modes.
+- `--frequency-penalty <n>` run-level generation override (`-2..2`) for supported modes.
+- `--seed <n>` optional run-level deterministic seed (`>=0`) for supported modes.
+- `--stop <value>` optional stop sequence (string) or comma-separated list in UI.
+## Showcase examples
+The `showcase` suite is the release-facing path. It is built from Hugging Face
+dataset `vincentkoc/tiny_qa_benchmark` (train split):
+- 52 single-question QA prompts
+- chat-mode single-turn runs
+- per-question answer checks via `contains`
+- category/difficulty metadata exposed as expected highlights
+Showcase behavior:
+- sequential only
+- one visible replay per scenario
+- request/response trace is the main artifact
+- verdict explains what passed or failed
+- raw payloads stay in the live event stream; persisted artifacts keep sanitized traces
+## Diagnostic suites
+The older suites remain for engineering use:
+- `smoke`
+- `proxy`
+- `agent`
+- `pool_smoke`
+- `omni_call_smoke`
+- `capabilities`
+Diagnostic behavior:
+- profile-driven warmup and measured runs
+- pass-rate and latency summaries
+- optional baseline regression warnings
+- optional capability cache updates
+Concurrency is no longer part of the benchmark story.
+## Scenario schema
+Required fields:
+- `id: string`
+- `mode: "chat" | "agent" | "responses" | "embeddings" | "image_generation" | "audio_transcription" | "audio_speech" | "omni_call"`
+Mode-specific required fields:
+- `chat | agent | responses | image_generation`: `prompt`
+- `embeddings`: `input`
+- `audio_transcription`: `audioFile`
+- `audio_speech`: `inputText`, `voice`
+- `omni_call`: `audioFile`
+Useful showcase metadata:
+- `title`
+- `summary`
+- `userVisibleGoal`
+- `exampleSource`
+- `inputPreview`
+- `successCriteria`
+- `expectedHighlights`
+- `requiresAvailableTools`
+Assertions:
+- generic: `statusCode`, `maxLatencyMs`
+- chat/agent/responses: `contains`, `notContains`
+- agent: `minToolCalls`, `maxToolCalls`, `requiredToolNames`
+- embeddings: `minItems`, `minVectorLength`
+- image generation: `minImages`
+- audio transcription: `containsText`, `notContainsText`
+- audio speech: `minBytes`, `contentType`
+Validation behavior:
+- schema errors fail fast with `file + index + field`
+- unknown fields become warnings
+Generation parameter precedence for supported modes (`chat`, `agent`, `responses`, `omni_call`):
+1. scenario-level value
+2. run-level override
+3. config defaults
+4. built-in defaults
+### Example: showcase responses scenario
+```json
+{
+  "id": "responses-demo",
+  "mode": "responses",
+  "title": "Responses Demo",
+  "userVisibleGoal": "Show Responses API compatibility.",
+  "prompt": "List two reasons to use a local AI gateway.",
+  "assertions": {
+    "statusCode": 200
+  }
+}
+```
+### Example: showcase tool-calling scenario
+```json
+{
+  "id": "agent-tool-demo",
+  "mode": "agent",
+  "title": "Tool Calling",
+  "prompt": "Use one available tool, then summarize what you learned.",
+  "requiresAvailableTools": true,
+  "assertions": {
+    "statusCode": 200,
+    "minToolCalls": 1
+  }
+}
+```
+## Artifacts and UI behavior
+Each run writes:
+- `bench-<timestamp>.json`
+- `bench-<timestamp>.txt`
+Reports now include:
+- run metadata and effective config
+- per-scenario results
+- sanitized scenario details for history
+- live-show traces for each scenario
+- verdict strings and tool usage summaries
+- optional capability matrix
+The Benchmark UI is optimized for:
+- guided suite selection (showcase or diagnostic suite)
+- selecting one showcase example when applicable
+- tuning generation parameters (temperature, top_p, max_tokens, penalties, seed, stop)
+- using an Advanced section for model override, scenario files, profile, and capability cache controls
+- watching the live trace
+- reading the exact scenario input
+- inspecting tool calls and tool results
+- seeing the final verdict clearly
+## Verification checklist
+- `waypoi bench` defaults to showcase behavior.
+- `waypoi bench --list-examples` lists human-readable examples.
+- Benchmark UI loads showcase examples by default.
+- A showcase run shows scenario input, wire request, response, and verdict.
+- Tool-driven examples are skipped clearly when no MCP tools are available.
+- Diagnostic suites still produce capability and regression information.

package/docs/mcp-guidelines.md ADDED Viewed

@@ -0,0 +1,125 @@
+# MCP Tool Governance Guidelines
+This document is the canonical policy for Waypoi built-in MCP tools (`/mcp`).
+Scope:
+- Applies to built-in tools registered in `src/mcp/service.ts`.
+- Does not enforce behavior for external third-party MCP servers managed under `/admin/mcp/*`.
+## 1) Tool description standard
+Every tool description should be concise and action-first:
+1. Sentence 1: capability summary (what the tool does).
+2. Sentence 2: required default behavior for the caller.
+3. Sentence 3: the biggest pitfall to avoid.
+Binary-producing tools should explicitly mention file-first output behavior.
+## 2) Input schema conventions
+- Use `snake_case` field names.
+- Include explicit bounds/defaults where relevant.
+- Represent incompatible options as mutually exclusive inputs and validate at runtime.
+- Mark optional non-default behavior clearly (for example `include_data`).
+## 3) Output conventions
+Top-level response shape:
+- Success: `{ ok: true, ... }`
+- Error: `{ ok: false, error: { type, message } }`
+For binary-producing tools:
+- Default to lightweight metadata in responses.
+- Require file output when the tool is binary-producing.
+- Return `file_path` values relative to the output root rather than absolute host paths.
+- Make `file_path` / `file_paths` the canonical small-model result fields.
+- Include raw `url` / `b64_json` only with explicit opt-in (`include_data=true`).
+- Keep `content.text` compact and free of inline base64.
+## 4) Error taxonomy
+Use stable typed errors:
+- `invalid_request`: parameter validation and contract violations.
+- `no_diffusion_model`: no suitable model available for image generation.
+- `no_vision_model`: no suitable vision-capable text model available for image understanding.
+- `no_video_model`: no suitable video generation model available.
+- `no_video_output`: video generation completed but no video URL was returned.
+- `upstream_error`: upstream/provider failures not attributable to caller input.
+- `forbidden`: endpoint/policy access denied (for route-level guards).
+Error messages should be deterministic and actionable.
+## 5) Operational behavior
+- Tool handlers should define explicit timeout behavior (for example 60s for image generation, 300s for video generation).
+- Do not silently degrade into inline-only success for binary tools.
+- For binary file-output modes, tools MAY override upstream response format to a byte-bearing format to guarantee file materialization.
+- Retry behavior should be explicit per tool. If no retries are implemented, fail deterministically.
+- In multi-project environments, pin MCP output root via server env:
+  - `WAYPOI_MCP_OUTPUT_ROOT=<absolute path>` (default: `~/.config/waypoi`)
+  - `WAYPOI_MCP_OUTPUT_SUBDIR=work` (or another controlled relative subdir; default: `generated-images`)
+  - `WAYPOI_MCP_STRICT_OUTPUT_ROOT=true` for fail-fast misconfiguration handling.
+## 6) Agent behavior guidelines
+For tool-calling agents:
+1. Prefer file output for binary-generating tools.
+2. Keep responses minimal unless inline data is explicitly needed downstream.
+3. Avoid repeated expensive calls with unchanged arguments.
+4. Use `include_data=true` only for explicit transport requirements.
+5. For image-generation editing, provide at most one source (`image_path` xor `image_url`).
+6. For image-to-text tools, provide exactly one image source (`image_path` xor `image_url`).
+Output goes to `~/.config/waypoi/generated-images` by default. Set `WAYPOI_MCP_OUTPUT_ROOT` to redirect.
+### Safe-default example (`generate_image`)
+```json
+{
+  "name": "generate_image",
+  "arguments": {
+    "prompt": "Minimal icon with clean geometric shape",
+    "include_data": false
+  }
+}
+```
+### Image-edit example (`generate_image`)
+```json
+{
+  "name": "generate_image",
+  "arguments": {
+    "prompt": "Replace the background with a clean studio backdrop",
+    "image_path": "./tmp/input.png",
+    "include_data": false
+  }
+}
+```
+### Image-to-text defaults (`understand_image`)
+- Exactly one image source is required (`image_path` xor `image_url`).
+- Keep `instruction` concise and task-specific unless broad analysis is needed.
+- Treat top-level `text` as the canonical answer field.
+- For local image files, coordinate-sensitive answers should be expressed in original image pixels even when the upload is resized upstream.
+## 7) New MCP tool checklist
+Before adding a new built-in MCP tool:
+1. Description follows the governance template and includes normative guidance.
+2. Input schema uses `snake_case`, bounds/defaults, and validates incompatible combinations.
+3. Output shape follows `{ ok: true|false, ... }`, compact `content.text`, and file-first policy for binary payloads.
+4. Typed errors are stable and mapped to taxonomy.
+5. Tests cover:
+   - policy validation rules,
+   - default payload behavior,
+   - error paths,
+   - tool listing/description visibility.