npm - noosphere - Versions diffs - 0.1.2 → 0.2.0 - Mend

noosphere 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/README.md CHANGED Viewed

@@ -7,7 +7,7 @@ One import. Every model. Every modality.
 ## Features
 - **4 modalities** — LLM chat, image generation, video generation, and text-to-speech
-- **246+ LLM models** — via Pi-AI gateway (OpenAI, Anthropic, Google, Groq, Mistral, xAI, Cerebras, OpenRouter)
+- **Always up-to-date models** — Dynamic auto-fetch from ALL provider APIs at runtime (OpenAI, Anthropic, Google, Groq, Mistral, xAI, Cerebras, OpenRouter)
 - **867+ media endpoints** — via FAL (Flux, SDXL, Kling, Sora 2, VEO 3, Kokoro, ElevenLabs, and hundreds more)
 - **30+ HuggingFace tasks** — LLM, image, TTS, translation, summarization, classification, and more
 - **Local-first architecture** — Auto-detects ComfyUI, Ollama, Piper, and Kokoro on your machine
@@ -59,6 +59,108 @@ const audio = await ai.speak({
 // audio.buffer contains the audio data
 ```
+## Dynamic Model Auto-Fetch — Always Up-to-Date
+Noosphere **automatically discovers the latest models** from every provider's API at runtime. When Google releases a new Gemini model, when OpenAI drops GPT-5, when Anthropic publishes Claude 4 — **you get them immediately**, without updating Noosphere or any dependency.
+### The Problem It Solves
+Traditional AI libraries rely on **static model catalogs** hardcoded at build time. The `@mariozechner/pi-ai` dependency ships with ~246 models in a pre-generated `models.generated.js` file. When a provider releases a new model, you'd have to wait for the library maintainer to run `npm run generate-models`, publish a new version, and then you'd `npm update`. This lag can be days or weeks.
+### How It Works
+On the **first API call**, Noosphere queries every provider's model listing API in parallel and merges the results with the static catalog:
+```
+First ai.chat() / ai.image() / ai.stream() call
+  │
+  ├─ 1. Load static pi-ai catalog (246 models with accurate cost/context data)
+  │
+  ├─ 2. Parallel fetch from ALL provider APIs (8 concurrent requests):
+  │     ├── GET https://api.openai.com/v1/models              (Bearer token)
+  │     ├── GET https://api.anthropic.com/v1/models            (x-api-key + anthropic-version)
+  │     ├── GET https://generativelanguage.googleapis.com/...  (API key in URL)
+  │     ├── GET https://api.groq.com/openai/v1/models         (Bearer token)
+  │     ├── GET https://api.mistral.ai/v1/models              (Bearer token)
+  │     ├── GET https://api.x.ai/v1/models                    (Bearer token)
+  │     ├── GET https://openrouter.ai/api/v1/models           (Bearer token)
+  │     └── GET https://api.cerebras.ai/v1/models             (Bearer token)
+  │
+  ├─ 3. Filter results (chat models only — exclude embeddings, TTS, whisper, etc.)
+  │
+  ├─ 4. Deduplicate against static catalog (static wins — has accurate cost data)
+  │
+  └─ 5. Merge: Static catalog + newly discovered models = complete model list
+```
+### What Gets Fetched Per Provider
+| Provider | API Endpoint | Auth | Model Filter | API Protocol |
+|---|---|---|---|---|
+| **OpenAI** | `/v1/models` | Bearer token | `gpt-*`, `o1*`, `o3*`, `o4*`, `chatgpt-*`, `codex-*` | `openai-responses` |
+| **Anthropic** | `/v1/models?limit=100` | `x-api-key` + `anthropic-version` | `claude-*` | `anthropic-messages` |
+| **Google** | `/v1beta/models?key=KEY` | API key in URL | `gemini-*`, `gemma-*` + must support `generateContent` | `google-generative-ai` |
+| **Groq** | `/openai/v1/models` | Bearer token | All (Groq only serves chat models) | `openai-completions` |
+| **Mistral** | `/v1/models` | Bearer token | Exclude `*embed*` | `openai-completions` |
+| **xAI** | `/v1/models` | Bearer token | `grok*` | `openai-completions` |
+| **OpenRouter** | `/api/v1/models` | Bearer token | All (OpenRouter only lists usable models) | `openai-completions` |
+| **Cerebras** | `/v1/models` | Bearer token | All (Cerebras only serves chat models) | `openai-completions` |
+### Resilience Guarantees
+- **8-second timeout** per provider — slow APIs don't block everything
+- **`Promise.allSettled()`** — if one provider fails, the others still work
+- **Silent failure** — network errors are caught and ignored, static catalog always available
+- **One-time fetch** — results are cached in memory, not re-fetched on every call
+- **Zero config** — works automatically if you have API keys set
+### How New Models Become Usable
+When a dynamically discovered model isn't in the static catalog, Noosphere constructs a **synthetic Model object** that pi-ai's `complete()` and `stream()` functions can use directly:
+```typescript
+// For a new model like "gpt-4.5-turbo" discovered from OpenAI's API:
+{
+  id: 'gpt-4.5-turbo',
+  name: 'gpt-4.5-turbo',
+  api: 'openai-responses',           // Correct protocol for the provider
+  provider: 'openai',
+  baseUrl: 'https://api.openai.com/v1',
+  reasoning: false,                   // Inferred from model ID prefix
+  input: ['text', 'image'],
+  cost: { input: 2.5, output: 10, cacheRead: 1.25, cacheWrite: 2.5 },  // From template
+  contextWindow: 128000,              // From template or provider API
+  maxTokens: 16384,                   // From template or provider API
+}
+```
+**Template inheritance:** Cost and context window data come from a "template" — the first model in the static catalog for that provider. This means new models inherit approximate pricing until the static catalog is updated with exact numbers. For Google, the API returns `inputTokenLimit` and `outputTokenLimit` directly, so context window data is always accurate.
+### Force Refresh
+```typescript
+const ai = new Noosphere();
+// Models are auto-fetched on first call:
+await ai.chat({ model: 'gemini-2.5-ultra', messages: [...] }); // works immediately
+// Force a re-fetch if you suspect new models were added mid-session:
+// (access the provider's refreshDynamicModels method via the registry)
+const models = await ai.getModels('llm');
+// Or trigger a full sync:
+await ai.syncModels();
+```
+### Why Not Just Use the Provider APIs Directly?
+| Approach | Pros | Cons |
+|---|---|---|
+| **Static catalog only** (old) | Accurate costs, fast startup | Stale within days, miss new models |
+| **Dynamic only** | Always current | No cost data, no context window info, slow startup |
+| **Hybrid (Noosphere)** | Best of both — accurate data for known models + immediate access to new ones | New models have estimated costs until catalog update |
+---
 ## Configuration
 API keys are resolved from the constructor config or environment variables (config takes priority):
@@ -478,68 +580,507 @@ A unified gateway that routes to 8 LLM providers through 4 different API protoco
 Aggregator providing access to hundreds of additional models including Llama, Deepseek, Mistral, Qwen, and many more. Full list available via `ai.getModels('llm')`.
-#### Agentic Capabilities (via Pi-AI library)
+#### The Pi-AI Engine — Deep Dive
+Noosphere's LLM provider is powered by `@mariozechner/pi-ai`, part of the **Pi mono-repo** by Mario Zechner (badlogic). Pi is NOT a wrapper like LangChain or Mastra — it's a **micro-framework for agentic AI** (~15K LOC, 4 npm packages) that was built from scratch as a minimalist alternative to Claude Code.
+Pi consists of 4 packages in 3 tiers:
+```
+TIER 1 — FOUNDATION
+  @mariozechner/pi-ai             LLM API: stream(), complete(), model registry
+                                  0 internal deps, talks to 20+ providers
+TIER 2 — INFRASTRUCTURE
+  @mariozechner/pi-agent-core     Agent loop, tool execution, lifecycle events
+                                  Depends on pi-ai
+  @mariozechner/pi-tui            Terminal UI with differential rendering
+                                  Standalone, 0 internal deps
+TIER 3 — APPLICATION
+  @mariozechner/pi-coding-agent   CLI + SDK: sessions, compaction, extensions
+                                  Depends on all above
+```
+Noosphere uses `@mariozechner/pi-ai` (Tier 1) directly for LLM access. But the full Pi ecosystem provides capabilities that can be layered on top.
+---
+#### How Pi Keeps 200+ Models Updated
+Pi does NOT hardcode models. It has an **auto-generation pipeline** that runs at build time:
+```
+STEP 1: FETCH (3 sources in parallel)
+┌──────────────────┐  ┌──────────────────┐  ┌───────────────┐
+│   models.dev     │  │   OpenRouter     │  │  Vercel AI    │
+│   /api.json      │  │   /v1/models     │  │  Gateway      │
+│                  │  │                  │  │  /v1/models   │
+│ Context windows  │  │ Pricing ($/M)    │  │ Capability    │
+│ Capabilities     │  │ Availability     │  │ tags          │
+│ Tool support     │  │ Provider routing │  │               │
+└────────┬─────────┘  └────────┬─────────┘  └──────┬────────┘
+         └─────────┬───────────┴────────────────────┘
+                   ▼
+STEP 2: MERGE & DEDUPLICATE
+         Priority: models.dev > OpenRouter > Vercel
+         Key: provider + modelId
+                   │
+                   ▼
+STEP 3: FILTER
+         ✅ tool_call === true
+         ✅ streaming supported
+         ✅ system messages supported
+         ✅ not deprecated
+                   │
+                   ▼
+STEP 4: NORMALIZE
+         Costs → $/million tokens
+         API type → one of 4 protocols
+         Input modes → ["text"] or ["text","image"]
+                   │
+                   ▼
+STEP 5: PATCH (manual corrections)
+         Claude Opus: cache pricing fix
+         GPT-5.4: context window override
+         Kimi K2.5: hardcoded pricing
+                   │
+                   ▼
+STEP 6: GENERATE TypeScript
+         → models.generated.ts (~330KB)
+         → 200+ models with full type safety
+```
+Each generated model entry looks like:
+```typescript
+{
+  id: "claude-opus-4-6",
+  name: "Claude Opus 4.6",
+  api: "anthropic-messages",
+  provider: "anthropic",
+  baseUrl: "https://api.anthropic.com",
+  reasoning: true,
+  input: ["text", "image"],
+  cost: {
+    input: 15,          // $15/M tokens
+    output: 75,         // $75/M tokens
+    cacheRead: 1.5,     // prompt cache hit
+    cacheWrite: 18.75,  // prompt cache write
+  },
+  contextWindow: 200_000,
+  maxTokens: 32_000,
+} satisfies Model<"anthropic-messages">
+```
+When a new model is released (e.g., Gemini 3.0), it appears in models.dev/OpenRouter → the script captures it → a new Pi version is published → Noosphere updates its dependency.
+---
+#### 4 API Protocols — How Pi Talks to Every Provider
+Pi abstracts all LLM providers into 4 wire protocols. Each protocol handles the differences in request format, streaming format, auth headers, and response parsing:
+| Protocol | Providers | Key Differences |
+|---|---|---|
+| `anthropic-messages` | Anthropic, AWS Bedrock | `system` as top-level field, content as `[{type:"text", text:"..."}]` blocks, `x-api-key` auth, `anthropic-beta` headers |
+| `openai-completions` | OpenAI, xAI, Groq, Cerebras, OpenRouter, Ollama, vLLM | `system` as message with `role:"system"`, content as string, `Authorization: Bearer` auth, `tool_calls` array |
+| `openai-responses` | OpenAI (reasoning models) | New Responses API with server-side context, `store: true`, reasoning summaries |
+| `google-generative-ai` | Google Gemini, Vertex AI | `systemInstruction.parts[{text}]`, role `"model"` instead of `"assistant"`, `functionCall` instead of `tool_calls`, `thinkingConfig` |
+The core function `streamSimple()` detects which protocol to use based on `model.api` and handles all the formatting/parsing transparently:
-The underlying `@mariozechner/pi-ai` library exposes powerful agentic features. While Noosphere currently surfaces chat and streaming, the library provides:
+```typescript
+// What happens inside Pi when you call Noosphere's chat():
+async function* streamSimple(
+  model: Model,           // includes model.api to determine protocol
+  context: Context,       // { systemPrompt, messages, tools }
+  options?: StreamOptions  // { signal, onPayload, thinkingLevel, ... }
+): AsyncIterable<AssistantMessageEvent> {
+  // 1. Format request according to model.api protocol
+  // 2. Open SSE/WebSocket stream
+  // 3. Parse provider-specific chunks
+  // 4. Emit normalized events:
+  //    → text_delta, thinking_delta, tool_call, message_end
+}
+```
+---
+#### Agentic Capabilities
+These are the capabilities people get access to through the Pi-AI engine:
+##### 1. Tool Use / Function Calling
+Full structured tool calling supported across **all major providers**. Tool definitions use TypeBox schemas with runtime validation via AJV:
+```typescript
+import { type Tool, StringEnum } from '@mariozechner/pi-ai';
+import { Type } from '@sinclair/typebox';
+// Define a tool with typed parameters
+const searchTool: Tool = {
+  name: 'web_search',
+  description: 'Search the web for information',
+  parameters: Type.Object({
+    query: Type.String({ description: 'Search query' }),
+    maxResults: Type.Optional(Type.Number({ default: 5 })),
+    type: StringEnum(['web', 'images', 'news'], { description: 'Search type' }),
+  }),
+};
+// Pass tools in context — Pi handles the rest
+const context = {
+  systemPrompt: 'You are a helpful assistant.',
+  messages: [{ role: 'user', content: 'Search for recent AI news' }],
+  tools: [searchTool],
+};
+```
+**How tool calling works internally:**
+```
+User prompt → LLM → "I need to call web_search"
+                         │
+                         ▼
+              Pi validates arguments with AJV
+              against the TypeBox schema
+                         │
+                   ┌─────┴─────┐
+                   │ Valid?     │
+                   ├─Yes───────┤
+                   │ Execute   │
+                   │ tool      │
+                   ├───────────┤
+                   │ No        │
+                   │ Return    │
+                   │ validation│
+                   │ error to  │
+                   │ LLM       │
+                   └───────────┘
+                         │
+                         ▼
+              Tool result → back into context → LLM continues
+```
+**Provider-specific tool_choice control:**
+- **Anthropic:** `"auto" | "any" | "none" | { type: "tool", name: "specific_tool" }`
+- **OpenAI:** `"auto" | "none" | "required" | { type: "function", function: { name: "..." } }`
+- **Google:** `"auto" | "none" | "any"`
+**Partial JSON streaming:** During streaming, Pi parses tool call arguments incrementally using partial JSON parsing. This means you can see tool arguments being built in real-time, not just after the tool call completes.
+##### 2. Reasoning / Extended Thinking
+Pi provides **unified thinking support** across all providers that support it. Thinking blocks are automatically extracted, separated from regular text, and streamed as distinct events:
+| Provider | Models | Control Parameters | How It Works |
+|---|---|---|---|
+| **Anthropic** | Claude Opus, Sonnet 4+ | `thinkingEnabled: boolean`, `thinkingBudgetTokens: number` | Extended thinking blocks in response, separate `thinking` content type |
+| **OpenAI** | o1, o3, o4, GPT-5 | `reasoningEffort: "minimal" \| "low" \| "medium" \| "high"` | Reasoning via Responses API, `reasoningSummary: "auto" \| "detailed" \| "concise"` |
+| **Google** | Gemini 2.5 Flash/Pro | `thinking.enabled: boolean`, `thinking.budgetTokens: number` | Thinking via `thinkingConfig`, mapped to effort levels |
+| **xAI** | Grok-4, Grok-3-mini | Native reasoning | Automatic when model supports it |
+**Cross-provider thinking portability:** When switching models mid-conversation, Pi converts thinking blocks between formats. Anthropic thinking blocks become `<thinking>` tagged text when sent to OpenAI/Google, and vice versa.
-**Tool Use / Function Calling:**
 ```typescript
-// Supported across Anthropic, OpenAI, Google, xAI, Groq
-// Tool definitions use TypeBox schemas for runtime validation
-interface Tool<TParameters extends TSchema = TSchema> {
-  name: string;
-  description: string;
-  parameters: TParameters;  // TypeBox schema — validated at runtime with AJV
+// Thinking is automatically extracted in Noosphere responses:
+const result = await ai.chat({
+  model: 'claude-opus-4-6',
+  messages: [{ role: 'user', content: 'Solve this step by step: 15! / 13!' }],
+});
+console.log(result.thinking);  // "Let me work through this... 15! = 15 × 14 × 13!..."
+console.log(result.content);   // "15! / 13! = 15 × 14 = 210"
+// During streaming, thinking arrives as separate events:
+const stream = ai.stream({ messages: [...] });
+for await (const event of stream) {
+  if (event.type === 'thinking_delta') console.log('[THINKING]', event.delta);
+  if (event.type === 'text_delta') console.log('[RESPONSE]', event.delta);
 }
 ```
-**Reasoning / Thinking:**
-- **Anthropic:** `thinkingEnabled`, `thinkingBudgetTokens` — Claude Opus/Sonnet extended thinking
-- **OpenAI:** `reasoningEffort` (minimal/low/medium/high) — o1/o3/o4/GPT-5 reasoning
-- **Google:** `thinking.enabled`, `thinking.budgetTokens` — Gemini 2.5 thinking
-- **xAI:** Grok-4 native reasoning
-- Thinking blocks are automatically extracted and streamed as separate `thinking_delta` events
+##### 3. Vision / Multimodal Input
+Models with `input: ["text", "image"]` accept images alongside text. Pi handles the encoding and format differences per provider:
-**Vision / Multimodal Input:**
 ```typescript
-// Send images alongside text to vision-capable models
-{
-  role: "user",
+// Send images to vision-capable models
+const messages = [{
+  role: 'user',
   content: [
-    { type: "text", text: "What's in this image?" },
-    { type: "image", data: base64String, mimeType: "image/png" }
-  ]
-}
+    { type: 'text', text: 'What is in this image?' },
+    { type: 'image', data: base64PngString, mimeType: 'image/png' },
+  ],
+}];
+// Supported MIME types: image/png, image/jpeg, image/gif, image/webp
+// Images are silently ignored when sent to non-vision models
 ```
-**Agent Loop:**
+**Vision-capable models include:** All Claude models, all GPT-4o/GPT-5 models, Gemini models, Grok-2-vision, Grok-4, and select Groq models.
+##### 4. Agent Loop — Autonomous Tool Execution
+The `@mariozechner/pi-agent-core` package provides a complete agent loop that automatically cycles through `prompt → LLM → tool call → result → repeat` until the task is done:
 ```typescript
-// Built-in agentic execution loop with automatic tool calling
 import { agentLoop } from '@mariozechner/pi-ai';
-const events = agentLoop(prompt, context, {
-  tools: [myTool],
-  model: getModel('anthropic', 'claude-sonnet-4-20250514'),
+const events = agentLoop(userMessage, agentContext, {
+  model: getModel('anthropic', 'claude-opus-4-6'),
+  tools: [searchTool, readFileTool, writeFileTool],
+  signal: abortController.signal,
 });
 for await (const event of events) {
-  // event.type: agent_start → turn_start → message_start →
-  //   message_update → tool_execution_start → tool_execution_end →
-  //   message_end → turn_end → agent_end
+  switch (event.type) {
+    case 'agent_start':           // Agent begins
+    case 'turn_start':            // New LLM turn begins
+    case 'message_start':         // LLM starts responding
+    case 'message_update':        // Text/thinking delta received
+    case 'tool_execution_start':  // About to execute a tool
+    case 'tool_execution_end':    // Tool finished, result available
+    case 'message_end':           // LLM finished this message
+    case 'turn_end':              // Turn complete (may loop if tools were called)
+    case 'agent_end':             // All done, final messages available
+  }
 }
 ```
-**Cost Tracking per Model:**
+**The agent loop state machine:**
+```
+[User sends prompt]
+        │
+        ▼
+  ┌─[Build Context]──▶ [Check Queues]──▶ [Stream LLM]◄── streamFn()
+  │                                           │
+  │                                     ┌─────┴──────┐
+  │                                     │            │
+  │                                   text      tool_call
+  │                                     │            │
+  │                                     ▼            ▼
+  │                                  [Done]    [Execute Tool]
+  │                                                  │
+  │                                            tool result
+  │                                                  │
+  └──────────────────────────────────────────────────┘
+                                    (loops back to Stream LLM)
+```
+**Key design decisions:**
+- Tools execute **sequentially** by default (parallelism can be added on top)
+- The `streamFn` is **injectable** — you can wrap it with middleware to modify requests per-provider
+- Tool arguments are **validated at runtime** using TypeBox + AJV before execution
+- Aborted/failed responses preserve partial content and usage data
+- Tool results are automatically added to the conversation context
+##### 5. The `streamFn` Pattern — Injectable Middleware
+This is Pi's most powerful architectural feature. The `streamFn` is the function that actually talks to the LLM, and it can be **wrapped with middleware** like Express.js request handlers:
+```typescript
+import type { StreamFn } from '@mariozechner/pi-agent-core';
+import { streamSimple } from '@mariozechner/pi-ai';
+// Start with Pi's base streaming function
+let fn: StreamFn = streamSimple;
+// Wrap it with middleware that modifies requests per-provider
+fn = createMyCustomWrapper(fn, {
+  // Add custom headers for Anthropic
+  onPayload: (payload) => {
+    if (model.provider === 'anthropic') {
+      payload.headers['anthropic-beta'] = 'fine-grained-tool-streaming-2025-05-14';
+    }
+  },
+});
+// Each wrapper calls the previous one, forming a chain:
+// request → wrapper3 → wrapper2 → wrapper1 → streamSimple → API
+```
+This pattern is what allows projects like OpenClaw to stack **16 provider-specific wrappers** on top of Pi's base streaming — adding beta headers for Anthropic, WebSocket transport for OpenAI, thinking sanitization for Google, reasoning effort headers for OpenRouter, and more — without modifying Pi's source code.
+##### 6. Session Management (via pi-coding-agent)
+The `@mariozechner/pi-coding-agent` package provides persistent session management with JSONL-based storage:
+```typescript
+import { createAgentSession, SessionManager } from '@mariozechner/pi-coding-agent';
+// Create a session with full persistence
+const session = await createAgentSession({
+  model: 'claude-opus-4-6',
+  tools: myTools,
+  sessionManager,  // handles JSONL persistence
+});
+const result = await session.run('Build a REST API');
+// Session is automatically saved to:
+// ~/.pi/agent/sessions/session_abc123.jsonl
+```
+**Session file format (append-only JSONL):**
+```jsonl
+{"role":"user","content":"Build a REST API","timestamp":1710000000}
+{"role":"assistant","content":"I'll create...","model":"claude-opus-4-6","usage":{...}}
+{"role":"toolResult","toolCallId":"tc_001","toolName":"bash","content":"OK"}
+{"type":"compaction","summary":"The user asked to build...","preservedMessages":[...]}
+```
+**Session operations:**
+- `create()` — new session
+- `open(id)` — restore existing session
+- `continueRecent()` — continue the most recent session
+- `forkFrom(id)` — create a branch (new JSONL referencing parent)
+- `inMemory()` — RAM-only session (for SDK/testing)
+##### 7. Context Compaction — Automatic Context Window Management
+When the conversation approaches the model's context window limit, Pi automatically **compacts** the history:
+```
+1. DETECT: Calculate inputTokens + outputTokens vs model.contextWindow
+2. TRIGGER: Proactively before overflow, or as recovery after overflow error
+3. SUMMARIZE: Send history to LLM with a compaction prompt
+4. WRITE: Append compaction entry to JSONL:
+   {"type":"compaction","summary":"...","preservedMessages":[last N messages]}
+5. CONTINUE: Context is now summary + recent messages instead of full history
+```
+The JSONL file is **never rewritten** — compaction entries are appended, maintaining a complete audit trail.
+##### 8. Cost Tracking — Cache-Aware Pricing
+Pi tracks costs per-request with cache-aware pricing for providers that support prompt caching:
 ```typescript
-// Costs tracked per 1M tokens with cache-aware pricing
+// Every model has 4 cost dimensions:
+{
+  input: 15,          // $15 per 1M input tokens
+  output: 75,         // $75 per 1M output tokens
+  cacheRead: 1.5,     // $1.50 per 1M cached prompt tokens (read)
+  cacheWrite: 18.75,  // $18.75 per 1M cached prompt tokens (write)
+}
+// Usage tracking on every response:
 {
-  input: number,       // cost per 1M input tokens
-  output: number,      // cost per 1M output tokens
-  cacheRead: number,   // prompt cache hit cost
-  cacheWrite: number,  // prompt cache write cost
+  input: 1500,        // tokens consumed as input
+  output: 800,        // tokens generated
+  cacheRead: 5000,    // prompt cache hits
+  cacheWrite: 1500,   // prompt cache writes
+  cost: {
+    total: 0.082,     // total cost in USD
+    input: 0.0225,
+    output: 0.06,
+    cacheRead: 0.0075,
+    cacheWrite: 0.028,
+  },
+}
+```
+**Anthropic and OpenAI** support prompt caching. For providers without caching, `cacheRead` and `cacheWrite` are always 0.
+##### 9. Extension System (via pi-coding-agent)
+Pi supports a plugin system where extensions can register tools, commands, and lifecycle hooks:
+```typescript
+// Extensions are TypeScript modules loaded at runtime via jiti
+export default function(api: ExtensionAPI) {
+  // Register a custom tool
+  api.registerTool('my_tool', {
+    description: 'Does something useful',
+    parameters: { /* TypeBox schema */ },
+    execute: async (args) => 'result',
+  });
+  // Register a slash command
+  api.registerCommand('/mycommand', {
+    handler: async (args) => { /* ... */ },
+    description: 'Custom command',
+  });
+  // Hook into the agent lifecycle
+  api.on('before_agent_start', async (context) => {
+    context.systemPrompt += '\nExtra instructions';
+  });
+  api.on('tool_execution_end', async (event) => {
+    // Post-process tool results
+  });
 }
 ```
+**Resource discovery chain (priority):**
+1. Project `.pi/` directory (highest)
+2. User `~/.pi/agent/`
+3. npm packages with Pi metadata
+4. Built-in defaults
+##### 10. The Anti-MCP Philosophy — Why Pi Uses CLI Instead
+Pi explicitly **rejects MCP** (Model Context Protocol). Mario Zechner's argument, backed by benchmarks:
+**The token cost problem:**
+| Approach | Tools | Tokens Consumed | % of Claude's Context |
+|---|---|---|---|
+| Playwright MCP | 21 tools | 13,700 tokens | 6.8% |
+| Chrome DevTools MCP | 26 tools | 18,000 tokens | 9.0% |
+| Pi CLI + README | N/A | 225 tokens | ~0.1% |
+That's a **60-80x reduction** in token consumption. With 5 MCP servers, you lose ~55,000 tokens before doing any work.
+**Benchmark results (120 evaluations):**
+| Approach | Avg Cost | Success Rate |
+|---|---|---|
+| CLI (tmux) | $0.37 | 100% |
+| CLI (terminalcp) | $0.39 | 100% |
+| MCP (terminalcp) | $0.48 | 100% |
+Same success rate, MCP costs **30% more**.
+**Pi's alternative: Progressive Disclosure via CLI tools + READMEs**
+Instead of loading all tool definitions upfront, Pi's agent has `bash` as a built-in tool and discovers CLI tools only when needed:
+```
+MCP approach:                          Pi approach:
+─────────────                          ──────────
+Session start →                        Session start →
+  Load 21 Playwright tools               Load 4 tools: read, write, edit, bash
+  Load 26 Chrome DevTools tools           (225 tokens)
+  Load N more MCP tools
+  (~55,000 tokens wasted)
+When browser needed:                   When browser needed:
+  Tools already loaded                   Agent reads SKILL.md (225 tokens)
+  (but context is polluted)              Runs: browser-start.js
+                                         Runs: browser-nav.js https://...
+                                         Runs: browser-screenshot.js
+When browser NOT needed:               When browser NOT needed:
+  Tools still consume context             0 tokens wasted
+```
+**The 4 built-in tools** (what Pi argues is sufficient):
+| Tool | What It Does | Why It's Enough |
+|---|---|---|
+| `read` | Read files (text + images) | Supports offset/limit for large files |
+| `write` | Create/overwrite files | Creates directories automatically |
+| `edit` | Replace text (oldText→newText) | Surgical edits, like a diff |
+| `bash` | Execute any shell command | **bash can do everything else** — replaces MCP entirely |
+The key insight: `bash` replaces MCP. Any CLI tool, API call, database query, or system operation can be invoked through bash. The agent reads the tool's README only when it needs it, paying tokens on-demand instead of upfront.
 ---
 ### FAL — Media Generation (867+ endpoints)
@@ -673,16 +1214,95 @@ The largest media generation provider with dynamic pricing fetched at runtime fr
 **Other TTS:**
 `fal-ai/f5-tts` (voice cloning), `fal-ai/dia-tts`, `fal-ai/minimax/speech-2.6-turbo`, `fal-ai/minimax/speech-2.6-hd`, `fal-ai/chatterbox/text-to-speech`, `fal-ai/index-tts-2/text-to-speech`
+#### FAL Provider Internals — How It Actually Works
+**Image generation** uses `fal.subscribe()` (queue-based, polls until complete):
+```typescript
+// Exact request payload sent to FAL:
+const response = await fal.subscribe(model, {
+  input: {
+    prompt: "A sunset over mountains",
+    negative_prompt: "blurry",           // from options.negativePrompt
+    image_size: { width: 1024, height: 768 }, // from options.width/height
+    seed: 42,                            // from options.seed
+    num_inference_steps: 30,             // from options.steps
+    guidance_scale: 7.5,                 // from options.guidanceScale
+  },
+});
+// Response parsing — URL from images array:
+const image = response.data?.images?.[0];
+// result.url    = image?.url
+// result.media  = { width: image?.width, height: image?.height, format: 'png' }
+```
+**Video generation** uses `fal.subscribe()`:
+```typescript
+const response = await fal.subscribe(model, {
+  input: {
+    prompt: "Ocean waves",
+    image_url: "https://...",   // from options.imageUrl (image-to-video)
+    duration: 5,                // from options.duration
+    fps: 24,                    // from options.fps
+  },
+});
+// Response parsing — URL from video object with fallback:
+const video = response.data?.video;
+// result.url = video?.url ?? response.data?.video_url
+// Note: width/height/duration/fps come from INPUT options, not response
+```
+**TTS** uses `fal.run()` (direct call, NOT subscribe — no queue):
+```typescript
+const response = await fal.run(model, {
+  input: {
+    text: "Hello world",
+    voice: "af_heart",          // from options.voice
+    speed: 1.0,                 // from options.speed
+  },
+});
+// Response parsing — URL from audio object with fallback:
+// result.url = response.data?.audio_url ?? response.data?.audio?.url
+```
+**Pricing cache and cost tracking:**
+```typescript
+// Pricing fetched dynamically from FAL API during listModels():
+const res = await fetch('https://api.fal.ai/v1/models/pricing', {
+  headers: { Authorization: `Key ${this.apiKey}` },
+});
+// Returns: Array<{ modelId: string, price: number, unit: string }>
+// Cached in memory Map, cleared on each listModels() call:
+private pricingCache = new Map<string, { price: number; unit: string }>();
+// Cost per request pulled from cache (defaults to 0 if not cached):
+usage: { cost: pricingCache.get(model)?.price ?? 0 }
+```
+**Modality inference from model ID — exact string matching:**
+```typescript
+inferModality(modelId: string, unit: string): Modality {
+  // TTS: unit contains 'char' OR modelId contains 'tts'/'kokoro'/'elevenlabs'
+  // Video: unit contains 'second' OR modelId contains 'video'/'kling'/'sora'/'veo'
+  // Image: everything else (default)
+}
+```
+**Error handling:** Only `listModels()` catches errors (returns `[]`). Image/video/speak methods let FAL errors propagate directly — no wrapping.
 #### FAL Client Capabilities
 The `@fal-ai/client` provides additional features beyond what Noosphere surfaces:
-- **Queue API** — Submit jobs, poll status, get results, cancel. Supports webhooks and priority levels
-- **Streaming API** — Real-time streaming responses via async iterators
-- **Realtime API** — WebSocket connections for interactive use (e.g., real-time image generation)
-- **Storage API** — File upload with configurable TTL (1h, 1d, 7d, 30d, 1y, never)
-- **Retry logic** — Configurable retries with exponential backoff and jitter
-- **Request middleware** — Custom request interceptors and proxy support
+- **Queue API** — `fal.queue.submit()`, `status()`, `result()`, `cancel()`. Supports webhooks, priority levels (`"low"` | `"normal"`), and polling/streaming status modes
+- **Streaming API** — `fal.streaming.stream()` with async iterators, chunk-level events, configurable timeout between chunks (15s default)
+- **Realtime API** — `fal.realtime.connect()` for WebSocket connections with msgpack encoding, throttle interval (128ms default), frame buffering (1-60 frames)
+- **Storage API** — `fal.storage.upload()` with configurable object lifecycle: `"never"` | `"immediate"` | `"1h"` | `"1d"` | `"7d"` | `"30d"` | `"1y"`
+- **Retry logic** — 3 retries default, exponential backoff (500ms base, 15s max), jitter enabled, retries on 408/429/500/502/503/504
+- **Request middleware** — `withMiddleware()` for request interceptors, `withProxy()` for proxy configuration
 ---
@@ -773,157 +1393,1473 @@ The `@huggingface/inference` library (v3.15.0) provides 30+ AI tasks, including
 - **Multimodal Input:** Images via `image_url` content chunks in chat messages
 - **17 Inference Providers:** Route through Groq, Together, Fireworks, Replicate, Cerebras, Cohere, and more
----
-### ComfyUI — Local Image Generation
-**Provider ID:** `comfyui`
-**Modalities:** Image, Video (planned)
-**Type:** Local
-**Default Port:** 8188
-Connects to a local ComfyUI instance for Stable Diffusion workflows.
+#### HuggingFace Provider Internals — How It Actually Works
-#### How It Works
+The `HuggingFaceProvider` class (`src/providers/huggingface.ts`, 141 lines) wraps the `@huggingface/inference` library's `HfInference` client. Here's the exact internal flow for each modality:
-1. Clones a built-in txt2img workflow template (KSampler + SDXL pipeline)
-2. Injects your parameters (prompt, dimensions, seed, steps, guidance)
-3. POSTs the workflow to ComfyUI's `/prompt` endpoint
-4. Polls `/history/{promptId}` every second until completion (max 5 minutes)
-5. Fetches the generated image from `/view`
-6. Returns a PNG buffer
+**Initialization:**
+```typescript
+// Constructor receives a single API token
+constructor(token: string) {
+  this.client = new HfInference(token);
+  // HfInference stores the token internally and attaches it
+  // as Authorization: Bearer <token> to every request
+}
-#### Configuration
+// ping() always returns true — HuggingFace is considered
+// "available" if the token was provided. No actual HTTP check.
+async ping(): Promise<boolean> { return true; }
+```
+**Chat Completions — exact request flow:**
 ```typescript
-const ai = new Noosphere({
-  local: {
-    comfyui: {
-      enabled: true,
-      host: 'http://localhost',
-      port: 8188,
-    },
-  },
+// Default model: meta-llama/Llama-3.1-8B-Instruct
+const model = options.model ?? 'meta-llama/Llama-3.1-8B-Instruct';
+// Maps directly to HfInference.chatCompletion():
+const response = await this.client.chatCompletion({
+  model,                           // HuggingFace model ID or inference endpoint
+  messages: options.messages,       // Array<{ role, content }> — passed directly
+  temperature: options.temperature, // 0.0 - 2.0 (optional)
+  max_tokens: options.maxTokens,    // Max output tokens (optional)
 });
+// Response parsing:
+const choice = response.choices?.[0];           // OpenAI-compatible format
+const usage = response.usage;                    // { prompt_tokens, completion_tokens }
+// result.content = choice?.message?.content ?? ''
+// result.usage.input = usage?.prompt_tokens
+// result.usage.output = usage?.completion_tokens
+// result.usage.cost = 0  (always free for HF Inference API)
 ```
-#### Default Workflow
+**Image Generation — Blob-to-Buffer conversion pipeline:**
+```typescript
+// Default model: stabilityai/stable-diffusion-xl-base-1.0
+const model = options.model ?? 'stabilityai/stable-diffusion-xl-base-1.0';
+// Uses textToImage() which returns a Blob object:
+const blob = await this.client.textToImage({
+  model,
+  inputs: options.prompt,        // The text prompt
+  parameters: {
+    negative_prompt: options.negativePrompt,  // What NOT to generate
+    width: options.width,                      // Pixel width
+    height: options.height,                    // Pixel height
+    guidance_scale: options.guidanceScale,      // CFG scale
+    num_inference_steps: options.steps,         // Denoising steps
+  },
+}, { outputType: 'blob' });  // <-- Forces Blob output (not ReadableStream)
-- **Checkpoint:** `sd_xl_base_1.0.safetensors`
-- **Sampler:** euler with normal scheduler
-- **Default Steps:** 20
-- **Default CFG/Guidance:** 7
-- **Default Size:** 1024x1024
-- **Max Size:** 2048x2048
-- **Output:** PNG
+// Blob → ArrayBuffer → Node.js Buffer conversion:
+const buffer = Buffer.from(await blob.arrayBuffer());
+// This is the critical step — HfInference returns a Web API Blob,
+// which must be converted to a Node.js Buffer for downstream use.
-#### Models Exposed
+// Result always reports PNG format regardless of actual model output:
+// result.media = { width: options.width ?? 1024, height: options.height ?? 1024, format: 'png' }
+```
-| Model ID | Modality | Description |
-|---|---|---|
-| `comfyui-txt2img` | Image | Text-to-image via workflow |
-| `comfyui-txt2vid` | Video | Planned (requires AnimateDiff workflow) |
+**Text-to-Speech — Blob-to-Buffer conversion:**
+```typescript
+// Default model: facebook/mms-tts-eng
+const model = options.model ?? 'facebook/mms-tts-eng';
+// Uses textToSpeech() — simpler API, just model + text:
+const blob = await this.client.textToSpeech({
+  model,
+  inputs: options.text,    // Text to synthesize
+  // Note: No voice, speed, or format parameters — these are model-dependent
+});
----
+// Same Blob → Buffer conversion:
+const buffer = Buffer.from(await blob.arrayBuffer());
-### Local TTS — Piper & Kokoro
+// Usage tracks character count, not tokens:
+// result.usage = { cost: 0, input: options.text.length, unit: 'characters' }
+// result.media = { format: 'wav' }
+```
-**Provider IDs:** `piper`, `kokoro`
-**Modality:** TTS
-**Type:** Local
+**Model listing — curated defaults, not API discovery:**
+```typescript
+// Unlike FAL (which fetches from API) or Pi-AI (which auto-generates),
+// HuggingFace returns a HARDCODED list of 3 curated models:
+async listModels(modality?: Modality): Promise<ModelInfo[]> {
+  const models: ModelInfo[] = [];
+  if (!modality || modality === 'image') {
+    models.push({ id: 'stabilityai/stable-diffusion-xl-base-1.0', ... });
+  }
+  if (!modality || modality === 'tts') {
+    models.push({ id: 'facebook/mms-tts-eng', ... });
+  }
+  if (!modality || modality === 'llm') {
+    models.push({ id: 'meta-llama/Llama-3.1-8B-Instruct', ... });
+  }
+  return models;
+}
+// This means: the registry only KNOWS about 3 models by default,
+// but you can use ANY HuggingFace model by passing its ID directly.
+// The model just won't appear in getModels() or syncModels() results.
+```
-Connects to local OpenAI-compatible TTS servers.
+#### The 17 HuggingFace Inference Providers
-#### Supported Engines
+The `@huggingface/inference` library supports routing requests through 17 different inference providers. This means a single HuggingFace model ID can be served by multiple backends with different performance/cost characteristics:
-| Engine | Default Port | Health Check | Voice Discovery |
+| # | Provider | Type | Strengths |
 |---|---|---|---|
-| Piper | 5500 | `GET /health` | `GET /voices` |
-| Kokoro | 5501 | `GET /health` | `GET /v1/models` (fallback) |
-#### API
-Uses the OpenAI-compatible TTS endpoint:
+| 1 | `hf-inference` | HuggingFace's own | Default, free tier, rate-limited |
+| 2 | `hf-dedicated` | Dedicated endpoints | Private, reserved GPU, guaranteed availability |
+| 3 | `together-ai` | Together.ai | Fast inference, competitive pricing |
+| 4 | `fireworks-ai` | Fireworks.ai | Optimized serving, function calling |
+| 5 | `replicate` | Replicate | Pay-per-use, large model catalog |
+| 6 | `cerebras` | Cerebras | Extreme speed (WSE-3 hardware) |
+| 7 | `groq` | Groq | Ultra-low latency (LPU hardware) |
+| 8 | `cohere` | Cohere | Enterprise, embeddings, RAG |
+| 9 | `sambanova` | SambaNova | Enterprise RDU hardware |
+| 10 | `nebius` | Nebius | European cloud infrastructure |
+| 11 | `hyperbolic` | Hyperbolic Labs | Open-access GPU marketplace |
+| 12 | `novita` | Novita AI | Cost-efficient inference |
+| 13 | `ovh-cloud` | OVHcloud | European sovereign cloud |
+| 14 | `aws` | Amazon SageMaker | AWS-managed endpoints |
+| 15 | `azure` | Azure ML | Azure-managed endpoints |
+| 16 | `google-vertex` | Google Vertex | GCP-managed endpoints |
+| 17 | `deepinfra` | DeepInfra | High-throughput inference |
+**Provider routing** is handled by the `@huggingface/inference` library's internal `provider` parameter:
+```typescript
+// Route through a specific inference provider:
+const response = await client.chatCompletion({
+  model: 'meta-llama/Llama-3.1-70B-Instruct',
+  provider: 'together-ai',  // <-- Route through Together.ai
+  messages: [...],
+});
+// NOTE: Noosphere does NOT currently expose the `provider` parameter
+// in its ChatOptions type. To use a specific HF inference provider,
+// you would need a custom provider or direct @huggingface/inference usage.
 ```
-POST /v1/audio/speech
-{
-  "model": "tts-1",
-  "input": "Hello world",
-  "voice": "default",
-  "speed": 1.0,
-  "response_format": "mp3"
-}
-```
-Supports `mp3`, `wav`, and `ogg` formats. Returns audio as a Buffer.
----
+#### Using HuggingFace Locally — Dedicated Endpoints
-## Architecture
+HuggingFace Inference Endpoints let you deploy any model on dedicated GPUs. The `@huggingface/inference` library supports this via the `endpointUrl` parameter:
-### Provider Resolution (Local-First)
+```typescript
+// Direct HfInference usage with a local/dedicated endpoint:
+import { HfInference } from '@huggingface/inference';
-When you call a generation method without specifying a provider, Noosphere resolves one automatically:
+const client = new HfInference('your-token');
-1. If `model` is specified without `provider` → looks up model in registry cache
-2. If a `default` is configured for the modality → uses that
-3. Otherwise → **local providers first**, then cloud providers
+// Point to your dedicated endpoint:
+const response = await client.chatCompletion({
+  model: 'tgi',
+  endpointUrl: 'https://your-endpoint.endpoints.huggingface.cloud',
+  messages: [{ role: 'user', content: 'Hello' }],
+});
-```
-resolveProvider(modality):
-  1. Check user-specified provider ID → return if found
-  2. Check configured defaults → return if found
-  3. Scan all providers:
-     → Return first LOCAL provider supporting this modality
-     → Fallback to first CLOUD provider
-  4. Throw NO_PROVIDER error
+// For a truly local setup with TGI (Text Generation Inference):
+const localClient = new HfInference();  // No token needed for local
+const response = await localClient.chatCompletion({
+  model: 'tgi',
+  endpointUrl: 'http://localhost:8080',  // Local TGI server
+  messages: [...],
+});
 ```
-### Retry & Failover Logic
+**Deploying HuggingFace models locally with TGI:**
-```
-executeWithRetry(modality, provider, fn):
-  for attempt = 0..maxRetries:
-    try: return fn()
-    catch:
-      if error is retryable AND attempts remain:
-        wait backoffMs * 2^attempt (exponential backoff)
-        retry same provider
-      if error is NOT GENERATION_FAILED AND failover enabled:
-        try each alternative provider for this modality
-      throw last error
+```bash
+# 1. Install Text Generation Inference (TGI):
+docker run --gpus all -p 8080:80 \
+  -v /data:/data \
+  ghcr.io/huggingface/text-generation-inference:latest \
+  --model-id meta-llama/Llama-3.1-8B-Instruct
+# 2. For image models, use Inference Endpoints:
+# Deploy via https://ui.endpoints.huggingface.co/
+# Select your model, GPU type, and region
+# Get an endpoint URL like: https://xyz123.endpoints.huggingface.cloud
+# 3. For TTS models locally, use the Transformers library:
+# pip install transformers torch
+# Then run a local server that serves the model
 ```
-**Retryable errors (same provider):** `PROVIDER_UNAVAILABLE`, `RATE_LIMITED`, `TIMEOUT`, `GENERATION_FAILED`
+**Other local deployment options:**
-**Failover-eligible errors (cross-provider):** `PROVIDER_UNAVAILABLE`, `RATE_LIMITED`, `TIMEOUT` (NOT `GENERATION_FAILED`)
+| Method | URL Pattern | Use Case |
+|---|---|---|
+| TGI Docker | `http://localhost:8080` | Production local LLM serving |
+| HF Inference Endpoints | `https://xxxx.endpoints.huggingface.cloud` | Managed dedicated GPU |
+| vLLM with HF models | `http://localhost:8000` | High-throughput local serving |
+| Transformers + FastAPI | Custom URL | Custom model serving |
-### Model Registry & Caching
+#### Unexposed `@huggingface/inference` Parameters
-- Models are fetched from providers via `listModels()` and cached in memory
-- Cache TTL is configurable (default: 60 minutes)
-- `syncModels()` forces a refresh of all provider model lists
-- Registry tracks model → provider mappings for fast resolution
+The `chatCompletion()` method accepts many parameters that Noosphere's `ChatOptions` doesn't currently expose. These are available if you use the library directly:
-### Usage Tracking
+| Parameter | Type | Description |
+|---|---|---|
+| `temperature` | `number` | Sampling temperature (0-2.0) — **exposed** via `ChatOptions.temperature` |
+| `max_tokens` | `number` | Max output tokens — **exposed** via `ChatOptions.maxTokens` |
+| `top_p` | `number` | Nucleus sampling threshold (0-1.0) — **not exposed** |
+| `top_k` | `number` | Top-K sampling — **not exposed** |
+| `frequency_penalty` | `number` | Penalize repeated tokens (-2.0 to 2.0) — **not exposed** |
+| `presence_penalty` | `number` | Penalize tokens already present (-2.0 to 2.0) — **not exposed** |
+| `repetition_penalty` | `number` | Alternative repetition penalty (>1.0 penalizes) — **not exposed** |
+| `stop` | `string[]` | Stop sequences — **not exposed** |
+| `seed` | `number` | Deterministic sampling seed — **not exposed** |
+| `tools` | `Tool[]` | Function/tool definitions — **not exposed** |
+| `tool_choice` | `string \| object` | Tool selection strategy — **not exposed** |
+| `tool_prompt` | `string` | System prompt for tool use — **not exposed** |
+| `response_format` | `object` | JSON schema constraints — **not exposed** |
+| `reasoning_effort` | `string` | Thinking depth level — **not exposed** |
+| `stream` | `boolean` | Enable streaming — **not exposed** (use `chatCompletionStream()`) |
+| `provider` | `string` | Inference provider routing — **not exposed** |
+| `endpointUrl` | `string` | Custom endpoint URL — **not exposed** |
+| `n` | `number` | Number of completions — **not exposed** |
+| `logprobs` | `boolean` | Return log probabilities — **not exposed** |
+| `grammar` | `object` | BNF grammar constraints — **not exposed** |
+**Image generation unexposed parameters:**
+| Parameter | Type | Description |
+|---|---|---|
+| `negative_prompt` | `string` | **Exposed** via `ImageOptions.negativePrompt` |
+| `width` / `height` | `number` | **Exposed** via `ImageOptions.width/height` |
+| `guidance_scale` | `number` | **Exposed** via `ImageOptions.guidanceScale` |
+| `num_inference_steps` | `number` | **Exposed** via `ImageOptions.steps` |
+| `scheduler` | `string` | Diffusion scheduler type — **not exposed** |
+| `target_size` | `object` | Target resize dimensions — **not exposed** |
+| `clip_skip` | `number` | CLIP skip layers — **not exposed** |
+#### HuggingFace Error Behavior
+Unlike other providers, HuggingFaceProvider does **not** catch errors from the `@huggingface/inference` library. All errors propagate directly up to Noosphere's `executeWithRetry()`:
+```
+HfInference throws → HuggingFaceProvider propagates →
+  executeWithRetry catches → Noosphere wraps as NoosphereError
+```
+Common error scenarios:
+- **401 Unauthorized** — Invalid or expired token → becomes `AUTH_FAILED`
+- **404 Model Not Found** — Model ID doesn't exist on HF Hub → becomes `MODEL_NOT_FOUND`
+- **429 Rate Limited** — Free tier limit exceeded → becomes `RATE_LIMITED` (retryable)
+- **503 Model Loading** — Model is cold-starting on HF Inference → becomes `PROVIDER_UNAVAILABLE` (retryable)
+---
+### ComfyUI — Local Image Generation
+**Provider ID:** `comfyui`
+**Modalities:** Image, Video (planned)
+**Type:** Local
+**Default Port:** 8188
+**Source:** `src/providers/comfyui.ts` (155 lines)
+Connects to a local ComfyUI instance for Stable Diffusion workflows. ComfyUI is a node-based UI for Stable Diffusion that exposes an HTTP API. Noosphere communicates with it via raw HTTP — no ComfyUI SDK needed.
+#### How It Works — Complete Lifecycle
+```
+User calls ai.image() →
+  1. structuredClone(DEFAULT_TXT2IMG_WORKFLOW)     // Deep-clone the template
+  2. Inject parameters into workflow nodes          // Mutate the clone
+  3. POST /prompt { prompt: workflow }              // Queue the workflow
+  4. Receive { prompt_id: "abc-123" }               // Get tracking ID
+  5. POLL GET /history/abc-123 every 1000ms         // Check completion
+  6. Parse outputs → find SaveImage node            // Locate generated image
+  7. GET /view?filename=X&subfolder=Y&type=Z        // Fetch image binary
+  8. Return Buffer                                   // PNG buffer to caller
+```
+#### The Complete Workflow JSON — All 8 Nodes
+The `DEFAULT_TXT2IMG_WORKFLOW` constant defines a complete SDXL text-to-image pipeline as a ComfyUI node graph. Each key is a **node ID** (string), each value defines the node type and its connections:
+```typescript
+// Node "3": KSampler — The core diffusion sampling node
+'3': {
+  class_type: 'KSampler',
+  inputs: {
+    seed: 0,                    // Random seed (overridden by options.seed)
+    steps: 20,                  // Denoising steps (overridden by options.steps)
+    cfg: 7,                     // CFG/guidance scale (overridden by options.guidanceScale)
+    sampler_name: 'euler',      // Sampling algorithm
+    scheduler: 'normal',        // Noise schedule
+    denoise: 1,                 // Full denoise (1.0 = txt2img, <1.0 = img2img)
+    model: ['4', 0],            // ← Connection: output 0 of node "4" (checkpoint model)
+    positive: ['6', 0],         // ← Connection: output 0 of node "6" (positive prompt)
+    negative: ['7', 0],         // ← Connection: output 0 of node "7" (negative prompt)
+    latent_image: ['5', 0],     // ← Connection: output 0 of node "5" (empty latent)
+  },
+}
+// Node "4": CheckpointLoaderSimple — Loads the SDXL model from disk
+'4': {
+  class_type: 'CheckpointLoaderSimple',
+  inputs: {
+    ckpt_name: 'sd_xl_base_1.0.safetensors',  // Checkpoint file on disk
+    // Outputs: [0]=MODEL, [1]=CLIP, [2]=VAE
+    // MODEL → KSampler.model
+    // CLIP  → CLIPTextEncode nodes
+    // VAE   → VAEDecode
+  },
+}
+// Node "5": EmptyLatentImage — Creates the initial noise tensor
+'5': {
+  class_type: 'EmptyLatentImage',
+  inputs: {
+    width: 1024,       // Overridden by options.width
+    height: 1024,      // Overridden by options.height
+    batch_size: 1,     // Always 1 image per generation
+  },
+}
+// Node "6": CLIPTextEncode — Positive prompt encoding
+'6': {
+  class_type: 'CLIPTextEncode',
+  inputs: {
+    text: '',           // Overridden by options.prompt
+    clip: ['4', 1],     // ← Connection: output 1 of node "4" (CLIP model)
+  },
+}
+// Node "7": CLIPTextEncode — Negative prompt encoding
+'7': {
+  class_type: 'CLIPTextEncode',
+  inputs: {
+    text: '',           // Overridden by options.negativePrompt ?? ''
+    clip: ['4', 1],     // ← Same CLIP model as positive prompt
+  },
+}
+// Node "8": VAEDecode — Converts latent space to pixel space
+'8': {
+  class_type: 'VAEDecode',
+  inputs: {
+    samples: ['3', 0],  // ← Connection: output 0 of node "3" (sampled latents)
+    vae: ['4', 2],       // ← Connection: output 2 of node "4" (VAE decoder)
+  },
+}
+// Node "9": SaveImage — Saves the final image
+'9': {
+  class_type: 'SaveImage',
+  inputs: {
+    filename_prefix: 'noosphere',   // Files saved as noosphere_00001.png, etc.
+    images: ['8', 0],                // ← Connection: output 0 of node "8" (decoded image)
+  },
+}
+```
+**Node connection format:** `['nodeId', outputIndex]` — this is ComfyUI's internal linking system. For example, `['4', 1]` means "output slot 1 of node 4", which is the CLIP model from CheckpointLoaderSimple.
+**Visual pipeline flow:**
+```
+CheckpointLoader["4"] ──MODEL──→ KSampler["3"]
+         ├──CLIP──→ CLIPTextEncode["6"] (positive) ──→ KSampler["3"]
+         ├──CLIP──→ CLIPTextEncode["7"] (negative) ──→ KSampler["3"]
+         └──VAE───→ VAEDecode["8"]
+EmptyLatentImage["5"] ──→ KSampler["3"] ──→ VAEDecode["8"] ──→ SaveImage["9"]
+```
+#### Parameter Injection — How Options Map to Nodes
+```typescript
+// Deep-clone to avoid mutating the template:
+const workflow = structuredClone(DEFAULT_TXT2IMG_WORKFLOW);
+// Direct node mutations:
+workflow['6'].inputs.text = options.prompt;                    // Positive prompt → Node 6
+workflow['7'].inputs.text = options.negativePrompt ?? '';      // Negative prompt → Node 7
+workflow['5'].inputs.width = options.width ?? 1024;            // Width → Node 5
+workflow['5'].inputs.height = options.height ?? 1024;          // Height → Node 5
+// Conditional overrides (only if user provided them):
+if (options.seed !== undefined)          workflow['3'].inputs.seed = options.seed;
+if (options.steps !== undefined)         workflow['3'].inputs.steps = options.steps;
+if (options.guidanceScale !== undefined) workflow['3'].inputs.cfg = options.guidanceScale;
+// Note: sampler_name, scheduler, and denoise are NOT configurable via Noosphere.
+// They're hardcoded to euler/normal/1.0
+```
+#### Queue Submission — POST /prompt
+```typescript
+const queueRes = await fetch(`${this.baseUrl}/prompt`, {
+  method: 'POST',
+  headers: { 'Content-Type': 'application/json' },
+  body: JSON.stringify({ prompt: workflow }),
+  // ComfyUI expects: { prompt: <workflow_object>, client_id?: string }
+});
+if (!queueRes.ok) throw new Error(`ComfyUI queue failed: ${queueRes.status}`);
+const { prompt_id } = await queueRes.json();
+// prompt_id is a UUID like "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
+// Used to track this specific generation in the history API
+```
+#### Polling Mechanism — Deadline-Based with 1s Intervals
+```typescript
+private async pollForResult(promptId: string, maxWaitMs = 300000): Promise<ArrayBuffer> {
+  const deadline = Date.now() + maxWaitMs;  // 300,000ms = 5 minutes
+  while (Date.now() < deadline) {
+    // Check history for our prompt
+    const res = await fetch(`${this.baseUrl}/history/${promptId}`);
+    if (!res.ok) {
+      await new Promise((r) => setTimeout(r, 1000));  // 1 second between polls
+      continue;
+    }
+    const history = await res.json();
+    // History format: { [promptId]: { outputs: { [nodeId]: { images: [...] } } } }
+    const entry = history[promptId];
+    if (!entry?.outputs) {
+      await new Promise((r) => setTimeout(r, 1000));  // Not ready yet
+      continue;
+    }
+    // Search ALL output nodes for images (not just node "9"):
+    for (const nodeOutput of Object.values(entry.outputs)) {
+      if (nodeOutput.images?.length > 0) {
+        const img = nodeOutput.images[0];
+        // Fetch the actual image binary:
+        const imgRes = await fetch(
+          `${this.baseUrl}/view?filename=${img.filename}&subfolder=${img.subfolder}&type=${img.type}`
+        );
+        return imgRes.arrayBuffer();
+      }
+    }
+    await new Promise((r) => setTimeout(r, 1000));
+  }
+  throw new Error(`ComfyUI generation timed out after ${maxWaitMs}ms`);
+}
+```
+**Key polling details:**
+- **Interval:** Fixed 1000ms (not configurable)
+- **Timeout:** 300,000ms = 5 minutes (hardcoded, not from `config.timeout.image`)
+- **Deadline-based:** Uses `Date.now() < deadline` comparison, NOT a retry counter
+- **Image fetch URL format:** `/view?filename=noosphere_00001_.png&subfolder=&type=output`
+- **Returns:** Raw `ArrayBuffer` → converted to `Buffer` by the caller
+#### Auto-Detection — How ComfyUI Gets Discovered
+During `Noosphere.init()`, if `autoDetectLocal` is true:
+```typescript
+// Ping the /system_stats endpoint with a 2-second timeout:
+const pingUrl = async (url: string): Promise<boolean> => {
+  const controller = new AbortController();
+  const timer = setTimeout(() => controller.abort(), 2000);  // 2s hard timeout
+  try {
+    const res = await fetch(url, { signal: controller.signal });
+    return res.ok;
+  } finally {
+    clearTimeout(timer);
+  }
+};
+// Check ComfyUI specifically:
+if (comfyuiCfg?.enabled) {
+  const ok = await pingUrl(`${comfyuiCfg.host}:${comfyuiCfg.port}/system_stats`);
+  if (ok) {
+    this.registry.addProvider(new ComfyUIProvider({
+      host: comfyuiCfg.host,  // Default: 'http://localhost'
+      port: comfyuiCfg.port,  // Default: 8188
+    }));
+  }
+}
+```
+**Environment variable overrides:**
+```bash
+COMFYUI_HOST=http://192.168.1.100    # Override host
+COMFYUI_PORT=8190                     # Override port
+```
+#### Configuration
+```typescript
+const ai = new Noosphere({
+  local: {
+    comfyui: {
+      enabled: true,                      // Default: true (auto-detected)
+      host: 'http://localhost',           // Default: 'http://localhost'
+      port: 8188,                         // Default: 8188
+    },
+  },
+});
+```
+#### Model Discovery — Dynamic via /object_info
+```typescript
+async listModels(modality?: Modality): Promise<ModelInfo[]> {
+  // Fetches ComfyUI's full node registry:
+  const res = await fetch(`${this.baseUrl}/object_info`);
+  if (!res.ok) return [];
+  // Does NOT parse the response — just uses it as a connectivity check.
+  // Returns hardcoded model entries:
+  const models: ModelInfo[] = [];
+  if (!modality || modality === 'image') {
+    models.push({
+      id: 'comfyui-txt2img',
+      provider: 'comfyui',
+      name: 'ComfyUI Text-to-Image',
+      modality: 'image',
+      local: true,
+      cost: { price: 0, unit: 'free' },
+      capabilities: { maxWidth: 2048, maxHeight: 2048, supportsNegativePrompt: true },
+    });
+  }
+  if (!modality || modality === 'video') {
+    models.push({
+      id: 'comfyui-txt2vid',
+      provider: 'comfyui',
+      name: 'ComfyUI Text-to-Video',
+      modality: 'video',
+      local: true,
+      cost: { price: 0, unit: 'free' },
+      capabilities: { maxDuration: 10, supportsImageToVideo: true },
+    });
+  }
+  return models;
+}
+// NOTE: /object_info is fetched but the response is discarded.
+// The actual model list is hardcoded. This means even if you have
+// dozens of checkpoints in ComfyUI, Noosphere only exposes 2 model IDs.
+```
+#### Video Generation — Not Yet Implemented
+```typescript
+async video(_options: VideoOptions): Promise<NoosphereResult> {
+  throw new Error('ComfyUI video generation requires a configured AnimateDiff workflow');
+}
+// The 'comfyui-txt2vid' model ID is listed but will throw at runtime.
+// This is a placeholder for future AnimateDiff/SVD workflow templates.
+```
+#### Default Workflow Parameters Summary
+| Parameter | Default | Configurable | Node |
+|---|---|---|---|
+| Checkpoint | `sd_xl_base_1.0.safetensors` | No | Node 4 |
+| Sampler | `euler` | No | Node 3 |
+| Scheduler | `normal` | No | Node 3 |
+| Denoise | `1.0` | No | Node 3 |
+| Steps | `20` | Yes (`options.steps`) | Node 3 |
+| CFG/Guidance | `7` | Yes (`options.guidanceScale`) | Node 3 |
+| Seed | `0` | Yes (`options.seed`) | Node 3 |
+| Width | `1024` | Yes (`options.width`) | Node 5 |
+| Height | `1024` | Yes (`options.height`) | Node 5 |
+| Batch Size | `1` | No | Node 5 |
+| Filename Prefix | `noosphere` | No | Node 9 |
+| Negative Prompt | `''` (empty) | Yes (`options.negativePrompt`) | Node 7 |
+| Max Size | `2048x2048` | Via options | Node 5 |
+| Output Format | PNG | No | ComfyUI default |
+---
+### Local TTS — Piper & Kokoro
+**Provider IDs:** `piper`, `kokoro`
+**Modality:** TTS
+**Type:** Local
+**Source:** `src/providers/local-tts.ts` (112 lines)
+The `LocalTTSProvider` is a generic adapter for any local TTS server that exposes an OpenAI-compatible `/v1/audio/speech` endpoint. Two instances are created by default — one for Piper, one for Kokoro — but the class works with ANY server implementing this protocol.
+#### Supported Engines
+| Engine | Default Port | Health Check | Voice Discovery | Description |
+|---|---|---|---|---|
+| Piper | 5500 | `GET /health` | `GET /voices` (array) | Fast offline TTS, 30+ languages, ONNX models |
+| Kokoro | 5501 | `GET /health` | `GET /v1/models` (OpenAI format) | High-quality neural TTS |
+#### Provider Instantiation — How Instances Are Created
+```typescript
+// The LocalTTSProvider constructor takes a config object:
+interface LocalTTSConfig {
+  id: string;     // Provider ID: 'piper' or 'kokoro'
+  name: string;   // Display name: 'Piper TTS' or 'Kokoro TTS'
+  host: string;   // Base URL host
+  port: number;   // Port number
+}
+// Two separate instances are created during init():
+new LocalTTSProvider({ id: 'piper',  name: 'Piper TTS',  host: piperCfg.host,  port: piperCfg.port })
+new LocalTTSProvider({ id: 'kokoro', name: 'Kokoro TTS', host: kokoroCfg.host, port: kokoroCfg.port })
+// Each instance is an independent provider in the registry.
+// They don't share state or config.
+// The baseUrl is constructed as: `${config.host}:${config.port}`
+// Example: "http://localhost:5500"
+```
+#### Health Check — Ping Protocol
+```typescript
+async ping(): Promise<boolean> {
+  try {
+    const res = await fetch(`${this.baseUrl}/health`);
+    return res.ok;  // true if HTTP 200-299
+  } catch {
+    return false;    // Network error, connection refused, etc.
+  }
+}
+// Used during auto-detection in Noosphere.init()
+// Also used by: the 2-second AbortController timeout in init()
+// Note: /health is checked BEFORE the provider is registered.
+// If /health fails, the provider is silently skipped.
+```
+#### Dual Voice Discovery Mechanism
+The `listModels()` method implements a **two-strategy fallback** to discover available voices. This is necessary because different TTS servers expose voices through different API formats:
+```typescript
+async listModels(modality?: Modality): Promise<ModelInfo[]> {
+  if (modality && modality !== 'tts') return [];
+  let voices: Array<{ id: string; name?: string }> = [];
+  // STRATEGY 1: Piper-style /voices endpoint
+  // Expected response: Array<{ id: string, name?: string, ... }>
+  try {
+    const res = await fetch(`${this.baseUrl}/voices`);
+    if (res.ok) {
+      const data = await res.json();
+      if (Array.isArray(data)) {
+        voices = data;
+        // Success — skip fallback
+      }
+    }
+  } catch {
+    // STRATEGY 2: OpenAI-compatible /v1/models endpoint
+    // Expected response: { data: Array<{ id: string, ... }> }
+    const res = await fetch(`${this.baseUrl}/v1/models`);
+    if (res.ok) {
+      const data = await res.json();
+      voices = data.data ?? [];
+    }
+  }
+  // Map voices to ModelInfo objects:
+  return voices.map((v) => ({
+    id: v.id,
+    provider: this.id,              // 'piper' or 'kokoro'
+    name: v.name ?? v.id,           // Fallback to ID if no name
+    modality: 'tts' as const,
+    local: true,
+    cost: { price: 0, unit: 'free' },
+    capabilities: {
+      voices: voices.map((vv) => vv.id),  // All voice IDs as capabilities
+    },
+  }));
+}
+```
+**Critical implementation detail:** The fallback is triggered by a `catch` block, NOT by checking the response. This means:
+- If `/voices` returns a **non-array** (e.g., `{}`), strategy 1 succeeds but `voices` remains empty
+- If `/voices` returns HTTP **404**, strategy 1 "succeeds" (no exception), but `res.ok` is false, so voices stays empty, AND strategy 2 is never tried
+- Strategy 2 only runs if `/voices` **throws a network error** (connection refused, DNS failure, etc.)
+**Piper response format** (`GET /voices`):
+```json
+[
+  { "id": "en_US-lessac-medium", "name": "Lessac (English US)" },
+  { "id": "en_US-amy-medium", "name": "Amy (English US)" },
+  { "id": "de_DE-thorsten-high", "name": "Thorsten (German)" }
+]
+```
+**Kokoro/OpenAI response format** (`GET /v1/models`):
+```json
+{
+  "data": [
+    { "id": "kokoro-v1", "object": "model" },
+    { "id": "kokoro-v1-jp", "object": "model" }
+  ]
+}
+```
+#### Speech Generation — Exact HTTP Protocol
+```typescript
+async speak(options: SpeakOptions): Promise<NoosphereResult> {
+  const start = Date.now();
+  // POST to OpenAI-compatible TTS endpoint:
+  const res = await fetch(`${this.baseUrl}/v1/audio/speech`, {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify({
+      model: options.model ?? 'tts-1',        // Default model ID
+      input: options.text,                     // Text to synthesize
+      voice: options.voice ?? 'default',       // Voice selection
+      speed: options.speed ?? 1.0,             // Playback speed multiplier
+      response_format: options.format ?? 'mp3', // Output audio format
+    }),
+  });
+  if (!res.ok) {
+    throw new Error(`Local TTS failed: ${res.status} ${await res.text()}`);
+    // Note: error includes the response body text for debugging
+  }
+  // Response is raw audio binary — convert to Buffer:
+  const audioBuffer = Buffer.from(await res.arrayBuffer());
+  return {
+    buffer: audioBuffer,
+    provider: this.id,                              // 'piper' or 'kokoro'
+    model: options.model ?? options.voice ?? 'default',  // Fallback chain
+    modality: 'tts',
+    latencyMs: Date.now() - start,
+    usage: {
+      cost: 0,                                       // Always free (local)
+      input: options.text.length,                     // CHARACTER count, not tokens
+      unit: 'characters',                             // Track by characters
+    },
+    media: {
+      format: options.format ?? 'mp3',               // Matches requested format
+    },
+  };
+}
+```
+**Request/Response details:**
+| Field | Value | Notes |
+|---|---|---|
+| Method | `POST` | Always POST |
+| URL | `/v1/audio/speech` | OpenAI-compatible standard |
+| Content-Type | `application/json` | JSON body |
+| Response Content-Type | `audio/mpeg`, `audio/wav`, or `audio/ogg` | Depends on `response_format` |
+| Response Body | Raw binary audio | Converted to `Buffer` via `arrayBuffer()` |
+**Available formats (from `SpeakOptions.format` type):**
+| Format | Typical Size | Quality | Use Case |
+|---|---|---|---|
+| `mp3` | Smallest | Lossy | Web playback, storage |
+| `wav` | Largest | Lossless | Processing, editing |
+| `ogg` | Medium | Lossy | Web playback, open format |
+#### Usage Tracking — Character-Based
+Local TTS tracks usage by **character count**, not tokens:
+```typescript
+usage: {
+  cost: 0,                        // Always 0 for local providers
+  input: options.text.length,     // JavaScript string .length (UTF-16 code units)
+  unit: 'characters',             // Unit identifier for aggregation
+}
+// Note: .length counts UTF-16 code units, not Unicode codepoints.
+// "Hello" = 5, "🎵" = 2 (surrogate pair), "café" = 4
+```
+This feeds into the global `UsageTracker`, so you can query TTS usage:
+```typescript
+const usage = ai.getUsage({ modality: 'tts' });
+// usage.totalRequests = number of TTS calls
+// usage.totalCost = 0 (always free for local)
+// usage.byProvider = { piper: 0, kokoro: 0 }
+```
+#### Auto-Detection — Parallel Discovery
+Both Piper and Kokoro are detected simultaneously during `init()`:
+```typescript
+// Inside Noosphere.init(), wrapped in Promise.allSettled():
+await Promise.allSettled([
+  // ... ComfyUI detection ...
+  (async () => {
+    if (piperCfg?.enabled) {    // enabled: true by default
+      const ok = await pingUrl(`${piperCfg.host}:${piperCfg.port}/health`);
+      if (ok) {
+        this.registry.addProvider(new LocalTTSProvider({
+          id: 'piper', name: 'Piper TTS',
+          host: piperCfg.host, port: piperCfg.port,
+        }));
+      }
+    }
+  })(),
+  (async () => {
+    if (kokoroCfg?.enabled) {   // enabled: true by default
+      const ok = await pingUrl(`${kokoroCfg.host}:${kokoroCfg.port}/health`);
+      if (ok) {
+        this.registry.addProvider(new LocalTTSProvider({
+          id: 'kokoro', name: 'Kokoro TTS',
+          host: kokoroCfg.host, port: kokoroCfg.port,
+        }));
+      }
+    }
+  })(),
+]);
+```
+**Environment variable overrides:**
+```bash
+PIPER_HOST=http://192.168.1.100    PIPER_PORT=5500
+KOKORO_HOST=http://192.168.1.100   KOKORO_PORT=5501
+```
+#### Setting Up Local TTS Servers
+**Piper TTS:**
+```bash
+# Docker (recommended):
+docker run -p 5500:5500 rhasspy/wyoming-piper \
+  --voice en_US-lessac-medium
+# Or via pip:
+pip install piper-tts
+# Then run a compatible HTTP server (wyoming-piper or piper-http-server)
+```
+**Kokoro TTS:**
+```bash
+# Docker:
+docker run -p 5501:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
+# The Kokoro server exposes OpenAI-compatible endpoints at:
+# GET  /v1/models          → List available voices
+# POST /v1/audio/speech    → Generate speech
+# GET  /health             → Health check
+```
+---
+## Architecture
+### The Complete Init() Flow — What Happens When You Create a Noosphere Instance
+```typescript
+const ai = new Noosphere({ /* config */ });
+// At this point: config is resolved, but NO providers are registered.
+// The `initialized` flag is false.
+await ai.chat({ messages: [...] });
+// FIRST call triggers lazy initialization via init()
+```
+**Initialization sequence (`src/noosphere.ts:240-322`):**
+```
+1. Constructor:
+   ├── resolveConfig(input)           // Merge config > env > defaults
+   ├── new Registry(cacheTTLMinutes)  // Empty provider registry
+   └── new UsageTracker(onUsage)      // Empty event list
+2. First API call triggers init():
+   ├── Set initialized = true (immediately, before any async work)
+   │
+   ├── CLOUD PROVIDER REGISTRATION (synchronous):
+   │   ├── Collect all API keys from resolved config
+   │   ├── If ANY LLM key exists → register PiAiProvider(allKeys)
+   │   ├── If FAL key exists    → register FalProvider(falKey)
+   │   └── If HF token exists   → register HuggingFaceProvider(token)
+   │
+   └── LOCAL SERVICE DETECTION (parallel, async):
+       └── Promise.allSettled([
+             pingUrl(comfyui /system_stats)  → register ComfyUIProvider
+             pingUrl(piper /health)          → register LocalTTSProvider('piper')
+             pingUrl(kokoro /health)         → register LocalTTSProvider('kokoro')
+           ])
+```
+**Key design decisions:**
+- `initialized = true` is set **before** async work, preventing concurrent init() calls
+- Cloud providers are registered **synchronously** (no network calls needed)
+- Local detection uses `Promise.allSettled()` — a failing ping doesn't block others
+- Each ping has a 2-second `AbortController` timeout
+- If auto-detection is disabled (`autoDetectLocal: false`), local providers are never registered
+### Configuration Resolution — Three-Layer Priority System
+The `resolveConfig()` function (`src/config.ts`, 87 lines) implements a strict priority hierarchy:
+```
+Priority: Explicit Config > Environment Variables > Built-in Defaults
+```
+**API Key Resolution:**
+```typescript
+// For each of the 9 supported providers:
+const ENV_KEY_MAP = {
+  openai:      'OPENAI_API_KEY',
+  anthropic:   'ANTHROPIC_API_KEY',
+  google:      'GEMINI_API_KEY',
+  fal:         'FAL_KEY',
+  openrouter:  'OPENROUTER_API_KEY',
+  huggingface: 'HUGGINGFACE_TOKEN',
+  groq:        'GROQ_API_KEY',
+  mistral:     'MISTRAL_API_KEY',
+  xai:         'XAI_API_KEY',
+};
+// Resolution per key:
+keys[name] = input.keys?.[name]     // 1. Explicit config
+           ?? process.env[envVar];   // 2. Environment variable
+                                     // 3. undefined (no default)
+```
+**Local Service Resolution:**
+```typescript
+// For each of the 4 local services:
+const LOCAL_DEFAULTS = {
+  ollama:  { host: 'http://localhost', port: 11434, envHost: 'OLLAMA_HOST',  envPort: 'OLLAMA_PORT'  },
+  comfyui: { host: 'http://localhost', port: 8188,  envHost: 'COMFYUI_HOST', envPort: 'COMFYUI_PORT' },
+  piper:   { host: 'http://localhost', port: 5500,  envHost: 'PIPER_HOST',   envPort: 'PIPER_PORT'   },
+  kokoro:  { host: 'http://localhost', port: 5501,  envHost: 'KOKORO_HOST',  envPort: 'KOKORO_PORT'  },
+};
+// Resolution per service:
+local[name] = {
+  enabled: cfgLocal?.enabled ?? true,                          // Default: enabled
+  host:    cfgLocal?.host ?? process.env[envHost] ?? defaults.host,
+  port:    cfgLocal?.port ?? parseInt(process.env[envPort]) ?? defaults.port,
+  type:    cfgLocal?.type,
+};
+```
+**Other config defaults:**
+| Setting | Default | Environment Override |
+|---|---|---|
+| `autoDetectLocal` | `true` | `NOOSPHERE_AUTO_DETECT_LOCAL` |
+| `discoveryCacheTTL` | `60` (minutes) | `NOOSPHERE_DISCOVERY_CACHE_TTL` |
+| `retry.maxRetries` | `2` | — |
+| `retry.backoffMs` | `1000` | — |
+| `retry.failover` | `true` | — |
+| `retry.retryableErrors` | `['PROVIDER_UNAVAILABLE', 'RATE_LIMITED', 'TIMEOUT']` | — |
+| `timeout.llm` | `30000` (30s) | — |
+| `timeout.image` | `120000` (2m) | — |
+| `timeout.video` | `300000` (5m) | — |
+| `timeout.tts` | `60000` (1m) | — |
+### Provider Resolution — Local-First Algorithm
+When you call a generation method without specifying a provider, Noosphere resolves one automatically through a three-stage process in `resolveProviderForModality()` (`src/noosphere.ts:324-348`):
+```typescript
+private resolveProviderForModality(
+  modality: Modality,
+  preferredId?: string,
+  modelId?: string,
+): NoosphereProvider {
+  // STAGE 1: Model-based resolution
+  // If model was specified WITHOUT a provider, search the registry cache
+  if (modelId && !preferredId) {
+    const resolved = this.registry.resolveModel(modelId, modality);
+    if (resolved) return resolved.provider;
+    // resolveModel() scans ALL cached models across ALL providers
+    // looking for exact match on both modelId AND modality
+  }
+  // STAGE 2: Default-based resolution
+  // Check if user configured a default for this modality
+  if (!preferredId) {
+    const defaultCfg = this.config.defaults[modality];
+    if (defaultCfg) {
+      preferredId = defaultCfg.provider;
+      // Now fall through to Stage 3 with this preferredId
+    }
+  }
+  // STAGE 3: Provider registry resolution
+  const provider = this.registry.resolveProvider(modality, preferredId);
+  if (!provider) {
+    throw new NoosphereError(
+      `No provider available for modality '${modality}'`,
+      { code: 'NO_PROVIDER', ... }
+    );
+  }
+  return provider;
+}
+```
+**Registry.resolveProvider() — The local-first algorithm** (`src/registry.ts:31-46`):
+```typescript
+resolveProvider(modality: Modality, preferredId?: string): NoosphereProvider | null {
+  // If a specific provider was requested:
+  if (preferredId) {
+    const p = this.providers.get(preferredId);
+    if (p && p.modalities.includes(modality)) return p;
+    return null;  // NOT found — returns null, NOT a fallback
+  }
+  // No preference — scan with local-first priority:
+  let bestCloud: NoosphereProvider | null = null;
+  for (const p of this.providers.values()) {
+    if (!p.modalities.includes(modality)) continue;
+    // LOCAL provider found → return IMMEDIATELY (first match wins)
+    if (p.isLocal) return p;
+    // CLOUD provider → save as fallback (first cloud match only)
+    if (!bestCloud) bestCloud = p;
+  }
+  return bestCloud;  // Return first cloud provider, or null
+}
+```
+**Resolution priority diagram:**
+```
+ai.chat({ model: 'gpt-4o' })
+  │
+  ├─ Stage 1: Search modelCache for 'gpt-4o' with modality 'llm'
+  │  └── Found in pi-ai cache → return PiAiProvider
+  │
+  ├─ Stage 2: (skipped — model resolved in Stage 1)
+  │
+  └─ Stage 3: (skipped — already resolved)
+ai.image({ prompt: 'sunset' })
+  │
+  ├─ Stage 1: (no model specified, skipped)
+  │
+  ├─ Stage 2: Check config.defaults.image → none configured
+  │
+  └─ Stage 3: resolveProvider('image', undefined)
+     ├── Scan providers:
+     │   ├── pi-ai: modalities=['llm'] → skip (no 'image')
+     │   ├── comfyui: modalities=['image','video'], isLocal=true → RETURN
+     │   └── (fal never reached — local wins)
+     └── Returns ComfyUIProvider (local-first)
+ai.image({ prompt: 'sunset' })  // No local ComfyUI running
+  │
+  └─ Stage 3: resolveProvider('image', undefined)
+     ├── Scan providers:
+     │   ├── pi-ai: no 'image' → skip
+     │   ├── fal: modalities=['image','video','tts'], isLocal=false → save as bestCloud
+     │   └── huggingface: modalities=['image','tts','llm'], isLocal=false → already have bestCloud
+     └── Returns FalProvider (first cloud fallback)
+```
+### Retry & Failover Logic — Complete Algorithm
+The `executeWithRetry()` method (`src/noosphere.ts:350-397`) implements a two-phase error handling strategy: same-provider retries, then cross-provider failover.
+```typescript
+private async executeWithRetry<T>(
+  modality: Modality,
+  provider: NoosphereProvider,
+  fn: () => Promise<T>,
+  failoverFnFactory?: (alt: NoosphereProvider) => (() => Promise<T>) | null,
+): Promise<T> {
+  const { maxRetries, backoffMs, retryableErrors, failover } = this.config.retry;
+  // Default: maxRetries=2, backoffMs=1000, failover=true
+  // retryableErrors = ['PROVIDER_UNAVAILABLE', 'RATE_LIMITED', 'TIMEOUT']
+  let lastError: Error | undefined;
+  for (let attempt = 0; attempt <= maxRetries; attempt++) {
+    try {
+      return await fn();  // Try the primary provider
+    } catch (err) {
+      lastError = err instanceof Error ? err : new Error(String(err));
+      const isNoosphereErr = err instanceof NoosphereError;
+      const code = isNoosphereErr ? err.code : 'GENERATION_FAILED';
+      // GENERATION_FAILED is special:
+      //   - Retryable on same provider (bad prompt, transient model issue)
+      //   - NOT eligible for cross-provider failover
+      const isRetryable = retryableErrors.includes(code) || code === 'GENERATION_FAILED';
+      const allowsFailover = code !== 'GENERATION_FAILED' && retryableErrors.includes(code);
+      if (!isRetryable || attempt === maxRetries) {
+        // FAILOVER PHASE: Try other providers
+        if (failover && allowsFailover && failoverFnFactory) {
+          const altProviders = this.registry.getAllProviders()
+            .filter((p) => p.id !== provider.id && p.modalities.includes(modality));
+          for (const alt of altProviders) {
+            try {
+              const altFn = failoverFnFactory(alt);
+              if (altFn) return await altFn();  // Success on alternate provider
+            } catch {
+              // Continue to next alternate provider
+            }
+          }
+        }
+        break;  // All retries and failovers exhausted
+      }
+      // RETRY: Exponential backoff on same provider
+      const delay = backoffMs * Math.pow(2, attempt);
+      // attempt=0: 1000ms, attempt=1: 2000ms, attempt=2: 4000ms
+      await new Promise((resolve) => setTimeout(resolve, delay));
+    }
+  }
+  throw lastError ?? new NoosphereError('Generation failed', { ... });
+}
+```
+**Failover function factory pattern:**
+Each generation method passes a factory function that creates the right call for alternate providers:
+```typescript
+// In chat():
+(alt) => alt.chat ? () => alt.chat!(options) : null
+// If the alternate provider has chat(), create a function to call it.
+// If not (e.g., ComfyUI for LLM), return null → skip this provider.
+// In image():
+(alt) => alt.image ? () => alt.image!(options) : null
+// In video():
+(alt) => alt.video ? () => alt.video!(options) : null
+// In speak():
+(alt) => alt.speak ? () => alt.speak!(options) : null
+```
+**Complete retry timeline example:**
+```
+ai.chat() with provider="pi-ai", maxRetries=2, backoffMs=1000
+Attempt 0: pi-ai.chat() → RATE_LIMITED
+  wait 1000ms (1000 * 2^0)
+Attempt 1: pi-ai.chat() → RATE_LIMITED
+  wait 2000ms (1000 * 2^1)
+Attempt 2: pi-ai.chat() → RATE_LIMITED
+  // maxRetries exhausted, RATE_LIMITED allows failover
+  Failover 1: huggingface.chat() → 503 SERVICE_UNAVAILABLE
+  Failover 2: (no more providers with 'llm' modality)
+  throw last error (RATE_LIMITED from pi-ai)
+```
+**Error classification matrix:**
+| Error Code | Same-Provider Retry | Cross-Provider Failover | Typical Cause |
+|---|---|---|---|
+| `PROVIDER_UNAVAILABLE` | Yes | Yes | Server down, network error |
+| `RATE_LIMITED` | Yes | Yes | API quota exceeded |
+| `TIMEOUT` | Yes | Yes | Slow response |
+| `GENERATION_FAILED` | Yes | **No** | Bad prompt, model error |
+| `AUTH_FAILED` | No | No | Wrong API key |
+| `MODEL_NOT_FOUND` | No | No | Invalid model ID |
+| `INVALID_INPUT` | No | No | Bad parameters |
+| `NO_PROVIDER` | No | No | No provider registered |
+### Model Registry — Internal Data Structures
+The Registry (`src/registry.ts`, 137 lines) is the central nervous system that maps providers to models and handles model lookups.
+**Internal state:**
+```typescript
+class Registry {
+  // Provider storage: Map<providerId, providerInstance>
+  private providers = new Map<string, NoosphereProvider>();
+  // Example: { 'pi-ai' → PiAiProvider, 'fal' → FalProvider, 'comfyui' → ComfyUIProvider }
+  // Model cache: Map<providerId, { models: ModelInfo[], syncedAt: timestamp }>
+  private modelCache = new Map<string, CachedModels>();
+  // Example: {
+  //   'pi-ai' → { models: [246 ModelInfo objects], syncedAt: 1710000000000 },
+  //   'fal'   → { models: [867 ModelInfo objects], syncedAt: 1710000000000 },
+  // }
+  // Cache TTL in milliseconds (converted from minutes in constructor)
+  private cacheTTLMs: number;
+  // Default: 60 * 60 * 1000 = 3,600,000ms = 1 hour
+}
+```
+**Cache staleness check:**
+```typescript
+isCacheStale(providerId: string): boolean {
+  const cached = this.modelCache.get(providerId);
+  if (!cached) return true;  // No cache = stale
+  return Date.now() - cached.syncedAt > this.cacheTTLMs;
+  // Example: if syncedAt was 61 minutes ago and TTL is 60 minutes → stale
+}
+```
+**Model resolution — linear scan across all caches:**
+```typescript
+resolveModel(modelId: string, modality: Modality):
+  { provider: NoosphereProvider; model: ModelInfo } | null {
+  // Scan EVERY provider's cached models:
+  for (const [providerId, cached] of this.modelCache) {
+    const model = cached.models.find(
+      (m) => m.id === modelId && m.modality === modality
+    );
+    // Must match BOTH modelId AND modality
+    if (model) {
+      const provider = this.providers.get(providerId);
+      if (provider) return { provider, model };
+    }
+  }
+  return null;
+}
+// Performance: O(n) where n = total models across all providers
+// With 246 Pi-AI + 867 FAL + 3 HuggingFace = ~1116 models to scan
+// This is fast enough for the use case (called once per request)
+```
+**Sync mechanism:**
+```typescript
+async syncAll(): Promise<SyncResult> {
+  const byProvider: Record<string, number> = {};
+  const errors: string[] = [];
+  let synced = 0;
+  // Sequential sync (NOT parallel) — one provider at a time:
+  for (const provider of this.providers.values()) {
+    try {
+      const models = await provider.listModels();
+      this.modelCache.set(provider.id, {
+        models,
+        syncedAt: Date.now(),
+      });
+      byProvider[provider.id] = models.length;
+      synced += models.length;
+    } catch (err) {
+      errors.push(`${provider.id}: ${err.message}`);
+      byProvider[provider.id] = 0;
+      // Note: failed sync does NOT clear existing cache
+    }
+  }
+  return { synced, byProvider, errors };
+}
+```
+**Provider info aggregation:**
+```typescript
+getProviderInfos(modality?: Modality): ProviderInfo[] {
+  // Returns summary info for each registered provider:
+  // {
+  //   id: 'pi-ai',
+  //   name: 'pi-ai (LLM Gateway)',
+  //   modalities: ['llm'],
+  //   local: false,
+  //   status: 'online',       // Always 'online' — no live ping check
+  //   modelCount: 246,        // From cache, or 0 if not synced
+  // }
+}
+```
+### Usage Tracking — In-Memory Event Store
+The `UsageTracker` (`src/tracking.ts`, 57 lines) records every API call and provides filtered aggregation.
+**Internal state:**
+```typescript
+class UsageTracker {
+  private events: UsageEvent[] = [];  // Append-only array
+  private onUsage?: (event: UsageEvent) => void | Promise<void>;  // Optional callback
+}
+```
+**Recording flow — every API call creates a UsageEvent:**
+```typescript
+// On SUCCESS (in Noosphere.trackUsage):
+const event: UsageEvent = {
+  modality: result.modality,    // 'llm' | 'image' | 'video' | 'tts'
+  provider: result.provider,    // 'pi-ai', 'fal', etc.
+  model: result.model,          // 'gpt-4o', 'flux-pro', etc.
+  cost: result.usage.cost,      // USD amount (0 for free/local)
+  latencyMs: result.latencyMs,  // Wall-clock milliseconds
+  input: result.usage.input,    // Input tokens or characters
+  output: result.usage.output,  // Output tokens (LLM only)
+  unit: result.usage.unit,      // 'tokens', 'characters', 'free'
+  timestamp: new Date().toISOString(),  // ISO 8601
+  success: true,
+  metadata,                     // User-provided metadata passthrough
+};
+// On FAILURE (in Noosphere.trackError):
+const event: UsageEvent = {
+  modality, provider,
+  model: model ?? 'unknown',
+  cost: 0,                              // No cost on failure
+  latencyMs: Date.now() - startMs,      // Time until failure
+  timestamp: new Date().toISOString(),
+  success: false,
+  error: err instanceof Error ? err.message : String(err),
+  metadata,
+};
+```
+**Query/aggregation — filtered summary:**
+```typescript
+getSummary(options?: UsageQueryOptions): UsageSummary {
+  let filtered = this.events;
+  // Time-range filtering:
+  if (options?.since) {
+    const since = new Date(options.since).getTime();
+    filtered = filtered.filter((e) => new Date(e.timestamp).getTime() >= since);
+  }
+  if (options?.until) {
+    const until = new Date(options.until).getTime();
+    filtered = filtered.filter((e) => new Date(e.timestamp).getTime() <= until);
+  }
+  // Provider/modality filtering:
+  if (options?.provider) {
+    filtered = filtered.filter((e) => e.provider === options.provider);
+  }
+  if (options?.modality) {
+    filtered = filtered.filter((e) => e.modality === options.modality);
+  }
+  // Aggregation:
+  const byProvider: Record<string, number> = {};
+  const byModality = { llm: 0, image: 0, video: 0, tts: 0 };
+  let totalCost = 0;
+  for (const event of filtered) {
+    totalCost += event.cost;
+    byProvider[event.provider] = (byProvider[event.provider] ?? 0) + event.cost;
+    byModality[event.modality] += event.cost;
+  }
+  return { totalCost, totalRequests: filtered.length, byProvider, byModality };
+}
+```
+**Usage example:**
+```typescript
+// Get all usage:
+const all = ai.getUsage();
+// { totalCost: 0.42, totalRequests: 15, byProvider: { 'pi-ai': 0.40, 'fal': 0.02 }, byModality: { llm: 0.40, image: 0.02, video: 0, tts: 0 } }
+// Get usage for last hour, LLM only:
+const recent = ai.getUsage({
+  since: new Date(Date.now() - 3600000),
+  modality: 'llm',
+});
+// Get usage for a specific provider:
+const falUsage = ai.getUsage({ provider: 'fal' });
+// Real-time callback (set in constructor):
+const ai = new Noosphere({
+  onUsage: (event) => {
+    console.log(`${event.provider}/${event.model}: $${event.cost} in ${event.latencyMs}ms`);
+    // Or: send to analytics, update dashboard, check budget
+  },
+});
+```
+**Important limitations:**
+- Events are stored **in memory only** — lost on process restart
+- No deduplication — each retry/failover attempt creates a separate event
+- `clear()` wipes all history (called by `dispose()`)
+- The `onUsage` callback is `await`ed — a slow callback blocks the response return
+### Streaming Architecture
+The `stream()` method (`src/noosphere.ts:73-124`) wraps provider streams with usage tracking:
+```typescript
+stream(options: ChatOptions): NoosphereStream {
+  // Returns IMMEDIATELY (synchronous) — no await
+  // The actual initialization happens lazily on first iteration
+  let innerStream: NoosphereStream | undefined;
+  let finalResult: NoosphereResult | undefined;
+  let providerRef: NoosphereProvider | undefined;
+  // Lazy init — runs on first for-await-of iteration:
+  const ensureInit = async () => {
+    if (!this.initialized) await this.init();
+    if (!providerRef) {
+      providerRef = this.resolveProviderForModality('llm', ...);
+      if (!providerRef.stream) throw new NoosphereError(...);
+      innerStream = providerRef.stream(options);
+    }
+  };
+  // Wrapped async iterator with usage tracking:
+  const wrappedIterator = {
+    async *[Symbol.asyncIterator]() {
+      await ensureInit();           // Init on first next()
+      for await (const event of innerStream!) {
+        if (event.type === 'done' && event.result) {
+          finalResult = event.result;
+          await trackUsage(event.result);  // Track when complete
+        }
+        yield event;                // Pass events through
+      }
+    },
+  };
+  return {
+    [Symbol.asyncIterator]: () => wrappedIterator[Symbol.asyncIterator](),
+    // result() — consume entire stream and return final result:
+    result: async () => {
+      if (finalResult) return finalResult;  // Already consumed
+      for await (const event of wrappedIterator) {
+        if (event.type === 'done') return event.result!;
+        if (event.type === 'error') throw event.error;
+      }
+      throw new NoosphereError('Stream ended without result');
+    },
+    // abort() — signal cancellation:
+    abort: () => innerStream?.abort(),
+  };
+}
+```
+**Stream event types:**
+| Event Type | Fields | When |
+|---|---|---|
+| `text_delta` | `{ type, delta: string }` | Each text token |
+| `thinking_delta` | `{ type, delta: string }` | Each reasoning token |
+| `done` | `{ type, result: NoosphereResult }` | Stream complete |
+| `error` | `{ type, error: Error }` | Stream failed |
+**Note:** Streaming does NOT use `executeWithRetry()`. If the stream fails, there's no automatic retry or failover. The error is yielded as an `error` event and also tracked via `trackError()`.
+### Lifecycle Management — dispose()
+```typescript
+async dispose(): Promise<void> {
+  // 1. Call dispose() on every registered provider (if implemented):
+  for (const provider of this.registry.getAllProviders()) {
+    if (provider.dispose) {
+      await provider.dispose();
+      // Currently: no built-in provider implements dispose()
+      // This is for custom providers that need cleanup
+    }
+  }
+  // 2. Clear the model cache:
+  this.registry.clearCache();
+  // 3. Clear usage history:
+  this.tracker.clear();
-Every API call (success or failure) records a `UsageEvent`:
-```typescript
-interface UsageEvent {
-  modality: 'llm' | 'image' | 'video' | 'tts';
-  provider: string;
-  model: string;
-  cost: number;           // USD
-  latencyMs: number;
-  input?: number;         // tokens or characters
-  output?: number;        // tokens
-  unit?: string;
-  timestamp: string;      // ISO 8601
-  success: boolean;
-  error?: string;         // error message if failed
-  metadata?: Record<string, unknown>;
+  // Note: does NOT set initialized=false
+  // After dispose(), the instance is NOT reusable for new requests
 }
 ```