npm - noosphere - Versions diffs - 0.1.3 → 0.2.1 - Mend

noosphere 0.1.3 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/README.md CHANGED Viewed

@@ -7,7 +7,7 @@ One import. Every model. Every modality.
 ## Features
 - **4 modalities** — LLM chat, image generation, video generation, and text-to-speech
-- **246+ LLM models** — via Pi-AI gateway (OpenAI, Anthropic, Google, Groq, Mistral, xAI, Cerebras, OpenRouter)
+- **Always up-to-date models** — Dynamic auto-fetch from ALL provider APIs at runtime (OpenAI, Anthropic, Google, Groq, Mistral, xAI, Cerebras, OpenRouter)
 - **867+ media endpoints** — via FAL (Flux, SDXL, Kling, Sora 2, VEO 3, Kokoro, ElevenLabs, and hundreds more)
 - **30+ HuggingFace tasks** — LLM, image, TTS, translation, summarization, classification, and more
 - **Local-first architecture** — Auto-detects ComfyUI, Ollama, Piper, and Kokoro on your machine
@@ -59,6 +59,209 @@ const audio = await ai.speak({
 // audio.buffer contains the audio data
 ```
+## Dynamic Model Auto-Fetch — Always Up-to-Date (ALL Providers, ALL Modalities)
+Noosphere **automatically discovers the latest models from EVERY provider's API at runtime** — across **all 4 modalities** (LLM, image, video, TTS). When Google releases a new Gemini model, when OpenAI drops GPT-5, when FAL adds a new video model, when a new image model trends on HuggingFace — **you get them immediately**, without updating Noosphere or any dependency.
+### The Problem It Solves
+Traditional AI libraries rely on **static model catalogs** hardcoded at build time. The `@mariozechner/pi-ai` dependency ships with ~246 LLM models in a pre-generated `models.generated.js` file. HuggingFace providers typically hardcode 3-5 default models. When a provider releases a new model, you'd have to wait for the library maintainer to update, publish, and then you'd `npm update`. This lag can be days or weeks.
+**Noosphere solves this for every provider and every modality simultaneously.**
+### How It Works — Complete Auto-Fetch Architecture
+Noosphere has **3 independent auto-fetch systems** that work in parallel, one for each provider layer:
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   NOOSPHERE AUTO-FETCH                       │
+├─────────────────────────────────────────────────────────────┤
+│                                                              │
+│  ┌─── Pi-AI Provider (LLM) ─────────────────────────────┐  │
+│  │  8 parallel API calls on first chat()/stream():       │  │
+│  │  OpenAI, Anthropic, Google, Groq, Mistral,            │  │
+│  │  xAI, OpenRouter, Cerebras                            │  │
+│  │  → Merges with static pi-ai catalog (246 models)      │  │
+│  │  → Constructs synthetic Model objects for new ones     │  │
+│  └───────────────────────────────────────────────────────┘  │
+│                                                              │
+│  ┌─── FAL Provider (Image/Video/TTS) ───────────────────┐  │
+│  │  1 API call on listModels():                          │  │
+│  │  GET https://api.fal.ai/v1/models/pricing             │  │
+│  │  → Returns ALL 867+ endpoints with live pricing       │  │
+│  │  → Auto-classifies modality from model ID + unit      │  │
+│  └───────────────────────────────────────────────────────┘  │
+│                                                              │
+│  ┌─── HuggingFace Provider (LLM/Image/TTS) ────────────┐  │
+│  │  3 parallel API calls on listModels():                │  │
+│  │  GET huggingface.co/api/models?pipeline_tag=...       │  │
+│  │  → text-generation (top 50 trending, inference-ready) │  │
+│  │  → text-to-image (top 50 trending, inference-ready)   │  │
+│  │  → text-to-speech (top 30 trending, inference-ready)  │  │
+│  │  → Includes inference provider mapping + pricing      │  │
+│  └───────────────────────────────────────────────────────┘  │
+│                                                              │
+└─────────────────────────────────────────────────────────────┘
+```
+### Layer 1: LLM Auto-Fetch (Pi-AI Provider) — 8 Provider APIs
+On the **first `chat()` or `stream()` call**, Pi-AI queries every LLM provider's model listing API in parallel:
+| Provider | API Endpoint | Auth | Model Filter | API Protocol |
+|---|---|---|---|---|
+| **OpenAI** | `GET /v1/models` | Bearer token | `gpt-*`, `o1*`, `o3*`, `o4*`, `chatgpt-*`, `codex-*` | `openai-responses` |
+| **Anthropic** | `GET /v1/models?limit=100` | `x-api-key` + `anthropic-version: 2023-06-01` | `claude-*` | `anthropic-messages` |
+| **Google** | `GET /v1beta/models?key=KEY` | API key in URL | `gemini-*`, `gemma-*` + must support `generateContent` | `google-generative-ai` |
+| **Groq** | `GET /openai/v1/models` | Bearer token | All (Groq only serves chat models) | `openai-completions` |
+| **Mistral** | `GET /v1/models` | Bearer token | Exclude `*embed*` | `openai-completions` |
+| **xAI** | `GET /v1/models` | Bearer token | `grok*` | `openai-completions` |
+| **OpenRouter** | `GET /api/v1/models` | Bearer token | All (all OpenRouter models are usable) | `openai-completions` |
+| **Cerebras** | `GET /v1/models` | Bearer token | All (Cerebras only serves chat models) | `openai-completions` |
+**How new LLM models become usable:** When a model isn't in the static catalog, Noosphere constructs a **synthetic `Model` object** with the correct API protocol, base URL, and inherited cost data:
+```typescript
+// New model "gpt-4.5-turbo" discovered from OpenAI's /v1/models:
+{
+  id: 'gpt-4.5-turbo',
+  name: 'gpt-4.5-turbo',
+  api: 'openai-responses',              // Correct protocol for OpenAI
+  provider: 'openai',
+  baseUrl: 'https://api.openai.com/v1',
+  reasoning: false,                      // Inferred from model ID prefix
+  input: ['text', 'image'],
+  cost: { input: 2.5, output: 10, ... },  // Inherited from template model
+  contextWindow: 128000,                   // From template or API response
+  maxTokens: 16384,
+}
+// This object is passed directly to pi-ai's complete()/stream() — works immediately
+```
+### Layer 2: Image/Video/TTS Auto-Fetch (FAL Provider) — Pricing API
+FAL already provides a **fully dynamic catalog**. On `listModels()`, it fetches from `https://api.fal.ai/v1/models/pricing`:
+```typescript
+// FAL returns an array with ALL available endpoints + live pricing:
+[
+  { modelId: "fal-ai/flux-pro/v1.1-ultra", price: 0.06, unit: "per_image" },
+  { modelId: "fal-ai/kling-video/v2/master/text-to-video", price: 0.10, unit: "per_second" },
+  { modelId: "fal-ai/kokoro/american-english", price: 0.002, unit: "per_1k_chars" },
+  // ... 867+ endpoints total
+]
+// Modality is auto-inferred from model ID + pricing unit:
+// - unit contains 'char' OR id contains 'tts'/'kokoro'/'elevenlabs' → TTS
+// - unit contains 'second' OR id contains 'video'/'kling'/'sora'/'veo' → Video
+// - Everything else → Image
+```
+**Result:** Every FAL model is always current — new endpoints appear the moment FAL publishes them. Pricing is always accurate because it comes directly from their API.
+### Layer 3: LLM/Image/TTS Auto-Fetch (HuggingFace Provider) — Hub API
+Instead of 3 hardcoded defaults, HuggingFace now fetches **trending inference-ready models** from the Hub API across all 3 modalities:
+```
+GET https://huggingface.co/api/models
+  ?pipeline_tag=text-generation       ← LLM models
+  &inference_provider=all             ← Only models available via inference API
+  &sort=trendingScore                 ← Most popular first
+  &limit=50                           ← Top 50
+  &expand[]=inferenceProviderMapping  ← Include provider routing + pricing
+```
+| Pipeline Tag | Modality | Limit | What It Fetches |
+|---|---|---|---|
+| `text-generation` | LLM | 50 | Top 50 trending chat/completion models with active inference endpoints |
+| `text-to-image` | Image | 50 | Top 50 trending image generation models (SDXL, Flux, etc.) |
+| `text-to-speech` | TTS | 30 | Top 30 trending TTS models with active inference endpoints |
+**What the Hub API returns per model:**
+```json
+{
+  "id": "Qwen/Qwen2.5-72B-Instruct",
+  "pipeline_tag": "text-generation",
+  "likes": 1893,
+  "downloads": 4521987,
+  "inferenceProviderMapping": [
+    {
+      "provider": "together",
+      "providerId": "Qwen/Qwen2.5-72B-Instruct-Turbo",
+      "status": "live",
+      "providerDetails": {
+        "context_length": 32768,
+        "pricing": { "input": 1.2, "output": 1.2 }
+      }
+    },
+    {
+      "provider": "fireworks-ai",
+      "providerId": "accounts/fireworks/models/qwen2p5-72b-instruct",
+      "status": "live"
+    }
+  ]
+}
+```
+**Noosphere extracts from this:**
+- Model ID → `id` field
+- Pricing → first provider with `providerDetails.pricing`
+- Context window → first provider with `providerDetails.context_length`
+- Inference providers → list of available providers (Together, Fireworks, Groq, etc.)
+**Three requests fire in parallel** (`Promise.allSettled`) with a **10-second timeout** each. If any fails, the 3 hardcoded defaults are always available as fallback.
+### Resilience Guarantees (All Layers)
+| Guarantee | Pi-AI (LLM) | FAL (Image/Video/TTS) | HuggingFace (LLM/Image/TTS) |
+|---|---|---|---|
+| **Timeout** | 8s per provider | No custom timeout | 10s per pipeline_tag |
+| **Parallelism** | 8 concurrent requests | 1 request (returns all) | 3 concurrent requests |
+| **Failure handling** | `Promise.allSettled` | Returns `[]` on error | `Promise.allSettled` |
+| **Fallback** | Static pi-ai catalog (246 models) | Empty list (provider still usable by model ID) | 3 hardcoded defaults |
+| **Caching** | One-time fetch, cached in memory | Per `listModels()` call | One-time fetch, cached in memory |
+| **Auth required** | Yes (per-provider API keys) | Yes (FAL key) | Optional (works without token) |
+### Total Model Coverage
+| Source | Modalities | Model Count | Update Frequency |
+|---|---|---|---|
+| Pi-AI static catalog | LLM | ~246 | On npm update |
+| Pi-AI dynamic fetch | LLM | **All models across 8 providers** | **Every session** |
+| FAL pricing API | Image, Video, TTS | 867+ | **Every `listModels()` call** |
+| HuggingFace Hub API | LLM, Image, TTS | Top 130 trending | **Every session** |
+| ComfyUI `/object_info` | Image | Local checkpoints | **Every `listModels()` call** |
+| Local TTS `/voices` | TTS | Local voices | **Every `listModels()` call** |
+### Force Refresh
+```typescript
+const ai = new Noosphere();
+// Models are auto-fetched on first call — no action needed:
+await ai.chat({ model: 'gemini-2.5-ultra', messages: [...] }); // works immediately
+// Trigger a full sync across ALL providers:
+const result = await ai.syncModels();
+// result = { synced: 1200+, byProvider: { 'pi-ai': 300, 'fal': 867, 'huggingface': 130, ... }, errors: [] }
+// Get all models for a specific modality:
+const imageModels = await ai.getModels('image');
+// Returns: FAL image models + HuggingFace image models + ComfyUI models
+```
+### Why Hybrid (Static + Dynamic)?
+| Approach | Pros | Cons |
+|---|---|---|
+| **Static catalog only** | Accurate costs, fast startup | Stale within days, miss new models |
+| **Dynamic only** | Always current | No cost data, no context window info, slow startup |
+| **Hybrid (Noosphere)** | Best of both — accurate data for known models + immediate access to new ones | New models have estimated costs until catalog update |
+---
 ## Configuration
 API keys are resolved from the constructor config or environment variables (config takes priority):
@@ -1112,28 +1315,125 @@ The largest media generation provider with dynamic pricing fetched at runtime fr
 **Other TTS:**
 `fal-ai/f5-tts` (voice cloning), `fal-ai/dia-tts`, `fal-ai/minimax/speech-2.6-turbo`, `fal-ai/minimax/speech-2.6-hd`, `fal-ai/chatterbox/text-to-speech`, `fal-ai/index-tts-2/text-to-speech`
+#### FAL Provider Internals — How It Actually Works
+**Image generation** uses `fal.subscribe()` (queue-based, polls until complete):
+```typescript
+// Exact request payload sent to FAL:
+const response = await fal.subscribe(model, {
+  input: {
+    prompt: "A sunset over mountains",
+    negative_prompt: "blurry",           // from options.negativePrompt
+    image_size: { width: 1024, height: 768 }, // from options.width/height
+    seed: 42,                            // from options.seed
+    num_inference_steps: 30,             // from options.steps
+    guidance_scale: 7.5,                 // from options.guidanceScale
+  },
+});
+// Response parsing — URL from images array:
+const image = response.data?.images?.[0];
+// result.url    = image?.url
+// result.media  = { width: image?.width, height: image?.height, format: 'png' }
+```
+**Video generation** uses `fal.subscribe()`:
+```typescript
+const response = await fal.subscribe(model, {
+  input: {
+    prompt: "Ocean waves",
+    image_url: "https://...",   // from options.imageUrl (image-to-video)
+    duration: 5,                // from options.duration
+    fps: 24,                    // from options.fps
+  },
+});
+// Response parsing — URL from video object with fallback:
+const video = response.data?.video;
+// result.url = video?.url ?? response.data?.video_url
+// Note: width/height/duration/fps come from INPUT options, not response
+```
+**TTS** uses `fal.run()` (direct call, NOT subscribe — no queue):
+```typescript
+const response = await fal.run(model, {
+  input: {
+    text: "Hello world",
+    voice: "af_heart",          // from options.voice
+    speed: 1.0,                 // from options.speed
+  },
+});
+// Response parsing — URL from audio object with fallback:
+// result.url = response.data?.audio_url ?? response.data?.audio?.url
+```
+**Pricing cache and cost tracking:**
+```typescript
+// Pricing fetched dynamically from FAL API during listModels():
+const res = await fetch('https://api.fal.ai/v1/models/pricing', {
+  headers: { Authorization: `Key ${this.apiKey}` },
+});
+// Returns: Array<{ modelId: string, price: number, unit: string }>
+// Cached in memory Map, cleared on each listModels() call:
+private pricingCache = new Map<string, { price: number; unit: string }>();
+// Cost per request pulled from cache (defaults to 0 if not cached):
+usage: { cost: pricingCache.get(model)?.price ?? 0 }
+```
+**Modality inference from model ID — exact string matching:**
+```typescript
+inferModality(modelId: string, unit: string): Modality {
+  // TTS: unit contains 'char' OR modelId contains 'tts'/'kokoro'/'elevenlabs'
+  // Video: unit contains 'second' OR modelId contains 'video'/'kling'/'sora'/'veo'
+  // Image: everything else (default)
+}
+```
+**Error handling:** Only `listModels()` catches errors (returns `[]`). Image/video/speak methods let FAL errors propagate directly — no wrapping.
 #### FAL Client Capabilities
 The `@fal-ai/client` provides additional features beyond what Noosphere surfaces:
-- **Queue API** — Submit jobs, poll status, get results, cancel. Supports webhooks and priority levels
-- **Streaming API** — Real-time streaming responses via async iterators
-- **Realtime API** — WebSocket connections for interactive use (e.g., real-time image generation)
-- **Storage API** — File upload with configurable TTL (1h, 1d, 7d, 30d, 1y, never)
-- **Retry logic** — Configurable retries with exponential backoff and jitter
-- **Request middleware** — Custom request interceptors and proxy support
+- **Queue API** — `fal.queue.submit()`, `status()`, `result()`, `cancel()`. Supports webhooks, priority levels (`"low"` | `"normal"`), and polling/streaming status modes
+- **Streaming API** — `fal.streaming.stream()` with async iterators, chunk-level events, configurable timeout between chunks (15s default)
+- **Realtime API** — `fal.realtime.connect()` for WebSocket connections with msgpack encoding, throttle interval (128ms default), frame buffering (1-60 frames)
+- **Storage API** — `fal.storage.upload()` with configurable object lifecycle: `"never"` | `"immediate"` | `"1h"` | `"1d"` | `"7d"` | `"30d"` | `"1y"`
+- **Retry logic** — 3 retries default, exponential backoff (500ms base, 15s max), jitter enabled, retries on 408/429/500/502/503/504
+- **Request middleware** — `withMiddleware()` for request interceptors, `withProxy()` for proxy configuration
 ---
-### Hugging Face — Open Source AI (30+ tasks)
+### Hugging Face — Open Source AI (30+ tasks, Dynamic Discovery)
 **Provider ID:** `huggingface`
 **Modalities:** LLM, Image, TTS
 **Library:** `@huggingface/inference`
+**Auto-Fetch:** Yes — discovers trending inference-ready models from the Hub API
+Access to the entire Hugging Face Hub ecosystem. Noosphere **automatically discovers the top trending models** across all 3 modalities via the Hub API, filtered to only include models with active inference provider endpoints.
+#### Auto-Discovered Models
+On first `listModels()` call, HuggingFace fetches from:
+```
+GET https://huggingface.co/api/models?inference_provider=all&pipeline_tag={tag}&sort=trendingScore&limit={n}&expand[]=inferenceProviderMapping
+```
+| Pipeline Tag | Modality | Limit | Example Models |
+|---|---|---|---|
+| `text-generation` | LLM | 50 | Qwen2.5-72B-Instruct, Llama-3.3-70B, DeepSeek-V3, Mistral-Large |
+| `text-to-image` | Image | 50 | FLUX.1-dev, Stable Diffusion 3.5, SDXL-Lightning, Playground v2.5 |
+| `text-to-speech` | TTS | 30 | Kokoro-82M, Bark, MMS-TTS |
-Access to the entire Hugging Face Hub ecosystem. Any model hosted on HuggingFace can be used by passing its ID directly.
+Each discovered model includes **inference provider routing** (Together, Fireworks, Groq, Replicate, etc.) and **pricing data** when available from the provider.
-#### Default Models
+#### Fallback Default Models
+These 3 models are always available, even if the Hub API is unreachable:
 | Modality | Default Model | Description |
 |---|---|---|
@@ -1141,7 +1441,7 @@ Access to the entire Hugging Face Hub ecosystem. Any model hosted on HuggingFace
 | Image | `stabilityai/stable-diffusion-xl-base-1.0` | SDXL Base |
 | TTS | `facebook/mms-tts-eng` | MMS TTS English |
-Any HuggingFace model ID works — just pass it as the `model` parameter:
+Any HuggingFace model ID works — just pass it as the `model` parameter (even if it's not in the auto-discovered list):
 ```typescript
 await ai.chat({
@@ -1212,6 +1512,265 @@ The `@huggingface/inference` library (v3.15.0) provides 30+ AI tasks, including
 - **Multimodal Input:** Images via `image_url` content chunks in chat messages
 - **17 Inference Providers:** Route through Groq, Together, Fireworks, Replicate, Cerebras, Cohere, and more
+#### HuggingFace Provider Internals — How It Actually Works
+The `HuggingFaceProvider` class (`src/providers/huggingface.ts`, 141 lines) wraps the `@huggingface/inference` library's `HfInference` client. Here's the exact internal flow for each modality:
+**Initialization:**
+```typescript
+// Constructor receives a single API token
+constructor(token: string) {
+  this.client = new HfInference(token);
+  // HfInference stores the token internally and attaches it
+  // as Authorization: Bearer <token> to every request
+}
+// ping() always returns true — HuggingFace is considered
+// "available" if the token was provided. No actual HTTP check.
+async ping(): Promise<boolean> { return true; }
+```
+**Chat Completions — exact request flow:**
+```typescript
+// Default model: meta-llama/Llama-3.1-8B-Instruct
+const model = options.model ?? 'meta-llama/Llama-3.1-8B-Instruct';
+// Maps directly to HfInference.chatCompletion():
+const response = await this.client.chatCompletion({
+  model,                           // HuggingFace model ID or inference endpoint
+  messages: options.messages,       // Array<{ role, content }> — passed directly
+  temperature: options.temperature, // 0.0 - 2.0 (optional)
+  max_tokens: options.maxTokens,    // Max output tokens (optional)
+});
+// Response parsing:
+const choice = response.choices?.[0];           // OpenAI-compatible format
+const usage = response.usage;                    // { prompt_tokens, completion_tokens }
+// result.content = choice?.message?.content ?? ''
+// result.usage.input = usage?.prompt_tokens
+// result.usage.output = usage?.completion_tokens
+// result.usage.cost = 0  (always free for HF Inference API)
+```
+**Image Generation — Blob-to-Buffer conversion pipeline:**
+```typescript
+// Default model: stabilityai/stable-diffusion-xl-base-1.0
+const model = options.model ?? 'stabilityai/stable-diffusion-xl-base-1.0';
+// Uses textToImage() which returns a Blob object:
+const blob = await this.client.textToImage({
+  model,
+  inputs: options.prompt,        // The text prompt
+  parameters: {
+    negative_prompt: options.negativePrompt,  // What NOT to generate
+    width: options.width,                      // Pixel width
+    height: options.height,                    // Pixel height
+    guidance_scale: options.guidanceScale,      // CFG scale
+    num_inference_steps: options.steps,         // Denoising steps
+  },
+}, { outputType: 'blob' });  // <-- Forces Blob output (not ReadableStream)
+// Blob → ArrayBuffer → Node.js Buffer conversion:
+const buffer = Buffer.from(await blob.arrayBuffer());
+// This is the critical step — HfInference returns a Web API Blob,
+// which must be converted to a Node.js Buffer for downstream use.
+// Result always reports PNG format regardless of actual model output:
+// result.media = { width: options.width ?? 1024, height: options.height ?? 1024, format: 'png' }
+```
+**Text-to-Speech — Blob-to-Buffer conversion:**
+```typescript
+// Default model: facebook/mms-tts-eng
+const model = options.model ?? 'facebook/mms-tts-eng';
+// Uses textToSpeech() — simpler API, just model + text:
+const blob = await this.client.textToSpeech({
+  model,
+  inputs: options.text,    // Text to synthesize
+  // Note: No voice, speed, or format parameters — these are model-dependent
+});
+// Same Blob → Buffer conversion:
+const buffer = Buffer.from(await blob.arrayBuffer());
+// Usage tracks character count, not tokens:
+// result.usage = { cost: 0, input: options.text.length, unit: 'characters' }
+// result.media = { format: 'wav' }
+```
+**Model listing — dynamic Hub API discovery:**
+```typescript
+// HuggingFace now auto-fetches trending models from the Hub API:
+async listModels(modality?: Modality): Promise<ModelInfo[]> {
+  if (!this.dynamicModels) await this.fetchHubModels();
+  // Returns: 3 hardcoded defaults + top 50 LLM + top 50 image + top 30 TTS
+  // All filtered by inference_provider=all (only inference-ready models)
+}
+// Hub API request per modality:
+// GET https://huggingface.co/api/models
+//   ?pipeline_tag=text-generation
+//   &inference_provider=all        ← Only models with active inference endpoints
+//   &sort=trendingScore            ← Most popular first
+//   &limit=50
+//   &expand[]=inferenceProviderMapping  ← Include provider routing + pricing
+// Response includes per model:
+// - id: "Qwen/Qwen2.5-72B-Instruct"
+// - inferenceProviderMapping: [{ provider: "together", status: "live",
+//     providerDetails: { context_length: 32768, pricing: { input: 1.2 } } }]
+// Pricing and context_length extracted from inferenceProviderMapping
+// 3 hardcoded defaults always included as fallback
+// Results cached in memory after first fetch
+```
+#### The 17 HuggingFace Inference Providers
+The `@huggingface/inference` library supports routing requests through 17 different inference providers. This means a single HuggingFace model ID can be served by multiple backends with different performance/cost characteristics:
+| # | Provider | Type | Strengths |
+|---|---|---|---|
+| 1 | `hf-inference` | HuggingFace's own | Default, free tier, rate-limited |
+| 2 | `hf-dedicated` | Dedicated endpoints | Private, reserved GPU, guaranteed availability |
+| 3 | `together-ai` | Together.ai | Fast inference, competitive pricing |
+| 4 | `fireworks-ai` | Fireworks.ai | Optimized serving, function calling |
+| 5 | `replicate` | Replicate | Pay-per-use, large model catalog |
+| 6 | `cerebras` | Cerebras | Extreme speed (WSE-3 hardware) |
+| 7 | `groq` | Groq | Ultra-low latency (LPU hardware) |
+| 8 | `cohere` | Cohere | Enterprise, embeddings, RAG |
+| 9 | `sambanova` | SambaNova | Enterprise RDU hardware |
+| 10 | `nebius` | Nebius | European cloud infrastructure |
+| 11 | `hyperbolic` | Hyperbolic Labs | Open-access GPU marketplace |
+| 12 | `novita` | Novita AI | Cost-efficient inference |
+| 13 | `ovh-cloud` | OVHcloud | European sovereign cloud |
+| 14 | `aws` | Amazon SageMaker | AWS-managed endpoints |
+| 15 | `azure` | Azure ML | Azure-managed endpoints |
+| 16 | `google-vertex` | Google Vertex | GCP-managed endpoints |
+| 17 | `deepinfra` | DeepInfra | High-throughput inference |
+**Provider routing** is handled by the `@huggingface/inference` library's internal `provider` parameter:
+```typescript
+// Route through a specific inference provider:
+const response = await client.chatCompletion({
+  model: 'meta-llama/Llama-3.1-70B-Instruct',
+  provider: 'together-ai',  // <-- Route through Together.ai
+  messages: [...],
+});
+// NOTE: Noosphere does NOT currently expose the `provider` parameter
+// in its ChatOptions type. To use a specific HF inference provider,
+// you would need a custom provider or direct @huggingface/inference usage.
+```
+#### Using HuggingFace Locally — Dedicated Endpoints
+HuggingFace Inference Endpoints let you deploy any model on dedicated GPUs. The `@huggingface/inference` library supports this via the `endpointUrl` parameter:
+```typescript
+// Direct HfInference usage with a local/dedicated endpoint:
+import { HfInference } from '@huggingface/inference';
+const client = new HfInference('your-token');
+// Point to your dedicated endpoint:
+const response = await client.chatCompletion({
+  model: 'tgi',
+  endpointUrl: 'https://your-endpoint.endpoints.huggingface.cloud',
+  messages: [{ role: 'user', content: 'Hello' }],
+});
+// For a truly local setup with TGI (Text Generation Inference):
+const localClient = new HfInference();  // No token needed for local
+const response = await localClient.chatCompletion({
+  model: 'tgi',
+  endpointUrl: 'http://localhost:8080',  // Local TGI server
+  messages: [...],
+});
+```
+**Deploying HuggingFace models locally with TGI:**
+```bash
+# 1. Install Text Generation Inference (TGI):
+docker run --gpus all -p 8080:80 \
+  -v /data:/data \
+  ghcr.io/huggingface/text-generation-inference:latest \
+  --model-id meta-llama/Llama-3.1-8B-Instruct
+# 2. For image models, use Inference Endpoints:
+# Deploy via https://ui.endpoints.huggingface.co/
+# Select your model, GPU type, and region
+# Get an endpoint URL like: https://xyz123.endpoints.huggingface.cloud
+# 3. For TTS models locally, use the Transformers library:
+# pip install transformers torch
+# Then run a local server that serves the model
+```
+**Other local deployment options:**
+| Method | URL Pattern | Use Case |
+|---|---|---|
+| TGI Docker | `http://localhost:8080` | Production local LLM serving |
+| HF Inference Endpoints | `https://xxxx.endpoints.huggingface.cloud` | Managed dedicated GPU |
+| vLLM with HF models | `http://localhost:8000` | High-throughput local serving |
+| Transformers + FastAPI | Custom URL | Custom model serving |
+#### Unexposed `@huggingface/inference` Parameters
+The `chatCompletion()` method accepts many parameters that Noosphere's `ChatOptions` doesn't currently expose. These are available if you use the library directly:
+| Parameter | Type | Description |
+|---|---|---|
+| `temperature` | `number` | Sampling temperature (0-2.0) — **exposed** via `ChatOptions.temperature` |
+| `max_tokens` | `number` | Max output tokens — **exposed** via `ChatOptions.maxTokens` |
+| `top_p` | `number` | Nucleus sampling threshold (0-1.0) — **not exposed** |
+| `top_k` | `number` | Top-K sampling — **not exposed** |
+| `frequency_penalty` | `number` | Penalize repeated tokens (-2.0 to 2.0) — **not exposed** |
+| `presence_penalty` | `number` | Penalize tokens already present (-2.0 to 2.0) — **not exposed** |
+| `repetition_penalty` | `number` | Alternative repetition penalty (>1.0 penalizes) — **not exposed** |
+| `stop` | `string[]` | Stop sequences — **not exposed** |
+| `seed` | `number` | Deterministic sampling seed — **not exposed** |
+| `tools` | `Tool[]` | Function/tool definitions — **not exposed** |
+| `tool_choice` | `string \| object` | Tool selection strategy — **not exposed** |
+| `tool_prompt` | `string` | System prompt for tool use — **not exposed** |
+| `response_format` | `object` | JSON schema constraints — **not exposed** |
+| `reasoning_effort` | `string` | Thinking depth level — **not exposed** |
+| `stream` | `boolean` | Enable streaming — **not exposed** (use `chatCompletionStream()`) |
+| `provider` | `string` | Inference provider routing — **not exposed** |
+| `endpointUrl` | `string` | Custom endpoint URL — **not exposed** |
+| `n` | `number` | Number of completions — **not exposed** |
+| `logprobs` | `boolean` | Return log probabilities — **not exposed** |
+| `grammar` | `object` | BNF grammar constraints — **not exposed** |
+**Image generation unexposed parameters:**
+| Parameter | Type | Description |
+|---|---|---|
+| `negative_prompt` | `string` | **Exposed** via `ImageOptions.negativePrompt` |
+| `width` / `height` | `number` | **Exposed** via `ImageOptions.width/height` |
+| `guidance_scale` | `number` | **Exposed** via `ImageOptions.guidanceScale` |
+| `num_inference_steps` | `number` | **Exposed** via `ImageOptions.steps` |
+| `scheduler` | `string` | Diffusion scheduler type — **not exposed** |
+| `target_size` | `object` | Target resize dimensions — **not exposed** |
+| `clip_skip` | `number` | CLIP skip layers — **not exposed** |
+#### HuggingFace Error Behavior
+Unlike other providers, HuggingFaceProvider does **not** catch errors from the `@huggingface/inference` library. All errors propagate directly up to Noosphere's `executeWithRetry()`:
+```
+HfInference throws → HuggingFaceProvider propagates →
+  executeWithRetry catches → Noosphere wraps as NoosphereError
+```
+Common error scenarios:
+- **401 Unauthorized** — Invalid or expired token → becomes `AUTH_FAILED`
+- **404 Model Not Found** — Model ID doesn't exist on HF Hub → becomes `MODEL_NOT_FOUND`
+- **429 Rate Limited** — Free tier limit exceeded → becomes `RATE_LIMITED` (retryable)
+- **503 Model Loading** — Model is cold-starting on HF Inference → becomes `PROVIDER_UNAVAILABLE` (retryable)
 ---
 ### ComfyUI — Local Image Generation
@@ -1220,17 +1779,237 @@ The `@huggingface/inference` library (v3.15.0) provides 30+ AI tasks, including
 **Modalities:** Image, Video (planned)
 **Type:** Local
 **Default Port:** 8188
+**Source:** `src/providers/comfyui.ts` (155 lines)
+Connects to a local ComfyUI instance for Stable Diffusion workflows. ComfyUI is a node-based UI for Stable Diffusion that exposes an HTTP API. Noosphere communicates with it via raw HTTP — no ComfyUI SDK needed.
+#### How It Works — Complete Lifecycle
+```
+User calls ai.image() →
+  1. structuredClone(DEFAULT_TXT2IMG_WORKFLOW)     // Deep-clone the template
+  2. Inject parameters into workflow nodes          // Mutate the clone
+  3. POST /prompt { prompt: workflow }              // Queue the workflow
+  4. Receive { prompt_id: "abc-123" }               // Get tracking ID
+  5. POLL GET /history/abc-123 every 1000ms         // Check completion
+  6. Parse outputs → find SaveImage node            // Locate generated image
+  7. GET /view?filename=X&subfolder=Y&type=Z        // Fetch image binary
+  8. Return Buffer                                   // PNG buffer to caller
+```
+#### The Complete Workflow JSON — All 8 Nodes
-Connects to a local ComfyUI instance for Stable Diffusion workflows.
+The `DEFAULT_TXT2IMG_WORKFLOW` constant defines a complete SDXL text-to-image pipeline as a ComfyUI node graph. Each key is a **node ID** (string), each value defines the node type and its connections:
-#### How It Works
+```typescript
+// Node "3": KSampler — The core diffusion sampling node
+'3': {
+  class_type: 'KSampler',
+  inputs: {
+    seed: 0,                    // Random seed (overridden by options.seed)
+    steps: 20,                  // Denoising steps (overridden by options.steps)
+    cfg: 7,                     // CFG/guidance scale (overridden by options.guidanceScale)
+    sampler_name: 'euler',      // Sampling algorithm
+    scheduler: 'normal',        // Noise schedule
+    denoise: 1,                 // Full denoise (1.0 = txt2img, <1.0 = img2img)
+    model: ['4', 0],            // ← Connection: output 0 of node "4" (checkpoint model)
+    positive: ['6', 0],         // ← Connection: output 0 of node "6" (positive prompt)
+    negative: ['7', 0],         // ← Connection: output 0 of node "7" (negative prompt)
+    latent_image: ['5', 0],     // ← Connection: output 0 of node "5" (empty latent)
+  },
+}
-1. Clones a built-in txt2img workflow template (KSampler + SDXL pipeline)
-2. Injects your parameters (prompt, dimensions, seed, steps, guidance)
-3. POSTs the workflow to ComfyUI's `/prompt` endpoint
-4. Polls `/history/{promptId}` every second until completion (max 5 minutes)
-5. Fetches the generated image from `/view`
-6. Returns a PNG buffer
+// Node "4": CheckpointLoaderSimple — Loads the SDXL model from disk
+'4': {
+  class_type: 'CheckpointLoaderSimple',
+  inputs: {
+    ckpt_name: 'sd_xl_base_1.0.safetensors',  // Checkpoint file on disk
+    // Outputs: [0]=MODEL, [1]=CLIP, [2]=VAE
+    // MODEL → KSampler.model
+    // CLIP  → CLIPTextEncode nodes
+    // VAE   → VAEDecode
+  },
+}
+// Node "5": EmptyLatentImage — Creates the initial noise tensor
+'5': {
+  class_type: 'EmptyLatentImage',
+  inputs: {
+    width: 1024,       // Overridden by options.width
+    height: 1024,      // Overridden by options.height
+    batch_size: 1,     // Always 1 image per generation
+  },
+}
+// Node "6": CLIPTextEncode — Positive prompt encoding
+'6': {
+  class_type: 'CLIPTextEncode',
+  inputs: {
+    text: '',           // Overridden by options.prompt
+    clip: ['4', 1],     // ← Connection: output 1 of node "4" (CLIP model)
+  },
+}
+// Node "7": CLIPTextEncode — Negative prompt encoding
+'7': {
+  class_type: 'CLIPTextEncode',
+  inputs: {
+    text: '',           // Overridden by options.negativePrompt ?? ''
+    clip: ['4', 1],     // ← Same CLIP model as positive prompt
+  },
+}
+// Node "8": VAEDecode — Converts latent space to pixel space
+'8': {
+  class_type: 'VAEDecode',
+  inputs: {
+    samples: ['3', 0],  // ← Connection: output 0 of node "3" (sampled latents)
+    vae: ['4', 2],       // ← Connection: output 2 of node "4" (VAE decoder)
+  },
+}
+// Node "9": SaveImage — Saves the final image
+'9': {
+  class_type: 'SaveImage',
+  inputs: {
+    filename_prefix: 'noosphere',   // Files saved as noosphere_00001.png, etc.
+    images: ['8', 0],                // ← Connection: output 0 of node "8" (decoded image)
+  },
+}
+```
+**Node connection format:** `['nodeId', outputIndex]` — this is ComfyUI's internal linking system. For example, `['4', 1]` means "output slot 1 of node 4", which is the CLIP model from CheckpointLoaderSimple.
+**Visual pipeline flow:**
+```
+CheckpointLoader["4"] ──MODEL──→ KSampler["3"]
+         ├──CLIP──→ CLIPTextEncode["6"] (positive) ──→ KSampler["3"]
+         ├──CLIP──→ CLIPTextEncode["7"] (negative) ──→ KSampler["3"]
+         └──VAE───→ VAEDecode["8"]
+EmptyLatentImage["5"] ──→ KSampler["3"] ──→ VAEDecode["8"] ──→ SaveImage["9"]
+```
+#### Parameter Injection — How Options Map to Nodes
+```typescript
+// Deep-clone to avoid mutating the template:
+const workflow = structuredClone(DEFAULT_TXT2IMG_WORKFLOW);
+// Direct node mutations:
+workflow['6'].inputs.text = options.prompt;                    // Positive prompt → Node 6
+workflow['7'].inputs.text = options.negativePrompt ?? '';      // Negative prompt → Node 7
+workflow['5'].inputs.width = options.width ?? 1024;            // Width → Node 5
+workflow['5'].inputs.height = options.height ?? 1024;          // Height → Node 5
+// Conditional overrides (only if user provided them):
+if (options.seed !== undefined)          workflow['3'].inputs.seed = options.seed;
+if (options.steps !== undefined)         workflow['3'].inputs.steps = options.steps;
+if (options.guidanceScale !== undefined) workflow['3'].inputs.cfg = options.guidanceScale;
+// Note: sampler_name, scheduler, and denoise are NOT configurable via Noosphere.
+// They're hardcoded to euler/normal/1.0
+```
+#### Queue Submission — POST /prompt
+```typescript
+const queueRes = await fetch(`${this.baseUrl}/prompt`, {
+  method: 'POST',
+  headers: { 'Content-Type': 'application/json' },
+  body: JSON.stringify({ prompt: workflow }),
+  // ComfyUI expects: { prompt: <workflow_object>, client_id?: string }
+});
+if (!queueRes.ok) throw new Error(`ComfyUI queue failed: ${queueRes.status}`);
+const { prompt_id } = await queueRes.json();
+// prompt_id is a UUID like "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
+// Used to track this specific generation in the history API
+```
+#### Polling Mechanism — Deadline-Based with 1s Intervals
+```typescript
+private async pollForResult(promptId: string, maxWaitMs = 300000): Promise<ArrayBuffer> {
+  const deadline = Date.now() + maxWaitMs;  // 300,000ms = 5 minutes
+  while (Date.now() < deadline) {
+    // Check history for our prompt
+    const res = await fetch(`${this.baseUrl}/history/${promptId}`);
+    if (!res.ok) {
+      await new Promise((r) => setTimeout(r, 1000));  // 1 second between polls
+      continue;
+    }
+    const history = await res.json();
+    // History format: { [promptId]: { outputs: { [nodeId]: { images: [...] } } } }
+    const entry = history[promptId];
+    if (!entry?.outputs) {
+      await new Promise((r) => setTimeout(r, 1000));  // Not ready yet
+      continue;
+    }
+    // Search ALL output nodes for images (not just node "9"):
+    for (const nodeOutput of Object.values(entry.outputs)) {
+      if (nodeOutput.images?.length > 0) {
+        const img = nodeOutput.images[0];
+        // Fetch the actual image binary:
+        const imgRes = await fetch(
+          `${this.baseUrl}/view?filename=${img.filename}&subfolder=${img.subfolder}&type=${img.type}`
+        );
+        return imgRes.arrayBuffer();
+      }
+    }
+    await new Promise((r) => setTimeout(r, 1000));
+  }
+  throw new Error(`ComfyUI generation timed out after ${maxWaitMs}ms`);
+}
+```
+**Key polling details:**
+- **Interval:** Fixed 1000ms (not configurable)
+- **Timeout:** 300,000ms = 5 minutes (hardcoded, not from `config.timeout.image`)
+- **Deadline-based:** Uses `Date.now() < deadline` comparison, NOT a retry counter
+- **Image fetch URL format:** `/view?filename=noosphere_00001_.png&subfolder=&type=output`
+- **Returns:** Raw `ArrayBuffer` → converted to `Buffer` by the caller
+#### Auto-Detection — How ComfyUI Gets Discovered
+During `Noosphere.init()`, if `autoDetectLocal` is true:
+```typescript
+// Ping the /system_stats endpoint with a 2-second timeout:
+const pingUrl = async (url: string): Promise<boolean> => {
+  const controller = new AbortController();
+  const timer = setTimeout(() => controller.abort(), 2000);  // 2s hard timeout
+  try {
+    const res = await fetch(url, { signal: controller.signal });
+    return res.ok;
+  } finally {
+    clearTimeout(timer);
+  }
+};
+// Check ComfyUI specifically:
+if (comfyuiCfg?.enabled) {
+  const ok = await pingUrl(`${comfyuiCfg.host}:${comfyuiCfg.port}/system_stats`);
+  if (ok) {
+    this.registry.addProvider(new ComfyUIProvider({
+      host: comfyuiCfg.host,  // Default: 'http://localhost'
+      port: comfyuiCfg.port,  // Default: 8188
+    }));
+  }
+}
+```
+**Environment variable overrides:**
+```bash
+COMFYUI_HOST=http://192.168.1.100    # Override host
+COMFYUI_PORT=8190                     # Override port
+```
 #### Configuration
@@ -1238,30 +2017,82 @@ Connects to a local ComfyUI instance for Stable Diffusion workflows.
 const ai = new Noosphere({
   local: {
     comfyui: {
-      enabled: true,
-      host: 'http://localhost',
-      port: 8188,
+      enabled: true,                      // Default: true (auto-detected)
+      host: 'http://localhost',           // Default: 'http://localhost'
+      port: 8188,                         // Default: 8188
     },
   },
 });
 ```
-#### Default Workflow
+#### Model Discovery — Dynamic via /object_info
-- **Checkpoint:** `sd_xl_base_1.0.safetensors`
-- **Sampler:** euler with normal scheduler
-- **Default Steps:** 20
-- **Default CFG/Guidance:** 7
-- **Default Size:** 1024x1024
-- **Max Size:** 2048x2048
-- **Output:** PNG
+```typescript
+async listModels(modality?: Modality): Promise<ModelInfo[]> {
+  // Fetches ComfyUI's full node registry:
+  const res = await fetch(`${this.baseUrl}/object_info`);
+  if (!res.ok) return [];
+  // Does NOT parse the response — just uses it as a connectivity check.
+  // Returns hardcoded model entries:
+  const models: ModelInfo[] = [];
+  if (!modality || modality === 'image') {
+    models.push({
+      id: 'comfyui-txt2img',
+      provider: 'comfyui',
+      name: 'ComfyUI Text-to-Image',
+      modality: 'image',
+      local: true,
+      cost: { price: 0, unit: 'free' },
+      capabilities: { maxWidth: 2048, maxHeight: 2048, supportsNegativePrompt: true },
+    });
+  }
+  if (!modality || modality === 'video') {
+    models.push({
+      id: 'comfyui-txt2vid',
+      provider: 'comfyui',
+      name: 'ComfyUI Text-to-Video',
+      modality: 'video',
+      local: true,
+      cost: { price: 0, unit: 'free' },
+      capabilities: { maxDuration: 10, supportsImageToVideo: true },
+    });
+  }
+  return models;
+}
+// NOTE: /object_info is fetched but the response is discarded.
+// The actual model list is hardcoded. This means even if you have
+// dozens of checkpoints in ComfyUI, Noosphere only exposes 2 model IDs.
+```
-#### Models Exposed
+#### Video Generation — Not Yet Implemented
-| Model ID | Modality | Description |
-|---|---|---|
-| `comfyui-txt2img` | Image | Text-to-image via workflow |
-| `comfyui-txt2vid` | Video | Planned (requires AnimateDiff workflow) |
+```typescript
+async video(_options: VideoOptions): Promise<NoosphereResult> {
+  throw new Error('ComfyUI video generation requires a configured AnimateDiff workflow');
+}
+// The 'comfyui-txt2vid' model ID is listed but will throw at runtime.
+// This is a placeholder for future AnimateDiff/SVD workflow templates.
+```
+#### Default Workflow Parameters Summary
+| Parameter | Default | Configurable | Node |
+|---|---|---|---|
+| Checkpoint | `sd_xl_base_1.0.safetensors` | No | Node 4 |
+| Sampler | `euler` | No | Node 3 |
+| Scheduler | `normal` | No | Node 3 |
+| Denoise | `1.0` | No | Node 3 |
+| Steps | `20` | Yes (`options.steps`) | Node 3 |
+| CFG/Guidance | `7` | Yes (`options.guidanceScale`) | Node 3 |
+| Seed | `0` | Yes (`options.seed`) | Node 3 |
+| Width | `1024` | Yes (`options.width`) | Node 5 |
+| Height | `1024` | Yes (`options.height`) | Node 5 |
+| Batch Size | `1` | No | Node 5 |
+| Filename Prefix | `noosphere` | No | Node 9 |
+| Negative Prompt | `''` (empty) | Yes (`options.negativePrompt`) | Node 7 |
+| Max Size | `2048x2048` | Via options | Node 5 |
+| Output Format | PNG | No | ComfyUI default |
 ---
@@ -1270,99 +2101,889 @@ const ai = new Noosphere({
 **Provider IDs:** `piper`, `kokoro`
 **Modality:** TTS
 **Type:** Local
+**Source:** `src/providers/local-tts.ts` (112 lines)
-Connects to local OpenAI-compatible TTS servers.
+The `LocalTTSProvider` is a generic adapter for any local TTS server that exposes an OpenAI-compatible `/v1/audio/speech` endpoint. Two instances are created by default — one for Piper, one for Kokoro — but the class works with ANY server implementing this protocol.
 #### Supported Engines
-| Engine | Default Port | Health Check | Voice Discovery |
-|---|---|---|---|
-| Piper | 5500 | `GET /health` | `GET /voices` |
-| Kokoro | 5501 | `GET /health` | `GET /v1/models` (fallback) |
+| Engine | Default Port | Health Check | Voice Discovery | Description |
+|---|---|---|---|---|
+| Piper | 5500 | `GET /health` | `GET /voices` (array) | Fast offline TTS, 30+ languages, ONNX models |
+| Kokoro | 5501 | `GET /health` | `GET /v1/models` (OpenAI format) | High-quality neural TTS |
-#### API
+#### Provider Instantiation — How Instances Are Created
-Uses the OpenAI-compatible TTS endpoint:
+```typescript
+// The LocalTTSProvider constructor takes a config object:
+interface LocalTTSConfig {
+  id: string;     // Provider ID: 'piper' or 'kokoro'
+  name: string;   // Display name: 'Piper TTS' or 'Kokoro TTS'
+  host: string;   // Base URL host
+  port: number;   // Port number
+}
+// Two separate instances are created during init():
+new LocalTTSProvider({ id: 'piper',  name: 'Piper TTS',  host: piperCfg.host,  port: piperCfg.port })
+new LocalTTSProvider({ id: 'kokoro', name: 'Kokoro TTS', host: kokoroCfg.host, port: kokoroCfg.port })
+// Each instance is an independent provider in the registry.
+// They don't share state or config.
+// The baseUrl is constructed as: `${config.host}:${config.port}`
+// Example: "http://localhost:5500"
 ```
-POST /v1/audio/speech
+#### Health Check — Ping Protocol
+```typescript
+async ping(): Promise<boolean> {
+  try {
+    const res = await fetch(`${this.baseUrl}/health`);
+    return res.ok;  // true if HTTP 200-299
+  } catch {
+    return false;    // Network error, connection refused, etc.
+  }
+}
+// Used during auto-detection in Noosphere.init()
+// Also used by: the 2-second AbortController timeout in init()
+// Note: /health is checked BEFORE the provider is registered.
+// If /health fails, the provider is silently skipped.
+```
+#### Dual Voice Discovery Mechanism
+The `listModels()` method implements a **two-strategy fallback** to discover available voices. This is necessary because different TTS servers expose voices through different API formats:
+```typescript
+async listModels(modality?: Modality): Promise<ModelInfo[]> {
+  if (modality && modality !== 'tts') return [];
+  let voices: Array<{ id: string; name?: string }> = [];
+  // STRATEGY 1: Piper-style /voices endpoint
+  // Expected response: Array<{ id: string, name?: string, ... }>
+  try {
+    const res = await fetch(`${this.baseUrl}/voices`);
+    if (res.ok) {
+      const data = await res.json();
+      if (Array.isArray(data)) {
+        voices = data;
+        // Success — skip fallback
+      }
+    }
+  } catch {
+    // STRATEGY 2: OpenAI-compatible /v1/models endpoint
+    // Expected response: { data: Array<{ id: string, ... }> }
+    const res = await fetch(`${this.baseUrl}/v1/models`);
+    if (res.ok) {
+      const data = await res.json();
+      voices = data.data ?? [];
+    }
+  }
+  // Map voices to ModelInfo objects:
+  return voices.map((v) => ({
+    id: v.id,
+    provider: this.id,              // 'piper' or 'kokoro'
+    name: v.name ?? v.id,           // Fallback to ID if no name
+    modality: 'tts' as const,
+    local: true,
+    cost: { price: 0, unit: 'free' },
+    capabilities: {
+      voices: voices.map((vv) => vv.id),  // All voice IDs as capabilities
+    },
+  }));
+}
+```
+**Critical implementation detail:** The fallback is triggered by a `catch` block, NOT by checking the response. This means:
+- If `/voices` returns a **non-array** (e.g., `{}`), strategy 1 succeeds but `voices` remains empty
+- If `/voices` returns HTTP **404**, strategy 1 "succeeds" (no exception), but `res.ok` is false, so voices stays empty, AND strategy 2 is never tried
+- Strategy 2 only runs if `/voices` **throws a network error** (connection refused, DNS failure, etc.)
+**Piper response format** (`GET /voices`):
+```json
+[
+  { "id": "en_US-lessac-medium", "name": "Lessac (English US)" },
+  { "id": "en_US-amy-medium", "name": "Amy (English US)" },
+  { "id": "de_DE-thorsten-high", "name": "Thorsten (German)" }
+]
+```
+**Kokoro/OpenAI response format** (`GET /v1/models`):
+```json
 {
-  "model": "tts-1",
-  "input": "Hello world",
-  "voice": "default",
-  "speed": 1.0,
-  "response_format": "mp3"
+  "data": [
+    { "id": "kokoro-v1", "object": "model" },
+    { "id": "kokoro-v1-jp", "object": "model" }
+  ]
 }
 ```
-Supports `mp3`, `wav`, and `ogg` formats. Returns audio as a Buffer.
+#### Speech Generation — Exact HTTP Protocol
+```typescript
+async speak(options: SpeakOptions): Promise<NoosphereResult> {
+  const start = Date.now();
+  // POST to OpenAI-compatible TTS endpoint:
+  const res = await fetch(`${this.baseUrl}/v1/audio/speech`, {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify({
+      model: options.model ?? 'tts-1',        // Default model ID
+      input: options.text,                     // Text to synthesize
+      voice: options.voice ?? 'default',       // Voice selection
+      speed: options.speed ?? 1.0,             // Playback speed multiplier
+      response_format: options.format ?? 'mp3', // Output audio format
+    }),
+  });
+  if (!res.ok) {
+    throw new Error(`Local TTS failed: ${res.status} ${await res.text()}`);
+    // Note: error includes the response body text for debugging
+  }
+  // Response is raw audio binary — convert to Buffer:
+  const audioBuffer = Buffer.from(await res.arrayBuffer());
+  return {
+    buffer: audioBuffer,
+    provider: this.id,                              // 'piper' or 'kokoro'
+    model: options.model ?? options.voice ?? 'default',  // Fallback chain
+    modality: 'tts',
+    latencyMs: Date.now() - start,
+    usage: {
+      cost: 0,                                       // Always free (local)
+      input: options.text.length,                     // CHARACTER count, not tokens
+      unit: 'characters',                             // Track by characters
+    },
+    media: {
+      format: options.format ?? 'mp3',               // Matches requested format
+    },
+  };
+}
+```
+**Request/Response details:**
+| Field | Value | Notes |
+|---|---|---|
+| Method | `POST` | Always POST |
+| URL | `/v1/audio/speech` | OpenAI-compatible standard |
+| Content-Type | `application/json` | JSON body |
+| Response Content-Type | `audio/mpeg`, `audio/wav`, or `audio/ogg` | Depends on `response_format` |
+| Response Body | Raw binary audio | Converted to `Buffer` via `arrayBuffer()` |
+**Available formats (from `SpeakOptions.format` type):**
+| Format | Typical Size | Quality | Use Case |
+|---|---|---|---|
+| `mp3` | Smallest | Lossy | Web playback, storage |
+| `wav` | Largest | Lossless | Processing, editing |
+| `ogg` | Medium | Lossy | Web playback, open format |
+#### Usage Tracking — Character-Based
+Local TTS tracks usage by **character count**, not tokens:
+```typescript
+usage: {
+  cost: 0,                        // Always 0 for local providers
+  input: options.text.length,     // JavaScript string .length (UTF-16 code units)
+  unit: 'characters',             // Unit identifier for aggregation
+}
+// Note: .length counts UTF-16 code units, not Unicode codepoints.
+// "Hello" = 5, "🎵" = 2 (surrogate pair), "café" = 4
+```
+This feeds into the global `UsageTracker`, so you can query TTS usage:
+```typescript
+const usage = ai.getUsage({ modality: 'tts' });
+// usage.totalRequests = number of TTS calls
+// usage.totalCost = 0 (always free for local)
+// usage.byProvider = { piper: 0, kokoro: 0 }
+```
+#### Auto-Detection — Parallel Discovery
+Both Piper and Kokoro are detected simultaneously during `init()`:
+```typescript
+// Inside Noosphere.init(), wrapped in Promise.allSettled():
+await Promise.allSettled([
+  // ... ComfyUI detection ...
+  (async () => {
+    if (piperCfg?.enabled) {    // enabled: true by default
+      const ok = await pingUrl(`${piperCfg.host}:${piperCfg.port}/health`);
+      if (ok) {
+        this.registry.addProvider(new LocalTTSProvider({
+          id: 'piper', name: 'Piper TTS',
+          host: piperCfg.host, port: piperCfg.port,
+        }));
+      }
+    }
+  })(),
+  (async () => {
+    if (kokoroCfg?.enabled) {   // enabled: true by default
+      const ok = await pingUrl(`${kokoroCfg.host}:${kokoroCfg.port}/health`);
+      if (ok) {
+        this.registry.addProvider(new LocalTTSProvider({
+          id: 'kokoro', name: 'Kokoro TTS',
+          host: kokoroCfg.host, port: kokoroCfg.port,
+        }));
+      }
+    }
+  })(),
+]);
+```
+**Environment variable overrides:**
+```bash
+PIPER_HOST=http://192.168.1.100    PIPER_PORT=5500
+KOKORO_HOST=http://192.168.1.100   KOKORO_PORT=5501
+```
+#### Setting Up Local TTS Servers
+**Piper TTS:**
+```bash
+# Docker (recommended):
+docker run -p 5500:5500 rhasspy/wyoming-piper \
+  --voice en_US-lessac-medium
+# Or via pip:
+pip install piper-tts
+# Then run a compatible HTTP server (wyoming-piper or piper-http-server)
+```
+**Kokoro TTS:**
+```bash
+# Docker:
+docker run -p 5501:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
+# The Kokoro server exposes OpenAI-compatible endpoints at:
+# GET  /v1/models          → List available voices
+# POST /v1/audio/speech    → Generate speech
+# GET  /health             → Health check
+```
 ---
 ## Architecture
-### Provider Resolution (Local-First)
+### The Complete Init() Flow — What Happens When You Create a Noosphere Instance
+```typescript
+const ai = new Noosphere({ /* config */ });
+// At this point: config is resolved, but NO providers are registered.
+// The `initialized` flag is false.
+await ai.chat({ messages: [...] });
+// FIRST call triggers lazy initialization via init()
+```
+**Initialization sequence (`src/noosphere.ts:240-322`):**
+```
+1. Constructor:
+   ├── resolveConfig(input)           // Merge config > env > defaults
+   ├── new Registry(cacheTTLMinutes)  // Empty provider registry
+   └── new UsageTracker(onUsage)      // Empty event list
+2. First API call triggers init():
+   ├── Set initialized = true (immediately, before any async work)
+   │
+   ├── CLOUD PROVIDER REGISTRATION (synchronous):
+   │   ├── Collect all API keys from resolved config
+   │   ├── If ANY LLM key exists → register PiAiProvider(allKeys)
+   │   ├── If FAL key exists    → register FalProvider(falKey)
+   │   └── If HF token exists   → register HuggingFaceProvider(token)
+   │
+   └── LOCAL SERVICE DETECTION (parallel, async):
+       └── Promise.allSettled([
+             pingUrl(comfyui /system_stats)  → register ComfyUIProvider
+             pingUrl(piper /health)          → register LocalTTSProvider('piper')
+             pingUrl(kokoro /health)         → register LocalTTSProvider('kokoro')
+           ])
+```
+**Key design decisions:**
+- `initialized = true` is set **before** async work, preventing concurrent init() calls
+- Cloud providers are registered **synchronously** (no network calls needed)
+- Local detection uses `Promise.allSettled()` — a failing ping doesn't block others
+- Each ping has a 2-second `AbortController` timeout
+- If auto-detection is disabled (`autoDetectLocal: false`), local providers are never registered
-When you call a generation method without specifying a provider, Noosphere resolves one automatically:
+### Configuration Resolution — Three-Layer Priority System
-1. If `model` is specified without `provider` → looks up model in registry cache
-2. If a `default` is configured for the modality → uses that
-3. Otherwise → **local providers first**, then cloud providers
+The `resolveConfig()` function (`src/config.ts`, 87 lines) implements a strict priority hierarchy:
 ```
-resolveProvider(modality):
-  1. Check user-specified provider ID → return if found
-  2. Check configured defaults → return if found
-  3. Scan all providers:
-     → Return first LOCAL provider supporting this modality
-     → Fallback to first CLOUD provider
-  4. Throw NO_PROVIDER error
+Priority: Explicit Config > Environment Variables > Built-in Defaults
 ```
-### Retry & Failover Logic
+**API Key Resolution:**
+```typescript
+// For each of the 9 supported providers:
+const ENV_KEY_MAP = {
+  openai:      'OPENAI_API_KEY',
+  anthropic:   'ANTHROPIC_API_KEY',
+  google:      'GEMINI_API_KEY',
+  fal:         'FAL_KEY',
+  openrouter:  'OPENROUTER_API_KEY',
+  huggingface: 'HUGGINGFACE_TOKEN',
+  groq:        'GROQ_API_KEY',
+  mistral:     'MISTRAL_API_KEY',
+  xai:         'XAI_API_KEY',
+};
+// Resolution per key:
+keys[name] = input.keys?.[name]     // 1. Explicit config
+           ?? process.env[envVar];   // 2. Environment variable
+                                     // 3. undefined (no default)
 ```
-executeWithRetry(modality, provider, fn):
-  for attempt = 0..maxRetries:
-    try: return fn()
-    catch:
-      if error is retryable AND attempts remain:
-        wait backoffMs * 2^attempt (exponential backoff)
-        retry same provider
-      if error is NOT GENERATION_FAILED AND failover enabled:
-        try each alternative provider for this modality
-      throw last error
+**Local Service Resolution:**
+```typescript
+// For each of the 4 local services:
+const LOCAL_DEFAULTS = {
+  ollama:  { host: 'http://localhost', port: 11434, envHost: 'OLLAMA_HOST',  envPort: 'OLLAMA_PORT'  },
+  comfyui: { host: 'http://localhost', port: 8188,  envHost: 'COMFYUI_HOST', envPort: 'COMFYUI_PORT' },
+  piper:   { host: 'http://localhost', port: 5500,  envHost: 'PIPER_HOST',   envPort: 'PIPER_PORT'   },
+  kokoro:  { host: 'http://localhost', port: 5501,  envHost: 'KOKORO_HOST',  envPort: 'KOKORO_PORT'  },
+};
+// Resolution per service:
+local[name] = {
+  enabled: cfgLocal?.enabled ?? true,                          // Default: enabled
+  host:    cfgLocal?.host ?? process.env[envHost] ?? defaults.host,
+  port:    cfgLocal?.port ?? parseInt(process.env[envPort]) ?? defaults.port,
+  type:    cfgLocal?.type,
+};
 ```
-**Retryable errors (same provider):** `PROVIDER_UNAVAILABLE`, `RATE_LIMITED`, `TIMEOUT`, `GENERATION_FAILED`
+**Other config defaults:**
+| Setting | Default | Environment Override |
+|---|---|---|
+| `autoDetectLocal` | `true` | `NOOSPHERE_AUTO_DETECT_LOCAL` |
+| `discoveryCacheTTL` | `60` (minutes) | `NOOSPHERE_DISCOVERY_CACHE_TTL` |
+| `retry.maxRetries` | `2` | — |
+| `retry.backoffMs` | `1000` | — |
+| `retry.failover` | `true` | — |
+| `retry.retryableErrors` | `['PROVIDER_UNAVAILABLE', 'RATE_LIMITED', 'TIMEOUT']` | — |
+| `timeout.llm` | `30000` (30s) | — |
+| `timeout.image` | `120000` (2m) | — |
+| `timeout.video` | `300000` (5m) | — |
+| `timeout.tts` | `60000` (1m) | — |
-**Failover-eligible errors (cross-provider):** `PROVIDER_UNAVAILABLE`, `RATE_LIMITED`, `TIMEOUT` (NOT `GENERATION_FAILED`)
+### Provider Resolution — Local-First Algorithm
-### Model Registry & Caching
+When you call a generation method without specifying a provider, Noosphere resolves one automatically through a three-stage process in `resolveProviderForModality()` (`src/noosphere.ts:324-348`):
-- Models are fetched from providers via `listModels()` and cached in memory
-- Cache TTL is configurable (default: 60 minutes)
-- `syncModels()` forces a refresh of all provider model lists
-- Registry tracks model → provider mappings for fast resolution
+```typescript
+private resolveProviderForModality(
+  modality: Modality,
+  preferredId?: string,
+  modelId?: string,
+): NoosphereProvider {
+  // STAGE 1: Model-based resolution
+  // If model was specified WITHOUT a provider, search the registry cache
+  if (modelId && !preferredId) {
+    const resolved = this.registry.resolveModel(modelId, modality);
+    if (resolved) return resolved.provider;
+    // resolveModel() scans ALL cached models across ALL providers
+    // looking for exact match on both modelId AND modality
+  }
-### Usage Tracking
+  // STAGE 2: Default-based resolution
+  // Check if user configured a default for this modality
+  if (!preferredId) {
+    const defaultCfg = this.config.defaults[modality];
+    if (defaultCfg) {
+      preferredId = defaultCfg.provider;
+      // Now fall through to Stage 3 with this preferredId
+    }
+  }
+  // STAGE 3: Provider registry resolution
+  const provider = this.registry.resolveProvider(modality, preferredId);
+  if (!provider) {
+    throw new NoosphereError(
+      `No provider available for modality '${modality}'`,
+      { code: 'NO_PROVIDER', ... }
+    );
+  }
+  return provider;
+}
+```
+**Registry.resolveProvider() — The local-first algorithm** (`src/registry.ts:31-46`):
+```typescript
+resolveProvider(modality: Modality, preferredId?: string): NoosphereProvider | null {
+  // If a specific provider was requested:
+  if (preferredId) {
+    const p = this.providers.get(preferredId);
+    if (p && p.modalities.includes(modality)) return p;
+    return null;  // NOT found — returns null, NOT a fallback
+  }
+  // No preference — scan with local-first priority:
+  let bestCloud: NoosphereProvider | null = null;
+  for (const p of this.providers.values()) {
+    if (!p.modalities.includes(modality)) continue;
+    // LOCAL provider found → return IMMEDIATELY (first match wins)
+    if (p.isLocal) return p;
+    // CLOUD provider → save as fallback (first cloud match only)
+    if (!bestCloud) bestCloud = p;
+  }
+  return bestCloud;  // Return first cloud provider, or null
+}
+```
+**Resolution priority diagram:**
+```
+ai.chat({ model: 'gpt-4o' })
+  │
+  ├─ Stage 1: Search modelCache for 'gpt-4o' with modality 'llm'
+  │  └── Found in pi-ai cache → return PiAiProvider
+  │
+  ├─ Stage 2: (skipped — model resolved in Stage 1)
+  │
+  └─ Stage 3: (skipped — already resolved)
+ai.image({ prompt: 'sunset' })
+  │
+  ├─ Stage 1: (no model specified, skipped)
+  │
+  ├─ Stage 2: Check config.defaults.image → none configured
+  │
+  └─ Stage 3: resolveProvider('image', undefined)
+     ├── Scan providers:
+     │   ├── pi-ai: modalities=['llm'] → skip (no 'image')
+     │   ├── comfyui: modalities=['image','video'], isLocal=true → RETURN
+     │   └── (fal never reached — local wins)
+     └── Returns ComfyUIProvider (local-first)
+ai.image({ prompt: 'sunset' })  // No local ComfyUI running
+  │
+  └─ Stage 3: resolveProvider('image', undefined)
+     ├── Scan providers:
+     │   ├── pi-ai: no 'image' → skip
+     │   ├── fal: modalities=['image','video','tts'], isLocal=false → save as bestCloud
+     │   └── huggingface: modalities=['image','tts','llm'], isLocal=false → already have bestCloud
+     └── Returns FalProvider (first cloud fallback)
+```
+### Retry & Failover Logic — Complete Algorithm
+The `executeWithRetry()` method (`src/noosphere.ts:350-397`) implements a two-phase error handling strategy: same-provider retries, then cross-provider failover.
+```typescript
+private async executeWithRetry<T>(
+  modality: Modality,
+  provider: NoosphereProvider,
+  fn: () => Promise<T>,
+  failoverFnFactory?: (alt: NoosphereProvider) => (() => Promise<T>) | null,
+): Promise<T> {
+  const { maxRetries, backoffMs, retryableErrors, failover } = this.config.retry;
+  // Default: maxRetries=2, backoffMs=1000, failover=true
+  // retryableErrors = ['PROVIDER_UNAVAILABLE', 'RATE_LIMITED', 'TIMEOUT']
+  let lastError: Error | undefined;
+  for (let attempt = 0; attempt <= maxRetries; attempt++) {
+    try {
+      return await fn();  // Try the primary provider
+    } catch (err) {
+      lastError = err instanceof Error ? err : new Error(String(err));
+      const isNoosphereErr = err instanceof NoosphereError;
+      const code = isNoosphereErr ? err.code : 'GENERATION_FAILED';
+      // GENERATION_FAILED is special:
+      //   - Retryable on same provider (bad prompt, transient model issue)
+      //   - NOT eligible for cross-provider failover
+      const isRetryable = retryableErrors.includes(code) || code === 'GENERATION_FAILED';
+      const allowsFailover = code !== 'GENERATION_FAILED' && retryableErrors.includes(code);
+      if (!isRetryable || attempt === maxRetries) {
+        // FAILOVER PHASE: Try other providers
+        if (failover && allowsFailover && failoverFnFactory) {
+          const altProviders = this.registry.getAllProviders()
+            .filter((p) => p.id !== provider.id && p.modalities.includes(modality));
+          for (const alt of altProviders) {
+            try {
+              const altFn = failoverFnFactory(alt);
+              if (altFn) return await altFn();  // Success on alternate provider
+            } catch {
+              // Continue to next alternate provider
+            }
+          }
+        }
+        break;  // All retries and failovers exhausted
+      }
+      // RETRY: Exponential backoff on same provider
+      const delay = backoffMs * Math.pow(2, attempt);
+      // attempt=0: 1000ms, attempt=1: 2000ms, attempt=2: 4000ms
+      await new Promise((resolve) => setTimeout(resolve, delay));
+    }
+  }
+  throw lastError ?? new NoosphereError('Generation failed', { ... });
+}
+```
+**Failover function factory pattern:**
+Each generation method passes a factory function that creates the right call for alternate providers:
+```typescript
+// In chat():
+(alt) => alt.chat ? () => alt.chat!(options) : null
+// If the alternate provider has chat(), create a function to call it.
+// If not (e.g., ComfyUI for LLM), return null → skip this provider.
+// In image():
+(alt) => alt.image ? () => alt.image!(options) : null
+// In video():
+(alt) => alt.video ? () => alt.video!(options) : null
+// In speak():
+(alt) => alt.speak ? () => alt.speak!(options) : null
+```
+**Complete retry timeline example:**
+```
+ai.chat() with provider="pi-ai", maxRetries=2, backoffMs=1000
+Attempt 0: pi-ai.chat() → RATE_LIMITED
+  wait 1000ms (1000 * 2^0)
+Attempt 1: pi-ai.chat() → RATE_LIMITED
+  wait 2000ms (1000 * 2^1)
+Attempt 2: pi-ai.chat() → RATE_LIMITED
+  // maxRetries exhausted, RATE_LIMITED allows failover
+  Failover 1: huggingface.chat() → 503 SERVICE_UNAVAILABLE
+  Failover 2: (no more providers with 'llm' modality)
+  throw last error (RATE_LIMITED from pi-ai)
+```
+**Error classification matrix:**
+| Error Code | Same-Provider Retry | Cross-Provider Failover | Typical Cause |
+|---|---|---|---|
+| `PROVIDER_UNAVAILABLE` | Yes | Yes | Server down, network error |
+| `RATE_LIMITED` | Yes | Yes | API quota exceeded |
+| `TIMEOUT` | Yes | Yes | Slow response |
+| `GENERATION_FAILED` | Yes | **No** | Bad prompt, model error |
+| `AUTH_FAILED` | No | No | Wrong API key |
+| `MODEL_NOT_FOUND` | No | No | Invalid model ID |
+| `INVALID_INPUT` | No | No | Bad parameters |
+| `NO_PROVIDER` | No | No | No provider registered |
+### Model Registry — Internal Data Structures
+The Registry (`src/registry.ts`, 137 lines) is the central nervous system that maps providers to models and handles model lookups.
+**Internal state:**
+```typescript
+class Registry {
+  // Provider storage: Map<providerId, providerInstance>
+  private providers = new Map<string, NoosphereProvider>();
+  // Example: { 'pi-ai' → PiAiProvider, 'fal' → FalProvider, 'comfyui' → ComfyUIProvider }
+  // Model cache: Map<providerId, { models: ModelInfo[], syncedAt: timestamp }>
+  private modelCache = new Map<string, CachedModels>();
+  // Example: {
+  //   'pi-ai' → { models: [246 ModelInfo objects], syncedAt: 1710000000000 },
+  //   'fal'   → { models: [867 ModelInfo objects], syncedAt: 1710000000000 },
+  // }
+  // Cache TTL in milliseconds (converted from minutes in constructor)
+  private cacheTTLMs: number;
+  // Default: 60 * 60 * 1000 = 3,600,000ms = 1 hour
+}
+```
+**Cache staleness check:**
+```typescript
+isCacheStale(providerId: string): boolean {
+  const cached = this.modelCache.get(providerId);
+  if (!cached) return true;  // No cache = stale
+  return Date.now() - cached.syncedAt > this.cacheTTLMs;
+  // Example: if syncedAt was 61 minutes ago and TTL is 60 minutes → stale
+}
+```
+**Model resolution — linear scan across all caches:**
+```typescript
+resolveModel(modelId: string, modality: Modality):
+  { provider: NoosphereProvider; model: ModelInfo } | null {
+  // Scan EVERY provider's cached models:
+  for (const [providerId, cached] of this.modelCache) {
+    const model = cached.models.find(
+      (m) => m.id === modelId && m.modality === modality
+    );
+    // Must match BOTH modelId AND modality
+    if (model) {
+      const provider = this.providers.get(providerId);
+      if (provider) return { provider, model };
+    }
+  }
+  return null;
+}
+// Performance: O(n) where n = total models across all providers
+// With 246 Pi-AI + 867 FAL + 3 HuggingFace = ~1116 models to scan
+// This is fast enough for the use case (called once per request)
+```
+**Sync mechanism:**
+```typescript
+async syncAll(): Promise<SyncResult> {
+  const byProvider: Record<string, number> = {};
+  const errors: string[] = [];
+  let synced = 0;
+  // Sequential sync (NOT parallel) — one provider at a time:
+  for (const provider of this.providers.values()) {
+    try {
+      const models = await provider.listModels();
+      this.modelCache.set(provider.id, {
+        models,
+        syncedAt: Date.now(),
+      });
+      byProvider[provider.id] = models.length;
+      synced += models.length;
+    } catch (err) {
+      errors.push(`${provider.id}: ${err.message}`);
+      byProvider[provider.id] = 0;
+      // Note: failed sync does NOT clear existing cache
+    }
+  }
+  return { synced, byProvider, errors };
+}
+```
+**Provider info aggregation:**
+```typescript
+getProviderInfos(modality?: Modality): ProviderInfo[] {
+  // Returns summary info for each registered provider:
+  // {
+  //   id: 'pi-ai',
+  //   name: 'pi-ai (LLM Gateway)',
+  //   modalities: ['llm'],
+  //   local: false,
+  //   status: 'online',       // Always 'online' — no live ping check
+  //   modelCount: 246,        // From cache, or 0 if not synced
+  // }
+}
+```
+### Usage Tracking — In-Memory Event Store
+The `UsageTracker` (`src/tracking.ts`, 57 lines) records every API call and provides filtered aggregation.
+**Internal state:**
+```typescript
+class UsageTracker {
+  private events: UsageEvent[] = [];  // Append-only array
+  private onUsage?: (event: UsageEvent) => void | Promise<void>;  // Optional callback
+}
+```
+**Recording flow — every API call creates a UsageEvent:**
+```typescript
+// On SUCCESS (in Noosphere.trackUsage):
+const event: UsageEvent = {
+  modality: result.modality,    // 'llm' | 'image' | 'video' | 'tts'
+  provider: result.provider,    // 'pi-ai', 'fal', etc.
+  model: result.model,          // 'gpt-4o', 'flux-pro', etc.
+  cost: result.usage.cost,      // USD amount (0 for free/local)
+  latencyMs: result.latencyMs,  // Wall-clock milliseconds
+  input: result.usage.input,    // Input tokens or characters
+  output: result.usage.output,  // Output tokens (LLM only)
+  unit: result.usage.unit,      // 'tokens', 'characters', 'free'
+  timestamp: new Date().toISOString(),  // ISO 8601
+  success: true,
+  metadata,                     // User-provided metadata passthrough
+};
+// On FAILURE (in Noosphere.trackError):
+const event: UsageEvent = {
+  modality, provider,
+  model: model ?? 'unknown',
+  cost: 0,                              // No cost on failure
+  latencyMs: Date.now() - startMs,      // Time until failure
+  timestamp: new Date().toISOString(),
+  success: false,
+  error: err instanceof Error ? err.message : String(err),
+  metadata,
+};
+```
+**Query/aggregation — filtered summary:**
+```typescript
+getSummary(options?: UsageQueryOptions): UsageSummary {
+  let filtered = this.events;
+  // Time-range filtering:
+  if (options?.since) {
+    const since = new Date(options.since).getTime();
+    filtered = filtered.filter((e) => new Date(e.timestamp).getTime() >= since);
+  }
+  if (options?.until) {
+    const until = new Date(options.until).getTime();
+    filtered = filtered.filter((e) => new Date(e.timestamp).getTime() <= until);
+  }
+  // Provider/modality filtering:
+  if (options?.provider) {
+    filtered = filtered.filter((e) => e.provider === options.provider);
+  }
+  if (options?.modality) {
+    filtered = filtered.filter((e) => e.modality === options.modality);
+  }
+  // Aggregation:
+  const byProvider: Record<string, number> = {};
+  const byModality = { llm: 0, image: 0, video: 0, tts: 0 };
+  let totalCost = 0;
+  for (const event of filtered) {
+    totalCost += event.cost;
+    byProvider[event.provider] = (byProvider[event.provider] ?? 0) + event.cost;
+    byModality[event.modality] += event.cost;
+  }
+  return { totalCost, totalRequests: filtered.length, byProvider, byModality };
+}
+```
+**Usage example:**
+```typescript
+// Get all usage:
+const all = ai.getUsage();
+// { totalCost: 0.42, totalRequests: 15, byProvider: { 'pi-ai': 0.40, 'fal': 0.02 }, byModality: { llm: 0.40, image: 0.02, video: 0, tts: 0 } }
+// Get usage for last hour, LLM only:
+const recent = ai.getUsage({
+  since: new Date(Date.now() - 3600000),
+  modality: 'llm',
+});
+// Get usage for a specific provider:
+const falUsage = ai.getUsage({ provider: 'fal' });
+// Real-time callback (set in constructor):
+const ai = new Noosphere({
+  onUsage: (event) => {
+    console.log(`${event.provider}/${event.model}: $${event.cost} in ${event.latencyMs}ms`);
+    // Or: send to analytics, update dashboard, check budget
+  },
+});
+```
+**Important limitations:**
+- Events are stored **in memory only** — lost on process restart
+- No deduplication — each retry/failover attempt creates a separate event
+- `clear()` wipes all history (called by `dispose()`)
+- The `onUsage` callback is `await`ed — a slow callback blocks the response return
+### Streaming Architecture
+The `stream()` method (`src/noosphere.ts:73-124`) wraps provider streams with usage tracking:
+```typescript
+stream(options: ChatOptions): NoosphereStream {
+  // Returns IMMEDIATELY (synchronous) — no await
+  // The actual initialization happens lazily on first iteration
+  let innerStream: NoosphereStream | undefined;
+  let finalResult: NoosphereResult | undefined;
+  let providerRef: NoosphereProvider | undefined;
+  // Lazy init — runs on first for-await-of iteration:
+  const ensureInit = async () => {
+    if (!this.initialized) await this.init();
+    if (!providerRef) {
+      providerRef = this.resolveProviderForModality('llm', ...);
+      if (!providerRef.stream) throw new NoosphereError(...);
+      innerStream = providerRef.stream(options);
+    }
+  };
+  // Wrapped async iterator with usage tracking:
+  const wrappedIterator = {
+    async *[Symbol.asyncIterator]() {
+      await ensureInit();           // Init on first next()
+      for await (const event of innerStream!) {
+        if (event.type === 'done' && event.result) {
+          finalResult = event.result;
+          await trackUsage(event.result);  // Track when complete
+        }
+        yield event;                // Pass events through
+      }
+    },
+  };
+  return {
+    [Symbol.asyncIterator]: () => wrappedIterator[Symbol.asyncIterator](),
+    // result() — consume entire stream and return final result:
+    result: async () => {
+      if (finalResult) return finalResult;  // Already consumed
+      for await (const event of wrappedIterator) {
+        if (event.type === 'done') return event.result!;
+        if (event.type === 'error') throw event.error;
+      }
+      throw new NoosphereError('Stream ended without result');
+    },
+    // abort() — signal cancellation:
+    abort: () => innerStream?.abort(),
+  };
+}
+```
+**Stream event types:**
+| Event Type | Fields | When |
+|---|---|---|
+| `text_delta` | `{ type, delta: string }` | Each text token |
+| `thinking_delta` | `{ type, delta: string }` | Each reasoning token |
+| `done` | `{ type, result: NoosphereResult }` | Stream complete |
+| `error` | `{ type, error: Error }` | Stream failed |
+**Note:** Streaming does NOT use `executeWithRetry()`. If the stream fails, there's no automatic retry or failover. The error is yielded as an `error` event and also tracked via `trackError()`.
+### Lifecycle Management — dispose()
+```typescript
+async dispose(): Promise<void> {
+  // 1. Call dispose() on every registered provider (if implemented):
+  for (const provider of this.registry.getAllProviders()) {
+    if (provider.dispose) {
+      await provider.dispose();
+      // Currently: no built-in provider implements dispose()
+      // This is for custom providers that need cleanup
+    }
+  }
+  // 2. Clear the model cache:
+  this.registry.clearCache();
+  // 3. Clear usage history:
+  this.tracker.clear();
-Every API call (success or failure) records a `UsageEvent`:
-```typescript
-interface UsageEvent {
-  modality: 'llm' | 'image' | 'video' | 'tts';
-  provider: string;
-  model: string;
-  cost: number;           // USD
-  latencyMs: number;
-  input?: number;         // tokens or characters
-  output?: number;        // tokens
-  unit?: string;
-  timestamp: string;      // ISO 8601
-  success: boolean;
-  error?: string;         // error message if failed
-  metadata?: Record<string, unknown>;
+  // Note: does NOT set initialized=false
+  // After dispose(), the instance is NOT reusable for new requests
 }
 ```