npm - @lloyal-labs/lloyal.node - Versions diffs - 1.0.4-alpha → 1.0.6-alpha - Mend

@lloyal-labs/lloyal.node 1.0.4-alpha → 1.0.6-alpha

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/README.md +145 -269
package/lib/index.d.ts +125 -9
package/lib/index.js +156 -17
package/package.json +16 -17
package/scripts/create-platform-package.js +19 -40
package/scripts/install.js +0 -138

package/README.md CHANGED Viewed

@@ -1,11 +1,11 @@
 # lloyal.node
-Node.js bindings for [liblloyal](https://github.com/lloyal-ai/liblloyal)—the inference kernel that orchestrates llama.cpp in-process for agentic inference patterns.
+**Advanced edge inference for Node.js**
-## Installation
+Inference with forkable state — KV cache, grammar, metrics all clone atomically. Entropy and surprisal mid-generation. Multi-sequence parallel exploration. The control surface llama.cpp exposes, in TypeScript.
 ```bash
-npm install lloyal.node
+npm install @lloyal-labs/lloyal.node
 ```
 Prebuilt binaries for 13 platforms:
@@ -19,281 +19,124 @@ Prebuilt binaries for 13 platforms:
 | Windows  | x64   | CPU / CUDA / Vulkan |
 | Windows  | arm64 | CPU / Vulkan        |
-Falls back to source build if your platform isn't covered.
+GPU selection happens at runtime, not install time. See [distribution.md](docs/distribution.md) for details.
-```bash
-LLOYAL_GPU=cuda npm install    # NVIDIA
-LLOYAL_GPU=vulkan npm install  # AMD/Intel
-LLOYAL_GPU=cpu npm install     # Force CPU
-```
-See [DISTRIBUTION.md](./docs/DISTRIBUTION.md) for package details.
-## Quick Start
-Complete example with greedy sampling:
-```typescript
-import { createContext } from 'lloyal.node';
-async function generate(prompt: string, maxTokens = 100): Promise<string> {
-  const ctx = await createContext({
-    modelPath: './model.gguf',
-    nCtx: 2048,
-    nThreads: 4,
-  });
+---
-  try {
-    const tokens = await ctx.tokenize(prompt);
-    await ctx.decode(tokens, 0);
+## Examples
-    const output: number[] = [];
-    let pos = tokens.length;
+Working examples demonstrate each capability:
-    for (let i = 0; i < maxTokens; i++) {
-      const token = ctx.greedySample();
-      if (token < 0) break; // EOS
+| Example                                   | What It Demonstrates                                                          |
+| ----------------------------------------- | ----------------------------------------------------------------------------- |
+| [`best-of-n/`](./examples/best-of-n/)     | Multi-sequence generation, PPL selection, captured logits for fair comparison |
+| [`speculative/`](./examples/speculative/) | KV forking, draft/verify/accept/reject, `kvCacheRemove` for rejected tokens   |
+| [`entropy/`](./examples/entropy/)         | Entropy Decision Tree — `modelEntropy()` mid-generation as control signal     |
+| [`grammar/`](./examples/grammar/)         | Pull loop with generators, JSON schema constraints, KV + grammar branching    |
+| [`streaming/`](./examples/streaming/)     | Infinite context via BlinkKV, `clearAndReseed`, perplexity tracking           |
+| [`chat/`](./examples/chat/)               | Interactive streaming chat                                                    |
+| [`embed/`](./examples/embed/)             | Text embeddings extraction                                                    |
-      output.push(token);
-      await ctx.decode([token], pos++);
-    }
-    return ctx.detokenize(output);
-  } finally {
-    ctx.dispose();
-  }
-}
-const response = await generate('The capital of France is');
-console.log(response);
+```bash
+node examples/best-of-n/best-of-n.mjs
+node examples/speculative/speculative.mjs
+node examples/entropy/entropy.mjs
+node examples/grammar/grammar.mjs
 ```
-## Test-Time Alignment
+Each example has a README explaining the pattern in depth.
-TTA is token-level test-time alignment by exposing logits so TypeScript can apply stateful policies/constraints to the full next-token distribution before sampling—no retraining. Enabling fusion of application state with sampling strategy at every token step. Instead of generating text and validating after, you enforce constraints _during_ generation.
+---
-This requires two things:
+## Core Patterns
-1. **Raw logits** — the probability distribution over all possible next tokens
-2. **TypeScript sampling** — so your app logic can modify probabilities before selection
+### Forkable State
-lloyal.node provides the logits. [tsampler](https://github.com/lloyal-ai/tsampler) provides the sampling:
+KV cache, grammar parser, and perplexity trackers all live behind handles. Handles clone atomically.
-```typescript
-import { createContext } from 'lloyal.node';
-import {
-  sampleWithStrategy,
-  computeModelEntropy,
-  TokenHistoryTracker,
-  SamplerWorkspace,
-  Xoroshiro128Plus,
-} from '@lloyal/tsampler';
-const ctx = await createContext({ modelPath: './model.gguf' });
-const prng = new Xoroshiro128Plus(42); // Deterministic PRNG
-const tokenHistory = new TokenHistoryTracker(64); // For repetition penalties
-const workspace = new SamplerWorkspace(256); // Pre-allocated, zero-alloc hot path
-const tokens = await ctx.tokenize(prompt);
-await ctx.decode(tokens, 0);
-let pos = tokens.length;
-const output: number[] = [];
-while (output.length < maxTokens) {
-  const logits = ctx.getLogits();
-  // === YOUR STEERING LOGIC HERE ===
-  // Enforce domain rules
-  if (currency === 'JPY') {
-    logits[DECIMAL_TOKEN] = -Infinity; // JPY has no decimal subdivision
-  }
+**Two forking strategies:**
-  // Adapt to model confidence
-  const entropy = computeModelEntropy(logits);
-  const params =
-    entropy < 2.0
-      ? { topK: 256, temperature: 1.5 } // Low confidence → explore more
-      : { topK: 40, temperature: 0.8 }; // High confidence → stay focused
+| Approach             | Method                            | Use Case                                     |
+| -------------------- | --------------------------------- | -------------------------------------------- |
+| **Tag copy**         | `kvSeqCopy(src, dst)`             | Parallel branches with different seqIds      |
+| **Snapshot/restore** | `kvCacheSave()` / `kvCacheLoad()` | Sequential exploration, return to checkpoint |
-  // === END STEERING LOGIC ===
+[`examples/best-of-n/`](./examples/best-of-n/) uses tag copy — each candidate gets its own seqId, branches run in parallel:
-  const token = sampleWithStrategy(logits, {
-    tokenHistory,
-    params,
-    workspace,
-    prng,
-  });
+```javascript
+ctx.kvSeqCopy(0, seqId); // O(1) tag copy, branch diverges on seqId
+```
-  if (token < 0) break;
+[`examples/grammar/`](./examples/grammar/) uses snapshot/restore — save state, explore branches sequentially, restore between each:
-  tokenHistory.accept(token);
-  output.push(token);
-  await ctx.decode([token], pos++);
-}
+```javascript
+const snapshot = await ctx.kvCacheSave(0); // Save checkpoint
+// ... explore branch ...
+await ctx.kvCacheLoad(0, snapshot); // Return to checkpoint
 ```
-### Domain Constraints
+Both approaches also fork grammar state with `cloneSampler()` when grammar constraints are involved.
-```typescript
-// Financial: JPY has no decimal subdivision
-if (currency === 'JPY' && parsingAmount) {
-  logits[DECIMAL_TOKEN] = -Infinity;
-  DIGIT_TOKENS.forEach((id) => (logits[id] += 2.0));
-}
+### Captured Logits
-// Legal: Boost required terminology
-if (contractType === 'NDA') {
-  CONFIDENTIALITY_TOKENS.forEach((id) => (logits[id] += 5.0));
-}
+After decode, logits represent P(next_token | context). When forking to multiple sequences, capture logits for fair comparison:
-// Medical: Enforce terminology based on actual lab values
-if (glucoseLevel > normalMax) {
-  ELEVATED_TOKENS.forEach((id) => (logits[id] += 10.0));
-  NORMAL_TOKENS.forEach((id) => (logits[id] = -Infinity));
-}
-```
+```javascript
+// Capture after prefill
+const capturedLogits = new Float32Array(ctx.getLogits());
-### Quality Gates
+// All candidates sample first token from same distribution
+const token = sampleWithStrategy(capturedLogits, { params, workspace, prng });
-```typescript
-import { computeModelSurprisal, RollingPerplexity } from '@lloyal/tsampler';
-const ppl = new RollingPerplexity();
-while (generating) {
-  const logits = ctx.getLogits();
-  const token = sampleWithStrategy(logits, {
-    tokenHistory,
-    params,
-    workspace,
-    prng,
-  });
-  const surprisal = computeModelSurprisal(logits, token);
-  ppl.addSurprisal(surprisal);
-  if (ppl.ppl() > 50) {
-    // Generation quality degrading — options:
-    // 1. Trigger RAG retrieval for more context
-    // 2. Prune KV cache (evict stale context)
-    // 3. Early stop and retry with different prompt
-  }
-  // ...
-}
+// Compute surprisal from captured logits (native C++)
+const surprisal = ctx.modelSurprisal(token, 'nats', capturedLogits);
 ```
-### Entropy-Adaptive Retrieval
+See [`examples/best-of-n/`](./examples/best-of-n/) for the full pattern.
-```typescript
-import { computeModelEntropy } from '@lloyal/tsampler';
+### Entropy as Control Signal
-while (generating) {
-  const logits = ctx.getLogits();
-  const entropy = computeModelEntropy(logits);
+Model uncertainty mid-generation enables dynamic behavior:
-  if (entropy > 5.0) {
-    // Model is uncertain — retrieve relevant context
-    const context = await rag.retrieve(currentQuery);
-    await injectContext(ctx, context);
-    continue; // Re-evaluate with new context
-  }
+```javascript
+const entropy = ctx.modelEntropy('bits');
-  const token = sampleWithStrategy(logits, {
-    tokenHistory,
-    params,
-    workspace,
-    prng,
-  });
-  // ...
+if (entropy > 4.0) {
+  // High uncertainty — model is guessing
+  // Trigger retrieval, reduce temperature, or branch
 }
 ```
-## Why TypeScript Sampling?
-|                         | Native C++   | TypeScript (tsampler) |
-| ----------------------- | ------------ | --------------------- |
-| Speed                   | ~0.3ms/token | ~3-5ms/token          |
-| Overhead vs 50ms decode | —            | ~6-10%                |
-| Logit steering          | ❌           | ✅                    |
-| Adaptive strategies     | ❌           | ✅                    |
-| OTA updates             | Rebuild app  | Ship new JS           |
-| Debugging               | printf       | Full inspect          |
-The overhead is imperceptible. A 50ms decode dominates; 3ms sampling is noise.
+See [`examples/entropy/`](./examples/entropy/) for entropy-triggered sampling strategies.
-### tsampler Capabilities
+### Pull Loop with Generators
-[tsampler](https://github.com/lloyal-ai/tsampler) provides llama.cpp sampling parity in pure TypeScript:
+For branching mid-generation, generators provide natural backpressure:
-**Sampling methods:** greedy, top-k, top-p, min-p, typical-p, top-n-sigma, temperature, mirostat v1/v2
-**Penalties:** repetition, frequency, presence (exact llama.cpp formulas)
-**Infrastructure:**
-- `Xoroshiro128Plus` — deterministic PRNG, reproducible generations
-- `TokenHistoryTracker` — sliding window for penalty calculations
-- `SamplerWorkspace` — pre-allocated buffers, zero-alloc hot path
-- `computeModelEntropy()` — Shannon entropy in nats
-- `computeModelSurprisal()` — per-token surprisal
-- `RollingPerplexity` — streaming perplexity tracking
-### Native References
-lloyal.node includes native C++ implementations for validation:
-```typescript
-// TypeScript implementation
-const tsEntropy = computeModelEntropy(logits);
-// Native reference (C++)
-const nativeEntropy = ctx.computeEntropy();
+```javascript
+function* tokenGenerator(ctx, grammarHandle) {
+  while (true) {
+    const logits = ctx.getLogits();
+    ctx.applySampler(grammarHandle, logits);
+    const token = ctx.sample({ temperature: 0.7 });
+    if (ctx.isStopToken(token)) return;
+    ctx.acceptSamplerToken(grammarHandle, token);
+    yield { token, text: ctx.tokenToText(token) };
+  }
+}
-// Should match within float precision
-console.assert(Math.abs(tsEntropy - nativeEntropy) < 1e-5);
+// Consumer controls pace — stop at branch point
+for (const { token, text } of gen) {
+  if (accumulated.includes('"city"')) break; // Pause here, branch
+}
 ```
-Available references:
-- `ctx.computeEntropy()` — Shannon entropy in nats
-- `ctx.greedySample()` — argmax token ID
-Build with confidence. Validate against native. Deploy TypeScript.
-## Embeddings
-lloyal.node supports embedding extraction with configurable pooling:
+See [`examples/grammar/`](./examples/grammar/) for the full pull loop pattern.
-```typescript
-import { createContext } from 'lloyal.node';
-const ctx = await createContext({
-  modelPath: './nomic-embed-text.gguf',
-  embeddings: true,
-  poolingType: 1, // 0=NONE, 1=MEAN, 2=CLS, 3=LAST
-});
-async function embed(text: string): Promise<Float32Array> {
-  const tokens = await ctx.tokenize(text);
-  await ctx.encode(tokens);
-  const embedding = ctx.getEmbeddings(true); // L2-normalized
-  await ctx.kvCacheClear(); // Reset for next text
-  return embedding;
-}
-const vec = await embed('Document to embed');
-console.log(`Dimension: ${ctx.getEmbeddingDimension()}`); // e.g., 768
-```
+---
 ## API Reference
-**📖 [Full API Documentation](https://lloyal-ai.github.io/lloyal.node)** - Complete reference with examples and type definitions
 ### Context Creation
 ```typescript
@@ -301,55 +144,88 @@ const ctx = await createContext({
   modelPath: string,       // Path to .gguf file (required)
   nCtx?: number,           // Context size (default: 2048)
   nThreads?: number,       // CPU threads (default: 4)
-  nGpuLayers?: number,     // Layers to offload to GPU (default: 0)
   embeddings?: boolean,    // Enable embedding mode (default: false)
-  poolingType?: number     // 0=NONE, 1=MEAN, 2=CLS, 3=LAST (default: 0)
+  poolingType?: number,    // 0=NONE, 1=MEAN, 2=CLS, 3=LAST
+  nSeqMax?: number,        // Max parallel sequences (default: 1)
 });
 ```
-### Inference
-| Method                     | Returns             | Description                                           |
-| -------------------------- | ------------------- | ----------------------------------------------------- |
-| `tokenize(text)`           | `Promise<number[]>` | Text → token IDs                                      |
-| `detokenize(tokens)`       | `Promise<string>`   | Token IDs → text                                      |
-| `decode(tokens, position)` | `Promise<void>`     | Forward pass, populates KV cache                      |
-| `getLogits()`              | `Float32Array`      | Vocabulary-sized probability distribution (zero-copy) |
-### Native References
-| Method             | Returns  | Description             |
-| ------------------ | -------- | ----------------------- |
-| `greedySample()`   | `number` | Argmax token ID         |
-| `computeEntropy()` | `number` | Shannon entropy in nats |
+### Core Methods
+| Method                        | Returns             | Description                     |
+| ----------------------------- | ------------------- | ------------------------------- |
+| `tokenize(text)`              | `Promise<number[]>` | Text → token IDs                |
+| `detokenize(tokens)`          | `Promise<string>`   | Token IDs → text                |
+| `tokenToText(token)`          | `string`            | Single token → text (streaming) |
+| `decode(tokens, pos, seqId?)` | `Promise<void>`     | Forward pass, updates KV cache  |
+| `sample(params?)`             | `number`            | Sample next token               |
+| `isStopToken(token)`          | `boolean`           | Check for EOS token             |
+| `getLogits()`                 | `Float32Array`      | Raw logits (zero-copy view)     |
+### KV Cache
+| Method                             | Returns           | Description                    |
+| ---------------------------------- | ----------------- | ------------------------------ |
+| `kvCacheSize(seqId?)`              | `number`          | Tokens in cache                |
+| `kvCacheClear()`                   | `Promise<void>`   | Clear all sequences            |
+| `kvCacheRemove(seqId, start, end)` | `Promise<void>`   | Remove token range             |
+| `kvCacheSave(seqId?)`              | `Promise<Buffer>` | Snapshot state                 |
+| `kvCacheLoad(seqId, state)`        | `Promise<void>`   | Restore state                  |
+| `kvSeqCopy(src, dst)`              | `void`            | Copy sequence (tag copy, O(1)) |
+| `kvSeqKeep(seqId)`                 | `void`            | Keep only one sequence         |
+| `clearAndReseed(sinks, tail)`      | `Promise<void>`   | BlinkKV pattern                |
+### Grammar (Handle-Based)
+| Method                           | Returns  | Description                 |
+| -------------------------------- | -------- | --------------------------- |
+| `jsonSchemaToGrammar(schema)`    | `string` | Schema → GBNF               |
+| `createSampler(grammarStr)`      | `number` | Create grammar handle       |
+| `cloneSampler(handle)`           | `number` | Clone grammar state         |
+| `applySampler(handle, logits)`   | `void`   | Apply constraints to logits |
+| `acceptSamplerToken(handle, id)` | `void`   | Advance parser state        |
+| `freeSamplerHandle(handle)`      | `void`   | Release grammar handle      |
+### Metrics
+| Method                                  | Returns         | Description                                |
+| --------------------------------------- | --------------- | ------------------------------------------ |
+| `modelEntropy(base?, logits?)`          | `number`        | Distribution entropy (bits/nats)           |
+| `modelSurprisal(token, base?, logits?)` | `number`        | Token surprisal (supports captured logits) |
+| `createPerplexityTracker()`             | `TrackerHandle` | Create tracker (forkable)                  |
+| `clonePerplexityTracker(handle)`        | `TrackerHandle` | Clone tracker state                        |
+| `addSurprisal(handle, value)`           | `void`          | Add to tracker                             |
+| `getPerplexity(handle)`                 | `number`        | Get current PPL                            |
+| `freePerplexityTracker(handle)`         | `void`          | Release tracker                            |
 ### Embeddings
-| Method                      | Returns         | Description                                |
-| --------------------------- | --------------- | ------------------------------------------ |
-| `encode(tokens)`            | `Promise<void>` | Forward pass for embedding extraction      |
-| `getEmbeddings(normalize?)` | `Float32Array`  | Embedding vector, optionally L2-normalized |
-| `getEmbeddingDimension()`   | `number`        | Vector dimension                           |
-| `kvCacheClear()`            | `Promise<void>` | Clear KV cache between texts               |
+| Method                      | Returns         | Description                 |
+| --------------------------- | --------------- | --------------------------- |
+| `encode(tokens)`            | `Promise<void>` | Forward pass for embeddings |
+| `getEmbeddings(normalize?)` | `Float32Array`  | Extract embedding vector    |
+| `getEmbeddingDimension()`   | `number`        | Vector dimension            |
 ### Lifecycle
-| Method      | Description                                           |
-| ----------- | ----------------------------------------------------- |
-| `dispose()` | Free native resources. **Required** — call when done. |
+| Method      | Description                          |
+| ----------- | ------------------------------------ |
+| `dispose()` | Free native resources (**required**) |
+---
-## LLoyal Ecosystem
+## Ecosystem
-| Package                                                 | Language     | What it does                                                                                                                                                                                             |
-| ------------------------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [liblloyal](https://github.com/lloyal-ai/liblloyal)     | C++          | Inference kernel. Orchestrates llama.cpp with composable primitives: tokenization, decoding, KV cache, sampling chains, metrics, embeddings. Plus `branch.hpp` / `lease.hpp` for state forking and SMMA. |
-| **lloyal.node**                                         | N-API        | Node.js bindings. Zero-copy logits, native references for validation.                                                                                                                                    |
-| [tsampler](https://github.com/lloyal-ai/tsampler)       | TypeScript   | Sampling library with llama.cpp parity. All filters, penalties, entropy metrics. Plugin for lloyal.node—consumes logits, returns tokens.                                                                 |
-| [nitro-llama](https://github.com/lloyal-ai/nitro-llama) | React Native | Mobile bindings via Nitro Modules. Same liblloyal primitives on iOS/Android.                                                                                                                             |
+| Package                                                 | Runtime      | Description                       |
+| ------------------------------------------------------- | ------------ | --------------------------------- |
+| [liblloyal](https://github.com/lloyal-ai/liblloyal)     | C++          | Header-only inference kernel      |
+| **lloyal.node**                                         | Node.js      | This package                      |
+| [nitro-llama](https://github.com/lloyal-ai/nitro-llama) | React Native | Mobile bindings via Nitro Modules |
+| [tsampler](https://github.com/lloyal-ai/tsampler)       | TypeScript   | Reference sampler implementation  |
 ## Contributing
-See [CONTRIBUTING.md](./CONTRIBUTING.md) for development setup, build instructions, and release process.
+See [CONTRIBUTING.md](./CONTRIBUTING.md) for development setup and release process.
 ## License

package/lib/index.d.ts CHANGED Viewed

@@ -4,6 +4,48 @@
  * N-API bindings for liblloyal - Node.js native addon for llama.cpp inference
  */
+/**
+ * GPU variant for binary loading
+ *
+ * Specifies which GPU-accelerated binary to load:
+ * - 'default': CPU-only (works everywhere)
+ * - 'cuda': NVIDIA CUDA (requires libcudart.so/cudart64.dll)
+ * - 'vulkan': Vulkan (AMD/Intel/NVIDIA, requires Vulkan runtime)
+ *
+ * If the requested variant is unavailable (package not installed or
+ * runtime libraries missing), loading automatically falls back to CPU.
+ */
+export type GpuVariant = 'default' | 'cuda' | 'vulkan';
+/**
+ * Options for binary loading
+ *
+ * Controls which native binary variant is loaded when creating a context.
+ * Use this for explicit GPU variant selection with automatic fallback.
+ */
+export interface LoadOptions {
+  /**
+   * GPU variant to use
+   *
+   * - 'cuda': NVIDIA CUDA (requires libcudart.so)
+   * - 'vulkan': Vulkan (AMD/Intel/NVIDIA)
+   * - 'default' or undefined: CPU only
+   *
+   * If the requested variant is unavailable (missing runtime libraries),
+   * automatically falls back to CPU with a console warning.
+   *
+   * @example
+   * ```typescript
+   * // Request CUDA with automatic fallback to CPU
+   * const ctx = await createContext(
+   *   { modelPath: './model.gguf' },
+   *   { gpuVariant: 'cuda' }
+   * );
+   * ```
+   */
+  gpuVariant?: GpuVariant;
+}
 /**
  * Pooling type for embedding extraction
  */
@@ -867,13 +909,15 @@ export interface SessionContext {
    * - High surprisal: Model didn't expect this token (low probability)
    *
    * Call after decode() to compute surprisal for any token based on
-   * the current logits distribution.
+   * the current logits distribution, or pass captured logits for
+   * offline computation (e.g., best-of-n scoring from prefill logits).
    *
    * @param pickedTokenId - Token ID to compute surprisal for
    * @param base - Logarithm base: "nats" (default) or "bits"
+   * @param logits - Optional Float32Array of logits (uses current context logits if omitted)
    * @returns Surprisal value in specified base
    *
-   * @example
+   * @example Current context logits (default)
    * ```typescript
    * await ctx.decode(tokens, position);
    * const token = ctx.sample();
@@ -881,9 +925,18 @@ export interface SessionContext {
    * console.log(`Model surprise: ${surprisal.toFixed(2)} bits`);
    * ```
    *
-   * COST: O(1) - direct probability lookup from logits
+   * @example Captured/arbitrary logits (for best-of-n, verification, etc.)
+   * ```typescript
+   * // Capture logits after prefill
+   * const capturedLogits = new Float32Array(ctx.getLogits());
+   *
+   * // Later: compute surprisal from captured logits
+   * const surprisal = ctx.modelSurprisal(token, "nats", capturedLogits);
+   * ```
+   *
+   * COST: O(n_vocab) - softmax normalization required
    */
-  modelSurprisal(pickedTokenId: number, base?: 'nats' | 'bits'): number;
+  modelSurprisal(pickedTokenId: number, base?: 'nats' | 'bits', logits?: Float32Array): number;
   /**
    * Compute entropy of the entire logits distribution.
@@ -892,12 +945,14 @@ export interface SessionContext {
    * - Low entropy: Model is confident (peaked distribution)
    * - High entropy: Model is uncertain (flat distribution)
    *
-   * Call after decode() to analyze the current prediction distribution.
+   * Call after decode() to analyze the current prediction distribution,
+   * or pass captured logits for offline analysis.
    *
    * @param base - Logarithm base: "nats" (default), "bits", or "base10"
+   * @param logits - Optional Float32Array of logits (uses current context logits if omitted)
    * @returns Entropy value in specified base
    *
-   * @example
+   * @example Current context logits (default)
    * ```typescript
    * await ctx.decode(tokens, position);
    * const entropy = ctx.modelEntropy("bits");
@@ -906,9 +961,15 @@ export interface SessionContext {
    * }
    * ```
    *
+   * @example Captured/arbitrary logits
+   * ```typescript
+   * const capturedLogits = new Float32Array(ctx.getLogits());
+   * const entropy = ctx.modelEntropy("nats", capturedLogits);
+   * ```
+   *
    * COST: O(n_vocab) - must sum over all token probabilities
    */
-  modelEntropy(base?: 'nats' | 'bits'): number;
+  modelEntropy(base?: 'nats' | 'bits', logits?: Float32Array): number;
   /**
    * Create a new perplexity tracker.
@@ -1304,9 +1365,14 @@ export interface SessionContext {
 /**
  * Create a new inference context
  *
+ * Loads the appropriate native binary (with automatic GPU fallback) and
+ * creates an inference context for the specified model.
+ *
  * @param options Context creation options
+ * @param loadOptions Optional binary loading options (GPU variant selection)
  * @returns Promise resolving to SessionContext instance
- * @example
+ *
+ * @example Basic usage
  * ```typescript
  * const ctx = await createContext({
  *   modelPath: './model.gguf',
@@ -1322,8 +1388,58 @@ export interface SessionContext {
  *   ctx.dispose();
  * }
  * ```
+ *
+ * @example With GPU variant selection
+ * ```typescript
+ * // Request CUDA - falls back to CPU if unavailable
+ * const ctx = await createContext(
+ *   { modelPath: './model.gguf', nCtx: 4096 },
+ *   { gpuVariant: 'cuda' }
+ * );
+ * ```
+ *
+ * @example Using environment variable
+ * ```typescript
+ * // Set LLOYAL_GPU=cuda before running
+ * // createContext will automatically use CUDA if available
+ * const ctx = await createContext({ modelPath: './model.gguf' });
+ * ```
+ */
+export function createContext(
+  options: ContextOptions,
+  loadOptions?: LoadOptions
+): Promise<SessionContext>;
+/**
+ * Load native binary for a specific GPU variant
+ *
+ * Loads the appropriate platform-specific binary with automatic fallback:
+ * 1. Try requested GPU variant (if specified)
+ * 2. Fall back to default (CPU) platform package
+ * 3. Fall back to local build (development: build/Release/lloyal.node)
+ *
+ * Use this for advanced scenarios where you need direct binary access
+ * or want to check variant availability before creating a context.
+ *
+ * @param variant GPU variant: 'cuda', 'vulkan', or undefined for CPU
+ * @returns Native binary module with createContext method
+ * @throws Error if no binary available for the current platform
+ *
+ * @example
+ * ```typescript
+ * // Load default (CPU) binary
+ * const binary = loadBinary();
+ *
+ * // Load CUDA binary (falls back to CPU if unavailable)
+ * const binary = loadBinary('cuda');
+ *
+ * // Create context from loaded binary
+ * const ctx = await binary.createContext({ modelPath: './model.gguf' });
+ * ```
  */
-export function createContext(options: ContextOptions): Promise<SessionContext>;
+export function loadBinary(variant?: GpuVariant): {
+  createContext(options: ContextOptions): Promise<SessionContext>;
+};
 /**
  * Safe logits access with automatic lifetime management

package/lib/index.js CHANGED Viewed

@@ -1,6 +1,3 @@
-const path = require('path');
-const binary = require('node-gyp-build')(path.join(__dirname, '..'));
 /**
  * liblloyal-node - Thin N-API wrapper over liblloyal
  *
@@ -9,7 +6,7 @@ const binary = require('node-gyp-build')(path.join(__dirname, '..'));
  *
  * @example
  * ```js
- * const { createContext, withLogits } = require('lloyal.node');
+ * const { createContext, withLogits } = require('@lloyal-labs/lloyal.node');
  *
  * const ctx = await createContext({
  *   modelPath: './model.gguf',
@@ -24,7 +21,7 @@ const binary = require('node-gyp-build')(path.join(__dirname, '..'));
  * await ctx.decode(tokens, 0);
  *
  * // Safe logits access (Runtime Borrow Checker pattern)
- * const entropy = await withLogits(ctx, (logits) => {
+ * const entropy = withLogits(ctx, (logits) => {
  *   // logits is valid here - use synchronously only!
  *   return computeEntropy(logits);
  * });
@@ -36,7 +33,120 @@ const binary = require('node-gyp-build')(path.join(__dirname, '..'));
  * // Cleanup
  * ctx.dispose();
  * ```
+ *
+ * @example GPU variant selection
+ * ```js
+ * // Option 1: Environment variable (affects all contexts)
+ * // Set LLOYAL_GPU=cuda before running
+ *
+ * // Option 2: Per-context selection (recommended)
+ * const ctx = await createContext(
+ *   { modelPath: './model.gguf', nCtx: 4096 },
+ *   { gpuVariant: 'cuda' }  // Falls back to CPU if CUDA unavailable
+ * );
+ * ```
+ */
+/**
+ * Platform package naming: @lloyal-labs/lloyal.node-{platform}-{arch}[-{gpu}]
+ * @param {string} [variant] - GPU variant: 'cuda', 'vulkan', or undefined for CPU
+ * @returns {string} Platform package name
+ */
+const getPlatformPackageName = (variant) => {
+  const platform = process.platform;
+  const arch = process.arch;
+  // cpu/metal/default = no suffix, cuda/vulkan = suffix
+  const noSuffix = !variant || variant === 'default' || variant === 'cpu' || variant === 'metal';
+  const suffix = noSuffix ? '' : `-${variant}`;
+  return `@lloyal-labs/lloyal.node-${platform}-${arch}${suffix}`;
+};
+/**
+ * Try to load a platform package, return null on failure.
+ * Failures include: package not installed, missing GPU runtime libs (dlopen fails),
+ * or module doesn't export expected interface.
+ * @param {string} packageName - Package name to load
+ * @param {boolean} [verbose=false] - Log failure reasons
+ * @returns {object|null} The native binary module or null
  */
+const tryLoadPackage = (packageName, verbose = false) => {
+  try {
+    const mod = require(packageName);
+    // Validate it's actually a native module with expected exports
+    if (mod && typeof mod.createContext === 'function') {
+      return mod;
+    }
+    if (verbose) {
+      console.warn(`[lloyal.node] ${packageName} loaded but missing createContext export`);
+    }
+    return null;
+  } catch (e) {
+    if (verbose) {
+      console.warn(`[lloyal.node] Failed to load ${packageName}: ${e.message}`);
+    }
+    return null;
+  }
+};
+/**
+ * Load the native binary with automatic fallback.
+ *
+ * Loading priority:
+ * 1. Requested GPU variant (if specified)
+ * 2. Default platform package (CPU)
+ * 3. Local build (development: build/Release/lloyal.node)
+ *
+ * @param {string} [variant] - GPU variant: 'cuda', 'vulkan', or undefined for CPU
+ * @returns {object} The native binary module
+ * @throws {Error} If no binary can be loaded
+ */
+const loadBinary = (variant) => {
+  // Use env var if no variant specified
+  variant = variant ?? process.env.LLOYAL_GPU;
+  // LLOYAL_NO_FALLBACK=1 disables fallback (for CI testing specific packages)
+  const noFallback = process.env.LLOYAL_NO_FALLBACK === '1';
+  // 1. Try requested variant (if specified)
+  if (variant && variant !== 'default') {
+    const pkgName = getPlatformPackageName(variant);
+    const binary = tryLoadPackage(pkgName, true); // verbose=true to see errors
+    if (binary) return binary;
+    if (noFallback) {
+      throw new Error(
+        `[lloyal.node] GPU variant "${variant}" failed to load. ` +
+        `Package: ${pkgName}. Check that runtime libraries are available.`
+      );
+    }
+    console.warn(`[lloyal.node] GPU variant "${variant}" unavailable, falling back to CPU`);
+  }
+  // 2. Try default platform package (CPU)
+  const defaultPkg = getPlatformPackageName();
+  const binary = tryLoadPackage(defaultPkg, true); // verbose=true
+  if (binary) return binary;
+  // 3. Try local build (development)
+  try {
+    return require('../build/Release/lloyal.node');
+  } catch (e) {
+    // ignore
+  }
+  throw new Error(
+    `No lloyal.node binary found for ${process.platform}-${process.arch}. ` +
+    `Tried: ${variant ? getPlatformPackageName(variant) + ', ' : ''}${defaultPkg}`
+  );
+};
+// Default binary (loaded lazily on first use)
+let _binary = null;
+const getBinary = () => {
+  if (!_binary) {
+    _binary = loadBinary(process.env.LLOYAL_GPU);
+  }
+  return _binary;
+};
 /**
  * Safe logits access with Runtime Borrow Checker pattern
@@ -97,25 +207,54 @@ module.exports = {
   /**
    * Create a new inference context
    *
-   * @param {Object} options
-   * @param {string} options.modelPath - Path to .gguf model file
-   * @param {number} [options.nCtx=2048] - Context size
-   * @param {number} [options.nThreads=4] - Number of threads
-   * @returns {Promise<SessionContext>}
+   * @param {ContextOptions} options - Context configuration
+   * @param {LoadOptions} [loadOptions] - Binary loading options
+   * @returns {Promise<SessionContext>} The inference context
+   *
+   * @example
+   * ```js
+   * // Basic usage
+   * const ctx = await createContext({
+   *   modelPath: './model.gguf',
+   *   nCtx: 2048,
+   *   nThreads: 4
+   * });
+   *
+   * // With GPU variant
+   * const ctx = await createContext(
+   *   { modelPath: './model.gguf' },
+   *   { gpuVariant: 'cuda' }
+   * );
+   * ```
    */
-  createContext: async (options) => {
-    // For now, createContext is synchronous in C++
-    // Wrap in Promise for future async model loading
+  createContext: async (options, loadOptions) => {
+    const variant = loadOptions?.gpuVariant || process.env.LLOYAL_GPU;
+    const binary = variant ? loadBinary(variant) : getBinary();
     return binary.createContext(options);
   },
   /**
-   * Safe logits access with Runtime Borrow Checker pattern
+   * Load binary for a specific GPU variant.
+   * Useful for checking variant availability before creating context.
+   *
+   * @param {string} [variant] - 'cuda', 'vulkan', or undefined for CPU
+   * @returns {object} Native binary module
+   * @throws {Error} If no binary available for platform
+   *
+   * @example
+   * ```js
+   * // Load default (CPU) binary
+   * const binary = loadBinary();
    *
-   * Ensures logits are only accessed synchronously within the callback.
+   * // Load CUDA binary (falls back to CPU if unavailable)
+   * const binary = loadBinary('cuda');
+   * ```
+   */
+  loadBinary,
+  /**
+   * Safe logits access with Runtime Borrow Checker pattern.
    * See function JSDoc for full documentation.
    */
   withLogits,
-  SessionContext: binary.SessionContext
 };

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@lloyal-labs/lloyal.node",
-  "version": "1.0.4-alpha",
+  "version": "1.0.6-alpha",
   "description": "Node.js client for liblloyal+llama.cpp",
   "main": "lib/index.js",
   "types": "lib/index.d.ts",
@@ -10,7 +10,6 @@
   },
   "scripts": {
     "download-models": "bash scripts/download-test-models.sh",
-    "install": "node scripts/install.js",
     "build": "node scripts/build.js",
     "build:debug": "cmake-js compile --debug",
     "rebuild": "cmake-js rebuild",
@@ -43,8 +42,8 @@
   },
   "homepage": "https://github.com/lloyal-ai/lloyal.node#readme",
   "dependencies": {
-    "node-addon-api": "^8.5.0",
-    "node-gyp-build": "^4.8.4"
+    "@lloyal-labs/tsampler": "^0.2.0",
+    "node-addon-api": "^8.5.0"
   },
   "devDependencies": {
     "cmake-js": "^7.4.0",
@@ -52,19 +51,19 @@
     "typedoc": "^0.27.5"
   },
   "optionalDependencies": {
-    "@lloyal-labs/lloyal.node-darwin-arm64": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-darwin-x64": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-linux-arm64": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-linux-arm64-cuda": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-linux-arm64-vulkan": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-linux-x64": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-linux-x64-cuda": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-linux-x64-vulkan": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-win32-arm64": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-win32-arm64-vulkan": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-win32-x64": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-win32-x64-cuda": "1.0.4-alpha",
-    "@lloyal-labs/lloyal.node-win32-x64-vulkan": "1.0.4-alpha"
+    "@lloyal-labs/lloyal.node-darwin-arm64": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-darwin-x64": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-linux-arm64": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-linux-arm64-cuda": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-linux-arm64-vulkan": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-linux-x64": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-linux-x64-cuda": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-linux-x64-vulkan": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-win32-arm64": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-win32-arm64-vulkan": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-win32-x64": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-win32-x64-cuda": "1.0.6-alpha",
+    "@lloyal-labs/lloyal.node-win32-x64-vulkan": "1.0.6-alpha"
   },
   "engines": {
     "node": ">=22.0.0"

package/scripts/create-platform-package.js CHANGED Viewed

@@ -108,52 +108,31 @@ if (osName === 'darwin') {
 // Create package.json from template
 console.log('\nGenerating package.json...');
 const mainPackageJson = require(path.join(ROOT, 'package.json'));
-const templatePath = path.join(ROOT, 'packages', 'template', 'package.json');
-let pkgJson;
-if (fs.existsSync(templatePath)) {
-  pkgJson = require(templatePath);
-} else {
-  // Fallback template if file doesn't exist yet
-  pkgJson = {
-    name: '@lloyal-labs/lloyal.node-PLATFORM',
-    version: '0.0.0',
-    description: 'Lloyal native binary for PLATFORM',
-    main: 'index.js',
-    files: ['bin/', 'index.js'],
-    repository: {
-      type: 'git',
-      url: 'git+https://github.com/lloyal-ai/lloyal.node.git'
-    },
-    license: 'Apache-2.0'
-  };
-}
-// Update with actual values
-pkgJson.name = `@lloyal-labs/lloyal.node-${packageName}`;
-pkgJson.version = mainPackageJson.version;
-pkgJson.description = `Lloyal native binary for ${packageName}`;
-pkgJson.os = [osName];
-pkgJson.cpu = [arch];
+// Platform package exports the binary directly (no index.js wrapper)
+// This enables runtime dynamic require with automatic fallback:
+//   require('@lloyal-labs/lloyal.node-linux-x64') → bin/lloyal.node
+const pkgJson = {
+  name: `@lloyal-labs/lloyal.node-${packageName}`,
+  version: mainPackageJson.version,
+  description: `Lloyal native binary for ${packageName}`,
+  main: 'bin/lloyal.node',
+  os: [osName],
+  cpu: [arch],
+  files: ['bin/'],
+  repository: {
+    type: 'git',
+    url: 'git+https://github.com/lloyal-ai/lloyal.node.git'
+  },
+  author: 'lloyal.ai',
+  license: 'Apache-2.0'
+};
 fs.writeFileSync(
   path.join(PKG_DIR, 'package.json'),
   JSON.stringify(pkgJson, null, 2) + '\n'
 );
-console.log(`  ✓ Created package.json`);
-// Create index.js
-console.log('\nGenerating index.js...');
-const indexJs = `// Platform-specific binary package for ${packageName}
-// This file resolves to the native binary in bin/
-const path = require('path');
-module.exports = path.join(__dirname, 'bin', 'lloyal.node');
-`;
-fs.writeFileSync(path.join(PKG_DIR, 'index.js'), indexJs);
-console.log(`  ✓ Created index.js`);
+console.log(`  ✓ Created package.json (main: bin/lloyal.node)`);
 // Summary
 console.log(`\n✅ Platform package created successfully!`);

package/scripts/install.js DELETED Viewed

@@ -1,138 +0,0 @@
-#!/usr/bin/env node
-/**
- * Smart installer for lloyal.node
- *
- * Strategy:
- * 1. Check if prebuilt binary exists for this platform
- * 2. If yes, copy to build/Release/ and exit
- * 3. If no, show helpful error with build-from-source instructions
- *
- * Respects LLOYAL_GPU environment variable for GPU variant selection
- */
-const fs = require('fs');
-const path = require('path');
-const PLATFORM = process.platform;
-const ARCH = process.arch;
-const ROOT = __dirname + '/..';
-const BUILD_DIR = path.join(ROOT, 'build', 'Release');
-// Logging helpers
-const log = (msg) => console.log(`[lloyal.node] ${msg}`);
-const error = (msg) => console.error(`[lloyal.node] ❌ ${msg}`);
-/**
- * Check if a platform package is installed and has binaries
- */
-function findPrebuilt(packageName) {
-  try {
-    const pkgPath = require.resolve(packageName);
-    const binPath = require(packageName); // index.js exports path to binary
-    if (fs.existsSync(binPath)) {
-      const binDir = path.dirname(binPath);
-      return binDir;
-    }
-  } catch (e) {
-    // Package not installed or doesn't export binary path
-  }
-  return null;
-}
-/**
- * Copy prebuilt binaries to build/Release/
- */
-function installPrebuilt(binDir, packageName) {
-  log(`Found prebuilt binaries in ${packageName}`);
-  try {
-    // Create build/Release directory
-    fs.mkdirSync(BUILD_DIR, { recursive: true });
-    // Copy all files from bin directory
-    const files = fs.readdirSync(binDir);
-    files.forEach(file => {
-      const src = path.join(binDir, file);
-      const dest = path.join(BUILD_DIR, file);
-      if (fs.statSync(src).isFile()) {
-        fs.copyFileSync(src, dest);
-        log(`  ✓ Copied ${file}`);
-      }
-    });
-    log(`✅ Installed prebuilt binaries successfully`);
-    process.exit(0);
-  } catch (e) {
-    error(`Failed to install prebuilt: ${e.message}`);
-    // Don't exit - fall through to source build
-  }
-}
-/**
- * Main installation logic
- */
-function main() {
-  log(`Platform: ${PLATFORM}-${ARCH}`);
-  // 1. Check for user-specified GPU variant via environment variable
-  if (process.env.LLOYAL_GPU) {
-    const gpu = process.env.LLOYAL_GPU.toLowerCase();
-    const packageName = `@lloyal-labs/lloyal.node-${PLATFORM}-${ARCH}-${gpu}`;
-    log(`LLOYAL_GPU=${gpu}, looking for ${packageName}...`);
-    const binDir = findPrebuilt(packageName);
-    if (binDir) {
-      installPrebuilt(binDir, packageName);
-      return; // exit(0) called in installPrebuilt
-    } else {
-      log(`  ⚠️  Package ${packageName} not found`);
-    }
-  }
-  // 2. Check for GPU variants in priority order
-  const gpuVariants = ['cuda', 'vulkan'];
-  for (const gpu of gpuVariants) {
-    const packageName = `@lloyal-labs/lloyal.node-${PLATFORM}-${ARCH}-${gpu}`;
-    const binDir = findPrebuilt(packageName);
-    if (binDir) {
-      log(`Auto-detected GPU variant: ${gpu}`);
-      installPrebuilt(binDir, packageName);
-      return; // exit(0) called in installPrebuilt
-    }
-  }
-  // 3. Check for default platform package (CPU or Metal on macOS)
-  const defaultPackage = `@lloyal-labs/lloyal.node-${PLATFORM}-${ARCH}`;
-  const binDir = findPrebuilt(defaultPackage);
-  if (binDir) {
-    installPrebuilt(binDir, defaultPackage);
-    return; // exit(0) called in installPrebuilt
-  }
-  // 4. No prebuilt found - error with helpful message
-  log('');
-  error('No prebuilt binary found for your platform');
-  log('');
-  log(`  Platform: ${PLATFORM}-${ARCH}`);
-  log('');
-  log('  Options:');
-  log('  1. Install a platform-specific package:');
-  log(`     npm install @lloyal-labs/lloyal.node-${PLATFORM}-${ARCH}`);
-  log('');
-  log('  2. Build from source (requires C++20, CMake 3.18+):');
-  log('     git clone --recursive https://github.com/lloyal-ai/lloyal.node.git');
-  log('     cd lloyal.node && npm run build');
-  log('');
-  log('  See: https://github.com/lloyal-ai/lloyal.node#building');
-  log('');
-  process.exit(1);
-}
-// Run installer
-main();