npm - @lloyal-labs/lloyal.node - Versions diffs - 1.0.5-alpha → 1.0.7 - Mend

@lloyal-labs/lloyal.node 1.0.5-alpha → 1.0.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/README.md +158 -267
package/lib/Branch.js +268 -0
package/lib/index.d.ts +307 -165
package/lib/index.js +165 -19
package/package.json +19 -18
package/scripts/create-platform-package.js +19 -40
package/scripts/download-test-models.sh +10 -0
package/scripts/install.js +0 -138

package/README.md CHANGED Viewed

@@ -1,11 +1,11 @@
 # lloyal.node
-Node.js bindings for [liblloyal](https://github.com/lloyal-ai/liblloyal)—the inference kernel that orchestrates llama.cpp in-process for agentic inference patterns.
+**Advanced edge inference for Node.js**
-## Installation
+A llama.cpp control surface in TypeScript with atomic inference state forking. Real time rolling perplexity/entropy/surprisal and multi-sequence parallel exploration primitives.
 ```bash
-npm install lloyal.node
+npm install @lloyal-labs/lloyal.node
 ```
 Prebuilt binaries for 13 platforms:
@@ -19,281 +19,139 @@ Prebuilt binaries for 13 platforms:
 | Windows  | x64   | CPU / CUDA / Vulkan |
 | Windows  | arm64 | CPU / Vulkan        |
-Falls back to source build if your platform isn't covered.
+GPU selection happens at runtime, not install time. See [distribution.md](docs/distribution.md) for details.
+---
+## Examples
+Working examples demonstrate each capability:
+| Example                                   | What It Demonstrates                                                          |
+| ----------------------------------------- | ----------------------------------------------------------------------------- |
+| [`best-of-n/`](./examples/best-of-n/)     | Branch API parallel generation, PPL selection, fork/produce/commit            |
+| [`speculative/`](./examples/speculative/) | Branch API fork/prune, draft/verify/accept/reject, bonus token sampling       |
+| [`entropy/`](./examples/entropy/)         | Entropy Decision Tree — `modelEntropy()` mid-generation as control signal     |
+| [`grammar/`](./examples/grammar/)         | Pull loop with generators, JSON schema constraints, KV + grammar branching    |
+| [`streaming/`](./examples/streaming/)     | Infinite context via BlinkKV, `clearAndReseed`, perplexity tracking           |
+| [`chat/`](./examples/chat/)               | Interactive streaming chat                                                    |
+| [`embed/`](./examples/embed/)             | Text embeddings extraction                                                    |
 ```bash
-LLOYAL_GPU=cuda npm install    # NVIDIA
-LLOYAL_GPU=vulkan npm install  # AMD/Intel
-LLOYAL_GPU=cpu npm install     # Force CPU
+node examples/best-of-n/best-of-n.mjs
+node examples/speculative/speculative.mjs
+node examples/entropy/entropy.mjs
+node examples/grammar/grammar.mjs
 ```
-See [DISTRIBUTION.md](./docs/DISTRIBUTION.md) for package details.
+Each example has a README explaining the pattern in depth.
-## Quick Start
+---
-Complete example with greedy sampling:
+## Core Patterns
-```typescript
-import { createContext } from 'lloyal.node';
+### Branch API
-async function generate(prompt: string, maxTokens = 100): Promise<string> {
-  const ctx = await createContext({
-    modelPath: './model.gguf',
-    nCtx: 2048,
-    nThreads: 4,
-  });
+`Branch` is the primary API for parallel generation. Each branch owns a KV cache sequence, sampler chain, logits snapshot, and perplexity tracker. Fork a branch to explore alternatives, compare by perplexity, prune losers.
-  try {
-    const tokens = await ctx.tokenize(prompt);
-    await ctx.decode(tokens, 0);
+```javascript
+import { createContext, Branch } from '@lloyal-labs/lloyal.node';
-    const output: number[] = [];
-    let pos = tokens.length;
+const ctx = await createContext({ modelPath: './model.gguf', nSeqMax: 8 });
+const tokens = await ctx.tokenize('Once upon a time');
+await ctx.decode(tokens, 0, 0);
-    for (let i = 0; i < maxTokens; i++) {
-      const token = ctx.greedySample();
-      if (token < 0) break; // EOS
+// Create root branch, capture logits from prefill
+const root = Branch.create(ctx, 0, tokens.length, { temperature: 0.8 });
+root.captureLogits();
-      output.push(token);
-      await ctx.decode([token], pos++);
-    }
+// Fork N candidates — each gets copied KV, logits, sampler, perplexity
+const candidates = [1, 2, 3, 4, 5].map((seqId, i) => {
+  const branch = root.fork(seqId);
+  branch.reseedSampler(1000 + i); // Unique PRNG per branch
+  return branch;
+});
-    return ctx.detokenize(output);
-  } finally {
-    ctx.dispose();
+// Generate in parallel (interleaved round-robin)
+for (let t = 0; t < 50; t++) {
+  for (const branch of candidates) {
+    const { token, isStop } = branch.produce(); // Sample (no KV write)
+    if (isStop) continue;
+    branch.commit(token); // Accept + decode + capture
   }
 }
-const response = await generate('The capital of France is');
-console.log(response);
+// Select best by perplexity, prune losers
+const best = candidates.reduce((a, b) => a.perplexity < b.perplexity ? a : b);
+for (const c of candidates) { if (c !== best) c.prune(); }
 ```
-## Test-Time Alignment
+**What `fork()` clones:** KV cache sequence, logits snapshot, sampler chain (penalties + PRNG), perplexity tracker. Under unified KV (the default), forking is a metadata-only operation — no KV tensor buffers are copied.
-TTA is token-level test-time alignment by exposing logits so TypeScript can apply stateful policies/constraints to the full next-token distribution before sampling—no retraining. Enabling fusion of application state with sampling strategy at every token step. Instead of generating text and validating after, you enforce constraints _during_ generation.
+**Use cases:** Best-of-N sampling, speculative decoding, MCTS/LATS tree search, beam search.
-This requires two things:
+See [`examples/best-of-n/`](./examples/best-of-n/) and [`examples/speculative/`](./examples/speculative/) for complete patterns.
-1. **Raw logits** — the probability distribution over all possible next tokens
-2. **TypeScript sampling** — so your app logic can modify probabilities before selection
+### Low-Level Forking
-lloyal.node provides the logits. [tsampler](https://github.com/lloyal-ai/tsampler) provides the sampling:
+For fine-grained control without the Branch wrapper, raw KV and state operations are available:
-```typescript
-import { createContext } from 'lloyal.node';
-import {
-  sampleWithStrategy,
-  computeModelEntropy,
-  TokenHistoryTracker,
-  SamplerWorkspace,
-  Xoroshiro128Plus,
-} from '@lloyal/tsampler';
-const ctx = await createContext({ modelPath: './model.gguf' });
-const prng = new Xoroshiro128Plus(42); // Deterministic PRNG
-const tokenHistory = new TokenHistoryTracker(64); // For repetition penalties
-const workspace = new SamplerWorkspace(256); // Pre-allocated, zero-alloc hot path
-const tokens = await ctx.tokenize(prompt);
-await ctx.decode(tokens, 0);
-let pos = tokens.length;
-const output: number[] = [];
-while (output.length < maxTokens) {
-  const logits = ctx.getLogits();
-  // === YOUR STEERING LOGIC HERE ===
-  // Enforce domain rules
-  if (currency === 'JPY') {
-    logits[DECIMAL_TOKEN] = -Infinity; // JPY has no decimal subdivision
-  }
+| Approach             | Method                            | Use Case                                     |
+| -------------------- | --------------------------------- | -------------------------------------------- |
+| **Tag copy**         | `kvSeqCopy(src, dst)`             | Parallel branches with different seqIds      |
+| **Snapshot/restore** | `kvCacheSave()` / `kvCacheLoad()` | Sequential exploration, return to checkpoint |
-  // Adapt to model confidence
-  const entropy = computeModelEntropy(logits);
-  const params =
-    entropy < 2.0
-      ? { topK: 256, temperature: 1.5 } // Low confidence → explore more
-      : { topK: 40, temperature: 0.8 }; // High confidence → stay focused
+[`examples/grammar/`](./examples/grammar/) uses snapshot/restore — save state, explore branches sequentially, restore between each:
-  // === END STEERING LOGIC ===
-  const token = sampleWithStrategy(logits, {
-    tokenHistory,
-    params,
-    workspace,
-    prng,
-  });
-  if (token < 0) break;
-  tokenHistory.accept(token);
-  output.push(token);
-  await ctx.decode([token], pos++);
-}
+```javascript
+const snapshot = await ctx.kvCacheSave(0); // Save checkpoint
+// ... explore branch ...
+await ctx.kvCacheLoad(0, snapshot); // Return to checkpoint
 ```
-### Domain Constraints
-```typescript
-// Financial: JPY has no decimal subdivision
-if (currency === 'JPY' && parsingAmount) {
-  logits[DECIMAL_TOKEN] = -Infinity;
-  DIGIT_TOKENS.forEach((id) => (logits[id] += 2.0));
-}
-// Legal: Boost required terminology
-if (contractType === 'NDA') {
-  CONFIDENTIALITY_TOKENS.forEach((id) => (logits[id] += 5.0));
-}
-// Medical: Enforce terminology based on actual lab values
-if (glucoseLevel > normalMax) {
-  ELEVATED_TOKENS.forEach((id) => (logits[id] += 10.0));
-  NORMAL_TOKENS.forEach((id) => (logits[id] = -Infinity));
-}
-```
+### Entropy as Control Signal
-### Quality Gates
+Model uncertainty mid-generation enables dynamic behavior:
-```typescript
-import { computeModelSurprisal, RollingPerplexity } from '@lloyal/tsampler';
-const ppl = new RollingPerplexity();
-while (generating) {
-  const logits = ctx.getLogits();
-  const token = sampleWithStrategy(logits, {
-    tokenHistory,
-    params,
-    workspace,
-    prng,
-  });
-  const surprisal = computeModelSurprisal(logits, token);
-  ppl.addSurprisal(surprisal);
-  if (ppl.ppl() > 50) {
-    // Generation quality degrading — options:
-    // 1. Trigger RAG retrieval for more context
-    // 2. Prune KV cache (evict stale context)
-    // 3. Early stop and retry with different prompt
-  }
+```javascript
+const entropy = ctx.modelEntropy('bits');
-  // ...
+if (entropy > 4.0) {
+  // High uncertainty — model is guessing
+  // Trigger retrieval, reduce temperature, or branch
 }
 ```
-### Entropy-Adaptive Retrieval
+See [`examples/entropy/`](./examples/entropy/) for entropy-triggered sampling strategies.
-```typescript
-import { computeModelEntropy } from '@lloyal/tsampler';
+### Pull Loop with Generators
-while (generating) {
-  const logits = ctx.getLogits();
-  const entropy = computeModelEntropy(logits);
+For branching mid-generation, generators provide natural backpressure:
-  if (entropy > 5.0) {
-    // Model is uncertain — retrieve relevant context
-    const context = await rag.retrieve(currentQuery);
-    await injectContext(ctx, context);
-    continue; // Re-evaluate with new context
+```javascript
+function* tokenGenerator(ctx, grammarHandle) {
+  while (true) {
+    const logits = ctx.getLogits();
+    ctx.applySampler(grammarHandle, logits);
+    const token = ctx.sample({ temperature: 0.7 });
+    if (ctx.isStopToken(token)) return;
+    ctx.acceptSamplerToken(grammarHandle, token);
+    yield { token, text: ctx.tokenToText(token) };
   }
-  const token = sampleWithStrategy(logits, {
-    tokenHistory,
-    params,
-    workspace,
-    prng,
-  });
-  // ...
 }
-```
-## Why TypeScript Sampling?
-|                         | Native C++   | TypeScript (tsampler) |
-| ----------------------- | ------------ | --------------------- |
-| Speed                   | ~0.3ms/token | ~3-5ms/token          |
-| Overhead vs 50ms decode | —            | ~6-10%                |
-| Logit steering          | ❌           | ✅                    |
-| Adaptive strategies     | ❌           | ✅                    |
-| OTA updates             | Rebuild app  | Ship new JS           |
-| Debugging               | printf       | Full inspect          |
-The overhead is imperceptible. A 50ms decode dominates; 3ms sampling is noise.
-### tsampler Capabilities
-[tsampler](https://github.com/lloyal-ai/tsampler) provides llama.cpp sampling parity in pure TypeScript:
-**Sampling methods:** greedy, top-k, top-p, min-p, typical-p, top-n-sigma, temperature, mirostat v1/v2
-**Penalties:** repetition, frequency, presence (exact llama.cpp formulas)
-**Infrastructure:**
-- `Xoroshiro128Plus` — deterministic PRNG, reproducible generations
-- `TokenHistoryTracker` — sliding window for penalty calculations
-- `SamplerWorkspace` — pre-allocated buffers, zero-alloc hot path
-- `computeModelEntropy()` — Shannon entropy in nats
-- `computeModelSurprisal()` — per-token surprisal
-- `RollingPerplexity` — streaming perplexity tracking
-### Native References
-lloyal.node includes native C++ implementations for validation:
-```typescript
-// TypeScript implementation
-const tsEntropy = computeModelEntropy(logits);
-// Native reference (C++)
-const nativeEntropy = ctx.computeEntropy();
-// Should match within float precision
-console.assert(Math.abs(tsEntropy - nativeEntropy) < 1e-5);
+// Consumer controls pace — stop at branch point
+for (const { token, text } of gen) {
+  if (accumulated.includes('"city"')) break; // Pause here, branch
+}
 ```
-Available references:
-- `ctx.computeEntropy()` — Shannon entropy in nats
-- `ctx.greedySample()` — argmax token ID
-Build with confidence. Validate against native. Deploy TypeScript.
+See [`examples/grammar/`](./examples/grammar/) for the full pull loop pattern.
-## Embeddings
-lloyal.node supports embedding extraction with configurable pooling:
-```typescript
-import { createContext } from 'lloyal.node';
-const ctx = await createContext({
-  modelPath: './nomic-embed-text.gguf',
-  embeddings: true,
-  poolingType: 1, // 0=NONE, 1=MEAN, 2=CLS, 3=LAST
-});
-async function embed(text: string): Promise<Float32Array> {
-  const tokens = await ctx.tokenize(text);
-  await ctx.encode(tokens);
-  const embedding = ctx.getEmbeddings(true); // L2-normalized
-  await ctx.kvCacheClear(); // Reset for next text
-  return embedding;
-}
-const vec = await embed('Document to embed');
-console.log(`Dimension: ${ctx.getEmbeddingDimension()}`); // e.g., 768
-```
+---
 ## API Reference
-**📖 [Full API Documentation](https://lloyal-ai.github.io/lloyal.node)** - Complete reference with examples and type definitions
 ### Context Creation
 ```typescript
@@ -301,55 +159,88 @@ const ctx = await createContext({
   modelPath: string,       // Path to .gguf file (required)
   nCtx?: number,           // Context size (default: 2048)
   nThreads?: number,       // CPU threads (default: 4)
-  nGpuLayers?: number,     // Layers to offload to GPU (default: 0)
   embeddings?: boolean,    // Enable embedding mode (default: false)
-  poolingType?: number     // 0=NONE, 1=MEAN, 2=CLS, 3=LAST (default: 0)
+  poolingType?: number,    // 0=NONE, 1=MEAN, 2=CLS, 3=LAST
+  nSeqMax?: number,        // Max parallel sequences (default: 1)
 });
 ```
-### Inference
-| Method                     | Returns             | Description                                           |
-| -------------------------- | ------------------- | ----------------------------------------------------- |
-| `tokenize(text)`           | `Promise<number[]>` | Text → token IDs                                      |
-| `detokenize(tokens)`       | `Promise<string>`   | Token IDs → text                                      |
-| `decode(tokens, position)` | `Promise<void>`     | Forward pass, populates KV cache                      |
-| `getLogits()`              | `Float32Array`      | Vocabulary-sized probability distribution (zero-copy) |
-### Native References
-| Method             | Returns  | Description             |
-| ------------------ | -------- | ----------------------- |
-| `greedySample()`   | `number` | Argmax token ID         |
-| `computeEntropy()` | `number` | Shannon entropy in nats |
+### Core Methods
+| Method                        | Returns             | Description                     |
+| ----------------------------- | ------------------- | ------------------------------- |
+| `tokenize(text)`              | `Promise<number[]>` | Text → token IDs                |
+| `detokenize(tokens)`          | `Promise<string>`   | Token IDs → text                |
+| `tokenToText(token)`          | `string`            | Single token → text (streaming) |
+| `decode(tokens, pos, seqId?)` | `Promise<void>`     | Forward pass, updates KV cache  |
+| `sample(params?)`             | `number`            | Sample next token               |
+| `isStopToken(token)`          | `boolean`           | Check for EOS token             |
+| `getLogits()`                 | `Float32Array`      | Raw logits (zero-copy view)     |
+### KV Cache
+| Method                             | Returns           | Description                    |
+| ---------------------------------- | ----------------- | ------------------------------ |
+| `kvCacheSize(seqId?)`              | `number`          | Tokens in cache                |
+| `kvCacheClear()`                   | `Promise<void>`   | Clear all sequences            |
+| `kvCacheRemove(seqId, start, end)` | `Promise<void>`   | Remove token range             |
+| `kvCacheSave(seqId?)`              | `Promise<Buffer>` | Snapshot state                 |
+| `kvCacheLoad(seqId, state)`        | `Promise<void>`   | Restore state                  |
+| `kvSeqCopy(src, dst)`              | `void`            | Copy sequence (tag copy, O(1)) |
+| `kvSeqKeep(seqId)`                 | `void`            | Keep only one sequence         |
+| `clearAndReseed(sinks, tail)`      | `Promise<void>`   | BlinkKV pattern                |
+### Grammar (Handle-Based)
+| Method                           | Returns  | Description                 |
+| -------------------------------- | -------- | --------------------------- |
+| `jsonSchemaToGrammar(schema)`    | `string` | Schema → GBNF               |
+| `createSampler(grammarStr)`      | `number` | Create grammar handle       |
+| `cloneSampler(handle)`           | `number` | Clone grammar state         |
+| `applySampler(handle, logits)`   | `void`   | Apply constraints to logits |
+| `acceptSamplerToken(handle, id)` | `void`   | Advance parser state        |
+| `freeSamplerHandle(handle)`      | `void`   | Release grammar handle      |
+### Metrics
+| Method                                  | Returns         | Description                                |
+| --------------------------------------- | --------------- | ------------------------------------------ |
+| `modelEntropy(base?, logits?)`          | `number`        | Distribution entropy (bits/nats)           |
+| `modelSurprisal(token, base?, logits?)` | `number`        | Token surprisal (supports captured logits) |
+| `createPerplexityTracker()`             | `TrackerHandle` | Create tracker (forkable)                  |
+| `clonePerplexityTracker(handle)`        | `TrackerHandle` | Clone tracker state                        |
+| `addSurprisal(handle, value)`           | `void`          | Add to tracker                             |
+| `getPerplexity(handle)`                 | `number`        | Get current PPL                            |
+| `freePerplexityTracker(handle)`         | `void`          | Release tracker                            |
 ### Embeddings
-| Method                      | Returns         | Description                                |
-| --------------------------- | --------------- | ------------------------------------------ |
-| `encode(tokens)`            | `Promise<void>` | Forward pass for embedding extraction      |
-| `getEmbeddings(normalize?)` | `Float32Array`  | Embedding vector, optionally L2-normalized |
-| `getEmbeddingDimension()`   | `number`        | Vector dimension                           |
-| `kvCacheClear()`            | `Promise<void>` | Clear KV cache between texts               |
+| Method                      | Returns         | Description                 |
+| --------------------------- | --------------- | --------------------------- |
+| `encode(tokens)`            | `Promise<void>` | Forward pass for embeddings |
+| `getEmbeddings(normalize?)` | `Float32Array`  | Extract embedding vector    |
+| `getEmbeddingDimension()`   | `number`        | Vector dimension            |
 ### Lifecycle
-| Method      | Description                                           |
-| ----------- | ----------------------------------------------------- |
-| `dispose()` | Free native resources. **Required** — call when done. |
+| Method      | Description                          |
+| ----------- | ------------------------------------ |
+| `dispose()` | Free native resources (**required**) |
+---
-## LLoyal Ecosystem
+## Ecosystem
-| Package                                                 | Language     | What it does                                                                                                                                                                                             |
-| ------------------------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [liblloyal](https://github.com/lloyal-ai/liblloyal)     | C++          | Inference kernel. Orchestrates llama.cpp with composable primitives: tokenization, decoding, KV cache, sampling chains, metrics, embeddings. Plus `branch.hpp` / `lease.hpp` for state forking and SMMA. |
-| **lloyal.node**                                         | N-API        | Node.js bindings. Zero-copy logits, native references for validation.                                                                                                                                    |
-| [tsampler](https://github.com/lloyal-ai/tsampler)       | TypeScript   | Sampling library with llama.cpp parity. All filters, penalties, entropy metrics. Plugin for lloyal.node—consumes logits, returns tokens.                                                                 |
-| [nitro-llama](https://github.com/lloyal-ai/nitro-llama) | React Native | Mobile bindings via Nitro Modules. Same liblloyal primitives on iOS/Android.                                                                                                                             |
+| Package                                                 | Runtime      | Description                       |
+| ------------------------------------------------------- | ------------ | --------------------------------- |
+| [liblloyal](https://github.com/lloyal-ai/liblloyal)     | C++          | Header-only inference kernel      |
+| **lloyal.node**                                         | Node.js      | This package                      |
+| [nitro-llama](https://github.com/lloyal-ai/nitro-llama) | React Native | Mobile bindings via Nitro Modules |
+| [tsampler](https://github.com/lloyal-ai/tsampler)       | TypeScript   | Reference sampler implementation  |
 ## Contributing
-See [CONTRIBUTING.md](./CONTRIBUTING.md) for development setup, build instructions, and release process.
+See [CONTRIBUTING.md](./CONTRIBUTING.md) for development setup and release process.
 ## License