npm - @elizaos/capacitor-llama - Versions diffs - 2.0.0-beta.1 → 2.0.3-beta.3 - Mend

@elizaos/capacitor-llama 2.0.0-beta.1 → 2.0.3-beta.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/LICENSE +21 -0
package/README.md +64 -43
package/dist/esm/capacitor-llama-adapter.d.ts +92 -1
package/dist/esm/capacitor-llama-adapter.js +455 -53
package/dist/esm/definitions.d.ts +156 -11
package/dist/esm/device-bridge-client.d.ts +17 -0
package/dist/esm/device-bridge-client.js +155 -21
package/dist/esm/index.d.ts +3 -2
package/dist/esm/index.js +3 -2
package/dist/esm/kv-cache-resolver.d.ts +2 -2
package/dist/esm/kv-cache-resolver.js +2 -2
package/dist/esm/token-tree-codec.d.ts +51 -0
package/dist/esm/token-tree-codec.js +217 -0
package/dist/plugin.cjs.js +830 -73
package/dist/plugin.cjs.js.map +1 -1
package/dist/plugin.js +830 -73
package/dist/plugin.js.map +1 -1
package/package.json +12 -8

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Shaw Walters and elizaOS Contributors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md CHANGED Viewed

@@ -1,68 +1,89 @@
 # @elizaos/capacitor-llama
-Mobile llama.cpp adapter for Eliza. A **thin wrapper** over
+Mobile llama.cpp adapter for elizaOS. A thin wrapper over
 [`llama-cpp-capacitor`](https://github.com/arusatech/annadata-llama-cpp) that
-maps its contextId-based API onto Eliza's `LocalInferenceLoader` contract,
-so the standard `ActiveModelCoordinator` in `@elizaos/app-core` can switch
-between the desktop (node-llama-cpp) engine and mobile native inference
-transparently.
+maps its contextId-based API onto elizaOS's `LocalInferenceLoader` contract,
+so the `ActiveModelCoordinator` in `@elizaos/ui`
+(`src/services/local-inference/`) can switch between the desktop
+(node-llama-cpp) engine and mobile native inference transparently.
 ## What it does
 - Registers as the runtime's `localInferenceLoader` service during the
-  Capacitor bootstrap.
-- Maps `loadModel({ modelPath })` → `initContext`.
-- Maps `unloadModel()` → `releaseContext` / `releaseAllContexts`.
-- Exposes a `generate()` surface matching the desktop engine.
-- Fans the native `@LlamaCpp_onToken` stream out to Eliza's token listeners.
+  Capacitor bootstrap via `registerCapacitorLlamaLoader(runtime)`.
+- Maps `load({ modelPath })` → `initContext` (one native context per adapter
+  instance; chat and embedding run on separate instances to avoid context
+  collisions).
+- Maps `unload()` → `releaseContext`.
+- Exposes `generate()` and `generateStream()` that target the chat model, and
+  `embed()` that targets a separate embedding-model context.
+- Applies the loaded GGUF's native chat template via `formatChat()` (backed
+  by `llama_chat_apply_template`).
+- Fans the native `@LlamaCpp_onToken` stream out to elizaOS token listeners.
+- Provides `DeviceBridgeClient` — a WebSocket relay that lets an agent
+  container reach a paired mobile device for inference (load, generate, embed,
+  formatChat over a JSON RPC protocol).
+- Provides `serializeTokenTree` / `deserializeTokenTree` — binary codec for
+  the native speculative-decode sampler-hook wire format.
 ## What it does not do
 - It does not ship llama.cpp native binaries — `llama-cpp-capacitor`
   handles iOS (arm64 + x86_64 with Metal) and Android (arm64-v8a,
   armeabi-v7a, x86, x86_64) itself.
-- It does not run on web. On Electrobun / Vite we fall back to the
-  standalone `node-llama-cpp` engine in `@elizaos/app-core`.
+- It does not run on web. On Electrobun / Vite the desktop agent uses the
+  standalone `node-llama-cpp` engine (`LocalInferenceEngine` in
+  `@elizaos/ui`, `src/services/local-inference/engine.ts`).
+- It does not export an elizaOS `Plugin` object; it is wired manually via
+  `registerCapacitorLlamaLoader`.
-## Setup in apps/app
+## Consumption
-1. Install the dependency (already declared here):
+This package is consumed by `@elizaos/ui` in
+`src/api/ios-local-agent-kernel.ts`, which dynamically imports
+`@elizaos/capacitor-llama` and uses the `capacitorLlama` singleton for the
+mobile local-agent kernel. The Capacitor app shell lives in `packages/app`
+(its `package.json` declares the `llama-cpp-capacitor` native dependency).
-   ```bash
-   bun install
-   ```
+Two ways to wire the adapter into a runtime:
-2. Register the loader during Capacitor bootstrap. In `apps/app`'s
-   Capacitor init path (currently in `src/capacitor-shell.ts` or the
-   runtime bootstrap that owns the mobile `AgentRuntime`):
+- **`registerCapacitorLlamaLoader(runtime)`** — registers a
+  `localInferenceLoader` service backed by separate chat and embedding adapter
+  instances. Call it during the mobile runtime bootstrap, in the init path that
+  owns the mobile `AgentRuntime`:
-   ```ts
-   import { registerCapacitorLlamaLoader } from "@elizaos/capacitor-llama";
+  ```ts
+  import { registerCapacitorLlamaLoader } from "@elizaos/capacitor-llama";
-   // After runtime boot, before the Model Hub is mounted:
-   registerCapacitorLlamaLoader(runtime);
-   ```
+  registerCapacitorLlamaLoader(runtime);
+  ```
-3. Run `bunx cap sync` in `apps/app` to pick up the native plugin. iOS and
-   Android builds will pull in `llama-cpp-capacitor`'s prebuilt native
-   libraries automatically.
+- **`capacitorLlama`** — the default singleton `LlamaAdapter`, used directly by
+  callers that don't need per-role context separation.
+After adding native code, run `bunx cap sync` in `packages/app` to pick up the
+native plugin. iOS and Android builds pull in `llama-cpp-capacitor`'s prebuilt
+native libraries automatically.
+## Configuration
+| Env var | Description |
+|---------|-------------|
+| `ELIZA_LLAMA_CACHE_TYPE_K` | KV-cache key type — `f16`, `tbq3_0`, `tbq4_0`. Requires the buun-llama-cpp fork for non-`f16` values. |
+| `ELIZA_LLAMA_CACHE_TYPE_V` | KV-cache value type — same values. |
+Explicit `cacheTypeK`/`cacheTypeV` fields on `LoadOptions` take precedence over env vars.
 ## Scope notes
-- Only **one model is loaded at a time**. `load()` disposes the previous
-  context first so we never double-allocate VRAM on device.
-- GGUF files are downloaded to the app sandbox by the
-  `@elizaos/app-core` downloader (shared with desktop). The mobile UI
-  filters the catalog to small/tiny bucket models only, since anything
-  larger won't realistically run on a phone.
+- Only **one model is loaded per adapter role** at a time. `load()` disposes
+  the previous context for that adapter before reinitializing, so VRAM is
+  never double-allocated.
+- GGUF files are downloaded to the app sandbox by the `@elizaos/ui`
+  downloader (`src/services/local-inference/downloader.ts`, shared with
+  desktop). The mobile UI filters the catalog to small/tiny models only.
 - Streaming tokens flow over Capacitor's native event bus
   (`@LlamaCpp_onToken`). Subscribe via `capacitorLlama.onToken(listener)`.
-- For a full desktop-level feature set (embeddings, reranking, chat
-  templates, tool calling), read the upstream
-  [`llama-cpp-capacitor` README](https://github.com/arusatech/annadata-llama-cpp).
-  This adapter only wires the minimal slice needed for Eliza's agent
-  runtime; extend it as the mobile product grows.
-## Licensing
-MIT — matches `llama-cpp-capacitor` and llama.cpp upstream.
+- The `buun-llama-cpp` fork exposes optional `setCacheType`, `setSpecType`,
+  and `getNativeKernels` bridge methods for TurboQuant KV caches and MTP
+  speculative decoding. Stock builds warn and skip unsupported calls.

package/dist/esm/capacitor-llama-adapter.d.ts CHANGED Viewed

@@ -1,4 +1,95 @@
-import type { LlamaAdapter } from "./definitions";
+import type { EmbedOptions, EmbedResult, GenerateOptions, GenerateResult, GenerateStreamOptions, GenerationEvent, HardwareInfo, LlamaAdapter, LoadOptions, SetSpecTypeArgs } from "./definitions";
+export declare class CapacitorLlamaAdapter implements LlamaAdapter {
+    private plugin;
+    /** Cached loader promise so concurrent `load()` calls don't race to register duplicate listeners. */
+    private pluginLoadPromise;
+    private loadedPath;
+    /**
+     * Native context id this adapter owns. Allocated lazily on first `load()`
+     * from the process-wide `nextContextId` counter so distinct adapter
+     * instances never share a context — see the module-level invariant comment.
+     */
+    private contextId;
+    private tokenIndex;
+    private tokenListeners;
+    private pluginListenerHandle;
+    /**
+     * Latest native completion stats captured by `generateStream`. Read by
+     * the `generate()` wrapper to populate `GenerateResult` without
+     * re-issuing the native call. Cleared at the start of every
+     * `generateStream` invocation.
+     */
+    private lastCompletionStats;
+    private requireContextId;
+    private loadPlugin;
+    getHardwareInfo(): Promise<HardwareInfo>;
+    setCacheType(typeK: string, typeV: string): Promise<void>;
+    setSpecType(args: SetSpecTypeArgs): Promise<void>;
+    isLoaded(): Promise<{
+        loaded: boolean;
+        modelPath: string | null;
+    }>;
+    currentModelPath(): string | null;
+    load(options: LoadOptions): Promise<void>;
+    unload(): Promise<void>;
+    /**
+     * Build the params object for the native completion call. Shared between
+     * the legacy `generate()` path and the new `generateStream()` path so the
+     * cache-key + stop-sequence wiring lives in one place.
+     */
+    private buildNativeParams;
+    /**
+     * Invoke the native completion (or generateText) entry point with a
+     * pre-built params bag. Returns the raw native result; callers map this
+     * to `GenerateResult` or to a `done` event.
+     */
+    private runNativeCompletion;
+    /**
+     * Native bridges currently don't honour per-generation sampler-stage
+     * injection — the Swift / Kotlin side needs separate wiring. Until that
+     * lands we log once per stage and otherwise pass through. The stages
+     * remain in the options object so downstream observers (telemetry,
+     * tests) can still see them.
+     */
+    private logUnwiredSamplerStages;
+    generate(options: GenerateOptions): Promise<GenerateResult>;
+    /**
+     * Streaming generation. Subscribes to the native token event bridge,
+     * starts the completion call, and yields typed `GenerationEvent`s as
+     * tokens arrive. The stream ends with exactly one `done` event (or one
+     * terminal `error`) once the native call resolves.
+     *
+     * Sampler-stage injection (`samplerStages`) and the per-generation
+     * spec-decode toggle (`specDecode`) are accepted but currently pass
+     * through unchanged on the JS side — the Swift / Kotlin bridge wiring is tracked
+     * separately. They flow through as part of the options bag so the
+     * native side can pick them up without an interface change.
+     */
+    generateStream(options: GenerateStreamOptions): AsyncIterable<GenerationEvent>;
+    setDrafter(drafterPath: string | null): Promise<void>;
+    trimMemory(level: "minor" | "major"): Promise<void>;
+    cancelGenerate(): Promise<void>;
+    /**
+     * Round-trip to the loaded GGUF's native chat template via
+     * `LlamaCpp.getFormattedChat`. The plugin's Java side serializes
+     * `messages` as a JSON string and invokes
+     * `cap_format_chat()` → `llama_chat_apply_template()`. Returns the
+     * rendered prompt (or null when the GGUF has no template metadata).
+     */
+    formatChat(messages: {
+        role: string;
+        content: string;
+    }[]): Promise<string | null>;
+    embed(options: EmbedOptions): Promise<EmbedResult>;
+    onToken(listener: (token: string, index: number) => void): () => void;
+    dispose(): Promise<void>;
+}
+/**
+ * Default singleton kept for back-compat with device-bridge-client and
+ * hardware-probe callers that don't distinguish chat vs embedding roles.
+ * The runtime's `localInferenceLoader` service uses per-role instances
+ * instead — see `registerCapacitorLlamaLoader`.
+ */
 export declare const capacitorLlama: LlamaAdapter;
 export declare function registerCapacitorLlamaLoader(runtime: {
     registerService?: (name: string, impl: unknown) => unknown;