@elizaos/capacitor-llama 2.0.0-beta.1 → 2.0.3-beta.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Shaw Walters and elizaOS Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md CHANGED
@@ -1,68 +1,89 @@
1
1
  # @elizaos/capacitor-llama
2
2
 
3
- Mobile llama.cpp adapter for Eliza. A **thin wrapper** over
3
+ Mobile llama.cpp adapter for elizaOS. A thin wrapper over
4
4
  [`llama-cpp-capacitor`](https://github.com/arusatech/annadata-llama-cpp) that
5
- maps its contextId-based API onto Eliza's `LocalInferenceLoader` contract,
6
- so the standard `ActiveModelCoordinator` in `@elizaos/app-core` can switch
7
- between the desktop (node-llama-cpp) engine and mobile native inference
8
- transparently.
5
+ maps its contextId-based API onto elizaOS's `LocalInferenceLoader` contract,
6
+ so the `ActiveModelCoordinator` in `@elizaos/ui`
7
+ (`src/services/local-inference/`) can switch between the desktop
8
+ (node-llama-cpp) engine and mobile native inference transparently.
9
9
 
10
10
  ## What it does
11
11
 
12
12
  - Registers as the runtime's `localInferenceLoader` service during the
13
- Capacitor bootstrap.
14
- - Maps `loadModel({ modelPath })` → `initContext`.
15
- - Maps `unloadModel()` `releaseContext` / `releaseAllContexts`.
16
- - Exposes a `generate()` surface matching the desktop engine.
17
- - Fans the native `@LlamaCpp_onToken` stream out to Eliza's token listeners.
13
+ Capacitor bootstrap via `registerCapacitorLlamaLoader(runtime)`.
14
+ - Maps `load({ modelPath })` → `initContext` (one native context per adapter
15
+ instance; chat and embedding run on separate instances to avoid context
16
+ collisions).
17
+ - Maps `unload()` `releaseContext`.
18
+ - Exposes `generate()` and `generateStream()` that target the chat model, and
19
+ `embed()` that targets a separate embedding-model context.
20
+ - Applies the loaded GGUF's native chat template via `formatChat()` (backed
21
+ by `llama_chat_apply_template`).
22
+ - Fans the native `@LlamaCpp_onToken` stream out to elizaOS token listeners.
23
+ - Provides `DeviceBridgeClient` — a WebSocket relay that lets an agent
24
+ container reach a paired mobile device for inference (load, generate, embed,
25
+ formatChat over a JSON RPC protocol).
26
+ - Provides `serializeTokenTree` / `deserializeTokenTree` — binary codec for
27
+ the native speculative-decode sampler-hook wire format.
18
28
 
19
29
  ## What it does not do
20
30
 
21
31
  - It does not ship llama.cpp native binaries — `llama-cpp-capacitor`
22
32
  handles iOS (arm64 + x86_64 with Metal) and Android (arm64-v8a,
23
33
  armeabi-v7a, x86, x86_64) itself.
24
- - It does not run on web. On Electrobun / Vite we fall back to the
25
- standalone `node-llama-cpp` engine in `@elizaos/app-core`.
34
+ - It does not run on web. On Electrobun / Vite the desktop agent uses the
35
+ standalone `node-llama-cpp` engine (`LocalInferenceEngine` in
36
+ `@elizaos/ui`, `src/services/local-inference/engine.ts`).
37
+ - It does not export an elizaOS `Plugin` object; it is wired manually via
38
+ `registerCapacitorLlamaLoader`.
26
39
 
27
- ## Setup in apps/app
40
+ ## Consumption
28
41
 
29
- 1. Install the dependency (already declared here):
42
+ This package is consumed by `@elizaos/ui` in
43
+ `src/api/ios-local-agent-kernel.ts`, which dynamically imports
44
+ `@elizaos/capacitor-llama` and uses the `capacitorLlama` singleton for the
45
+ mobile local-agent kernel. The Capacitor app shell lives in `packages/app`
46
+ (its `package.json` declares the `llama-cpp-capacitor` native dependency).
30
47
 
31
- ```bash
32
- bun install
33
- ```
48
+ Two ways to wire the adapter into a runtime:
34
49
 
35
- 2. Register the loader during Capacitor bootstrap. In `apps/app`'s
36
- Capacitor init path (currently in `src/capacitor-shell.ts` or the
37
- runtime bootstrap that owns the mobile `AgentRuntime`):
50
+ - **`registerCapacitorLlamaLoader(runtime)`** registers a
51
+ `localInferenceLoader` service backed by separate chat and embedding adapter
52
+ instances. Call it during the mobile runtime bootstrap, in the init path that
53
+ owns the mobile `AgentRuntime`:
38
54
 
39
- ```ts
40
- import { registerCapacitorLlamaLoader } from "@elizaos/capacitor-llama";
55
+ ```ts
56
+ import { registerCapacitorLlamaLoader } from "@elizaos/capacitor-llama";
41
57
 
42
- // After runtime boot, before the Model Hub is mounted:
43
- registerCapacitorLlamaLoader(runtime);
44
- ```
58
+ registerCapacitorLlamaLoader(runtime);
59
+ ```
45
60
 
46
- 3. Run `bunx cap sync` in `apps/app` to pick up the native plugin. iOS and
47
- Android builds will pull in `llama-cpp-capacitor`'s prebuilt native
48
- libraries automatically.
61
+ - **`capacitorLlama`** the default singleton `LlamaAdapter`, used directly by
62
+ callers that don't need per-role context separation.
63
+
64
+ After adding native code, run `bunx cap sync` in `packages/app` to pick up the
65
+ native plugin. iOS and Android builds pull in `llama-cpp-capacitor`'s prebuilt
66
+ native libraries automatically.
67
+
68
+ ## Configuration
69
+
70
+ | Env var | Description |
71
+ |---------|-------------|
72
+ | `ELIZA_LLAMA_CACHE_TYPE_K` | KV-cache key type — `f16`, `tbq3_0`, `tbq4_0`. Requires the buun-llama-cpp fork for non-`f16` values. |
73
+ | `ELIZA_LLAMA_CACHE_TYPE_V` | KV-cache value type — same values. |
74
+
75
+ Explicit `cacheTypeK`/`cacheTypeV` fields on `LoadOptions` take precedence over env vars.
49
76
 
50
77
  ## Scope notes
51
78
 
52
- - Only **one model is loaded at a time**. `load()` disposes the previous
53
- context first so we never double-allocate VRAM on device.
54
- - GGUF files are downloaded to the app sandbox by the
55
- `@elizaos/app-core` downloader (shared with desktop). The mobile UI
56
- filters the catalog to small/tiny bucket models only, since anything
57
- larger won't realistically run on a phone.
79
+ - Only **one model is loaded per adapter role** at a time. `load()` disposes
80
+ the previous context for that adapter before reinitializing, so VRAM is
81
+ never double-allocated.
82
+ - GGUF files are downloaded to the app sandbox by the `@elizaos/ui`
83
+ downloader (`src/services/local-inference/downloader.ts`, shared with
84
+ desktop). The mobile UI filters the catalog to small/tiny models only.
58
85
  - Streaming tokens flow over Capacitor's native event bus
59
86
  (`@LlamaCpp_onToken`). Subscribe via `capacitorLlama.onToken(listener)`.
60
- - For a full desktop-level feature set (embeddings, reranking, chat
61
- templates, tool calling), read the upstream
62
- [`llama-cpp-capacitor` README](https://github.com/arusatech/annadata-llama-cpp).
63
- This adapter only wires the minimal slice needed for Eliza's agent
64
- runtime; extend it as the mobile product grows.
65
-
66
- ## Licensing
67
-
68
- MIT — matches `llama-cpp-capacitor` and llama.cpp upstream.
87
+ - The `buun-llama-cpp` fork exposes optional `setCacheType`, `setSpecType`,
88
+ and `getNativeKernels` bridge methods for TurboQuant KV caches and MTP
89
+ speculative decoding. Stock builds warn and skip unsupported calls.
@@ -1,4 +1,95 @@
1
- import type { LlamaAdapter } from "./definitions";
1
+ import type { EmbedOptions, EmbedResult, GenerateOptions, GenerateResult, GenerateStreamOptions, GenerationEvent, HardwareInfo, LlamaAdapter, LoadOptions, SetSpecTypeArgs } from "./definitions";
2
+ export declare class CapacitorLlamaAdapter implements LlamaAdapter {
3
+ private plugin;
4
+ /** Cached loader promise so concurrent `load()` calls don't race to register duplicate listeners. */
5
+ private pluginLoadPromise;
6
+ private loadedPath;
7
+ /**
8
+ * Native context id this adapter owns. Allocated lazily on first `load()`
9
+ * from the process-wide `nextContextId` counter so distinct adapter
10
+ * instances never share a context — see the module-level invariant comment.
11
+ */
12
+ private contextId;
13
+ private tokenIndex;
14
+ private tokenListeners;
15
+ private pluginListenerHandle;
16
+ /**
17
+ * Latest native completion stats captured by `generateStream`. Read by
18
+ * the `generate()` wrapper to populate `GenerateResult` without
19
+ * re-issuing the native call. Cleared at the start of every
20
+ * `generateStream` invocation.
21
+ */
22
+ private lastCompletionStats;
23
+ private requireContextId;
24
+ private loadPlugin;
25
+ getHardwareInfo(): Promise<HardwareInfo>;
26
+ setCacheType(typeK: string, typeV: string): Promise<void>;
27
+ setSpecType(args: SetSpecTypeArgs): Promise<void>;
28
+ isLoaded(): Promise<{
29
+ loaded: boolean;
30
+ modelPath: string | null;
31
+ }>;
32
+ currentModelPath(): string | null;
33
+ load(options: LoadOptions): Promise<void>;
34
+ unload(): Promise<void>;
35
+ /**
36
+ * Build the params object for the native completion call. Shared between
37
+ * the legacy `generate()` path and the new `generateStream()` path so the
38
+ * cache-key + stop-sequence wiring lives in one place.
39
+ */
40
+ private buildNativeParams;
41
+ /**
42
+ * Invoke the native completion (or generateText) entry point with a
43
+ * pre-built params bag. Returns the raw native result; callers map this
44
+ * to `GenerateResult` or to a `done` event.
45
+ */
46
+ private runNativeCompletion;
47
+ /**
48
+ * Native bridges currently don't honour per-generation sampler-stage
49
+ * injection — the Swift / Kotlin side needs separate wiring. Until that
50
+ * lands we log once per stage and otherwise pass through. The stages
51
+ * remain in the options object so downstream observers (telemetry,
52
+ * tests) can still see them.
53
+ */
54
+ private logUnwiredSamplerStages;
55
+ generate(options: GenerateOptions): Promise<GenerateResult>;
56
+ /**
57
+ * Streaming generation. Subscribes to the native token event bridge,
58
+ * starts the completion call, and yields typed `GenerationEvent`s as
59
+ * tokens arrive. The stream ends with exactly one `done` event (or one
60
+ * terminal `error`) once the native call resolves.
61
+ *
62
+ * Sampler-stage injection (`samplerStages`) and the per-generation
63
+ * spec-decode toggle (`specDecode`) are accepted but currently pass
64
+ * through unchanged on the JS side — the Swift / Kotlin bridge wiring is tracked
65
+ * separately. They flow through as part of the options bag so the
66
+ * native side can pick them up without an interface change.
67
+ */
68
+ generateStream(options: GenerateStreamOptions): AsyncIterable<GenerationEvent>;
69
+ setDrafter(drafterPath: string | null): Promise<void>;
70
+ trimMemory(level: "minor" | "major"): Promise<void>;
71
+ cancelGenerate(): Promise<void>;
72
+ /**
73
+ * Round-trip to the loaded GGUF's native chat template via
74
+ * `LlamaCpp.getFormattedChat`. The plugin's Java side serializes
75
+ * `messages` as a JSON string and invokes
76
+ * `cap_format_chat()` → `llama_chat_apply_template()`. Returns the
77
+ * rendered prompt (or null when the GGUF has no template metadata).
78
+ */
79
+ formatChat(messages: {
80
+ role: string;
81
+ content: string;
82
+ }[]): Promise<string | null>;
83
+ embed(options: EmbedOptions): Promise<EmbedResult>;
84
+ onToken(listener: (token: string, index: number) => void): () => void;
85
+ dispose(): Promise<void>;
86
+ }
87
+ /**
88
+ * Default singleton kept for back-compat with device-bridge-client and
89
+ * hardware-probe callers that don't distinguish chat vs embedding roles.
90
+ * The runtime's `localInferenceLoader` service uses per-role instances
91
+ * instead — see `registerCapacitorLlamaLoader`.
92
+ */
2
93
  export declare const capacitorLlama: LlamaAdapter;
3
94
  export declare function registerCapacitorLlamaLoader(runtime: {
4
95
  registerService?: (name: string, impl: unknown) => unknown;