npm - react-native-litert-lm - Versions diffs - 0.3.1 → 0.3.3 - Mend

react-native-litert-lm 0.3.1 → 0.3.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/README.md +43 -25
package/android/build.gradle +6 -2
package/android/src/main/java/com/margelo/nitro/dev/litert/litertlm/HybridLiteRTLM.kt +26 -30
package/app.plugin.js +28 -3
package/cpp/HybridLiteRTLM.cpp +146 -63
package/cpp/HybridLiteRTLM.hpp +2 -2
package/lib/hooks.js +4 -0
package/lib/index.d.ts +19 -2
package/lib/index.js +24 -7
package/lib/specs/LiteRTLM.nitro.d.ts +7 -7
package/package.json +19 -13
package/scripts/build-ios-engine.sh +1 -1
package/scripts/download-ios-frameworks.sh +1 -1
package/src/hooks.ts +5 -0
package/src/index.ts +27 -6
package/src/specs/LiteRTLM.nitro.ts +7 -7

package/README.md CHANGED Viewed

@@ -1,16 +1,16 @@
 # react-native-litert-lm
-High-performance on-device LLM inference for React Native, powered by [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM) and [Nitro Modules](https://github.com/mrousavy/nitro). Optimized for **Gemma 3n** and other on-device language models.
+High-performance on-device LLM inference for React Native, powered by [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM) and [Nitro Modules](https://github.com/mrousavy/nitro). Optimized for **Gemma 4** and other on-device language models.
 ## Features
 - 🚀 **Native Performance** — Kotlin (Android) / C++ (iOS) via Nitro Modules JSI bindings
-- 🧠 **Gemma 3n Ready** — First-class support for Gemma 3n E2B/E4B models
+- 🧠 **Gemma 4 Ready** — First-class support for Gemma 4 E2B/E4B multimodal models (text + vision + audio)
 - ⚡ **GPU Acceleration** — GPU delegate (Android), Metal/MPS (iOS)
 - 🔄 **Streaming Support** — Token-by-token generation callbacks
 - 📱 **Cross-Platform** — Android API 26+ / iOS 15.0+
-- 🖼️ **Multimodal** — Image and audio input support (Android)
-- 🧵 **Async API** — Non-blocking inference on background threads
+- 🖼️ **Multimodal** — Image and audio input support
+- 🧵 **Async API** — Non-blocking inference on dedicated large-stack threads
 - 📊 **Real Memory Tracking** — OS-level memory metrics (RSS, native heap, available memory) via native APIs
 - 🧮 **Zero-Copy Buffers** — Memory snapshots stored in native ArrayBuffers via Nitro Modules
 - 📥 **Automatic Model Download** — Downloads models from URL with progress tracking and local caching
@@ -94,7 +94,7 @@ The `example/` directory contains a fully functional test app with a dark-themed
 ## Model Management
-LiteRT-LM models (like Gemma 3n) are large files (3 GB+) and cannot be bundled into your app binary. They are downloaded at runtime.
+LiteRT-LM models (like Gemma 4) are large files (2–4 GB) and cannot be bundled into your app binary. They are downloaded at runtime.
 ### Automatic Downloading
@@ -112,16 +112,15 @@ If you prefer to manage downloads yourself (e.g., using `expo-file-system`), dow
 ```typescript
 import * as FileSystem from "expo-file-system";
+import { GEMMA_4_E2B_IT } from "react-native-litert-lm";
-const MODEL_URL =
-  "https://huggingface.co/litert-community/gemma-3n-2b-it/resolve/main/model.litertlm";
-const localPath = `${FileSystem.documentDirectory}gemma-3n.litertlm`;
+const localPath = `${FileSystem.documentDirectory}gemma-4-E2B-it.litertlm`;
 async function downloadModel() {
   const info = await FileSystem.getInfoAsync(localPath);
   if (info.exists) return localPath;
-  await FileSystem.downloadAsync(MODEL_URL, localPath);
+  await FileSystem.downloadAsync(GEMMA_4_E2B_IT, localPath);
   return localPath;
 }
 ```
@@ -133,7 +132,7 @@ async function downloadModel() {
 The `useModel` hook manages the full model lifecycle: downloading, loading, inference, and cleanup.
 ```typescript
-import { useModel, GEMMA_3N_E2B_IT_INT4 } from "react-native-litert-lm";
+import { useModel, GEMMA_4_E2B_IT } from "react-native-litert-lm";
 import { Platform } from "react-native";
 function App() {
@@ -145,8 +144,8 @@ function App() {
     load,          // Manually trigger load
     deleteModel,   // Delete cached model file
     memorySummary, // Auto-updated memory stats (if tracking enabled)
-  } = useModel(GEMMA_3N_E2B_IT_INT4, {
-    backend: Platform.OS === 'ios' ? 'gpu' : 'cpu',
+  } = useModel(GEMMA_4_E2B_IT, {
+    backend: 'cpu',
     autoLoad: true, // Default: true. Set false to load manually via load().
     systemPrompt: "You are a helpful assistant.",
     enableMemoryTracking: true,
@@ -206,7 +205,7 @@ const warning = checkMultimodalSupport();
 if (warning) {
   console.warn(warning); // Experimental on iOS
 } else {
-  // Image input (for vision models like Gemma 3n)
+  // Image input (for vision models like Gemma 4)
   // Images >1024px are automatically resized to prevent OOM
   const response = await llm.sendMessageWithImage(
     "What's in this image?",
@@ -310,15 +309,20 @@ const buffer = tracker.getNativeBuffer();
 Download `.litertlm` models automatically using the exported URL constants, or manually from [HuggingFace](https://huggingface.co/litert-community):
-| Constant               | Model                                  | Size  | Min RAM |
-| :--------------------- | :------------------------------------- | :---- | :------ |
-| `GEMMA_3N_E2B_IT_INT4` | Gemma 3n E2B (Instruction Tuned, Int4) | ~3 GB | 4 GB+   |
+| Constant               | Model                              | Size     | Min RAM | Auth Required |
+| :--------------------- | :--------------------------------- | :------- | :------ | :------------ |
+| `GEMMA_4_E2B_IT`       | Gemma 4 E2B (Multimodal, IT)       | 2.58 GB  | 4 GB+   | ❌ No          |
+| `GEMMA_4_E4B_IT`       | Gemma 4 E4B (Higher Quality)       | 3.65 GB  | 6 GB+   | ❌ No          |
+| `GEMMA_3N_E2B_IT_INT4` | Gemma 3n E2B (Int4, Multimodal)    | ~1.3 GB  | 4 GB+   | ✅ HuggingFace |
+> **Recommended:** Use `GEMMA_4_E2B_IT` for most use cases. It's multimodal (text + vision + audio) and downloads directly from HuggingFace without requiring an account.
+>
+> **iOS Note:** Models larger than ~2 GB (like Gemma 4) require the `com.apple.developer.kernel.extended-virtual-addressing` entitlement. See [iOS Entitlements](#ios-entitlements) below.
 **Other compatible models** (download manually from HuggingFace):
 | Model         | Size    | Min RAM | Notes                 |
 | ------------- | ------- | ------- | --------------------- |
-| Gemma 3n E4B  | ~4 GB   | 8 GB+   | Higher quality        |
 | Gemma 3 1B    | ~1 GB   | 4 GB+   | Smallest, fastest     |
 | Phi-4 Mini    | ~2 GB   | 4 GB+   | Microsoft's small LLM |
 | Qwen 2.5 1.5B | ~1.5 GB | 4 GB+   | Multilingual          |
@@ -339,7 +343,7 @@ Loads a model from a local path or HTTPS URL.
 | Parameter             | Type     | Default | Description                               |
 | --------------------- | -------- | ------- | ----------------------------------------- |
 | `path`                | `string` | —       | Absolute path to `.litertlm` or HTTPS URL |
-| `config.backend`      | `string` | `'gpu'` | `'cpu'`, `'gpu'`, or `'npu'`              |
+| `config.backend`      | `string` | `'cpu'` | `'cpu'`, `'gpu'`, or `'npu'`              |
 | `config.systemPrompt` | `string` | —       | System prompt for the model               |
 | `config.temperature`  | `number` | `0.7`   | Sampling temperature                      |
 | `config.topK`         | `number` | `40`    | Top-K sampling                            |
@@ -354,7 +358,7 @@ Loads a model from a local path or HTTPS URL.
 | `'gpu'` | GPU / Metal         | Fast    | Recommended default                            |
 | `'npu'` | NPU / Neural Engine | Fastest | Requires supported hardware; falls back to GPU |
-> **iOS**: `'gpu'` uses Metal/MPS and is the recommended backend. The engine automatically tries multiple backend combinations if the primary one fails.
+> **iOS**: `'cpu'` is the recommended default backend. `'gpu'` (Metal/MPS) is also supported. The engine automatically tries multiple backend combinations if the primary one fails.
 ### `sendMessage(message): Promise<string>`
@@ -366,11 +370,11 @@ Streaming generation. Callback signature: `(token: string, isDone: boolean) => v
 ### `sendMessageWithImage(message, imagePath): Promise<string>`
-Send a message with an image (Android only; for vision models like Gemma 3n).
+Send a message with an image (for vision models like Gemma 4 E2B).
 ### `sendMessageWithAudio(message, audioPath): Promise<string>`
-Send a message with audio (Android only).
+Send a message with audio (for audio-capable models like Gemma 4 E2B).
 ### `getStats(): GenerationStats`
@@ -448,7 +452,7 @@ const prompt = applyGemmaTemplate(
 | react-native-nitro-modules | 0.35.0+       |
 | Android API                | 26+ (ARM64)   |
 | iOS                        | 15.0+ (ARM64) |
-| LiteRT-LM Engine            | 0.9.0           |
+| LiteRT-LM Engine            | 0.10.1          |
 ## Platform Support
@@ -463,7 +467,8 @@ const prompt = applyGemmaTemplate(
 | ---------------------------- | ------ | ----------------------------------------------------- |
 | Text inference (blocking)    | ✅     | Via LiteRT-LM C API                                   |
 | Text inference (streaming)   | ✅     | Token-by-token callbacks                              |
-| GPU inference (Metal/MPS)    | ✅     | Recommended backend                                   |
+| CPU inference                | ✅     | Recommended default backend                           |
+| GPU inference (Metal/MPS)    | ✅     | Supported via `backend: 'gpu'`                        |
 | Model download with progress | ✅     | NSURLSession, cached in `Caches/`                     |
 | Memory tracking              | ✅     | `mach_task_basic_info`                                |
 | Multi-turn conversation      | ✅     | Context retained across turns                         |
@@ -471,6 +476,19 @@ const prompt = applyGemmaTemplate(
 | Constrained decoding         | ❌     | Requires llguidance Rust runtime                      |
 | Function calling             | ❌     | Requires Rust CXX bridge runtime                      |
+### iOS Entitlements
+Models larger than ~2 GB (like Gemma 4 E2B at 2.58 GB) require the **Extended Virtual Addressing** entitlement on iOS physical devices. Without it, iOS limits virtual memory to ~2 GB and the app will be killed by Jetsam.
+Add to your app's `.entitlements` file:
+```xml
+<key>com.apple.developer.kernel.extended-virtual-addressing</key>
+<true/>
+```
+> **Note:** This entitlement requires a **paid Apple Developer account** ($99/year). Gemma 3n E2B (~1.3 GB) works without it.
 ## Building the iOS Engine
 The iOS build uses a **Bazel-to-XCFramework pipeline** that compiles the LiteRT-LM C engine and all transitive dependencies into a static library (~83 MB).
@@ -488,7 +506,7 @@ The iOS build uses a **Bazel-to-XCFramework pipeline** that compiles the LiteRT-
 This will:
-1. Clone/checkout LiteRT-LM `v0.9.0` source into `.litert-lm-build/`
+1. Clone/checkout LiteRT-LM `v0.10.1` source into `.litert-lm-build/`
 2. Build `//c:engine` for `ios_arm64` and `ios_sim_arm64` via Bazel
 3. Collect all transitive `.o` files (engine, protobuf, re2, sentencepiece, etc.)
 4. Compile C/C++ stubs for unavailable Rust dependencies
@@ -540,7 +558,7 @@ Additionally, `PromptTemplate` is patched at build time to use a simplified C++
 ```
 - **Android**: Kotlin (`HybridLiteRTLM.kt`) interfacing with the `litertlm-android` AAR.
-- **iOS**: C++ (`HybridLiteRTLM.cpp`) interfacing with the LiteRT-LM C API via a prebuilt `LiteRTLM.xcframework`. Platform-specific code (model downloading, file management) is in Objective-C++ (`ios/IOSDownloadHelper.mm`).
+- **iOS**: C++ (`HybridLiteRTLM.cpp`) interfacing with the LiteRT-LM C API via a prebuilt `LiteRTLM.xcframework`. All engine operations (load, inference, streaming) run on dedicated `pthread` threads with 8 MB stack to accommodate XNNPack's stack requirements. Platform-specific code (model downloading, file management) is in Objective-C++ (`ios/IOSDownloadHelper.mm`).
 > **For contributors**: Changes to `cpp/HybridLiteRTLM.cpp` do not affect Android. Feature changes must be applied to both the Kotlin and C++ implementations.

package/android/build.gradle CHANGED Viewed

@@ -9,9 +9,13 @@ plugins {
 // Apply Nitrogen autolinking
 apply from: '../nitrogen/generated/android/LiteRTLM+autolinking.gradle'
+// Read LiteRT-LM SDK version from package.json (single source of truth)
+def packageJson = new groovy.json.JsonSlurper().parseText(file('../package.json').text)
+def litertLmVersion = packageJson.litertLm.androidMavenVersion
 android {
     namespace "dev.litert.litertlm"
-    compileSdk 35
+    compileSdk 36
     defaultConfig {
         minSdk 26  // LiteRT-LM requires API 26+
@@ -84,5 +88,5 @@ dependencies {
     implementation 'org.jetbrains.kotlinx:kotlinx-coroutines-android:1.7.3'
     // LiteRT-LM Kotlin API
-    implementation 'com.google.ai.edge.litertlm:litertlm-android:0.9.0'
+    implementation "com.google.ai.edge.litertlm:litertlm-android:${litertLmVersion}"
 }

package/android/src/main/java/com/margelo/nitro/dev/litert/litertlm/HybridLiteRTLM.kt CHANGED Viewed

@@ -44,8 +44,8 @@ internal class StreamingCallbackListener(
     private val history: MutableList<Message>,
 ) : com.google.ai.edge.litertlm.MessageCallback {
-    override fun onMessage(responseMsg: com.google.ai.edge.litertlm.LiteRTMessage) {
-        val chunk = responseMsg.contents
+    override fun onMessage(responseMsg: com.google.ai.edge.litertlm.Message) {
+        val chunk = responseMsg.contents.contents
             .filterIsInstance<com.google.ai.edge.litertlm.Content.Text>()
             .joinToString("") { it.text }
@@ -123,7 +123,7 @@ class HybridLiteRTLM : HybridLiteRTLMSpec() {
     )
     // Configuration
-    private var backend: Backend = Backend.GPU
+    private var backend: Backend = Backend.CPU
     private var temperature: Double = 0.7
     private var topK: Int = 40
     private var topP: Double = 0.95
@@ -161,21 +161,21 @@ class HybridLiteRTLM : HybridLiteRTLMSpec() {
                 }
                 try {
-                    // Map our Backend enum to LiteRT-LM Backend enum
+                    // Map our Backend enum to LiteRT-LM Backend sealed class
                     val lmBackend = when (backend) {
-                        Backend.GPU -> com.google.ai.edge.litertlm.Backend.GPU
+                        Backend.GPU -> com.google.ai.edge.litertlm.Backend.GPU()
                         Backend.NPU -> {
                             Log.i(TAG, "NPU backend requested - requires hardware support")
-                            com.google.ai.edge.litertlm.Backend.NPU
+                            com.google.ai.edge.litertlm.Backend.NPU()
                         }
-                        else -> com.google.ai.edge.litertlm.Backend.CPU
+                        else -> com.google.ai.edge.litertlm.Backend.CPU()
                     }
-                    // Vision backend: hardcoded to GPU (required by Gemma 3n)
-                    val lmVisionBackend = com.google.ai.edge.litertlm.Backend.GPU
+                    // Vision backend: hardcoded to GPU (required by Gemma models)
+                    val lmVisionBackend = com.google.ai.edge.litertlm.Backend.GPU()
                     // Audio backend: hardcoded to CPU (optimal for audio processing)
-                    val lmAudioBackend = com.google.ai.edge.litertlm.Backend.CPU
+                    val lmAudioBackend = com.google.ai.edge.litertlm.Backend.CPU()
                     Log.i(TAG, "Backend config: main=$lmBackend, vision=$lmVisionBackend (hardcoded), audio=$lmAudioBackend (hardcoded)")
@@ -228,13 +228,13 @@ class HybridLiteRTLM : HybridLiteRTLMSpec() {
             Log.i(TAG, "sendMessage (Promise): $message")
             // Blocking inference (safe here because we are in Promise.parallel worker thread)
-            val userMsg = LiteRTMessage.of(message)
+            val userMsg = LiteRTMessage.of(text = message)
             val startTime = System.nanoTime()
-            val responseMsg = conversation!!.sendMessage(userMsg)
+            val responseMsg = conversation!!.sendMessage(message = userMsg)
             val elapsedMs = (System.nanoTime() - startTime) / 1_000_000.0
             // Extract text
-            val response = responseMsg.contents
+            val response = responseMsg.contents.contents
                 .filterIsInstance<com.google.ai.edge.litertlm.Content.Text>()
                 .joinToString("") { it.text }
@@ -242,6 +242,9 @@ class HybridLiteRTLM : HybridLiteRTLMSpec() {
             history.add(Message(Role.MODEL, response))
             // Update stats with real timing data
+            // Token count heuristic: LiteRT-LM Android SDK does not expose
+            // actual token counts from inference. We approximate using
+            // ~4 chars/token. iOS uses the C API benchmark info for real counts.
             val promptTokens = message.length / 4.0
             val completionTokens = response.length / 4.0
             lastStats = GenerationStats(
@@ -279,8 +282,8 @@ class HybridLiteRTLM : HybridLiteRTLMSpec() {
         )
         try {
-            val userMsg = LiteRTMessage.of(message)
-            conversation!!.sendMessageAsync(userMsg, listener)
+            val userMsg = LiteRTMessage.of(text = message)
+            conversation!!.sendMessageAsync(message = userMsg, callback = listener)
         } catch (e: Exception) {
             Log.e(TAG, "Failed to initiate async generation", e)
             onToken("Error: ${e.message}", true)
@@ -343,19 +346,14 @@ class HybridLiteRTLM : HybridLiteRTLMSpec() {
             // Use factory method Message.of passing a list of Content
             val textContent = Content.Text(message)
-            val contentList = listOf(
-                textContent,
-                Content.ImageFile(processedImagePath)
-            )
-            val userMsg = LiteRTMessage.of(contentList)
+            val userMsg = LiteRTMessage.of(textContent, Content.ImageFile(processedImagePath))
             // Add to history
             history.add(Message(Role.USER, "$message [Image]"))
-            val responseMsg = conversation!!.sendMessage(userMsg)
+            val responseMsg = conversation!!.sendMessage(message = userMsg)
-            val response = responseMsg.contents
+            val response = responseMsg.contents.contents
                 .filterIsInstance<Content.Text>()
                 .joinToString("") { it.text }
@@ -490,18 +488,16 @@ class HybridLiteRTLM : HybridLiteRTLMSpec() {
             // Load audio
-            val contentList = listOf(
+            val userMsg = LiteRTMessage.of(
                 Content.Text(message),
                 Content.AudioFile(audioPath)
             )
-            val userMsg = LiteRTMessage.of(contentList)
             history.add(Message(Role.USER, "$message [Audio]"))
-            val responseMsg = conversation!!.sendMessage(userMsg)
+            val responseMsg = conversation!!.sendMessage(message = userMsg)
-            val response = responseMsg.contents
+            val response = responseMsg.contents.contents
                 .filterIsInstance<Content.Text>()
                 .joinToString("") { it.text }
@@ -628,8 +624,8 @@ class HybridLiteRTLM : HybridLiteRTLMSpec() {
                     // Send system instruction as the first turn to prime the conversation.
                     // LiteRT-LM's Conversation API handles chat template formatting,
                     // including Gemma's <start_of_turn>system block.
-                    val systemMsg = LiteRTMessage.of(listOf(Content.Text(prompt)))
-                    conversation!!.sendMessage(systemMsg)
+                    val systemMsg = LiteRTMessage.of(Content.Text(prompt))
+                    conversation!!.sendMessage(message = systemMsg)
                     Log.i(TAG, "System prompt applied (${prompt.length} chars)")
                 } catch (e: Exception) {
                     Log.w(TAG, "Failed to apply system prompt: ${e.message}")

package/app.plugin.js CHANGED Viewed

@@ -2,10 +2,12 @@
  * Expo config plugin for react-native-litert-lm.
  *
  * Ensures correct build settings for the LiteRT-LM native module:
- * - Android: minSdkVersion 26, arm64-v8a ABI filter
- * - iOS: deployment target 15.0
+ * - Android: minSdkVersion 26, Kotlin 2.3.0 (required by litertlm-android AAR)
  */
-const { withGradleProperties, withXcodeProject } = require('@expo/config-plugins');
+const {
+  withGradleProperties,
+  withProjectBuildGradle,
+} = require('@expo/config-plugins');
 function withLiteRTLM(config) {
   // Android: Ensure minSdkVersion is at least 26
@@ -27,6 +29,29 @@ function withLiteRTLM(config) {
     return config;
   });
+  // Android: Pin Kotlin Gradle plugin to 2.3.0
+  // The litertlm-android AAR uses Kotlin 2.3.0 metadata (version defined in
+  // package.json → litertLm.androidMavenVersion).
+  // React Native's default Kotlin version (2.1.0) cannot read this metadata,
+  // so we must force the Kotlin Gradle plugin to 2.3.0 in the project-level
+  // build.gradle. This ensures the fix survives `expo prebuild --clean`.
+  config = withProjectBuildGradle(config, (config) => {
+    if (config.modResults.language === 'groovy') {
+      const contents = config.modResults.contents;
+      // Only add if not already pinned
+      if (!contents.includes("kotlin-gradle-plugin:2.3.0")) {
+        // Replace the unversioned kotlin-gradle-plugin classpath with a pinned one
+        config.modResults.contents = contents.replace(
+          "classpath('org.jetbrains.kotlin:kotlin-gradle-plugin')",
+          "classpath('org.jetbrains.kotlin:kotlin-gradle-plugin:2.3.0')"
+        );
+      }
+    }
+    return config;
+  });
   return config;
 }

package/cpp/HybridLiteRTLM.cpp CHANGED Viewed

@@ -23,13 +23,52 @@
 #ifdef __APPLE__
 #include "IOSDownloadHelper.h"
+#include <os/proc.h>
 #endif
 #include <fstream>
 #include <thread>
 #include <regex>
+#include <pthread.h>
+#include <functional>
 namespace margelo::nitro::litertlm {
+// =============================================================================
+// Thread Helper — LiteRT engine operations need >512KB stack (XNNPack, Metal)
+// =============================================================================
+static void runOnLargeStack(std::function<void()> work, size_t stackSize = 8 * 1024 * 1024) {
+  struct Context {
+    std::function<void()> fn;
+    std::exception_ptr exception;
+  };
+  Context ctx{std::move(work), nullptr};
+  pthread_t thread;
+  pthread_attr_t attr;
+  pthread_attr_init(&attr);
+  pthread_attr_setstacksize(&attr, stackSize);
+  int rc = pthread_create(&thread, &attr, [](void* arg) -> void* {
+    auto* c = static_cast<Context*>(arg);
+    try {
+      c->fn();
+    } catch (...) {
+      c->exception = std::current_exception();
+    }
+    return nullptr;
+  }, &ctx);
+  pthread_attr_destroy(&attr);
+  if (rc != 0) {
+    throw std::runtime_error("Failed to create large-stack thread (errno: " + std::to_string(rc) + ")");
+  }
+  pthread_join(thread, nullptr);
+  if (ctx.exception) {
+    std::rethrow_exception(ctx.exception);
+  }
+}
 // =============================================================================
 // JSON Helpers
 // =============================================================================
@@ -70,6 +109,34 @@ std::string HybridLiteRTLM::buildAudioMessageJson(const std::string& text, const
          "]}";
 }
+/**
+ * Strip Gemma / LiteRT-LM control tokens from model output.
+ * The iOS C API returns raw model text including stop/turn markers
+ * that the Android Kotlin SDK strips automatically.
+ */
+static std::string stripControlTokens(const std::string& text) {
+  static const char* tokens[] = {
+    "<end_of_turn>",
+    "<start_of_turn>model",
+    "<start_of_turn>user",
+    "<start_of_turn>",
+    "<eos>",
+  };
+  std::string result = text;
+  for (auto* tok : tokens) {
+    std::string t(tok);
+    size_t pos;
+    while ((pos = result.find(t)) != std::string::npos) {
+      result.erase(pos, t.length());
+    }
+  }
+  // Trim leading/trailing whitespace
+  size_t start = result.find_first_not_of(" \t\n\r");
+  if (start == std::string::npos) return "";
+  size_t end = result.find_last_not_of(" \t\n\r");
+  return result.substr(start, end - start + 1);
+}
 std::string HybridLiteRTLM::extractTextFromResponse(const std::string& jsonResponse) {
   // The C API response JSON is structured as:
   //   {"role":"model","content":[{"type":"text","text":"..."}]}
@@ -102,7 +169,7 @@ std::string HybridLiteRTLM::extractTextFromResponse(const std::string& jsonRespo
           result += jsonResponse[i];
         }
       }
-      return result;
+      return stripControlTokens(result);
     }
   }
@@ -125,11 +192,11 @@ std::string HybridLiteRTLM::extractTextFromResponse(const std::string& jsonRespo
         result += jsonResponse[i];
       }
     }
-    return result;
+    return stripControlTokens(result);
   }
-  // Fallback: return full response
-  return jsonResponse;
+  // Fallback: return full response (still strip control tokens)
+  return stripControlTokens(jsonResponse);
 }
 // =============================================================================
@@ -191,7 +258,9 @@ std::shared_ptr<Promise<void>> HybridLiteRTLM::loadModel(
     const std::string& modelPath,
     const std::optional<LLMConfig>& config) {
   return Promise<void>::async([this, modelPath, config]() {
-    loadModelInternal(modelPath, config);
+    runOnLargeStack([&]() {
+      loadModelInternal(modelPath, config);
+    });
   });
 }
@@ -243,7 +312,7 @@ void HybridLiteRTLM::loadModelInternal(
       modelPath.c_str(),
       backend,
       visionBackend,
-      nullptr // audio executor not supported on iOS yet
+      "cpu" // audio executor: iOS XCFramework lacks compiled audio ops (INTERNAL ERROR at Invoke)
     );
     if (!settings) {
       return false;
@@ -336,7 +405,11 @@ void HybridLiteRTLM::loadModelInternal(
 std::shared_ptr<Promise<std::string>> HybridLiteRTLM::sendMessage(const std::string& message) {
   return Promise<std::string>::async([this, message]() -> std::string {
-    return sendMessageInternal(message);
+    std::string result;
+    runOnLargeStack([&]() {
+      result = sendMessageInternal(message);
+    });
+    return result;
   });
 }
@@ -431,9 +504,13 @@ void HybridLiteRTLM::streamCallbackFn(void* callback_data, const char* chunk,
   if (chunk) {
     std::string token(chunk);
-    ctx->fullResponse += token;
+    // Filter out Gemma control tokens from streamed chunks
+    std::string cleaned = stripControlTokens(token);
+    ctx->fullResponse += cleaned;
     ctx->tokenCount++;
-    ctx->onToken(token, false);
+    if (!cleaned.empty()) {
+      ctx->onToken(cleaned, false);
+    }
   }
 }
@@ -445,34 +522,42 @@ void HybridLiteRTLM::sendMessageAsync(
   auto onTokenCopy = onToken;
   auto messageCopy = message;
-  // Capture shared state safely
-  auto* ctx = new StreamContext();
-  ctx->onToken = std::move(onTokenCopy);
-  ctx->fullResponse = "";
-  ctx->history = &history_;
-  ctx->historyMutex = &mutex_;
-  ctx->userMessage = messageCopy;
-  ctx->lastStats = &lastStats_;
-  ctx->startTime = std::chrono::steady_clock::now();
-  ctx->tokenCount = 0;
+  // Capture shared state safely — use unique_ptr to prevent leaks
+  auto ctxOwner = std::make_unique<StreamContext>();
+  ctxOwner->onToken = std::move(onTokenCopy);
+  ctxOwner->fullResponse = "";
+  ctxOwner->history = &history_;
+  ctxOwner->historyMutex = &mutex_;
+  ctxOwner->userMessage = messageCopy;
+  ctxOwner->lastStats = &lastStats_;
+  ctxOwner->startTime = std::chrono::steady_clock::now();
+  ctxOwner->tokenCount = 0;
 #ifdef __APPLE__
   ensureLoaded();
   std::string msgJson = buildTextMessageJson(messageCopy);
-  int result = litert_lm_conversation_send_message_stream(
-    conversation_, msgJson.c_str(), nullptr,
-    streamCallbackFn, ctx);
+  // Release ownership — the C callback now owns the context via raw pointer.
+  // streamCallbackFn will delete it when done or on error.
+  StreamContext* ctx = ctxOwner.release();
-  if (result != 0) {
-    delete ctx;
-    throw std::runtime_error("LiteRT-LM: Failed to start streaming inference");
-  }
+  // Wrap the initial engine call in runOnLargeStack for consistency
+  // with all other engine entry points (XNNPack needs >512KB stack).
+  runOnLargeStack([&]() {
+    int result = litert_lm_conversation_send_message_stream(
+      conversation_, msgJson.c_str(), nullptr,
+      streamCallbackFn, ctx);
+    if (result != 0) {
+      delete ctx;
+      throw std::runtime_error("LiteRT-LM: Failed to start streaming inference");
+    }
+  });
 #else
   // Non-Apple stub
-  ctx->onToken("[iOS only] Streaming not available on this platform.", true);
-  delete ctx;
+  ctxOwner->onToken("[iOS only] Streaming not available on this platform.", true);
+  // ctxOwner auto-deleted by unique_ptr
 #endif
 }
@@ -484,7 +569,11 @@ std::shared_ptr<Promise<std::string>> HybridLiteRTLM::sendMessageWithImage(
     const std::string& message,
     const std::string& imagePath) {
   return Promise<std::string>::async([this, message, imagePath]() -> std::string {
-    return sendMessageWithImageInternal(message, imagePath);
+    std::string result;
+    runOnLargeStack([&]() {
+      result = sendMessageWithImageInternal(message, imagePath);
+    });
+    return result;
   });
 }
@@ -547,7 +636,11 @@ std::shared_ptr<Promise<std::string>> HybridLiteRTLM::sendMessageWithAudio(
     const std::string& message,
     const std::string& audioPath) {
   return Promise<std::string>::async([this, message, audioPath]() -> std::string {
-    return sendMessageWithAudioInternal(message, audioPath);
+    std::string result;
+    runOnLargeStack([&]() {
+      result = sendMessageWithAudioInternal(message, audioPath);
+    });
+    return result;
   });
 }
@@ -574,7 +667,12 @@ std::string HybridLiteRTLM::sendMessageWithAudioInternal(
     conversation_, msgJson.c_str(), nullptr);
   if (!response) {
-    throw std::runtime_error("LiteRT-LM: sendMessageWithAudio failed");
+    std::string errMsg = "LiteRT-LM: sendMessageWithAudio failed";
+    const char* nativeErr = litert_lm_get_last_error();
+    if (nativeErr && nativeErr[0] != '\0') {
+      errMsg += ": " + std::string(nativeErr);
+    }
+    throw std::runtime_error(errMsg);
   }
   const char* responseStr = litert_lm_json_response_get_string(response);
@@ -607,16 +705,8 @@ std::shared_ptr<Promise<std::string>> HybridLiteRTLM::downloadModel(
 #ifdef __APPLE__
     return litert_lm::downloadModelFile(url, fileName, onProgress);
 #else
-    std::string destPath = "/tmp/" + fileName;
-    std::string curlCmd = "curl -L -o \"" + destPath + "\" \"" + url + "\"";
-    int result = system(curlCmd.c_str());
-    if (result != 0) {
-      throw std::runtime_error("Failed to download model from: " + url);
-    }
-    if (onProgress.has_value()) {
-      onProgress.value()(1.0);
-    }
-    return destPath;
+    // Non-Apple platforms: not supported from C++ (Android uses Kotlin)
+    throw std::runtime_error("Download not available on this platform. Use the Kotlin implementation.");
 #endif
   });
 }
@@ -688,8 +778,8 @@ GenerationStats HybridLiteRTLM::getStats() {
 // =============================================================================
 MemoryUsage HybridLiteRTLM::getMemoryUsage() {
-  double usedMemoryBytes = 0;
-  double totalMemoryBytes = 0;
+  double nativeHeapBytes = 0;
+  double residentBytes = 0;
   double availableBytes = 0;
   bool isLowMemory = false;
@@ -704,33 +794,26 @@ MemoryUsage HybridLiteRTLM::getMemoryUsage() {
                                &count);
   if (kr == KERN_SUCCESS) {
-    usedMemoryBytes = static_cast<double>(info.resident_size);
-  }
-  // Get total physical memory
-  mach_port_t host_port = mach_host_self();
-  struct host_basic_info hostInfo;
-  mach_msg_type_number_t hostCount = HOST_BASIC_INFO_COUNT;
-  kr = host_info(host_port, HOST_BASIC_INFO,
-                  (host_info_t)&hostInfo, &hostCount);
-  if (kr == KERN_SUCCESS) {
-    totalMemoryBytes = static_cast<double>(hostInfo.max_mem);
+    residentBytes = static_cast<double>(info.resident_size);
+    // On iOS, mach_task_basic_info doesn't separate heap from RSS.
+    // Use resident_size_max as a proxy for peak native allocation.
+    nativeHeapBytes = static_cast<double>(info.resident_size);
   }
-  availableBytes = totalMemoryBytes - usedMemoryBytes;
-  if (availableBytes < 0) availableBytes = 0;
+  // Use os_proc_available_memory() (iOS 13+) for accurate Jetsam headroom.
+  // This reports how much memory the process can still allocate before
+  // the system kills it — far more accurate than total_physical - process_rss.
+  availableBytes = static_cast<double>(os_proc_available_memory());
   // Low memory threshold (~200MB available)
-  isLowMemory = (totalMemoryBytes > 0) && (availableBytes < 200.0 * 1024.0 * 1024.0);
+  isLowMemory = availableBytes < 200.0 * 1024.0 * 1024.0;
 #endif
   return MemoryUsage{
-    usedMemoryBytes,          // nativeHeapBytes
-    usedMemoryBytes,          // residentBytes
-    availableBytes,           // availableMemoryBytes
-    isLowMemory               // isLowMemory
+    nativeHeapBytes,            // nativeHeapBytes (RSS as proxy on iOS)
+    residentBytes,              // residentBytes
+    availableBytes,             // availableMemoryBytes
+    isLowMemory                 // isLowMemory
   };
 }

package/cpp/HybridLiteRTLM.hpp CHANGED Viewed

@@ -3,7 +3,7 @@
 // react-native-litert-lm
 //
 // High-performance LLM inference using LiteRT-LM.
-// Supports Gemma 3n and other .litertlm models.
+// Supports Gemma 4, Gemma 3n, and other .litertlm models.
 //
 // NOTE: This C++ implementation is used for iOS ONLY.
 // Android uses the Kotlin implementation in `android/src/main/java/com/margelo/nitro/dev/litert/litertlm/HybridLiteRTLM.kt`.
@@ -112,7 +112,7 @@ private:
   mutable std::mutex mutex_;
   // Configuration - backend
-  Backend backend_ = Backend::GPU;
+  Backend backend_ = Backend::CPU;
   // System prompt / instruction
   std::string systemPrompt_;

package/lib/hooks.js CHANGED Viewed

@@ -51,6 +51,10 @@ function useModel(pathOrUrl, config) {
             enableMemoryTracking,
             maxMemorySnapshots,
         });
+        // Reset ready state — the new instance has no model loaded yet.
+        // This prevents stale isReady=true after Fast Refresh (which
+        // preserves useState but re-runs useEffect).
+        setIsReady(false);
         // Cleanup on unmount
         return () => {
             try {

package/lib/index.d.ts CHANGED Viewed

@@ -45,6 +45,10 @@ export { createLLM } from "./modelFactory";
  * Use with model download utilities or as reference.
  */
 export declare const Models: {
+    /** Gemma 4 E2B Instruct (2B parameters, latest generation) */
+    readonly GEMMA_4_E2B: "gemma-4-E2B-it-litert-lm";
+    /** Gemma 4 E4B Instruct (4B parameters, higher quality) */
+    readonly GEMMA_4_E4B: "gemma-4-E4B-it-litert-lm";
     /** Gemma 3n E2B (2B parameters, efficient) */
     readonly GEMMA_3N_E2B: "gemma-3n-E2B-it-litert-lm-preview";
     /** Gemma 3n E4B (4B parameters, higher quality) */
@@ -59,9 +63,10 @@ export declare const Models: {
 export type ModelId = (typeof Models)[keyof typeof Models];
 /**
  * Get the recommended backend for the current platform.
- * Returns 'gpu' for most devices as it provides the best balance of speed and compatibility.
+ * Returns 'cpu' as the safe default. GPU (Metal on iOS, GPU delegate on Android)
+ * is faster but may not be available on all devices or model configurations.
  *
- * @returns The recommended backend ('gpu' for most cases)
+ * @returns The recommended backend ('cpu')
  *
  * @example
  * ```typescript
@@ -106,5 +111,17 @@ export declare function checkBackendSupport(backend: Backend): string | undefine
 export declare function checkMultimodalSupport(): string | undefined;
 /**
  * Download URL for the Gemma 3n E2B IT INT4 model.
+ * Note: Requires a HuggingFace account (gated model).
  */
 export declare const GEMMA_3N_E2B_IT_INT4 = "https://litert.dev/gemma-3n-E2B-it-int4.litertlm";
+/**
+ * Download URL for the Gemma 4 E2B IT model (2.58 GB).
+ * Public — no HuggingFace account required.
+ */
+export declare const GEMMA_4_E2B_IT = "https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it.litertlm";
+/**
+ * Download URL for the Gemma 4 E4B IT model (3.65 GB).
+ * Higher quality than E2B but requires more device memory.
+ * Public — no HuggingFace account required.
+ */
+export declare const GEMMA_4_E4B_IT = "https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/resolve/main/gemma-4-E4B-it.litertlm";

package/lib/index.js CHANGED Viewed

@@ -14,7 +14,7 @@ var __exportStar = (this && this.__exportStar) || function(m, exports) {
     for (var p in m) if (p !== "default" && !Object.prototype.hasOwnProperty.call(exports, p)) __createBinding(exports, m, p);
 };
 Object.defineProperty(exports, "__esModule", { value: true });
-exports.GEMMA_3N_E2B_IT_INT4 = exports.Models = exports.createLLM = exports.createNativeBuffer = exports.createMemoryTracker = exports.applyLlamaTemplate = exports.applyPhiTemplate = exports.applyGemmaTemplate = void 0;
+exports.GEMMA_4_E4B_IT = exports.GEMMA_4_E2B_IT = exports.GEMMA_3N_E2B_IT_INT4 = exports.Models = exports.createLLM = exports.createNativeBuffer = exports.createMemoryTracker = exports.applyLlamaTemplate = exports.applyPhiTemplate = exports.applyGemmaTemplate = void 0;
 exports.getRecommendedBackend = getRecommendedBackend;
 exports.checkBackendSupport = checkBackendSupport;
 exports.checkMultimodalSupport = checkMultimodalSupport;
@@ -67,6 +67,10 @@ Object.defineProperty(exports, "createLLM", { enumerable: true, get: function ()
  * Use with model download utilities or as reference.
  */
 exports.Models = {
+    /** Gemma 4 E2B Instruct (2B parameters, latest generation) */
+    GEMMA_4_E2B: "gemma-4-E2B-it-litert-lm",
+    /** Gemma 4 E4B Instruct (4B parameters, higher quality) */
+    GEMMA_4_E4B: "gemma-4-E4B-it-litert-lm",
     /** Gemma 3n E2B (2B parameters, efficient) */
     GEMMA_3N_E2B: "gemma-3n-E2B-it-litert-lm-preview",
     /** Gemma 3n E4B (4B parameters, higher quality) */
@@ -80,9 +84,10 @@ exports.Models = {
 };
 /**
  * Get the recommended backend for the current platform.
- * Returns 'gpu' for most devices as it provides the best balance of speed and compatibility.
+ * Returns 'cpu' as the safe default. GPU (Metal on iOS, GPU delegate on Android)
+ * is faster but may not be available on all devices or model configurations.
  *
- * @returns The recommended backend ('gpu' for most cases)
+ * @returns The recommended backend ('cpu')
  *
  * @example
  * ```typescript
@@ -91,9 +96,9 @@ exports.Models = {
  * ```
  */
 function getRecommendedBackend() {
-    // GPU is the recommended default for all platforms
-    // It provides good performance and broad compatibility
-    return "gpu";
+    // CPU is the safe default — always available, broadly compatible.
+    // GPU is faster but may fail on some models/devices.
+    return "cpu";
 }
 /**
  * Check if a backend configuration is supported on the current platform.
@@ -140,11 +145,23 @@ function checkBackendSupport(backend) {
  */
 function checkMultimodalSupport() {
     if (react_native_1.Platform.OS === "ios") {
-        return "Multimodal (image/audio) is experimental on iOS. Vision and audio executors may not be available in the current build.";
+        return "Multimodal (image/audio) is not available on iOS. The XCFramework lacks compiled vision and audio executor ops.";
     }
     return undefined;
 }
 /**
  * Download URL for the Gemma 3n E2B IT INT4 model.
+ * Note: Requires a HuggingFace account (gated model).
  */
 exports.GEMMA_3N_E2B_IT_INT4 = "https://litert.dev/gemma-3n-E2B-it-int4.litertlm";
+/**
+ * Download URL for the Gemma 4 E2B IT model (2.58 GB).
+ * Public — no HuggingFace account required.
+ */
+exports.GEMMA_4_E2B_IT = "https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it.litertlm";
+/**
+ * Download URL for the Gemma 4 E4B IT model (3.65 GB).
+ * Higher quality than E2B but requires more device memory.
+ * Public — no HuggingFace account required.
+ */
+exports.GEMMA_4_E4B_IT = "https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/resolve/main/gemma-4-E4B-it.litertlm";

package/lib/specs/LiteRTLM.nitro.d.ts CHANGED Viewed

@@ -26,18 +26,18 @@ export interface LLMConfig {
     systemPrompt?: string;
     /**
      * Primary compute backend for text generation.
-     * - 'cpu': CPU inference (slower but always available)
-     * - 'gpu': GPU acceleration (fast, recommended)
+     * - 'cpu': CPU inference (safe default, always available)
+     * - 'gpu': GPU acceleration (fast, Metal on iOS, GPU delegate on Android)
      * - 'npu': NPU/Neural Engine (fastest on supported devices)
      *
-     * If not specified, defaults to 'gpu'.
+     * If not specified, defaults to 'cpu'.
      * If specified backend is unavailable, falls back automatically.
      *
      * @remarks
-     * Vision encoder is always set to GPU (required by Gemma 3n).
+     * Vision encoder is always set to GPU (required by Gemma models).
      * Audio encoder is always set to CPU (optimal for audio processing).
      *
-     * @default 'gpu'
+     * @default 'cpu'
      */
     backend?: Backend;
     /**
@@ -104,12 +104,12 @@ export interface MemoryUsage {
 }
 /**
  * LiteRT-LM: High-performance LLM inference engine.
- * Supports Gemma 3n, Phi-4, Qwen, and other .litertlm models.
+ * Supports Gemma 4, Gemma 3n, Phi-4, Qwen, and other .litertlm models.
  *
  * @example
  * ```typescript
  * const llm = createLLM();
- * llm.loadModel('/path/to/gemma-3n.litertlm', { backend: 'gpu' });
+ * llm.loadModel('/path/to/gemma-4-E2B-it.litertlm', { backend: 'cpu' });
  *
  * // Blocking generation
  * const response = llm.sendMessage('What is the capital of France?');

package/package.json CHANGED Viewed

@@ -1,7 +1,12 @@
 {
   "name": "react-native-litert-lm",
-  "version": "0.3.1",
-  "description": "High-performance LLM inference for React Native using LiteRT-LM. Optimized for Gemma 3n and other on-device language models.",
+  "version": "0.3.3",
+  "litertLm": {
+    "version": "0.10.1",
+    "androidMavenVersion": "0.10.0",
+    "iosGitTag": "v0.10.1"
+  },
+  "description": "High-performance LLM inference for React Native using LiteRT-LM. Optimized for Gemma 4 and other on-device language models.",
   "license": "MIT",
   "author": "Hugh Chen (https://github.com/hung-yueh)",
   "repository": {
@@ -19,6 +24,7 @@
     "litert-lm",
     "llm",
     "gemma",
+    "gemma-4",
     "gemma-3n",
     "ai",
     "machine-learning",
@@ -69,26 +75,26 @@
     "release": "release-it"
   },
   "devDependencies": {
-    "@expo/config-plugins": "~54.0.4",
-    "@types/react": "~19.1.10",
-    "expo": "^54.0.31",
-    "nitrogen": "^0.35.0",
-    "react": "19.1.0",
-    "react-native": "0.81.5",
+    "@expo/config-plugins": "~55.0.0",
+    "@types/react": "~19.2.10",
     "release-it": "^19.2.4",
     "typescript": "^5.0.0"
   },
   "peerDependencies": {
-    "expo": ">=54.0.0",
+    "expo": ">=55.0.0",
     "react": "*",
-    "react-native": "*"
+    "react-native": "*",
+    "react-native-nitro-modules": "^0.35.0"
   },
   "peerDependenciesMeta": {
     "expo": {
       "optional": true
+    },
+    "react": {
+      "optional": true
+    },
+    "react-native": {
+      "optional": true
     }
-  },
-  "dependencies": {
-    "react-native-nitro-modules": "^0.35.0"
   }
 }

package/scripts/build-ios-engine.sh CHANGED Viewed

@@ -16,12 +16,12 @@
 set -euo pipefail
-LITERT_LM_VERSION="v0.9.0"
 LITERT_LM_REPO="https://github.com/google-ai-edge/LiteRT-LM.git"
 FRAMEWORK_NAME="LiteRTLM"
 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
 PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+LITERT_LM_VERSION="$(node -e "console.log(require('$PROJECT_ROOT/package.json').litertLm.iosGitTag)")"
 OUTPUT_DIR="$PROJECT_ROOT/ios/Frameworks"
 C_API_HEADER_DIR="$PROJECT_ROOT/cpp/include"
 BUILD_DIR="$PROJECT_ROOT/.litert-lm-build"

package/scripts/download-ios-frameworks.sh CHANGED Viewed

@@ -19,7 +19,7 @@ PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 OUTPUT_DIR="$PROJECT_ROOT/ios/Frameworks"
 C_API_HEADER_DIR="$PROJECT_ROOT/cpp/include"
-LITERT_LM_VERSION="v0.9.0"
+LITERT_LM_VERSION="$(node -e "console.log(require('$PROJECT_ROOT/package.json').litertLm.iosGitTag)")"
 GITHUB_RAW="https://github.com/google-ai-edge/LiteRT-LM/raw/${LITERT_LM_VERSION}"
 # Read version from package.json

package/src/hooks.ts CHANGED Viewed

@@ -108,6 +108,11 @@ export function useModel(
       maxMemorySnapshots,
     });
+    // Reset ready state — the new instance has no model loaded yet.
+    // This prevents stale isReady=true after Fast Refresh (which
+    // preserves useState but re-runs useEffect).
+    setIsReady(false);
     // Cleanup on unmount
     return () => {
       try {

package/src/index.ts CHANGED Viewed

@@ -79,6 +79,10 @@ export { createLLM } from "./modelFactory";
  * Use with model download utilities or as reference.
  */
 export const Models = {
+  /** Gemma 4 E2B Instruct (2B parameters, latest generation) */
+  GEMMA_4_E2B: "gemma-4-E2B-it-litert-lm",
+  /** Gemma 4 E4B Instruct (4B parameters, higher quality) */
+  GEMMA_4_E4B: "gemma-4-E4B-it-litert-lm",
   /** Gemma 3n E2B (2B parameters, efficient) */
   GEMMA_3N_E2B: "gemma-3n-E2B-it-litert-lm-preview",
   /** Gemma 3n E4B (4B parameters, higher quality) */
@@ -95,9 +99,10 @@ export type ModelId = (typeof Models)[keyof typeof Models];
 /**
  * Get the recommended backend for the current platform.
- * Returns 'gpu' for most devices as it provides the best balance of speed and compatibility.
+ * Returns 'cpu' as the safe default. GPU (Metal on iOS, GPU delegate on Android)
+ * is faster but may not be available on all devices or model configurations.
  *
- * @returns The recommended backend ('gpu' for most cases)
+ * @returns The recommended backend ('cpu')
  *
  * @example
  * ```typescript
@@ -106,9 +111,9 @@ export type ModelId = (typeof Models)[keyof typeof Models];
  * ```
  */
 export function getRecommendedBackend(): Backend {
-  // GPU is the recommended default for all platforms
-  // It provides good performance and broad compatibility
-  return "gpu";
+  // CPU is the safe default — always available, broadly compatible.
+  // GPU is faster but may fail on some models/devices.
+  return "cpu";
 }
 /**
@@ -158,13 +163,29 @@ export function checkBackendSupport(backend: Backend): string | undefined {
  */
 export function checkMultimodalSupport(): string | undefined {
   if (Platform.OS === "ios") {
-    return "Multimodal (image/audio) is experimental on iOS. Vision and audio executors may not be available in the current build.";
+    return "Multimodal (image/audio) is not available on iOS. The XCFramework lacks compiled vision and audio executor ops.";
   }
   return undefined;
 }
 /**
  * Download URL for the Gemma 3n E2B IT INT4 model.
+ * Note: Requires a HuggingFace account (gated model).
  */
 export const GEMMA_3N_E2B_IT_INT4 =
   "https://litert.dev/gemma-3n-E2B-it-int4.litertlm";
+/**
+ * Download URL for the Gemma 4 E2B IT model (2.58 GB).
+ * Public — no HuggingFace account required.
+ */
+export const GEMMA_4_E2B_IT =
+  "https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it.litertlm";
+/**
+ * Download URL for the Gemma 4 E4B IT model (3.65 GB).
+ * Higher quality than E2B but requires more device memory.
+ * Public — no HuggingFace account required.
+ */
+export const GEMMA_4_E4B_IT =
+  "https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/resolve/main/gemma-4-E4B-it.litertlm";

package/src/specs/LiteRTLM.nitro.ts CHANGED Viewed

@@ -30,18 +30,18 @@ export interface LLMConfig {
   /**
    * Primary compute backend for text generation.
-   * - 'cpu': CPU inference (slower but always available)
-   * - 'gpu': GPU acceleration (fast, recommended)
+   * - 'cpu': CPU inference (safe default, always available)
+   * - 'gpu': GPU acceleration (fast, Metal on iOS, GPU delegate on Android)
    * - 'npu': NPU/Neural Engine (fastest on supported devices)
    *
-   * If not specified, defaults to 'gpu'.
+   * If not specified, defaults to 'cpu'.
    * If specified backend is unavailable, falls back automatically.
    *
    * @remarks
-   * Vision encoder is always set to GPU (required by Gemma 3n).
+   * Vision encoder is always set to GPU (required by Gemma models).
    * Audio encoder is always set to CPU (optimal for audio processing).
    *
-   * @default 'gpu'
+   * @default 'cpu'
    */
   backend?: Backend;
@@ -116,12 +116,12 @@ export interface MemoryUsage {
 /**
  * LiteRT-LM: High-performance LLM inference engine.
- * Supports Gemma 3n, Phi-4, Qwen, and other .litertlm models.
+ * Supports Gemma 4, Gemma 3n, Phi-4, Qwen, and other .litertlm models.
  *
  * @example
  * ```typescript
  * const llm = createLLM();
- * llm.loadModel('/path/to/gemma-3n.litertlm', { backend: 'gpu' });
+ * llm.loadModel('/path/to/gemma-4-E2B-it.litertlm', { backend: 'cpu' });
  *
  * // Blocking generation
  * const response = llm.sendMessage('What is the capital of France?');