npm - llama-cpp-capacitor - Versions diffs - 0.0.13 → 0.0.21 - Mend

llama-cpp-capacitor 0.0.13 → 0.0.21

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

package/LlamaCpp.podspec +17 -17
package/Package.swift +27 -27
package/README.md +717 -574
package/android/build.gradle +88 -69
package/android/src/main/AndroidManifest.xml +2 -2
package/android/src/main/CMakeLists-arm64.txt +131 -0
package/android/src/main/CMakeLists-x86_64.txt +135 -0
package/android/src/main/CMakeLists.txt +35 -52
package/android/src/main/java/ai/annadata/plugin/capacitor/LlamaCpp.java +956 -717
package/android/src/main/java/ai/annadata/plugin/capacitor/LlamaCppPlugin.java +710 -590
package/android/src/main/jni-utils.h +7 -7
package/android/src/main/jni.cpp +868 -127
package/cpp/{rn-completion.cpp → cap-completion.cpp} +202 -24
package/cpp/{rn-completion.h → cap-completion.h} +22 -11
package/cpp/{rn-llama.cpp → cap-llama.cpp} +81 -27
package/cpp/{rn-llama.h → cap-llama.h} +32 -20
package/cpp/{rn-mtmd.hpp → cap-mtmd.hpp} +15 -15
package/cpp/{rn-tts.cpp → cap-tts.cpp} +12 -12
package/cpp/{rn-tts.h → cap-tts.h} +14 -14
package/cpp/ggml-cpu/ggml-cpu-impl.h +30 -0
package/dist/docs.json +100 -3
package/dist/esm/definitions.d.ts +45 -2
package/dist/esm/definitions.js.map +1 -1
package/dist/esm/index.d.ts +22 -0
package/dist/esm/index.js +66 -3
package/dist/esm/index.js.map +1 -1
package/dist/plugin.cjs.js +71 -3
package/dist/plugin.cjs.js.map +1 -1
package/dist/plugin.js +71 -3
package/dist/plugin.js.map +1 -1
package/ios/Sources/LlamaCppPlugin/LlamaCpp.swift +596 -596
package/ios/Sources/LlamaCppPlugin/LlamaCppPlugin.swift +591 -514
package/ios/Tests/LlamaCppPluginTests/LlamaCppPluginTests.swift +15 -15
package/package.json +111 -110

package/README.md CHANGED Viewed

@@ -1,574 +1,717 @@
-# llama-cpp Capacitor Plugin
-[![Actions Status](https://github.com/arusatech/llama-cpp/workflows/CI/badge.svg)](https://github.com/arusatech/llama-cpp/actions)
-[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
-[![npm](https://img.shields.io/npm/v/llama-cpp-capacitor.svg)](https://www.npmjs.com/package/llama-cpp-capacitor/)
-A native Capacitor plugin that embeds [llama.cpp](https://github.com/ggerganov/llama.cpp) directly into mobile apps, enabling offline AI inference with comprehensive support for text generation, multimodal processing, TTS, LoRA adapters, and more.
-[llama.cpp](https://github.com/ggerganov/llama.cpp): Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
-## 🚀 Features
-- **Offline AI Inference**: Run large language models completely offline on mobile devices
-- **Text Generation**: Complete text completion with streaming support
-- **Chat Conversations**: Multi-turn conversations with context management
-- **Multimodal Support**: Process images and audio alongside text
-- **Text-to-Speech (TTS)**: Generate speech from text using vocoder models
-- **LoRA Adapters**: Fine-tune models with LoRA adapters
-- **Embeddings**: Generate vector embeddings for semantic search
-- **Reranking**: Rank documents by relevance to queries
-- **Session Management**: Save and load conversation states
-- **Benchmarking**: Performance testing and optimization tools
-- **Structured Output**: Generate JSON with schema validation
-- **Cross-Platform**: iOS and Android support with native optimizations
-## ✅ **Complete Implementation Status**
-This plugin is now **FULLY IMPLEMENTED** with complete native integration of llama.cpp for both iOS and Android platforms. The implementation includes:
-### **Completed Features**
-- **Complete C++ Integration**: Full llama.cpp library integration with all core components
-- **Native Build System**: CMake-based build system for both iOS and Android
-- **Platform Support**: iOS (arm64, x86_64) and Android (arm64-v8a, armeabi-v7a, x86, x86_64)
-- **TypeScript API**: Complete TypeScript interface matching llama.rn functionality
-- **Native Methods**: All 30+ native methods implemented with proper error handling
-- **Event System**: Capacitor event system for progress and token streaming
-- **Documentation**: Comprehensive README and API documentation
-### **Technical Implementation**
-- **C++ Core**: Complete llama.cpp library with GGML, GGUF, and all supporting components
-- **iOS Framework**: Native iOS framework with Metal acceleration support
-- **Android JNI**: Complete JNI implementation with multi-architecture support
-- **Build Scripts**: Automated build system for both platforms
-- **Error Handling**: Robust error handling and result types
-### **Project Structure**
-```
-llama-cpp/
-├── cpp/                    # Complete llama.cpp C++ library
-│   ├── ggml.c             # GGML core
-│   ├── gguf.cpp           # GGUF format support
-│   ├── llama.cpp          # Main llama.cpp implementation
-│   ├── rn-llama.cpp       # React Native wrapper (adapted)
-│   ├── rn-completion.cpp  # Completion handling
-│   ├── rn-tts.cpp         # Text-to-speech
-│   └── tools/mtmd/        # Multimodal support
-├── ios/
-│   ├── CMakeLists.txt     # iOS build configuration
-│   └── Sources/           # Swift implementation
-├── android/
-│   ├── src/main/
-│   │   ├── CMakeLists.txt # Android build configuration
-│   │   ├── jni.cpp        # JNI implementation
-│   │   └── jni-utils.h    # JNI utilities
-│   └── build.gradle       # Android build config
-├── src/
-│   ├── definitions.ts     # Complete TypeScript interfaces
-│   ├── index.ts           # Main plugin implementation
-│   └── web.ts             # Web fallback
-└── build-native.sh        # Automated build script
-```
-## 📦 Installation
-```sh
-npm install llama-cpp-capacitor
-```
-## 🔨 **Building the Native Library**
-The plugin includes a complete native implementation of llama.cpp. To build the native libraries:
-### **Prerequisites**
-- **CMake** (3.16+ for iOS, 3.10+ for Android)
-- **Xcode** (for iOS builds, macOS only)
-- **Android Studio** with NDK (for Android builds)
-- **Make** or **Ninja** build system
-### **Automated Build**
-```bash
-# Build for all platforms
-npm run build:native
-# Build for specific platforms
-npm run build:ios      # iOS only
-npm run build:android  # Android only
-# Clean native builds
-npm run clean:native
-```
-### **Manual Build**
-#### **iOS Build**
-```bash
-cd ios
-cmake -B build -S .
-cmake --build build --config Release
-```
-#### **Android Build**
-```bash
-cd android
-./gradlew assembleRelease
-```
-### **Build Output**
-- **iOS**: `ios/build/LlamaCpp.framework/`
-- **Android**: `android/src/main/jniLibs/{arch}/libllama-cpp-{arch}.so`
-### iOS Setup
-1. Install the plugin:
-```sh
-npm install llama-cpp
-```
-2. Add to your iOS project:
-```sh
-npx cap add ios
-npx cap sync ios
-```
-3. Open the project in Xcode:
-```sh
-npx cap open ios
-```
-### Android Setup
-1. Install the plugin:
-```sh
-npm install llama-cpp
-```
-2. Add to your Android project:
-```sh
-npx cap add android
-npx cap sync android
-```
-3. Open the project in Android Studio:
-```sh
-npx cap open android
-```
-## 🎯 Quick Start
-### Basic Text Completion
-```typescript
-import { initLlama } from 'llama-cpp';
-// Initialize a model
-const context = await initLlama({
-  model: '/path/to/your/model.gguf',
-  n_ctx: 2048,
-  n_threads: 4,
-  n_gpu_layers: 0,
-});
-// Generate text
-const result = await context.completion({
-  prompt: "Hello, how are you today?",
-  n_predict: 50,
-  temperature: 0.8,
-});
-console.log('Generated text:', result.text);
-```
-### Chat-Style Conversations
-```typescript
-const result = await context.completion({
-  messages: [
-    { role: "system", content: "You are a helpful AI assistant." },
-    { role: "user", content: "What is the capital of France?" },
-    { role: "assistant", content: "The capital of France is Paris." },
-    { role: "user", content: "Tell me more about it." }
-  ],
-  n_predict: 100,
-  temperature: 0.7,
-});
-console.log('Chat response:', result.content);
-```
-### Streaming Completion
-```typescript
-let fullText = '';
-const result = await context.completion({
-  prompt: "Write a short story about a robot learning to paint:",
-  n_predict: 150,
-  temperature: 0.8,
-}, (tokenData) => {
-  // Called for each token as it's generated
-  fullText += tokenData.token;
-  console.log('Token:', tokenData.token);
-});
-console.log('Final result:', result.text);
-```
-## 📚 API Reference
-### Core Functions
-#### `initLlama(params: ContextParams, onProgress?: (progress: number) => void): Promise<LlamaContext>`
-Initialize a new llama.cpp context with a model.
-**Parameters:**
-- `params`: Context initialization parameters
-- `onProgress`: Optional progress callback (0-100)
-**Returns:** Promise resolving to a `LlamaContext` instance
-#### `releaseAllLlama(): Promise<void>`
-Release all contexts and free memory.
-#### `toggleNativeLog(enabled: boolean): Promise<void>`
-Enable or disable native logging.
-#### `addNativeLogListener(listener: (level: string, text: string) => void): { remove: () => void }`
-Add a listener for native log messages.
-### LlamaContext Class
-#### `completion(params: CompletionParams, callback?: (data: TokenData) => void): Promise<NativeCompletionResult>`
-Generate text completion.
-**Parameters:**
-- `params`: Completion parameters including prompt or messages
-- `callback`: Optional callback for token-by-token streaming
-#### `tokenize(text: string, options?: { media_paths?: string[] }): Promise<NativeTokenizeResult>`
-Tokenize text or text with images.
-#### `detokenize(tokens: number[]): Promise<string>`
-Convert tokens back to text.
-#### `embedding(text: string, params?: EmbeddingParams): Promise<NativeEmbeddingResult>`
-Generate embeddings for text.
-#### `rerank(query: string, documents: string[], params?: RerankParams): Promise<RerankResult[]>`
-Rank documents by relevance to a query.
-#### `bench(pp: number, tg: number, pl: number, nr: number): Promise<BenchResult>`
-Benchmark model performance.
-### Multimodal Support
-#### `initMultimodal(params: { path: string; use_gpu?: boolean }): Promise<boolean>`
-Initialize multimodal support with a projector file.
-#### `isMultimodalEnabled(): Promise<boolean>`
-Check if multimodal support is enabled.
-#### `getMultimodalSupport(): Promise<{ vision: boolean; audio: boolean }>`
-Get multimodal capabilities.
-#### `releaseMultimodal(): Promise<void>`
-Release multimodal resources.
-### TTS (Text-to-Speech)
-#### `initVocoder(params: { path: string; n_batch?: number }): Promise<boolean>`
-Initialize TTS with a vocoder model.
-#### `isVocoderEnabled(): Promise<boolean>`
-Check if TTS is enabled.
-#### `getFormattedAudioCompletion(speaker: object | null, textToSpeak: string): Promise<{ prompt: string; grammar?: string }>`
-Get formatted audio completion prompt.
-#### `getAudioCompletionGuideTokens(textToSpeak: string): Promise<Array<number>>`
-Get guide tokens for audio completion.
-#### `decodeAudioTokens(tokens: number[]): Promise<Array<number>>`
-Decode audio tokens to audio data.
-#### `releaseVocoder(): Promise<void>`
-Release TTS resources.
-### LoRA Adapters
-#### `applyLoraAdapters(loraList: Array<{ path: string; scaled?: number }>): Promise<void>`
-Apply LoRA adapters to the model.
-#### `removeLoraAdapters(): Promise<void>`
-Remove all LoRA adapters.
-#### `getLoadedLoraAdapters(): Promise<Array<{ path: string; scaled?: number }>>`
-Get list of loaded LoRA adapters.
-### Session Management
-#### `saveSession(filepath: string, options?: { tokenSize: number }): Promise<number>`
-Save current session to a file.
-#### `loadSession(filepath: string): Promise<NativeSessionLoadResult>`
-Load session from a file.
-## 🔧 Configuration
-### Context Parameters
-```typescript
-interface ContextParams {
-  model: string;                    // Path to GGUF model file
-  n_ctx?: number;                   // Context size (default: 512)
-  n_threads?: number;               // Number of threads (default: 4)
-  n_gpu_layers?: number;            // GPU layers (iOS only)
-  use_mlock?: boolean;              // Lock memory (default: false)
-  use_mmap?: boolean;               // Use memory mapping (default: true)
-  embedding?: boolean;              // Embedding mode (default: false)
-  cache_type_k?: string;            // KV cache type for K
-  cache_type_v?: string;            // KV cache type for V
-  pooling_type?: string;            // Pooling type
-  // ... more parameters
-}
-```
-### Completion Parameters
-```typescript
-interface CompletionParams {
-  prompt?: string;                  // Text prompt
-  messages?: Message[];             // Chat messages
-  n_predict?: number;               // Max tokens to generate
-  temperature?: number;             // Sampling temperature
-  top_p?: number;                   // Top-p sampling
-  top_k?: number;                   // Top-k sampling
-  stop?: string[];                  // Stop sequences
-  // ... more parameters
-}
-```
-## 📱 Platform Support
-| Feature | iOS | Android | Web |
-|---------|-----|---------|-----|
-| Text Generation | ✅ | ✅ | ❌ |
-| Chat Conversations | ✅ | ✅ | ❌ |
-| Streaming | ✅ | ✅ | ❌ |
-| Multimodal | ✅ | ✅ | ❌ |
-| TTS | ✅ | ✅ | ❌ |
-| LoRA Adapters | ✅ | ✅ | ❌ |
-| Embeddings | ✅ | ✅ | ❌ |
-| Reranking | ✅ | ✅ | ❌ |
-| Session Management | ✅ | ✅ | ❌ |
-| Benchmarking | ✅ | ✅ | ❌ |
-## 🎨 Advanced Examples
-### Multimodal Processing
-```typescript
-// Initialize multimodal support
-await context.initMultimodal({
-  path: '/path/to/mmproj.gguf',
-  use_gpu: true,
-});
-// Process image with text
-const result = await context.completion({
-  messages: [
-    {
-      role: "user",
-      content: [
-        { type: "text", text: "What do you see in this image?" },
-        { type: "image_url", image_url: { url: "file:///path/to/image.jpg" } }
-      ]
-    }
-  ],
-  n_predict: 100,
-});
-console.log('Image analysis:', result.content);
-```
-### Text-to-Speech
-```typescript
-// Initialize TTS
-await context.initVocoder({
-  path: '/path/to/vocoder.gguf',
-  n_batch: 512,
-});
-// Generate audio
-const audioCompletion = await context.getFormattedAudioCompletion(
-  null, // Speaker configuration
-  "Hello, this is a test of text-to-speech functionality."
-);
-const guideTokens = await context.getAudioCompletionGuideTokens(
-  "Hello, this is a test of text-to-speech functionality."
-);
-const audioResult = await context.completion({
-  prompt: audioCompletion.prompt,
-  grammar: audioCompletion.grammar,
-  guide_tokens: guideTokens,
-  n_predict: 1000,
-});
-const audioData = await context.decodeAudioTokens(audioResult.audio_tokens);
-```
-### LoRA Adapters
-```typescript
-// Apply LoRA adapters
-await context.applyLoraAdapters([
-  { path: '/path/to/adapter1.gguf', scaled: 1.0 },
-  { path: '/path/to/adapter2.gguf', scaled: 0.5 }
-]);
-// Check loaded adapters
-const adapters = await context.getLoadedLoraAdapters();
-console.log('Loaded adapters:', adapters);
-// Generate with adapters
-const result = await context.completion({
-  prompt: "Test prompt with LoRA adapters:",
-  n_predict: 50,
-});
-// Remove adapters
-await context.removeLoraAdapters();
-```
-### Structured Output
-```typescript
-const result = await context.completion({
-  prompt: "Generate a JSON object with a person's name, age, and favorite color:",
-  n_predict: 100,
-  response_format: {
-    type: 'json_schema',
-    json_schema: {
-      strict: true,
-      schema: {
-        type: 'object',
-        properties: {
-          name: { type: 'string' },
-          age: { type: 'number' },
-          favorite_color: { type: 'string' }
-        },
-        required: ['name', 'age', 'favorite_color']
-      }
-    }
-  }
-});
-console.log('Structured output:', result.content);
-```
-## 🔍 Model Compatibility
-This plugin supports GGUF format models, which are compatible with llama.cpp. You can find GGUF models on Hugging Face by searching for the "GGUF" tag.
-### Recommended Models
-- **Llama 2**: Meta's latest language model
-- **Mistral**: High-performance open model
-- **Code Llama**: Specialized for code generation
-- **Phi-2**: Microsoft's efficient model
-- **Gemma**: Google's open model
-### Model Quantization
-For mobile devices, consider using quantized models (Q4_K_M, Q5_K_M, etc.) to reduce memory usage and improve performance.
-## ⚡ Performance Considerations
-### Memory Management
-- Use quantized models for better memory efficiency
-- Adjust `n_ctx` based on your use case
-- Monitor memory usage with `use_mlock: false`
-### GPU Acceleration
-- iOS: Set `n_gpu_layers` to use Metal GPU acceleration
-- Android: GPU acceleration is automatically enabled when available
-### Threading
-- Adjust `n_threads` based on device capabilities
-- More threads may improve performance but increase memory usage
-## 🐛 Troubleshooting
-### Common Issues
-1. **Model not found**: Ensure the model path is correct and the file exists
-2. **Out of memory**: Try using a quantized model or reducing `n_ctx`
-3. **Slow performance**: Enable GPU acceleration or increase `n_threads`
-4. **Multimodal not working**: Ensure the mmproj file is compatible with your model
-### Debugging
-Enable native logging to see detailed information:
-```typescript
-import { toggleNativeLog, addNativeLogListener } from 'llama-cpp';
-await toggleNativeLog(true);
-const logListener = addNativeLogListener((level, text) => {
-  console.log(`[${level}] ${text}`);
-});
-```
-## 🤝 Contributing
-We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
-## 📄 License
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
-## 🙏 Acknowledgments
-- [llama.cpp](https://github.com/ggerganov/llama.cpp) - The core inference engine
-- [Capacitor](https://capacitorjs.com/) - The cross-platform runtime
-- [llama.rn](https://github.com/mybigday/llama.rn) - Inspiration for the React Native implementation
-## 📞 Support
-- 📧 Email: support@arusatech.com
-- 🐛 Issues: [GitHub Issues](https://github.com/arusatech/llama-cpp/issues)
-- 📖 Documentation: [GitHub Wiki](https://github.com/arusatech/llama-cpp/wiki)
+# llama-cpp Capacitor Plugin
+[![Actions Status](https://github.com/arusatech/llama-cpp/workflows/CI/badge.svg)](https://github.com/arusatech/llama-cpp/actions)
+[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
+[![npm](https://img.shields.io/npm/v/llama-cpp-capacitor.svg)](https://www.npmjs.com/package/llama-cpp-capacitor/)
+A native Capacitor plugin that embeds [llama.cpp](https://github.com/ggerganov/llama.cpp) directly into mobile apps, enabling offline AI inference with comprehensive support for text generation, multimodal processing, TTS, LoRA adapters, and more.
+[llama.cpp](https://github.com/ggerganov/llama.cpp): Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
+## 🚀 Features
+- **Offline AI Inference**: Run large language models completely offline on mobile devices
+- **Text Generation**: Complete text completion with streaming support
+- **Chat Conversations**: Multi-turn conversations with context management
+- **Multimodal Support**: Process images and audio alongside text
+- **Text-to-Speech (TTS)**: Generate speech from text using vocoder models
+- **LoRA Adapters**: Fine-tune models with LoRA adapters
+- **Embeddings**: Generate vector embeddings for semantic search
+- **Reranking**: Rank documents by relevance to queries
+- **Session Management**: Save and load conversation states
+- **Benchmarking**: Performance testing and optimization tools
+- **Structured Output**: Generate JSON with schema validation
+- **Cross-Platform**: iOS and Android support with native optimizations
+## ✅ **Complete Implementation Status**
+This plugin is now **FULLY IMPLEMENTED** with complete native integration of llama.cpp for both iOS and Android platforms. The implementation includes:
+### **Completed Features**
+- **Complete C++ Integration**: Full llama.cpp library integration with all core components
+- **Native Build System**: CMake-based build system for both iOS and Android
+- **Platform Support**: iOS (arm64, x86_64) and Android (arm64-v8a, armeabi-v7a, x86, x86_64)
+- **TypeScript API**: Complete TypeScript interface matching llama.rn functionality
+- **Native Methods**: All 30+ native methods implemented with proper error handling
+- **Event System**: Capacitor event system for progress and token streaming
+- **Documentation**: Comprehensive README and API documentation
+### **Technical Implementation**
+- **C++ Core**: Complete llama.cpp library with GGML, GGUF, and all supporting components
+- **iOS Framework**: Native iOS framework with Metal acceleration support
+- **Android JNI**: Complete JNI implementation with multi-architecture support
+- **Build Scripts**: Automated build system for both platforms
+- **Error Handling**: Robust error handling and result types
+### **Project Structure**
+```
+llama-cpp/
+├── cpp/                    # Complete llama.cpp C++ library
+│   ├── ggml.c             # GGML core
+│   ├── gguf.cpp           # GGUF format support
+│   ├── llama.cpp          # Main llama.cpp implementation
+│   ├── rn-llama.cpp       # React Native wrapper (adapted)
+│   ├── rn-completion.cpp  # Completion handling
+│   ├── rn-tts.cpp         # Text-to-speech
+│   └── tools/mtmd/        # Multimodal support
+├── ios/
+│   ├── CMakeLists.txt     # iOS build configuration
+│   └── Sources/           # Swift implementation
+├── android/
+│   ├── src/main/
+│   │   ├── CMakeLists.txt # Android build configuration
+│   │   ├── jni.cpp        # JNI implementation
+│   │   └── jni-utils.h    # JNI utilities
+│   └── build.gradle       # Android build config
+├── src/
+│   ├── definitions.ts     # Complete TypeScript interfaces
+│   ├── index.ts           # Main plugin implementation
+│   └── web.ts             # Web fallback
+└── build-native.sh        # Automated build script
+```
+## 📦 Installation
+```sh
+npm install llama-cpp-capacitor
+```
+## 🔨 **Building the Native Library**
+The plugin includes a complete native implementation of llama.cpp. To build the native libraries:
+### **Prerequisites**
+- **CMake** (3.16+ for iOS, 3.10+ for Android)
+- **Xcode** (for iOS builds, macOS only)
+- **Android Studio** with NDK (for Android builds)
+- **Make** or **Ninja** build system
+### **Automated Build**
+```bash
+# Build for all platforms
+npm run build:native
+# Build for specific platforms
+npm run build:ios      # iOS only
+npm run build:android  # Android only
+# Clean native builds
+npm run clean:native
+```
+### **Manual Build**
+#### **iOS Build**
+```bash
+cd ios
+cmake -B build -S .
+cmake --build build --config Release
+```
+#### **Android Build**
+```bash
+cd android
+./gradlew assembleRelease
+```
+### **Build Output**
+- **iOS**: `ios/build/LlamaCpp.framework/`
+- **Android**: `android/src/main/jniLibs/{arch}/libllama-cpp-{arch}.so`
+### iOS Setup
+1. Install the plugin:
+```sh
+npm install llama-cpp
+```
+2. Add to your iOS project:
+```sh
+npx cap add ios
+npx cap sync ios
+```
+3. Open the project in Xcode:
+```sh
+npx cap open ios
+```
+### Android Setup
+1. Install the plugin:
+```sh
+npm install llama-cpp
+```
+2. Add to your Android project:
+```sh
+npx cap add android
+npx cap sync android
+```
+3. Open the project in Android Studio:
+```sh
+npx cap open android
+```
+## 🎯 Quick Start
+### Basic Text Completion
+```typescript
+import { initLlama } from 'llama-cpp';
+// Initialize a model
+const context = await initLlama({
+  model: '/path/to/your/model.gguf',
+  n_ctx: 2048,
+  n_threads: 4,
+  n_gpu_layers: 0,
+});
+// Generate text
+const result = await context.completion({
+  prompt: "Hello, how are you today?",
+  n_predict: 50,
+  temperature: 0.8,
+});
+console.log('Generated text:', result.text);
+```
+### Chat-Style Conversations
+```typescript
+const result = await context.completion({
+  messages: [
+    { role: "system", content: "You are a helpful AI assistant." },
+    { role: "user", content: "What is the capital of France?" },
+    { role: "assistant", content: "The capital of France is Paris." },
+    { role: "user", content: "Tell me more about it." }
+  ],
+  n_predict: 100,
+  temperature: 0.7,
+});
+console.log('Chat response:', result.content);
+```
+### Streaming Completion
+```typescript
+let fullText = '';
+const result = await context.completion({
+  prompt: "Write a short story about a robot learning to paint:",
+  n_predict: 150,
+  temperature: 0.8,
+}, (tokenData) => {
+  // Called for each token as it's generated
+  fullText += tokenData.token;
+  console.log('Token:', tokenData.token);
+});
+console.log('Final result:', result.text);
+```
+## 🚀 **Mobile-Optimized Speculative Decoding**
+**Achieve 2-8x faster inference with significantly reduced battery consumption!**
+Speculative decoding uses a smaller "draft" model to predict multiple tokens ahead, which are then verified by the main model. This results in dramatic speedups with identical output quality.
+### Basic Usage
+```typescript
+import { initLlama } from 'llama-cpp-capacitor';
+// Initialize with speculative decoding
+const context = await initLlama({
+  model: '/path/to/your/main-model.gguf',         // Main model (e.g., 7B)
+  draft_model: '/path/to/your/draft-model.gguf', // Draft model (e.g., 1.5B)
+  // Speculative decoding parameters
+  speculative_samples: 3,      // Number of tokens to predict speculatively
+  mobile_speculative: true,    // Enable mobile optimizations
+  // Standard parameters
+  n_ctx: 2048,
+  n_threads: 4,
+});
+// Use normally - speculative decoding is automatic
+const result = await context.completion({
+  prompt: "Write a story about AI:",
+  n_predict: 200,
+  temperature: 0.7,
+});
+console.log('🚀 Generated with speculative decoding:', result.text);
+```
+### Mobile-Optimized Configuration
+```typescript
+// Recommended mobile setup for best performance/battery balance
+const mobileContext = await initLlama({
+  // Quantized models for mobile efficiency
+  model: '/models/llama-2-7b-chat.q4_0.gguf',
+  draft_model: '/models/tinyllama-1.1b-chat.q4_0.gguf',
+  // Conservative mobile settings
+  n_ctx: 1024,                 // Smaller context for mobile
+  n_threads: 3,                // Conservative threading
+  n_batch: 64,                 // Smaller batch size
+  n_gpu_layers: 24,            // Utilize mobile GPU
+  // Optimized speculative decoding
+  speculative_samples: 3,      // 2-3 tokens ideal for mobile
+  mobile_speculative: true,    // Enables mobile-specific optimizations
+  // Memory optimizations
+  use_mmap: true,              // Memory mapping for efficiency
+  use_mlock: false,            // Don't lock memory on mobile
+});
+```
+### Performance Benefits
+- **2-8x faster inference** - Dramatically reduced time to generate text
+- **50-80% battery savings** - Less time computing = longer battery life
+- **Identical output quality** - Same text quality as regular decoding
+- **Automatic fallback** - Falls back to regular decoding if draft model fails
+- **Mobile optimized** - Specifically tuned for mobile device constraints
+### Model Recommendations
+| Model Type | Recommended Size | Quantization | Example |
+|------------|------------------|--------------|---------|
+| **Main Model** | 3-7B parameters | Q4_0 or Q4_1 | `llama-2-7b-chat.q4_0.gguf` |
+| **Draft Model** | 1-1.5B parameters | Q4_0 | `tinyllama-1.1b-chat.q4_0.gguf` |
+### Error Handling & Fallback
+```typescript
+// Robust setup with automatic fallback
+try {
+  const context = await initLlama({
+    model: '/models/main-model.gguf',
+    draft_model: '/models/draft-model.gguf',
+    speculative_samples: 3,
+    mobile_speculative: true,
+  });
+  console.log('✅ Speculative decoding enabled');
+} catch (error) {
+  console.warn('⚠️ Falling back to regular decoding');
+  const context = await initLlama({
+    model: '/models/main-model.gguf',
+    // No draft_model = regular decoding
+  });
+}
+```
+## 📚 API Reference
+### Core Functions
+#### `initLlama(params: ContextParams, onProgress?: (progress: number) => void): Promise<LlamaContext>`
+Initialize a new llama.cpp context with a model.
+**Parameters:**
+- `params`: Context initialization parameters
+- `onProgress`: Optional progress callback (0-100)
+**Returns:** Promise resolving to a `LlamaContext` instance
+#### `releaseAllLlama(): Promise<void>`
+Release all contexts and free memory.
+#### `toggleNativeLog(enabled: boolean): Promise<void>`
+Enable or disable native logging.
+#### `addNativeLogListener(listener: (level: string, text: string) => void): { remove: () => void }`
+Add a listener for native log messages.
+### LlamaContext Class
+#### `completion(params: CompletionParams, callback?: (data: TokenData) => void): Promise<NativeCompletionResult>`
+Generate text completion.
+**Parameters:**
+- `params`: Completion parameters including prompt or messages
+- `callback`: Optional callback for token-by-token streaming
+#### `tokenize(text: string, options?: { media_paths?: string[] }): Promise<NativeTokenizeResult>`
+Tokenize text or text with images.
+#### `detokenize(tokens: number[]): Promise<string>`
+Convert tokens back to text.
+#### `embedding(text: string, params?: EmbeddingParams): Promise<NativeEmbeddingResult>`
+Generate embeddings for text.
+#### `rerank(query: string, documents: string[], params?: RerankParams): Promise<RerankResult[]>`
+Rank documents by relevance to a query.
+#### `bench(pp: number, tg: number, pl: number, nr: number): Promise<BenchResult>`
+Benchmark model performance.
+### Multimodal Support
+#### `initMultimodal(params: { path: string; use_gpu?: boolean }): Promise<boolean>`
+Initialize multimodal support with a projector file.
+#### `isMultimodalEnabled(): Promise<boolean>`
+Check if multimodal support is enabled.
+#### `getMultimodalSupport(): Promise<{ vision: boolean; audio: boolean }>`
+Get multimodal capabilities.
+#### `releaseMultimodal(): Promise<void>`
+Release multimodal resources.
+### TTS (Text-to-Speech)
+#### `initVocoder(params: { path: string; n_batch?: number }): Promise<boolean>`
+Initialize TTS with a vocoder model.
+#### `isVocoderEnabled(): Promise<boolean>`
+Check if TTS is enabled.
+#### `getFormattedAudioCompletion(speaker: object | null, textToSpeak: string): Promise<{ prompt: string; grammar?: string }>`
+Get formatted audio completion prompt.
+#### `getAudioCompletionGuideTokens(textToSpeak: string): Promise<Array<number>>`
+Get guide tokens for audio completion.
+#### `decodeAudioTokens(tokens: number[]): Promise<Array<number>>`
+Decode audio tokens to audio data.
+#### `releaseVocoder(): Promise<void>`
+Release TTS resources.
+### LoRA Adapters
+#### `applyLoraAdapters(loraList: Array<{ path: string; scaled?: number }>): Promise<void>`
+Apply LoRA adapters to the model.
+#### `removeLoraAdapters(): Promise<void>`
+Remove all LoRA adapters.
+#### `getLoadedLoraAdapters(): Promise<Array<{ path: string; scaled?: number }>>`
+Get list of loaded LoRA adapters.
+### Session Management
+#### `saveSession(filepath: string, options?: { tokenSize: number }): Promise<number>`
+Save current session to a file.
+#### `loadSession(filepath: string): Promise<NativeSessionLoadResult>`
+Load session from a file.
+## 🔧 Configuration
+### Context Parameters
+```typescript
+interface ContextParams {
+  model: string;                    // Path to GGUF model file
+  n_ctx?: number;                   // Context size (default: 512)
+  n_threads?: number;               // Number of threads (default: 4)
+  n_gpu_layers?: number;            // GPU layers (iOS only)
+  use_mlock?: boolean;              // Lock memory (default: false)
+  use_mmap?: boolean;               // Use memory mapping (default: true)
+  embedding?: boolean;              // Embedding mode (default: false)
+  cache_type_k?: string;            // KV cache type for K
+  cache_type_v?: string;            // KV cache type for V
+  pooling_type?: string;            // Pooling type
+  // ... more parameters
+}
+```
+### Completion Parameters
+```typescript
+interface CompletionParams {
+  prompt?: string;                  // Text prompt
+  messages?: Message[];             // Chat messages
+  n_predict?: number;               // Max tokens to generate
+  temperature?: number;             // Sampling temperature
+  top_p?: number;                   // Top-p sampling
+  top_k?: number;                   // Top-k sampling
+  stop?: string[];                  // Stop sequences
+  // ... more parameters
+}
+```
+## 📱 Platform Support
+| Feature | iOS | Android | Web |
+|---------|-----|---------|-----|
+| Text Generation | ✅ | ✅ | ❌ |
+| Chat Conversations | ✅ | ✅ | ❌ |
+| Streaming | ✅ | ✅ | ❌ |
+| Multimodal | ✅ | ✅ | ❌ |
+| TTS | ✅ | ✅ | ❌ |
+| LoRA Adapters | ✅ | ✅ | ❌ |
+| Embeddings | ✅ | ✅ | ❌ |
+| Reranking | ✅ | ✅ | ❌ |
+| Session Management | ✅ | ✅ | ❌ |
+| Benchmarking | ✅ | ✅ | ❌ |
+## 🎨 Advanced Examples
+### Multimodal Processing
+```typescript
+// Initialize multimodal support
+await context.initMultimodal({
+  path: '/path/to/mmproj.gguf',
+  use_gpu: true,
+});
+// Process image with text
+const result = await context.completion({
+  messages: [
+    {
+      role: "user",
+      content: [
+        { type: "text", text: "What do you see in this image?" },
+        { type: "image_url", image_url: { url: "file:///path/to/image.jpg" } }
+      ]
+    }
+  ],
+  n_predict: 100,
+});
+console.log('Image analysis:', result.content);
+```
+### Text-to-Speech
+```typescript
+// Initialize TTS
+await context.initVocoder({
+  path: '/path/to/vocoder.gguf',
+  n_batch: 512,
+});
+// Generate audio
+const audioCompletion = await context.getFormattedAudioCompletion(
+  null, // Speaker configuration
+  "Hello, this is a test of text-to-speech functionality."
+);
+const guideTokens = await context.getAudioCompletionGuideTokens(
+  "Hello, this is a test of text-to-speech functionality."
+);
+const audioResult = await context.completion({
+  prompt: audioCompletion.prompt,
+  grammar: audioCompletion.grammar,
+  guide_tokens: guideTokens,
+  n_predict: 1000,
+});
+const audioData = await context.decodeAudioTokens(audioResult.audio_tokens);
+```
+### LoRA Adapters
+```typescript
+// Apply LoRA adapters
+await context.applyLoraAdapters([
+  { path: '/path/to/adapter1.gguf', scaled: 1.0 },
+  { path: '/path/to/adapter2.gguf', scaled: 0.5 }
+]);
+// Check loaded adapters
+const adapters = await context.getLoadedLoraAdapters();
+console.log('Loaded adapters:', adapters);
+// Generate with adapters
+const result = await context.completion({
+  prompt: "Test prompt with LoRA adapters:",
+  n_predict: 50,
+});
+// Remove adapters
+await context.removeLoraAdapters();
+```
+### Structured Output
+#### JSON Schema (Auto-converted to GBNF)
+```typescript
+const result = await context.completion({
+  prompt: "Generate a JSON object with a person's name, age, and favorite color:",
+  n_predict: 100,
+  response_format: {
+    type: 'json_schema',
+    json_schema: {
+      strict: true,
+      schema: {
+        type: 'object',
+        properties: {
+          name: { type: 'string' },
+          age: { type: 'number' },
+          favorite_color: { type: 'string' }
+        },
+        required: ['name', 'age', 'favorite_color']
+      }
+    }
+  }
+});
+console.log('Structured output:', result.content);
+```
+#### Direct GBNF Grammar
+```typescript
+// Define GBNF grammar directly for maximum control
+const grammar = `
+root ::= "{" ws name_field "," ws age_field "," ws color_field "}"
+name_field ::= "\\"name\\"" ws ":" ws string_value
+age_field ::= "\\"age\\"" ws ":" ws number_value
+color_field ::= "\\"favorite_color\\"" ws ":" ws string_value
+string_value ::= "\\"" [a-zA-Z ]+ "\\""
+number_value ::= [0-9]+
+ws ::= [ \\t\\n]*
+`;
+const result = await context.completion({
+  prompt: "Generate a person's profile:",
+  grammar: grammar,
+  n_predict: 100
+});
+console.log('Grammar-constrained output:', result.text);
+```
+#### Manual JSON Schema to GBNF Conversion
+```typescript
+import { convertJsonSchemaToGrammar } from 'llama-cpp-capacitor';
+const schema = {
+  type: 'object',
+  properties: {
+    name: { type: 'string' },
+    age: { type: 'number' }
+  },
+  required: ['name', 'age']
+};
+// Convert schema to GBNF grammar
+const grammar = await convertJsonSchemaToGrammar(schema);
+console.log('Generated grammar:', grammar);
+const result = await context.completion({
+  prompt: "Generate a person:",
+  grammar: grammar,
+  n_predict: 100
+});
+```
+## 🔍 Model Compatibility
+This plugin supports GGUF format models, which are compatible with llama.cpp. You can find GGUF models on Hugging Face by searching for the "GGUF" tag.
+### Recommended Models
+- **Llama 2**: Meta's latest language model
+- **Mistral**: High-performance open model
+- **Code Llama**: Specialized for code generation
+- **Phi-2**: Microsoft's efficient model
+- **Gemma**: Google's open model
+### Model Quantization
+For mobile devices, consider using quantized models (Q4_K_M, Q5_K_M, etc.) to reduce memory usage and improve performance.
+## ⚡ Performance Considerations
+### Memory Management
+- Use quantized models for better memory efficiency
+- Adjust `n_ctx` based on your use case
+- Monitor memory usage with `use_mlock: false`
+### GPU Acceleration
+- iOS: Set `n_gpu_layers` to use Metal GPU acceleration
+- Android: GPU acceleration is automatically enabled when available
+### Threading
+- Adjust `n_threads` based on device capabilities
+- More threads may improve performance but increase memory usage
+## 🐛 Troubleshooting
+### Common Issues
+1. **Model not found**: Ensure the model path is correct and the file exists
+2. **Out of memory**: Try using a quantized model or reducing `n_ctx`
+3. **Slow performance**: Enable GPU acceleration or increase `n_threads`
+4. **Multimodal not working**: Ensure the mmproj file is compatible with your model
+### Debugging
+Enable native logging to see detailed information:
+```typescript
+import { toggleNativeLog, addNativeLogListener } from 'llama-cpp';
+await toggleNativeLog(true);
+const logListener = addNativeLogListener((level, text) => {
+  console.log(`[${level}] ${text}`);
+});
+```
+## 🤝 Contributing
+We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
+## 📄 License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## 🙏 Acknowledgments
+- [llama.cpp](https://github.com/ggerganov/llama.cpp) - The core inference engine
+- [Capacitor](https://capacitorjs.com/) - The cross-platform runtime
+- [llama.rn](https://github.com/mybigday/llama.rn) - Inspiration for the React Native implementation
+## 📞 Support
+- 📧 Email: support@arusatech.com
+- 🐛 Issues: [GitHub Issues](https://github.com/arusatech/llama-cpp/issues)
+- 📖 Documentation: [GitHub Wiki](https://github.com/arusatech/llama-cpp/wiki)