npm - cui-llama.rn - Versions diffs - 1.7.4 → 1.7.6 - Mend

cui-llama.rn 1.7.4 → 1.7.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (276) hide show

package/README.md CHANGED Viewed

@@ -55,6 +55,8 @@ For get a GGUF model or quantize manually, see [`Prepare and Quantize`](https://
 ## Usage
+> **💡 New!** `llama.rn` now supports **multimodal models** with vision and audio capabilities! See the [Multimodal section](#multimodal-vision--audio) for details.
 Load model info only:
 ```js
@@ -123,49 +125,162 @@ console.log('Result:', textResult.text)
 console.log('Timings:', textResult.timings)
 ```
-The binding’s deisgn inspired by [server.cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) example in llama.cpp:
+The binding's deisgn inspired by [server.cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) example in llama.cpp:
 - `/completion` and `/chat/completions`: `context.completion(params, partialCompletionCallback)`
 - `/tokenize`: `context.tokenize(content)`
 - `/detokenize`: `context.detokenize(tokens)`
 - `/embedding`: `context.embedding(content)`
+- `/rerank`: `context.rerank(query, documents, params)`
 - ... Other methods
 Please visit the [Documentation](docs/API) for more details.
 You can also visit the [example](example) to see how to use it.
-## Session (State)
+## Multimodal (Vision & Audio)
-The session file is a binary file that contains the state of the context, it can saves time of prompt processing.
+`llama.rn` supports multimodal capabilities including vision (images) and audio processing. This allows you to interact with models that can understand both text and media content.
+### Supported Media Formats
+**Images (Vision):**
+- JPEG, PNG, BMP, GIF, TGA, HDR, PIC, PNM
+- Base64 encoded images (data URLs)
+- Local file paths
+- \* Not supported HTTP URLs yet
+**Audio:**
+- WAV, MP3 formats
+- Base64 encoded audio (data URLs)
+- Local file paths
+- \* Not supported HTTP URLs yet
+### Setup
+First, you need a multimodal model and its corresponding multimodal projector (mmproj) file, see [how to obtain mmproj](https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd#how-to-obtain-mmproj) for more details.
+### Initialize Multimodal Support
 ```js
-const context = await initLlama({ ...params })
+import { initLlama } from 'llama.rn'
-// After prompt processing or completion ...
+// First initialize the model context
+const context = await initLlama({
+  model: 'path/to/your/multimodal-model.gguf',
+  n_ctx: 4096,
+  n_gpu_layers: 99, // Recommended for multimodal models
+  // Important: Disable context shifting for multimodal
+  ctx_shift: false,
+})
-// Save the session
-await context.saveSession('<path to save session>')
+// Initialize multimodal support with mmproj file
+const success = await context.initMultimodal({
+  path: 'path/to/your/mmproj-model.gguf',
+  use_gpu: true, // Recommended for better performance
+})
-// Load the session
-await context.loadSession('<path to load session>')
+// Check if multimodal is enabled
+console.log('Multimodal enabled:', await context.isMultimodalEnabled())
+if (success) {
+  console.log('Multimodal support initialized!')
+  // Check what modalities are supported
+  const support = await context.getMultimodalSupport()
+  console.log('Vision support:', support.vision)
+  console.log('Audio support:', support.audio)
+} else {
+  console.log('Failed to initialize multimodal support')
+}
+// Release multimodal context
+await context.releaseMultimodal()
 ```
-## Embedding
+### Usage Examples
-The embedding API is used to get the embedding of a text.
+#### Vision (Image Processing)
 ```js
-const context = await initLlama({
-  ...params,
-  embedding: true,
+const result = await context.completion({
+  messages: [
+    {
+      role: 'user',
+      content: [
+        {
+          type: 'text',
+          text: 'What do you see in this image?',
+        },
+        {
+          type: 'image_url',
+          image_url: {
+            url: 'file:///path/to/image.jpg',
+            // or base64: 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD...'
+          },
+        },
+      ],
+    },
+  ],
+  n_predict: 100,
+  temperature: 0.1,
 })
-const { embedding } = await context.embedding('Hello, world!')
+console.log('AI Response:', result.text)
 ```
-- You can use model like [nomic-ai/nomic-embed-text-v1.5-GGUF](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF) for better embedding quality.
-- You can use DB like [op-sqlite](https://github.com/OP-Engineering/op-sqlite) with sqlite-vec support to store and search embeddings.
+#### Audio Processing
+```js
+// Method 1: Using structured message content (Recommended)
+const result = await context.completion({
+  messages: [
+    {
+      role: 'user',
+      content: [
+        {
+          type: 'text',
+          text: 'Transcribe or describe this audio:',
+        },
+        {
+          type: 'input_audio',
+          input_audio: {
+            data: 'data:audio/wav;base64,UklGRiQAAABXQVZFZm10...',
+            // or url: 'file:///path/to/audio.wav',
+            format: 'wav', // or 'mp3'
+          },
+        },
+      ],
+    },
+  ],
+  n_predict: 200,
+})
+console.log('Transcription:', result.text)
+```
+### Tokenization with Media
+```js
+// Tokenize text with media
+const tokenizeResult = await context.tokenize(
+  'Describe this image: <__media__>',
+  {
+    media_paths: ['file:///path/to/image.jpg']
+  }
+)
+console.log('Tokens:', tokenizeResult.tokens)
+console.log('Has media:', tokenizeResult.has_media)
+console.log('Media positions:', tokenizeResult.chunk_pos_media)
+```
+### Notes
+- **Context Shifting**: Multimodal models require `ctx_shift: false` to maintain media token positioning
+- **Memory**: Multimodal models require more memory; use adequate `n_ctx` and consider GPU offloading
+- **Media Markers**: The system automatically handles `<__media__>` markers in prompts. When using structured message content, media items are automatically replaced with this marker
+- **Model Compatibility**: Ensure your model supports the media type you're trying to process
 ## Tool Calling
@@ -289,6 +404,91 @@ console.log('Result:', text)
 Also, this is how `json_schema` works in `response_format` during completion, it converts the json_schema to gbnf grammar.
+## Session (State)
+The session file is a binary file that contains the state of the context, it can saves time of prompt processing.
+```js
+const context = await initLlama({ ...params })
+// After prompt processing or completion ...
+// Save the session
+await context.saveSession('<path to save session>')
+// Load the session
+await context.loadSession('<path to load session>')
+```
+### Notes
+- \* Session is currently not supported save state from multimodal context, so it only stores the text chunk before the first media chunk.
+## Embedding
+The embedding API is used to get the embedding of a text.
+```js
+const context = await initLlama({
+  ...params,
+  embedding: true,
+})
+const { embedding } = await context.embedding('Hello, world!')
+```
+- You can use model like [nomic-ai/nomic-embed-text-v1.5-GGUF](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF) for better embedding quality.
+- You can use DB like [op-sqlite](https://github.com/OP-Engineering/op-sqlite) with sqlite-vec support to store and search embeddings.
+## Rerank
+The rerank API is used to rank documents based on their relevance to a query. This is particularly useful for improving search results and implementing retrieval-augmented generation (RAG) systems.
+```js
+const context = await initLlama({
+  ...params,
+  embedding: true, // Required for reranking
+  pooling_type: 'rank', // Use rank pooling for rerank models
+})
+// Rerank documents based on relevance to query
+const results = await context.rerank(
+  'What is artificial intelligence?', // query
+  [
+    'AI is a branch of computer science.',
+    'The weather is nice today.',
+    'Machine learning is a subset of AI.',
+    'I like pizza.',
+  ], // documents to rank
+  {
+    normalize: 1, // Optional: normalize scores (default: from model config)
+  }
+)
+// Results are automatically sorted by score (highest first)
+results.forEach((result, index) => {
+  console.log(`Rank ${index + 1}:`, {
+    score: result.score,
+    document: result.document,
+    originalIndex: result.index,
+  })
+})
+```
+### Notes
+- **Model Requirements**: Reranking requires models with `RANK` pooling type (e.g., reranker models)
+- **Embedding Enabled**: The context must have `embedding: true` to use rerank functionality
+- **Automatic Sorting**: Results are returned sorted by relevance score in descending order
+- **Document Access**: Each result includes the original document text and its index in the input array
+- **Score Interpretation**: Higher scores indicate higher relevance to the query
+### Recommended Models
+- [jinaai - jina-reranker-v2-base-multilingual-GGUF](https://huggingface.co/gpustack/jina-reranker-v2-base-multilingual-GGUF)
+- [BAAI - bge-reranker-v2-m3-GGUF](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF)
+- Other models with "rerank" or "reranker" in their name and GGUF format
 ## Mock `llama.rn`
 We have provided a mock version of `llama.rn` for testing purpose you can use on Jest:

package/android/src/main/CMakeLists.txt CHANGED Viewed

@@ -27,12 +27,11 @@ set(
     ${RNLLAMA_LIB_DIR}/ggml-cpu/amx/mmq.cpp
     ${RNLLAMA_LIB_DIR}/ggml-cpu/ggml-cpu.c
     ${RNLLAMA_LIB_DIR}/ggml-cpu/ggml-cpu.cpp
-    ${RNLLAMA_LIB_DIR}/ggml-cpu/ggml-cpu-aarch64.cpp
-    ${RNLLAMA_LIB_DIR}/ggml-cpu/ggml-cpu-quants.c
-    ${RNLLAMA_LIB_DIR}/ggml-cpu/ggml-cpu-traits.cpp
+    ${RNLLAMA_LIB_DIR}/ggml-cpu/quants.c
+    ${RNLLAMA_LIB_DIR}/ggml-cpu/traits.cpp
+    ${RNLLAMA_LIB_DIR}/ggml-cpu/repack.cpp
     ${RNLLAMA_LIB_DIR}/ggml-cpu/unary-ops.cpp
     ${RNLLAMA_LIB_DIR}/ggml-cpu/binary-ops.cpp
-    ${RNLLAMA_LIB_DIR}/ggml-cpu/sgemm.cpp
     ${RNLLAMA_LIB_DIR}/ggml-cpu/vec.cpp
     ${RNLLAMA_LIB_DIR}/ggml-cpu/ops.cpp
     ${RNLLAMA_LIB_DIR}/ggml-opt.cpp
@@ -41,6 +40,9 @@ set(
     ${RNLLAMA_LIB_DIR}/gguf.cpp
     ${RNLLAMA_LIB_DIR}/log.cpp
     ${RNLLAMA_LIB_DIR}/llama-impl.cpp
+    ${RNLLAMA_LIB_DIR}/chat-parser.cpp
+    ${RNLLAMA_LIB_DIR}/json-partial.cpp
+    ${RNLLAMA_LIB_DIR}/regex-partial.cpp
     # Multimodal support
     ${RNLLAMA_LIB_DIR}/tools/mtmd/mtmd.cpp
     ${RNLLAMA_LIB_DIR}/tools/mtmd/mtmd-audio.cpp
@@ -52,7 +54,6 @@ set(
     ${RNLLAMA_LIB_DIR}/llama-adapter.cpp
     ${RNLLAMA_LIB_DIR}/llama-chat.cpp
     ${RNLLAMA_LIB_DIR}/llama-context.cpp
-    ${RNLLAMA_LIB_DIR}/llama-kv-cache.cpp
     ${RNLLAMA_LIB_DIR}/llama-arch.cpp
     ${RNLLAMA_LIB_DIR}/llama-batch.cpp
     ${RNLLAMA_LIB_DIR}/llama-cparams.cpp
@@ -60,6 +61,10 @@ set(
     ${RNLLAMA_LIB_DIR}/llama.cpp
     ${RNLLAMA_LIB_DIR}/llama-model.cpp
     ${RNLLAMA_LIB_DIR}/llama-model-loader.cpp
+    ${RNLLAMA_LIB_DIR}/llama-kv-cache-unified.cpp
+    ${RNLLAMA_LIB_DIR}/llama-kv-cache-unified-iswa.cpp
+    ${RNLLAMA_LIB_DIR}/llama-memory-hybrid.cpp
+    ${RNLLAMA_LIB_DIR}/llama-memory-recurrent.cpp
     ${RNLLAMA_LIB_DIR}/llama-mmap.cpp
     ${RNLLAMA_LIB_DIR}/llama-vocab.cpp
     ${RNLLAMA_LIB_DIR}/llama-memory.cpp
@@ -71,7 +76,8 @@ set(
     ${RNLLAMA_LIB_DIR}/common.cpp
     ${RNLLAMA_LIB_DIR}/chat.cpp
     ${RNLLAMA_LIB_DIR}/json-schema-to-grammar.cpp
-    ${RNLLAMA_LIB_DIR}/json.hpp
+    ${RNLLAMA_LIB_DIR}/nlohmann/json.hpp
+    ${RNLLAMA_LIB_DIR}/nlohmann/json_fwd.hpp
     ${RNLLAMA_LIB_DIR}/minja/minja.hpp
     ${RNLLAMA_LIB_DIR}/minja/chat-template.hpp
     ${RNLLAMA_LIB_DIR}/rn-llama.cpp
@@ -81,16 +87,28 @@ set(
 find_library(LOG_LIB log)
-function(build_library target_name cpu_flags)
+function(build_library target_name arch cpu_flags)
+    if (NOT ${arch} STREQUAL "generic")
+        set(SOURCE_FILES_ARCH
+            ${RNLLAMA_LIB_DIR}/ggml-cpu/arch/${arch}/quants.c
+            ${RNLLAMA_LIB_DIR}/ggml-cpu/arch/${arch}/repack.cpp
+        )
+    endif ()
     add_library(
         ${target_name}
         SHARED
         ${SOURCE_FILES}
+        ${SOURCE_FILES_ARCH}
     )
     target_link_libraries(${target_name} ${LOG_LIB} android)
-    target_compile_options(${target_name} PRIVATE -DLM_GGML_USE_CPU -DLM_GGML_USE_CPU_AARCH64  -DRNLLAMA_USE_FD_FILE -pthread ${cpu_flags})
+    if (${arch} STREQUAL "generic")
+        target_compile_options(${target_name} PRIVATE -DLM_GGML_CPU_GENERIC)
+    endif ()
+    target_compile_options(${target_name} PRIVATE -DLM_GGML_USE_CPU -DLM_GGML_USE_CPU_REPACK  -DRNLLAMA_USE_FD_FILE  -pthread ${cpu_flags})
     if (${CMAKE_BUILD_TYPE} STREQUAL "Debug")
         target_compile_options(${target_name} PRIVATE -DRNLLAMA_ANDROID_ENABLE_LOGGING)
@@ -111,17 +129,17 @@ endfunction()
 # Default target (no specific CPU features)
-build_library("rnllama" "")
+build_library("rnllama" "generic" "")
 if (${ANDROID_ABI} STREQUAL "arm64-v8a")
     # ARM64 targets
     # Removing fp16 for now as it leads to issues with some models like deepseek r1 distills
     # https://github.com/mybigday/llama.rn/pull/110#issuecomment-2609918310
-    build_library("rnllama_v8" "-march=armv8-a")
-    build_library("rnllama_v8_2" "-march=armv8.2-a")
-    build_library("rnllama_v8_2_dotprod" "-march=armv8.2-a+dotprod")
-    build_library("rnllama_v8_2_i8mm" "-march=armv8.2-a+i8mm")
-    build_library("rnllama_v8_2_dotprod_i8mm" "-march=armv8.2-a+dotprod+i8mm")
+    build_library("rnllama_v8" "arm" "-march=armv8-a")
+    build_library("rnllama_v8_2" "arm" "-march=armv8.2-a")
+    build_library("rnllama_v8_2_dotprod" "arm" "-march=armv8.2-a+dotprod")
+    build_library("rnllama_v8_2_i8mm" "arm" "-march=armv8.2-a+i8mm")
+    build_library("rnllama_v8_2_dotprod_i8mm" "arm" "-march=armv8.2-a+dotprod+i8mm")
     # https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md#cross-compile-using-android-ndk
     # llama.cpp will deal with the cpu features
@@ -131,5 +149,6 @@ if (${ANDROID_ABI} STREQUAL "arm64-v8a")
 elseif (${ANDROID_ABI} STREQUAL "x86_64")
     # x86_64 target
-    build_library("rnllama_x86_64" "-march=x86-64" "-mtune=intel" "-msse4.2" "-mpopcnt")
+    build_library("rnllama_x86_64" "x86" "-march=x86-64" "-mtune=intel" "-msse4.2" "-mpopcnt")
 endif ()

package/android/src/main/java/com/rnllama/LlamaContext.java CHANGED Viewed

@@ -134,8 +134,6 @@ public class LlamaContext {
       modelName,
       // String chat_template,
       params.hasKey("chat_template") ? params.getString("chat_template") : "",
-      // String reasoning_format,
-      params.hasKey("reasoning_format") ? params.getString("reasoning_format") : "none",
       // boolean embedding,
       params.hasKey("embedding") ? params.getBoolean("embedding") : false,
       // int embd_normalize,
@@ -207,6 +205,7 @@ public class LlamaContext {
     String tools = params.hasKey("tools") ? params.getString("tools") : "";
     Boolean parallelToolCalls = params.hasKey("parallel_tool_calls") ? params.getBoolean("parallel_tool_calls") : false;
     String toolChoice = params.hasKey("tool_choice") ? params.getString("tool_choice") : "";
+    Boolean enableThinking = params.hasKey("enable_thinking") ? params.getBoolean("enable_thinking") : false;
     return getFormattedChatWithJinja(
       this.context,
       messages,
@@ -214,7 +213,8 @@ public class LlamaContext {
       jsonSchema,
       tools,
       parallelToolCalls,
-      toolChoice
+      toolChoice,
+      enableThinking
     );
   }
@@ -303,12 +303,25 @@ public class LlamaContext {
       }
     }
+    int[] guide_tokens = null;
+    if (params.hasKey("guide_tokens")) {
+      ReadableArray guide_tokens_array = params.getArray("guide_tokens");
+      guide_tokens = new int[guide_tokens_array.size()];
+      for (int i = 0; i < guide_tokens_array.size(); i++) {
+        guide_tokens[i] = (int) guide_tokens_array.getDouble(i);
+      }
+    }
     WritableMap result = doCompletion(
       this.context,
       // String prompt,
       params.getString("prompt"),
+      // int[] guide_tokens,
+      guide_tokens,
       // int chat_format,
       params.hasKey("chat_format") ? params.getInt("chat_format") : 0,
+      // String reasoning_format,
+      params.hasKey("reasoning_format") ? params.getString("reasoning_format") : "none",
       // String grammar,
       params.hasKey("grammar") ? params.getString("grammar") : "",
       // String json_schema,
@@ -319,6 +332,8 @@ public class LlamaContext {
       params.hasKey("grammar_triggers") ? params.getArray("grammar_triggers") : null,
       // ReadableArray preserved_tokens,
       params.hasKey("preserved_tokens") ? params.getArray("preserved_tokens") : null,
+      // boolean thinking_forced_open,
+      params.hasKey("thinking_forced_open") ? params.getBoolean("thinking_forced_open") : false,
       // float temperature,
       params.hasKey("temperature") ? (float) params.getDouble("temperature") : 0.7f,
       // int n_threads,
@@ -423,6 +438,27 @@ public class LlamaContext {
     return result;
   }
+  public WritableArray getRerank(String query, ReadableArray documents, ReadableMap params) {
+    if (isEmbeddingEnabled(this.context) == false) {
+      throw new IllegalStateException("Embedding is not enabled but required for reranking");
+    }
+    // Convert ReadableArray to Java string array
+    String[] documentsArray = new String[documents.size()];
+    for (int i = 0; i < documents.size(); i++) {
+      documentsArray[i] = documents.getString(i);
+    }
+    WritableArray result = rerank(
+      this.context,
+      query,
+      documentsArray,
+      // int normalize,
+      params.hasKey("normalize") ? params.getInt("normalize") : -1
+    );
+    return result;
+  }
   public String bench(int pp, int tg, int pl, int nr) {
     return bench(this.context, pp, tg, pl, nr);
   }
@@ -487,6 +523,34 @@ public class LlamaContext {
     releaseMultimodal(this.context);
   }
+  public boolean initVocoder(String vocoderModelPath) {
+    return initVocoder(this.context, vocoderModelPath);
+  }
+  public boolean isVocoderEnabled() {
+    return isVocoderEnabled(this.context);
+  }
+  public String getFormattedAudioCompletion(String speakerJsonStr, String textToSpeak) {
+    return getFormattedAudioCompletion(this.context, speakerJsonStr, textToSpeak);
+  }
+  public WritableArray getAudioCompletionGuideTokens(String textToSpeak) {
+    return getAudioCompletionGuideTokens(this.context, textToSpeak);
+  }
+  public WritableArray decodeAudioTokens(ReadableArray tokens) {
+    int[] toks = new int[tokens.size()];
+    for (int i = 0; i < tokens.size(); i++) {
+      toks[i] = (int) tokens.getDouble(i);
+    }
+    return decodeAudioTokens(this.context, toks);
+  }
+  public void releaseVocoder() {
+    releaseVocoder(this.context);
+  }
   public void release() {
     freeContext(context);
   }
@@ -588,7 +652,6 @@ public class LlamaContext {
   protected static native long initContext(
     String model_path,
     String chat_template,
-    String reasoning_format,
     boolean embedding,
     int embd_normalize,
     int n_ctx,
@@ -625,7 +688,8 @@ public class LlamaContext {
     String jsonSchema,
     String tools,
     boolean parallelToolCalls,
-    String toolChoice
+    String toolChoice,
+    boolean enableThinking
   );
   protected static native String getFormattedChat(
     long contextPtr,
@@ -644,12 +708,15 @@ public class LlamaContext {
   protected static native WritableMap doCompletion(
     long context_ptr,
     String prompt,
+    int[] guide_tokens,
     int chat_format,
+    String reasoning_format,
     String grammar,
     String json_schema,
     boolean grammar_lazy,
     ReadableArray grammar_triggers,
     ReadableArray preserved_tokens,
+    boolean thinking_forced_open,
     float temperature,
     int n_threads,
     int n_predict,
@@ -690,6 +757,7 @@ public class LlamaContext {
     String text,
     int embd_normalize
   );
+  protected static native WritableArray rerank(long contextPtr, String query, String[] documents, int normalize);
   protected static native String bench(long contextPtr, int pp, int tg, int pl, int nr);
   protected static native int applyLoraAdapters(long contextPtr, ReadableArray loraAdapters);
   protected static native void removeLoraAdapters(long contextPtr);
@@ -698,4 +766,10 @@ public class LlamaContext {
   protected static native void setupLog(NativeLogCallback logCallback);
   protected static native void unsetLog();
   protected static native void releaseMultimodal(long contextPtr);
+  protected static native boolean isVocoderEnabled(long contextPtr);
+  protected static native String getFormattedAudioCompletion(long contextPtr, String speakerJsonStr, String textToSpeak);
+  protected static native WritableArray getAudioCompletionGuideTokens(long contextPtr, String textToSpeak);
+  protected static native WritableArray decodeAudioTokens(long contextPtr, int[] tokens);
+  protected static native boolean initVocoder(long contextPtr, String vocoderModelPath);
+  protected static native void releaseVocoder(long contextPtr);
 }