npm - @novastera-oss/llamarn - Versions diffs - 0.3.1 → 0.4.1 - Mend

@novastera-oss/llamarn 0.3.1 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (347) hide show

package/README.md CHANGED Viewed

@@ -18,6 +18,9 @@
 * Chat completion with templates (including Jinja template support)
 * Embeddings generation
 * Function/tool calling support
+* **Advanced thinking and reasoning support** for compatible models
+* **Flexible reasoning budget control** (unlimited, disabled, or limited)
+* **Multiple reasoning format support** (none, auto, deepseek, deepseek-legacy)
 ## What Needs Help
@@ -38,6 +41,8 @@ We welcome contributions, especially in these areas:
 3. **Tool Support**:
    * Improving tool calling functionality for complex interactions
    * Better JSON validation and error handling
+   * Enhanced thinking and reasoning model support
+   * Advanced reasoning format implementations
 4. **Testing**:
    * Automated testing using the example project
@@ -139,7 +144,10 @@ import { initLlama } from '@novastera-oss/llamarn';
 const context = await initLlama({
   model: 'path/to/model.gguf',
   n_ctx: 2048,
-  n_batch: 512
+  n_batch: 512,
+  // Optional: Enable thinking and reasoning capabilities
+  reasoning_budget: -1,  // Unlimited thinking
+  reasoning_format: 'auto'  // Automatic reasoning format detection
 });
 // Generate a completion
@@ -162,7 +170,10 @@ const context = await initLlama({
   model: 'path/to/model.gguf',
   n_ctx: 4096,
   n_batch: 512,
-  use_jinja: true  // Enable Jinja template parsing
+  use_jinja: true,  // Enable Jinja template parsing
+  // Optional: Configure thinking and reasoning
+  reasoning_budget: -1,  // Enable unlimited thinking
+  reasoning_format: 'deepseek'  // Use DeepSeek reasoning format
 });
 // Chat completion with messages
@@ -189,9 +200,47 @@ const context = await initLlama({
   model: 'path/to/model.gguf',
   n_ctx: 2048,
   n_batch: 512,
-  use_jinja: true  // Enable template handling for tool calls
+  use_jinja: true,  // Enable template handling for tool calls
+  parse_tool_calls: true,  // Enable tool call parsing (auto-enabled with use_jinja)
+  parallel_tool_calls: false  // Disable parallel tool calls for compatibility
+});
+```
+### Thinking and Reasoning Models
+For models that support reasoning and thinking, you can enable advanced thinking functionality:
+```js
+import { initLlama } from '@novastera-oss/llamarn';
+// Initialize a reasoning model with thinking capabilities
+const context = await initLlama({
+  model: 'path/to/reasoning-model.gguf',
+  n_ctx: 4096,
+  n_batch: 512,
+  use_jinja: true,
+  // Thinking and reasoning options
+  reasoning_budget: -1,           // -1 = unlimited thinking, 0 = disabled, >0 = limited
+  reasoning_format: 'deepseek',   // Use DeepSeek reasoning format
+  thinking_forced_open: true,     // Force the model to always output thinking
+  parse_tool_calls: true,         // Enable tool call parsing
+  parallel_tool_calls: false      // Disable parallel tool calls for compatibility
+});
+// Chat completion with thinking enabled
+const result = await context.completion({
+  messages: [
+    { role: 'system', content: 'You are a helpful assistant. Think through problems step by step.' },
+    { role: 'user', content: 'Solve this math problem: What is 15% of 240?' }
+  ],
+  temperature: 0.7
 });
+console.log('Response:', result.text);
+// The response may include thinking tags like <think>...</think> depending on the model
+```
 // Create a chat with tool calling
 const response = await context.completion({
   messages: [
@@ -260,6 +309,40 @@ const embeddingResponse = await context.embedding({
 console.log('Embedding:', embeddingResponse.data[0].embedding);
 ```
+## Advanced Configuration Options
+### Thinking and Reasoning Parameters
+The library supports advanced thinking and reasoning capabilities for models that support them:
+- **`reasoning_budget`**: Controls the amount of thinking allowed
+  - `-1`: Unlimited thinking (default)
+  - `0`: Disabled thinking
+  - `>0`: Limited thinking with the specified budget
+- **`reasoning_format`**: Controls how thinking is parsed and returned
+  - `'none'`: Leave thoughts unparsed in message content
+  - `'auto'`: Same as deepseek (default)
+  - `'deepseek'`: Extract thinking into `message.reasoning_content`
+  - `'deepseek-legacy'`: Extract thinking with streaming behavior
+- **`thinking_forced_open`**: Forces reasoning models to always output thinking
+  - `false`: Normal thinking behavior (default)
+  - `true`: Always include thinking tags in output
+- **`parse_tool_calls`**: Enables tool call parsing
+  - `true`: Parse and extract tool calls (default)
+  - `false`: Disable tool call parsing
+  - **Note**: Automatically enabled when `use_jinja` is true
+- **`parallel_tool_calls`**: Enables multiple tool calls in a single response
+  - `false`: Single tool calls only (default, for compatibility)
+  - `true`: Allow parallel tool calls (only supported by some models)
+### Automatic Tool Call Enhancement
+When `use_jinja` is enabled, `parse_tool_calls` is automatically enabled because Jinja templates provide better tool calling capabilities. This ensures optimal tool support when using advanced templates.
 ## Model Path Handling
 The module accepts different path formats depending on the platform:

package/RNLlamaCpp.podspec CHANGED Viewed

@@ -53,7 +53,7 @@ Pod::Spec.new do |s|
   # Compiler settings
   s.pod_target_xcconfig = {
     "HEADER_SEARCH_PATHS" => "\"$(PODS_TARGET_SRCROOT)/ios/include\" \"$(PODS_TARGET_SRCROOT)/cpp\" \"$(PODS_TARGET_SRCROOT)/ios/generated/RNLlamaCppSpec\" \"$(PODS_TARGET_SRCROOT)/ios/generated\" \"$(PODS_TARGET_SRCROOT)/cpp/llama.cpp\" \"$(PODS_TARGET_SRCROOT)/cpp/llama.cpp/include\" \"$(PODS_TARGET_SRCROOT)/cpp/llama.cpp/ggml/include\" \"$(PODS_TARGET_SRCROOT)/cpp/llama.cpp/common\" \"$(PODS_TARGET_SRCROOT)/cpp/llama.cpp/vendor\" \"$(PODS_ROOT)/boost\" \"$(PODS_ROOT)/Headers/Public/React-bridging\" \"$(PODS_ROOT)/Headers/Public/React\"",
-    "OTHER_CPLUSPLUSFLAGS" => "-DFOLLY_NO_CONFIG -DFOLLY_MOBILE=1 -DFOLLY_USE_LIBCPP=1 -DLLAMA_METAL -DRCT_NEW_ARCH_ENABLED=1 -DFBJSRT_EXPORTED=1",
+    "OTHER_CPLUSPLUSFLAGS" => "-DFOLLY_NO_CONFIG -DFOLLY_MOBILE=1 -DFOLLY_USE_LIBCPP=1 -DFOLLY_CFG_NO_COROUTINES=1 -DLLAMA_METAL -DRCT_NEW_ARCH_ENABLED=1 -DFBJSRT_EXPORTED=1",
     "CLANG_CXX_LANGUAGE_STANDARD" => "c++17",
     "GCC_OPTIMIZATION_LEVEL" => "3", # Maximum optimization
     "SWIFT_OPTIMIZATION_LEVEL" => "-O",

package/android/CMakeLists.txt CHANGED Viewed

@@ -78,9 +78,17 @@ add_library(
     ${CPP_DIR}/rn-completion.cpp
 )
-# Suppress unused function warnings for llama.cpp code
-target_compile_options(common PRIVATE -Wno-unused-function)
-target_compile_options(RNLlamaCpp PRIVATE -Wno-unused-function)
+# Suppress additional warnings that are treated as errors in Expo SDK 54
+target_compile_options(common PRIVATE )
+# Use React Native's compile options function for proper C++ flags and RN_SERIALIZABLE_STATE
+if(ReactAndroid_VERSION_MINOR GREATER_EQUAL 80)
+    # Add additional warning suppressions for RNLlamaCpp target
+    target_compile_reactnative_options(RNLlamaCpp PRIVATE)
+    target_compile_options(RNLlamaCpp PRIVATE -Wno-unused-function)
+else()
+    target_compile_options(RNLlamaCpp PRIVATE -Wno-unused-function)
+endif()
 # Check if Vulkan backend library is available
 set(VULKAN_BACKEND_AVAILABLE FALSE)

package/android/generated/jni/react/renderer/components/RNLlamaCppSpec/RNLlamaCppSpecJSI.h CHANGED Viewed

@@ -18,7 +18,7 @@ namespace facebook::react {
 #pragma mark - NativeRNLlamaCppLlamaModelParams
-template <typename P0, typename P1, typename P2, typename P3, typename P4, typename P5, typename P6, typename P7, typename P8, typename P9, typename P10, typename P11, typename P12, typename P13, typename P14, typename P15, typename P16, typename P17, typename P18, typename P19, typename P20, typename P21, typename P22, typename P23>
+template <typename P0, typename P1, typename P2, typename P3, typename P4, typename P5, typename P6, typename P7, typename P8, typename P9, typename P10, typename P11, typename P12, typename P13, typename P14, typename P15, typename P16, typename P17, typename P18, typename P19, typename P20, typename P21, typename P22, typename P23, typename P24, typename P25, typename P26, typename P27, typename P28>
 struct NativeRNLlamaCppLlamaModelParams {
   P0 model;
   P1 n_ctx;
@@ -42,10 +42,15 @@ struct NativeRNLlamaCppLlamaModelParams {
   P19 chat_template;
   P20 use_jinja;
   P21 verbose;
-  P22 lora_adapters;
-  P23 grammar;
+  P22 reasoning_budget;
+  P23 reasoning_format;
+  P24 thinking_forced_open;
+  P25 parse_tool_calls;
+  P26 parallel_tool_calls;
+  P27 lora_adapters;
+  P28 grammar;
   bool operator==(const NativeRNLlamaCppLlamaModelParams &other) const {
-    return model == other.model && n_ctx == other.n_ctx && n_batch == other.n_batch && n_ubatch == other.n_ubatch && n_threads == other.n_threads && n_keep == other.n_keep && n_gpu_layers == other.n_gpu_layers && use_mmap == other.use_mmap && use_mlock == other.use_mlock && vocab_only == other.vocab_only && embedding == other.embedding && seed == other.seed && rope_freq_base == other.rope_freq_base && rope_freq_scale == other.rope_freq_scale && yarn_ext_factor == other.yarn_ext_factor && yarn_attn_factor == other.yarn_attn_factor && yarn_beta_fast == other.yarn_beta_fast && yarn_beta_slow == other.yarn_beta_slow && logits_all == other.logits_all && chat_template == other.chat_template && use_jinja == other.use_jinja && verbose == other.verbose && lora_adapters == other.lora_adapters && grammar == other.grammar;
+    return model == other.model && n_ctx == other.n_ctx && n_batch == other.n_batch && n_ubatch == other.n_ubatch && n_threads == other.n_threads && n_keep == other.n_keep && n_gpu_layers == other.n_gpu_layers && use_mmap == other.use_mmap && use_mlock == other.use_mlock && vocab_only == other.vocab_only && embedding == other.embedding && seed == other.seed && rope_freq_base == other.rope_freq_base && rope_freq_scale == other.rope_freq_scale && yarn_ext_factor == other.yarn_ext_factor && yarn_attn_factor == other.yarn_attn_factor && yarn_beta_fast == other.yarn_beta_fast && yarn_beta_slow == other.yarn_beta_slow && logits_all == other.logits_all && chat_template == other.chat_template && use_jinja == other.use_jinja && verbose == other.verbose && reasoning_budget == other.reasoning_budget && reasoning_format == other.reasoning_format && thinking_forced_open == other.thinking_forced_open && parse_tool_calls == other.parse_tool_calls && parallel_tool_calls == other.parallel_tool_calls && lora_adapters == other.lora_adapters && grammar == other.grammar;
   }
 };
@@ -80,6 +85,11 @@ struct NativeRNLlamaCppLlamaModelParamsBridging {
       bridging::fromJs<decltype(types.chat_template)>(rt, value.getProperty(rt, "chat_template"), jsInvoker),
       bridging::fromJs<decltype(types.use_jinja)>(rt, value.getProperty(rt, "use_jinja"), jsInvoker),
       bridging::fromJs<decltype(types.verbose)>(rt, value.getProperty(rt, "verbose"), jsInvoker),
+      bridging::fromJs<decltype(types.reasoning_budget)>(rt, value.getProperty(rt, "reasoning_budget"), jsInvoker),
+      bridging::fromJs<decltype(types.reasoning_format)>(rt, value.getProperty(rt, "reasoning_format"), jsInvoker),
+      bridging::fromJs<decltype(types.thinking_forced_open)>(rt, value.getProperty(rt, "thinking_forced_open"), jsInvoker),
+      bridging::fromJs<decltype(types.parse_tool_calls)>(rt, value.getProperty(rt, "parse_tool_calls"), jsInvoker),
+      bridging::fromJs<decltype(types.parallel_tool_calls)>(rt, value.getProperty(rt, "parallel_tool_calls"), jsInvoker),
       bridging::fromJs<decltype(types.lora_adapters)>(rt, value.getProperty(rt, "lora_adapters"), jsInvoker),
       bridging::fromJs<decltype(types.grammar)>(rt, value.getProperty(rt, "grammar"), jsInvoker)};
     return result;
@@ -174,6 +184,26 @@ struct NativeRNLlamaCppLlamaModelParamsBridging {
     return bridging::toJs(rt, value);
   }
+  static double reasoning_budgetToJs(jsi::Runtime &rt, decltype(types.reasoning_budget) value) {
+    return bridging::toJs(rt, value);
+  }
+  static jsi::String reasoning_formatToJs(jsi::Runtime &rt, decltype(types.reasoning_format) value) {
+    return bridging::toJs(rt, value);
+  }
+  static bool thinking_forced_openToJs(jsi::Runtime &rt, decltype(types.thinking_forced_open) value) {
+    return bridging::toJs(rt, value);
+  }
+  static bool parse_tool_callsToJs(jsi::Runtime &rt, decltype(types.parse_tool_calls) value) {
+    return bridging::toJs(rt, value);
+  }
+  static bool parallel_tool_callsToJs(jsi::Runtime &rt, decltype(types.parallel_tool_calls) value) {
+    return bridging::toJs(rt, value);
+  }
   static jsi::Array lora_adaptersToJs(jsi::Runtime &rt, decltype(types.lora_adapters) value) {
     return bridging::toJs(rt, value);
   }
@@ -252,6 +282,21 @@ struct NativeRNLlamaCppLlamaModelParamsBridging {
     if (value.verbose) {
       result.setProperty(rt, "verbose", bridging::toJs(rt, value.verbose.value(), jsInvoker));
     }
+    if (value.reasoning_budget) {
+      result.setProperty(rt, "reasoning_budget", bridging::toJs(rt, value.reasoning_budget.value(), jsInvoker));
+    }
+    if (value.reasoning_format) {
+      result.setProperty(rt, "reasoning_format", bridging::toJs(rt, value.reasoning_format.value(), jsInvoker));
+    }
+    if (value.thinking_forced_open) {
+      result.setProperty(rt, "thinking_forced_open", bridging::toJs(rt, value.thinking_forced_open.value(), jsInvoker));
+    }
+    if (value.parse_tool_calls) {
+      result.setProperty(rt, "parse_tool_calls", bridging::toJs(rt, value.parse_tool_calls.value(), jsInvoker));
+    }
+    if (value.parallel_tool_calls) {
+      result.setProperty(rt, "parallel_tool_calls", bridging::toJs(rt, value.parallel_tool_calls.value(), jsInvoker));
+    }
     if (value.lora_adapters) {
       result.setProperty(rt, "lora_adapters", bridging::toJs(rt, value.lora_adapters.value(), jsInvoker));
     }

package/android/src/main/cpp/include/llama.h CHANGED Viewed

@@ -64,8 +64,6 @@ extern "C" {
     typedef struct llama_memory_i * llama_memory_t;
-    struct llama_kv_cache; // DEPRECATED (use llama_memory instead)
     typedef int32_t llama_pos;
     typedef int32_t llama_token;
     typedef int32_t llama_seq_id;
@@ -152,6 +150,7 @@ extern "C" {
         //LLAMA_FTYPE_MOSTLY_Q4_0_8_8      = 35, // removed from gguf files, use Q4_0 and runtime repack
         LLAMA_FTYPE_MOSTLY_TQ1_0         = 36, // except 1d tensors
         LLAMA_FTYPE_MOSTLY_TQ2_0         = 37, // except 1d tensors
+        LLAMA_FTYPE_MOSTLY_MXFP4_MOE     = 38, // except 1d tensors
         LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
     };
@@ -284,10 +283,11 @@ extern "C" {
         const struct llama_model_kv_override * kv_overrides;
         // Keep the booleans together to avoid misalignment during copy-by-value.
-        bool vocab_only;    // only load the vocabulary, no weights
-        bool use_mmap;      // use mmap if possible
-        bool use_mlock;     // force system to keep model in RAM
-        bool check_tensors; // validate model tensor data
+        bool vocab_only;      // only load the vocabulary, no weights
+        bool use_mmap;        // use mmap if possible
+        bool use_mlock;       // force system to keep model in RAM
+        bool check_tensors;   // validate model tensor data
+        bool use_extra_bufts; // use extra buffer types (used for weight repacking)
     };
     // NOTE: changing the default values of parameters marked as [EXPERIMENTAL] may cause crashes or incorrect results in certain configurations
@@ -312,7 +312,7 @@ extern "C" {
         float    yarn_beta_fast;   // YaRN low correction dim
         float    yarn_beta_slow;   // YaRN high correction dim
         uint32_t yarn_orig_ctx;    // YaRN original context size
-        float    defrag_thold;     // defragment the KV cache if holes/size > thold, <= 0 disabled (default)
+        float    defrag_thold;     // [DEPRECATED] defragment the KV cache if holes/size > thold, <= 0 disabled (default)
         ggml_backend_sched_eval_callback cb_eval;
         void * cb_eval_user_data;
@@ -467,8 +467,6 @@ extern "C" {
     LLAMA_API           llama_memory_t   llama_get_memory  (const struct llama_context * ctx);
     LLAMA_API  enum llama_pooling_type   llama_pooling_type(const struct llama_context * ctx); // TODO: rename to llama_get_pooling_type
-    DEPRECATED(LLAMA_API struct llama_kv_cache * llama_get_kv_self(struct llama_context * ctx), "use llama_get_memory instead");
     LLAMA_API const struct llama_vocab * llama_model_get_vocab(const struct llama_model * model);
     LLAMA_API enum llama_rope_type       llama_model_rope_type(const struct llama_model * model);
@@ -537,6 +535,9 @@ extern "C" {
     // Returns true if the model is recurrent (like Mamba, RWKV, etc.)
     LLAMA_API bool llama_model_is_recurrent(const struct llama_model * model);
+    // Returns true if the model is diffusion-based (like LLaDA, Dream, etc.)
+    LLAMA_API bool llama_model_is_diffusion(const struct llama_model * model);
     // Returns 0 on success
     LLAMA_API uint32_t llama_model_quantize(
             const char * fname_inp,
@@ -552,6 +553,24 @@ extern "C" {
             struct llama_model * model,
             const char * path_lora);
+    // Functions to access the adapter's GGUF metadata scalar values
+    // - The functions return the length of the string on success, or -1 on failure
+    // - The output string is always null-terminated and cleared on failure
+    // - When retrieving a string, an extra byte must be allocated to account for the null terminator
+    // - GGUF array values are not supported by these functions
+    // Get metadata value as a string by key name
+    LLAMA_API int32_t llama_adapter_meta_val_str(const struct llama_adapter_lora * adapter, const char * key, char * buf, size_t buf_size);
+    // Get the number of metadata key/value pairs
+    LLAMA_API int32_t llama_adapter_meta_count(const struct llama_adapter_lora * adapter);
+    // Get metadata key name by index
+    LLAMA_API int32_t llama_adapter_meta_key_by_index(const struct llama_adapter_lora * adapter, int32_t i, char * buf, size_t buf_size);
+    // Get metadata value as a string by index
+    LLAMA_API int32_t llama_adapter_meta_val_str_by_index(const struct llama_adapter_lora * adapter, int32_t i, char * buf, size_t buf_size);
     // Manually free a LoRA adapter
     // Note: loaded adapters will be free when the associated model is deleted
     LLAMA_API void llama_adapter_lora_free(struct llama_adapter_lora * adapter);
@@ -662,111 +681,6 @@ extern "C" {
     // Check if the memory supports shifting
     LLAMA_API bool llama_memory_can_shift(llama_memory_t mem);
-    //
-    // KV cache for self-attention (TODO: deprecate in favor of llama_memory)
-    //
-    // Returns the number of tokens in the KV cache (slow, use only for debug)
-    // If a KV cell has multiple sequences assigned to it, it will be counted multiple times
-    DEPRECATED(LLAMA_API int32_t llama_kv_self_n_tokens(const struct llama_context * ctx),
-               "Use llama_kv_self_seq_pos_max() and llama_kv_self_seq_pos_min() instead (https://github.com/ggml-org/llama.cpp/issues/13793)");
-    // Returns the number of used KV cells (i.e. have at least one sequence assigned to them)
-    DEPRECATED(LLAMA_API int32_t llama_kv_self_used_cells(const struct llama_context * ctx),
-               "Use llama_kv_self_seq_pos_max() and llama_kv_self_seq_pos_min() instead (https://github.com/ggml-org/llama.cpp/issues/13793)");
-    // Clear the KV cache - both cell info is erased and KV data is zeroed
-    DEPRECATED(LLAMA_API void llama_kv_self_clear(
-                struct llama_context * ctx),
-            "Use llama_memory_clear() instead");
-    // Removes all tokens that belong to the specified sequence and have positions in [p0, p1)
-    // Returns false if a partial sequence cannot be removed. Removing a whole sequence never fails
-    // seq_id < 0 : match any sequence
-    // p0 < 0     : [0,  p1]
-    // p1 < 0     : [p0, inf)
-    DEPRECATED(LLAMA_API bool llama_kv_self_seq_rm(
-            struct llama_context * ctx,
-                    llama_seq_id   seq_id,
-                       llama_pos   p0,
-                       llama_pos   p1),
-            "Use llama_memory_seq_rm() instead");
-    // Copy all tokens that belong to the specified sequence to another sequence
-    // Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence
-    // p0 < 0 : [0,  p1]
-    // p1 < 0 : [p0, inf)
-    DEPRECATED(LLAMA_API void llama_kv_self_seq_cp(
-            struct llama_context * ctx,
-                    llama_seq_id   seq_id_src,
-                    llama_seq_id   seq_id_dst,
-                       llama_pos   p0,
-                       llama_pos   p1),
-            "Use llama_memory_seq_cp() instead");
-    // Removes all tokens that do not belong to the specified sequence
-    DEPRECATED(LLAMA_API void llama_kv_self_seq_keep(
-            struct llama_context * ctx,
-                    llama_seq_id   seq_id),
-            "Use llama_memory_seq_keep() instead");
-    // Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1)
-    // If the KV cache is RoPEd, the KV data is updated accordingly:
-    //   - lazily on next llama_decode()
-    // p0 < 0 : [0,  p1]
-    // p1 < 0 : [p0, inf)
-    DEPRECATED(LLAMA_API void llama_kv_self_seq_add(
-            struct llama_context * ctx,
-                    llama_seq_id   seq_id,
-                       llama_pos   p0,
-                       llama_pos   p1,
-                       llama_pos   delta),
-            "Use llama_memory_seq_add() instead");
-    // Integer division of the positions by factor of `d > 1`
-    // If the KV cache is RoPEd, the KV data is updated accordingly:
-    //   - lazily on next llama_decode()
-    // p0 < 0 : [0,  p1]
-    // p1 < 0 : [p0, inf)
-    DEPRECATED(LLAMA_API void llama_kv_self_seq_div(
-            struct llama_context * ctx,
-                    llama_seq_id   seq_id,
-                       llama_pos   p0,
-                       llama_pos   p1,
-                             int   d),
-            "Use llama_memory_seq_div() instead");
-    // Returns the smallest position present in the KV cache for the specified sequence
-    // This is typically non-zero only for SWA caches
-    // Note that all positions in the range [pos_min, pos_max] are guaranteed to be present in the KV cache
-    // Return -1 if the sequence is empty
-    DEPRECATED(LLAMA_API llama_pos llama_kv_self_seq_pos_min(
-            struct llama_context * ctx,
-                    llama_seq_id   seq_id),
-            "Use llama_memory_seq_pos_min() instead");
-    // Returns the largest position present in the KV cache for the specified sequence
-    // Note that all positions in the range [pos_min, pos_max] are guaranteed to be present in the KV cache
-    // Return -1 if the sequence is empty
-    DEPRECATED(LLAMA_API llama_pos llama_kv_self_seq_pos_max(
-            struct llama_context * ctx,
-                    llama_seq_id   seq_id),
-            "Use llama_memory_seq_pos_max() instead");
-    // Defragment the KV cache
-    // This will be applied:
-    //   - lazily on next llama_decode()
-    DEPRECATED(LLAMA_API void llama_kv_self_defrag(struct llama_context * ctx),
-            "simply remove this call, the context will automatically decide when to do a defragmentation based on 'defrag_thold'");
-    // Check if the context supports KV cache shifting
-    DEPRECATED(LLAMA_API bool llama_kv_self_can_shift(const struct llama_context * ctx),
-            "use llama_memory_can_shift() instead");
-    // Apply the KV cache updates (such as K-shifts, defragmentation, etc.)
-    DEPRECATED(LLAMA_API void llama_kv_self_update(struct llama_context * ctx),
-            "simply remove this call, updates are applied lazily on the next llama_decode()");
     //
     // State / sessions
     //
@@ -865,6 +779,29 @@ extern "C" {
                           size_t   n_token_capacity,
                           size_t * n_token_count_out);
+#define LLAMA_STATE_SEQ_FLAGS_SWA_ONLY 1
+    typedef uint32_t llama_state_seq_flags;
+    LLAMA_API size_t llama_state_seq_get_size_ext(
+            struct llama_context * ctx,
+                    llama_seq_id   seq_id,
+           llama_state_seq_flags   flags);
+    LLAMA_API size_t llama_state_seq_get_data_ext(
+            struct llama_context * ctx,
+                         uint8_t * dst,
+                          size_t   size,
+                    llama_seq_id   seq_id,
+           llama_state_seq_flags   flags);
+    LLAMA_API size_t llama_state_seq_set_data_ext(
+            struct llama_context * ctx,
+                   const uint8_t * src,
+                          size_t   size,
+                    llama_seq_id   dest_seq_id,
+           llama_state_seq_flags   flags);
     //
     // Decoding
     //
@@ -1432,6 +1369,8 @@ extern "C" {
         ggml_opt_get_optimizer_params get_opt_pars; // callback for calculating optimizer parameters
         void * get_opt_pars_ud;                     // userdata for calculating optimizer parameters
+        enum ggml_opt_optimizer_type optimizer_type;
     };
     LLAMA_API void llama_opt_init(struct llama_context * lctx, struct llama_model * model, struct llama_opt_params lopt_params);

package/android/src/main/jniLibs/arm64-v8a/libggml-base.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/arm64-v8a/libggml-cpu.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/arm64-v8a/libggml.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/arm64-v8a/libllama.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/armeabi-v7a/libggml-base.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/armeabi-v7a/libggml-cpu.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/armeabi-v7a/libggml.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/armeabi-v7a/libllama.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/x86/libggml-base.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/x86/libggml-cpu.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/x86/libggml.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/x86/libllama.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/x86_64/libggml-base.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/x86_64/libggml-cpu.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/x86_64/libggml.so CHANGED Viewed

Binary file

package/android/src/main/jniLibs/x86_64/libllama.so CHANGED Viewed

Binary file

package/cpp/LlamaCppModel.cpp CHANGED Viewed

@@ -948,16 +948,8 @@ jsi::Value LlamaCppModel::embeddingJsi(jsi::Runtime& rt, const jsi::Value* args,
       throw std::runtime_error("Invalid embedding dimension");
     }
-    // For OpenAI compatibility, default to mean pooling
-    enum llama_pooling_type pooling_type = LLAMA_POOLING_TYPE_MEAN;
-    if (options.hasProperty(rt, "pooling") && options.getProperty(rt, "pooling").isString()) {
-      std::string pooling = options.getProperty(rt, "pooling").getString(rt).utf8(rt);
-      if (pooling == "last") {
-        pooling_type = LLAMA_POOLING_TYPE_LAST;
-      } else if (pooling == "cls" || pooling == "first") {
-        pooling_type = LLAMA_POOLING_TYPE_CLS;
-      }
-    }
+    // Note: Pooling is handled automatically by llama_get_embeddings()
+    // The function returns the appropriate embedding based on the model's configuration
     // Get the embeddings
     std::vector<float> embedding_vec(n_embd);