PyPI - cortex-llm - Versions diffs - 1.0.7__tar.gz → 1.0.9__tar.gz - Mend

cortex-llm 1.0.7tar.gz → 1.0.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (58) hide show

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cortex-llm
-Version: 1.0.7
+Version: 1.0.9
 Summary: GPU-Accelerated LLM Terminal for Apple Silicon
 Home-page: https://github.com/faisalmumtaz/Cortex
 Author: Cortex Development Team
@@ -131,6 +131,10 @@ Cortex supports:
   - `docs/template-registry.md`
 - **Inference engine details** and backend behavior
   - `docs/inference-engine.md`
+- **Tooling (experimental, WIP)** for repo-scoped read/search and optional file edits with explicit confirmation
+  - `docs/cli.md`
+**Important (Work in Progress):** Tooling is actively evolving and should be considered experimental. Behavior, output format, and available actions may change; tool calls can fail; and UI presentation may be adjusted. Use tooling on non-critical work first, and always review any proposed file changes before approving them.
 ## Configuration

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/README.md RENAMED Viewed

@@ -73,6 +73,10 @@ Cortex supports:
   - `docs/template-registry.md`
 - **Inference engine details** and backend behavior
   - `docs/inference-engine.md`
+- **Tooling (experimental, WIP)** for repo-scoped read/search and optional file edits with explicit confirmation
+  - `docs/cli.md`
+**Important (Work in Progress):** Tooling is actively evolving and should be considered experimental. Behavior, output format, and available actions may change; tool calls can fail; and UI presentation may be adjusted. Use tooling on non-critical work first, and always review any proposed file changes before approving them.
 ## Configuration

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/__init__.py RENAMED Viewed

@@ -5,7 +5,7 @@ A high-performance terminal interface for running Hugging Face LLMs locally
 with exclusive GPU acceleration via Metal Performance Shaders (MPS) and MLX.
 """
-__version__ = "1.0.7"
+__version__ = "1.0.9"
 __author__ = "Cortex Development Team"
 __license__ = "MIT"

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/inference_engine.py RENAMED Viewed

@@ -243,6 +243,33 @@ class InferenceEngine:
         tokens_generated = 0
         first_token_time = None
         last_metrics_update = time.time()
+        stream_total_text = ""
+        stream_cumulative = False
+        def normalize_stream_chunk(chunk: Any) -> str:
+            """Normalize streaming output to delta chunks when backend yields cumulative text."""
+            nonlocal stream_total_text, stream_cumulative
+            if chunk is None:
+                return ""
+            if not isinstance(chunk, str):
+                chunk = str(chunk)
+            if stream_cumulative:
+                if chunk.startswith(stream_total_text):
+                    delta = chunk[len(stream_total_text):]
+                    stream_total_text = chunk
+                    return delta
+                stream_total_text += chunk
+                return chunk
+            if stream_total_text and len(chunk) > len(stream_total_text) and chunk.startswith(stream_total_text):
+                stream_cumulative = True
+                delta = chunk[len(stream_total_text):]
+                stream_total_text = chunk
+                return delta
+            stream_total_text += chunk
+            return chunk
         try:
             # Use MLX accelerator's optimized generation if available
@@ -262,10 +289,14 @@ class InferenceEngine:
                     if self._cancel_event.is_set():
                         self.status = InferenceStatus.CANCELLED
                         break
+                    delta = normalize_stream_chunk(token) if request.stream else str(token)
+                    if not delta:
+                        continue
                     if first_token_time is None:
                         first_token_time = time.time() - start_time
                     tokens_generated += 1
                     # Update metrics less frequently
@@ -284,13 +315,18 @@ class InferenceEngine:
                         last_metrics_update = current_time
                     # Token is already a string from generate_optimized
-                    yield token
+                    yield delta
                     if any(stop in token for stop in request.stop_sequences):
                         break
             elif mlx_generate:
                 # Fallback to standard MLX generation
-                logger.info("Using standard MLX generation")
+                if request.stream and mlx_stream_generate:
+                    logger.info("Using MLX streaming generation")
+                    generate_fn = mlx_stream_generate
+                else:
+                    logger.info("Using standard MLX generation")
+                    generate_fn = mlx_generate
                 # Import sample_utils for creating sampler
                 try:
@@ -314,7 +350,7 @@ class InferenceEngine:
                 if request.seed is not None and request.seed >= 0:
                     mx.random.seed(request.seed)
-                for response in mlx_generate(
+                for response in generate_fn(
                     model,
                     tokenizer,
                     **generation_kwargs
@@ -328,10 +364,14 @@ class InferenceEngine:
                         token = response.text
                     else:
                         token = str(response)
+                    delta = normalize_stream_chunk(token) if request.stream else token
+                    if request.stream and not delta:
+                        continue
                     if first_token_time is None:
                         first_token_time = time.time() - start_time
                     tokens_generated += 1
                     # Update metrics less frequently to reduce overhead
@@ -352,7 +392,7 @@ class InferenceEngine:
                         )
                         last_metrics_update = current_time
-                    yield token
+                    yield delta
                     if any(stop in token for stop in request.stop_sequences):
                         break

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/metal/mlx_converter.py RENAMED Viewed

@@ -66,9 +66,22 @@ class MLXConverter:
         self.cache_dir.mkdir(parents=True, exist_ok=True)
         self.conversion_cache = self.cache_dir / "conversion_cache.json"
         self._load_conversion_cache()
+        self._warned_mlx_lm_compat = False
         logger.info(f"MLX Converter initialized with cache dir: {self.cache_dir}")
         logger.info(f"MLX LM available: {mlx_utils is not None and load is not None}")
+    def _warn_mlx_lm_compat(self, missing: str) -> None:
+        """Warn once when mlx-lm is missing newer helper APIs."""
+        if self._warned_mlx_lm_compat:
+            return
+        self._warned_mlx_lm_compat = True
+        message = (
+            f"[WARN] mlx-lm is missing '{missing}'. Using compatibility fallback. "
+            "For best support, upgrade mlx-lm to a newer version."
+        )
+        logger.warning(message)
+        print(message)
     def _load_conversion_cache(self) -> None:
         """Load conversion cache metadata."""
@@ -206,6 +219,83 @@ class MLXConverter:
         return download_dir
+    def _mlx_get_model_path(self, source_path: Path) -> Tuple[Path, Optional[str]]:
+        """Resolve model path with MLX LM compatibility fallbacks."""
+        if mlx_utils is not None and hasattr(mlx_utils, "get_model_path"):
+            return mlx_utils.get_model_path(str(source_path))
+        self._warn_mlx_lm_compat("get_model_path")
+        # Fallback: local path or direct HF download.
+        model_path = Path(source_path)
+        if model_path.exists():
+            hf_repo = None
+            try:
+                from huggingface_hub import ModelCard
+                card_path = model_path / "README.md"
+                if card_path.is_file():
+                    card = ModelCard.load(card_path)
+                    hf_repo = getattr(card.data, "base_model", None)
+            except Exception:
+                hf_repo = None
+            return model_path, hf_repo
+        try:
+            model_path = Path(
+                snapshot_download(
+                    str(source_path),
+                    allow_patterns=[
+                        "*.json",
+                        "model*.safetensors",
+                        "*.py",
+                        "tokenizer.model",
+                        "*.tiktoken",
+                        "tiktoken.model",
+                        "*.txt",
+                        "*.jsonl",
+                        "*.jinja",
+                    ],
+                )
+            )
+        except Exception as e:
+            raise RuntimeError(f"Failed to download model from Hugging Face: {e}") from e
+        return model_path, str(source_path)
+    def _mlx_fetch_from_hub(
+        self,
+        model_path: Path,
+        trust_remote_code: bool = False
+    ) -> Tuple[Any, Dict[str, Any], Any]:
+        """Fetch model/config/tokenizer with MLX LM compatibility fallbacks."""
+        if mlx_utils is not None and hasattr(mlx_utils, "fetch_from_hub"):
+            return mlx_utils.fetch_from_hub(
+                model_path,
+                lazy=True,
+                trust_remote_code=trust_remote_code
+            )
+        self._warn_mlx_lm_compat("fetch_from_hub")
+        if mlx_utils is not None and hasattr(mlx_utils, "load_model") and hasattr(mlx_utils, "load_tokenizer"):
+            model, model_config = mlx_utils.load_model(model_path, lazy=True)
+            try:
+                tokenizer = mlx_utils.load_tokenizer(
+                    model_path,
+                    eos_token_ids=model_config.get("eos_token_id", None),
+                    tokenizer_config_extra={"trust_remote_code": trust_remote_code},
+                )
+            except TypeError:
+                tokenizer = mlx_utils.load_tokenizer(
+                    model_path,
+                    eos_token_ids=model_config.get("eos_token_id", None),
+                )
+            return model, model_config, tokenizer
+        raise RuntimeError(
+            "mlx_lm.utils is missing required helpers (fetch_from_hub/load_model). "
+            "Upgrade mlx-lm to a newer version."
+        )
     def _requires_sentencepiece(self, model_path: Path) -> bool:
         """Return True if the model likely needs SentencePiece."""
         # If a fast tokenizer is present, SentencePiece should not be required.
@@ -379,10 +469,17 @@ class MLXConverter:
             # Build quantization configuration
             quantize_config = self._build_quantization_config(config)
-            model_path, hf_repo = mlx_utils.get_model_path(str(source_path))
-            model, model_config, tokenizer = mlx_utils.fetch_from_hub(
-                model_path, lazy=True, trust_remote_code=False
-            )
+            try:
+                model_path, hf_repo = self._mlx_get_model_path(Path(source_path))
+            except Exception as e:
+                return False, f"Model path resolution failed: {e}", None
+            try:
+                model, model_config, tokenizer = self._mlx_fetch_from_hub(
+                    model_path, trust_remote_code=False
+                )
+            except Exception as e:
+                return False, f"Model fetch failed: {e}", None
             dtype = model_config.get("torch_dtype", None)
             if dtype in ["float16", "bfloat16", "float32"]:
@@ -398,6 +495,8 @@ class MLXConverter:
                 model.update(tree_map_with_path(set_dtype, model.parameters()))
             if config.quantization != QuantizationRecipe.NONE:
+                if mlx_utils is None or not hasattr(mlx_utils, "quantize_model"):
+                    return False, "MLX LM quantize_model not available; upgrade mlx-lm.", None
                 quant_predicate = None
                 if quantize_config and "quant_predicate" in quantize_config:
                     quant_predicate = quantize_config["quant_predicate"]
@@ -411,6 +510,8 @@ class MLXConverter:
                 )
             normalized_hf_repo = self._normalize_hf_repo(hf_repo)
+            if mlx_utils is None or not hasattr(mlx_utils, "save"):
+                return False, "MLX LM save() not available; upgrade mlx-lm.", None
             mlx_utils.save(output_path, model_path, model, tokenizer, model_config, hf_repo=normalized_hf_repo)
             logger.info("MLX conversion completed")

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/ui/cli.py RENAMED Viewed

@@ -18,6 +18,7 @@ from textwrap import wrap
 from rich.live import Live
 from rich.style import Style
+from rich.text import Text
 logger = logging.getLogger(__name__)
@@ -30,6 +31,8 @@ from cortex.conversation_manager import ConversationManager, MessageRole
 from cortex.model_downloader import ModelDownloader
 from cortex.template_registry import TemplateRegistry
 from cortex.fine_tuning import FineTuneWizard
+from cortex.tools import ToolRunner
+from cortex.tools import protocol as tool_protocol
 from cortex.ui.markdown_render import ThinkMarkdown, PrefixedRenderable, render_plain_with_think
@@ -58,6 +61,11 @@ class CortexCLI:
         # Initialize fine-tuning wizard
         self.fine_tune_wizard = FineTuneWizard(model_manager, config)
+        # Tooling support (always enabled)
+        self.tool_runner = ToolRunner(Path.cwd())
+        self.tool_runner.set_confirm_callback(self._confirm_tool_change)
+        self.max_tool_iterations = 4
         self.running = True
@@ -132,6 +140,86 @@ class CortexCLI:
             # Don't call sys.exit() here - let the main loop exit naturally
             # This prevents traceback from the parent process
             print("\n", file=sys.stderr)  # Just add a newline for cleaner output
+    def _confirm_tool_change(self, prompt: str) -> bool:
+        """Prompt user to approve a tool-driven change."""
+        print("\n" + prompt)
+        response = input("Apply change? [y/N]: ").strip().lower()
+        return response in {"y", "yes"}
+    def _ensure_tool_instructions(self) -> None:
+        """Inject tool instructions into the conversation once."""
+        conversation = self.conversation_manager.get_current_conversation()
+        if conversation is None:
+            conversation = self.conversation_manager.new_conversation()
+        marker = "[CORTEX_TOOL_INSTRUCTIONS v2]"
+        for message in conversation.messages:
+            if message.role == MessageRole.SYSTEM and marker in message.content:
+                return
+        self.conversation_manager.add_message(MessageRole.SYSTEM, self.tool_runner.tool_instructions())
+    def _summarize_tool_call(self, call: dict) -> str:
+        name = str(call.get("name", "tool"))
+        args = call.get("arguments") or {}
+        parts = []
+        preferred = ("path", "query", "anchor", "start_line", "end_line", "recursive", "max_results")
+        for key in preferred:
+            if key in args:
+                value = args[key]
+                if isinstance(value, str) and len(value) > 60:
+                    value = value[:57] + "..."
+                parts.append(f"{key}={value!r}")
+        if not parts and args:
+            for key in list(args.keys())[:3]:
+                value = args[key]
+                if isinstance(value, str) and len(value) > 60:
+                    value = value[:57] + "..."
+                parts.append(f"{key}={value!r}")
+        arg_str = ", ".join(parts)
+        return f"{name}({arg_str})" if arg_str else f"{name}()"
+    def _summarize_tool_result(self, result: dict) -> str:
+        name = str(result.get("name", "tool"))
+        if not result.get("ok", False):
+            error = result.get("error") or "unknown error"
+            return f"{name} -> error: {error}"
+        payload = result.get("result") or {}
+        if name == "list_dir":
+            entries = payload.get("entries") or []
+            return f"{name} -> entries={len(entries)}"
+        if name == "search":
+            matches = payload.get("results") or []
+            return f"{name} -> results={len(matches)}"
+        if name == "read_file":
+            path = payload.get("path") or ""
+            start = payload.get("start_line")
+            end = payload.get("end_line")
+            if start and end:
+                return f"{name} -> {path} lines {start}-{end}"
+            if start:
+                return f"{name} -> {path} from line {start}"
+            return f"{name} -> {path}"
+        if name in {"write_file", "create_file", "delete_file", "replace_in_file", "insert_after", "insert_before"}:
+            path = payload.get("path") or ""
+            return f"{name} -> {path}"
+        return f"{name} -> ok"
+    def _print_tool_activity(self, tool_calls: list, tool_results: list) -> None:
+        lines = []
+        for call, result in zip(tool_calls, tool_results):
+            lines.append(f"tool {self._summarize_tool_call(call)} -> {self._summarize_tool_result(result)}")
+        if not lines:
+            return
+        text = Text("\n".join(lines), style=Style(color="bright_black", italic=True))
+        renderable = PrefixedRenderable(text, prefix="  ", prefix_style=Style(dim=True), indent="  ", auto_space=False)
+        original_console_width = self.console._width
+        target_width = max(40, int(self.get_terminal_width() * 0.75))
+        self.console.width = target_width
+        try:
+            self.console.print(renderable, highlight=False, soft_wrap=True)
+            self.console.print()
+        finally:
+            self.console._width = original_console_width
     def get_terminal_width(self) -> int:
@@ -1110,16 +1198,10 @@ class CortexCLI:
         except Exception as e:
             logger.debug(f"Failed to get template profile: {e}")
-        # Build conversation context with proper formatting BEFORE adding to conversation
-        formatted_prompt = self._format_prompt_with_chat_template(user_input)
-        # DEBUG: Uncomment these lines to see the exact prompt being sent to the model
-        # This is crucial for debugging when models give unexpected responses
-        # It shows the formatted prompt with all special tokens and formatting
-        # print(f"\033[33m[DEBUG] Formatted prompt being sent to model:\033[0m", file=sys.stderr)
-        # print(f"\033[33m{repr(formatted_prompt[:200])}...\033[0m", file=sys.stderr)
-        # Now add user message to conversation history
+        # Ensure tool instructions are present before adding user message
+        self._ensure_tool_instructions()
+        # Now add user message to conversation history
         self.conversation_manager.add_message(MessageRole.USER, user_input)
         # Start response on a new line; prefix is rendered with the markdown output.
@@ -1134,130 +1216,154 @@ class CortexCLI:
             except Exception as e:
                 logger.debug(f"Could not get stop sequences: {e}")
-        # Create generation request with formatted prompt
-        request = GenerationRequest(
-            prompt=formatted_prompt,
-            max_tokens=self.config.inference.max_tokens,
-            temperature=self.config.inference.temperature,
-            top_p=self.config.inference.top_p,
-            top_k=self.config.inference.top_k,
-            repetition_penalty=self.config.inference.repetition_penalty,
-            stream=self.config.inference.stream_output,
-            seed=self.config.inference.seed if self.config.inference.seed >= 0 else None,
-            stop_sequences=stop_sequences
-        )
-        # Generate response
+        # Generate response (with tool loop)
         self.generating = True
-        generated_text = ""
-        start_time = time.time()
-        token_count = 0
-        first_token_time = None
         try:
-            # Reset streaming state for reasoning templates if supported
-            if uses_reasoning_template and template_profile and template_profile.supports_streaming():
-                if hasattr(template_profile, 'reset_streaming_state'):
-                    template_profile.reset_streaming_state()
+            tool_iterations = 0
+            while tool_iterations < self.max_tool_iterations:
+                tool_iterations += 1
-            display_text = ""
-            accumulated_response = ""
-            last_render_time = 0.0
-            render_interval = 0.05  # seconds
-            prefix_style = Style(color="cyan")
+                formatted_prompt = self._format_prompt_with_chat_template(user_input, include_user=False)
-            def build_renderable(text: str):
-                if getattr(self.config.ui, "markdown_rendering", True):
-                    markdown = ThinkMarkdown(
-                        text,
-                        code_theme="monokai",
-                        use_line_numbers=False,
-                        syntax_highlighting=getattr(self.config.ui, "syntax_highlighting", True),
-                    )
-                    renderable = markdown
-                else:
-                    renderable = render_plain_with_think(text)
+                # DEBUG: Uncomment these lines to see the exact prompt being sent to the model
+                # print(f"\033[33m[DEBUG] Formatted prompt being sent to model:\033[0m", file=sys.stderr)
+                # print(f"\033[33m{repr(formatted_prompt[:200])}...\033[0m", file=sys.stderr)
-                return PrefixedRenderable(renderable, prefix="⏺ ", prefix_style=prefix_style, indent="  ")
+                request = GenerationRequest(
+                    prompt=formatted_prompt,
+                    max_tokens=self.config.inference.max_tokens,
+                    temperature=self.config.inference.temperature,
+                    top_p=self.config.inference.top_p,
+                    top_k=self.config.inference.top_k,
+                    repetition_penalty=self.config.inference.repetition_penalty,
+                    stream=self.config.inference.stream_output,
+                    seed=self.config.inference.seed if self.config.inference.seed >= 0 else None,
+                    stop_sequences=stop_sequences
+                )
-            original_console_width = self.console._width
-            target_width = max(40, int(self.get_terminal_width() * 0.75))
-            self.console.width = target_width
-            try:
-                with Live(
-                    build_renderable(""),
-                    console=self.console,
-                    auto_refresh=False,
-                    refresh_per_second=20,
-                    transient=False,
-                    vertical_overflow="visible",
-                ) as live:
-                    for token in self.inference_engine.generate(request):
-                        if first_token_time is None:
-                            first_token_time = time.time()
+                generated_text = ""
+                start_time = time.time()
+                token_count = 0
+                first_token_time = None
+                tool_calls_started = False
-                        generated_text += token
-                        token_count += 1
+                if uses_reasoning_template and template_profile and template_profile.supports_streaming():
+                    if hasattr(template_profile, 'reset_streaming_state'):
+                        template_profile.reset_streaming_state()
-                        display_token = token
-                        if uses_reasoning_template and template_profile and template_profile.supports_streaming():
-                            display_token, should_display = template_profile.process_streaming_response(
-                                token, accumulated_response
-                            )
-                            accumulated_response += token
-                            if not should_display:
-                                display_token = ""
+                display_text = ""
+                accumulated_response = ""
+                last_render_time = 0.0
+                render_interval = 0.05  # seconds
+                prefix_style = Style(color="cyan")
+                def build_renderable(text: str):
+                    if getattr(self.config.ui, "markdown_rendering", True):
+                        markdown = ThinkMarkdown(
+                            text,
+                            code_theme="monokai",
+                            use_line_numbers=False,
+                            syntax_highlighting=getattr(self.config.ui, "syntax_highlighting", True),
+                        )
+                        renderable = markdown
+                    else:
+                        renderable = render_plain_with_think(text)
-                        if display_token:
-                            display_text += display_token
+                    return PrefixedRenderable(renderable, prefix="⏺", prefix_style=prefix_style, indent="  ", auto_space=True)
-                        now = time.time()
-                        if display_token and ("\n" in display_token or now - last_render_time >= render_interval):
+                original_console_width = self.console._width
+                target_width = max(40, int(self.get_terminal_width() * 0.75))
+                self.console.width = target_width
+                try:
+                    with Live(
+                        build_renderable(""),
+                        console=self.console,
+                        auto_refresh=False,
+                        refresh_per_second=20,
+                        transient=False,
+                        vertical_overflow="visible",
+                    ) as live:
+                        for token in self.inference_engine.generate(request):
+                            if first_token_time is None:
+                                first_token_time = time.time()
+                            generated_text += token
+                            token_count += 1
+                            if not tool_calls_started and tool_protocol.find_tool_calls_block(generated_text)[0] is not None:
+                                tool_calls_started = True
+                                display_text = "<think>tools running...</think>"
+                                live.update(build_renderable(display_text), refresh=True)
+                            display_token = token
+                            if uses_reasoning_template and template_profile and template_profile.supports_streaming():
+                                display_token, should_display = template_profile.process_streaming_response(
+                                    token, accumulated_response
+                                )
+                                accumulated_response += token
+                                if not should_display:
+                                    display_token = ""
+                            if not tool_calls_started and display_token:
+                                display_text += display_token
+                            now = time.time()
+                            if (not tool_calls_started and display_token and
+                                    ("\n" in display_token or now - last_render_time >= render_interval)):
+                                live.update(build_renderable(display_text), refresh=True)
+                                last_render_time = now
+                        if not tool_calls_started and uses_reasoning_template and template_profile:
+                            final_text = template_profile.process_response(generated_text)
+                            generated_text = final_text
+                            if not template_profile.config.show_reasoning:
+                                display_text = final_text
                             live.update(build_renderable(display_text), refresh=True)
-                            last_render_time = now
+                finally:
+                    self.console._width = original_console_width
-                    if uses_reasoning_template and template_profile:
-                        final_text = template_profile.process_response(generated_text)
-                        generated_text = final_text
-                        if not template_profile.config.show_reasoning:
-                            display_text = final_text
+                tool_calls, parse_error = tool_protocol.parse_tool_calls(generated_text)
+                if parse_error:
+                    print(f"\n\033[31m✗ Tool call parse error:\033[0m {parse_error}", file=sys.stderr)
-                    live.update(build_renderable(display_text), refresh=True)
-            finally:
-                self.console._width = original_console_width
+                if tool_calls:
+                    tool_results = self.tool_runner.run_calls(tool_calls)
+                    self._print_tool_activity(tool_calls, tool_results)
+                    self.conversation_manager.add_message(
+                        MessageRole.SYSTEM,
+                        tool_protocol.format_tool_results(tool_results)
+                    )
+                    if tool_iterations >= self.max_tool_iterations:
+                        print("\n\033[31m✗\033[0m Tool loop limit reached.", file=sys.stderr)
+                        break
+                    continue
-            # Display final metrics in a clean, professional way
-            elapsed = time.time() - start_time
-            if token_count > 0 and elapsed > 0:
-                tokens_per_sec = token_count / elapsed
-                first_token_latency = first_token_time - start_time if first_token_time else 0
-                # Build metrics parts - all will be wrapped in dim for subtlety
-                metrics_parts = []
-                if first_token_latency > 0.1:
-                    # First token latency
-                    metrics_parts.append(f"first {first_token_latency:.2f}s")
-                # Total time
-                metrics_parts.append(f"total {elapsed:.1f}s")
-                # Token count
-                metrics_parts.append(f"tokens {token_count}")
-                # Throughput
-                metrics_parts.append(f"speed {tokens_per_sec:.1f} tok/s")
-                # Print entire metrics line as dim/secondary to make it less prominent
-                # Indent metrics to align with response text
-                metrics_line = " · ".join(metrics_parts)
-                print(f"  \033[2m{metrics_line}\033[0m")
-            if token_count >= request.max_tokens:
-                print(f"  \033[2m(output truncated at max_tokens={request.max_tokens}; increase in config.yaml)\033[0m")
-            # Add assistant message to conversation history
-            self.conversation_manager.add_message(MessageRole.ASSISTANT, generated_text)
+                final_text = generated_text
+                if parse_error:
+                    final_text = tool_protocol.strip_tool_blocks(generated_text)
+                    if tool_calls_started and final_text.strip():
+                        self.console.print(build_renderable(final_text))
+                elapsed = time.time() - start_time
+                if token_count > 0 and elapsed > 0:
+                    tokens_per_sec = token_count / elapsed
+                    first_token_latency = first_token_time - start_time if first_token_time else 0
+                    metrics_parts = []
+                    if first_token_latency > 0.1:
+                        metrics_parts.append(f"first {first_token_latency:.2f}s")
+                    metrics_parts.append(f"total {elapsed:.1f}s")
+                    metrics_parts.append(f"tokens {token_count}")
+                    metrics_parts.append(f"speed {tokens_per_sec:.1f} tok/s")
+                    metrics_line = " · ".join(metrics_parts)
+                    print(f"  \033[2m{metrics_line}\033[0m")
+                if token_count >= request.max_tokens:
+                    print(f"  \033[2m(output truncated at max_tokens={request.max_tokens}; increase in config.yaml)\033[0m")
+                self.conversation_manager.add_message(MessageRole.ASSISTANT, final_text)
+                break
         except Exception as e:
             print(f"\n\033[31m✗ Error:\033[0m {str(e)}", file=sys.stderr)
@@ -1274,7 +1380,7 @@ class CortexCLI:
         except (KeyboardInterrupt, EOFError):
             raise
-    def _format_prompt_with_chat_template(self, user_input: str) -> str:
+    def _format_prompt_with_chat_template(self, user_input: str, include_user: bool = True) -> str:
         """Format the prompt with appropriate chat template for the model."""
         # Get current conversation context
         conversation = self.conversation_manager.get_current_conversation()
@@ -1297,10 +1403,11 @@ class CortexCLI:
                 })
         # Add current user message
-        messages.append({
-            "role": "user",
-            "content": user_input
-        })
+        if include_user:
+            messages.append({
+                "role": "user",
+                "content": user_input
+            })
         # Use template registry to format messages
         try:

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/ui/markdown_render.py RENAMED Viewed

@@ -190,11 +190,13 @@ class PrefixedRenderable:
         prefix: str,
         prefix_style: Style | None = None,
         indent: str | None = None,
+        auto_space: bool = False,
     ) -> None:
         self.renderable = renderable
         self.prefix = prefix
         self.prefix_style = prefix_style
         self.indent = indent if indent is not None else " " * len(prefix)
+        self.auto_space = auto_space
     def __rich_console__(self, console: Console, options):
         prefix_width = cell_len(self.prefix)
@@ -205,6 +207,7 @@ class PrefixedRenderable:
         yield Segment(self.prefix, self.prefix_style)
+        inserted_space = False
         for segment in console.render(self.renderable, inner_options):
             if segment.control:
                 yield segment
@@ -213,6 +216,12 @@ class PrefixedRenderable:
             text = segment.text
             style = segment.style
+            if self.auto_space and not inserted_space:
+                if text:
+                    if not text[0].isspace():
+                        yield Segment(" ", None)
+                    inserted_space = True
             if "\n" not in text:
                 yield segment
                 continue

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex_llm.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cortex-llm
-Version: 1.0.7
+Version: 1.0.9
 Summary: GPU-Accelerated LLM Terminal for Apple Silicon
 Home-page: https://github.com/faisalmumtaz/Cortex
 Author: Cortex Development Team
@@ -131,6 +131,10 @@ Cortex supports:
   - `docs/template-registry.md`
 - **Inference engine details** and backend behavior
   - `docs/inference-engine.md`
+- **Tooling (experimental, WIP)** for repo-scoped read/search and optional file edits with explicit confirmation
+  - `docs/cli.md`
+**Important (Work in Progress):** Tooling is actively evolving and should be considered experimental. Behavior, output format, and available actions may change; tool calls can fail; and UI presentation may be adjusted. Use tooling on non-critical work first, and always review any proposed file changes before approving them.
 ## Configuration

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "cortex-llm"
-version = "1.0.7"
+version = "1.0.9"
 description = "GPU-Accelerated LLM Terminal for Apple Silicon"
 readme = "README.md"
 license = "MIT"

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/setup.py RENAMED Viewed

@@ -26,7 +26,7 @@ def read_requirements():
 setup(
     name="cortex-llm",
-    version="1.0.7",
+    version="1.0.9",
     author="Cortex Development Team",
     description="GPU-Accelerated LLM Terminal for Apple Silicon",
     long_description=README,

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/LICENSE RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/__main__.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/config.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/conversation_manager.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/fine_tuning/__init__.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/fine_tuning/dataset.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/fine_tuning/mlx_lora_trainer.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/fine_tuning/trainer.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/fine_tuning/wizard.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/gpu_validator.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/metal/__init__.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/metal/gpu_validator.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/metal/memory_pool.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/metal/mlx_accelerator.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/metal/mlx_compat.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/metal/mps_optimizer.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/metal/optimizer.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/metal/performance_profiler.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/model_downloader.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/model_manager.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/quantization/__init__.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/quantization/dynamic_quantizer.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/__init__.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/auto_detector.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/config_manager.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/interactive.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/registry.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/__init__.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/base.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/complex/__init__.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/complex/reasoning.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/standard/__init__.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/standard/alpaca.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/standard/chatml.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/standard/gemma.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/standard/llama.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/template_registry/template_profiles/standard/simple.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/ui/__init__.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex/ui/terminal_app.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex_llm.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex_llm.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex_llm.egg-info/entry_points.txt RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex_llm.egg-info/not-zip-safe RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex_llm.egg-info/requires.txt RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/cortex_llm.egg-info/top_level.txt RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/setup.cfg RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/tests/test_apple_silicon.py RENAMED Viewed

File without changes

{cortex_llm-1.0.7 → cortex_llm-1.0.9}/tests/test_metal_optimization.py RENAMED Viewed

File without changes

cortex-llm 1.0.7__tar.gz → 1.0.9__tar.gz

cortex-llm 1.0.7tar.gz → 1.0.9tar.gz