npm - nex-code - Versions diffs - 0.4.38 → 0.4.40 - Mend

nex-code 0.4.38 → 0.4.40

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md +41 -41
package/dist/benchmark.js +419 -378
package/dist/nex-code.js +698 -632
package/dist/skills/autoresearch.js +249 -18
package/package.json +3 -6

package/README.md CHANGED Viewed

@@ -78,16 +78,16 @@ npm update -g nex-code
 ---
-## vs. Claude Code & Gemini CLI
-| | nex-code | Claude Code | Gemini CLI |
-|---|---|---|---|
-| Free tier | ✅ Ollama Cloud flat-rate | ❌ subscription required | ⚠️ limited free quota |
-| Open models | ✅ devstral, Kimi K2, Qwen3 | ❌ Anthropic only | ❌ Google only |
-| Local Ollama | ✅ | ❌ | ❌ |
-| Multi-provider | ✅ swap with one env var | ❌ | ❌ |
-| VS Code sidebar | ✅ built-in, same install | ✅ | ❌ |
-| Startup time | ~100ms | ~2–4s | ~1–2s |
+## Why nex-code?
+| Feature | nex-code | Closed-source alternatives |
+|---|---|---|
+| Free tier | ✅ Ollama Cloud flat-rate | ❌ subscription or limited quota |
+| Open models | ✅ devstral, Kimi K2, Qwen3 | ❌ vendor-locked |
+| Local Ollama | ✅ | ❌ |
+| Multi-provider | ✅ swap with one env var | ❌ |
+| VS Code sidebar | ✅ built-in | partial |
+| Startup time | ~100ms | 1–4s |
 | Runtime deps | 2 | heavy | heavy |
 | Infra tools | ✅ SSH, Docker, K8s built-in | ❌ | ❌ |
@@ -101,7 +101,7 @@ npm update -g nex-code
 **Open-model first.** Not locked to any single vendor. Tool tiers (`essential / standard / full`) adapt automatically to the model's capability level, so smaller models don't receive tool schemas they can't handle. A 5-layer auto-fix loop catches and retries malformed tool calls without user intervention.
-**Smart model routing.** The built-in `/benchmark` system tests all configured models against 56 real nex-code tool-calling tasks across 5 task categories. The results feed a routing table so nex-code can automatically switch to the best model for the detected task type:
+**Smart model routing.** The built-in `/benchmark` system tests all configured models against 62 real nex-code tool-calling tasks across 5 task categories. The results feed a routing table so nex-code can automatically switch to the best model for the detected task type:
 | Detected task             | Routed model (example)      |
 | ------------------------- | --------------------------- |
@@ -125,7 +125,7 @@ The verify phase catches incomplete work before reporting "done" — if tests fa
 **Lightweight.** 2 runtime dependencies (`axios`, `dotenv`). Starts in ~100ms. No Python, no heavy runtime, no daemon process.
-**Server-aware from the first message.** When your prompt contains a URL whose domain matches a configured SSH profile (e.g. `jarvis.example.com` → profile `jarvis`), nex-code probes the server before responding — listing ports, running processes, and data directories. The model receives this topology before its first token, so it goes straight to `ssh_exec` instead of reading local files.
+**Server-aware from the first message.** When your prompt contains a URL whose domain matches a configured SSH profile (e.g. `server.example.com` → profile `server`), nex-code probes the server before responding — listing ports, running processes, and data directories. The model receives this topology before its first token, so it goes straight to `ssh_exec` instead of reading local files.
 **Few-shot behavior injection.** On each session start, nex-code injects a short example of the correct tool sequence for the detected task type (sysadmin → check remote logs first; coding → read file before editing; data → explain before rewriting). Works across all models without fine-tuning. Customize with your own high-scoring sessions via `npm run extract-examples`.
@@ -156,38 +156,27 @@ The verify phase catches incomplete work before reporting "done" — if tests fa
 ## Ollama Cloud — Recommended Model Setup
 nex-code was built with Ollama Cloud as its primary provider. No subscription, no billing surprises.
-Rankings are based on nex-code's own `/benchmark` — 15 tool-calling tasks against real nex-code schemas.
+Rankings are based on nex-code's own `/benchmark` — 14-task quick benchmark against real nex-code schemas (62 tasks full run).
 ### Flat-Rate / Pay-as-you-go
 <!-- nex-benchmark-start -->
-<!-- Updated: 2026-03-29 — run `/benchmark --discover` after new Ollama Cloud releases -->
+<!-- Updated: 2026-04-01 — run `/benchmark --discover` after new Ollama Cloud releases -->
 | Rank | Model | Score | Avg Latency | Context | Best For |
 |---|---|---|---|---|---|
-| 🥇 | `qwen3-vl:235b` | **77.1** | 14.4s | 131K | Overall #1 — frontier tool selection, data + agentic tasks |
-| 🥈 | `qwen3-vl:235b-instruct` | 76.3 | 6.5s | 131K | Best latency/score balance — recommended default |
-| 🥉 | `rnj-1:8b` | 74 | 3.7s | 131K | — |
-| — | `ministral-3:8b` | 73.1 | 2.3s | 131K | Fastest strong model — 2.2s latency, 70+ score |
-| — | `qwen3-coder-next` | 71.4 | 2.8s | 256K | — |
-| — | `qwen3-next:80b` | 70.6 | 11.6s | 131K | — |
-| — | `qwen3.5:397b` | 68.9 | 3.9s | 256K | — |
-| — | `minimax-m2.7` | 68.7 | 6.8s | 200K | — |
-| — | `glm-5` | 67.6 | 4.5s | 131K | — |
-| — | `devstral-2:123b` | 67.6 | 2.0s | 131K | Sysadmin + SSH tasks, reliable coding |
-| — | `glm-4.7` | 66.5 | 5.1s | 131K | — |
-| — | `kimi-k2-thinking` | 66.3 | 18.4s | 256K | — |
-| — | `ministral-3:14b` | 65.8 | 3.8s | 131K | — |
-| — | `devstral-small-2:24b` | 65.5 | 2.3s | 131K | Fast sub-agents, simple lookups |
-| — | `ministral-3:3b` | 65.4 | 2.2s | 32K | — |
-| — | `kimi-k2.5` | 65.2 | 3.5s | 256K | Large repos — faster than k2:1t |
-| — | `kimi-k2:1t` | 65.2 | 4.2s | 256K | Large repos (>100K tokens) |
-| — | `minimax-m2.1` | 64.2 | 5.4s | 200K | — |
-| — | `glm-4.6` | 63.9 | 4.9s | 131K | — |
-| — | `qwen3-coder:480b` | 63.2 | 14.1s | 131K | Heavy coding sessions, large context |
-| — | `nemotron-3-super` | 61.3 | 2.6s | 256K | — |
-| — | `gpt-oss:20b` | 60.9 | 2.5s | 131K | Fast small model, good overall score |
-| — | `mistral-large-3:675b` | 60.8 | 3.8s | 131K | — |
+| 🥇 | `qwen3-vl:235b-instruct` | **79.9** | 3.8s | 131K | Best latency/score balance — recommended default |
+| 🥈 | `qwen3-vl:235b` | 79.4 | 12.3s | 131K | Overall #1 — frontier tool selection, data + agentic tasks |
+| 🥉 | `qwen3-coder-next` | 74.9 | 1.7s | 256K | — |
+| — | `rnj-1:8b` | 74.6 | 2.5s | 131K | — |
+| — | `ministral-3:8b` | 74.2 | 1.2s | 131K | Fastest strong model — 2.2s latency, 70+ score |
+| — | `qwen3.5:397b` | 72.8 | 2.1s | 256K | — |
+| — | `qwen3-next:80b` | 71.3 | 10.3s | 131K | — |
+| — | `devstral-2:123b` | 69.9 | 1.6s | 131K | Sysadmin + SSH tasks, reliable coding |
+| — | `minimax-m2.7` | 69.4 | 4.1s | 200K | — |
+| — | `glm-5` | 69 | 7.6s | 131K | — |
+| — | `glm-4.7` | 67.8 | 3.7s | 131K | — |
+| — | `kimi-k2-thinking` | 62 | 2.4s | 256K | — |
 > Rankings are nex-code-specific: tool name accuracy, argument validity, schema compliance.
 > Toolathon (Minimax SOTA) measures different task types — run `/benchmark --discover` after model releases.
@@ -208,14 +197,15 @@ NEX_FAST_MODEL=devstral-small-2:24b   # quick lookups, fast sub-agents
 ### Run the benchmark yourself
 ```bash
-/benchmark             # full run: 15 tasks × 5 models
-/benchmark --quick     # fast run: 7 tasks × 3 models
+/benchmark             # full run: 62 tasks × 5 models
+/benchmark --quick     # fast run: 14 tasks × 3 models  (doubled from 7 for better resolution)
 /benchmark --discover  # detect new Ollama Cloud models, benchmark + auto-update README
 /benchmark --models=minimax-m2.7:cloud,qwen3-coder:480b
 /benchmark --history   # show OpenClaw nightly trend
 ```
 Switch anytime: `/model devstral-2:123b` or update `DEFAULT_MODEL` in `.env`.
+The best models discovered are automatically saved to `~/.nex-code/.env` to persist globally across all your projects.
 Auto-discovery runs weekly via the scheduled improvement task and updates this table automatically.
 ---
@@ -672,7 +662,7 @@ Or create `.nex/servers.json` manually:
 {
   "prod": {
     "host": "94.130.37.43",
-    "user": "jarvis",
+    "user": "deploy",
     "port": 22,
     "key": "~/.ssh/id_rsa",
     "os": "almalinux9",
@@ -728,7 +718,7 @@ Create `.nex/deploy.json` (or use `/init deploy`):
   "api": {
     "server": "prod",
     "method": "git",
-    "remote_path": "/home/jarvis/my-api",
+    "remote_path": "/home/deploy/my-api",
     "branch": "main",
     "deploy_script": "npm ci --omit=dev && sudo systemctl restart my-api",
     "health_check": "systemctl is-active my-api"
@@ -925,6 +915,16 @@ The agent follows a repeating cycle on a dedicated `autoresearch/<tag>` branch:
 /ar-clear                 # reset experiment history
 ```
+The loop can also run **headless** — useful for unattended overnight sessions:
+```bash
+nex-code --task "/ar-self-improve" --no-auto-orchestrate --max-turns 200
+```
+`/ar-self-improve` uses nex-code's own 14-task quick benchmark as the fitness metric. Each experiment that raises the average score above the session baseline is kept; all others are reverted with `git reset`. The benchmark output includes a **Failing tasks** section that names which tasks each model got wrong, making root causes immediately visible.
+> **Self-improvement history** (2026-03-31): baseline 86.7 → **92.9** (+6.2 pts) in one session. Key fix: rewording the `edit_file` tool description so models call it directly instead of first calling `read_file`. `rnj-1:8b` jumped from 77.1 → 97.9 on that change alone.
 ### Memory
 Persistent project memory that survives across sessions: