nex-code 0.4.38 → 0.4.40

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -78,16 +78,16 @@ npm update -g nex-code
78
78
 
79
79
  ---
80
80
 
81
- ## vs. Claude Code & Gemini CLI
82
-
83
- | | nex-code | Claude Code | Gemini CLI |
84
- |---|---|---|---|
85
- | Free tier | ✅ Ollama Cloud flat-rate | ❌ subscription required | ⚠️ limited free quota |
86
- | Open models | ✅ devstral, Kimi K2, Qwen3 | ❌ Anthropic only | ❌ Google only |
87
- | Local Ollama | ✅ | ❌ | ❌ |
88
- | Multi-provider | ✅ swap with one env var | ❌ | ❌ |
89
- | VS Code sidebar | ✅ built-in, same install | | ❌ |
90
- | Startup time | ~100ms | ~2–4s | ~1–2s |
81
+ ## Why nex-code?
82
+
83
+ | Feature | nex-code | Closed-source alternatives |
84
+ |---|---|---|
85
+ | Free tier | ✅ Ollama Cloud flat-rate | ❌ subscription or limited quota |
86
+ | Open models | ✅ devstral, Kimi K2, Qwen3 | ❌ vendor-locked |
87
+ | Local Ollama | ✅ | ❌ |
88
+ | Multi-provider | ✅ swap with one env var | ❌ |
89
+ | VS Code sidebar | ✅ built-in | partial |
90
+ | Startup time | ~100ms | 1–4s |
91
91
  | Runtime deps | 2 | heavy | heavy |
92
92
  | Infra tools | ✅ SSH, Docker, K8s built-in | ❌ | ❌ |
93
93
 
@@ -101,7 +101,7 @@ npm update -g nex-code
101
101
 
102
102
  **Open-model first.** Not locked to any single vendor. Tool tiers (`essential / standard / full`) adapt automatically to the model's capability level, so smaller models don't receive tool schemas they can't handle. A 5-layer auto-fix loop catches and retries malformed tool calls without user intervention.
103
103
 
104
- **Smart model routing.** The built-in `/benchmark` system tests all configured models against 56 real nex-code tool-calling tasks across 5 task categories. The results feed a routing table so nex-code can automatically switch to the best model for the detected task type:
104
+ **Smart model routing.** The built-in `/benchmark` system tests all configured models against 62 real nex-code tool-calling tasks across 5 task categories. The results feed a routing table so nex-code can automatically switch to the best model for the detected task type:
105
105
 
106
106
  | Detected task | Routed model (example) |
107
107
  | ------------------------- | --------------------------- |
@@ -125,7 +125,7 @@ The verify phase catches incomplete work before reporting "done" — if tests fa
125
125
 
126
126
  **Lightweight.** 2 runtime dependencies (`axios`, `dotenv`). Starts in ~100ms. No Python, no heavy runtime, no daemon process.
127
127
 
128
- **Server-aware from the first message.** When your prompt contains a URL whose domain matches a configured SSH profile (e.g. `jarvis.example.com` → profile `jarvis`), nex-code probes the server before responding — listing ports, running processes, and data directories. The model receives this topology before its first token, so it goes straight to `ssh_exec` instead of reading local files.
128
+ **Server-aware from the first message.** When your prompt contains a URL whose domain matches a configured SSH profile (e.g. `server.example.com` → profile `server`), nex-code probes the server before responding — listing ports, running processes, and data directories. The model receives this topology before its first token, so it goes straight to `ssh_exec` instead of reading local files.
129
129
 
130
130
  **Few-shot behavior injection.** On each session start, nex-code injects a short example of the correct tool sequence for the detected task type (sysadmin → check remote logs first; coding → read file before editing; data → explain before rewriting). Works across all models without fine-tuning. Customize with your own high-scoring sessions via `npm run extract-examples`.
131
131
 
@@ -156,38 +156,27 @@ The verify phase catches incomplete work before reporting "done" — if tests fa
156
156
  ## Ollama Cloud — Recommended Model Setup
157
157
 
158
158
  nex-code was built with Ollama Cloud as its primary provider. No subscription, no billing surprises.
159
- Rankings are based on nex-code's own `/benchmark` — 15 tool-calling tasks against real nex-code schemas.
159
+ Rankings are based on nex-code's own `/benchmark` — 14-task quick benchmark against real nex-code schemas (62 tasks full run).
160
160
 
161
161
  ### Flat-Rate / Pay-as-you-go
162
162
 
163
163
  <!-- nex-benchmark-start -->
164
- <!-- Updated: 2026-03-29 — run `/benchmark --discover` after new Ollama Cloud releases -->
164
+ <!-- Updated: 2026-04-01 — run `/benchmark --discover` after new Ollama Cloud releases -->
165
165
 
166
166
  | Rank | Model | Score | Avg Latency | Context | Best For |
167
167
  |---|---|---|---|---|---|
168
- | 🥇 | `qwen3-vl:235b` | **77.1** | 14.4s | 131K | Overall #1frontier tool selection, data + agentic tasks |
169
- | 🥈 | `qwen3-vl:235b-instruct` | 76.3 | 6.5s | 131K | Best latency/score balance recommended default |
170
- | 🥉 | `rnj-1:8b` | 74 | 3.7s | 131K | — |
171
- | — | `ministral-3:8b` | 73.1 | 2.3s | 131K | Fastest strong model 2.2s latency, 70+ score |
172
- | — | `qwen3-coder-next` | 71.4 | 2.8s | 256K | — |
173
- | — | `qwen3-next:80b` | 70.6 | 11.6s | 131K | — |
174
- | — | `qwen3.5:397b` | 68.9 | 3.9s | 256K | — |
175
- | — | `minimax-m2.7` | 68.7 | 6.8s | 200K | |
176
- | — | `glm-5` | 67.6 | 4.5s | 131K | — |
177
- | — | `devstral-2:123b` | 67.6 | 2.0s | 131K | Sysadmin + SSH tasks, reliable coding |
178
- | — | `glm-4.7` | 66.5 | 5.1s | 131K | — |
179
- | — | `kimi-k2-thinking` | 66.3 | 18.4s | 256K | — |
180
- | — | `ministral-3:14b` | 65.8 | 3.8s | 131K | — |
181
- | — | `devstral-small-2:24b` | 65.5 | 2.3s | 131K | Fast sub-agents, simple lookups |
182
- | — | `ministral-3:3b` | 65.4 | 2.2s | 32K | — |
183
- | — | `kimi-k2.5` | 65.2 | 3.5s | 256K | Large repos — faster than k2:1t |
184
- | — | `kimi-k2:1t` | 65.2 | 4.2s | 256K | Large repos (>100K tokens) |
185
- | — | `minimax-m2.1` | 64.2 | 5.4s | 200K | — |
186
- | — | `glm-4.6` | 63.9 | 4.9s | 131K | — |
187
- | — | `qwen3-coder:480b` | 63.2 | 14.1s | 131K | Heavy coding sessions, large context |
188
- | — | `nemotron-3-super` | 61.3 | 2.6s | 256K | — |
189
- | — | `gpt-oss:20b` | 60.9 | 2.5s | 131K | Fast small model, good overall score |
190
- | — | `mistral-large-3:675b` | 60.8 | 3.8s | 131K | — |
168
+ | 🥇 | `qwen3-vl:235b-instruct` | **79.9** | 3.8s | 131K | Best latency/score balance recommended default |
169
+ | 🥈 | `qwen3-vl:235b` | 79.4 | 12.3s | 131K | Overall #1frontier tool selection, data + agentic tasks |
170
+ | 🥉 | `qwen3-coder-next` | 74.9 | 1.7s | 256K | — |
171
+ | — | `rnj-1:8b` | 74.6 | 2.5s | 131K | — |
172
+ | — | `ministral-3:8b` | 74.2 | 1.2s | 131K | Fastest strong model 2.2s latency, 70+ score |
173
+ | — | `qwen3.5:397b` | 72.8 | 2.1s | 256K | — |
174
+ | — | `qwen3-next:80b` | 71.3 | 10.3s | 131K | — |
175
+ | — | `devstral-2:123b` | 69.9 | 1.6s | 131K | Sysadmin + SSH tasks, reliable coding |
176
+ | — | `minimax-m2.7` | 69.4 | 4.1s | 200K | — |
177
+ | — | `glm-5` | 69 | 7.6s | 131K | |
178
+ | — | `glm-4.7` | 67.8 | 3.7s | 131K | — |
179
+ | — | `kimi-k2-thinking` | 62 | 2.4s | 256K | — |
191
180
 
192
181
  > Rankings are nex-code-specific: tool name accuracy, argument validity, schema compliance.
193
182
  > Toolathon (Minimax SOTA) measures different task types — run `/benchmark --discover` after model releases.
@@ -208,14 +197,15 @@ NEX_FAST_MODEL=devstral-small-2:24b # quick lookups, fast sub-agents
208
197
  ### Run the benchmark yourself
209
198
 
210
199
  ```bash
211
- /benchmark # full run: 15 tasks × 5 models
212
- /benchmark --quick # fast run: 7 tasks × 3 models
200
+ /benchmark # full run: 62 tasks × 5 models
201
+ /benchmark --quick # fast run: 14 tasks × 3 models (doubled from 7 for better resolution)
213
202
  /benchmark --discover # detect new Ollama Cloud models, benchmark + auto-update README
214
203
  /benchmark --models=minimax-m2.7:cloud,qwen3-coder:480b
215
204
  /benchmark --history # show OpenClaw nightly trend
216
205
  ```
217
206
 
218
207
  Switch anytime: `/model devstral-2:123b` or update `DEFAULT_MODEL` in `.env`.
208
+ The best models discovered are automatically saved to `~/.nex-code/.env` to persist globally across all your projects.
219
209
  Auto-discovery runs weekly via the scheduled improvement task and updates this table automatically.
220
210
 
221
211
  ---
@@ -672,7 +662,7 @@ Or create `.nex/servers.json` manually:
672
662
  {
673
663
  "prod": {
674
664
  "host": "94.130.37.43",
675
- "user": "jarvis",
665
+ "user": "deploy",
676
666
  "port": 22,
677
667
  "key": "~/.ssh/id_rsa",
678
668
  "os": "almalinux9",
@@ -728,7 +718,7 @@ Create `.nex/deploy.json` (or use `/init deploy`):
728
718
  "api": {
729
719
  "server": "prod",
730
720
  "method": "git",
731
- "remote_path": "/home/jarvis/my-api",
721
+ "remote_path": "/home/deploy/my-api",
732
722
  "branch": "main",
733
723
  "deploy_script": "npm ci --omit=dev && sudo systemctl restart my-api",
734
724
  "health_check": "systemctl is-active my-api"
@@ -925,6 +915,16 @@ The agent follows a repeating cycle on a dedicated `autoresearch/<tag>` branch:
925
915
  /ar-clear # reset experiment history
926
916
  ```
927
917
 
918
+ The loop can also run **headless** — useful for unattended overnight sessions:
919
+
920
+ ```bash
921
+ nex-code --task "/ar-self-improve" --no-auto-orchestrate --max-turns 200
922
+ ```
923
+
924
+ `/ar-self-improve` uses nex-code's own 14-task quick benchmark as the fitness metric. Each experiment that raises the average score above the session baseline is kept; all others are reverted with `git reset`. The benchmark output includes a **Failing tasks** section that names which tasks each model got wrong, making root causes immediately visible.
925
+
926
+ > **Self-improvement history** (2026-03-31): baseline 86.7 → **92.9** (+6.2 pts) in one session. Key fix: rewording the `edit_file` tool description so models call it directly instead of first calling `read_file`. `rnj-1:8b` jumped from 77.1 → 97.9 on that change alone.
927
+
928
928
  ### Memory
929
929
 
930
930
  Persistent project memory that survives across sessions: