nex-code 0.4.38 → 0.4.40
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +41 -41
- package/dist/benchmark.js +419 -378
- package/dist/nex-code.js +698 -632
- package/dist/skills/autoresearch.js +249 -18
- package/package.json +3 -6
package/README.md
CHANGED
|
@@ -78,16 +78,16 @@ npm update -g nex-code
|
|
|
78
78
|
|
|
79
79
|
---
|
|
80
80
|
|
|
81
|
-
##
|
|
82
|
-
|
|
83
|
-
| | nex-code |
|
|
84
|
-
|
|
85
|
-
| Free tier | ✅ Ollama Cloud flat-rate | ❌ subscription
|
|
86
|
-
| Open models | ✅ devstral, Kimi K2, Qwen3 | ❌
|
|
87
|
-
| Local Ollama | ✅ | ❌ |
|
|
88
|
-
| Multi-provider | ✅ swap with one env var | ❌ |
|
|
89
|
-
| VS Code sidebar | ✅ built-in
|
|
90
|
-
| Startup time | ~100ms |
|
|
81
|
+
## Why nex-code?
|
|
82
|
+
|
|
83
|
+
| Feature | nex-code | Closed-source alternatives |
|
|
84
|
+
|---|---|---|
|
|
85
|
+
| Free tier | ✅ Ollama Cloud flat-rate | ❌ subscription or limited quota |
|
|
86
|
+
| Open models | ✅ devstral, Kimi K2, Qwen3 | ❌ vendor-locked |
|
|
87
|
+
| Local Ollama | ✅ | ❌ |
|
|
88
|
+
| Multi-provider | ✅ swap with one env var | ❌ |
|
|
89
|
+
| VS Code sidebar | ✅ built-in | partial |
|
|
90
|
+
| Startup time | ~100ms | 1–4s |
|
|
91
91
|
| Runtime deps | 2 | heavy | heavy |
|
|
92
92
|
| Infra tools | ✅ SSH, Docker, K8s built-in | ❌ | ❌ |
|
|
93
93
|
|
|
@@ -101,7 +101,7 @@ npm update -g nex-code
|
|
|
101
101
|
|
|
102
102
|
**Open-model first.** Not locked to any single vendor. Tool tiers (`essential / standard / full`) adapt automatically to the model's capability level, so smaller models don't receive tool schemas they can't handle. A 5-layer auto-fix loop catches and retries malformed tool calls without user intervention.
|
|
103
103
|
|
|
104
|
-
**Smart model routing.** The built-in `/benchmark` system tests all configured models against
|
|
104
|
+
**Smart model routing.** The built-in `/benchmark` system tests all configured models against 62 real nex-code tool-calling tasks across 5 task categories. The results feed a routing table so nex-code can automatically switch to the best model for the detected task type:
|
|
105
105
|
|
|
106
106
|
| Detected task | Routed model (example) |
|
|
107
107
|
| ------------------------- | --------------------------- |
|
|
@@ -125,7 +125,7 @@ The verify phase catches incomplete work before reporting "done" — if tests fa
|
|
|
125
125
|
|
|
126
126
|
**Lightweight.** 2 runtime dependencies (`axios`, `dotenv`). Starts in ~100ms. No Python, no heavy runtime, no daemon process.
|
|
127
127
|
|
|
128
|
-
**Server-aware from the first message.** When your prompt contains a URL whose domain matches a configured SSH profile (e.g. `
|
|
128
|
+
**Server-aware from the first message.** When your prompt contains a URL whose domain matches a configured SSH profile (e.g. `server.example.com` → profile `server`), nex-code probes the server before responding — listing ports, running processes, and data directories. The model receives this topology before its first token, so it goes straight to `ssh_exec` instead of reading local files.
|
|
129
129
|
|
|
130
130
|
**Few-shot behavior injection.** On each session start, nex-code injects a short example of the correct tool sequence for the detected task type (sysadmin → check remote logs first; coding → read file before editing; data → explain before rewriting). Works across all models without fine-tuning. Customize with your own high-scoring sessions via `npm run extract-examples`.
|
|
131
131
|
|
|
@@ -156,38 +156,27 @@ The verify phase catches incomplete work before reporting "done" — if tests fa
|
|
|
156
156
|
## Ollama Cloud — Recommended Model Setup
|
|
157
157
|
|
|
158
158
|
nex-code was built with Ollama Cloud as its primary provider. No subscription, no billing surprises.
|
|
159
|
-
Rankings are based on nex-code's own `/benchmark` —
|
|
159
|
+
Rankings are based on nex-code's own `/benchmark` — 14-task quick benchmark against real nex-code schemas (62 tasks full run).
|
|
160
160
|
|
|
161
161
|
### Flat-Rate / Pay-as-you-go
|
|
162
162
|
|
|
163
163
|
<!-- nex-benchmark-start -->
|
|
164
|
-
<!-- Updated: 2026-
|
|
164
|
+
<!-- Updated: 2026-04-01 — run `/benchmark --discover` after new Ollama Cloud releases -->
|
|
165
165
|
|
|
166
166
|
| Rank | Model | Score | Avg Latency | Context | Best For |
|
|
167
167
|
|---|---|---|---|---|---|
|
|
168
|
-
| 🥇 | `qwen3-vl:235b` | **
|
|
169
|
-
| 🥈 | `qwen3-vl:235b
|
|
170
|
-
| 🥉 | `
|
|
171
|
-
| — | `
|
|
172
|
-
| — | `
|
|
173
|
-
| — | `qwen3
|
|
174
|
-
| — | `qwen3
|
|
175
|
-
| — | `
|
|
176
|
-
| — | `
|
|
177
|
-
| — | `
|
|
178
|
-
| — | `glm-4.7` |
|
|
179
|
-
| — | `kimi-k2-thinking` |
|
|
180
|
-
| — | `ministral-3:14b` | 65.8 | 3.8s | 131K | — |
|
|
181
|
-
| — | `devstral-small-2:24b` | 65.5 | 2.3s | 131K | Fast sub-agents, simple lookups |
|
|
182
|
-
| — | `ministral-3:3b` | 65.4 | 2.2s | 32K | — |
|
|
183
|
-
| — | `kimi-k2.5` | 65.2 | 3.5s | 256K | Large repos — faster than k2:1t |
|
|
184
|
-
| — | `kimi-k2:1t` | 65.2 | 4.2s | 256K | Large repos (>100K tokens) |
|
|
185
|
-
| — | `minimax-m2.1` | 64.2 | 5.4s | 200K | — |
|
|
186
|
-
| — | `glm-4.6` | 63.9 | 4.9s | 131K | — |
|
|
187
|
-
| — | `qwen3-coder:480b` | 63.2 | 14.1s | 131K | Heavy coding sessions, large context |
|
|
188
|
-
| — | `nemotron-3-super` | 61.3 | 2.6s | 256K | — |
|
|
189
|
-
| — | `gpt-oss:20b` | 60.9 | 2.5s | 131K | Fast small model, good overall score |
|
|
190
|
-
| — | `mistral-large-3:675b` | 60.8 | 3.8s | 131K | — |
|
|
168
|
+
| 🥇 | `qwen3-vl:235b-instruct` | **79.9** | 3.8s | 131K | Best latency/score balance — recommended default |
|
|
169
|
+
| 🥈 | `qwen3-vl:235b` | 79.4 | 12.3s | 131K | Overall #1 — frontier tool selection, data + agentic tasks |
|
|
170
|
+
| 🥉 | `qwen3-coder-next` | 74.9 | 1.7s | 256K | — |
|
|
171
|
+
| — | `rnj-1:8b` | 74.6 | 2.5s | 131K | — |
|
|
172
|
+
| — | `ministral-3:8b` | 74.2 | 1.2s | 131K | Fastest strong model — 2.2s latency, 70+ score |
|
|
173
|
+
| — | `qwen3.5:397b` | 72.8 | 2.1s | 256K | — |
|
|
174
|
+
| — | `qwen3-next:80b` | 71.3 | 10.3s | 131K | — |
|
|
175
|
+
| — | `devstral-2:123b` | 69.9 | 1.6s | 131K | Sysadmin + SSH tasks, reliable coding |
|
|
176
|
+
| — | `minimax-m2.7` | 69.4 | 4.1s | 200K | — |
|
|
177
|
+
| — | `glm-5` | 69 | 7.6s | 131K | — |
|
|
178
|
+
| — | `glm-4.7` | 67.8 | 3.7s | 131K | — |
|
|
179
|
+
| — | `kimi-k2-thinking` | 62 | 2.4s | 256K | — |
|
|
191
180
|
|
|
192
181
|
> Rankings are nex-code-specific: tool name accuracy, argument validity, schema compliance.
|
|
193
182
|
> Toolathon (Minimax SOTA) measures different task types — run `/benchmark --discover` after model releases.
|
|
@@ -208,14 +197,15 @@ NEX_FAST_MODEL=devstral-small-2:24b # quick lookups, fast sub-agents
|
|
|
208
197
|
### Run the benchmark yourself
|
|
209
198
|
|
|
210
199
|
```bash
|
|
211
|
-
/benchmark # full run:
|
|
212
|
-
/benchmark --quick # fast run:
|
|
200
|
+
/benchmark # full run: 62 tasks × 5 models
|
|
201
|
+
/benchmark --quick # fast run: 14 tasks × 3 models (doubled from 7 for better resolution)
|
|
213
202
|
/benchmark --discover # detect new Ollama Cloud models, benchmark + auto-update README
|
|
214
203
|
/benchmark --models=minimax-m2.7:cloud,qwen3-coder:480b
|
|
215
204
|
/benchmark --history # show OpenClaw nightly trend
|
|
216
205
|
```
|
|
217
206
|
|
|
218
207
|
Switch anytime: `/model devstral-2:123b` or update `DEFAULT_MODEL` in `.env`.
|
|
208
|
+
The best models discovered are automatically saved to `~/.nex-code/.env` to persist globally across all your projects.
|
|
219
209
|
Auto-discovery runs weekly via the scheduled improvement task and updates this table automatically.
|
|
220
210
|
|
|
221
211
|
---
|
|
@@ -672,7 +662,7 @@ Or create `.nex/servers.json` manually:
|
|
|
672
662
|
{
|
|
673
663
|
"prod": {
|
|
674
664
|
"host": "94.130.37.43",
|
|
675
|
-
"user": "
|
|
665
|
+
"user": "deploy",
|
|
676
666
|
"port": 22,
|
|
677
667
|
"key": "~/.ssh/id_rsa",
|
|
678
668
|
"os": "almalinux9",
|
|
@@ -728,7 +718,7 @@ Create `.nex/deploy.json` (or use `/init deploy`):
|
|
|
728
718
|
"api": {
|
|
729
719
|
"server": "prod",
|
|
730
720
|
"method": "git",
|
|
731
|
-
"remote_path": "/home/
|
|
721
|
+
"remote_path": "/home/deploy/my-api",
|
|
732
722
|
"branch": "main",
|
|
733
723
|
"deploy_script": "npm ci --omit=dev && sudo systemctl restart my-api",
|
|
734
724
|
"health_check": "systemctl is-active my-api"
|
|
@@ -925,6 +915,16 @@ The agent follows a repeating cycle on a dedicated `autoresearch/<tag>` branch:
|
|
|
925
915
|
/ar-clear # reset experiment history
|
|
926
916
|
```
|
|
927
917
|
|
|
918
|
+
The loop can also run **headless** — useful for unattended overnight sessions:
|
|
919
|
+
|
|
920
|
+
```bash
|
|
921
|
+
nex-code --task "/ar-self-improve" --no-auto-orchestrate --max-turns 200
|
|
922
|
+
```
|
|
923
|
+
|
|
924
|
+
`/ar-self-improve` uses nex-code's own 14-task quick benchmark as the fitness metric. Each experiment that raises the average score above the session baseline is kept; all others are reverted with `git reset`. The benchmark output includes a **Failing tasks** section that names which tasks each model got wrong, making root causes immediately visible.
|
|
925
|
+
|
|
926
|
+
> **Self-improvement history** (2026-03-31): baseline 86.7 → **92.9** (+6.2 pts) in one session. Key fix: rewording the `edit_file` tool description so models call it directly instead of first calling `read_file`. `rnj-1:8b` jumped from 77.1 → 97.9 on that change alone.
|
|
927
|
+
|
|
928
928
|
### Memory
|
|
929
929
|
|
|
930
930
|
Persistent project memory that survives across sessions:
|