nex-code 0.5.11 → 0.5.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -76,42 +76,58 @@ On first launch, an interactive setup wizard guides you through provider and cre
76
76
  Rankings from nex-code's own `/benchmark` — 62 tasks testing tool selection, argument validity, and schema compliance.
77
77
 
78
78
  <!-- nex-benchmark-start -->
79
- <!-- Updated: 2026-04-09 — run `/benchmark --discover` after new Ollama Cloud releases -->
79
+ <!-- Updated: 2026-04-12 — run `/benchmark --discover` after new Ollama Cloud releases -->
80
80
 
81
81
  | Rank | Model | Score | Avg Latency | Context | Best For |
82
82
  |---|---|---|---|---|---|
83
- | 🥇 | `qwen3-vl:235b` | **80.1** | 12.9s | 131K | Overall #1 — frontier tool selection, data + agentic tasks |
84
- | 🥈 | `rnj-1:8b` | 78.6 | 2.7s | 131K | — |
85
- | 🥉 | `qwen3-vl:235b-instruct` | 78.4 | 7.3s | 131K | Best latency/score balance recommended default |
86
- | — | `nemotron-3-super` | 76.2 | 2.8s | 256K | — |
87
- | — | `deepseek-v3.1:671b` | 74.8 | 5.6s | 131K | — |
88
- | — | `qwen3-coder-next` | 74.5 | 2.9s | 256K | — |
89
- | — | `ministral-3:3b` | 73.6 | 2.4s | 32K | — |
90
- | — | `ministral-3:8b` | 72.6 | 1.9s | 131K | Fastest strong model — 2.2s latency, 70+ score |
91
- | — | `qwen3-next:80b` | 72.2 | 11.5s | 131K | — |
92
- | — | `mistral-large-3:675b` | 70.9 | 5.7s | 131K | — |
93
- | — | `devstral-small-2:24b` | 70.9 | 2.8s | 131K | Fast sub-agents, simple lookups |
94
- | — | `devstral-2:123b` | 70.9 | 4.0s | 131K | Sysadmin + SSH tasks, reliable coding |
95
- | — | `minimax-m2.1` | 70.7 | 4.3s | 200K | — |
96
- | — | `gpt-oss:20b` | 70.2 | 3.9s | 131K | Fast small model, good overall score |
97
- | — | `kimi-k2:1t` | 69.9 | 5.0s | 256K | Large repos (>100K tokens) |
98
- | — | `kimi-k2.5` | 69 | 5.8s | 256K | Large repos faster than k2:1t |
99
- | — | `kimi-k2-thinking` | 69 | 4.0s | 256K | |
100
- | — | `glm-5` | 69 | 7.2s | 131K | — |
101
- | — | `glm-5.1` | 68.8 | 9.7s | ? | |
102
- | — | `gemma4:31b` | 68.7 | 3.3s | ? | — |
103
- | — | `minimax-m2.7` | 68.6 | 5.1s | 200K | — |
104
- | — | `nemotron-3-nano:30b` | 67.8 | 2.9s | 131K | — |
105
- | — | `ministral-3:14b` | 67.7 | 2.3s | 131K | — |
106
- | — | `qwen3-coder:480b` | 67.2 | 7.7s | 131K | Heavy coding sessions, large context |
107
- | — | `qwen3.5:397b` | 67.1 | 7.2s | 256K | |
108
- | — | `glm-4.6` | 65.2 | 7.5s | 131K | — |
109
- | — | `gpt-oss:120b` | 64.6 | 3.7s | 131K | — |
83
+ | 🥇 | `qwen3-vl:235b` | **100** | 13.4s | 131K | Overall #1 — frontier tool selection, data + agentic tasks |
84
+ | 🥈 | `qwen3-vl:235b-instruct` | 97.5 | 7.7s | 131K | Best latency/score balance recommended default |
85
+ | 🥉 | `glm-4.6` | 97.5 | 26.8s | 131K | — |
86
+ | — | `qwen3-next:80b` | 97.2 | 8.0s | 131K | — |
87
+ | — | `deepseek-v3.1:671b` | 94.5 | 3.1s | 131K | — |
88
+ | — | `qwen3-coder-next` | 94.3 | 2.2s | 256K | — |
89
+ | — | `qwen3.5:397b` | 94.3 | 4.2s | 256K | — |
90
+ | — | `ministral-3:8b` | 94.3 | 1.6s | 131K | Fastest strong model — 2.2s latency, 70+ score |
91
+ | — | `minimax-m2.7` | 92.9 | 4.7s | 200K | — |
92
+ | — | `rnj-1:8b` | 92.2 | 2.1s | 131K | — |
93
+ | — | `glm-5` | 91.7 | 3.6s | 131K | |
94
+ | — | `nemotron-3-super` | 91.4 | 1.7s | 256K | |
95
+ | — | `ministral-3:14b` | 91.2 | 1.5s | 131K | — |
96
+ | — | `qwen3-coder:480b` | 91 | 8.3s | 131K | Heavy coding sessions, large context |
97
+ | — | `glm-4.7` | 90.7 | 4.1s | 131K | |
98
+ | — | `devstral-2:123b` | 90.3 | 8.1s | 131K | Sysadmin + SSH tasks, reliable coding |
99
+ | — | `kimi-k2:1t` | 90.3 | 3.7s | 256K | Large repos (>100K tokens) |
100
+ | — | `minimax-m2` | 90 | 3.4s | 200K | — |
101
+ | — | `devstral-small-2:24b` | 88.8 | 6.8s | 131K | Fast sub-agents, simple lookups |
102
+ | — | `kimi-k2-thinking` | 88.7 | 4.3s | 256K | — |
103
+ | — | `minimax-m2.1` | 88.1 | 2.5s | 200K | — |
104
+ | — | `glm-5.1` | 87.2 | 5.0s | ? | — |
105
+ | — | `kimi-k2.5` | 86.2 | 4.8s | 256K | Large repos faster than k2:1t |
106
+ | — | `gemma4:31b` | 85.2 | 4.8s | ? | |
107
+ | — | `minimax-m2.5` | 84.2 | 6.8s | 131K | Multi-agent, large context |
108
+ | — | `gpt-oss:120b` | 83.9 | 2.8s | 131K | — |
109
+ | — | `mistral-large-3:675b` | 82.5 | 7.0s | 131K | — |
110
+ | — | `ministral-3:3b` | 82.4 | 1.3s | 32K | — |
111
+ | — | `gpt-oss:20b` | 81.1 | 1.5s | 131K | Fast small model, good overall score |
112
+ | — | `nemotron-3-nano:30b` | 78.3 | 2.3s | 131K | — |
113
+ | — | `gemini-3-flash-preview` | 76.5 | 3.3s | 131K | — |
114
+ | — | `deepseek-v3.2` | 65.4 | 14.3s | 131K | — |
115
+ | — | `cogito-2.1:671b` | 65.2 | 3.4s | 131K | — |
110
116
 
111
117
  > Rankings are nex-code-specific: tool name accuracy, argument validity, schema compliance.
112
118
  > Toolathon (Minimax SOTA) measures different task types — run `/benchmark --discover` after model releases.
113
119
  <!-- nex-benchmark-end -->
114
120
 
121
+ <!-- nex-routing-start -->
122
+ <!-- Updated: 2026-04-15 -->
123
+
124
+ **Model routing by task type** (auto-updated by `/benchmark --all`):
125
+
126
+ | Category | Model | Score |
127
+ |---|---|---|
128
+ | coding | `new` | 90/100 |
129
+ <!-- nex-routing-end -->
130
+
115
131
  **Recommended `.env`:**
116
132
 
117
133
  ```env
@@ -418,7 +434,7 @@ See [DEVELOPMENT.md](DEVELOPMENT.md) for full architecture details.
418
434
  npm test # 97 suites, 3920 tests
419
435
  npm run typecheck # TypeScript noEmit check
420
436
  npm run benchmark:gate # 7-task smoke test (blocks push on regression)
421
- npm run benchmark:reallife # 35 real-world tasks across 7 categories
437
+ npm run benchmark:reallife # 35 real-life tasks across 7 categories
422
438
  ```
423
439
 
424
440
  ---