ada-agent 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +14 -7
- package/bench/README.md +88 -88
- package/bench/swebench.mjs +242 -242
- package/docs/architecture.md +163 -139
- package/docs/architecture.svg +73 -73
- package/docs/cloudflare.md +81 -0
- package/docs/connectors.md +49 -48
- package/docs/integrations.md +62 -59
- package/package.json +65 -64
- package/src/client/catalog.json +1 -0
- package/src/client/cli.ts +1262 -1253
- package/src/client/models-dev.ts +106 -52
- package/src/selfcheck.ts +26 -0
- package/src/server/config.ts +65 -58
- package/src/server/providers/openai-compat.ts +78 -76
- package/src/server/providers/registry.ts +32 -31
- package/src/server/router.ts +33 -29
- package/src/shared/types.ts +21 -20
package/README.md
CHANGED
|
@@ -52,19 +52,25 @@ The backend proxies any OpenAI-compatible upstream and translates the one that i
|
|
|
52
52
|
|---|---|---|
|
|
53
53
|
| OpenAI | `gpt-*`, `o*` | `OPENAI_API_KEY` |
|
|
54
54
|
| Anthropic | `claude-*` | `ANTHROPIC_API_KEY` |
|
|
55
|
-
| Google Gemini | `gemini-*` | `GEMINI_API_KEY` |
|
|
56
|
-
| Mistral | `mistral
|
|
57
|
-
| Groq | — | `GROQ_API_KEY` |
|
|
55
|
+
| Google Gemini | `gemini-*`, `gemma-*` | `GEMINI_API_KEY` |
|
|
56
|
+
| Mistral | `mistral-*`, `codestral-*`, … | `MISTRAL_API_KEY` |
|
|
58
57
|
| DeepSeek | `deepseek-*` | `DEEPSEEK_API_KEY` |
|
|
59
|
-
| Together | — | `TOGETHER_API_KEY` |
|
|
60
58
|
| xAI (Grok) | `grok-*` | `XAI_API_KEY` |
|
|
61
|
-
| DashScope (Qwen) |
|
|
59
|
+
| DashScope (Qwen) | `qwen-*`, `qwq-*` | `DASHSCOPE_API_KEY` |
|
|
60
|
+
| **Cloudflare** (Workers AI / AI Gateway) | `@cf/*` (e.g. `@cf/moonshotai/kimi-k2.7-code`) | `CLOUDFLARE_API_TOKEN` (+ `CLOUDFLARE_ACCOUNT_ID`) |
|
|
61
|
+
| Groq | `groq/<model>` | `GROQ_API_KEY` |
|
|
62
|
+
| Together | `together/<model>` | `TOGETHER_API_KEY` |
|
|
62
63
|
| OpenRouter | everything else | `OPENROUTER_API_KEY` |
|
|
63
64
|
| **Ollama (local)** | `name:tag` (e.g. `qwen2.5-coder:latest`) | *keyless* |
|
|
64
65
|
|
|
65
|
-
Routing: a model id containing `:` → local Ollama;
|
|
66
|
+
Routing: a model id containing `:` → local Ollama; `@cf/*` → Cloudflare; `groq/…`/`together/…` pick
|
|
67
|
+
those providers (their model names — `llama-3.3`, `gemma2` — are ambiguous, so they're explicit);
|
|
68
|
+
otherwise by prefix; an explicit `provider`
|
|
66
69
|
field always wins. Set only the keys you have — the rest stay dormant (vendor SDKs load lazily).
|
|
67
70
|
|
|
71
|
+
**Cloudflare** (Workers AI or AI Gateway) is a step-by-step of its own — see
|
|
72
|
+
**[docs/cloudflare.md](docs/cloudflare.md)**.
|
|
73
|
+
|
|
68
74
|
---
|
|
69
75
|
|
|
70
76
|
## Install
|
|
@@ -153,7 +159,8 @@ shows in the prompt line. In **ask** mode each gated tool prompts with what it w
|
|
|
153
159
|
**auto** runs tools without asking (destructive `bash` still confirms). `--yolo` starts in **auto**.
|
|
154
160
|
|
|
155
161
|
**Subcommands:** `ada mcp …` (connectors) · `ada skill add <url>` · `ada worktree add <name>` ·
|
|
156
|
-
`ada
|
|
162
|
+
`ada catalog [provider]` (offline model/price catalog) · `ada serve` (HTTP API) · `ada share`
|
|
163
|
+
(view a session) · `ada acp` (editor bridge). See
|
|
157
164
|
[docs/integrations.md](docs/integrations.md) for the HTTP API, the typed SDK, and ACP.
|
|
158
165
|
|
|
159
166
|
**Orchestration strategies** — the harness runs pluggable agent architectures (`--strategy <name>`
|
package/bench/README.md
CHANGED
|
@@ -1,88 +1,88 @@
|
|
|
1
|
-
# Benchmarking ada on SWE-bench Verified
|
|
2
|
-
|
|
3
|
-
ada can run **SWE-bench Verified** — give the agent a real GitHub issue, let it edit the repo, and
|
|
4
|
-
score whether the repo's test suite passes. This directory has the **generation** half (ada produces
|
|
5
|
-
patches); **scoring** is the official `swebench` Docker harness — we don't reimplement it, because
|
|
6
|
-
that's the only way to get correct, comparable numbers.
|
|
7
|
-
|
|
8
|
-
```
|
|
9
|
-
dataset (issues) ──▶ bench/swebench.mjs ──▶ predictions.jsonl ──▶ official swebench eval ──▶ resolved %
|
|
10
|
-
(ada edits the repo, (Docker: apply patch +
|
|
11
|
-
per isolated clone) test_patch, run tests)
|
|
12
|
-
```
|
|
13
|
-
|
|
14
|
-
## Prerequisites
|
|
15
|
-
|
|
16
|
-
- **ada-server running with provider keys** — the harness drives `ada -p`, which needs the backend:
|
|
17
|
-
```bash
|
|
18
|
-
export ANTHROPIC_API_KEY=sk-ant-... # and/or OPENAI_API_KEY, etc.
|
|
19
|
-
ada-server # http://localhost:8787
|
|
20
|
-
```
|
|
21
|
-
- `git` + network (the harness clones each task repo; clones are cached under `~/.cache/ada-swebench`).
|
|
22
|
-
- For scoring: **Docker** and the **`swebench`** Python package (`pip install swebench`). Allow plenty
|
|
23
|
-
of disk — the official images are large.
|
|
24
|
-
|
|
25
|
-
## 1. Get the dataset
|
|
26
|
-
|
|
27
|
-
SWE-bench Verified (500 instances) lives on Hugging Face. Export it to JSONL once:
|
|
28
|
-
|
|
29
|
-
```python
|
|
30
|
-
# pip install datasets
|
|
31
|
-
from datasets import load_dataset
|
|
32
|
-
load_dataset("princeton-nlp/SWE-bench_Verified", split="test").to_json("swe-bench-verified.jsonl")
|
|
33
|
-
```
|
|
34
|
-
|
|
35
|
-
## 2. Generate predictions with ada
|
|
36
|
-
|
|
37
|
-
```bash
|
|
38
|
-
# smoke test on 5 instances first
|
|
39
|
-
node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
|
|
40
|
-
--out runs/opus --limit 5 --concurrency 2
|
|
41
|
-
|
|
42
|
-
# a specific instance, or the whole set
|
|
43
|
-
node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
|
|
44
|
-
--out runs/opus --instances astropy__astropy-12907
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
For each instance it clones the repo at `base_commit` into an isolated dir, hands ada the issue text
|
|
48
|
-
(`ada -p … --json`, auto-approve), captures `git diff` as the model patch, and appends an
|
|
49
|
-
official-format line to `runs/opus/predictions.jsonl`:
|
|
50
|
-
|
|
51
|
-
```json
|
|
52
|
-
{"instance_id": "...", "model_name_or_path": "claude-opus-4-8", "model_patch": "diff --git ..."}
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
It also writes `meta.jsonl` (seconds, patch size, token/cost usage per instance). Re-running **resumes**
|
|
56
|
-
— instances already in `predictions.jsonl` are skipped. Flags: `--limit N`, `--instances a,b`,
|
|
57
|
-
`--concurrency` (default 2), `--timeout` seconds per instance (default 1200), `--out <dir>`.
|
|
58
|
-
|
|
59
|
-
Swap `--model` to compare models on the same tasks (`gpt-...`, `qwen2.5-coder:latest`, …) — ada routes
|
|
60
|
-
each to the right provider.
|
|
61
|
-
|
|
62
|
-
## 3. Score with the official harness
|
|
63
|
-
|
|
64
|
-
```bash
|
|
65
|
-
python -m swebench.harness.run_evaluation \
|
|
66
|
-
--dataset_name princeton-nlp/SWE-bench_Verified \
|
|
67
|
-
--predictions_path runs/opus/predictions.jsonl \
|
|
68
|
-
--max_workers 4 --run_id ada-opus
|
|
69
|
-
```
|
|
70
|
-
|
|
71
|
-
It applies each patch + the held-out `test_patch` in Docker, runs the `FAIL_TO_PASS` / `PASS_TO_PASS`
|
|
72
|
-
tests, and reports the **resolved rate** plus a per-instance breakdown.
|
|
73
|
-
|
|
74
|
-
## Notes & honest caveats
|
|
75
|
-
|
|
76
|
-
- ada is told **not to touch tests** (the grader supplies its own); the patch is whatever ada changed
|
|
77
|
-
in the source.
|
|
78
|
-
- An empty patch (ada gave up / errored) is still recorded — it just counts as unresolved.
|
|
79
|
-
- This measures ada's default `react` loop. Try `ADA_MODEL`, a different `--model`, or wire a
|
|
80
|
-
`--strategy` into the harness to compare setups.
|
|
81
|
-
- Other benchmarks (HumanEval, Aider polyglot) fit the same generate-then-score shape; ask and we'll
|
|
82
|
-
add a sibling script.
|
|
83
|
-
|
|
84
|
-
## Quick check
|
|
85
|
-
|
|
86
|
-
```bash
|
|
87
|
-
node bench/swebench.mjs --selftest # offline: validates the prompt/prediction/arg helpers
|
|
88
|
-
```
|
|
1
|
+
# Benchmarking ada on SWE-bench Verified
|
|
2
|
+
|
|
3
|
+
ada can run **SWE-bench Verified** — give the agent a real GitHub issue, let it edit the repo, and
|
|
4
|
+
score whether the repo's test suite passes. This directory has the **generation** half (ada produces
|
|
5
|
+
patches); **scoring** is the official `swebench` Docker harness — we don't reimplement it, because
|
|
6
|
+
that's the only way to get correct, comparable numbers.
|
|
7
|
+
|
|
8
|
+
```
|
|
9
|
+
dataset (issues) ──▶ bench/swebench.mjs ──▶ predictions.jsonl ──▶ official swebench eval ──▶ resolved %
|
|
10
|
+
(ada edits the repo, (Docker: apply patch +
|
|
11
|
+
per isolated clone) test_patch, run tests)
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
## Prerequisites
|
|
15
|
+
|
|
16
|
+
- **ada-server running with provider keys** — the harness drives `ada -p`, which needs the backend:
|
|
17
|
+
```bash
|
|
18
|
+
export ANTHROPIC_API_KEY=sk-ant-... # and/or OPENAI_API_KEY, etc.
|
|
19
|
+
ada-server # http://localhost:8787
|
|
20
|
+
```
|
|
21
|
+
- `git` + network (the harness clones each task repo; clones are cached under `~/.cache/ada-swebench`).
|
|
22
|
+
- For scoring: **Docker** and the **`swebench`** Python package (`pip install swebench`). Allow plenty
|
|
23
|
+
of disk — the official images are large.
|
|
24
|
+
|
|
25
|
+
## 1. Get the dataset
|
|
26
|
+
|
|
27
|
+
SWE-bench Verified (500 instances) lives on Hugging Face. Export it to JSONL once:
|
|
28
|
+
|
|
29
|
+
```python
|
|
30
|
+
# pip install datasets
|
|
31
|
+
from datasets import load_dataset
|
|
32
|
+
load_dataset("princeton-nlp/SWE-bench_Verified", split="test").to_json("swe-bench-verified.jsonl")
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
## 2. Generate predictions with ada
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
# smoke test on 5 instances first
|
|
39
|
+
node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
|
|
40
|
+
--out runs/opus --limit 5 --concurrency 2
|
|
41
|
+
|
|
42
|
+
# a specific instance, or the whole set
|
|
43
|
+
node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
|
|
44
|
+
--out runs/opus --instances astropy__astropy-12907
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
For each instance it clones the repo at `base_commit` into an isolated dir, hands ada the issue text
|
|
48
|
+
(`ada -p … --json`, auto-approve), captures `git diff` as the model patch, and appends an
|
|
49
|
+
official-format line to `runs/opus/predictions.jsonl`:
|
|
50
|
+
|
|
51
|
+
```json
|
|
52
|
+
{"instance_id": "...", "model_name_or_path": "claude-opus-4-8", "model_patch": "diff --git ..."}
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
It also writes `meta.jsonl` (seconds, patch size, token/cost usage per instance). Re-running **resumes**
|
|
56
|
+
— instances already in `predictions.jsonl` are skipped. Flags: `--limit N`, `--instances a,b`,
|
|
57
|
+
`--concurrency` (default 2), `--timeout` seconds per instance (default 1200), `--out <dir>`.
|
|
58
|
+
|
|
59
|
+
Swap `--model` to compare models on the same tasks (`gpt-...`, `qwen2.5-coder:latest`, …) — ada routes
|
|
60
|
+
each to the right provider.
|
|
61
|
+
|
|
62
|
+
## 3. Score with the official harness
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
python -m swebench.harness.run_evaluation \
|
|
66
|
+
--dataset_name princeton-nlp/SWE-bench_Verified \
|
|
67
|
+
--predictions_path runs/opus/predictions.jsonl \
|
|
68
|
+
--max_workers 4 --run_id ada-opus
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
It applies each patch + the held-out `test_patch` in Docker, runs the `FAIL_TO_PASS` / `PASS_TO_PASS`
|
|
72
|
+
tests, and reports the **resolved rate** plus a per-instance breakdown.
|
|
73
|
+
|
|
74
|
+
## Notes & honest caveats
|
|
75
|
+
|
|
76
|
+
- ada is told **not to touch tests** (the grader supplies its own); the patch is whatever ada changed
|
|
77
|
+
in the source.
|
|
78
|
+
- An empty patch (ada gave up / errored) is still recorded — it just counts as unresolved.
|
|
79
|
+
- This measures ada's default `react` loop. Try `ADA_MODEL`, a different `--model`, or wire a
|
|
80
|
+
`--strategy` into the harness to compare setups.
|
|
81
|
+
- Other benchmarks (HumanEval, Aider polyglot) fit the same generate-then-score shape; ask and we'll
|
|
82
|
+
add a sibling script.
|
|
83
|
+
|
|
84
|
+
## Quick check
|
|
85
|
+
|
|
86
|
+
```bash
|
|
87
|
+
node bench/swebench.mjs --selftest # offline: validates the prompt/prediction/arg helpers
|
|
88
|
+
```
|