ada-agent 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -52,19 +52,25 @@ The backend proxies any OpenAI-compatible upstream and translates the one that i
52
52
  |---|---|---|
53
53
  | OpenAI | `gpt-*`, `o*` | `OPENAI_API_KEY` |
54
54
  | Anthropic | `claude-*` | `ANTHROPIC_API_KEY` |
55
- | Google Gemini | `gemini-*` | `GEMINI_API_KEY` |
56
- | Mistral | `mistral-*` | `MISTRAL_API_KEY` |
57
- | Groq | — | `GROQ_API_KEY` |
55
+ | Google Gemini | `gemini-*`, `gemma-*` | `GEMINI_API_KEY` |
56
+ | Mistral | `mistral-*`, `codestral-*`, … | `MISTRAL_API_KEY` |
58
57
  | DeepSeek | `deepseek-*` | `DEEPSEEK_API_KEY` |
59
- | Together | — | `TOGETHER_API_KEY` |
60
58
  | xAI (Grok) | `grok-*` | `XAI_API_KEY` |
61
- | DashScope (Qwen) | | `DASHSCOPE_API_KEY` |
59
+ | DashScope (Qwen) | `qwen-*`, `qwq-*` | `DASHSCOPE_API_KEY` |
60
+ | **Cloudflare** (Workers AI / AI Gateway) | `@cf/*` (e.g. `@cf/moonshotai/kimi-k2.7-code`) | `CLOUDFLARE_API_TOKEN` (+ `CLOUDFLARE_ACCOUNT_ID`) |
61
+ | Groq | `groq/<model>` | `GROQ_API_KEY` |
62
+ | Together | `together/<model>` | `TOGETHER_API_KEY` |
62
63
  | OpenRouter | everything else | `OPENROUTER_API_KEY` |
63
64
  | **Ollama (local)** | `name:tag` (e.g. `qwen2.5-coder:latest`) | *keyless* |
64
65
 
65
- Routing: a model id containing `:` → local Ollama; otherwise by prefix; an explicit `provider`
66
+ Routing: a model id containing `:` → local Ollama; `@cf/*` Cloudflare; `groq/…`/`together/…` pick
67
+ those providers (their model names — `llama-3.3`, `gemma2` — are ambiguous, so they're explicit);
68
+ otherwise by prefix; an explicit `provider`
66
69
  field always wins. Set only the keys you have — the rest stay dormant (vendor SDKs load lazily).
67
70
 
71
+ **Cloudflare** (Workers AI or AI Gateway) is a step-by-step of its own — see
72
+ **[docs/cloudflare.md](docs/cloudflare.md)**.
73
+
68
74
  ---
69
75
 
70
76
  ## Install
@@ -153,7 +159,8 @@ shows in the prompt line. In **ask** mode each gated tool prompts with what it w
153
159
  **auto** runs tools without asking (destructive `bash` still confirms). `--yolo` starts in **auto**.
154
160
 
155
161
  **Subcommands:** `ada mcp …` (connectors) · `ada skill add <url>` · `ada worktree add <name>` ·
156
- `ada serve` (HTTP API) · `ada share` (view a session) · `ada acp` (editor bridge). See
162
+ `ada catalog [provider]` (offline model/price catalog) · `ada serve` (HTTP API) · `ada share`
163
+ (view a session) · `ada acp` (editor bridge). See
157
164
  [docs/integrations.md](docs/integrations.md) for the HTTP API, the typed SDK, and ACP.
158
165
 
159
166
  **Orchestration strategies** — the harness runs pluggable agent architectures (`--strategy <name>`
package/bench/README.md CHANGED
@@ -1,88 +1,88 @@
1
- # Benchmarking ada on SWE-bench Verified
2
-
3
- ada can run **SWE-bench Verified** — give the agent a real GitHub issue, let it edit the repo, and
4
- score whether the repo's test suite passes. This directory has the **generation** half (ada produces
5
- patches); **scoring** is the official `swebench` Docker harness — we don't reimplement it, because
6
- that's the only way to get correct, comparable numbers.
7
-
8
- ```
9
- dataset (issues) ──▶ bench/swebench.mjs ──▶ predictions.jsonl ──▶ official swebench eval ──▶ resolved %
10
- (ada edits the repo, (Docker: apply patch +
11
- per isolated clone) test_patch, run tests)
12
- ```
13
-
14
- ## Prerequisites
15
-
16
- - **ada-server running with provider keys** — the harness drives `ada -p`, which needs the backend:
17
- ```bash
18
- export ANTHROPIC_API_KEY=sk-ant-... # and/or OPENAI_API_KEY, etc.
19
- ada-server # http://localhost:8787
20
- ```
21
- - `git` + network (the harness clones each task repo; clones are cached under `~/.cache/ada-swebench`).
22
- - For scoring: **Docker** and the **`swebench`** Python package (`pip install swebench`). Allow plenty
23
- of disk — the official images are large.
24
-
25
- ## 1. Get the dataset
26
-
27
- SWE-bench Verified (500 instances) lives on Hugging Face. Export it to JSONL once:
28
-
29
- ```python
30
- # pip install datasets
31
- from datasets import load_dataset
32
- load_dataset("princeton-nlp/SWE-bench_Verified", split="test").to_json("swe-bench-verified.jsonl")
33
- ```
34
-
35
- ## 2. Generate predictions with ada
36
-
37
- ```bash
38
- # smoke test on 5 instances first
39
- node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
40
- --out runs/opus --limit 5 --concurrency 2
41
-
42
- # a specific instance, or the whole set
43
- node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
44
- --out runs/opus --instances astropy__astropy-12907
45
- ```
46
-
47
- For each instance it clones the repo at `base_commit` into an isolated dir, hands ada the issue text
48
- (`ada -p … --json`, auto-approve), captures `git diff` as the model patch, and appends an
49
- official-format line to `runs/opus/predictions.jsonl`:
50
-
51
- ```json
52
- {"instance_id": "...", "model_name_or_path": "claude-opus-4-8", "model_patch": "diff --git ..."}
53
- ```
54
-
55
- It also writes `meta.jsonl` (seconds, patch size, token/cost usage per instance). Re-running **resumes**
56
- — instances already in `predictions.jsonl` are skipped. Flags: `--limit N`, `--instances a,b`,
57
- `--concurrency` (default 2), `--timeout` seconds per instance (default 1200), `--out <dir>`.
58
-
59
- Swap `--model` to compare models on the same tasks (`gpt-...`, `qwen2.5-coder:latest`, …) — ada routes
60
- each to the right provider.
61
-
62
- ## 3. Score with the official harness
63
-
64
- ```bash
65
- python -m swebench.harness.run_evaluation \
66
- --dataset_name princeton-nlp/SWE-bench_Verified \
67
- --predictions_path runs/opus/predictions.jsonl \
68
- --max_workers 4 --run_id ada-opus
69
- ```
70
-
71
- It applies each patch + the held-out `test_patch` in Docker, runs the `FAIL_TO_PASS` / `PASS_TO_PASS`
72
- tests, and reports the **resolved rate** plus a per-instance breakdown.
73
-
74
- ## Notes & honest caveats
75
-
76
- - ada is told **not to touch tests** (the grader supplies its own); the patch is whatever ada changed
77
- in the source.
78
- - An empty patch (ada gave up / errored) is still recorded — it just counts as unresolved.
79
- - This measures ada's default `react` loop. Try `ADA_MODEL`, a different `--model`, or wire a
80
- `--strategy` into the harness to compare setups.
81
- - Other benchmarks (HumanEval, Aider polyglot) fit the same generate-then-score shape; ask and we'll
82
- add a sibling script.
83
-
84
- ## Quick check
85
-
86
- ```bash
87
- node bench/swebench.mjs --selftest # offline: validates the prompt/prediction/arg helpers
88
- ```
1
+ # Benchmarking ada on SWE-bench Verified
2
+
3
+ ada can run **SWE-bench Verified** — give the agent a real GitHub issue, let it edit the repo, and
4
+ score whether the repo's test suite passes. This directory has the **generation** half (ada produces
5
+ patches); **scoring** is the official `swebench` Docker harness — we don't reimplement it, because
6
+ that's the only way to get correct, comparable numbers.
7
+
8
+ ```
9
+ dataset (issues) ──▶ bench/swebench.mjs ──▶ predictions.jsonl ──▶ official swebench eval ──▶ resolved %
10
+ (ada edits the repo, (Docker: apply patch +
11
+ per isolated clone) test_patch, run tests)
12
+ ```
13
+
14
+ ## Prerequisites
15
+
16
+ - **ada-server running with provider keys** — the harness drives `ada -p`, which needs the backend:
17
+ ```bash
18
+ export ANTHROPIC_API_KEY=sk-ant-... # and/or OPENAI_API_KEY, etc.
19
+ ada-server # http://localhost:8787
20
+ ```
21
+ - `git` + network (the harness clones each task repo; clones are cached under `~/.cache/ada-swebench`).
22
+ - For scoring: **Docker** and the **`swebench`** Python package (`pip install swebench`). Allow plenty
23
+ of disk — the official images are large.
24
+
25
+ ## 1. Get the dataset
26
+
27
+ SWE-bench Verified (500 instances) lives on Hugging Face. Export it to JSONL once:
28
+
29
+ ```python
30
+ # pip install datasets
31
+ from datasets import load_dataset
32
+ load_dataset("princeton-nlp/SWE-bench_Verified", split="test").to_json("swe-bench-verified.jsonl")
33
+ ```
34
+
35
+ ## 2. Generate predictions with ada
36
+
37
+ ```bash
38
+ # smoke test on 5 instances first
39
+ node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
40
+ --out runs/opus --limit 5 --concurrency 2
41
+
42
+ # a specific instance, or the whole set
43
+ node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
44
+ --out runs/opus --instances astropy__astropy-12907
45
+ ```
46
+
47
+ For each instance it clones the repo at `base_commit` into an isolated dir, hands ada the issue text
48
+ (`ada -p … --json`, auto-approve), captures `git diff` as the model patch, and appends an
49
+ official-format line to `runs/opus/predictions.jsonl`:
50
+
51
+ ```json
52
+ {"instance_id": "...", "model_name_or_path": "claude-opus-4-8", "model_patch": "diff --git ..."}
53
+ ```
54
+
55
+ It also writes `meta.jsonl` (seconds, patch size, token/cost usage per instance). Re-running **resumes**
56
+ — instances already in `predictions.jsonl` are skipped. Flags: `--limit N`, `--instances a,b`,
57
+ `--concurrency` (default 2), `--timeout` seconds per instance (default 1200), `--out <dir>`.
58
+
59
+ Swap `--model` to compare models on the same tasks (`gpt-...`, `qwen2.5-coder:latest`, …) — ada routes
60
+ each to the right provider.
61
+
62
+ ## 3. Score with the official harness
63
+
64
+ ```bash
65
+ python -m swebench.harness.run_evaluation \
66
+ --dataset_name princeton-nlp/SWE-bench_Verified \
67
+ --predictions_path runs/opus/predictions.jsonl \
68
+ --max_workers 4 --run_id ada-opus
69
+ ```
70
+
71
+ It applies each patch + the held-out `test_patch` in Docker, runs the `FAIL_TO_PASS` / `PASS_TO_PASS`
72
+ tests, and reports the **resolved rate** plus a per-instance breakdown.
73
+
74
+ ## Notes & honest caveats
75
+
76
+ - ada is told **not to touch tests** (the grader supplies its own); the patch is whatever ada changed
77
+ in the source.
78
+ - An empty patch (ada gave up / errored) is still recorded — it just counts as unresolved.
79
+ - This measures ada's default `react` loop. Try `ADA_MODEL`, a different `--model`, or wire a
80
+ `--strategy` into the harness to compare setups.
81
+ - Other benchmarks (HumanEval, Aider polyglot) fit the same generate-then-score shape; ask and we'll
82
+ add a sibling script.
83
+
84
+ ## Quick check
85
+
86
+ ```bash
87
+ node bench/swebench.mjs --selftest # offline: validates the prompt/prediction/arg helpers
88
+ ```