ada-agent 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (39) hide show
  1. package/README.md +262 -263
  2. package/bench/README.md +88 -88
  3. package/bench/swebench.mjs +242 -242
  4. package/docs/architecture.md +163 -163
  5. package/docs/architecture.svg +73 -73
  6. package/docs/cloudflare.md +81 -81
  7. package/docs/connectors.md +49 -49
  8. package/docs/integrations.md +62 -62
  9. package/package.json +66 -65
  10. package/skills/aesthetic-direction/SKILL.md +24 -24
  11. package/skills/color-palette/SKILL.md +24 -24
  12. package/skills/component-library/SKILL.md +23 -23
  13. package/skills/dark-mode/SKILL.md +24 -24
  14. package/skills/dashboard-ui/SKILL.md +23 -23
  15. package/skills/design-system/SKILL.md +24 -24
  16. package/skills/design-tokens/SKILL.md +24 -24
  17. package/skills/empty-states/SKILL.md +23 -23
  18. package/skills/hero-section/SKILL.md +23 -23
  19. package/skills/micro-interactions/SKILL.md +23 -23
  20. package/skills/motion-design/SKILL.md +23 -23
  21. package/skills/page-transitions/SKILL.md +23 -23
  22. package/skills/pricing-page/SKILL.md +23 -23
  23. package/skills/scroll-animation/SKILL.md +23 -23
  24. package/skills/skeleton-loader/SKILL.md +23 -23
  25. package/skills/tailwind-theme/SKILL.md +24 -24
  26. package/skills/typography/SKILL.md +24 -24
  27. package/skills/ui-polish/SKILL.md +24 -24
  28. package/skills/ui-review/SKILL.md +24 -24
  29. package/skills/web-fonts/SKILL.md +24 -24
  30. package/src/client/autostart.ts +93 -0
  31. package/src/client/catalog.json +1 -1
  32. package/src/client/cli.ts +1275 -1262
  33. package/src/client/models-dev.ts +106 -106
  34. package/src/selfcheck.ts +404 -390
  35. package/src/server/config.ts +65 -65
  36. package/src/server/providers/openai-compat.ts +78 -78
  37. package/src/server/providers/registry.ts +32 -32
  38. package/src/server/router.ts +33 -33
  39. package/src/shared/types.ts +21 -21
package/bench/README.md CHANGED
@@ -1,88 +1,88 @@
1
- # Benchmarking ada on SWE-bench Verified
2
-
3
- ada can run **SWE-bench Verified** — give the agent a real GitHub issue, let it edit the repo, and
4
- score whether the repo's test suite passes. This directory has the **generation** half (ada produces
5
- patches); **scoring** is the official `swebench` Docker harness — we don't reimplement it, because
6
- that's the only way to get correct, comparable numbers.
7
-
8
- ```
9
- dataset (issues) ──▶ bench/swebench.mjs ──▶ predictions.jsonl ──▶ official swebench eval ──▶ resolved %
10
- (ada edits the repo, (Docker: apply patch +
11
- per isolated clone) test_patch, run tests)
12
- ```
13
-
14
- ## Prerequisites
15
-
16
- - **ada-server running with provider keys** — the harness drives `ada -p`, which needs the backend:
17
- ```bash
18
- export ANTHROPIC_API_KEY=sk-ant-... # and/or OPENAI_API_KEY, etc.
19
- ada-server # http://localhost:8787
20
- ```
21
- - `git` + network (the harness clones each task repo; clones are cached under `~/.cache/ada-swebench`).
22
- - For scoring: **Docker** and the **`swebench`** Python package (`pip install swebench`). Allow plenty
23
- of disk — the official images are large.
24
-
25
- ## 1. Get the dataset
26
-
27
- SWE-bench Verified (500 instances) lives on Hugging Face. Export it to JSONL once:
28
-
29
- ```python
30
- # pip install datasets
31
- from datasets import load_dataset
32
- load_dataset("princeton-nlp/SWE-bench_Verified", split="test").to_json("swe-bench-verified.jsonl")
33
- ```
34
-
35
- ## 2. Generate predictions with ada
36
-
37
- ```bash
38
- # smoke test on 5 instances first
39
- node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
40
- --out runs/opus --limit 5 --concurrency 2
41
-
42
- # a specific instance, or the whole set
43
- node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
44
- --out runs/opus --instances astropy__astropy-12907
45
- ```
46
-
47
- For each instance it clones the repo at `base_commit` into an isolated dir, hands ada the issue text
48
- (`ada -p … --json`, auto-approve), captures `git diff` as the model patch, and appends an
49
- official-format line to `runs/opus/predictions.jsonl`:
50
-
51
- ```json
52
- {"instance_id": "...", "model_name_or_path": "claude-opus-4-8", "model_patch": "diff --git ..."}
53
- ```
54
-
55
- It also writes `meta.jsonl` (seconds, patch size, token/cost usage per instance). Re-running **resumes**
56
- — instances already in `predictions.jsonl` are skipped. Flags: `--limit N`, `--instances a,b`,
57
- `--concurrency` (default 2), `--timeout` seconds per instance (default 1200), `--out <dir>`.
58
-
59
- Swap `--model` to compare models on the same tasks (`gpt-...`, `qwen2.5-coder:latest`, …) — ada routes
60
- each to the right provider.
61
-
62
- ## 3. Score with the official harness
63
-
64
- ```bash
65
- python -m swebench.harness.run_evaluation \
66
- --dataset_name princeton-nlp/SWE-bench_Verified \
67
- --predictions_path runs/opus/predictions.jsonl \
68
- --max_workers 4 --run_id ada-opus
69
- ```
70
-
71
- It applies each patch + the held-out `test_patch` in Docker, runs the `FAIL_TO_PASS` / `PASS_TO_PASS`
72
- tests, and reports the **resolved rate** plus a per-instance breakdown.
73
-
74
- ## Notes & honest caveats
75
-
76
- - ada is told **not to touch tests** (the grader supplies its own); the patch is whatever ada changed
77
- in the source.
78
- - An empty patch (ada gave up / errored) is still recorded — it just counts as unresolved.
79
- - This measures ada's default `react` loop. Try `ADA_MODEL`, a different `--model`, or wire a
80
- `--strategy` into the harness to compare setups.
81
- - Other benchmarks (HumanEval, Aider polyglot) fit the same generate-then-score shape; ask and we'll
82
- add a sibling script.
83
-
84
- ## Quick check
85
-
86
- ```bash
87
- node bench/swebench.mjs --selftest # offline: validates the prompt/prediction/arg helpers
88
- ```
1
+ # Benchmarking ada on SWE-bench Verified
2
+
3
+ ada can run **SWE-bench Verified** — give the agent a real GitHub issue, let it edit the repo, and
4
+ score whether the repo's test suite passes. This directory has the **generation** half (ada produces
5
+ patches); **scoring** is the official `swebench` Docker harness — we don't reimplement it, because
6
+ that's the only way to get correct, comparable numbers.
7
+
8
+ ```
9
+ dataset (issues) ──▶ bench/swebench.mjs ──▶ predictions.jsonl ──▶ official swebench eval ──▶ resolved %
10
+ (ada edits the repo, (Docker: apply patch +
11
+ per isolated clone) test_patch, run tests)
12
+ ```
13
+
14
+ ## Prerequisites
15
+
16
+ - **ada-server running with provider keys** — the harness drives `ada -p`, which needs the backend:
17
+ ```bash
18
+ export ANTHROPIC_API_KEY=sk-ant-... # and/or OPENAI_API_KEY, etc.
19
+ ada-server # http://localhost:8787
20
+ ```
21
+ - `git` + network (the harness clones each task repo; clones are cached under `~/.cache/ada-swebench`).
22
+ - For scoring: **Docker** and the **`swebench`** Python package (`pip install swebench`). Allow plenty
23
+ of disk — the official images are large.
24
+
25
+ ## 1. Get the dataset
26
+
27
+ SWE-bench Verified (500 instances) lives on Hugging Face. Export it to JSONL once:
28
+
29
+ ```python
30
+ # pip install datasets
31
+ from datasets import load_dataset
32
+ load_dataset("princeton-nlp/SWE-bench_Verified", split="test").to_json("swe-bench-verified.jsonl")
33
+ ```
34
+
35
+ ## 2. Generate predictions with ada
36
+
37
+ ```bash
38
+ # smoke test on 5 instances first
39
+ node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
40
+ --out runs/opus --limit 5 --concurrency 2
41
+
42
+ # a specific instance, or the whole set
43
+ node bench/swebench.mjs --dataset swe-bench-verified.jsonl --model claude-opus-4-8 \
44
+ --out runs/opus --instances astropy__astropy-12907
45
+ ```
46
+
47
+ For each instance it clones the repo at `base_commit` into an isolated dir, hands ada the issue text
48
+ (`ada -p … --json`, auto-approve), captures `git diff` as the model patch, and appends an
49
+ official-format line to `runs/opus/predictions.jsonl`:
50
+
51
+ ```json
52
+ {"instance_id": "...", "model_name_or_path": "claude-opus-4-8", "model_patch": "diff --git ..."}
53
+ ```
54
+
55
+ It also writes `meta.jsonl` (seconds, patch size, token/cost usage per instance). Re-running **resumes**
56
+ — instances already in `predictions.jsonl` are skipped. Flags: `--limit N`, `--instances a,b`,
57
+ `--concurrency` (default 2), `--timeout` seconds per instance (default 1200), `--out <dir>`.
58
+
59
+ Swap `--model` to compare models on the same tasks (`gpt-...`, `qwen2.5-coder:latest`, …) — ada routes
60
+ each to the right provider.
61
+
62
+ ## 3. Score with the official harness
63
+
64
+ ```bash
65
+ python -m swebench.harness.run_evaluation \
66
+ --dataset_name princeton-nlp/SWE-bench_Verified \
67
+ --predictions_path runs/opus/predictions.jsonl \
68
+ --max_workers 4 --run_id ada-opus
69
+ ```
70
+
71
+ It applies each patch + the held-out `test_patch` in Docker, runs the `FAIL_TO_PASS` / `PASS_TO_PASS`
72
+ tests, and reports the **resolved rate** plus a per-instance breakdown.
73
+
74
+ ## Notes & honest caveats
75
+
76
+ - ada is told **not to touch tests** (the grader supplies its own); the patch is whatever ada changed
77
+ in the source.
78
+ - An empty patch (ada gave up / errored) is still recorded — it just counts as unresolved.
79
+ - This measures ada's default `react` loop. Try `ADA_MODEL`, a different `--model`, or wire a
80
+ `--strategy` into the harness to compare setups.
81
+ - Other benchmarks (HumanEval, Aider polyglot) fit the same generate-then-score shape; ask and we'll
82
+ add a sibling script.
83
+
84
+ ## Quick check
85
+
86
+ ```bash
87
+ node bench/swebench.mjs --selftest # offline: validates the prompt/prediction/arg helpers
88
+ ```