@athenaflow/plugin-web-bench 1.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +16 -0
- package/.codex-plugin/plugin.json +16 -0
- package/.mcp.json +19 -0
- package/dist/1.0.3/.agents/plugins/marketplace.json +14 -0
- package/dist/1.0.3/claude/plugin/.claude-plugin/plugin.json +16 -0
- package/dist/1.0.3/claude/plugin/.mcp.json +19 -0
- package/dist/1.0.3/claude/plugin/package.json +9 -0
- package/dist/1.0.3/claude/plugin/skills/evaluate-task/SKILL.md +173 -0
- package/dist/1.0.3/claude/plugin/skills/evaluate-task/agents/claude.yaml +2 -0
- package/dist/1.0.3/claude/plugin/skills/execute-task/SKILL.md +133 -0
- package/dist/1.0.3/claude/plugin/skills/execute-task/agents/claude.yaml +2 -0
- package/dist/1.0.3/claude/plugin/skills/generate-report/SKILL.md +204 -0
- package/dist/1.0.3/claude/plugin/skills/generate-report/agents/claude.yaml +2 -0
- package/dist/1.0.3/claude/plugin/skills/load-dataset/SKILL.md +209 -0
- package/dist/1.0.3/claude/plugin/skills/load-dataset/agents/claude.yaml +2 -0
- package/dist/1.0.3/claude/plugin/skills/run-benchmark/SKILL.md +92 -0
- package/dist/1.0.3/claude/plugin/skills/run-benchmark/agents/claude.yaml +3 -0
- package/dist/1.0.3/codex/plugin/.codex-plugin/plugin.json +16 -0
- package/dist/1.0.3/codex/plugin/.mcp.json +19 -0
- package/dist/1.0.3/codex/plugin/package.json +9 -0
- package/dist/1.0.3/codex/plugin/skills/evaluate-task/SKILL.md +173 -0
- package/dist/1.0.3/codex/plugin/skills/evaluate-task/agents/claude.yaml +2 -0
- package/dist/1.0.3/codex/plugin/skills/execute-task/SKILL.md +133 -0
- package/dist/1.0.3/codex/plugin/skills/execute-task/agents/claude.yaml +2 -0
- package/dist/1.0.3/codex/plugin/skills/generate-report/SKILL.md +204 -0
- package/dist/1.0.3/codex/plugin/skills/generate-report/agents/claude.yaml +2 -0
- package/dist/1.0.3/codex/plugin/skills/load-dataset/SKILL.md +209 -0
- package/dist/1.0.3/codex/plugin/skills/load-dataset/agents/claude.yaml +2 -0
- package/dist/1.0.3/codex/plugin/skills/run-benchmark/SKILL.md +92 -0
- package/dist/1.0.3/codex/plugin/skills/run-benchmark/agents/claude.yaml +3 -0
- package/dist/1.0.3/release.json +18 -0
- package/dist/1.0.5/.agents/plugins/marketplace.json +14 -0
- package/dist/1.0.5/claude/plugin/.claude-plugin/plugin.json +16 -0
- package/dist/1.0.5/claude/plugin/.mcp.json +19 -0
- package/dist/1.0.5/claude/plugin/package.json +9 -0
- package/dist/1.0.5/claude/plugin/skills/evaluate-task/SKILL.md +173 -0
- package/dist/1.0.5/claude/plugin/skills/evaluate-task/agents/claude.yaml +2 -0
- package/dist/1.0.5/claude/plugin/skills/execute-task/SKILL.md +133 -0
- package/dist/1.0.5/claude/plugin/skills/execute-task/agents/claude.yaml +2 -0
- package/dist/1.0.5/claude/plugin/skills/generate-report/SKILL.md +204 -0
- package/dist/1.0.5/claude/plugin/skills/generate-report/agents/claude.yaml +2 -0
- package/dist/1.0.5/claude/plugin/skills/load-dataset/SKILL.md +209 -0
- package/dist/1.0.5/claude/plugin/skills/load-dataset/agents/claude.yaml +2 -0
- package/dist/1.0.5/claude/plugin/skills/run-benchmark/SKILL.md +92 -0
- package/dist/1.0.5/claude/plugin/skills/run-benchmark/agents/claude.yaml +3 -0
- package/dist/1.0.5/codex/plugin/.codex-plugin/plugin.json +16 -0
- package/dist/1.0.5/codex/plugin/.mcp.json +19 -0
- package/dist/1.0.5/codex/plugin/package.json +9 -0
- package/dist/1.0.5/codex/plugin/skills/evaluate-task/SKILL.md +173 -0
- package/dist/1.0.5/codex/plugin/skills/evaluate-task/agents/claude.yaml +2 -0
- package/dist/1.0.5/codex/plugin/skills/execute-task/SKILL.md +133 -0
- package/dist/1.0.5/codex/plugin/skills/execute-task/agents/claude.yaml +2 -0
- package/dist/1.0.5/codex/plugin/skills/generate-report/SKILL.md +204 -0
- package/dist/1.0.5/codex/plugin/skills/generate-report/agents/claude.yaml +2 -0
- package/dist/1.0.5/codex/plugin/skills/load-dataset/SKILL.md +209 -0
- package/dist/1.0.5/codex/plugin/skills/load-dataset/agents/claude.yaml +2 -0
- package/dist/1.0.5/codex/plugin/skills/run-benchmark/SKILL.md +92 -0
- package/dist/1.0.5/codex/plugin/skills/run-benchmark/agents/claude.yaml +3 -0
- package/dist/1.0.5/release.json +18 -0
- package/package.json +13 -0
- package/skills/evaluate-task/SKILL.md +173 -0
- package/skills/evaluate-task/agents/claude.yaml +2 -0
- package/skills/execute-task/SKILL.md +133 -0
- package/skills/execute-task/agents/claude.yaml +2 -0
- package/skills/generate-report/SKILL.md +204 -0
- package/skills/generate-report/agents/claude.yaml +2 -0
- package/skills/load-dataset/SKILL.md +209 -0
- package/skills/load-dataset/agents/claude.yaml +2 -0
- package/skills/run-benchmark/SKILL.md +92 -0
- package/skills/run-benchmark/agents/claude.yaml +3 -0
|
@@ -0,0 +1,209 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: load-dataset
|
|
3
|
+
description: >
|
|
4
|
+
Download and prepare the Halluminate/WebBench dataset from HuggingFace for benchmarking.
|
|
5
|
+
Triggers: "load dataset", "download WebBench", "prepare benchmark data", "fetch tasks".
|
|
6
|
+
Downloads the CSV dataset via curl, converts to JSONL with Node.js, applies optional filters
|
|
7
|
+
(category, sample size, website allowlist/blocklist), and writes web-bench-tasks.jsonl to the
|
|
8
|
+
working directory. Zero Python dependencies — uses only curl and Node.js.
|
|
9
|
+
Does NOT execute tasks — use execute-task for that.
|
|
10
|
+
allowed-tools: Bash Read Write Edit Glob
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Load WebBench Dataset
|
|
14
|
+
|
|
15
|
+
Download the Halluminate/WebBench dataset from HuggingFace and prepare it for benchmark execution.
|
|
16
|
+
|
|
17
|
+
## Dataset Source
|
|
18
|
+
|
|
19
|
+
- **HuggingFace:** `Halluminate/WebBench`
|
|
20
|
+
- **Source file:** `webbenchfinal.csv` (CSV format)
|
|
21
|
+
- **Size:** ~2,454 tasks across 452 websites
|
|
22
|
+
- **Fields per row:** `ID` (int), `Starting_URL` (string), `Category` (enum), `Task` (string)
|
|
23
|
+
|
|
24
|
+
## Pre-check: Skip Download if Dataset Exists
|
|
25
|
+
|
|
26
|
+
Before downloading, check if `web-bench-tasks.jsonl` already exists in the working directory:
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
if [ -f web-bench-tasks.jsonl ]; then
|
|
30
|
+
echo "Dataset already exists: $(wc -l < web-bench-tasks.jsonl) tasks"
|
|
31
|
+
head -1 web-bench-tasks.jsonl
|
|
32
|
+
fi
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
**If `web-bench-tasks.jsonl` exists and is non-empty, skip the download and conversion entirely.** Jump straight to [Applying Filters](#applying-filters) if filters need to be applied, or report the existing dataset to the tracker.
|
|
36
|
+
|
|
37
|
+
Only proceed with download if the file does not exist or is empty.
|
|
38
|
+
|
|
39
|
+
## Download Method
|
|
40
|
+
|
|
41
|
+
Download the CSV directly with `curl`, then convert to JSONL with Node.js. No Python dependencies required.
|
|
42
|
+
|
|
43
|
+
### Step 1: Download the CSV
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
curl -fSL -o web-bench-dataset.csv \
|
|
47
|
+
"https://huggingface.co/datasets/Halluminate/WebBench/resolve/main/webbenchfinal.csv"
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
If the above URL fails (HuggingFace sometimes changes paths), try:
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
curl -fSL -o web-bench-dataset.csv \
|
|
54
|
+
"https://huggingface.co/datasets/Halluminate/WebBench/raw/main/webbenchfinal.csv"
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Step 2: Convert CSV to JSONL
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
node -e "
|
|
61
|
+
const fs = require('fs');
|
|
62
|
+
const csv = fs.readFileSync('web-bench-dataset.csv', 'utf-8');
|
|
63
|
+
const lines = csv.split('\n');
|
|
64
|
+
const header = lines[0].split(',').map(h => h.trim().replace(/^\"|\"$/g, ''));
|
|
65
|
+
|
|
66
|
+
// Find column indices
|
|
67
|
+
const idIdx = header.findIndex(h => h === 'ID');
|
|
68
|
+
const urlIdx = header.findIndex(h => h === 'Starting_URL');
|
|
69
|
+
const catIdx = header.findIndex(h => h === 'Category');
|
|
70
|
+
const taskIdx = header.findIndex(h => h === 'Task');
|
|
71
|
+
|
|
72
|
+
const out = fs.createWriteStream('web-bench-tasks.jsonl');
|
|
73
|
+
let count = 0;
|
|
74
|
+
|
|
75
|
+
for (let i = 1; i < lines.length; i++) {
|
|
76
|
+
const line = lines[i].trim();
|
|
77
|
+
if (!line) continue;
|
|
78
|
+
|
|
79
|
+
// Parse CSV line respecting quoted fields
|
|
80
|
+
const fields = [];
|
|
81
|
+
let field = '';
|
|
82
|
+
let inQuotes = false;
|
|
83
|
+
for (let j = 0; j < line.length; j++) {
|
|
84
|
+
const ch = line[j];
|
|
85
|
+
if (ch === '\"') {
|
|
86
|
+
inQuotes = !inQuotes;
|
|
87
|
+
} else if (ch === ',' && !inQuotes) {
|
|
88
|
+
fields.push(field.trim());
|
|
89
|
+
field = '';
|
|
90
|
+
} else {
|
|
91
|
+
field += ch;
|
|
92
|
+
}
|
|
93
|
+
}
|
|
94
|
+
fields.push(field.trim());
|
|
95
|
+
|
|
96
|
+
if (fields.length > taskIdx) {
|
|
97
|
+
out.write(JSON.stringify({
|
|
98
|
+
id: parseInt(fields[idIdx], 10),
|
|
99
|
+
url: fields[urlIdx],
|
|
100
|
+
category: fields[catIdx],
|
|
101
|
+
task: fields[taskIdx]
|
|
102
|
+
}) + '\n');
|
|
103
|
+
count++;
|
|
104
|
+
}
|
|
105
|
+
}
|
|
106
|
+
|
|
107
|
+
out.end();
|
|
108
|
+
console.log('Wrote ' + count + ' tasks to web-bench-tasks.jsonl');
|
|
109
|
+
"
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
### Step 3: Verify the output
|
|
113
|
+
|
|
114
|
+
```bash
|
|
115
|
+
wc -l web-bench-tasks.jsonl
|
|
116
|
+
head -1 web-bench-tasks.jsonl
|
|
117
|
+
node -e "
|
|
118
|
+
const fs = require('fs');
|
|
119
|
+
const tasks = fs.readFileSync('web-bench-tasks.jsonl','utf-8').trim().split('\n').map(JSON.parse);
|
|
120
|
+
const cats = {};
|
|
121
|
+
const sites = new Set();
|
|
122
|
+
for (const t of tasks) {
|
|
123
|
+
cats[t.category] = (cats[t.category] || 0) + 1;
|
|
124
|
+
sites.add(t.url);
|
|
125
|
+
}
|
|
126
|
+
for (const [c, n] of Object.entries(cats).sort()) console.log(' ' + c + ': ' + n);
|
|
127
|
+
console.log('Total: ' + tasks.length + ' tasks across ' + sites.size + ' websites');
|
|
128
|
+
"
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
## Applying Filters
|
|
132
|
+
|
|
133
|
+
After downloading, apply filters based on tracker configuration. All filters use Node.js.
|
|
134
|
+
|
|
135
|
+
### Category Filter
|
|
136
|
+
|
|
137
|
+
If the tracker specifies a category filter (e.g., `READ`, `CREATE`):
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
node -e "
|
|
141
|
+
const fs = require('fs');
|
|
142
|
+
const category = process.argv[1];
|
|
143
|
+
const tasks = fs.readFileSync('web-bench-tasks.jsonl','utf-8').trim().split('\n').map(JSON.parse);
|
|
144
|
+
const filtered = tasks.filter(t => t.category === category);
|
|
145
|
+
fs.writeFileSync('web-bench-tasks.jsonl', filtered.map(JSON.stringify).join('\n') + '\n');
|
|
146
|
+
console.log('Filtered to ' + filtered.length + ' ' + category + ' tasks');
|
|
147
|
+
" "READ"
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
### Sample Size
|
|
151
|
+
|
|
152
|
+
If the tracker specifies a sample size (e.g., `--sample 50`):
|
|
153
|
+
|
|
154
|
+
```bash
|
|
155
|
+
node -e "
|
|
156
|
+
const fs = require('fs');
|
|
157
|
+
const n = parseInt(process.argv[1], 10);
|
|
158
|
+
const tasks = fs.readFileSync('web-bench-tasks.jsonl','utf-8').trim().split('\n').map(JSON.parse);
|
|
159
|
+
|
|
160
|
+
// Deterministic shuffle (seed-based) for reproducibility
|
|
161
|
+
function seededShuffle(arr, seed) {
|
|
162
|
+
const a = [...arr];
|
|
163
|
+
let s = seed;
|
|
164
|
+
for (let i = a.length - 1; i > 0; i--) {
|
|
165
|
+
s = (s * 1664525 + 1013904223) & 0xffffffff;
|
|
166
|
+
const j = ((s >>> 0) % (i + 1));
|
|
167
|
+
[a[i], a[j]] = [a[j], a[i]];
|
|
168
|
+
}
|
|
169
|
+
return a;
|
|
170
|
+
}
|
|
171
|
+
|
|
172
|
+
const sample = seededShuffle(tasks, 42).slice(0, Math.min(n, tasks.length));
|
|
173
|
+
fs.writeFileSync('web-bench-tasks.jsonl', sample.map(JSON.stringify).join('\n') + '\n');
|
|
174
|
+
console.log('Sampled ' + sample.length + ' tasks');
|
|
175
|
+
" "50"
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
### Website Blocklist
|
|
179
|
+
|
|
180
|
+
```bash
|
|
181
|
+
node -e "
|
|
182
|
+
const fs = require('fs');
|
|
183
|
+
const blocklist = new Set(process.argv[1] ? process.argv[1].split(',') : []);
|
|
184
|
+
const tasks = fs.readFileSync('web-bench-tasks.jsonl','utf-8').trim().split('\n').map(JSON.parse);
|
|
185
|
+
const filtered = tasks.filter(t => !blocklist.has(t.url));
|
|
186
|
+
fs.writeFileSync('web-bench-tasks.jsonl', filtered.map(JSON.stringify).join('\n') + '\n');
|
|
187
|
+
console.log(filtered.length + ' tasks after blocklist filter');
|
|
188
|
+
" ""
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
## Output
|
|
192
|
+
|
|
193
|
+
- **File:** `web-bench-tasks.jsonl` in working directory
|
|
194
|
+
- **Intermediate file:** `web-bench-dataset.csv` (can be deleted after conversion)
|
|
195
|
+
- **Format:** One JSON object per line
|
|
196
|
+
- **Schema:**
|
|
197
|
+
```json
|
|
198
|
+
{"id": 42, "url": "https://acehardware.com", "category": "READ", "task": "Navigate to..."}
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
## Cleanup
|
|
202
|
+
|
|
203
|
+
After successful conversion, remove the intermediate CSV:
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
rm -f web-bench-dataset.csv
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
Report the total count and category breakdown to the tracker.
|
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: run-benchmark
|
|
3
|
+
description: >
|
|
4
|
+
Run the WebBench browser agent benchmark — main entry point and orchestrator.
|
|
5
|
+
Triggers: "run benchmark", "run WebBench", "start benchmark", "benchmark browser agent",
|
|
6
|
+
"web bench", "execute WebBench", "run web-bench".
|
|
7
|
+
Parses user configuration (category filter, sample size, resume), delegates to
|
|
8
|
+
load-dataset, execute-task, evaluate-task, and generate-report skills.
|
|
9
|
+
This is the user-invocable orchestrator that ties the full benchmark pipeline together.
|
|
10
|
+
allowed-tools: Read Write Edit Glob Grep Bash Task mcp__browser__ping mcp__browser__navigate mcp__browser__find mcp__browser__get_element mcp__browser__get_form mcp__browser__get_field mcp__browser__click mcp__browser__type mcp__browser__press mcp__browser__select mcp__browser__hover mcp__browser__drag mcp__browser__scroll mcp__browser__scroll_to mcp__browser__wheel mcp__browser__snapshot mcp__browser__screenshot mcp__browser__go_back mcp__browser__go_forward mcp__browser__reload mcp__browser__list_pages mcp__browser__close_page
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Run WebBench Benchmark
|
|
14
|
+
|
|
15
|
+
Main entry point for running the WebBench browser agent benchmark. This skill is used in interactive (single-session) mode. For multi-session workflow execution, see the system prompt.
|
|
16
|
+
|
|
17
|
+
## Input
|
|
18
|
+
|
|
19
|
+
Parse configuration from: `$ARGUMENTS`
|
|
20
|
+
|
|
21
|
+
Supported flags:
|
|
22
|
+
|
|
23
|
+
| Flag | Description | Default |
|
|
24
|
+
|------|-------------|---------|
|
|
25
|
+
| `--category <CAT>` | Filter tasks by category (READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION) | All categories |
|
|
26
|
+
| `--sample <N>` | Random sample of N tasks (deterministic seed=42) | Full dataset |
|
|
27
|
+
| `--resume` | Resume from existing web-bench-results.jsonl, skip completed task IDs | Fresh run |
|
|
28
|
+
| `--report-only` | Skip execution, just generate report from existing results | Full run |
|
|
29
|
+
|
|
30
|
+
Examples:
|
|
31
|
+
- `run-benchmark --category READ --sample 50` — 50 random READ tasks
|
|
32
|
+
- `run-benchmark --resume` — continue from where last run stopped
|
|
33
|
+
- `run-benchmark --report-only` — just aggregate existing results
|
|
34
|
+
|
|
35
|
+
## Interactive Execution Protocol
|
|
36
|
+
|
|
37
|
+
When run interactively (not via the workflow loop), this skill executes the full pipeline in a single session:
|
|
38
|
+
|
|
39
|
+
### 1. Setup
|
|
40
|
+
|
|
41
|
+
1. Parse arguments
|
|
42
|
+
2. Check for existing state (`web-bench-tasks.jsonl`, `web-bench-results.jsonl`)
|
|
43
|
+
3. If `--resume` and results exist: determine completed task IDs, skip them
|
|
44
|
+
4. If not resuming: load the `load-dataset` skill to download and prepare the dataset
|
|
45
|
+
5. Report configuration and task count
|
|
46
|
+
|
|
47
|
+
### 2. Execute Tasks
|
|
48
|
+
|
|
49
|
+
For each task in `web-bench-tasks.jsonl` (skipping completed if resuming):
|
|
50
|
+
|
|
51
|
+
1. Read the task line
|
|
52
|
+
2. Record start time: `date +%s%3N`
|
|
53
|
+
3. Load `execute-task` methodology and perform browser automation
|
|
54
|
+
4. Load `evaluate-task` methodology and score the result
|
|
55
|
+
5. Record end time: `date +%s%3N`, compute duration
|
|
56
|
+
6. Append result to `web-bench-results.jsonl`:
|
|
57
|
+
```json
|
|
58
|
+
{"id": 42, "url": "...", "category": "READ", "task": "...", "score": 1.0, "verdict": "PASS", "reasoning": "...", "error": null, "duration_ms": 34200, "tokens_used": {"input": 12450, "output": 3200}, "timestamp": "2026-03-19T14:30:00Z"}
|
|
59
|
+
```
|
|
60
|
+
7. Print progress: `[42/2454] PASS (1.0) — acehardware.com — READ — 34.2s`
|
|
61
|
+
|
|
62
|
+
### 3. Generate Report
|
|
63
|
+
|
|
64
|
+
After all tasks are processed (or if `--report-only`):
|
|
65
|
+
|
|
66
|
+
1. Load `generate-report` methodology
|
|
67
|
+
2. Aggregate `web-bench-results.jsonl` into `web-bench-report.md`
|
|
68
|
+
3. Print summary statistics to console
|
|
69
|
+
|
|
70
|
+
## Token Tracking
|
|
71
|
+
|
|
72
|
+
Token usage should be tracked per task. The agent should estimate tokens consumed during task execution by recording:
|
|
73
|
+
|
|
74
|
+
- **Input tokens:** Approximate from the size of prompts, page snapshots, and tool responses received during execution
|
|
75
|
+
- **Output tokens:** Approximate from the size of responses and tool calls generated
|
|
76
|
+
|
|
77
|
+
If exact token counts are available from the session metadata, prefer those over estimates.
|
|
78
|
+
|
|
79
|
+
## Progress Display
|
|
80
|
+
|
|
81
|
+
After each task, print a status line:
|
|
82
|
+
|
|
83
|
+
```
|
|
84
|
+
[1/50] PASS (1.0) acehardware.com READ 34.2s 15,650 tokens
|
|
85
|
+
[2/50] FAIL (0.0) airbnb.com CREATE 12.1s 8,200 tokens [auth_required]
|
|
86
|
+
[3/50] PARTIAL(0.5) amazon.com READ 45.8s 22,100 tokens
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Guardrails
|
|
90
|
+
|
|
91
|
+
- **Always append, never overwrite** results. The JSONL file is append-only.
|
|
92
|
+
- **Respect the dataset.** Do not modify task descriptions or skip tasks without recording a FAIL.
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "web-bench",
|
|
3
|
+
"version": "1.0.3",
|
|
4
|
+
"description": "WebBench benchmark runner \u2014 executes real-world browser tasks from the Halluminate/WebBench dataset, scores via LLM-as-judge, and produces evaluation reports",
|
|
5
|
+
"author": {
|
|
6
|
+
"name": "Athenaflow"
|
|
7
|
+
},
|
|
8
|
+
"skills": "./skills/",
|
|
9
|
+
"mcpServers": "./.mcp.json",
|
|
10
|
+
"interface": {
|
|
11
|
+
"displayName": "WebBench",
|
|
12
|
+
"shortDescription": "Browser task benchmarks scored via LLM-as-judge",
|
|
13
|
+
"developerName": "Athenaflow",
|
|
14
|
+
"category": "Evaluation"
|
|
15
|
+
}
|
|
16
|
+
}
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
{
|
|
2
|
+
"mcpServers": {
|
|
3
|
+
"agent-web-interface": {
|
|
4
|
+
"command": "npx",
|
|
5
|
+
"args": ["-y", "agent-web-interface@latest"],
|
|
6
|
+
"env": {
|
|
7
|
+
"NODE_ENV": "production"
|
|
8
|
+
},
|
|
9
|
+
"options": [
|
|
10
|
+
{ "label": "Auto (user → persistent → isolated)", "env": {} },
|
|
11
|
+
{ "label": "User's Chrome", "env": { "AWI_BROWSER_MODE": "user" } },
|
|
12
|
+
{ "label": "Persistent profile", "env": { "AWI_BROWSER_MODE": "persistent" } },
|
|
13
|
+
{ "label": "Isolated (temp profile)", "env": { "AWI_BROWSER_MODE": "isolated" } },
|
|
14
|
+
{ "label": "Headless", "env": { "AWI_HEADLESS": "true" } },
|
|
15
|
+
{ "label": "Connect to CDP endpoint", "env": { "AWI_CDP_URL": "http://localhost:9222" } }
|
|
16
|
+
]
|
|
17
|
+
}
|
|
18
|
+
}
|
|
19
|
+
}
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@athenaflow/plugin-web-bench",
|
|
3
|
+
"version": "1.0.3",
|
|
4
|
+
"description": "WebBench benchmark runner — executes real-world browser tasks from the Halluminate/WebBench dataset, scores via LLM-as-judge, and produces evaluation reports",
|
|
5
|
+
"license": "MIT",
|
|
6
|
+
"publishConfig": {
|
|
7
|
+
"access": "public"
|
|
8
|
+
}
|
|
9
|
+
}
|
|
@@ -0,0 +1,173 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evaluate-task
|
|
3
|
+
description: >
|
|
4
|
+
Evaluate whether a WebBench task was successfully completed using LLM-as-judge scoring.
|
|
5
|
+
Triggers: "evaluate task", "score task", "judge result", "grade benchmark task".
|
|
6
|
+
Examines the execution trace, final page state, and extracted data against the original
|
|
7
|
+
task description. Produces a structured verdict (PASS/PARTIAL/FAIL) with reasoning.
|
|
8
|
+
Does NOT execute browser actions — use execute-task for that.
|
|
9
|
+
allowed-tools: Read Write Edit
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Evaluate WebBench Task
|
|
13
|
+
|
|
14
|
+
Judge whether a completed task execution meets the success criteria defined in the original task description. This is a post-hoc evaluation — no browser interaction, only analysis of the execution trace and captured state.
|
|
15
|
+
|
|
16
|
+
## Input
|
|
17
|
+
|
|
18
|
+
You receive:
|
|
19
|
+
|
|
20
|
+
1. **Original task:** `{"id": 42, "url": "...", "category": "READ", "task": "Navigate to the news section and summarize..."}`
|
|
21
|
+
2. **Execution trace:** Actions taken, final URL, extracted data, blockers encountered, screenshots
|
|
22
|
+
|
|
23
|
+
## Scoring Rubric
|
|
24
|
+
|
|
25
|
+
### Verdict Scale
|
|
26
|
+
|
|
27
|
+
| Verdict | Score | Criteria |
|
|
28
|
+
|---------|-------|----------|
|
|
29
|
+
| **PASS** | 1.0 | Task fully completed. All requested information extracted or all requested actions performed. |
|
|
30
|
+
| **PARTIAL** | 0.5 | Task partially completed. Some but not all requirements met. Meaningful progress was made. |
|
|
31
|
+
| **FAIL** | 0.0 | Task not completed. No meaningful progress, wrong information, or blocked before starting. |
|
|
32
|
+
|
|
33
|
+
### Category-Specific Evaluation
|
|
34
|
+
|
|
35
|
+
#### READ Tasks
|
|
36
|
+
- **PASS:** All requested data was extracted accurately and completely
|
|
37
|
+
- **PARTIAL:** Some data extracted but incomplete (e.g., found the page but missed some fields)
|
|
38
|
+
- **FAIL:** Wrong data, wrong page, or no data extracted
|
|
39
|
+
|
|
40
|
+
#### CREATE Tasks
|
|
41
|
+
- **PASS:** Item was created as specified, confirmation visible
|
|
42
|
+
- **PARTIAL:** Creation started but not confirmed (e.g., form filled but not submitted)
|
|
43
|
+
- **FAIL:** Could not reach the creation form, or creation failed
|
|
44
|
+
|
|
45
|
+
#### UPDATE Tasks
|
|
46
|
+
- **PASS:** Data was modified as specified, change confirmed
|
|
47
|
+
- **PARTIAL:** Found the item but could not complete the modification
|
|
48
|
+
- **FAIL:** Could not find the item or reach the edit interface
|
|
49
|
+
|
|
50
|
+
#### DELETE Tasks
|
|
51
|
+
- **PASS:** Item was deleted and removal confirmed
|
|
52
|
+
- **PARTIAL:** Found the item and initiated deletion but could not confirm
|
|
53
|
+
- **FAIL:** Could not find the item or reach the delete action
|
|
54
|
+
|
|
55
|
+
#### FILE_MANIPULATION Tasks
|
|
56
|
+
- **PASS:** File downloaded with correct name/content
|
|
57
|
+
- **PARTIAL:** Download initiated but not verified
|
|
58
|
+
- **FAIL:** Could not locate or download the file
|
|
59
|
+
|
|
60
|
+
### Blocker Handling
|
|
61
|
+
|
|
62
|
+
If the execution trace contains blockers, evaluate based on the blocker type:
|
|
63
|
+
|
|
64
|
+
| Blocker | Verdict | Reasoning |
|
|
65
|
+
|---------|---------|-----------|
|
|
66
|
+
| Login required (no credentials) | FAIL | Infrastructure limitation — task requires auth |
|
|
67
|
+
| CAPTCHA | FAIL | Infrastructure limitation — cannot solve programmatically |
|
|
68
|
+
| Site down / 404 | FAIL | External dependency — site unavailable |
|
|
69
|
+
| Geo-restricted | FAIL | Infrastructure limitation — content not accessible |
|
|
70
|
+
| Paywall | FAIL | Infrastructure limitation — paid content |
|
|
71
|
+
| Pop-up could not be dismissed | PARTIAL or FAIL | Depends on whether task could proceed |
|
|
72
|
+
|
|
73
|
+
### Evaluation Dimensions
|
|
74
|
+
|
|
75
|
+
Score each dimension and use them to determine the overall verdict:
|
|
76
|
+
|
|
77
|
+
1. **Navigation (required):** Did the agent reach the correct page/section?
|
|
78
|
+
- Correct site? Correct section? Correct page?
|
|
79
|
+
|
|
80
|
+
2. **Comprehension (required):** Did the agent understand what was being asked?
|
|
81
|
+
- Did it attempt the right type of action? Did it target the right elements?
|
|
82
|
+
|
|
83
|
+
3. **Completeness (required):** Did the agent fulfill ALL parts of the task?
|
|
84
|
+
- Multi-part tasks: each part must be addressed
|
|
85
|
+
- Quantitative tasks: all requested data points must be present
|
|
86
|
+
|
|
87
|
+
4. **Accuracy (for READ tasks):** Is the extracted information correct?
|
|
88
|
+
- Does it match what's visible on the page?
|
|
89
|
+
- Are numbers, names, and details accurate?
|
|
90
|
+
|
|
91
|
+
5. **Confirmation (for WRITE tasks):** Is there evidence the action was performed?
|
|
92
|
+
- Success message visible? Item appears in list? State changed?
|
|
93
|
+
|
|
94
|
+
## Evaluation Process
|
|
95
|
+
|
|
96
|
+
### Step 1: Parse the Task Requirements
|
|
97
|
+
|
|
98
|
+
Break the task description into discrete, verifiable requirements:
|
|
99
|
+
|
|
100
|
+
```
|
|
101
|
+
Task: "Navigate to the news section and summarize the headline and key points from the latest science policy update."
|
|
102
|
+
|
|
103
|
+
Requirements:
|
|
104
|
+
1. Navigate to the news section
|
|
105
|
+
2. Find the latest science policy update
|
|
106
|
+
3. Extract the headline
|
|
107
|
+
4. Extract key points
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
### Step 2: Check Each Requirement Against the Trace
|
|
111
|
+
|
|
112
|
+
For each requirement, determine if the execution trace shows it was fulfilled:
|
|
113
|
+
|
|
114
|
+
```
|
|
115
|
+
1. Navigate to news section → DONE (action 2: clicked "News", URL changed to /news)
|
|
116
|
+
2. Find latest science policy update → DONE (action 3: found article "New Science Policy...")
|
|
117
|
+
3. Extract headline → DONE (extracted: "New Science Policy Framework Announced")
|
|
118
|
+
4. Extract key points → NOT DONE (only headline extracted, no key points)
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Step 3: Determine Verdict
|
|
122
|
+
|
|
123
|
+
- All requirements met → **PASS**
|
|
124
|
+
- Some requirements met → **PARTIAL**
|
|
125
|
+
- No requirements met or fundamentally wrong approach → **FAIL**
|
|
126
|
+
|
|
127
|
+
### Step 4: Write Reasoning
|
|
128
|
+
|
|
129
|
+
Provide clear, structured reasoning:
|
|
130
|
+
|
|
131
|
+
```
|
|
132
|
+
Verdict: PARTIAL (0.5)
|
|
133
|
+
Reasoning: Agent successfully navigated to the news section and identified the correct article.
|
|
134
|
+
The headline was extracted accurately. However, the task also requested "key points" from the
|
|
135
|
+
article, which were not extracted. 3 of 4 requirements met.
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
## Output Format
|
|
139
|
+
|
|
140
|
+
Return a structured evaluation result:
|
|
141
|
+
|
|
142
|
+
```json
|
|
143
|
+
{
|
|
144
|
+
"task_id": 42,
|
|
145
|
+
"verdict": "PARTIAL",
|
|
146
|
+
"score": 0.5,
|
|
147
|
+
"reasoning": "Agent navigated correctly and extracted the headline, but missed the key points requirement. 3/4 requirements fulfilled.",
|
|
148
|
+
"requirements_met": 3,
|
|
149
|
+
"requirements_total": 4,
|
|
150
|
+
"blocker": null
|
|
151
|
+
}
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
If a blocker prevented execution:
|
|
155
|
+
|
|
156
|
+
```json
|
|
157
|
+
{
|
|
158
|
+
"task_id": 43,
|
|
159
|
+
"verdict": "FAIL",
|
|
160
|
+
"score": 0.0,
|
|
161
|
+
"reasoning": "Task requires account login. No credentials available — infrastructure limitation.",
|
|
162
|
+
"requirements_met": 0,
|
|
163
|
+
"requirements_total": 3,
|
|
164
|
+
"blocker": "auth_required"
|
|
165
|
+
}
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
## Guardrails
|
|
169
|
+
|
|
170
|
+
- **Be strict but fair.** A task that asks for 5 data points and delivers 4 is PARTIAL, not PASS.
|
|
171
|
+
- **Do not hallucinate success.** If the trace doesn't show evidence of completion, it didn't happen.
|
|
172
|
+
- **Separate agent failure from infrastructure failure.** Auth requirements, CAPTCHAs, and site outages are not agent failures — but they are still FAIL verdicts for scoring purposes. Note the distinction in reasoning.
|
|
173
|
+
- **Evaluate what was asked, not what was attempted.** A well-executed wrong approach is still a FAIL.
|
|
@@ -0,0 +1,133 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: execute-task
|
|
3
|
+
description: >
|
|
4
|
+
Execute a single WebBench benchmark task via browser automation.
|
|
5
|
+
Triggers: "execute task", "run task", "perform benchmark task", "browser task".
|
|
6
|
+
Navigates to the Starting_URL, interprets the natural-language task description,
|
|
7
|
+
performs the required browser actions, and captures the final state (screenshot + snapshot).
|
|
8
|
+
Records an execution trace with actions taken and errors encountered.
|
|
9
|
+
Does NOT evaluate success — use evaluate-task for that.
|
|
10
|
+
allowed-tools: Read Write Edit Bash mcp__browser__ping mcp__browser__navigate mcp__browser__find mcp__browser__get_element mcp__browser__get_form mcp__browser__get_field mcp__browser__click mcp__browser__type mcp__browser__press mcp__browser__select mcp__browser__hover mcp__browser__drag mcp__browser__scroll mcp__browser__scroll_to mcp__browser__wheel mcp__browser__snapshot mcp__browser__screenshot mcp__browser__go_back mcp__browser__go_forward mcp__browser__reload mcp__browser__list_pages mcp__browser__close_page
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Execute WebBench Task
|
|
14
|
+
|
|
15
|
+
Execute a single WebBench task via browser automation. This skill handles the browser interaction — navigating, clicking, typing, extracting data — required to complete the task.
|
|
16
|
+
|
|
17
|
+
## Input
|
|
18
|
+
|
|
19
|
+
You receive a single task object:
|
|
20
|
+
|
|
21
|
+
```json
|
|
22
|
+
{"id": 42, "url": "https://acehardware.com", "category": "READ", "task": "Navigate to the news section and summarize..."}
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
## Execution Protocol
|
|
26
|
+
|
|
27
|
+
### 1. Record Start Time
|
|
28
|
+
|
|
29
|
+
Before any browser interaction:
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
date +%s%3N
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Save this as `start_time_ms`. You will need it for the result record.
|
|
36
|
+
|
|
37
|
+
### 2. Navigate to Starting URL
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
navigate → task.url
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
Wait for the page to load. Take an initial snapshot to understand the page structure.
|
|
44
|
+
|
|
45
|
+
### 3. Interpret the Task
|
|
46
|
+
|
|
47
|
+
Read the task description carefully. Classify it:
|
|
48
|
+
|
|
49
|
+
| Category | What to Do |
|
|
50
|
+
|----------|------------|
|
|
51
|
+
| **READ** | Navigate to the right page/section, extract the requested information. Your "result" is the extracted data. |
|
|
52
|
+
| **CREATE** | Fill forms, create accounts/entries/posts as described. Your "result" is confirmation the item was created. |
|
|
53
|
+
| **UPDATE** | Find existing data and modify it as described. Your "result" is confirmation the update was applied. |
|
|
54
|
+
| **DELETE** | Find and remove the specified item. Your "result" is confirmation of deletion. |
|
|
55
|
+
| **FILE_MANIPULATION** | Download the specified file. Your "result" is the filename and confirmation of download. |
|
|
56
|
+
|
|
57
|
+
### 4. Execute Browser Actions
|
|
58
|
+
|
|
59
|
+
Work through the task step by step:
|
|
60
|
+
|
|
61
|
+
1. **Observe** — Use `snapshot` or `find` to understand the current page state
|
|
62
|
+
2. **Plan** — Decide the next action based on what you see
|
|
63
|
+
3. **Act** — Use the appropriate MCP tool (click, type, select, etc.)
|
|
64
|
+
4. **Verify** — Check that the action had the expected effect
|
|
65
|
+
|
|
66
|
+
**Key principles:**
|
|
67
|
+
|
|
68
|
+
- **Use `find` with `kind` filters** to locate interactive elements (buttons, links, textboxes)
|
|
69
|
+
- **Use `snapshot`** to get a full page state when you need orientation
|
|
70
|
+
- **Use `screenshot`** to visually verify state when snapshots are ambiguous
|
|
71
|
+
- **Handle pop-ups and modals** — cookie banners, newsletter pop-ups, chat widgets. Dismiss them before proceeding
|
|
72
|
+
- **Stay on the specified site** — Tasks often say "Only use [site] to achieve the task." Respect this constraint
|
|
73
|
+
- **Handle pagination** — If data spans multiple pages, navigate through them
|
|
74
|
+
- **Be patient with slow sites** — Some sites load content dynamically. If elements aren't found immediately, try scrolling or waiting
|
|
75
|
+
|
|
76
|
+
### 5. Handle Common Obstacles
|
|
77
|
+
|
|
78
|
+
| Obstacle | Strategy |
|
|
79
|
+
|----------|----------|
|
|
80
|
+
| **Cookie consent banner** | Find and click "Accept" or "Close" |
|
|
81
|
+
| **Login required** | Record as blocker — do not attempt to create accounts or guess credentials |
|
|
82
|
+
| **CAPTCHA** | Record as blocker — cannot solve programmatically |
|
|
83
|
+
| **Paywall** | Record as blocker |
|
|
84
|
+
| **Geo-restricted content** | Record as blocker |
|
|
85
|
+
| **Site down / 404** | Record as error |
|
|
86
|
+
| **Pop-up / overlay blocking interaction** | Dismiss it, then continue |
|
|
87
|
+
|
|
88
|
+
### 6. Capture Final State
|
|
89
|
+
|
|
90
|
+
After completing the task (or hitting a blocker):
|
|
91
|
+
|
|
92
|
+
1. Take a **screenshot** of the final page state
|
|
93
|
+
2. Take a **snapshot** of the final DOM state
|
|
94
|
+
3. Record the **current URL**
|
|
95
|
+
|
|
96
|
+
### 7. Record End Time
|
|
97
|
+
|
|
98
|
+
```bash
|
|
99
|
+
date +%s%3N
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
Save as `end_time_ms`. Compute `duration_ms = end_time_ms - start_time_ms`.
|
|
103
|
+
|
|
104
|
+
### 8. Build Execution Trace
|
|
105
|
+
|
|
106
|
+
Construct a structured trace of what happened:
|
|
107
|
+
|
|
108
|
+
```json
|
|
109
|
+
{
|
|
110
|
+
"task_id": 42,
|
|
111
|
+
"actions": [
|
|
112
|
+
{"step": 1, "action": "navigate", "target": "https://acehardware.com", "result": "loaded"},
|
|
113
|
+
{"step": 2, "action": "click", "target": "News section link", "result": "navigated to /news"},
|
|
114
|
+
{"step": 3, "action": "extract", "target": "headline text", "result": "Black Friday Deals..."}
|
|
115
|
+
],
|
|
116
|
+
"final_url": "https://acehardware.com/news",
|
|
117
|
+
"blockers": [],
|
|
118
|
+
"extracted_data": "The headline is 'Black Friday Deals'...",
|
|
119
|
+
"duration_ms": 34200
|
|
120
|
+
}
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
## Output
|
|
124
|
+
|
|
125
|
+
Return the execution trace to the calling context (system prompt / run-benchmark skill). Do NOT write results to disk — that is the system prompt's responsibility after evaluation.
|
|
126
|
+
|
|
127
|
+
## Guardrails
|
|
128
|
+
|
|
129
|
+
- **One task only.** Execute exactly the task given. Do not browse further or attempt other tasks.
|
|
130
|
+
- **No account creation.** If a task requires login, record it as a blocker. Do not create accounts.
|
|
131
|
+
- **No credential guessing.** Never attempt to guess passwords or bypass authentication.
|
|
132
|
+
- **Time limit awareness.** If a task is taking an unreasonable number of steps (>20 actions), consider it likely stuck and record what you have.
|
|
133
|
+
- **No destructive actions on real sites.** For WRITE/UPDATE/DELETE tasks on production sites, be aware these are real websites. If the task would create real accounts or modify real data, record this concern but still attempt the task as specified (this is the nature of the benchmark).
|