openrouter-agent-cli 0.1.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- openrouter_agent_cli-0.1.2/LICENSE +22 -0
- openrouter_agent_cli-0.1.2/PKG-INFO +270 -0
- openrouter_agent_cli-0.1.2/README.md +260 -0
- openrouter_agent_cli-0.1.2/openrouter_agent_cli/__init__.py +2 -0
- openrouter_agent_cli-0.1.2/openrouter_agent_cli/__main__.py +6 -0
- openrouter_agent_cli-0.1.2/openrouter_agent_cli/cli.py +748 -0
- openrouter_agent_cli-0.1.2/openrouter_agent_cli.egg-info/PKG-INFO +270 -0
- openrouter_agent_cli-0.1.2/openrouter_agent_cli.egg-info/SOURCES.txt +12 -0
- openrouter_agent_cli-0.1.2/openrouter_agent_cli.egg-info/dependency_links.txt +1 -0
- openrouter_agent_cli-0.1.2/openrouter_agent_cli.egg-info/entry_points.txt +2 -0
- openrouter_agent_cli-0.1.2/openrouter_agent_cli.egg-info/requires.txt +1 -0
- openrouter_agent_cli-0.1.2/openrouter_agent_cli.egg-info/top_level.txt +1 -0
- openrouter_agent_cli-0.1.2/pyproject.toml +19 -0
- openrouter_agent_cli-0.1.2/setup.cfg +4 -0
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
22
|
+
|
|
@@ -0,0 +1,270 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: openrouter-agent-cli
|
|
3
|
+
Version: 0.1.2
|
|
4
|
+
Summary: Standalone terminal agent for OpenRouter with tool actions and context management.
|
|
5
|
+
Requires-Python: >=3.10
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
License-File: LICENSE
|
|
8
|
+
Requires-Dist: httpx>=0.27
|
|
9
|
+
Dynamic: license-file
|
|
10
|
+
|
|
11
|
+
# openrouter-agent-cli
|
|
12
|
+
|
|
13
|
+
Standalone terminal agent for OpenRouter models with:
|
|
14
|
+
- tool actions (`run_bash`)
|
|
15
|
+
- interactive permission gating (`allow` / `deny` / `ask`)
|
|
16
|
+
- session persistence
|
|
17
|
+
- context visibility and compaction
|
|
18
|
+
|
|
19
|
+
## Install
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
cd openrouter-agent-cli
|
|
23
|
+
python3 -m venv .venv
|
|
24
|
+
source .venv/bin/activate
|
|
25
|
+
pip install -e .
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Run
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
32
|
+
openrouter-agent
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Or without installation:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
39
|
+
python -m openrouter_agent_cli.cli
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Non-interactive prompt
|
|
43
|
+
|
|
44
|
+
`--prompt` (short `-p`) lets another process run the CLI with a single user message, emit only the assistant reply to `stdout`, and exit immediately. Operation logs, tool call summaries, and permission notices are written to `stderr`, and tool calls are automatically denied unless you disable tools with `--no-tools`.
|
|
45
|
+
|
|
46
|
+
Example:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
openrouter-agent --prompt "Explain tail recursion" --no-tools
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## Useful flags
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
openrouter-agent \
|
|
56
|
+
--model arcee-ai/trinity-large-preview:free \
|
|
57
|
+
--session-id my-session \
|
|
58
|
+
--workdir ~/Projects \
|
|
59
|
+
--max-turns 24 \
|
|
60
|
+
--max-history-messages 60 \
|
|
61
|
+
--command-timeout 30
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
Disable tools:
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
openrouter-agent --no-tools
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Slash commands
|
|
71
|
+
|
|
72
|
+
- `/help`
|
|
73
|
+
- `/exit`
|
|
74
|
+
- `/model [id]`
|
|
75
|
+
- `/usage`
|
|
76
|
+
- `/context [n]`
|
|
77
|
+
- `/compact`
|
|
78
|
+
- `/clear`
|
|
79
|
+
- `/tools`
|
|
80
|
+
- `/tools on|off`
|
|
81
|
+
- `/allow <tool|*>`
|
|
82
|
+
- `/deny <tool|*>`
|
|
83
|
+
- `/unallow <tool|*>`
|
|
84
|
+
- `/undeny <tool|*>`
|
|
85
|
+
- `/cwd [path]`
|
|
86
|
+
|
|
87
|
+
## Context management
|
|
88
|
+
|
|
89
|
+
- history is saved in `~/.openrouter-agent-cli/sessions/<session_id>.json`
|
|
90
|
+
- `/usage` shows rough token estimate
|
|
91
|
+
- `/compact` forces summarization
|
|
92
|
+
- automatic compaction triggers when non-system message count exceeds `--max-history-messages`
|
|
93
|
+
|
|
94
|
+
## Security notes
|
|
95
|
+
|
|
96
|
+
- `run_bash` executes shell commands on your machine in `--workdir`
|
|
97
|
+
- default policy is `ask` for every tool call
|
|
98
|
+
- use `/deny *` for a fully no-tools session
|
|
99
|
+
- default model is free-tier (`arcee-ai/trinity-large-preview:free`); override with `--model` or `OPENROUTER_MODEL`
|
|
100
|
+
|
|
101
|
+
## Tool schema seen by the model
|
|
102
|
+
|
|
103
|
+
When tools are enabled, each OpenRouter request includes this tool definition:
|
|
104
|
+
|
|
105
|
+
```json
|
|
106
|
+
[
|
|
107
|
+
{
|
|
108
|
+
"type": "function",
|
|
109
|
+
"function": {
|
|
110
|
+
"name": "run_bash",
|
|
111
|
+
"description": "Run a shell command in the current working directory and return stdout/stderr.",
|
|
112
|
+
"parameters": {
|
|
113
|
+
"type": "object",
|
|
114
|
+
"properties": {
|
|
115
|
+
"command": {
|
|
116
|
+
"type": "string",
|
|
117
|
+
"description": "Shell command to execute."
|
|
118
|
+
},
|
|
119
|
+
"timeout_seconds": {
|
|
120
|
+
"type": "integer",
|
|
121
|
+
"description": "Execution timeout in seconds (1-600).",
|
|
122
|
+
"default": 30
|
|
123
|
+
}
|
|
124
|
+
},
|
|
125
|
+
"required": ["command"]
|
|
126
|
+
}
|
|
127
|
+
}
|
|
128
|
+
}
|
|
129
|
+
]
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
Request body shape sent to OpenRouter (simplified):
|
|
133
|
+
|
|
134
|
+
```json
|
|
135
|
+
{
|
|
136
|
+
"model": "arcee-ai/trinity-large-preview:free",
|
|
137
|
+
"messages": [...],
|
|
138
|
+
"temperature": 0,
|
|
139
|
+
"max_tokens": 4096,
|
|
140
|
+
"tools": [...],
|
|
141
|
+
"tool_choice": "auto"
|
|
142
|
+
}
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
If tools are disabled (`--no-tools` or `/tools off`), the request sets:
|
|
146
|
+
|
|
147
|
+
```json
|
|
148
|
+
{
|
|
149
|
+
"tool_choice": "none"
|
|
150
|
+
}
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
## How `run_bash` is invoked
|
|
154
|
+
|
|
155
|
+
Execution flow per user turn:
|
|
156
|
+
|
|
157
|
+
1. Model returns `tool_calls` in assistant message.
|
|
158
|
+
2. CLI decodes `function.arguments` JSON into a dict.
|
|
159
|
+
3. Permission policy is applied:
|
|
160
|
+
- `deny` list blocks immediately.
|
|
161
|
+
- `allow` list runs immediately.
|
|
162
|
+
- otherwise prompt user (`y/n/a/d`).
|
|
163
|
+
4. For `run_bash`, CLI executes:
|
|
164
|
+
- `asyncio.create_subprocess_shell(command, cwd=<workdir>, stdout=PIPE, stderr=PIPE)`
|
|
165
|
+
- waits with `asyncio.wait_for(..., timeout_seconds)`
|
|
166
|
+
- kills process on timeout
|
|
167
|
+
5. CLI formats stdout/stderr/exit code to text and appends a tool result message:
|
|
168
|
+
- role: `tool`
|
|
169
|
+
- tool_call_id: model-provided id
|
|
170
|
+
- content: command output (capped to 8000 chars before being sent back to model)
|
|
171
|
+
|
|
172
|
+
Example tool call from model:
|
|
173
|
+
|
|
174
|
+
```json
|
|
175
|
+
{
|
|
176
|
+
"id": "call_123",
|
|
177
|
+
"type": "function",
|
|
178
|
+
"function": {
|
|
179
|
+
"name": "run_bash",
|
|
180
|
+
"arguments": "{\"command\":\"ls -la\",\"timeout_seconds\":30}"
|
|
181
|
+
}
|
|
182
|
+
}
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
Example tool result message added by CLI:
|
|
186
|
+
|
|
187
|
+
```json
|
|
188
|
+
{
|
|
189
|
+
"role": "tool",
|
|
190
|
+
"tool_call_id": "call_123",
|
|
191
|
+
"content": "total 64\n-rw-r--r-- ..."
|
|
192
|
+
}
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
Note: despite the name `run_bash`, execution uses `create_subprocess_shell` (system shell), not an explicit `bash` binary unless the command itself invokes `bash`.
|
|
196
|
+
|
|
197
|
+
## Prompt A/B testing
|
|
198
|
+
|
|
199
|
+
This repo includes a small harness for comparing system prompts:
|
|
200
|
+
|
|
201
|
+
- script: `scripts/ab_test_system_prompts.py`
|
|
202
|
+
- prompt variants:
|
|
203
|
+
- `prompts/system_prompt_control.md`
|
|
204
|
+
- `prompts/system_prompt_agentic_v1.md`
|
|
205
|
+
- sample tasks: `ab_tests/tasks_sample.txt`
|
|
206
|
+
|
|
207
|
+
Run prompt-only comparison (no tools):
|
|
208
|
+
|
|
209
|
+
```bash
|
|
210
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
211
|
+
python scripts/ab_test_system_prompts.py \
|
|
212
|
+
--tool-mode none \
|
|
213
|
+
--model arcee-ai/trinity-large-preview:free
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
Run with tool execution enabled (use cautiously):
|
|
217
|
+
|
|
218
|
+
```bash
|
|
219
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
220
|
+
python scripts/ab_test_system_prompts.py \
|
|
221
|
+
--tool-mode execute \
|
|
222
|
+
--workdir "$(pwd)" \
|
|
223
|
+
--model arcee-ai/trinity-large-preview:free
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
Artifacts are written to `ab_tests/results/<timestamp>/`:
|
|
227
|
+
|
|
228
|
+
- `results.json` full transcripts and metadata
|
|
229
|
+
- `summary.csv` flat comparison table
|
|
230
|
+
- `summary.md` quick markdown summary
|
|
231
|
+
|
|
232
|
+
Run a harder repeated suite (2 prompts x 6 tasks x 3 repeats):
|
|
233
|
+
|
|
234
|
+
```bash
|
|
235
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
236
|
+
python scripts/ab_test_system_prompts.py \
|
|
237
|
+
--tool-mode execute \
|
|
238
|
+
--tasks-file ab_tests/tasks_hard_suite_v1.txt \
|
|
239
|
+
--repeats 3 \
|
|
240
|
+
--max-turns 3 \
|
|
241
|
+
--max-tokens 1000 \
|
|
242
|
+
--request-timeout 40 \
|
|
243
|
+
--command-timeout 20 \
|
|
244
|
+
--workdir "$(pwd)" \
|
|
245
|
+
--model arcee-ai/trinity-large-preview:free \
|
|
246
|
+
--output-dir ab_tests/results/hard_suite_v1_r3
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
Evaluate quality and groundedness from a run:
|
|
250
|
+
|
|
251
|
+
```bash
|
|
252
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
253
|
+
python scripts/evaluate_ab_results.py \
|
|
254
|
+
--results ab_tests/results/hard_suite_v1_r3/results.json \
|
|
255
|
+
--judge-model arcee-ai/trinity-large-preview:free \
|
|
256
|
+
--output-dir ab_tests/results/hard_suite_v1_r3/eval
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
Evaluator artifacts:
|
|
260
|
+
|
|
261
|
+
- `evaluation.json` per-case raw evaluation details
|
|
262
|
+
- `evaluation.csv` tabular scores
|
|
263
|
+
- `leaderboard.md` aggregated per-prompt ranking
|
|
264
|
+
|
|
265
|
+
## Findings and release docs
|
|
266
|
+
|
|
267
|
+
- benchmark findings: `docs/AB_FINDINGS_2026-02-21.md`
|
|
268
|
+
- public release checklist: `docs/PUBLIC_RELEASE_CHECKLIST.md`
|
|
269
|
+
- security policy: `SECURITY.md`
|
|
270
|
+
- env template: `.env.example`
|
|
@@ -0,0 +1,260 @@
|
|
|
1
|
+
# openrouter-agent-cli
|
|
2
|
+
|
|
3
|
+
Standalone terminal agent for OpenRouter models with:
|
|
4
|
+
- tool actions (`run_bash`)
|
|
5
|
+
- interactive permission gating (`allow` / `deny` / `ask`)
|
|
6
|
+
- session persistence
|
|
7
|
+
- context visibility and compaction
|
|
8
|
+
|
|
9
|
+
## Install
|
|
10
|
+
|
|
11
|
+
```bash
|
|
12
|
+
cd openrouter-agent-cli
|
|
13
|
+
python3 -m venv .venv
|
|
14
|
+
source .venv/bin/activate
|
|
15
|
+
pip install -e .
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
## Run
|
|
19
|
+
|
|
20
|
+
```bash
|
|
21
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
22
|
+
openrouter-agent
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Or without installation:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
29
|
+
python -m openrouter_agent_cli.cli
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Non-interactive prompt
|
|
33
|
+
|
|
34
|
+
`--prompt` (short `-p`) lets another process run the CLI with a single user message, emit only the assistant reply to `stdout`, and exit immediately. Operation logs, tool call summaries, and permission notices are written to `stderr`, and tool calls are automatically denied unless you disable tools with `--no-tools`.
|
|
35
|
+
|
|
36
|
+
Example:
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
openrouter-agent --prompt "Explain tail recursion" --no-tools
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Useful flags
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
openrouter-agent \
|
|
46
|
+
--model arcee-ai/trinity-large-preview:free \
|
|
47
|
+
--session-id my-session \
|
|
48
|
+
--workdir ~/Projects \
|
|
49
|
+
--max-turns 24 \
|
|
50
|
+
--max-history-messages 60 \
|
|
51
|
+
--command-timeout 30
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
Disable tools:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
openrouter-agent --no-tools
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
## Slash commands
|
|
61
|
+
|
|
62
|
+
- `/help`
|
|
63
|
+
- `/exit`
|
|
64
|
+
- `/model [id]`
|
|
65
|
+
- `/usage`
|
|
66
|
+
- `/context [n]`
|
|
67
|
+
- `/compact`
|
|
68
|
+
- `/clear`
|
|
69
|
+
- `/tools`
|
|
70
|
+
- `/tools on|off`
|
|
71
|
+
- `/allow <tool|*>`
|
|
72
|
+
- `/deny <tool|*>`
|
|
73
|
+
- `/unallow <tool|*>`
|
|
74
|
+
- `/undeny <tool|*>`
|
|
75
|
+
- `/cwd [path]`
|
|
76
|
+
|
|
77
|
+
## Context management
|
|
78
|
+
|
|
79
|
+
- history is saved in `~/.openrouter-agent-cli/sessions/<session_id>.json`
|
|
80
|
+
- `/usage` shows rough token estimate
|
|
81
|
+
- `/compact` forces summarization
|
|
82
|
+
- automatic compaction triggers when non-system message count exceeds `--max-history-messages`
|
|
83
|
+
|
|
84
|
+
## Security notes
|
|
85
|
+
|
|
86
|
+
- `run_bash` executes shell commands on your machine in `--workdir`
|
|
87
|
+
- default policy is `ask` for every tool call
|
|
88
|
+
- use `/deny *` for a fully no-tools session
|
|
89
|
+
- default model is free-tier (`arcee-ai/trinity-large-preview:free`); override with `--model` or `OPENROUTER_MODEL`
|
|
90
|
+
|
|
91
|
+
## Tool schema seen by the model
|
|
92
|
+
|
|
93
|
+
When tools are enabled, each OpenRouter request includes this tool definition:
|
|
94
|
+
|
|
95
|
+
```json
|
|
96
|
+
[
|
|
97
|
+
{
|
|
98
|
+
"type": "function",
|
|
99
|
+
"function": {
|
|
100
|
+
"name": "run_bash",
|
|
101
|
+
"description": "Run a shell command in the current working directory and return stdout/stderr.",
|
|
102
|
+
"parameters": {
|
|
103
|
+
"type": "object",
|
|
104
|
+
"properties": {
|
|
105
|
+
"command": {
|
|
106
|
+
"type": "string",
|
|
107
|
+
"description": "Shell command to execute."
|
|
108
|
+
},
|
|
109
|
+
"timeout_seconds": {
|
|
110
|
+
"type": "integer",
|
|
111
|
+
"description": "Execution timeout in seconds (1-600).",
|
|
112
|
+
"default": 30
|
|
113
|
+
}
|
|
114
|
+
},
|
|
115
|
+
"required": ["command"]
|
|
116
|
+
}
|
|
117
|
+
}
|
|
118
|
+
}
|
|
119
|
+
]
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Request body shape sent to OpenRouter (simplified):
|
|
123
|
+
|
|
124
|
+
```json
|
|
125
|
+
{
|
|
126
|
+
"model": "arcee-ai/trinity-large-preview:free",
|
|
127
|
+
"messages": [...],
|
|
128
|
+
"temperature": 0,
|
|
129
|
+
"max_tokens": 4096,
|
|
130
|
+
"tools": [...],
|
|
131
|
+
"tool_choice": "auto"
|
|
132
|
+
}
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
If tools are disabled (`--no-tools` or `/tools off`), the request sets:
|
|
136
|
+
|
|
137
|
+
```json
|
|
138
|
+
{
|
|
139
|
+
"tool_choice": "none"
|
|
140
|
+
}
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
## How `run_bash` is invoked
|
|
144
|
+
|
|
145
|
+
Execution flow per user turn:
|
|
146
|
+
|
|
147
|
+
1. Model returns `tool_calls` in assistant message.
|
|
148
|
+
2. CLI decodes `function.arguments` JSON into a dict.
|
|
149
|
+
3. Permission policy is applied:
|
|
150
|
+
- `deny` list blocks immediately.
|
|
151
|
+
- `allow` list runs immediately.
|
|
152
|
+
- otherwise prompt user (`y/n/a/d`).
|
|
153
|
+
4. For `run_bash`, CLI executes:
|
|
154
|
+
- `asyncio.create_subprocess_shell(command, cwd=<workdir>, stdout=PIPE, stderr=PIPE)`
|
|
155
|
+
- waits with `asyncio.wait_for(..., timeout_seconds)`
|
|
156
|
+
- kills process on timeout
|
|
157
|
+
5. CLI formats stdout/stderr/exit code to text and appends a tool result message:
|
|
158
|
+
- role: `tool`
|
|
159
|
+
- tool_call_id: model-provided id
|
|
160
|
+
- content: command output (capped to 8000 chars before being sent back to model)
|
|
161
|
+
|
|
162
|
+
Example tool call from model:
|
|
163
|
+
|
|
164
|
+
```json
|
|
165
|
+
{
|
|
166
|
+
"id": "call_123",
|
|
167
|
+
"type": "function",
|
|
168
|
+
"function": {
|
|
169
|
+
"name": "run_bash",
|
|
170
|
+
"arguments": "{\"command\":\"ls -la\",\"timeout_seconds\":30}"
|
|
171
|
+
}
|
|
172
|
+
}
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
Example tool result message added by CLI:
|
|
176
|
+
|
|
177
|
+
```json
|
|
178
|
+
{
|
|
179
|
+
"role": "tool",
|
|
180
|
+
"tool_call_id": "call_123",
|
|
181
|
+
"content": "total 64\n-rw-r--r-- ..."
|
|
182
|
+
}
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
Note: despite the name `run_bash`, execution uses `create_subprocess_shell` (system shell), not an explicit `bash` binary unless the command itself invokes `bash`.
|
|
186
|
+
|
|
187
|
+
## Prompt A/B testing
|
|
188
|
+
|
|
189
|
+
This repo includes a small harness for comparing system prompts:
|
|
190
|
+
|
|
191
|
+
- script: `scripts/ab_test_system_prompts.py`
|
|
192
|
+
- prompt variants:
|
|
193
|
+
- `prompts/system_prompt_control.md`
|
|
194
|
+
- `prompts/system_prompt_agentic_v1.md`
|
|
195
|
+
- sample tasks: `ab_tests/tasks_sample.txt`
|
|
196
|
+
|
|
197
|
+
Run prompt-only comparison (no tools):
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
201
|
+
python scripts/ab_test_system_prompts.py \
|
|
202
|
+
--tool-mode none \
|
|
203
|
+
--model arcee-ai/trinity-large-preview:free
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
Run with tool execution enabled (use cautiously):
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
210
|
+
python scripts/ab_test_system_prompts.py \
|
|
211
|
+
--tool-mode execute \
|
|
212
|
+
--workdir "$(pwd)" \
|
|
213
|
+
--model arcee-ai/trinity-large-preview:free
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
Artifacts are written to `ab_tests/results/<timestamp>/`:
|
|
217
|
+
|
|
218
|
+
- `results.json` full transcripts and metadata
|
|
219
|
+
- `summary.csv` flat comparison table
|
|
220
|
+
- `summary.md` quick markdown summary
|
|
221
|
+
|
|
222
|
+
Run a harder repeated suite (2 prompts x 6 tasks x 3 repeats):
|
|
223
|
+
|
|
224
|
+
```bash
|
|
225
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
226
|
+
python scripts/ab_test_system_prompts.py \
|
|
227
|
+
--tool-mode execute \
|
|
228
|
+
--tasks-file ab_tests/tasks_hard_suite_v1.txt \
|
|
229
|
+
--repeats 3 \
|
|
230
|
+
--max-turns 3 \
|
|
231
|
+
--max-tokens 1000 \
|
|
232
|
+
--request-timeout 40 \
|
|
233
|
+
--command-timeout 20 \
|
|
234
|
+
--workdir "$(pwd)" \
|
|
235
|
+
--model arcee-ai/trinity-large-preview:free \
|
|
236
|
+
--output-dir ab_tests/results/hard_suite_v1_r3
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Evaluate quality and groundedness from a run:
|
|
240
|
+
|
|
241
|
+
```bash
|
|
242
|
+
export OPENROUTER_API_KEY=sk-or-...
|
|
243
|
+
python scripts/evaluate_ab_results.py \
|
|
244
|
+
--results ab_tests/results/hard_suite_v1_r3/results.json \
|
|
245
|
+
--judge-model arcee-ai/trinity-large-preview:free \
|
|
246
|
+
--output-dir ab_tests/results/hard_suite_v1_r3/eval
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
Evaluator artifacts:
|
|
250
|
+
|
|
251
|
+
- `evaluation.json` per-case raw evaluation details
|
|
252
|
+
- `evaluation.csv` tabular scores
|
|
253
|
+
- `leaderboard.md` aggregated per-prompt ranking
|
|
254
|
+
|
|
255
|
+
## Findings and release docs
|
|
256
|
+
|
|
257
|
+
- benchmark findings: `docs/AB_FINDINGS_2026-02-21.md`
|
|
258
|
+
- public release checklist: `docs/PUBLIC_RELEASE_CHECKLIST.md`
|
|
259
|
+
- security policy: `SECURITY.md`
|
|
260
|
+
- env template: `.env.example`
|