evalloop 0.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- evalloop-0.1.1/.github/workflows/publish.yml +29 -0
- evalloop-0.1.1/.gitignore +9 -0
- evalloop-0.1.1/PKG-INFO +319 -0
- evalloop-0.1.1/README.md +301 -0
- evalloop-0.1.1/evalloop/__init__.py +4 -0
- evalloop-0.1.1/evalloop/_utils.py +15 -0
- evalloop-0.1.1/evalloop/baseline.py +99 -0
- evalloop-0.1.1/evalloop/capture.py +424 -0
- evalloop-0.1.1/evalloop/cli.py +361 -0
- evalloop-0.1.1/evalloop/db.py +139 -0
- evalloop-0.1.1/evalloop/defaults.py +127 -0
- evalloop-0.1.1/evalloop/scorer.py +325 -0
- evalloop-0.1.1/pyproject.toml +29 -0
- evalloop-0.1.1/tests/__init__.py +0 -0
- evalloop-0.1.1/tests/golden_set.py +147 -0
- evalloop-0.1.1/tests/test_baseline.py +101 -0
- evalloop-0.1.1/tests/test_capture.py +467 -0
- evalloop-0.1.1/tests/test_cli_init.py +171 -0
- evalloop-0.1.1/tests/test_defaults.py +106 -0
- evalloop-0.1.1/tests/test_scorer.py +237 -0
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
name: Publish to PyPI
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
tags:
|
|
6
|
+
- "v*"
|
|
7
|
+
|
|
8
|
+
jobs:
|
|
9
|
+
test:
|
|
10
|
+
runs-on: ubuntu-latest
|
|
11
|
+
steps:
|
|
12
|
+
- uses: actions/checkout@v4
|
|
13
|
+
- uses: actions/setup-python@v5
|
|
14
|
+
with:
|
|
15
|
+
python-version: "3.11"
|
|
16
|
+
- run: pip install -e ".[dev]" && python -m pytest tests/ -q
|
|
17
|
+
|
|
18
|
+
publish:
|
|
19
|
+
needs: test
|
|
20
|
+
runs-on: ubuntu-latest
|
|
21
|
+
permissions:
|
|
22
|
+
id-token: write # required for PyPI Trusted Publishing (no token needed)
|
|
23
|
+
steps:
|
|
24
|
+
- uses: actions/checkout@v4
|
|
25
|
+
- uses: actions/setup-python@v5
|
|
26
|
+
with:
|
|
27
|
+
python-version: "3.11"
|
|
28
|
+
- run: pip install build && python -m build
|
|
29
|
+
- uses: pypa/gh-action-pypi-publish@release/v1
|
evalloop-0.1.1/PKG-INFO
ADDED
|
@@ -0,0 +1,319 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: evalloop
|
|
3
|
+
Version: 0.1.1
|
|
4
|
+
Summary: Closed-loop eval monitoring for LLM-powered products
|
|
5
|
+
License: MIT
|
|
6
|
+
Requires-Python: >=3.9
|
|
7
|
+
Requires-Dist: click>=8.0.0
|
|
8
|
+
Provides-Extra: anthropic
|
|
9
|
+
Requires-Dist: anthropic>=0.40.0; extra == 'anthropic'
|
|
10
|
+
Provides-Extra: dev
|
|
11
|
+
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
|
|
12
|
+
Requires-Dist: pytest>=7.0.0; extra == 'dev'
|
|
13
|
+
Provides-Extra: openai
|
|
14
|
+
Requires-Dist: openai>=1.0.0; extra == 'openai'
|
|
15
|
+
Provides-Extra: voyage
|
|
16
|
+
Requires-Dist: voyageai>=0.2.4; extra == 'voyage'
|
|
17
|
+
Description-Content-Type: text/markdown
|
|
18
|
+
|
|
19
|
+
# evalloop
|
|
20
|
+
|
|
21
|
+
**Sentry for AI behavior.** Closed-loop eval monitoring for LLM-powered products.
|
|
22
|
+
|
|
23
|
+
You changed your prompt on Friday. Your bot broke. Your users noticed before you did.
|
|
24
|
+
|
|
25
|
+
evalloop wraps your LLM client with one line. Every call is scored against your known-good baselines in the background — zero added latency. Regressions surface in your terminal before they reach your users.
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Who this is for
|
|
30
|
+
|
|
31
|
+
**You're building an AI product** — a chatbot, support bot, summarizer, coding assistant. You're making prompt changes regularly. You have no eval system. You've shipped a regression at least once and found out from a user.
|
|
32
|
+
|
|
33
|
+
| If you're... | Your pain | evalloop gives you... |
|
|
34
|
+
|---|---|---|
|
|
35
|
+
| A solo founder shipping fast | You're the only engineer. When a prompt change breaks your bot over a weekend, users notice before you do. You have no eval infra — no time to build one. | Score trends after every call so regressions surface in 2 minutes, not Monday morning |
|
|
36
|
+
| An AI engineer at a startup | You improve prompts weekly but can't prove quality went up. Your PM asks "did the last change make it better?" and you have no answer. | A shareable eval report — timestamped scores by prompt version — you can show your PM exactly what changed and when |
|
|
37
|
+
| An ML engineer with internal LLM tools | Internal tools have no user feedback loop. If your tool degrades, users just quietly stop using it. You'd never know without monitoring. | Automated regression alerts so quality drops surface before users abandon the tool |
|
|
38
|
+
|
|
39
|
+
**Before evalloop:**
|
|
40
|
+
```
|
|
41
|
+
Prompt change pushed Friday 5pm
|
|
42
|
+
→ Users complain Monday morning
|
|
43
|
+
→ 4 hours debugging which change broke what
|
|
44
|
+
→ Average detection time: 3 days
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
**After evalloop:**
|
|
48
|
+
```
|
|
49
|
+
Prompt change pushed Friday 5pm
|
|
50
|
+
→ evalloop watch alerts Friday 5:02pm
|
|
51
|
+
→ Developer reverts before closing laptop
|
|
52
|
+
→ Average detection time: 2 minutes
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## Install
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
pip install evalloop
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
Requires Python 3.9+. Then run the setup wizard:
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
evalloop init
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
evalloop auto-detects your scoring backend from environment variables — no extra config needed:
|
|
70
|
+
|
|
71
|
+
| Keys you have | Scoring backend | Accuracy |
|
|
72
|
+
|---|---|---|
|
|
73
|
+
| `VOYAGE_API_KEY` | Semantic embeddings via Voyage AI | Best |
|
|
74
|
+
| `ANTHROPIC_API_KEY` | LLM-as-judge via claude-haiku | Good |
|
|
75
|
+
| Neither | Heuristics only (length/format) | Basic |
|
|
76
|
+
|
|
77
|
+
If you're already wrapping an Anthropic or OpenAI client, evalloop uses your existing key automatically — no extra signup required.
|
|
78
|
+
|
|
79
|
+
---
|
|
80
|
+
|
|
81
|
+
## Quickstart
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
from evalloop import wrap
|
|
85
|
+
import anthropic
|
|
86
|
+
|
|
87
|
+
# One line change — evalloop uses your existing ANTHROPIC_API_KEY automatically
|
|
88
|
+
client = wrap(anthropic.Anthropic(), task_tag="qa")
|
|
89
|
+
|
|
90
|
+
resp = client.messages.create(
|
|
91
|
+
model="claude-haiku-4-5-20251001",
|
|
92
|
+
max_tokens=256,
|
|
93
|
+
messages=[{"role": "user", "content": "What is the capital of France?"}],
|
|
94
|
+
)
|
|
95
|
+
# ^ scored against your baselines in the background. Zero added latency.
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
Then check your dashboard:
|
|
99
|
+
|
|
100
|
+
```
|
|
101
|
+
$ evalloop status
|
|
102
|
+
|
|
103
|
+
🟢 [qa]
|
|
104
|
+
Calls captured : 142
|
|
105
|
+
Scored : 142
|
|
106
|
+
Avg (7d) : 0.81
|
|
107
|
+
Avg (24h) : 0.79
|
|
108
|
+
Trend (recent) : ▆▆▇▇▆▅▆▇▇▆▇▆▇▇▆▆▇▆▇▆
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
Or watch for regressions automatically:
|
|
112
|
+
|
|
113
|
+
```
|
|
114
|
+
$ evalloop watch --interval 60
|
|
115
|
+
|
|
116
|
+
evalloop watch — polling every 60s. Ctrl-C to stop.
|
|
117
|
+
|
|
118
|
+
🔴 [qa]
|
|
119
|
+
Calls captured : 201
|
|
120
|
+
Avg (7d) : 0.81
|
|
121
|
+
Avg (24h) : 0.71
|
|
122
|
+
⚠ Regression : score dropped 10.0pp in last 24h
|
|
123
|
+
|
|
124
|
+
^C
|
|
125
|
+
Watch stopped.
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
---
|
|
129
|
+
|
|
130
|
+
## How it works
|
|
131
|
+
|
|
132
|
+
```
|
|
133
|
+
your call
|
|
134
|
+
│
|
|
135
|
+
▼
|
|
136
|
+
wrap(client).messages.create()
|
|
137
|
+
│ returns immediately (zero latency added)
|
|
138
|
+
│
|
|
139
|
+
└─▶ background thread
|
|
140
|
+
│
|
|
141
|
+
├─▶ infer task tag (from system prompt, if not set)
|
|
142
|
+
├─▶ score(output, baselines)
|
|
143
|
+
│ ├─▶ heuristics (empty, too short, too long)
|
|
144
|
+
│ └─▶ cosine similarity via Voyage AI
|
|
145
|
+
└─▶ sqlite insert → ~/.evalloop/calls.db
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
- **Zero latency** — scoring runs in a background thread
|
|
149
|
+
- **Silent on errors** — disk full, API down, DB locked → log to stderr, never crash your app
|
|
150
|
+
- **Graceful exit** — on normal Python exit, flushes the queue so no captures are lost
|
|
151
|
+
- **Degraded mode** — if Voyage AI is unavailable, heuristic-only scoring continues (flags `degraded_mode`)
|
|
152
|
+
- **Self-hosted** — all data stays local in `~/.evalloop/`
|
|
153
|
+
|
|
154
|
+
---
|
|
155
|
+
|
|
156
|
+
## Auto tag inference
|
|
157
|
+
|
|
158
|
+
If you don't set `task_tag`, evalloop infers it from your system prompt:
|
|
159
|
+
|
|
160
|
+
```python
|
|
161
|
+
client = wrap(anthropic.Anthropic()) # no task_tag needed
|
|
162
|
+
|
|
163
|
+
resp = client.messages.create(
|
|
164
|
+
model="claude-haiku-4-5-20251001",
|
|
165
|
+
max_tokens=256,
|
|
166
|
+
system="You are a helpful assistant that summarizes documents.",
|
|
167
|
+
# ↑ evalloop detects → task_tag="summarization"
|
|
168
|
+
messages=[{"role": "user", "content": "Summarize this article..."}],
|
|
169
|
+
)
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
Supported task types for auto-inference: `qa`, `summarization`, `code`, `customer-service`, `classification`
|
|
173
|
+
|
|
174
|
+
---
|
|
175
|
+
|
|
176
|
+
## CLI commands
|
|
177
|
+
|
|
178
|
+
```bash
|
|
179
|
+
# First-time setup — detect keys, choose scoring backend, install baselines
|
|
180
|
+
evalloop init
|
|
181
|
+
|
|
182
|
+
# Score trends for all tags
|
|
183
|
+
evalloop status
|
|
184
|
+
|
|
185
|
+
# Filter to one tag
|
|
186
|
+
evalloop status --tag qa
|
|
187
|
+
|
|
188
|
+
# Watch for regressions (polls every 60s by default)
|
|
189
|
+
evalloop watch
|
|
190
|
+
evalloop watch --tag qa --interval 30
|
|
191
|
+
|
|
192
|
+
# Export calls + scores for sharing or analysis
|
|
193
|
+
evalloop export # JSON to stdout
|
|
194
|
+
evalloop export --format csv -o out.csv # CSV to file
|
|
195
|
+
evalloop export --tag qa --limit 500 # filtered export
|
|
196
|
+
|
|
197
|
+
# Manage baselines
|
|
198
|
+
evalloop baseline add "your good output" --tag my-task
|
|
199
|
+
evalloop baseline list
|
|
200
|
+
evalloop baseline install # install all built-in defaults
|
|
201
|
+
evalloop baseline install --tag qa # install for one tag
|
|
202
|
+
evalloop baseline install --overwrite # replace existing defaults
|
|
203
|
+
|
|
204
|
+
# List available built-in task types
|
|
205
|
+
evalloop defaults
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
---
|
|
209
|
+
|
|
210
|
+
## Scoring
|
|
211
|
+
|
|
212
|
+
evalloop uses **consistency-first scoring**: deterministic heuristics run first, then the best available semantic backend.
|
|
213
|
+
|
|
214
|
+
| Signal | What it catches |
|
|
215
|
+
|--------|----------------|
|
|
216
|
+
| Empty / whitespace output | Total failures |
|
|
217
|
+
| Too short (< 15% of baseline length) | Truncation |
|
|
218
|
+
| Too long (> 5x baseline length) | Rambling, prompt injection |
|
|
219
|
+
| Cosine similarity to baseline centroid | Semantic drift (Voyage backend) |
|
|
220
|
+
| LLM-as-judge rating | Quality regression (Anthropic backend) |
|
|
221
|
+
| `degraded_mode` flag | No scoring backend available — heuristics only |
|
|
222
|
+
|
|
223
|
+
Backend is auto-detected from your environment. Run `evalloop init` to configure.
|
|
224
|
+
|
|
225
|
+
Scores range 0.0–1.0. A score below your 7-day average by >5pp triggers a regression flag.
|
|
226
|
+
|
|
227
|
+
---
|
|
228
|
+
|
|
229
|
+
## Baselines
|
|
230
|
+
|
|
231
|
+
Baselines are your known-good outputs. evalloop ships with curated defaults for common task types — you get meaningful scores immediately on install.
|
|
232
|
+
|
|
233
|
+
```bash
|
|
234
|
+
# Install defaults for all task types
|
|
235
|
+
evalloop baseline install
|
|
236
|
+
|
|
237
|
+
# Add your own known-good output
|
|
238
|
+
evalloop baseline add "The capital of France is Paris." --tag qa
|
|
239
|
+
|
|
240
|
+
# List all tags with baselines
|
|
241
|
+
evalloop baseline list
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
Built-in task types: `qa`, `summarization`, `code`, `customer-service`, `classification`
|
|
245
|
+
|
|
246
|
+
---
|
|
247
|
+
|
|
248
|
+
## OpenAI support
|
|
249
|
+
|
|
250
|
+
```python
|
|
251
|
+
from evalloop import wrap
|
|
252
|
+
import openai
|
|
253
|
+
|
|
254
|
+
# pip install "evalloop[openai]"
|
|
255
|
+
client = wrap(openai.OpenAI(), task_tag="summarization")
|
|
256
|
+
|
|
257
|
+
resp = client.chat.completions.create(
|
|
258
|
+
model="gpt-4o",
|
|
259
|
+
messages=[
|
|
260
|
+
{"role": "system", "content": "You summarize documents."},
|
|
261
|
+
{"role": "user", "content": "Summarize: ..."},
|
|
262
|
+
],
|
|
263
|
+
)
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
---
|
|
267
|
+
|
|
268
|
+
## Data & Privacy
|
|
269
|
+
|
|
270
|
+
evalloop stores the following locally in `~/.evalloop/calls.db`:
|
|
271
|
+
|
|
272
|
+
| Field | Stored by default | With `store_inputs=False` |
|
|
273
|
+
|-------|:-----------------:|:------------------------:|
|
|
274
|
+
| Timestamp, model, latency | ✅ | ✅ |
|
|
275
|
+
| Output text | ✅ | ✅ |
|
|
276
|
+
| Score, flags, confidence | ✅ | ✅ |
|
|
277
|
+
| Input messages (may contain PII) | ✅ | ❌ |
|
|
278
|
+
|
|
279
|
+
For PII-sensitive environments (HIPAA, GDPR), opt out of input storage:
|
|
280
|
+
|
|
281
|
+
```python
|
|
282
|
+
client = wrap(anthropic.Anthropic(), task_tag="support", store_inputs=False)
|
|
283
|
+
# Input messages are never written to disk. Output + scores are still captured.
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
All data stays on your machine. evalloop has no cloud component.
|
|
287
|
+
|
|
288
|
+
---
|
|
289
|
+
|
|
290
|
+
## Export & share
|
|
291
|
+
|
|
292
|
+
Share your eval data with your team:
|
|
293
|
+
|
|
294
|
+
```bash
|
|
295
|
+
# Export last 1000 calls as JSON
|
|
296
|
+
evalloop export > eval-report.json
|
|
297
|
+
|
|
298
|
+
# Export as CSV for Excel/Sheets
|
|
299
|
+
evalloop export --format csv -o eval-report.csv
|
|
300
|
+
|
|
301
|
+
# Export a specific tag
|
|
302
|
+
evalloop export --tag qa --limit 500
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
---
|
|
306
|
+
|
|
307
|
+
## Architecture
|
|
308
|
+
|
|
309
|
+
- **Storage**: SQLite at `~/.evalloop/calls.db` (WAL mode — concurrent reads while writing)
|
|
310
|
+
- **Baselines**: JSONL files at `~/.evalloop/baselines/<tag>.jsonl`
|
|
311
|
+
- **Scoring backends**: Voyage AI `voyage-3-lite` (semantic embeddings) or Claude Haiku (LLM-as-judge) — auto-detected from env vars
|
|
312
|
+
- **Score model provenance**: stored per-row so history remains interpretable if backend changes
|
|
313
|
+
- **No cloud dependency** — everything runs locally
|
|
314
|
+
|
|
315
|
+
---
|
|
316
|
+
|
|
317
|
+
## License
|
|
318
|
+
|
|
319
|
+
MIT
|
evalloop-0.1.1/README.md
ADDED
|
@@ -0,0 +1,301 @@
|
|
|
1
|
+
# evalloop
|
|
2
|
+
|
|
3
|
+
**Sentry for AI behavior.** Closed-loop eval monitoring for LLM-powered products.
|
|
4
|
+
|
|
5
|
+
You changed your prompt on Friday. Your bot broke. Your users noticed before you did.
|
|
6
|
+
|
|
7
|
+
evalloop wraps your LLM client with one line. Every call is scored against your known-good baselines in the background — zero added latency. Regressions surface in your terminal before they reach your users.
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Who this is for
|
|
12
|
+
|
|
13
|
+
**You're building an AI product** — a chatbot, support bot, summarizer, coding assistant. You're making prompt changes regularly. You have no eval system. You've shipped a regression at least once and found out from a user.
|
|
14
|
+
|
|
15
|
+
| If you're... | Your pain | evalloop gives you... |
|
|
16
|
+
|---|---|---|
|
|
17
|
+
| A solo founder shipping fast | You're the only engineer. When a prompt change breaks your bot over a weekend, users notice before you do. You have no eval infra — no time to build one. | Score trends after every call so regressions surface in 2 minutes, not Monday morning |
|
|
18
|
+
| An AI engineer at a startup | You improve prompts weekly but can't prove quality went up. Your PM asks "did the last change make it better?" and you have no answer. | A shareable eval report — timestamped scores by prompt version — you can show your PM exactly what changed and when |
|
|
19
|
+
| An ML engineer with internal LLM tools | Internal tools have no user feedback loop. If your tool degrades, users just quietly stop using it. You'd never know without monitoring. | Automated regression alerts so quality drops surface before users abandon the tool |
|
|
20
|
+
|
|
21
|
+
**Before evalloop:**
|
|
22
|
+
```
|
|
23
|
+
Prompt change pushed Friday 5pm
|
|
24
|
+
→ Users complain Monday morning
|
|
25
|
+
→ 4 hours debugging which change broke what
|
|
26
|
+
→ Average detection time: 3 days
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
**After evalloop:**
|
|
30
|
+
```
|
|
31
|
+
Prompt change pushed Friday 5pm
|
|
32
|
+
→ evalloop watch alerts Friday 5:02pm
|
|
33
|
+
→ Developer reverts before closing laptop
|
|
34
|
+
→ Average detection time: 2 minutes
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## Install
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
pip install evalloop
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
Requires Python 3.9+. Then run the setup wizard:
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
evalloop init
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
evalloop auto-detects your scoring backend from environment variables — no extra config needed:
|
|
52
|
+
|
|
53
|
+
| Keys you have | Scoring backend | Accuracy |
|
|
54
|
+
|---|---|---|
|
|
55
|
+
| `VOYAGE_API_KEY` | Semantic embeddings via Voyage AI | Best |
|
|
56
|
+
| `ANTHROPIC_API_KEY` | LLM-as-judge via claude-haiku | Good |
|
|
57
|
+
| Neither | Heuristics only (length/format) | Basic |
|
|
58
|
+
|
|
59
|
+
If you're already wrapping an Anthropic or OpenAI client, evalloop uses your existing key automatically — no extra signup required.
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Quickstart
|
|
64
|
+
|
|
65
|
+
```python
|
|
66
|
+
from evalloop import wrap
|
|
67
|
+
import anthropic
|
|
68
|
+
|
|
69
|
+
# One line change — evalloop uses your existing ANTHROPIC_API_KEY automatically
|
|
70
|
+
client = wrap(anthropic.Anthropic(), task_tag="qa")
|
|
71
|
+
|
|
72
|
+
resp = client.messages.create(
|
|
73
|
+
model="claude-haiku-4-5-20251001",
|
|
74
|
+
max_tokens=256,
|
|
75
|
+
messages=[{"role": "user", "content": "What is the capital of France?"}],
|
|
76
|
+
)
|
|
77
|
+
# ^ scored against your baselines in the background. Zero added latency.
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Then check your dashboard:
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
$ evalloop status
|
|
84
|
+
|
|
85
|
+
🟢 [qa]
|
|
86
|
+
Calls captured : 142
|
|
87
|
+
Scored : 142
|
|
88
|
+
Avg (7d) : 0.81
|
|
89
|
+
Avg (24h) : 0.79
|
|
90
|
+
Trend (recent) : ▆▆▇▇▆▅▆▇▇▆▇▆▇▇▆▆▇▆▇▆
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
Or watch for regressions automatically:
|
|
94
|
+
|
|
95
|
+
```
|
|
96
|
+
$ evalloop watch --interval 60
|
|
97
|
+
|
|
98
|
+
evalloop watch — polling every 60s. Ctrl-C to stop.
|
|
99
|
+
|
|
100
|
+
🔴 [qa]
|
|
101
|
+
Calls captured : 201
|
|
102
|
+
Avg (7d) : 0.81
|
|
103
|
+
Avg (24h) : 0.71
|
|
104
|
+
⚠ Regression : score dropped 10.0pp in last 24h
|
|
105
|
+
|
|
106
|
+
^C
|
|
107
|
+
Watch stopped.
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## How it works
|
|
113
|
+
|
|
114
|
+
```
|
|
115
|
+
your call
|
|
116
|
+
│
|
|
117
|
+
▼
|
|
118
|
+
wrap(client).messages.create()
|
|
119
|
+
│ returns immediately (zero latency added)
|
|
120
|
+
│
|
|
121
|
+
└─▶ background thread
|
|
122
|
+
│
|
|
123
|
+
├─▶ infer task tag (from system prompt, if not set)
|
|
124
|
+
├─▶ score(output, baselines)
|
|
125
|
+
│ ├─▶ heuristics (empty, too short, too long)
|
|
126
|
+
│ └─▶ cosine similarity via Voyage AI
|
|
127
|
+
└─▶ sqlite insert → ~/.evalloop/calls.db
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
- **Zero latency** — scoring runs in a background thread
|
|
131
|
+
- **Silent on errors** — disk full, API down, DB locked → log to stderr, never crash your app
|
|
132
|
+
- **Graceful exit** — on normal Python exit, flushes the queue so no captures are lost
|
|
133
|
+
- **Degraded mode** — if Voyage AI is unavailable, heuristic-only scoring continues (flags `degraded_mode`)
|
|
134
|
+
- **Self-hosted** — all data stays local in `~/.evalloop/`
|
|
135
|
+
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
## Auto tag inference
|
|
139
|
+
|
|
140
|
+
If you don't set `task_tag`, evalloop infers it from your system prompt:
|
|
141
|
+
|
|
142
|
+
```python
|
|
143
|
+
client = wrap(anthropic.Anthropic()) # no task_tag needed
|
|
144
|
+
|
|
145
|
+
resp = client.messages.create(
|
|
146
|
+
model="claude-haiku-4-5-20251001",
|
|
147
|
+
max_tokens=256,
|
|
148
|
+
system="You are a helpful assistant that summarizes documents.",
|
|
149
|
+
# ↑ evalloop detects → task_tag="summarization"
|
|
150
|
+
messages=[{"role": "user", "content": "Summarize this article..."}],
|
|
151
|
+
)
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
Supported task types for auto-inference: `qa`, `summarization`, `code`, `customer-service`, `classification`
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## CLI commands
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
# First-time setup — detect keys, choose scoring backend, install baselines
|
|
162
|
+
evalloop init
|
|
163
|
+
|
|
164
|
+
# Score trends for all tags
|
|
165
|
+
evalloop status
|
|
166
|
+
|
|
167
|
+
# Filter to one tag
|
|
168
|
+
evalloop status --tag qa
|
|
169
|
+
|
|
170
|
+
# Watch for regressions (polls every 60s by default)
|
|
171
|
+
evalloop watch
|
|
172
|
+
evalloop watch --tag qa --interval 30
|
|
173
|
+
|
|
174
|
+
# Export calls + scores for sharing or analysis
|
|
175
|
+
evalloop export # JSON to stdout
|
|
176
|
+
evalloop export --format csv -o out.csv # CSV to file
|
|
177
|
+
evalloop export --tag qa --limit 500 # filtered export
|
|
178
|
+
|
|
179
|
+
# Manage baselines
|
|
180
|
+
evalloop baseline add "your good output" --tag my-task
|
|
181
|
+
evalloop baseline list
|
|
182
|
+
evalloop baseline install # install all built-in defaults
|
|
183
|
+
evalloop baseline install --tag qa # install for one tag
|
|
184
|
+
evalloop baseline install --overwrite # replace existing defaults
|
|
185
|
+
|
|
186
|
+
# List available built-in task types
|
|
187
|
+
evalloop defaults
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## Scoring
|
|
193
|
+
|
|
194
|
+
evalloop uses **consistency-first scoring**: deterministic heuristics run first, then the best available semantic backend.
|
|
195
|
+
|
|
196
|
+
| Signal | What it catches |
|
|
197
|
+
|--------|----------------|
|
|
198
|
+
| Empty / whitespace output | Total failures |
|
|
199
|
+
| Too short (< 15% of baseline length) | Truncation |
|
|
200
|
+
| Too long (> 5x baseline length) | Rambling, prompt injection |
|
|
201
|
+
| Cosine similarity to baseline centroid | Semantic drift (Voyage backend) |
|
|
202
|
+
| LLM-as-judge rating | Quality regression (Anthropic backend) |
|
|
203
|
+
| `degraded_mode` flag | No scoring backend available — heuristics only |
|
|
204
|
+
|
|
205
|
+
Backend is auto-detected from your environment. Run `evalloop init` to configure.
|
|
206
|
+
|
|
207
|
+
Scores range 0.0–1.0. A score below your 7-day average by >5pp triggers a regression flag.
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
## Baselines
|
|
212
|
+
|
|
213
|
+
Baselines are your known-good outputs. evalloop ships with curated defaults for common task types — you get meaningful scores immediately on install.
|
|
214
|
+
|
|
215
|
+
```bash
|
|
216
|
+
# Install defaults for all task types
|
|
217
|
+
evalloop baseline install
|
|
218
|
+
|
|
219
|
+
# Add your own known-good output
|
|
220
|
+
evalloop baseline add "The capital of France is Paris." --tag qa
|
|
221
|
+
|
|
222
|
+
# List all tags with baselines
|
|
223
|
+
evalloop baseline list
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
Built-in task types: `qa`, `summarization`, `code`, `customer-service`, `classification`
|
|
227
|
+
|
|
228
|
+
---
|
|
229
|
+
|
|
230
|
+
## OpenAI support
|
|
231
|
+
|
|
232
|
+
```python
|
|
233
|
+
from evalloop import wrap
|
|
234
|
+
import openai
|
|
235
|
+
|
|
236
|
+
# pip install "evalloop[openai]"
|
|
237
|
+
client = wrap(openai.OpenAI(), task_tag="summarization")
|
|
238
|
+
|
|
239
|
+
resp = client.chat.completions.create(
|
|
240
|
+
model="gpt-4o",
|
|
241
|
+
messages=[
|
|
242
|
+
{"role": "system", "content": "You summarize documents."},
|
|
243
|
+
{"role": "user", "content": "Summarize: ..."},
|
|
244
|
+
],
|
|
245
|
+
)
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
---
|
|
249
|
+
|
|
250
|
+
## Data & Privacy
|
|
251
|
+
|
|
252
|
+
evalloop stores the following locally in `~/.evalloop/calls.db`:
|
|
253
|
+
|
|
254
|
+
| Field | Stored by default | With `store_inputs=False` |
|
|
255
|
+
|-------|:-----------------:|:------------------------:|
|
|
256
|
+
| Timestamp, model, latency | ✅ | ✅ |
|
|
257
|
+
| Output text | ✅ | ✅ |
|
|
258
|
+
| Score, flags, confidence | ✅ | ✅ |
|
|
259
|
+
| Input messages (may contain PII) | ✅ | ❌ |
|
|
260
|
+
|
|
261
|
+
For PII-sensitive environments (HIPAA, GDPR), opt out of input storage:
|
|
262
|
+
|
|
263
|
+
```python
|
|
264
|
+
client = wrap(anthropic.Anthropic(), task_tag="support", store_inputs=False)
|
|
265
|
+
# Input messages are never written to disk. Output + scores are still captured.
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
All data stays on your machine. evalloop has no cloud component.
|
|
269
|
+
|
|
270
|
+
---
|
|
271
|
+
|
|
272
|
+
## Export & share
|
|
273
|
+
|
|
274
|
+
Share your eval data with your team:
|
|
275
|
+
|
|
276
|
+
```bash
|
|
277
|
+
# Export last 1000 calls as JSON
|
|
278
|
+
evalloop export > eval-report.json
|
|
279
|
+
|
|
280
|
+
# Export as CSV for Excel/Sheets
|
|
281
|
+
evalloop export --format csv -o eval-report.csv
|
|
282
|
+
|
|
283
|
+
# Export a specific tag
|
|
284
|
+
evalloop export --tag qa --limit 500
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
---
|
|
288
|
+
|
|
289
|
+
## Architecture
|
|
290
|
+
|
|
291
|
+
- **Storage**: SQLite at `~/.evalloop/calls.db` (WAL mode — concurrent reads while writing)
|
|
292
|
+
- **Baselines**: JSONL files at `~/.evalloop/baselines/<tag>.jsonl`
|
|
293
|
+
- **Scoring backends**: Voyage AI `voyage-3-lite` (semantic embeddings) or Claude Haiku (LLM-as-judge) — auto-detected from env vars
|
|
294
|
+
- **Score model provenance**: stored per-row so history remains interpretable if backend changes
|
|
295
|
+
- **No cloud dependency** — everything runs locally
|
|
296
|
+
|
|
297
|
+
---
|
|
298
|
+
|
|
299
|
+
## License
|
|
300
|
+
|
|
301
|
+
MIT
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
"""
|
|
2
|
+
_utils.py — shared internal helpers.
|
|
3
|
+
"""
|
|
4
|
+
|
|
5
|
+
from __future__ import annotations
|
|
6
|
+
|
|
7
|
+
import sys
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
def _warn(msg: str) -> None:
|
|
11
|
+
"""Write a warning to stderr. Never raises."""
|
|
12
|
+
try:
|
|
13
|
+
print(msg, file=sys.stderr)
|
|
14
|
+
except Exception: # noqa: BLE001
|
|
15
|
+
pass
|